hamilton VS awesome-pipeline

Compare hamilton vs awesome-pipeline and see what are their differences.

hamilton

Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage and metadata. Runs and scales everywhere python does. (by DAGWorks-Inc)

awesome-pipeline

A curated list of awesome pipeline toolkits inspired by Awesome Sysadmin (by pditommaso)
Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
hamilton awesome-pipeline
20 10
1,312 5,904
8.2% -
9.8 5.7
6 days ago 2 months ago
Jupyter Notebook
BSD 3-clause Clear License -
The number of mentions indicates the total number of mentions that we've tracked plus the number of user suggested alternatives.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.

hamilton

Posts with mentions or reviews of hamilton. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2024-04-26.
  • Building an Email Assistant Application with Burr
    6 projects | dev.to | 26 Apr 2024
    Note that this uses simple OpenAI calls — you can replace this with Langchain, LlamaIndex, Hamilton (or something else) if you prefer more abstraction, and delegate to whatever LLM you like to use. And, you should probably use something a little more concrete (E.G. instructor) to guarantee output shape.
  • Using IPython Jupyter Magic commands to improve the notebook experience
    1 project | dev.to | 3 Mar 2024
    In this post, we’ll show how your team can turn any utility function(s) into reusable IPython Jupyter magics for a better notebook experience. As an example, we’ll use Hamilton, my open source library, to motivate the creation of a magic that facilitates better development ergonomics for using it. You needn’t know what Hamilton is to understand this post.
  • FastUI: Build Better UIs Faster
    12 projects | news.ycombinator.com | 1 Mar 2024
    We built an app with it -- https://blog.dagworks.io/p/building-a-lightweight-experiment. You can see the code here https://github.com/DAGWorks-Inc/hamilton/blob/main/hamilton/....

    Usually we've been prototyping with streamlit, but found that at times to be clunky. FastUI still has rough edges, but we made it work for our lightweight app.

  • Show HN: On Garbage Collection and Memory Optimization in Hamilton
    1 project | news.ycombinator.com | 24 Oct 2023
  • Facebook Prophet: library for generating forecasts from any time series data
    7 projects | news.ycombinator.com | 26 Sep 2023
    This library is old news? Is there anything new that they've added that's noteworthy to take it for another spin?

    [disclaimer I'm a maintainer of Hamilton] Otherwise FYI Prophet gels well with https://github.com/DAGWorks-Inc/hamilton for setting up your features and dataset for fitting & prediction[/disclaimer].

  • Show HN: Declarative Spark Transformations with Hamilton
    1 project | news.ycombinator.com | 24 Aug 2023
  • Langchain Is Pointless
    16 projects | news.ycombinator.com | 8 Jul 2023
    I had been hearing these pains from Langchain users for quite a while. Suffice to say I think:

    1. too many layers of OO abstractions are a liability in production contexts. I'm biased, but a more functional approach is a better way to model what's going on. It's easier to test, wrap a function with concerns, and therefore reason about.

    2. as fast as the field is moving, the layers of abstractions actually hurt your ability to customize without really diving into the details of the framework, or requiring you to step outside it -- in which case, why use it?

    Otherwise I definitely love the small amount of code you need to write to get an LLM application up with Langchain. However you read code more often than you write it, in which case this brevity is a trade-off. Would you prefer to reduce your time debugging a production outage? or building the application? There's no right answer, other than "it depends".

    To that end - we've come up with a post showing how one might use Hamilton (https://github.com/dagWorks-Inc/hamilton) to easily create a workflow to ingest data into a vector database that I think has a great production story. https://open.substack.com/pub/dagworks/p/building-a-maintain...

    Note: Hamilton can cover your MLOps as well as LLMOps needs; you'll invariably be connecting LLM applications with traditional data/ML pipelines because LLMs don't solve everything -- but that's a post for another day.

  • Free access to beta product I'm building that I'd love feedback on
    1 project | /r/quants | 31 May 2023
    This is me. I drive an open source library Hamilton that people doing time-series/ML work love to use. I'm building a paid product around it at DAGWorks, and I'm after feedback on our current version. Can I entice anyone to:
  • IPyflow: Reactive Python Notebooks in Jupyter(Lab)
    5 projects | news.ycombinator.com | 10 May 2023
    From a nuts and bolts perspective, I've been thinking of building some reactivity on top of https://github.com/dagworks-inc/hamilton (author here) that could get at this. (If you have a use case that could be documented, I'd appreciate it.)
  • Data lineage
    1 project | /r/mlops | 15 Apr 2023
    Most people don't track lineage because it's difficult (though if you use something like https://github.com/DAGWorks-Inc/hamilton to write your pipeline - author here - it can come almost for free).

awesome-pipeline

Posts with mentions or reviews of awesome-pipeline. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2023-06-01.
  • Orchestration: Thoughts on Dagster, Airflow and Prefect?
    3 projects | /r/dataengineering | 1 Jun 2023
    There are a truly huge number of options in this space, see for example https://github.com/pditommaso/awesome-pipeline Many of them are very niche / half-baked / abandonware.
  • Launch HN: DAGWorks – ML platform for data science teams
    7 projects | news.ycombinator.com | 7 Mar 2023
    As a long-time fan of DAG-oriented tools, congrats on the launch. Maybe you can get added here https://github.com/pditommaso/awesome-pipeline now or in the future...

    This is a problem space I've worked in and been thinking about for a very, very long time. I've extensively used Airflow (bad), DBT (good-ish), Luigi (good), drake (abandoned), tested many more, and written two of my own.

    It's important to remember that DAG tools exist to solve two primary problems, that arise from one underlying cause. Those problems are 1) getting parallelism and execution ordering automatically (i.e. declaratively) based on the structure of dependencies, and 2) being able to resume a partially-failed run. The underlying cause is: data processing jobs take significant wall-clock time (minutes, hours, even days), so we want to use resources efficiently, and avoid re-computing things.

    Any DAG tool that doesn't solve these problems is unlikely to be useful. From your docs, I don't see anything on either of those topics, so not off to a strong start. Perhaps you have that functionality but haven't documented it yet? I can imagine the parallelism piece being there but just not stated, but the "resumption from partial failure" piece needs to be spelled out. Anyway, something to consider.

    A couple more things...

    It looks like you've gone the route of expressing dependencies only "locally". That is, when I define a computation, I indicate what it depends on there, right next to the definition. DBT and Luigi work this way also. Airflow, by contrast, defines dependencies centrally, as you add task instances to a DAG object. There is no right answer here, only tradeoffs. One thing to be aware of is that when using the "local" style, as a project grows big (glances at 380-model DBT project...), understanding its execution flow at a high level becomes a struggle, and is often only solvable through visualization tools. I see you have Graphviz output which is great. I recommend investing heavily in visualization tooling (DBT's graph browser, for example).

    I don't see any mention of development workflow. As a few examples, DBT has rich model selection features that let you run one model, all its ancestors, all its descendants, all models with a tag, etc etc. Luigi lets you invoke any task as a terminal task, using a handy auto-generated CLI. Airflow lets you... run a single task, and that's it. This makes a BIG DIFFERENCE. Developers -- be they scientists or engineers -- will need to run arbitrary subgraphs while they fiddle with stuff, and the easier you make that, the more they will love your tool.

    Another thing I notice is that it seems like your model is oriented around flowing data through the program, as arguments / return values (similar to Prefect, and of course Spark). This is fine as far as it goes, but consider that much of what we deal with in data is 1) far too big for this to work and/or 2) processed elsewhere e.g. a SQL query. You should think about, and document, how you handle dependencies that exist in the World State rather than in memory. This intersects with how you model and keep track of task state. Airflow keeps task state in a database. DBT keeps task state in memory. Luigi track task state through Targets which typically live in the World State. Again there's no right or wrong here only tradeoffs, but leaning on durable records of task state directly facilitates "resumption from partial failure" as mentioned above.

    Best of luck.

  • Any good resources for R for bioinformatics
    2 projects | /r/labrats | 17 Nov 2022
  • Alternatives to nextflow?
    6 projects | /r/bioinformatics | 26 Oct 2022
    Hi everyone. So I've been using nextflow for about a month or so, having developed a few pipelines and I've found the debugging experience absolutely abysmal. Although nextflow has great observability with tower, and great community support with nf-core, the uninformative error messages is souring the experience for me. There are soooo many pipeline frameworks out there, but I'm wondering if anyone has come across one similar to nextflow in offering observability, a strong community behind it, multiple executors (container image based preferably) and an awesome debugging experience? I would favor a python based approach, but not sure snakemake is the one I'm looking for.
  • A General Workflow Engine
    1 project | /r/developersIndia | 9 Nov 2021
    My answer is more from design/product point of view. If you mean code execution workflow management, then there are a bunch of packages listed in this awesome list.
  • [Discussion] Applied machine learning implementation debate. Is OOP approach towards data preprocessing in python an overkill?
    3 projects | /r/MachineLearning | 3 Nov 2021
    I'd focus more on understanding the issues in depth, before jumping to a solution. Otherwise, you would be adding hassle with some - bluntly speaking - opinionated and inflexible boilerplate code which not many people will like using. You mention some issues: non-obvious to understand code and hard to execute and replicate. Bad code which is not following engineering best practices (ideas from SOLID etc.) does not get better if you force the author to introduce certain classes. You can suggest some basics (e.g. common code formatter, meaningful variables names, short functions, no hard-coded values, ...), but I'm afraid you cannot educate non-engineers in a single day workshop. I would not focus on that at first. However, there is no excuse for writing bad code and then expecting others to fix. As you say, data engineering is part of data science skills, you are "junior" if you cannot write reproducible code. Being hard to execute and replicate is theoretically easy to fix. Force everyone to (at least hypothetically) submit their code into a testing environment where it will be automatically executed on a fresh machine. This will mean that at first they have to exactly specify all libraries that need to be installed. Second, they need to externalize all configuration - in particular data input and data output paths. Not a single value should be hard-coded in code! And finally they need a *single* command which can be run to execute the whole(!) pipeline. If they fail on any of these parts... they should try again. Work that does not pass this test is considered unfinished by the author. Basically you are introducing an automated, infallible test. Regarding your code, I'd really not try that direction. In particular even these few lines already look unclear and over-engineered. The csv format is already hard-coded into the code. If it changes to parquet you'd have to touch the code. The processing object has data paths fixed for which is no reason in a job which should take care of pure processing. Export data is also not something that a processing job should handle. And what if you have multiple input and output data? You would not have all these issues if you had kept to most simple solution to have a function `process(data1, data2, ...) -> result_data` where dataframes are passed in and out. It would also mean to have zero additional libraries or boilerplate. I highly doubt that a function `main_pipe(...)` will fix the malpractices some people may do. There are two small feature which are useful beyond a plain function though: automatically generating a visual DAG from the code and quick checking if input requirements are satisfied before heavy code is run. You can still put any mature DAG library on top, which probably already includes experience from a lot of developers. Not need to rewrite that. I'm not sure which one is best (metaflow, luigi, airflow, ... https://github.com/pditommaso/awesome-pipeline no idea), but many come with a lot of features. If you want a bit more scaffolding to easier understand foreign projects, you could look at https://github.com/quantumblacklabs/kedro but maybe that's already too much. Fix the "single command replication-from-scratch requirement" first.
  • Experiences with workflow managers implemented in Haskell (funflow, porcupine, bioshake, ?)
    5 projects | /r/haskell | 3 Oct 2021
    There are a billion of them out there (https://github.com/pditommaso/awesome-pipeline), so the decision which one to choose is not exactly easy. Most of my colleagues rely on Nextflow and Snakemake, so I should consider these, but before I start to learn an entirely new language I wanted to explore the Haskell ecosystem for possible solutions. Strong typing should in theory be a perfect match for a pipeline manager. And having this in Haskell would simplify replacing some of my R code with Haskell eventually.
  • what do you think about airflow?
    2 projects | /r/dataengineering | 2 Oct 2021
    I found this list of other "awesome pipelines" https://github.com/pditommaso/awesome-pipeline
  • Workflow Orchestration
    1 project | /r/ML_Eng | 17 Sep 2021
  • Your impression of {targets}? (r package)
    3 projects | /r/Rlanguage | 2 May 2021
    Have been trying to find the right pipeline tool to manage my R workflows. If it's more complicated than building a view in SQL, currently I develop a package, write a simple "mypackage::do_the_thing()" script, and schedule the script w/ taskscheduleR (the add-in for Rstudio is nice). Side note, I am running windows 10.

What are some alternatives?

When comparing hamilton and awesome-pipeline you can also consider the following projects:

dagster - An orchestration platform for the development, production, and observation of data assets.

Kedro - Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.

tree-of-thought-llm - [NeurIPS 2023] Tree of Thoughts: Deliberate Problem Solving with Large Language Models

targets - Function-oriented Make-like declarative workflows for R

haystack - :mag: LLM orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.

Airflow - Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

snowpark-python - Snowflake Snowpark Python API

astro-sdk - Astro SDK allows rapid and clean development of {Extract, Load, Transform} workflows using Python and SQL, powered by Apache Airflow.

aipl - Array-Inspired Pipeline Language

bionix - Functional highly reproducible bioinformatics pipelines

vscode-reactive-jupyter - A simple Reactive Python Extension for Visual Studio Code

funflow - Functional workflows