hamilton VS Dask

Compare hamilton vs Dask and see what are their differences.

hamilton

Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage and metadata. Runs and scales everywhere python does. (by DAGWorks-Inc)
Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
hamilton Dask
19 32
1,272 11,906
9.7% 1.6%
9.8 9.7
3 days ago 6 days ago
Jupyter Notebook Python
BSD 3-clause Clear License BSD 3-clause "New" or "Revised" License
The number of mentions indicates the total number of mentions that we've tracked plus the number of user suggested alternatives.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.

hamilton

Posts with mentions or reviews of hamilton. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2024-03-01.
  • FastUI: Build Better UIs Faster
    12 projects | news.ycombinator.com | 1 Mar 2024
    We built an app with it -- https://blog.dagworks.io/p/building-a-lightweight-experiment. You can see the code here https://github.com/DAGWorks-Inc/hamilton/blob/main/hamilton/....

    Usually we've been prototyping with streamlit, but found that at times to be clunky. FastUI still has rough edges, but we made it work for our lightweight app.

  • Facebook Prophet: library for generating forecasts from any time series data
    7 projects | news.ycombinator.com | 26 Sep 2023
    This library is old news? Is there anything new that they've added that's noteworthy to take it for another spin?

    [disclaimer I'm a maintainer of Hamilton] Otherwise FYI Prophet gels well with https://github.com/DAGWorks-Inc/hamilton for setting up your features and dataset for fitting & prediction[/disclaimer].

  • Langchain Is Pointless
    16 projects | news.ycombinator.com | 8 Jul 2023
    I had been hearing these pains from Langchain users for quite a while. Suffice to say I think:

    1. too many layers of OO abstractions are a liability in production contexts. I'm biased, but a more functional approach is a better way to model what's going on. It's easier to test, wrap a function with concerns, and therefore reason about.

    2. as fast as the field is moving, the layers of abstractions actually hurt your ability to customize without really diving into the details of the framework, or requiring you to step outside it -- in which case, why use it?

    Otherwise I definitely love the small amount of code you need to write to get an LLM application up with Langchain. However you read code more often than you write it, in which case this brevity is a trade-off. Would you prefer to reduce your time debugging a production outage? or building the application? There's no right answer, other than "it depends".

    To that end - we've come up with a post showing how one might use Hamilton (https://github.com/dagWorks-Inc/hamilton) to easily create a workflow to ingest data into a vector database that I think has a great production story. https://open.substack.com/pub/dagworks/p/building-a-maintain...

    Note: Hamilton can cover your MLOps as well as LLMOps needs; you'll invariably be connecting LLM applications with traditional data/ML pipelines because LLMs don't solve everything -- but that's a post for another day.

    16 projects | news.ycombinator.com | 8 Jul 2023
    Totally! As a person driving a project like https://github.com/DAGWorks-Inc/hamilton I couldn't agree more!
  • IPyflow: Reactive Python Notebooks in Jupyter(Lab)
    5 projects | news.ycombinator.com | 10 May 2023
    From a nuts and bolts perspective, I've been thinking of building some reactivity on top of https://github.com/dagworks-inc/hamilton (author here) that could get at this. (If you have a use case that could be documented, I'd appreciate it.)
  • Needs advice for choosing tools for my team. We use AWS.
    2 projects | /r/mlops | 25 Mar 2023
    Otherwise, I'm biased here, but check out https://github.com/dagworks-inc/hamilton - it could be your universal layer that expresses how things should flow, that is orchestration system agnostic, which would make it easy to migrate between systems easily.
  • Launch HN: DAGWorks – ML platform for data science teams
    7 projects | news.ycombinator.com | 7 Mar 2023
    Hey HN! We’re Stefan and Elijah, co-founders of DAGWorks (https:///www.dagworks.io). We’re on a mission to eliminate the insane inefficiency of building and maintaining ML pipelines in production.

    DAGWorks is based on Hamilton, an open-source project that we created and recently forked (https://github.com/dagworks-inc/hamilton). Hamilton is a set of high-level conventions for Python functions that can be automatically converted into working ETL pipelines. To that, we're adding a closed-source offering that goes a step further, plugging these functions into a wide array of production ML stacks.

    ML pipelines consist of computational steps (code + data) that produce a working statistical model that a business can use. A typical pipeline might be (1) pull raw data (Extract), (2) transform that data into inputs for the model (Transform), (3) define a statistical model (Transform), (4) use that statistical model to predict on another data set (Transform) and (5) push that data for downstream use (Load). Instead of “pipeline” you might hear people call this “workflow”, “ETL” (Extract-Transform-Load), and so on.

    Maintaining these things in production is insanely inefficient because you need both data scientists and software engineers to do it. Data scientists know the models and data, but most can't write the code needed to get things working in production infrastructure—for example, a lot of mid-size companies out there use Snowflake to store data, Pandas/Spark to transform it, and something like databrick's MLFlow to handle model serving. Engineers can handle the latter, but mostly aren't experts in the ML stuff. It's a classic impedance mismatch, with all the horror stories you'd expect—e.g. when data scientists make a change, engineers (or data scientists who aren’t engineers) have to manually propagate the change in production. We've talked to teams who are spending as much as 50% of their time doing this. That's not just expensive, it's gruntwork—those engineers should be working on something else! Basically, maintaining ML pipelines over time sucks for most teams.

    One way out is to hire people who combine both skills, i.e. data scientists who can also write production code. But these are rare and expensive, and in our experience they usually are only expert at one side of the equation and not as good at the other.

    The other way is to build your own platform to automatically integrate models + data into your production stack. That way the data scientists can maintain their own work without needing to hand things off to engineers. However, most companies can't afford to make this investment, and even for the ones that can, such in-house layers tend to end up in spaghetti code and tech debt hell, because they're not the company's core product.

    Elijah and I have been building data and ML tooling for the last 7 years, most recently at Stitch Fix, where we built a ML platform that served over 100 data scientists from various modeling disciplines (some of our blog posts, like [1], hit the front page of HN - thanks!). We saw first hand the issues teams encountered with ML pipelines.

    Most companies running ML in production need a ratio of 1:1 or 1:2 data scientists to engineers. At bigger companies like Stitch Fix, the ratio is more like 1:10—way more efficient—because they can afford to build the kind of platform described above. With DAGWorks, we want to bring the power of an intuitive ML Pipeline platform to all data science teams, so a ratio of 1:1 is no longer required. A junior data scientist should be able to easily and safely write production code without deep knowledge of underlying infrastructure.

    We decided to build our startup around Hamilton, in large part due to the reception that it got here [2] - thanks HN! We came up with Hamilton while we were at Stitch Fix (note: if you start an open-source project at an employer, we recommend forking it right away when you start a company. We only just did that and left behind ~900 stars...). We are betting on it being our abstraction layer to enable our vision of how to go about building and maintaining ML pipelines, given what we learned at Stitch Fix. We believe a solution has to have an open source component to be successful (we invite you to check out the code). In terms of why the name DAGWorks? We named the company after Directed Acyclic Graphs because we think the DAG representation, which Hamilton also provides, is key.

    A quick primer on Hamilton. With Hamilton we use a new paradigm in Python (well not quite “new” as pytest fixtures use this approach) for defining model pipelines. Users write declarative functions instead of writing procedural code. For example, rather than writing the following pandas code:

      df['col_c'] = df['col_a'] + df['col_b']
    7 projects | news.ycombinator.com | 7 Mar 2023
    will just give out the specific ones you want. In the latter case, it's a little trickier but doable -- we were just going over this with a user recently actually! https://github.com/DAGWorks-Inc/hamilton/issues/90
    7 projects | news.ycombinator.com | 7 Mar 2023
    Yeah! So we actually have an integration with polars. See https://github.com/DAGWorks-Inc/hamilton/blob/5c8e564d19ff23....

    To be clear, the specific paradigm we're referring to is this way of writing transforms as functions where the parameter name is the upstream dependency -- not the notion of delayed execution.

    I think there are two different concepts here though:

    1. How the transforms are executed

    7 projects | news.ycombinator.com | 7 Mar 2023

Dask

Posts with mentions or reviews of Dask. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2023-06-15.

What are some alternatives?

When comparing hamilton and Dask you can also consider the following projects:

Airflow - Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

Numba - NumPy aware dynamic Python compiler using LLVM

Kedro - Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.

NetworkX - Network Analysis in Python

Pandas - Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

Interactive Parallel Computing with IPython - IPython Parallel: Interactive Parallel Computing in Python

statsmodels - Statsmodels: statistical modeling and econometrics in Python

PyMC - Bayesian Modeling and Probabilistic Programming in Python

blaze - NumPy and Pandas interface to Big Data

Ray - Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

luigi - Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.

orange - 🍊 :bar_chart: :bulb: Orange: Interactive data analysis