awesome-pipeline

A curated list of awesome pipeline toolkits inspired by Awesome Sysadmin (by pditommaso)

Awesome-pipeline Alternatives

Similar projects and alternatives to awesome-pipeline

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a better awesome-pipeline alternative or higher similarity.

awesome-pipeline reviews and mentions

Posts with mentions or reviews of awesome-pipeline. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2023-06-01.
  • Orchestration: Thoughts on Dagster, Airflow and Prefect?
    3 projects | /r/dataengineering | 1 Jun 2023
    There are a truly huge number of options in this space, see for example https://github.com/pditommaso/awesome-pipeline Many of them are very niche / half-baked / abandonware.
  • Launch HN: DAGWorks – ML platform for data science teams
    7 projects | news.ycombinator.com | 7 Mar 2023
    As a long-time fan of DAG-oriented tools, congrats on the launch. Maybe you can get added here https://github.com/pditommaso/awesome-pipeline now or in the future...

    This is a problem space I've worked in and been thinking about for a very, very long time. I've extensively used Airflow (bad), DBT (good-ish), Luigi (good), drake (abandoned), tested many more, and written two of my own.

    It's important to remember that DAG tools exist to solve two primary problems, that arise from one underlying cause. Those problems are 1) getting parallelism and execution ordering automatically (i.e. declaratively) based on the structure of dependencies, and 2) being able to resume a partially-failed run. The underlying cause is: data processing jobs take significant wall-clock time (minutes, hours, even days), so we want to use resources efficiently, and avoid re-computing things.

    Any DAG tool that doesn't solve these problems is unlikely to be useful. From your docs, I don't see anything on either of those topics, so not off to a strong start. Perhaps you have that functionality but haven't documented it yet? I can imagine the parallelism piece being there but just not stated, but the "resumption from partial failure" piece needs to be spelled out. Anyway, something to consider.

    A couple more things...

    It looks like you've gone the route of expressing dependencies only "locally". That is, when I define a computation, I indicate what it depends on there, right next to the definition. DBT and Luigi work this way also. Airflow, by contrast, defines dependencies centrally, as you add task instances to a DAG object. There is no right answer here, only tradeoffs. One thing to be aware of is that when using the "local" style, as a project grows big (glances at 380-model DBT project...), understanding its execution flow at a high level becomes a struggle, and is often only solvable through visualization tools. I see you have Graphviz output which is great. I recommend investing heavily in visualization tooling (DBT's graph browser, for example).

    I don't see any mention of development workflow. As a few examples, DBT has rich model selection features that let you run one model, all its ancestors, all its descendants, all models with a tag, etc etc. Luigi lets you invoke any task as a terminal task, using a handy auto-generated CLI. Airflow lets you... run a single task, and that's it. This makes a BIG DIFFERENCE. Developers -- be they scientists or engineers -- will need to run arbitrary subgraphs while they fiddle with stuff, and the easier you make that, the more they will love your tool.

    Another thing I notice is that it seems like your model is oriented around flowing data through the program, as arguments / return values (similar to Prefect, and of course Spark). This is fine as far as it goes, but consider that much of what we deal with in data is 1) far too big for this to work and/or 2) processed elsewhere e.g. a SQL query. You should think about, and document, how you handle dependencies that exist in the World State rather than in memory. This intersects with how you model and keep track of task state. Airflow keeps task state in a database. DBT keeps task state in memory. Luigi track task state through Targets which typically live in the World State. Again there's no right or wrong here only tradeoffs, but leaning on durable records of task state directly facilitates "resumption from partial failure" as mentioned above.

    Best of luck.

  • Any good resources for R for bioinformatics
    2 projects | /r/labrats | 17 Nov 2022
  • Alternatives to nextflow?
    6 projects | /r/bioinformatics | 26 Oct 2022
    Hi everyone. So I've been using nextflow for about a month or so, having developed a few pipelines and I've found the debugging experience absolutely abysmal. Although nextflow has great observability with tower, and great community support with nf-core, the uninformative error messages is souring the experience for me. There are soooo many pipeline frameworks out there, but I'm wondering if anyone has come across one similar to nextflow in offering observability, a strong community behind it, multiple executors (container image based preferably) and an awesome debugging experience? I would favor a python based approach, but not sure snakemake is the one I'm looking for.
  • A General Workflow Engine
    1 project | /r/developersIndia | 9 Nov 2021
    My answer is more from design/product point of view. If you mean code execution workflow management, then there are a bunch of packages listed in this awesome list.
  • [Discussion] Applied machine learning implementation debate. Is OOP approach towards data preprocessing in python an overkill?
    3 projects | /r/MachineLearning | 3 Nov 2021
    I'd focus more on understanding the issues in depth, before jumping to a solution. Otherwise, you would be adding hassle with some - bluntly speaking - opinionated and inflexible boilerplate code which not many people will like using. You mention some issues: non-obvious to understand code and hard to execute and replicate. Bad code which is not following engineering best practices (ideas from SOLID etc.) does not get better if you force the author to introduce certain classes. You can suggest some basics (e.g. common code formatter, meaningful variables names, short functions, no hard-coded values, ...), but I'm afraid you cannot educate non-engineers in a single day workshop. I would not focus on that at first. However, there is no excuse for writing bad code and then expecting others to fix. As you say, data engineering is part of data science skills, you are "junior" if you cannot write reproducible code. Being hard to execute and replicate is theoretically easy to fix. Force everyone to (at least hypothetically) submit their code into a testing environment where it will be automatically executed on a fresh machine. This will mean that at first they have to exactly specify all libraries that need to be installed. Second, they need to externalize all configuration - in particular data input and data output paths. Not a single value should be hard-coded in code! And finally they need a *single* command which can be run to execute the whole(!) pipeline. If they fail on any of these parts... they should try again. Work that does not pass this test is considered unfinished by the author. Basically you are introducing an automated, infallible test. Regarding your code, I'd really not try that direction. In particular even these few lines already look unclear and over-engineered. The csv format is already hard-coded into the code. If it changes to parquet you'd have to touch the code. The processing object has data paths fixed for which is no reason in a job which should take care of pure processing. Export data is also not something that a processing job should handle. And what if you have multiple input and output data? You would not have all these issues if you had kept to most simple solution to have a function `process(data1, data2, ...) -> result_data` where dataframes are passed in and out. It would also mean to have zero additional libraries or boilerplate. I highly doubt that a function `main_pipe(...)` will fix the malpractices some people may do. There are two small feature which are useful beyond a plain function though: automatically generating a visual DAG from the code and quick checking if input requirements are satisfied before heavy code is run. You can still put any mature DAG library on top, which probably already includes experience from a lot of developers. Not need to rewrite that. I'm not sure which one is best (metaflow, luigi, airflow, ... https://github.com/pditommaso/awesome-pipeline no idea), but many come with a lot of features. If you want a bit more scaffolding to easier understand foreign projects, you could look at https://github.com/quantumblacklabs/kedro but maybe that's already too much. Fix the "single command replication-from-scratch requirement" first.
  • Experiences with workflow managers implemented in Haskell (funflow, porcupine, bioshake, ?)
    5 projects | /r/haskell | 3 Oct 2021
    There are a billion of them out there (https://github.com/pditommaso/awesome-pipeline), so the decision which one to choose is not exactly easy. Most of my colleagues rely on Nextflow and Snakemake, so I should consider these, but before I start to learn an entirely new language I wanted to explore the Haskell ecosystem for possible solutions. Strong typing should in theory be a perfect match for a pipeline manager. And having this in Haskell would simplify replacing some of my R code with Haskell eventually.
  • what do you think about airflow?
    2 projects | /r/dataengineering | 2 Oct 2021
    I found this list of other "awesome pipelines" https://github.com/pditommaso/awesome-pipeline
  • Workflow Orchestration
    1 project | /r/ML_Eng | 17 Sep 2021
  • Your impression of {targets}? (r package)
    3 projects | /r/Rlanguage | 2 May 2021
    Have been trying to find the right pipeline tool to manage my R workflows. If it's more complicated than building a view in SQL, currently I develop a package, write a simple "mypackage::do_the_thing()" script, and schedule the script w/ taskscheduleR (the add-in for Rstudio is nice). Side note, I am running windows 10.
  • A note from our sponsor - WorkOS
    workos.com | 18 Apr 2024
    The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning. Learn more →