awesome-pipeline VS nextflow

Compare awesome-pipeline vs nextflow and see what are their differences.

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
awesome-pipeline nextflow
10 9
5,913 2,544
- 0.9%
5.6 9.7
7 days ago about 11 hours ago
Groovy
- Apache License 2.0
The number of mentions indicates the total number of mentions that we've tracked plus the number of user suggested alternatives.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.

awesome-pipeline

Posts with mentions or reviews of awesome-pipeline. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2023-06-01.
  • Orchestration: Thoughts on Dagster, Airflow and Prefect?
    3 projects | /r/dataengineering | 1 Jun 2023
    There are a truly huge number of options in this space, see for example https://github.com/pditommaso/awesome-pipeline Many of them are very niche / half-baked / abandonware.
  • Launch HN: DAGWorks – ML platform for data science teams
    7 projects | news.ycombinator.com | 7 Mar 2023
    As a long-time fan of DAG-oriented tools, congrats on the launch. Maybe you can get added here https://github.com/pditommaso/awesome-pipeline now or in the future...

    This is a problem space I've worked in and been thinking about for a very, very long time. I've extensively used Airflow (bad), DBT (good-ish), Luigi (good), drake (abandoned), tested many more, and written two of my own.

    It's important to remember that DAG tools exist to solve two primary problems, that arise from one underlying cause. Those problems are 1) getting parallelism and execution ordering automatically (i.e. declaratively) based on the structure of dependencies, and 2) being able to resume a partially-failed run. The underlying cause is: data processing jobs take significant wall-clock time (minutes, hours, even days), so we want to use resources efficiently, and avoid re-computing things.

    Any DAG tool that doesn't solve these problems is unlikely to be useful. From your docs, I don't see anything on either of those topics, so not off to a strong start. Perhaps you have that functionality but haven't documented it yet? I can imagine the parallelism piece being there but just not stated, but the "resumption from partial failure" piece needs to be spelled out. Anyway, something to consider.

    A couple more things...

    It looks like you've gone the route of expressing dependencies only "locally". That is, when I define a computation, I indicate what it depends on there, right next to the definition. DBT and Luigi work this way also. Airflow, by contrast, defines dependencies centrally, as you add task instances to a DAG object. There is no right answer here, only tradeoffs. One thing to be aware of is that when using the "local" style, as a project grows big (glances at 380-model DBT project...), understanding its execution flow at a high level becomes a struggle, and is often only solvable through visualization tools. I see you have Graphviz output which is great. I recommend investing heavily in visualization tooling (DBT's graph browser, for example).

    I don't see any mention of development workflow. As a few examples, DBT has rich model selection features that let you run one model, all its ancestors, all its descendants, all models with a tag, etc etc. Luigi lets you invoke any task as a terminal task, using a handy auto-generated CLI. Airflow lets you... run a single task, and that's it. This makes a BIG DIFFERENCE. Developers -- be they scientists or engineers -- will need to run arbitrary subgraphs while they fiddle with stuff, and the easier you make that, the more they will love your tool.

    Another thing I notice is that it seems like your model is oriented around flowing data through the program, as arguments / return values (similar to Prefect, and of course Spark). This is fine as far as it goes, but consider that much of what we deal with in data is 1) far too big for this to work and/or 2) processed elsewhere e.g. a SQL query. You should think about, and document, how you handle dependencies that exist in the World State rather than in memory. This intersects with how you model and keep track of task state. Airflow keeps task state in a database. DBT keeps task state in memory. Luigi track task state through Targets which typically live in the World State. Again there's no right or wrong here only tradeoffs, but leaning on durable records of task state directly facilitates "resumption from partial failure" as mentioned above.

    Best of luck.

  • Any good resources for R for bioinformatics
    2 projects | /r/labrats | 17 Nov 2022
  • Alternatives to nextflow?
    6 projects | /r/bioinformatics | 26 Oct 2022
    Hi everyone. So I've been using nextflow for about a month or so, having developed a few pipelines and I've found the debugging experience absolutely abysmal. Although nextflow has great observability with tower, and great community support with nf-core, the uninformative error messages is souring the experience for me. There are soooo many pipeline frameworks out there, but I'm wondering if anyone has come across one similar to nextflow in offering observability, a strong community behind it, multiple executors (container image based preferably) and an awesome debugging experience? I would favor a python based approach, but not sure snakemake is the one I'm looking for.
  • A General Workflow Engine
    1 project | /r/developersIndia | 9 Nov 2021
    My answer is more from design/product point of view. If you mean code execution workflow management, then there are a bunch of packages listed in this awesome list.
  • [Discussion] Applied machine learning implementation debate. Is OOP approach towards data preprocessing in python an overkill?
    3 projects | /r/MachineLearning | 3 Nov 2021
    I'd focus more on understanding the issues in depth, before jumping to a solution. Otherwise, you would be adding hassle with some - bluntly speaking - opinionated and inflexible boilerplate code which not many people will like using. You mention some issues: non-obvious to understand code and hard to execute and replicate. Bad code which is not following engineering best practices (ideas from SOLID etc.) does not get better if you force the author to introduce certain classes. You can suggest some basics (e.g. common code formatter, meaningful variables names, short functions, no hard-coded values, ...), but I'm afraid you cannot educate non-engineers in a single day workshop. I would not focus on that at first. However, there is no excuse for writing bad code and then expecting others to fix. As you say, data engineering is part of data science skills, you are "junior" if you cannot write reproducible code. Being hard to execute and replicate is theoretically easy to fix. Force everyone to (at least hypothetically) submit their code into a testing environment where it will be automatically executed on a fresh machine. This will mean that at first they have to exactly specify all libraries that need to be installed. Second, they need to externalize all configuration - in particular data input and data output paths. Not a single value should be hard-coded in code! And finally they need a *single* command which can be run to execute the whole(!) pipeline. If they fail on any of these parts... they should try again. Work that does not pass this test is considered unfinished by the author. Basically you are introducing an automated, infallible test. Regarding your code, I'd really not try that direction. In particular even these few lines already look unclear and over-engineered. The csv format is already hard-coded into the code. If it changes to parquet you'd have to touch the code. The processing object has data paths fixed for which is no reason in a job which should take care of pure processing. Export data is also not something that a processing job should handle. And what if you have multiple input and output data? You would not have all these issues if you had kept to most simple solution to have a function `process(data1, data2, ...) -> result_data` where dataframes are passed in and out. It would also mean to have zero additional libraries or boilerplate. I highly doubt that a function `main_pipe(...)` will fix the malpractices some people may do. There are two small feature which are useful beyond a plain function though: automatically generating a visual DAG from the code and quick checking if input requirements are satisfied before heavy code is run. You can still put any mature DAG library on top, which probably already includes experience from a lot of developers. Not need to rewrite that. I'm not sure which one is best (metaflow, luigi, airflow, ... https://github.com/pditommaso/awesome-pipeline no idea), but many come with a lot of features. If you want a bit more scaffolding to easier understand foreign projects, you could look at https://github.com/quantumblacklabs/kedro but maybe that's already too much. Fix the "single command replication-from-scratch requirement" first.
  • Experiences with workflow managers implemented in Haskell (funflow, porcupine, bioshake, ?)
    5 projects | /r/haskell | 3 Oct 2021
    There are a billion of them out there (https://github.com/pditommaso/awesome-pipeline), so the decision which one to choose is not exactly easy. Most of my colleagues rely on Nextflow and Snakemake, so I should consider these, but before I start to learn an entirely new language I wanted to explore the Haskell ecosystem for possible solutions. Strong typing should in theory be a perfect match for a pipeline manager. And having this in Haskell would simplify replacing some of my R code with Haskell eventually.
  • what do you think about airflow?
    2 projects | /r/dataengineering | 2 Oct 2021
    I found this list of other "awesome pipelines" https://github.com/pditommaso/awesome-pipeline
  • Workflow Orchestration
    1 project | /r/ML_Eng | 17 Sep 2021
  • Your impression of {targets}? (r package)
    3 projects | /r/Rlanguage | 2 May 2021
    Have been trying to find the right pipeline tool to manage my R workflows. If it's more complicated than building a view in SQL, currently I develop a package, write a simple "mypackage::do_the_thing()" script, and schedule the script w/ taskscheduleR (the add-in for Rstudio is nice). Side note, I am running windows 10.

nextflow

Posts with mentions or reviews of nextflow. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2023-08-10.
  • Nextflow: Data-Driven Computational Pipelines
    9 projects | news.ycombinator.com | 10 Aug 2023
    > It's been a while since you can rerun/resume Nextflow pipelines

    Yes, you can resume, but you need your whole upstream DAG to be present. Snakemake can rerun a job when only the dependencies of that job are present, which allows to neatly manage the disk usage, or archive an intermediate state of a project and rerun things from there.

    > and yes, you can have dry runs in Nextflow

    You have stubs, which really isn't the same thing.

    > I have no idea what you're referring to with the 'arbitrary limit of 1000 parallel jobs' though

    I was referring to this issue: https://github.com/nextflow-io/nextflow/issues/1871. Except, the discussion doesn't give the issue a full justice. Nextflow spans each job in a separate thread, and when it tries to span 1000+ condor jobs it die with a cryptic error message. The option of -Dnxf.pool.type=sync and -Dnxf.pool.maxThreads=N prevents the ability to resume and attempts to rerun the pipeline.

    > As for deleting temporary files, there are features that allow you to do a few things related to that, and other features being implemented.

    There are some hacks for this - but nothing I would feel safe to integrate into a production tool. They are implementing something - you're right - and it's been the case for several years now, so we'll see.

    Snakemake has all that out of the box.

  • Alternatives to nextflow?
    6 projects | /r/bioinformatics | 26 Oct 2022
    For now, I think that the best place to track this / get your voice heard is this GitHub Discussions post (which covers many things - error reporting is one of them). https://github.com/nextflow-io/nextflow/discussions/3107
  • HyperQueue: ergonomic HPC task executor written in Rust
    4 projects | /r/rust | 12 Oct 2022
  • Nextflow vs Snakemake
    2 projects | /r/bioinformatics | 29 Jul 2022
    We could spend the day pointing to things we wish were different, but that doesn't change the fact that Nextflow is the leader when it comes to workflow orchestration. And feel free to create a new issue in the GitHub repository if you wish to request a feature :)
  • Feel very hard writing nextflow pipeline.
    2 projects | /r/bioinformatics | 11 May 2022
    The nextflow devs have been talking about this for a while on GitHub. Looks like they're implementing something along these lines using schema like they do for nf-core. GitHub discussion.
  • Need a statically typed Python replacement
    1 project | /r/learnprogramming | 28 Dec 2021
    Groovy definitely scales up just fine I think but I never used it myself outside of little snippets embedded in my DSL, I know its considered by some to be "dead" so its interesting to see what other JVM-ecosystem users think of it.

What are some alternatives?

When comparing awesome-pipeline and nextflow you can also consider the following projects:

Kedro - Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.

galaxy - Data intensive science for everyone.

targets - Function-oriented Make-like declarative workflows for R

argo - Workflow Engine for Kubernetes

Airflow - Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

ploomber - The fastest ⚡️ way to build data pipelines. Develop iteratively, deploy anywhere. ☁️

astro-sdk - Astro SDK allows rapid and clean development of {Extract, Load, Transform} workflows using Python and SQL, powered by Apache Airflow.

singularity - Singularity has been renamed to Apptainer as part of us moving the project to the Linux Foundation. This repo has been persisted as a snapshot right before the changes.

hamilton - Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage and metadata. Runs and scales everywhere python does.

bionix - Functional highly reproducible bioinformatics pipelines

devops-resources - DevOps resources - Linux, Jenkins, AWS, SRE, Prometheus, Docker, Python, Ansible, Git, Kubernetes, Terraform, OpenStack, SQL, NoSQL, Azure, GCP