Nextflow: Data-Driven Computational Pipelines

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • common-workflow-language

    Repository for the CWL standards. Use https://cwl.discourse.group/ for support 😊

  • https://www.commonwl.org/

    https://github.com/common-workflow-language/common-workflow-...

  • nextflow

    A DSL for data-driven computational pipelines

  • > It's been a while since you can rerun/resume Nextflow pipelines

    Yes, you can resume, but you need your whole upstream DAG to be present. Snakemake can rerun a job when only the dependencies of that job are present, which allows to neatly manage the disk usage, or archive an intermediate state of a project and rerun things from there.

    > and yes, you can have dry runs in Nextflow

    You have stubs, which really isn't the same thing.

    > I have no idea what you're referring to with the 'arbitrary limit of 1000 parallel jobs' though

    I was referring to this issue: https://github.com/nextflow-io/nextflow/issues/1871. Except, the discussion doesn't give the issue a full justice. Nextflow spans each job in a separate thread, and when it tries to span 1000+ condor jobs it die with a cryptic error message. The option of -Dnxf.pool.type=sync and -Dnxf.pool.maxThreads=N prevents the ability to resume and attempts to rerun the pipeline.

    > As for deleting temporary files, there are features that allow you to do a few things related to that, and other features being implemented.

    There are some hacks for this - but nothing I would feel safe to integrate into a production tool. They are implementing something - you're right - and it's been the case for several years now, so we'll see.

    Snakemake has all that out of the box.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • https://www.commonwl.org/

    https://github.com/common-workflow-language/common-workflow-...

  • cgpipe

    cgpipe - minimum viable HPC pipeline

  • I do too.. and have similar opinions. I wrote my own tool years back for pipelines because it was always frustrating (started roughly around the same time as Nextflow).

    Allowing for files to be marked as transient (temp) and re-running from arbitrary time points are definitely one of the things I support... as is conditional logic within the pipeline for job definition and resource usage. For me though, one of the biggest things is that I like having composable pipelines, so each part of the larger workflow can be developed independently. They can interact with each other (DAG) and use existing dependencies, but they don't have to exist in the same document/script. I work on large WGS datasets, so 1000's of jobs per patient isn't uncommon.

    Happy to talk more if you're interested.

    https://github.com/compgen-io/cgpipe

    (And yes, you can dry run the entire thing. It will write out a bash script if you want to see exactly what is going to run without submitting jobs.)

  • huey

    a little task queue for python

  • I've considered using Nextflow for bioinformatics pipelines but have yet to take the plunge. At work, I develop a proteomics pipeline that is composed of huey¹ tasks (Python library; simple alternative to Celery) which either use subprocess to call out to some external tool, or are just pure python. It runs in a worker container which is created by docker swarm, and all containers pull jobs from redis. For our scale, it works great. However, I don't have control over the resource utilization of individual steps, and in the past I've had issues with the pipeline blocking as a result of how I was chaining tasks together. I think something like Nextflow would remove these limitations, but one thing I think I would miss is the ability to debug individual pipeline steps locally with an interactive debugger. As far as I can tell, Nextflow has logging/tracing facilities but nothing quite like an interactive debugger. I'd be happy to be told I'm wrong, or even that I'm doing it wrong.

    ____

    ¹ https://github.com/coleifer/huey/

  • redun

    Yet another redundant workflow engine

  • I'm personally a huge fan of redun¹ for running computational pipelines. It's pure python, it's easy to learn/debug, it has automatic caching, retry, provenance logging, and a great integration with AWS Batch for running large jobs. I've been really impressed with how easy it is to run a job to completion that fans out to thousands of AWS spot instances at once.

    I've used nextflow in the past, and I've found it to be much harder to use. Learning another DSL is annoying, documentation was sparse, I constantly ran into bugs, and it was hard to debug in general. I don't know how much it's changed over the past 3 years though.

    ¹https://github.com/insitro/redun

  • Kedro

    Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.

  • Interesting, thanks for sharing. I'll definitely take a look, although at this point I am so comfortable with Snakemake, it is a bit hard to imagine what would convince me to move to another tool. But I like the idea of composable pipelines: I am building a tool (too early to share) that would allow to lay Snakemake pipelines on top of each other using semi-automatic data annotations similar to how it is done in kedro (https://github.com/kedro-org/kedro).

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts