Our great sponsors
-
mandala
A powerful and easy to use Python framework for experiment tracking and incremental computing
-
tes-azure-legacy
Discontinued [DEPRECATED] - A GA4GH Task Execution Service (TES) compatible implementation for Azure Compute
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
oxen-release
Lightning fast data version control system for structured and unstructured machine learning datasets. We aim to make versioning datasets as easy as versioning code.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
Snakemake is great, but it does feel like just a slightly more modern Make.
I am pretty excited about research projects that tie the recipe and the computation closer together so that you do not preserve just the last recipe, but the whole history of exploratory computation and analysis.
E.g. mandala (https://github.com/amakelov/mandala), a project of a colleague of mine which is basically semantic git for your computational graph and data at the same time.
Snakemake is a beautiful project and evolves and improves so fast. Years ago I realized I needed to up my game from the usual bash based NGS data processing pipelines I was writing. Based on several recommendation I choose Snakemake. I have never regretted it, It worked perfectly on our PBS cluster then on our Slurm cluster. I made some steps to make it run on K8s, which is supports, and most recently, I'm still/again happy with my choice for Snakemake because it (together with Nextflow) seems to be the chosen framework for GA4GH's cloud work stream's "products" like WES and TES [0]. This seems to be the tech stack where Amazon Omics and Microsoft Genomics focus on [1].
I owe a lot to Snakemake and Johannes Köster, I hope some day I can repay him and his project.
[0] https://www.ga4gh.org/work_stream/cloud/
[1] https://github.com/Microsoft/tes-azure
Super cool! Would love to see an integration with Oxen and their data version control https://github.com/Oxen-AI/oxen-release
For a very different approach, check out make-booster:
https://github.com/david-a-wheeler/make-booster
Make-booster provides utility routines intended to greatly simplify data processing (particularly a data pipeline) using GNU make. It includes some mechanisms specifically to help Python, as well as general-purpose mechanisms that can be useful in any system. In particular, it helps reliably reproduce results, and it automatically determines what needs to run and runs only that (producing a significant speedup in most cases). Released as open source software.
1. Command-line tools are often used in steps of a bioinformatics pipeline, and they bridge the gap (e.g. https://github.com/snakemake/snakemake-wrappers).