We're wasting money by only supporting gzip for raw DNA files

Our great sponsors

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

SaaSHub - Software Alternatives and Reviews

Our great sponsors

community

1 19 0.0

An open community with an interest in developing and using new technologies for tensor data storage. (by zarr-developers)

The Zarr format is used in some genomics workflows (see https://github.com/zarr-developers/community/issues/19) and supports a wide range of modern compressors (e.g. Zstd, Zlib, BZ2, LZMA, ZFPY, Blosc, as well as many filters.)
tamtools

1 3 10.0 Python

Create and manage hybrid reference assemblies to consolidate two original DNA alignments against different reference assemblies.

A few years ago I built some tools https://github.com/tf318/tamtools to store alignments against two different reference assemblies in an efficient way (taking advantage of the fact that the majority of each alignment to different assemblies would in fact be the same, just shifted in position).
The intent was to enhance this to store alignments against multiple references as new references are published, and probably to rewrite in Rust or C rather than the initial Python effort.
In retrospect I would be interested to know whether this domain-specific compression effort, with zstd to the resulting "hybrid" alignment, would be more efficient than just letting zstd do its own thing with a full set of individual alignments against the different references.
WorkOS

workos.com
sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
Hail

5 934 9.8 Python

Cloud-native genomic dataframes and batch computing
zstd

105 22,293 9.6 C

Zstandard - Fast real-time compression algorithm

zstd has a long range mode, which lets it find redundancies a gigabyte away. Try --long and --long=31 for very long range mode.
zstd has delta / patch mode, which creates a file that stores the "patch" to create a new file from an old (reference) file. See https://github.com/facebook/zstd/wiki/Zstandard-as-a-patchin...
See the man page: https://github.com/facebook/zstd/blob/dev/programs/zstd.1.md
BEETL

1 94 1.2 C++

BEETL

If you have re-sequencing data of model species (which applies to >80% of generated sequencing data), the storage issue is often solved using CRAM/BAM formats. The FASTQ can be reconstructed if unmapped reads are stored in the file.
More general (pre-alignment) sequence compression methods never really took of (e.g. https://github.com/BEETL/BEETL). Probably because it helps so much to have common format that most workflows can start with. Here, the replacement of gzip with zstd would be a lower hanging fruit to start with.
genozip

2 147 9.2 C

A modern compressor for genomic files (FASTQ, SAM/BAM/CRAM, VCF, FASTA, GFF/GTF/GVF, 23andMe...), up to 5x better than gzip and faster too

You might want to check out Genozip - a modern compressor for FASTQ, BAM/CRAM, VCF etc.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

AWS doesn't make sense for scientific computing
1 project | news.ycombinator.com | 7 Oct 2022
Has anyone stored/queried VCFs and their variant records in a relational database?
2 projects | /r/bioinformatics | 12 Nov 2022
Introducing Genozip - a file compressor for BAM, FASTQ, VCF etc
1 project | /r/bioinformatics | 23 Jul 2021
Help running hap.py
1 project | /r/bioinformatics | 22 Nov 2022
Software engineers: consider working on genomics
6 projects | news.ycombinator.com | 19 Nov 2022

We're wasting money by only supporting gzip for raw DNA files

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Compression Genomics Vcf Big Data Genetics
Post date: 9 Jan 2023

community

tamtools

WorkOS

Hail

zstd

BEETL

genozip

Related posts

We're wasting money by only supporting gzip for raw DNA files

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com Compression Genomics Vcf Big Data Genetics Post date: 9 Jan 2023

community

tamtools

WorkOS

Hail

zstd

BEETL

genozip

Related posts

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Compression Genomics Vcf Big Data Genetics
Post date: 9 Jan 2023