We're wasting money by only supporting gzip for raw DNA files

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • community

    An open community with an interest in developing and using new technologies for tensor data storage. (by zarr-developers)

    The Zarr format is used in some genomics workflows (see https://github.com/zarr-developers/community/issues/19) and supports a wide range of modern compressors (e.g. Zstd, Zlib, BZ2, LZMA, ZFPY, Blosc, as well as many filters.)

  • tamtools

    Create and manage hybrid reference assemblies to consolidate two original DNA alignments against different reference assemblies.

    A few years ago I built some tools https://github.com/tf318/tamtools to store alignments against two different reference assemblies in an efficient way (taking advantage of the fact that the majority of each alignment to different assemblies would in fact be the same, just shifted in position).

    The intent was to enhance this to store alignments against multiple references as new references are published, and probably to rewrite in Rust or C rather than the initial Python effort.

    In retrospect I would be interested to know whether this domain-specific compression effort, with zstd to the resulting "hybrid" alignment, would be more efficient than just letting zstd do its own thing with a full set of individual alignments against the different references.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

  • Hail

    Cloud-native genomic dataframes and batch computing

  • zstd

    Zstandard - Fast real-time compression algorithm

    zstd has a long range mode, which lets it find redundancies a gigabyte away. Try --long and --long=31 for very long range mode.

    zstd has delta / patch mode, which creates a file that stores the "patch" to create a new file from an old (reference) file. See https://github.com/facebook/zstd/wiki/Zstandard-as-a-patchin...

    See the man page: https://github.com/facebook/zstd/blob/dev/programs/zstd.1.md

  • BEETL

    BEETL

    If you have re-sequencing data of model species (which applies to >80% of generated sequencing data), the storage issue is often solved using CRAM/BAM formats. The FASTQ can be reconstructed if unmapped reads are stored in the file.

    More general (pre-alignment) sequence compression methods never really took of (e.g. https://github.com/BEETL/BEETL). Probably because it helps so much to have common format that most workflows can start with. Here, the replacement of gzip with zstd would be a lower hanging fruit to start with.

  • genozip

    A modern compressor for genomic files (FASTQ, SAM/BAM/CRAM, VCF, FASTA, GFF/GTF/GVF, 23andMe...), up to 5x better than gzip and faster too

    You might want to check out Genozip - a modern compressor for FASTQ, BAM/CRAM, VCF etc.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts