Our great sponsors
-
community
An open community with an interest in developing and using new technologies for tensor data storage. (by zarr-developers)
The Zarr format is used in some genomics workflows (see https://github.com/zarr-developers/community/issues/19) and supports a wide range of modern compressors (e.g. Zstd, Zlib, BZ2, LZMA, ZFPY, Blosc, as well as many filters.)
-
tamtools
Create and manage hybrid reference assemblies to consolidate two original DNA alignments against different reference assemblies.
A few years ago I built some tools https://github.com/tf318/tamtools to store alignments against two different reference assemblies in an efficient way (taking advantage of the fact that the majority of each alignment to different assemblies would in fact be the same, just shifted in position).
The intent was to enhance this to store alignments against multiple references as new references are published, and probably to rewrite in Rust or C rather than the initial Python effort.
In retrospect I would be interested to know whether this domain-specific compression effort, with zstd to the resulting "hybrid" alignment, would be more efficient than just letting zstd do its own thing with a full set of individual alignments against the different references.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
-
zstd has a long range mode, which lets it find redundancies a gigabyte away. Try --long and --long=31 for very long range mode.
zstd has delta / patch mode, which creates a file that stores the "patch" to create a new file from an old (reference) file. See https://github.com/facebook/zstd/wiki/Zstandard-as-a-patchin...
See the man page: https://github.com/facebook/zstd/blob/dev/programs/zstd.1.md
-
If you have re-sequencing data of model species (which applies to >80% of generated sequencing data), the storage issue is often solved using CRAM/BAM formats. The FASTQ can be reconstructed if unmapped reads are stored in the file.
More general (pre-alignment) sequence compression methods never really took of (e.g. https://github.com/BEETL/BEETL). Probably because it helps so much to have common format that most workflows can start with. Here, the replacement of gzip with zstd would be a lower hanging fruit to start with.
-
genozip
A modern compressor for genomic files (FASTQ, SAM/BAM/CRAM, VCF, FASTA, GFF/GTF/GVF, 23andMe...), up to 5x better than gzip and faster too
You might want to check out Genozip - a modern compressor for FASTQ, BAM/CRAM, VCF etc.