scikit-bio
scikit-bio: a community-driven Python library for bioinformatics, providing versatile data structures, algorithms and educational resources. (by scikit-bio)
biofast
Benchmarking programming languages/implementations for common tasks in Bioinformatics (by lh3)
scikit-bio | biofast | |
---|---|---|
2 | 3 | |
833 | 175 | |
0.8% | - | |
8.8 | 0.0 | |
7 days ago | over 2 years ago | |
Python | C | |
BSD 3-clause "New" or "Revised" License | - |
The number of mentions indicates the total number of mentions that we've tracked plus the number of user suggested alternatives.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
scikit-bio
Posts with mentions or reviews of scikit-bio.
We have used some of these posts to build our list of alternatives
and similar projects. The last one was on 2021-09-23.
- What are some of the bioinformatic projects I could do on python as a beginner?
-
Why I Use Nim instead of Python for Data Processing
You make a fair point that using optimized numerical libraries instead of string methods will be ridiculously fast because they're compiled anyway. For example, scikit-bio does just this for their reverse complement operation [1]. However, they use an 8 bit representation since they need to be able to represent the extended IUPAC notation for ambiguous bases, which includes things like the character N for "aNy" nucleotide [2]. One could get creative with a 4 bit encoding and still end up saving space (assuming you don't care about the distinction between upper versus lowercase characters in your sequence [2]). Or, if you know in advance your sequence is unambiguous (unlikely in DNA sequencing-derived data) you could use the 2 bit encoding. When dealing with short nucleotide sequences, another approach is to encode the sequence as an integer. I would love to see a library—Python, Nim, or otherwise—that made using the most efficient encoding for a sequence transparent to the developer.
[1] https://github.com/biocore/scikit-bio/blob/b470a55a8dfd054ae...
[2] https://en.wikipedia.org/wiki/Nucleic_acid_notation
[3]
biofast
Posts with mentions or reviews of biofast.
We have used some of these posts to build our list of alternatives
and similar projects. The last one was on 2021-09-23.
-
Parsing huge files in Python
FYI: the python packages I mentioned earlier can all directly read gzip'd fastq files. See also this repo for examples.
-
Does Rust Support Reading in FATSA files?
needletail is rated in the Heng Li benchmark (https://github.com/lh3/biofast/)
- Why I Use Nim instead of Python for Data Processing
What are some alternatives?
When comparing scikit-bio and biofast you can also consider the following projects:
PrimesResult - The results of the Dave Plummer's Primes Drag Race
nimtorch - PyTorch - Python + Nim
nimpylib - Some python standard library functions ported to Nim
readfq - Fast multi-line FASTA/Q reader in several programming languages
viroiddb - A curated database of all available viroid-like RNA sequences
RecursiveFactorization.jl
Primes - Prime Number Projects in C#/C++/Python
benchmarks - Some benchmarks of different languages