comparing the similarity between a set of protein sequences

This page summarizes the projects mentioned and recommended in the original post on /r/genomics

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • Biopython

    Official git repository for Biopython (originally converted from CVS)

  • Usearch will do all-against-all comparisons, cluster sequences, and produce alignments for each cluster. You can set the clustering threshold (proportion of residues identical). The alignments are in fasta format, which is pretty standard. If all you want is basic similarity it might be easiest to just write something that calculates normalized Hamming distances (typically called p-distances in the molecular evolution literature) between pairs of sequences. I suspect the biopython fasta reader (you can install biopython from https://biopython.org/) will be good enough.

  • diamond

    Accelerated BLAST compatible local sequence aligner. (by bbuchfink)

  • Diamond (https://github.com/bbuchfink/diamond) might help. It has a protein sequence clustering option. You could cluster your sequences and then take the centroids of each cluster. Vary the BLAST parameters to increase/decrease the numbers of clusters.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts