-
esm
Evolutionary Scale Modeling (esm): Pretrained language models for proteins (by facebookresearch)
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
I was supposed to be reply to another comment. The GitHub is from 2022:
https://github.com/salesforce/progen
When evaluating this work, it’s important to remember that the functional labels on each of the 290 million input sequences were originally assigned by HMM as part of the pfam project, so the model is predicting a prediction.
Furthermore, the authors must engage a lot of human curation to ensure the sequences they generate are active. First, they pick an easy target. Second, they employ by-hand classical bioinformatics techniques on their predicted sequences after they are generated. For example, they manually align them and select those which contain specific important amino acids at specific positions which are present in 100% of functional proteins of that class, and are required for function. This is all done by a human bioinformatics expert before they test the “generated” sequences.
One other comment, in protein science, a sequence with 40% identity to another sequence is not “very different” if it is homologous. Since this model is essentially generating homologs from a particular class, it’s no surprise at a pairwise amino acid level, the generated sequences have this degree of similarity. Take proteins in any functional family and compare them. They will have the same overall 3-D structure—called their “fold”—yet have pairwise sequence identities much lower than 30–40%.
Not to be negative. I really enjoyed reading this paper and I think the work is important. Some related work by Meta AI is the ESM series of models [1] trained on the same data (the UniProt dataset [2]).
One thing I wonder is about the vocabulary size of this model. The number of tokens is 26 for the 20 amino acids and some extras, whereas for a LLM like Meta’s LLaMa the vocab size is 32,000. I wonder how that changes training and inference, and how we can adopt the transformer architecture for this scenario.
1. https://github.com/facebookresearch/esm
2. https://www.uniprot.org/help/downloads
Related posts
-
What is a recent scientific discovery that you find exciting?
-
[R] Large language models generate functional protein sequences across diverse families
-
Salesforce/progen: projects and models for protein engineering and design
-
1-Jun-2023
-
Basaran is an open-source alternative to the OpenAI text completion API