Large language models generate functional protein sequences across families

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

progen

6 559 0.0 Python

Official release of the ProGen models

I was supposed to be reply to another comment. The GitHub is from 2022:
https://github.com/salesforce/progen

esm

5 2,833 4.6 Python

Evolutionary Scale Modeling (esm): Pretrained language models for proteins (by facebookresearch)

When evaluating this work, it’s important to remember that the functional labels on each of the 290 million input sequences were originally assigned by HMM as part of the pfam project, so the model is predicting a prediction.
Furthermore, the authors must engage a lot of human curation to ensure the sequences they generate are active. First, they pick an easy target. Second, they employ by-hand classical bioinformatics techniques on their predicted sequences after they are generated. For example, they manually align them and select those which contain specific important amino acids at specific positions which are present in 100% of functional proteins of that class, and are required for function. This is all done by a human bioinformatics expert before they test the “generated” sequences.
One other comment, in protein science, a sequence with 40% identity to another sequence is not “very different” if it is homologous. Since this model is essentially generating homologs from a particular class, it’s no surprise at a pairwise amino acid level, the generated sequences have this degree of similarity. Take proteins in any functional family and compare them. They will have the same overall 3-D structure—called their “fold”—yet have pairwise sequence identities much lower than 30–40%.
Not to be negative. I really enjoyed reading this paper and I think the work is important. Some related work by Meta AI is the ESM series of models [1] trained on the same data (the UniProt dataset [2]).
One thing I wonder is about the vocabulary size of this model. The number of tokens is 26 for the 20 amino acids and some extras, whereas for a LLM like Meta’s LLaMa the vocab size is 32,000. I wonder how that changes training and inference, and how we can adopt the transformer architecture for this scenario.
1. https://github.com/facebookresearch/esm
2. https://www.uniprot.org/help/downloads

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

What is a recent scientific discovery that you find exciting?

2 projects | /r/AskScienceDiscussion | 7 May 2023
[R] Large language models generate functional protein sequences across diverse families

1 project | /r/MachineLearning | 26 Feb 2023
Salesforce/progen: projects and models for protein engineering and design

1 project | news.ycombinator.com | 29 Jan 2023
1-Jun-2023

2 projects | /r/dailyainews | 2 Jun 2023
Basaran is an open-source alternative to the OpenAI text completion API

1 project | news.ycombinator.com | 31 May 2023

Large language models generate functional protein sequences across families

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
language-model Protein generative-model
Post date: 13 May 2023

progen

esm

InfluxDB

Related posts

What is a recent scientific discovery that you find exciting?

[R] Large language models generate functional protein sequences across diverse families

Salesforce/progen: projects and models for protein engineering and design

1-Jun-2023

Basaran is an open-source alternative to the OpenAI text completion API

Large language models generate functional protein sequences across families

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com language-model Protein generative-model Post date: 13 May 2023

progen

esm

InfluxDB

Related posts

What is a recent scientific discovery that you find exciting?

[R] Large language models generate functional protein sequences across diverse families

Salesforce/progen: projects and models for protein engineering and design

1-Jun-2023

Basaran is an open-source alternative to the OpenAI text completion API

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
language-model Protein generative-model
Post date: 13 May 2023