Large language models generate functional protein sequences across families

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • progen

    Official release of the ProGen models

  • I was supposed to be reply to another comment. The GitHub is from 2022:

    https://github.com/salesforce/progen

  • esm

    Evolutionary Scale Modeling (esm): Pretrained language models for proteins (by facebookresearch)

  • When evaluating this work, it’s important to remember that the functional labels on each of the 290 million input sequences were originally assigned by HMM as part of the pfam project, so the model is predicting a prediction.

    Furthermore, the authors must engage a lot of human curation to ensure the sequences they generate are active. First, they pick an easy target. Second, they employ by-hand classical bioinformatics techniques on their predicted sequences after they are generated. For example, they manually align them and select those which contain specific important amino acids at specific positions which are present in 100% of functional proteins of that class, and are required for function. This is all done by a human bioinformatics expert before they test the “generated” sequences.

    One other comment, in protein science, a sequence with 40% identity to another sequence is not “very different” if it is homologous. Since this model is essentially generating homologs from a particular class, it’s no surprise at a pairwise amino acid level, the generated sequences have this degree of similarity. Take proteins in any functional family and compare them. They will have the same overall 3-D structure—called their “fold”—yet have pairwise sequence identities much lower than 30–40%.

    Not to be negative. I really enjoyed reading this paper and I think the work is important. Some related work by Meta AI is the ESM series of models [1] trained on the same data (the UniProt dataset [2]).

    One thing I wonder is about the vocabulary size of this model. The number of tokens is 26 for the 20 amino acids and some extras, whereas for a LLM like Meta’s LLaMa the vocab size is 32,000. I wonder how that changes training and inference, and how we can adopt the transformer architecture for this scenario.

    1. https://github.com/facebookresearch/esm

    2. https://www.uniprot.org/help/downloads

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • What is a recent scientific discovery that you find exciting?

    2 projects | /r/AskScienceDiscussion | 7 May 2023
  • [R] Large language models generate functional protein sequences across diverse families

    1 project | /r/MachineLearning | 26 Feb 2023
  • Salesforce/progen: projects and models for protein engineering and design

    1 project | news.ycombinator.com | 29 Jan 2023
  • 1-Jun-2023

    2 projects | /r/dailyainews | 2 Jun 2023
  • Basaran is an open-source alternative to the OpenAI text completion API

    1 project | news.ycombinator.com | 31 May 2023