Hot Topics #10 (August 1, 2022)

RL via sequence modeling, self-supervised learning for antibodies, rotamer-free protein sequence design, offline RL and more.

Aug 01, 2022

DALL-E Mini drawing of “Deciphering the language of antibodies using self-supervised learning”

Decision Transformer: Reinforcement Learning via Sequence Modeling; Chen et al.; NeurIPS 2021

Abstract: We introduce a framework that abstracts Reinforcement Learning (RL) as a sequence modeling problem. This allows us to draw upon the simplicity and scalability of the Transformer architecture, and associated advances in language modeling such as GPT-x and BERT. In particular, we present Decision Transformer, an architecture that casts the problem of RL as conditional sequence modeling. Unlike prior approaches to RL that fit value functions or compute policy gradients, Decision Transformer simply outputs the optimal actions by leveraging a causally masked Transformer. By conditioning an autoregressive model on the desired return (reward), past states, and actions, our Decision Transformer model can generate future actions that achieve the desired return. Despite its simplicity, Decision Transformer matches or exceeds the performance of state-of-the-art model-free offline RL baselines on Atari, OpenAI Gym, and Key-to-Door tasks.

Deciphering the language of antibodies using self-supervised learning; Leem et al.; July 8, 2022

Abstract: An individual’s B cell receptor (BCR) repertoire encodes information about past immune responses and potential for future disease protection. Deciphering the information stored in BCR sequence datasets will transform our understanding of disease and enable discovery of novel diagnostics and antibody therapeutics. A key challenge of BCR sequence analysis is the prediction of BCR properties from their amino acid sequence alone. Here, we present an antibody-specific language model, Antibody-specific Bidirectional Encoder Representation from Transformers (AntiBERTa), which provides a contextualized representation of BCR sequences. Following pre-training, we show that AntiBERTa embeddings capture biologically relevant information, generalizable to a range of applications. As a case study, we fine-tune AntiBERTa to predict paratope positions from an antibody sequence, outperforming public tools across multiple metrics. To our knowledge, AntiBERTa is the deepest protein-family-specific language model, providing a rich representation of BCRs. AntiBERTa embeddings are primed for multiple downstream tasks and can improve our understanding of the language of antibodies.

AlphaFold reveals the structure of the protein universe; DeepMind Blog; July 28, 2022

Abstract: It’s been one year since we released and open sourced AlphaFold, our AI system to predict the 3D structure of a protein just from its 1D amino acid sequence, and created the AlphaFold Protein Structure Database (AlphaFold DB) to freely share this scientific knowledge with the world. Proteins are the building blocks of life, they underpin every biological process in every living thing. And, because a protein’s shape is closely linked with its function, knowing a protein’s structure unlocks a greater understanding of what it does and how it works. We hoped this groundbreaking resource would help accelerate scientific research and discovery globally, and that other teams could learn from and build on the advances we made with AlphaFold to create further breakthroughs. That hope has become a reality far quicker than we had dared to dream. Just twelve months later, AlphaFold has been accessed by more than half a million researchers and used to accelerate progress on important real-world problems ranging from plastic pollution to antibiotic resistance.

Today, I’m incredibly excited to share the next stage of this journey. In partnership with EMBL’s European Bioinformatics Institute (EMBL-EBI), we’re now releasing predicted structures for nearly all catalogued proteins known to science, which will expand the AlphaFold DB by over 200x - from nearly 1 million structures to over 200 million structures - with the potential to dramatically increase our understanding of biology.

Rotamer-free protein sequence design based on deep learning and self-consistency; Liu et al.; July 21, 2022

Abstract: Several previously proposed deep learning methods to design amino acid sequences that autonomously fold into a given protein backbone yielded promising results in computational tests but did not outperform conventional energy function-based methods in wet experiments. Here we present the ABACUS-R method, which uses an encoder–decoder network trained using a multitask learning strategy to predict the sidechain type of a central residue from its three-dimensional local environment, which includes, besides other features, the types but not the conformations of the surrounding sidechains. This eliminates the need to reconstruct and optimize sidechain structures, and drastically simplifies the sequence design process. Thus iteratively applying the encoder–decoder to different central residues is able to produce self-consistent overall sequences for a target backbone. Results of wet experiments, including five structures solved by X-ray crystallography, show that ABACUS-R outperforms state-of-the-art energy function-based methods in success rate and design precision.

Offline Reinforcement Learning at Multiple Frequencies; Burns et al.; July 26, 2022

Abstract: Leveraging many sources of offline robot data requires grappling with the heterogeneity of such data. In this paper, we focus on one particular aspect of heterogeneity: learning from offline data collected at different control frequencies. Across labs, the discretization of controllers, sampling rates of sensors, and demands of a task of interest may differ, giving rise to a mixture of frequencies in an aggregated dataset. We study how well offline reinforcement learning (RL) algorithms can accommodate data with a mixture of frequencies during training. We observe that the Q-value propagates at different rates for different discretizations, leading to a number of learning challenges for off-the-shelf offline RL. We present a simple yet effective solution that enforces consistency in the rate of Q-value updates to stabilize learning. By scaling the value of N in N-step returns with the discretization size, we effectively balance Q-value propagation, leading to more stable convergence. On three simulated robotic control problems, we empirically find that this simple approach outperforms naïve mixing by 50% on average.

Structural basis for Cas9 off-target activity; Pacesa et al.; November 18, 2021

Abstract: The target DNA specificity of the CRISPR-associated genome editor nuclease Cas9 is determined by complementarity to a 20-nucleotide segment in its guide RNA. However, Cas9 can bind and cleave partially complementary off-target sequences, which raises safety concerns for its use in clinical applications. Here we report crystallographic structures of Cas9 bound to bona fide off-target substrates, revealing that off-target binding is enabled by a range of non-canonical base pairing interactions and preservation of base stacking within the guide–off-target heteroduplex. Off-target sites containing single-nucleotide deletions relative to the guide RNA are accommodated by base skipping rather than RNA bulge formation. Additionally, PAM-distal mismatches result in duplex unpairing and induce a conformational change of the Cas9 REC lobe that perturbs its conformational activation. Together, these insights provide a structural rationale for the off-target activity of Cas9 and contribute to the improved rational design of guide RNAs and off-target prediction algorithms.

Machine Learning in Structural Biology Workshop; December 3, 2022 @ NeurIPS 2022 Conference

About: In only a few years, structural biology, the study of the 3D structure or shape of proteins and other biomolecules, has been transformed by breakthroughs from machine learning algorithms. Machine learning models are now routinely being used by experimentalists to predict structures that can help answer real biological questions (e.g. AlphaFold), accelerate the experimental process of structure determination (e.g. computer vision algorithms for cryo-electron microscopy), and have become a new industry standard for bioengineering new protein therapeutics (e.g. large language models for protein design). Despite all this progress, there are still many active and open challenges for the field, such as modeling protein dynamics, predicting higher order complexes, pushing towards generalization of protein folding physics, and relating the structure of proteins to the in vivo and contextual nature of their underlying function. These challenges are diverse and interdisciplinary, motivating new kinds of machine learning systems and requiring the development and maturation of standard benchmarks and datasets.

In this exciting time for the field, our workshop, “Machine Learning in Structural Biology” (MLSB), seeks to bring together relevant experts, practitioners, and students across a broad community to focus on these challenges and opportunities. We believe the union of these communities, including the geometric and graph learning communities, NLP researchers, and structural biologists with domain expertise at our workshop can help spur new ideas, spark collaborations, and advance the impact of machine learning in structural biology. Progress at this intersection promises to unlock new scientific discoveries and the ability to design novel medicines.

Hierarchical Generation of Molecular Graphs using Structural Motifs; Paper

Chemiscope; Chemiscope is an graphical tool for the interactive exploration of materials and molecular databases, correlating local and global structural descriptors with the physical properties of the different systems; as well as a library of re-usable components useful to create new interfaces. Demo

Mokapot; Fast and flexible semi-supervised learning for peptide detection.

mokapot is fundamentally a Python implementation of the semi-supervised learning algorithm first introduced by Percolator. We developed mokapot to add additional flexibility to our analyses, whether to try something experimental---such as swapping Percolator's linear support vector machine classifier for a non-linear, gradient boosting classifier---or to train a joint model across experiments while retaining valid, per-experiment confidence estimates. We designed mokapot to be extensible and support the analysis of additional types of proteomics data, such as cross-linked peptides from cross-linking mass spectrometry experiments. mokapot offers basic functionality from the command line, but using mokapot as a Python package unlocks maximum flexibility.

For more information, check out our documentation.

Oxford Protein Informatics Group

This Week in MathOnco

The Century of Biology

The Merge

Discussion about this post