Hot Topics #9 (July 25, 2022)
Mixture policies, peptide design, protein representation learning, deep chemical models, protein homodimers design, biosynthetic gene cluster detection, and more.

Strength Through Diversity: Robust Behavior Learning via Mixture Policies; Seyde et al.; 2022
Abstract: Efficiency in robot learning is highly dependent on hyperparameters. Robot morphology and task structure differ widely and finding the optimal setting typically requires sequential or parallel repetition of experiments, strongly increasing the interaction count. We propose a training method that only relies on a single trial by enabling agents to select and combine controller designs conditioned on the task. Our Hyperparameter Mixture Policies (HMPs) feature diverse sub-policies that vary in distribution types and parameterization, reducing the impact of design choices and unlocking synergies between low-level components. We demonstrate strong performance on continuous control tasks, including a simulated ANYmal robot, showing that HMPs yield robust, data-efficient learning.
EvoBind: in silico directed evolution of peptide binders with AlphaFold; Bryant & Elofsson; July 23, 2022
Abstract: Currently, there is no accurate method to computationally design peptide binders towards a specific protein interface using only a target structure. Experimental methods such as phage display can produce strong binders, but it is impossible to know where these bind without solving the structures. Using AlphaFold2 (AF) and other AI methods to distinguish true binders has proven highly successful but relies on the availability of binding scaffolds. Here, we develop EvoBind, an in silico directed-evolution platform based on AF that designs peptide binders towards an interface using only sequence information. We show that AF can distinguish between native and mutated peptide binders using the plDDT score and find that AF adapts the receptor interface structure to the binders during optimisation. We analyse previously designed minibinder proteins and show that AF can distinguish designed binders from non-binders. We compare ELISA ratios of different peptide binders and find the affinity can not be distinguished among binders, possibly due to varying binding sites and low AF confidence. We test the recovery of binding motifs and find that up to 75% of motifs are recovered. In principle, EvoBind can be used to design binders towards any interface conditioned on if AF can predict these. We expect that EvoBind will aid experimentalists substantially, providing a starting point for further laboratory analysis and optimisation. We hope that the use of AI-based methods will come to make binder design significantly cheaper and more accurate in tackling unmet clinical needs. EvoBind is freely available at: https://colab.research.google.com/github/patrickbryant1/EvoBind/blob/master/EvoBind.ipynb
Neural Scaling of Deep Chemical Models; Frey et al.; May 16, 2022
Abstract: Massive scale, both in terms of data availability and computation, enables significant breakthroughs in key application areas of deep learning such as natural language processing (NLP) and computer vision. There is emerging evidence that scale may be a key ingredient in scientific deep learning, but the importance of physical priors in scientific domains makes the strategies and benefits of scaling uncertain. Here, we investigate neural scaling behavior in large chemical models by varying model and dataset sizes over many orders of magnitude, studying models with over one billion parameters, pre-trained on datasets of up to ten million datapoints. We consider large language models for generative chemistry and graph neural networks for machine-learned interatomic potentials. To enable large-scale scientific deep learning studies under resource constraints, we develop the Training Performance Estimation (TPE) framework to reduce the costs of scalable hyperparameter optimization by up to 90%. Using this framework, we discover empirical neural scaling relations for deep chemical models and investigate the interplay between physical priors and scale. Potential applications of large, pre-trained models for "prompt engineering" and unsupervised representation learning of molecules are shown.
Scaffolding protein functional sites using deep learning; Wang et al.; July 21, 2022
Abstract: The binding and catalytic functions of proteins are generally mediated by a small number of functional residues held in place by the overall protein structure. Here, we describe deep learning approaches for scaffolding such functional sites without needing to prespecify the fold or secondary structure of the scaffold. The first approach, “constrained hallucination,” optimizes sequences such that their predicted structures contain the desired functional site. The second approach, “inpainting,” starts from the functional site and fills in additional sequence and structure to create a viable protein scaffold in a single forward pass through a specifically trained RoseTTAFold network. We use these two methods to design candidate immunogens, receptor traps, metalloproteins, enzymes, and protein-binding proteins and validate the designs using a combination of in silico and experimental tests.
De novo design of protein homodimers containing tunable symmetric protein pockets; Hicks et al.; July 21, 2022
Abstract: Function follows form in biology, and the binding of small molecules requires proteins with pockets that match the shape of the ligand. For design of binding to symmetric ligands, protein homo-oligomers with matching symmetry are advantageous as each protein subunit can make identical interactions with the ligand. Here, we describe a general approach to designing hyperstable C2 symmetric proteins with pockets of diverse size and shape. We first designed repeat proteins that sample a continuum of curvatures but have low helical rise, then docked these into C2 symmetric homodimers to generate an extensive range of C2 symmetric cavities. We used this approach to design thousands of C2 symmetric homodimers, and characterized 101 of them experimentally. Of these, the geometry of 31 were confirmed by small angle X-ray scattering and 2 were shown by crystallographic analyses to be in close agreement with the computational design models. These scaffolds provide a rich set of starting points for binding a wide range of C2 symmetric compounds.
Language models of protein sequences at the scale of evolution enable accurate structure prediction; Lin et al.; July 21, 2022
Abstract: Large language models have recently been shown to develop emergent capabilities with scale, going beyond simple pattern matching to perform higher level reasoning and generate lifelike images and text. While language models trained on protein sequences have been studied at a smaller scale, little is known about what they learn about biology as they are scaled up. In this work we train models up to 15 billion parameters, the largest language models of proteins to be evaluated to date. We find that as models are scaled they learn information enabling the prediction of the three-dimensional structure of a protein at the resolution of individual atoms. We present ESMFold for high accuracy end-to-end atomic level structure prediction directly from the individual sequence of a protein. ESMFold has similar accuracy to AlphaFold2 and RoseTTAFold for sequences with low perplexity that are well understood by the language model. ESMFold inference is an order of magnitude faster than AlphaFold2, enabling exploration of the structural space of metagenomic proteins in practical timescales.
LIMO: Latent Inceptionism for Targeted Molecule Generation; Eckmann et al.; June 17, 2022
Abstract: Generation of drug-like molecules with high binding affinity to target proteins remains a difficult and resource-intensive task in drug discovery. Existing approaches primarily employ reinforcement learning, Markov sampling, or deep generative models guided by Gaussian processes, which can be prohibitively slow when generating molecules with high binding affinity calculated by computationally-expensive physics-based methods. We present Latent Inceptionism on Molecules (LIMO), which significantly accelerates molecule generation with an inceptionism-like technique. LIMO employs a variational autoencoder-generated latent space and property prediction by two neural networks in sequence to enable faster gradient-based reverse-optimization of molecular properties. Comprehensive experiments show that LIMO performs competitively on benchmark tasks and markedly outperforms state-of-the-art techniques on the novel task of generating drug-like compounds with high binding affinity, reaching nanomolar range against two protein targets. We corroborate these docking-based results with more accurate molecular dynamics-based calculations of absolute binding free energy and show that one of our generated drug-like compounds has a predicted KD (a measure of binding affinity) of 6⋅10−14 M against the human estrogen receptor, well beyond the affinities of typical early-stage drug candidates and most FDA-approved drugs to their respective targets. Code is available at this https URL.
Learning inverse folding from millions of predicted structures; Hsu et al.; April 10, 2022
Abstract: We consider the problem of predicting a protein sequence from its backbone atom coordinates. Machine learning approaches to this problem to date have been limited by the number of available experimentally determined protein structures. We augment training data by nearly three orders of magnitude by predicting structures for 12M protein sequences using AlphaFold2. Trained with this additional data, a sequence-to-sequence transformer with invariant geometric input processing layers achieves 51% native sequence recovery on structurally held-out backbones with 72% recovery for buried residues, an overall improvement of almost 10 percentage points over existing methods. The model generalizes to a variety of more complex tasks including design of protein complexes, partially masked structures, binding interfaces, and multiple states.
Deep self-supervised learning for biosynthetic gene cluster detection and product classification; Rios-Martinez et al.; July 23, 2022
Abstract: Natural products are chemical compounds that form the basis of many therapeutics used in the pharmaceutical industry. In microbes, natural products are synthesized by groups of colocalized genes called biosynthetic gene clusters (BGCs). With advances in high-throughput sequencing, there has been an increase of complete microbial isolate genomes and metagenomes, from which a vast number of BGCs are undiscovered. Here, we introduce a self-supervised learning approach designed to identify and characterize BGCs from such data. To do this, we represent BGCs as chains of functional protein domains and train a masked language model on these domains. We assess the ability of our approach to detect BGCs and characterize BGC properties in bacterial genomes. We also demonstrate that our model can learn meaningful representations of BGCs and their constituent domains, detect BGCs in microbial genomes, and predict BGC product classes. These results highlight self-supervised neural networks as a promising framework for improving BGC prediction and classification.
VegaFusion: VegaFusion provides serverside acceleration for the Vega visualization grammar. While not limited to Python, an initial application of VegaFusion is the acceleration of the Altair Python interface to Vega-Lite.
Ferminet: Fermionic Neural Networks: An implementation of the Fermionic Neural Network for ab-initio electronic structure calculations
Biosynthetic gene cluster (BiG) convolutional autoencoding representations of proteins (CARP): This repo contains training and plotting code for the paper Deep self-supervised learning for biosynthetic gene cluster detection. Model weights, data, and some results are available on Zenodo. If you'd like to use BiGCARP, the easiest way is through the protein sequence models repo.