Hot Topics #8 (July 21, 2022)
Antibody affinity, protein properties, peptide design, planning through language models, and more.
Editors note: Sorry for yet another delayed post. This week I was driving on a cross-country road trip. Next week will be released on Monday as usual. Please enjoy this weeks roundup!
Co-optimization of therapeutic antibody affinity and specificity using machine learning models that generalize to novel mutational space; Makowski et al.; July 1, 2022
Abstract: Therapeutic antibody development requires selection and engineering of molecules with high affinity and other drug-like biophysical properties. Co-optimization of multiple antibody properties remains a difficult and time-consuming process that impedes drug development. Here we evaluate the use of machine learning to simplify antibody co-optimization for a clinical-stage antibody (emibetuzumab) that displays high levels of both on-target (antigen) and off-target (non-specific) binding. We mutate sites in the antibody complementarity-determining regions, sort the antibody libraries for high and low levels of affinity and non-specific binding, and deep sequence the enriched libraries. Interestingly, machine learning models trained on datasets with binary labels enable predictions of continuous metrics that are strongly correlated with antibody affinity and non-specific binding. These models illustrate strong tradeoffs between these two properties, as increases in affinity along the co-optimal (Pareto) frontier require progressive reductions in specificity. Notably, models trained with deep learning features enable prediction of novel antibody mutations that co-optimize affinity and specificity beyond what is possible for the original antibody library. These findings demonstrate the power of machine learning models to greatly expand the exploration of novel antibody sequence space and accelerate the development of highly potent, drug-like antibodies.
Learning functional properties of proteins with language models; Unsal et al.; March 21, 2022
Abstract: Data-centric approaches have been used to develop predictive methods for elucidating uncharacterized properties of proteins; however, studies indicate that these methods should be further improved to effectively solve critical problems in biomedicine and biotechnology, which can be achieved by better representing the data at hand. Novel data representation approaches mostly take inspiration from language models that have yielded ground-breaking improvements in natural language processing. Lately, these approaches have been applied to the field of protein science and have displayed highly promising results in terms of extracting complex sequence–structure–function relationships. In this study we conducted a detailed investigation over protein representation learning by first categorizing/explaining each approach, subsequently benchmarking their performances on predicting: (1) semantic similarities between proteins, (2) ontology-based protein functions, (3) drug target protein families and (4) protein–protein binding affinity changes following mutations. We evaluate and discuss the advantages and disadvantages of each method over the benchmark results, source datasets and algorithms used, in comparison with classical model-driven approaches. Finally, we discuss current challenges and suggest future directions. We believe that the conclusions of this study will help researchers to apply machine/deep learning-based representation techniques to protein data for various predictive tasks, and inspire the development of novel methods.
Fast and Flexible Protein Design Using Deep Graph Neural Networks; Strokach et al.; October 21, 2022
Abstract: Protein structure and function is determined by the arrangement of the linear sequence of amino acids in 3D space. We show that a deep graph neural network, ProteinSolver, can precisely design sequences that fold into a predetermined shape by phrasing this challenge as a constraint satisfaction problem (CSP), akin to Sudoku puzzles. We trained ProteinSolver on over 70,000,000 real protein sequences corresponding to over 80,000 structures. We show that our method rapidly designs new protein sequences and benchmark them in silico using energy-based scores, molecular dynamics, and structure prediction methods. As a proof-of-principle validation, we use ProteinSolver to generate sequences that match the structure of serum albumin, then synthesize the top-scoring design and validate it in vitro using circular dichroism. ProteinSolver is freely available at http://design.proteinsolver.org and https://gitlab.com/ostrokach/proteinsolver. A record of this paper’s transparent peer review process is included in the Supplemental Information.
Dual use of artificial-intelligence-powered drug discovery; Urbina et al.; March 7, 2022
Abstract: The Swiss Federal Institute for NBC (nuclear, biological and chemical) Protection —Spiez Laboratory— convenes the ‘convergence’ conference series1 set up by the Swiss government to identify developments in chemistry, biology and enabling technologies that may have implications for the Chemical and Biological Weapons Conventions. Meeting every two years, the conferences bring together an international group of scientific and disarmament experts to explore the current state of the art in the chemical and biological fields and their trajectories, to think through potential security implications and to consider how these implications can most effectively be managed internationally. The meeting convenes for three days of discussion on the possibilities of harm, should the intent be there, from cutting-edge chemical and biological technologies. Our drug discovery company received an invitation to contribute a presentation on how AI technologies for drug discovery could potentially be misused.
De novo designed peptides for cellular delivery and subcellular localisation; Rhys et al.; July 14, 2022
Abstract: Increasingly, it is possible to design peptide and protein assemblies de novo from first principles or computationally. This approach provides new routes to functional synthetic polypeptides, including designs to target and bind proteins of interest. Much of this work has been developed in vitro. Therefore, a challenge is to deliver de novo polypeptides efficiently to sites of action within cells. Here we describe the design, characterisation, intracellular delivery, and subcellular localisation of a de novo synthetic peptide system. This system comprises a dual-function basic peptide, programmed both for cell penetration and target binding, and a complementary acidic peptide that can be fused to proteins of interest and introduced into cells using synthetic DNA. The designs are characterised in vitro using biophysical methods and X-ray crystallography. The utility of the system for delivery into mammalian cells and subcellular targeting is demonstrated by marking organelles and actively engaging functional protein complexes.
Peptide binding specificity prediction using fine-tuned protein structure prediction networks; Mortmaen et al.; July 13, 2022
Abstract: Peptide binding proteins play key roles in biology, and predicting their binding specificity is a long-standing challenge. While considerable protein structural information is available, the most successful current methods use sequence information alone, in part because it has been a challenge to model the subtle structural changes accompanying sequence substitutions. Protein structure prediction networks such as AlphaFold model sequence-structure relationships very accurately, and we reasoned that if it were possible to specifically train such networks on binding data, more generalizable models could be created. We show that placing a classifier on top of the AlphaFold network and fine-tuning the combined network parameters for both classification and structure prediction accuracy leads to a model with strong generalizable performance on a wide range of Class I and Class II peptide-MHC interactions that approaches the overall performance of the state-of-the-art NetMHCpan sequence-based method. The peptide-MHC optimized model shows excellent performance in distinguishing binding and non-binding peptides to SH3 and PDZ domains. This ability to generalize well beyond the training set far exceeds that of sequence only models, and should be particularly powerful for systems where less experimental data is available.
Inner Monologue: Embodied Reasoning through Planning with Language Models; Huang et al.; July 12, 2022
Abstract: Recent works have shown how the reasoning capabilities of Large Language Models (LLMs) can be applied to domains beyond natural language processing, such as planning and interaction for robotics. These embodied problems require an agent to understand many semantic aspects of the world: the repertoire of skills available, how these skills influence the world, and how changes to the world map back to language. LLMs planning in embodied environments need to consider not just what skills to do, but also how and when to do them - answers that change over time in response to the agent’s own choices. In this work, we investigate to what extent LLMs used in such embodied contexts can reason over sources of feedback provided through natural language, without any additional training. We propose that by leveraging environment feedback, LLMs are able to form an inner monologue that allows them to more richly process and plan in robotic control scenarios. We investigate a variety of sources of feedback, such as success detection, object recognition, scene description, and human interaction. We find that closed-loop language feedback significantly improves high level instruction completion on three domains, including simulated and real table top rearrangement tasks and long-horizon mobile manipulation tasks in a real kitchen environment.
Effective Mutation Rate Adaptation through Group Elite Selection; Kumar et al.; April 11, 2022
Abstract: Evolutionary algorithms are sensitive to the mutation rate (MR); no single value of this parameter works well across domains. Self-adaptive MR approaches have been proposed but they tend to be brittle: Sometimes they decay the MR to zero, thus halting evolution. To make self-adaptive MR robust, this paper introduces the Group Elite Selection of Mutation Rates (GESMR) algorithm. GESMR co-evolves a population of solutions and a population of MRs, such that each MR is assigned to a group of solutions. The resulting best mutational change in the group, instead of average mutational change, is used for MR selection during evolution, thus avoiding the vanishing MR problem. With the same number of function evaluations and with almost no overhead, GESMR converges faster and to better solutions than previous approaches on a wide range of continuous test optimization problems. GESMR also scales well to high-dimensional neuroevolution for supervised image-classification tasks and for reinforcement learning control tasks. Remarkably, GESMR produces MRs that are optimal in the long-term, as demonstrated through a comprehensive look-ahead grid search. Thus, GESMR and its theoretical and empirical analysis demonstrate how self-adaptation can be harnessed to improve performance in several applications of evolutionary computation.
A synthetic protein-level neural network in mammalian cells; Chen et al.; July 11, 2022
Abstract: Artificial neural networks provide a powerful paradigm for information processing that has transformed diverse fields. Within living cells, genetically encoded synthetic molecular networks could, in principle, harness principles of neural computation to classify molecular signals. Here, we combine de novo designed protein heterodimers and engineered viral proteases to implement a synthetic protein circuit that performs winner-take-all neural network computation. This “perceptein” circuit includes modules that compute weighted sums of input protein concentrations through reversible binding interactions, and allow for self-activation and mutual inhibition of protein components using irreversible proteolytic cleavage reactions. Altogether, these interactions comprise a network of 310 chemical reactions stemming from 8 expressed protein species. The complete system achieves signal classification with tunable decision boundaries in mammalian cells. These results demonstrate how engineered protein-based networks can enable programmable signal classification in living cells.
espaloma: Extensible Surrogate Potential Optimized by Message-passing Algorithms: Molecular mechanics (MM) potentials have long been a workhorse of computational chemistry. Leveraging accuracy and speed, these functional forms find use in a wide variety of applications in biomolecular modeling and drug discovery, from rapid virtual screening to detailed free energy calculations. Traditionally, MM potentials have relied on human-curated, inflexible, and poorly extensible discrete chemical perception rules atom types for applying parameters to small molecules or biopolymers, making it difficult to optimize both types and parameters to fit quantum chemical or physical property data. Here, we propose an alternative approach that uses graph neural networks to perceive chemical environments, producing continuous atom embeddings from which valence and nonbonded parameters can be predicted using invariance-preserving layers. Since all stages are built from smooth neural functions, the entire process—spanning chemical perception to parameter assignment—is modular and end-to-end differentiable with respect to model parameters, allowing new force fields to be easily constructed, extended, and applied to arbitrary molecules. We show that this approach is not only sufficiently expressive to reproduce legacy atom types, but that it can learn and extend existing molecular mechanics force fields, construct entirely new force fields applicable to both biopolymers and small molecules from quantum chemical calculations, and even learn to accurately predict free energies from experimental observables.
alphafold_finetune: Python code for fine-tuning AlphaFold to perform protein-peptide binding predictions. This repository is a collaborative effort: Justas Dauparas implemented the AlphaFold changes necessary for fine-tuning and wrote a template of the fine-tuning script. Amir Motmaen and Phil Bradley further developed and extensively tested the fine-tuning and inference scripts in the context of protein-peptide binding.