Hot Topics #6 (July 5, 2022)
Repurposing AlphaFold for protein design, unsupervised learning of protein sequences, a complete cell atlas of aging, RNA binding + metabolism, and human heuristics for AI-generated language.
Editor’s Note: Sorry the post is one day late this week, I took advantage of the holiday weekend and took yesterday off. Hope everyone had a pleasant weekend and next week I’ll resume the regular post schedule.
Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences; Rives et al.; December 15, 2020
Abstract: In the field of artificial intelligence, a combination of scale in data and model capacity enabled by un-supervised learning has led to major advances in representation learning and statistical generation. In the life sciences, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Protein language modeling at the scale of evolution is a logical step toward predictive and generative artificial intelligence for biology. To this end we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity. The resulting model contains information about biological properties in its representations. The representations are learned from sequence data alone. The learned representation space has a multi-scale organization reflecting structure from the level of biochemical properties of amino acids to remote homology of proteins. Information about secondary and tertiary structure is encoded in the representations and can be identified by linear projections. Representation learning produces features that generalize across a range of applications, enabling state-of-the-art supervised prediction of mutational effect and secondary structure, and improving state-of-the-art features for long-range contact prediction.
The complete cell atlas of an aging multicellular organism; Roux et al.; June 16, 2022
Abstract: Here we describe a single-cell atlas of aging for the nematode Caenorhabditis elegans. This unique resource describes the expression across adulthood of over 20,000 genes among 211 groups of cells that correspond to virtually every cell type in this organism. Our findings suggest that C. elegans aging is not random and stochastic in nature, but rather characterized by coordinated changes in functionally related metabolic and stress-response genes in a highly cell-type specific fashion. Aging signatures of different cell types are largely different from one another, downregulation of energy metabolism being the only nearly universal change. Some biological pathways, such as genes associated with translation, DNA repair and the ER unfolded protein response, exhibited strong (in some cases opposite) changes in subsets of cell types, but many more were limited to a single cell type. Similarly, the rates at which cells aged, measured as genome-wide expression changes, differed between cell types; some of these differences were tested and validated in vivo by measuring age-dependent changes in mitochondrial morphology. In some, but not all, cell types, aging was characterized by an increase in cell-to-cell variance. Finally, we identified a set of transcription factors whose activities changed coordinately across many cell types with age. This set was strongly enriched for stress-resistance TFs known to influence the rate of aging. We tested other members of this set, and discovered that some, such as GEI-3, likely also regulate the rate of aging. Our dataset can be accessed and queried at c.elegans.aging.atlas.research.calicolabs.com/.
A computationally-enhanced hiCLIP atlas reveals Staufen1 RNA binding features and links 3’ UTR structure to RNA metabolism; Chakrabarti et al.; June 16, 2022
Abstract: The structure of mRNA molecules plays an important role in its interactions with trans-acting factors, notably RNA binding proteins (RBPs), thus contributing to the functional consequences of this interplay. However, current transcriptome-wide experimental methods to chart these interactions are limited by their poor sensitivity. Here we extend the hiCLIP atlas of duplexes bound by Staufen1 (STAU1) ∼10-fold, through careful consideration of experimental assumptions, and the development of bespoke computational methods which we apply to existing data. We present Tosca, a Nextflow computational pipeline for the processing, analysis and visualisation of proximity ligation sequencing data generally. We use our extended duplex atlas to discover insights into the RNA selectivity of STAU1, revealing the importance of structural symmetry and duplex-span-dependent nucleotide composition. Furthermore, we identify heterogeneity in the relationship between STAU1-bound 3’ UTRs and metabolism of the associated RNAs that we relate to RNA structure: transcripts with short-range proximal 3’ UTR duplexes have high degradation rates, but those with long-range duplexes have low rates. Overall, our work enables the integrative analysis of proximity ligation data delivering insights into specific features and effects of RBP-RNA structure interactions.
Smart-design of universally decorated nano-particles for drug delivery applications driven by active transport; Halbi et al.; June 16, 2022
Abstract: Targeting the cell nucleus remains a challenge for drug delivery. Here we present a universal platform for smart design of nano-particles (NPs) decoration that allows recruitment of multiple dynein motors to drive their active motion towards the nucleus. The uniqueness of our approach is based on using: (i) a spacer polymer, commonly Biotin-Polyethylene-glycol-thiol (B-PEG-SH), whose grafting density and molecular weight can be tuned thereby allowing NP transport optimization, and (ii) protein binding peptides, like cell penetrating, NLS, or cancer targeting, peptides. Universal chemistry is employed to link peptides to the PEG free-end. To manifest our platform, we use a SV40T large antigen-originating NLS peptide. Our modular design allows tuning the number of recruited motors, and to replace the NLS by a variety of other localization signal molecules. Our control of the NP decoration scheme, and the modularity of our platform, carries great advantage for nano-carrier design for drug delivery applications.
Human Heuristics for AI-Generated Language Are Flawed; Jakesch et al.; June 15, 2022
Abstract: Human communication is increasingly intermixed with language generated by AI. Across chat, email, and social media, AI systems produce smart replies, autocompletes, and translations. AI-generated language is often not identified as such but poses as human language, raising concerns about novel forms of deception and manipulation. Here, we study how humans discern whether one of the most personal and consequential forms of language - a self-presentation - was generated by AI. In six experiments, participants (N = 4,600) tried to detect self-presentations generated by state-of-the-art language models. Across professional, hospitality, and dating settings, we find that humans are unable to detect AI-generated self-presentations. Our findings show that human judgments of AI-generated language are handicapped by intuitive but flawed heuristics such as associating first-person pronouns, spontaneous wording, or family topics with humanity. We demonstrate that these heuristics make human judgment of generated language predictable and manipulable, allowing AI systems to produce language perceived as more human than human. We discuss solutions, such as AI accents, to reduce the deceptive potential of generated language, limiting the subversion of human intuition.
AlphaDesign: A de novo protein design framework based on AlphaFold; Jendrusch et al.; October 12, 2021
Abstract: De novo protein design is a longstanding fundamental goal of synthetic biology, but has been hindered by the difficulty in reliable prediction of accurate high-resolution protein structures from sequence. Recent advances in the accuracy of protein structure prediction methods, such as AlphaFold (AF), have facilitated proteome scale structural predictions of monomeric proteins. Here we develop AlphaDesign, a computational framework for de novo protein design that embeds AF as an oracle within an optimisable design process. Our framework enables rapid prediction of completely novel protein monomers starting from random sequences. These are shown to adopt a diverse array of folds within the known protein space. A recent and unexpected utility of AF to predict the structure of protein complexes, further allows our framework to design higher-order complexes. Subsequently a range of predictions are made for monomers, homodimers, heterodimers as well as higher-order homo-oligomers - trimers to hexamers. Our analyses also show potential for designing proteins that bind to a pre-specified target protein. Structural integrity of predicted structures is validated and confirmed by standard ab initio folding and structural analysis methods as well as more extensively by performing rigorous all-atom molecular dynamics simulations and analysing the corresponding structural flexibility, intramonomer and interfacial amino-acid contacts. These analyses demonstrate widespread maintenance of structural integrity and suggests that our framework allows for fairly accurate protein design. Strikingly, our approach also reveals the capacity of AF to predict proteins that switch conformation upon complex formation, such as involving switches from α-helices to β-sheets during amyloid filament formation. Correspondingly, when integrated into our design framework, our approach reveals de novo design of a subset of proteins that switch conformation between monomeric and oligomeric state.
Using AlphaFold for Rapid and Accurate Fixed Backbone Protein Design; Moffat et al.; August 26, 2021
Abstract: The prediction of protein structure and the design of novel protein sequences and structures have long been intertwined. The recently released AlphaFold has heralded a new generation of accurate protein structure prediction, but the extent to which this affects protein design stands yet unexplored. Here we develop a rapid and effective approach for fixed backbone computational protein design, leveraging the predictive power of AlphaFold. For several designs we demonstrate that not only are the AlphaFold predicted structures in agreement with the desired backbones, but they are also supported by the structure predictions of other supervised methods as well as ab initio folding. These results suggest that AlphaFold, and methods like it, are able to facilitate the development of a new range of novel and accurate protein design methodologies.
Design of protein-binding proteins from the target structure alone; Cao et al.; March 24, 2022
Abstract: The design of proteins that bind to a specific site on the surface of a target protein using no information other than the three-dimensional structure of the target remains a challenge1,2,3,4,5. Here we describe a general solution to this problem that starts with a broad exploration of the vast space of possible binding modes to a selected region of a protein surface, and then intensifies the search in the vicinity of the most promising binding modes. We demonstrate the broad applicability of this approach through the de novo design of binding proteins to 12 diverse protein targets with different shapes and surface properties. Biophysical characterization shows that the binders, which are all smaller than 65 amino acids, are hyperstable and, following experimental optimization, bind their targets with nanomolar to picomolar affinities. We succeeded in solving crystal structures of five of the binder–target complexes, and all five closely match the corresponding computational design models. Experimental data on nearly half a million computational designs and hundreds of thousands of point mutants provide detailed feedback on the strengths and limitations of the method and of our current understanding of protein–protein interactions, and should guide improvements of both. Our approach enables the targeted design of binders to sites of interest on a wide variety of proteins for therapeutic and diagnostic applications.
ProtBert; Pretrained model on protein sequences using a masked language modeling (MLM) objective. It was introduced in this paper and first released in this repository. This model is trained on uppercase amino acids: it only works with capital letter amino acids.
ProtGPT2; ProtGPT2 (preprint) is a language model that speaks the protein language and can be used for de novo protein design and engineering. ProtGPT2 generated sequences conserve natural proteins' critical features (amino acid propensities, secondary structural content, and globularity) while exploring unseen regions of the protein space.
Elegy; A high-level API for deep learning in JAX;
Main Features
😀 Easy-to-use: Elegy provides a Keras-like high-level API that makes it very easy to use for most common tasks.
💪 Flexible: Elegy provides a Pytorch Lightning-like low-level API that offers maximum flexibility when needed.
🔌 Compatible: Elegy supports various frameworks and data sources including Flax & Haiku Modules, Optax Optimizers, TensorFlow Datasets, Pytorch DataLoaders, and more.