Hot Topics #2 (June 6, 2022)

Generalist agents, protein design, and neural network equivariance.

Jun 06, 2022

A Generalist Agent. Reed et al.; May 19 2022

Abstract: Inspired by progress in large-scale language modeling, we apply a similar approach towards building a
single generalist agent beyond the realm of text outputs. The agent, which we refer to as Gato, works as a
multi-modal, multi-task, multi-embodiment generalist policy. The same network with the same weights
can play Atari, caption images, chat, stack blocks with a real robot arm and much more, deciding based
on its context whether to output text, joint torques, button presses, or other tokens. In this report we
describe the model and the data, and document the current capabilities of Gato.

Multi-Game Transformers. Lee et al.; May 30, 2022

Abstract: A longstanding goal of the field of AI is a strategy for compiling diverse experience into a highly capable, generalist agent. In the subfields of vision and language, this was largely achieved by scaling up transformer-based models and training them on large, diverse datasets. Motivated by this progress, we investigate whether the same strategy can be used to produce generalist reinforcement learning agents. Specifically, we show that a single transformer-based model -- with a single set of weights -- trained purely offline can play a suite of up to 46 Atari games simultaneously at close-to-human performance. When trained and evaluated appropriately, we find that the same trends observed in language and vision hold, including scaling of performance with model size and rapid adaptation to new games via fine-tuning. We compare several approaches in this multi-game setting, such as online and offline RL methods and behavioral cloning, and find that our Multi-Game Decision Transformer models offer the best scalability and performance. We release the pre-trained models and code to encourage further research in this direction.

A general-purpose protein design framework based on mining sequence–structure relationships in known protein structures. Zhou et al.; December 31, 2019

Abstract: Current state-of-the-art approaches to computational protein design (CPD) aim to capture the determinants of structure from physical principles. While this has led to many successful designs, it does have strong limitations associated with inaccuracies in physical modeling, such that a reliable general solution to CPD has yet to be found. Here, we propose a design framework—one based on identifying and applying patterns of sequence–structure compatibility found in known proteins, rather than approximating them from models of interatomic interactions. We carry out extensive computational analyses and an experimental validation for our method. Our results strongly argue that the Protein Data Bank is now sufficiently large to enable proteins to be designed by using only examples of structural motifs from unrelated proteins. Because our method is likely to have orthogonal strengths relative to existing techniques, it could represent an important step toward removing remaining barriers to robust CPD.

Protein sequence design by conformational landscape optimization. Norn et al.; March 12, 2021

Abstract: The protein design problem is to identify an amino acid sequence that folds to a desired structure. Given Anfinsen’s thermodynamic hypothesis of folding, this can be recast as finding an amino acid sequence for which the desired structure is the lowest energy state. As this calculation involves not only all possible amino acid sequences but also, all possible structures, most current approaches focus instead on the more tractable problem of finding the lowest-energy amino acid sequence for the desired structure, often checking by protein structure prediction in a second step that the desired structure is indeed the lowest-energy conformation for the designed sequence, and typically discarding a large fraction of designed sequences for which this is not the case. Here, we show that by backpropagating gradients through the transform-restrained Rosetta (trRosetta) structure prediction network from the desired structure to the input amino acid sequence, we can directly optimize over all possible amino acid sequences and all possible structures in a single calculation. We find that trRosetta calculations, which consider the full conformational landscape, can be more effective than Rosetta single-point energy estimations in predicting folding and stability of de novo designed proteins. We compare sequence design by conformational landscape optimization with the standard energy-based sequence design methodology in Rosetta and show that the former can result in energy landscapes with fewer alternative energy minima. We show further that more funneled energy landscapes can be designed by combining the strengths of the two approaches: the low-resolution trRosetta model serves to disfavor alternative states, and the high-resolution Rosetta model serves to create a deep energy minimum at the design target structure.

Protein sequence profile prediction using ProtAlbert transformer. Behjati et al.; September 28, 2021

Abstract: Protein sequences can be viewed as a language; therefore, we benefit from using the models initially developed for natural languages such as transformers. ProtAlbert is one of the best pre-trained transformers on protein sequences, and its efficiency enables us to run the model on longer sequences with less computation power while having similar performance with the other pre-trained transformers. This paper includes two main parts: transformer analysis and profile prediction. In the first part, we propose five algorithms to assess the attention heads in different layers of ProtAlbert for five protein characteristics, nearest-neighbor interactions, type of amino acids, biochemical and biophysical properties of amino acids, protein secondary structure, and protein tertiary structure. These algorithms are performed on 55 proteins extracted from CASP13 and three case study proteins whose sequences, experimental tertiary structures, and HSSP profiles are available. This assessment shows that although the model is only pre-trained on protein sequences, attention heads in the layers of ProtAlbert are representative of some protein family characteristics. This conclusion leads to the second part of our work. We propose an algorithm called PA_SPP for protein sequence profile prediction by pre-trained ProtAlbert using masked-language modeling. PA_SPP algorithm can help the researchers to predict an HSSP profile while there are no similar sequences to a query sequence in the database for making the HSSP profile.

De novo protein design by deep network hallucination. Anishchenko et al.; July 23, 2020

Abstract: There has been considerable recent progress in protein structure prediction using deep neural networks to infer distance constraints from amino acid residue co-evolution1–3. We investigated whether the information captured by such networks is sufficiently rich to generate new folded proteins with sequences unrelated to those of the naturally occuring proteins used in training the models. We generated random amino acid sequences, and input them into the trRosetta structure prediction network to predict starting distance maps, which as expected are quite featureless. We then carried out Monte Carlo sampling in amino acid sequence space, optimizing the contrast (KL-divergence) between the distance distributions predicted by the network and the background distribution. Optimization from different random starting points resulted in a wide range of proteins with diverse sequences and all alpha, all beta sheet, and mixed alpha-beta structures. We obtained synthetic genes encoding 129 of these network hallucinated sequences, expressed and purified the proteins in E coli, and found that 27 folded to monomeric stable structures with circular dichroism spectra consistent with the hallucinated structures. Thus deep networks trained to predict native protein structures from their sequences can be inverted to design new proteins, and such networks and methods should contribute, alongside traditional physically based models, to the de novo design of proteins with new functions.

Naturally Occurring Equivariance in Neural Networks. Olah et al,; December 8, 2020

Abstract: Convolutional neural networks contain a hidden world of symmetries within themselves. This symmetry is a powerful tool in understanding the features and circuits inside neural networks. It also suggests that efforts to design neural networks with additional symmetries baked in may be on a promising track. To see these symmetries, we need to look at the individual neurons inside convolutional neural networks and the circuits that connect them. It turns out that many neurons are slightly transformed versions of the same basic feature. This includes rotated copies of the same feature, scaled copies, flipped copies, features detecting different colors, and much more. We sometimes call this phenomenon “equivariance,” since it means that switching the neurons is equivalent to transforming the input. Equivariance can be seen as a kind of ”circuit motif,” an abstract recurring pattern across circuits analogous to motifs in systems biology . It can also be seen as a kind of larger-scale “structural phenomenon” (similar to weight banding and branch specialization), since a given equivariance type is often widespread in some layers and rare in others. In this article, we’ll focus on examples of equivariance in InceptionV1 trained on ImageNet, but we’ve observed at least some equivariance in every model trained on natural images we’ve studied.

AI is Ushering In a New Scientific Revolution. Bryan McMahon.; June 4, 2022

First Paragraph: Since the discovery of DNA in the 1950s, biologists have sought to tie lengths of genetic code to a range of cellular parts and processes—including, for example, the mRNA transcription of specific antibodies that powers the now-famous mRNA vaccines. Despite the progress in sequencing and understanding the genome since the discovery of DNA, one big missing link remained. Biologists lacked a way to accurately and efficiently predict the 3-D shape of an unknown protein using just its DNA or RNA source code. In biology, structure determines function. What a protein does in a cell depends on its shape. Cylindrical with a hollow middle makes for a good membrane receptor, while U-shaped enzymes catalyze chemical reactions in their fjord-like cavities. Being able to predict or even design proteins would be a leap forward in our understanding of human disease and unlock new treatments for a range of diseases.

Independent SE(3)-Equivariant Models for End-to-End Rigid Protein Docking. Ganea et al.; March 15, 2022

Abstract: Protein complex formation is a central problem in biology, being involved in most of the cell's processes, and essential for applications, e.g. drug design or protein engineering. We tackle rigid body protein-protein docking, i.e., computationally predicting the 3D structure of a protein-protein complex from the individual unbound structures, assuming no conformational change within the proteins happens during binding. We design a novel pairwise-independent SE(3)-equivariant graph matching network to predict the rotation and translation to place one of the proteins at the right docked position relative to the second protein. We mathematically guarantee a basic principle: the predicted complex is always identical regardless of the initial locations and orientations of the two structures. Our model, named EquiDock, approximates the binding pockets and predicts the docking poses using keypoint matching and alignment, achieved through optimal transport and a differentiable Kabsch algorithm. Empirically, we achieve significant running time improvements and often outperform existing docking software despite not relying on heavy candidate sampling, structure refinement, or templates.

MedCLIP; Search for medical images with natural language powered by a CLIP model [Model Card] finetuned on the Radiology Objects in COntext (ROCO) dataset.

DALL-E Mega Training Report; Follow live as the open-source DALL-E Mega model is trained.

DALL-E Mini; DALL·E mini is an AI model that generates images from any prompt you give!

Oxford Protein Informatics Group

This Week in MathOnco

The Merge

Discussion about this post