Hot Topics #5 (June 27, 2022)
Playing Minecraft with video pretraining, evolving robots through large models and differentiable programming, ML for proteins, and learning neural network approximations for transcription dynamics.
Learning to Play Minecraft with Video PreTraining (VPT); OpenAI; June 23, 2022
First paragraph: We trained a neural network to play Minecraft by Video PreTraining (VPT) on a massive unlabeled video dataset of human Minecraft play, while using only a small amount of labeled contractor data. With fine-tuning, our model can learn to craft diamond tools, a task that usually takes proficient humans over 20 minutes (24,000 actions). Our model uses the native human interface of keypresses and mouse movements, making it quite general, and represents a step towards general computer-using agents.
Evolution through Large Models; Lehman et al.; June 17, 2022
Abstract: This paper pursues the insight that large language models (LLMs) trained to generate code can vastly improve the effectiveness of mutation operators applied to programs in genetic programming (GP). Because such LLMs benefit from training data that includes sequential changes and modifications, they can approximate likely changes that humans would make. To highlight the breadth of implications of such evolution through large models (ELM), in the main experiment ELM combined with MAP-Elites generates hundreds of thousands of functional examples of Python programs that output working ambulating robots in the Sodarace domain, which the original LLM had never seen in pre-training. These examples then help to bootstrap training a new conditional language model that can output the right walker for a particular terrain. The ability to bootstrap new models that can output appropriate artifacts for a given context in a domain where zero training data was previously available carries implications for open-endedness, deep learning, and reinforcement learning. These implications are explored here in depth in the hope of inspiring new directions of research now opened up by ELM.
The road to fully programmable protein catalysis; Lovelock et al.; June 1, 2022
Abstract: The ability to design efficient enzymes from scratch would have a profound effect on chemistry, biotechnology and medicine. Rapid progress in protein engineering over the past decade makes us optimistic that this ambition is within reach. The development of artificial enzymes containing metal cofactors and noncanonical organocatalytic groups shows how protein structure can be optimized to harness the reactivity of nonproteinogenic elements. In parallel, computational methods have been used to design protein catalysts for diverse reactions on the basis of fundamental principles of transition state stabilization. Although the activities of designed catalysts have been quite low, extensive laboratory evolution has been used to generate efficient enzymes. Structural analysis of these systems has revealed the high degree of precision that will be needed to design catalysts with greater activity. To this end, emerging protein design methods, including deep learning, hold particular promise for improving model accuracy. Here we take stock of key developments in the field and highlight new opportunities for innovation that should allow us to transition beyond the current state of the art and enable the robust design of biocatalysts to address societal needs.
Hallucinating protein assemblies; Wicky et al.; June 9, 2022
Abstract: Deep learning generative approaches provide an opportunity to broadly explore protein structure space beyond the sequences and structures of natural proteins. Here we use deep network hallucination to generate a wide range of symmetric protein homo-oligomers given only a specification of the number of protomers and the protomer length. Crystal structures of 7 designs are very close to the computational models (median RMSD: 0.6 Å), as are 3 cryoEM structures of giant rings with up to 1550 residues, C33 symmetry, and 10 nanometer in diameter; all differ considerably from previously solved structures. Our results highlight the rich diversity of new protein structures that can be created using deep learning, and pave the way for the design of increasingly complex nanomachines and biomaterials.
PEER: A Comprehensive and Multi-Task Benchmark for Protein Sequence Understanding; Xu et al.; June 5, 2022
Abstract: We are now witnessing significant progress of deep learning methods in a variety of tasks (or datasets) of proteins. However, there is a lack of a standard benchmark to evaluate the performance of different methods, which hinders the progress of deep learning in this field. In this paper, we propose such a benchmark called PEER, a comprehensive and multi-task benchmark for Protein sEquence undERstanding. PEER provides a set of diverse protein understanding tasks including protein function prediction, protein localization prediction, protein structure prediction, protein-protein interaction prediction, and protein-ligand interaction prediction. We evaluate different types of sequence-based methods for each task including traditional feature engineering approaches, different sequence encoding methods as well as large-scale pre-trained protein language models. In addition, we also investigate the performance of these methods under the multi-task learning setting. Experimental results show that large-scale pre-trained protein language models achieve the best performance for most individual tasks, and jointly training multiple tasks further boosts the performance. The datasets and source codes of this benchmark will be open-sourced soon.
Learning inverse folding from millions of predicted structures; Hsu et al.; April 10, 2022
Abstract: We consider the problem of predicting a protein sequence from its backbone atom coordinates. Machine learning approaches to this problem to date have been limited by the number of available experimentally determined protein structures. We augment training data by nearly three orders of magnitude by predicting structures for 12M protein sequences using AlphaFold2. Trained with this additional data, a sequence-to-sequence transformer with invariant geometric input processing layers achieves 51% native sequence recovery on structurally held-out backbones with 72% recovery for buried residues, an overall improvement of almost 10 percentage points over existing methods. The model generalizes to a variety of more complex tasks including design of protein complexes, partially masked structures, binding interfaces, and multiple states.
Spectral neural approximations for models of transcription dynamics; Gorin et al.; June 16, 2022
Abstract: Transcriptional systems involving discrete, stochastic events are naturally modeled using Chemical Master Equations (CMEs). These can be solved for microstate probabilities over time and state space for a better understanding of biological rates and system dynamics. However, closed form solutions to CMEs are available in only the simplest cases. Probing systems of higher complexity is challenging due to the computational cost of finding solutions and often compromises accuracy by treating infinite systems as finite. We use statistical understanding of system behavior and the generalizability of neural networks to approximate steady-state joint distribution solutions for a two-species model of the life cycle of RNA. We define a set of kernel functions using moments of the system and learn optimal weights for kernel functions with a neural network trained to minimize statistical distance between approximated and numerically calculated distributions. We show that this method of kernel weight regression (KWR) approximation is as accurate as lower-order generating-function solutions to the system, but faster; KWR approximation reduces the time for likelihood evaluation by several orders of magnitude. KWR also generalizes to produce probability predictions for system rates outside of training sets, thereby enabling efficient transcriptional parameter exploration and system analysis.
CellScape: Protein structure visualization with vector graphics cartoons; Silvestre-Ryan et al.; June 16, 2022
Motivation Illustrative renderings of proteins are useful aids for scientific communication and education. Nevertheless, few software packages exist to automate the generation of these visualizations.
Results We introduce CellScape, a tool designed to generate 2D molecular cartoons from atomic coordinates and combine them into larger cellular scenes. These illustrations can outline protein regions in different levels of detail. Unlike most molecular visualization tools which use raster image formats, these illustrations are represented as vector graphics, making them easily editable and composable with other graphics.
Availability and Implementation CellScape is implemented in Python 3 and freely available at https://github.com/jordisr/cellscape. It can be run as a command-line tool or interactively in a Jupyter notebook.
LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning; Sung et al.; June 13, 2022
Abstract: Fine-tuning large pre-trained models on downstream tasks has been adopted in a variety of domains recently. However, it is costly to update the entire parameter set of large pre-trained models. Although recently proposed parameter-efficient transfer learning (PETL) techniques allow updating a small subset of parameters (e.g. only using 2% of parameters) inside a pre-trained backbone network for a new task, they only reduce the training memory requirement by up to 30%. This is because the gradient computation for the trainable parameters still requires backpropagation through the large pre-trained backbone model. To address this, we propose Ladder Side-Tuning (LST), a new PETL technique that reduces training memory requirements by more substantial amounts. Unlike existing parameter-efficient methods that insert additional parameters inside backbone networks, we train a ladder side network, a small and separate network that takes intermediate activations as input via shortcut connections (ladders) from backbone networks and makes predictions. LST has significantly lower memory requirements than previous methods, because it does not require backpropagation through the backbone network, but instead only through the side network and ladder connections. We evaluate our method with various models (T5, CLIP-T5) on both NLP (GLUE) and vision-language (VQA, GQA, NLVR2, MSCOCO) tasks. LST saves 69% of the memory costs to fine-tune the whole network, while other methods only save 26% of that in similar parameter usages (hence, 2.7x more memory savings). Moreover, LST achieves higher accuracy than Adapter and LoRA in a low-memory regime. To further show the advantage of this better memory efficiency, we also apply LST to larger T5 models (T5-large, T5-3B), attaining better GLUE performance than full fine-tuning and other PETL methods. The exact same trend also holds in our experiments on VL tasks.
Severe Damage Recovery in Evolving Soft Robots through Differentiable Programming; Horibe et al.; June 14, 2022
Abstract: Biological systems are very robust to morphological damage, but artificial systems (robots) are currently not. In this paper we present a system based on neural cellular automata, in which locomoting robots are evolved and then given the ability to regenerate their morphology from damage through gradient-based training. Our approach thus combines the benefits of evolution to discover a wide range of different robot morphologies, with the efficiency of supervised training for robustness through differentiable update rules. The resulting neural cellular automata are able to grow virtual robots capable of regaining more than 80\% of their functionality, even after severe types of morphological damage.
benchmark_VAE; This library implements some of the most common (Variational) Autoencoder models. In particular it provides the possibility to perform benchmark experiments and comparisons by training the models with the same autoencoding neural network architecture. The feature make your own autoencoder allows you to train any of these models with your own data and own Encoder and Decoder neural networks.
CHESS human protein structure database; Open access to 3D structure predictions for 194,780 human protein isoforms.
Chemical Graph 3D Molecule Modeler; View 3D graphs of different molecules by using the SMILES graph syntax.