Hot Topics #17 (Feb. 6, 2023)

RL for learning programs, LLMs for protein sequence generation, generating musical accompaniments, and more.

Feb 06, 2023

DALL-E drawing of “A robot playing the violin at Walt Disney Concert Hall, synthwave”

Hierarchical Programmatic Reinforcement Learning via Learning to Compose Programs: Liu et al.; Jan 30, 2023

Abstract: Aiming to produce reinforcement learning (RL) policies that are human-interpretable and can generalize better to novel scenarios, Trivedi et al. (2021) present a method (LEAPS) that first learns a program embedding space to continuously parameterize diverse programs from a pre-generated program dataset, and then searches for a task-solving program in the learned program embedding space when given a task. Despite encouraging results, the program policies that LEAPS can produce are limited by the distribution of the program dataset. Furthermore, during searching, LEAPS evaluates each candidate program solely based on its return, failing to precisely reward correct parts of programs and penalize incorrect parts. To address these issues, we propose to learn a meta-policy that composes a series of programs sampled from the learned program embedding space. By composing programs, our proposed method can produce program policies that describe out-of-distributionally complex behaviors and directly assign credits to programs that induce desired behaviors. We design and conduct extensive experiments in the Karel domain. The experimental results show that our proposed framework outperforms baselines. The ablation studies confirm the limitations of LEAPS and justify our design choices.

Large language models generate functional protein sequences across diverse families: Madani et al.; Jan 26, 2023

Abstract: Deep-learning language models have shown promise in various biotechnological applications, including protein design and engineering. Here we describe ProGen, a language model that can generate protein sequences with a predictable function across large protein families, akin to generating grammatically and semantically correct natural language sentences on diverse topics. The model was trained on 280 million protein sequences from >19,000 families and is augmented with control tags specifying protein properties. ProGen can be further fine-tuned to curated sequences and tags to improve controllable generation performance of proteins from families with sufficient homologous samples. Artificial proteins fine-tuned to five distinct lysozyme families showed similar catalytic efficiencies as natural lysozymes, with sequence identity to natural proteins as low as 31.4%. ProGen is readily adapted to diverse protein families, as we demonstrate with chorismate mutase and malate dehydrogenase.

SingSong: Generating Musical Accompaniments from Singing: Donahue et al.; Jan 30, 2023

Abstract: We present SingSong, a system that generates instrumental music to accompany input vocals, potentially offering musicians and non-musicians alike an intuitive new way to create music featuring their own voice. To accomplish this, we build on recent developments in musical source separation and audio generation. Specifically, we apply a state-of-the-art source separation algorithm to a large corpus of music audio to produce aligned pairs of vocals and instrumental sources. Then, we adapt AudioLM (Borsos et al., 2022) -- a state-of-the-art approach for unconditional audio generation -- to be suitable for conditional "audio-to-audio" generation tasks, and train it on the source-separated (vocal, instrumental) pairs. In a pairwise comparison with the same vocal inputs, listeners expressed a significant preference for instrumentals generated by SingSong compared to those from a strong retrieval baseline.
Sound examples at this https URL

REPLUG: Retrieval-Augmented Black Box Language Models: Shi et al.; Jan 30, 2023

Abstract: We introduce REPLUG, a retrieval-augmented language modeling framework that treats the language model (LM) as a black box and augments it with a tuneable retrieval model. Unlike prior retrieval-augmented LMs that train language models with special cross attention mechanisms to encode the retrieved text, REPLUG simply prepends retrieved documents to the input for the frozen black-box LM. This simple design can be easily applied to any existing retrieval and language models. Furthermore, we show that the LM can be used to supervise the retrieval model, which can then find documents that help the LM make better predictions. Our experiments demonstrate that REPLUG with the tuned retriever significantly improves the performance of GPT-3 (175B) on language modeling by 6.3%, as well as the performance of Codex on five-shot MMLU by 5.1%.

New AI classifier for indicating AI-written text: OpenAI Team; Jan 31, 2023

Abstract: We’ve trained a classifier to distinguish between text written by a human and text written by AIs from a variety of providers. While it is impossible to reliably detect all AI-written text, we believe good classifiers can inform mitigations for false claims that AI-generated text was written by a human: for example, running automated misinformation campaigns, using AI tools for academic dishonesty, and positioning an AI chatbot as a human.

Learning to Synthesize Programs as Interpretable and Generalizable Policies: Trivedi et al.; Aug 31, 2021

Abstract: Recently, deep reinforcement learning (DRL) methods have achieved impressive performance on tasks in a variety of domains. However, neural network policies produced with DRL methods are not human-interpretable and often have difficulty generalizing to novel scenarios. To address these issues, prior works explore learning programmatic policies that are more interpretable and structured for generalization. Yet, these works either employ limited policy representations (e.g. decision trees, state machines, or predefined program templates) or require stronger supervision (e.g. input/output state pairs or expert demonstrations). We present a framework that instead learns to synthesize a program, which details the procedure to solve a task in a flexible and expressive manner, solely from reward signals. To alleviate the difficulty of learning to compose programs to induce the desired agent behavior from scratch, we propose to first learn a program embedding space that continuously parameterizes diverse behaviors in an unsupervised manner and then search over the learned program embedding space to yield a program that maximizes the return for a given task. Experimental results demonstrate that the proposed framework not only learns to reliably synthesize task-solving programs but also outperforms DRL and program synthesis baselines while producing interpretable and more generalizable policies. We also justify the necessity of the proposed two-stage learning scheme as well as analyze various methods for learning the program embedding.

Pocket2Mol: Efficient Molecular Sampling Based on 3D Protein Pockets; Pocket2Mol used equivariant graph neural networks to improve efficiency and molecule quality of previous structure-based drug design model.

Fusion models for Atomic and molecular STructures (FAST); Predicting accurate protein-ligand binding affinity is important in drug discovery. This code implements fusion network model to benefit from Spatial Grach CNN and 3D CNN models to improve the binding affinity prediction. The code is written in python with Tensorflow and Pytorch.

LightDock; LightDock is a protein-protein, protein-peptide and protein-DNA docking framework based on the Glowworm Swarm Optimization (GSO) algorithm.

The LightDock framework is highly versatile, with many options that can be further developed and optimized by the users: it can accept any user-defined scoring function, can use local gradient-free minimization, the simulation can be restrained from the beginning to focus on user-assigned interacting regions, it supports residue restraints in both receptor and ligand partners.

Share this post with your colleauges if you think it may be useful to them.

ML for Drug Discovery Workshop:

Overview from website:

We are at a pivotal moment in healthcare characterized by unprecedented scientific and technological progress in recent years together with the promise of personalized medicine to radically transform the way we provide care to patients. However, drug discovery has become an increasingly challenging endeavor: not only has the success rate of developing new therapeutics been historically low, but this rate has been steadily declining . The average cost to bring a new drug to market (factoring in failures) is now estimated at $2.6 billion – 140% higher than a decade earlier.

Machine learning-based approaches present a unique opportunity to address this challenge. Last year, the first ‘Machine Learning for Drug Discovery’ (MLDD) workshop at ICLR 2022 brought together hundreds of attendees and world-class experts in ML for drug discovery. The second edition of MLDD workshop aims at bringing together the community to discuss cutting edge research in this area on the following three themes, covering the end-to-end drug discovery process:

Genetic & molecular representation learning: Methods aiming at learning compact lower dimensional representations of high dimensional structured biological objects (e.g., DNA, proteins, small molecules). The objective is to then leverage these representations in disease prediction models (e.g., variant effect predictions) or quantify the affinity between two biological entities (e.g., binding between antibody and viral proteins) to support drug and vaccine design.
Molecule optimization & target identification: Approaches to enhance the identification or the generation of new molecules that optimize specific properties of interest (e.g., drug-likeness, solubility). This is crucial for efficient large scale screening of drug precursors and protein biotherapeutics design.
Biological experiment design: Methods to guide the design and execution of complex biological experiments (e.g., active learning), in particular the efficient exploration of experiment spaces that span hundreds of billions of potential configurations. The overarching goal is to uncover causal relationships between genes and pathologies and subsequently identify more promising drug targets.

The workshop will feature talks from leading researchers and pioneers in ML applied to drug discovery, a community challenge, as well as spotlight presentations and poster sessions for accepted papers.

The Merge

Discussion about this post