Hot Topics #12 (August 29, 2022)
Classification of speakers in audio, quantification of venom-induce hemorrhage, protein engineering via Bayesian optimization, and more.

Molecular and in vivo studies of a glutamate-class prolyl-endopeptidase for coeliac disease therapy, Amo-Maestro et al., August 1, 2022
Abstract: The digestion of gluten generates toxic peptides, among which a highly immunogenic proline-rich 33-mer from wheat α-gliadin, that trigger coeliac disease. Neprosin from the pitcher plant is a reported prolyl endopeptidase. Here, we produce recombinant neprosin and its mutants, and find that full-length neprosin is a zymogen, which is self-activated at gastric pH by the release of an all-β pro-domain via a pH-switch mechanism featuring a lysine plug. The catalytic domain is an atypical 7+8-stranded β-sandwich with an extended active-site cleft containing an unprecedented pair of catalytic glutamates. Neprosin efficiently degrades both gliadin and the 33-mer in vitro under gastric conditions and is reversibly inactivated at pH > 5. Moreover, co-administration of gliadin and the neprosin zymogen at the ratio 500:1 reduces the abundance of the 33-mer in the small intestine of mice by up to 90%. Neprosin therefore founds a family of eukaryotic glutamate endopeptidases that fulfils requisites for a therapeutic glutenase.
Generative Extraction of Audio Classifiers for Speaker Identification, Afonja et al., July 26, 2022
Abstract: It is perhaps no longer surprising that machine learning models, especially deep neural networks, are particularly vulnerable to attacks. One such vulnerability that has been well studied is model extraction: a phenomenon in which the attacker attempts to steal a victim's model by training a surrogate model to mimic the decision boundaries of the victim model. Previous works have demonstrated the effectiveness of such an attack and its devastating consequences, but much of this work has been done primarily for image and text processing tasks. Our work is the first attempt to perform model extraction on audio classification models. We are motivated by an attacker whose goal is to mimic the behavior of the victim's model trained to identify a speaker. This is particularly problematic in security-sensitive domains such as biometric authentication. We find that prior model extraction techniques, where the attacker naively uses a proxy dataset to attack a potential victim's model, fail. We therefore propose the use of a generative model to create a sufficiently large and diverse pool of synthetic attack queries. We find that our approach is able to extract a victim's model trained on LibriSpeech using queries synthesized with a proxy dataset based off of VoxCeleb; we achieve a test accuracy of 84.41% with a budget of 3 million queries.
ALOHA: AI-guided tool for the quantification of venom-induced haemorrhage in mice, Jenkins et al.; August 5, 2022
Abstract: Venom-induced haemorrhage constitutes a severe pathology in snakebite envenomings, especially those inflicted by viperid species. In order to both explore venom compositions accurately, and evaluate the efficacy of viperid antivenoms for the neutralisation of haemorrhagic activity it is essential to have available a precise, quantitative tool for empirically determining venom-induced haemorrhage. Thus, we have built on our prior approach and developed a new AI-guided tool (ALOHA) for the quantification of venom-induced haemorrhage in mice. Using a smartphone, it takes less than a minute to take a photo, upload the image, and receive accurate information on the magnitude of a venom-induced haemorrhagic lesion in mice. This substantially decreases analysis time, reduces human error, and does not require expert haemorrhage analysis skills. Furthermore, its open access web-based graphical user interface makes it easy to use and implement in laboratories across the globe. Together, this will reduce the resources required to preclinically assess and control the quality of antivenoms, whilst also expediting the profiling of hemorrhagic activity in venoms for the wider toxinology community.
Efficient base-catalysed Kemp elimination in an engineered ancestral enzyme, Gutierrez-Rus et al.; July 30, 2022
Abstract: The routine generation of enzymes with completely new active sites is one of the major unsolved problems in protein engineering. Advances in this field have been so far modest, perhaps due, at least in part, to the widespread use of modern natural proteins as scaffolds for de novo engineering. Most modern proteins are highly evolved and specialized, and, consequently, difficult to repurpose for completely new functionalities. Conceivably, resurrected ancestral proteins with the biophysical properties that promote evolvability, such as high stability and conformational diversity, could provide better scaffolds for de novo enzyme generation. Kemp elimination, a non-natural reaction that provides a simple model of proton abstraction from carbon, has been extensively used as a benchmark in de novo enzyme engineering. Here, we present an engineered ancestral β-lactamase with a new active site capable of efficiently catalysing the Kemp elimination. Our Kemp eliminase is the outcome of a minimalist design based on a single function-generating mutation followed by sharply-focused, low-throughput library screening. Yet, its catalytic parameters (kcat/KM=2·105 M−1s−1, kcat=635 s−1) compare favourably with the average modern natural enzyme and with the best proton-abstraction de novo Kemp eliminases reported in the literature. General implications of our results for de novo enzyme engineering are discussed.
Protein engineering via Bayesian optimization-guided evolutionary algorithm and robotic experiments, Hu et al.; August 12, 2022
Abstract: Protein engineering aims to find top functional sequences in a vast design space. For such an expensive “black-box” function optimization problem, Bayesian optimization is a principled sample-efficient approach, which is guided by a surrogate model of the objective function. Unfortunately, Bayesian optimization is computationally intractable with the vast search space. Even worse, it proposes sequences sequentially, making it incompatible with batched wet-lab measurement. Here, we report a scalable and batched method, Bayesian Optimization-guided EVOlutionary (BO-EVO) algorithm, to guide multiple rounds of robotic experiments to explore protein fitness landscapes of combinatorial mutagenesis libraries. We first examined various design specifications based on an empirical landscape of protein G domain B1. Then, BO-EVO was successfully generalized to another empirical landscape of an Escherichia coli kinase PhoQ, as well as simulated NK landscapes with up to moderate epistasis. This approach was then applied to guide robotic library creation and screening to engineer enzyme specificity of RhlA, a key biosynthetic enzyme for rhamnolipid biosurfactants. A 4.8-fold improvement in producing a target rhamnolipid congener was achieved after examining less than 1% of all possible mutants after 4 iterations. Overall, BO-EVO proves to be an efficient and general approach to guide combinatorial protein engineering without prior knowledge.
High-throughput sequencing analysis of nuclear-encoded mitochondrial genes reveals a genetic signature of human longevity, Gonzalez et al.; August 10, 2022
Abstract: Mitochondrial dysfunction is a well-known contributor to aging and age-related diseases. The precise mechanisms through which mitochondria impact human lifespan, however, remain unclear. We hypothesize that humans with exceptional longevity harbor rare variants in nuclear-encoded mitochondrial genes (mitonuclear genes) that confer resistance against age-related mitochondrial dysfunction. Here we report an integrated functional genomics study to identify rare functional variants in ~ 660 mitonuclear candidate genes discovered by target capture sequencing analysis of 496 centenarians and 572 controls of Ashkenazi Jewish descent. We identify and prioritize longevity-associated variants, genes, and mitochondrial pathways that are enriched with rare variants. We provide functional gene variants such as those in MTOR (Y2396Lfs*29), CPS1 (T1406N), and MFN2 (G548*) as well as LRPPRC (S1378G) that is predicted to affect mitochondrial translation. Taken together, our results suggest a functional role for specific mitonuclear genes and pathways in human longevity.
OmegaFold protein folding on Google Colab; Released code
This is the release code for paper High-resolution de novo structure prediction from primary sequence.
We will continue to optimize this repository for more ease of use, for instance, reducing the GRAM required to inference long proteins and releasing possibly stronger models.
Evolutionary Scale Modeling (esm): Pretrained language models for proteins; This repository contains code and pre-trained weights for Transformer protein language models from Facebook AI Research, including our state-of-the-art ESM-2 and MSA Transformer, as well as ESM-1v for predicting variant effects and ESM-IF1 for inverse folding. Transformer protein language models were introduced in our paper, "Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences" (Rives et al., 2019).