Hot Topics #30 (July 15, 2024)

Non-standard proteins and AlphaFold3, fusing approaches for protein structure prediction, predicting mutation effects, learning temporal distances, and more.

Jul 15, 2024

Non-standard proteins in the lenses of AlphaFold3 - case study of amyloids: Wojciechowska et al.: July 12, 2024

Abstract: Motivation: The recent release of AlphaFold3 raises a question about its powers and limitations. Here, we analyze the potential of AlphaFold3 in correct reproduction of amyloid structures, which are an example of multimeric proteins characterized by polymorphism and low representation in protein structure databases. Results: We show that AlphaFold3 is capable of producing amyloid-like assemblies that have a high similarity to experimental structures, although its results are impacted by the number of monomers in the predicted fibril. It produces structurally diverse models of some amyloid proteins, which could reflect their polymorphism observed in nature. We hypothesize that the lower emphasis on multiple sequence analysis (MSA) in AlphaFold3 improves the results quality, since for this class of proteins sequence homology is not necessary for their structural similarity. Notably, the structural landscape obtained from the modeling does not reflect the real one governed by thermodynamics, which does not hamper modeling amyloid proteins. Finally, AlphaFold3 opens the door to fast structural modeling of fibril-like structures, including their polymorphic nature.

StarFunc: fusing template-based and deep learning approaches for accurate protein function prediction: Zhang et al.: May 18, 2024

Abstract: Deep learning has significantly advanced the development of high-performance methods for protein function prediction. Nonetheless, even for state-of-the-art deep learning approaches, template information remains an indispensable component in most cases. While many function prediction methods use templates identified through sequence homology or protein-protein interactions, very few methods detect templates through structural similarity, even though protein structures are the basis of their functions. Here, we describe our development of StarFunc, a composite approach that integrates state-of-the-art deep learning models seamlessly with template information from sequence homology, protein-protein interaction partners, proteins with similar structures, and protein domain families. Large-scale benchmarking and blind testing in the 5th Critical Assessment of Function Annotation (CAFA5) consistently demonstrate StarFunc’s advantage when compared to both state-of-the-art deep learning methods and conventional template-based predictors.

Learning to Predict Mutation Effects of Protein-Protein Interactions by Microenvironment-aware Hierarchical Prompt Learning: Wu et al.: May 16, 2024

Abstract: Protein-protein bindings play a key role in a variety of fundamental biological processes, and thus predicting the effects of amino acid mutations on protein-protein binding is crucial. To tackle the scarcity of annotated mutation data, pre-training with massive unlabeled data has emerged as a promising solution. However, this process faces a series of challenges: (1) complex higher-order dependencies among multiple (more than paired) structural scales have not yet been fully captured; (2) it is rarely explored how mutations alter the local conformation of the surrounding microenvironment; (3) pre-training is costly, both in data size and computational burden. In this paper, we first construct a hierarchical prompt codebook to record common microenvironmental patterns at different structural scales independently. Then, we develop a novel codebook pre-training task, namely masked microenvironment modeling, to model the joint distribution of each mutation with their residue types, angular statistics, and local conformational changes in the microenvironment. With the constructed prompt codebook, we encode the microenvironment around each mutation into multiple hierarchical prompts and combine them to flexibly provide information to wild-type and mutated protein complexes about their microenvironmental differences. Such a hierarchical prompt learning framework has demonstrated superior performance and training efficiency over state-of-the-art pre-training-based methods in mutation effect prediction and a case study of optimizing human antibodies against SARS-CoV-2.

Learning Temporal Distances: Contrastive Successor Features Can Provide a Metric Structure for Decision-Making: Myers et al.: June 24, 2024

Abstract: Temporal distances lie at the heart of many algorithms for planning, control, and reinforcement learning that involve reaching goals, allowing one to estimate the transit time between two states. However, prior attempts to define such temporal distances in stochastic settings have been stymied by an important limitation: these prior approaches do not satisfy the triangle inequality. This is not merely a definitional concern, but translates to an inability to generalize and find shortest paths. In this paper, we build on prior work in contrastive learning and quasimetrics to show how successor features learned by contrastive learning (after a change of variables) form a temporal distance that does satisfy the triangle inequality, even in stochastic settings. Importantly, this temporal distance is computationally efficient to estimate, even in high-dimensional and stochastic settings. Experiments in controlled settings and benchmark suites demonstrate that an RL algorithm based on these new temporal distances exhibits combinatorial generalization (i.e., "stitching") and can sometimes learn more quickly than prior methods, including those based on quasimetrics.

Unsupervised evolution of protein and antibody complexes with a structure-informed language model: Shanker et al.: July 4, 2024

Abstract: Large language models trained on sequence information alone can learn high-level principles of protein design. However, beyond sequence, the three-dimensional structures of proteins determine their specific function, activity, and evolvability. Here, we show that a general protein language model augmented with protein structure backbone coordinates can guide evolution for diverse proteins without the need to model individual functional tasks. We also demonstrate that ESM-IF1, which was only trained on single-chain structures, can be extended to engineer protein complexes. Using this approach, we screened about 30 variants of two therapeutic clinical antibodies used to treat severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection. We achieved up to 25-fold improvement in neutralization and 37-fold improvement in affinity against antibody-escaped viral variants of concern BQ.1.1 and XBB.1.5, respectively. These findings highlight the advantage of integrating structural information to identify efficient protein evolution trajectories without requiring any task-specific training data.

The Illustrated AlphaFold, by Elena P. Simon, is “a visual walkthrough of the AlphaFold3 architecture, with more details and diagrams than you were probably looking for.”

py3Dmol, allows you to view PDB structures inside of a Jupyter notebook. See examples.

Load Balancing, by samwho.dev is a post explaining how load balancers work to distribute requests across servers.

AlphaPulldown, AlphaPulldown is a Python package that streamlines protein-protein interaction screens and high-throughput modelling of higher-order oligomers using AlphaFold-Multimer.

Thank you for reading The Merge. This post is public so feel free to share it.

The Merge