Machine Learning Seminar

Design and Discovery in the Protein Fitness Landscape

by

Stefano Martiniani
Department of Chemical Engineering and Materials Science
University of Minnesota

Wednesday, December 16, 2020
3:30–4:30 pm
View recording

Engineered proteins have emerged as novel diagnostics, therapeutics, and catalysts. Often, poor protein developability — quantified by expression, solubility, and stability — hinders commercialization. The ability to predict protein developability from amino acid sequence would reduce the experimental burden when selecting candidates. Recent advances in screening technologies enabled a high-throughput (HT) developability dataset for 105 of 1020 possible variants of protein scaffold Gp2. In this work, we evaluate the ability of neural networks to learn a developability representation from the HT dataset and transfer the knowledge to predict recombinant expression beyond the observed sequences. Our model predicts expression levels 42% closer to the experimental variance compared to a non-embedded control. We then seek to exploit this knowledge to design new sequences with high developability. To overcome the intractability of a brute force search of 1020 possible Gp2 variants, we descend through the protein fitness landscape by Nested Sampling, a Monte Carlo scheme for Bayesian parameter estimation and model selection, which is particularly suited for the analysis of multimodal distributions. In addition to identifying high-developability libraries, we obtain unprecedented insight into the structure of the protein fitness landscape through a “topographical” analysis and statistical mechanical interpretation of the results.


Stefano Martiniani is Assistant Professor in the Department of Chemical Engineering and Materials Science at the University of Minnesota, a member of the graduate faculty in the School of Physics & Astronomy, and in Data Science. Prior to joining UMN, he was a postdoc at the Center for Soft Matter Research at New York University, and a Gates Scholar at the University of Cambridge where he obtained an MPhil in Scientific Computing and a PhD in Theoretical Chemistry. Stefano’s research focuses on the design of novel theoretical and computational frameworks to address open problems in science and engineering. His work draws primarily from statistical and computational physics, dynamical systems, and machine learning. His theoretical interests span the energy landscapes of disordered systems, neuronal dynamics, bio/molecular design and simulation, and soft matter.