NEW YORK – A team led by researchers at the Technical University of Munich has developed a new machine learning-based approach to de novo protein design.
Detailed in a paper published Thursday in Science, the method employs a form of deep network hallucination-based modeling approach in which use of a "relaxed sequence space" allows for more efficient design of protein backbones.
Using the method, called relaxed sequence optimization (RSO), the researchers designed and produced more than 100 proteins, including some consisting of as many as 1,000 amino acids, and validated the structure of five using either X-ray diffraction or cryo-electron microscopy.
According to the authors, RSO could enable a variety of applications including the design of protein complexes and of proteins "approaching the size of therapeutically relevant protein scaffolds such as antibodies."
RSO is based on the idea that allowing structure prediction tools to operate outside the space of physically possible protein sequences could improve the efficiency of their predictions. In devising the approach, the authors built on existing work in the use of gradient descent-based hallucination for protein design in which researchers devised protein structures by entering a sequence into the AlphaFold2 (AF2) protein prediction tool, calculating the "loss" between the predicted structure and the target structure, and then backpropagating this loss through the AF2 tool to produce a gradient that can be used to update the original input sequence to move it closer to producing the target protein structure.
The authors noted that this updating process does not typically produce a defined amino acid sequence but "a logit-like or position-specific scoring matrix (PSSM)" in which "each residue position is populated seemingly by a superposition of all the 20 amino acids, each with a specific numeral weight."
They added that researchers using this approach have commonly forced such relaxed sequences into defined sequences before feeding them back into the prediction tool, but that this causes "substantial deviations away from the optimal gradient direction." In RSO, on the other hand, users iterate on the relaxed sequences, a process that the TUM researchers found produced "rapid and stable convergence and improved performance relative to previous protocols."
The team used RSO to design 85 proteins between 100 and 300 amino acids in length. They characterized eight of those using size-exclusion chromatography and circular dichroism spectroscopy, finding their measurements agreed with those expected for the designed protein structures. Looking at the molecular weights of the remaining proteins, they found that 58 percent had molecular weights matching expectation.
They also designed and produced five other proteins consisting of 200, 400, 600, 950, and 1,000 amino acids, characterizing the first three by X-ray diffraction and the last two by cryo-EM, finding that they were good matches to the predicted structures.
Additionally, the researchers designed interacting proteins, producing a heterodimer design in which "individual monomers stayed monomeric when expressed separately, but formed a dimeric complex when mixed."
"RSO achieves high designability and efficiently generates promising in silico candidates for large proteins, including tasks such as site scaffolding and binder generation," they wrote.