Deep learning language models have shown promise in several biotechnology applications, including protein design and engineering. Scientists at the University of California – San Francisco have developed an AI system capable of generating artificial enzymes from scratch.
Their system, called ProGen, uses next-token prediction to assemble amino acid sequences into artificial proteins. In testing, some of the resulting enzymes worked just as well as those found in nature, even when their artificially generated amino acid sequences differed significantly from any known natural protein.
Scientists said the new technology could become more powerful than directed evolution, the Nobel Prize-winning protein design technology, and it will reinvigorate the 50-year-old field of protein engineering by accelerating the development of new proteins that could be used for almost anything from therapies to degrading plastic .
James Fraser, Ph.D., professor of bioengineering and therapeutic sciences at the UCSF School of Pharmacy, said: “The artificial designs perform much better than designs inspired by the evolutionary process. The language model teaches aspects of evolution, but it is different from the normal evolutionary process.”
“We can now tailor the generation of these properties to specific effects. For example, an incredibly thermostable enzyme likes acidic environments or won’t interact with other proteins.”
The amino acid sequences of 280 million unique proteins from all kinds were loaded into the machine learning model to develop the model. The model was then given several weeks to process the data. Then they modified the model by feeding it 56,000 sequences from five different lysozyme families, along with some background knowledge about these specific proteins.
Based on how closely they mirrored the sequences of normal proteins and how naturalistic the underlying “grammar” and “semantics” of the AI proteins’ amino acids were, the research team chose 100 sequences from the model’s rapid generation of one million sequences. to test.
From this first batch of 100 proteins, which Tierra Biosciences evaluated in vitro, the team created five artificial proteins to test in cells and compared their function to an enzyme known as chicken egg white lysozyme present in chicken egg whites (HEWL ). Human tears, saliva and milk all contain similar lysozymes that act as antimicrobial defenses against bacteria and fungi.
Despite sharing only about 18% of their sequences, two artificial enzymes can break down bacterial cell walls with activity similar to HEWL.
Just one mutation in a natural protein can stop it from working. Still, in a subsequent round of screening, the scientists found that the AI-generated enzymes showed activity even when only 31.4% of their sequence resembled a known natural protein.
The AI could even learn how to form the enzymes by studying the raw sequence data. Measured by X-ray crystallography, the atomic structures of the artificial proteins looked exactly as they should, though the sequences looked like nothing before.
Nikhil Naik, Ph.D., director of AI Research at Salesforce Research and the paper’s senior author, said: “If you train sequence-based models with a lot of data, they are really powerful at learning structure and rules. They learn which words can occur next to each other, and also composition.”
Ali Madani, Ph.D., founder of Profluent Bio, a former research scientist at Salesforce Research, and the paper’s first author, said: “Given the limitless possibilities, it is remarkable that the model can generate working enzymes so easily.”
“The ability to generate functional proteins from scratch shows that we are entering a new era of protein design. This is a versatile new tool available to protein engineers and we look forward to its therapeutic applications.”
Magazine reference:
- Ali Madani, Ben Krause, Eric R. Greene, Subu Subramanian, Benjamin P. Mohr, James M. Holton, Jose Luis Olmos, Caiming Xiong, Zachary Z. Sun, Richard Socher, James S. Fraser, Nikhil Naik. Large language models generate functional protein sequences in diverse families. Nature Biotechnology, 2023; DOI: 10.1038/s41587-022-01618-2