AI tool AlphaGenome predicts how one typo can change a genetic story

The model can predict changes in 11 biological activities across 1 million DNA letters

Ladderlike fluorescent patterns representing DNA sequence in blue, orange, yellow and green is overlayed on an image of an open book.

A new deep learning AI model called AlphaGenome is opening up the DNA instruction book and making it easier to read.

Nenov/Moment/Getty Images; TEK IMAGE/SCIENCE PHOTO LIBRARY/Getty Images

A new deep-learning AI model may help scientists better decipher the plot of the genetic instruction book and learn how typos alter the story.

AlphaGenome, created by Google DeepMind, is the latest in an ever-improving line of AI models built to analyze vast stretches of DNA. The previous front-runner, a model called Borzoi, could predict molecular signposts in stretches of DNA 500,000 bases long. AlphaGenome can analyze 1 million DNA building blocks at a time, researchers report January 28 in Nature. The model may have practical implications for diagnosing rare genetic diseases, identifying cancer-driving mutations, designing synthetic DNA sequences or therapeutic RNAs and better understanding basic biology.

“AlphaGenome is not just a bigger model in terms of context length, but it actually is quite a leap forward in its overall utility,” says Anshul Kundaje, a computational biologist at Stanford University who develops AI models for genomics.

For instance, a genetic change may have no effect on nearby genes but could change activity of genes far away. Because AlphaGenome examines longer stretches of DNA, it is more likely to spot such long-distance relationships.

But AlphaGenome isn’t perfect. Unpublished data from Kundaje’s lab indicates the model struggles with predicting how gene activity changes in individuals. Right now, the model is a tool for uncovering basic biology not something doctors could use to diagnose or treat patients.

AlphaGenome has “maxed out” what this type of model can do, Kundaje says. He predicts the next big leap will come from scientists generating new types of data for the model or its descendants to analyze.

AlphaGenome can pinpoint biologically important spots down to single base resolution, says Peter Koo, a computational biologist at Cold Spring Harbor Laboratory in New York. That’s much higher resolution than Borzoi, which flagged points of biological interest in 32 base-pair bins.

That’s a big task considering that the model’s reference is the 3-billion-base-long human genome, often called a genetic instruction book. The book is actually a multivolume, choose-your-own-adventure, popup encyclopedia.

Genes, the short stories of the book, are told in small phrases that can be rearranged, shortened or skipped. In between the story fragments are passages that may contain instructions for how to read a different story entirely. Pages and chapters are intricately folded into each other so that pulling a tab in one passage causes something to pop up chapters away.

Much of the book is filled with what many people thought was nonsense but is often essential reading material. Researchers have cataloged a dizzying array of punctuation marks, origami-like creases, syntax swaps, margin scribbles and other types of biological grammar that cells use to make sense of the book.

AlphaGenome’s task is to take a string of DNA letters and predict how plot points, punctuation and other variations affect 11 distinct biological processes, including RNA splicing, gene activity levels and certain protein-DNA interactions. The model considers 5,930 data points from studies of human DNA and 1,128 in mouse DNA. With those data, the AI can predict how changing a single letter, or base, in the million-base string alters the story.

Specialized computational models that predict subsets of these biological functions have been in use for years, but AlphaGenome outperforms them on most measures and does particularly well at identifying some features in different types of cells, the researchers report. For example, AlphaGenome identified gene activity changes in certain cell types 14.7 percent better than Borzoi2.

“By doing well on so many different genomic tasks simultaneously, we believe this demonstrates that the model has learned a powerful general representation of DNA sequences and the complex processes these sequences encode,” said Natasha Latysheva of Google DeepMind January 27 during a news briefing.

The tool could make things easier for researchers who are trying to understand how the genome works, says Judit García González, a human geneticist at the Ichan School of Medicine at Mount Sinai in New York City. Before AlphaGenome, a researcher “might need to use three different tools with their own caveats, and [have] to learn how they work, for predicting say 20 different genomic functional consequences,” she says. Now, AlphaGenome unites all those in one tool.

AlphaGenome isn’t an entirely new invention. It builds on previous models but uses aspects of those models in clever ways. “There is no single innovation in AlphaGenome that one can pinpoint as a critical innovation. It’s really a system of lots of tricks and engineering,” Koo says.

AlphaGenome used one trick called ensemble distillation that Koo’s lab has been experimenting with. That strategy pretrains multiple copies of the model each on computationally mutated DNA. Those models serve as teachers to a single student model that averages their outputs.

It’s like having 60 history professors give their account of an important event, Koo says. “If you consider the consensus across what every historian agrees, what overlaps across their story lines, that is probably what might actually be true.”

The consensus, he says, “tends to be more reliable than trusting any individual model.”

Tina Hesman Saey is the senior staff writer and reports on molecular biology. She has a Ph.D. in molecular genetics from Washington University in St. Louis and a master’s degree in science journalism from Boston University.