When it comes to whole genomes, the bigger they are, the easier they seem to fall.
Scientists announced the completion of the largest genome yet decoded, that of the fruit fly Drosophila melanogaster, last week at the annual meeting of the American Association for the Advancement of Science in Washington, D.C.
Using a controversial approach called shotgun sequencing, a team at Celera Genomics in Rockville, Md., and several academic institutions decoded 97 percent of the 120 million bases that make up the protein-coding portion of the fruit fly genome. The scientists don’t yet have the technology to tackle an additional 60-million-base section of the fruit fly genome that they say contains very few genes.
Many genome researchers have criticized Celera for using the shotgun sequencing method, but the technique has succeeded beyond even the most optimistic predictions, says Gerald Rubin, a geneticist who heads a fruit fly research project at the University of California, Berkeley.
Shotgun sequencing resembles doing a jigsaw puzzle. The researchers randomly cut the Drosophila genome into small pieces. They then determined the nucleotide sequence of each short segment of DNA—a process that took only 4 months in what Celera calls its “sequencing factory”—and used a computer to detect overlaps and thereby determine the order of the segments.
Other researchers had predicted that the many repeated DNA sequences in the fruit fly genome would make it impossible to determine unambiguously the segments’ correct order. Repetitive DNA sequences are the source of many gaps in the sequences of other organisms, including the worm Caenorhabditis elegans (SN: 12/12/98, p. 372: http://www.sciencenews.org/sn_arc98/12_12_98/Fob1.htm).
The Celera team filtered repetitive sequences out of the pool of DNA puzzle pieces before they started putting together the rest of the genome. “If you can just avoid [repetitive sequences] until the very end, you’re home free,” said Celera’s Eugene W. Myers at the meeting.
Although 3 percent of the protein-coding part of the fruit fly genome remains unidentified, the problem isn’t defects in data processing, said Myers. The researchers’ set of small pieces was missing those parts of the genome. The team should fill in the holes in the next year, Myers says. He notes that 1.76 million bases of the C. elegans genome—about 1.8 percent—are still unknown more than a year after its announced completion.
Despite its size, the Drosophila genome took less time to assemble than either the worm or yeast genome. The human genome should be even more straightforward because human DNA repeats are easier to tell apart than the repeated sequences in the fruit fly are, said Myers.
The researchers identified 13,601 genes in the segments they sequenced. About half represent proteins with no known function, says Celera’s Mark Adams. Some of the newly identified genes are of particular interest. The fruit fly version of the cancer gene p53 “popped right out” of the genome, although other methods to find the gene in Drosophila had failed, he noted.