Page 37 - Genetics_From_Genes_to_Genomes_6th_FULL_Part3
P. 37
9.4 Sequencing Genomes 331
TABLE 9.2 Genome Comparisons
Organism Number of Number of Genome
b
Type Species Chromosomes a Genes Size (Mb)
Bacterium Escherichia coli 1 ∼4400 4.6 d
Yeast Saccharomyces cerevisiae 16 ∼6000 12.5
Worm Caenorhabditis elegans 6 ∼22,000 100.3
Fly Drosophila melanogaster 4 ∼17,000 122.7
Mustard weed Arabidopsis thaliana 5 ∼28,000 135
Mouse Mus musculus 20 ∼27,000 2,700
Human Homo sapiens 23 ∼27,000 3,300
Lungfish Protopterus aethiopicus 14 ?? 133,000
c
Canopy plant Paris japonica 5 ?? 152,400
a
Haploid chromosome complement except where indicated.
b Includes non-protein-coding genes.
c
This species is an octoploid; 5 is the basic chromosome number (see Chapter 13).
d E. coli genomes vary in size; 4.6 Mb is a representative length (see Chapter 14.)
genome of the plant Paris japonica. Thus, the informa- When the Human Genome Project began in 1990,
tion content of a genome is not necessarily proportional to whole-genome shotgun strategies were thought to be impos-
the complexity of the organism. sible. One problem was recognized very early: As will be
The large size of some genomes, including the human discussed later, genomes contain many kinds of repetitive
genome, presents major challenges for their ultimate char- DNA sequences, each of which can be located in many posi-
acterization and analysis. If any single DNA sequencing tions scattered throughout the genome. Many repeated se-
run can yield at most 1000 bases of information, then you quences are longer than a typical sequence read of 1000 bp.
might think you would need to obtain at least 3 million This fact makes it impossible to assemble the genome from
such sequences to determine the human genome’s entire random reads. The reason is that unique sequences on one
sequence. In fact this is a gross underestimate, because as side of a long repeat from one particular genomic location
we discussed previously, you would really need to examine cannot be present in the same read as unique sequences from
at least five times this number of clones from a genomic the other side of the repeat (Fig. 9.9a). Scientists eventually
library to ensure just a 95% chance that each portion of the realized that paired-end sequencing of random clones would
genome would be represented once. How can you do so make the whole-genome shotgun strategy possible. We will
many DNA sequencing runs? And how can you deal with explain this method, but first we describe an alternative strat-
the immense amount of data you would obtain so that you egy researchers used before this conceptual breakthrough.
could somehow figure out how these millions of small, To get around the assembly problem presented by the
1000-base snippets are ordered with respect to each other existence of long repeats, the first genome scientists tried
in the intact genome? a divide-and-conquer method called a hierarchical strat-
The basic concept behind the method now used to egy (Fig. 9.8). They first separated the genome into large
sequence complex genomes, called the whole-genome chunks by cloning 200–300 kb fragments in BAC vectors,
shotgun strategy, is easy to explain: Determine DNA se- and then they applied strategies (not discussed here) to de-
quences, each about 1000 bases long, from both ends termine the order of the inserts with respect to each other in
(paired ends) of random human genomic DNA inserts of the original genome. Note that the genomic DNA frag-
millions of individual BAC (bacterial artificial chromosome) ments were generated using a method (such as sonication)
clones from a genomic library, and then look for overlaps that cleaves different copies of the genome at different lo-
between the sequences so that they can be assembled to cations, resulting in overlapping fragments (Fig. 9.8).
reconstruct the sequence of the entire genome (Fig. 9.8). These methods allowed the researchers to determine the
(Shotgun refers to the fact that the clones are chosen ran- smallest set of BAC clones with the least amount of
domly for sequencing.) Ideally, in the case of the human overlap that could cover the entire genome (the so-called
genome, the ultimate output would be 24 linear strings of minimal tiling path). The scientists then determined the
nucleotide sequences, one for each chromosome (the auto- DNA sequence of the entire insert in each BAC clone of
somes and the X and the Y). the minimal tiling path so as to reconstruct the genome