Page 37 - Genetics_From_Genes_to_Genomes_6th_FULL_Part3
P. 37

9.4 Sequencing Genomes   331


                           TABLE 9.2      Genome Comparisons

                                    Organism                         Number of                Number of            Genome
                                                                                                   b
                        Type                  Species                Chromosomes a            Genes                Size (Mb)
                        Bacterium             Escherichia coli             1                    ∼4400                   4.6 d
                        Yeast                 Saccharomyces cerevisiae    16                    ∼6000                  12.5
                        Worm                  Caenorhabditis elegans       6                   ∼22,000                100.3
                        Fly                   Drosophila melanogaster      4                   ∼17,000                122.7
                        Mustard weed          Arabidopsis thaliana         5                   ∼28,000                135  
                        Mouse                 Mus musculus                20                   ∼27,000              2,700  
                        Human                 Homo sapiens                23                   ∼27,000              3,300  
                        Lungfish              Protopterus aethiopicus     14                           ??          133,000  
                                                                           c
                        Canopy plant          Paris japonica               5                           ??          152,400  
                       a
                       Haploid chromosome complement except where indicated.
                       b Includes non-protein-coding genes.
                       c
                       This species is an octoploid; 5 is the basic chromosome number (see Chapter 13).
                       d E. coli genomes vary in size; 4.6 Mb is a representative length (see Chapter 14.)


                       genome of the plant Paris japonica. Thus, the informa-  When the Human Genome Project began in 1990,
                       tion content of a genome is not necessarily proportional to   whole-genome shotgun strategies were thought to be impos-
                       the complexity of the organism.                     sible. One problem was recognized very early: As will be
                          The large size of some genomes, including the human   discussed later, genomes contain many kinds of repetitive
                       genome, presents major challenges for their ultimate char-  DNA sequences, each of which can be located in many posi-
                       acterization and analysis. If any single DNA sequencing   tions scattered throughout the genome. Many repeated se-
                       run can yield at most 1000 bases of information, then you   quences are longer than a typical sequence read of 1000 bp.
                       might think you would need to obtain at least 3 million   This fact makes it impossible to assemble the genome from
                       such sequences to determine the human genome’s entire   random reads. The reason is that unique sequences on one
                       sequence. In fact this is a gross underestimate, because as   side of a long repeat from one particular genomic location
                       we discussed previously, you would really need to examine   cannot be present in the same read as unique sequences from
                       at least five times this number of clones from a genomic   the other side of the repeat (Fig. 9.9a). Scientists eventually
                       library to ensure just a 95% chance that each portion of the   realized that paired-end sequencing of random clones would
                       genome would be represented once. How can you do so   make the whole-genome shotgun strategy possible. We will
                       many DNA sequencing runs? And how can you deal with   explain this method, but first we describe an alternative strat-
                       the immense amount of data you would obtain so that you   egy researchers used before this conceptual breakthrough.
                       could somehow figure out how these millions of small,   To get around the assembly problem presented by the
                       1000-base snippets are ordered with respect to each other   existence of long repeats, the first genome scientists tried
                       in the intact genome?                               a divide-and-conquer method called a hierarchical strat-
                          The basic concept behind the method now used to    egy (Fig. 9.8). They first separated the genome into large
                       sequence complex genomes, called the  whole-genome   chunks by cloning 200–300 kb fragments in BAC vectors,
                       shotgun strategy, is easy to explain: Determine DNA se-  and then they applied strategies (not discussed here) to de-
                       quences, each about 1000 bases long, from both ends   termine the order of the inserts with respect to each other in
                       (paired ends) of random human genomic DNA inserts of   the original genome. Note that the genomic DNA frag-
                       millions of individual BAC (bacterial artificial chromosome)   ments were generated using a method (such as sonication)
                       clones from a genomic library, and then look for overlaps   that cleaves different copies of the genome at different lo-
                       between the sequences so that they can be assembled to   cations, resulting in overlapping fragments (Fig. 9.8).
                       reconstruct the sequence of the entire genome (Fig. 9.8).   These methods allowed the researchers to determine the
                       (Shotgun refers to the fact that the clones are chosen ran-  smallest set of BAC clones with the least amount of
                       domly for sequencing.) Ideally, in the case of the human   overlap that could cover the entire genome (the so-called
                       genome, the ultimate output would be 24 linear strings of   minimal  tiling  path).  The  scientists  then  determined  the
                       nucleotide sequences, one for each chromosome (the auto-  DNA sequence of the entire insert in each BAC clone of
                       somes and the X and the Y).                         the  minimal tiling path so as to reconstruct the genome
   32   33   34   35   36   37   38   39   40   41   42