Page 48 - Genetics_From_Genes_to_Genomes_6th_FULL_Part3
P. 48

342    Chapter 10   Genome Annotation


               10.1   Finding the Genes in Genomes                 frame that continues without a stop for significantly more
                                                                   than 21 triplets, there is a good chance that the DNA in this
                                                                   region is not a random set of nucleotides, but instead  actually
                learning objectives                                encodes amino acids within a protein (Fig. 10.1).
                                                                       This method is useful but far from foolproof. Genomes
                1.  Explain why a long open reading frame suggests the   are so large that regions that do not correspond to genes
                   existence of a protein-coding exon.             might rarely contain a long ORF by chance. On the other
                2.  Describe how scientists predict the location of genes by   hand, because many genes in higher eukaryotes are inter-
                   identifying sequences conserved in the genomes of   rupted by introns, some protein-coding exons are so small
                   widely divergent species.                       that they would not be identified as ORFs unless other in-
                3.  Discuss the use of reverse transcriptase in the   formation was available.
                   construction of a cDNA library.                     One type of additional information that could poten-
                4.  Compare the information that can be obtained from   tially aid computer programs in identifying genes is the fact
                   genomic and cDNA libraries.                     that the splice acceptor and splice donor sites at intron/exon
                                                                   boundaries are composed of characteristic consensus se-
                                                                   quences (review Fig. 8.15). Genome analysis programs can
              Genes are the key functional elements of genomes. In this   thus connect potential exons together and see if a long ORF
              section, we focus on methods to locate genes within ge-  suggestive of a gene would result.
              nomic DNA sequences. You will see that information use-
              ful for the annotation of the genes within the human genome
              can be found in the sequence of the genome itself, the se-  Whole-Genome Comparisons
              quences of the genomes of species other than humans, and   Distinguish Genomic Elements
              from the characterization of RNA molecules in human   Conserved by Natural Selection
              cells. These methods have successfully located and charac-
              terized more than 27,000 genes in the human genome, but   The whole-genome shotgun approach to the sequencing of
              in spite of all of these efforts, the task is still incomplete;   genomes described in Chapter 9 has been so successful that
              some genes undoubtedly remain to be found.           scientists have already deciphered the genomes of thou-
                                                                   sands of different species. Researchers can exploit this tre-
                                                                   mendous amount of information to look for regions of
              Open Reading Frames (ORFs) Help                      DNA that are similar in  diverse organisms. Such regions
              Locate Protein-Coding Genes                          usually, though not  always, correspond to genes.
                                                                       The justification for comparing genomes goes all the
              One way to look specifically for regions that might corre-  way back to Charles Darwin. Nearly a century before the
              spond to the exons of protein-coding genes  is to scan   DNA double helix was discovered, he proposed the evolu-
                genomic DNA sequences for long open reading frames   tion of species from now-extinct ancestors by a process of
              (ORFs); that is, stretches of nucleotides that have a reading   descent with modification. We now know that the actual
              frame of triplets uninterrupted by a stop codon. As you re-  entity undergoing descent with modification is the DNA
              member from Chapter 8’s discussion of the genetic code,   sequence that defines an organism’s genome. The modifi-
                                                  3
              the four nucleotides can be arranged into 4  = 64 possible   cations are random mutations that occur in DNA. Natural
              triplets, of which three (TAA, TAG, and TGA written as   selection is the process whereby mutations that confer an
              DNA) signify stop. Thus, as a very rough estimate, if you   advantage to the individuals carrying them will spread
              looked at any random sequence of DNA starting at any one   throughout a population, while deleterious mutations will
              nucleotide, you would on average run into a stop codon after   disappear. The challenge is to trace such molecular evolu-
              about 64/3 ≈ 21 triplets. If that nucleotide begins a reading   tion at the DNA level.



              Figure 10.1  Open reading frames (ORFs). Any sequence of DNA can be read in any of six different reading frames (three from one
              strand, three from the other strand). Reading frames uninterrupted by stop codons (red) are ORFs. A long ORF suggests that the region may
              be part of a protein-coding exon. In this example, only one reading frame (Frame 5) is open.
                                     Frame 1  5' .  .  .CCG  ATG  CTG  AAT  AGC  GTA  GAG  GTT  AGG  TAA  TCA  TCA.  .  . 3'
                                     Frame 2  5' .  .  .   CGA  TGC   TGA  ATA  GCG  TAG  AGG  TTA  GGT  AAT  CAT  CA.  .  . 3'
                                     Frame 3  5' .  .  .      GAT  GCT  GAA  TAG  CGT  AGA  GGT  TAG  GTA  ATC  ATC  A.  .  . 3'

                                            3' .  .  .GGC  TAC  GAC  TTA  TCG  CAT  CTC  CAA  TCC  ATT  AGT  AGT .  .  . 5'  Frame 4
                                            3' .  .  .GG  CTA  CGA  CTT  ATC  GCA  TCT  CCA  ATC  CAT  TAG  TAG   .  .  . 5'  Frame 5
                                            3' .  .  .G  GCT  ACG  ACT  TAT  CGC  ATC  TCC   AAT  CCA  TTA  GTA     .  .  . 5'  Frame 6
   43   44   45   46   47   48   49   50   51   52   53