Identification of anonymous maize coding sequences by evolutionary
considerations
--Winkler, RG
In the past year I have worked with two anonymous maize cDNAs; in both cases there was no obvious preferred reading frame. In both cases there was no AUG near either 5' end nor were there stop sites that eliminated any of the reading frames from consideration. Although the sequence of one was extended by RACE, there still was not an AUG near the 5' end. In both cases I have been able to define a reading frame using a simple evolutionary consideration: Because the third base of the codon is under reduced or little selective pressure it diverges more rapidly than the first two bases.
Identification of reading frame by interspecies and intraspecies comparisons. Thus if one compares the sequence of an anonymous gene or expressed sequence tag (EST) with the EST database (dbEST) and identifies a related gene with sufficient overlap, then the reading frame with the greatest variation at the third position is predicted to be the correct reading frame: the third base of the correct reading frame will vary at a much higher rate than the first and second bases of the correct reading frame. The translated products of each gene can also be compared to give a similar result. This criterion can be applied in multiple ways: 1) from any genomic or cDNA sequence to ESTs, 2) for gene families or duplicate genes within a species or even between species, for example between duplicate maize genes, 3) between any two ESTs. This could in fact be used to systematically identify the reading frames of anonymous ESTs or genomic sequences.
Identification of reading frame by identification of internal duplications within a gene. In addition to gene duplication, a second driving force in evolution is internal duplication to produce repeated peptide units. Thus internal duplications within a gene can also be used to predict correct reading frames by the same third position criteria. This can be approached by matrix analysis of the predicted peptides to determine which frame conserves the peptide repeats. This could also be used to systematically define the reading frames of many anonymous ESTs and similarly could be applied to genomic sequences as a test of the possibility that duplicated sequences are protein coding.
Identification of the limits of the coding sequence by interspecies and intraspecies comparisons. A related criterion can be used to predict the gene product of a genomic or cDNA sequence. Since coding sequences are under much greater selective pressure than the 5' and 3' untranslated sequences, interspecies comparisons can be used to predict coding regions. This has been used in the past for many known genes: the human to mouse comparison is very powerful as are interspecies comparisons in plants. The rapid increase of EST data makes this approach more widely applicable. When I compared an anonymous fully sequenced maize cDNA with dbEst I observed that a peptide of 80 amino acids was conserved between maize and rice and maize and Arabidopsis (the rice and Arabidopsis genes were obtained and fully sequenced). In addition to establishing the correct reading frame this suggested that the entire protein was 80 amino acids long as there was no conservation beyond this. This was surprising because the first AUG of the maize gene was at bp 300 which is unusually long for a 5' leader sequence. There were no stop sites in the first 300 bp. Although it is possible that exon-sharing could be an explanation for this conservation, it is not likely as this transcript is single copy.
The value of these approaches is that by simple computer comparisons one can rapidly derive testable hypotheses that predict the coding frame and coding region of an anonymous sequence. Once a peptide is identified it is much easier to start deriving hypotheses on its function by further analysis.
Return to the MNL 70 On-Line Index
Return to the Maize Newsletter Index
Return to the MaizeGDB Homepage