doi:10.3181/0704-MR-97
© 2008 Society for Experimental Biology and Medicine
|
Harlan Robins*1,
Michael Krasnitz and
Arnold J. Levine
* Computational Biology Group, Fred Hutchinson Cancer Research Center, Seattle, Washington 98109; and Institute for Advanced Study, Princeton, New Jersey 08540
1 Computational Biology Group, Fred Hutchinson Cancer Research Center, 1100 Fairview Ave. N, Seattle, WA 98109. E-mail: hrobins{at}fhcrc.org
Abstract |
---|
TOP Abstract Introduction 1) Creating a Maximal… 2) Using Relative Entropy… 3) Evidence for Biological… 4) Applications 5) Why Do These… References |
---|
A new algorithm has been constructed for finding under- andoverrepresented oligonucleotide motifs in the protein codingregions of genomes that have been normalized for G/C content,codon usage, and amino acid order. This Robins-Krasnitz algorithmhas been employed to compare the oligonucleotide frequenciesbetween many different prokaryotic genomes. Evidence is presenteddemonstrating that at least some of these sequence motifs arefunctionally important and selected for or against during theevolution of these prokaryotes. The applications of this methodinclude the optimization of protein expression for syntheticgenes in foreign organisms, identification of novel oligonucleotidesignals used by the organism and the examination of evolutionaryrelationships not dependent upon different gene sequence trees.
Keywords: motifs, codon usage, relative entropy
Introduction |
---|
TOP Abstract Introduction 1) Creating a Maximal… 2) Using Relative Entropy… 3) Evidence for Biological… 4) Applications 5) Why Do These… References |
---|
The degeneracy of the genetic code allows for an enormous number of nucleotide sequences to encode the same protein. A protein of typical length (e.g. 300 amino acids) can be translated from well over 1050 different nucleotide sequences. This number is close to the estimate of the total number of protons that make up the planet Earth. As this number is extraordinarily large, the set of open reading frames (ORFs) within the genome of every organism has the capacity to carry vastly more information than just the amino acid order. Only a portion of this information, namely A + T/G + C content, the frequency of codon usage, and the amino acid content of all of the proteins of an organism has been employed to explain the unique DNA sequences of genomes.(1–3) But these variables appear insufficient to explain the stability of the E. coli DNA sequence.(4) Why does a specific organism like E. coli, isolated from many different parts of the world, have a very similar nucleotide sequence? Why doesn’t that sequence drift and still produce all of the same proteins (due to selection) found in all the strains of E. coli? Is it possible that inherent in the genome of E. coli is additionalinformation that preserves that sequence by selection?
Any genome of an organism can be divided into the portion ofa gene which encodes the amino acid sequence of a protein andthe non-coding portion that contains the regulatory motifs thatimpact the levels of a gene product, the timing of its expression,and the place (cell type) where it is produced. Clearly thesequence motifs or signals that have biological function areselected for and retained at high frequencies in a genome. Arethere similar sets of sequence information in the protein codingregions of the genome which result from the degeneracy of thegenetic code and preferential use of selected codons? In thisdiscussion the search for functional sequence motifs or signalsin the genome of an organism will be restricted to the proteincoding regions of the genome. Of course, there is the confoundingeffect that some motifs are conserved for nonfunctional reasons.Only experiments can distinguish the truly functional motifs.The same approaches employed here can also be usefully appliedto the regulatory regions of genomes and this is work in progress.
The proportion of each codon used for a given amino acid varies widely between organisms as well as between genes in the same organism. For example, in Saccharomyces cerevisiae the highly expressed genes have different codon usage than poorly expressed genes (3). The typical explanation is that isoaccepting tRNAs have very different concentrations within a yeast cell, so the tRNAs that occur in lower abundance can cause rate-limiting steps in translation of an mRNA. The yeast genome selects against those rare codons that correspond to these low abundance tRNAs. The highly expressed genes have a stronger negative selection pressure on the rare codons than poorly expressed genes. Is this a sufficient explanation to shape the unique DNA sequence and use of codons of S. cerevisiae? The results reviewed here suggest other additional factors may play a role. Yet another example of this issue is the LINE-1 retrotransposon elements found in the genomes of many organisms. These LINE-1 elements are very poorly expressed in the somatic cells of their hosts. This is in part due to the heavy methylation of cytosine residues which decreases transcription rates but a second factor, inherent in the nucleotide sequences of LINE-1 elements, slows the rate of transcription of these genomes. Changing the nucleotides in the third position of codons in a LINE-1 element (no amino acid changes) increases the rate of transcription of these elements many fold (5, 6). This can impact the frequency of transposition and the forces of selection for these sequence motifs. Because the poor expression is a transcriptional problem, not translation, this increase due to codon change is not related to codon usage in translation. Could it be that inherent in these LINE-1 elements are sequence motifs that modulate transcription rates or RNA processing rates? Yet a third example of a possible role for nucleotide sequence motifs in protein coding regions of genes, which are not related to codon usage, is found in the Human Immunodeficiency Virus. The DNA copy of this virus integrated into the host cell genome is transcribed poorly and RNA processing of HIV mRNA is inefficient. Indeed two viral proteins are required to enhance transcription (Tat) and RNA processing (Rev). In addition to these aids in gene expression, however, synonymous changes in the third positions of this HIV DNA sequence greatly increases the gene expression of this virus even in the absence of Tat and Rev (7). Thus, there is growing evidence that the RNA sequence of an organism contains sequence motifs, other than codons, which impart biological information. Despite the direct biological relevance of codon usage, there has been very little effort to search for additional information within coding regions (8–11). Here we review the Robins-Krasnitz algorithm, a new method to search this vast amount of additional information contained in the coding region of each organism for biologically functional signals (12). In particular, the algorithm systematicallydetermines the set of small motifs which are under-and overrepresentedin a genome.
This search algorithm has overcome two major obstacles. First, the algorithm attempts to find signals which are obscured by the already known biological properties: coding for proteins and codon usage. These make up a set of complicated constraints that must be controlled. Second, motifs come in large families that are highly codependent. For example, CpG is significantly underrepresented in the human genome because a common mutation is CU through the removal of an amine group. This negative evolutionarypressure significantly reduces the entire superfamily of motifsthat include CG as a substring, e.g. ACG, CCG, CGA, etc. Additionally,the mutation increases the number of TAs and its superfamily.Disentangling the primary object of selection is a difficultchallenge. Taking advantage of powerful information theory tools,the Robins-Krasnitz algorithm overcomes both of these obstacles.One drawback of the algorithm in the present form is that itis limited to exact motifs. Extending the formulation to includedegenerate motifs is a difficult problem.
These obstacles, particularly the first one, are a specific case of a general problem in bioinformatics as well as other areas of computational biology (13). Finding a pattern or signalrequires a null-hypothesis for what the genome would look likewithout signal. An incorrect null-hypothesis (or backgroundmodel) leads to the discovery of false signals. Our backgroundmodel needs to account for as much biological knowledge as possiblewithout damping the signal too dramatically. For the case offinding transcription factor binding sites in upstream regionsof genes, we have very little biological knowledge to use. Genomicalgorithms have had much difficulty separating out false signals.In the case of coding regions, we have substantially more biologicalknowledge to incorporate into our background model. For codingregions, we must be careful of the opposite issue, over constrainingthe background and losing the signal. Of the two constraintswe consider, amino acid order is not controversial. The interestingdecision is how to include codon usage. We are limited for practicalreasons to two choices; either we can fix codon usage in thewhole genome or we can fix it individually in each gene. Wehave strong biological evidence that codon usage within a genehas a large effect on rates of gene expression in bacteria.Additionally, the variation of codon usage between genes inthe same genome varies widely, with the variation strongly correlatedwith expression. If overall codon usage in the genome is usedas a constraint, we will be ignoring substantial biologicalinformation. On a practical level, fixing codon usage in eachgene also fixes mononucleotide content. Lifting this constraintcreates false signals that correlate with GC content.
This review is divided into five sections. Sections one andtwo present the algorithm, explaining how both obstacles describedabove are overcome. The third section provides evidence forbiological function of the motifs. The fourth section presentsa set of possible applications for this information. The lastsection explores the reasons why such coding region sequencemotifs might arise in different organisms and differ in differentorganisms.
1) Creating a Maximal Entropy Distribution |
---|
TOP Abstract Introduction 1) Creating a Maximal… 2) Using Relative Entropy… 3) Evidence for Biological… 4) Applications 5) Why Do These… References |
---|
The Robins-Krasnitz algorithm finds short nucleotide motifsin coding regions that are independent of amino acid order andcodon usage. Figure 1 contains a flow chart outlining the algorithm.Codon usage, for this paper, is defined to mean the total fractionof each codon used in a given gene. Finding a set of over- andunderrepresented motifs requires a clear definition of the backgroundor expected distribution of motifs. In other words, we mustdetermine a standard of comparison to give the terms over-andunderrepresented meaning. Because we are searching for signalsthat are orthogonal to the known information, amino acid orderand codon usage, we treat them as constraints. The goal is tofind the most random distribution of motifs from a given setof sequences subject to the constraints that amino acid orderand codon usage are held constant. A probability distributionwhich is completely unstructured except for a set of constraintsis called a maximal entropy distribution (MED). For the problemat hand, the constraints are too complicated to analyticallyfind the MED. However, we can extract the distribution usinga Monte Carlo procedure as long as we randomize at each stepsuch that the constraints are obeyed exactly and no other conditionsare imposed.
The Monte Carlo procedure is implemented using what we willrefer to as a shuffling program. For each coding sequence inthe set of genes of interest, we consider each type of aminoacid separately. At each step in the Monte Carlo, all the codonsin a gene (and then all of the genes in a genome) that are translatedinto the same amino acid are randomly permuted. For example,if a protein has four different Leucines coded by four differentcodons, each iteration of the Monte Carlo will produce a newnucleotide sequence, where the four codons for Leucine are randomlypermuted amongst themselves in the gene. See Figure 2 for anillustration of this process. The same randomization is donefor all 20 amino acids at each step in the Monte Carlo. Thisprocedure clearly obeys all of the necessary requirements togenerate a MED. The constraints are met by construction at eachiteration. Amino acid order is never changed as we are onlypermuting the codons that code for the same amino acid. Additionally,the codon usage for each amino acid in each gene is held constantbecause the permutation does not change the number of any codonin any gene. The shuffling procedure only moves the codons aroundwithin the same gene. Any additional variation beyond the shufflingprocedure would violate one of the constraints, therefore, extractingthe distribution as the average of a number of iterations ofthis Monte Carlo procedure approaches the MED. The distributionconverges quickly, so computational time is not an issue evenfor the full human genome. Convergence usually occurs by 20iterations, but we use 50 to be conservative.
The shuffling procedure described above gives a set of randomizedsequences. How do we extract a probability distribution of motifsfrom a set of sequences? As long as the number of occurrencesof each motif found in the total set of sequences is reasonablylarge, we can form a probability distribution, estimating theprobability of a given motif by its fraction in the set of allmotifs. If a certain motif of length four occurs 2000 timesin the set of sequences and the total number of motifs of lengthfour is 100,000, then we estimate the probability of that motifto be 1/50. An estimate of the variation for the count of aspecific motif is the square root of the count. For counts assmall as 30, the fraction is still a reasonable estimate forthe probability. For the MED, we take the average of a set ofrandomizations, the Monte Carlo iterations.
For practical purposes, we actually only need to keep track of the set of the longest motifs we wish to consider. All shorter motifs can be retrieved from this set. We determine the longest length motif that is statistically robust for the set of sequences we are analyzing. For the case of a bacterium with 1 Mb of coding sequence, we can find motifs up to length seven. Since 106/47~ 60, all the motifs of length seven would be expected to havereasonably large counts. Then, we move a sliding window of thislength across the sets of sequences, moving one base at a time.This process determines the counts for the largest length motifs.However, the shorter length motifs are over counted, becausethey appear in more than one window each. The formula to deconstructthe counts for all the shorter length motifs is found in Box1A. An example is found in Box 1B. The estimated probabilitiesfor each motif are simply the counts divided by the total lengthof sequence. We now have two distributions to compare: one isthe motif probabilities from the real set of sequences and theother is the MED distribution which is found by averaging themotif probabilities from a number of Monte Carlo iterations.
Box 1A. Extracting the Maximal Entropy Distribution from a Set of Sequences.
NMED(w) is the average count for each word w from the Monte Carlo runs. Wimax is the set of words of maximum length where i = 0, . . . , 4Lmax–1. All counts NMED(w) can be determined from NMED(Wimax), but a couple of additional quantities need to be defined. C(Wimax,w) is the number of times the word w is contained in the word Wimax. L(w) is the length of word w. The equation to extract the counts of all the shorter motifs from the counts of the longest motifs is
PMED(w) = NMED(w)/L is the probability of motif w, where L is the total length of sequence. PR(w) = NR(w)/L is the probability of motif w in the real sequence (this is found by counting). |
Box 1B. An Example.
Consider the case where the maximum length is 7, which is the maximum size motif that we can consider for most bacterial genomes. Let w = AAC and W2577 = AACAAAC, and where the label 257 is the base four conversion of the motif with A = 0, C = 1, G = 2, T = 3, i.e. 0010001 = 4^4 + 4^0. L(w) = 3, C(W7257,w) = 2, and L(Wmax) = 7. To get NMED(w), this process is repeated for all 47 Wimax terms. The values are then added in the formula above. |
2) Using Relative Entropy to Choose Over- and Underrepresented Motifs |
---|
TOP Abstract Introduction 1) Creating a Maximal… 2) Using Relative Entropy… 3) Evidence for Biological… 4) Applications 5) Why Do These… References |
---|
The shuffling procedure defines two distributions, the realdistribution found from the actual sequence and the MED whichwe use as the surrogate for the background. The first obstacleis overcome, because the MED exactly conserves the constraints.Now we need a method for choosing under- and overrepresentedmotifs. The standard we use is from information theory. Themotif that contributes the most bits of information to the differencebetween the real distribution and the MED is the first motifwe choose. Using information theory has the nice feature ofputting all results in the same units, number of bits. Thisallows us to compare motifs of different lengths and motifsthat are either over- or underrepresented. The formula we employto compute the motif contributing the most bits of informationbetween the two distributions is the Kullback-Leibler distanceor the Relative Entropy. The formula is found in Box 2. We computethe Relative Entropy contribution for each motif and pick outthe one with the largest value.
Box 2. The Relative Entropy or Kullback-Leibler Distance.
The Kullback-Leibler distance between the real and MED probability distributions is
A figure of merit that measures the extent to which *any* word w, of length 2 to max, contributes to DKL is
This can also be thought of as a Kullback-Leibler distance between two probability distributions, namely the coarse-grained real and background distributions where we only know if a given word is or is not w. |
The other choice of distance is a symmetrized version of theKullback-Leibler (KL) distance, called the Jensen-Shannon (JS)divergence. The practical difference is small for our applicationas the KL distance is symmetric to lowest order in our application.The JS divergence also smoothes the result, which removes someof the signal. JS converges faster than KL upon rescaling, sowe use KL.
Now that we have found the most under- or over-represented motif in the sequence, we have to tackle the second obstacle. We need to pick out the motif which is the next most under- or overrepresented. However, we cannot simply take the motif which has the next largest Relative Entropy. This is because the motifs are overlapping, so under- or overrepresentation of a given motif affects the distribution of all the other motifs. The example of CpG from the introduction illustrates this point. If we take the human genome, CG will have the largest Relative Entropy. However, all eight trimers which contain CG as a subset fall within the top 50 highest Relative Entropy of all motifs. This is simply an artifact of the selection against CG. We must first remove the contribution of CG from the MED before recalculating the Relative Entropy to find the next motif. If we call the motif w, we rescale all motifs that contain w, by the same amount such that the rescaled MED has the same distribution for w as the real distribution. This forces the Relative Entropy of w to zero and, at the same time, removes the contribution of wfrom all other motifs.
The mathematical details of the rescaling are as follows with many of the definitions in Boxes 1 and 2. After a motif w is found to be over- or underrepresented, its contribution to the difference between the real distribution and the MED needs to be eliminated. The maximal entropy distribution is rescaled in a minimal way, such that the contribution of w becomes identical in both the real distribution and the MED. For the rescaling to be minimal, the ratios of frequencies of words Wimax of length max that contain w the same number of times should not change. That is, all words Wimax with the same C(Wimax,w) must be rescaled by an equal factor. Consider a coarse graining of the detailed probability distributions. The MED is defined as the set of words Wimax of length max, with the probabilities PMED(Wimax). The set of Wimax is partitioned into disjoint subsets where each element of a given subset contains the motif w an equalnumber of times. These sets are
with J = {0, … , max – 1} and
These disjoint subsets, KJ(w), need to be rescaled such thatthe probability of being in a given subset is equal in the realdistribution and the MED. Define
These are well-defined probability distributions because they are grouped elements from the old probability distribution (and their probabilities are added). A rescaling that factors out the contribution of w while conserving probability is givenby
for all i. Note that with this rescaled distribution the figure of merit for w is now Srescaled(w) = 0, so the contribution of w to DKL has been removed. Now, the whole procedure can be repeated to find the next word w, etc.
It is not hard to see that in this iterative algorithm, the MEDPMED(Wimax) converges to the real distribution PR(Wimax). This is because DKL is monotonically decreasing (proof is found in Robins et al. 2005). It is well known that DKL is nonnegative, and is 0 if and only if the two distributions are identical. The algorithm will not halt before convergence is achieved since S(w) = 0 for all w also implies that the real and background distributions are identical. Finally, DKL cannot converge toa positive value, since one could then find a word that wouldreduce it below that value.
The procedure is iterated, so that we remove the contribution of one motif at a time from contributing to the Relative Entropy through rescaling of the MED. Then, we choose the next motif. The rescaling automatically over-comes the second obstacle. The impact of selection on a motif w, affects all other motifs,especially its supersets and subsets. Our scaling proceduresystematically handles all of these motifs at the same time.
Iterations are performed until the largest Relative Entropy for any motif is small enough that we would expect it by chance. This cutoff is the point where it becomes likely that chance fluctuations would create the most significant remaining word, appropriately corrected for multiple hypotheses (the set of all words of length w). The cutoff occurs when the selected word w satisfies
where (w) is the standard deviation of the background count for w. For all our applications we stopped after 100 iterations,which is substantially below the cutoff.
The algorithm gives a set of motifs of varying lengths whichare under- or overrepresented in the coding region. Each motifvaries in a statistically significant manner between the realand Maximal Entropy Distributions. Having this list, the questionthen arises; do these sequence motifs found under- or overrepresentedin the coding regions of genes have any biological function?
3) Evidence for Biological Function |
---|
TOP Abstract Introduction 1) Creating a Maximal… 2) Using Relative Entropy… 3) Evidence for Biological… 4) Applications 5) Why Do These… References |
---|
A computational sequence analysis like the one preformed above can provide evidence for function in two ways. Our functionality tests are applied to bacterial genomes because of the wide range of sequence data available on the NCBI website. Below we show that these over- and underrepresented sequence motifs have been evolving with the bacteria over billions of years of evolution. We need to be a bit careful here, because motifs can appear conserved for reasons other than direct biological function. In particular, motifs can be also be conserved by historical accident as well as by unknown biochemical constraints. Since we are considering billions of years of evolution, the rate of accidental conservation should be very low. However, we are not able to separate biochemical constraints from functional elements. Next, we can, in some circumstances, directly associate a motif with a known function. Within the set of underrepresented bacterial motifs, most of the known restriction sites are found. For example, Table 1 shows the results of the top 100 most under- and overrepresented motifs from the coding regions of the E. coli genome. We suggest that the algorithm presented is a validmethod to determine an exhaustive list of restriction sitesthat are found in different bacterial organisms. This also showsthat different species and genera of bacteria have differentsequence motifs over- or under-represented. There is no universalsequence motif avoided or preferred by all bacterial organismssequenced.
Before providing evidence for evolutionary conservation, we must first provide a necessary check that the motifs are real properties of bacterial genomes. We choose a particular genome, in our case E. coli, and randomly divide the full set of genes in half. We run our algorithm on both sets independently and focus on the top 100 motifs from each half. This self consistency check requires both lists to be the same up to minor fluctuations. This is exactly what we get. There are minor variations in order, but every one of the top 80 motifs from one half is within the top 100 on the other list. We repeated this process for many random divisions and always pass this consistency check. We examine two different Null hypotheses relating to get statistics for the previous statement. We reject the Null hypothesis that two different sets of motifs from separate runs are drawn from independent distributions with a P-value << 10–7. We are unable to reject the Null hypothesis that the two sets of motifs from different runs are drawn from the same distribution, P-value = 4. We are unable to find evidence for different motifsin separate divisions of the genome.
Additionally, we have analyzed the frame bias for the startof the motifs. We asked the question, are the motifs preferentiallybeginning in a particular reading frame. If the answer was yes,this could either be an artifact of the method or be a biologicallyfunctional. However, there is no frame preference overall. Thedistribution among individual motifs agrees with a random distribution(data not shown).
We use the E. coli example to explore the relative effect ofthe constraints. If we remove the constraint on amino acid orderin each gene, we maintain the constraints on GC bias and codonusage. Lifting this constraint makes the least change betweenthe distributions. The result is that 40% of the top 100 motifsare completely different, while 60% are the same. Removing furtherconstraints drops the overlap of motifs below 25%.
How do the motifs in one bacterium relate to the motifs in another bacterium? Using the list of motifs the Robins-Krasnitz algorithm produces for each of the 164 bacterial species found in the NCBI database, we defined a Hamming distance as a metric. We constructed a phylogenetic tree of these bacteria based entirely on the motifs and this metric (Figs. 3A, 3B). The tree formed from over- or under-represented sequences was then compared with an evolutionary tree constructed from the ribosomal gene family sequences, which is considered at this time the gold standard. The tree created by the Robins-Krasnitz algorithm captures most of the standard bacterial taxonomy with a whole or partial genome approach (12). This distribution of listsof over- and underrepresented sequences for each organism shouldpermit the assembly and classification of other unknown organismsfrom their sequence data. Thus, this method can be used to takeadvantage of shotgun sequence data, because it does not requirealignment of orthologous genes. As long as a sufficient amountof sequence from a bacterium, e.g. over 50,000 bases, are sequenced,this method can accurately place the unknown organism into aregion of the bacterial phylogenetic tree.
The last piece of evidence for biological function is found in the relationship between viruses and their hosts. As viruses are intercellular parasites, their genomes are subject to similar selection pressures to the host genomes. There has been a large amount of work comparing bacterial genomes to their infecting phage genomes. Codon usage and dinucleotide content have both been utilized with some success to illuminate features that are shared preferentially between phage genomes and their hosts (14). The motifs from the Robins-Krasnitz algorithm capture substantially more information about the relationship between phages and hosts. These over- and underrepresented motifs provide a prediction accuracy of phages to their known hosts of almost double the other methods. Table 2 displays the results of the Robins-Krasnitz algorithm applied to 131 phages with known bacterial hosts. The algorithm is used to match the phages to the set of 108 bacterial genera using a scoring algorithm described in Robins et al. 2005.
4) Applications |
---|
TOP Abstract Introduction 1) Creating a Maximal… 2) Using Relative Entropy… 3) Evidence for Biological… 4) Applications 5) Why Do These… References |
---|
The set of applications of the Robins-Krasnitz algorithm maybe grouped into three different categories. The first is a directutilization of the algorithm to enhance medical research. Thesecond is as a bioinformatics tool to assist in the identificationand categorization of defined motifs or signals in the genome.Third, the motifs themselves can be used as a stepping stonefor basic research. We have shown indirectly that the motifshave biological function. Future experiments will be requiredto determine the possible biological roles of each of thesemotifs.
Two projects are presently underway for direct application ofthese motifs for medical research. The first project extrapolatesthe phage/bacterial analysis to humans and our viruses. Likebacteria, many of the motifs derived from the human genome matchthe motifs from our infecting viruses. However, viral genescan have regulatory differences when compared to the sequencesin the human genome. The evolutionary pressures on the viralgenomes may result in design differences when compared to thosein the human genome. These differences may well be reflectedin differences in the motifs detected by the Robins-Krasnitzalgorithm. There are striking differences between the motifsfound in the mRNA of HIV-1 and those found in the set of allhuman genes in their coding regions. As mentioned previouslythe genes of HIV are expressed very poorly in their human host.Focusing on a single motif which is overrepresented in the HIV-1genome and underrepresented in the coding regions of the humangenome, we have recoded the Gag gene of HIV-1 to remove thissignal. This simple change in coding doubled the expressionof Gag in a human kidney cell line. The same recoded sequencewas then used as a DNA vaccine in a mouse model. The anti-gagimmune response was six fold higher for our recoded sequence.This avenue of research is presently ongoing and would be afirst step in improving an HIV DNA vaccine. It might also helpexplain the modest success of the HIV DNA vaccines.
The second practical application is to increase the gene expressionof coding regions of genes from one host expressed in a secondhost. Human or viral genes are often expressed in foreign organisms,usually bacteria, yeast, or baculoviruses, for either medicalpurposes or basic research applications. There is a promisingcomputational project underway to create a tool which combinescodon optimization and the contribution of the motifs to recreatea sequence for a particular protein which has increased expressionin a given organism.
Additionally, the motifs can be used as a bioinformatic tool to determine the origin and classification of DNA sequences of unknown origin or to assemble a genome from its parts when multiple organisms compose a sample for DNA sequencing. With ongoing shotgun sequencing projects to explore the bacterial diversity in the Sargasso Sea, the air above Manhattan, and the human digestive track, among others, classifying bacterial phylogeny based on novel sequence is a useful tool (15). Forthis purpose a genome wide property that accurately classifiesa genome will be essential. The common tool presently used isdinucleotide content. This method has advantages over the RobinsKrasnitz algorithm for very short sequences, but as is demonstratedby phage/host analysis, our algorithm becomes significantlybetter than the other methods when the sequences reach a fewtens of thousands of bases.
Finally, as shown in section 3, the motifs themselves appearto have biological function. We expect a subset of these motifsto be protein binding sequences. Further work needs to be doneto determine which proteins are binding these particular sequencesand what their biological function could be in a cell. Havingthese overrepresented sequences provides the tools to isolateand identify DNA sequence specific binding proteins and thento determine their functions. Some of the DNA sequence motifscould be providing physical properties such as flexibility forbending of DNA, packaging DNA in a cell or interactions withRNA polymerases. These DNA sequence motifs could function inthe DNA or the mRNA produced from that DNA sequence. One wayto distinguish between these alternatives is to look for thereverse complement of an over- or underrepresented sequencein the non-coding strand of DNA (note all of the motifs discussedhere come from the coding strand). An analysis of the genesin the human genome with a modified Robins-Krasnitz algorithmhas identified overrepresented sequences that function in bothDNA and RNA molecules (transcription factor binding sites, splicejunctions, splice enhancers, etc.). More than half of thesesequence motifs have unknown functions.
If it is correct that these sequence motifs that are over-and underrepresented in a genome are derived from bases in the third position of the coding regions of a gene and the motifs that are formed are selected for or against, then some third position mutations may not be neutral to selection even though they don’t change an amino acid. Synonymous single nucleotide polymorphisms that have a phenotype could well uncover sequence motifs (out of reading frame) with functions that have important consequences. For example a synonymous single nucleotide sequence in the multidrug resistance gene-1 alters the functionality of that gene product (16). Many of the mutations in the IARC p53 gene mutant databasereside in the third position but no one has tested them fortheir functional consequences, if any. Clearly these ideas bringup some new directions for research in this area.
5) Why Do These DNA Sequence Motifs Arise in the Genome? |
---|
TOP Abstract Introduction 1) Creating a Maximal… 2) Using Relative Entropy… 3) Evidence for Biological… 4) Applications 5) Why Do These… References |
---|
This review and the prior publication upon which it is based (12) demonstrates that:
- The coding regions of genes in a genome contain sequence motifs (oligonucleotide sequences) that are significantly over- or underrepresented (compared to the MED) even when codon usage, G + C content and amino acid order is normalized.
- The DNA sequence motifs differ in different organisms and are distributed uniformly around the genome (a real exception to this is the G/C rich regions in the genomes of mammals, birds and some reptiles).
- Several tests suggest that these over- and underrepresented sequences are functional and they can be used to form an evolutionary tree much like the ribosomal gene family.
- The best classifier of the host of a bacteriophage is based upon these sequence motifs.
- Several of these sequence motifs can be assigned functions in the cell.
These observations provide good evidence that these sequencesignals in a genome are under positive or negative evolutionaryselection pressures and so genome sequences are shaped by manydiverse sources. But it could be that these sequence motifsarise in a more complex way. The mutation rate and even thenature or the frequency of different types of mutations (transitions,transversions, direction of mutations; A to G or G to A) candiffer in different organisms. Mutation rates can be set bythe variation in the environment or by diversity of the environment.Thus, there is an evolutionary bias for some sequence motifsin a genome and this could differ between organisms, be distributedevenly around the genome, provide evolutionary diversity andan evolutionary tree of related organisms and would impact uponboth the host and viral genomes. It seems likely that some ofthese sequence motifs could arise from a bias in the mutationtype and frequency and might not have a functional consequenceeven though it was selected for by the host. It may be likelythat both functional and non-functional sequence motifs residein this distribution of over- and underrepresented sequences.What is clear is that such sequence motifs identify importantprocesses in the evolution of the host and contain informationabout those processes that will need to be decoded in the future.
|
|
|
|
|
Footnotes |
---|
This work was supported in part by the Simons Foundation, theAmbrose Monell Foundation, and the Leon Levy Foundation.
References |
---|
TOP Abstract Introduction 1) Creating a Maximal… 2) Using Relative Entropy… 3) Evidence for Biological… 4) Applications 5) Why Do These… References |
---|
- Plotkin JB, Robins H, Levine AJ. Tissue-specific codon usage and the expression of human genes. Proc Natl Acad Sci U S A 101:12588–12591, 2004.[Abstract/Free Full Text]
- Vinogradov AE. Isochores and tissue-specificity. Nucleic Acids Res 31:5212–5220, 2003.[Abstract/Free Full Text]
- Sharp PM, Li WH. The codon adaptation index—a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res 15:1281–1295, 1987.[Abstract/Free Full Text]
- Fuglsang A. Evolution of prokaryotic DNA: intragenic and extragenic divergences observed with orthologs from three related species. Mol Biol Evol 21:1152–1159, 2004.[Abstract/Free Full Text]
- Han JS, Boeke JD. A highly active synthetic mammalian retrotransposon. Nature 429:314–318, 2004.[Medline]
- Han JS, Szak ST, Boeke JD. Transcriptional disruption by the L1 retrotransposon and implications for mammalian transcriptomes. Nature 429:268–274, 2004.[Medline]
- Schneider R, Campbell M, Nasioulas G, Felber BK, Pavlakis GN. Inactivation of the human immunodeficiency virus type 1 inhibitory elements allows Rev-independent expression of Gag and Gag/protease and particle formation. J Virol 71:4892–4903, 1997.[Abstract/Free Full Text]
- Fairbrother WG, Yeo GW, Yeh R, Goldstein P, Mawson M, Sharp PA, Burge CB. RESCUE-ESE identifies candidate exonic splicing enhancers in vertebrate exons. Nucleic Acids Res 32:W187–190, 2004.[Abstract/Free Full Text]
- Fuglsang A. The relationship between palindrome avoidance and intragenic codon usage variations: a Monte Carlo study. Biochem Biophys Res Commun 316:755–762, 2004.[Medline]
- Karlin S, Cardon LR. Computational DNA sequence analysis. Annu Rev Microbiol 48:619–654, 1994.[Medline]
- Yeo G, Burge CB. Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. J Comput Biol 11:377–394, 2004.[Medline]
- Robins H, Krasnitz M, Barak H, Levine AJ. A relative-entropy algorithm for genomic fingerprinting captures host-phage similarities. J Bacteriol 187:8370–8374, 2005.[Abstract/Free Full Text]
- Artzy-Randrup Y, Fleishman SJ, Ben-Tal N, Stone L. Comment on “Network motifs: simple building blocks of complex networks” and “Superfamilies of evolved and designed networks”. Science (New York, NY) 305:1107; author reply 1107, 2004.
- Blaisdell BE, Campbell AM, Karlin S. Similarities and dissimilarities of phage genomes. Proc Natl Acad Sci U S A 93:5854–5859, 1996.[Abstract/Free Full Text]
- Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, Eisen JA, Wu D, Paulsen I, Nelson KE, Nelson W, Fouts DE, Levy S, Knap AH, Lomas MW, Nealson K, White O, Peterson J, Hoffman J, Parsons R, Baden-Tillson H, Pfannkoch C, Rogers YH, Smith HO. Environmental genome shotgun sequencing of the Sargasso Sea. Science (New York, NY) 304:66–74, 2004.
- Kimchi-Sarfaty C, Oh JM, Kim IW, Sauna ZE, Calcagno AM, Ambudkar SV, Gottesman MM. A “silent” polymorphism in the MDR1 gene changes substrate specificity. Science (New York, NY) 315:525–528, 2007.
CiteULike Complore Connotea Delicious Digg Reddit Technorati What’s this?
|