Phylogenetic invariants[1] are polynomial relationships between the frequencies of various site patterns in an idealized DNA multiple sequence alignment. They have received substantial study in the field of biomathematics, and they can be used to choose among phylogenetic tree topologies in an empirical setting. The primary advantage of phylogenetic invariants relative to other methods of phylogenetic estimation like maximum likelihood or Bayesian MCMC analyses is that invariants can yield information about the tree without requiring the estimation of branch lengths of model parameters. The idea of using phylogenetic invariants was introduced independently by James Cavender and Joseph Felsenstein[2] and by James A. Lake[3] in 1987.
At this point the number of programs that allow empirical datasets to be analyzed using invariants is limited. However, phylogenetic invariants may provide solutions to other problems in phylogenetics and they represent an area of active research for that reason. Felsenstein stated it best when he said, "invariants are worth attention, not for what they do for us now, but what they might lead to in the future." (p. 390)
If we consider a multiple sequence alignment with t taxa and no gaps or missing data (i.e., an idealized multiple sequence alignment), there are 4t possible site patterns. For example, there are 256 possible site patterns for four taxa (fAAAA, fAAAC, fAAAG, … fTTTT), which can be written as a vector. This site pattern frequency vector has 255 degrees of freedom because the frequencies must sum to one. However, any set of site pattern frequencies that resulted from some specific process of sequence evolution on a specific tree must obey many constraints. and therefore have many fewer degrees of freedom. Thus, there should be polynomials involving those frequencies that take on a value of zero if the DNA sequences were generated on a specific tree given a particular substitution model.
Invariants are formulas in the expected pattern frequencies, not the observed pattern frequencies. When they are computed using the observed pattern frequencies, we will usually find that they are not precisely zero even when the model and tree topology are correct. By testing whether such polynomials for various trees are 'nearly zero' when evaluated on the observed frequencies of patterns in real data sequences one should be able infer which tree best explains the data.
Some invariants are straightforward consequences of symmetries in the model of nucleotide substitution and they will take on a value of zero regardless of the underlying tree topology. For example, if we assume the Jukes-Cantor model of sequence evolution and a four-taxon tree we expect:
fACAT-fCGCA=0
This is a simple outgrowth of the fact that base frequencies are constrained to be equal under the Jukes-Cantor model. Thus, they are called symmetry invariants. The equation shown above is only one of a large number of symmetry invariants for the Jukes-Cantor model; in fact, there are a total of 241 symmetry invariants for that model.
4x | xxxx (e.g., AAAA, CCCC, ...) | 1 | 4 | 3 | |
3x, 1y | xxxy (e.g., AAAC, AACA, ...) | 4 | 12 | 44 | |
2x, 2y | xxyy (e.g., AACC, ACCA, ...) | 3 | 12 | 33 | |
2x, 1y, 1z | xxyz (e.g., AACG, ACGA, ...) | 6 | 24 | 138 | |
1x, 1y, 1z, 1w | xyzw (e.g., ACGT, CGTA, ...) | 1 | 24 | 23 | |
Totals = | 15 | 241 |
JC69* | Jukes-Cantor | |
K80* | Kimura two-parameter | |
K81* | Kimura three-parameter | |
SSM (CS05) | Strand-specific model | |
GMM | General Markov model |
Phylogenetic invariants, which are defined as the subset of invariants that take on a value of zero only when the sequences were (or were not) generated on a specific topology, are likely to be the most useful invariants for phylogenetic studies.
Lake's invariants (which he called "evolutionary parsimony") provide an excellent example of phylogenetic invariants. Lake's invariants involve quartets, two of which (the incorrect topologies) yield values of zero and one of which yields a value greater than zero. This can be used to construct a test based on following invariant relationship, which holds for the two incorrect trees when sites evolve under the Kimura two-parameter model of sequence evolution:
f1133+f1234=f1233+f1134
The indices of these site pattern frequencies indicate the bases scored relative to the base in the first taxon (which we call taxon A). If base 1 is a purine, then base 2 is the other purine and bases 3 and 4 are the pyrimidines. If base 1 is a pyrimidine, then base 2 is the other pyrimidine and. bases 3 and 4 are the purines.
We will call three possible quartet trees TX [T<sub>X</sub> is ((A,B),(C,D)); in [[newick format]]], TY [T<sub>Y</sub> is ((A,C),(B,D)); in newick format], and TZ [T<sub>Z</sub> is ((A,D),(B,C)); in newick format]. We can calculate three values from the data to identify the best topology given the data:
X=N1133-N1233-N1134+N1234
Y=N1313-N132-N1314+N1324
Z=N1331-N1332-N1341+N1342
Lake broke these values up into a "parsimony-like term" (
P=N1133+N1234
B=N1233+N1134
\chi2=(P-B)2/(P+B)
A classic study by John Huelsenbeck and David Hillis[10] found that Lake's invariants converges on the true tree over all of the branch length space they examined when the underlying model of evolution is the Kimura two-parameter model. However, they also found that Lake's invariants are very inefficient (large amounts of data are necessary to converge on the correct tree). This inefficiency has caused most empiricists to abandon the use of Lake's invariants. Also, because Lake's invariants are based on the Kimura two-parameter model phylogenetic estimation using Lake's invariants may not yield the true tree when the model that generated the data strongly violates that model.
The low efficiency of Lake's invariants reflects the fact that it used a limited set of generators for the phylogenetic invariants. Casanellas et al.[11] introduced methods to derive a much larger set of set of generators for DNA data and this has led to the development of invariants methods that are as efficient as maximum likelihood methods.[12] Several of these methods have implementations that are practical for analyses of empirical datasets.
Eriksson[13] proposed an invariants method for the general Markov model based on singular value decomposition (SVD) of matrices generated by "flattening" the nucleotides associated with each of the leaves (i.e., the site pattern frequency spectrum). Different flattening matrices are produced for each topology. However, comparisons of the original Eriksson SVD method (ErikSVD) to neighbor joining and the maximum likelihood approach implemented in the PHYLIP program dnaml were mixed; ErikSVD underperformed the other two methods when used with simulated data but it appeared to perform better than dnaml when applied to an empirical mammalian dataset based on an early release of data from the ENCODE project. The original ErikSVD method was improved by Fernández-Sánchez and Casanellas,[14] who proposed a normalization they called Erik+2. The original ErikSVD method is statistically consistent (it converges on. the true tree. as the empirical distribution approaches the theoretical distribution); the Erik+2 normalization improves the performance of the method given finite datasets. It has been implemented in the software package PAUP* as an option for the SVDquartets method.
"Squangles" (stochastic quartet tangles[15]) represents another example of an invariants method[16] hat has been implemented in software package that is practical to be used with empirical datasets. Squangles permit the choice among the three possible quartets assuming that DNA sequences have evolved under the general Markov model; the quartets can then be assembled using a supertree method. There are three squangles that are useful for differentiating among quartets, which can be denoted as q1(f), q2(f), and q3(f) (f is a 256 element vector containing the site frequency spectrum). Each q has 66,744 terms and together they satisfy the linear relation q1 + q2 + q3 = 0 (i.e., up to linear dependence there are only two q values). Each possible quartet has different expected values for q1, q2, and q3:
((A,B),(C,D)); | AB|CD (or 12|34) | 0 | -u | u | |
((A,C),(B,D)); | AC|BD (or 13|24) | v | 0 | -v | |
((A,D),(B,C)); | AD|BC (or 14|23) | -w | w | 0 |