library(phangorn)
library(TeachingPopGen)

Exploring Gene Genealogies

One of the two key tenets of Darwinian evolution is common descent - the idea that any two organisms share a common ancestor. The same principle holds for alleles - two homologous sequences, regardless of their current differences, at some point in the past shared a common ancestor. The problem is that, unlike reproductively isolated species, the relationships are reticulated, so that different pairs of alleles have different ancestries. The objective of coalescence analysis is to develop the necessary theory to generate expectations regarding the process, and then analyze actual data within that framework. The theory can be daunting, however it is worth the effort, since it offers the promise of providing robust insights into population and species histories.

A Visual introduction

Before delving into the theory, we can consider the problem qualitatively. The data we have consist of homologous DNA sequences, so we know those sequences have some pattern of ancestry. A phylogeneticist interested in that history would use one or more available algorithm to construct a phylogenetic tree, under the assumption that the most similar sequences share the most recent common ancestry. We can do that with our acp29 data, reading it as follows:

data(acp29) #read the data from a local file
names(acp29)<-c(1:17) # rename the sequences to clean up the output

Plotting a tree

Let’s start with a possible tree that purports to trace our sequences to a common ancestor. The easiest way to do that is with a function in the package phangorn(Schliep 2011) (which may need to be installed in the usual way)

#install.packages("phangorn") # uncomment if you need to install
library (phangorn) #note that this make take a while to load
tree.rat <-pratchet(as.phyDat(acp29),trace=0) # Do parsimony; requires conversion of data to different type of object
plot(tree.rat) # Plot the data, ignoring inferred branch lengths

Don’t take this tree too seriously; it is simply a possible gene genealogy, and there are no doubt many more. Remember - we are dealing with alleles in a population, not genes in different species. So we need to dig further.

Furthermore, note that we are now dealing with expectations for for a tree, not for some summary measure. Thus, it is not particularly obvious how we could generate a quantitative distribution of trees that could result in the data (remember - 17 sequences of 702 base pairs, with 15 segregating sites.). As we will see below, we can generate trees based on those paramters - as many as we want - but how do we reduce them down to summary statistics that can be further analyzed?

Generating random trees

We reture to program ms. Remember that it does is to simulate sequence evolution under the neutral model - that is, given either a value of θ or S (as well as, when appropriate, some other population parameters), it will generate simulated data sets; in addition it can, if the appropriate options are set, generate the data necessary for illustrating each set as a tree. ms can be run either from the command line or as part of the package phyclust (Chen and W.-C. 2011); to the greatest extent possible we will use the latter approach.

Running the simulations

So what we will do is to generate 6 random data sets, each with 17 sequences, 702 base pairs and 15 segregating sites. The numbers following the -r parameter indicate that there is 0 recombination and 702 total positions, and the -T option indicates that we wish to have descriptions of the trees generated.

tr <-ms(nsam=17,nreps=6,opts ="-s 15 -r 0 702 -T" )

Plotting the trees

We will talk a lot more about using ms output, but for now, we can use the read.tree function in ape(Paradis et al. 2004) to extract the data and then plot the trees:

par(mfrow=c(3,2))
trees <-read.tree(text=tr)
sapply(trees,plot)

par(mfrow=c(1,1))

These could be thought of as the gene genealogies of six different genes, all sampled from the same individuals. What we see is a veritable forest (and if we increased nreps, we could get even more - ms is VERY fast, so that generation of 1000 simulations is a simple matter). The question now becomes how to summarize the simulations in a way that is amenable to our usual form of analysis. That is where we will go next.

Summary

Everything we have covered assumes that a given set of alleles has a common ancestor (usually referred to as MRCA, or most recent common ancestor). We have also assumed that the processes that gave rise to the set of alleles we’ve sequenced are the stochastic processes of mutation and drift, which are the components that go into θ, the neutral parameter. What we see, however, is that there are, even for a small sample of alleles, a nearly infinite number of geneologies that can give rise to the observed data. What we will now do is to divert into theory for a bit and ask how we might use some of the work we have done in analyzing DNA polymorphisms to see how we can better quantify the coalescent process.

Part 2. The Underlying Theory

So keep in mind - our essential questions is the following: Given an observed set of allelic DNA sequences, what can we infer about the evolutionary processes that gave rise to them? For now, we’re going to simplify the question by only asking whether the data can be explained by the processes of mutation and drift occurring in a finite population. Furthermore, since we are working backwards from extant species, we needn’t be concerned about the vast number of new variants that would have been lost by the processes we’ve already discussed.

The approach, which we already outlined, is to make inferences about the ancestry of alleles. The classic approach, developed by Kingman (1982) is to consider the problem from the standpoint of the whole population. However, in most cases, the data one works from consists of a sample drawn from that data. We will briefly consider the first approach, but focus most of our attention on the second.

Looking back one generation

First, in a population of N individuals (2N alleles), and assuming that self-fertilization is possible, what is the probability that two lineages coalesce in the previous generation? That is simply \(\frac{1}{2N}\), since there are 2N alleles in the population. And given that, the probability of two lineages not coalescing is simply \(1-\frac{1}{2N}\).

Going back further.

Suppose we go back r generations. What we want to figure first is the probability of two lineages not coalescing. It’s a fairly easy derivation to show that this is just going to be \((1-1/2N)^r\). We can then use that to ask the following question - what is the probability that two genes first find a common ancestor in generation r? To get that, we combine the probability that they have not coalesced by generation r-1 with the probability (1/2N) that they do in generation r. This leads to

\(P\{coalescence\,in\,generation\,r\}= (1-\frac{1}{2N})^{r-1}*\frac{1}{2N}\)

Let’s look at this a little more closely. What we are saying is that there is a certain probability, over some time interval t, that an event (coalescence) occurs, which, we’ve seen is 1/2N. Now for the next interval t, the same probability exists, only this now a sample size of (1-1/(2N)). This would continue on out as t -> infinity. What this is is a geometric distribution. We can summarize it as follows:

Suppose there is some event that occurs with probability p during a particular time period. What we are examining is the time to the first occurrence of the event. In our case, the event is the most recent coalescent event.

Rather than get into the math of it, let’s look at what our expectation would be, using simulation. Let’s assume a population of 100, with 2N thus = 200, and generate 1000 geometrically distributed random values

N <-100
geo <-rgeom(1000,1/(2*N))
mean(geo)
## [1] 211.205
hist(geo,main="Time to Coalescence, N=100",xlab="Generations")

And we see (which we could also show analytically) that the mean time to coalescence is, in fact 2N. For that reason, from here on out, we will scale time in units of 2N, so that we have, for this example, the following:

t generations
1 200
2 400
3 600

etc.

For sample size n

Our main interest is in looking at an actual set of allelic DNA sequences in a coalescent framework. In so doing, there are two quantities that are useful - time to the most recent common ancestor (TMRCA) and total branch length. We will take a brief analytical look at both to see how we can generate expectations and then see how we can use ms to examine their distributions. Note that in what follows, the lower case “n” will refer to the number of sequences in our experimental sample, as opposed to the upper case N, the effective size of the population

TMRCA

So, when scaling time by 2N, we see that the expected rate of coalescence for two alleles is 1. But in a real experiment, we would have some sample of n alleles. We already saw when talking about nucleotide diversity that there are n(n-1)/2 pairs of sequences in a sample of n, thus we have the same number of pairs of sequences coalescing. so if the rate of 1 pair of sequences coalescing is 1, the rate for for n sequences is simply n(n-1)/2. So here is the important point - the time for the first coalescent event to occur is the reciprocal of the rate or

\(\hat{t}_{coalescence\,1}=\frac{1}{\frac{n(n-1)}{2}}\)

and rearranging, we get

\(E(t_{1})=\frac{2}{(n(n-1)}\)

Now let’s go back one more step. If two lineages have coalesced into 1, we now have a total of n-1 remaining. So we repeat the process, only now substituting n-1 for n in the above equations and get

\(E(t_{2})=\frac{2}{(n-1)(n-2)}\)

And for generation k,

\(E(t_{k})=\frac{2}{k(k-1)}\)

And we would continue this until there is only one lineage left, at which we have reached the most recent common ancestor. And how long will that take? it will simply be the sum of all of the tk’s, or

\(E(TMRCA) = ∑_{k=2}^{n}\frac{2}{k(k-1)}\)

An Example

Let’s use the example of k=4 and figure out the times to the first, second, and third coalescent events:

n <-c(4:2) # number of lineages working from the present backwards through the coalescent events

t <-1/(n*(n-1)/2) # calculate the tune to each event
t
## [1] 0.1666667 0.3333333 1.0000000
sum(t) #sum to get E(TMRCA)
## [1] 1.5

And we see that the expectation is that E(TMRCA) is 1.5. Keep this value in mind; we will return to it shortly.

Total branch length

In looking at our tree, we can also ask what the total lengths of the branches are. Based on our previous calculations, this actually turns out to be easy. if we have k lineages at a given point, each with an expected time to coalescence (which we’ve already determined) then for the tree length in going from k to k-1 is simply k(E(tk)). For the entire tree, therefore, we can sum this, like we did before, over k=2 to k=n. The algebra of this is pretty straightforward - it ends up to be

\(E[tree\,length]=2∑_{k=1}^{n-1}\frac{1}{k}\)

In our example, therefore

lngth <-2*sum(1/c(1:3))
lngth
## [1] 3.666667

And we see that the total tree length is 3.6666667.

Simulating the distributions of TMRCA and Total Tree Length

So we have expectations for our two quantities, but as always we’d like to look at their distributions. This is actually not all that hard to do analytically, but as a means of further exploring ms, we can do it, as we have so often, by simulation. For now, we will stick with the example on n=4. However, what we will do is to ask ms to do 1000 simulations of coalescence of four sequences, each with 1 segragating site (the choice of this number is completely arbitrary at this point).

sims <-ms(nsam=4, nreps=1000,opts=("-s 1 -L"))

Because we specified the -L option, part of the information returned is TMRCA and tree length. We will use the read.ms.output function (in the course package) to get the data into a usable form:

sims.out <-read.ms.output(txt=sims)
## 100  200  300  400  500  600  700  800  900  1000

Note that this returns a list of objects; what we are interested is in sims$times:

colnames(sims.out$times) <-c("TMRCA","Length")
head(sims.out$times)
##          TMRCA    Length
## [1,] 0.4049056 0.9857655
## [2,] 0.8504165 2.5265317
## [3,] 1.5684915 3.8566498
## [4,] 0.6194265 1.4058521
## [5,] 0.3960569 1.0428001
## [6,] 0.3295677 0.7883637

The first column of this is simply the estimates of TMRCA for each of the 1000 simulations; the second column is the total tree length. We can look at their means and do quick plots of them as follows:

tms <-2*sims.out$times
sim.mean <-apply(tms,2,mean) #ms scales in terms of 4N, so we've mutliplied by 2 for comparison's sake.
par(mfrow=c(2,1))
sapply(1:2,function(x){ hist(tms[,x],main=colnames(tms)[x],xlab=" ")
                        abline(v=sim.mean[x],col="red")})

sim.mean

The top panel shows the distribution of TMRCA; the bottom one that of total length. In both cases, the red lines indicate the means. Remembering that we arrived at the expected TMRCA to be 1.5 and that of total length to be 3.6666667, we see that our simulated values are quite close to that - 1.5128706 and 3.7036213.

Summary

Remember where we want to be. We want to be able to find summary statistics that describe the coalescent process, which we can then use to compare observed data to the expected distribution. Now we have two numbers for coalescence - TMRCA and total tree length - and we can simulate both their expected values and their distributions. But remember the summary statistics we have for DNA - π (mean pairwise differences) and S (number of segregating sites). How can we bring these together? That’s where we’re headed next.

Applying Coalescence to Data

So at this point, we have the following:

DNA variation

We have seen how, with aligned allelic sequences, we can quantitate the amount of variation in a population in terms of the number of segregatinv sites (S) and the average pairwise difference between sequences (π). We have also seen, that akin to the distribution of allele frequencies, we can visualize the distribution of variation in the population in terms of the site frequency spectrum. However, thus far, it is not obvious how we can compare what we have observed to what is expected under a mutation-drift model.

Coalescence

Using some fairly basic logic, we have seen that we can start with the simple idea that if there are 2N gametes, in generation t, the probability of a single coalescent event occurring in generation t-1 is 1/2N and extend that to calculate two expectations for n alleles:

  1. \(E(TMRCA) = ∑_{k=2}^{n}\frac{2}{k(k-1)}\), and
  2. \(E[tree\,length]=2∑_{k=1}^{n-1}\frac{1}{k}\)

So what we need to do is to bring these together, that is to use coalescent theory to generate expectations for our measures of sequence variation

The infinite sites model

Up to this point, we have been using the infinite allele model, which says in essence that every mutation happens only once, so any homozgote carries two alleles that are identical by descent. The infinite sites model is a variant on this theme. It assumes that mutations can occur at an infinite number of sites, so again every mutation is unique, and no more than one mutation will ever occur at a particular site. This actually turns out to be a fairly good approximation of the mutational process for the time scale with which we are dealing.

π and coalescence.

This turns out to be a fairly simple relationship. Since every mutation can occur only once, the number of differences between to sequences is simply the number of mutations that has occurred. Now consider a single coalescent event - two lineages converging into 1. From what has gone before, we would expect that to occur in 2N generations. Now, suppose mutations are occurring on both branches at a rate of µ per site per generation. It is simple to see, that in 2N generations, we would expect 2Nµ mutations to occur. But that’s for only one lineage leading to the coalescence; the same is occurring on the other lineage as well. Thus, the total number of mutations separating the two lineages, or dij, is simply 4Nµ, or our old friend θ.

But of course, in real data, we have n sequences and n(n-1)/2 comparisons that we average to get π. We can now combine the above result with that to get the following:

NS7

NS7

We can see, by some very simple rearrangement, that the expectation for π, based on coalescent theory, is simply θ! Therefore, the observed value of π can be used as an estimator of θ - it is often referred to as Tajima’s estimator. This is a simple but remarkable result. Remember that we first encountered θ in the context of classical drift/mutation theory. Now we see that it recurs in the context of DNA sequence coalescence, and in fact we have obtained a very simple estimator of it. However, in this case, in contrast to what we saw previously, we can obtain an estimator of θ from readily available data.

An aside

Before going on, there is a point that bears emphasis. Remember that π as we have been using it is the average number of differences per sequence of length n. μ, on the other hand, is usually expressed as mutations/per base/unit time. We need to keep that in mind if we use our relationship

E(θ) = E(4Neμ) = π

to do a simple calculation and estimate Ne. In this case we need to express π per base to make the units consistent. Let’s do so with the acp data:

data(acp29)
pi <-nuc.div(acp29)
pi
## [1] 0.004210659

So what is μ? There are lots of estimates out there, but we can use one estimated by (???) based on actual quantitation of mutations that occurred in a controlled laboratory system. That estimate is 8.4*10-9/generation, so we calculate as follows:

mu <-8.4e-9
Ne <-pi/(4*mu)
signif(Ne,3)
## [1] 125000

So if this were all there was to it, we could conclude that the effective population size of D. melanogaster is on the order of 125,000. Given the number of flies that are attracted to the average banana, this should seem low. This is something we will return to, but before doing so, we need to return to our main narrative.

S and coalescence

We now turn to S, the number of segregating sites. Under the infinite sites model, it also has a simple interpretation - it is the total number of mutations that has occurred in the time between the MRCA and the sequences being analyzed (remember, one site mutates only once, so sites with identical bases are, by definition, ibd). So what we now have to ask is, given a sample of n sequences, how many mutations would we expect to occur?

We start by remembering our formula for total tree length, given above. We also remember that, in our discussion of the expectation for π, we saw that the expected number of mutations in time t is tθ/2 (half the number that separate two coalescing lineages). So given those relationships, we can easily derive the following

NS3
NS4 NS5

And we see that now we have a second estimator of θ, also known as the Watterson estimator, or θw, which is based on S and n, the number of sequences in the sample.

And, as we did with the Tajima estimator above, we can calculate Ne with the Watterson one as well. Note again, that we have to express S on a per site basis:

S <-length(seg.sites(acp29))/702 #Calculate S per site
k <-length(acp29) # determine the number of sequences
x <- c(1:(k-1))  # determine denminator of the estimator
denom <-sum(1/x) 
th.w <-S/denom # calculate theta
Ne.w <-th.w/(4*mu) #And estimate.
signif(Ne.w,3)
## [1] 188000

We now have two estimates of Ne, which are similar but not identical. As we shall see, the difference between these two estimators leads us to Tajima’s D statistic, one of the most widely used statistics in molecular population genetics.

Conclusion

So now it appears we may have something useful from an empirical standard. We can sample a number of alleles, easily calculate π and S, and with some further simple manipulation based on coalescence, obtain two estimators of θ. Of course, as noted earlier, it is still challenging to interpret what this parameter means, given that it is based on the product of two numbers, one large and one small, that are inherently difficult to estimate. But nevertheless, we now have a framework in which we can start asking questions about the evolutionary histories of genes, and by extension, the populations from whence they were drawn. But before we can jump headlong into analyzing the data, we need to explore the relationship between our two estimators of θ more closely. Then we will need to see how we can use simulation to determine whether actual values, determined from real data, are consistent with the model of mutation and drift that has been inherent in our considerations so far.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

library(knitr)

Exploring Further

So, combining analysis of DNA sequence variation with coalescence theory, we see that we have two simple estimators of θ, one based on the number of segregating sites in the data (S) and the other, which is in fact just the mean pairwise differences, or π. Since they are estimates of the same paramter (and are both unbiased), we might expect them to yield the same value, right? Let’s explore that further.

A simple tree

So what we will do is to use ms to get us a very simple data set. We will then look at the tree, first with the mutations as they were simulated, and second with the same number of mutations (S) but with them occurring on different branches.

Generate the tree

First, we will use ms to generate a tree and dataset with 6 samples and three segregating sites

set.seed(123)
tr <-ms(nsam=6, nreps=1, opts=("-s 3 -T"))
plot(read.tree(text=tr),use.edge.length=FALSE)

Look at the data

Now let’s see the actual gametes that gave rise to this tree

tr.out <-read.ms.output(txt=tr) #Get data in usable form
gams <-as.data.frame(tr.out$gametes)
colnames(gams) <-c("p1","p2","p3")
kable(gams,align="l")
p1 p2 p3
0 0 0
0 1 1
0 1 1
1 1 1
0 1 1
0 0 0

As an exercise, you should now be able to walk through the tree, remembering that the ancestral haplotype was (0 0 0) and position the mutations accordingly. Having done that, we now want to calculate the two estimators of θ.

Calculating π

Remember that we just need to take every pair of sequences, count the number of differences, and divide the sum of those differences by n(n-1)/2. The following function will do that

picalc <- function (gams) {
  k <-nrow(gams)
  l <-k-1
  pi <-0
  for (i in 1:l){
    for (j in 2:k){
      pi <-pi+sum(abs(gams[i,]-gams[j,]))
    }
  }
  pi <-pi/((k*(k-1))/2)
  pi
}

And we can plug in our results from the simulation to get

pi.sim <-picalc(gams)
pi.sim
## [1] 1.6

And that, of course, is our estimator of θ

Estimating θw

For the Watterson estimator of θ remember that we need to divide S by ∑(1/k) summed over 1 to k-1

th.w <-3/sum(1/c(1:5))
th.w
## [1] 1.313869

And we see that we get a similar, but not identical number. So what gives?

Comparing the Two Estimates

Look back at what we have. One of our two estimators, the Watterson one, is based only on S, which in the infinite sites model, is simply the total number of mutations. It says nothing about when and where the mutations occurred. π, on the other hand, involves comparison of sequences in a pairwise fashion, so we might intuitively expect that the position of the mutational events on the geneology might have an effect on it.

We can illustrate this with the tree above. We will keep the number of segregating sites only move the location at which they occur.

We can make up a dataset as follows:

g2 <-as.data.frame(rbind(c(0,0,0),c(1,0,0),c(0,0,0), c(0,0,0),c(0,1,0),c(0,0,1)))
colnames(g2) <-colnames(gams)
kable(g2,align="l")
p1 p2 p3
0 0 0
1 0 0
0 0 0
0 0 0
0 1 0
0 0 1

And as you map these mutations, you will see that they all occur in the tip branches, and as a result, each mutation is represented only once in the data set, and each in a different sequence. So now we can calculate π with our function and compare it to what we got with the original data.

pi.2 <-picalc(g2)

And we get a value of 1.4, which can be compared with the value from the original data set of 1.6. Not a huge difference, but a difference nevertheless. And remember - the Watterson estimator is identical for the two scenarios. So we need to explore the relationship between π and the location of mutations on the tree more closely.

π, S, and the Site Frequency Spectrum

Remember that the site frequency spectrum simply consists of the numbers of sites that differ at 1,2, . . .n/2 among the sequences sampled. It should be very obvious that if we take the sum of all of the classes in the SFS, what we get is S. For π, the relationship is a bit more complicated, but with a little bit of thought, we can get to the following relationship:

π = [∑(i)(n-i)ηi]/((n*(n-1))/2)

where ηi is the number of sites in class i of the SFS.

Don’t worry too much about the math of the above expression; what we want to ask is which of the classes of the SFS makes the most significant contribution to π. To get at that, we simply have to ask when n(n-i), which is the average pairwise difference for a single site when there are i derived sequences, is maximized. We can visualize that over the range of i=1:n, where n=10.

n <-10
i <-c(0:10)
v <-i*(n-i)
plot(i,v,type="l")

If you don’t want to fight with the math, then remember the analogy between π at the sequence level and heterozygosity at the allelic level. We know that heterozygosity is maximized when p=q=.5, so the conclusion we draw here is similar - π is maximized when i=n/2, which would be the case for sites in which f(0) = f(1) = .5

Based on this, look back at our two hypothetical situations. In the first case, with mutations that divide the tree earlier in the process, we get a higher value of π than we do when they occur in tip branches, each separating only one sequence from all of the rest. Now, as we look at the data matrix for the original simulation, we see that p2 and p3 have frequencies of (2,4), closer to that maximum than the frequencies of (1,5) that all three sites have in the second case.

Conclusion.

We can now see the basis for the different estimates we get for θ - one is simply based on mutation number, while the other incorporates the distribution of those mutations on the gene geneology. Again, given that θ is a single parameter, what we have is two estimators of the same thing, so

E[π] = E[θw] = θ

So if our estimates are different, is it significant? And if it is significant, what do we conclude? Before going on to the next section, you might want to view the video below, in which Mohammed Noor explores this question and introduces Tajima’s D statistic.

The Tajima Statistic (D)

Now that we have our two estimators, we need to compare them and then somehow determine if they differ by more than we would expect by chance. This brings us to one of the most widely used statistics in molecular population genetics - Tajima’s D. What we will see is that, while the calculation of the estimator is straightforward, its distribution is another one (like the Ewens estimator) that is best determined by simulation. In addition, it is important to note at the outset is that there is not necessarily a single explanation for an outcome that differs significantly from expectation. What we will do first, therefore, is to define the measure and then look at some real data.

The Formula

The easiest way to see what is going on is to look at the formula (9.2 in Nielsen and Slatkin (2013)):

\(\large{D=\frac{\hat{\theta}_T - \hat{\theta}_W}{\sqrt{\hat{V}(\hat{\theta}_T - \hat{\theta}_W)}}}\)

In this formula, \(\theta_T\) is the notation used for π and \(\theta_W\) is the Watterson estimator, based as we’ve seen previously on the number of segregating sites. the denominator is the square root of the variance of the difference between the two estimators; it is thus always positive. The numerator, on the other hand can be either positive or negative. It is positive when π is large, which we’ve already seen occurs when there are many intermediate frequency sites; it is small (and D negative) when there are lots of sites with only a few variants each.

Some data

But enough of hypotheticals. What we want to do is to look at actual data and calculate D. Once we have, we’ll then focus on interpreting the result. For this, we will use popset 345105052, from the Drosophila mauritania spaghetti squash gene, which were imported into Clustal. A deletion-free segment was then saved to use in this example

library(TeachingPopGen)
data(spaghetti)
dat <-spaghetti
dat
## 42 DNA sequences in binary format stored in a list.
## 
## All sequences of same length: 702 
## 
## Labels: gi|345105052|gb|JF815874.1| gi|345105054|gb|JF815875.1| gi|345105056|gb|JF815876.1| gi|345105058|gb|JF815877.1| gi|345105060|gb|JF815878.1| gi|345105062|gb|JF815879.1| ...
## 
## Base composition:
##     a     c     g     t 
## 0.278 0.196 0.244 0.281

As before, we can plot the site frequency spectrum

plot(site.spectrum(dat))

length(seg.sites(dat))
## [1] 92

And we see that most of the 92 segregating sites fall in the first class, with one sequence differing from all of the rest. The function for the Tajima test is simply

D <-tajima.test(dat)
D
## $D
## [1] -2.799181
## 
## $Pval.normal
## [1] 0.005123244
## 
## $Pval.beta
## [1] 9.796647e-06

We note that three values are returned:

  1. $D is the actual value of the statistic. In this case we see that it is negative, suggesting that the tendency is towards a disproportionate number of low frequency variants.
  2. $Pval.normal and $Pval.beta are estimates of the probability of the data, based on the assumption of the distribution of D being either normal or a beta distribution. These values are approximations at best; we will see that simulation provides a much better means of comparing our results to an expected distribution.

What does this tell us?

The negative value of D tells us that there are more low frequency sites than we would expect. What does that suggest about the evolutionary history of the gene and species? We will explore this further, but for now, we can think of a few possibilities:

  1. Most mutations are detrimental, so that they tend to be eliminated by selection, so that their frequencies are low (a negative selection model).
  2. One mutation has arisen recently and is advantageous, leading to its increase in frequency, and thus decrease in relative frequency of others
  3. The population has undergone a recent expansion, so that there are many more recent rather than ancient branches. In that case, there would be many more mutations that would be represented only a few times in the sample, leading to an excess of low frequency sites.

The first two hypotheses invoke natural selection as an explanation, something we will explore in depth in the second part of the course. The third does not - it suggests that in fact, patterns of sequence diversity may be signatures of past demographic events. If so, then perhaps the opportunity exists to use data like these to make inferences about those historical processes.

Conclusions

As noted, Tajima’s D test is one of the most widely used statistics in molecular population genetics, however we’ve left two problems outstanding. First, while the expected value under our mutation-drift model is zero (or close to it), we don’t know a lot about its distribution. Second, even if we see a departure from expectation, the alternative biological hypotheses are complex and diverse. To explore these questions, in the next unit we will see how simulation can provide real insight.

Combining Theory and Data

Having now seen how we can begin to compare observed data to the expectations of the drift/mutation coalescence model, we now want to delve a bit more deeply. In what follows, we will address three related areas, all of which involve combining data with simulations run in ms. In essence, the common approach will be to compare observed results (in the form of S, π, and D) to those generated as a result of repeated simulations.

Assessing the Tajima Statistic

At this point, we need to address two questions. First, goven some value of Tajima’s D statistic, how do we determine its significance? Second, if it is significant, what does it tell us about the evolution of the gene and/or population?

The first question is probably of lesser significance - indeed, as we have seen, the tajima.test function addresses it by comparing the value to either a normal or a beta distribution, and a rule of thumb that works pretty well is that if -1.8 < D <1.8, then its difference from expectation is not significant. However, addressing the question does give us an opportunity to explore how we can use ms to calculate summary statistics, such as Tajima’s D, and analyze their distribution.

For this unit, we will work with an aligned portion of the spaghetti squash data from D. mauritiana NCBI Popset 345105052:

library(TeachingPopGen)
data(spaghetti)
dat <-spaghetti
n <-length(dat) # number of sequences
nbase=length(dat[[1]]) # get the length of one sequence
s <-length(seg.sites(dat))
pi <-nuc.div(dat)*nbase
nbase;s;pi
## [1] 702
## [1] 92
## [1] 5.249084

So we see that we have a set of sequences, each of which is 702 bases long, with S= 92 and π = 5.2490836. Now, we can calculate the Tajima statistic:

D <-tajima.test(dat)
D$D
## [1] -2.799181

And, as we saw before, the value of -2.7991808 that we obtain suggests that there may in fact be a significant difference from expectation. To look at that more closely, we can use ms to generate 1000 simulations, in which S = 92 and there are 42 sequences of 702 nucleotides. We will then convert the results into a manageable form as always.

ss.ms <-ms(nsam=n,nreps=1000,opts=(paste("-s",s,"-r",0,nbase,sep=" ")))

Now we will use the function ms.sumstats to calculate summary statistics on all 1000 simulations

ss.sum <-ms.sumstats(ss.ms)
## 100  200  300  400  500  600  700  800  900  1000
head(ss.sum)
##    S       pi     th.W    Tajima.D     th.H         H
## 1 92 24.70848 21.38076  0.56460397 18.02323  6.685250
## 2 92 21.13937 21.38076 -0.04095612  9.15331 11.986063
## 3 92 17.38792 21.38076 -0.67745436 18.56330 -1.175377
## 4 92 32.84669 21.38076  1.94539133 22.51916 10.327526
## 5 92 21.05226 21.38076 -0.05573549 14.65505  6.397213
## 6 92 29.55865 21.38076  1.38751934 23.41696  6.141696

Notice that the Watterson estimator is the same for all simulations. This is to be expected, since, as we know, it is derived solely from the number of segregating sites, which is fixed in this example.

And what we have is a nicely arranged data frame, containing six statistics. Two of them, th.H and H, we will ignore for now. However Tajima.D is the Tajima statistic, so we can look at the distribution of it in the usual way.

colhist(ss.sum$Tajima.D,xlab="D",ylab="Number",main="Distribution of Tajima Statistic")
Dmean <-mean(ss.sum$Tajima.D)
Dmean
## [1] -0.1314268
abline(v=Dmean, col="blue")

So we see that, based on the simulation:

  1. The expectation, shown by the blue line, is close to zero - not surprising (the reason it is not zero is not important to us)
  2. Our observed value of D, -2.7991808, is significantly different from the expectation, in the negative direction. This suggests, as we’ve mentioned before, that in these data, there are an excess of singletons, but how we can explain that remains a problem.

Estimating θ

Overview

We have seen that both S and π lead directly to estimates of θ, and that Tajima’s statistic can be used to detemine whether the results are significantly different. And we have a data set for which the latter is the case. So what should we go with?

One of the things ms will do is, given both the -s (S) and -t (θ) parameters to return the probability of getting S, given the simulated geneology and the specified value of θ. Let’s try that with the spaghetti squash data. We will now do a single ms simulation based on those values, structure the results with read.ms.output, and see what the probability of getting the results would be. We will use π as our estimator of θ.

set.seed(1234)
ms.out <-ms(nsam=n, nreps=1,opts=(paste("-s",s,"-t", pi, "-r 0", nbase, sep=" "))) #ms call from phyclust; paste function used to create option string
ms.proc <-read.ms.output(txt=ms.out)
ms.proc$probs
## [1] 7.65891e-20

As always, the results will vary depending on the simulation run , but what we see here is that the probability of getting the data given use of π as an estimator is very small - 7.6589110^{-20}. So, perhaps unsurprisingly given the results of the Tajima test, it is highly unlikely that we would see 92 segregating sites, given that theta is estimated based on π.

Putting a Bayesian Spin on the Problem

At this point, we are assuming that there is one parameter, θ, which describes the data, so what we want to do is to find what value of that has the highest probability. What if we make a prior assumption about the distribution of that parameter, and apply Bayesian logic to estimate it from the resulting posterior distribution? Actually, Hudson provides a script to do just that in the distribution of ms; it has been modified below so that it will run with the phyclust function (note that it would run a lot faster using the command line version).

So as a prior, let’s draw possible values of θ from a uniform distribution between 0 and 20:

th <-runif(10000,1,20)
head(th)
## [1] 11.898153  4.585277 15.407041  2.760979 12.824812  8.902351

Now we will use each of those values to do a single ms simulation for 702 bases

out <-lapply(th,function(x) ms(nsam=25,nreps=1,opts=(paste("-t ",x,"-r",0,nbase,sep=""))))

Now we need to read the data appropriately, in order that we can identify those simulations that resulted in 92 segregating sites. Because of the fact that our simulations were run separately for each value of theta, we need to use the lapply function so that the output from each one is read corrrectly

out.ms <-lapply(out,ms.sumstats)
#gams <-sapply(out.ms, function(x) x$gametes)
ms.sum <-do.call(rbind.data.frame,out.ms)
head(ms.sum)
##    S        pi      th.W   Tajima.D       th.H          H
## 1 48  8.853333 12.712005 -1.1664260 21.2300000 -12.376667
## 2 10  2.393333  2.648334 -0.3180734  0.9400000   1.453333
## 3 45 10.526667 11.917505 -0.4470781  5.2233333   5.303333
## 4  7  1.920000  1.853834  0.1103631  0.6633333   1.256667
## 5 55 21.293333 14.565839  1.7853980 17.3733333   3.920000
## 6 65 14.800000 17.214174 -0.5455808 41.2000000 -26.400000

And we get the typical kind of output. So let’s first look graphically at the distribution of segregating sites:

hist(ms.sum$S)

Now, what we have to do is to plot a distribution of values of θ for all of those which have a particular value of segsites. So we will select those values of theta that gave rise to 92 segregating sites.

post <-th[which(ms.sum$S==s)]

What we have done is to select, from the total of 10000 simulations done, the 20 that resulted in 92 segregating sites. Now we can plot it:

plot(density(post), xlim=c(0,20), xlab="theta",main=paste("Posterior Distribution, S=",s,sep=" "))
rug(post)
pmean <-mean(post)
pquant <-quantile(post,c(.025,.975))
abline(v=mean(post),col="red")

pmean; pquant
## [1] 17.55685
##     2.5%    97.5% 
## 12.63820 19.86839

And what we see is that the mean value of that posterior distribution, that is, the mean value of θ for all the simulations that resulted in 42 segregating sites, is 17.5568511. However, note also that the credible interval ranges from 12.6382048 to 19.868392, which covers a lot of territory.

There are many reasons not to place a lot of faith in this result. In addition to the size of the credible interval, the shape of the distribution is somewhat irregular. More importantly from our standpoint, it assumes that θ is the only parameter underlying the data. As we will see, that is not necessarily a reasonable assumption.

Looking at demographic scenarios

Finally, we really want to address what forces - demographic, selection, etc. - might underlie any departure from expectation. To do that, we’re going to look at only one example as a means of framing the problem we face. Suppose we use our results so far to set S to 42 (the number observed in the data), and let’s compare two demographic models - one with constant population size, and one in which the population has been undergoing exponential growth.

Tree topology

To simplify our display, suppose we only have 10 sequences. First we’ll do our standard coalescent simulation, using the -T option to get the Newick tree for the data.

set.seed(4321)
m.const <- ms(nsam=10, nreps=1, opts=("-s 92 -r 0 702 -T"))

Now, let’s construct a growth scenario. Suppose our current population is 10^5 , but it is the result of growth at the rate of 2% per year and there are 5 generations per year. To factor growth into ms, we need to convert the yearly growth rate into the growth rate per 4N generations. We can do so as follows:

N <-100000
r <-.02 #growth rate per year
gen <-5 #generations per year
nyears <-(4*N)/5 # 1 coalescent unit in years
g <-r*nyears # growth rate in coalescence units
g
## [1] 1600

Now, we can put that into ms using the -G parameter and a value:

m.out <-ms(nsam=10, nreps=1, opts=(paste("-s 92 -T -r 0 702 -G",g,sep=" ")))

And now we plot the two trees

tr.const <-read.tree(text=m.const)
tr.growth <-read.tree(text=m.out)
par(mfrow=c(2,1))
plot(tr.const,main="Constant")
plot(tr.growth,main="Growth")

par(mfrow=c(1,1))

And what we see is that, in the growth model, we have much longer tip branches than we do on the constant one. Per our earlier discussion of D and π, our expectation would be a site frequency spectrum with an excess of singletons, and thus a negative value of D.

The site frequency spectra

We can, of course, also visualize these data as site frequency spectra

out.const <-read.ms.output(m.const)
out.gr <-read.ms.output(m.out)
par(mfrow=c(1,2))
sfs(out.const$gametes[[1]])
## [[1]]
## [1] 10
## 
## [[2]]
## [1] 19 45  0  4  0  9  0 15  0
sfs(out.gr$gametes[[1]])

## [[1]]
## [1] 10
## 
## [[2]]
## [1] 80  6  4  0  0  0  2  0  0
par(mfrow=c(1,1))

Notice that with the growth scenario (on the right) our the spectrum we see is quite similar to that for the original data, in that there appears (at least qualitatively) to be an excess of singletons:

plot(site.spectrum(dat))
## Warning in site.spectrum(dat): 2 sites with more than two states were
## ignored

The impact on D

Let’s consider our original data, with 92 segregating sites, and look at the expected distribution of D under this scenario, using the value of theta we got from our Bayesian estimation:

ms.gr <-ms(nsam=42,nreps=1000,opts=(paste("-s",n, "-r 0 702 -G",g,sep=" ")))
growth.sum <-ms.sumstats(ms.gr)
## 100  200  300  400  500  600  700  800  900  1000
head(growth.sum)
##    S       pi     th.W  Tajima.D      th.H         H
## 1 42 3.397213 9.760783 -2.282387 1.2369338 2.1602787
## 2 42 3.807201 9.760783 -2.135339 3.1196283 0.6875726
## 3 42 4.234611 9.760783 -1.982042 0.7409988 3.4936121
## 4 42 2.617886 9.760783 -2.561904 0.1138211 2.5040650
## 5 42 2.973287 9.760783 -2.434434 0.2462253 2.7270616
## 6 42 2.831591 9.760783 -2.485255 0.1927991 2.6387921
Dg.mean <-mean(growth.sum$Tajima.D,na.rm=TRUE)
Dg.mean
## [1] -2.33726

And what we see is as expected. The mean value for D is -2.3372598, which is comparable to out observed value. Just to complete the story, we can do a rough histogram of the simulated values

par(mfrow=c(1,1))
d.gr <-growth.sum$Tajima.D[!is.na(growth.sum$Tajima.D)]
colhist(d.gr,xlab-"D", ylab="Number",main="Distribution of D, Growth Model")
Dg.mean <-mean(growth.sum$Tajima.D,na.rm=TRUE)
abline (v=Dg.mean, col="red")
abline(v=D$D,col="darkgreen")

Dg.mean
## [1] -2.33726

And we see that the mean value of D is -2.3372598, which is negative as expected. Remembering that the value of D we calculated from our original data was -2.7991808, we might be tempted to conclude that the growth scenario better fits the data than does the original constant population size neutral model (although we see that the observed value of D, shown by the green line, does differ significantly from expectation). But of course, there are no doubt many other scenarios that could fit as well.

Conclusion

What we have done is to briefly introduce some of the kinds of analyses we can do, comparing observed nucleotide sequence data to coalescent simulations. The major points we’ve focused on are

  1. We can generate an expected distribution of the Tajima statistic and compare an observed value to it, providing the best test for determining whether a significant difference between observed data and the coalescent expectation exists.
  2. If we assume that population size has been constant, and that only it and mutation rate have been factors in generating the observed data, then we can use Bayesian logic to come up with at least a preliminary estimate of theta.
  3. Demographic history can be a critical factor in the coalescent process.

These last two points lead to a very interesting (and practical) question. Suppose we wanted to use a Bayesian approach, where we wished to work from prior distributions of both θ and growth rate. In that case, we would need a way to sample across a large number of possible combinations of the two. And we might also want to incorporate a prior distribution of µ as well, even further complicating the story. This is where Monte Carlo methods come in, in which random sampling of the possible parameter combinations is used to go from a multidimensional prior distribution space to a posterior distribution that can provide estimates of the parameters that make up that space. We will explore that approach in the future.

References

Chen, W.-C., 2011 Overlapping Codon Model, Phylogenetic Clustering, and Alternative Partial Expectation Conditional Maximization Algorithm.

Kingman J., 1982 On the Genealogy of Large Populations. Journal of applied probability 19: 27–43.

Nielsen R., Slatkin M., 2013 An Introduction to Population Genetics: Theory and Applications. Macmillan Education.

Paradis E., Claude J., Strimmer K., 2004 A{PE}: analyses of phylogenetics and evolution in {R} language. Bioinformatics 20: 289–290.

Schliep K. P., 2011 phangorn: phylogenetic analysis in R. Bioinformatics 27: 592–593.