Background

We have all seen examples in which sequence variation has been used to make inferences about past population history. But how is that done? In fact, what is involved is no more than a (very) sophisticated application of coalescent theory. At its core, of course, the key to any coalescent model is an estimate of θ, which in turn needs to be parsed into its two components, Ne and μ. Other factors - migration, recombination, subdivision, etc. are also important, but for now we will focus on those two. First, we will see what we can learn from simple simulations, in which the gene genealogy is known. Then we will turn to some Bayesian methods for working with real data.

Skyline plots

For now, we are going to consider mutation to be a constant. Thus, over the time course of a particular coalescent lineage, the only parameter that can vary is N. Further more, at any given time T, there will be a certain number of lineages, and those lineages will persist for some interval of time. The theory involved here is complex, but in essence Pybus et al. (2000) and Strimmer and Pybus (2001) showed that, given estimates of those over the time course of a robust phylogeny, one could estimate N for each interval. These can then be plotted against time (measured, at least initially, in substitution rate), to generate what is known as a skyline plot. We will look at some simple examples below.

The simple case - N is constant

So let’s have ms generate us a phylogeny of 10 sequences, with θ equal to 5, returning a tree

set.seed(1234)
tr <-ms(nsam=20, nreps=1, opts=("-t 5 -T"))
tr <-read.tree(text=tr)

Now we can look at the tree, along with a simple skyline plot of the tree

plot(tr)
sk <-skyline(tr)
plot(sk)

Note that in all of these plots, the Y axis is in actuality Ne multiplied by the substitution rate (something that we would have to provide).

But remember that we are assuming that periods of rapid coalescence are associated with population size changes, but we are basing this on the mutational process being linearly related to time. Since mutation is a stochasitic process, some of the inferred fluctuation in population size is in fact due to that randomness. We thus need to estimate some minimum interval for which proportionality holds (i. e. correct for error) and collapse the shorter branches. There is a function in ape that uses a liklihood procedure to estimate the best value of that minimum (ε). First, let’s look at the length of the coalescent intervals (in units of subsitutions) calculated from the data:

tr.ci <-coalescent.intervals(tr)
data.frame(tr.ci$lineages,tr.ci$interval.length)
##    tr.ci.lineages tr.ci.interval.length
## 1              20          0.0057214573
## 2              19          0.0013035461
## 3              18          0.0013278760
## 4              17          0.0022316966
## 5              16          0.0051246677
## 6              15          0.0062913243
## 7              14          0.0063191038
## 8              13          0.0206344724
## 9              12          0.0048714131
## 10             11          0.0280374810
## 11             10          0.0132057667
## 12              9          0.0038176402
## 13              8          0.0001407266
## 14              7          0.0103887469
## 15              6          0.0369786471
## 16              5          0.0361812115
## 17              4          0.2172224373
## 18              3          0.1140160561
## 19              2          0.3526494503

Now, we will let the computer smooth the data by calculating a factor epsilon that is designed to remove the random fluctuations from the data:

eps <-find.skyline.epsilon(tr.ci)
## Searching for the optimal epsilon... epsilon = 0.8664637
eps
## [1] 0.8664637

And if we plug that into our skyline plot, and compare it with our previous plot

plot(skyline(tr), main="Uncorrected")
plot(skyline(tr,eps),main=paste("epsilon = ",round(eps,3)))

We see that, as we modeled originally, this is in fact a constant size population.

A population growth model.

A while back, we looked at the effect of population growth on coalescence. Our scenario is repeated below:

Suppose our current population is 105 , but it is the result of growth at the rate of 2% per year and there are 5 generations per year. To factor growth into ms, we need to convert the yearly growth rate into the growth rate per 4N generations. We will also do more “sequencing” by examining 100 samples. We can do so as follows:

N <-100000
r <-.02 #growth rate per year
gen <-5 #generations per year
nyears <-(4*N)/5 # 1 coalescent unit in years
g <-r*nyears # growth rate in coalescence units
m.out <-ms(nsam=100, nreps=1, opts=(paste("-t 12 -T -r 0 702 -G",g,sep=" ")))
tree.gr <-read.tree(text=m.out)

And we can repeat what we did above to get an initial skyline plot

plot(tree.gr)
plot(skyline(tree.gr))

But what about epsilon in this case? Becuase the average branch length is longer, we might intuitively expect that it would be smaller. But let’s examine that

ci.gr <-coalescent.intervals(tree.gr)
eps.gr <-find.skyline.epsilon(ci.gr)
## Searching for the optimal epsilon... epsilon = 0.0003031039

And it is - 3.031039410^{-4}, as opposed to 0.8664637 in the previous case. And we can look at the skyline plots with and without the collapsing and we see

plot(skyline(tree.gr),main="Uncorrected")
plot(skyline(tree.gr,eps.gr),main=paste("Epsilon = ",round(eps.gr,5)))

Note that what we see are stepwise changes - obviously an approximation. And when we find the minimum interval for which noise is filtered out, we see that what was in fact a continuous process gives a signal of a single stepwise one.

Can we detect a bottleneck?

Our final simulation will be of a bottleneck event (taken directly from ms documentation)

tr.bot <-ms(nsam= 150,nreps=1, opts=("-t 6.4 -G 14 -eG .2 0 -eN .3 .5 -T"))
tr.bot <-read.tree(text=tr.bot)
par(mfrow=c(1,1))
plot(tr.bot)
plot(skyline(tr.bot))

Note that with the simple skyline plot, there is no obvious evidence of a bottleneck. But what about if we figure epsilon as before?

ci.bot <-coalescent.intervals(tr.bot)
eps.bot <-find.skyline.epsilon(ci.bot)
## Searching for the optimal epsilon... epsilon = 0.0317345
plot(skyline(tr.bot,eps.bot))

And we see a pretty good indicator of a bottleneck followed by growth. This illustrates an important point: One should never attempt to draw conclusions from an uncorrected skyline plot. Calculation of the minimum coalescence interval (ε) is essential.

Some real data

Up to now, aside from the fact that these are simulated data, there is another issue. That is, everything is scaled in terms of units of substitutions. What we really want, of course, is time, but for that we need to have some empirical estimate of the rate of substitutions that has occurred per year. In fact, the ape package provides some sample data for 193 human deficiency virus sequences, in which that rate could be determined by direct measurement. We can look at these data as follows:

data("hivtree.newick") # load sample data
tr.hiv <-read.tree(text=hivtree.newick)
tr.hiv
## 
## Phylogenetic tree with 193 tips and 192 internal nodes.
## 
## Tip labels:
##  A97DCA1EQTB52, A97DCA1MBFE185, A97DCA1MBS12, A97DCA1MBS30, A97DCA1SJDS17, A97DCA1KCD9, ...
## 
## Rooted; includes branch lengths.
ci.hiv <- coalescent.intervals(tr.hiv)
eps.hiv <-find.skyline.epsilon(ci.hiv)
## Searching for the optimal epsilon... epsilon = 0.01191938
plot(skyline(tr.hiv),show.years=TRUE, subst.rate=.0023,present.year=1997)
plot(skyline(ci.hiv,eps.hiv),show.years=TRUE, subst.rate=.0023,present.year=1997)

Notice that his is a much larger data set than what we’ve been modeling. We have also incorporated an empirically determined substitution rate (substitutions per year) so we can scale the X axis in years before proesent. The Y axis remains scaled by Neµ, but not that is is a logarithmic scale

MCMC estimation

The stepwise model we have been using so far, one that assumes population changes are associated with coalescence events, is clearly not biologically realistic. This method, implemented in APE, derives a smooth population size function based on the given geneologies. The paper describing the work is quite straightforward (at least as theoretical papers go) and is worthy of at least brief perusal Drummond et al. (2005) developed methodology for “smoothing” these curves - in essence it takes an MCMC approach to find a set of parameters that results in a smooth population growth function based on one or more gene geneologies.

The Bottleneck model

So we can use the mcmc.popsize function to determine the prior distribution of N’s; note that we have a burnin cycle of 1000 followed by an MCMC cycle of 10000

tr.bot.mc <-mcmc.popsize(tr.bot,10000,,burn.in=1000,progress.bar=FALSE)

Now we can plot this and compare it to our original skyline plot

par(mfrow=c(1,1))
pop.bot <-extract.popsize(tr.bot.mc)
plot(skyline(tr.bot,eps.bot))
lines(pop.bot,col="red")

And what emerges is a fairly good manifestation of the population growth following the bottleneck.

HIV Sequence

But now we can do the same analysis with real data - the West African HIV samples we looked at previously. We’ll do the same MCMC cycle (which will take a while)

mc.hiv <-mcmc.popsize(tr.hiv,10000,burn.in=2000,progress.bar=FALSE)

And we can now look, as before, superimposing the mcmc fit on the smoothed skyline plot (ε = `r eps.hiv)

par(mfrow=c(1,1))
pop.hiv <-extract.popsize(mc.hiv)
plot(skyline(tr.hiv,eps.hiv),show.years=TRUE, subst.rate=.0023,present.year=1997)
lines(pop.hiv,col="red",show.years=TRUE, subst.rate=.0023,present.year=1997)

And we see more or less what we would expect to see - constant growth from the 1940-1975 period, with an acceleration of growth occurring in the 1975-1990 period, coincident with the spread of the AIDS pandemic.

Arctic Fur Seals

Hoffman et al. (2011) have reported mitochondrial DNA for fur seals from the North Atlantic. We will look more closely at their data when we introduce BEAST; however, we can do a little bit of rough analysis with what we’ve done so far

We should also say a little about animal mitochondrial DNA and why it is so widely used for reconstruction of recent population structure and demographics. It main features can be summarized as follows:

  1. It does not recombine
  2. It has relatively short coalescence times
  3. It contains regions that are both sufficiently conserved to be used in species identification (“DNA barcoding”) but also ones that are highly variable within species (notably the D loop, or origin of replication).

So to proceed, we need to do two things:

  1. We need to build a test phylogeny
  2. We need to apply the above analyses to it.
  3. Based on an estimate of mutation rate, then we should get a profile of the population, scaled in years.
data(furseals)
image.DNAbin(as.matrix(fur.dat))

fur.dat
## 246 DNA sequences in binary format stored in a list.
## 
## All sequences of same length: 263 
## 
## Labels: AGP03001 AGP03007 AGP03011 AGP03022 AGP03023 AGP03025 ...
## 
## Base composition:
##     a     c     g     t 
## 0.321 0.281 0.135 0.263
length(seg.sites(fur.dat))
## [1] 45

Now make a distance matrix

fur.dist <-dist.dna(fur.dat)

Now use the simplest possible tree building algorithm - UPGMA - to specify the inferred geneology, after which we will caclulate skyline plots, both with and without correction for mutational stochasticity

library(phangorn)
#tr.fur.u <-NJ(fur.dist)
tr.fur.u <-upgma(fur.dist)
plot(tr.fur.u,show.tip.label = FALSE)

As such, it’s not very informative (nor, in all likelihood, very accurate). But we can use it as input for skyline analysis as follows, first withous smoothing and then with:

#plot(skyline(tr.fur.u))
ci.fur <-coalescent.intervals(tr.fur.u)
 eps.fur <- find.skyline.epsilon(ci.fur)
## Searching for the optimal epsilon... epsilon = 0.002073082
plot(skyline(tr.fur.u,eps.fur),ylim=c(.002,1))

So we could hypothesize a model of historical population growth followed by a more recent decline. But we should be able to improve the model by mcmc analysis:

mc.fur <-mcmc.popsize(ci.fur,10000,burn.in=1000,,thinning=10,progress.bar=FALSE)
fur.pop <-extract.popsize(mc.fur)

And this can be plotted as before

rn <-rev(range(fur.pop[,1]))
plot(fur.pop[,1],fur.pop[,3], xlim=rn,type="l",xlab="time (# Substitutions)",ylab="Ne X Subst. Rate")
lines(fur.pop[,1],fur.pop[,4],col="red")
lines(fur.pop[,1],fur.pop[,5],col="red")

What we see is that there is clear evidence for an historically stable population, but based on this modeling, while there is some evidence of recent decline, it is not dramatic

Putting a time frame on this

Hoffman et al draw on other sources to estimate per site substitution rates; below we have converted them to the number expected for the 263 base pair fragmnet under investigation, to get Mu, an estimate of the rate of substitutions per sequence

gtime <-9.89
mu <-2.71*10^-6
Mu <-(mu/gtime)*263
Mu
## [1] 7.206572e-05

We can now scale both axes by the substitution rate and replot to get what we really want - effective population size vs. time before present:

yrs <-fur.pop[,1]/Mu
yrs <-2012-yrs
plot(yrs,fur.pop[,3]/Mu,type="l",xlab="Yrs. Before Present",ylab="Ne")
lines(yrs,fur.pop[,4]/Mu,col="red")
lines(yrs,fur.pop[,5]/Mu,col="red")

Note that the dates are obviously quite sensitive to the mutation rate; the one employed is given as a conservative estimate by the authors. Nevertheless, based on these fixed parameters, what we see is an historically large population, with an indication of a recent decline

Conclusion

In fact, as we shall see, in order to do real analyses such as this, we really need to turn to BEAST, which allows us to build much more detailed (and hopefully realistic) models, in which we set priors for all of the parameters involved in tree estimation. Remember that what we have based our analysis on hear is the simplest possible algorithm (UPGMA), one that is rarely if ever used in modern analysis. So at this point you should have a grasp on the basics - Skyline plots associate the length of branches and the occurrence of coalescent events with demographic events, thus allowing some inference about past population processes. But what we’ve looked at so far is two fairly simple models. In the first case, we proceed from the assumption that substitutions occur linearly with time, and that coalescent events reflect changes in population size only. We have subsequently changed that to average over small branches, thus reducing noise in the system, and we see that the situation becomes more simplified. Finally, by what is basically a Bayesian curve-fitting algorithm, we can go from a stepwise model to a more realistic continuous one. But remember - the original data for skyline plot analysis is the inferred genealogy, and as we know, a particular set of sequences could be the result of a myriad of different ones. Hence, you should treat everything we have done to this point as a demonstration only, NOT as an rigorous analytical approach to data analysis You may want to use it for preliminary data examination, however, for a rigorous analysis, we need to step back and consider how this analysis might be refined by treating the geneaolgies as a random variable - that is, to explore tree-space, find the distribution of geneolgies that best fit the data, and then go from there. In some cases, we may also be able to empirically estimate substitution rates. This is where BEAST comes in.

Bonus 1. Doing MCMC on the growth model

Note that what follows is very rough code - treat it as such!!

So from our earlier model, we created tree.gr, a simulation of a 2% growth model. We can do the mcmc bit on it.

gr.mc <-mcmc.popsize(tree.gr,10000,burn.in=1000,progress.bar=FALSE)
gr.pop <-extract.popsize(gr.mc)
plot(skyline(tree.gr,eps.gr))
lines(gr.pop,col="red")

And we see a much nicer fit than we got with the previous models.

Bonus 2. What about spaghetti squash as a substitute for fur?

data(spaghetti)
spaghetti
## 42 DNA sequences in binary format stored in a list.
## 
## All sequences of same length: 702 
## 
## Labels: gi|345105052|gb|JF815874.1| gi|345105054|gb|JF815875.1| gi|345105056|gb|JF815876.1| gi|345105058|gb|JF815877.1| gi|345105060|gb|JF815878.1| gi|345105062|gb|JF815879.1| ...
## 
## Base composition:
##     a     c     g     t 
## 0.278 0.196 0.244 0.281
ss.dist <-dist.dna(spaghetti)
ss.tr <-upgma(ss.dist)
plot(skyline(ss.tr))

add in eps:

ss.ci <-coalescent.intervals(ss.tr)
ss.eps <-find.skyline.epsilon(ss.ci)
## Searching for the optimal epsilon... epsilon = 0.0004881657
ss.eps
## [1] 0.0004881657

And plot

plot(skyline(ss.tr,ss.eps))

OK, so it looks like a population expansion model. Now mcmc it.

ss.mc <-mcmc.popsize(ss.tr,10000,burn.in=1000,progress.bar=FALSE)

And do the final bit

ss.pop <-extract.popsize(ss.mc)
plot(skyline(ss.tr,ss.eps))
lines(ss.pop,col="red")

BEAST - An Introduction

So up to this point, we’ve seen that we’ve used coalescent theory, starting from a set of nucleotide sequences, to make inferences about past population history, and in fact we managed to achieve some at least semi-plausible results. However, none of the work we have done, either in class or in lab, would even begin to appropach publishable standards. Think again what our work has been based on:

  1. We assumed the phylogeny, based as it was on progressively grouping similar sequences, was the true gene geneaolgy.
  2. We assumed that “a mutation is a mutation is a mutation” (more on this in a minute).
  3. We assumed that the substitution rate was constant with time.

With those assumptions, we found that we could take a Bayesian approach, starting with the coalescent intervals derived from the tree, and find a distribution of values for N0 . . . N(TMRCA) that best fit the data. But in fact, as we shall see, there is uncertainty in all three of our assumptions, so what we really want to do is something like the following:

  1. Select a model of mutation that is appropriate for our data
  2. Select a “molecular clock” model that best relates our mutation model to units of time.
  3. Based on those, generate a distribution of gene geneaologies that best explain our data
  4. Use that posterior distribution to estimate the gene genealogy, and from that do our Skyline (or other analyses).

BEAST stands for Bayesian Evolutionary Analysis by Sampling Trees - as its name implies, it does all of the following. It is an incredibly complex program, but if managed correctly, it can be very powerful. Thus, before delving into the actual workings of the program, we need to get at at least a few of the underpinnings. For an in-depth introduction to both theory and practice with BEAST, see Drummond and Bouckaert (2015).

Dealing with Forests of Trees

As we shall see, the output of BEAST consists of two files - one contains trees that were accepted as part of the posterior distribution of trees, drawn from the prior distribution that we specified. As an illustration, we will very simplistically at part of the output of a simple BEAST run, performed on some simulated data generated by ms and ms2dna. We’ve done this before - we are simulating 10 sequences of length 703 with 15 segregating sites

system("ms 10 1 -s 15 -r 0 703 -seed 1 2 3 | ms2dna >./BEAST/demo1.fasta")

We can visualize them in the usual way:

seqs <-read.FASTA("../BeastDemo/demo1.fasta")
image.DNAbin(as.matrix(seqs))

. However, rather than staying in R, we will detour (in class) to the BEAST pipeline, doing the following:

  1. Load the data into BEAUTi, the program that prepares the model for BEAST analysis. In so doing, we will, with two exceptions. First, we will specify that we want to use a constant size coalescent model as our prior distribution of trees. Second, We will reduce the number of MCMC cycles to 1,000,000 and specify a burn-in (early trees we discard) of 100,000. Note also that we will only save every 1000th tree.
  2. Run BEAST
  3. Return to R to see some of what we got.

We now return to look at the output. All of it was stored (at our choosing) in the folder BEAST. Its contents are now

list.files("./BeastDemo")
## character(0)

The key files are demo1.log and demo1.trees. Our focus for now will be on the trees. They are stored in what is called a “nexus” format, and they can be easily read by R.

trees <-read.nexus("../BeastDemo/demo1.trees")
length(trees)
## [1] 1001

So we see we have 1001 of them. What we can now do is to sample a few of them and look at their topology:

par(mfrow=c(2,2))
s <-sample(length(trees),4)
x <-sapply(s, function(x) plot(trees[[x]]))

Since both BEAST and our sampling in R are stochastic in nature, the output will vary from run to run. However, notice a few things:

  1. The overall topology of the trees are quite similar. This is to be expected - remember that they were all generated from the same data set, and that furthermore, BEAST has generated (mysteriously at this point) a posterior distribution consisting of trees with the highest likelihood of being the actual one.
  2. The big problem is that, while the nodes mostly connect the same samples, the branch lengths (and the relative timing of the nodes) vary. This is a problem - after all, those are what we are using in our coalescent analysis. This can be visualized by plotting the same four trees, only telling R to ignore branch lengths:
x <-sapply(s, function(x) plot(trees[[x]],use.edge.length=FALSE))

par(mfrow=c(1,1))

Now we see they are much more similar (although still not identical).

So obviously, at this point we need to think more about where the distribution of branch lengths came from. To do so, we need to diverge into mutation theory.

The log file - what is in it

As noted above, BEAST produces two output files, one containing the distribution of trees and the other calculations derived from each of those trees. The log file is almost never used directly - rather it is used as input for analysis in TRACER. However, for completeness sake, we can briefly look at its contents:

dat.log <-read.table("../BeastDemo/demo1.log",header=TRUE)
head(dat.log)
##   Sample posterior likelihood    prior treeLikelihood  TreeHeight
## 1      0 -1039.623  -1087.026 47.40235      -1087.026 0.007748052
## 2   1000 -1038.702  -1085.874 47.17218      -1085.874 0.007449725
## 3   2000 -1031.286  -1088.728 57.44162      -1088.728 0.004433778
## 4   3000 -1039.004  -1085.130 46.12577      -1085.130 0.009592086
## 5   4000 -1037.354  -1085.601 48.24714      -1085.601 0.006894678
## 6   5000 -1029.513  -1088.792 59.27873      -1088.792 0.004545260
##       popSize CoalescentConstant
## 1 0.003377296           41.71167
## 2 0.001964661           40.93975
## 3 0.001295105           50.79246
## 4 0.005800010           40.97588
## 5 0.001967077           42.01593
## 6 0.001183514           52.53946

About the only one of these headings that makes any sense at this point is popSize; in fact, by visualizing it, we can see the posterior distribution of population size estimate from our data:

colhist(dat.log$popSize,xlab="Scaled Population Size",ylab="Frequency",main="Population Size Posterior Distribution")
abline(v=median(dat.log$popSize),col="darkred")

We can also get a preview of one of the kinds of output we will be looking at:

plot(dat.log$popSize,xlab="Sample",ylab="N", main="Population Size Scaled by Substitution Rate",type="l")

What we see is evidence for good “mixing” - that is, there are no evident correlations among sequentially accepted values in the distribution.

As we will see however, we can learn much more from this output.

A Note on Rates

With BEAST, one can either use a predetermined substitution rate (preferably derived from independent data) or ask the program to estimate it. In either case, the rate that BEAST works from is substitutions/site/unit time. And if you can estimate of the rate as part of the model (i. e. you have one or more calibration points), you do need to specify a prior distribution. Here, the authors recommend selecting a log-normal distribution, for which you specify a mean and standard deviation. In one of the worked examples (RSV), for example, the recommendation is for a log-normal distribution with mean of -2 and standard deviation of 1.25. This distribution looks like the following

x <-seq(.001,2,.001)
y <-dlnorm(x,-2,1.25)
plot(x,y,type="l")

mean(y)
## [1] 0.4922053
quantile(y,c(0.025,.975))
##       2.5%      97.5% 
## 0.01678136 4.20239757

So in this case, you would be restricting your prior to a range of something like .015 to 4.2 substitutions/site/year, a range that is reasonable for viral evolution.

Summary

So at this point, we need to review the basic approach we are using

  1. We want to create a posterior distribution of gene genealogies that best explain the data.
  2. To do so, we have to specify a tree model (constant, skyline, etc.)
  3. We need to select a mutation model and set appropriate prior distributions for its parameters.
  4. With all that (plus more we haven’t gotten to yet), we put BEAST to work to find the posteriors.

And then comes the analysis of the results. We will be moving to that next, but first, we’ll work through a couple of real data examples.

References

Drummond A. J., Rambaut A., Shapiro B., Pybus O. G., 2005 Bayesian coalescent inference of past population dynamics from molecular sequences. Molecular biology and evolution 22: 1185–92.

Drummond A. J., Bouckaert R. R., 2015 Bayesian Evolutionary Analysis with BEAST. Cambridge University Press.

Hoffman J. I., Grant S. M., Forcada J., Phillips C. D., 2011 Bayesian inference of a historical bottleneck in a heavily exploited marine mammal. Molecular Ecology 20: 3989–4008.

Pybus O. G., Rambaut A., Harvey P. H., Rienzo A. D., Wilson A. C., Donnelly P., Tavaré S., Edwards A. W. F., Felsenstein J., Felsenstein J., Fu Y.-X., Grassly N. C., Harvey P. H., Holmes E. C., Griffiths R. C., Tavaré S., Hasegawa M., Kishino H., Yano T., Holmes E. C., Pybus O. G., Harvey P. H., Kuhner M. K., Yamato J., Felsenstein J., Kuhner M. K., Yamato J., Felsenstein J., Leitner T., Albert J., Leitner T., Kumar S., Albert J., Li W. H., Tanimuri M., Sharp P. M., Nee S., Holmes E. C., Rambaut A., Harvey P. H., Polanski A., Kimmel M., Chakraborty R., Pybus O. G., Holmes E. C., Harvey P. H., Rambaut A., Grassly N. C., Rayfield M. A., Downing R. G., Baggs J., Hu J., Pieniazek D., Robbins K. E., Kostrikis L. G., Brown T. M., Anzala O., Shin S., Rodrigo A. G., Felsenstein J., Slatkin M., Hudson R. R., Swofford D. L., 2000 An Integrated Framework for the Inference of Viral Population History From Reconstructed Genealogies. Genetics 155: 1597–1601.

Strimmer K., Pybus O. G., 2001 Exploring the Demographic History of DNA Sequences Using the Generalized Skyline Plot. Molecular Biology and Evolution 18: 2298–2305.