Population Genetics in 2016

Population genetics has a rich history. The Hardy-Weinberg equilibrium dates back to the early part of the 20th century; the rich theoretical work of Fisher, Wright and Haldane in the 1920’s and 1930’s and the classic empirical work of Dobzhansky and others laid the groundwork for the selectionist-neutralist controversy of the mid-century, and the 1980’s saw the Kingman (1982) coalescent and the beginnings of DNA sequence-based analysis of genetic variation. But for most of its history, population genetics has been a field rich in theory but starving for data.

That has now changed. In the last twenty years, we have seen two revolutionary changes - facile DNA sequencing and the growth of the internet as a medium for data and information exchange. Thus, while population genetics theory continues to fluorish and grow, it does so in a data-rich environment. The classical approaches - analytical formulation and simulation - remain central, but the hypotheses developed using those approaches can now be readily tested empirically.

Given this new environment, new pedagogical approaches are needed. It is no longer sufficient to limit even basic instruction in the discipline to the “big A little a” philosophy. More specifically:

  1. The “struggle to measure variation” Lewontin (1974) is over. In the era of thousands of human genomes and 101 dog genomes, detection and quantitation of genetic variation is no longer an experimental challenge (although making the genotype-phenotype connection remains one).
  2. The single-gene approach to the field is only the starting point - variation can now be (and indeed must be) considered at the genomic level.
  3. The computational power of even the most basic laptop computer makes large-scale data retrieval and analysis straightforward.

Analytical vs. Numerical Approaches

If you were to open any classical population genetics text to a random page, chances are you would find equations. Lots of them. And some of them would be quite awful. Here’s an example, taken from Felsenstein (2015)

Equation

Equation

To put it mildly, it is intimidating. And likely, many people (myself included) find this approach to be rather uninformative. So what’s the alternative?

In a quantitative field like this, some algebra (and occasionally higher math) is unavoidable. However, numerical simulation is something that can be a much more intuitive approach. Rather than deriving exact mathematical solutions to problems, we can ask our computer to simulate the results of a particular set of processes and then examine those simulated results to determine how well they explain observed phenomena.

That is the approach we will be using, to the greatest extent possible. And that also addresses the question as to why what follows is built within the R programming language(R Core Team 2015). Quite simply, it is a freely available programming environment that makes this approach feasible with a minimum amount of coding required. In addition, it operates on multiple operating systems and is highly extensible, so that over the years, simple functions have been developed to address complex problems, and all of that code and material is readily available from central repositories (e. g. CRAN, BioConductor, and github).

The Elements of this Project

The basic elements of the project are described below. Click on each tab to get more information about variaous aspects.

Publication and Revision

This project is (and will continue to be) published as a set of interconnected html pages. The reason for this is twofold. First, by so doing, it is possible to include not only text and code, but also external figures, videos, etc. when appropriate. Of course, there is some risk to that, in that links to external material can be ephemeral, however my philosophy is that while some dead links may be inevitable, this is a price we pay for what information can be incorporated from active ones.

And this leads to the second reason for this approach, which is revision. By using web-based publishing, it is a simple matter to make changes in content as needed and make the most recent versions available to readers. Thus, each chapter includes a date, which is that of the most recent revision.

Text and Code

Most of the material in this project has been developed in the RStudio environment and published as html directly from it. For the most part, raw code is shown, however the reader can opt to hide some or all of it. The original Rmd code is available on github (and of course, the github-savvy reader could choose to clone the repository and tinker with the code at will).

TeachingPopGen

TeachingPopGen, or TPG, is an R package developed in conjunction with this project, which is envisioned as one that will simplify the learning curve of users who are not conversant with coding in general or R in specific. It was built on the following principles:

  1. Installation of TPG should provide the user with all (or almost all) of the functions and methods necessary to work through the material covered.
  2. The user should have to spend a minimum of time “munging” data. That is, data used in the project should be easily accessed and manipulated.
  3. Functions used are derived as much as possible from base R.
  4. Data structures are kept simple. Ideally, the user should have to understand little more than vectors, matrices, data frames (probably the most important), and lists.

TeachingPopGen is definitely a work in progress - as of this writing (January 08 2017), documentation is far from complete, and it is not ready for prime time on CRAN. It is, however, available on github, and can be installed with the following commands:

install.packages("devtools") #Only necessary if devtools is not already installed.
library(devtools)
install_github("bjcochrane/TeachingPopGen")
library(TeachingPopGen)

R vs. not R

The R purist (and there are many) thinks first about how a problem can be approached purely in that environment. But in the spirit of trying to be accessible to users with varying computer proficiency, there are references to other programs, and often times data manipulations outside of R may be recommended. Again, the goal is not to train programmers but rather to teach population genetics. Thus, for example, Excel, especially with the Genalex add-on, is often a better way to get at some problems. Similarly, it would be foolish to ignore elegant standalone packages like STRUCTURE and BEAST. But all of those packages produce numerical output, and at times, working with them in R provides unique new opportunities for exploration.


Overview of R and RStudio

From the outset, we need to recognize that there are lots of valuable standalone programs for evolutionary genetic analysis, many of which operate on Windows, Mac and Linux platforms and do so in a user-friendly fashion. However, there is one serious downside to exclusive dependence on such packages - many of them require specialized data formats, and outputs from one such program are often not readily input into another.

R (R Core Team 2015), on the other hand, has a couple of advantages:

It is a data oriented programming language,

so virtually any data can be input in some form or another. For example, one can directly read data from HapMap, ALFRED, Dryad and other such data bases, and with some manipulation (a. k. a. munging) get them into a format that can be used for a wide variety of applications.

R is vector-based

For those who remember the days of having to wrestle with subscript indices in languages like FORTRAN and BASIC, R is a delight. For example, suppose you wish to create a vector of integers from 1:100. It is as easy as

x=c(1:100)
x
##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
##  [18]  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34
##  [35]  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51
##  [52]  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68
##  [69]  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85
##  [86]  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100

No need to set dimenstions or any of that stuff. And we can then do simple manipulations on those objects, such as

x*2
##   [1]   2   4   6   8  10  12  14  16  18  20  22  24  26  28  30  32  34
##  [18]  36  38  40  42  44  46  48  50  52  54  56  58  60  62  64  66  68
##  [35]  70  72  74  76  78  80  82  84  86  88  90  92  94  96  98 100 102
##  [52] 104 106 108 110 112 114 116 118 120 122 124 126 128 130 132 134 136
##  [69] 138 140 142 144 146 148 150 152 154 156 158 160 162 164 166 168 170
##  [86] 172 174 176 178 180 182 184 186 188 190 192 194 196 198 200

And every element is multiplied by 2.

And we can do plots.

Working with our same vector, we can easily get a simple working plot by

plot(x)

R is imminently extensible

with both built-in functions that carry out complex tasks based on simple commands, as well as add on packages that provide additional capabilities for specialized tasks. For example, suppose we want to generate a bunch of random binomial variants, akin to doing 100 experiments of flipping a coin 100 times qne plotting how many heads we get. We can write

h <-rbinom(100,100,.5)
hist(h)

And then there are packages.

R is open source, so there are legions of skilled programmers specifically developing add-on packages that can be installed and provide additional functionality. We will use a lot of these packages, including

  • ape (Paradis et al. 2004) and pegas (Paradis 2010) - provide essential functions for population genetic analysis and tree manipulation
  • phyclust (Chen and W.-C. 2011) - an independently developed package that has the interesting ability to run some otherwise command-line stand-alone programs (ms (Hudson 2002), seq-gen) directly within R
  • rehh (Gautier et al. 2012) - Brings the world of Pardis Sabeti and extended haploytpe analaysis to R. Data munging is challenging, but once done, it is rewarding.

RStudio

By itself, R can be intimidating. In particular, it is a “command line” program, meaning that once a command is entered, it is executed, something that is not necessarily desirable when performing complex tasks. The program becomes much more functional if multiple lines of code can be entered simultaneously, and importantly the code can be annotated such that a reader can understand what’s being done.

RStudio is an “Integrated Development Environment”, or IDE, that greatly facilitates working with R, and quite importantly documenting that work.

Installation is easy

Go to RStudio.org, look for the download program for your platform(Windows, Mac or whatver) and download it. Windows will probably have to be unzipped and executed; Mac will come as a DMG file that has to be installed. But be assured that in just a few minutes (I did it while wating to board a plane) you will have the system up and running.

When you open RStudio, you are presented with four panes.

1.The upper left is where you do most of the work, it is here you can type sets of commpands, run them and see what happens, 2. You can enter single commands in the lower left that will execute when you hit return, however the more valuable role of that panel is to display numerical results that arise for the command blocks you run in the upper left pane. 3. The upper right pane I think of a s a bookeeping area. It provides a list of all of the objects that are available in memory (and thus can be used in the currnet session). It gives the object name, as well as the nature of the object (vector, matrix, list, data frame, and many others.) 4. The lower right is where the graphs appear, and this is the fun stuff. R is exceptionally good at providing visualization of data in ways that can be readily interpreted; much of what we do will focus on interpretation of this output.

Creating a “Markdown” file for Program Execution

This is the coolest feature of R studio. The best place to start is with the video below, in which Roger Peng of Johns Hopkins runs you through the basics of what we’ll be doing.

Suppose I’ve written a file that includes some text (such as this) as well as “chunks” of code that can be executed. The code is sparated by the following delimiters:

```{r}
#Code goes here
```

When the HTML rendered, or knitted, you will see the actual coded in a shaded box, followed by the result of that code’s execution in an unshaded one. Everything outside the chunk will be treated as regular text and formatted accordingly.

Publishing as HTML

So up to this point, we have completed a narrative describing some basics of R and RStudio, and we have embedded a few chunks of code to illustrate some points. What we would now like to do would be to publish our writing in a format that is broadly accessible to others. To do so, RStudio has the package “knitr” built in, which will render a markdown document (such as this) in standard html, accessible from any browser. To do so is simple - either click the “Knit HTML” button at the top of the window, or press Shift-Command-H. If you haven’t done so already, you will be prompted for a file name for saving the R Markdown file (it will have the suffix .Rmd added). Two additional files will then be generated - a markdown file, which can be accessed by other programs for format tweaking and so forth, but most importantly, an html file that can then be opened in any browser.

Getting Help

The philosophy of this project is that the reader should learn by doing. Thus, we won’t spend a whole lot of time dealing with the minutiae of the R programming language; rather we will introduce particular functions and concepts in the context of actual problems. However, there are several places the user can turn for quick assistance:

R Help pages

Simply entering ?function on the command line (in the lower left pane) will bring up the standard help page for that function. While these pages are sometimes cryptic, they should include

  • A brief description of what the function does
  • A description of its syntax
  • The nature of the arguments that the function uses
  • The value that is returned
  • (usually) a few examples of its uses

In the case of packages, the easiest way to get documentation in RStudio is to select the “Package” tab in the lower right pane of the RStudio display and click on the package of interest. That will bring up a page with links to information on that package, and most importantly to the help files for all of the functions contained within it.

Markdown syntax

Markdown is probably the easiest way to embed basic formatting information into a document, in such a way that is can be rendered into a publishable format. A good general introduction to it can be found here; a description of its use in Rstudio can be found here.

Summary

So at this point, where are we?

  1. We gave R and R Studio up and running
  2. We have performed some basic data retrieval and made a bar plot
  3. We have rendered our narrative and code into HTML so that it can be accessed on the web.

But, you ask, what good is all of this? How do we address real questions with real data? To get there, we need to spend a bit of time looking at at least a few of R’s capabilities with respect to data manipulation.


Other Software

Below you will find a brief introduction to some of the analytical methods of modern population genetics. The references included are ones you should review carefully prior to starting any data analysis that incorporates them. Note that these are typically not the documentation for the programs themselves (which are provided elsewhere); rather they are research papers that describe the methods and assumptions that underly the particular impelementations.

ms

While there are many more sophisticated simulators of the neutral coalescent, ms remains a stalwart, and it is what we will use throughout the course. For a very nice introduction to coalescent theory, see Hudson (1990). The actual program ms is described by Hudson (2002).

Bayesian Phylogenetic Inference with BEAST

There is so much to BEAST (Bayesian Evolutionary Analysis by Simulating Trees) that it is impossible to summarize its capabilities in a couple of paragraphs. For a broad overview, the recent book by Drummond and Bouckaert (2015) is the place to start - it combines a reasonably readable introduction to the theory, along with an introduction to some specific applications.

One of the temptations in a BEAST analysis is to use tip dates to calibrate the tree. Used appropriately, this can be very powerful, however Rieux and Balloux (2016) provides essential background, as well as some suggestions as to how to test whether such an approach is valid with a given data set.

Structure

The original method was described by Pritchard et al. (2000), however at that point the methodology did not incorporate the possibility of admixture into the algorithm. This was subsequently added (Falush et al. 2003, Falush et al. (2007)) and is now a routine part of STRUCTURE analyses.

Approximate Bayesian Computation

Before delving into this area, you should read Csillery et al. (2010), which provides a great overview of the method and its potential. Remember that the key purpose of ABC is to compare different evolutionary models by simulating a set of summary statistics under each and then determining in which posterior distribution those statistics calculated from observed data are most likely to fall.


Bibliographic Essay

Historical

It is hard to believe that, prior to about 1920, there were many who considered Mendelian genetics to be incompatible with Darwinian Evolution. It was the seminal work of, among others, R. A. Fisher, J. B. S. Haldane and Sewall Wright that led to their reconciliation and what is now known as the “Modern Synthesis” of evolutionary biology. The late historian, biologist and iconolclast Will Provine wrote of the early days of that period of resolution; the book that resulted from his Master’s thesis (Provine 2001) is well worth reading. Provine also wrote a seminal biography of Sewall Wright, one of the greatest geneticists of the twentieth century (Provine 1986).

Two of the most important academic lineages were those of Theodosius Dobzhansky and James Crow. Dobzhansky and his colleagues focused more on empirical population genetics, while Crow and his students (particular Motoo Kimura) made seminal contributions to the theoretical underpinnings of the field. Two good places to start are with Dobzhansky’s Genetics of the Evolutionary Process(Dobzhansky 1970) and Kimura’s The Neutral Theory of Evolution(Kimura 1985). For those who wish to delve deeper into their original work, collected volumes of Dobzhansky’s (Lewontin et al. 1981) and Kimura’s (Kimura 1995) are well worth perusal

Texts

There are basically two types of population genetics texts - the comprehensive and the overview. The classic in the former category is Crow and Kimura (1964); there are those who argue that anyone who wants to acquire a deep and basic understadning of the subject should start by digesting this in its entirety. It is, however, more than fifty years old, and much has happened since then. For more modern takes, see Hamilton (2009), Charlesworth and Charlesworth, or Hedrick.

The shorter versions tend to be much more accessible, however that can be both a blessing and a curse, as often they overdo compression of difficult concepts. The classic in this genre is Metter and Gregg; Three others worthy of note are Crow (1986), Gillespie and Hartl. The most recent of these books, Nielsen and Slatkin (2013), is one that we will refer to regularly as we progress.

Finally, Joe Felsenstein has produced a comprehensive online treatment of theoretical population genetics (Felsenstein 2015). It is free of charge and can be downloaded as a pdf file.

Collections


References

Chen, W.-C., 2011 Overlapping Codon Model, Phylogenetic Clustering, and Alternative Partial Expectation Conditional Maximization Algorithm.

Crow J. F., Kimura M., 1964 An Introduction to Population Genetics Theory. Blackburn Press.

Crow J. G., 1986 Basic Concepts in Population, Quantitative, and Evolutionary Genetics. W H Freeman & Co (Sd).

Csillery K., Blum M. G. B., Gaggiotti O. E., Francois O., 2010 Approximate Bayesian Computation (ABC) in practice. Trends in Ecology and Evolution 25: 410–418.

Dobzhansky T., 1970 Genetics of the Evolutionary Process. Columbia University Press.

Drummond A. J., Bouckaert R. R., 2015 Bayesian Evolutionary Analysis with BEAST. Cambridge University Press.

Falush D., Stephens M., Pritchard J. K., Agis M., Schlötterer C., Anderson E. C., Anderson E. C., Thompson E. A., Balding D. J., Nichols R. A., Barton N. H., Hewitt G. M., Beaumont M., Gottelli D., Barratt E. M., Kitchener A. C., Daniels M. J., Bertorelle G., Excoffier L., Broman K. W., Murray J. C., Sheffield V. C., White R. L., Weber J. L., Chakraborty R., Weiss K. M., Chikhi L., Bruford M. W., Beaumont M. A., Cooper R. S., Luke A., Zhu X., Kan D., Adeyemo A., Daly M. J., Rioux J. D., Schaffner S. F., Hudson T. J., Lander E. S., Dawson K. J., Belkhir K., Erosheva E. A., Excoffier L., Falush D., Kraft C., Taylor N. S., Correa P., Fox J. G., Falush D., Wirth T., Linz B., Pritchard J. K., Stephens M., Gilks W. R., Richardson S., Spiegelhalter D. J., Guglielmino C. R., Piazza A., Menozzi P., Cavalli-Sforza L. L., Knowler W. C., Williams R. C., Pettitt D. J., Steinberg A. G., Kong A., Gudbjartsson D. F., Sainz J., Jonsdottir G. M., Gudjonsson S. A., Kumar S., Tamura K., Jakobsen I. B., Nei M., Long J. C., Marchini J. L., Cardon L. R., McKeigue P. M., McKeigue P. M., Carpenter J. R., Parra E. J., Shriver M. D., Nei M., Li W.-H., Nicholson G., Smith A. V., Jónsson F., Gústafsson O., Stefansson K., Parra E. J., Marcini A., Akey J., Martinson J., Batzer M. A., Pfaff C. L., Parra E. J., Bonilla C., Hiester K., McKeigue P. M., Pritchard J. K., Stephens M., Donnelly P., Rabiner L. R., Rieseberg L. H., Whitton J., Gardner K., Rosenberg N. A., Pritchard J. K., Weber J. L., Cann H. M., Kidd K. K., Satten G. A., Flanders W. D., Yang Q., Sillanpää M. J., Kilpikari R., Ripatti S., Onkamo P., Uimari P., Sites J. W., Barton N. H., Reed K. M., Stephens J. C., Briscoe D., O’Brien S. J., Stephens M., Smith N. J., Donnelly P., Thiel B. A., Chakravarti A., Cooper R. S., Luke A., Lewis S., Thompson E. A., Thornsberry J. M., Goodman M. M., Doebley J., Kresovich S., Nielsen D., Wright S., 2003 Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics 164: 1567–87.

Falush D., Stephens M., Pritchard J. K., 2007 Inference of population structure using multilocus genotype data: dominant markers and null alleles. Molecular Ecology Notes 7: 574–578.

Felsenstein J., 2015 Theoretical Population Genetics.: 509.

Gautier, M., Vitalis, R., 2012 rehh: An R package to detect footprints of selection in genome-wide SNP data from haplotype structure. Bioinformatics 0: accepted.

Hamilton M., 2009 Population Genetics. Wiley-Blackwell.

Hudson R. R., 1990 Gene genealogies and the coalescent process. 7: 1–44.

Hudson R. R., 2002 Generating samples under a Wright–Fisher neutral model of genetic variation. Bioinformatics Applications Note 18: 337–338.

Kimura M., 1985 The neutral theory of molecular evolution. Cambridge University Press.

Kimura M., 1995 Population Genetics, Molecular Evolution, and the Neutral Theory: Selected Papers. University Of Chicago Press.

Kingman J., 1982 On the Genealogy of Large Populations. Journal of applied probability 19: 27–43.

Lewontin R. C., 1974 The Genetic Basis of Evolutionary Change. Columbia Univ Pr.

Lewontin R. C., Moore J., Provine W. B., Wallace B., 1981 Dobzhansky’s Genetics of Natural Populations I-Xliii. Columbia Univ Pr.

Nielsen R., Slatkin M., 2013 An Introduction to Population Genetics: Theory and Applications. Macmillan Education.

Paradis E., Claude J., Strimmer K., 2004 A{PE}: analyses of phylogenetics and evolution in {R} language. Bioinformatics 20: 289–290.

Paradis E., 2010 pegas: an {R} package for population genetics with an integrated–modular approach. Bioinformatics 26: 419–420.

Pritchard J., Stephens M., Donnelly P., 2000 Inference of population structure using multilocus genotype data. Genetics.

Provine W. B., 1986 Sewall Wright and Evolutionary Biology (Science and Its Conceptual Foundations). Univ of Chicago Pr (Tx).

Provine W. B., 2001 The origins of theoretical population genetics. University of Chicago Press.

R Core Team, 2015 R: A Language and Environment for Statistical Computing.

Rieux A., Balloux F., 2016 Inferences from tip-calibrated phylogenies: a review and a practical guide. Molecular Ecology 25: 1911–1924.