The Basics

Of all of the assumptions of Hardy-Weinberg, the one that is always violated is that of infinite population size. While many populations are so large that we needn’t be concerned about it too much, in fact many others are sufficiently small that we need to explore this further. And in fact, we shall see that that population size, in concert with mutation rate, are two of the most critical parameters in determining the genetic structure of real populations

What does it mean to be homozygous?

Before getting into the heart of the matter, we’re going to take a short detour and think about homozygosity. How does it arise? We can think of two possibilities

  1. They are identical in state - are same in sequence or whatever, regardless of ancestry
  2. They are identical by descent - identical by virtue of having descended from a common ancestor.

We need to think a bit about the difference here. If, for example, we assume that a particular allele can only arise once (by a single mutation), then the distinction goes away - if we know enough about ancestry we will find the last common one. On the other hand, if particular mutations can arise multiple times, then the second possibility comes into the picture. For now, however (and for the next few weeks), we are going to assume that identity in state does in fact imply identity by descent - that is, every new mutation is unique, and thus identical alleles are identical by descent.

Departure from Hardy Weinberg

Let’s look at some real data. In the human data we looked previously, we saw that departures from Hardy Weinberg were minimal. Let’s now look at another example, some microsatellite data from the legume Medicago trunculata (Ronfort et al. 2006). The data are included in TPG; we will se them to do the following:

  1. We look for departures from Hardy Weinberg
  2. We will examine some sample loci to see what’s going on.

So first we can look at the data and see what it consists of

data(Medicago)
Medicago
## Allelic data frame: 346 individuals
##                     13 loci
##                     1 additional variable

Next, we can look at the thirteen loci individually and determine whether they conform to Hardy-Weinberg expectations:

hw.test(Medicago)
##                chi^2   df Pr(chi^2 >) Pr.exact
## FMT08      7256.3008  253           0        0
## FMT11      5379.8523  171           0        0
## MTPG85C    4819.3046  120           0        0
## FMT90      6862.7288  231           0        0
## FMT07      9817.4937  496           0        0
## MAL367466  1865.1465   21           0        0
## FMT50      8175.9001  406           0        0
## ENPB1      4607.3651  190           0        0
## MTIC37     4053.5620  120           0        0
## MTIC59     5818.6628  325           0        0
## FMTBN1    15083.5677 1378           0        0
## MTIC243     916.8731   10           0        0
## MTIC126     330.0008    3           0        0

And we see that, in fact all loci are wildly out of equilibrium. But let’s look at the data for one locus:

med.sum <-summary(Medicago)
med.sum$MTIC243
## $genotype
## 01/01 02/02 02/03 02/04 04/04 05/05 
##     6   292     1     2     9    17 
## 
## $allele
##  01  02  03  04  05 
##  12 587   1  20  34

And what we see is

  1. We are dealing with multiple alleles
  2. Many of the genotype numbers are small, making our χ2 numbers a bit suspect.
  3. But here’s the important part - if we visually inspect the numbers above, we see that there are a total of 3 heterozygotes out of the total sample of 346 individuals, so our observed heterozygosity is 0.009. In contrast, remember that

\(H_{exp}=1-\sum{p_i^2}\)

Which can be calculated for this locus by the following

hexp <- heterozygosity(med.sum$MTIC243$allele)
hexp
## [1] 0.1907124

and of course

Hobs <-3/346
Fi <-1-Hobs/hexp
Fi
## [1] 0.9545361

So quite obviously, for this locus (and, as it turns out for all of the others as well), the departure from Hardy-Weinberg is the result of a severe reduction in heterozygosity. So what gives?

The Sampling Effect of reproduction

We are all familiar with sampling, in the context of collecting data from a sample of a larger population and using those data to estimate parameters for the population as a whole. It is important to understand, however, that in population genetics, there is also the biological sampling that takes place during reproduction. That is, in a finite population, while an essentially infinite number of gametes can be produced (and for a ballelic locus, allele frequencies of p and q), only a limited number of those gametes actually give rise to the next generation. In a diploid population with constant size N, that number, of course, will be 2N. This model is what is known as a Wright-Fisher Population. And such a population, the parameters about which we’ve been talking - notably allele frequencies and linkage disequilibrium, will be subject to the sampling “error” that results (we’ve already seen this in simulated data). So the question we will be dealing with is how that biological sampling process, a random one, affects the genetics of the evolutionary process.

Nonrandom Mating - an Extreme Case

Suppose we set up the following experiment. We start with a population of heterozygotes (so p=q=.5). Each generation, after gametes are produced, we then set up a large number of populations, each founded by one individual with two gametes sampled from that pool. We then repeat the process for subsequent generations and look at the number of populations that have 0, 1, or 2 a alleles. We can model that as follows:

par(mfrow=c(5,1),mar=c(1,5,1,1))
g <-c(0,1,0)

for(i in 1:5){
t <-rbind(c(1,0,0),c(.25,.5,.25),c(0,0,1))
g <-g%*%t
cols <-heat.colors(5)



barplot(g,names.arg =c("0","1","2"), xlab="# a alleles",ylab="Frequency", col=cols[i])

}

So what we are seeing is an extreme form of genetic drift, the result of which is a set of populations where a is either lost or fixed. Note, however, that if we were to look at the frequency of the alleles across all populations combined, they would not have changed - p = q = 0.5. But in such a combined population, of course, the number of heterozygotes would be very low. So what we see are the following:

  1. An increase in the variance of allele frequencies across populations
  2. A decline in overall heterozygosity
  3. Random fixation or loss of alleles.

A More realistic Example

Now, suppose we start with a set of replicate populations, each with 16 individuals that are all heterozygotes for some gene. What do we expect to happen? Below are the results of simulating that experiment (the rather crude R code used is hidden)

And what we see is that while initially all populations were heterozygous, by the end, all are fixed for one of the two alleles. So does this happen in real life? Buri (1956) performed just this experiment, using populations of Drosophila melanogaster that were initially heterozygous for two alleles of the bw locus, one that affects eye color. His results are shown below

Buri

Buri

So we see that in the laboratory, we get results very similar to those we predicted. So what has happened? As in the case of our initial hypothetical example, we see an increase in allele frequency variance among replicate populations, an overall decline in heterozygosity, and random fixation and loss of alleles.

Tracking allele frequencies over time

Thus far, we have only been looking at genotype distributions. For example, in the Buri experiment, we did not delve into the dynamics of change in individual populations - we only pointed out that we went from populations consisting of heterozygotes (p=.5) to ones in which one allele or the other is fixed (p=0 or 1) Now we want to look at that process.

To do so, we can use the driftPlot function in the TeachingPopGen function. It uses as input a vector of population sizes, the length of which is the number of generations to be simulated. The number of simulations can also be set with the parameter nreps=n; the default value of n (used if none is specified) is 10.

So, for example, let’s look at 100 generations of allele frequency change in a population of N=100,000, where p = q = .5

par(mfrow=c(1,1)) #set a single graphics pane
set.seed(3454) #set a random number generator seed (not strictly necessary)
ngen <-rep(100000,100) # create the vector of population sizes
sim1.out <-driftPlot(ngen) #run the simulation

head(sim1.out)
##          [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]
## [1,] 0.500000 0.500000 0.500000 0.500000 0.500000 0.500000 0.500000
## [2,] 0.499345 0.500545 0.500465 0.500560 0.500555 0.500220 0.499690
## [3,] 0.501335 0.501915 0.499575 0.499910 0.501075 0.500210 0.500375
## [4,] 0.501360 0.501255 0.500835 0.500240 0.502990 0.498485 0.501865
## [5,] 0.501040 0.501925 0.500680 0.500485 0.503710 0.500795 0.500470
## [6,] 0.502435 0.501815 0.500200 0.501675 0.502940 0.499065 0.500290
##          [,8]     [,9]    [,10]
## [1,] 0.500000 0.500000 0.500000
## [2,] 0.500135 0.499310 0.498905
## [3,] 0.500745 0.499335 0.499165
## [4,] 0.500430 0.498845 0.498060
## [5,] 0.500215 0.500330 0.497935
## [6,] 0.501395 0.500225 0.495960

This gives us a plot that shows very little change. In addition to the plot, however, the function returns a matrix, in which the rows are generations and the columns are simulations. This will become valuable in a little bit.

Now let’s look at N=1000

sim2.out <-driftPlot(rep(1000,100))

And we see that there is more scatter in the trajectories of the replicate populations. Now let’s run it one more time with N=100 and analyze the results more closely

set.seed(9987)
sim3.out <-driftPlot(rep(100,100))

One thing should immediately jump out - we have populations that have gone to either fixation or loss, and the variance among replicates has grown large. What about the mean value of p for all populations? Here is where we use the matrix output of the drifPlot function. In the plot statement below, we use the apply function, which works as follows

  1. Starts with a matrix or data frame
  2. based on the parameter provided, looks at rows(1) or columns (2)
  3. Applies a function (mean in this case) to the selected dimension.
plot(apply(sim3.out,1,mean),type="l", xlab="Generation", ylab="p", main="Mean Allele Frequency", ylim=c(0,1))

We see vary little change. But what about the variance?

# par(mfrow=c(1,2))
plot(apply(sim3.out,1,var),main="Variance Among Replicates",xlab="Generation",ylab="Variance")

And we see that it increases, more or less monotonically, with time. Now finally, what happens to expected heterozygosity (2pq)? Here, in the apply function, we create our own function, one which takes each element of a row, calculates 2pq, and then averages over all replicates:

Fhat <-apply(sim3.out,1,function(p) mean(2*p*(1-p)))
plot(Fhat, main="Mean Heterozygosity among Replicates",xlab="Generation",ylab="Heterozygosity")

It declines monotonically

These two observations lead to concept of “Effective Population size” , or Ne, which is defined as the size of a Wright-Fisher population that would exhibit the same amount of either increase in variance or level of genetic drift as observed in an actual population.

Summary

What we have done is to develop an understanding of the basics of genetic drift, a process that all finite populations undergo. In so doing, we have reached a definition of effective population size, one of the most important parameters in population genetics. However, as is so often the case, new questions arise, in particular

  1. Our definition is based on what we saw by simulating allele frequency dynamics in replicate populations. How can use real data gathered from a sample of a single population, to arrive at an estimate of Ne?
  2. We’ve seen now, both in the Buri experiment and in our last simulation, that the long-term effect of genetic drift is fixation or loss of an allele, meaning a net loss of genetic variation with time. However, we also know that mutation is acting in the opposite fashion to introduce new variation. How do these two stochastic processes interact?

These questions will occupy much of our attention in the next several weeks.

More on Effective Population Size

Let’s start by reminding ourselves where we were. We saw that, in finite populations, which otherwise fulfill Hardy Weinberg conditions, variance in allele frequencies among populations tends to increase, while heterozygosity decreases. For convenience sake, we will redo some the last few plots from that exercise

library(TeachingPopGen)
set.seed(5678)
n100 <-driftPlot(rep(100,100)) #run simulations, plot them, and return the results as a list

Now what we want to do is to focus on the decline in heterozygosity and look at this further, so let’s plot it

het100 <-2*n100*(1-n100) #calculate all heterozygosities
Fhat <-apply(het100,1,mean) #determine per generation mean
plot(Fhat, main="Mean Heterozygosity within Replicates",type="l",xlab="Generation",ylab="Mean 2pq")

Analytical solution to decay of Heterozygosity.

For a moment, let’s get away from our simulations and look for an anlytical solution to the question. Let’s assume our population consists of individuals who carry two alleles that are identical by descent (ibd; homozygous) and ones who are not (heterozygotes), and the frequencies of the two are respectively F and 1-F.

fexp <-function(Nm){
  ngen <-c(1:100)
Ht <-(1-(1/(2*Nm)))^ngen*.5
Ht
  }
plot(Fhat, main="Mean Heterozygosity within Replicates",type="l",ylab="Heterozygosity")
lines(fexp(100),col="red")

And we see that we get a pretty good match between our simulation and the theoretical expectation.

Population bottlenecks

Thus far, we have modeled a constant population size (N=100). Suppose there had been a bottleneck in the past:

Generation size
1-40 100
41-50 10
51-100 100

It is fairly straightforward to run our simulation again, only now specifying these numbers for the population sizes over the 100 generations modeled:

set.seed(3454)
N <-c(rep(100,40),rep(10,10), rep(100,50))
nbot <-driftPlot(N,10)

And as we did before, we can look at the decline in heterozygosity:

Fhat <-apply(nbot,1,function(x) mean(2*x*(1-x)))
plot(Fhat, main="Mean Heterozygosity within Replicates",type="l",ylab="Heterozygosity")

and we see a drop in heterozygosity concomitant with the bottleneck. So what is the effective size of this population? If we start with the current census count (100) and look at what we get

Fhat <-apply(nbot,1,function(x) mean(2*x*(1-x)))
f100 <-fexp(100)
plot(Fhat, main="Mean Heterozygosity within Replicates",type="l")
lines(f100,col="red")

The fit is not very good. What if we knew the population history and used the mean?

nmean <-mean(N)
nmean
## [1] 91
fmean <-fexp(nmean)
plot(Fhat, main="Mean Heterozygosity within Replicates",type="l")
lines(fmean,col="darkgreen")

We still don’t have a particularly good fit. But suppose we use the harmonic mean, which is mathematically more sensitive to low values:

Initially, we see that the fit is not great. However, suppose we use the harmonic mean of N instead of the current census count

nharm <-1/mean(1/N)
nharm
## [1] 52.63158

And plot the expected trajectory of F

fharm <-fexp(nharm)
plot(Fhat, main="Mean Heterozygosity within Replicates",type="l")
lines (fharm, col="blue")

And we see a much better fit to the data. And there is a logical biological reason for this. When a bottleneck occurs, genetic variation is lost, and in particular low frequency alleles may be eliminated. Thus, absent new mutation (which we can ignore over the time period we are examining here), that loss of variation persists. Thus, those time periods in which population numbers are low make have a disproportianely large impact on heterozygosity and thus Ne. The same is true of the harmonic mean of a series of numbers - the reciprocal of a small number is large, and it is those reciprocals that are being averaged.

Summary

At this point, the customary approach to understanding effective population size is to do similar theoretical calculations based on unequal sex ratio, population size variation, etc. However, we are still not where we want to be, which is to be able to take a single population sample and use it to infer Ne. Also, all of our analyses have been based on the assumption of a diploid population. For a haploid system such as mitochondrial DNA, Ne is reduced by 1/2, and in the case of uniparental inheritance (as is the case for bothe mitochondria and chloroplasts) it is reduced by 1/2 again.

So overall, then, we’ve been able to develop some theory about genetic drift, effective population size, and mutation. What we now need to work towards are ways that we can use that theory to make inferences about real populations.

Combining Drift and Mutation

Gaving looked at drift and mutation independently, we now need to examine them jointly. In doing so, we want to bear in mind a few things:

  1. In any finite population, the fate of a particular allele is ultimately fixation or loss.
  2. The fate of most new mutations is also loss, however if it does persist, then it becomes subject to drift dynamics.
  3. The results of these processes is what we observe when we assess genetic variation in a population sample.

At this point, it is important to distinguish two approaches that have been used in population genetics. The first (and what we have been using thus far) can be thought of as the “look forward” approach, that is, we ask “given a set of assumptions, what do we predict will happen?” The second, the “look back approach” asks “Given what we observed, what can we infer about the processes that have taken place?” Much of the power of modern population genetics comes from the latter. However, it is instructive to at least briefly consider the former, as it served as the basis for much of the “look back” theory, usually referred to as “coalescence”.

The infinite alleles model

The theoretical structure we are going to describe long predates the era of DNA sequencing - much of the groundwork was laid in the 1960’s through 1980’s by Kimura, Crow, Ohta and others. Much of the theory was developed based on the “infinite alleles” model, which, simply stated, assumed that every new mutation is unique, so that an individual who is homozygous in fact must possess two alleles that are identical by descent. New mutations happen at some frequency µ, and for the most part, reverse mutation is ignored.

Note that the model was first developed during the isozyme era, when all we could determine, based on electrophoretic comparison of allozymes, was that two alleles were the same or different. There was no way of determining the molecular basis of observed differences, so that in fact there was no way to identify how many mutational steps led to those differences. The ability to infer that came later, with the advent of DNA sequence comparison as a means of quantitating genetic diversity.

Probability of identity by descent

So given this model, P{ibd}, which we will call Ft, becomes a critical parameter. Because identity in state implies identity by descent, Ft is best thought of as the homozygosity (∑Pi2 ) of the population. Given a population with some level of homozygosity Ft, we consider first what we would predict F(t+1) to be. This is fairly simple:

\(F_{t+1}=Pr\{two\,identical\,alleles\}+Pr\{two\,different alleles\}*F_{t}\)

\(=\frac{1}{2N}+(1-\frac{1}{2N})*F_{t}\)

Now, suppose we add mutation into the equation. In the case of the 1/2N individuals who received two copies of the same allele, they will be homozygous only if neither allele underwent mutation; the probability of that occurring is (1-µ)(1-µ) or (1-µ)2. In the case of the others, they will be homozygous only if the two alleles they possess happen to be identical by descent and neither have undergone mutation; again, the latter probability is simply (1-µ)2.

So we can now add modify our equation above to be

\(F_{t+1} = \frac{1}{2N}(1-\mu)^2+(1-\frac{1}{1-2N})F_{t}(1-\mu)^2\)

And finally, we can ask what happens at equilibrium, when F(t+1) = F(t). We needn’t go through the algebra, but if we do, and assume all terms containing µ2 and u/N ≈ 0, then we come to the following:

\(\hat{F} = \frac{1}{1+4N\mu}\)

To be more precise, however, we need to remember from our look at drift that in real populations, the predictor of heterozygosity dynamics is not N, but rather the effective population size Ne, so we need to rewrite this as

\(\hat{F} = \frac{1}{1+4N_{e}\mu}\)

The term 4Neµ is an incredibly important one as we will see - in fact it has earned its own greek letter designation - θ.

What does this mean?

We need to explore some properties of this relationship. First and foremost, it involves the product of two quantities, Ne and µ, that are extremely difficult to determine empirically. Nevertheless, it does allow us to make some predictions about things like heterozygosity and allele frequency distributions in finite populations in which the infinite allele model holds. We will consider those below:

Expected homozygosity and heterozygosity.

In looking at our equation, we see that in fact F is inversely proportional to θ. Remembering also that F is simply expected homozygosity, (1-F) is expected heterozygosity. We can plot these very quickly and see what we see.

th <-seq(0,10,.01) #create a vector of values of theta
f <-1/(1+th) #calculate f over that range
plot(th,f, type="l", xlab="Theta",ylab="Proportion",main="Homozygosity and Heterozygosity as Functions of Theta")
lines(th,1-f, lty=2)

So we see, not surprisingly, that as θ increases (that is, mutation rate and/or population size increase), so does heterozygosity.

Estimating theta from real data

For the following, we will use some allozyme frequency data for the Adk-1 locus in Drosophila willistoni (Ayala and Tracey 1974; cited in Hartl and Clark 2006).

freqs <-c(.574,.309,.114,.003) # frequencies of four alleles
f.adk <-sum(freqs^2)  #calculate homozygosity.
th.adk <-(1/f.adk-1)
th.adk
## [1] 1.283303

In fact, this value is higher than most reported for similar data in other species of Drosophila, but how do we interpret it? Given the nature of theta, and the fact that this is electrophoretic data, this is a problem.

What about expected allele frequencies?

One of the remarkable features of what we have developed is that we are saying nothing about allele frequencies. We used them to calculate F, that’s true, but we subsumed them into a single measure - homozygosity - and have not incorporated their actual frequency spectrum. As an illustration, we could imagine that in another survey, such as the one described above, all four allele frequencies were equal:

freq.eq <-rep(.25,4)
barplot(rbind(freqs,freq.eq),beside=TRUE)

f.eq <-sum(freq.eq^2)
f.eq; f.adk
## [1] 0.25
## [1] 0.437962

And of course we can also calculate theta for the equal frequency case:

th.eq <-1/f.eq-1
th.eq
## [1] 3

We do see that homozygosity is lower and θ higher, but which is more consistent with the infinite alleles model? To do that, we need to determine the expected distribution of allele frequencies and then compare our experimental results to them. This is where the work we did earlier with simulating distributions comes into play.

The Ewens Formula and Sampling distribution.

So we want to ask the following question: “Given a sample of size n with k different alleles, how do the frequencies of those alleles compare to those expected in an equilibrium population under the infinite sites model”? In a seminal paper, Ewens (1972) addressed this issue. First, he showed that the expected number of alleles (k) was the following:

\(E(k)=\int_{0}^{1} x^{-1}(1-x)^\theta\{1-(1-x)^{2n}\}dx\)

\(=\frac{\theta}{\theta}+\frac{\theta}{\theta+1} + . . . + \frac{\theta}{\theta+2n-1}\)

The remarkable part of this equation is that k is simply a function of θ and the sample size (n). We saw how we could calculate θ from our data; it would now be possible to establish the expected number of alleles and sample size, given varying values of theta. But as is the case with any expectation, we need to examine its distribution. Here, Watterson (1977) provided an approach. Put simply, he suggested that we could do the following:

  1. Calculate Ft, expected homozygosity or ∑pi2, in the normal way from our experimental data. Note that under the infinite alleles model, homozygosity implies identity by descent, so for this case, in fact this is our old familiar value of F.
  2. Based on n and k (the number of samples and the number of alleles in our data) simulate the expected distribution of Ft (just as we did earlier for the binary and normal distributions)
  3. Compare our observed value to that distribution and ask if the deviation from expectation is significant at the .05 level.

The problem at the time (1980 or thereabouts) was that in all of this, we are assuming that we can detect every allele, and with electrophoresis as our tool, that is not possible. However, Jerry Coyne in Richard Lewontin’s lab developed a technique known as sequential electrophoresis, that came close to meeting this assumption. Keith et al (1985) applied the technique to the Xdh locus in 89 homozygous lines of Drosophila pseudoobscura and obtained the following results:

Xdh <- c(52,9,8,4,4,2,2,1,1,1,1,1,1,1,1) # vector of observed allele numbers
k =length(Xdh) # number of alleles = k
n = sum(Xdh) #number of samples = n

There are thus 15 alleles in a sample of 89 total alleles. We can quickly visualize their distribution (or allele frequency spectrum):

barplot(Xdh/n, xlab="Allele",ylab="Frequency", main="Observed Distribution of Xdh Alleles",names.arg=c(1:k))

Now, per Watterson, we can calculate Ft from the data and address the following:

  1. For 1000 simulated equilibrium populations, how is F distributed?
  2. How does this compare to observed F in actual population?

First, we will calculate Ft from the data:

Fx <-fhat(Xdh)
Fx
## [1] 0.3657366

Now, we can look at how our observed Ft compares to the distribution of 1000 simulated values, providing the function Ewens with n, k, and our expected homozygosity:

Ewens(n,k,Fx)
## [1] 0.1710547
##      2.5%     97.5% 
## 0.1081934 0.3152885
freq.even <-rep(k/n,k)
Fx.even <- fhat(freq.even)
abline(v=Fx.even,col="blue")

And we see that indeed our observed value of F is significantly different from that expected based on the infinite sites model. So what is going on?

To address that, let’s look at what happens when, for the same number of alleles, homozygosity is minimized. Remember that that is when all of the allele frequencies are equal. In the plot above, that is shown by the blue line. We see that in fact our actual data are at the opposite end of the spectrum, suggesting the overabundance of one or more alleles relative to expectation. Keith et al. interpreted this as evidence for purifying selection, but there are questions that can be raised:

  1. Expected distribution is typically J shaped, which is what was observed in these data.
  2. Is it an equilibrium distribution? Based on one sample, we have no way of knowing
  3. Are all alleles detected? In this case, it may be true with respect to amino acids, but it is not with respect to nucleotides - silent substitutions cannot be detected by protein electrophoresis.

Another application - The Lactase gene in humans

As one final example, let’s use a data set that is exhaustive with respect to allele detection and where we have a priori expectations with respect to evolutionary history. Lactase persistence, the ability to digest lactose post-weaning, is well recognized as a gene that has been subject to selection in human populations in which dairy is part of the diet. Thus, we can compare dairy-adapted populations with those in which it hasn’t occurred and ask the same question - is the distribution of allele frequencies in these populations consistent with the neutral infinite alleles model, and if not, are departures consistent with our expectations based on known histories of selection?

To address this, we will look at data from HapMap - SNP’s in a 50,000 kb region including the lct gene. We will use data from two dairy-adapted populations (CEU and MKK) and one non-adapted one (YRI). In this analysis, we are going to make the simplifying assumption that recombination is not a factor - clearly this is not strictly true, but given the small size of the region in question, its contribution to patterns of variation is likely small.

First we will read the data and, for processing ease, combine them into a list. Note that the data have been previously processed

data(Lactase) #Loads the data as a list, with one element per population

Next, let’s visualize the allele frequency distributions for the three populations

dat.all=Lactase
par(mfrow=c(3,1))
dist.plot(dat.all[[1]],t="YRI")
## [1] 19
## [1] 230
dist.plot(dat.all[[2]],t="MKK")
## [1] 21
## [1] 285
dist.plot(dat.all[[3]],t="CEU")

## [1] 10
## [1] 234

By themselves these are suggestive, but not definitive. The two dairy-adapted populations, MKK and CEU, have one haplotype that predominates, while the distribution in the YRI population is more even. Note also that both African populations have more haplotypes. So let’s apply the Ewens/Watterson test to these:

set.seed(4348)
out <-sapply(dat.all,Ewens.hap)
## [1] 0.1511531
## [1] 0.1727533
##      2.5%     97.5% 
## 0.1011626 0.3318091

## [1] 0.4900585
## [1] 0.1597286
##       2.5%      97.5% 
## 0.09411819 0.29784734

## [1] 0.6126452
## [1] 0.3193167
##      2.5%     97.5% 
## 0.1720779 0.6157773

The order of the plots is the same as in the previous figure, and we see that indeed, both of the dairy-adapted populations have values of F that are significantly different from expectation, and that difference is in the positive direction, implying over-representation of one allele. These results are entirely consistent with the hypothesis that in the CEU and MKK populations, one allele has been the target of positive selection.

Conclusions

What we’ve been doing here is laying groundwork for what follows. As noted at the outset, all of our logic has been based on the “look forward” approach. We have used a combination of theory and simulations to generate predictions regarding expectations under a particular model, and then have compared those expectations with observations from real populations. In so doing, we have seen the importance of identity by descent as an analytical statistic for finite populations and introduced the neutral parameter θ, one that we will use extensively in the future.

But it bears repeating that most of the analytical approaches we have used so far were developed prior to the advent of population-level DNA sequencing. In essence, as in the lactase case we are still treating alleles as discrete units. In so doing, we are ignoring much of the information that is there, in particular, given two sequences, how many differences are there between them? What sites vary and what sites don’t? As we shall see, once we take those factors into consideration, we can move beyond the “what-ifs” of the look forward approach and instead ask “what did” by looking backwards in time from the data at hand.

References

Ayala F. J., Tracey M. L., 1974 Genetic differentiation within and between species of the Drosophila willistoni group. Proceedings of the National Academy of Sciences of the United States of America 71: 999–1003.

Buri P., 1956 Gene Frequency in Small Populations of Mutant Drosophila. Evolution 10: 367–402.

Ewens W. J., 1972 The sampling theory of selectively neutral alleles. Theoretical Population Biology 3: 87–112.

Hartl D. L., Clark A. G., 2006 Principles of Population Genetics, Fourth Edition. Sinauer Associates, Inc.

Keith T. P., Brooks L. D., Lewontin R. C., Martinez-Cruzado J. C., Rigby D. L., 1985 Nearly identical distributions of Xanthine dehydrogenase in two populations of Drosophila pseudoobscura. Molecular Biology and Evolution 2: 206–216.

Ronfort J., Bataillon T., Santoni S., Delalande M., David J. L., Prosperi J.-M., 2006 Microsatellite diversity and broad scale geographic structure in a model legume: building a set of nested core collection for studying naturally occurring variation in Medicago truncatula. BMC Plant Biology 6: 28.

Watterson G. A., 1977 Heterosis or neutrality? Genetics 85: 789–814.