What is it?

The starting point for in population genetics is the Hardy Weinberg equilibrium. It is the basis on which everything is built, so a thorough understanding of what it says and what it assumes is critical. In this unit, we will briefly go through the basics of it ( the familiar p and q approach). We will then examine it in more detail.

A Sample Problem

So let’s approach this as a problem in data analysis. The following are the data that were part of the opening day “quiz” - frequencies of Single Nucleotide Polymorphism (SNP) number rs3739020 from the 1000 Genomes Project.

Genotype Number
G G 6
G T 28
T T 66

We want to ask two questions:

  1. What are the allele frequencies (i. e. how many G’s and how many T’s are there)?
  2. Most importantly, if these individuals were to mate at random, what would be the allele frequencies in the F1 generation?

Determining Allele Frequencies

So for question 1 let’s proceed as follows. First of all, we will create a vector with of the observed data.

obs <- c(6,28,66)

Note that c() is a function that concatenates a number of values, in this case integers, in order to create a vector.

Now, let’s ask how many G’s there are. Recognizing that every GG homozygote has two and every GT heterozygote has one, we can calculate that by

nA <-2*obs[1]+obs[2]
nA
## [1] 40

And to make that a frequency, we need to divide it by the total number of alleles, which, since these are diploid individuals, is simply two times the number of individuals genotyped (N).

N <-sum(obs)
p <-nA/(2*N)
p
## [1] 0.2

Now, we could repeat that process to determine the frequency of T, but here one of the simplest but most important relationships comes in to play. Note that this is a biallelic locus, so if a particular allele is not G, it has to be T. Thus, if we set the frequency of T to be q, then p+q must equal 1, or

q <-1-p
q
## [1] 0.8

Random Mating

Now, suppose we turn the question around. Given p and q, as calculated above, and assuming that random mating then occurs, what genotype numbers frequencies would we expect to see? There are two ways we can approach this, one fairly simple and one a bit more involved. First, the simple one. Think of the population as orginating from a pool of alleles with frequencies p and q. If that is the case, then we can calculate the expected probabilities of the three genotypes (GG, GT, and TT) as

p{GG} = p{G}p{G} = p2
p{TT} = p{T}
p{T} = q2
p{GT} =p{GT} or p{TG}
= P{G} * P{T} + P{T} *P{G}
= pq + qp
= 2pq

We can can make a vector of those frequencies as follows

fexp <-c(p^2,2*p*q,q^2)
fexp
## [1] 0.04 0.32 0.64

And finally, given those frequencies, what numbers of genotypes would we expect to see in a sample size of N? Here, R makes it very easy to multiply the frequency vector by N

Nexp <-fexp*N # multiplies each element of fexp by N
as.integer(Nexp) # get rid of decimal positions
## [1]  4 32 64

Contrast that with our observed numbers

obs
## [1]  6 28 66

And they look pretty close. Of course we need to test that statistically (which we will in a bit), but before we do so, let’s see what’s happened to p and q as a result of random mating:

p1 <-(2*Nexp[1]+Nexp[2])/(2*N)
q1 <- 1-p1
p1;q1
## [1] 0.2
## [1] 0.8

And again, contrast those values with what we determined from the original data:

p;q
## [1] 0.2
## [1] 0.8

And note. p and q have not changed.

A Sample problem:

Below are some (completely made-up) data. Use them to

  1. Calculate p and q.
  2. calculate expected genotype frequencies
  3. Calculate expected genotype numbers
Genotype Number
AA 69
Aa 101
aa 30
obs <-c(69,101,30)
p <-(2*obs[1]+obs[2])/(2*sum(obs))
p
## [1] 0.5975
q <-1-p
genos <-c(p^2,2*p*q,q^2)
genos
## [1] 0.3570063 0.4809875 0.1620062
gebis2 <-genos*sum(obs)

And Why is This Important?

So what have we shown? We can summarize as follows:

  1. Given a random mating population, with allele frequencies p and q, the expected genotype frequencies are p2, 2pq, and q2.
  2. In an ideal population, allele frequencies do not change.
  3. One generation of random mating is sufficient to restore HWE to autosomal loci (X linked are a bit different)
  4. And what is an “ideal” population? It is one with the following properties
  • Infinite size
  • Random mating
  • No Mutation
  • No Migration
  • No selection

In other words, A population that is in Hardy Weinberg Equilibrium is not evolving. Thus, it becomes the null hypothesis from which we work; hence a deep understanding of its properties is essential if we are to thoroughly understand the genetics of the evolutionary process.

A more detailed derivation

In the above, we simply treated mating as involving every individual contributing equally to a pool of gametes and then those gametes being drawn two at a time to create zygotes. In fact, of course, more typically, individuals are mating with one another and it is their gametes that are combining. Felsenstein (2016 version, pg.6) Works this up by first, showing the expected frequencies of all possible random matings:

From this, the following can be calculated:



And inspection of the right side should show you that the expected genotype frequencies are indeed p2, 2pq, and q2.

Calculations Based on Genotype Frequencies

I personally like the above approach, in that it starts with the rawest of raw data - a count of genotypes. However, at times problems are formulated in terms of genotype frequencies rather than numbers. In those cases, the typical formula given is

p = f(AA) + 1/2(f(Aa))

But let’s look at this a little more closely - Suppose we multiply it by 2N/2N. Then we get

P =(2Nf(AA))+2N(1/2(f(Aa)))/2N

But what is this? N * f(AA) is simply the number of AA genotypes counted, and N *F(Aa) is the number of Aa, so this whole equation can be rewritten as

p=(2 * #(AA) + # (Aa))/2N

Which is exactly the same formula as above.

So to wrap it up, when you start a Hardy-Weinberg problem, the first question to ask is whether you are given genotype Numbers or frequencies and go from there. What you do not want to do is to treat the numbers as frequencies or vice versa.

Exploring in More Depth

So at this point, we’ve seen how we can calculate allele frequencies, and we’ve also seen that in an ideal population, those frequencies do not change - the population does not evolve. And it is worth it at this point to briefly summarize the properties of an ideal population:

  1. It is infinitely large
  2. Mating is random
  3. There is no mutation
  4. There is no migration
  5. There is no selection

Much of the rest of our time will be focused on understanding the consequences departures from one or more of these conditions - how do we detect them and what are their consequences? Before that, however, we need to delve into the Hardy Weinberg Equilibrium a bit more deeply.

Testing for departures from Hardy Weinberg

This is a subject we will be exploring in more depth in a later unit, but for starters, let’s keep it simple. Assume we can observe some genotypes:

Genotype N
AA 56
Aa 78
aa 22

We can calculate the frequencies of the two alleles as we did before:

obs <-c(56,78,22)
N <-sum(obs)
p <-(2*obs[1]+obs[2])/(2*N)
q <- 1-p
p; q
## [1] 0.6089744
## [1] 0.3910256

And from those, we can calculate our expected genotype numbers

exp <-N*c(p^2,2*p*q,q^2)
exp
## [1] 57.85256 74.29487 23.85256

So now we have observed and expected numbers; it would seem a logical step to apply the chi-square test to see if the difference between the two is significant. For more details, see the unit on the chi-squared distribution; at this point, remember that what we are doing is quantitating the departure from the expectations of an hypothesis for categorical data. In our case, the categories are genotypes; the hypothesis is that the population is in Hardy-Weinberg equilibrium, and the data are the observed genotype numbers, which are compared to those expected based on HWE.

We’ve done the calculations above, so χ2 can be calculated easily by

chi <-sum((obs-exp)^2/exp)
chi
## [1] 0.3879836

But how many degrees of freedom? Our rule of thumb previously was that degrees of freedom are one less than the number of classes, suggesting that it might be two in this case. However there is another wrinkle. In this case we used the actual data to estimate a parameter used in determining our expected values - p. That costs us another degree of freedom. So in the case of testing Hardy Weinberg with data from a biallelic locus, we only have one degree of freedom.

So what is the probability that we would observe this much deviation based on chance alone (in which case we would accept our hypothesis)? One way to do that is what we did with coin flips - generate a bunch of χ2 distributed random variables and see where the 5% cutoff falls. In this case, we will do a one tail test, since the range of our statistic is from zero to infinity (with zero the value obtained if the observed data exactly match the expectation)

set.seed(123)
ch <- rchisq(100000,1)
colhist(ch,tail=1, xr=c(0,20),xlab="chi-square",ylab="Number",main="Are the Data Consistent with HWE?")
abline(v=chi,col="blue")

quantile(ch,.95)
##      95% 
## 3.851331

And we see that the data in fact fall within the range expected be observed as a result of chance.

We can also calculate the probability of this particular χ2 value as follows:

pr <-qchisq(chi,.95)
pr
## [1] 0.2277098

And we see that the probability of observing this much or more deviation from Hardy-Weinberg expectations is 22%, far higher than our normal standard of 5% for rejecting the null hypothesis. So we can conclude in this case that the hypothesis cannot be rejected.

The more conventional approach

When you’ve done χ2 testing in the past, you’ve probably used a table that lists degrees of freedom and cutoff values that have been determined analytically. For example, for one degree of freedom, that value is 3.84; we see that it is very similar to the number we obtained from our simulation above. A small version of the table is in the unit on the χ2 distribution; it is also shown below:

df <-c(1:10)
chicrit.05 <-sapply(df, function(x) qchisq(.95,x))
chicrit.01 <-sapply(df,function (x) qchisq(.99,x))
kable(data.frame(df,p.05 = chicrit.05,p.01=chicrit.01))
df p.05 p.01
1 3.841459 6.634897
2 5.991465 9.210340
3 7.814728 11.344867
4 9.487729 13.276704
5 11.070498 15.086272
6 12.591587 16.811894
7 14.067140 18.475307
8 15.507313 20.090235
9 16.918978 21.665994
10 18.307038 23.209251

A coding note

the hw() function in TeachingPopGen will accomplish much of what we have done so far. Given some set of single locus, biallelic genotypes, it will return all the basic information. For example

dat.obs <-c(55,75,12) #observed numbers of AA, Aa, and aa
dat.hw <-hw(dat.obs)
## [1] p= 0.651408450704225 q= 0.348591549295775
##      obs exp
## [1,]  55  60
## [2,]  75  64
## [3,]  12  17
## [1] chi squared = 3.772 p =  0.052 with 1 d. f.
## [1] F =  -0.163

Note that it will print the results to the console; it also returns a list, in this case to the variable dat.hw:

str(dat.hw)
## List of 6
##  $ Observed    : num [1:3] 55 75 12
##  $ Expected    : num [1:3] 60.3 64.5 17.3
##  $ Allele_freqs: num [1:2] 0.651 0.349
##  $ Chisq       : num 3.77
##  $ prob        : num 0.0521
##  $ F           : num -0.163

Homozyogsity, Heterozygosity, and F

So by now we’ve looked at the chi-squared approach to testing whether observed genotype numbers are consistent with the prediction of Hardy-Weinberg. There is, however, another way we can make this comparison - one that is less statistical, but which (as we shall see) is very relevant biologically.

A couple of definitions

We will use the following terms extensively in what follows:

  • Homozygosity - the frequency of homozygotes (of any sort, e. g. AA and aa in the biallelic case) in a sample or population
  • Heterozygosity the total frequency of heterozygotes in the population

Expected Values

As is the case with any such measure, we need to have expectations. These are pretty straightforward, especially:

  1. Expected homozygosity is simply ∑pi2, where the p’s are the frequencies of the k alleles in the population
  2. Expected heterozygosity is easy for the biallelic case - it is simply 2p(1-p). That could be extended to the case of multiple alleles, but remembering that every individual is either a homozygote or a heterozygote, a simpler calculation is

E(heterozygosity)=1-E(homozygosity) = 1- ∑pi2

Comparing some observations.

Now that we have our expectations, we can turn to some data. These come from real data and are a staple of chapter-end problems in Genetics and Evolution texts. The organisms are brown bears, the phenotypes are brown (dominant) vs. white(recessive) and the genotypes of a single base in the melanocortin receptor gene which is responsible for the difference are as follows:

Genotype Phenotype N
AA Brown 42
AG Brown 24
GG White 21

So we can do some basic calculations

genos <-c(AA=42,AG=24,GG=21)
genos
## AA AG GG 
## 42 24 21

And we can do our basic Hardy-Weinberg calculations as we have before

hw.genos <-hw(genos)
## [1] p= 0.620689655172414 q= 0.379310344827586
##    obs exp
## AA  42  34
## AG  24  41
## GG  21  13
## [1] chi squared = 14.922 p =  0 with 1 d. f.
## [1] F =  0.4141

And we see that this is not an ideal population. In fact, by inspection, we see that the observed number of heterozygotes, 24, is less than the expected one, 40.9655172. But we want to make this more quantitative, comparing the observed and expected homozygosity and heterozygosity. But since these must add up to one, and for reasons that will become clear down the road, we’ll focus on heterozygosity. Furthermore, from here on out, we’ll work with frequencies rather than numbers (although the latter would work equally well).

So first, what are the observed frequencies?

genos.f.obs <-genos/sum(genos)
genos.f.obs
##        AA        AG        GG 
## 0.4827586 0.2758621 0.2413793

The observed heterozygosity is thus 0.276; to make the notation easier, we’ll assign that a name

Hobs <-genos.f.obs[2]

Now, what about the expected? Remembering that that is 2pq, we calculate it as

allele.freqs <-hw.genos$Allele_freqs
Hexp <-2*allele.freqs[1]*allele.freqs[2]
unname(Hexp)
## [1] 0.470868

Note that an alternate way of calculating this would simply be

homo.exp <-(allele.freqs)^2
homo.exp <-sum(homo.exp)
unname(1-homo.exp)
## [1] 0.470868

Which gives us the same value.

The F statistic

So what we want to do now is to come up with a numerical value to compare observed and expected heterozygosity. Obviously, if they are equal (that is, we have a Hardy-Weinberg population), then the ratio will be 1. Similarly, if there are no heterozygotes observed, then the ratio would be zero. Thus, by subtracting the ratio from 1, we get a statistic that increases from zero to one, proportionate with increasing departure from HW. Thus for our bear case here, we find that

f.bears <-1-Hobs/Hexp
unname(f.bears)
## [1] 0.4141414

This statistic has a variety of names - Wright’s F Statistic and the Fixation Index are two common ones. We won’t say more about it at this point, however we will be using it extensively in the future. But the key points to realize are

  1. If a population is in HWE, then Hobs=Hexp so F= 1-1 = 0
  2. If there are fewer heterozygotes observed than expected, then F will be positive (as in the bear case)
  3. If there are more heterozygotes observed than expected, then F is negative

So what is a significant value of F? At this point, we will leave that question open, but we will come back to it.

Finally, as an aside, the explanation for the bear data is that bears preferentially mate with ones of their own color. Thus, brown X white matings occur less frequently than they would in a random mating population, thus reducing the frequency of heterozygotes.

A Few Wrinkles

Dominance

Until now, we’eve assumed we can identify all genoytpes and then count alleles, divide by the total and have estimate of p. However, what if we can’t? For example, consider a typical general genetics problem - the frequency of cystic fibrosis in Caucasians is 1/3000; what is frequency of allele? One approach (the only one in this case) - if we assume HW, then

freq <-1/3000
q <-sqrt(freq)
q
## [1] 0.01825742

But note that we have assumed that the genotypes are in Hardy-Weinberg proportion by estimating q as the square root of the frequency of recessive homozygotes. Hence, no statistical test for departure from HW is possible.

Multiple Alleles

Much of what we will work with this semester will involve biallelic data (for example, most SNPs are polymorphisms of two bases - sites with three are quite rare). However, there will be some cases in which there are multiple alleles - in fact this is the norm for microsatellite loci. In those cases, calculating allele frequencies is a fairly straightforward extension of the biallelic one; probably the easiest way is to work from genotype frequencies, in which case, for the frequency of allele i

pi = f(homozygotes)+ 1/2∑ f(heterozygotes)

And also, by extension, from an observed set of genotypes, we can estimate n allele frequencies and then estimate expected genotype frequencies and numbers, noting that if

f(A1) = p
f(A2) = q
f(A3) = r

then

f(A1A1)exp = p2
f(A1A2)exp = 2pq
f(A1A3)exp = 2pr
etc. for six total genotypes

And of course, we can again do a chi square test, with degrees of freedom; since we estimate both p and q from the data, we lose a total of three degrees of freedom (one for each parameter estimated and one for the number of categories) giving us a total of 6-3=3.

But be forewarned, however - as i increases, the number of possible genotypes increases, and thus the expected numbers of particular genotypes can become very small (approaching zero in the case of homozygotes for low frequency alleles,) so χ2 testing becomes problematic. In fact the number of genotypes can be calculated as follows.

Given k alleles, there are obviously k possible homozygotes. The number of possible heterozygotes is (k(k-1))/2. So we can plot this as

k <-1:8
ngenos <-k+(k*(k-1))/2
plot(k,ngenos, xlab="number of alleles", ylab="number of genotypes")

Do we have the best estimator of p?

This may seem a bit nitpicky, but it is a nice illustration of the principle of likelihood maximization. Consider the following biallelic data

genos <-c(11,41,53)

Using a TPG function, we can calculate all of the above

 hw.out <-hw(genos)
## [1] p= 0.3 q= 0.7
##      obs exp
## [1,]  11   9
## [2,]  41  44
## [3,]  53  51
## [1] chi squared = 0.519 p =  0.471 with 1 d. f.
## [1] F =  0.0703

And we see that, based on our estimate of p, we cannot reject Hardy-Weinberg. But is this the best estmator? We could consider a couple of other possibilities

  1. use the square root of the frequency of AA
sqrt(genos[1]/sum(genos))
## [1] 0.3236694

Or one minus the square root of the number of aa

1-sqrt(genos[3]/sum(genos))
## [1] 0.289534

So we have three estimates - .3, .32, and ,29. Which is the best? Intuitively, since we used the most information to get .3, so it seems best, but is it?

R. A. Fisher investigated this problem; Felsenstein provides the necessary mathematical information. In essence, what we can do is the following:

  1. For a range of parameters (p in this case) estimate P(X|p), the likelihood of X given p, where X is the data. Note that heretofore we have been estimating P(p|X), the probability of a particular value of p given observed data X.
  2. Intuitively, this likelihood is going to cover a wide range. In the example above, it we know that p cannot equal zero or 1; furthermore it is highly unlikely that it would be, say .05 or .9.
  3. So what we can then do, for the range p=.01 to .99, calculate the exact probability (just like we did when we introduced the binomial distribution) that we would get the data if p were in fact truly that value, and take the logarithm of it to scale it appropriately
  4. Finally, we can plot the log likelihood against p, and look for the point at which it is maximized. The function maxp will do that for us.
maxp(genos)

## [1] 0.3 0.3

And we see that, in fact, the value of P that has the maximum likelihood associated with it is .3, the same as what we calculated initially.

Again, this may seem like a somewhat meaningless exercise, but it does introduce the concept of maximum likelihood estimation, something that can be extremely useful in more complex situations. And, as we did in Bayesian analysis, we are using the data to actually estimate p, not ask (as the frequentist does) whether the deviation from a previously determined paramter is likely to be due to chance

ABO Blood Types - When Multiple alleles and Dominance Collide.

We are all familiar with the ABO blood type system. There are typically three alleles, A, B and i. A and B are codominant, so that an AB heterozygote has the AB blood type. However, i is recessive to both A and B so, for example, an Ai individual has type A blood, as does an AA one. These relationships are summarized as follows:

Genotypes Blood Type Number Phenotype Frequency
AA, AO A \(n_{A}\) \(p^2_{A}+2p_{A}p_{O}\)
BB, BO B \(n_{B}\) \(p^2_{B}+2p_{B}p_{O}\)
AB AB \(n_{AB}\) \(2p_{A}p_{B}\)
OO O \(n_{0}\) \(p^2_{O}\)

And using our multiple allele logic, we should be able to calculate the frequency of the A allele as

\(P_{A} = \frac{2n_{AA}+n_{AO}+n_{AB}}{2n}\)

But there’s a problem - we can’t distinguish between AA and AO individuals, so how do we plug in the numbers?

Again, Felsenstein (2015) suggests an approach (see also Weir (1996)) known as “expectation maximization” to get at this problem. Look again at the equation above. Our problem is that, while we can get n and nAB directly from the data, AA and AO individuals both have type A blood. The similar problem would exist for the frequency of B as well.

Felsenstein(page 31 and 32) shows the three values we wish to estimate, p A , p B and p i , can be expressed as follows:

\(P_{A}=\frac{2(\frac{p_{A}+p_{O}}{p_{A}+2p_{O}})n_{A}+n_{AB}}{2n}\)

\(P_{B}=\frac{2(\frac{p_{B}+p_{O}}{p_{B}+2p_{O}})n_{A}+n_{AB}}{2n}\)

\(P_{O}=\frac{\frac{p_{O}}{p_{A}+2p_{O}}n_{A}+\frac{p_{O}}{p_{B}+2p_{O}}n_{B}+n_{O}}{n}\)

We also know that the three frequencies must sum to one.

But wait a minute! In each case, calculation of the frequency is dependent on knowing that frequency, that is, the quantity we are trying to estimate is necessary in order to make the estimate. So how to proceed?

Look again at these equations. If we plug some set of frequencies into the right sides, we will get some values on the left - no problem. And if we happened to have guessed the right ones, we would be done. But of course that is almost never going to occur, rather we’ll get a new set of p’s. We could then use those as our new values on the right and repeat the process. Eventually (we hope) we would reach some equilibrium value, in which the numbers we calculate on the left are the same as the ones we plug in on the right.

So we can work through an example. Suppose we have the following data from and actual sample

ABO <-c(212,103,39,148) # vector of counts of A, B, AB and O

We’ve written a simple function to carry out the expectation maximization process, starting with an initial estimate that all three frequencies are .3 We can apply that function to these data:

p <-p.em(ABO)

p
## [1] 0.2944973 0.1540032 0.5514996

These plots show the first 10 iterations of the estimation maximization process, and we see that the estimates of the three allele frequencies quickly stabilize at the equilibrium values of .29,.12, and .55. We can now calculate expected genotype numbers, assuming Hardy Weinberg holds

OExp <- p[3]^2

ABExp <-2*(p[1]*p[2])

Aexp <-p[1]^2+2*p[1]*p[3]

Bexp <-p[2]^2+2*p[2]*p[3]

exp <-c(Aexp,Bexp,ABExp,OExp)*sum(ABO)

exp
## [1] 206.60255  97.17833  45.53493 152.68419

And we can then use our function for a quick chi square with 1 df (remembering that 2 parameters are estimated and one df lost for the data)

chixw(ABO,exp)
##   chi.sq prob d.f.
## 1   1.57 0.21    1

And we see that in fact we have done a pretty good job of explaining the data.

So what have we done? As Felsenstein points out, at first glance, it may seem to be an “exercise in ad-hockery”. However, he then goes on to point out that what we have done is to find the maximum likelihood estimates of the allele frequencies, and we have done so in a way that is relatively quick and insensitive to the starting values used.

The Relationship between Genotype and Allele Frequencies

Think about relationship between p and heterozygosity (fraction of individuals that are heterozygous). At extremes - p= 0 and p=1, heterozygosity is obviously zero, but what what happens in between? A very important point to recognize is that while the frequency of heterozygotes is a function of p, the frequency of homozygotes is a function of p^2. What that means qualitatively is that as p gets small, p^2 gets very small, meaning that, in a finite sample, it is quite possible that there will be no homozygotes for the minor allele present.

Thinking about rare alleles

In diploids particular allele (call it a) can exist in one of two states - either in a homozygote or a heterozygote. If its frequency is q, and the size of the population is N, then there are 2N copies of the allele, and the number that are in homozygotes is simply 2Np2. Thus, we can determine the fraction of a alleles that occur in homozygotes to be

2Nq2/2Nq = q

And the fraction that is in heterozygotes is thus

1-q = p

Felsenstein describes this relationship particularly well (pg. 9).

And that leads us to the following critical prediction: Most rare alleles are found in heterozygotes, while most common ones are found in heterozygotes. We can examine this graphically by plotting q (from 0 to .99) against q/p (the ratio of the frequency in heterozygotes to that in homozygotes) as follows:

q <-seq(0,.9,.001)
p <-1-q
fhet <-q/p
plot(q, fhet,type="l", ylab="heterozygote/homozygote", main="Heterozygote/Homozygote as a Function of p")

And we see that as q increases, the ratio increases exponentially. This becomes even more extreme as we approach 1:

q <-seq(.9,1,.001)
p <-1-q
fhet <-q/p
plot(q, fhet,type="l", ylab="heterozygote/homozygote", main="Heterozygote/Homozygote as a Function of p")

In fact, in this range, we can use the approximation of q ≈ 1, so that the ratio then ≈ 1/p:

plot(q, fhet,type="l", ylab="heterozygote/homozygote", main="Heterozygote/Homozygote as a Function of p")
lines(q,1/p,col="red", type="l")

And we see that indeed the fit is quite close

Exploring the entire range of allele frequencies

Now we want to examine expected heterozygosity for the entire range of possible values for p (0 to 1). We can explore this graphically, using what is called a DeFinetti diagram (or sometimes a ternary plot). It is best to work through a few by example. Fortunately, the R package HardyWeinberg has excellent capabilities in this regard.

library(HardyWeinberg)

We will start by generating a plot that shows the theoretical relationship between p and heterozygosity. First, however, we need to generate simulated multiple locus data; for example 100 genes in a sample of 100

dat <-(HWData(100,100))

This returns a matrix numbers, consisting of one row per locus and one column per genotype. We can look at a little of it as follows :

head(dat)
##      AA AB BB
## [1,] 62 36  2
## [2,]  8 37 55
## [3,]  0 30 70
## [4,] 91  8  1
## [5,] 23 50 27
## [6,] 77 22  1

And its structure is pretty clear. We will come back to using these numbers in a minute, but first, we’d like to visualize the theoretical relationship between p and heterozygosity. We can do so as follows:

HWTernaryPlot(dat, hwcurve=TRUE,addmarkers=FALSE,region=0,vbounds=FALSE,axis=2,vertexlab=c("0","","1"),axislab="P",main="Theoretical Relationship betweeen p and H",cex.main=1.2)
abline(h=.5,lty=2)

The parabola in this curve indicates the expected heterozygosity given the value of p plotted on the x axis. What we see is that heterozygosity is maximized at 0.5 when p equals .5 (indicated by the dashed line).

Now we can add our simulated data to the plot. Remember that we generated 100 simulated data points (genotype distributions for 100 alleles in a sample size of 100, all drawn from expected HW distribution). Let’s put them onto the de Finetti diagram

HWTernaryPlot(dat, hwcurve=TRUE,addmarkers=TRUE,region=0,vbounds=FALSE,axis=2,vertexlab=c("0","","1"),axislab="P",main="Adding 100 Loci from N=100",cex.main=1.2)

And we see we have a bunch of scatter (although the theoretical curve is followed pretty well.) But what falls within chi square acceptance value?

options(warn=-1)
HWTernaryPlot(dat, hwcurve=TRUE,addmarkers=TRUE,region=1,vbounds=FALSE,axis=2,vertexlab=c("0","","1"))

Remember, testing at the 5 % level, so not surprising that a few will fall outside of acceptance region. But the key point to remember is that the data we have simulated here were 100 loci from 100 individuals in a Hardy Weinberg Population. In our next exercise, we will take this one step further and look at some real data.

library(HardyWeinberg)

HW Analysis of SNP data

It’s time for some real data. We’ve seen that we can test Hardy-Weinberg, and that in simulated data from finite populations, we do see (not unexpectly) some departures from our expectation due simply to chance. But the question is, if we were to look at real data, what would we see?

Loading the data

To do this, we will work from an old human variation project called HapMap (it has largely been superceded by the 1000 Genomes Project). These data consist of single nucleotide polymorphism genotypes of individuals from a variety of globally distributed populations. For the most part, SNPs are biallelic data (each one has a major allele and a minor allele). For our exercise, we will use data for about 48,000,000 base pairs (from position 42505066 to 90394354) from chromosome 6 from the Yoruba (YRI) population, a sample of 90 individuals.

For demonstration purposes, a sample of the raw data from hapmap (400 SNP’s out of a total of nearly 43000) are in TeachingPopGen, so we can access it by the following:

data(ch6.yri)

This returns the raw data (ch6.raw) for the region in question; we will now use a built-in function to clean the data. In doing so, we will eliminate all of those SNPs which have fewer than 5 individuals in any genotypic class.

ch6.smp <-datprep(ch6.yri,min=5)
load("../../QPopgen/Data2015/HwGenomeData") # to run in class

We can look at the data we have briefly as follows:

head(ch6.sub)
##           AA Aa aa
## rs260249   4 20 35
## rs9381174 97 15  1
## rs260251   0 19 41
## rs260253  36 55 22
## rs260254  16 30 14
## rs260255   0 37 76
print(paste("Number of Loci = ",nrow(ch6.sub),sep=""),quote=FALSE)
## [1] Number of Loci = 42826

And we see that it is simply a data frame, containing bilocus genotype numbers from the population for each of the SNP’s that met our criteria.

Testing for Hardy Weinberg

Now, we want to determine how many of these are in Hardy Weinberg Equilibrium. We will do this in two ways.

Ternary Plot of a subset of the data

We saw in the last unit that we can visualize departures from Hardy-Weinberg on a ternary plot. We can do so first on a sample of these data (number of loci=200:

## This is code run in class only (on the complete data set)
smp <-sample(1:nrow(ch6.sub),200)
ch6.samp <-ch6.sub[smp,]
trn <-  HWTernaryPlot(ch6.samp, hwcurve=TRUE,addmarkers=TRUE,region=1,vbounds=FALSE,axis=2,vertexlab=c("0","","1"),axislab="P",main="Chromosome 6 - 200 Random SNPS",cex.main=1.2)

and then on the complete set:

trn <- HWTernaryPlot(ch6.sub, hwcurve=TRUE,addmarkers=TRUE,region=1,vbounds=FALSE,axis=2,vertexlab=c("0","","1"),axislab="P",main="Chromosome 6 - 100 Random SNPS",cex.main=1.2)

And we get a pattern very similar to what we did based on the simulated data. And by looking at the structure of the data returned:

str(trn)
## List of 5
##  $ minp       : num 0.21
##  $ maxp       : num 0.79
##  $ inrange    : int 22460
##  $ percinrange: num 52.4
##  $ nsignif    : int 2476

we see that there are in fact 2476 SNPS that differ significantly from χ2 expectations.

Chi-square test on the entire data set

Next, let’s look at the distribution of chi-squared values for the entire data set. We get into some slightly sophisticated R functions here, but the nice thing is that takes only a few steps:

options(warn=-1)
hwtest <-apply(ch6.sub,1,hw,print=FALSE) #use the hw function in TeachingPopGen to test each snp
  Hw.ch<-sapply(hwtest,function(x) x$Chisq) # Create a vector of the calculated chi square values
  q <-quantile(Hw.ch,.95) # determine the upper quantile
colhist(Hw.ch,tail=1,xlab="chi-squared",ylab="Number",main="Chi Square Distribution, ch. 6")

So we see that indeed, some of the SNP’s do show departure from Hardy Weinberg. How many? For this, we would most likely be using the “book value” for the 5% cutoff with 1 degree of freedom, or 3.84

dep <-length(Hw.ch[Hw.ch>3.84])
dep
## [1] 2478

And as a fraction of the total

f.sig <-dep/nrow(ch6.sub)
f.sig
## [1] 0.05786205

So of the 42826 SNPs, examined, 2478 indeed do show departure from Hardy Weinberg expectations. This is actually similar to the 5% we would expect by chance.

So is there anything significant about any of those snps in the “red zone”? In bioinformatics, a more stringent probability test is usually applied for cases of multiple simulatneous tests like this - that is to divide the normal cutoff percent (0.05) by the number of comparisons being made. In this case, that would be

p.cut <- .05/nrow(ch6.sub)
p.cut
## [1] 1.167515e-06

And we can do a little bit of manipulation of our hwtest results to determine if we have any such SNPS

hw.p <- sapply(hwtest,function(x) x$prob)
snp.out <-which(hw.p<p.cut)
length(snp.out)
## [1] 18

The distribution of F

Before addressing that, we can also look at the mean and distribution of Wright’s F statistic

Hw.f <-sapply(hwtest,function(x) x$F)
colhist(Hw.f)
f.mean <-mean(Hw.f)
abline(v=f.mean, col="red")

f.range <-range(Hw.f)
f.mean
## [1] 0.00387696
f.range
## [1] -0.4520548  0.6491228

So we see that the mean of f is very close to zero, although there are outliers that might be worth exploring further.

The Chromosomal Distribution of F values

So what if we want to look at the distribution of chi-squared or F values based on chromosomal position? Doing so might show regions of the chromosome that show systematic departure from Hardy-Weinberg expectations and thus might be targets of further investigation. Those positions were in our original data set that we loaded (ch6.yri), so we can extract them and plot them for our selected data as follows:

hw.smp <- apply(ch6.smp, 1,hw,print=FALSE)
chi.smp <-sapply(hw.smp, function (x) x$Chisq)
f.smp <-sapply(hw.smp, function (x) x$F)
snps <-rownames(ch6.smp)
pos <-ch6.yri$V3[rownames(ch6.yri) %in% snps]
par(mfrow=c(2,1))
plot(pos/1000000,chi.smp,xlab="Chromosomal Position",ylab="chi-squared",pch=20,main="Chi-squared")
abline(h=3.84,lty=2,col="blue")
plot(pos/1000000,f.smp,pch=20,xlab="Chromosomal Position (MB)",ylab="F",main="F")
abline(h=0,col="red")
abline(h=c(-.2,.2),lty=2,col="blue")

par(mfrow=c(1,1))

What constitutes a “significant” departure from zero? That is not an easy question to address at this point, but we’ll somewhat arbitrarily consider any value greater than &plusnm; 0.2 to be significant. For chi-squared, we’ll use the standard p <0.05 significance level for one degree of freedom (3.84). These values are indicated on the two plots with dashed blue lines. There are a few bits of information we can glean

  1. The chi-squared plot shows the positions of the eight SNPs that show significant departures from HW expectations. They appear to be fairly randomly distributed over the chromosome.
  2. Note that when the correction described above, none of them are significant.
  3. We see a similar number of SNPs that show values of F outside of the range of ± 0.2. However, we note that some show a positive values and others a negative. As a thought question, how do these two classes of SNPs differ with respect to the nature of their departure from Hardy-Weinberg Expectations?

Summary

There are a few reasons for undertaking this exercise.

  1. We need to remember that the Hardy Weinberg equilibrium (or departure therefrom) is a property of the population, not of any individual locus. Therefore since some loci will, by chance, depart from the expected proportions in a given sample, we need to be conservative about extrapolating from single locus data to population-level properties.
  2. When we do see outliers, we have to proceed with caution. For example, in bioinformatics, departure from Hardy-Weinberg is used as an indicator of possible problems with the raw data - if heterozyotes are erroneously scored as homozygotes, then there will be apparent departures.
  3. In cases where an allele occurs at low frequency, the chi-squared test is fraught with danger. In such cases, and especially in typical sample sizes (N~100), the expected numbers of homozygotes of the low frequency allele are very small, which can artificially inflate chi-squared values.

References

Felsenstein, Joseph. 2015. “Theoretical Population Genetics.” Online. http://evolution.gs.washington.edu/pgbook/pgbook.pdf.

Weir, B S. 1996. Genetic Data Analysis II: Methods for Discrete Population Genetic Data. Genetic Data Analysis. Sinauer Associates. https://books.google.com/books?id=e9QPAQAAMAAJ.