An Exact Test of Hardy-Weinberg

We’re going to start with a simple reality - with the exception of populations with regular non-random mating schemes (which we will explore briefly soon), significant departures from Hardy-Weinberg are uncommon. Nevertheless, there is an extremely large literature on the subject - why? One reason, which is relevant to some of the data we will be using, is that when sample sizes are small and/or the number of alleles is large. The latter condition is one that is common for microsatellite data. So some investigation of tests beyond the chi-squared approximation is worth the effort.

The Exact Test of HWE.

Remember for a minute how we approached the binomial. First, given our experiment in which we observed 17 out of 20 heads in the flip of a fair coin, we calculated the exact probability of getting that particular outcome. We then used the alternative approach of simulating possibile outcomes of repeated testing and determining the frequency we observed a deviation as much or greater from the expected outcome that we observed. In the case of the χ2 distribution, we did something similar - ask how many times the value of chi-square would be equal to or greater than that observed, given the appropriate number of degrees of freedom.

We’re going to do something similar here. First, we are going to ask how we can calculate and interpret the exact probability that we would get a given distribution of genotypes, given some allele frequency and the assumption of Hardy Weinberg. Then we will extend this to multiple allele situations, in which case, even with a good computer, the computataion becomes impractical.

What follows is dervied from Genetic Data Analysis II (Weir, 1996), pp 99-105. Note that Weir’s algebraic nomenclature is somewhat iconoclastic; we will try to go through his arguments using more familiar terminology.

First, let’s start with some 2 allele data - frequencies of Pgm alleles in mosquitoes. We have genotyped 50 individuals, 0 of which are AA, 19 of which is heterozygous (AB) and 21 of which are BB. We can do our quick calculations

pgm.geno <-c(0,19,21)
n <-sum(pgm.geno)
pgm.hw <-hw(pgm.geno)
## [1] p= 0.2375 q= 0.7625
##      obs exp
## [1,]   0   2
## [2,]  19  14
## [3,]  21  23
## [1] chi squared = 3.881 p =  0.049 with 1 d. f.
## [1] F =  -0.3115

And we can see that with an unadjusted chi-square test, we would reject the hypothesis of HWE. However, note also that the expected number of AA homozygotes is small (2), so this result could be questioned. So let’s proceed as follows, focusing on calculating the probability of getting the observed number of heterozygotes.. First, let’s determine the number of A alleles in the sample

n.A <-2*pgm.geno[1]+pgm.geno[2]
n.A
## [1] 19

First, let’s calculate the exact probability of our data. This is a complicated calculation, based on the binomial distribution, but rather than worry about it (see Weir if you are interested), we’ll use the R function hwexact in the package HardyWeinberg:

pgm.pr <-HWExact(pgm.geno)
pgm.pr$pofthesample
##         19 
## 0.05938107

So the exact probability of getting 1 heterozygote is .06. But the question we want to get at (as always) is what is the probability of observing as much or greater deviation from the expectation. To get at that we will do the following

  1. We will calculate the probability of each possible set of genotypes, given our sample size and number of A alleles.
  2. We will then rank them from the least likely to the most likely.
  3. We will then calculate the cumulative probability associated with each.

So with 19 A alleles, how many hetergotes could we have? Well, since 19 is odd, there is no way we could have zero heterozygotes - if we had four homozygotes (a total of 8), we’d have a leftover allele which, since it can’t be split, would have to be in a heterozygote. The same holds for 2 heterozygotes - if we put two of the 19 alleles into heterozygotes, that leaves 17 alleles that would have to be in homozygotes, which is impossible. Thus, the number of heterozygotes possible is 1,3,5, . . . up to 19. Thus

n.AB <-seq(1,19,2)
n.AB
##  [1]  1  3  5  7  9 11 13 15 17 19

Now, what about AA homozygotes? Well, since each of them has 2 A alleles, we simply have to subtract the number of A alleles in heterozygotes, subtract it from 19 and divide by two

n.AA <-(n.A-n.AB)/2
n.AA
##  [1] 9 8 7 6 5 4 3 2 1 0

And finally the number of BB individuals is everyone left. We’ll calculate that and then make a small data frame out of the data.

n.BB <-n-n.AA-n.AB
pgm.df <-data.frame(n.AA,n.AB,n.BB)
kable (pgm.df)
n.AA n.AB n.BB
9 1 30
8 3 29
7 5 28
6 7 27
5 9 26
4 11 25
3 13 24
2 15 23
1 17 22
0 19 21

Now what is the exact probability associated with each? In fact, the HWExact function has already calculated them - they are the first element in the list (pgm.pr$prob), so we can add them to our data base

Prob <-format(pgm.pr$prob,scientific = FALSE,digits=1)
Prob <-as.numeric(Prob)
pgm.df <-cbind(pgm.df,Prob)
kable(pgm.df)
n.AA n.AB n.BB Prob
9 1 30 0.0000000
8 3 29 0.0000026
7 5 28 0.0001222
6 7 27 0.0022802
5 9 26 0.0205222
4 11 25 0.0970139
3 13 24 0.2487536
2 15 23 0.3411478
1 17 22 0.2307764
0 19 21 0.0593811

If we do a quick plot of these, we get the following

barplot(pgm.df$Prob,names.arg=pgm.df$n.AB)

Suppose we rearrange this, so that we go from the least likely to the most likely. This takes a little r code munging:

ord <-order(pgm.df$Prob)
pgm.df.sort <-pgm.df[ord,]
kable (pgm.df.sort)
n.AA n.AB n.BB Prob
1 9 1 30 0.0000000
2 8 3 29 0.0000026
3 7 5 28 0.0001222
4 6 7 27 0.0022802
5 5 9 26 0.0205222
10 0 19 21 0.0593811
6 4 11 25 0.0970139
9 1 17 22 0.2307764
7 3 13 24 0.2487536
8 2 15 23 0.3411478

Now the genotypes are sorted from least likely to most likely. And the last thing we will do is to calculate the cumulative probability - that is, the probability that that much or more deviation was observed by chance.

Cumulative.Prob <-cumsum(pgm.df.sort$Prob)
pgm.df.sort <-cbind(pgm.df.sort,Cumulative.Prob)
kable( pgm.df.sort)
n.AA n.AB n.BB Prob Cumulative.Prob
1 9 1 30 0.0000000 0.0000000
2 8 3 29 0.0000026 0.0000026
3 7 5 28 0.0001222 0.0001248
4 6 7 27 0.0022802 0.0024050
5 5 9 26 0.0205222 0.0229272
10 0 19 21 0.0593811 0.0823083
6 4 11 25 0.0970139 0.1793222
9 1 17 22 0.2307764 0.4100986
7 3 13 24 0.2487536 0.6588522
8 2 15 23 0.3411478 1.0000000

And we can visualize this as follows:

barplot(pgm.df.sort$Cumulative.Prob,names.arg = pgm.df.sort$n.AB,xlab="Number Heterozygotes",ylab="Cumulative Probability")
abline(h=.05,col="red")

The red line is drawn at p=0.05; the interpretation is that given that there are 19 A alleles, we would expect to see nine or fewer heterozygotes less than 5 percent of the time. Thus, if we did see that, we would reject HWE. Note also that our simple chi-squared led to us rejecting HWE in the original case (19 heterozgotes); based on calculation of exact probabilities we do not.

As things get more complicated

All of this is well and good, and we can use this approach quite easily for the biallelic case. But think of what we are doing - we are enumerating every possible genotype distribution and calculating the exact probability of observing it. In this case, that really only took 10 calculations. And even if our sample size was much larger, the computational load is not too awful.

The problem comes with multiple alleles. With two alleles, there are 3 possible genotypes, with three there are 6 (3 homozgyotes, three heterozygotes), with four there are 10, etc. In fact, with microsatellites, for example, the number of alleles and possible genotype distributions can be extremely large, so computation becomes impractical (indeed, most implementations of the exact test are limited to the biallelic case.

Guo and Thompson (1992) proposed randomization, or Monte Carlo methods, to address this. The basic idea is to take the alleles in a sample, mix them together, and then draw them out two by two to create a population that might be expected based on random mating. The exact probability can then be calculated and the process repeated, until there is a distribution of possibilities against which observed data can be compared. There is actually much more mathematical sophistication involved, but for now this description should suit our needs.

So we’re going to briefly use some real data - data for one locus (DQR1) from the human major histocampatibility locus for 271 samples, available in the R package gap. It takes a bunch of manipulation to extract, and since we will be actually doing these procedures using other programs, the code for doing so is hidden (but can be found in the .Rmd file).

##   DQR.a1 DQR.a2
## 1      4      9
## 2      4      7
## 3     22     21
## 4      6      6
## 5      9     21
## 6     22     17
## [1] 271

Each line corresponds to an individual (there are a total of 271), and the two columns are allele identifiers, numbered 1 through n. So how many different alleles are there?

## [1] 25

And that gives an idea of the scope of the problem. There are 25 possible homozgyous genotypes and 300 heterozygous ones - there is simply no practical way to calculate the exact probability for each. So we’re going to use our sampling approach (after a little bit more hidden munging) to determine whether or not we should accept HWE for this sample.

And we can display some of the results:

out$p.value
## [1] 0.055002
out$p.value.se
## [1] 0.005422998

And we see that, for these data, we do not reject HWE (although it is close, something unusual for human data)

Conclusions

In many cases, especially biallelic ones, simple χ2 analysis is sufficient to test Hardy Weinberg, and we will resort to it often. However, with multiallelic data (notably microsatellites, which we will use frequently), even carefully adjusted chi-sqaured tests can be misleading. Hence, either the methods used above or similar ones are more desirable.

And fortunately for us, this method has been implemented at GenePop on the Web. In our next laboratory exercise, we will explore that resource, the Dryad Data Repository and Genalex as relatively easy tools to use to do these sorts of analysis.

Estimating LD From Genotype Data

In our consideration of measures of linkage disequilibrium (D, D’, r) we proceeded from the assumption that we could, in some fashion, count gametes. In a haploid population, of course, this is no problem. However, in the case of diploids, while we can infer the gametic combinations of most individuals, in some (the double heterozygotes) we cannot. Thus, we need to find a way to estimate linkage disequilibrium from what is effectively incomplete data.

As we have done previously, we will restrict our attention to the two locus case, with two loci with two alleles each. A number of approaches have been proposed (and are reviewed quite well in Weir); as an example, we will focus on one proposed forty years ago by Hill, 1974.

An example

In the reference cited above, Hill (Table 3a) uses an example of codominant data, in which 1000 individuals were scored for their MN an Ss blood types. To keep our nomenclature consistent, we will refer to the alleles controlling MN as A1 and A2; those controlling Ss will be designated B1 and B2. So the first thing we will need t do is load the data, which are available in TeachingPopgen:

library(TeachingPopGen)
data(HillData)

The data are imported in a matrix named gmat, which we can inspect:

gmat
##    SS  Ss  ss
## MM 57 140 101
## MN 39 224 226
## NN  3  54 156

What it consists of is the genotype numbers for all nine possible genotypes. To make the data consistent with our A,B nomenclature, we can change the row and column names:

rownames(gmat) <-c("AA","Aa","aa")
colnames(gmat) <-c("BB","Bb","bb")
gmat
##    BB  Bb  bb
## AA 57 140 101
## Aa 39 224 226
## aa  3  54 156

Now, let’s look at these in terms of gametic composition. Some are real easy - all of the double homozygotes (for example AABB) consist of two identical gametes, so we can state that in the total pool of 2000 gametes, there must be 2*57, or 114, that are derived from AABB individuals and are in fact AB gametes. We also can accurately infer the gametes that make up single heterozygotes. So using AABb (of which there are 140) as an example, we can see that these genotypes, in terms of gametes, have to be (AB)/(Ab), meaning that this genotype consists 140 AB and 140 Ab gametes to the pool. In this fashion, we can count the gametes for eight of the nine genotypes.

The problem is the ninth - the double heterozygotes (AaBb). In fact, these individuals could be either (AB)/(ab) or (Ab)/(aB). So how do we account for this?

To do this, Hill proposed use of a procedure called expectation maximization. We won’t go into the details, as yhe mathematics are a bit involved, but in essence we do the following:

  1. Count those gametes that can be directly identified
  2. Based on those, select a possible starting frequency for the AB gametes
  3. Use these two to re-estimate the AB frequency (that is, use the number actually counted plus the number that would be contributed by the double heterozygotes given the estimate of AB frequency used)
  4. Estimate D based on these numbers
  5. Repeat steps 3 and 4, using the newly calculated value
  6. Continue this iterative process until a stable value of D is obtained.

So let’s see how this works with these data (HillD is a function in TeachingPopGen that does the math). Below are the results of 20 iterations, shown graphically and numerically.

HillD(gmat)

##    Iteration          D
## 1          1 0.07082000
## 2          2 0.07034666
## 3          3 0.07014933
## 4          4 0.07006685
## 5          5 0.07003235
## 6          6 0.07001791
## 7          7 0.07001186
## 8          8 0.07000933
## 9          9 0.07000827
## 10        10 0.07000783
## 11        11 0.07000765
## 12        12 0.07000757
## 13        13 0.07000754
## 14        14 0.07000752
## 15        15 0.07000752
## 16        16 0.07000751
## 17        17 0.07000751
## 18        18 0.07000751
## 19        19 0.07000751
## 20        20 0.07000751

It is quite evident from the plot produced that an equilibrium value of D is quickly reached - in this case 0.07. But is this correct? To test the concept, we’ll do a couple of simulations:

Independent Assortment

First, let’s generate a simulated data set with two loci, each of which are assorting independently and for which D=0. We will set the value of p (f(A)) at .2 and that of r (f(B)) at .6.

A <- rbinom(1000,2,.2)
B <-rbinom(1000,2,.6)
genos <-table(A,B)
genos
##    B
## A     0   1   2
##   0  96 309 227
##   1  50 154 123
##   2  10  17  14

Now we can run HillD to see what we get:

d1 <-HillD(genos)

print(paste("Final Estimate of D = ",signif(d1[20,2],4)))
## [1] "Final Estimate of D =  -0.002008"

And we see that D converges on a value close to zero, as expected.

If D is not Zero

We also need to test whether, given a dataset that was generated with some nonzero value of D, the Hill approach returns a correct value. To do this, we need to do the following:

  1. Calculate the gamete frequencies based on some prespecified values of allele frequencies and D.
  2. Use those gamete frequencies to generate expected genotype frequencies and numbers.
  3. Us expectation maximization to calculate D.

Step 2 above is actually rather laborious - there are 10 possible combinations of gametes, and the expected frequency of each (assuming random mating) has to be calculated individually. We will do it with another function in teaching popgen, which accepts as input the allele frequencies p and r, as well as the prespecified value of D:

genos <-genotype.generate(.4,.7,.1)
genos
##     BB  Bb bb
## AA 144  15  0
## Aa 243 226 11
## aa 102 179 78

And we can do our Hill calculation as before:

d1 <-HillD(genos)

print(paste("Final Estimate of D = ",signif(d1[20,2],4)))
## [1] "Final Estimate of D =  0.1006"

And again we see rapid convergence to a value quite close to the actual value.

Conclusion

This is one of several means of estimating D based on real data. In fact, the issue of “phasing” - that is, determining linkage relationships in heterozygotes - is one that has become quite sophisticated. Fortunately for us, we can use HapMap data, in which these methods have been applied, and work from gametes that were inferred from genotype data with a high degree of confidence.