The starting point for in population genetics is the Hardy Weinberg equilibrium. It is the basis on which everything is built, so a thorough understanding of what it says and what it assumes is critical. In this unit, we will briefly go through the basics of it ( the familiar p and q approach). We will then examine it in more detail.
So let’s approach this as a problem in data analysis. The following are the data that were part of the opening day “quiz” - frequencies of Single Nucleotide Polymorphism (SNP) number rs3739020 from the 1000 Genomes Project.
Genotype | Number |
---|---|
G G | 6 |
G T | 28 |
T T | 66 |
We want to ask two questions:
So for question 1 let’s proceed as follows. First of all, we will create a vector with of the observed data.
obs <- c(6,28,66)
Note that c() is a function that concatenates a number of values, in this case integers, in order to create a vector.
Now, let’s ask how many G’s there are. Recognizing that every GG homozygote has two and every GT heterozygote has one, we can calculate that by
nA <-2*obs[1]+obs[2]
nA
## [1] 40
And to make that a frequency, we need to divide it by the total number of alleles, which, since these are diploid individuals, is simply two times the number of individuals genotyped (N).
N <-sum(obs)
p <-nA/(2*N)
p
## [1] 0.2
Now, we could repeat that process to determine the frequency of T, but here one of the simplest but most important relationships comes in to play. Note that this is a biallelic locus, so if a particular allele is not G, it has to be T. Thus, if we set the frequency of T to be q, then p+q must equal 1, or
q <-1-p
q
## [1] 0.8
Now, suppose we turn the question around. Given p and q, as calculated above, and assuming that random mating then occurs, what genotype numbers frequencies would we expect to see? There are two ways we can approach this, one fairly simple and one a bit more involved. First, the simple one. Think of the population as orginating from a pool of alleles with frequencies p and q. If that is the case, then we can calculate the expected probabilities of the three genotypes (GG, GT, and TT) as
p{GG} = p{G}p{G} = p2
p{TT} = p{T}p{T} = q2
p{GT} =p{GT} or p{TG}
= P{G} * P{T} + P{T} *P{G}
= pq + qp
= 2pq
We can can make a vector of those frequencies as follows
fexp <-c(p^2,2*p*q,q^2)
fexp
## [1] 0.04 0.32 0.64
And finally, given those frequencies, what numbers of genotypes would we expect to see in a sample size of N? Here, R makes it very easy to multiply the frequency vector by N
Nexp <-fexp*N # multiplies each element of fexp by N
as.integer(Nexp) # get rid of decimal positions
## [1] 4 32 64
Contrast that with our observed numbers
obs
## [1] 6 28 66
And they look pretty close. Of course we need to test that statistically (which we will in a bit), but before we do so, let’s see what’s happened to p and q as a result of random mating:
p1 <-(2*Nexp[1]+Nexp[2])/(2*N)
q1 <- 1-p1
p1;q1
## [1] 0.2
## [1] 0.8
And again, contrast those values with what we determined from the original data:
p;q
## [1] 0.2
## [1] 0.8
And note. p and q have not changed.
Below are some (completely made-up) data. Use them to
Genotype | Number |
---|---|
AA | 69 |
Aa | 101 |
aa | 30 |
obs <-c(69,101,30)
p <-(2*obs[1]+obs[2])/(2*sum(obs))
p
## [1] 0.5975
q <-1-p
genos <-c(p^2,2*p*q,q^2)
genos
## [1] 0.3570063 0.4809875 0.1620062
gebis2 <-genos*sum(obs)
So what have we shown? We can summarize as follows:
- Infinite size
- Random mating
- No Mutation
- No Migration
- No selection
In other words, A population that is in Hardy Weinberg Equilibrium is not evolving. Thus, it becomes the null hypothesis from which we work; hence a deep understanding of its properties is essential if we are to thoroughly understand the genetics of the evolutionary process.
In the above, we simply treated mating as involving every individual contributing equally to a pool of gametes and then those gametes being drawn two at a time to create zygotes. In fact, of course, more typically, individuals are mating with one another and it is their gametes that are combining. Felsenstein (2016 version, pg.6) Works this up by first, showing the expected frequencies of all possible random matings:
From this, the following can be calculated:
And inspection of the right side should show you that the expected genotype frequencies are indeed p2, 2pq, and q2.
I personally like the above approach, in that it starts with the rawest of raw data - a count of genotypes. However, at times problems are formulated in terms of genotype frequencies rather than numbers. In those cases, the typical formula given is
p = f(AA) + 1/2(f(Aa))
But let’s look at this a little more closely - Suppose we multiply it by 2N/2N. Then we get
P =(2Nf(AA))+2N(1/2(f(Aa)))/2N
But what is this? N * f(AA) is simply the number of AA genotypes counted, and N *F(Aa) is the number of Aa, so this whole equation can be rewritten as
p=(2 * #(AA) + # (Aa))/2N
Which is exactly the same formula as above.
So to wrap it up, when you start a Hardy-Weinberg problem, the first question to ask is whether you are given genotype Numbers or frequencies and go from there. What you do not want to do is to treat the numbers as frequencies or vice versa.
So at this point, we’ve seen how we can calculate allele frequencies, and we’ve also seen that in an ideal population, those frequencies do not change - the population does not evolve. And it is worth it at this point to briefly summarize the properties of an ideal population:
Much of the rest of our time will be focused on understanding the consequences departures from one or more of these conditions - how do we detect them and what are their consequences? Before that, however, we need to delve into the Hardy Weinberg Equilibrium a bit more deeply.
This is a subject we will be exploring in more depth in a later unit, but for starters, let’s keep it simple. Assume we can observe some genotypes:
Genotype | N |
---|---|
AA | 56 |
Aa | 78 |
aa | 22 |
We can calculate the frequencies of the two alleles as we did before:
obs <-c(56,78,22)
N <-sum(obs)
p <-(2*obs[1]+obs[2])/(2*N)
q <- 1-p
p; q
## [1] 0.6089744
## [1] 0.3910256
And from those, we can calculate our expected genotype numbers
exp <-N*c(p^2,2*p*q,q^2)
exp
## [1] 57.85256 74.29487 23.85256
So now we have observed and expected numbers; it would seem a logical step to apply the chi-square test to see if the difference between the two is significant. For more details, see the unit on the chi-squared distribution; at this point, remember that what we are doing is quantitating the departure from the expectations of an hypothesis for categorical data. In our case, the categories are genotypes; the hypothesis is that the population is in Hardy-Weinberg equilibrium, and the data are the observed genotype numbers, which are compared to those expected based on HWE.
We’ve done the calculations above, so χ2 can be calculated easily by
chi <-sum((obs-exp)^2/exp)
chi
## [1] 0.3879836
But how many degrees of freedom? Our rule of thumb previously was that degrees of freedom are one less than the number of classes, suggesting that it might be two in this case. However there is another wrinkle. In this case we used the actual data to estimate a parameter used in determining our expected values - p. That costs us another degree of freedom. So in the case of testing Hardy Weinberg with data from a biallelic locus, we only have one degree of freedom.
So what is the probability that we would observe this much deviation based on chance alone (in which case we would accept our hypothesis)? One way to do that is what we did with coin flips - generate a bunch of χ2 distributed random variables and see where the 5% cutoff falls. In this case, we will do a one tail test, since the range of our statistic is from zero to infinity (with zero the value obtained if the observed data exactly match the expectation)
set.seed(123)
ch <- rchisq(100000,1)
colhist(ch,tail=1, xr=c(0,20),xlab="chi-square",ylab="Number",main="Are the Data Consistent with HWE?")
abline(v=chi,col="blue")
quantile(ch,.95)
## 95%
## 3.851331
And we see that the data in fact fall within the range expected be observed as a result of chance.
We can also calculate the probability of this particular χ2 value as follows:
pr <-qchisq(chi,.95)
pr
## [1] 0.2277098
And we see that the probability of observing this much or more deviation from Hardy-Weinberg expectations is 22%, far higher than our normal standard of 5% for rejecting the null hypothesis. So we can conclude in this case that the hypothesis cannot be rejected.
When you’ve done χ2 testing in the past, you’ve probably used a table that lists degrees of freedom and cutoff values that have been determined analytically. For example, for one degree of freedom, that value is 3.84; we see that it is very similar to the number we obtained from our simulation above. A small version of the table is in the unit on the χ2 distribution; it is also shown below:
df <-c(1:10)
chicrit.05 <-sapply(df, function(x) qchisq(.95,x))
chicrit.01 <-sapply(df,function (x) qchisq(.99,x))
kable(data.frame(df,p.05 = chicrit.05,p.01=chicrit.01))
df | p.05 | p.01 |
---|---|---|
1 | 3.841459 | 6.634897 |
2 | 5.991465 | 9.210340 |
3 | 7.814728 | 11.344867 |
4 | 9.487729 | 13.276704 |
5 | 11.070498 | 15.086272 |
6 | 12.591587 | 16.811894 |
7 | 14.067140 | 18.475307 |
8 | 15.507313 | 20.090235 |
9 | 16.918978 | 21.665994 |
10 | 18.307038 | 23.209251 |
the hw() function in TeachingPopGen will accomplish much of what we have done so far. Given some set of single locus, biallelic genotypes, it will return all the basic information. For example
dat.obs <-c(55,75,12) #observed numbers of AA, Aa, and aa
dat.hw <-hw(dat.obs)
## [1] p= 0.651408450704225 q= 0.348591549295775
## obs exp
## [1,] 55 60
## [2,] 75 64
## [3,] 12 17
## [1] chi squared = 3.772 p = 0.052 with 1 d. f.
## [1] F = -0.163
Note that it will print the results to the console; it also returns a list, in this case to the variable dat.hw:
str(dat.hw)
## List of 6
## $ Observed : num [1:3] 55 75 12
## $ Expected : num [1:3] 60.3 64.5 17.3
## $ Allele_freqs: num [1:2] 0.651 0.349
## $ Chisq : num 3.77
## $ prob : num 0.0521
## $ F : num -0.163
So by now we’ve looked at the chi-squared approach to testing whether observed genotype numbers are consistent with the prediction of Hardy-Weinberg. There is, however, another way we can make this comparison - one that is less statistical, but which (as we shall see) is very relevant biologically.
We will use the following terms extensively in what follows:
As is the case with any such measure, we need to have expectations. These are pretty straightforward, especially:
E(heterozygosity)=1-E(homozygosity) = 1- ∑pi2
Now that we have our expectations, we can turn to some data. These come from real data and are a staple of chapter-end problems in Genetics and Evolution texts. The organisms are brown bears, the phenotypes are brown (dominant) vs. white(recessive) and the genotypes of a single base in the melanocortin receptor gene which is responsible for the difference are as follows:
Genotype | Phenotype | N |
---|---|---|
AA | Brown | 42 |
AG | Brown | 24 |
GG | White | 21 |
So we can do some basic calculations
genos <-c(AA=42,AG=24,GG=21)
genos
## AA AG GG
## 42 24 21
And we can do our basic Hardy-Weinberg calculations as we have before
hw.genos <-hw(genos)
## [1] p= 0.620689655172414 q= 0.379310344827586
## obs exp
## AA 42 34
## AG 24 41
## GG 21 13
## [1] chi squared = 14.922 p = 0 with 1 d. f.
## [1] F = 0.4141
And we see that this is not an ideal population. In fact, by inspection, we see that the observed number of heterozygotes, 24, is less than the expected one, 40.9655172. But we want to make this more quantitative, comparing the observed and expected homozygosity and heterozygosity. But since these must add up to one, and for reasons that will become clear down the road, we’ll focus on heterozygosity. Furthermore, from here on out, we’ll work with frequencies rather than numbers (although the latter would work equally well).
So first, what are the observed frequencies?
genos.f.obs <-genos/sum(genos)
genos.f.obs
## AA AG GG
## 0.4827586 0.2758621 0.2413793
The observed heterozygosity is thus 0.276; to make the notation easier, we’ll assign that a name
Hobs <-genos.f.obs[2]
Now, what about the expected? Remembering that that is 2pq, we calculate it as
allele.freqs <-hw.genos$Allele_freqs
Hexp <-2*allele.freqs[1]*allele.freqs[2]
unname(Hexp)
## [1] 0.470868
Note that an alternate way of calculating this would simply be
homo.exp <-(allele.freqs)^2
homo.exp <-sum(homo.exp)
unname(1-homo.exp)
## [1] 0.470868
Which gives us the same value.
So what we want to do now is to come up with a numerical value to compare observed and expected heterozygosity. Obviously, if they are equal (that is, we have a Hardy-Weinberg population), then the ratio will be 1. Similarly, if there are no heterozygotes observed, then the ratio would be zero. Thus, by subtracting the ratio from 1, we get a statistic that increases from zero to one, proportionate with increasing departure from HW. Thus for our bear case here, we find that
f.bears <-1-Hobs/Hexp
unname(f.bears)
## [1] 0.4141414
This statistic has a variety of names - Wright’s F Statistic and the Fixation Index are two common ones. We won’t say more about it at this point, however we will be using it extensively in the future. But the key points to realize are
So what is a significant value of F? At this point, we will leave that question open, but we will come back to it.
Finally, as an aside, the explanation for the bear data is that bears preferentially mate with ones of their own color. Thus, brown X white matings occur less frequently than they would in a random mating population, thus reducing the frequency of heterozygotes.
Until now, we’eve assumed we can identify all genoytpes and then count alleles, divide by the total and have estimate of p. However, what if we can’t? For example, consider a typical general genetics problem - the frequency of cystic fibrosis in Caucasians is 1/3000; what is frequency of allele? One approach (the only one in this case) - if we assume HW, then
freq <-1/3000
q <-sqrt(freq)
q
## [1] 0.01825742
But note that we have assumed that the genotypes are in Hardy-Weinberg proportion by estimating q as the square root of the frequency of recessive homozygotes. Hence, no statistical test for departure from HW is possible.
Much of what we will work with this semester will involve biallelic data (for example, most SNPs are polymorphisms of two bases - sites with three are quite rare). However, there will be some cases in which there are multiple alleles - in fact this is the norm for microsatellite loci. In those cases, calculating allele frequencies is a fairly straightforward extension of the biallelic one; probably the easiest way is to work from genotype frequencies, in which case, for the frequency of allele i
pi = f(homozygotes)+ 1/2∑ f(heterozygotes)
And also, by extension, from an observed set of genotypes, we can estimate n allele frequencies and then estimate expected genotype frequencies and numbers, noting that if
f(A1) = p
f(A2) = q
f(A3) = r
then
f(A1A1)exp = p2
f(A1A2)exp = 2pq
f(A1A3)exp = 2pr
etc. for six total genotypes
And of course, we can again do a chi square test, with degrees of freedom; since we estimate both p and q from the data, we lose a total of three degrees of freedom (one for each parameter estimated and one for the number of categories) giving us a total of 6-3=3.
But be forewarned, however - as i increases, the number of possible genotypes increases, and thus the expected numbers of particular genotypes can become very small (approaching zero in the case of homozygotes for low frequency alleles,) so χ2 testing becomes problematic. In fact the number of genotypes can be calculated as follows.
Given k alleles, there are obviously k possible homozygotes. The number of possible heterozygotes is (k(k-1))/2. So we can plot this as
k <-1:8
ngenos <-k+(k*(k-1))/2
plot(k,ngenos, xlab="number of alleles", ylab="number of genotypes")
This may seem a bit nitpicky, but it is a nice illustration of the principle of likelihood maximization. Consider the following biallelic data
genos <-c(11,41,53)
Using a TPG function, we can calculate all of the above
hw.out <-hw(genos)
## [1] p= 0.3 q= 0.7
## obs exp
## [1,] 11 9
## [2,] 41 44
## [3,] 53 51
## [1] chi squared = 0.519 p = 0.471 with 1 d. f.
## [1] F = 0.0703
And we see that, based on our estimate of p, we cannot reject Hardy-Weinberg. But is this the best estmator? We could consider a couple of other possibilities
sqrt(genos[1]/sum(genos))
## [1] 0.3236694
Or one minus the square root of the number of aa
1-sqrt(genos[3]/sum(genos))
## [1] 0.289534
So we have three estimates - .3, .32, and ,29. Which is the best? Intuitively, since we used the most information to get .3, so it seems best, but is it?
R. A. Fisher investigated this problem; Felsenstein provides the necessary mathematical information. In essence, what we can do is the following:
maxp(genos)
## [1] 0.3 0.3
And we see that, in fact, the value of P that has the maximum likelihood associated with it is .3, the same as what we calculated initially.
Again, this may seem like a somewhat meaningless exercise, but it does introduce the concept of maximum likelihood estimation, something that can be extremely useful in more complex situations. And, as we did in Bayesian analysis, we are using the data to actually estimate p, not ask (as the frequentist does) whether the deviation from a previously determined paramter is likely to be due to chance
We are all familiar with the ABO blood type system. There are typically three alleles, A, B and i. A and B are codominant, so that an AB heterozygote has the AB blood type. However, i is recessive to both A and B so, for example, an Ai individual has type A blood, as does an AA one. These relationships are summarized as follows:
Genotypes | Blood Type | Number | Phenotype Frequency |
---|---|---|---|
AA, AO | A | \(n_{A}\) | \(p^2_{A}+2p_{A}p_{O}\) |
BB, BO | B | \(n_{B}\) | \(p^2_{B}+2p_{B}p_{O}\) |
AB | AB | \(n_{AB}\) | \(2p_{A}p_{B}\) |
OO | O | \(n_{0}\) | \(p^2_{O}\) |
And using our multiple allele logic, we should be able to calculate the frequency of the A allele as
\(P_{A} = \frac{2n_{AA}+n_{AO}+n_{AB}}{2n}\)
But there’s a problem - we can’t distinguish between AA and AO individuals, so how do we plug in the numbers?
Again, Felsenstein (2015) suggests an approach (see also Weir (1996)) known as “expectation maximization” to get at this problem. Look again at the equation above. Our problem is that, while we can get n and nAB directly from the data, AA and AO individuals both have type A blood. The similar problem would exist for the frequency of B as well.
Felsenstein(page 31 and 32) shows the three values we wish to estimate, p A , p B and p i , can be expressed as follows:
\(P_{A}=\frac{2(\frac{p_{A}+p_{O}}{p_{A}+2p_{O}})n_{A}+n_{AB}}{2n}\)
\(P_{B}=\frac{2(\frac{p_{B}+p_{O}}{p_{B}+2p_{O}})n_{A}+n_{AB}}{2n}\)
\(P_{O}=\frac{\frac{p_{O}}{p_{A}+2p_{O}}n_{A}+\frac{p_{O}}{p_{B}+2p_{O}}n_{B}+n_{O}}{n}\)
We also know that the three frequencies must sum to one.
But wait a minute! In each case, calculation of the frequency is dependent on knowing that frequency, that is, the quantity we are trying to estimate is necessary in order to make the estimate. So how to proceed?
Look again at these equations. If we plug some set of frequencies into the right sides, we will get some values on the left - no problem. And if we happened to have guessed the right ones, we would be done. But of course that is almost never going to occur, rather we’ll get a new set of p’s. We could then use those as our new values on the right and repeat the process. Eventually (we hope) we would reach some equilibrium value, in which the numbers we calculate on the left are the same as the ones we plug in on the right.
So we can work through an example. Suppose we have the following data from and actual sample
ABO <-c(212,103,39,148) # vector of counts of A, B, AB and O
We’ve written a simple function to carry out the expectation maximization process, starting with an initial estimate that all three frequencies are .3 We can apply that function to these data:
p <-p.em(ABO)
p
## [1] 0.2944973 0.1540032 0.5514996
These plots show the first 10 iterations of the estimation maximization process, and we see that the estimates of the three allele frequencies quickly stabilize at the equilibrium values of .29,.12, and .55. We can now calculate expected genotype numbers, assuming Hardy Weinberg holds
OExp <- p[3]^2
ABExp <-2*(p[1]*p[2])
Aexp <-p[1]^2+2*p[1]*p[3]
Bexp <-p[2]^2+2*p[2]*p[3]
exp <-c(Aexp,Bexp,ABExp,OExp)*sum(ABO)
exp
## [1] 206.60255 97.17833 45.53493 152.68419
And we can then use our function for a quick chi square with 1 df (remembering that 2 parameters are estimated and one df lost for the data)
chixw(ABO,exp)
## chi.sq prob d.f.
## 1 1.57 0.21 1
And we see that in fact we have done a pretty good job of explaining the data.
So what have we done? As Felsenstein points out, at first glance, it may seem to be an “exercise in ad-hockery”. However, he then goes on to point out that what we have done is to find the maximum likelihood estimates of the allele frequencies, and we have done so in a way that is relatively quick and insensitive to the starting values used.
Think about relationship between p and heterozygosity (fraction of individuals that are heterozygous). At extremes - p= 0 and p=1, heterozygosity is obviously zero, but what what happens in between? A very important point to recognize is that while the frequency of heterozygotes is a function of p, the frequency of homozygotes is a function of p^2. What that means qualitatively is that as p gets small, p^2 gets very small, meaning that, in a finite sample, it is quite possible that there will be no homozygotes for the minor allele present.
In diploids particular allele (call it a) can exist in one of two states - either in a homozygote or a heterozygote. If its frequency is q, and the size of the population is N, then there are 2N copies of the allele, and the number that are in homozygotes is simply 2Np2. Thus, we can determine the fraction of a alleles that occur in homozygotes to be
2Nq2/2Nq = q
And the fraction that is in heterozygotes is thus
1-q = p
Felsenstein describes this relationship particularly well (pg. 9).
And that leads us to the following critical prediction: Most rare alleles are found in heterozygotes, while most common ones are found in heterozygotes. We can examine this graphically by plotting q (from 0 to .99) against q/p (the ratio of the frequency in heterozygotes to that in homozygotes) as follows:
q <-seq(0,.9,.001)
p <-1-q
fhet <-q/p
plot(q, fhet,type="l", ylab="heterozygote/homozygote", main="Heterozygote/Homozygote as a Function of p")
And we see that as q increases, the ratio increases exponentially. This becomes even more extreme as we approach 1:
q <-seq(.9,1,.001)
p <-1-q
fhet <-q/p
plot(q, fhet,type="l", ylab="heterozygote/homozygote", main="Heterozygote/Homozygote as a Function of p")
In fact, in this range, we can use the approximation of q ≈ 1, so that the ratio then ≈ 1/p:
plot(q, fhet,type="l", ylab="heterozygote/homozygote", main="Heterozygote/Homozygote as a Function of p")
lines(q,1/p,col="red", type="l")
And we see that indeed the fit is quite close
Now we want to examine expected heterozygosity for the entire range of possible values for p (0 to 1). We can explore this graphically, using what is called a DeFinetti diagram (or sometimes a ternary plot). It is best to work through a few by example. Fortunately, the R package HardyWeinberg has excellent capabilities in this regard.
library(HardyWeinberg)
We will start by generating a plot that shows the theoretical relationship between p and heterozygosity. First, however, we need to generate simulated multiple locus data; for example 100 genes in a sample of 100
dat <-(HWData(100,100))
This returns a matrix numbers, consisting of one row per locus and one column per genotype. We can look at a little of it as follows :
head(dat)
## AA AB BB
## [1,] 62 36 2
## [2,] 8 37 55
## [3,] 0 30 70
## [4,] 91 8 1
## [5,] 23 50 27
## [6,] 77 22 1
And its structure is pretty clear. We will come back to using these numbers in a minute, but first, we’d like to visualize the theoretical relationship between p and heterozygosity. We can do so as follows:
HWTernaryPlot(dat, hwcurve=TRUE,addmarkers=FALSE,region=0,vbounds=FALSE,axis=2,vertexlab=c("0","","1"),axislab="P",main="Theoretical Relationship betweeen p and H",cex.main=1.2)
abline(h=.5,lty=2)
The parabola in this curve indicates the expected heterozygosity given the value of p plotted on the x axis. What we see is that heterozygosity is maximized at 0.5 when p equals .5 (indicated by the dashed line).
Now we can add our simulated data to the plot. Remember that we generated 100 simulated data points (genotype distributions for 100 alleles in a sample size of 100, all drawn from expected HW distribution). Let’s put them onto the de Finetti diagram
HWTernaryPlot(dat, hwcurve=TRUE,addmarkers=TRUE,region=0,vbounds=FALSE,axis=2,vertexlab=c("0","","1"),axislab="P",main="Adding 100 Loci from N=100",cex.main=1.2)
And we see we have a bunch of scatter (although the theoretical curve is followed pretty well.) But what falls within chi square acceptance value?
options(warn=-1)
HWTernaryPlot(dat, hwcurve=TRUE,addmarkers=TRUE,region=1,vbounds=FALSE,axis=2,vertexlab=c("0","","1"))
Remember, testing at the 5 % level, so not surprising that a few will fall outside of acceptance region. But the key point to remember is that the data we have simulated here were 100 loci from 100 individuals in a Hardy Weinberg Population. In our next exercise, we will take this one step further and look at some real data.
library(HardyWeinberg)
It’s time for some real data. We’ve seen that we can test Hardy-Weinberg, and that in simulated data from finite populations, we do see (not unexpectly) some departures from our expectation due simply to chance. But the question is, if we were to look at real data, what would we see?
To do this, we will work from an old human variation project called HapMap (it has largely been superceded by the 1000 Genomes Project). These data consist of single nucleotide polymorphism genotypes of individuals from a variety of globally distributed populations. For the most part, SNPs are biallelic data (each one has a major allele and a minor allele). For our exercise, we will use data for about 48,000,000 base pairs (from position 42505066 to 90394354) from chromosome 6 from the Yoruba (YRI) population, a sample of 90 individuals.
For demonstration purposes, a sample of the raw data from hapmap (400 SNP’s out of a total of nearly 43000) are in TeachingPopGen, so we can access it by the following:
data(ch6.yri)
This returns the raw data (ch6.raw) for the region in question; we will now use a built-in function to clean the data. In doing so, we will eliminate all of those SNPs which have fewer than 5 individuals in any genotypic class.
ch6.smp <-datprep(ch6.yri,min=5)
load("../../QPopgen/Data2015/HwGenomeData") # to run in class
We can look at the data we have briefly as follows:
head(ch6.sub)
## AA Aa aa
## rs260249 4 20 35
## rs9381174 97 15 1
## rs260251 0 19 41
## rs260253 36 55 22
## rs260254 16 30 14
## rs260255 0 37 76
print(paste("Number of Loci = ",nrow(ch6.sub),sep=""),quote=FALSE)
## [1] Number of Loci = 42826
And we see that it is simply a data frame, containing bilocus genotype numbers from the population for each of the SNP’s that met our criteria.
Now, we want to determine how many of these are in Hardy Weinberg Equilibrium. We will do this in two ways.
We saw in the last unit that we can visualize departures from Hardy-Weinberg on a ternary plot. We can do so first on a sample of these data (number of loci=200:
## This is code run in class only (on the complete data set)
smp <-sample(1:nrow(ch6.sub),200)
ch6.samp <-ch6.sub[smp,]
trn <- HWTernaryPlot(ch6.samp, hwcurve=TRUE,addmarkers=TRUE,region=1,vbounds=FALSE,axis=2,vertexlab=c("0","","1"),axislab="P",main="Chromosome 6 - 200 Random SNPS",cex.main=1.2)
and then on the complete set:
trn <- HWTernaryPlot(ch6.sub, hwcurve=TRUE,addmarkers=TRUE,region=1,vbounds=FALSE,axis=2,vertexlab=c("0","","1"),axislab="P",main="Chromosome 6 - 100 Random SNPS",cex.main=1.2)
And we get a pattern very similar to what we did based on the simulated data. And by looking at the structure of the data returned:
str(trn)
## List of 5
## $ minp : num 0.21
## $ maxp : num 0.79
## $ inrange : int 22460
## $ percinrange: num 52.4
## $ nsignif : int 2476
we see that there are in fact 2476 SNPS that differ significantly from χ2 expectations.
Next, let’s look at the distribution of chi-squared values for the entire data set. We get into some slightly sophisticated R functions here, but the nice thing is that takes only a few steps:
options(warn=-1)
hwtest <-apply(ch6.sub,1,hw,print=FALSE) #use the hw function in TeachingPopGen to test each snp
Hw.ch<-sapply(hwtest,function(x) x$Chisq) # Create a vector of the calculated chi square values
q <-quantile(Hw.ch,.95) # determine the upper quantile
colhist(Hw.ch,tail=1,xlab="chi-squared",ylab="Number",main="Chi Square Distribution, ch. 6")
So we see that indeed, some of the SNP’s do show departure from Hardy Weinberg. How many? For this, we would most likely be using the “book value” for the 5% cutoff with 1 degree of freedom, or 3.84
dep <-length(Hw.ch[Hw.ch>3.84])
dep
## [1] 2478
And as a fraction of the total
f.sig <-dep/nrow(ch6.sub)
f.sig
## [1] 0.05786205
So of the 42826 SNPs, examined, 2478 indeed do show departure from Hardy Weinberg expectations. This is actually similar to the 5% we would expect by chance.
So is there anything significant about any of those snps in the “red zone”? In bioinformatics, a more stringent probability test is usually applied for cases of multiple simulatneous tests like this - that is to divide the normal cutoff percent (0.05) by the number of comparisons being made. In this case, that would be
p.cut <- .05/nrow(ch6.sub)
p.cut
## [1] 1.167515e-06
And we can do a little bit of manipulation of our hwtest results to determine if we have any such SNPS
hw.p <- sapply(hwtest,function(x) x$prob)
snp.out <-which(hw.p<p.cut)
length(snp.out)
## [1] 18
Before addressing that, we can also look at the mean and distribution of Wright’s F statistic
Hw.f <-sapply(hwtest,function(x) x$F)
colhist(Hw.f)
f.mean <-mean(Hw.f)
abline(v=f.mean, col="red")
f.range <-range(Hw.f)
f.mean
## [1] 0.00387696
f.range
## [1] -0.4520548 0.6491228
So we see that the mean of f is very close to zero, although there are outliers that might be worth exploring further.
So what if we want to look at the distribution of chi-squared or F values based on chromosomal position? Doing so might show regions of the chromosome that show systematic departure from Hardy-Weinberg expectations and thus might be targets of further investigation. Those positions were in our original data set that we loaded (ch6.yri), so we can extract them and plot them for our selected data as follows:
hw.smp <- apply(ch6.smp, 1,hw,print=FALSE)
chi.smp <-sapply(hw.smp, function (x) x$Chisq)
f.smp <-sapply(hw.smp, function (x) x$F)
snps <-rownames(ch6.smp)
pos <-ch6.yri$V3[rownames(ch6.yri) %in% snps]
par(mfrow=c(2,1))
plot(pos/1000000,chi.smp,xlab="Chromosomal Position",ylab="chi-squared",pch=20,main="Chi-squared")
abline(h=3.84,lty=2,col="blue")
plot(pos/1000000,f.smp,pch=20,xlab="Chromosomal Position (MB)",ylab="F",main="F")
abline(h=0,col="red")
abline(h=c(-.2,.2),lty=2,col="blue")
par(mfrow=c(1,1))
What constitutes a “significant” departure from zero? That is not an easy question to address at this point, but we’ll somewhat arbitrarily consider any value greater than &plusnm; 0.2 to be significant. For chi-squared, we’ll use the standard p <0.05 significance level for one degree of freedom (3.84). These values are indicated on the two plots with dashed blue lines. There are a few bits of information we can glean
There are a few reasons for undertaking this exercise.
Felsenstein, Joseph. 2015. “Theoretical Population Genetics.” Online. http://evolution.gs.washington.edu/pgbook/pgbook.pdf.
Weir, B S. 1996. Genetic Data Analysis II: Methods for Discrete Population Genetic Data. Genetic Data Analysis. Sinauer Associates. https://books.google.com/books?id=e9QPAQAAMAAJ.