What is it?

Up to this point, we have by and large been treating different genes as independent entities, either by looking at them one at a time, or when looking at multiple genes, assuming that (as Mendel proposed in the “law” of independent assortment) they behave independently of one another. But, of course, we know that genes occur on chromosomes, and that the closer two loci are to one another, the more likely they are to cosegregate in meiosis. So the question then becomes how this impacts the distribution of allelic variation in populations. In this unit we will look at this question theoretically; we will subsequently address how we can infer the effects of linkage from real data.

Allele and gamete frequencies - What is the relationship?

To begin with, we will consider two biallelic loci, A and B. The frequencies of all alleles (A1,A2; B1, B2) - p, q, r and s - will be set to 0.5. What we want to do is to ask what will happen as a result of random mating under two extreme scenarios:

  1. The genes assort independently (meaning that they are either on separate chromosomes or on the same chromosome separated by more than 50 cM)
  2. The genes are completely linked - that is, no recombination occurs between them.

Now, let’s suppose we start with an artificial population in which all individuals are of the genotype A1 B1/A2 B2. Put another way there are only two haplotypes present in the population, and all individuals are heterozygous for them. We let them undergo random mating; what will we expect to see?

  1. If they assort independently, then the expected frequencies of the four possible gametes will simply be the products of the frequencies of the alleles that comprise that gamete. Thus for example, the expected frequency of A1B1 gametes would be pr, or .5 * .5 = .25
  2. If no recombination occurs, then we would expect the two original types of gametes - A1B1 and A2B2 - to occur in equal frequency, but the other two possibilities (A1B2 and A2B1) would be absent.

We can summarize these expectations as follows:

Gamete Scenario 1 Scenario 2
A1 B1 pr = .25 .5
A1 B2 ps = .25 0
A2 B1 qr = .25 0
A2 B2 qs = .25 .5

D - a summary statistic

Now, suppose, instead of having done the controlled cross, we actually were determining haplotype frequencies in a population. Under the null hypothesis of independent assortment, we would expect results like case 1 above to occur - that is, the frequencies of gametes should be simply the products of the frequencies of the alleles making up that gamete. So let’s define a statistic D to be simply the difference between the observed and the expected number of gametes or

\(\large{D = f(A_iB_i)_{obs} - f(A_iB_i)_{exp}}\)

\(\large{= f(A_iB_i) - f(A_i)f(B_i)}\)

Here’s a trivial little R function to do that calculation. It takes as input the frequency of two alleles, one at each locus, and that of the gamete containing those alleles.

Dtriv = function(p,r,obs){ #input two allele frequencies and the observed gamete frequency
  obs-p*r
}

Now let’s use that function to calculate D for our two scenarios above, using the given allele frequencies and the alternative frequencies of the A1B1 gamete. First, case 1:

d1 <-Dtriv(.5,.5,.25)
d1
## [1] 0

And it is zero, as expected. Now for the second case.

d2 <-Dtriv(.5,.5,.5)
d2
## [1] 0.25

And we get a value of linkage disequilibrium of .25, indicating that the observed frequency of the gamete is this much greater than that expected based on independent assortment.

We can do the same calculation for A1B2 under scenario 2 as follows:

d2a <-Dtriv(.5,.5,0)
d2a
## [1] -0.25

And we get the same value with opposite sign, indicating that there are fewer gametes observed than expected.

An intermediate case

Suppose we actually observe something between the two extremes of complete independence and complete linkage, for example

Gamete Frequency
A1 B1 .4
A2 B1 .1
A1 B2 .1
A2 B2 .4

Now we can repeat these two calculations, one for a coupling gamete and one for a repulsion one. Note that the allele frequencies have not changed, only the gamete frequencies

Dc <-Dtriv(.5,.5,.4)
Dr <-Dtriv(.5,.5,.1)
Dc; Dr
## [1] 0.15
## [1] -0.15

And we see that the absolute value of D is now .15, intermediate to what we saw in the two extreme cases.

Finally, given our constraint that all the allele frequencies=.5, we can plot D against all possible observed gamete frequencies 0 to .5:

gf <-seq(0,.5,.01)
Df <-sapply(gf, function(x) Dtriv(.5,.5,x))
plot(gf,Df,type="l", xlab="Observed Gamete Frequency", ylab="D",main="D vs. observed gamete frequency, p=r=.5",col="red")
abline(h=0)

And we see that, in this case, D is a linear function of gamete frequency, ranging from a minimum value of -.25 to a maximum value of .25

When p ≠ r

To this point, we have been assuming that we have equal allele frequencies (.5) at each of the two loci. What if that constraint is relaxed? What happens if we have different allele frequencies? eg, suppose we have the following gametic frequencies:

gf <-c(.3,.5,.1,.1) # frequencies of 11, 12, 21, and 22 gametes

The first thing we need to do is to calculate the allele frequencies, which we can do by adding together gametes that contain a particular allele:

p <- gf[1]+gf[2]
q <-1-p
r <-gf[1]+gf[3]
s <-1-r
p;q;r;s
## [1] 0.8
## [1] 0.2
## [1] 0.4
## [1] 0.6

Now, we can calculate D for the four gametes:

D1 <-gf[1]-p*r
D2 <-gf[2]-p*s
D3 <-gf[3]-q*r
D4 <-gf[4]-q*s
D1; D2; D3; D4
## [1] -0.02
## [1] 0.02
## [1] 0.02
## [1] -0.02

And we see that the values we obtain are much lower. But referring back to the figure above, how do these values relate to the maximum and minimum possible values of D?

Nielsen and Slatkin (page 110-111) have a nice explanation of this. Their argument is summarized briefly below.

First, as we have seen above, the gametes can be described as follows:

\(\large{f_{AB} = f_Af_B+D}\)

\(\large{f_{Ab}=f_Af_b-D}\)

\(\large{f_{aB}=f_af_B-D}\)

\(\large{f_{ab}=f_af_b+D}\)

Second, knowing that gametes cannot have a negative frequency, we can write the following inequalities.

\(\large{0 \leq f_Af_B+D}\)

\(\large{0 \leq f_Af_b-D}\)

\(\large{0 \leq f_af_B-D}\)

\(\large{0 \leq f_af_b+D}\)

And from these, we can do a little rearrangement to come up with the following:

\(\large{+D \leq \min{(f_Af_b,f_af_B)}}\)

\(\large{-D \leq \min{(f_Af_B,f_aF_b)}}\)

We can explore this relationship by writing another simple R function, which will take as inputs the four allele frequencies and return values of Dmx and Dmin

Dmxmn <- function(p,q,r,s){
  Dmax <-min(p*s,q*r)
  Dmin <-min(p*r,q*s)
  c(Dmax, 0-Dmin)
  }

And we can use it for a couple of examples. First, let’s do the case of four equal allele frequencies

Dmxmn(.5,.5,.5,.5)
## [1]  0.25 -0.25

And we get the values that we saw above - -.25 ≤ D≤ .25.

But now let’s use our example where p = .8 and r = .4

Dm2 <- Dmxmn(p,q,r,s)
Dm2
## [1]  0.08 -0.12

and now -.12 <D <.08

This creates a problem. Suppose we observe a value of D = .08. Should we consider it significant? In the second case, we almost certainly should - after all, it is Dmax. But in the first case (p=r=.5) Dmax is .25, so our observed value is something like 1/3 of the maximum possible.

So how do we address this problem?

D’

In the early 1970’s, Lewontin proposed the following measure, based on what we have done so far.

D’ = D/Dmax for D>0 D’ = D/Dmin for D<0

What that does is to normalize D by Dmax and Dmin, so that

0 ≤ D’ ≤ 1

However, there is a problem with this. Consider again our case of p=.8 r=.4. We saw that Dmax and Dmin are

Dm2
## [1]  0.08 -0.12

Now, suppose we observe a value of D=.01. What is D’?

Dp1 <-.01/Dm2[1]
Dp1
## [1] 0.125

Alternatively, suppose the observed value of D is -.01. Now

Dp2 <- -.01/Dm2[2]
Dp2
## [1] 0.08333333

The difference between the two values of D that were observed is .02; the difference between the two values of D’ is

Dp1-Dp2
## [1] 0.04166667

Contrast now that with a similar difference in observed D, only where both values are positive

Dp3 <-.01/Dm2[1]
Dp4 <-.03/Dm2[1]
Dp3; Dp4; Dp4-Dp3
## [1] 0.125
## [1] 0.375
## [1] 0.25

So the same absolute magnitude of difference in observed values of D gives quite different differences in D’. Thus, for statistical analysis, this measure is problematic. However, again per @nielsen2013introduction:

  1. If one gamete is missing from the sample, then D’= 1
  2. In the case of a polymorphic locus (A/a) and a monomorphic one (B/B), if a new mutation occurs, such that there is now one Ab or ab gamete, D’ will also equal 1, since until recombination occurs, one gamete will be missing.

r2

As we saw with D’, one of the problems we have to deal with is that, depending on which gametes (coupling or repulsion) are in excess, the sign of D can differ. However, we consider both to be departures from expectation, so we would like both to be positive. As we did with the chi square, we could square our observed values

Consider the example we did, in which we observed that for p=.8 and r = .4, D= ± .01. If we square either the positive or negative value, we get

Dsq <-D1^2
Dsq
## [1] 4e-04

Now, we need to normalize this to account for allele frequencies. We can do by dividing by the product of the four allele frequencies

prod <-p*q*r*s
r1 <-Dsq/prod
r1
## [1] 0.01041667

The real benefit if this is that, in fact we’ve calculated something that is well know by statisticians - the correlation coefficient (see Neilsen and Slatkin, Box 6.3). This can be converted to a chi squared value by multiplying it by sample size, so, for example, if we’g counted 100 individuals in our experiment, we would do the following

N=100
chi <-r1*N
chi
## [1] 1.041667

With one degree of freedom, the critical value of chi-squared is

qchisq(.95,1)
## [1] 3.841459

so we would conclude that the departure from linkage equilibrium is not significant.

And one final note - While certainly r2 is more tractable than D’ with respect to significance testing, it still suffers from the “compression problem” = that is, unless p = r = .01 the maximum value of r is less than one, and it differs depending on whether D is positive or negative. To illustrate this, consider the Dmax and Dmin values we calculated earlier:

Dm2[1]^2/prod
## [1] 0.1666667
Dm2[2]^2/prod
## [1] 0.375

So we see that the sign problem remains

Summary

What we have done here is to look at how we would analyze gametic data from a real population in order to determine whether those frequencies are in fact those we would predict based on the hypothesis of the statistical independence of the two loci in question. However, a number of questions remain:

  1. Is linkage disequilibrium a real phenomenon?
  2. If so, from whence does it arise?
  3. If a Hardy-Weinberg population is in linkage disequilibrium, what happens to it with time?

Finally, it is very important to note that all of the above assumes that we can count gametes in our experimental population. This is not a trivial problem, since by simple genotyping, it is not possible to distinguish between double heterozygotes that are in coupling (AB/ab) or repulsion (Ab/aB).

We will address these questions soon. But before doing so, we have to remember that recombination is an ongoing process and ask what that says, in theory, about the fate of linkage disequilibrium.


The Fate of Linkage Diequilibrium

Having seen how we measure linkage disequilibrium, we need to address what the fate of disequilibrium would be in an equilibrium population. We will do so in the context of a very simple model, that of two genes on the same chromosome, between which recombination occurs with some frequency r.

The scenario

Suppose we have two genes that are linked but undergo recombination some percent of the time (call it r). Some fraction of those gametes are of the type \(A_1B_1\). We let the population mate at random (and assume recombination occurs in both sexes). What happens to \(f(A_1B_1)\) after 1 generation?

We would get these gametes from two possible sources

  1. Nonrecombinant - these would be all of the A1B1 gametes that do not undergo recombination. Since a fraction r do recombine, the nonrecombinant fraction would be 1-r, and \(f'(A_1B_1)=(1-r)f(AB)\)
  2. Recombinant - Assuming that mating is random, then for that fraction of gametes that undergo recombination ,r, the probability of it being \(A_1B_1\) is simply r multiplied by the allele frequencies, or \(r(f(A_1)f(B_1))\).

Taken together, the expected frequency in the next generation (f’(A1B1)) is simply the sum of these, or

\(f'(A_1B_1) = (1-r)(fA_1B_1)+rf(A_1)f(B_1)\)

Remmber also that

\(D = f(A_1B_1) - f(A)f(B)\)

we can subtract \(f(A_1)f(B_1)\) from both sides, do a little algebra (see Felsenstein (2015), pg 18 for more details), and get

\(D'=(1-r)D_0\)

If we then repeat this process for a second generation

\(D'' =(1-r)^2 Do\)

And at time t

\(Dt = (1-r)^t Do \)

  1. So let’s look at this graphically for various values of r
Do <-.25
r <-c(.001,.01,.05,.1,.5)
r <-rev(r) # simple way to reverse order of elements; not needed if they had been input from largest to smallest
t <-c(1:50)
D <- sapply(r, function(x) Do*(1-x)^t )
matplot(D, type="l",main="Decay of Linkage Disequilibrium", xlab ="Generation", ylab="D",lty=1)

From these results, two results are worth noting:

  1. LD can exist briefly, even for unlinked loci
  2. Ultimately, however, it will decay assymptotically to zero, even with very tight linkage

Thus, if LD is observed, suggests that this is a nonequilibrium population, the cause of which would need to be explored. One important application, however, is in population-based association mapping, that is, looking for markers that cosegregate with genes contributing to a phenotype of interest. In fact, this is nothing more than finding markers that are in LD with condition of interest even if the marker itself is not in a causal gene. We will explore further as we go along.

Finally - if we are doing this in a finite population, as always we will see scatter. For example, we can ask what we would expect to see for the same recombination fractions in populations of 1000 and 100:

par(mfrow=c(2,1)) #create two panels in the plot window
sapply(c(1000,100),Dfin) # Do two simulations, one with N=1000 and one with N=100

par(mfrow=c(1,1)) #restore single graphic panel

As will become a common theme throughout, we see that in finite populations (which, indeed, all populations are), random factors can lead to departures from expectations, especially in small populations.

A Real Example

And, just to bring some actual data below, Clegg et al. (1978) did an interesting experiment in which they monitored changes in allele and gamete frequencies of two allozyme loci initially in linkage disequilibrium with a dominant lethal allele in Drosopshila melanogaster. In four replicate populations, they observed the following:

Clegg

Clegg

In three of the four replicate populations, we see that from generation 8-10 onward, LD increases. So what is happening? Some insight can be gained from looking at the decline in heterozygosity at the locus with the lethal allele (Sb):

Clegg2

Clegg2

Their conclusion was that, as a result of recombination, particular haploytpes were generated that were free of the negative selection acting at the Sb locus and conferred a selective advantage, so they increased in frequency.

Conclusions

So is linkage equilibrium, like the Hardy-Weinberg equilibrium, an expected property of ideal populations? The answer is yes, but with a major caveat. That is, in the case of Hardy-Weinberg equilibrium, it is easy to show that if a population falls out of equilibrium, then a generation of random mating (at least for autosomal loci) will restore it. Not so in the case of linkage equilibrium, however - as we have seen, if LD exists, the population does return to equilibrium (D=0), but only exponentially, with the rate dependent on the frequency of recombination between the loci in question.


Visualizing Multilocus Linkage Disequilibrium

While it is techinically possible to think of linkage disequilibrium among more than two loci, in fact the calculations are onerous and the values rapidly converge on zero. Thus the usual procedure when considering a large number of linked SNPs is to calculate the data in a pairwise fashion, and then construct a “heatmap”, in which a color matrix is produced, and the intensity of the color (usually yellow to red) is correlated with the degree of LD.

HapMap can produe these quite readily, but we an also do them in R. In what follows, we will work through an example as a means of better understanding how to read and interpret heat maps.

Preparing R

We will be using two R packages in the following - TeachingPopGen (as always) and LDheatmap. If they aren’t already installed, the following installation commands need to be run. Otherwise they can be skipped

install.packages("LDheatmap")
library(devtools)
install_github("bjcochrane/TeachingPopGen")

Then, as always, we need to put the two packages into our environment:

library(LDheatmap)
library(TeachingPopGen)

A hypothetical example

Suppose we have the following situation: We have genotyped three SNPs (A, B and C) in a population, determined the haplotype relationships (not a trivial undertaking but doable) and found The following values for pairwise linkage disequilibrium:

SNP1 SNP2 D
A B .25
A C .10
B C .15

One way we can visualize this is as a heatmap:

Note that the strongest LD is between locus A and B; in the heatmap the block connecting the two is red. It is somewhat less between B and C, hence the lighted color. And the least is between A and C; that color is lighter still.

The data

We will load some hapmap data off the web and then examine it. It was originally downladed as genotypes. We will load one set from the CEU (European) population and the other from CHB (Chinese)

hya.ceu <-read.hapmap.genotype("http://users.miamioh.edu/cochrabj/Data/hyal2.ceu.genotype") #European
hya.chb <-read.hapmap.genotype("http://users.miamioh.edu/cochrabj/Data/hyal2.chb.genotype")

And we can inspect them with the summary command

summary(hya.ceu)
##                      Length Class      Mode   
## Genotypes            39     data.frame list   
## Chromosomal Position 39     -none-     numeric
summary(hya.chb)
##                      Length Class      Mode   
## Genotypes            35     data.frame list   
## Chromosomal Position 35     -none-     numeric

This tells us that there are 39 SNP genotypes in the CEU dataset and 35 in the CEU. We can look at one of the genotypes as follows:

summary(hya.ceu$Genotypes[1])
##  rs7641224      
##  C/C:156 (95%)  
##  C/T:  9 ( 5%)

And that shows us that it includes 165 genotypes of the snp rs7641224, 156 of which are C/C and 9 of which are C/T.

And where on the chromosome are they?

range(hya.ceu$`Chromosomal Position`)
## [1] 50282115 50373893

So we are dealing with snps contained within an approximately 100,000 base pair region.

The Analysis

So with that in hand, we can proceed with the analysis and plot the heat maps for the two populations

par(mfrow=c(1,1))
LDheatmap(hya.ceu[[1]],hya.ceu[[2]],col=heat.colors(20),text=FALSE,title="CEU")

LDheatmap(hya.chb[[1]],hya.chb[[2]],col=heat.colors(20),text=FALSE,title="CHB")

Note that, in general, LD is higher in the Chinese population than in the European one. In fact, this is an example we may return to later - it has been hypothesized by Ding et al. (2014) that the segment in question contains a haplotype derived from Denisovans, which has been subsequently the target of positive selection in Asian populations.


References

Clegg M. T., Kidwell J. F., Kidwell M. G., 1978 Dynamics of correlated genetic systems III. Behaviour of chromosomal segments under lethal selection. Genetica 48: 95–106.

Ding Q., Hu Y., Xu S., Wang J., Jin L., 2014 Neanderthal introgression at chromosome 3p21.31 was under positive natural selection in east asians. Molecular Biology and Evolution 31: 683–695.

Felsenstein J., 2015 Theoretical Population Genetics.: 509.