The Basic Problem - Bionomial Variables.

In the simplest case, working with genetic data is an exercise in the manipulation of binomial (or in the multi-allelic case multinomial) random variables. For now, consider the case of a biallelic locus, with alleles A and a, both with equal frequencies (p=q=.5). If we think of a population as consisting of a collection of gametes, then drawing samples from it would be akin to a coin flip. To determine the probability of a particular outcome, we could consider consider combinations and permutations, and then apply the binomial theorem. Alternatively, the same result can be obtained via simulation, and in a way that leads more directly to dealing with more complex situations.

Flipping a Coin

Where better to start a consideration of probability distributions (in particular the binomial) with a coin flip. We can do a trial, flip a coin, that has two possible outcomes (heads or tails), and assuming its a fair coin, we can hypothesize that  p{HEAD}= p = .5 p{TAIL}= q = .5

and since these are the only possible outcomes (we will ignore the improbable outcome of the coin landing on its edge), the two probabilities sum to 1. And if we were to do 2N trials, our expectation is that we would observe Np heads and Nq tails. But of course, we know that chance plays a role as well. So what we have to do is ask, given this hypothesis and expection, how can we characterize the expected distribution of outcomes?

The Experiment

Suppose we start with the following question. “If you were to flip a coin 20 times and get 17 heads, would you consider it to be a fair coin”? In practice, in a class of junior and senior biology majors, the results were as follows

Answer Percent
Yes 53%
No 32%
Not Sure 15%

Analysis

As we will do throughout, we will have computer simulate the experiment 10000 times and plot the results. When we do so, we will also identify the 5% of the 10000 outcomes that deviate from the expected the most. R is very good at helping us do this. First, we can do the experiment of flipping a fair coin (P{H}=P{T} =.5) 10000 times

set.seed(3972) #set the random number generator seed so that results are reproducible
flips <-rbinom(10000,20,.5) # do 1000 replications of 20 flips of a fair coin
length(flips)
## [1] 10000
head(flips)
## [1]  6 13 10  9 12 10

We can easily determine the mean of the outcome:

mean(flips)
## [1] 10.006

Which is very close to 10, which is what we would intuitively expect (and indeed, in the analytical world, E(H)=pN = .5*20 = 10)

And we can do a quick and dirty plot of those data

hist(flips,xlab="Number of Heads",ylab="Number", main="1000 Replicates of 20 Coin Flips", xlim =c(0,20))

But let’s return to our question. We’ve actually done the experiment once, and we observed 17 heads. We can consider two mutually exclusive hypotheses:

  • H0 (the null hypothesis) - the coin is fair (P{Head} = .5)
  • H1 (the alternative hypothesis) - the coin is not fair (P{Head} ≠ .5)

Obviously, had we gotten 10 heads, that would have matched our expectation, and we would accept the null hypothesis. But we got 18. It that by chance, or should we reject our null hypothesis (that the coin is fair)?

There are two ways to do this - analytically and numerically. In the analytical approach, we want to calculate the exact probability of getting 18 heads. Here, as you can find in any statistics text, the binomial formula can be applied; for k successes in n trials

p{n,k} = n!/((n-k)!(k!)) pk q(n-k)

The left part of the expression (the one with the factorials) gives the number of combinations of flips resulting in k heads that are possible; the right half is the probability of any one of them occurring.

This is easy to code in R:

N=20 #number of trials
H=17 #number of heads observed
p <- q <-.5  #probabilities of heads and tails
P17 <-factorial(N)/(factorial(N-H)*factorial(H))*p^H*q^(20-H)
P17
## [1] 0.001087189

So we see that the exact probability of getting 18 heads is about 1 in 1000, not very likely. And as a side note, we can ask R to tell us how many of our simulated experiments resulted in 18 heads by the following:

Obs17 <-length(which(flips==17))
Obs17
## [1] 7

So our expectation (calculated by multiplying the exact probability by 10,000) based on the binomial was 11 and we observed 7.

But while R can do this calculation easily, it is a pain to do manually, and even in R, as N and K get large, the computational load gets considerable. Furthermore, in real life, we are often interested in a slightly different question - what is the probability of getting as much or greater deviation from our expected outcome (17 observed when 10 are expected in this case)? Here the analytical solution can quickly become very onerous (although we will see that such an exact test of Hardy-Weinberg is sometimes necessary).

To do this, we will use the following standard. If we expect to see as much or more deviation from the expectation of the null hypothesis in less than 5% of replicated trials, then we reject the null hypothesis. In our case, the results suggest we may have gotten “too many” heads, however in the general case the deviation could be in either direction - that is, either too many or two few.

Thus, let’s ask the question, given our simulated results, which outcomes fall in that category. In other words, of the 10000 simulations of 20 coin flips, which gave outcomes that occurred less than 5% of the time? The R quantile () function does that; as implemented below, it finds the two tails of the distribution

q <-quantile(flips,c(.025,.975))
q
##  2.5% 97.5% 
##     6    14

So what this says is that 2.5% of the trials resulted in fewer than 6 heads, while 2.5% resulted in more than 14. And of course, we observed 18. We can illustrate this graphically, using the colhist() from TeachingPopGen, which will color in those portions of the distribution that fall within these ranges and abline(), which will draw a line showing where our outcome fell

colhist(flips,tail=2,xr=c(0,20),xlab="# Heads",,ylab="# Ocurrences",main="Is it a Fair Coin?")
abline(v=17,col="blue")

And we see that our outcome (17 Heads) falls falls well outside the range of deviation from our expectation that we would expect to see by chance.

Conclusions

Thus, we can see that our obseved value (shown by the blue vertical line) lies far outside of the observed distribution of outcomes in the simulations; in fact, in none of the trials was 18 heads obeserved. Thus, based on our standard we set (chance of as much or greater deviation occurring by chance < 5% ) we reject our null hypothesis.

But what do we know about this particular coin? We have rejected the hypothesis that it is fair. But note that all our alternative hypothesis says is that it is not fair. It tells us nothing about what the value of the true probability of getting a head, with that particular coin. We will come back to that point in detail when we consider Bayesian statistics.

Summary

In the preceding, we have used what statisticians call the “frequentist” approach to hypothesis testing. It can be summarized as follows.

  1. Based on a null hypothesis, we predict the expected value of some parameter of the distribution of possible outcomes. in this case it is pN; the general term for such a parameter is θ.
  2. We ask, based on a specified rejection criterion (in this case 5%) whether the outcome we observe is consistent with that expectation.
  3. If it is not, we reject the null hypothesis.

This is, in fact, the classic approach to testing genetic data. Every introductory genetics text presents it as the means for determining, for example, whether the observed results of a particular cross are consistent with Mendelian predictions. We will explore that futher in the next section, but do keep in mind that, as a means of determining estimates of the actual values of the parameters of an unknown distribution, the frequentist approach has its limits.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

The Normal Distribution

In the last unit, we used simulation to test whether a particular outcome (18 heads in 20 coin tosses) was consistent with the hypothesis that it is a fair coin. We did so by simulating the experiment 1000 times and asking how often we got results comparable to the one reserved. That we did so very infrequently led us to reject our null hypothesis.

But note that in this case, what we were looking at was a very simple case - two possible outcomes, each with equal probabilities. We also saw that we could calculate the exact probability of a a praticular outcome, in that case using the binomial equation. Isn’t that the preferred approach?

Well, no, for at least 2 reasons.

  1. While R had no problem with the algrebra, in fact it is computationally intense, espcially as N gets large. In the days of manual calculations, this was particularly problematic.
  2. This works for a binomial random variable (and can be extended to a multinomial one like the roll of a die), but it is not readily generalizable to random variables with other distributions.

So we need to go one or two steps farther

Same approach, Different Distribution

Let’s move away from a situation in which there are a finite number of discrete possible outcomes to one in which the possibilities are continuously variable. A good example is a physical trait like height. We are all familiar with the “bell curve” distribution of such a trait - if we measure a sample of a population, we wouldn’t be surprised to see a distribution something like this

x <- seq(100,260,length=100)   #set a range of possible heights
hx <-dnorm(x,180,30) # calculates the density function, or theoretical exact probability, for every value of x, assuming mean of 180 and standard deviation of 30
plot(x,hx, type="l",xlab="Height (cm)", main="Hypothetical Distribution of Human Height",yaxt="n",ylab="Frequency")
abline(v=180,col="red")

This is of course, a normal distribution, with some mean (180 cm in this case) and a standard deviation. So, as we did with repect to the coin flip, we can work through a problem.

The problem.

We weigh a group of long-distance runners. and find that the mean weight is 163 pounds, and the standard deviation of those heights is 10. An individual shows up who weighs 185 pounds. Is he a runner?

The Analysis

So again, we can use our simulation approach. We will create a some data that are what we might find by weighing 100 runners, look at the distribution, and see where our mystery person falls:

set.seed(123)
runners <-rnorm(100,163,10)
hist(runners,xlab="Weight",freq=FALSE,ylab="Probability",main="Weight of Long-Distance Runners",breaks=20)
curve(dnorm(x,163,10),col="blue",add=TRUE) # Add the theoretical distribution curve
abline(v=185,col="red")

And what we see is a pretty good normal distribution. The bars represent the distribution of actual measurements; the blue line is the theoretical distribution, given the mean and the standard deviation. The red line shows where our mystery person weighs in - definitely in the upper extreme of it, but is this the result of chance variation (i. e. is he likely to be a runner or not?)

We can do the same thing we did with the binary. First we can numerically determine the upper and lower 2.5% limits

as.integer(quantile(runners, c(.025,.975)))
## [1] 146 182

And we can do this on a version of the plot

colhist(runners,,tail=2,xr=c(140,190),xlab="Weight",ylab="Number",main="Is the Mystery Person a Runner?")
abline(v=185,col="blue")

And we see that, based on this analysis, just as in the case of the coin flip with 8 of 10 heads, we would reject the hypothesis that the 185 pound individual is a runner.

Conclusions

So the simulation approach has worked again, and we we can reject our null hypothesis. While there were some runners whose weight deviated from the mean as much or more than did our subject, they comprise less than 5% of the total. And, in fact, since we estimated the distribution from the data (that is, by actually measuring the runners), we are not necessarily constrained by the nature of that distribution. But, as in the case of our coin flips, we are determining the probability that one particular sample is a member of a distribution. More frequently, however, we have a sample of some size n; we would like to know if it is in fact drawn from a population with a distribution with some set of parameters, known or unknown. This will be our next challenge.

Summary

So we have seen that, given a particular distribution, it is in fact possible to calculate the exact probability of a particular outcome, however that can be computationally impractical. In contrast, a numerical approach is much more generalizable, in that one can simply determine where that outcome falls relative to the overall distribution. What we will do next is to see how we can determine if a sample of individuals is in fact drawn from a population described by a particular null hypothesis.

And one final note. At this point, we have looked at one case in which outcomes are discrete (the coin flip) and one where they are continuous (height or weight). In fact, this is one of the major challenges in genetic analysis. Alleles and genotypes are discrete, but phenotypes, on which natural selection acts, are often continuous. In fact, when Mendel’s paper first became widely recognized, this problem seemed intractable. Indeed, vestiges of that apparent tension remain today - the study of the distribution of alleles and genotypes is usually referred to as “population genetics”, while investigation of phenotypic variation is called “quantitative genetics”. We shall see that much progress has been made in reconciling the two, but much more remains to be done.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

The Chi-squared Distribution

Background

Our goal now will be to ask the question raised at the end of the last unit. Given that we have performed some measures on a sample of individuals, how can we assess the null hypothesis that it is in fact a sample of a population with a distribution described by some set of parameters? Let’s work from the following assumptions.

  1. We are working with categorical data. That is, we can place every individual we measure into one of a finite number of categories.
  2. We have a null hypothesis that allows us to predict the expected numbers of individuals in the sample that should fall into each category

What we want to do is to determine the extent to which the numbers of individuals observed in each category differ from those expected based upon our hypothesis. And we need to take into account three other considerations

  1. The sign of the deviation between observed and expected numbers is immaterial - we want to treat positive and negative differences between the two equivalently.
  2. We need to weight the magnitude of the deviation between observed and expected in each category with respect to the number expected in the category. The reason for doing so is actually fairly simple. Suppose we expect to observe 100 and we actually observe 95. Intuitively, we would say that such a deviation (observed-expected = 5) is of little consequence - likely due to chance. However, if we expected 10 and we observed 5, while the absolute value of the deviation is the same (5) intuition tells us that in this case it may be much more significant.
  3. The number of possible categories matters. Put loosely, the more categories that exist, the greater the opportunity for deviation

The problem

So the best to go at this is to work from a problem. Since our interest is genetics, why don’t we work from some of Gregor Mendel’s data? We’ll use his original F2 data from a cross of true breeding strains of plants with round and yellow seeds with ones with wrinkled and green seeds. We can set those up as a data frame in R.

The analysis

Entering the data

First, we need to create a data frame that contains Mendel’s observations. That can be done as follows:

seeds <-c("Round Yellow","Round Green","Wrinkled Yellow", "Wrinkled Green") # The four phenotypes
obs <-c(315,108,101,32) # The number observed
dat <-data.frame(obs,row.names=seeds) # Place the observations into a dataframe, with one entry per phenotype
dat
##                 obs
## Round Yellow    315
## Round Green     108
## Wrinkled Yellow 101
## Wrinkled Green   32

Calculating Expected Values

So that’s what was observed. Now what was expected? Remember, Mendel, with his insight, predicted that for a dihybrid cross, in which there is one dominant and one recessive allele at each locus, the four possible phebotypes should occur in a 9:3:3:1 ratio. Here, the “vectorization” feature of r comes into play:

ratio <-c(9/16,3/16,3/16,1/16) # make a vector of the expected ratio
exp <-sum(obs)*ratio # calculate expected numbers by multiplying the total plants observed by the expected ratio
dat <-cbind(dat,exp) # add those numbers to the data frame and view
dat
##                 obs    exp
## Round Yellow    315 312.75
## Round Green     108 104.25
## Wrinkled Yellow 101 104.25
## Wrinkled Green   32  34.75

Calculating the difference between observed and expected.

So now we have observed and expected values - are the differences between them due to chance? To assess that, we need a summary statistic, a value, computed from what we have, that we can compare to an expected distribution. The simplest approach is to simply take the difference between the observed and expected numbers and do something with them. However, remembering our first consideration above, that positive and negative deviations need to be treated equivalently, so we need to make all those differences positive. Them most mathematically tractable way to do that is to square them:

diff <-(dat$obs-dat$exp)^2 #again, we perform operations on each element of two vectors (observed and expected)
dat <-cbind (dat,diff) #add the results to the data frame as a column
dat
##                 obs    exp    diff
## Round Yellow    315 312.75  5.0625
## Round Green     108 104.25 14.0625
## Wrinkled Yellow 101 104.25 10.5625
## Wrinkled Green   32  34.75  7.5625

Account for the magnitude of the expected number

To incorporate our second consideration, we’ll just divide the difference we just calculated (the variable diff) by the approrpiate expected numbers, giving us the difference weighted by expected category number (diff.w).

diff.w <-diff/dat$exp
dat <-cbind(dat,diff.w)
dat
##                 obs    exp    diff     diff.w
## Round Yellow    315 312.75  5.0625 0.01618705
## Round Green     108 104.25 14.0625 0.13489209
## Wrinkled Yellow 101 104.25 10.5625 0.10131894
## Wrinkled Green   32  34.75  7.5625 0.21762590

Calculating the summary statistic

And finally, we can take the sum of these, called χ2 to be our summary statistic - an over all measure of the weighted difference between observed and expected.

chi <-sum(dat$diff.w)
chi
## [1] 0.470024

So what does this mean? What we want to do is to compare this result to the distribution of values we would expect to obtain if we did this experiment repeatedly. Before doing so, however we have our third consideration.

Accounting for the number of categories

Remember that, again thinking intuitively, the more the possible categories, the larger a total deviation (and thus χ2 ) we would expect to observe. So, we need to ask, how is that statistic distributed when we have k classes (in this case 4)? Here’s where the idea of degrees of freedom comes into play. For now, let’s think of it as follows. Remember that the total frequency of all classes of outcomes must add up to 1. Thus, in this case, if we look at a sample and determine it is not in class 1, 2, or 3, then we know it must be in class 4. To a statistician, that means there are three degrees of freedom in the data. For now, degrees of freedom (df) can be thought of as one less than the total number of categories; when we get to Hardy-Weinberg, we will see that there can be other factors involved as well.

So is it chance deviation?

To review, we’ve detedmined from Mendel’s data that χ2 = 0.47; since there are four categories into which we’ve classified the data there are three degrees of freedom. We can now use the simulation approach that we have previously. First, we’ll have r generate 100,000 values of chi-square based on 3 degrees of freedom

set.seed(123)
ch <- rchisq(100000,3)

Now, remember that, by squaring the difference between observed and expected, all deviation, regardless of its original direction, is now going to make a positive contribution to chi squared. So we can ask which of the simulated values fall in the range that occurs 5% of the time, due simply to chance

quantile(ch, .95)
##      95% 
## 7.859045

So, as we have done before, we conclude that if we were to observe a value of χ2 of this value or higher, we would reject our null hypothesis that the observed numbers are consistent with those predicted based on our hypothesis. But the value we actually observed it .47, which is less. We can illustrate this below:

colhist(ch,tail=1, xr=c(0,20),xlab+"Chi-squared",ylab="Number",main="Did Mendel's Data Fit?")
abline(v=chi,col="blue")

So in this case we can accept the null hypothesis and conclude that Mendel’s laws of segregation and assortment do indeed explain his results.

Note that in this case, we calculated a one tail statistic, as opposed to the two-tailed approach we have used previously. This is because of the fact that, as we described, we are looking at the square of the deviation from expectation, a number that will always be positive and will become larger as the deviation becomes greater. Thus, we are only concerned with the upper 5% of the distribution, consisting of those values of χ2 that are least likely to be observed by chance.

Conclusions

What have we accomplished here?

  1. We have compared an experimental result with the distribution expected based on a biological hypothesis.
  2. We have done so in a computationally straightforward fashion
  3. We have done so without prior knowledge regarding the expected distribution of repeated samples drawn from the original population.

We will come back to this repeatedly, and in doing so, we needn’t go through this whole simulation process repeatedly. In fact, the critical values of the χ2 distribution are widely available, and indeed they are built into R The table below provides the .05 cutoffs for 1 to 10 degrees of freedom:

df <-c(1:10)
chicrit.05 <-sapply(df, function(x) qchisq(.95,x))
chicrit.01 <-sapply(df,function (x) qchisq(.99,x))
printx(data.frame(cbind(chicrit.05,chicrit.01),row.names=df))
chicrit.05 chicrit.01
1 3.84 6.63
2 5.99 9.21
3 7.81 11.34
4 9.49 13.28
5 11.07 15.09
6 12.59 16.81
7 14.07 18.48
8 15.51 20.09
9 16.92 21.67
10 18.31 23.21

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Bayesian Inference

Up until now, we have been working from the standpoint of null hypothesis testing. That is, we have asked, given some expected value of a parameter or set of parameters, how frequently we would see a departure from that expectation as great or greater than that which we see in our data? But, as has been pointed out before, if we reject our null hypothesis based on this analysis, all we really know is that the null hypothesis doesn’t explain the data. We do not directly obtain an alternative expected value of our parameter.

In Bayesian Statistics (named after the Reverend Thomas Bayes, an 18th Century Scotsman who first proposed Bayes Theorem), we turn things around a bit. We start by using whatever prior information we might have about the parameter we are estimating, and then ask, given a particular outcome, how should that distribution be refined (giving us a posterior distribution). In other words, can we use the data to actually work towards an estimate of the parameter(s) in question.

A Forensic Example

We are going to come back to coin flips in a bit, but before doing so, let’s look at a real example, taken from forensics. This is taken from Shoemaker, Painter, and Weir (1999), a very nice introduction to Bayesian Statistics for genetics. We are in New Zealand, and we have a genotype obtained from a blood sample from a crime. There are three ethnic groups in New Zealand - Caucasian, Maori and Western Polynesion - we would like to determine the probabilities that the sample came from a member of each of these groups, given what we know about the ethnic makeup of New Zealand.

So we start with our prior information, which in this case is simply the distribution of the three ethnic groups in the population as a whole:

pr.c <-.819 #proportion that is Caucasian
pr.m <-.137 # Maori
pr.wp <-.044 # Western Polynesion

We then need the likelihoods that the particular genotype would be found in these populations (presumably done by some prior population-based genotyping)

l.c <-3.96*10^-9
l.m <-1.18*10^-8
l.wp <-1.91*10^-7

These can be thought of as “the probability of getting the genotype from a member of a particular ethnic group”.

Next, we can calculate the total probability that this blood sample would be found in this population (p{X}) by multiplying each likelihood by its corresponding prior probability and summing:

p.X <-pr.c*l.c+pr.m*l.m+pr.wp*l.wp
p.X
## [1] 1.326384e-08

So over all, it’s a rare genotype. But we have it in our sample - that is a given. What we want to know is what the probabilities are that the sample came from an individual from each of the three ethnicities. To calculate these posterior probabilities, we do the following

ps.c <-(pr.c*l.c)/p.X
ps.m <-(pr.m*l.m)/p.X
ps.wp <-(pr.wp*l.wp)/p.X
Ethnicity Prior Posterior
Caucasian 0.819 0.2445174
Maori 0.137 0.1218802
W.Polynesian 0.044 0.6336023

So what we see is that, for example, despite the fact that Western Polynesians comprise the smallest fraction of the total population (4.4%), in fact the probability that the particular blood sample came from one of that group is the highest.

Priors as continuous distributions

In the above to examples, we had a finite number of possibilities, three (Caucasian, Maori or Western Polynesian). But what if we were dealing with something for which a continuous range of outcomes were possible? How then do we calculate p{X}, the sum of all possible values weighted by the likelihood of each? In some cases, integral calculus can come to the rescue, but often the problem is too complex for such an analytical solution.

An example

Returing to our coin-flipping obsession, suppose you flip a coin 10 times and get 8 heads, is it a fair coin? Remember, The standard way of getting at that question is to say, “given that it is a fair coin, what is the probability that the outcome is 8 heads?” But that is based on the assumption that the coin is fair (i. e. prob{head}=0.5) - if that hypothesis is rejected, it does not by istelf provide any guidance as to what an alternative hypothesis.

So let’s reframe the question as a Bayesian would. Suppose that the true probability of getting a head is somewhere between 0 and 1, but we don’t know what it is. How can we find it? One way would be to conduct an experiment, in which we have a magical coin for which p{head} varies between 0 and 1 with each toss. If we could do that, we could then ask which trials, with what probabilities, led to the outcome of 10 heads.

Not an easy experiment to do in real life, but simulating it is straightforward. Let’s suppose we do a million coin tosses, and for each one, there is a probability of a head of some value between zero and one. First we can simulate the probabilities:

set.seed(123)
p <-runif(1000000,0,1) # generate 1 million random numbers between 0 and 1
head(p)
## [1] 0.2875775 0.7883051 0.4089769 0.8830174 0.9404673 0.0455565

And we can quickly plot these values as a histogram:

par(mfrow=c(1,1))
hist(p,xlab="p", main="Prior Distribution of p")

This is the epitomy of what statisticians call an “uniformative prior”

Now, let’s take each of those values of p and generate the hypothetical results of tossing a coin with that probability of success 10 times

outcomes <-rbinom(1000000,10,p)
head(outcomes)
## [1] 3 9 2 8 9 1

And we can look at the distribution of outcomes we got

hist(outcomes)

And we see that (not surprisingly) they are evenly distributed from 0 to 10. But what we want to do is focus on those trials in which in fact the outcome was 8, and ask what values of p in our initial uniform prior led to that outcome

post <-p[which(outcomes==8)]
head(post)
## [1] 0.8830174 0.8998250 0.8895393 0.6907053 0.7584595 0.4659625

We can then plot this posterior distribution and calculate its mean and upper and lower confidence intervals:

mean(post)
## [1] 0.7504267
colhist(post,xr=c(0,1),xlab="p",ylab="Number",main="Posterior Distribution, # Heads = 8")
mean(post)
## [1] 0.7504267
quantile(post,(c(.025,.975)))
##      2.5%     97.5% 
## 0.4833942 0.9399202
abline(v=mean(post),col="blue")

And we see that, if we consider the mean of this distribution to be our expectation of p, then it is about .75.

As we have done previously, we have shaded the upper and lower tails of the distribution. However, here they have a new meaning. What we are saying is that, given our prior distribution and the particular outcome (8 heads), we conclude that there is a 95% chance that the true value of p falls within the range (in this case) of ~.5 and ~.95. This is known as the maximum credible interval and is quite different from the confidence intervals we have considered earlier. Remember that those are the range of outcomes you would expect to see, given some null hypothesis about p (e. g. Ho:p=0.5)

So, what have we done:

  1. We made an assumption about the distribution of the probabilities of getting a head (some value between 0 and 1).
  2. Based on that distribution, we simulated the outcomes of coin flips using randomly selected coins from that distribution.
  3. We then selected as our posterior distribution all of those that gave the observed outcome (8 heads) and used the mean of that as our new estimate of the true value of p(heads).

Refining the estimate of p

Suppose I just told you the results of this experiment and asked you what your best estimate of p might be. Most likely it would be .8, yet what we saw from our Bayesian analysis is that we got something slightly lower - .75 or so. Why the discrepency?

Remember - we had a completely uninformative prior - we started by assuming that any outcome was possible. However, having done the experiment, we saw that now the distribution of p was no longer uniform; rather it is a unimodal one with a mean of .75. What would happen if we did our experiment again and got the same outcome? Now, we can use this new (posterior) distribution as our prior. And we could continue to do this, refining our prior after each experiment. The code below will do that for 50 iterations; we will then plot the mean value of p obtained for each iteration.

out.mean <-rep(0,50)
for(i in 1:50){
  outcomes <-rbinom(1000000,10,p)
  post1 <-p[which(outcomes==8)]
  out.mean[i] <-mean(post1)
  p <-sample(post1,1000000,replace=TRUE)
}

We can look at this in a couple of ways. First, we can plot the change in the mean of the posterior distribution of p with successive iterations:

plot(out.mean, xlab="Iteration",ylab="p", type="l",main="Iterative Estimation of p")

And we see that, as we expected intuitively, it approaches a value of .8. We can also look at the final posterior distribution of p (after 50 iterations):

par(mfrow=c(1,2))
colhist(post,xlab="P",ylab="Frequency",main="Initial Posterior Distribution")
colhist(post1,xlab="P",ylab="Frequency",main="Final Posterior Distribution",xr=c(0,1)) 

Two features are evident. First, as we already saw, the mean of the distribution is very close to .8, which is what we would intuitively expect given that we keep getting 8 heads out of 10. Second, compare the distribution with that we saw when we did the calculation based on our initial uniform prior distribution, we got the following confidence limits:

quantile(post,c(.025,.975))
##      2.5%     97.5% 
## 0.4833942 0.9399202
quantile(post1,c(.025,.975))
##      2.5%     97.5% 
## 0.7616568 0.8329505

So as we’ve added more data, refining our prior distribution as we do so, the accuracy of our estimate has increased. This is as it should be.

Summary

Thomas Bayes first proposed his theorem in the 18th century, yet it is only within the past 20 years or so that Bayesian statistics have been widely used. There are multiple reasons for this, however two are worth mentioning:

  1. While the mathematics for simple cases are straightforward, as the number of parameters to be estimated increases and/or the prior distributions become more complex, significant computational power is required.
  2. There is always the question as to whether the logic is inherently subjective and therefore suspect. After all, we started all the analyses we have done with an assumption about what we thought the truth to be and then analyzed the data in that context.

With respect to the first point, we’ve seen, in a simple case, how we can easily use computer simulation to handle the problem of modeling a prior. Subsequently, we will see how sampling can be used to further refine this process. Regarding the second point, one thing to keep in mind is that selection of a prior distribution is important - if, for example, in the New Zealand case, if there were in fact a fourth ethnic group that was not included in the prior distribution, then the posterior probabilities obtained would obviously be flawed. So the challenge is to define one’s priors sufficiently broadly to cover all possibilities, but also sufficiently focused that, via whatever algroithm is used, a well-supported outcome can be achieved in a manageable way.

Finally, as we will see, there are many population genetics problems in which, while we may have some prior information about how particular parameters may be distributed, we do not have a specific null hypothesis against which we can test the data. In such cases, the Bayesian approach can be very powerful.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

References

Shoemaker, Jennifer S., Ian S. Painter, and Bruce S. Weir. 1999. “Bayesian statistics in genetics: A guide for the uninitiated.” Trends in Genetics 15 (9): 354–58. doi:10.1016/S0168-9525(99)01751-5.