DNA Sequence Variation
- Some data
  - Number of segregating sites
  - Average pairwise differences
- Summary
The Site Frequency Spectrum
- Some real data
- Simulating Expected Data
Mitochondrial DNA and Haplotypes - A Special Case
- An example with dogs
  - Visualization 1. A Tree
  - Visualization 2. A Network
- Conclusion
References

DNA Sequence Variation

The most basic level of genetic variation is, of course, that of DNA sequence. Indeed, we have been thinking about that implicitly up until this point; now, before going further, we need to make it explicit. So our first objective will be to consider how we make such comparisons; from there we will ask how we can use those comparisons to make evolutionary inferences.

Some data

As an example, we are going to use data from the acp29 locus in Drosophila melanogaster, Popset 11245776, downloaded as a FASTA file. We will use the rentrez package to access the data directly:

Note that at this point, we have saved the downloaded data as a local file, which we will be able to access in the future.

We see that there are 17 sequences, each 702 base pairs in length. We’re going to do two processing steps, one trivial an one critical:

1.As a cleanup for now, we will rename them as simply 1 through 17.
2. We want to be sure they are properly aligned. To o so, we will use the add-on program muscle (which must be indepenently installed so that it can be called from R).

names(acp29) <-c(1:17)
acp.mat <-muscle(acp29) # makes the data a little easier to work with

Note that our data (acp.mat) is now a DNAbin object, which is simply a matrix of sequences - one row per sequence and one column per position. This is a data type used extensively in ape and related packages.

Now, we need to visualize them, for a couple of reasons. First, we need to check our alignment, so that, in doing our analyses, we are as certain as we can be that we are comparing evolutionarily homologous positions. Second, we need to see, at least qualitatively, how variable the sequences are.

To do this, we will use the image.DNAbin function, part of the package “ape”

image.DNAbin(acp.mat,cex=.8)

And with this, we can, at least qualitatively, address our question. The fact that most bases are identical in all 17 sequences tells us that the sequence alignment looks good; the fact that there are some exceptions to this says that there are is in fact variation.

But how much variation? There are two measures that we will use extensively for this:

Number of segregating sites

Since we are interested in variation, our attention should be focused on those sites that are in fact polymorphic. ape provides a function to give us those sites:

acp.sites <-seg.sites(acp29)
acp.sites

##  [1]  66  87 159 168 176 338 339 429 458 474 522 534 540 546 579

And we can count them rather easily

acp.nss <-length(acp.sites)
acp.nss

## [1] 15

And we see there are 15 out of the total of 702 positions that show some degree of variation. These are what we need to focus on - from a population perspective the remaining 687 sites tell us nothing.

Average pairwise differences

OK, so we have 15 segregating sites. But there is another question we need to ask - on average, how many differences are there between two sequences?

Let’s choose a couple of sequences at random and see how many differences there are between them.

seqs <-sample(1:17,2)
seqs

## [1]  7 11

Now let’s see how many segregating sites there are.

ss <-length(seg.sites(acp.mat[seqs,]))
ss

## [1] 3

And we see there are 3. If we reran the above block of code, we could select two other sequences, and we’d get another number. In fact, there are n(n-1)/2 such comparisons; the average of them is our measure of diversity. Again, there is an R function (in the package pegas) that gets at this

acp.div <-nuc.div(acp.mat)
acp.div

## [1] 0.004210659

But this is an important point to note. What we see here is the average diversity per nucleotide. In other words, if we were to pick a random position from two random sequences, this would be the probability that they would be different bases. We will use this occasionally, but for now, something that will be much more useful will be π, or the average number of differences per sequence. We can get this simply by multiplying the diversity number by the length of the sequence (702 in this case).

acp.pi <-ncol(acp.mat)*acp.div
acp.pi

## [1] 2.955882

Summary

In this brief introduction, we have seen how we can quantify DNA sequence variation based on estimates of two parameters - the number of segregating sites and the average pairwise difference between sites. To an extent, these are analogous to the measures used in the electrophoresis days - the fraction of loci that are polymorphic and average heterozygosity. However, as we shall see, there is much more that can be learned based on sequence variation, especially when it comes to making inferences about past evolutionary processes. To get there, we have to consider a few more factors:

What are the processes that underly generation of variation? In other words, how can we model the mutational process?
What are the dynamics of the interaction between mutation and finite population size?
How can we make inferences about the processes that gave rise to an observed pattern of sequence variation?

The Site Frequency Spectrum

Thus far, we have talked about the use of nucleotide diversity (π) and number of segregating sites (S) as possible summary statistics for characterizing DNA sequence variation. However, there is more that can be extracted from a typical data set. In particular, neither of these address the question of allele frequencies. But when we are talking about DNA sequences, what do we mean by “allele”? Let’s consider sites as loci; further we can assume that each site has one of two states - ancestral or derived. Furthermore, we assume that a particular site only undergoes a single mutation, so that all sequence differences we observe when we compare two sequences are the result of unique mutational events. This is what is known as the infinite sites model.

Thus, we might want to use as a measure site-based allele frequencies as the fraction of sequences that contain the derived allele at that site. Let’s do so by simulating a simple data set. To do so, we are going to us Rick Hudson’s program ms for the first time (Hudson 2002). It is an incredibly powerful program, but its basic function is simple - given some parameters such as θ or S, it will generate samples that would be generated based on the infinite sites model. It is distributed as c++ code that can be compiled and run from the command line; fortunately for us it has been incorporated into the r package phyclust, which is loaded along with TeachingPopGen.

So we are going illustrate site frequency spectra, we will use ms to generate 10 samples with 10 segregating sites. These will be coded as 0’s and 1’s, indicating ancestral and derived alleles, but remember, in real life each allele would be a nucleotide (A, G, C or T), and two such alleles (one ancestral and one derived) would be segregating at each position.

library(TeachingPopGen)
set.seed(123)  
rand <-ms(nsam=10, nreps=1,opts="-s 10")  
rand <-read.ms.output(txt=rand)  
x <-rand$gametes[[1]]  
x

##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]    0    0    1    1    0    0    0    0    0     0
##  [2,]    0    0    0    0    0    0    0    0    0     0
##  [3,]    0    1    0    0    0    1    1    0    0     0
##  [4,]    0    0    0    0    0    0    0    0    0     0
##  [5,]    0    0    0    0    0    0    0    0    1     0
##  [6,]    1    0    0    0    0    0    0    1    0     1
##  [7,]    0    0    0    0    0    0    0    0    0     0
##  [8,]    0    0    0    0    0    0    0    0    1     0
##  [9,]    0    0    1    1    1    0    0    0    0     0
## [10,]    0    0    0    0    0    0    0    0    1     0

Now, we can visually inspect this and see, for example, that sites 1,2,5,6,8, 9 and 10 contain one derived allele each, 3 and 4 contain 2, and so on. But R can get the whole spectrum much more readily, and we can then plot a histogram.

spec <-sfs(x)

spec

## [[1]]
## [1] 10
## 
## [[2]]
## [1] 7 2 1 0 0 0 0 0 0

And we see the spectrum plotted graphically; in addition the function returns a list (here given the name spec) that has as elements the number of sequences (spec[[1]]) and the actual spectrum (spec[[2]]).

Some real data

But let’s go to something more realistic. Let’s return to the acp29 data. The data are in FASTA format. By the following, we can read it in as a DNAbin object and plot the sfs:

#data(acp29) #Load data from TeachingPopGen
acp29 <-read.FASTA("../Data/acp29.fasta") #Read the file we created in the last unit
spec.acp <-site.spectrum(acp29)
plot(spec.acp)

So we have 10 positions which are “singletons” 16 of the 17 sequences have one base and one has another. We then have 1 each where 2 and 3 sequences differ from the rest., and so forth. The question then becomes what to do with them - as with the examples we did with the Ewens data, we need to compare these to some sort of expectation.

Note that this is what is called a “folded” site frequency spectrum. That means that, in contrast to our simulated example, the ancestral state is not known. In this case, what we do is to treat all positions in which a base has a frequency of 1 as the first class, again for those with two, and so forth up to the any class that has equal numbers of two bases. In this situation (and in contrast to the unfolded case, where ancestry is known), the maximum frequency class is n/2.

Simulating Expected Data

But the question is, what should that distribution look like. Nielsen and Slatkin provide an analytical solution for the expectation - see pp. 54 and 55. But in our philosophy of using simulation as an approach, we will eschew the equations and simulate the result as follows.

First, we have seen that the data set contains 17 sequences of 703 bases. We need to know the number of segregating sites

length(seg.sites(acp29))

## [1] 15

There are 15. So now we will use ms to simulate that, and then send the results to ms2dna to produce simulated bases (rather than 1’s and zeros). The results are stored in a fasta file (called test.fasta) Note that in this case, we are not using ms as implemented in phyclust; rather we are making a system call directly to ms, rerouting the outcome to ms2dna, which converts the output to simulated DNA sequences, and having that saved as a fasta file.

system("ms 17 1 -s 15 -r 0 703 -seed 1 2 3 | ms2dna >test.fasta") # numbers set to mimic acp data; r parameter is essential

Now we will read that file and look at its SFS., compared with that of acp

test.dna <-read.FASTA("test.fasta")

par(mfrow=c(2,1))
test.sfs <-site.spectrum(test.dna)
plot(spec.acp,main="Observed")
plot(test.sfs, main="Simulated")

So is what we observe consistent with neutral expectations? Based on a single simulation, that’s hard to say. Suppose we repeat the simulation 10 times:

par(mfrow=c(5,2))
for(i in 1:10){
  system("ms 17 1 -s 15 -r 0 703 | ms2dna >test.fasta") # numbers set to mimic acp data; r parameter is essential
  test.dna <-read.FASTA("test.fasta")
  test.sfs <-site.spectrum(test.dna)
plot(test.sfs,main=paste("Simulation", i, sep=" "))
}

par(mfrow=c(1,1))

We see that we get a variety of spectra. So how to sort through them? Before we can address that, we need to turn to coalescent theory.

library(knitr)
library(rentrez)

Mitochondrial DNA and Haplotypes - A Special Case

Organellar DNA has a rich history in population genetics. In the late 1970’s, John Avise, then starting his career, proposed that animal mitochondrial DNA has some unique properties that make it suitable for use in population genetic studies. In particular

It is genetically haploid.
It does not recombine
It is uniparentally (usually maternally) inherited.
Experimental work (at the time involving restriction site analysis) showed it to be highly variable.

Fast forward to the present, and if you a keyword search on “mitochondrial” in popset returns over 15,000 hits - in fact they far and away the most frequent class of sequence sets in that database. Note, however, that they are by and large restricted to animals - in plants, chloroplast DNA has been used successfully for similar purposes, however its lower level of variation reduces its utility somewhat.

While there is much we could say about the structure of mtDNA and its applicability to evolutionary questions, two features should be mentioned:

The coi gene, a relatively conserved sequence, has been widely employed as a “DNA Barcode” to identify samples at the species level.
The “D loop”, the origin of replication, is a very A/T-rich region that is highly variable among individuals; this is an excellent sequence to use for population structure studies.

The intent here is to introduce mtDNA manipulation, so that you can consider it for further explorations. We will start by showing how we can use it to construct a network of haplotypes and ask whether it reveals meaningful relations among individuals in a population. We will then use it to briefly introduce BEAST, a Bayesian inference approach to making inferences about past population structure.

An example with dogs

The data we are going to look at are from popset 322367799, a collection of control region sequences from various dog breeds. The data have previously been downlaoded as a FASTA file, so we can read it in as follows:

dat.rnt <-entrez_search(db="popset",term="322367799")
dat.raw <-entrez_fetch("popset",id=dat.rnt$ids,rettype="fasta")
write(dat.raw,file="dog.fasta")
dog.dat <-read.FASTA("dog.fasta")
dog.dat

## 17 DNA sequences in binary format stored in a list.
## 
## Mean sequence length: 1270.588 
##    Shortest sequence: 1248 
##     Longest sequence: 1272 
## 
## Labels: HQ845266.1 Canis lupus familiaris isolate K_30 breed Chihuahua control region, partial sequence; mitochondrial HQ845267.1 Canis lupus familiaris isolate K_31 breed Shih Tzu control region, partial sequence; mitochondrial HQ845268.1 Canis lupus familiaris isolate K_32 breed Jack Russel Terrier control region, partial sequence; mitochondrial HQ845269.1 Canis lupus familiaris isolate K_33 breed Hovawart control region, partial sequence; mitochondrial HQ845270.1 Canis lupus familiaris isolate K_34 breed Malthezer Bichon control region, partial sequence; mitochondrial HQ845271.1 Canis lupus familiaris isolate K_35 breed Poodle x Shih Tzu control region, partial sequence; mitochondrial ...
## 
## Base composition:
##     a     c     g     t 
## 0.285 0.260 0.137 0.318

We need to edit the names of each sequence to something meaningful, in this case the breed. We can use the following trick for this case (for others, another parsing method might have to be applied)

nms <- names(dog.dat)
nms<- sapply(nms, function(x) word(x,8,9))
names(dog.dat) <- nms
dog.dat

## 17 DNA sequences in binary format stored in a list.
## 
## Mean sequence length: 1270.588 
##    Shortest sequence: 1248 
##     Longest sequence: 1272 
## 
## Labels: Chihuahua control Shih Tzu Jack Russel Hovawart control Malthezer Bichon Poodle x ...
## 
## Base composition:
##     a     c     g     t 
## 0.285 0.260 0.137 0.318

As always, it is important to ensure that the sequences are aligned, which we will do with muscle (remember that the binary file must be accessible by R).

dat.aln <-muscle(dog.dat)

And with that, we can visualize the alignment:

image.DNAbin(dat.aln,cex.lab=.5)

We see that there is a huge gap in the middle of it. Here’s where one of the great advantages of the matrix format comes in - we can readily select the first 600 nucleotides, which are free of gaps, and use them for our subsequent analysis:

dat.aln.sub <-dat.aln[,1:600]
image.DNAbin(dat.aln.sub,cex.lab=.5)

And that looks better. Note that the names of the sequences were edited prior to loading the data as well - full original designations can be found in the popset link.

Now we are going to identify how many distinct mtDNA haploytpes are present in the sample:

dat.hap <-haplotype(dat.aln.sub)
dat.hap

## 
## Haplotypes extracted from: dat.aln.sub 
## 
##     Number of haplotypes: 12 
##          Sequence length: 600 
## 
## Haplotype labels and frequencies:
## 
##    I   II  III   IV    V   VI  VII VIII   IX    X   XI  XII 
##    2    3    2    1    1    1    2    1    1    1    1    1

So among the 17 sequences in the data set, there are 12 different haplotypes. We can now visualize them in two ways.

Visualization 1. A Tree

One of the most familiar ways to visualize relationships among aligned sequences is as a phylogenetic tree. In this case, what we will do is to use the number of differences between pairs of species as a measure of their divergence and use the resulting matrix to generate a tree. Note that for this we are treating all differences equally and are not taking the position of the differences into account.

dat.dist <-dist.dna(dat.aln.sub) #calculate the distance matrix
plot(nj(dat.dist),type="phylo",cex=.8) # plot a "neighbor joining" tree

But in fact, while useful, tree visualization like this is not entirely appropriate for population data, as it implies ancestor-descendant relationships that may or may not exist. In fact, we should be thinking of this as a network of haplotypes, which differ by one or more bases.

Visualization 2. A Network

Another way of thinking about this is asking, given two haplotypes, how many substitutions would be required to go from one to the other. We would then extend that logic to create a network of all 12 haplotypes that requires the fewest steps. The code is a bit tricky, but it and the results are shown below:

net <-haploNet(dat.hap)
fq <-attr(net,"freq")
plot(net,size=fq,threshold=0,cex=.5)

The size of the circles indicate the number of sequences for each haplotype (for example, the smallest circles like X indicate a haplotype that occurred only once in the sample, while the largest on - II - actually had three). Note also that breeds are not shown - to do so would be a graphical nightmare. Instead, we can do a little bit of munging and generate a table that provides that information

sample.no <-attr(dat.hap,"index")


breed.by.haplo <-sapply(sample.no, function(x) rownames(dat.aln.sub)[x])
sorted <-sapply(breed.by.haplo,paste,collapse=" ")
haplo <-as.roman(1:12)
kable(data.frame(Haplotype=as.character(haplo),Breed=sorted))

Haplotype	Breed
I	Chihuahua control Dwarf Schnauzer
II	Shih Tzu Poodle x Poodle x
III	Jack Russel Jack Russel
IV	Hovawart control
V	Malthezer Bichon
VI	Argos German
VII	German Shepherd German Shepherd
VIII	Maltheser control
IX	Border Collie
X	Golden Retriever
XI	German Shepherd
XII	Cocker Spaniel

Conclusion

So looking at these results, we can’t say that we’ve learned a lot about these dog breeds. But after all, our sample size, both in terms of haplotypes and sequenced bases, is quite small. Nevertheless, it illustrates the potential of the approach.

References

Hudson R. R., 2002 Generating samples under a Wright–Fisher neutral model of genetic variation. Bioinformatics Applications Note 18: 337–338.

Introduction to DNA Sequence Variation

February 20 2017