Strains were aligned and analyzed as previously described. The VCF files contain a wealth of information, some of which I'm not entirely sure what it means. Therefore I'll limit my analysis to 2 basic pieces of information: where the SNP is (record.POS)
and the number of "alternate" (not exactly sure what alternate means in this context) alleles called (record.INFO['AC'])
. I could change it to record.INFO['AF']
because it's a float of the frequency, but having it in the range of 1-100 works pretty well for visualization. These python blurbs utilize PyVCF.
Step 1: where are the SNP's?
First I just want to get an idea of how diverse the 27 samples are, so I read in all the VCF files, and then store their position in a big list. After that I plot them as a histogram. I played around with the number of bins to optimize for aesthetics and plotting performance. The overall shape doesn't change much if you plot 25k bins vs 2500 bins vs what I settled on, 253 bins (roughly 100 bp per bin). Plus, at 25k and 2500, the colors don't really show up, and that will come in handy for overlaying information later.
import vcf, os
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
fig1 = plt.figure(figsize=(18, 3))
snp_plot = fig1.add_subplot(111)
flist = os.listdir('../mito_vcf')
snp = []
snp_dp = []
for files in flist:
vcf_reader = vcf.Reader(open('../mito_vcf/%s' % files, 'r'))
for record in vcf_reader:
snp.append(record.POS)
snp_dp.append(record.INFO['DP'])
snp_plot.hist(snp, 253, facecolor='red', alpha=0.5, edgecolor='none', label='SNP frequency') #this gives ~100bp per bin
snp_plot.set_xlabel('Sequence Position, all SNPs. Median read depth = %s, range = %s to %s' % (np.median(snp_dp), min(snp_dp), max(snp_dp)))
snp_plot.set_ylabel('count')
snp_plot.set_ylim(0, 40)
snp_plot.set_xlim(0, 25000)
snp_plot.legend()
plt.show()
We have differences!
So there are definitely hot spots for sequence diversity, which is interesting in and of itself. But I want to know if there are multiple sequences seen at those sites at a frequency of 1% or greater. Also notice that the read depth is rather insane, so we can be confident that for most sites we have ample observations to call 1% sequence differences.
Now I want to see which of these SNP sites are heteroplsmic, so I replot, tossing anything with an AC value of 100 (ie - just a SNP, not a heteroplasmic SNP).
h_snp = []
fig2 = plt.figure(figsize=(18, 3))
hsnp_plot = fig2.add_subplot(111)
hsnp_dp = []
hsnp_ac = []
for files in flist:
vcf_reader = vcf.Reader(open('../mito_vcf/%s' % files, 'r'))
for record in vcf_reader:
if record.INFO['AC'][0] < 100: # only want heteroplasmic sites
h_snp.append(record.POS)
hsnp_dp.append(record.INFO['DP'])
hsnp_ac.append(record.INFO['AC'][0])
hsnp_plot.hist(snp, 253, facecolor='red', alpha=0.5, edgecolor='none', label='All SNPs') #this gives ~100bp per bin
hsnp_plot.hist(h_snp, 253, facecolor='blue', alpha=0.5, edgecolor='none', label='Variable SNPs')
hsnp_plot.set_xlabel('Sequence Position, heteroplasmic SNPs. Median read depth = %s, range = %s to %s' % (np.median(snp), min(snp), max(snp)))
hsnp_plot.set_ylabel('count')
hsnp_plot.set_ylim(0, 40)
hsnp_plot.set_xlim(0, 25000)
hsnp_plot.legend()
plt.show()
Still differences!
Curioser and curioser. As is clear from the overlayed plot, most of the differences seen are not merely single site SNP's, they also have some degree of sequence variation. The most obvious dropout is the big spike around 2000, which demonstrates the need for controls: it's a sequencing error in the published H99 reference. Even better, the median read depth goes up when you toss non-variable sites.
Just to make sure there isn't some population of low read depth sites lurking, I've plotted histograms before and after culling, and scatter plots of number of alleles called vs read depth.
hist_plot = plt.figure(figsize=(18, 9))
gs = gridspec.GridSpec(1, 2, width_ratios=[1, 1])
dp_hist = plt.subplot(gs[0])
dp_hist.hist(snp_dp, 80, alpha=0.5, edgecolor='none', label='All SNPs')
dp_hist.hist(hsnp_dp, 80, alpha=0.5, edgecolor='none', label='Heteroplasmic SNPs')
# dp_hist.set_xlim(0, 10000)
dp_hist.legend()
dp_scatter = plt.subplot(gs[1])
dp_scatter.scatter(hsnp_ac, hsnp_dp)
dp_scatter.set_ylim(100, 80000)
dp_scatter.set_xlim(0, 100)
dp_scatter.set_xlabel('Number of alternate called alleles')
dp_scatter.set_ylabel('Read depth')
dp_scatter.set_yscale('log')
dp_scatter.grid(True)
plt.show()
Yes, but just how variable are they?
While interesting, the map of variable SNP's doesn't tell me how variable they are, just if they are. But, I do have the number of alternate alleles called, so I should be able to plot the SNP positions in x and y, and vary the size of the marker by how variant they are. The radii calculation may look odd, but it eliminates FreeBayes' arbitrary calling of what an "alternate" is. If there are two major alleles, saying one site is 75% one variant is the same as saying it is 25% the other. The multiplication is just a factor to get it to pop a little more.
import math
fig3 = plt.figure(figsize=(18, 5))
het_plot = fig3.add_subplot(111)
hx = []
hy = []
hr = []
ycount = 1
strain_dict = {}
for files in flist:
vcf_reader = vcf.Reader(open('../mito_vcf/%s' % files, 'r'))
record_name = files[:-19]
strain_dict[ycount]=record_name
for record in vcf_reader:
if record.INFO['AC'][0] < 100: # only want heteroplasmic sites
hx.append(record.POS)
hy.append(ycount)
radii = 3*(50-(math.fabs(record.INFO['AC'][0]-50)))
hr.append(radii) # make the radius bigger the lower the allele count
ycount = ycount+1
het_plot.scatter(hx, hy, c='purple', s=hr, alpha=0.5, edgecolor='none')
het_plot.set_ylim(0, 27)
het_plot.set_xlim(0, 25000)
het_plot.set_xlabel('Sequence Position, radius indicates percentage variation')
het_plot.set_ylabel('Strain')
plt.show()
Oooo... pretty
I will fully admit to doing a little dance when this figure rendered for the first time. Just look at it, how can any self respecting data junkie not think that it cool? There are some obvious patterns that show up. It would appear that there are both strains with high heteroplasmy (strains 10, 14 and 22) and positions with high heteroplasmy (5000, 6000 and the stretch between 15k and 20k).
Next up, it would be nice to know where these fall, so let's read in the gene annotation and make a crude histogram of where the exons are.
f = open('../H99_mitochondria.gtf', 'r')
ex = []
ey = []
for line in f: # make a histogram that equals the exon boundries
if line.split()[2] == 'exon':
count = int(line.split()[3])
stop = int(line.split()[4])
while count < stop:
ex.append(count)
ey.append(1)
count = count+1
fig4 = plt.figure(figsize=(18, 1))
exon = fig4.add_subplot(111)
exon.hist(ex, 253, facecolor='green', alpha=0.7, edgecolor='none')
exon.set_xlabel('Exons')
exon.set_ylim(0, 1)
exon.set_xlim(0, 25000)
plt.show()
If my matplotlib-fu was better, I'm sure I could create an axis with gene labels and everything. But, it ain't.
Let's put it all together, shall we?
import sys
all_fig = plt.figure(figsize=(18, 12))
gs = gridspec.GridSpec(4, 1, height_ratios=[3, 3, 5, 1])
asnp_plot = plt.subplot(gs[0])
asnp_plot.hist(snp, 253, facecolor='red', alpha=0.5, edgecolor='none', label='SNP frequency')
asnp_plot.set_xlabel('Sequence Position, all SNPs. Median read depth = %s, range = %s to %s' % (np.median(snp), min(snp), max(snp)))
asnp_plot.set_ylabel('count')
asnp_plot.set_ylim(0, 40)
asnp_plot.set_xlim(0, 25000)
asnp_plot.legend()
ahsnp_plot = plt.subplot(gs[1])
ahsnp_plot.hist(snp, 253, facecolor='red', alpha=0.5, edgecolor='none', label='all SNPs') #this gives ~100bp per bin
ahsnp_plot.hist(h_snp, 253, facecolor='blue', alpha=0.5, edgecolor='none', label='variable SNPs')
ahsnp_plot.set_xlabel('Sequence Position, heteroplasmic SNPs. Median read depth = %s, range = %s to %s' % (np.median(snp), min(snp), max(snp)))
ahsnp_plot.set_ylabel('count')
ahsnp_plot.set_ylim(0, 40)
ahsnp_plot.set_xlim(0, 25000)
ahsnp_plot.legend()
ahet_plot = plt.subplot(gs[2])
ahet_plot.scatter(hx, hy, c='purple', s=hr, alpha=0.5, edgecolor='none')
ahet_plot.set_ylim(0, 27)
ahet_plot.set_xlim(0, 25000)
ahet_plot.set_xlabel('Sequence Position, radius indicates percentage variation')
ahet_plot.set_ylabel('Strain')
aexon = plt.subplot(gs[3])
aexon.hist(ex, 253, facecolor='green', alpha=0.7, edgecolor='none')
aexon.set_xlabel('Exons')
aexon.set_ylim(0, 1)
aexon.set_xlim(0, 25000)
plt.show()
ncount = 1
print "Strain numbers with names"
print "-------------------------"
for keys in strain_dict:
if ncount % 5 == 0:
sys.stdout.write("%s) %s\n" % (keys,strain_dict[keys]))
else:
sys.stdout.write("%s) %s\t" % (keys,strain_dict[keys]))
ncount+=1
This is just everything put together, with the additional information about the strain names. So from this plot, as the last, we can tell that strains Bt206, C23 and MW_RSA852 have the greatest variability/heteroplasmy. Qualitatively, at least. I need a way to quantify things a bit better.
But wait, there's more!
So I was curious how this would be reflected in a traditional assessment of diversity. So I called a consensus sequence in samtools for all 27 strains, using the suggested command in the documentation:
samtools mpileup -uf ref.fa aln.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fq
I then aligned them all in MAFFT using the L-INS-i (more accurate) option, and loaded that alignment into MEGA5 and calculated a Maximum Likelihood tree using the default options with 1000 bootstrap replications. Below is that tree, with Bt206, C23 and MW_RSA852 highlighted. For some reason all the axis labels turned miniscule when I exported from MEGA5, but the conculsion is still obvious: while C23 and MW_RSA852 are outliers, Bt206, by straight consensus, is indistinguishable from the reference sequence, H99!