Using FreeBayes to measure mitochondrial heteroplasmy

Background

I have genomic data collected on 10 strains of Cryptococcus neoformans, a human fungal pathogen. It was collected a few years ago, and is, in a word, terrible. It is SOLiD3 or SOLid4 50bp single end data, 20M-30M reads per sample, but a best half survive quality trimming. In an attempt to salvage something out of the data, I noticed that while coverage of the genome was pretty awful, coverage of the mitochondria wasn't bad. Even better, while mapping is very sensitive to using the right variant (var neoformans, common name JEC21 vs var grubii, common name H99), the mitochondrial sequence of H99 appears good enough to map either. To double check, I wondered if NCBI had published their raw data for the JEC21 sequence. I didn't find it, but what I did find was something even better, a veritable gold mine: 27 strains of C. neoformans, all determined to at least 500X using a combination of paried end and single end Illumina.

This is an odd case of open science, in that the Broad Institute has deposited the raw read data and I can't find a manuscript to go with it! I've contacted a person at Duke who I think is working on it, but so far (two weeks later) have received no response. As expected from Broad, the data is amazing, and the proper controls have been done, including resequencing the reference strain, H99. Out of curiousity, I decided to download a read set and try mapping it in order to get some experience working with good data. On pulling up the SAM files in Geneious, I saw something kind of odd: it was obvious that certain positions in the mitochondria contained multiple valid reads for different sequences. So I did the logical next step; ask someone on twitter for help:

Ok, it wasn't just anyone. It still blows me away that Jonathan is so generous with his time over twitter. Anyway, with that simple exchange, a new project (to me, remember I have no idea what the folks at Broad are doing with this data) was born. I hope to address what mitochondrial heteroplasmy is and means (if anything) in the next post, but for the time being I just wanted to know it it existed in significant quantity in C. neoformans. It would be tempting to think so, as mating in this fungi consist of nuclear fusion, but surprisingly... no. Even more strange (to me), mating factor type seems to determine which population "wins" after nuclear fusion. The caveat to those two papers is that they weren't measured by high coverage sequencing, and I know what I saw.

After the fold are the programs/strategies for manipulating the raw read data into a Variant Call Format file. My choice of programs are somewhat arbitrary. FreeBayes has a "ploidy" and "pooled" option that seem to do what I want: treat the mitochondria like a population of bacteria. I used Novoalign because I had a free 30 day trial and wanted to put it through its paces with some good data, and Mick Watson sings its praises re:accuracy, which seems relatively important for this analysis. The rest is my kludgy python. This ipython notebook is saved to my public github repository, found here.

Data file manipulation

First I needed to get the data, all 100+ GB of it.

  • downloaded the entire SRP015886 experiment runs and converted them to FASTQ using fastq-dump in a bash script.
  • Ran Trimmomatic on all with the following options (bash script as well):

Paired end:

java -classpath /usr/local/Trimmomatic-0.22/trimmomatic-0.22.jar \
    org.usadellab.trimmomatic.TrimmomaticPE -threads 8 -phred33 -trimlog \
    $dir.trim.log $dir\_1.fastq.gz $dir\_2.fastq.gz \
        $dir\_paired_1.fq.gz $dir\_unpaired_1.fq.gz \
        $dir\_paired_2.fq.gz $dir\_unpaired_2.fq.gz \
    ILLUMINACLIP:../illuminaClip.fa:2:40:15 LEADING:15 \
    TRAILING:15 SLIDINGWINDOW:4:20 MINLEN:36

Single end:

java -classpath /usr/local/Trimmomatic-0.22/trimmomatic-0.22.jar \
    org.usadellab.trimmomatic.TrimmomaticSE -threads 8 -phred33 -trimlog \
    $dir.trim.log $dir.fastq.gz \
    $dir\_trimmed.fq.gz  \
    ILLUMINACLIP:../illuminaClip.fa:2:40:15 LEADING:15 \
    TRAILING:15 SLIDINGWINDOW:4:20 MINLEN:36

Downloaded the SRP015886.xml from the ENA at EBI (note to self, always get read data from ENA when possible)

These were also all done before I figured out that variables are share between cells in iPython notebooks, so reading in the XML file every time in order to do a new analysis is totally uneccesary. But changing it after the fact would be a waste of time.

Ran novoindex on the H99.fasta file

Fasta download from Broad for both genome and mitochondria, cat'ed together.

novoindex H99.nix H99.fasta

Wrote python blurb to create bash script

Reads in the XML, looks for all the runs, then extracts read length and sdev when appropriate. Novoalign can use read length and sdev. Mapped all the read files seperately, as Novoalign suggests doing that and then merging the SAM files.
As it will take 2-3 days, running mapping from within python notebook seemed like a bad idea, hence the scripts.

In [107]:
import xml.etree.ElementTree as ET
tree = ET.parse('SRP015886.xml')
root = tree.getroot()

f = open('align_all.sh', 'w')
f.write("#!/bin/bash\n")

for child in root:
    ID = child.find('IDENTIFIERS').find('PRIMARY_ID').text
    try:
        paired = child.find('DESIGN').find('LIBRARY_DESCRIPTOR').find('LIBRARY_LAYOUT').find('PAIRED')
        pdev = int(float(paired.attrib['NOMINAL_SDEV']))
        plength = int(float(paired.attrib['NOMINAL_LENGTH']))
    except AttributeError:
        pdev = 0
        plength = 0
    links = child.find('EXPERIMENT_LINKS')
    for exps in links:
        if exps.find('XREF_LINK').find('DB').text == 'ENA-RUN':
            runs = exps.find('XREF_LINK').find('ID').text.split(',')
            for lanes in runs:
                if pdev:
                    f.write("cd %s\n" % (lanes))
                    f.write("novoalign -o SAM -d ../H99.nix -f %s_paired_1.fq.gz %s_paired_2.fq.gz -i %i,%i > %s_H99.SAM \n" % \
                            (lanes,lanes,plength,pdev,lanes))
                    f.write("cd ..\n")
                else:
                    f.write("cd %s\n" % (lanes))
                    f.write("novoalign -o SAM -d ../H99.nix -f %s_trimmed.fq.gz > %s_H99.SAM \n" % \
                            (lanes,lanes))
                    f.write("cd ..\n")
f.close()

Screwed up and missed SRR646261-SRR646262, should be SRR646261,SRR646262.

For reasons that aren't clear, certain runs are denoted in the XML file as a range (SRR646261-SRR646262 in this example, but there are others), while most are a list.
Had to run from that point on. Now need to make a list of samples and their associated SAM/BAM files.

In [120]:
import xml.etree.ElementTree as ET
tree = ET.parse('SRP015886.xml')
root = tree.getroot()

samples = {}

for child in root:
    ID = child.find('DESIGN').find('SAMPLE_DESCRIPTOR').attrib['refname']
    cellsplit = ID.split(' ')
    sample = cellsplit[len(cellsplit)-1].strip()
    links = child.find('EXPERIMENT_LINKS')
    for exps in links:
        if exps.find('XREF_LINK').find('DB').text == 'ENA-RUN':
            runs = exps.find('XREF_LINK').find('ID').text.split(',')
    if samples.has_key(sample):
        for names in runs:
            samples[sample].append(names)
    else:
        samples[sample]=[]
        for names in runs:
            samples[sample].append(names)

f = open('sam_merge.sh', 'w')
f.write("#!/bin/bash\n")
for keys in samples:
    f.write("mkdir %s\n" % keys)
    sams = []
    for dirs in samples[keys]:
        sams.append(dirs+".bam")
        f.write("cd %s\n" % dirs)
        f.write("samtools view -bS %s_H99.SAM > %s_H99.bam\n" % (dirs,dirs))
        f.write("rm %s_H99.SAM\n" % dirs)
        f.write("ln -s %s_H99.bam ../%s/%s.bam\n" % (dirs,keys,dirs))
        f.write("cd ..\n")
    f.write("cd %s\n" % keys)
    f.write("samtools merge %s.bam %s\n" % (keys,' '.join(sams)))
    f.write("samtools sort %s.bam %s.sorted\n" % (keys,keys))
    f.write("samtools index %s.sorted.bam\n" % keys)
    f.write("cd ..\n")
f.close()

Damnit, apparently that link command doesn't work, need to write another cleanup script.
I've also noticed that the forum suggested sorting, then merging. This merges, then sorts. Hrm.
The new BAM files are named from the XML, by strain name.

In [123]:
import xml.etree.ElementTree as ET
tree = ET.parse('SRP015886.xml')
root = tree.getroot()

samples = {}

for child in root:
    ID = child.find('DESIGN').find('SAMPLE_DESCRIPTOR').attrib['refname']
    cellsplit = ID.split(' ')
    sample = cellsplit[len(cellsplit)-1].strip()
    links = child.find('EXPERIMENT_LINKS')
    for exps in links:
        if exps.find('XREF_LINK').find('DB').text == 'ENA-RUN':
            runs = exps.find('XREF_LINK').find('ID').text.split(',')
    if samples.has_key(sample):
        for names in runs:
            samples[sample].append(names)
    else:
        samples[sample]=[]
        for names in runs:
            samples[sample].append(names)

f = open('sam_cleanup.sh', 'w')
f.write("#!/bin/bash\n")
for keys in samples:
    sams = []
    for dirs in samples[keys]:
        sams.append(dirs+"_H99.bam")
    f.write("cd %s\n" % keys)
    f.write("samtools merge %s.bam %s\n" % (keys,' '.join(sams)))
    f.write("samtools sort %s.bam %s.sorted\n" % (keys,keys))
    f.write("samtools index %s.sorted.bam\n" % keys)
    f.write("cd ..\n")
f.close()

Running Freebayes with ploidy 4

Before finding out about a different strategy below, tried --ploidy 4

In [125]:
import xml.etree.ElementTree as ET
tree = ET.parse('SRP015886.xml')
root = tree.getroot()

samples = {}

for child in root:
    ID = child.find('DESIGN').find('SAMPLE_DESCRIPTOR').attrib['refname']
    cellsplit = ID.split(' ')
    sample = cellsplit[len(cellsplit)-1].strip()
    links = child.find('EXPERIMENT_LINKS')
    for exps in links:
        if exps.find('XREF_LINK').find('DB').text == 'ENA-RUN':
            runs = exps.find('XREF_LINK').find('ID').text.split(',')
    if samples.has_key(sample):
        for names in runs:
            samples[sample].append(names)
    else:
        samples[sample]=[]
        for names in runs:
            samples[sample].append(names)

f = open('variant_call.sh', 'w')
f.write("#!/bin/bash\n")
for keys in samples:
    f.write("cd %s\n" % keys)
    f.write("samtools view -b %s.sorted.bam H99_mitochondria > %s_mito.bam\n" % (keys,keys))
    f.write("samtools index %s_mito.bam\n" % keys)
    f.write("freebayes -0 -b %s_mito.bam -f ../H99_mitochondria.fasta -v %s_mito_ploidy4.vcf --ploidy 4 --min-alternate-fraction 0.01 --max-complex-gap 50\n" % (keys,keys))
    f.write("cd ..\n")
f.close()

Different FreeBayes run

According to this wiki at UTexas), mixed bacteria can be tested using --ploidy 100 and --pooled. This matches what I've been thinking in the first place: just treat all the mitochondria in the cell as a mixed population of bacteria.

In [126]:
import xml.etree.ElementTree as ET
tree = ET.parse('SRP015886.xml')
root = tree.getroot()

samples = {}

for child in root:
    ID = child.find('DESIGN').find('SAMPLE_DESCRIPTOR').attrib['refname']
    cellsplit = ID.split(' ')
    sample = cellsplit[len(cellsplit)-1].strip()
    links = child.find('EXPERIMENT_LINKS')
    for exps in links:
        if exps.find('XREF_LINK').find('DB').text == 'ENA-RUN':
            runs = exps.find('XREF_LINK').find('ID').text.split(',')
    if samples.has_key(sample):
        for names in runs:
            samples[sample].append(names)
    else:
        samples[sample]=[]
        for names in runs:
            samples[sample].append(names)

f = open('freebayes100.sh', 'w')
f.write("#!/bin/bash\n")
for keys in samples:
    f.write("cd %s\n" % keys)
    f.write("freebayes -b %s_mito.bam -f ../H99_mitochondria.fasta -v %s_mito_ploidy100.vcf --ploidy 100 --pooled --min-alternate-total 3 --max-complex-gap 50\n" % (keys,keys))
    f.write("cd ..\n")
f.close()

OK! Now I have 27 vcf files, I need a way to visualize what is going on. Tomorrow I will post my visualization of the data in iPython notebook and matplotlib.

In []:

Pages

Categories

Tags