Aligning bisulfite-converted DNA reads to a genome

Bisulfite is used to detect methylated cytosines. It converts unmethylated Cs to Ts, but it leaves methylated Cs intact. If we then sequence the DNA and align it to a reference genome, we can infer cytosine methylation.

To align the DNA accurately, we should take the C->T conversion into account. Here is how to do it with LAST.

Let's assume we have bisulfite-converted DNA reads in a file called "reads.fastq" (in fastq-sanger format), and the genome is in "mygenome.fa" (in fasta format). We will also assume that all the reads are from the converted strand, and not its reverse-complement (i.e. they have C->T conversions and not G->A conversions).

First, we need to run lastdb twice, for forward-strand and reverse-strand alignments:

lastdb -u bisulfite_f.seed my_f mygenome.fa
lastdb -u bisulfite_r.seed my_r mygenome.fa

Then find alignments, one strand at a time:

lastal -p bisulfite_f.mat -s1 -Q1 -e120 my_f reads.fastq > temp_f
lastal -p bisulfite_r.mat -s0 -Q1 -e120 my_r reads.fastq > temp_r

Take care not to mistype these commands, especially f/r and s1/s0. Finally, merge the alignments and estimate which one represents the genomic source of each read:

last-merge-batches.py temp_f temp_r | last-map-probs.py -s150 > myalns.maf

These commands refer to files (bisulfite_f.seed etc), which are in the examples directory. You need to specify exactly where they are (e.g. "-u examples/bisulfite_f.seed").

Avoiding biased methylation estimates

Imagine that one genomic cytosine is methylated in 50% of cells in your sample, so that 50% of reads covering it have C and 50% have T. It is possible that the reads with C are easier to align, so we align more of them. The methylation rate would then look higher than 50%.

We can avoid this bias by converting all Cs in the reads to Ts, before aligning them:

perl -pe 'y/C/t/ if $. % 4 == 2' reads.fastq | lastal ... my_f - > temp_f
perl -pe 'y/C/t/ if $. % 4 == 2' reads.fastq | lastal ... my_r - > temp_r

This perl command assumes that the fastq file has uppercase sequences, and no line-wrapping or blank lines. It converts Cs to lowercase Ts: lowercase has no effect on the alignment, but it lets you see where the Cs were in the output.

Tips

Paired-end DNA reads

Let us assume that reads1.fastq are all from the converted strand (i.e. they have C->T conversions) and reads2.fastq are all from the reverse-complement (i.e. they have G->A conversions). It gets a bit complicated:

lastdb -u bisulfite_f.seed my_f mygenome.fa
lastdb -u bisulfite_r.seed my_r mygenome.fa

lastal -p bisulfite_f.mat -s1 -Q1 -e120 -i1 my_f reads1.fastq > t1f
lastal -p bisulfite_r.mat -s0 -Q1 -e120 -i1 my_r reads1.fastq > t1r
last-merge-batches.py t1f t1r > t1

lastal -p bisulfite_f.mat -s0 -Q1 -e120 -i1 my_f reads2.fastq > t2f
lastal -p bisulfite_r.mat -s1 -Q1 -e120 -i1 my_r reads2.fastq > t2r
last-merge-batches.py t2f t2r > t2

last-pair-probs.py t1 t2 > results.maf

Alternatively, you can ignore the pairing and align all the reads as if they were unpaired.