This script reads alignments of query sequences to a genome, and estimates the genomic source of each part of each query. It allows different parts of one query to come from different parts of the genome (e.g. if the query crosses a rearrangement breakpoint).
These commands align the queries in "q.fastq" to "genome.fasta":
lastdb -m1111110 myDB genome.fasta lastal -Q1 -e120 myDB q.fastq | last-split-probs.py -s150 > out.maf
(The purpose of -e120 and -s150 is explained in the last-map-probs document.)
The output is in MAF(-like) format:
a score=150 mismap=0.000413 s chr21 15963638 25 + 48129895 TCAGATGAGGACCTAATTTATTACT s query7 50 25 + 75 TCAGATGAGGACCTAATTTATTACT q query7 EBEEC@CE=EEE?FEDAED5?@@D@ p !#$'BBBBBBBBBBBBBBBBBBBBB
The "mismap" is the estimated probability that this part of the query does not come from this part of the genome. The line starting with "p" indicates the probability that each base does not come from this part of the genome. It uses a compact code:
Symbol Error probability Symbol Error probability ! 0.79 -- 1 0 0.025 -- 0.032 " 0.63 -- 0.79 1 0.02 -- 0.025 # 0.5 -- 0.63 2 0.016 -- 0.02 $ 0.4 -- 0.5 3 0.013 -- 0.016 % 0.32 -- 0.4 4 0.01 -- 0.013 & 0.25 -- 0.32 5 0.0079 -- 0.01 ' 0.2 -- 0.25 6 0.0063 -- 0.0079 ( 0.16 -- 0.2 7 0.005 -- 0.0063 ) 0.13 -- 0.16 8 0.004 -- 0.005 * 0.1 -- 0.13 9 0.0032 -- 0.004 + 0.079 -- 0.1 : 0.0025 -- 0.0032 , 0.063 -- 0.079 ; 0.002 -- 0.0025 - 0.05 -- 0.063 < 0.0016 -- 0.002 . 0.04 -- 0.05 = 0.0013 -- 0.0016 / 0.032 -- 0.04 > 0.001 -- 0.0013
Other symbols indicate lower error probabilities, and "~" is the lowest possible. In general:
Error probability <= 10 ^ -((ASCII value - 33) / 10)
The "mismap" is simply the lowest probability from the "p" line.
-h, --help Show a help message, with default option values. -m M, --mismap=M Don't write alignments with mismap probability > M. -s S, --score=S Don't write alignments with score < S. -b P, --break-prob=P Assume a breakpoint probability per base of P. -n, --no-split Don't write split-alignments. Instead, write the original alignments, annotated with "p" lines and mismap probabilities.
The "typical usage" discards alignment parts that are unreliable (low score or high mismap probability). To see a complete split-alignment for each query, do this:
lastal -Q1 -e120 myDB q.fastq | last-split-probs.py -m1 > out.maf
It is a bit slow. We hope to develop a faster version, but this version is already useful.