last-split-probs

This script reads alignments of query sequences to a genome, and estimates the genomic source of each part of each query. It allows different parts of one query to come from different parts of the genome (e.g. if the query crosses a rearrangement breakpoint).

Typical usage

These commands align the queries in "q.fastq" to "genome.fasta":

lastdb -m1111110 myDB genome.fasta
lastal -Q1 -e120 myDB q.fastq | last-split-probs.py -s150 > out.maf

(The purpose of -e120 and -s150 is explained in the last-map-probs document.)

Output

The output is in MAF(-like) format:

a score=150 mismap=0.000413
s chr21  15963638 25 + 48129895 TCAGATGAGGACCTAATTTATTACT
s query7       50 25 +       75 TCAGATGAGGACCTAATTTATTACT
q query7                        EBEEC@CE=EEE?FEDAED5?@@D@
p                               !#$'BBBBBBBBBBBBBBBBBBBBB

The "mismap" is the estimated probability that this part of the query does not come from this part of the genome. The line starting with "p" indicates the probability that each base does not come from this part of the genome. It uses a compact code:

Symbol Error probability Symbol Error probability
! 0.79 -- 1 0 0.025 -- 0.032
" 0.63 -- 0.79 1 0.02 -- 0.025
# 0.5 -- 0.63 2 0.016 -- 0.02
$ 0.4 -- 0.5 3 0.013 -- 0.016
% 0.32 -- 0.4 4 0.01 -- 0.013
& 0.25 -- 0.32 5 0.0079 -- 0.01
' 0.2 -- 0.25 6 0.0063 -- 0.0079
( 0.16 -- 0.2 7 0.005 -- 0.0063
) 0.13 -- 0.16 8 0.004 -- 0.005
* 0.1 -- 0.13 9 0.0032 -- 0.004
+ 0.079 -- 0.1 : 0.0025 -- 0.0032
, 0.063 -- 0.079 ; 0.002 -- 0.0025
- 0.05 -- 0.063 < 0.0016 -- 0.002
. 0.04 -- 0.05 = 0.0013 -- 0.0016
/ 0.032 -- 0.04 > 0.001 -- 0.0013

Other symbols indicate lower error probabilities, and "~" is the lowest possible. In general:

Error probability <= 10 ^ -((ASCII value - 33) / 10)

The "mismap" is simply the lowest probability from the "p" line.

Options

-h, --help Show a help message, with default option values.
-m M, --mismap=M
 Don't write alignments with mismap probability > M.
-s S, --score=S
 Don't write alignments with score < S.
-b P, --break-prob=P
 Assume a breakpoint probability per base of P.
-n, --no-split Don't write split-alignments. Instead, write the original alignments, annotated with "p" lines and mismap probabilities.

Alternative usage

The "typical usage" discards alignment parts that are unreliable (low score or high mismap probability). To see a complete split-alignment for each query, do this:

lastal -Q1 -e120 myDB q.fastq | last-split-probs.py -m1 > out.maf

Details

Limitations

It is a bit slow. We hope to develop a faster version, but this version is already useful.