It can read in the result of a previous compseq analysis and use this to set the expected frequencies of the subsequences.
Unless you tell 'compseq' otherwise, it expects each word to be equally likely. The 'Expected' frequency therefore of any dimer is 1/16 - this is simply the inverse of the number of possible dimers (AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT).
Similarly, the 'Expected' frequency of any trimer is 1/64, etc.
Obviously this is not the case in real sequences - there will be bias in favour of some words.
Compseq cannot otherwise guess what the 'Expected' frequency is. You can, however, tell it what the Expected frequencies are by giving compseq the output of the analysis of another set of sequences, produced by a previous compseq run.
So you take a set of sequences that are representative of the type of sequence you expect and you run compseq on it to get your expected sequence frequencies.
You then take the sequences you wish to investigate, run compseq on them giving compseq the expected frequencies that you have established, above. You tell compseq what the file of expected frequencies is by specifying it with '-infile filename' on the command-line.
|
Header information and comments are preceeded by a '#' character at the start of the line.
The Word size and the Total count are then given on separate lines,
The headers of the columns of results are preceeded by a '#'
The results columns are: the sub-sequence word, the observed frequency, the expected frequency (which will be read from the input file if one is given, else it is a simple inverse of the number of words of the size specified that can be constructed), the ratio of the observed to expected frequency.
After a blank line at the end, the results of 'Other' words is given - this is the number of words with a sequence which has IUPAC ambiguity codes or other unusual characters in.
The input data file format is exactly the same as the output file format.
It expects to read in a previous output file of this program. An error is produced if the word size of the current compseq job and that of the output file being read in are different.
You can produce very large output files if you choose large values of wordsize.
If you run out of memory, you may see the program crash with a generic error message that will be specific to your machine's operating system, but will probably be a warning about writing to memory that the program does not own (eg "Segmentation fault" on a Solaris machine)
This is not a bug, it is a feature of the way this program grabs large amounts of memory.