getorf

Function

Description

This program finds and outputs the sequences of open reading frames (ORFs).

The ORFs can be defined as regions of a specified minimum size between STOP codons or between START and STOP codons.

The ORFs can be output as the nucleotide sequence or as the translation.

The program can also output the region around the START or the initial STOP codon or the ending STOP codons of an ORF for those doing analysis of the properties of these regions.

The START and STOP codons are defined in the Genetic Code tables. A suitable Genetic Code table can be selected for the organism you are investigating.

Usage

Command line arguments


Input file format

getorf reads any nucleic acid sequence USA.

Output file format

The output is a sequence file containing predicted open reading frames longer than the minimum size, which defaults to 30 bases (i.e. 10 amino acids).

The name of the ORF sequences is constructed from the name of the input sequence with an underscore character ('_') and a unique ordinal number of the ORF found appended. The description of the output ORF sequence is constructed from the description of the input sequence with the start and end positions of the ORF prepended.

The unique number appended to the name is simply used to create new unique sequence names, it does not imply any further information indicating any order, positioning or sense-strand of the ORFs.

If the ORF has been found in the reverse sense, then the start position will be smaller than the end position. The numbering uses the forward-sense positions, but read in the reverse sense. For example, >ECLACI_3 [465 - 49] in the output above is a reverse-sense ORF running from position 465 to 49. The description will also contain '(REVERSE SENSE)'.

If the sequence has been specified as a circular genome (using the command-line switch '-circular'), then ORFs can potentially continue past the 'end' of the input sequence (the breakpoint of the circular genome) and into the 'start' of the sequence again. This is dealt with by appending the sequence to itself three times and reporting long ORFs that are found in this extended sequence. Any ORF that is longer that three times the sequence length (i.e one that continues without hitting a STOP at any point in the genome) will be reported as being a maximum of three times the length of the input sequence. Note that the end position of an ORF in circular genomes can be apparently longer than the input sequence if the ORF crosses the breakpoint. If the ORF crosses the breakpoint, then the text '(ORF crosses the breakpoint)' will be added to the description of the output sequence.

Data files

The START and STOP codons used by getorf are defined in the Genetic Code data files. By default, Genetic Code file EGC.0 is used.

The default file EGC.0 is the 'Standard Code' with the rarely used alternate START codons omitted, it only has the normal 'AUG' START codon. The 'Standard Code' with the rarely used alternate START codons included is Genetic Code file EGC.1.

It is expected that user will sometimes wish to customise a Genetic Code file. To do this, use the program embossdata.

Notes

If you have selected one of the options to report a regions around a START or STOP codon, then note that any such region that crosses the beginning or end of the sequence will not be reported.

References

None.

Warnings

None.

Diagnostic Error Messages

None.

Exit status

It always exits with status 0.

Known bugs

'-sbegin' and -send' do not work with this program.

Author(s)

History

Target users

Comments