Do It Yourself Annotator - diya.pm
1.0
A simple diya script:
use diya;
$pipeline = diya->new; $pipeline->read_conf; $pipeline->run;
The script can be run like this:
diya-script.pl --conf diya.conf seq1.fa seq2.fa ...
diya is an open source tool used to build annotation pipelines. A pipeline is a series of steps linking the various stages of sequence annotation into a concise process. The software is designed to use sequences as input. These could be complete genomes or the result of shotgun sequencing of a genome library. A possible output would be a fully annotated genomic sequence in Genbank format.
You can also use this Genbank file as input and load GFF into a backend database for viewing with tools like GBrowse.
A pipeline may be executed on a single computer or on a cluster. Currently diya only supports the Sun Grid Engine platform if you are using a cluster.
All diya pipelines are made up of parser or script steps that are executed in a specific order. The details on the parser and script steps for a given pipeline are contained in a single XML configuration, or conf, file.
The diya.pm
Perl module is the controller module for a diya annotation pipeline.
This module reads the configuration file that describes the pipeline, executes each
step in the pipeline, and launches specific parser modules when required.
It also keeps track of the input and output files and keeps all these files
in a single output directory.
A parser step is what is doing the analysis in the pipeline. For every diya parser step there will be a bioinformatics application that will produce output and a corresponding Perl module that parses the application output and creates an annotated Genbank file. A parser step can act at any time in the pipeline.
A script step in diya is simpler than a parser step. Its output is not parsed so it does not required a corresponding Perl module. A script step may do something like move a file, format a database, or send an alert. A script step can act at any time in the pipeline.
This is a rough description of how a pipeline works in the diya.pm
code:
new()
read_conf()
run()
method is called, it will iterate over all the steps in the pipeline
run()
starts again with the next sequence
A good way to watch what diya is doing is to run it with verbose set to 1. For example:
diya.pl --conf diya.conf --verbose 1
The details are in the INSTALL file. diya uses BioPerl, and you will need to install some other Perl modules from CPAN in addition.
Consider setting the $DIYAHOME environment variable. By default
diya uses this directory when it looks for a diya configuration file
and when it creates output directories. If you do not have this set then make
sure to tell diya where your configuration file is using -conf
or
-use_conf
, see more about this below.
This package comes with a number of test scripts in the t/
directory that
run automatically if you type:
>perl Makefile.PL >make >make test
Most of the test scripts run bioinformatics applications, specifically
blastall
, formatdb
, tRNAscan-SE
,
and glimmer3
. The scripts
are written such that they will skip many tests if these applications
are not found in /usr/local/bin
. If you want to run these tests and
you have these applications installed then you may need to edit the *conf
files found in t/data
to enter the correct paths.
Most of the information about the pipeline is stored in a
configuration file in XML format. The configuration file that comes with the
package is called diya.conf
but you can create your own configuration files
and call them whatever you want. There are also example *conf files in the
t/data
and examples
directory in this package.
The configuration file contains different sections. These sections can appear in any order in the file.
<?xml version="1.0" encoding="UTF-8"?> <conf> <script> <name>download</name> <executable>download-genome.pl</executable> <command>-id MYID -out OUTPUTFILE</command> <home>/Users/bosborne/diya/diya/branches/0.4.0/examples</home> <inputfrom></inputfrom> </script> <parser> <name>blastp</name> <executable>blastall</executable> <command>-i INPUTFILE -d MYDB -p blastp -o OUTPUTFILE</command> <home>/usr/local/bin</home> <inputformat>fasta</inputformat> <inputfrom>download</inputfrom> </parser> <run> <mode>serial</mode> </run> <order> <names>download blastp</names> </order> </conf>
You might run the pipeline using this configuration file like this:
diya.pl -conf download-blastp.conf -set MYDB=/opt/gb/at.fa -set MYID=3
This section tell diya what script and parser steps to run and in what order. The names of the steps are separated by spaces. For example:
<order> <names>tRNAscanSE glimmer blastall</names> </order>
You will see below that every parser or script section has a line for its name, like this:
<name>tRNAscanSE</name>
You use that same name in the names line. This means that there has to be a corresponding parser or script section for each name in the names section.
This section tells diya whether to run the pipeline on a cluster or not. For example:
<run> <mode>serial</mode> </run>
You can run diya to run in sge or serial mode. These are the only 2 possible values.
This section describes a parser step. An example for
the application tRNAscan-SE
:
<parser> <executable>tRNAscan-SE</executable> <home>/usr/local/bin</home> <command>-B -o OUTPUTFILE INPUTFILE</command> <name>tRNAscanSE</name> <inputformat>fasta</inputformat> <inputfrom></inputfrom> </parser>
The name of the application. This should be the actual name, not a synonym. Required.
The directory where the application is found. Required.
The command that has to be run, without the application name. Note that these do not have to be real file names. Instead you can substitute actual input and output file names with INPUTFILE and OUTPUTFILE. See more on this in WRITING YOUR OWN conf* FILES.
The arbitrary name for the step. It does not have to be the same as the executable but if the step is a parser then this has to be the same as the name of the Perl module that parses the executable output. The only rule is that no punctuation or spaces are allowed in the name. For example, a name could be tRNAscanSE, but not tRNAscan-SE (the reason for this is that spaces and punctuation are not allowed in a Perl module name). Required.
In addition, you may want to have different steps in a pipeline that use the same application or script, but in different ways. This way you can assign a different name to each of these steps.
The sequence format for the input file. Optional, if there is no inputformat set then fasta format is assumed.
If inputformat is set then diya will determine the format of the input file for the given step. If this format is different from the inputformat of the step then diya will create a new file of the correct format and make it the new input file for the step.
This is optional. Use this if you want the output file from one parser or script step to be used as the input file for another parser or script step. For example, if the input file for 'stepA' should be created by 'stepB' do this:
<parser> <name>stepA</name> <inputfrom>stepB</inputfrom> <executable>mixmaster</executable> <home>/usr/local/bin</home> <command>INPUTFILE</command> <inputformat></inputformat> </parser>
If you do not specify inputfrom for any step then it is assumed that the input file comes from the command-line. For example, if you run diya like this:
diya-script.pl --conf diya.conf seq1.fa
then the input file will be 'seq1.fa' when there is no inputfrom for a given step.
A script step simply executes and its output is not parsed. For example, you may need to copy a sequence file from some location before running a pipeline. Or you may want to send the pipeline output somewhere or do an email alert after the pipeline is done, you would write script steps for these purposes. An example:
<script> <name>formatdb</name> <executable>formatdb.sh</executable> <command>INPUTFILE</command> <home>/Users/bosborne/diya/branches/0.4.0/examples</home> <inputfrom>extractCDS</inputfrom> </script>
The name of the script. This should be the actual name, not a synonym. Required.
The name for this step in the pipeline. Required.
The directory where the script is found. Required.
The command that has to be run, without the executable name. Optional.
This is optional. Use this if you want the output file from a parser or script step to be used as the input file for a script step.
You can pass options to your pipeline object when you create it with
the new()
method.
When you set verbose to 1 the pipeline object will print out useful diagnostic messages. Set verbose to true like this:
my $pipeline = diya->new(-verbose => 1)
Setting verbose is optional, the default value is 0 or false
The values are serial and sge. Set mode like this:
my $pipeline = diya->new(-mode => 'sge')
Setting mode is optional, the default value is serial.
Specify the configuration file for the pipeline. The conf file can have any sort of name as long as it has the correct format. An example:
my $pipeline = diya->new( -use_conf => "~/myconf.conf" )
Setting use_conf is optional. If it is not set then diya will
look for a file named diya.conf
in your $DIYAHOME directory or in the
current working directory.
Specify the output directory for the pipeline. An example:
my $pipeline = diya->new( -outputdir => "~/myfiles" )
Setting outputdir is optional. If it is not set then diya will create an output directory in your $DIYAHOME directory using a timestamp, for example ``2008-06-29-11:35:38-diya''.
You will run diya using a fairly simple script since most of the details
are in the configuration file. The diya.pl
script that comes with
diya is an example.
diya scripts are run like this:
% diya.pl [options] [input files]
Set the verbosity level, 0 or 1.
diya.pl --verbose 1
Run the batch in serial mode, or sge mode if SGE is available.
diya.pl --mode serial
Use the given conf file. If you use this option then this given
conf file will be used, if there is a conf file specified in the new()
method it will be ignored.
diya.pl --conf new-diya.conf
Set the output directory.
diya.pl --outputdir /tmp/mypipeline
You can also modify your commands dynamically from the command-line.
For example, you might want to run blastall
and create an output file with
a specific name. Here is an example blastall
command from a *conf file:
<command>-p blastp -d ran.fa -i INPUTFILE -o MYOUTPUTFILE</command>
You could run diya.pl
like this:
diya.pl --set MYOUTPUTFILE=blastp.out
And an output file called blastp.out
would be created.
You can add these ``wild card'' words anywhere you want to in the command line of the *conf file. The only rule is that you should not use the words INPUTFILE, OUTPUTFILE, and OUTPUTDIR. These are already being used by diya. One way to make sure your ``wild card'' is unique is to prefix it with 'MY'. We also suggest capitalizing these words, for clarity.
Suppose you want your Perl module to be able to get some value from
the command-line and use it as a variable, e.g. $MYDATABASE
. First add the
variable name to the @EXPORT_OK
array in diya.pm
. Then modify the use diya;
line in your Perl module, for example:
use diya qw($MYDATABASE);
After these modifications you should be able to do the following:
diya.pl --set MYDATABASE=ncbi Seq.fa
And the variable $MYDATABASE
will have the value ncbi in your Perl module
when diya.pl
runs.
When you use --set you are creating global variables that can be used in your own Perl modules so make sure that your variable names do not collide with diya variables. One way to do this is to use variable names that are all capitalized, or prefix the name with 'MY'.
THE CONFIGURATION FILE section discusses the structure of the *conf file but in order to create your own files you will need to understand some of the internal details of diya.
When diya runs it can create the names of input and output files. This makes it easy for diya to keep track of files since one of its jobs is to pass the output of one step to the next step as input. diya uses a timestamp and the name of the step to create file names, for example:
2008_08_07_10_19_53-create-fasta-db.out
The file above was created by the create-fasta-db step, as you can see from its name. This file could be the input for some other step, and you would indicate this by using the inputfrom field. For example:
<parser> <inputfrom>create-fasta-db</inputfrom> <executable>blastall</executable> <home>/usr/local/bin</home> <command>-p blastp -i INPUTFILE -d MYDATABASE -o OUTPUTFILE </command> <name>blastpCDS</name> <inputformat></inputformat> </parser>
The block above says that the INPUTFILE should come from the create-fasta-db step. When diya runs and the actual command is constructed this part of the command line:
-p blastp -i INPUTFILE
Will be transformed into something like:
-p blastp -i /tmp/2008_08_07_10_19_53-create-fasta-db.out
INPUTFILE has a second meaning, which is the name of the sequence file passed to the diya script. Recall that you can run a diya script like this:
mydiya.pl --conf my.conf NC_123456.fa
If a given step has no inputfrom value then the value of INPUTFILE will be the name of the sequence file set from the command-line, or ``NC_123456.fa'' in the example above.
This does not mean that you have to use INPUTFILE in each step. It means that when INPUTFILE is present in a command line it will substituted in one of these 2 ways, depending on whether or not there is an inputfrom value.
The OUTPUTFILE from one step will frequently be used as the INPUTFILE to another step. Thus you may need to explicitly create a file with the name contained in OUTPUTFILE using the command line. An example script block:
<script> <name>create-fasta-db</name> <executable>createdb.pl</executable> <command>-o OUTPUTFILE</command> <home>~/scripts</home> <inputfrom></inputfrom> </script>
When this step runs a command like this will be run and executed:
~/scripts/createdb.pl -o /tmp/2008_08_07_10_19_53-create-fasta-db.out
Here ~/scripts/createdb.pl
will use the file name provided by diya
and put its output into that file. An alternative is to redirect the
output of an application into an output file. For example:
<script> <name>create-fasta-db</name> <executable>createdb.pl</executable> <command> > OUTPUTFILE</command> <home>~/scripts</home> <inputfrom></inputfrom> </script>
By default diya creates an output directory to contain all the input and output files created during a pipeline run, something like:
2008_11_21_22_02_19_diya/
You can get this name in order to use it in your script or parser steps, something like:
<script> <name>move-file</name> <executable>move-file.pl</executable> <command>-o OUTPUTDIR</command> <home>~/scripts</home> <inputfrom></inputfrom> </script>
In this example the name of the diya output directory will be passed to the script.
The authors use diya to annotate sequences and save these annotations
as GenBank files. They routinely convert these files to GFF, load
the GFF into Bio::DB::GFF databases, and visualize the annotations
using GBrowse. The conversion script used is diya-genbank2gff3.pl
in the scripts
directory, this script is a modification of the
genbank2gff3.pl
script that comes with BioPerl.
Andrew Stewart, andrew.stewart@med.navy.mil Brian Osborne, briano@bioteam.net
Tim Read, timothy.read@med.navy.mil
Public methods are listed first, followed by private methods prefixed with '_'.
Name : new Usage : $diya = diya->new() Function: create a diya pipeline object Returns : a diya object Args : -verbose (optional), 1 or 0, set the verbosity level -use_conf (optional), name of the conf file to be used -outputdir (optional), the directory where results will reside -mode (optional), 'serial' is default Example : my $pipeline = diya->new(-verbose => 1, -use_conf => "latest.conf", -outputdir => 'mydir', -mode => 'sge' );
Name : read_conf Usage : $diya->read_conf("my_conf_file") Function: read a diya conf file Returns : 1 on success Args : the name of the conf file to be read (optional) Example : $pipeline->read_conf() or $pipeline->read_conf("latest.conf")
Name : run Usage : $diya->run() Function: run a diya pipeline Returns : 1 on success Args : Example : $pipeline->run
Name : new_parser Usage : $parser = $module->new_parser Function: instantiate a new parser object Returns : a new parser object Args : none Example :
Name : order Usage : $diya->order( @array ) or $order = $diya->order Function: get or set the order of the steps to be run Returns : array of step names Args : To set pass an array of one or more step names the parsers and scripts must exist in the conf file in the <parser> and <script> sections Example : $pipeline->order( qw(tRNAscanSE blastall) ) or $pipeline->order('tRNAscanSE') or @my_order = $self->order
Name : write_conf Usage : $diya->write_conf("my_new_conf_file") Function: write a conf file - if no name is supplied then the new file will be given a name of format <timestamp>-diya.conf (e.g. 2008-06-29-11:35:38-diya.conf ) Returns : the name of the conf file that was written Args : the name of the conf file that will be written (optional) Example : $pipeline->write_conf() or $pipeline->write_conf("version2.conf")
Name : verbose Usage : $diya->verbose($num) or $verbose_level = $diya->verbose Function: get or set the verbose level Returns : the verbose level Args : Example : $pipeline->verbose(1)
Name : project Usage : $diya->project($num) or $project = $diya->project Function: get or set the NCBI project number Returns : the NCBI project number Args : Example : $pipeline->project(1355)
Name : outputdir Usage : $diya->outputdir() Function: get or set the name of the output directory, where all of the files created by the pipeline will be written - the object will try and create the directory if it does not exist
if no output directory is specified then an output directory will be created based on a timestamp, e.g. "2008-06-29-11:35:38-diya"
Returns : name of the output directory or 0 if not output directory is set Args : Example : $diya->outputdir("pipe-output")
Name : mode Usage : Function: get or set the mode corresponding to a pipeline Returns : "serial" or "sge" Args : Example : $pipeline->mode("serial")
Name : cleanup Usage : $diya->cleanup Function: remove extraneous files created when a pipeline is run Returns : 1 on success Args : none Example :
Name : inputfile Usage : $diya->inputfile('NC.gbk') Function: Get or set the names of the input sequence files Returns : Args : Example : $self->inputfile("234.fa") or $self->inputfile( qw(234.fa AB.fa) )
Name : _next_inputfile Usage : Function: Get the name of the next input sequence file, remove the last from the queue Returns : Args : Example :
Name : _execute Usage : $self->_execute($command) Function: encapsulate the serial and sge execution logic Returns : none Args : command Example :
Usage : _reconstruct_sequence Function: reconstruct the sequence object. Used only when mode is sge. When mode is sge, the _execute() will generate an intermediate script performing the $parse->parse($diya). In the intermediate script, this method is called. Please see _execute Returns : none Args : none Example : $self->_reconstruct_sequence()
Usage : _check_executable Function: checks to see that the executable exists Returns : 1 or die Args : Name of step Example : $self->_check_executable($step)
Usage : _check_input_sequence Function: checks to see that the format of the input sequence file is correct, if the format is not correct then it creates a sequence file of the correct format Returns : The name of the file of the correct format Args : Name of step Example : $self->_check_input_sequence($step)
Usage : _check_inputfile Function: records the input file name for the step - this may be the input sequence for the entire pipeline but a step may also use the output of another step as input Returns : the name of the input file for the step Args : name of step Example : $self->_check_inputfile($step)
Name : _check_outputdir Usage : $diya->_check_outputdir Function: checks that there is an output directory - if no output directory is defined then a directory name will be made using a timestamp (e.g. "2008-06-29-11:35:38-diya") and this directory will be in the diya home directory - if an output directory is defined but does not exist then we will attempt to create it Returns : name of output directory, on success Args : none Example :
Name : _make_command Usage : $pipeline->_make_command($step) Function: to create a complete command using information from the conf file and any command-line options - private method called by run() Returns : a command, ready to execute Args : the name of the parser step (e.g. "tRNAscanSE") Example : $pipeline->_make_command($parser)
Name : _make_outputfilename Usage : $diya->_make_outputfilename($parser) Function: create an output file name with parser step name and timestamp, private method, called by run() Returns : output file name Args : step name Example :
Name : _lastsgeid Usage : $diya->_lastsgeid() Function: set or get the last sge job id submitted by current process, used only internally for job id tracking. Returns : last sge job id submitted by current process. Args : Example : $pipeline->_lastsgeid(53)
Name : _outputfile Usage : $file = $self->_outputfile($parser) Function: get the output file name for a given parser step, private method called by run() or a parser module Returns : output file name Args : step name Example : $file = $self->_outputfile($step)
Name : _greeting Usage : $diya->_greeting Function: print a greeting - private method, called by new() Returns : nothing Args : Example :
Name : _help Usage : $diya->_help Function: print the POD - works only if DIYA is installed Returns : nothing Args : Example :
Name : _diyahome Usage : Function: add the path to the diya home directory to the object - the path to the diya package comes from the env $DIYAHOME. If this not set then try to use the current working directory. Private method, called by new() Returns : the diya home directory Args : none Example :
Name : _conf Usage : Function: get or set a hash representing the conf file - private method, called by read_conf() and write_conf() Returns : a hash reference representing the conf file Args : Example : $conf = $diya->_conf
Name : _executable Usage : Function: return the executable name corresponding to a parser or script, private method called by _make_command and _check_executable Returns : executable name Args : parser or script name Example : $exe = $self->_executable($name)
Name : _parsers Usage : Function: return all the parser names in the conf file, private method called by order() Returns : array of parser names Args : none Example : @parsers = $self->_parsers
Name : _scripts Usage : Function: return all the script names in the conf file, private method called by order() Returns : array of script names Args : none Example : @scripts = $self->_scripts
Name : _sequence Usage : $diya->_sequence($seq) or $seq = $diya->_sequence Function: get or set the DNA sequence object, private method used by parser modules and by _check_input_sequence() Returns : the sequence object Args : Example : $pipeline->_sequence($seq)
Name : _home Usage : Function: return the home or location corresponding to an executable, private method called by _make_command and _check_executable Returns : Args : parser or script name Example : $path = $self->_home($exe)
Name : _inputfrom Usage : Function: return the inputfrom field corresponding to a parser or script Returns : Args : parser or script name Example : $path = $self->_inputfrom($module)
Name : _command Usage : Function: return the command corresponding to a step, private method called by _make_command() Returns : Args : step name Example : $args = $self->_command($step)
Name : _inputformat Usage : Function: return the file format required by a parser step, private method called by run() Returns : format name, or 0 if no value is found Args : parser step name Example : $format = $self->_inputformat($step)
Name : _use_conf Usage : $diya->_use_conf("conf_file") Function: add the name of the conf file being used to the object - private method, called by read_conf() Returns : the name of the conf file being used Args : Example :
Name : _initialize Usage : $diya->_initialize Function: add parameters to the diya object - strips the dash used by named parameters, private method called by new() Returns : Args : Example :
Name : _get_type Usage : $type = $diya->_get_type($step) Function: return 'script' or 'parser', private method called by run() Returns : 'script' or 'parser' Args : Example :
Name : _load_app_module Usage : $diya->_load_app_module($module) Function: call require on the given module, private method called by run() Returns : full name of module, e.g. "diya::tRNAscanSE" Args : name of the module, e.g. 'tRNAscanSE' Example :
Name : _get_options Usage : Function: get command-line options - the --set option is used to create globals that can be imported into a parser module or used directly in this module Returns : 1 on success Args : none Example : At the command-line: diya.pl --set REFD=~/mydb.ref