Running a pipeline

Running tRNAnalysis is easy using the commandline. If you have installed trnanalysis using conda then all the software dependancies should have been installed and you are ready to go. A step by step tutorial pipeline can be found here:.

Introduction

This pipeline requires the following input:

  • a single end fastq file - if you have paired end data we recommend flashing the reads together to make a single file or only using the first read of your paired end data.
  • a bowtie indexed genome
  • ensembl gtf: we recommend that you download our gtf files that have been sanitised for this workflow here. However,

you can use your own, if you make sure that all of the chromosomes are listed according to the ensembl annotations (i.e. the chromosomes are named chr1, chr2.. e.c.t.)

Optionally to make the pipeline run faster you can also use a downloaded tRNAscan-SE output. The most time consuming part of the pipeline is running tScan-SE to identify tRNAs across the genome. In order to speed the pipeline execution we have pre-ran tScan-SE and generated the outputs that can be found in the following directory . You can then tell the pipeline the location of the file using the yml configuration file.

Running tRNAnalysis

Command line usage information is available by running:

trnanalysis --help

The basic syntax for running tRNAnalysis is:

trnanalysis [workflow options] [workflow arguments]

workflow options can be one of the following:

make <task>

run all tasks required to build task

show <task>

show tasks required to build task without executing them

plot <task>

plot image of workflow (requires inkscape) of pipeline state for task

touch <task>

touch files without running task or its pre-requisites. This sets the timestamps for files in task and its pre-requisites such that they will seem up-to-date to the pipeline.

config

write a new configuration file pipeline.ini with default values. An existing configuration file will not be overwritten.

clone <srcdir>

clone a pipeline from srcdir into the current directory. Cloning attempts to conserve disk space by linking.

Fastq naming convention

tRNAanalysis assume that input fastq files follows the following naming convention(with the read inserted between the fastq and the gz). The reason for this is so that regular expressions do not have to account for the read within the name. It is also more explicit:

sample1-condition-R1.fastq.1.gz
sample1-condition-R2.fastq.2.gz

Additional options

In addition to running tRNAanalysis with default command line options, running trnaanalysis with –help will allow you to see additional options for workflow arguments when running the pipelines. These will modify the way the pipeline in ran.

- -no-cluster

This option allows the pipeline to run locally.

- -input-validation

This option will check the pipeline.ini file for missing values before the pipeline starts.

- -debug

Add debugging information to the console and not the logfile

- -dry-run

Perform a dry run of the pipeline (do not execute shell commands)

- -exceptions

Echo exceptions immediately as they occur.

-c - -checksums

Set the level of ruffus checksums.

Building tRNAnalysis reports

Reports are generated using the following command once a the full command has completed:

tranalysis make build_report

Troubleshooting

Many things can go wrong while running the pipeline. Look out for

  • bad input format. The pipeline does not perform sanity checks on the input format. If the input is bad, you might see wrong or missing results or an error message.
  • pipeline disruptions. Problems with the cluster, the file system or the controlling terminal might all cause the pipeline to abort.
  • bugs. The pipeline makes many implicit assumptions about the input files and the programs it runs. If program versions change or inputs change, the pipeline might not be able to deal with it. The result will be wrong or missing results or an error message.

If tRNAnalysis aborts, locate the step that caused the error by reading the logfiles and the error messages on stderr (nohup.out). See if you can understand the error and guess the likely problem (new program versions, badly formatted input, …). If you are able to fix the error, remove the output files of the step in which the error occurred and restart the pipeline. Processing should resume at the appropriate point.

Note

Look out for upstream errors. For example, you may find that if the pipeline errors and stops, it may create the file and when the pipeline is started again, it will move to the next function, despite the previous file being empty. To fix this, delete the files created by the last task ran before restarting the pipeline.

Common errors

One of the most common errors when running the tRNAnalysis is:

GLOBAL_SESSION = drmaa.Session()
NameError: name 'drmaa' is not defined

This error occurs because you are not connected to the cluster. Alternatively you can run the pipeline in local mode by adding - -no-cluster as a command line option.