BioX-Workflow

Introduction

Most bioinformatics workflows involve starting with a set of samples, and processing those samples in one or more steps. Also, most bioinformatics workflows are bash based, and no one wants to reinvent the wheel to rewrite their code in perl/python/whatever.

Once you have your configuration all set, to process your entire workflow run

    biox-workflow.pl --workflow workflow.yml > workflow.sh

Alternately, to select an exact rule

    biox-workflow.pl --workflow workflow.yml --select_rules bowtie2 > rule1.sh

To match a set of rules using a regexp

Matches all rules that contain 'gatk', including 'gatk_realign_indels', or 'rule_gatk'

    biox-workflow.pl --workflow workflow.yml --match_rules gatk > gatk.sh
    #Match only those rules beginning with gatk
    biox-workflow.pl --workflow workflow.yml --match_rules "^gatk" > gatk.sh

InDepth

Samples

For example with our samples test1.vcf and test2.vcf, we want to bgzip and annotate using snpeff, and then parse the output using vcf-to-table.pl (shameless plug for BioX::Wrapper::Annovar).

BioX::Workflow assumes your have a set of inputs, known as samples, and these inputs will carry on through your pipeline. There are some exceptions to this, which we will explore with the resample option.

BioX::Workflow also assumes your samples are files or directories. They may also be people, frogs, or cells, but first and foremost they are files.

Structure

It also makes several assumptions about your output structure. It assumes you have each of your processes/rules outputting to a distinct directory. Each of the assumptions BioX::Workflow makes can be overridden either globally or locally. These directories will be created and automatically named based on your process name.

It also assumes the indir of each rule is the outdir of the previous rule.

All the things can be modified!

All the variables can be modified from their defaults in order to enable custom control of your workflow.


Acknowledgements

Before version 0.03

This module was originally developed at and for Weill Cornell Medical College in Qatar within ITS Advanced Computing Team. With approval from WCMC-Q, this information was generalized and put on github, for which the authors would like to express their gratitude.

As of version 0.03:

This modules continuing development is supported by NYU Abu Dhabi in the Center for Genomics and Systems Biology. With approval from NYUAD, this information was generalized and put on bitbucket, for which the authors would like to express their gratitude.