WDL-based pipelines for whole genome and exome sequencing analysis

Date:

November 14-16, 2018. Co-authored with Muñoz-Barrera A, Rubio-Rodríguez LA, Lorenzo-Salazar JM, Flores C, Pérez-González CJ and Roda-García JL. Abstract Booklet #19, page 40.

Abstract

Next Generation Sequencing data analysis comprises a series of computational tasks frequently based on the use of command line tools. These analyses are defined in workflows that group all the necessary tasks, improving data processing performance and results interpretation. Some Domain Specific Languages (DSLs), such as WDL and Nextflow, have been recently created to define and program complex pipelines, as well as to improve the parallelization, the scalability and the reusability. We have developed a complete pipeline programmed in WDL via scripting and Rabix Composer based on the Broad Institute’s best practices and the Genome Analysis Toolkit (GATK4) to analyze whole‐genome (WGS) and whole‐exome (WES) data. This pipeline is able to process data from one or several WGS/WES of multiple individuals. It allows to parallelize the analysis in multiple tasks and nodes, and runs in environments based on Docker Swarm or in high‐performance computing systems such as TeideHPC at ITER (http://teidehpc.iter.es/en/home). Interesting features are: possibility to run on a HPC infrastructure connecting the Cromwell engine and the SLURM scheduler; starts from BCL data; demultiplexing of samples pooled across the flowcell; data processing both on a per‐lane and a per‐sample basis; possibility to handle hg19 and hg38 reference genomes; programmed to restart from every step in case of fail. For benchmarking, we are following the guidelines of the Truth and Consistency precisionFDA challenges using Genome In A Bottle Consortium released genomes data. A full pipeline is currently running on TeideHPC to analyze WGS and WES germline data produced by an Illumina HiSeq4000 sequencing platform for research purposes. These resources are available at: https://github.com/genomicsITER/wdl Funded by Ministerio de Ciencia, Innovación y Universidades (RTC‐2017‐6471‐1; MINECO/AEI/FEDER, UE). This work has been supported by the CEDeI program (Centro de Excelencia de Desarrollo e Innovación, Cabildo de Tenerife). The authors also thankfully acknowledge the computer resources and the technical support provided by TARO Research Group of the University of La Laguna.