Development of a dockerized Bioinformatic architecture based on Big Data technologies to process data from Next Generation Sequencing

Date:

November 14-16, 2018. Co-authored with Rubio-Rodríguez LA, Muñoz-Barrera A, Roda-García JL, Pérez-González CJ, González-Yanes P, Flores C and Lorenzo-Salazar JM. Abstract Booklet #22, page 44.

Abstract

Next Generation Sequencing (NGS) data imposes major challenges for processing due to its volume and complexity. Analysis of NGS data is generally based on many third party software, which are sometimes complex to install, configure and usually have dependencies that may arise in portability and reproducibility issues. Thus, it is necessary to develop infrastructures to store, manage and analyse massive genomic data in an efficient, scalable and reproducible way. We have developed a Docker container‐based infrastructure comprised of Bioinformatics and Big Data tools (Hadoop and Spark) for NGS data analysis. We have created Docker images for the Hadoop and Spark components and bioinformatics tools, including QC applications (FastQC, MultiQC, Qualimap2), aligners (BWA), variant callers (GATK4, Platypus), and other complementary software (JupyterLab). The infrastructure was deployed as a cluster of 20 commodity hardware nodes in a computer classroom at the ESIT‐ULL Computing Center and can be easily scaled. We have used Docker Compose for multi‐container definition and Docker Swarm for container orchestration and cluster management. To ensure data integrity, genomic data is stored in a distributed and a redundant manner across all nodes by using HDFS technology. This solution allowed university students to keep using their computers during data processing, so we could take advantage of existing equipment with no additional cost. As this is a non‐dedicated infrastructure, it has been necessary to keep in mind the continuous addition and removal of nodes from the cluster. Docker images are available at: https://hub.docker.com/r/taroull/. Dockerfiles for deploying the described infrastructure are available at: https://github.com/lubertorubior/docker‐esit Funded by Ministerio de Ciencia, Innovación y Universidades (RTC‐2017‐6471‐1; MINECO/AEI/FEDER, UE). This work has been supported by the CEDeI program (Centro de Excelencia de Desarrollo e Innovación, Cabildo de Tenerife).