White Papers

Variant Calling Benchmark
Not Only Human
Variant call process refers to the identification of a nucleotide difference from reference sequences at a
given position in an individual genome or transcriptome. It includes single nucleotide polymorphism
(SNPs), insertion/deletions (indels) and structural variants. One of the most popular variant calling
applications is GenomeAnalysisTK (GATK) from Broad Institute. Often this GATK is used with BWA to
compose a variant calling workflow focusing on SNPs and indels. After we published Dell HPC System for
Genomics White Paper last year, there were significant changes in GATK. The key process, variant call
step UnifiedGenotyper is no longer recommended in their best practice. Hence, here we recreate BWA-
GATK pipeline according to the recommended practice to test whole genome sequencing data from
mammals and plants in addition to human’s whole genome sequencing data. This is a part of Dell’s
effort to help customers estimating their infrastructure needs for their various genomics data loads by
providing a comprehensive benchmark.
Variant Analysis for Whole Genome Sequencing data
System
The detailed configuration is in Dell HPC System for Genomics White Paper, and the summary of system
configuration and software is in Table 2.
Table 1 Server configuration and software
Component
Detail
Server
40x PowerEdge FC430 in FX2 chassis
Processor
Total of 1120 cores: Intel® Xeon® Dual E5-2695 v3 - 14 cores
Memory
128GB - 8x 16GB RDIMM, 2133 MT/s, Dual Rank, x4 Data Width
Storage
480TB IEEL (Lustre)
Interconnect
InfiniBand FDR
OS
Red Hat Enterprise 6.6
Cluster Management tool
Bright Cluster Manager 7.1
Short Sequence Aligner
BWA 0.7.2-r1039
Variant Analysis
GATK 3.5
Utilities
sambamba 0.6.0, samtools 1.2.1
BWA-GATK pipeline
The current version of GATK is 3.5, and the actual workflow tested obtained from the workshop, GATK
Best Practices and Beyond’. In this workshop, they introduce a new workflow with three phases.
Best Practices Phase 1: Pre-processing
Best Practices Phase 2A: Calling germline variants
Best Practices Phase 2B: Calling somatic variants
Best Practices Phase 3: Preliminary analyses
Here we tested out phase 1, phase 2A and phase3 for germline variant call pipeline. The details of
commands used in benchmark are listed below.

Summary of content (6 pages)