The Genome Analysis Toolkit (GATK) Copyright (c) 2009 The Broad Institute Overview -------- The Genome Analysis Toolkit (GATK) is a structured programming framework designed to enable rapid development of efficient and robust analysis tools for next-generation DNA sequencers. The GATK solves the data management challenge by separating data access patterns from analysis algorithms, using the functional programming philosophy of Map/Reduce. Consequently, the GATK is structured into data traversals and data walkers that interact through a programming contract in which the traversal provides a series of units of data to the walker, and the walker consumes each datum to generate an output for each datum. Because many tools to analyze next-generation sequencing data access the data in a very similar way, the GATK can provide a small but nearly comprehensive set of traversal types that satisfying the data access needs of the majority of analysis tools. For example, traversals "by each sequencer read" and "by every read covering each locus in a genome" are common throughout many tools such as counting reads, building base quality histograms, reporting average coverage of the genome, and calling SNPs. The small number of these traversals, shared among many tools enables the core GATK development team to optimize such traversals for correctness, stability, CPU performance, memory footprint, and in many cases to even automatically parallelize calculations. Moreover, since the traversal engine encapsulates the complexity of efficiently accessing the next-generation sequencing data, researchers and developers are free to focus on their specific analysis algorithms. This not only vastly improves productivity of the developers, who can quickly write new analyses, but also results in tools that are efficient and robust and can benefit from improvement to a common data management engine. Capabilities ------------ The GenomeAnalysisTK development environment is currently provided as a platform-independent Java programming language library. The core system works with the nascent standard Sequence Alignment/Map (SAM) format to represent reads using a production-quality SAM library developed at the Broad. The system can access a variety of metadata files such as dbSNP, Hapmap, RefSeq as well as work with genotype and SNP files in GLF, Geli, and other common formats. The core system handles read data from Illumina/Solexa, SOLiD, and Roche/454. The current GATK engine can process all of the 1000 genomes data representing ~5Tb of data from these three technologies produced from multiple sequencing centers and aligned to the human reference genome with multiple aligners. The GATK currently provides traversals by each read (ByRead traversal), by all reads covering each locus in the genome (ByLoci traversal), and by all reads within pre-specified intervals on the genome (ByWindow traversal). Dependencies ------------ The GATK relies on a Java 6-compatible JRE. At the time of this writing, the GATK team tests with Sun JRE version 1.6.0_12-b04. Additionally, the GATK requires as inputs a sorted, indexed BAM file containing aligned reads and a fasta-format reference with associated dictionary file (.dict)and index (.fasta.fai). Instructions for preparing input files are available here: http://www.broadinstitute.org/gsa/wiki/index.php/Preparing_input_files The bundled 'resources' directory contains an example BAM and fasta. Getting Started --------------- The GATK is distributed with a few standard analyses, including PrintReads, Pileup, and DepthOfCoverage. More information on the included walkers is available here: http://www.broadinstitute.org/gsa/wiki/index.php/Built-in_walkers To print the reads of the included sample data, untar the package into the GenomeAnalysisTK directory and run the following command: java -jar GenomeAnalysisTK/GenomeAnalysisTK.jar \ -T PrintReads \ -R GenomeAnalysisTK/resources/exampleFASTA.fasta \ -I GenomeAnalysisTK/resources/exampleBAM.bam Support ------- Documentation for the GATK is available at http://www.broadinstitute.org/gsa/wiki. For help using the GATK, developing analyses with the GATK, bug reports, or feature requests, please email gsadevelopers@broadinstitute.org.