2009-07-16 05:00:08 +08:00
|
|
|
The Genome Analysis Toolkit (GATK)
|
|
|
|
|
Copyright (c) 2009 The Broad Institute
|
|
|
|
|
|
|
|
|
|
Overview
|
|
|
|
|
--------
|
|
|
|
|
The Genome Analysis Toolkit (GATK) is a structured programming
|
|
|
|
|
framework designed to enable rapid development of efficient and robust
|
|
|
|
|
analysis tools for next-generation DNA sequencers. The GATK solves
|
|
|
|
|
the data management challenge by separating data access patterns from
|
|
|
|
|
analysis algorithms, using the functional programming philosophy of
|
|
|
|
|
Map/Reduce. Consequently, the GATK is structured into data traversals
|
|
|
|
|
and data walkers that interact through a programming contract in which
|
|
|
|
|
the traversal provides a series of units of data to the walker, and
|
|
|
|
|
the walker consumes each datum to generate an output for each datum.
|
|
|
|
|
Because many tools to analyze next-generation sequencing data access
|
|
|
|
|
the data in a very similar way, the GATK can provide a small but
|
|
|
|
|
nearly comprehensive set of traversal types that satisfying the data
|
|
|
|
|
access needs of the majority of analysis tools. For example,
|
2009-07-16 05:35:13 +08:00
|
|
|
traversals "by each sequencer read" and "by every read covering
|
|
|
|
|
each locus in a genome" are common throughout many tools such as
|
2009-07-16 05:00:08 +08:00
|
|
|
counting reads, building base quality histograms, reporting average
|
|
|
|
|
coverage of the genome, and calling SNPs. The small number of these
|
|
|
|
|
traversals, shared among many tools enables the core GATK development
|
|
|
|
|
team to optimize such traversals for correctness, stability, CPU
|
|
|
|
|
performance, memory footprint, and in many cases to even automatically
|
|
|
|
|
parallelize calculations. Moreover, since the traversal engine
|
|
|
|
|
encapsulates the complexity of efficiently accessing the
|
|
|
|
|
next-generation sequencing data, researchers and developers are free
|
|
|
|
|
to focus on their specific analysis algorithms. This not only vastly
|
|
|
|
|
improves productivity of the developers, who can quickly write new
|
|
|
|
|
analyses, but also results in tools that are efficient and robust and
|
|
|
|
|
can benefit from improvement to a common data management engine.
|
|
|
|
|
|
|
|
|
|
Capabilities
|
|
|
|
|
------------
|
|
|
|
|
The GenomeAnalysisTK development environment is currently provided as
|
|
|
|
|
a platform-independent Java programming language library. The core
|
|
|
|
|
system works with the nascent standard Sequence Alignment/Map (SAM)
|
|
|
|
|
format to represent reads using a production-quality SAM library
|
|
|
|
|
developed at the Broad. The system can access a variety of metadata
|
|
|
|
|
files such as dbSNP, Hapmap, RefSeq as well as work with genotype and
|
|
|
|
|
SNP files in GLF, Geli, and other common formats. The core system
|
|
|
|
|
handles read data from Illumina/Solexa, SOLiD, and Roche/454. The
|
|
|
|
|
current GATK engine can process all of the 1000 genomes data
|
|
|
|
|
representing ~5Tb of data from these three technologies produced from
|
|
|
|
|
multiple sequencing centers and aligned to the human reference genome
|
|
|
|
|
with multiple aligners. The GATK currently provides traversals by
|
|
|
|
|
each read (ByRead traversal), by all reads covering each locus in the
|
|
|
|
|
genome (ByLoci traversal), and by all reads within pre-specified
|
|
|
|
|
intervals on the genome (ByWindow traversal).
|
|
|
|
|
|
|
|
|
|
Dependencies
|
|
|
|
|
------------
|
|
|
|
|
The GATK relies on a Java 6-compatible JRE. At the time of this writing,
|
2009-07-16 05:35:13 +08:00
|
|
|
the GATK team tests with Sun JRE version 1.6.0_12-b04. Additionally, the
|
|
|
|
|
GATK requires as inputs a sorted, indexed BAM file containing aligned reads
|
|
|
|
|
and a fasta-format reference with associated dictionary file (.dict)and
|
|
|
|
|
index (.fasta.fai).
|
2009-07-16 05:00:08 +08:00
|
|
|
|
2009-07-16 05:35:13 +08:00
|
|
|
Instructions for preparing input files are available here:
|
2009-07-16 05:00:08 +08:00
|
|
|
|
|
|
|
|
http://www.broadinstitute.org/gsa/wiki/index.php/Preparing_input_files
|
|
|
|
|
|
2009-07-16 05:35:13 +08:00
|
|
|
The bundled 'resources' directory contains an example BAM and fasta.
|
2009-07-16 05:00:08 +08:00
|
|
|
|
|
|
|
|
Getting Started
|
|
|
|
|
---------------
|
|
|
|
|
The GATK is distributed with a few standard analyses, including PrintReads,
|
|
|
|
|
Pileup, and DepthOfCoverage. More information on the included walkers is
|
2009-07-16 05:35:13 +08:00
|
|
|
available here:
|
2009-07-16 05:00:08 +08:00
|
|
|
|
|
|
|
|
http://www.broadinstitute.org/gsa/wiki/index.php/Built-in_walkers
|
|
|
|
|
|
2009-07-16 05:35:13 +08:00
|
|
|
To print the reads of the included sample data, untar the package into
|
|
|
|
|
the GenomeAnalysisTK directory and run the following command:
|
2009-07-16 05:00:08 +08:00
|
|
|
|
|
|
|
|
java -jar GenomeAnalysisTK/GenomeAnalysisTK.jar \
|
|
|
|
|
-T PrintReads \
|
|
|
|
|
-R GenomeAnalysisTK/resources/exampleFASTA.fasta \
|
|
|
|
|
-I GenomeAnalysisTK/resources/exampleBAM.bam
|
|
|
|
|
|
|
|
|
|
Support
|
|
|
|
|
-------
|
2009-07-16 05:35:13 +08:00
|
|
|
Documentation for the GATK is available at http://www.broadinstitute.org/gsa/wiki.
|
|
|
|
|
For help using the GATK, developing analyses with the GATK, bug reports,
|
|
|
|
|
or feature requests, please email gsadevelopers@broadinstitute.org.
|