89 lines
4.2 KiB
Plaintext
89 lines
4.2 KiB
Plaintext
The Genome Analysis Toolkit (GATK)
|
||
Copyright (c) 2009 The Broad Institute
|
||
|
||
Overview
|
||
--------
|
||
The Genome Analysis Toolkit (GATK) is a structured programming
|
||
framework designed to enable rapid development of efficient and robust
|
||
analysis tools for next-generation DNA sequencers. The GATK solves
|
||
the data management challenge by separating data access patterns from
|
||
analysis algorithms, using the functional programming philosophy of
|
||
Map/Reduce. Consequently, the GATK is structured into data traversals
|
||
and data walkers that interact through a programming contract in which
|
||
the traversal provides a series of units of data to the walker, and
|
||
the walker consumes each datum to generate an output for each datum.
|
||
Because many tools to analyze next-generation sequencing data access
|
||
the data in a very similar way, the GATK can provide a small but
|
||
nearly comprehensive set of traversal types that satisfying the data
|
||
access needs of the majority of analysis tools. For example,
|
||
traversals -Y´by each sequencer read¡ and ´by every read covering
|
||
each locus in a genome¡ are common throughout many tools such as
|
||
counting reads, building base quality histograms, reporting average
|
||
coverage of the genome, and calling SNPs. The small number of these
|
||
traversals, shared among many tools enables the core GATK development
|
||
team to optimize such traversals for correctness, stability, CPU
|
||
performance, memory footprint, and in many cases to even automatically
|
||
parallelize calculations. Moreover, since the traversal engine
|
||
encapsulates the complexity of efficiently accessing the
|
||
next-generation sequencing data, researchers and developers are free
|
||
to focus on their specific analysis algorithms. This not only vastly
|
||
improves productivity of the developers, who can quickly write new
|
||
analyses, but also results in tools that are efficient and robust and
|
||
can benefit from improvement to a common data management engine.
|
||
|
||
Capabilities
|
||
------------
|
||
The GenomeAnalysisTK development environment is currently provided as
|
||
a platform-independent Java programming language library. The core
|
||
system works with the nascent standard Sequence Alignment/Map (SAM)
|
||
format to represent reads using a production-quality SAM library
|
||
developed at the Broad. The system can access a variety of metadata
|
||
files such as dbSNP, Hapmap, RefSeq as well as work with genotype and
|
||
SNP files in GLF, Geli, and other common formats. The core system
|
||
handles read data from Illumina/Solexa, SOLiD, and Roche/454. The
|
||
current GATK engine can process all of the 1000 genomes data
|
||
representing ~5Tb of data from these three technologies produced from
|
||
multiple sequencing centers and aligned to the human reference genome
|
||
with multiple aligners. The GATK currently provides traversals by
|
||
each read (ByRead traversal), by all reads covering each locus in the
|
||
genome (ByLoci traversal), and by all reads within pre-specified
|
||
intervals on the genome (ByWindow traversal).
|
||
|
||
Dependencies
|
||
------------
|
||
The GATK relies on a Java 6-compatible JRE. At the time of this writing,
|
||
the GATK team tests with Sun JRE version 1.6.0_12-b04.
|
||
|
||
Additionally, a sorted, indexed BAM file containing aligned reads and a
|
||
fasta-format reference with associated dictionary file (.dict) and
|
||
index (.fasta.fai) are required inputs to the tool.
|
||
|
||
Instructions for preparing input files are available at the site below:
|
||
|
||
http://www.broadinstitute.org/gsa/wiki/index.php/Preparing_input_files
|
||
|
||
An example BAM and fasta are provided in the 'resources' directory of
|
||
the GATK.
|
||
|
||
Getting Started
|
||
---------------
|
||
The GATK is distributed with a few standard analyses, including PrintReads,
|
||
Pileup, and DepthOfCoverage. More information on the included walkers is
|
||
available on the following wiki page:
|
||
|
||
http://www.broadinstitute.org/gsa/wiki/index.php/Built-in_walkers
|
||
|
||
To run PrintReads on the included sample data, run the following command:
|
||
|
||
java -jar GenomeAnalysisTK/GenomeAnalysisTK.jar \
|
||
-T PrintReads \
|
||
-R GenomeAnalysisTK/resources/exampleFASTA.fasta \
|
||
-I GenomeAnalysisTK/resources/exampleBAM.bam
|
||
|
||
Support
|
||
-------
|
||
Documentation for the GATK is available at
|
||
http://www.broadinstitute.org/gsa/wiki. For help using or developing
|
||
for the GATK, bug reports, or feature requests, please email
|
||
gsadevelopers@broadinstitute.org.
|