89 lines
4.2 KiB
Plaintext
89 lines
4.2 KiB
Plaintext
|
|
The Genome Analysis Toolkit (GATK)
|
|||
|
|
Copyright (c) 2009 The Broad Institute
|
|||
|
|
|
|||
|
|
Overview
|
|||
|
|
--------
|
|||
|
|
The Genome Analysis Toolkit (GATK) is a structured programming
|
|||
|
|
framework designed to enable rapid development of efficient and robust
|
|||
|
|
analysis tools for next-generation DNA sequencers. The GATK solves
|
|||
|
|
the data management challenge by separating data access patterns from
|
|||
|
|
analysis algorithms, using the functional programming philosophy of
|
|||
|
|
Map/Reduce. Consequently, the GATK is structured into data traversals
|
|||
|
|
and data walkers that interact through a programming contract in which
|
|||
|
|
the traversal provides a series of units of data to the walker, and
|
|||
|
|
the walker consumes each datum to generate an output for each datum.
|
|||
|
|
Because many tools to analyze next-generation sequencing data access
|
|||
|
|
the data in a very similar way, the GATK can provide a small but
|
|||
|
|
nearly comprehensive set of traversal types that satisfying the data
|
|||
|
|
access needs of the majority of analysis tools. For example,
|
|||
|
|
traversals -Y<>by each sequencer read<61> and <20>by every read covering
|
|||
|
|
each locus in a genome<6D> are common throughout many tools such as
|
|||
|
|
counting reads, building base quality histograms, reporting average
|
|||
|
|
coverage of the genome, and calling SNPs. The small number of these
|
|||
|
|
traversals, shared among many tools enables the core GATK development
|
|||
|
|
team to optimize such traversals for correctness, stability, CPU
|
|||
|
|
performance, memory footprint, and in many cases to even automatically
|
|||
|
|
parallelize calculations. Moreover, since the traversal engine
|
|||
|
|
encapsulates the complexity of efficiently accessing the
|
|||
|
|
next-generation sequencing data, researchers and developers are free
|
|||
|
|
to focus on their specific analysis algorithms. This not only vastly
|
|||
|
|
improves productivity of the developers, who can quickly write new
|
|||
|
|
analyses, but also results in tools that are efficient and robust and
|
|||
|
|
can benefit from improvement to a common data management engine.
|
|||
|
|
|
|||
|
|
Capabilities
|
|||
|
|
------------
|
|||
|
|
The GenomeAnalysisTK development environment is currently provided as
|
|||
|
|
a platform-independent Java programming language library. The core
|
|||
|
|
system works with the nascent standard Sequence Alignment/Map (SAM)
|
|||
|
|
format to represent reads using a production-quality SAM library
|
|||
|
|
developed at the Broad. The system can access a variety of metadata
|
|||
|
|
files such as dbSNP, Hapmap, RefSeq as well as work with genotype and
|
|||
|
|
SNP files in GLF, Geli, and other common formats. The core system
|
|||
|
|
handles read data from Illumina/Solexa, SOLiD, and Roche/454. The
|
|||
|
|
current GATK engine can process all of the 1000 genomes data
|
|||
|
|
representing ~5Tb of data from these three technologies produced from
|
|||
|
|
multiple sequencing centers and aligned to the human reference genome
|
|||
|
|
with multiple aligners. The GATK currently provides traversals by
|
|||
|
|
each read (ByRead traversal), by all reads covering each locus in the
|
|||
|
|
genome (ByLoci traversal), and by all reads within pre-specified
|
|||
|
|
intervals on the genome (ByWindow traversal).
|
|||
|
|
|
|||
|
|
Dependencies
|
|||
|
|
------------
|
|||
|
|
The GATK relies on a Java 6-compatible JRE. At the time of this writing,
|
|||
|
|
the GATK team tests with Sun JRE version 1.6.0_12-b04.
|
|||
|
|
|
|||
|
|
Additionally, a sorted, indexed BAM file containing aligned reads and a
|
|||
|
|
fasta-format reference with associated dictionary file (.dict) and
|
|||
|
|
index (.fasta.fai) are required inputs to the tool.
|
|||
|
|
|
|||
|
|
Instructions for preparing input files are available at the site below:
|
|||
|
|
|
|||
|
|
http://www.broadinstitute.org/gsa/wiki/index.php/Preparing_input_files
|
|||
|
|
|
|||
|
|
An example BAM and fasta are provided in the 'resources' directory of
|
|||
|
|
the GATK.
|
|||
|
|
|
|||
|
|
Getting Started
|
|||
|
|
---------------
|
|||
|
|
The GATK is distributed with a few standard analyses, including PrintReads,
|
|||
|
|
Pileup, and DepthOfCoverage. More information on the included walkers is
|
|||
|
|
available on the following wiki page:
|
|||
|
|
|
|||
|
|
http://www.broadinstitute.org/gsa/wiki/index.php/Built-in_walkers
|
|||
|
|
|
|||
|
|
To run PrintReads on the included sample data, run the following command:
|
|||
|
|
|
|||
|
|
java -jar GenomeAnalysisTK/GenomeAnalysisTK.jar \
|
|||
|
|
-T PrintReads \
|
|||
|
|
-R GenomeAnalysisTK/resources/exampleFASTA.fasta \
|
|||
|
|
-I GenomeAnalysisTK/resources/exampleBAM.bam
|
|||
|
|
|
|||
|
|
Support
|
|||
|
|
-------
|
|||
|
|
Documentation for the GATK is available at
|
|||
|
|
http://www.broadinstitute.org/gsa/wiki. For help using or developing
|
|||
|
|
for the GATK, bug reports, or feature requests, please email
|
|||
|
|
gsadevelopers@broadinstitute.org.
|