GATK readme.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1262 348d0f76-0448-11de-a6fe-93d51630548a
This commit is contained in:
parent
e1055bcc4c
commit
a04f205a7f
|
|
@ -0,0 +1,88 @@
|
|||
The Genome Analysis Toolkit (GATK)
|
||||
Copyright (c) 2009 The Broad Institute
|
||||
|
||||
Overview
|
||||
--------
|
||||
The Genome Analysis Toolkit (GATK) is a structured programming
|
||||
framework designed to enable rapid development of efficient and robust
|
||||
analysis tools for next-generation DNA sequencers. The GATK solves
|
||||
the data management challenge by separating data access patterns from
|
||||
analysis algorithms, using the functional programming philosophy of
|
||||
Map/Reduce. Consequently, the GATK is structured into data traversals
|
||||
and data walkers that interact through a programming contract in which
|
||||
the traversal provides a series of units of data to the walker, and
|
||||
the walker consumes each datum to generate an output for each datum.
|
||||
Because many tools to analyze next-generation sequencing data access
|
||||
the data in a very similar way, the GATK can provide a small but
|
||||
nearly comprehensive set of traversal types that satisfying the data
|
||||
access needs of the majority of analysis tools. For example,
|
||||
traversals -Y´by each sequencer read¡ and ´by every read covering
|
||||
each locus in a genome¡ are common throughout many tools such as
|
||||
counting reads, building base quality histograms, reporting average
|
||||
coverage of the genome, and calling SNPs. The small number of these
|
||||
traversals, shared among many tools enables the core GATK development
|
||||
team to optimize such traversals for correctness, stability, CPU
|
||||
performance, memory footprint, and in many cases to even automatically
|
||||
parallelize calculations. Moreover, since the traversal engine
|
||||
encapsulates the complexity of efficiently accessing the
|
||||
next-generation sequencing data, researchers and developers are free
|
||||
to focus on their specific analysis algorithms. This not only vastly
|
||||
improves productivity of the developers, who can quickly write new
|
||||
analyses, but also results in tools that are efficient and robust and
|
||||
can benefit from improvement to a common data management engine.
|
||||
|
||||
Capabilities
|
||||
------------
|
||||
The GenomeAnalysisTK development environment is currently provided as
|
||||
a platform-independent Java programming language library. The core
|
||||
system works with the nascent standard Sequence Alignment/Map (SAM)
|
||||
format to represent reads using a production-quality SAM library
|
||||
developed at the Broad. The system can access a variety of metadata
|
||||
files such as dbSNP, Hapmap, RefSeq as well as work with genotype and
|
||||
SNP files in GLF, Geli, and other common formats. The core system
|
||||
handles read data from Illumina/Solexa, SOLiD, and Roche/454. The
|
||||
current GATK engine can process all of the 1000 genomes data
|
||||
representing ~5Tb of data from these three technologies produced from
|
||||
multiple sequencing centers and aligned to the human reference genome
|
||||
with multiple aligners. The GATK currently provides traversals by
|
||||
each read (ByRead traversal), by all reads covering each locus in the
|
||||
genome (ByLoci traversal), and by all reads within pre-specified
|
||||
intervals on the genome (ByWindow traversal).
|
||||
|
||||
Dependencies
|
||||
------------
|
||||
The GATK relies on a Java 6-compatible JRE. At the time of this writing,
|
||||
the GATK team tests with Sun JRE version 1.6.0_12-b04.
|
||||
|
||||
Additionally, a sorted, indexed BAM file containing aligned reads and a
|
||||
fasta-format reference with associated dictionary file (.dict) and
|
||||
index (.fasta.fai) are required inputs to the tool.
|
||||
|
||||
Instructions for preparing input files are available at the site below:
|
||||
|
||||
http://www.broadinstitute.org/gsa/wiki/index.php/Preparing_input_files
|
||||
|
||||
An example BAM and fasta are provided in the 'resources' directory of
|
||||
the GATK.
|
||||
|
||||
Getting Started
|
||||
---------------
|
||||
The GATK is distributed with a few standard analyses, including PrintReads,
|
||||
Pileup, and DepthOfCoverage. More information on the included walkers is
|
||||
available on the following wiki page:
|
||||
|
||||
http://www.broadinstitute.org/gsa/wiki/index.php/Built-in_walkers
|
||||
|
||||
To run PrintReads on the included sample data, run the following command:
|
||||
|
||||
java -jar GenomeAnalysisTK/GenomeAnalysisTK.jar \
|
||||
-T PrintReads \
|
||||
-R GenomeAnalysisTK/resources/exampleFASTA.fasta \
|
||||
-I GenomeAnalysisTK/resources/exampleBAM.bam
|
||||
|
||||
Support
|
||||
-------
|
||||
Documentation for the GATK is available at
|
||||
http://www.broadinstitute.org/gsa/wiki. For help using or developing
|
||||
for the GATK, bug reports, or feature requests, please email
|
||||
gsadevelopers@broadinstitute.org.
|
||||
Loading…
Reference in New Issue