gatk-3.8/doc/README

The Genome Analysis Toolkit (GATK) 
Copyright (c) 2009 The Broad Institute 

Overview 
-------- 
The Genome Analysis Toolkit (GATK) is a structured programming
framework designed to enable rapid development of efficient and robust
analysis tools for next-generation DNA sequencers.  The GATK solves
the data management challenge by separating data access patterns from
analysis algorithms, using the functional programming philosophy of
Map/Reduce.  Consequently, the GATK is structured into data traversals
and data walkers that interact through a programming contract in which
the traversal provides a series of units of data to the walker, and
the walker consumes each datum to generate an output for each datum.
Because many tools to analyze next-generation sequencing data access
the data in a very similar way, the GATK can provide a small but
nearly comprehensive set of traversal types that satisfying the data
access needs of the majority of analysis tools.  For example,
traversals "by each sequencer read" and "by every read covering
each locus in a genome" are common throughout many tools such as
counting reads, building base quality histograms, reporting average
coverage of the genome, and calling SNPs.  The small number of these
traversals, shared among many tools enables the core GATK development
team to optimize such traversals for correctness, stability, CPU
performance, memory footprint, and in many cases to even automatically
parallelize calculations.  Moreover, since the traversal engine
encapsulates the complexity of efficiently accessing the
next-generation sequencing data, researchers and developers are free
to focus on their specific analysis algorithms.  This not only vastly
improves productivity of the developers, who can quickly write new
analyses, but also results in tools that are efficient and robust and
can benefit from improvement to a common data management engine.

Capabilities 
------------ 
The GenomeAnalysisTK development environment is currently provided as
a platform-independent Java programming language library.  The core
system works with the nascent standard Sequence Alignment/Map (SAM)
format to represent reads using a production-quality SAM library
developed at the Broad.  The system can access a variety of metadata
files such as dbSNP, Hapmap, RefSeq as well as work with genotype and
SNP files in GLF, Geli, and other common formats.  The core system
handles read data from Illumina/Solexa, SOLiD, and Roche/454.  The
current GATK engine can process all of the 1000 genomes data
representing ~5Tb of data from these three technologies produced from
multiple sequencing centers and aligned to the human reference genome
with multiple aligners.  The GATK currently provides traversals by
each read (ByRead traversal), by all reads covering each locus in the
genome (ByLoci traversal), and by all reads within pre-specified
intervals on the genome (ByWindow traversal).

Dependencies
------------
The GATK relies on a Java 6-compatible JRE.  At the time of this writing,
the GATK team tests with Sun JRE version 1.6.0_12-b04.  Additionally, the
GATK requires as inputs a sorted, indexed BAM file containing aligned reads 
and a fasta-format reference with associated dictionary file (.dict)and 
index (.fasta.fai).  

Instructions for preparing input files are available here:

http://www.broadinstitute.org/gsa/wiki/index.php/Preparing_input_files

The bundled 'resources' directory  contains an example BAM and fasta.

Getting Started
---------------
The GATK is distributed with a few standard analyses, including PrintReads,
Pileup, and DepthOfCoverage.  More information on the included walkers is
available here:

http://www.broadinstitute.org/gsa/wiki/index.php/Built-in_walkers

To print the reads of the included sample data, untar the package into
the GenomeAnalysisTK directory and run the following command:

java -jar GenomeAnalysisTK/GenomeAnalysisTK.jar \
     -T PrintReads \
     -R GenomeAnalysisTK/resources/exampleFASTA.fasta \
     -I GenomeAnalysisTK/resources/exampleBAM.bam

Support
-------
Documentation for the GATK is available at http://www.broadinstitute.org/gsa/wiki.  
For help using the GATK, developing analyses with the GATK, bug reports, 
or feature requests, please email gsadevelopers@broadinstitute.org.
GATK readme. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1262 348d0f76-0448-11de-a6fe-93d51630548a 2009-07-16 05:00:08 +08:00			`The Genome Analysis Toolkit (GATK)`
			`Copyright (c) 2009 The Broad Institute`

			`Overview`
			`--------`
			`The Genome Analysis Toolkit (GATK) is a structured programming`
			`framework designed to enable rapid development of efficient and robust`
			`analysis tools for next-generation DNA sequencers. The GATK solves`
			`the data management challenge by separating data access patterns from`
			`analysis algorithms, using the functional programming philosophy of`
			`Map/Reduce. Consequently, the GATK is structured into data traversals`
			`and data walkers that interact through a programming contract in which`
			`the traversal provides a series of units of data to the walker, and`
			`the walker consumes each datum to generate an output for each datum.`
			`Because many tools to analyze next-generation sequencing data access`
			`the data in a very similar way, the GATK can provide a small but`
			`nearly comprehensive set of traversal types that satisfying the data`
			`access needs of the majority of analysis tools. For example,`
Cleanup my pathetic prose. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1263 348d0f76-0448-11de-a6fe-93d51630548a 2009-07-16 05:35:13 +08:00			`traversals "by each sequencer read" and "by every read covering`
			`each locus in a genome" are common throughout many tools such as`
GATK readme. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1262 348d0f76-0448-11de-a6fe-93d51630548a 2009-07-16 05:00:08 +08:00			`counting reads, building base quality histograms, reporting average`
			`coverage of the genome, and calling SNPs. The small number of these`
			`traversals, shared among many tools enables the core GATK development`
			`team to optimize such traversals for correctness, stability, CPU`
			`performance, memory footprint, and in many cases to even automatically`
			`parallelize calculations. Moreover, since the traversal engine`
			`encapsulates the complexity of efficiently accessing the`
			`next-generation sequencing data, researchers and developers are free`
			`to focus on their specific analysis algorithms. This not only vastly`
			`improves productivity of the developers, who can quickly write new`
			`analyses, but also results in tools that are efficient and robust and`
			`can benefit from improvement to a common data management engine.`

			`Capabilities`
			`------------`
			`The GenomeAnalysisTK development environment is currently provided as`
			`a platform-independent Java programming language library. The core`
			`system works with the nascent standard Sequence Alignment/Map (SAM)`
			`format to represent reads using a production-quality SAM library`
			`developed at the Broad. The system can access a variety of metadata`
			`files such as dbSNP, Hapmap, RefSeq as well as work with genotype and`
			`SNP files in GLF, Geli, and other common formats. The core system`
			`handles read data from Illumina/Solexa, SOLiD, and Roche/454. The`
			`current GATK engine can process all of the 1000 genomes data`
			`representing ~5Tb of data from these three technologies produced from`
			`multiple sequencing centers and aligned to the human reference genome`
			`with multiple aligners. The GATK currently provides traversals by`
			`each read (ByRead traversal), by all reads covering each locus in the`
			`genome (ByLoci traversal), and by all reads within pre-specified`
			`intervals on the genome (ByWindow traversal).`

			`Dependencies`
			`------------`
			`The GATK relies on a Java 6-compatible JRE. At the time of this writing,`
Cleanup my pathetic prose. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1263 348d0f76-0448-11de-a6fe-93d51630548a 2009-07-16 05:35:13 +08:00			`the GATK team tests with Sun JRE version 1.6.0_12-b04. Additionally, the`
			`GATK requires as inputs a sorted, indexed BAM file containing aligned reads`
			`and a fasta-format reference with associated dictionary file (.dict)and`
			`index (.fasta.fai).`
GATK readme. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1262 348d0f76-0448-11de-a6fe-93d51630548a 2009-07-16 05:00:08 +08:00
Cleanup my pathetic prose. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1263 348d0f76-0448-11de-a6fe-93d51630548a 2009-07-16 05:35:13 +08:00			`Instructions for preparing input files are available here:`
GATK readme. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1262 348d0f76-0448-11de-a6fe-93d51630548a 2009-07-16 05:00:08 +08:00
			`http://www.broadinstitute.org/gsa/wiki/index.php/Preparing_input_files`

Cleanup my pathetic prose. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1263 348d0f76-0448-11de-a6fe-93d51630548a 2009-07-16 05:35:13 +08:00			`The bundled 'resources' directory contains an example BAM and fasta.`
GATK readme. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1262 348d0f76-0448-11de-a6fe-93d51630548a 2009-07-16 05:00:08 +08:00
			`Getting Started`
			`---------------`
			`The GATK is distributed with a few standard analyses, including PrintReads,`
			`Pileup, and DepthOfCoverage. More information on the included walkers is`
Cleanup my pathetic prose. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1263 348d0f76-0448-11de-a6fe-93d51630548a 2009-07-16 05:35:13 +08:00			`available here:`
GATK readme. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1262 348d0f76-0448-11de-a6fe-93d51630548a 2009-07-16 05:00:08 +08:00
			`http://www.broadinstitute.org/gsa/wiki/index.php/Built-in_walkers`

Cleanup my pathetic prose. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1263 348d0f76-0448-11de-a6fe-93d51630548a 2009-07-16 05:35:13 +08:00			`To print the reads of the included sample data, untar the package into`
			`the GenomeAnalysisTK directory and run the following command:`
GATK readme. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1262 348d0f76-0448-11de-a6fe-93d51630548a 2009-07-16 05:00:08 +08:00
			`java -jar GenomeAnalysisTK/GenomeAnalysisTK.jar \`
			`-T PrintReads \`
			`-R GenomeAnalysisTK/resources/exampleFASTA.fasta \`
			`-I GenomeAnalysisTK/resources/exampleBAM.bam`

			`Support`
			`-------`
Cleanup my pathetic prose. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1263 348d0f76-0448-11de-a6fe-93d51630548a 2009-07-16 05:35:13 +08:00			`Documentation for the GATK is available at http://www.broadinstitute.org/gsa/wiki.`
			`For help using the GATK, developing analyses with the GATK, bug reports,`
			`or feature requests, please email gsadevelopers@broadinstitute.org.`