GATK readme.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1262 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-15 21:00:08 +00:00 · 2009-07-15 21:00:08 +00:00 · a04f205a7f
parent e1055bcc4c
commit a04f205a7f
1 changed files with 88 additions and 0 deletions
--- a/doc/README
+++ b/doc/README
@ -0,0 +1,88 @@
+The Genome Analysis Toolkit (GATK) 
+Copyright (c) 2009 The Broad Institute 
+
+Overview 
+-------- 
+The Genome Analysis Toolkit (GATK) is a structured programming
+framework designed to enable rapid development of efficient and robust
+analysis tools for next-generation DNA sequencers.  The GATK solves
+the data management challenge by separating data access patterns from
+analysis algorithms, using the functional programming philosophy of
+Map/Reduce.  Consequently, the GATK is structured into data traversals
+and data walkers that interact through a programming contract in which
+the traversal provides a series of units of data to the walker, and
+the walker consumes each datum to generate an output for each datum.
+Because many tools to analyze next-generation sequencing data access
+the data in a very similar way, the GATK can provide a small but
+nearly comprehensive set of traversal types that satisfying the data
+access needs of the majority of analysis tools.  For example,
+traversals -Y´by each sequencer read¡ and ´by every read covering
+each locus in a genome¡ are common throughout many tools such as
+counting reads, building base quality histograms, reporting average
+coverage of the genome, and calling SNPs.  The small number of these
+traversals, shared among many tools enables the core GATK development
+team to optimize such traversals for correctness, stability, CPU
+performance, memory footprint, and in many cases to even automatically
+parallelize calculations.  Moreover, since the traversal engine
+encapsulates the complexity of efficiently accessing the
+next-generation sequencing data, researchers and developers are free
+to focus on their specific analysis algorithms.  This not only vastly
+improves productivity of the developers, who can quickly write new
+analyses, but also results in tools that are efficient and robust and
+can benefit from improvement to a common data management engine.
+
+Capabilities 
+------------ 
+The GenomeAnalysisTK development environment is currently provided as
+a platform-independent Java programming language library.  The core
+system works with the nascent standard Sequence Alignment/Map (SAM)
+format to represent reads using a production-quality SAM library
+developed at the Broad.  The system can access a variety of metadata
+files such as dbSNP, Hapmap, RefSeq as well as work with genotype and
+SNP files in GLF, Geli, and other common formats.  The core system
+handles read data from Illumina/Solexa, SOLiD, and Roche/454.  The
+current GATK engine can process all of the 1000 genomes data
+representing ~5Tb of data from these three technologies produced from
+multiple sequencing centers and aligned to the human reference genome
+with multiple aligners.  The GATK currently provides traversals by
+each read (ByRead traversal), by all reads covering each locus in the
+genome (ByLoci traversal), and by all reads within pre-specified
+intervals on the genome (ByWindow traversal).
+
+Dependencies
+------------
+The GATK relies on a Java 6-compatible JRE.  At the time of this writing,
+the GATK team tests with Sun JRE version 1.6.0_12-b04.  
+
+Additionally, a sorted, indexed BAM file containing aligned reads and a 
+fasta-format reference with associated dictionary file (.dict) and 
+index (.fasta.fai) are required inputs to the tool.  
+
+Instructions for preparing input files are available at the site below:
+
+http://www.broadinstitute.org/gsa/wiki/index.php/Preparing_input_files
+
+An example BAM and fasta are provided in the 'resources' directory of
+the GATK.
+
+Getting Started
+---------------
+The GATK is distributed with a few standard analyses, including PrintReads,
+Pileup, and DepthOfCoverage.  More information on the included walkers is
+available on the following wiki page:
+
+http://www.broadinstitute.org/gsa/wiki/index.php/Built-in_walkers
+
+To run PrintReads on the included sample data, run the following command:
+
+java -jar GenomeAnalysisTK/GenomeAnalysisTK.jar \
+     -T PrintReads \
+     -R GenomeAnalysisTK/resources/exampleFASTA.fasta \
+     -I GenomeAnalysisTK/resources/exampleBAM.bam
+
+Support
+-------
+Documentation for the GATK is available at
+http://www.broadinstitute.org/gsa/wiki.  For help using or developing
+for the GATK, bug reports, or feature requests, please email
+gsadevelopers@broadinstitute.org.