diff --git a/doc/README b/doc/README new file mode 100644 index 000000000..e01bc67b6 --- /dev/null +++ b/doc/README @@ -0,0 +1,88 @@ +The Genome Analysis Toolkit (GATK) +Copyright (c) 2009 The Broad Institute + +Overview +-------- +The Genome Analysis Toolkit (GATK) is a structured programming +framework designed to enable rapid development of efficient and robust +analysis tools for next-generation DNA sequencers. The GATK solves +the data management challenge by separating data access patterns from +analysis algorithms, using the functional programming philosophy of +Map/Reduce. Consequently, the GATK is structured into data traversals +and data walkers that interact through a programming contract in which +the traversal provides a series of units of data to the walker, and +the walker consumes each datum to generate an output for each datum. +Because many tools to analyze next-generation sequencing data access +the data in a very similar way, the GATK can provide a small but +nearly comprehensive set of traversal types that satisfying the data +access needs of the majority of analysis tools. For example, +traversals -Y´by each sequencer readˇ and ´by every read covering +each locus in a genomeˇ are common throughout many tools such as +counting reads, building base quality histograms, reporting average +coverage of the genome, and calling SNPs. The small number of these +traversals, shared among many tools enables the core GATK development +team to optimize such traversals for correctness, stability, CPU +performance, memory footprint, and in many cases to even automatically +parallelize calculations. Moreover, since the traversal engine +encapsulates the complexity of efficiently accessing the +next-generation sequencing data, researchers and developers are free +to focus on their specific analysis algorithms. This not only vastly +improves productivity of the developers, who can quickly write new +analyses, but also results in tools that are efficient and robust and +can benefit from improvement to a common data management engine. + +Capabilities +------------ +The GenomeAnalysisTK development environment is currently provided as +a platform-independent Java programming language library. The core +system works with the nascent standard Sequence Alignment/Map (SAM) +format to represent reads using a production-quality SAM library +developed at the Broad. The system can access a variety of metadata +files such as dbSNP, Hapmap, RefSeq as well as work with genotype and +SNP files in GLF, Geli, and other common formats. The core system +handles read data from Illumina/Solexa, SOLiD, and Roche/454. The +current GATK engine can process all of the 1000 genomes data +representing ~5Tb of data from these three technologies produced from +multiple sequencing centers and aligned to the human reference genome +with multiple aligners. The GATK currently provides traversals by +each read (ByRead traversal), by all reads covering each locus in the +genome (ByLoci traversal), and by all reads within pre-specified +intervals on the genome (ByWindow traversal). + +Dependencies +------------ +The GATK relies on a Java 6-compatible JRE. At the time of this writing, +the GATK team tests with Sun JRE version 1.6.0_12-b04. + +Additionally, a sorted, indexed BAM file containing aligned reads and a +fasta-format reference with associated dictionary file (.dict) and +index (.fasta.fai) are required inputs to the tool. + +Instructions for preparing input files are available at the site below: + +http://www.broadinstitute.org/gsa/wiki/index.php/Preparing_input_files + +An example BAM and fasta are provided in the 'resources' directory of +the GATK. + +Getting Started +--------------- +The GATK is distributed with a few standard analyses, including PrintReads, +Pileup, and DepthOfCoverage. More information on the included walkers is +available on the following wiki page: + +http://www.broadinstitute.org/gsa/wiki/index.php/Built-in_walkers + +To run PrintReads on the included sample data, run the following command: + +java -jar GenomeAnalysisTK/GenomeAnalysisTK.jar \ + -T PrintReads \ + -R GenomeAnalysisTK/resources/exampleFASTA.fasta \ + -I GenomeAnalysisTK/resources/exampleBAM.bam + +Support +------- +Documentation for the GATK is available at +http://www.broadinstitute.org/gsa/wiki. For help using or developing +for the GATK, bug reports, or feature requests, please email +gsadevelopers@broadinstitute.org.