gatk-3.8/doc/README

89 lines
4.2 KiB
Plaintext
Raw Blame History

This file contains invisible Unicode characters!

This file contains invisible Unicode characters that may be processed differently from what appears below. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to reveal hidden characters.

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

The Genome Analysis Toolkit (GATK)
Copyright (c) 2009 The Broad Institute
Overview
--------
The Genome Analysis Toolkit (GATK) is a structured programming
framework designed to enable rapid development of efficient and robust
analysis tools for next-generation DNA sequencers. The GATK solves
the data management challenge by separating data access patterns from
analysis algorithms, using the functional programming philosophy of
Map/Reduce. Consequently, the GATK is structured into data traversals
and data walkers that interact through a programming contract in which
the traversal provides a series of units of data to the walker, and
the walker consumes each datum to generate an output for each datum.
Because many tools to analyze next-generation sequencing data access
the data in a very similar way, the GATK can provide a small but
nearly comprehensive set of traversal types that satisfying the data
access needs of the majority of analysis tools. For example,
traversals -Y´by each sequencer read¡ and ´by every read covering
each locus in a genome¡ are common throughout many tools such as
counting reads, building base quality histograms, reporting average
coverage of the genome, and calling SNPs. The small number of these
traversals, shared among many tools enables the core GATK development
team to optimize such traversals for correctness, stability, CPU
performance, memory footprint, and in many cases to even automatically
parallelize calculations. Moreover, since the traversal engine
encapsulates the complexity of efficiently accessing the
next-generation sequencing data, researchers and developers are free
to focus on their specific analysis algorithms. This not only vastly
improves productivity of the developers, who can quickly write new
analyses, but also results in tools that are efficient and robust and
can benefit from improvement to a common data management engine.
Capabilities
------------
The GenomeAnalysisTK development environment is currently provided as
a platform-independent Java programming language library. The core
system works with the nascent standard Sequence Alignment/Map (SAM)
format to represent reads using a production-quality SAM library
developed at the Broad. The system can access a variety of metadata
files such as dbSNP, Hapmap, RefSeq as well as work with genotype and
SNP files in GLF, Geli, and other common formats. The core system
handles read data from Illumina/Solexa, SOLiD, and Roche/454. The
current GATK engine can process all of the 1000 genomes data
representing ~5Tb of data from these three technologies produced from
multiple sequencing centers and aligned to the human reference genome
with multiple aligners. The GATK currently provides traversals by
each read (ByRead traversal), by all reads covering each locus in the
genome (ByLoci traversal), and by all reads within pre-specified
intervals on the genome (ByWindow traversal).
Dependencies
------------
The GATK relies on a Java 6-compatible JRE. At the time of this writing,
the GATK team tests with Sun JRE version 1.6.0_12-b04.
Additionally, a sorted, indexed BAM file containing aligned reads and a
fasta-format reference with associated dictionary file (.dict) and
index (.fasta.fai) are required inputs to the tool.
Instructions for preparing input files are available at the site below:
http://www.broadinstitute.org/gsa/wiki/index.php/Preparing_input_files
An example BAM and fasta are provided in the 'resources' directory of
the GATK.
Getting Started
---------------
The GATK is distributed with a few standard analyses, including PrintReads,
Pileup, and DepthOfCoverage. More information on the included walkers is
available on the following wiki page:
http://www.broadinstitute.org/gsa/wiki/index.php/Built-in_walkers
To run PrintReads on the included sample data, run the following command:
java -jar GenomeAnalysisTK/GenomeAnalysisTK.jar \
-T PrintReads \
-R GenomeAnalysisTK/resources/exampleFASTA.fasta \
-I GenomeAnalysisTK/resources/exampleBAM.bam
Support
-------
Documentation for the GATK is available at
http://www.broadinstitute.org/gsa/wiki. For help using or developing
for the GATK, bug reports, or feature requests, please email
gsadevelopers@broadinstitute.org.