\documentclass[11pt,fullpage]{article} \usepackage[urlcolor=blue,colorlinks=true]{hyperref} \oddsidemargin 0.0in \textwidth 6.5in \begin{document} \title{Getting Started with the Genome Analysis Toolkit (GATK)} \author{Matt Hanna} \date{Created March 16, 2009\\ Updated \today} \maketitle \section{Build Prerequisites} GATK requires JDK 1.6 and Ant 1.7.1 to compile. \section{Getting and Building the Source} GATK is located in the Sting svn repository, and compiles using a build.xml in the root directory. Download and build the source as follows: \begin{verbatim} svn co https://svnrepos/Sting/trunk Sting cd Sting ant \end{verbatim} \section{Getting Started} The core concept behind GATK is the walker, a class that implements the three core operations: filtering, mapping, and reducing. \begin{description} \item [filter] reduces the size of the dataset by applying a predicate. \item [map] Applies a function to each individual element in a dataset, effectively 'mapping' it to a new element. \item [reduce] Inductively combines the elements of a list. The base case is supplied by the reduceInit() function, and the inductive step is performed by the reduce() function. \end{description} Users of the GATK will provide a walker to run their analyses. The engine will produce a result by first filtering the dataset, running a map operation, and finally reducing the map operation to a single result. \section{Creating a Walker} To be loaded by GATK, the walker must satisfy the following properties: \begin{enumerate} \item It must be a loose class, not packaged into a jar file. \item It must be in the unnamed package (in other words, the source should not start with a package declaration). \item It must subclass one of the basic walkers in the org.broadinstitute.sting.gatk.walkers package: ReadWalker or LociWalker. \item It must live in the directory \$STING\_HOME/dist/walkers. \end{enumerate} \section{Example} This walker will print output for each read it sees, eventually computing the total number of reads by mapping every read to 1 and summing all the 1s to realize the total number of reads. \begin{samepage} Copy the following text into the file \$STING\_HOME/dist/walkers/HelloWalker.java: \begin{verbatim} import net.sf.samtools.SAMRecord; import org.broadinstitute.sting.gatk.LocusContext; import org.broadinstitute.sting.gatk.walkers.ReadWalker; /** * Define a class extending from ReadWalker with types * . */ public class HelloWalker extends ReadWalker { private Long currentRead = 0L; // Maps each read to the value 1. public Integer map(LocusContext context, SAMRecord read) { System.out.printf("Hello read %d%n", ++currentRead ); return 1; } // Provides an initial value for the reduce function. public Long reduceInit() { return 0L; } // Defines how to compute the reduction given a value in the list. public Long reduce(Integer value, Long sum) { return sum + value; } } \end{verbatim} \end{samepage} To compile the walker: \begin{verbatim} setenv CLASSPATH $STING_HOME/dist/GenomeAnalysisTK.jar:$STING_HOME/dist/sam-1.0.jar javac HelloWalker.java \end{verbatim} To run the walker: \begin{verbatim} mkdir $STING_HOME/dist/walkers java -Xmx4096m -jar dist/GenomeAnalysisTK.jar \ -R/seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta \ -I /broad/1KG/legacy_data/trio/na12878.bam -T Hello \ -L chr1:10000000-10000100 -l WARN \end{verbatim} This command will run the walker across a subsection of chromosome 1, operating on reads which align to that subsection. If you'd like to see more information from the GATK on what it's doing, you can change the logging level (-l) to INFO. \end{document}