Merge pull request #1615 from broadinstitute/gvda_archive_docs
Archive GATK3-specific docs from the forum
This commit is contained in:
commit
a906e24010
|
|
@ -0,0 +1,63 @@
|
|||
## (howto) Map and mark duplicates
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/2799/howto-map-and-mark-duplicates
|
||||
|
||||
<blockquote>
|
||||
<h4>See <a href="http://gatkforums.broadinstitute.org/gatk/discussion/6747">Tutorial#6747</a> for a comparison of <em>MarkDuplicates</em> and <em>MarkDuplicatesWithMateCigar</em>, downloadable example data to follow along, and additional commentary.</h4>
|
||||
</blockquote>
|
||||
<hr />
|
||||
<h4>Objective</h4>
|
||||
<p>Map the read data to the reference and mark duplicates.</p>
|
||||
<h4>Prerequisites</h4>
|
||||
<ul>
|
||||
<li>This tutorial assumes adapter sequences have been removed.</li>
|
||||
</ul>
|
||||
<h4>Steps</h4>
|
||||
<ol>
|
||||
<li>Identify read group information</li>
|
||||
<li>Generate a SAM file containing aligned reads</li>
|
||||
<li>Convert to BAM, sort and mark duplicates </li>
|
||||
</ol>
|
||||
<hr />
|
||||
<h3>1. Identify read group information</h3>
|
||||
<p>The read group information is key for downstream GATK functionality. The GATK will not work without a read group tag. Make sure to enter as much metadata as you know about your data in the read group fields provided. For more information about all the possible fields in the @RG tag, take a look at the SAM specification. </p>
|
||||
<h4>Action</h4>
|
||||
<p>Compose the read group identifier in the following format:</p>
|
||||
<pre><code class="pre_md">@RG\tID:group1\tSM:sample1\tPL:illumina\tLB:lib1\tPU:unit1 </code class="pre_md"></pre>
|
||||
<p>where the <code>\t</code> stands for the tab character. </p>
|
||||
<hr />
|
||||
<h3>2. Generate a SAM file containing aligned reads</h3>
|
||||
<h4>Action</h4>
|
||||
<p>Run the following BWA command: </p>
|
||||
<p>In this command, replace read group info by the read group identifier composed in the previous step. </p>
|
||||
<pre><code class="pre_md">bwa mem -M -R ’<read group info>’ -p reference.fa raw_reads.fq > aligned_reads.sam </code class="pre_md"></pre>
|
||||
<p>replacing the <code><read group info></code> bit with the read group identifier you composed at the previous step. </p>
|
||||
<p><em>The <code>-M</code> flag causes BWA to mark shorter split hits as secondary (essential for Picard compatibility).</em></p>
|
||||
<h4>Expected Result</h4>
|
||||
<p>This creates a file called <code>aligned_reads.sam</code> containing the aligned reads from all input files, combined, annotated and aligned to the same reference. </p>
|
||||
<p>Note that here we are using a command that is specific for pair end data in an interleaved (read pairs together in the same file, with the forward read followed directly by its paired reverse read) fastq file, which is what we are providing to you as a tutorial file. To map other types of datasets (e.g. single-ended or pair-ended in forward/reverse read files) you will need to adapt the command accordingly. Please see the BWA documentation for exact usage and more options for these commands.</p>
|
||||
<hr />
|
||||
<h3>3. Convert to BAM, sort and mark duplicates</h3>
|
||||
<p>These initial pre-processing operations format the data to suit the requirements of the GATK tools. </p>
|
||||
<h4>Action</h4>
|
||||
<p>Run the following Picard command to sort the SAM file and convert it to BAM: </p>
|
||||
<pre><code class="pre_md">java -jar picard.jar SortSam \
|
||||
INPUT=aligned_reads.sam \
|
||||
OUTPUT=sorted_reads.bam \
|
||||
SORT_ORDER=coordinate </code class="pre_md"></pre>
|
||||
<h4>Expected Results</h4>
|
||||
<p>This creates a file called <code>sorted_reads.bam</code> containing the aligned reads sorted by coordinate. </p>
|
||||
<h4>Action</h4>
|
||||
<p>Run the following Picard command to mark duplicates: </p>
|
||||
<pre><code class="pre_md">java -jar picard.jar MarkDuplicates \
|
||||
INPUT=sorted_reads.bam \
|
||||
OUTPUT=dedup_reads.bam \
|
||||
METRICS_FILE=metrics.txt</code class="pre_md"></pre>
|
||||
<h4>Expected Result</h4>
|
||||
<p>This creates a sorted BAM file called <code>dedup_reads.bam</code> with the same content as the input file, except that any duplicate reads are marked as such. It also produces a metrics file called <code>metrics.txt</code> containing (can you guess?) metrics.</p>
|
||||
<h4>Action</h4>
|
||||
<p>Run the following Picard command to index the BAM file: </p>
|
||||
<pre><code class="pre_md">java -jar picard.jar BuildBamIndex \
|
||||
INPUT=dedup_reads.bam </code class="pre_md"></pre>
|
||||
<h4>Expected Result</h4>
|
||||
<p>This creates an index file for the BAM file called <code>dedup_reads.bai</code>.</p>
|
||||
|
|
@ -0,0 +1,44 @@
|
|||
## (howto) Perform local realignment around indels
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/2800/howto-perform-local-realignment-around-indels
|
||||
|
||||
<h3>NOTE: This tutorial has been replaced by a more recent and much improved version that you can find <a href="https://www.broadinstitute.org/gatk/guide/article?id=7156">here</a>.</h3>
|
||||
<h4>Objective</h4>
|
||||
<p>Perform local realignment around indels to correct mapping-related artifacts.</p>
|
||||
<h4>Prerequisites</h4>
|
||||
<ul>
|
||||
<li>TBD</li>
|
||||
</ul>
|
||||
<h4>Steps</h4>
|
||||
<ol>
|
||||
<li>Create a target list of intervals to be realigned </li>
|
||||
<li>Perform realignment of the target intervals</li>
|
||||
</ol>
|
||||
<hr />
|
||||
<h3>1. Create a target list of intervals to be realigned</h3>
|
||||
<h4>Action</h4>
|
||||
<p>Run the following GATK command: </p>
|
||||
<pre><code class="pre_md">java -jar GenomeAnalysisTK.jar \
|
||||
-T RealignerTargetCreator \
|
||||
-R reference.fa \
|
||||
-I dedup_reads.bam \
|
||||
-L 20 \
|
||||
-known gold_indels.vcf \
|
||||
-o realignment_targets.list</code class="pre_md"></pre>
|
||||
<h4>Expected Result</h4>
|
||||
<p>This creates a file called <code>realignment_targets.list</code> containing the list of intervals that the program identified as needing realignment within our target, chromosome 20.</p>
|
||||
<p>The list of known indel sites (<code>gold_indels.vcf</code>) are used as targets for realignment. Only use it if there is such a list for your organism. </p>
|
||||
<hr />
|
||||
<h3>2. Perform realignment of the target intervals</h3>
|
||||
<h4>Action</h4>
|
||||
<p>Run the following GATK command: </p>
|
||||
<pre><code class="pre_md">java -jar GenomeAnalysisTK.jar \
|
||||
-T IndelRealigner \
|
||||
-R reference.fa \
|
||||
-I dedup_reads.bam \
|
||||
-targetIntervals realignment_targets.list \
|
||||
-known gold_indels.vcf \
|
||||
-o realigned_reads.bam </code class="pre_md"></pre>
|
||||
<h4>Expected Result</h4>
|
||||
<p>This creates a file called <code>realigned_reads.bam</code> containing all the original reads, but with better local alignments in the regions that were realigned.</p>
|
||||
<p>Note that here, we didn’t include the <code>-L 20</code> argument. It's not necessary since the program will only run on the target intervals we are providing. </p>
|
||||
|
|
@ -0,0 +1,45 @@
|
|||
## (howto) Prepare a reference for use with BWA and GATK
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/2798/howto-prepare-a-reference-for-use-with-bwa-and-gatk
|
||||
|
||||
<h3>NOTE: This tutorial has been replaced by a more recent version that uses GRCh38 that you can find <a href="https://www.broadinstitute.org/gatk/guide/article?id=8017">here</a>.</h3>
|
||||
<hr />
|
||||
<h4>Objective</h4>
|
||||
<p>Prepare a reference sequence so that it is suitable for use with BWA and GATK. </p>
|
||||
<h4>Prerequisites</h4>
|
||||
<ul>
|
||||
<li>Installed BWA</li>
|
||||
<li>Installed SAMTools</li>
|
||||
<li>Installed Picard</li>
|
||||
</ul>
|
||||
<h4>Steps</h4>
|
||||
<ol>
|
||||
<li>Generate the BWA index</li>
|
||||
<li>Generate the Fasta file index</li>
|
||||
<li>Generate the sequence dictionary </li>
|
||||
</ol>
|
||||
<hr />
|
||||
<h3>1. Generate the BWA index</h3>
|
||||
<h4>Action</h4>
|
||||
<p>Run the following BWA command:</p>
|
||||
<pre><code class="pre_md">bwa index -a bwtsw reference.fa </code class="pre_md"></pre>
|
||||
<p>where <code>-a bwtsw</code> specifies that we want to use the indexing algorithm that is capable of handling the whole human genome.</p>
|
||||
<h4>Expected Result</h4>
|
||||
<p>This creates a collection of files used by BWA to perform the alignment. </p>
|
||||
<hr />
|
||||
<h3>2. Generate the fasta file index</h3>
|
||||
<h4>Action</h4>
|
||||
<p>Run the following SAMtools command: </p>
|
||||
<pre><code class="pre_md">samtools faidx reference.fa </code class="pre_md"></pre>
|
||||
<h4>Expected Result</h4>
|
||||
<p>This creates a file called <code>reference.fa.fai</code>, with one record per line for each of the contigs in the FASTA reference file. Each record is composed of the contig name, size, location, basesPerLine and bytesPerLine. </p>
|
||||
<hr />
|
||||
<h3>3. Generate the sequence dictionary</h3>
|
||||
<h4>Action</h4>
|
||||
<p>Run the following Picard command: </p>
|
||||
<pre><code class="pre_md">java -jar picard.jar CreateSequenceDictionary \
|
||||
REFERENCE=reference.fa \
|
||||
OUTPUT=reference.dict </code class="pre_md"></pre>
|
||||
<p>Note that this is the new syntax for use with the latest version of Picard. Older versions used a slightly different syntax because all the tools were in separate jars, so you'd call e.g. <code>java -jar CreateSequenceDictionary.jar</code> directly. </p>
|
||||
<h4>Expected Result</h4>
|
||||
<p>This creates a file called <code>reference.dict</code> formatted like a SAM header, describing the contents of your reference FASTA file. </p>
|
||||
|
|
@ -0,0 +1,177 @@
|
|||
## Adding Genomic Annotations Using SnpEff and VariantAnnotator
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/50/adding-genomic-annotations-using-snpeff-and-variantannotator
|
||||
|
||||
<h3>This article is out of date and no longer applicable. At this time, we do not provide support for performing functional annotation. Programs that we are aware of and that our collaborators use successfully include Oncotator and Variant Effect Predictor (VEP).</h3>
|
||||
<hr />
|
||||
<p><em>Our testing has shown that not all combinations of snpEff/database versions produce high-quality results. Be sure to read this document completely to familiarize yourself with our recommended best practices BEFORE running snpEff.</em></p>
|
||||
<h3>Introduction</h3>
|
||||
<p>Until recently we were using an in-house annotation tool for genomic annotation, but the burden of keeping the database current and our lack of ability to annotate indels has led us to employ the use of a third-party tool instead. After reviewing many external tools (including annoVar, VAT, and Oncotator), we decided that <a href="http://snpeff.sourceforge.net/">SnpEff</a> best meets our needs as it accepts VCF files as input, can annotate a full exome callset (including indels) in seconds, and provides continually-updated transcript databases. We have implemented support in the GATK for parsing the output from the SnpEff tool and annotating VCFs with the information provided in it. </p>
|
||||
<h3>SnpEff Setup and Usage</h3>
|
||||
<p>Download the SnpEff core program. If you want to be able to run VariantAnnotator on the SnpEff output, you'll need to download a version of SnpEff that VariantAnnotator supports from <a href="http://sourceforge.net/projects/snpeff/files/">this page</a> (currently supported versions are listed below). If you just want the most recent version of SnpEff and don't plan to run VariantAnnotator on its output, you can get it from <a href="http://snpeff.sourceforge.net/download.html">here</a>.</p>
|
||||
<p>After unzipping the core program, open the file snpEff.config in a text editor, and change the "database_repository" line to the following:</p>
|
||||
<pre><code class="pre_md">database_repository = http://sourceforge.net/projects/snpeff/files/databases/</code class="pre_md"></pre>
|
||||
<p>Then, download one or more databases using SnpEff's built-in download command:</p>
|
||||
<pre><code class="pre_md">java -jar snpEff.jar download GRCh37.64</code class="pre_md"></pre>
|
||||
<p>You can find a list of available databases <a href="http://snpeff.sourceforge.net/download.html">here</a>. The human genome databases have <strong>GRCh</strong> or <strong>hg</strong> in their names. You can also download the databases directly from the SnpEff website, if you prefer.</p>
|
||||
<p>The download command by default puts the databases into a subdirectory called <strong>data</strong> within the directory containing the SnpEff jar file. If you want the databases in a different directory, you'll need to edit the <code>data_dir</code> entry in the file <code>snpEff.config</code> to point to the correct directory.</p>
|
||||
<p>Run SnpEff on the file containing your variants, and redirect its output to a file. SnpEff supports many input file formats including VCF 4.1, BED, and SAM pileup. Full details and command-line options can be found on the <a href="http://snpeff.sourceforge.net/">SnpEff home page</a>.</p>
|
||||
<h3>Supported SnpEff Versions</h3>
|
||||
<p>If you want to take advantage of SnpEff integration in the GATK, you'll need to run SnpEff version *<em>2.0.5</em>. <em>Note: newer versions are currently unsupported by the GATK, as we haven't yet had the reources to test it.</em></p>
|
||||
<h3>Current Recommended Best Practices When Running SnpEff</h3>
|
||||
<p>These best practices are based on our analysis of various snpEff/database versions as described in detail in the <strong>Analysis of SnpEff Annotations Across Versions</strong> section below.</p>
|
||||
<ul>
|
||||
<li>
|
||||
<p>We recommend using only the <strong>GRCh37.64</strong> database with SnpEff 2.0.5. The more recent GRCh37.65 database produces many false-positive Missense annotations due to a regression in the ENSEMBL Release 65 GTF file used to build the database. This regression has been acknowledged by ENSEMBL and is supposedly fixed as of 1-30-2012; however as we have not yet tested the fixed version of the database we continue to recommend using only GRCh37.64 for now.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>We recommend always running with <code>-onlyCoding true</code> with human databases (eg., the GRCh37.<em> databases). Setting <code>-onlyCoding false</code> causes snpEff to report all transcripts as if they were coding (even if they're not), which can lead to nonsensical results. The <code>-onlyCoding false</code> option should </em>only* be used with databases that lack protein coding information.</p>
|
||||
</li>
|
||||
<li>Do not trust annotations from versions of snpEff prior to 2.0.4. Older versions of snpEff (such as 2.0.2) produced many incorrect annotations due to the presence of a certain number of nonsensical transcripts in the underlying ENSEMBL databases. Newer versions of snpEff filter out such transcripts.</li>
|
||||
</ul>
|
||||
<h3>Analyses of SnpEff Annotations Across Versions</h3>
|
||||
<p>See our analysis of the SNP annotations produced by snpEff across various snpEff/database versions <a href="http://www.broadinstitute.org/gatk/media/docs/SnpEff_snps_comparison_of_available_versions.pdf">here</a>.</p>
|
||||
<ul>
|
||||
<li>
|
||||
<p>Both snpEff 2.0.2 + GRCh37.63 and snpEff 2.0.5 + GRCh37.65 produce an abnormally high Missense:Silent ratio, with elevated levels of Missense mutations across the entire spectrum of allele counts. They also have a relatively low (~70%) level of concordance with the 1000G Gencode annotations when it comes to Silent mutations. This suggests that these combinations of snpEff/database versions incorrectly annotate many Silent mutations as Missense.</p>
|
||||
</li>
|
||||
<li>snpEff 2.0.4 RC3 + GRCh37.64 and snpEff 2.0.5 + GRCh37.64 produce a Missense:Silent ratio in line with expectations, and have a very high (~97%-99%) level of concordance with the 1000G Gencode annotations across all categories.</li>
|
||||
</ul>
|
||||
<p>See our comparison of SNP annotations produced using the GRCh37.64 and GRCh37.65 databases with snpEff 2.0.5 <a href="http://www.broadinstitute.org/gatk/media/docs/SnpEff_snps_ensembl_64_vs_65.pdf">here</a></p>
|
||||
<ul>
|
||||
<li>
|
||||
<p>The GRCh37.64 database gives good results on the condition that you run snpEff with the <code>-onlyCoding true</code> option. The <code>-onlyCoding false</code> option causes snpEff to mark <em>all</em> transcripts as coding, and so produces many false-positive Missense annotations.</p>
|
||||
</li>
|
||||
<li>The GRCh37.65 database gives results that are as poor as those you get with the <code>-onlyCoding false</code> option on the GRCh37.64 database. This is due to a regression in the ENSEMBL release 65 GTF file used to build snpEff's GRCh37.65 database. The regression has been acknowledged by ENSEMBL and is due to be fixed shortly.</li>
|
||||
</ul>
|
||||
<p>See our analysis of the INDEL annotations produced by snpEff across snpEff/database versions <a href="http://www.broadinstitute.org/gatk/media/docs/SnpEff_indels.pdf">here</a></p>
|
||||
<ul>
|
||||
<li>snpEff's indel annotations are highly concordant with those of a high-quality set of genomic annotations from the 1000 Genomes project. This is true across all snpEff/database versions tested.</li>
|
||||
</ul>
|
||||
<h3>Example SnpEff Usage with a VCF Input File</h3>
|
||||
<p>Below is an example of how to run SnpEff version 2.0.5 with a VCF input file and have it write its output in VCF format as well. Notice that you need to explicitly specify the database you want to use (in this case, GRCh37.64). This database must be present in a directory of the same name within the <code>data_dir</code> as defined in <code>snpEff.config</code>.</p>
|
||||
<pre><code class="pre_md">java -Xmx4G -jar snpEff.jar eff -v -onlyCoding true -i vcf -o vcf GRCh37.64 1000G.exomes.vcf > snpEff_output.vcf</code class="pre_md"></pre>
|
||||
<p>In this mode, SnpEff aggregates all effects associated with each variant record together into a single INFO field annotation with the key EFF. The general format is:</p>
|
||||
<pre><code class="pre_md">EFF=Effect1(Information about Effect1),Effect2(Information about Effect2),etc.</code class="pre_md"></pre>
|
||||
<p>And here is the precise layout with all the subfields:</p>
|
||||
<pre><code class="pre_md">EFF=Effect1(Effect_Impact|Effect_Functional_Class|Codon_Change|Amino_Acid_Change|Gene_Name|Gene_BioType|Coding|Transcript_ID|Exon_ID),Effect2(etc...</code class="pre_md"></pre>
|
||||
<p>It's also possible to get SnpEff to output in a (non-VCF) text format with one Effect per line. See the <a href="http://snpeff.sourceforge.net/">SnpEff home page</a> for full details.</p>
|
||||
<h3>Adding SnpEff Annotations using VariantAnnotator</h3>
|
||||
<p>Once you have a SnpEff output VCF file, you can use the VariantAnnotator walker to add SnpEff annotations based on that output to the input file you ran SnpEff on.</p>
|
||||
<p>There are two different options for doing this:</p>
|
||||
<h4>Option 1: Annotate with only the highest-impact effect for each variant</h4>
|
||||
<p><em>NOTE: This option works only with supported SnpEff versions as explained above. VariantAnnotator run as described below will refuse to parse SnpEff output files produced by other versions of the tool, or which lack a SnpEff version number in their header.</em></p>
|
||||
<p>The default behavior when you run VariantAnnotator on a SnpEff output file is to parse the complete set of effects resulting from the current variant, select the most biologically-significant effect, and add annotations for just that effect to the INFO field of the VCF record for the current variant. This is the mode we plan to use in our Production Data-Processing Pipeline.</p>
|
||||
<p>When selecting the most biologically-significant effect associated with the current variant, VariantAnnotator does the following:</p>
|
||||
<ul>
|
||||
<li>
|
||||
<p>Prioritizes the effects according to the categories (in order of decreasing precedence) "High-Impact", "Moderate-Impact", "Low-Impact", and "Modifier", and always selects one of the effects from the highest-priority category. For example, if there are three moderate-impact effects and two high-impact effects resulting from the current variant, the annotator will choose one of the high-impact effects and add annotations based on it. See below for a full list of the effects arranged by category.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>Within each category, ties are broken using the functional class of each effect (in order of precedence: NONSENSE, MISSENSE, SILENT, or NONE). For example, if there is both a NON_SYNONYMOUS_CODING (MODERATE-impact, MISSENSE) and a CODON_CHANGE (MODERATE-impact, NONE) effect associated with the current variant, the annotator will select the NON_SYNONYMOUS_CODING effect. This is to allow for more accurate counts of the total number of sites with NONSENSE/MISSENSE/SILENT mutations. See below for a description of the functional classes SnpEff associates with the various effects.</p>
|
||||
</li>
|
||||
<li>Effects that are within a non-coding region are always considered lower-impact than effects that are within a coding region.</li>
|
||||
</ul>
|
||||
<p>Example Usage:</p>
|
||||
<pre><code class="pre_md">java -jar dist/GenomeAnalysisTK.jar \
|
||||
-T VariantAnnotator \
|
||||
-R /humgen/1kg/reference/human_g1k_v37.fasta \
|
||||
-A SnpEff \
|
||||
--variant 1000G.exomes.vcf \ (file to annotate)
|
||||
--snpEffFile snpEff_output.vcf \ (SnpEff VCF output file generated by running SnpEff on the file to annotate)
|
||||
-L 1000G.exomes.vcf \
|
||||
-o out.vcf</code class="pre_md"></pre>
|
||||
<p>VariantAnnotator adds some or all of the following INFO field annotations to each variant record:</p>
|
||||
<ul>
|
||||
<li><code>SNPEFF_EFFECT</code> - The highest-impact effect resulting from the current variant (or one of the highest-impact effects, if there is a tie)</li>
|
||||
<li><code>SNPEFF_IMPACT</code> - Impact of the highest-impact effect resulting from the current variant (<code>HIGH</code>, <code>MODERATE</code>, <code>LOW</code>, or <code>MODIFIER</code>)</li>
|
||||
<li><code>SNPEFF_FUNCTIONAL_CLASS</code> - Functional class of the highest-impact effect resulting from the current variant (<code>NONE</code>, <code>SILENT</code>, <code>MISSENSE</code>, or <code>NONSENSE</code>)</li>
|
||||
<li><code>SNPEFF_CODON_CHANGE</code> - Old/New codon for the highest-impact effect resulting from the current variant</li>
|
||||
<li><code>SNPEFF_AMINO_ACID_CHANGE</code> - Old/New amino acid for the highest-impact effect resulting from the current variant</li>
|
||||
<li><code>SNPEFF_GENE_NAME</code> - Gene name for the highest-impact effect resulting from the current variant</li>
|
||||
<li><code>SNPEFF_GENE_BIOTYPE</code> - Gene biotype for the highest-impact effect resulting from the current variant</li>
|
||||
<li><code>SNPEFF_TRANSCRIPT_ID</code> - Transcript ID for the highest-impact effect resulting from the current variant</li>
|
||||
<li><code>SNPEFF_EXON_ID</code> - Exon ID for the highest-impact effect resulting from the current variant</li>
|
||||
</ul>
|
||||
<p>Example VCF records annotated using SnpEff and VariantAnnotator:</p>
|
||||
<pre><code class="pre_md">1 874779 . C T 279.94 . AC=1;AF=0.0032;AN=310;BaseQRankSum=-1.800;DP=3371;Dels=0.00;FS=0.000;HRun=0;HaplotypeScore=1.4493;InbreedingCoeff=-0.0045;
|
||||
MQ=54.49;MQ0=10;MQRankSum=0.982;QD=13.33;ReadPosRankSum=-0.060;SB=-120.09;SNPEFF_AMINO_ACID_CHANGE=G215;SNPEFF_CODON_CHANGE=ggC/ggT;
|
||||
SNPEFF_EFFECT=SYNONYMOUS_CODING;SNPEFF_EXON_ID=exon_1_874655_874840;SNPEFF_FUNCTIONAL_CLASS=SILENT;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=SAMD11;
|
||||
SNPEFF_IMPACT=LOW;SNPEFF_TRANSCRIPT_ID=ENST00000342066
|
||||
|
||||
1 874816 . C CT 2527.52 . AC=15;AF=0.0484;AN=310;BaseQRankSum=-11.876;DP=4718;FS=48.575;HRun=1;HaplotypeScore=91.9147;InbreedingCoeff=-0.0520;
|
||||
MQ=53.37;MQ0=6;MQRankSum=-1.388;QD=5.92;ReadPosRankSum=-1.932;SB=-741.06;SNPEFF_EFFECT=FRAME_SHIFT;SNPEFF_EXON_ID=exon_1_874655_874840;
|
||||
SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=SAMD11;SNPEFF_IMPACT=HIGH;SNPEFF_TRANSCRIPT_ID=ENST00000342066</code class="pre_md"></pre>
|
||||
<h4>Option 2: Annotate with all effects for each variant</h4>
|
||||
<p>VariantAnnotator also has the ability to take the EFF field from the SnpEff VCF output file containing all the effects aggregated together and copy it verbatim into the VCF to annotate.</p>
|
||||
<p>Here's an example of how to do this:</p>
|
||||
<pre><code class="pre_md">java -jar dist/GenomeAnalysisTK.jar \
|
||||
-T VariantAnnotator \
|
||||
-R /humgen/1kg/reference/human_g1k_v37.fasta \
|
||||
-E resource.EFF \
|
||||
--variant 1000G.exomes.vcf \ (file to annotate)
|
||||
--resource snpEff_output.vcf \ (SnpEff VCF output file generated by running SnpEff on the file to annotate)
|
||||
-L 1000G.exomes.vcf \
|
||||
-o out.vcf</code class="pre_md"></pre>
|
||||
<p>Of course, in this case you can also use the VCF output by SnpEff directly, but if you are using VariantAnnotator for other purposes anyway the above might be useful.</p>
|
||||
<h3>List of Genomic Effects</h3>
|
||||
<p>Below are the possible genomic effects recognized by SnpEff, grouped by biological impact. Full descriptions of each effect are available on <a href="http://snpeff.sourceforge.net/faq.html">this page</a>.</p>
|
||||
<h4>High-Impact Effects</h4>
|
||||
<ul>
|
||||
<li>SPLICE_SITE_ACCEPTOR</li>
|
||||
<li>SPLICE_SITE_DONOR</li>
|
||||
<li>START_LOST</li>
|
||||
<li>EXON_DELETED</li>
|
||||
<li>FRAME_SHIFT</li>
|
||||
<li>STOP_GAINED</li>
|
||||
<li>STOP_LOST</li>
|
||||
</ul>
|
||||
<h4>Moderate-Impact Effects</h4>
|
||||
<ul>
|
||||
<li>NON_SYNONYMOUS_CODING</li>
|
||||
<li>CODON_CHANGE <i>(note: this effect is used by SnpEff only for MNPs, not SNPs)</i></li>
|
||||
<li>CODON_INSERTION</li>
|
||||
<li>CODON_CHANGE_PLUS_CODON_INSERTION</li>
|
||||
<li>CODON_DELETION</li>
|
||||
<li>CODON_CHANGE_PLUS_CODON_DELETION</li>
|
||||
<li>UTR_5_DELETED</li>
|
||||
<li>UTR_3_DELETED</li>
|
||||
</ul>
|
||||
<h4>Low-Impact Effects</h4>
|
||||
<ul>
|
||||
<li>SYNONYMOUS_START</li>
|
||||
<li>NON_SYNONYMOUS_START</li>
|
||||
<li>START_GAINED</li>
|
||||
<li>SYNONYMOUS_CODING</li>
|
||||
<li>SYNONYMOUS_STOP</li>
|
||||
<li>NON_SYNONYMOUS_STOP</li>
|
||||
</ul>
|
||||
<h4>Modifiers</h4>
|
||||
<ul>
|
||||
<li>NONE</li>
|
||||
<li>CHROMOSOME</li>
|
||||
<li>CUSTOM</li>
|
||||
<li>CDS</li>
|
||||
<li>GENE</li>
|
||||
<li>TRANSCRIPT</li>
|
||||
<li>EXON</li>
|
||||
<li>INTRON_CONSERVED</li>
|
||||
<li>UTR_5_PRIME</li>
|
||||
<li>UTR_3_PRIME</li>
|
||||
<li>DOWNSTREAM</li>
|
||||
<li>INTRAGENIC</li>
|
||||
<li>INTERGENIC</li>
|
||||
<li>INTERGENIC_CONSERVED</li>
|
||||
<li>UPSTREAM</li>
|
||||
<li>REGULATION</li>
|
||||
<li>INTRON</li>
|
||||
</ul>
|
||||
<h3>Functional Classes</h3>
|
||||
<p>SnpEff assigns a functional class to certain effects, in addition to an impact:</p>
|
||||
<ul>
|
||||
<li><code>NONSENSE</code>: assigned to point mutations that result in the creation of a new stop codon</li>
|
||||
<li><code>MISSENSE</code>: assigned to point mutations that result in an amino acid change, but not a new stop codon</li>
|
||||
<li><code>SILENT</code>: assigned to point mutations that result in a codon change, but not an amino acid change or new stop codon</li>
|
||||
<li><code>NONE</code>: assigned to all effects that don't fall into any of the above categories (including all events larger than a point mutation)</li>
|
||||
</ul>
|
||||
<p>The GATK prioritizes effects with functional classes over effects of equal impact that lack a functional class when selecting the most significant effect in VariantAnnotator. This is to enable accurate counts of NONSENSE/MISSENSE/SILENT sites.</p>
|
||||
|
|
@ -0,0 +1,336 @@
|
|||
## BWA/C Bindings - RETIRED
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/60/bwa-c-bindings-retired
|
||||
|
||||
<h3>Please note that this article has not been updated in a very long time and may no longer be applicable. Use at your own risk.</h3>
|
||||
<hr />
|
||||
<h3>Sting BWA/C Bindings</h3>
|
||||
<p><b><span style="color:red">WARNING: This tool was experimental and unsupported and never made it beyond a beta version. Use at your own risk.</span></b>
|
||||
</p><p>The GSA group has made bindings available for Heng Li's <a rel="nofollow" class="external text" href="http://bio-bwa.sourceforge.net/">Burrows-Wheeler Aligner (BWA)</a>. Our aligner bindings present additional functionality to the user not traditionally available with BWA. BWA standalone is optimized to do fast, low-memory alignments from <a rel="nofollow" class="external text" href="http://maq.sourceforge.net/fastq.shtml">Fastq</a> to <a rel="nofollow" class="external text" href="http://samtools.sourceforge.net/SAM1.pdf">BAM</a>. While our bindings aim to provide support for reasonably fast, reasonably low memory alignment, we add the capacity to do exploratory data analyses. The bindings can provide all alignments for a given read, allowing a user to walk over the alignments and see information not typically provided in the BAM format. Users of the bindings can 'go deep', selectively relaxing alignment parameters one read at a time, looking for the best alignments at a site.
|
||||
</p><p>The BWA/C bindings should be thought of as alpha release quality. However, we aim to be particularly responsive to issues in the bindings as they arise. Because of the bindings' alpha state, some functionality is limited; see the Limitations section below for more details on what features are currently supported.
|
||||
</p>
|
||||
<table id="toc" class="toc"><tr><td><div id="toctitle"><h2>Contents</h2></div>
|
||||
<ul>
|
||||
<li class="toclevel-1 tocsection-1"><a href="#A_note_about_using_the_bindings"><span class="tocnumber">1</span> <span class="toctext">A note about using the bindings</span></a>
|
||||
<ul>
|
||||
<li class="toclevel-2 tocsection-2"><a href="#bash"><span class="tocnumber">1.1</span> <span class="toctext">bash</span></a></li>
|
||||
<li class="toclevel-2 tocsection-3"><a href="#csh"><span class="tocnumber">1.2</span> <span class="toctext">csh</span></a></li>
|
||||
</ul>
|
||||
</li>
|
||||
<li class="toclevel-1 tocsection-4"><a href="#Preparing_to_use_the_aligner"><span class="tocnumber">2</span> <span class="toctext">Preparing to use the aligner</span></a>
|
||||
<ul>
|
||||
<li class="toclevel-2 tocsection-5"><a href="#Within_the_Broad_Institute"><span class="tocnumber">2.1</span> <span class="toctext">Within the Broad Institute</span></a></li>
|
||||
<li class="toclevel-2 tocsection-6"><a href="#Outside_of_the_Broad_Institute"><span class="tocnumber">2.2</span> <span class="toctext">Outside of the Broad Institute</span></a></li>
|
||||
</ul>
|
||||
</li>
|
||||
<li class="toclevel-1 tocsection-7"><a href="#Using_the_existing_GATK_alignment_walkers"><span class="tocnumber">3</span> <span class="toctext">Using the existing GATK alignment walkers</span></a></li>
|
||||
<li class="toclevel-1 tocsection-8"><a href="#Writing_new_GATK_walkers_utilizing_alignment_bindings"><span class="tocnumber">4</span> <span class="toctext">Writing new GATK walkers utilizing alignment bindings</span></a></li>
|
||||
<li class="toclevel-1 tocsection-9"><a href="#Running_the_aligner_outside_of_the_GATK"><span class="tocnumber">5</span> <span class="toctext">Running the aligner outside of the GATK</span></a></li>
|
||||
<li class="toclevel-1 tocsection-10"><a href="#Limitations"><span class="tocnumber">6</span> <span class="toctext">Limitations</span></a></li>
|
||||
<li class="toclevel-1 tocsection-11"><a href="#Example:_analysis_of_alignments_with_the_BWA_bindings"><span class="tocnumber">7</span> <span class="toctext">Example: analysis of alignments with the BWA bindings</span></a></li>
|
||||
<li class="toclevel-1 tocsection-12"><a href="#Validation_methods"><span class="tocnumber">8</span> <span class="toctext">Validation methods</span></a></li>
|
||||
<li class="toclevel-1 tocsection-13"><a href="#Unsupported:_using_the_BWA.2FC_bindings_from_within_Matlab"><span class="tocnumber">9</span> <span class="toctext">Unsupported: using the BWA/C bindings from within Matlab</span></a></li>
|
||||
</ul>
|
||||
</td></tr></table>
|
||||
<h2><span class="mw-headline" id="A_note_about_using_the_bindings"> A note about using the bindings </span></h2>
|
||||
<p>Whenever native code is called from Java, the user must assist Java in finding the proper shared library. Java looks for shared libraries in two places, on the system-wide library search path and through Java properties invoked on the command line. To add libbwa.so to the global library search path, add the following to your .my.bashrc, .my.cshrc, or other startup file:
|
||||
</p>
|
||||
<h5><span class="mw-headline" id="bash"> bash </span></h5>
|
||||
<pre>
|
||||
export LD_LIBRARY_PATH=/humgen/gsa-scr1/GATK_Data/bwa/stable:$LD_LIBRARY_PATH
|
||||
</pre>
|
||||
<h5><span class="mw-headline" id="csh"> csh </span></h5>
|
||||
<pre>
|
||||
setenv LD_LIBRARY_PATH /humgen/gsa-scr1/GATK_Data/bwa/stable:$LD_LIBRARY_PATH
|
||||
</pre>
|
||||
<p>To specify the location of libbwa.so directly on the command-line, use the java.library.path system property as follows:
|
||||
</p>
|
||||
<pre>
|
||||
java -Djava.library.path=/humgen/gsa-scr1/GATK_Data/bwa/stable \
|
||||
-jar dist/GenomeAnalysisTK.jar \
|
||||
-T AlignmentValidation \
|
||||
-I /humgen/gsa-hphome1/hanna/reference/1kg/NA12878_Pilot1_20.bwa.bam \
|
||||
-R /humgen/gsa-scr1/GATK_Data/bwa/human_b36_both.fasta
|
||||
</pre>
|
||||
<h2><span class="mw-headline" id="Preparing_to_use_the_aligner"> Preparing to use the aligner </span></h2>
|
||||
<h3><span class="mw-headline" id="Within_the_Broad_Institute"> Within the Broad Institute </span></h3>
|
||||
<p>We provide internally accessible versions of both the BWA shared library and precomputed BWA indices for two commonly used human references at the Broad (Homo_sapiens_assembly18.fasta and human_b36_both.fasta). These files live in the following directory:
|
||||
</p>
|
||||
<pre>
|
||||
/humgen/gsa-scr1/GATK_Data/bwa/stable
|
||||
</pre>
|
||||
<h3><span class="mw-headline" id="Outside_of_the_Broad_Institute"> Outside of the Broad Institute </span></h3>
|
||||
<p>Two steps are required in preparing to use the aligner: building the shared library and using BWA/C to generate an index of the reference sequence.
|
||||
</p><p>The Java bindings to the aligner are available through the <a rel="nofollow" class="external text" href="https://github.com/broadgsa/gatk">Sting</a> repository. A precompiled version of the bindings are available for Linux;
|
||||
these bindings are available in c/bwa/libbwa.so.1. To build the aligner from source:
|
||||
</p>
|
||||
<ul><li> Fetch the latest svn of BWA from <a rel="nofollow" class="external text" href="https://bio-bwa.svn.sourceforge.net/svnroot/bio-bwa">SourceForge</a>. Configure and build BWA.
|
||||
</li></ul>
|
||||
<pre>
|
||||
sh autogen.sh
|
||||
./configure
|
||||
make
|
||||
</pre>
|
||||
<ul><li> Download the latest version of Sting from our <a rel="nofollow" class="external text" href="https://github.com/broadgsa/gatk">Github repository</a>.
|
||||
</li><li> Customize the variables at the top one of the build scripts (c/bwa/build_linux.sh,c/bwa/build_mac.sh) based on your environment. Run the build script.
|
||||
</li></ul>
|
||||
<p>To build a reference sequence, use the BWA C executable directly:
|
||||
</p>
|
||||
<pre>
|
||||
bwa index -a bwtsw <your reference sequence>.fasta
|
||||
</pre>
|
||||
<h2><span class="mw-headline" id="Using_the_existing_GATK_alignment_walkers"> Using the existing GATK alignment walkers </span></h2>
|
||||
<p>Two walkers are provided for end users of the GATK. The first of the stock walkers is Align, which can align an unmapped BAM file or realign a mapped BAM file.
|
||||
</p>
|
||||
<pre>
|
||||
java \
|
||||
-Djava.library.path=/humgen/gsa-scr1/GATK_Data/bwa/stable \
|
||||
-jar dist/GenomeAnalysisTK.jar \
|
||||
-T Align \
|
||||
-I NA12878_Pilot1_20.unmapped.bam \
|
||||
-R /humgen/gsa-scr1/GATK_Data/bwa/human_b36_both.fasta \
|
||||
-U \
|
||||
-ob human.unsorted.bam
|
||||
</pre>
|
||||
<p>Most of the available parameters here are standard GATK. -T specifies that the alignment analysis should be used; -I specifies the unmapped BAM file to align, and the -R specifies the reference to which to align. By default, this walker assumes that the bwa index support files will live alongside the reference. If these files are stored elsewhere, the optional -BWT argument can be used to specify their location. By defaults, alignments will be emitted to the console in SAM format. Alignments can be spooled to disk in SAM format using the -o option or spooled to disk in BAM format using the -ob option.
|
||||
</p><p>The other stock walker is AlignmentValidation, which computes all possible alignments based on the BWA default configuration settings and makes sure at least
|
||||
one of the top alignments matches the alignment stored in the read.
|
||||
</p>
|
||||
<pre>
|
||||
java \
|
||||
-Djava.library.path=/humgen/gsa-scr1/GATK_Data/bwa/stable \
|
||||
-jar dist/GenomeAnalysisTK.jar \
|
||||
-T AlignmentValidation \
|
||||
-I /humgen/gsa-hphome1/hanna/reference/1kg/NA12878_Pilot1_20.bwa.bam \
|
||||
-R /humgen/gsa-scr1/GATK_Data/bwa/human_b36_both.fasta
|
||||
</pre>
|
||||
<p>Options for the AlignmentValidation walker are identical to the Alignment walker, except the AlignmentValidation walker's only output is a exception if validation fails.
|
||||
</p><p>Another sample walker of limited scope, CountBestAlignmentsWalker, is available for review; it is discussed in the example section below.
|
||||
</p>
|
||||
<h2><span class="mw-headline" id="Writing_new_GATK_walkers_utilizing_alignment_bindings"> Writing new GATK walkers utilizing alignment bindings </span></h2>
|
||||
<p>BWA/C can be created on-the-fly using the org.broadinstitute.sting.alignment.bwa.c.BWACAligner constructor. The bindings have two sets of interfaces: an interface which returns all possible alignments
|
||||
and an interface which randomly selects an alignment from a list of the top scoring alignments as selected by BWA.
|
||||
</p><p>To iterate through all functions, use the following method:
|
||||
</p>
|
||||
<pre>
|
||||
/**
|
||||
* Get a iterator of alignments, batched by mapping quality.
|
||||
* @param bases List of bases.
|
||||
* @return Iterator to alignments.
|
||||
*/
|
||||
public Iterable<Alignment[]> getAllAlignments(final byte[] bases);
|
||||
</pre>
|
||||
<p>The call will return an Iterable which batches alignments by score. Each call to next() on the provided iterator will return all Alignments of a given score, ordered in
|
||||
best to worst. For example, given a read sequence with at least one match on the genome, the first call to next() will supply all exact matches, and subsequent calls
|
||||
to next() will give alignments judged to be inferior by BWA (alignments containing mismatches, gap opens, or gap extensions).
|
||||
</p><p>Alignments can be transformed to reads using the following static method in org.broadinstitute.sting.alignment.Alignment:
|
||||
</p>
|
||||
<pre>
|
||||
/**
|
||||
* Creates a read directly from an alignment.
|
||||
* @param alignment The alignment to convert to a read.
|
||||
* @param unmappedRead Source of the unmapped read. Should have bases, quality scores, and flags.
|
||||
* @param newSAMHeader The new SAM header to use in creating this read. Can be null, but if so, the sequence
|
||||
* dictionary in the
|
||||
* @return A mapped alignment.
|
||||
*/
|
||||
public static SAMRecord convertToRead(Alignment alignment, SAMRecord unmappedRead, SAMFileHeader newSAMHeader);
|
||||
</pre>
|
||||
<p>A convenience method is available which allows the user to get SAMRecords directly from the aligner.
|
||||
</p>
|
||||
<pre>
|
||||
/**
|
||||
* Get a iterator of aligned reads, batched by mapping quality.
|
||||
* @param read Read to align.
|
||||
* @param newHeader Optional new header to use when aligning the read. If present, it must be null.
|
||||
* @return Iterator to alignments.
|
||||
*/
|
||||
public Iterable<SAMRecord[]> alignAll(final SAMRecord read, final SAMFileHeader newHeader);
|
||||
</pre>
|
||||
<p>To return a single read randomly selected by the bindings, use one of the following methods:
|
||||
</p>
|
||||
<pre>
|
||||
/**
|
||||
* Allow the aligner to choose one alignment randomly from the pile of best alignments.
|
||||
* @param bases Bases to align.
|
||||
* @return An align
|
||||
*/
|
||||
public Alignment getBestAlignment(final byte[] bases);
|
||||
|
||||
/**
|
||||
* Align the read to the reference.
|
||||
* @param read Read to align.
|
||||
* @param header Optional header to drop in place.
|
||||
* @return A list of the alignments.
|
||||
*/
|
||||
public SAMRecord align(final SAMRecord read, final SAMFileHeader header);
|
||||
</pre>
|
||||
<p>The org.broadinstitute.sting.alignment.bwa.BWAConfiguration argument allows the user to specify parameters normally specified to 'bwt aln'. Available parameters are:
|
||||
</p>
|
||||
<ul><li> Maximum edit distance (-n)
|
||||
</li><li> Maximum gap opens (-o)
|
||||
</li><li> Maximum gap extensions (-e)
|
||||
</li><li> Disallow an indel within INT bp towards the ends (-i)
|
||||
</li><li> Mismatch penalty (-M)
|
||||
</li><li> Gap open penalty (-O)
|
||||
</li><li> Gap extension penalty (-E)
|
||||
</li></ul>
|
||||
<p>Settings must be supplied to the constructor; leaving any BWAConfiguration field unset means that BWA should use its default value for that argument. Configuration
|
||||
settings can be updated at any time using the BWACAligner updateConfiguration method.
|
||||
</p>
|
||||
<pre>
|
||||
public void updateConfiguration(BWAConfiguration configuration);
|
||||
</pre>
|
||||
<h2><span class="mw-headline" id="Running_the_aligner_outside_of_the_GATK"> Running the aligner outside of the GATK </span></h2>
|
||||
<p>The BWA/C bindings were written with running outside of the GATK in mind, but this workflow has never been tested. If you would like to run the bindings outside of the
|
||||
GATK, you will need:
|
||||
</p>
|
||||
<ul><li> The BWA shared object, libbwa.so.1
|
||||
</li><li> The packaged version of Aligner.jar
|
||||
</li></ul>
|
||||
<p>To build the packaged version of the aligner, run the following command
|
||||
</p>
|
||||
<pre>
|
||||
cp $STING_HOME/lib/bcel-*.jar ~/.ant/lib
|
||||
ant package -Dexecutable=Aligner
|
||||
</pre>
|
||||
<p>This command will extract all classes required to run the aligner and place them in $STING_HOME/dist/packages/Aligner/Aligner.jar. You can then specify this one jar in your project's dependencies.
|
||||
</p>
|
||||
<h2> <span class="mw-headline" id="Limitations"> Limitations </span></h2>
|
||||
<p>The BWA/C bindings are currently in an alpha state, but they are extensively supported. Because of the bindings' alpha state, some functionality is limited. The limitations of these bindings include:
|
||||
</p>
|
||||
<ul><li> Only single-end alignment is supported. However, a paired end module could be implemented as a simple extension that finds the jointly optimal placement of both singly-aligned ends.
|
||||
</li><li> Color space alignments are not currently supported.
|
||||
</li><li> Only a limited number of parameters BWA's extensive parameter list are supported. The current list of supported parameters is specified in the 'Writing new GATK walkers utilizing alignment bindings' section below.
|
||||
</li><li> The system is not as heavily memory-optimized as the BWA/C implementation standalone. The JVM, by default, uses slightly over 4G of resident memory when running BWA on human. We have not done extensive testing on the behavior of the BWA/C bindings under memory pressure.
|
||||
</li><li> There is a slight negative impact on performance when using the BWA/C bindings. BWA/C standalone on 6.9M reads of human data takes roughly 45min to run 'bwa aln', 5min to run 'bwa samse', and another 1.5min to convert the resulting SAM file to a BAM. Aligning the same dataset using the Java bindings takes approximately 55 minutes.
|
||||
</li><li> The GATK requires that its input BAMs be sorted and indexed. Before using the Align or AlignmentValidation walker, you must sort and index your unmapped BAM file. Note that this is a limitation of the GATK, not the aligner itself. Using the alignment support files outside of the GATK will eliminate this requirement.
|
||||
</li></ul>
|
||||
<h2><span class="mw-headline" id="Example:_analysis_of_alignments_with_the_BWA_bindings"> Example: analysis of alignments with the BWA bindings </span></h2>
|
||||
<p>In order to validate that the Java bindings were computing the same number of reads as BWA/C standalone, we modified the BWA source to gather the number of equally scoring alignments and the frequency of the number of equally scoring alignments. We then implemented the same using a walker written in the GATK. We computed this distribution over a set of 36bp human reads and found the distributions to be identical.
|
||||
</p><p>The relevant parts of the walker follow.
|
||||
</p>
|
||||
<pre>
|
||||
public class CountBestAlignmentsWalker extends ReadWalker<Integer,Integer> {
|
||||
/**
|
||||
* The supporting BWT index generated using BWT.
|
||||
*/
|
||||
@Argument(fullName="BWTPrefix",shortName="BWT",doc="Index files generated by bwa index -d bwtsw",required=false)
|
||||
String prefix = null;
|
||||
|
||||
/**
|
||||
* The actual aligner.
|
||||
*/
|
||||
private Aligner aligner = null;
|
||||
|
||||
private SortedMap<Integer,Integer> alignmentFrequencies = new TreeMap<Integer,Integer>();
|
||||
|
||||
/**
|
||||
* Create an aligner object. The aligner object will load and hold the BWT until close() is called.
|
||||
*/
|
||||
@Override
|
||||
public void initialize() {
|
||||
BWTFiles bwtFiles = new BWTFiles(prefix);
|
||||
BWAConfiguration configuration = new BWAConfiguration();
|
||||
aligner = new BWACAligner(bwtFiles,configuration);
|
||||
}
|
||||
|
||||
/**
|
||||
* Aligns a read to the given reference.
|
||||
* @param ref Reference over the read. Read will most likely be unmapped, so ref will be null.
|
||||
* @param read Read to align.
|
||||
* @return Number of alignments found for this read.
|
||||
*/
|
||||
@Override
|
||||
public Integer map(char[] ref, SAMRecord read) {
|
||||
Iterator<Alignment[]> alignmentIterator = aligner.getAllAlignments(read.getReadBases()).iterator();
|
||||
if(alignmentIterator.hasNext()) {
|
||||
int numAlignments = alignmentIterator.next().length;
|
||||
if(alignmentFrequencies.containsKey(numAlignments))
|
||||
alignmentFrequencies.put(numAlignments,alignmentFrequencies.get(numAlignments)+1);
|
||||
else
|
||||
alignmentFrequencies.put(numAlignments,1);
|
||||
}
|
||||
return 1;
|
||||
}
|
||||
|
||||
/**
|
||||
* Initial value for reduce. In this case, validated reads will be counted.
|
||||
* @return 0, indicating no reads yet validated.
|
||||
*/
|
||||
@Override
|
||||
public Integer reduceInit() { return 0; }
|
||||
|
||||
/**
|
||||
* Calculates the number of reads processed.
|
||||
* @param value Number of reads processed by this map.
|
||||
* @param sum Number of reads processed before this map.
|
||||
* @return Number of reads processed up to and including this map.
|
||||
*/
|
||||
@Override
|
||||
public Integer reduce(Integer value, Integer sum) {
|
||||
return value + sum;
|
||||
}
|
||||
|
||||
/**
|
||||
* Cleanup.
|
||||
* @param result Number of reads processed.
|
||||
*/
|
||||
@Override
|
||||
public void onTraversalDone(Integer result) {
|
||||
aligner.close();
|
||||
for(Map.Entry<Integer,Integer> alignmentFrequency: alignmentFrequencies.entrySet())
|
||||
out.printf("%d\t%d%n", alignmentFrequency.getKey(), alignmentFrequency.getValue());
|
||||
super.onTraversalDone(result);
|
||||
}
|
||||
}
|
||||
</pre>
|
||||
<p>This walker can be run within the svn version of the GATK using -T CountBestAlignments.
|
||||
</p><p>The resulting placement count frequency is shown in the graph below. The number of placements clearly follows an exponential.
|
||||
</p><p><a href="/gsa/wiki/index.php/File:Bwa_dist.png" class="image"><img alt="Bwa dist.png" src="/gsa/wiki/images/7/77/Bwa_dist.png" width="640" height="480" /></a>
|
||||
</p>
|
||||
<h2><span class="mw-headline" id="Validation_methods"> Validation methods </span></h2>
|
||||
<p>Two major techniques were used to validate the Java bindings against the current BWA implementation.
|
||||
</p>
|
||||
<ul><li> Fastq files from E coli and from NA12878 chr20 were aligned using BWA standalone with BWA's default settings. The aligned SAM files were sorted, indexed, and fed into the alignment validation walker. The alignment validation walker verified that one of the top scoring matches from the BWA bindings matched the alignment produced by BWA standalone.
|
||||
</li><li> Fastq files from E coli and from NA12878 chr20 were aligned using the GATK Align walker, then fed back into the GATK's alignment validation walker.
|
||||
</li><li> The distribution of the alignment frequency was compared between BWA standalone and the Java bindings and was found to be identical.
|
||||
</li></ul>
|
||||
<p>As an ongoing validation strategy, we will use the GATK integration test suite to align a small unmapped BAM file with human data. The contents of the unmapped BAM file will be aligned and written to disk. The md5 of the resulting file will be calculated and compared to a known good md5.
|
||||
</p>
|
||||
<h2><span class="mw-headline" id="Unsupported:_using_the_BWA.2FC_bindings_from_within_Matlab"> Unsupported: using the BWA/C bindings from within Matlab </span></h2>
|
||||
<p>Some users are attempting to use the BWA/C bindings from within Matlab. To run the GATK within Matlab, you'll need to add libbwa.so to your library path through the librarypath.txt file. The librarypath.txt file normally lives in $matlabroot/toolbox/local. Within the Broad Institute, the $matlabroot/toolbox/local/librarypath.txt file is shared; therefore, you'll have to create a librarypath.txt file in your working directory from which you execute matlab.
|
||||
</p>
|
||||
<pre>
|
||||
##
|
||||
## FILE: librarypath.txt
|
||||
##
|
||||
## Entries:
|
||||
## o path_to_jnifile
|
||||
## o [alpha,glnx86,sol2,unix,win32,mac]=path_to_jnifile
|
||||
## o $matlabroot/path_to_jnifile
|
||||
## o $jre_home/path_to_jnifile
|
||||
##
|
||||
$matlabroot/bin/$arch
|
||||
/humgen/gsa-scr1/GATK_Data/bwa/stable
|
||||
</pre>
|
||||
<p>Once you've edited the library path, you can verify that Matlab has picked up your modified file by running the following command:
|
||||
</p>
|
||||
<pre>
|
||||
>> java.lang.System.getProperty('java.library.path')
|
||||
|
||||
ans =
|
||||
/broad/tools/apps/matlab2009b/bin/glnxa64:/humgen/gsa-scr1/GATK_Data/bwa/stable
|
||||
</pre>
|
||||
<p>Once the location of libbwa.so has been added to the library path, you can use the BWACAligner just as you would any other Java class in Matlab:
|
||||
</p>
|
||||
<pre>
|
||||
>> javaclasspath({'/humgen/gsa-scr1/hanna/src/Sting/dist/packages/Aligner/Aligner.jar'})
|
||||
>> import org.broadinstitute.sting.alignment.bwa.BWTFiles
|
||||
>> import org.broadinstitute.sting.alignment.bwa.BWAConfiguration
|
||||
>> import org.broadinstitute.sting.alignment.bwa.c.BWACAligner
|
||||
>> x = BWACAligner(BWTFiles('/humgen/gsa-scr1/GATK_Data/bwa/Homo_sapiens_assembly18.fasta'),BWAConfiguration())
|
||||
>> y=x.getAllAlignments(uint8('CCAATAACCAAGGCTGTTAGGTATTTTATCAGCAATGTGGGATAAGCAC'));
|
||||
</pre>
|
||||
<p>We don't have the resources to directly support using the BWA/C bindings from within Matlab, but if you report problems to us, we will try to address them.
|
||||
</p>
|
||||
|
|
@ -0,0 +1,158 @@
|
|||
## Data Processing Pipeline - RETIRED
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/41/data-processing-pipeline-retired
|
||||
|
||||
<h3>Please note that the DataProcessingPipeline qscript is no longer available. We are looking into the possibility of producing some new Qscripts that will be more appropriate for sharing with the public.</h3>
|
||||
<p><em>The DPP script was only provided has an example, but many people were using it "out of the box" without properly understanding how it works. In order to protect users from mishandling this tool, and to decrease our support burden, we have taken the difficult decision of removing the script from our public repository. If you would like to put together your own version of the DPP, please have a look at our other example scripts to understand how Qscripts work, and read the Best Practices documentation to understand what are the processing steps and what parameters you need to set/adjust.</em></p>
|
||||
<h2>Data Processing Pipeline</h2>
|
||||
<p>The Data Processing Pipeline is a Queue script designed to take BAM files from the NGS machines to <em>analysis ready</em> BAMs for the GATK. </p>
|
||||
<h3>Introduction</h3>
|
||||
<p>Reads come off the sequencers in a raw state that is not suitable for analysis using the GATK. In order to prepare the dataset, one must perform the steps described <a href="http://www.broadinstitute.org/gatk/guide/topic?name=best-practices">here</a>. This pipeline performs the following steps: indel cleaning, duplicate marking and base score recalibration, following the GSA's latest definition of best practices. The product of this pipeline is a set of <em>analysis ready</em> BAM files (one per sample sequenced).</p>
|
||||
<h3>Requirements</h3>
|
||||
<p>This pipeline is a <a href="http://www.broadinstitute.org/gatk/guide/article?id=1306">Queue</a> script that uses tools from the GATK, <a href="http://picard.sourceforge.net/">Picard</a> and <a href="http://bio-bwa.sourceforge.net/">BWA</a> (optional) software suites which are all freely available through their respective websites. Queue is a GATK companion that is included in the GATK package.</p>
|
||||
<p><strong>Warning:</strong> This pipeline was designed specifically to handle the Broad Institute's main sequencing pipeline with Illumina BAM files and BWA alignment. The GSA cannot support its use for other types of datasets. It is possible however, with some effort, to modify it for your needs.</p>
|
||||
<h3>Command-line arguments</h3>
|
||||
<h4>Required Parameters</h4>
|
||||
<table border="1" cellpadding="2" width="100%">
|
||||
<tr>
|
||||
<th scope="col" width="15%"> Argument (short-name)
|
||||
</th>
|
||||
<th scope="col" width="25%"> Argument (long-name)
|
||||
</th>
|
||||
<th scope="col"> Description
|
||||
</th></tr>
|
||||
<tr>
|
||||
<td> -i <BAM file / BAM list> </td>
|
||||
<td> --input <BAM file / BAM list> </td>
|
||||
<td> input BAM file - or list of BAM files.
|
||||
</td></tr>
|
||||
<tr>
|
||||
<td> -R <fasta> </td>
|
||||
<td> --reference <fasta> </td>
|
||||
<td> Reference fasta file.
|
||||
</td></tr>
|
||||
<tr>
|
||||
<td> -D <vcf> </td>
|
||||
<td> --dbsnp <dbsnp vcf> </td>
|
||||
<td> dbsnp ROD to use (must be in VCF format).
|
||||
</td></tr></table>
|
||||
<h4>Optional Parameters</h4>
|
||||
<table border="1" cellpadding="2" width="100%">
|
||||
<tr>
|
||||
<th scope="col" width="15%"> Argument (short-name)
|
||||
</th>
|
||||
<th scope="col" width="25%"> Argument (long-name)
|
||||
</th>
|
||||
<th scope="col"> Description
|
||||
</th></tr>
|
||||
<tr>
|
||||
<td> -indels <vcf> </td>
|
||||
<td> --extra_indels <vcf> </td>
|
||||
<td> VCF files to use as reference indels for Indel Realignment.
|
||||
</td></tr>
|
||||
<tr>
|
||||
<td> -bwa <path> </td>
|
||||
<td> --path_to_bwa <path> </td>
|
||||
<td> The path to the binary of bwa (usually BAM files have already been mapped - but if you want to remap this is the option)
|
||||
</td></tr>
|
||||
<tr>
|
||||
<td> -outputDir <path> </td>
|
||||
<td> --output_directory <path> </td>
|
||||
<td> Output path for the processed BAM files.
|
||||
</td></tr>
|
||||
<tr>
|
||||
<td> -L <GATK interval string> </td>
|
||||
<td> --gatk_interval_string <GATK interval string> </td>
|
||||
<td> the -L interval string to be used by GATK - output bams at interval only
|
||||
</td></tr>
|
||||
<tr>
|
||||
<td> -intervals <GATK interval file> </td>
|
||||
<td> --gatk_interval_file <GATK interval file> </td>
|
||||
<td> an <i>intervals</i> file to be used by GATK - output bams at intervals
|
||||
</td></tr></table>
|
||||
<h4>Modes of Operation (also optional parameters)</h4>
|
||||
<table border="1" cellpadding="2" width="100%">
|
||||
<tr>
|
||||
<th scope="col" width="15%"> Argument (short-name)
|
||||
</th>
|
||||
<th scope="col" width="25%"> Argument (long-name)
|
||||
</th>
|
||||
<th scope="col"> Description
|
||||
</th></tr>
|
||||
<tr>
|
||||
<td> -p <name> </td>
|
||||
<td> --project <name> </td>
|
||||
<td> the project name determines the final output (BAM file) base name. Example NA12878 yields NA12878.processed.bam
|
||||
</td></tr>
|
||||
<tr>
|
||||
<td> -knowns </td>
|
||||
<td> --knowns_only </td>
|
||||
<td> Perform cleaning on knowns only.
|
||||
</td></tr>
|
||||
<tr>
|
||||
<td> -sw </td>
|
||||
<td> --use_smith_waterman </td>
|
||||
<td> Perform cleaning using Smith Waterman
|
||||
</td></tr>
|
||||
<tr>
|
||||
<td> -bwase </td>
|
||||
<td> --use_bwa_single_ended </td>
|
||||
<td> Decompose input BAM file and fully realign it using BWA and assume Single Ended reads
|
||||
</td></tr>
|
||||
<tr>
|
||||
<td> -bwape </td>
|
||||
<td> --use_bwa_pair_ended </td>
|
||||
<td> Decompose input BAM file and fully realign it using BWA and assume Pair Ended reads
|
||||
</td></tr></table>
|
||||
<h2>The Pipeline</h2>
|
||||
<p>Data processing pipeline of the best practices for raw data processing, from sequencer data (fastq files) to analysis read reads (bam file):</p>
|
||||
<p><img src="https://us.v-cdn.net/5019796/uploads/FileUpload/55/0a67f9e1b7962a14c422e993f34643.jpeg" alt="the data processing pipeline" /></p>
|
||||
<p>Following the group's Best Practices definition, the data processing pipeline does all the processing at the sample level. There are two high-level parts of the pipeline:</p>
|
||||
<h3>BWA alignment</h3>
|
||||
<p>This option is for datasets that have already been processed using a different pipeline or different criteria, and you want to reprocess it using this pipeline. One example is a BAM file that has been processed at the lane level, or did not perform some of the best practices steps of the current pipeline. By using the optional BWA stage of the processing pipeline, your BAM file will be realigned from scratch before creating sample level bams and entering the pipeline.</p>
|
||||
<h3>Sample Level Processing</h3>
|
||||
<p>This is the where the pipeline applies its main procedures: Indel Realignment and Base Quality Score Recalibration. </p>
|
||||
<h4>Indel Realignment</h4>
|
||||
<p>This is a two step process. First we create targets using the Realigner Target Creator (either for knowns only, or including data indels), then we realign the targets using the Indel Realigner (see [Local realignment around indels]) with an optional smith waterman realignment. The Indel Realigner also fixes mate pair information for reads that get realigned.</p>
|
||||
<h4>Base Quality Score Recalibration</h4>
|
||||
<p>This is a crucial step that re-adjusts the quality score using statistics based on several different covariates. In this pipeline we utilize four: Read Group Covariate, Quality Score Covariate, Cycle Covariate, Dinucleotide Covariate</p>
|
||||
<h3>The Outputs</h3>
|
||||
<p>The Data Processing Pipeline produces 3 types of output for each file: a fully processed bam file, a validation report on the input bam and output bam files, a analysis before and after base quality score recalibration. If you look at the pipeline flowchart, the grey boxes indicate processes that generate an output. </p>
|
||||
<h4>Processed Bam File</h4>
|
||||
<p>The final product of the pipeline is one BAM file per sample in the dataset. It also provides one BAM list with all the bams in the dataset. This file is named <project name>.cohort.list, and each sample bam file has the name <project name>.<sample name>.bam. The sample names are extracted from the input BAM headers, and the project name is provided as a parameter to the pipeline.</p>
|
||||
<h4>Validation Files</h4>
|
||||
<p>We validate each unprocessed sample level BAM file and each final processed sample level BAM file. The validation is performed using <a href="http://picard.sourceforge.net/">Picard</a>'s ValidateSamFile. Because the parameters of this validation are very strict, we don't enforce that the input BAM has to pass all validation, but we provide the log of the validation as an informative companion to your input. The validation file is named : <project name>.<sample name>.pre.validation and <project name>.<sample name>.post.validation.</p>
|
||||
<p>Notice that even if your BAM file fails validation, the pipeline can still go through successfully. The validation is a strict report on how your BAM file is looking. Some errors are not critical, but the output files (both pre.validation and post.validation) should give you some input on how to make your dataset better organized in the BAM format.</p>
|
||||
<h4>Base Quality Score Recalibration Analysis</h4>
|
||||
<p>PDF plots of the base qualities are generated before and after recalibration for further analysis on the impact of recalibrating the base quality scores in each sample file. These graphs are explained in detail <a href="http://www.broadinstitute.org/gatk/guide/article?id=44">here</a>. The plots are created in directories named : <project name>.<sample name>.pre and <project name>.<sample name>.post.</p>
|
||||
<h3>Examples</h3>
|
||||
<ol>
|
||||
<li>
|
||||
<p>Example script that runs the data processing pipeline with its standard parameters and uses LSF for scatter/gathering (without bwa)</p>
|
||||
<p>java \
|
||||
-Xmx4g \
|
||||
-Djava.io.tmpdir=/path/to/tmpdir \
|
||||
-jar path/to/GATK/Queue.jar \
|
||||
-S path/to/DataProcessingPipeline.scala \
|
||||
-p myFancyProjectName \
|
||||
-i myDataSet.list \
|
||||
-R reference.fasta \
|
||||
-D dbSNP.vcf \
|
||||
-run</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>Performing realignment and the full data processing pipeline in one pair-ended bam file</p>
|
||||
<p>java \
|
||||
-Xmx4g \
|
||||
-Djava.io.tmpdir=/path/to/tmpdir \
|
||||
-jar path/to/Queue.jar \
|
||||
-S path/to/DataProcessingPipeline.scala \
|
||||
-bwa path/to/bwa \
|
||||
-i test.bam \
|
||||
-R reference.fasta \
|
||||
-D dbSNP.vcf \
|
||||
-p myProjectWithRealignment \
|
||||
-bwape \
|
||||
-run</p>
|
||||
</li>
|
||||
</ol>
|
||||
|
|
@ -0,0 +1,16 @@
|
|||
## Errors about BAM or VCF files not being ordered properly
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/58/errors-about-bam-or-vcf-files-not-being-ordered-properly
|
||||
|
||||
<h3>This article has been deprecated</h3>
|
||||
<h4>For a more recent version please see <a href="https://www.broadinstitute.org/gatk/guide/article?id=1328">https://www.broadinstitute.org/gatk/guide/article?id=1328</a></h4>
|
||||
<hr />
|
||||
<p>This error occurs when for example, a collaborator gives you a BAM that's derived from what was originally the same reference as you are using, but for whatever reason the contigs are not sorted in the same order .The GATK can be particular about the <a href="http://www.broadinstitute.org/gatk/guide/article?id=1204">ordering of a BAM file</a> so it will fail with an error in this case. </p>
|
||||
<p>So what do you do? You use a Picard tool called ReorderSam to, well, reorder your BAM file. </p>
|
||||
<p>Here's an example usage where we reorder a BAM file that was sorted lexicographically so that the output will be another BAM, but this time sorted karyotypically : </p>
|
||||
<pre><code class="pre_md">java -jar picard.jar ReorderSam \
|
||||
I= lexicographic.bam \
|
||||
O= kayrotypic.bam \
|
||||
REFERENCE= Homo_sapiens_assembly18.kayrotypic.fasta</code class="pre_md"></pre>
|
||||
<p>This tool requires you have a correctly sorted version of the reference sequence you used to align your reads. Be aware that this tool will drop reads that don't have equivalent contigs in the new reference (potentially bad, but maybe not). If contigs have the same name in the bam and the new reference, this tool assumes that the alignment of the read in the new BAM is the same. This is not a liftover tool!</p>
|
||||
<p>This tool is part of the <a href="https://broadinstitute.github.io/picard/command-line-overview.html#ReorderSam">Picard package</a>.</p>
|
||||
|
|
@ -0,0 +1,76 @@
|
|||
## Genotype and Validate
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/61/genotype-and-validate
|
||||
|
||||
<h3>Please note that this article has not been updated in a very long time and may no longer be applicable. Use at your own risk.</h3>
|
||||
<hr />
|
||||
<h3>Introduction</h3>
|
||||
<p>Genotype and Validate is a tool to asses the quality of a technology dataset for calling SNPs and Indels given a secondary (validation) datasource. </p>
|
||||
<p>The simplest scenario is when you have a VCF of hand annotated SNPs and Indels, and you want to know how well a particular technology performs calling these snps. With a dataset (BAM file) generated by the technology in test, and the hand annotated VCF, you can run GenotypeAndValidate to asses the accuracy of the calls with the new technology's dataset.</p>
|
||||
<p>Another option is to validate the calls on a VCF file, using a deep coverage BAM file that you trust the calls on. The GenotypeAndValidate walker will make calls using the reads in the BAM file and take them as truth, then compare to the calls in the VCF file and produce a truth table.</p>
|
||||
<h3>Command-line arguments</h3>
|
||||
<p>Usage of GenotypeAndValidate and its command line arguments are described <a href="http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_validation_GenotypeAndValidate.html">here</a>.</p>
|
||||
<h3>The VCF Annotations </span></h2></h3>
|
||||
<p>The annotations can be either true positive (T) or false positive (F). 'T' means it is known to be a true SNP/Indel, while a 'F' means it is known not to be a SNP/Indel but the technology used to create the VCF calls it. To annotate the VCF, simply add an INFO field GV with the value T or F.</p>
|
||||
<h3>The Outputs</h3>
|
||||
<p>GenotypeAndValidate has two outputs. The <em>truth table</em> and the <em>optional VCF file</em>. The truth table is a 2x2 table correlating what was called in the dataset with the truth of the call (whether it's a true positive or a false positive). The table should look like this: </p>
|
||||
<table border="1" cellpadding="2" align="center">
|
||||
<tr>
|
||||
<th scope="col">
|
||||
</th>
|
||||
<th scope="col"> ALT
|
||||
</th>
|
||||
<th scope="col"> REF
|
||||
</th>
|
||||
<th scope="col"> Predictive Value
|
||||
</th></tr>
|
||||
<tr>
|
||||
<td> <b>called alt</b> </td>
|
||||
<td> True Positive (TP) </td>
|
||||
<td> False Positive (FP) </td>
|
||||
<td> Positive PV
|
||||
</td></tr>
|
||||
<tr>
|
||||
<td> <b>called ref</b> </td>
|
||||
<td> False Negative (FN) </td>
|
||||
<td> True Negative (TN) </td>
|
||||
<td> Negative PV
|
||||
</td></tr></table>
|
||||
<p>The <strong>positive predictive value</strong> (PPV) is the proportion of subjects with <em>positive</em> test results who are correctly diagnose.</p>
|
||||
<p>The <strong>negative predictive value</strong> (NPV) is the proportion of subjects with a <em>negative</em> test result who are correctly diagnosed.</p>
|
||||
<p>The optional VCF file will contain only the variants that were called or not called, excluding the ones that were uncovered or didn't pass the filters (-depth). This file is useful if you are trying to compare the PPV and NPV of two different technologies on the exact same sites (so you can compare apples to apples).</p>
|
||||
<h3>Additional Details</h3>
|
||||
<ul>
|
||||
<li>
|
||||
<p>You should always use -BTI alleles, so that the GATK only looks at the sites on the VCF file, speeds up the process a lot. (this will soon be added as a default gatk engine mode)</p>
|
||||
</li>
|
||||
<li>The total number of visited bases may be greater than the number of variants in the original VCF file because of extended indels, as they trigger one call per new insertion or deletion. (i.e. ACTG/- will count as 4 genotyper calls, but it's only one line in the VCF).</li>
|
||||
</ul>
|
||||
<h3>Examples</h3>
|
||||
<p>Genotypes BAM file from new technology using the VCF as a truth dataset:</p>
|
||||
<pre><code class="pre_md">java \
|
||||
-jar /GenomeAnalysisTK.jar \
|
||||
-T GenotypeAndValidate \
|
||||
-R human_g1k_v37.fasta \
|
||||
-I myNewTechReads.bam \
|
||||
-alleles handAnnotatedVCF.vcf \
|
||||
-BTI alleles \
|
||||
-o gav.vcf</code class="pre_md"></pre>
|
||||
<p>An annotated VCF example (info field clipped for clarity)</p>
|
||||
<pre><code class="pre_md">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878
|
||||
1 20568807 . C T 0 HapMapHet AC=1;AF=0.50;AN=2;DP=0;GV=T GT 0/1
|
||||
1 22359922 . T C 282 WG-CG-HiSeq AC=2;AF=0.50;GV=T;AN=4;DP=42 GT:AD:DP:GL:GQ 1/0 ./. 0/1:20,22:39:-72.79,-11.75,-67.94:99 ./.
|
||||
13 102391461 . G A 341 Indel;SnpCluster AC=1;GV=F;AF=0.50;AN=2;DP=45 GT:AD:DP:GL:GQ ./. ./. 0/1:32,13:45:-50.99,-13.56,-112.17:99 ./.
|
||||
1 175516757 . C G 655 SnpCluster,WG AC=1;AF=0.50;AN=2;GV=F;DP=74 GT:AD:DP:GL:GQ ./. ./. 0/1:52,22:67:-89.02,-20.20,-191.27:99 ./.</code class="pre_md"></pre>
|
||||
<p>Using a BAM file as the truth dataset:</p>
|
||||
<pre><code class="pre_md">java \
|
||||
-jar /GenomeAnalysisTK.jar \
|
||||
-T GenotypeAndValidate \
|
||||
-R human_g1k_v37.fasta \
|
||||
-I myTruthDataset.bam \
|
||||
-alleles callsToValidate.vcf \
|
||||
-BTI alleles \
|
||||
-bt \
|
||||
-o gav.vcf</code class="pre_md"></pre>
|
||||
<p>Example truth table of PacBio reads (BAM) to validate HiSeq annotated dataset (VCF) using the GenotypeAndValidate walker:</p>
|
||||
<p><img src="http://www.broadinstitute.org/gatk/media/pics/PbGenotypeAndValidate.jpg" alt="PacBio PbGenotypeAndValidate results" /></p>
|
||||
|
|
@ -0,0 +1,26 @@
|
|||
## How to get and install Firepony
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/6020/how-to-get-and-install-firepony
|
||||
|
||||
<p>Binary packages for various versions of Linux are available at <a href="http://packages.shadau.com/">http://packages.shadau.com/</a></p>
|
||||
<p>Below are installation instructions for Debian, Ubunto, CentOS and Fedora. For other Linux distributions, the Firepony source code is available at <a href="https://github.com/broadinstitute/firepony">https://github.com/broadinstitute/firepony</a> along with compilation instructions.</p>
|
||||
<hr />
|
||||
<h3>On Debian or Ubuntu systems</h3>
|
||||
<p>The following commands can be used to install Firepony:</p>
|
||||
<pre><code class="pre_md">sudo apt-get install software-properties-common
|
||||
sudo add-apt-repository http://packages.shadau.com/
|
||||
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-key 285514D704F4CDB7
|
||||
sudo apt-get update
|
||||
sudo apt-get install firepony</code class="pre_md"></pre>
|
||||
<p>Once this initial install is done, updates will be automatically installed as part of the standard Ubuntu/Debian update procedure.</p>
|
||||
<hr />
|
||||
<h3>On CentOS 7 and Fedora 21 systems</h3>
|
||||
<p>On CentOS 7, the following commands can be used to install Firepony:</p>
|
||||
<pre><code class="pre_md">sudo curl -o /etc/yum.repos.d/packages.shadau.com.repo \
|
||||
http://packages.shadau.com/rpm/centos-7/packages.shadau.com.repo
|
||||
sudo yum install firepony</code class="pre_md"></pre>
|
||||
<p>For Fedora 21, use the following sequence of commands:</p>
|
||||
<pre><code class="pre_md">sudo curl -o /etc/yum.repos.d/packages.shadau.com.repo \
|
||||
http://packages.shadau.com/rpm/fedora-21/packages.shadau.com.repo
|
||||
sudo yum install firepony</code class="pre_md"></pre>
|
||||
<p>Any subsequent updates will automatically be installed when running ‘yum update’.</p>
|
||||
|
|
@ -0,0 +1,46 @@
|
|||
## How to use Firepony
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/6021/how-to-use-firepony
|
||||
|
||||
<p>Firepony can be run with the following command line arguments:</p>
|
||||
<pre><code class="pre_md">firepony -r <reference FASTA file> -s <SNP database file> -o <output table file> <input alignment file></code class="pre_md"></pre>
|
||||
<p>where:</p>
|
||||
<ul>
|
||||
<li><code>-r</code> specifies the path to the reference file (in uncompressed FASTA format, equivalent to GATK option <code>-R</code>)</li>
|
||||
<li><code>-s</code> specifies the path to the SNP database file (in BCF or VCF format, equivalent to GATK option <code>-knownSites</code>). </li>
|
||||
</ul>
|
||||
<p>Firepony will load an index for the reference file if it exists, which enables on-demand loading of reference sequences as the SNP database is loaded.</p>
|
||||
<p>For example, the following GATK command line:</p>
|
||||
<pre><code class="pre_md">java -Xmx8g GenomeAnalysisTK-3.4.jar \
|
||||
-T BaseRecalibrator \
|
||||
-I NA12878D_HiSeqX_R1.deduplicated.bam \
|
||||
-R /store/ref/hs37d5.fa \
|
||||
-knownSites /store/dbsnp/dbsnp_138.b37.vcf \
|
||||
-o recal_data.table</code class="pre_md"></pre>
|
||||
<p>would be replaced by the following Firepony command line:</p>
|
||||
<pre><code class="pre_md">firepony \
|
||||
-r /store/ref/hs37d5.fa -s /store/dbsnp/dbsnp_138.b37.vcf \
|
||||
-o recal_data.table NA12878D_HiSeqX_R1.deduplicated.bam</code class="pre_md"></pre>
|
||||
<p>Additional command line options are described in the help output for firepony invoked by </p>
|
||||
<pre><code class="pre_md">`firepony --help`</code class="pre_md"></pre>
|
||||
<p>Note that it is recommended to use the BCF format rather than VCF for SNP databases when running Firepony. Both generate the same results, but loading BCF files is much more efficient.</p>
|
||||
<p>At the moment, Firepony only supports recalibrating Illumina reads with the default GATK BQSR parameters, listed below in BQSR table format. Expanding the parameter set as well as the number of supported instruments will be done based on user feedback.</p>
|
||||
<pre><code class="pre_md">#:GATKTable:Arguments:Recalibration argument collection values used in this run
|
||||
Argument Value
|
||||
binary_tag_name null
|
||||
covariate ReadGroupCovariate,QualityScoreCovariate,ContextCovariate,CycleCovariate
|
||||
default_platform null
|
||||
deletions_default_quality 45
|
||||
force_platform null
|
||||
indels_context_size 3
|
||||
insertions_default_quality 45
|
||||
low_quality_tail 2
|
||||
maximum_cycle_value 500
|
||||
mismatches_context_size 2
|
||||
mismatches_default_quality -1
|
||||
no_standard_covs false
|
||||
quantizing_levels 16
|
||||
recalibration_report null
|
||||
run_without_dbsnp false
|
||||
solid_nocall_strategy THROW_EXCEPTION
|
||||
solid_recal_mode SET_Q_ZERO</code class="pre_md"></pre>
|
||||
|
|
@ -0,0 +1,117 @@
|
|||
## Merging batched call sets - RETIRED
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/46/merging-batched-call-sets-retired
|
||||
|
||||
<h3>This procedure is deprecated since it is no longer necessary and goes against our Best Practices recommendations. For calling variants on multiple samples, use the <a href="https://www.broadinstitute.org/gatk/guide/best-practices">Best Practices workflow</a> for performing variant discovery using HaplotypeCaller.</h3>
|
||||
<hr />
|
||||
<h3>Introduction</h3>
|
||||
<p>Three-stage procedure:</p>
|
||||
<ul>
|
||||
<li>
|
||||
<p>Create a master set of sites from your N batch VCFs that you want to genotype in all samples. At this stage you need to determine how you want to resolve disagreements among the VCFs. This is your master sites VCF.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>Take the master sites VCF and genotype each sample BAM file at these sites</p>
|
||||
</li>
|
||||
<li>(Optionally) Merge the single sample VCFs into a master VCF file</li>
|
||||
</ul>
|
||||
<h3>Creating the master set of sites: SNPs and Indels</h3>
|
||||
<p>The first step of batch merging is to create a master set of sites that you want to genotype in all samples. To make this problem concrete, suppose I have two VCF files:</p>
|
||||
<p>Batch 1:</p>
|
||||
<pre><code class="pre_md">##fileformat=VCFv4.0
|
||||
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12891
|
||||
20 9999996 . A ATC . PASS . GT:GQ 0/1:30
|
||||
20 10000000 . T G . PASS . GT:GQ 0/1:30
|
||||
20 10000117 . C T . FAIL . GT:GQ 0/1:30
|
||||
20 10000211 . C T . PASS . GT:GQ 0/1:30
|
||||
20 10001436 . A AGG . PASS . GT:GQ 1/1:30</code class="pre_md"></pre>
|
||||
<p>Batch 2:</p>
|
||||
<pre><code class="pre_md">##fileformat=VCFv4.0
|
||||
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878
|
||||
20 9999996 . A ATC . PASS . GT:GQ 0/1:30
|
||||
20 10000117 . C T . FAIL . GT:GQ 0/1:30
|
||||
20 10000211 . C T . FAIL . GT:GQ 0/1:30
|
||||
20 10000598 . T A . PASS . GT:GQ 1/1:30
|
||||
20 10001436 . A AGGCT . PASS . GT:GQ 1/1:30</code class="pre_md"></pre>
|
||||
<p>In order to merge these batches, I need to make a variety of bookkeeping and filtering decisions, as outlined in the merged VCF below: </p>
|
||||
<p>Master VCF:</p>
|
||||
<pre><code class="pre_md">20 9999996 . A ATC . PASS . GT:GQ 0/1:30 [pass in both]
|
||||
20 10000000 . T G . PASS . GT:GQ 0/1:30 [only in batch 1]
|
||||
20 10000117 . C T . FAIL . GT:GQ 0/1:30 [fail in both]
|
||||
20 10000211 . C T . FAIL . GT:GQ 0/1:30 [pass in 1, fail in 2, choice in unclear]
|
||||
20 10000598 . T A . PASS . GT:GQ 1/1:30 [only in batch 2]
|
||||
20 10001436 . A AGGCT . PASS . GT:GQ 1/1:30 [A/AGG in batch 1, A/AGGCT in batch 2, including this site may be problematic]</code class="pre_md"></pre>
|
||||
<p>These issues fall into the following categories:</p>
|
||||
<ul>
|
||||
<li>For sites present in all VCFs (20:9999996 above), the alleles agree, and each site PASS is pass, this site can obviously be considered "PASS" in the master VCF</li>
|
||||
<li>Some sites may be PASS in one batch, but absent in others (20:10000000 and 20:10000598), which occurs when the site is polymorphic in one batch but all samples are reference or no-called in the other batch</li>
|
||||
<li>Similarly, sites that are fail in all batches in which they occur can be safely filtered out, or included as failing filters in the master VCF (20:10000117)</li>
|
||||
</ul>
|
||||
<p>There are two difficult situations that must be addressed by the needs of the project merging batches:</p>
|
||||
<ul>
|
||||
<li>Some sites may be PASS in some batches but FAIL in others. This might indicate that either:</li>
|
||||
<li>The site is actually truly polymorphic, but due to limited coverage, poor sequencing, or other issues it is flag as unreliable in some batches. In these cases, it makes sense to include the site</li>
|
||||
<li>The site is actually a common machine artifact, but just happened to escape standard filtering in a few batches. In these cases, you would obviously like to filter out the site</li>
|
||||
<li>Even more complicated, it is possible that in the PASS batches you have found a reliable allele (C/T, for example) while in others there's no alt allele but actually a low-frequency error, which is flagged as failing. Ideally, here you could filter out the failing allele from the FAIL batches, and keep the pass ones</li>
|
||||
<li>Some sites may have multiple segregating alleles in each batch. Such sites are often errors, but in some cases may be actual multi-allelic sites, in particular for indels.</li>
|
||||
</ul>
|
||||
<p>Unfortunately, we cannot determine which is actually the correct choice, especially given the goals of the project. We leave it up the project bioinformatician to handle these cases when creating the master VCF. We are hopeful that at some point in the future we'll have a consensus approach to handle such merging, but until then this will be a manual process.</p>
|
||||
<p>The GATK tool <a href="http://www.broadinstitute.org/gatk/guide/article?id=53">CombineVariants</a> can be used to merge multiple VCF files, and parameter choices will allow you to handle some of the above issues. With tools like <a href="http://www.broadinstitute.org/gatk/guide/article?id=54">SelectVariants</a> one can slice-and-dice the merged VCFs to handle these complexities as appropriate for your project's needs. For example, the above master merge can be produced with the following CombineVariants:</p>
|
||||
<pre><code class="pre_md">java -jar dist/GenomeAnalysisTK.jar \
|
||||
-T CombineVariants \
|
||||
-R human_g1k_v37.fasta \
|
||||
-V:one,VCF combine.1.vcf -V:two,VCF combine.2.vcf \
|
||||
--sites_only \
|
||||
-minimalVCF \
|
||||
-o master.vcf</code class="pre_md"></pre>
|
||||
<p>producing the following VCF:</p>
|
||||
<pre><code class="pre_md">##fileformat=VCFv4.0
|
||||
#CHROM POS ID REF ALT QUAL FILTER INFO
|
||||
20 9999996 . A ACT . PASS set=Intersection
|
||||
20 10000000 . T G . PASS set=one
|
||||
20 10000117 . C T . FAIL set=FilteredInAll
|
||||
20 10000211 . C T . PASS set=filterIntwo-one
|
||||
20 10000598 . T A . PASS set=two
|
||||
20 10001436 . A AGG,AGGCT . PASS set=Intersection</code class="pre_md"></pre>
|
||||
<h3>Genotyping your samples at these sites</h3>
|
||||
<p>Having created the master set of sites to genotype, along with their alleles, as in the previous section, you now use the <a href="http://www.broadinstitute.org/gatk/guide/article?id=1237">UnifiedGenotyper</a> to genotype each sample independently at the master set of sites. This GENOTYPE_GIVEN_ALLELES mode of the UnifiedGenotyper will jump into the sample BAM file, and calculate the genotype and genotype likelihoods of the sample at the site for each of the genotypes available for the REF and ALT alleles. For example, for site 10000211, the UnifiedGenotyper would evaluate the likelihoods of the CC, CT, and TT genotypes for the sample at this site, choose the most likely configuration, and generate a VCF record containing the genotype call and the likelihoods for the three genotype configurations. </p>
|
||||
<p>As a concrete example command line, you can genotype the master.vcf file using in the bundle sample NA12878 with the following command:</p>
|
||||
<pre><code class="pre_md">java -Xmx2g -jar dist/GenomeAnalysisTK.jar \
|
||||
-T UnifiedGenotyper \
|
||||
-R bundle/b37/human_g1k_v37.fasta \
|
||||
-I bundle/b37/NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam \
|
||||
-alleles master.vcf \
|
||||
-L master.vcf \
|
||||
-gt_mode GENOTYPE_GIVEN_ALLELES \
|
||||
-out_mode EMIT_ALL_SITES \
|
||||
-stand_call_conf 0.0 \
|
||||
-glm BOTH \
|
||||
-G none \</code class="pre_md"></pre>
|
||||
<p>The <code>-L master.vcf</code> argument tells the UG to only genotype the sites in the master file. If you don't specify this, the UG will genotype the master sites in GGA mode, but it will also genotype all other sites in the genome in regular mode. </p>
|
||||
<p><code>The last item,</code>-G ` prevents the UG from computing annotations you don't need. This command produces something like the following output:</p>
|
||||
<pre><code class="pre_md">##fileformat=VCFv4.0
|
||||
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878
|
||||
20 9999996 . A ACT 4576.19 . . GT:DP:GQ:PL 1/1:76:99:4576,229,0
|
||||
20 10000000 . T G 0 . . GT:DP:GQ:PL 0/0:79:99:0,238,3093
|
||||
20 10000211 . C T 857.79 . . GT:AD:DP:GQ:PL 0/1:28,27:55:99:888,0,870
|
||||
20 10000598 . T A 1800.57 . . GT:AD:DP:GQ:PL 1/1:0,48:48:99:1834,144,0
|
||||
20 10001436 . A AGG,AGGCT 1921.12 . . GT:DP:GQ:PL 0/2:49:84.06:1960,2065,0,2695,222,84</code class="pre_md"></pre>
|
||||
<p>Several things should be noted here:</p>
|
||||
<ul>
|
||||
<li>The genotype likelihoods calculation evolves, especially for indels, the exact results of this command will change. </li>
|
||||
<li>The command will emit sites that are hom-ref in the sample at the site, but the -stand_call_conf 0.0 argument should be provided so that they aren't tagged as "LowQual" by the UnifiedGenotyper.</li>
|
||||
<li>The filtered site 10000117 in the master.vcf is not genotyped by the UG, as it doesn't pass filters and so is considered bad by the GATK UG. If you want to determine the genotypes for all sites, independent on filtering, you must unfilter all of your records in master.vcf, and if desired, restore the filter string for these records later.</li>
|
||||
</ul>
|
||||
<p>This genotyping command can be performed independently per sample, and so can be parallelized easily on a farm with one job per sample, as in the following:</p>
|
||||
<pre><code class="pre_md">foreach sample in samples:
|
||||
run UnifiedGenotyper command above with -I $sample.bam -o $sample.vcf
|
||||
end</code class="pre_md"></pre>
|
||||
<h3>(Optional) Merging the sample VCFs together</h3>
|
||||
<p>You can use a similar command for <a href="http://www.broadinstitute.org/gatk/guide/article?id=53">CombineVariants</a> above to merge back together all of your single sample genotyping runs. Suppose all of my UnifiedGenotyper jobs have completed, and I have VCF files named sample1.vcf, sample2.vcf, to sampleN.vcf. The single command:</p>
|
||||
<pre><code class="pre_md">java -jar dist/GenomeAnalysisTK.jar -T CombineVariants -R human_g1k_v37.fasta -V:sample1 sample1.vcf -V:sample2 sample2.vcf [repeat until] -V:sampleN sampleN.vcf -o combined.vcf</code class="pre_md"></pre>
|
||||
<h3>General notes</h3>
|
||||
<ul>
|
||||
<li>Because the GATK uses dynamic downsampling of reads, it is possible for truly marginal calls to change likelihoods from discovery (processing the BAM incrementally) vs. genotyping (jumping into the BAM). Consequently, do not be surprised to see minor differences in the genotypes for samples from discovery and genotyping.</li>
|
||||
<li>More advanced users may want to consider group several samples together for genotyping. For example, 100 samples could be genotyped in 10 groups of 10 samples, resulting in only 10 VCF files. Merging the 10 VCF files may be faster (or just easier to manage) than 1000 individual VCFs.</li>
|
||||
<li>Sometimes, using this method, a monomorphic site within a batch will be identified as polymorphic in one or more samples within that same batch. This is because the UnifiedGenotyper applies a frequency prior to determine whether a site is likely to be monomorphic. If the site is monomorphic, it is either not output, or if EMIT_ALL_SITES is thrown, reference genotypes are output. If the site is determined to be polymorphic, genotypes are assigned greedily (as of GATK-v1.4). Calling single-sample reduces the effect of the prior, so sites which were considered monomorphic within a batch could be considered polymorphic within a sub-batch.</li>
|
||||
</ul>
|
||||
|
|
@ -0,0 +1,5 @@
|
|||
## Moved: (How to) Create a snippet of reads corresponding to a genomic interval
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/6530/moved-how-to-create-a-snippet-of-reads-corresponding-to-a-genomic-interval
|
||||
|
||||
This discussion has been <a href="http://gatkforums.broadinstitute.org/discussion/6517/how-to-create-a-snippet-of-reads-corresponding-to-a-genomic-interval">moved</a>.
|
||||
|
|
@ -0,0 +1,5 @@
|
|||
## Moved: (How to) Efficiently map and clean up short read sequence data
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/6573/moved-how-to-efficiently-map-and-clean-up-short-read-sequence-data
|
||||
|
||||
This discussion has been <a href="http://gatkforums.broadinstitute.org/discussion/6483/how-to-efficiently-map-and-clean-up-short-read-sequence-data">moved</a>.
|
||||
|
|
@ -0,0 +1,5 @@
|
|||
## Moved: (How to) Generate an unmapped BAM from FASTQ or aligned BAM
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/6538/moved-how-to-generate-an-unmapped-bam-from-fastq-or-aligned-bam
|
||||
|
||||
This discussion has been <a href="http://gatkforums.broadinstitute.org/discussion/6484/how-to-generate-an-unmapped-bam-from-fastq-or-aligned-bam">moved</a>.
|
||||
|
|
@ -0,0 +1,5 @@
|
|||
## Moved: (How to) Mark duplicates with MarkDuplicates or MarkDuplicatesWithMateCigar
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/6873/moved-how-to-mark-duplicates-with-markduplicates-or-markduplicateswithmatecigar
|
||||
|
||||
This discussion has been <a href="http://gatkforums.broadinstitute.org/gatk/discussion/6747/how-to-mark-duplicates-with-markduplicates-or-markduplicateswithmatecigar">moved</a>.
|
||||
|
|
@ -0,0 +1,5 @@
|
|||
## Moved: (How to) Visualize an alignment with IGV
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/6606/moved-how-to-visualize-an-alignment-with-igv
|
||||
|
||||
This discussion has been <a href="http://gatkforums.broadinstitute.org/discussion/6491/how-to-visualize-an-alignment-with-igv">moved</a>.
|
||||
|
|
@ -0,0 +1,50 @@
|
|||
## Per-base alignment qualities (BAQ) in the GATK
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/1326/per-base-alignment-qualities-baq-in-the-gatk
|
||||
|
||||
<h3>This article is out of date and no longer applicable. BAQs are no longer used in GATK.</h3>
|
||||
<hr />
|
||||
<h3>1. Introduction</h3>
|
||||
<p>The GATK provides an implementation of the Per-Base Alignment Qualities (BAQ) developed by Heng Li in late 2010. See <a href="http://samtools.sourceforge.net/mpileup.shtml">this SamTools page</a> for more details.</p>
|
||||
<hr />
|
||||
<h3>2. Using BAQ</h3>
|
||||
<p>The BAQ algorithm is applied by the GATK engine itself, which means that all GATK walkers can potentially benefit from it. By default, BAQ is OFF, meaning that the engine will not use BAQ quality scores at all. </p>
|
||||
<p>The GATK engine accepts the argument <code>-baq</code> with the following <code>enum</code> values: </p>
|
||||
<pre><code class="pre_md">public enum CalculationMode {
|
||||
OFF, // don't apply a BAQ at all, the default
|
||||
CALCULATE_AS_NECESSARY, // do HMM BAQ calculation on the fly, as necessary, if there's no tag
|
||||
RECALCULATE // do HMM BAQ calculation on the fly, regardless of whether there's a tag present
|
||||
}</code class="pre_md"></pre>
|
||||
<p>If you want to enable BAQ, the usual thing to do is <code>CALCULATE_AS_NECESSARY</code>, which will calculate BAQ values if they are not in the <code>BQ</code> read tag. If your reads are already tagged with <code>BQ</code> values, then the GATK will use those. <code>RECALCULATE</code> will always recalculate the <code>BAQ</code>, regardless of the tag, which is useful if you are experimenting with the gap open penalty (see below).</p>
|
||||
<p>If you are really an expert, the GATK allows you to specify the BAQ gap open penalty (<code>-baqGOP</code>) to use in the HMM. This value should be 40 by default, a good value for whole genomes and exomes for highly sensitive calls. However, if you are analyzing exome data only, you may want to use 30, which seems to result in more specific call set. We continue to play with these values some. Some walkers, where BAQ would corrupt their analyses, forbid the use of BAQ and will throw an exception if <code>-baq</code> is provided.</p>
|
||||
<hr />
|
||||
<h3>3. Some example uses of the BAQ in the GATK</h3>
|
||||
<ul>
|
||||
<li>
|
||||
<p>For UnifiedGenotyper to get more specific SNP calls.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>For PrintReads to write out a BAM file with BAQ tagged reads</p>
|
||||
</li>
|
||||
<li>For TableRecalibrator or IndelRealigner to write out a BAM file with BAQ tagged reads. Make sure you use <code>-baq RECALCULATE</code> so the engine knows to recalculate the BAQ after these tools have updated the base quality scores or the read alignments. Note that both of these tools will not use the BAQ values on input, but will write out the tags for analysis tools that will use them.</li>
|
||||
</ul>
|
||||
<p>Note that some tools should not have BAQ applied to them.</p>
|
||||
<p>This last option will be a particularly useful for people who are already doing base quality score recalibration. Suppose I have a pipeline that does:</p>
|
||||
<pre><code class="pre_md">RealignerTargetCreator
|
||||
IndelRealigner
|
||||
|
||||
BaseRecalibrator
|
||||
PrintReads (with --BQSR input)
|
||||
|
||||
UnifiedGenotyper</code class="pre_md"></pre>
|
||||
<p>A highly efficient BAQ extended pipeline would look like</p>
|
||||
<pre><code class="pre_md">RealignerTargetCreator
|
||||
IndelRealigner // don't bother with BAQ here, since we will calculate it in table recalibrator
|
||||
|
||||
BaseRecalibrator
|
||||
PrintReads (with --BQSR input) -baq RECALCULATE // now the reads will have a BAQ tag added. Slows the tool down some
|
||||
|
||||
UnifiedGenotyper -baq CALCULATE_AS_NECESSARY // UG will use the tags from TableRecalibrate, keeping UG fast</code class="pre_md"></pre>
|
||||
<hr />
|
||||
<h3>4. BAQ and walker control</h3>
|
||||
<p>Walkers can control via the <code>@BAQMode</code> annotation how the BAQ calculation is applied. Can either be as a tag, by overwriting the qualities scores, or by only returning the baq-capped qualities scores. Additionally, walkers can be set up to have the BAQ applied to the incoming reads (<code>ON_INPUT</code>, the default), to output reads (<code>ON_OUTPUT</code>), or <code>HANDLED_BY_WALKER</code>, which means that calling into the BAQ system is the responsibility of the individual walker.</p>
|
||||
|
|
@ -0,0 +1,90 @@
|
|||
## Statistical methods used by GATK tools
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/4732/statistical-methods-used-by-gatk-tools
|
||||
|
||||
<p><strong>This document is out of date; see individual method documents in the <a href="https://software.broadinstitute.org/gatk/documentation/topic?name=methods">Methods and Algorithms</a> section.</strong></p>
|
||||
<h3>List of documented methods below</h3>
|
||||
<ul>
|
||||
<li>Inbreeding Coefficient</li>
|
||||
<li>Rank Sum Test</li>
|
||||
</ul>
|
||||
<hr />
|
||||
<h2>Inbreeding Coefficient</h2>
|
||||
<h3>Overview</h3>
|
||||
<p>Although the name Inbreeding Coefficient suggests it is a measure of inbreeding, Inbreeding Coefficient measures the excess heterozygosity at a variant site. It can be used as a proxy for poor mapping (sites that have high Inbreeding Coefficients are typically locations in the genome where the mapping is bad and reads that are in the region mismatch the region because they belong elsewhere). At least 10 samples are required (preferably many more) in order for this annotation to be calculated properly.</p>
|
||||
<h3>Theory</h3>
|
||||
<p>The <a href="https://en.wikipedia.org/wiki/Hardy%E2%80%93Weinberg_principle">Wikipedia article about Hardy-Weinberg principle</a> includes some very helpful information on the theoretical underpinnings of the test, as Inbreeding Coefficient relies on the math behind the Hardy-Weinberg Principle.</p>
|
||||
<h3>Use in GATK</h3>
|
||||
<p>We calculate Inbreeding Coefficient as 1-(# observed heterozygotes)/(# expected heterozygotes). The number of observed heterozygotes can be calculated from the data. The number of expected heterozygotes is 2pq, where p is the frequency of the reference allele and q is the frequency of the alternate allele (AF). (Please see Hardy-Weinberg Principle link above). A value of 0 suggests the site is in Hardy-Weinberg Equilibrium. Negative values of Inbreeding Coefficient could mean there are too many heterozygotes and suggest a site with bad mapping. The other nice side effect is that one of the error modes in variant calling is for all calls to be heterozygous, which this metric captures nicely. This is why we recommend filtering out variants with negative Inbreeding Coefficients. Although positive values suggest too few heterozygotes, we do not recommend filtering out positive values because they could arise from admixture of different ethnic populations. </p>
|
||||
<h4>Please note: Inbreeding Coefficient is not really robust to the assumption of being unrelated. We have found that relatedness does break down the assumptions Inbreeding Coefficient is based on. For family samples, it really depends on how many families and samples you have. For example, if you have 3 families, inbreeding coefficient is not going to work. But, if you have 10,000 samples and just a few families, it should be fine. Also, if you pass in a pedigree file (*.ped), it will use that information to calculate Inbreeding Coefficient only using the founders (i.e. individuals whose parents aren't in the callset), and as long as there are >= 10 of those, the data should be pretty good.</h4>
|
||||
<h3>Example: Inbreeding Coefficient</h3>
|
||||
<p>In this example, lets say we are working with 100 human samples, and we are trying to calculate Inbreeding Coefficient at a site that has A for the reference allele and T for the alternate allele. </p>
|
||||
<h4>Step 1: Count the number of samples that have each genotype (hom-ref, het, hom-var)</h4>
|
||||
<p>A/A (hom-ref): 51
|
||||
A/T (het): 11
|
||||
T/T (hom-var): 38</p>
|
||||
<h4>Step 2: Get all necessary information to solve equation</h4>
|
||||
<p>We need to find the # observed hets and # expected hets. </p>
|
||||
<p>number of observed hets = 11 (from number of observed A/T given above)</p>
|
||||
<p>number of expected hets = 2pq * total genotypes (2pq is frequency of heterozygotes according to Hardy-Weinberg Equilibrium. We need to multiply that frequency by the number of all genotypes in the population to get the expected number of heterozygotes.)</p>
|
||||
<p>p = frequency of ref allele = (# ref alleles)/(total # alleles) = (2 <em> 51 + 11)/(2 </em> 51 + 11 <em> 2 + 38 </em> 2) = 113/200 = 0.565
|
||||
q = frequency of alt allele = (# alt alleles)/(total # alleles) = (2 <em> 38 + 11)/(2 </em> 51 + 11 <em> 2 + 38 </em> 2) = 87/200 = 0.435</p>
|
||||
<h4>Remember that homozygous genotypes have two copies of the allele of interest (because we're assuming diploid.)</h4>
|
||||
<p>number of expected hets = 2pq <em> 100 = 2 </em> 0.565 <em> 0.435 </em> 100 = 49.155</p>
|
||||
<h4>Step 3: Plug in the Numbers</h4>
|
||||
<p>Inbreeding Coefficient = 1 - (# observed hets)/(#expected hets) = 1 - (11/49.155) = 0.776</p>
|
||||
<h4>Step 4: Interpret the output</h4>
|
||||
<p>Our Inbreeding Coefficient is 0.776. Because it is a positive number, we can see there are fewer than the expected number of heterozygotes according to the Hardy-Weinberg Principle. Too few heterozygotes can imply inbreeding. However, we do not recommend filtering this site out because there may be a mixture of ethnicities in the cohort, and some ethnicities may be hom-ref while others are hom-var. </p>
|
||||
<h2>Rank Sum Test</h2>
|
||||
<h3>Overview</h3>
|
||||
<p>The Rank Sum Test, also known as Mann-Whitney-Wilcoxon U-test after its developers (who are variously credited in subsets and in different orders depending on the sources you read) is a statistical test that aims to determine whether there is significant difference in the values of two populations of data.</p>
|
||||
<h3>Theory</h3>
|
||||
<p>The <a href="https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test">Wikipedia article about the Rank Sum Test</a> includes some very helpful information on the theoretical underpinnings of the test, as well as various examples of how it can be applied. </p>
|
||||
<h3>Use in GATK</h3>
|
||||
<p>This test is used by several GATK annotations, including two standard annotations that are used for variant recalibration in the Best Practices: <a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_MappingQualityRankSumTest.php">MappingQualityRankSum</a> and <a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_ReadPosRankSumTest.php">ReadPosRankSum</a>. In all cases, the idea is to check, for a given candidate variant, whether the properties of the data that support the reference allele are similar to those of the data that support a variant allele. If they are not similar, we conclude that there may be some technical bias and that the candidate variant may be an artifact. </p>
|
||||
<h3>Example: BaseQualityRankSumTest</h3>
|
||||
<p><em>Note: this example applies Method 2 from the Wikipedia article linked above.</em></p>
|
||||
<p>In this example, we have a set of 20 reads, 10 of which support the reference allele and 10 of which support the alternate allele. At first glance, that looks like a clear heterozygous 0/1 site. But to be thorough in our analysis and to account for any technical bias, we want to determine if there is a significant difference in the base qualities of the bases that support the reference allele vs. the bases that support the alternate allele. </p>
|
||||
<p>Before we proceed, we must define our null hypothesis and alternate hypothesis. </p>
|
||||
<p>-<em>Null hypothesis:</em> There is <strong>no</strong> difference in the base qualities that support the reference allele and the base qualities that support the alternate allele.</p>
|
||||
<p>-<em>Alternate hypothesis:</em> There <strong>is</strong> a difference in the base qualities that support the reference allele and the base qualities that support the alternate allele.</p>
|
||||
<h4>Step 1: List the relevant observations</h4>
|
||||
<p>Reference allele base qualities: 20, 25, 26, 30, 32, 40, 47, 50, 53, 60
|
||||
Alternate allele base qualities: 0, 7, 10, 17, 20, 21, 30, 34, 40, 45</p>
|
||||
<h4>Step 2: Rank the observations</h4>
|
||||
<p>First, we arrange all the observations (base qualities) into a list of values ordered from lowest to highest (reference bases are in bold).</p>
|
||||
<p>0, 7, 10, 17, <strong>20</strong>, 20, 21, <strong>25</strong>, <strong>26</strong>, <strong>30</strong>, 30, <strong>32</strong>, 34, <strong>40</strong>, 40, 45, <strong>47</strong>, <strong>50</strong>, <strong>53</strong>, <strong>60</strong></p>
|
||||
<p>Next we determine the ranks of the values. Since there are 20 observations (the base qualities), we have 20 ranks to assign. Whenever there are ties between observations for the rank, we take the rank to be equal to the midpoint of the ranks. For example, for 20(ref) and 20(alt), we have a tie in values, so we assign each observation a rank of (5+6)/2 = 5.5.</p>
|
||||
<p>The ranks from the above list are (reference ranks are in bold):</p>
|
||||
<p>1, 2, 3, 4, <strong>5.5</strong>, 5.5, 7, <strong>8</strong>, <strong>9</strong>, <strong>10.5</strong>, 10.5, <strong>12</strong>, 13, <strong>14.5</strong>, 14.5, 16, <strong>17</strong>, <strong>18</strong>, <strong>19</strong>, <strong>20</strong></p>
|
||||
<h4>Step 3: Add up the ranks for each group</h4>
|
||||
<p>We now need to add up the ranks for the base qualities that came from the reference allele and the alternate allele.</p>
|
||||
<p>$$ Rank_{ref} = 133.5 $$</p>
|
||||
<p>$$ Rank_{alt} = 76.5 $$</p>
|
||||
<h4>Step 4: Calculate U for each group</h4>
|
||||
<p>U is a statistic that tells us the difference between the two rank totals. We can use the U statistic to calculate the z-score (explained below), which will give us our p-value.</p>
|
||||
<p>Calculate U for each group (n = number of observations in each sample)</p>
|
||||
<p>$$ U<em>{ref} = \frac{ n</em>{ref} <em> n<em>{alt} + n</em>{ref} </em> (n<em>{ref}+ 1) }{ 2 } - Rank</em>{ref} $$</p>
|
||||
<p>$$ U<em>{alt} = \frac{ n</em>{alt} <em> n<em>{ref} + n</em>{alt} </em> (n<em>{alt} + 1) }{ 2 } - Rank</em>{alt} $$</p>
|
||||
<p>$$ U_{ref} = \frac{ 10 <em> 10 + 10 </em> 11 }{ 2 } - 133.5 = 21.5 $$</p>
|
||||
<p>$$ U_{alt} = \frac{ 10 <em> 10 + 10 </em> 11 }{ 2 } - 76.5 = 78.5 $$</p>
|
||||
<h4>Step 5: Calculate the overall z-score</h4>
|
||||
<p>Next, we need to calculate the z-score which will allow us to get the p-value. The z-score is a normalized score that allows us to compare the probability of the U score occurring in our distribution.
|
||||
<a href="https://statistics.laerd.com/statistical-guides/standard-score.php">https://statistics.laerd.com/statistical-guides/standard-score.php</a></p>
|
||||
<p>The equation to get the z-score is:</p>
|
||||
<p>$$ z = \frac{U - mu}{u} $$ </p>
|
||||
<p>Breaking this equation down:</p>
|
||||
<p>$$ z = z-score $$</p>
|
||||
<p>$$ U = \text{lowest of the U scores calculated in previous steps} $$</p>
|
||||
<p>$$ mu = \text{mean of the U scores above} = \frac{ n<em>{ref} * n</em>{alt} }{ 2 } $$</p>
|
||||
<p>$$ u = \text{standard deviation of U} = \sqrt{ \frac{n<em>{ref} * n</em>{alt} * (n<em>{ref} + n</em>{alt} + 1) }{ 12 } } $$</p>
|
||||
<p>To calculate our z:</p>
|
||||
<p>$$ U = 21.5 $$</p>
|
||||
<p>$$ mu = \frac{10 * 10 }{ 2 } = 50 $$</p>
|
||||
<p>$$ u = \sqrt{ \frac{10 <em> 10 </em>(10 + 10 + 1) }{ 12 } } = 13.229 $$</p>
|
||||
<p>So altogether we have: </p>
|
||||
<p>$$ z = \frac{ 21.5 - 50 }{ 13.229 } = -2.154 $$</p>
|
||||
<h4>Step 6: Calculate and interpret the p-value</h4>
|
||||
<p>The p-value is the probability of obtaining a z-score at least as extreme as the one we got, assuming the null hypothesis is true. In our example, the p-value gives us the probability that there is no difference in the base qualities that support the reference allele and the base qualities that support the alternate allele. The lower the p-value, the less likely it is that there is no difference in the base qualities.</p>
|
||||
<p>Going to the z-score table, or just using a <a href="http://graphpad.com/quickcalcs/pValue2/">p-value calculator</a>, we find the p-value to be 0.0312.</p>
|
||||
<p>This means there is a .0312 chance that the base quality scores of the reference allele and alternate allele are the same. Assuming a p-value cutoff of 0.05, meaning there is less than 5% chance there is no difference in the two groups, and greater than or equal to 95% chance that there is a difference between the two groups, we have enough evidence to <strong>reject our null hypothesis</strong> that there is no difference in the base qualities of the reference and alternate allele. This indicates there is some bias and that the alternate allele is less well supported by the data than the allele counts suggest.</p>
|
||||
|
|
@ -0,0 +1,30 @@
|
|||
## Using Variant Annotator
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/49/using-variant-annotator
|
||||
|
||||
<h3>This document is out of date and has been retired. Please see the Annotation documentation in the Tool Docs as well as various other Guide articles for better materials on annotating variants.</h3>
|
||||
<hr />
|
||||
<p>2 SNPs with significant strand bias</p>
|
||||
<img src="http://www.broadinstitute.org/gatk/media/pics/StrandFailure.png" />
|
||||
<p>Several SNPs with excessive coverage</p>
|
||||
<img src="http://www.broadinstitute.org/gatk/media/pics/DoCFailure.png" />
|
||||
<p><strong>For a complete, detailed argument reference, refer to the GATK document page <a href="http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_annotator_VariantAnnotator.html">here</a>.</strong></p>
|
||||
<h3>Introduction</h3>
|
||||
<p>In addition to true variation, variant callers emit a number of false-positives. Some of these false-positives can be detected and rejected by various statistical tests. VariantAnnotator provides a way of annotating variant calls as preparation for executing these tests.</p>
|
||||
<p>Description of the haplotype score annotation</p>
|
||||
<img src="http://www.broadinstitute.org/gatk/media/pics/HaplotypeScore.png" />
|
||||
<h3>Examples of Available Annotations</h3>
|
||||
<p>The list below is not comprehensive. Please use the <code>--list</code> argument to get a list of all possible annotations available. Also, see <a href="http://www.broadinstitute.org/gatk/guide/article?id=1268">the FAQ article on understanding the Unified Genotyper's VCF files</a> for a description of some of the more standard annotations.</p>
|
||||
<ul>
|
||||
<li><a href="http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_annotator_BaseQualityRankSumTest.html">BaseQualityRankSumTest</a> (BaseQRankSum)</li>
|
||||
<li><a href="https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_annotator_Coverage.php">DepthOfCoverage</a> (DP)</li>
|
||||
<li><a href="http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_annotator_FisherStrand.html">FisherStrand</a> (FS)</li>
|
||||
<li><a href="http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_annotator_HaplotypeScore.html">HaplotypeScore</a> (HaplotypeScore)</li>
|
||||
<li><a href="http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_annotator_MappingQualityRankSumTest.html">MappingQualityRankSumTest</a> (MQRankSum)</li>
|
||||
<li><a href="http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_annotator_MappingQualityZero.html">MappingQualityZero</a> (MQ0)</li>
|
||||
<li><a href="http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_annotator_QualByDepth.html">QualByDepth</a> (QD)</li>
|
||||
<li><a href="http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_annotator_ReadPosRankSumTest.html">ReadPositionRankSumTest</a> (ReadPosRankSum)</li>
|
||||
<li><a href="http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_annotator_RMSMappingQuality.html">RMSMappingQuality</a> (MQ)</li>
|
||||
<li><a href="http://www.broadinstitute.org/gatk/guide/article?id=50">SnpEff</a>: Add genomic annotations using the third-party tool SnpEff with VariantAnnotator</li>
|
||||
</ul>
|
||||
<p>Note that technically the VariantAnnotator does not require reads (from a BAM file) to run; if no reads are provided, only those Annotations which don't use reads (e.g. Chromosome Counts) will be added. But most Annotations do require reads. <strong>When running the tool we recommend that you add the <code>-L</code> argument with the variant rod to your command line for efficiency and speed.</strong></p>
|
||||
|
|
@ -0,0 +1,54 @@
|
|||
## Walkthrough of the Oct 2013 GATK workshop hands-on session
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/3366/walkthrough-of-the-oct-2013-gatk-workshop-hands-on-session
|
||||
|
||||
<h4>Note: the exact data files we used in this tutorial are no longer available. However, you can use the files in the resource bundle to work through this tutorial. You may need to adapt the filenames accordingly.</h4>
|
||||
<hr />
|
||||
<h3>Map and mark duplicates</h3>
|
||||
<p><a href="http://gatkforums.broadinstitute.org/discussion/2799/howto-map-and-mark-duplicates">http://gatkforums.broadinstitute.org/discussion/2799/howto-map-and-mark-duplicates</a></p>
|
||||
<p><em>Starting with aligned (mapped) and deduplicated (dedupped) reads in .sam file to save time.</em></p>
|
||||
<h4>- Generate index</h4>
|
||||
<p>Create an index file to enable fast seeking through the file.</p>
|
||||
<pre><code class="pre_md">java -jar BuildBamIndex.jar I= dedupped_20.bam</code class="pre_md"></pre>
|
||||
<h4>- Prepare reference to work with GATK</h4>
|
||||
<p><a href="http://gatkforums.broadinstitute.org/discussion/2798/howto-prepare-a-reference-for-use-with-bwa-and-gatk">http://gatkforums.broadinstitute.org/discussion/2798/howto-prepare-a-reference-for-use-with-bwa-and-gatk</a></p>
|
||||
<p>Create a dictionary file and index for the reference.</p>
|
||||
<pre><code class="pre_md">java -jar CreateSequenceDictionary.jar R=human_b37_20.fasta O=human_b37_20.dict
|
||||
|
||||
samtools faidx human_b37_20.fasta </code class="pre_md"></pre>
|
||||
<hr />
|
||||
<h3>Getting to know GATK</h3>
|
||||
<h4>- Run a simple walker: CountReads</h4>
|
||||
<p>Identify basic syntax, console output: version, command recap line, progress estimates, result if applicable.</p>
|
||||
<pre><code class="pre_md">java -jar GenomeAnalysisTK.jar -T CountReads -R human_b37_20.fasta -I dedupped_20.bam -L 20</code class="pre_md"></pre>
|
||||
<h4>- Add a filter to count how many duplicates were marked</h4>
|
||||
<p>Look at filtering summary.</p>
|
||||
<pre><code class="pre_md">java -jar GenomeAnalysisTK.jar -T CountReads -R human_b37_20.fasta -I dedupped_20.bam -L 20 -rf DuplicateRead</code class="pre_md"></pre>
|
||||
<h4>- Demonstrate how to select a subset of read data</h4>
|
||||
<p>This can come in handy for bug reports.</p>
|
||||
<pre><code class="pre_md">java -jar GenomeAnalysisTK.jar -T PrintReads -R human_b37_20.fasta -I dedupped_20.bam -L 20:10000000-11000000 -o snippet.bam</code class="pre_md"></pre>
|
||||
<h4>- Demonstrate the equivalent for variant calls</h4>
|
||||
<p>Refer to docs for many other capabilities including selecting by sample name, up to complex queries.</p>
|
||||
<pre><code class="pre_md">java -jar GenomeAnalysisTK.jar -T SelectVariants -R human_b37_20.fasta -V dbsnp_b37_20.vcf -o snippet.vcf -L 20:10000000-11000000</code class="pre_md"></pre>
|
||||
<hr />
|
||||
<h3>Back to data processing</h3>
|
||||
<h4>- Realign around Indels</h4>
|
||||
<p><a href="http://gatkforums.broadinstitute.org/discussion/2800/howto-perform-local-realignment-around-indels">http://gatkforums.broadinstitute.org/discussion/2800/howto-perform-local-realignment-around-indels</a></p>
|
||||
<pre><code class="pre_md">java -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R human_b37_20.fasta -I dedupped_20.bam -known indels_b37_20.vcf -o target_intervals.list -L 20
|
||||
|
||||
java -jar GenomeAnalysisTK.jar -T IndelRealigner -R human_b37_20.fasta -I dedupped_20.bam -known indels_b37_20.vcf -targetIntervals target_intervals.list -o realigned_20.bam -L 20 </code class="pre_md"></pre>
|
||||
<h4>- Base recalibration</h4>
|
||||
<p><a href="http://gatkforums.broadinstitute.org/discussion/2801/howto-recalibrate-base-quality-scores-run-bqsr">http://gatkforums.broadinstitute.org/discussion/2801/howto-recalibrate-base-quality-scores-run-bqsr</a></p>
|
||||
<pre><code class="pre_md">java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R human_b37_20.fasta -I realigned_20.bam -knownSites dbsnp_b37_20.vcf -knownSites indels_b37_20.vcf -o recal_20.table -L 20
|
||||
|
||||
java -jar GenomeAnalysisTK.jar -T PrintReads -R human_b37_20.fasta -I realigned_20.bam -BQSR recal_20.table -o recal_20.bam -L 20
|
||||
|
||||
java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R human_b37_20.fasta -I recalibrated_20.bam -knownSites dbsnp_b37_20.vcf -knownSites indels_b37_20.vcf -o post_recal_20.table -L 20
|
||||
|
||||
java -jar GenomeAnalysisTK.jar -T AnalyzeCovariates -R human_b37_20.fasta -before recal_20.table -after post_recal_20.table -plots recalibration_plots.pdf -L 20 </code class="pre_md"></pre>
|
||||
<h4>- ReduceReads</h4>
|
||||
<p><a href="http://gatkforums.broadinstitute.org/discussion/2802/howto-compress-read-data-with-reducereads">http://gatkforums.broadinstitute.org/discussion/2802/howto-compress-read-data-with-reducereads</a></p>
|
||||
<pre><code class="pre_md">java -jar GenomeAnalysisTK.jar -T ReduceReads -R human_b37_20.fasta -I recalibrated_20.bam -o reduced_20.bam -L 20 </code class="pre_md"></pre>
|
||||
<h4>- HaplotypeCaller</h4>
|
||||
<p><a href="http://gatkforums.broadinstitute.org/discussion/2803/howto-call-variants-on-a-diploid-genome-with-the-haplotypecaller">http://gatkforums.broadinstitute.org/discussion/2803/howto-call-variants-on-a-diploid-genome-with-the-haplotypecaller</a></p>
|
||||
<pre><code class="pre_md">java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R human_b37_20.fasta -I reduced_20.bam --genotyping_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 30 -o variants_20.vcf -L 20 </code class="pre_md"></pre>
|
||||
|
|
@ -0,0 +1,16 @@
|
|||
## What is Firepony and what can I expect from it?
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/6019/what-is-firepony-and-what-can-i-expect-from-it
|
||||
|
||||
<h3>Firepony in a nutshell</h3>
|
||||
<p>Firepony is a base quality score recalibrator for aligned read data sets. It recalculates the quality scores for each nucleotide in a SAM/BAM file based on the original quality data generated by the sequencer plus the empirical data obtained by running alignment.</p>
|
||||
<p>The algorithm is a re-engineering of the base quality score recalibrator in the Genome Analysis Toolkit. It generates identical results, but runs much faster.</p>
|
||||
<p><strong>Note that this tool was written by external collaborators of the GATK team and is their sole responsibility. To be clear, Firepony is not part of the official GATK software and is not tested/validated by the GATK developers. Use at your own risk.</strong></p>
|
||||
<hr />
|
||||
<h3>How Firepony fits into your existing processing pipeline (workflow and command line usage)</h3>
|
||||
<p>Firepony is meant to be a drop-in replacement for the BQSR step in GATK. The output of Firepony is a table that can be used as input for the PrintReads tool in GATK.</p>
|
||||
<p>Existing pipelines can be modified by replacing the BQSR step (i.e., running GATK with the <code>-T BaseRecalibrator</code> argument) with Firepony, as outlined in the accompanying documentation.</p>
|
||||
<hr />
|
||||
<h3>Technical requirements and expected performance</h3>
|
||||
<p>Firepony runs on Linux systems based on Intel CPUs with 64-bit support and at least 16GB of RAM. It can optionally make use of NVIDIA GPUs (Kepler class or higher with at least 4GB of memory) for higher performance.</p>
|
||||
<p>Compared to GATK, Firepony runs anywhere from 5x to 12x faster, depending on the specific hardware and data set used. The output of Firepony is compatible with GATK, meaning it can be used by subsequent processing steps that rely on GATK.</p>
|
||||
|
|
@ -0,0 +1,25 @@
|
|||
## Where can I get more information about high-throughput sequencing concepts and terms?
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/1321/where-can-i-get-more-information-about-high-throughput-sequencing-concepts-and-terms
|
||||
|
||||
<p><strong>This article has been retired, as the resources it cites are somewhat out of date. For an introduction to GATK and sequence analysis, see the Best Practices section of the website, which contains a lot of intro-level information and references useful resources.</strong></p>
|
||||
<p>We know this field can be confusing or even overwhelming to newcomers, and getting to grips with a large and varied toolkit like the GATK can be a big challenge. We have produced a presentation that we hope will help you review all the background information that you need to know in order to use the GATK:</p>
|
||||
<ul>
|
||||
<li>Introduction to High-Throughput Sequencing Analysis: all you need to know to use the GATK: <a href="http://www.broadinstitute.org/gatk/events/3093/GATKwh1-BP-0A-Intro_to_NGS.pdf">slides</a> and <a href="http://www.broadinstitute.org/videos/broade-introduction-ngs-gatk">video</a></li>
|
||||
</ul>
|
||||
<p>In addition, the following links feature a lot of useful educational material about concepts and terminology related to next-generation sequencing:</p>
|
||||
<ul>
|
||||
<li>
|
||||
<p><a href="http://en.wikipedia.org/wiki/DNA_sequencing">DNA sequencing (Wikipedia)</a> </p>
|
||||
<p>A basic review of the sequencing process.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><a href="http://www.nature.com/nrg/journal/v11/n1/full/nrg2626.html">Sequencing technologies, the next generation, (M. Metzker, Nature Reviews - Genetics)</a> </p>
|
||||
<p>An excellent, detailed overview of the myriad next-gen sequencing methdologies.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><a href="http://www.nature.com/nmeth/journal/v7/n7/full/nmeth0710-495.html">Next-generation sequencing: adjusting to data overload (M. Baker, Nature Methods)</a> </p>
|
||||
<p>A nice piece explaining the problems inherent in trying to analyze terabytes of data. The GATK addresses this issue by requiring all datasets be in reference order, so only small chunks of the genome need to be in memory at once, as explained <a href="http://gatkforums.broadinstitute.org/discussion/1320/how-does-the-gatk-handle-these-huge-ngs-datasets">here</a>.</p>
|
||||
</li>
|
||||
<li><a href="https://www.dropbox.com/s/f09g6br4bq5o7hw/NGS%20intro%20v1.pptx.pdf">Primer on NGS analysis, from Broad Institute Primers in Medical Genetics</a></li>
|
||||
</ul>
|
||||
|
|
@ -0,0 +1,79 @@
|
|||
## Workshop walkthrough (Brussels 2014)
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/4327/workshop-walkthrough-brussels-2014
|
||||
|
||||
<h4>Note: this is a walkthrough of a hands-on GATK tutorial given at the Royal Institute of Natural Sciences on June 26, 2014 in Brussels, Belgium. It is intended to be performed with version 3.1-2 of the GATK and the corresponding data bundle.</h4>
|
||||
<h3>Data files</h3>
|
||||
<p>We start with a BAM file called "NA12878.wgs.1lib.bam" (along with its index, "NA12878.wgs.1lib.bai") containing Illumina sequence reads from our favorite test subject, NA12878, that have been mapped using BWA-mem and processed using Picard tools according to the instructions here:</p>
|
||||
<p><a href="http://www.broadinstitute.org/gatk/guide/article?id=2799">http://www.broadinstitute.org/gatk/guide/article?id=2799</a></p>
|
||||
<p>Note that this file only contains sequence for a small region of chromosome 20, in order to minimize the file size and speed up the processing steps, for demonstration purposes. Normally you would run the steps in this tutorial on the entire genome (or exome). </p>
|
||||
<p>This subsetted file was prepared by extracting read group 20GAV.1 from the CEUTrio.HiSeq.WGS.b37.NA12878.bam that is available in our resource bundle, using the following command:</p>
|
||||
<pre><code class="pre_md">java -jar GenomeAnalysisTK.jar -T PrintReads -R human_g1k_v37.fasta -I CEUTrio.HiSeq.WGS.b37.NA12878.bam -o NA12878.wgs.1lib.bam -L 20 -rf SingleReadGroup -goodRG 20GAV.1</code class="pre_md"></pre>
|
||||
<p>(We'll explain later in the tutorial how to use this kind of utility function to manipulate BAM files.)</p>
|
||||
<p>We also have our human genome reference, called "human_g1k_v37.fasta", which has been prepared according to the instructions here:</p>
|
||||
<p><a href="http://www.broadinstitute.org/gatk/guide/article?id=2798">http://www.broadinstitute.org/gatk/guide/article?id=2798</a></p>
|
||||
<p>We will walk through both of these tutorials to explain the processing, but without actually running the steps to save time.</p>
|
||||
<p>And finally we have a few resource files containing known variants (dbsnp, mills indels). These files are all available in the resource bundle on our FTP server. See here for access instructions:</p>
|
||||
<p><a href="http://www.broadinstitute.org/gatk/guide/article?id=1215">http://www.broadinstitute.org/gatk/guide/article?id=1215</a></p>
|
||||
<hr />
|
||||
<h2>DAY 1</h2>
|
||||
<h3>Prelude: BAM manipulation with Picard and Samtools</h3>
|
||||
<h4>- Viewing a BAM file information</h4>
|
||||
<p>See also the Samtools docs:</p>
|
||||
<p><a href="http://samtools.sourceforge.net/samtools.shtml">http://samtools.sourceforge.net/samtools.shtml</a> </p>
|
||||
<h4>- Reverting a BAM file</h4>
|
||||
<p>Clean the BAM we are using of previous GATK processing using this Picard command:</p>
|
||||
<pre><code class="pre_md">java -jar RevertSam.jar I=NA12878.wgs.1lib.bam O=aligned_reads_20.bam RESTORE_ORIGINAL_QUALITIES=true REMOVE_DUPLICATE_INFORMATION=true REMOVE_ALIGNMENT_INFORMATION=false SORT_ORDER=coordinate</code class="pre_md"></pre>
|
||||
<p>Note that it is possible to revert the file to FastQ format by setting REMOVE_ALIGNMENT_INFORMATION=true, but this method leads to biases in the alignment process, so if you want to do that, the better method is to follow the instructions given here:</p>
|
||||
<p><a href="http://www.broadinstitute.org/gatk/guide/article?id=2908">http://www.broadinstitute.org/gatk/guide/article?id=2908</a></p>
|
||||
<p>See also the Picard docs:</p>
|
||||
<p><a href="http://picard.sourceforge.net/command-line-overview.shtml">http://picard.sourceforge.net/command-line-overview.shtml</a> </p>
|
||||
<h3>Mark Duplicates</h3>
|
||||
<p>See penultimate step of <a href="http://www.broadinstitute.org/gatk/guide/article?id=2799">http://www.broadinstitute.org/gatk/guide/article?id=2799</a></p>
|
||||
<p>After a few minutes, the file (which we'll call "dedupped_20.bam") is ready for use with GATK.</p>
|
||||
<h3>Interlude: tour of the documentation, website, forum etc. Also show how to access the bundle on the FTP server with FileZilla.</h3>
|
||||
<h3>Getting to know GATK</h3>
|
||||
<p>Before starting to run the GATK Best Practices, we are going to learn about the basic syntax of GATK, how the results are output, how to interpret error messages, and so on.</p>
|
||||
<h4>- Run a simple walker: CountReads</h4>
|
||||
<p>Identify basic syntax, console output: version, command recap line, progress estimates, result if applicable.</p>
|
||||
<pre><code class="pre_md">java -jar GenomeAnalysisTK.jar -T CountReads -R human_g1k_v37.fasta -I dedupped_20.bam -L 20</code class="pre_md"></pre>
|
||||
<h4>- Add a filter to count how many duplicates were marked</h4>
|
||||
<p>Look at the filtering summary.</p>
|
||||
<pre><code class="pre_md">java -jar GenomeAnalysisTK.jar -T CountReads -R human_g1k_v37.fasta -I dedupped_20.bam -L 20 -rf DuplicateRead</code class="pre_md"></pre>
|
||||
<h4>- Demonstrate how to select a subset of read data</h4>
|
||||
<p>This can come in handy for bug reports.</p>
|
||||
<pre><code class="pre_md">java -jar GenomeAnalysisTK.jar -T PrintReads -R human_g1k_v37.fasta -I dedupped_20.bam -L 20:10000000-11000000 -o snippet.bam</code class="pre_md"></pre>
|
||||
<p>Also show how a bug report should be formatted and submitted. See
|
||||
<a href="http://www.broadinstitute.org/gatk/guide/article?id=1894">http://www.broadinstitute.org/gatk/guide/article?id=1894</a></p>
|
||||
<h4>- Demonstrate the equivalent for variant calls</h4>
|
||||
<p>Refer to docs for many other capabilities including selecting by sample name, up to complex queries.</p>
|
||||
<pre><code class="pre_md">java -jar GenomeAnalysisTK.jar -T SelectVariants -R human_g1k_v37.fasta -V dbsnp_b37_20.vcf -o snippet.vcf -L 20:10000000-11000000</code class="pre_md"></pre>
|
||||
<p>See <a href="http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantutils_SelectVariants.html">http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantutils_SelectVariants.html</a></p>
|
||||
<hr />
|
||||
<h3>GATK Best Practices for data processing (DNA seq)</h3>
|
||||
<p>These steps should typically be performed per lane of data. Here we are running the tools on a small slice of the data, to save time and disk space, but normally you would run on the entire genome or exome. This is especially important for BQSR, which does not work well on small amounts of data.</p>
|
||||
<p>Now let's pick up where we left off after Marking Duplicates.</p>
|
||||
<h4>- Realign around Indels</h4>
|
||||
<p>See <a href="http://gatkforums.broadinstitute.org/discussion/2800/howto-perform-local-realignment-around-indels">http://gatkforums.broadinstitute.org/discussion/2800/howto-perform-local-realignment-around-indels</a></p>
|
||||
<pre><code class="pre_md">java -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R human_g1k_v37.fasta -I dedupped_20.bam -known Mills_and_1000G_gold_standard.indels.b37 -o target_intervals.list -L 20:10000000-11000000
|
||||
|
||||
java -jar GenomeAnalysisTK.jar -T IndelRealigner -R human_g1k_v37.fasta -I dedupped_20.bam -known Mills_and_1000G_gold_standard.indels.b37.vcf -targetIntervals target_intervals.list -o realigned.bam -L 20:10000000-11000000 </code class="pre_md"></pre>
|
||||
<h4>- Base recalibration</h4>
|
||||
<p>See <a href="http://gatkforums.broadinstitute.org/discussion/2801/howto-recalibrate-base-quality-scores-run-bqsr">http://gatkforums.broadinstitute.org/discussion/2801/howto-recalibrate-base-quality-scores-run-bqsr</a></p>
|
||||
<pre><code class="pre_md">java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R human_g1k_v37.fasta -I realigned_20.bam -knownSites dbsnp_b37_20.vcf -knownSites Mills_and_1000G_gold_standard.indels.b37.vcf -o recal_20.table -L 20:10000000-11000000
|
||||
|
||||
java -jar GenomeAnalysisTK.jar -T PrintReads -R human_g1k_v37.fasta -I realigned_20.bam -BQSR recal_20.table -o recal_20.bam -L 20:10000000-11000000
|
||||
|
||||
java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R human_g1k_v37.fasta -I recalibrated_20.bam -knownSites dbsnp_b37_20.vcf -knownSites Mills_and_1000G_gold_standard.indels.b37.vcf -o post_recal_20.table -L 20:10000000-11000000
|
||||
|
||||
java -jar GenomeAnalysisTK.jar -T AnalyzeCovariates -R human_g1k_v37.fasta -before recal_20.table -after post_recal_20.table -plots recalibration_plots.pdf -L 20:10000000-11000000</code class="pre_md"></pre>
|
||||
<hr />
|
||||
<h3>GATK Best Practices for variant calling (DNA seq)</h3>
|
||||
<h4>- Run HaplotypeCaller in regular mode</h4>
|
||||
<p>See <a href="http://www.broadinstitute.org/gatk/guide/article?id=2803">http://www.broadinstitute.org/gatk/guide/article?id=2803</a></p>
|
||||
<pre><code class="pre_md">java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R human_g1k_v37.fasta -I recal_20.bam -o raw_hc_20.vcf -L 20:10000000-11000000</code class="pre_md"></pre>
|
||||
<p>Look at VCF in text and in IGV, compare with bam file.</p>
|
||||
<h4>- Run HaplotypeCaller in GVCF mode (banded and BP_RESOLUTION)</h4>
|
||||
<p>See <a href="http://www.broadinstitute.org/gatk/guide/article?id=3893">http://www.broadinstitute.org/gatk/guide/article?id=3893</a></p>
|
||||
<pre><code class="pre_md">java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R human_g1k_v37.fasta -I recal_20.bam -o raw_hc_20.g.vcf -L 20:10000000-11000000 --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000</code class="pre_md"></pre>
|
||||
<p>Compare to regular VCF.</p>
|
||||
|
|
@ -0,0 +1,476 @@
|
|||
## [How to] Generate a BAM for variant discovery (long)
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/5969/how-to-generate-a-bam-for-variant-discovery-long
|
||||
|
||||
<h3>This document is an archived rough draft of <a href="https://software.broadinstitute.org/gatk/documentation/article?id=6483">Tutorial#6483</a>. Please use the public tutorial. If you are interested in aligning to GRCh38, then please refer to a separate tutorial, <a href="https://software.broadinstitute.org/gatk/documentation/article?id=8017">Tutorial#8017</a>.</h3>
|
||||
<hr />
|
||||
<p>[work in progress--I am breaking this up into smaller chunks]
|
||||
<a name="top"></a>
|
||||
This document in part replaces the previous post <a href="http://gatkforums.broadinstitute.org/discussion/2908/howto-revert-a-bam-file-to-fastq-format">(howto) Revert a BAM file to FastQ format</a> that uses HTSlib commands. The workflow assumes familiarity with the concepts given in <a href="http://gatkforums.broadinstitute.org/discussion/1317/collected-faqs-about-bam-files">Collected FAQs about BAM files</a>.</p>
|
||||
<img src="https://us.v-cdn.net/5019796/uploads/FileUpload/72/6abbf031529f1a8287a302adb454aa.png" height="270"align="right" border="9"/>
|
||||
<p>We outline steps to preprocess Illumina and similar tech DNA sequence reads for use in GATK's variant discovery workflow. This preprocessing workflow involves marking adapter sequences using <a href="https://broadinstitute.github.io/picard/command-line-overview.html#MarkIlluminaAdapters">MarkIlluminaAdapters</a> so they contribute minimally to alignments, alignment using the <a href="http://bio-bwa.sourceforge.net/bwa.shtml#3">BWA</a> aligner's maximal exact match (MEM) algorithm, and preserving and adjusting read and read meta data using <a href="https://broadinstitute.github.io/picard/command-line-overview.html#MergeBamAlignment">MergeBamAlignment</a> for consistency and comparability of downstream results with analyses from the Broad Institute. With the exception of BWA, we use the most current versions of tools as of this writing. The workflow results in an aligned BAM file with appropriate meta information that is ready for processing with MarkDuplicates.</p>
|
||||
<p>This workflow applies to three common types of sequence read files: (A) aligned BAMs that need realignment, (B) FASTQ format data and (C) raw sequencing data in BAM format. If you have raw data in BAM format (C), given appropriate read group fields, you can start with step 2. The other two formats require conversion to unmapped BAM (uBAM). We use Picard's <a href="http://broadinstitute.github.io/picard/command-line-overview.html#RevertSam">RevertSam</a> to convert an aligned BAM (A) or Picard's <a href="https://broadinstitute.github.io/picard/command-line-overview.html#FastqToSam">FastqToSam</a> to convert a FASTQ (B) to the uBAM. </p>
|
||||
<p>We address options relevant to process reads extracted from an interval as well as options to process large files, in our case a ~150G file called <code>Solexa-272222</code>. The tutorial uses a smaller file of reads aligning to a genomic interval, called <code>snippet</code> derived from <code>Solexa-272222</code>, for faster processing. The example commands apply to the larger file. Some comments on the workflow: </p>
|
||||
<ul>
|
||||
<li>The workflow reflects a <em>lossless</em> operating procedure that retains original FASTQ read information within the final BAM file such that data is amenable to reversion and analysis by different means. These practices make scaling up and longterm storage efficient, as one needs only store the final BAM file.</li>
|
||||
<li>When transforming data files, we stick to using Picard tools over other tools to avoid subtle incompatibilities.</li>
|
||||
<li>Finally, when I call default options within a command, follow suit to ensure the same results. </li>
|
||||
</ul>
|
||||
<hr />
|
||||
<h4>The steps of the workflow are as follows.</h4>
|
||||
<ol>
|
||||
<li><a href="#step1">Generate an unmapped BAM (uBAM)</a>
|
||||
(A) Convert the FASTQ to uBAM and add read group information using FastqToSam
|
||||
(B1) [Optional] Extract reads in a genomic interval from aligned BAM
|
||||
(B2) Convert aligned BAM to uBAM and discard problematic records using RevertSam </li>
|
||||
<li><a href="#step2">Mark adapter sequences using MarkIlluminaAdapters</a></li>
|
||||
<li><a href="#step3">Convert uBAM to FASTQ and assign adapter bases low qualities using SamToFastq</a></li>
|
||||
<li><a href="#step4">Align reads and flag secondary hits using BWA MEM</a></li>
|
||||
<li>[Optional] <a href="#step5">Pipe steps 3 & 4 and collect alignment metrics</a></li>
|
||||
<li>[Optional] <a href="#step6">Sort, index and convert alignment to a BAM using SortSam and visualize on IGV</a></li>
|
||||
<li><a href="#step7">Restore altered data and apply & adjust meta information using MergeBamAlignment</a></li>
|
||||
</ol>
|
||||
<hr />
|
||||
<p><a name="step1"></a></p>
|
||||
<h3>1. Generate an unmapped BAM (uBAM)</h3>
|
||||
<p>The goal is to produce an unmapped BAM file with <em>appropriate</em> read group (@RG) information that differentiates not only samples, but also factors that contribute to technical artifacts. To see the read group information for a BAM file, use the following command. </p>
|
||||
<pre><code class="pre_md">samtools view -H Solexa-272222.bam | grep '@RG'</code class="pre_md"></pre>
|
||||
<p>This prints the lines starting with @RG within the header. Our tutorial file's single @RG line is shown below. The file has the read group fields required by this workflow as well as extra fields for record keeping. Two read group fields, <code>ID</code> and <code>PU</code>, appropriately differentiate flow cell lane, marked by <code>.2</code>, a factor that contributes to batch effects. </p>
|
||||
<pre><code class="pre_md">@RG ID:H0164.2 PL:illumina PU:H0164ALXX140820.2 LB:Solexa-272222 PI:0 DT:2014-08-20T00:00:00-0400 SM:NA12878 CN:BI</code class="pre_md"></pre>
|
||||
<ul>
|
||||
<li>GATK's variant discovery workflow requires <code>ID</code>, <code>SM</code> and <code>LB</code> fields and recommends the <code>PL</code> field. </li>
|
||||
<li>Each <code>@RG</code> line has a unique <code>ID</code> that differentiates read groups. It is the lowest denominator that differentiates factors contributing to technical batch effects and is repeatedly indicated by the <code>RG</code> tag for each read record. Thus, the length of this field contributes to file size. </li>
|
||||
<li><code>SM</code> indicates sample name and, within a collection of samples, <code>LB</code> indicates if the same sample was sequenced in multiple lanes. See item 8 of <a href="http://gatkforums.broadinstitute.org/discussion/1317/collected-faqs-about-bam-files">Collected FAQs about BAM files</a> for more detail. </li>
|
||||
<li><code>PU</code> is not required by any GATK tool. If present it is used by BQSR instead of <code>ID</code>. It is required by Picard's AddOrReplaceReadGroups but not FastqToSam. </li>
|
||||
</ul>
|
||||
<p>If your sample collection's BAM files lack required fields or do not differentiate pertinent factors within the fields, use Picard's <a href="http://broadinstitute.github.io/picard/command-line-overview.html#AddOrReplaceReadGroups">AddOrReplaceReadGroups</a> to add or appropriately rename the read group fields.</p>
|
||||
<p>Here we illustrate how to derive both <code>ID</code> and <code>PU</code> fields from query names. We break down the common portion of two different read query names from the tutorial file. </p>
|
||||
<pre><code class="pre_md">H0164ALXX140820:2:1101:10003:23460
|
||||
H0164ALXX140820:2:1101:15118:25288
|
||||
|
||||
#Breaking down the common portion of the query names:
|
||||
H0164____________ # portion of @RG ID and PU fields indicating Illumina flow cell
|
||||
_____ALXX140820__ # portion of @RG PU field indicating barcode or index in a multiplexed run
|
||||
_______________:2 # portion of @RG ID and PU fields indicating flow cell lane</code class="pre_md"></pre>
|
||||
<hr />
|
||||
<h4>(A) Convert the FASTQ to uBAM and add read group information using FastqToSam</h4>
|
||||
<p>Picard's <a href="https://broadinstitute.github.io/picard/command-line-overview.html#FastqToSam">FastqToSam</a> transforms a FASTQ file to unmapped BAM, requires two read group fields and makes optional specification of other read group fields. In the command below we note which fields are required for our workflow. All other read group fields are optional. </p>
|
||||
<pre><code class="pre_md">java -Xmx8G -jar /seq/software/picard/current/bin/picard.jar FastqToSam \
|
||||
FASTQ=snippet_XT_interleaved.fq \ #our single tutorial file contains both reads in a pair
|
||||
OUTPUT=snippet_FastqToSam_PU.bam \
|
||||
READ_GROUP_NAME=H0164.2 \ # required; changed from default of A
|
||||
SAMPLE_NAME=NA12878 \ # required
|
||||
LIBRARY_NAME=Solexa-272222 \ # required
|
||||
PLATFORM_UNIT=H0164ALXX140820.2 \
|
||||
PLATFORM=illumina \ # recommended
|
||||
SEQUENCING_CENTER=BI \
|
||||
RUN_DATE=2014-08-20T00:00:00-0400</code class="pre_md"></pre>
|
||||
<p>Some details on select parameters: </p>
|
||||
<ul>
|
||||
<li><code>QUALITY_FORMAT</code> is detected automatically if unspecified.</li>
|
||||
<li><code>SORT_ORDER</code> by default is queryname.</li>
|
||||
<li>Specify both <code>FASTQ</code> and <code>FASTQ2</code> for paired reads in separate files. </li>
|
||||
<li><code>PLATFORM_UNIT</code> is often in run_barcode.lane format. Include if sample is multiplexed.</li>
|
||||
<li><code>RUN_DATE</code> is in <a href="https://en.wikipedia.org/wiki/ISO_8601">Iso8601 date format</a>.</li>
|
||||
</ul>
|
||||
<hr />
|
||||
<h4>(B1) [Optional] Extract reads in a genomic interval from aligned BAM</h4>
|
||||
<p>We want to test our reversion process on a subset of the tutorial file before committing to reverting the entire BAM. This process requires the reads in the BAM to be aligned to a reference genome and produces a BAM containing reads from a genomic interval.</p>
|
||||
<pre><code class="pre_md">java -Xmx8G -jar /path/GenomeAnalysisTK.jar \
|
||||
-T PrintReads \
|
||||
-R /path/human_g1k_v37_decoy.fasta \
|
||||
-L 10:90000000-100000000 \ # this is the retained interval
|
||||
-I Solexa-272222.bam -o snippet.bam # snippet.bam is newly created</code class="pre_md"></pre>
|
||||
<ul>
|
||||
<li>This seems a good time to bring this up. In the command, the <code>-Xmx8G</code> Java option sets the maximum heap size, or memory usage to eight gigabytes. We want to both cap Java's use of memory so the system doesn't slow down as well as allow enough memory for the tool to run without causing an out of memory error. The <code>-Xmx</code> settings we provide here is more than sufficient for most cases. For GATK, 4G is standard, while for Picard less is needed. Some tools, e.g. MarkDuplicates, may require more. I have heard up to16G specified and have also omitted this option for small files. To find a system's default maximum heap size, type <code>java -XX:+PrintFlagsFinal -version</code>, and look for <code>MaxHeapSize</code>. Note that any setting beyond available memory spills to storage and slows a system down. If <a href="https://www.broadinstitute.org/gatk/guide/article?id=1975">multithreading</a>, increase memory proportionately to the number of threads. e.g. if 1G is the minimum required for one thread, then use 2G for two threads.</li>
|
||||
<li>This step is for our tutorial only. For applying interval lists, e.g. to whole exome data, see <a href="http://gatkforums.broadinstitute.org/discussion/4133/when-should-i-use-l-to-pass-in-a-list-of-intervals">When should I use L to pass in a list of intervals</a>.</li>
|
||||
</ul>
|
||||
<hr />
|
||||
<h4>(B2) Convert aligned BAM to uBAM and discard problematic records using RevertSam</h4>
|
||||
<p>We use Picard's RevertSam to remove alignment information. The resulting unmapped BAM (uBAM) has two uses in this workflow: (1) for processing through the MarkIlluminaAdapters branch of the workflow, and (2) for application of read group, read sequence and other read meta information to the aligned read file in the MergeBamAlignment branch of the workflow. The RevertSam parameters we specify remove information pertaining to previous alignments including program group records and standard alignment flags and tags that would otherwise transfer over in the MergeBamAlignment step. We remove nonstandard alignment tags with the <code>ATTRIBUTE_TO_CLEAR</code> option. For example, we clear the <code>XT</code> tag using this option so that it is free for use by MarkIlluminaAdapters. Our settings also reset <a href="https://broadinstitute.github.io/picard/explain-flags.html">flags</a> to unmapped values, e.g. 77 and 141 for paired reads. Additionally, we invoke the <code>SANITIZE</code> option to remove reads that cause problems for MarkIlluminaAdapters. Our tutorial's <code>snippet</code> requires such filtering while <code>Solexa-272222</code> does not. </p>
|
||||
<p>For our particular file, we use the following parameters.</p>
|
||||
<pre><code class="pre_md">java -Xmx8G -jar /path/picard.jar RevertSam \
|
||||
I=snippet.bam \
|
||||
O=snippet_revert.bam \
|
||||
SANITIZE=true \
|
||||
MAX_DISCARD_FRACTION=0.005 \ # informational; does not affect processing
|
||||
ATTRIBUTE_TO_CLEAR=XT \
|
||||
ATTRIBUTE_TO_CLEAR=XN \
|
||||
ATTRIBUTE_TO_CLEAR=AS \ #Picard release of 9/2015 clears AS by default
|
||||
ATTRIBUTE_TO_CLEAR=OC \
|
||||
ATTRIBUTE_TO_CLEAR=OP \
|
||||
SORT_ORDER=queryname \ #default
|
||||
RESTORE_ORIGINAL_QUALITIES=true \ #default
|
||||
REMOVE_DUPLICATE_INFORMATION=true \ #default
|
||||
REMOVE_ALIGNMENT_INFORMATION=true #default</code class="pre_md"></pre>
|
||||
<p>To process large files, also designate a temporary directory. </p>
|
||||
<pre><code class="pre_md"> TMP_DIR=/path/shlee # sets environmental variable for temporary directory</code class="pre_md"></pre>
|
||||
<p>We change these settings for RevertSam:</p>
|
||||
<ul>
|
||||
<li><code>SANITIZE</code> If the BAM file contains problematic reads, such as that might arise from taking a genomic interval of reads (Step 1), then RevertSam's <code>SANTITIZE</code> option removes them. Our workflow's downstream tools will have problems with paired reads with missing mates, duplicated records, and records with mismatches in length of bases and qualities. </li>
|
||||
<li>
|
||||
<p><code>MAX_DISCARD_FRACTION</code> is set to a more strict threshold of 0.005 instead of the default 0.01. Whether or not this fraction is reached, the tool informs you of the number and fraction of reads it discards. This parameter asks the tool to additionally inform you of the discarded fraction via an exception as it finishes processing. </p>
|
||||
<pre><code class="pre_md">Exception in thread "main" picard.PicardException: Discarded 0.947% which is above MAX_DISCARD_FRACTION of 0.500% </code class="pre_md"></pre>
|
||||
</li>
|
||||
<li>
|
||||
<p><code>ATTRIBUTE_TO_CLEAR</code> is set to clear more than the default standard tags, which are NM, UQ, PG, MD, MQ, SA, MC, and AS tags. The AS tag is removed by default for Picard releases starting 9/2015. Remove all other tags, such as the XT tag needed by MarkIlluminaAdapters, by specifying each with the <code>ATTRIBUTE_TO_CLEAR</code> option. To list all tags within my BAM, I used the command below to get RG, OC, XN, OP, <em>SA</em>, <em>MD</em>, <em>NM</em>, <em>PG</em>, <em>UQ</em>, <em>MC</em>, <em>MQ</em>, <em>AS</em>, XT, and <em>OQ</em> tags. Those removed by default and by <code>RESTORE_ORIGINAL_QUALITIES</code> are italicized. See your aligner's documentation and the <a href="http://samtools.sourceforge.net/SAM1.pdf">Sequence Alignment/Map Format Specification</a> for descriptions of tags. </p>
|
||||
<pre><code class="pre_md">samtools view input.bam | cut -f 12- | tr '\t' '\n' | cut -d ':' -f 1 | awk '{ if(!x[$1]++) { print }}' </code class="pre_md"></pre>
|
||||
</li>
|
||||
</ul>
|
||||
<p>Some comments on options kept at default:</p>
|
||||
<ul>
|
||||
<li><code>SORT_ORDER</code>=queryname
|
||||
For paired read files, because each read in a pair has the same query name, sorting results in interleaved reads. This means that reads in a pair are listed consecutively within the same file. We make sure to alter the previous sort order. Coordinate sorted reads result in the aligner incorrectly estimating insert size from blocks of paired reads as they are not randomly distributed. </li>
|
||||
<li><code>RESTORE_ORIGINAL_QUALITIES</code>=true
|
||||
Restoring original base qualities to the QUAL field requires OQ tags listing original qualities. The OQ tag uses the same encoding as the QUAL field, e.g. ASCII Phred-scaled base quality+33 for tutorial data. After restoring the QUAL field, RevertSam removes the tag.</li>
|
||||
<li><code>REMOVE_ALIGNMENT_INFORMATION</code>=true will remove program group records and alignment information. It also invokes the default <code>ATTRIBUTE_TO_CLEAR</code> parameter which removes standard alignment tags.</li>
|
||||
</ul>
|
||||
<p>For snippet.bam, <code>SANITIZE</code> removes 25,909 out of 2,735,539 (0.947%) reads, leaving us with 2,709,630 reads. The intact BAM retains all reads. The example shows a read pair before and after RevertSam. </p>
|
||||
<pre><code class="pre_md">#original BAM
|
||||
H0164ALXX140820:2:1101:10003:23460 83 10 91515318 60 151M = 91515130 -339 CCCATCCCCTTCCCCTTCCCTTTCCCTTTCCCTTTTCTTTCCTCTTTTAAAGAGACAAGGTCTTGTTCTGTCACCCAGGCTGGAATGCAGTGGTGCAGTCATGGCTCACTGCCGCCTCAGACTTCAGGGCAAAAGCAATCTTTCCAGCTCA :<<=>@AAB@AA@AA>6@@A:>,*@A@<@??@8?9>@==8?:?@?;?:><??@>==9?>8>@:?>>=>;<==>>;>?=?>>=<==>>=>9<=>??>?>;8>?><?<=:>>>;4>=>7=6>=>>=><;=;>===?=>=>>?9>>>>??==== MC:Z:60M91S MD:Z:151 PG:Z:MarkDuplicates RG:Z:H0164.2 NM:i:0 MQ:i:0 OQ:Z:<FJFFJJJJFJJJJJF7JJJ<F--JJJFJJJJ<J<FJFF<JAJJJAJAJFFJJJFJAFJAJJAJJJJJFJJJJJFJJFJJJJFJFJJJJFFJJJJJJJFAJJJFJFJFJJJFFJJJ<J7JJJJFJ<AFAJJJJJFJJJJJAJFJJAFFFFA UQ:i:0 AS:i:151
|
||||
H0164ALXX140820:2:1101:10003:23460 163 10 91515130 0 60M91S = 91515318 339 TCTTTCCTTCCTTCCTTCCTTGCTCCCTCCCTCCCTCCTTTCCTTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTTCCCCTCTCCCACCCCTCTCTCCCCCCCTCCCACCC :0;.=;8?7==?794<<;:>769=,<;0:=<0=:9===/,:-==29>;,5,98=599;<=########################################################################################### SA:Z:2,33141573,-,37S69M45S,0,1; MC:Z:151M MD:Z:48T4T6 PG:Z:MarkDuplicates RG:Z:H0164.2 NM:i:2 MQ:i:60 OQ:Z:<-<-FA<F<FJF<A7AFAAJ<<AA-FF-AJF-FA<AFF--A-FA7AJA-7-A<F7<<AFF########################################################################################### UQ:i:49 AS:i:50
|
||||
|
||||
#after RevertSam (step 1.B2)
|
||||
H0164ALXX140820:2:1101:10003:23460 77 * 0 0 * * 0 0 TGAGCTGGAAAGATTGCTTTTGCCCTGAAGTCTGAGGCGGCAGTGAGCCATGACTGCACCACTGCATTCCAGCCTGGGTGACAGAACAAGACCTTGTCTCTTTAAAAGAGGAAAGAAAAGGGAAAGGGAAAGGGAAGGGGAAGGGGATGGG AFFFFAJJFJAJJJJJFJJJJJAFA<JFJJJJ7J<JJJFFJJJFJFJFJJJAFJJJJJJJFFJJJJFJFJJJJFJJFJJJJJFJJJJJAJJAJFAJFJJJFFJAJAJJJAJ<FFJF<J<JJJJFJJJ--F<JJJ7FJJJJJFJJJJFFJF< RG:Z:H0164.2
|
||||
H0164ALXX140820:2:1101:10003:23460 141 * 0 0 * * 0 0 TCTTTCCTTCCTTCCTTCCTTGCTCCCTCCCTCCCTCCTTTCCTTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTTCCCCTCTCCCACCCCTCTCTCCCCCCCTCCCACCC <-<-FA<F<FJF<A7AFAAJ<<AA-FF-AJF-FA<AFF--A-FA7AJA-7-A<F7<<AFF########################################################################################### RG:Z:H0164.2</code class="pre_md"></pre>
|
||||
<p><a href="#top">back to top</a></p>
|
||||
<hr />
|
||||
<p><a name="step2"></a></p>
|
||||
<h3>2. Mark adapter sequences using MarkIlluminaAdapters</h3>
|
||||
<p>Previously we cleared the XT tag from our BAM so Picard's <a href="https://broadinstitute.github.io/picard/command-line-overview.html#MarkIlluminaAdapters">MarkIlluminaAdapters</a> can use it to mark adapter sequences. SamToFastq (step 4) will use these in turn to assign low base quality scores to the adapter bases, effectively removing their contribution to read alignment and alignment scoring metrics. For the tutorial data, adapter sequences have already been removed from the <em>beginning</em> of reads. We want to additionally effectively remove any adapter sequences at the <em>ends</em> of reads arising from read-through to adapters in read pairs with shorter inserts. </p>
|
||||
<pre><code class="pre_md">java -Xmx8G -jar /path/picard.jar MarkIlluminaAdapters \
|
||||
I=snippet_revert.bam \
|
||||
O=snippet_revertmark.bam \
|
||||
M=snippet_revertmark.metrics.txt \ #naming required
|
||||
TMP_DIR=/path/shlee # optional to process large files</code class="pre_md"></pre>
|
||||
<ul>
|
||||
<li>By default, the tool uses Illumina adapter sequences. This is sufficient for our tutorial data. Specify other adapter sequences as outlined in the <a href="https://broadinstitute.github.io/picard/command-line-overview.html#MarkIlluminaAdapters">tool documentation</a>.</li>
|
||||
<li>Only reads with adapter sequence are marked with the tag in XT:i:[#] format, where # denotes the starting position of the adapter sequence. </li>
|
||||
</ul>
|
||||
<p>The example shows a read pair marked with the XT tag by MarkIlluminaAdapters. This is a different pair than shown previously as <code>H0164ALXX140820:2:1101:10003:23460</code> reads do not contain adapter sequence. The insert region sequences for the reads overlap by a length corresponding approximately to the XT tag value. The same read pair is shown after SamToFastq transformation, where adapter sequence base quality scores have been set to 2 (# symbol), and after MergeBamAlignment, which restores original base quality scores. </p>
|
||||
<pre><code class="pre_md">#after MarkIlluminaAdapters (step 2)
|
||||
H0164ALXX140820:2:1101:15118:25288 77 * 0 0 * * 0 0
|
||||
ACCTGCCTCAGCCTCCCAAAGTGCTGGGATTATAGGTATGTGTCACCACACCCAGCCAAGTATACTCACATTGTCGTGCAACCAAACTCCAGAACTTTTTCATCTTAAAGAATCAAGGTTTTTTATTGTTTACTTTATTACTTATTTATTT
|
||||
AFFFFFJJFJFAAJJFFJJFJFJ<FJJJJJJF<JJJFFJJAF7JJJAAF7AJJFJFJFFJ--A-FAJA-F<J7A--AFJ7AJ7AJ-FJ7-JJJ-F-J---7J---7FF-JAJJ<A7JFAFAA7--FF----AF-7<JF<JFA-7<F-FF-J RG:Z:H0164.2 XT:i:63
|
||||
H0164ALXX140820:2:1101:15118:25288 141 * 0 0 * * 0 0
|
||||
GTCATGGCTGGACGCAGTGGCTCATACCTGTAATCCCAGCACTTTTGGAGGCTGAGGCAGGTAGATCGGAAGCGCCTCGTGTAGGGAGAGAGGGTTAACAAAAATGTAGATACCGGAGGTCGCCGTAAAATAAAAAAGTAGCAAGGAGTAG
|
||||
AAFFFJJJJJAJJJJJFJJJJ<JFJJJJJJJJFJJJJFJ<FJJJJAJJJJJJJJFJJJ7JJ--JJJ<J<-FJ7F--<-J7--7AJJA-J------J7F<-77--F--FFJ---J-J-J--A-7<<----J-7-J-FJ--J--FA####### RG:Z:H0164.2 XT:i:63
|
||||
|
||||
#after SamToFastq (step 3)
|
||||
@H0164ALXX140820:2:1101:15118:25288/1
|
||||
ACCTGCCTCAGCCTCCCAAAGTGCTGGGATTATAGGTATGTGTCACCACACCCAGCCAAGTATACTCACATTGTCGTGCAACCAAACTCCAGAACTTTTTCATCTTAAAGAATCAAGGTTTTTTATTGTTTACTTTATTACTTATTTATTT
|
||||
+
|
||||
AFFFFFJJFJFAAJJFFJJFJFJ<FJJJJJJF<JJJFFJJAF7JJJAAF7AJJFJFJFFJ--#########################################################################################
|
||||
@H0164ALXX140820:2:1101:15118:25288/2
|
||||
GTCATGGCTGGACGCAGTGGCTCATACCTGTAATCCCAGCACTTTTGGAGGCTGAGGCAGGTAGATCGGAAGCGCCTCGTGTAGGGAGAGAGGGTTAACAAAAATGTAGATACCGGAGGTCGCCGTAAAATAAAAAAGTAGCAAGGAGTAG
|
||||
+
|
||||
AAFFFJJJJJAJJJJJFJJJJ<JFJJJJJJJJFJJJJFJ<FJJJJAJJJJJJJJFJJJ7JJ-#########################################################################################
|
||||
|
||||
#after MergeBamAlignment (step 7)
|
||||
H0164ALXX140820:2:1101:15118:25288 99 10 99151971 60 151M = 99152350 440
|
||||
ACCTGCCTCAGCCTCCCAAAGTGCTGGGATTATAGGTATGTGTCACCACACCCAGCCAAGTATACTCACATTGTCGTGCAACCAAACTCCAGAACTTTTTCATCTTAAAGAATCAAGGTTTTTTATTGTTTACTTTATTACTTATTTATTT
|
||||
AFFFFFJJFJFAAJJFFJJFJFJ<FJJJJJJF<JJJFFJJAF7JJJAAF7AJJFJFJFFJ--A-FAJA-F<J7A--AFJ7AJ7AJ-FJ7-JJJ-F-J---7J---7FF-JAJJ<A7JFAFAA7--FF----AF-7<JF<JFA-7<F-FF-J MC:Z:90S61M MD:Z:74T10T3A37T23 PG:Z:bwamem RG:Z:H0164.2 NM:i:4 MQ:i:60 UQ:i:48 AS:i:131 XS:i:40
|
||||
H0164ALXX140820:2:1101:15118:25288 147 10 99152350 60 90S61M = 99151971 -440
|
||||
CTACTCCTTGCTACTTTTTTATTTTACGGCGACCTCCGGTATCTACATTTTTGTTAACCCTCTCTCCCTACACGAGGCGCTTCCGATCTACCTGCCTCAGCCTCCAAAAGTGCTGGGATTACAGGTATGAGCCACTGCGTCCAGCCATGAC
|
||||
#######AF--J--JF-J-7-J----<<7-A--J-J-J---JFF--F--77-<F7J------J-AJJA7--7J-<--F7JF-<J<JJJ--JJ7JJJFJJJJJJJJAJJJJF<JFJJJJFJJJJJJJJFJ<JJJJFJJJJJAJJJJJFFFAA MC:Z:151M MD:Z:61 PG:Z:bwamem RG:Z:H0164.2 NM:i:0 MQ:i:60 UQ:i:0 AS:i:61 XS:i:50</code class="pre_md"></pre>
|
||||
<p>Snippet_revertmark.bam marks 5,810 reads (0.21%) with XT, while Solexa-272222_revertmark.bam marks 3,236,552 reads (0.39%). We plot the metrics data using RStudio.
|
||||
<img src="https://us.v-cdn.net/5019796/uploads/FileUpload/6e/2da8652875645713c23e45fddef790.png" height="230" border="9"/> <img src="https://us.v-cdn.net/5019796/uploads/FileUpload/a1/59e1837963fe37d0577d466f3c56b2.png"height="230" border="9" /></p>
|
||||
<p><a href="#top">back to top</a></p>
|
||||
<hr />
|
||||
<p><a name="step3"></a></p>
|
||||
<h3>3. Convert BAM to FASTQ using SamToFastq</h3>
|
||||
<p>Picard's SamToFastq takes read identifiers, read sequences, and base quality scores to write a Sanger FASTQ format file. We use additional options to effectively remove adapter sequences previously marked with the XT tag. All extant meta data, i.e. alignment information, flags and tags, are purged in this transformation. </p>
|
||||
<pre><code class="pre_md">java -Xmx8G -jar /path/picard.jar SamToFastq \
|
||||
I=snippet_revertmark.bam \
|
||||
FASTQ=snippet_XT_interleaved.fq \
|
||||
CLIPPING_ATTRIBUTE=XT \
|
||||
CLIPPING_ACTION=2 \
|
||||
INTERLEAVE=true \
|
||||
NON_PF=true \
|
||||
TMP_DIR=/path/shlee # optional to process large files </code class="pre_md"></pre>
|
||||
<ul>
|
||||
<li>
|
||||
<p>By specifying <code>CLIPPING_ATTRIBUTE</code>=XT and <code>CLIPPING_ACTION</code>=2, SamToFastq changes the quality scores of bases marked by XT to two--a rather low score in the Phred scale. This effectively removes the adapter portion of sequences from contributing to read alignment and alignment scoring metrics. This reassignment is temporary as we will restore the original base quality scores after alignment in step 7.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>For our paired reads sample we set SamToFastq's <code>INTERLEAVE</code> to true. During the conversion to FASTQ format, the query name of the reads in a pair are marked with /1 or /2 and paired reads are retained in the same FASTQ file.</p>
|
||||
<p><a href="http://bio-bwa.sourceforge.net/bwa.shtml">BWA aligner</a> accepts interleaved FASTQ files given the <code>-p</code> option. This command indicates that the i-th and the (i+1)-th reads constitute a read pair. </p>
|
||||
</li>
|
||||
<li>We change the <code>NON_PF</code>, aka <code>INCLUDE_NON_PF_READS</code>, option from default to true. SamToFastq will then retain reads marked by what <a href="https://github.com/samtools/hts-specs/issues/85">some consider archaic 0x200 flag bit</a> that denotes reads that do not pass quality controls. These reads are also known as failing platform or vendor quality checks. Our tutorial data do not contain such reads and we call out this option for illustration only.</li>
|
||||
</ul>
|
||||
<h4>[Optional] Compress the FASTQ using gzip</h4>
|
||||
<p>This step is optional. The step is irrelevant if you pipe steps 3 and 4, as we outline in step 5. </p>
|
||||
<p>BWA handles both FASTQ and gzipped FASTQ files natively--that is, BWA works on both file types directly. Thus, this step is optional. Compress the FASTQ file using the UNIX gzip utility. </p>
|
||||
<pre><code class="pre_md">gzip snippet_XT_interleaved.fq #replaces the file with snippet_XT_interleaved.fq.gz</code class="pre_md"></pre>
|
||||
<p><a href="#top">back to top</a></p>
|
||||
<hr />
|
||||
<p><a name="step4"></a></p>
|
||||
<h3>4. Align reads and flag secondary hits using BWA MEM</h3>
|
||||
<p>GATK's variant discovery workflow recommends Burrows-Wheeler Aligner's maximal exact matches (BWA MEM) algorithm (<a href="http://arxiv.org/abs/1303.3997">Li 2013 reference</a>; <a href="http://bioinformatics.oxfordjournals.org/content/30/20/2843.long">Li 2014 benchmarks</a>; <a href="http://bio-bwa.sourceforge.net/">homepage</a>; <a href="http://bio-bwa.sourceforge.net/bwa.shtml">manual</a>). BWA MEM is suitable for aligning high-quality long reads ranging from 70 bp to 1 Mbp against a large reference genome such as the human genome. </p>
|
||||
<ul>
|
||||
<li>We use BWA v 0.7.7.r441, the same aligner used by the Broad's Genomics Platform as of this writing (9/2015).</li>
|
||||
<li>Alignment is a compute intensive process. For faster processing, use a reference genome with decoy sequences, also called a <a href="http://www.cureffi.org/2013/02/01/the-decoy-genome/">decoy genome</a>. For example, the Broad's Genomics Platform uses an Hg19/GRCh37 reference sequence that includes Ebstein-Barr virus (EBV) sequence to soak up reads that fail to align to the human reference that the aligner would otherwise spend an inordinate amount of time trying to align as split reads. <a href="https://www.broadinstitute.org/gatk/guide/article.php?id=1213">GATK's resource bundle</a> provides a standard decoy genome from the <a href="http://www.1000genomes.org/">1000 Genomes Project</a>.</li>
|
||||
<li>Aligning our <code>snippet</code> reads from a genomic interval against either a portion or the whole genome is not equivalent to aligning our entire file and taking a new <code>slice</code> from the same genomic interval. </li>
|
||||
</ul>
|
||||
<p><strong>Index the reference genome file for BWA.</strong> Indexing is specific to algorithms. To index the human genome for BWA, we apply BWA's <code>index</code> function on the reference genome file, e.g. <code>human_g1k_v37_decoy.fasta</code>. This produces five index files with the extensions <code>amb</code>, <code>ann</code>, <code>bwt</code>, <code>pac</code> and <code>sa</code>. </p>
|
||||
<pre><code class="pre_md">bwa index -a bwtsw human_g1k_v37_decoy.fasta</code class="pre_md"></pre>
|
||||
<p><strong>Align using BWA MEM.</strong> The tool automatically locates the index files within the same folder as the reference FASTA file. In the alignment command, <code>></code> denotes the aligned file. </p>
|
||||
<ul>
|
||||
<li>The aligned file is in SAM format even if given a BAM extension and retains the sort order of the FASTQ file. Thus, our aligned tutorial file remains sorted by query name. </li>
|
||||
<li>
|
||||
<p>BWA automatically creates a program group record (@PG) in the header that gives the ID, group name, group version, and command line information. </p>
|
||||
<p>/path/bwa mem -M -t 7 -p \
|
||||
/path/Homo_sapiens_assembly19.fasta \ #reference genome
|
||||
Solexa-272222_interleavedXT.fq > Solexa-272222_markXT_aln.sam </p>
|
||||
</li>
|
||||
</ul>
|
||||
<p>We invoke three options in the command. </p>
|
||||
<ul>
|
||||
<li><code>-M</code> to flag shorter split hits as secondary.
|
||||
This is optional for Picard compatibility. However, if we want MergeBamAlignment to reassign proper pair alignments, we need to mark secondary alignments. </li>
|
||||
<li><code>-p</code> to indicate the given file contains interleaved paired reads.</li>
|
||||
<li>
|
||||
<p><code>-t</code> followed by a number for the <em>additional</em> number of processor threads to use concurrently. Check your server or system's total number of threads with the following command.</p>
|
||||
<pre><code class="pre_md">getconf _NPROCESSORS_ONLN #thanks Kate</code class="pre_md"></pre>
|
||||
</li>
|
||||
</ul>
|
||||
<p>MarkDuplicates can directly process BWA's alignment, whether or not the alignment marks secondary hits. However, the point of this workflow is to take advantage of the features offered by MergeBamAlignment that allow for the scalable, <em>lossless</em> operating procedure practiced by Broad's Genomics Platform and to produce comparable metrics.</p>
|
||||
<p><a href="#top">back to top</a></p>
|
||||
<hr />
|
||||
<p><a name="step5"></a></p>
|
||||
<h3>5. [Optional] Pipe steps 3 & 4 and collect alignment metrics</h3>
|
||||
<p><strong>Piping processes saves time and space.</strong> Our tutorial's resulting SAM file is small enough to easily view, manipulate and store. For larger data, however, consider using <a href="https://en.wikipedia.org/wiki/Pipeline_(Unix)">Unix pipelines</a>. Piping allows streaming data in the processor's input-output (I/O) device directly to the next process for efficient processing and storage. We recommend piping steps 3 and 4 so as to avoid rereading and storing the large intermediate FASTQ file. </p>
|
||||
<p>You may additionally extend piping to include step 6's SortSam. Steps 3-4-6 are piped in the example command below to generate an aligned BAM file and index. [For the larger file, I couldn't pipe Step 7's MergeBamAlignment.]</p>
|
||||
<pre><code class="pre_md">#overview of command structure
|
||||
[step 3's SamToFastq] | [step 4's bwa mem] | [step 6's SortSam]
|
||||
|
||||
#for our file
|
||||
java -Xmx8G -jar /path/picard.jar SamToFastq I=snippet_revertmark.bam \
|
||||
FASTQ=/dev/stdout \
|
||||
CLIPPING_ATTRIBUTE=XT CLIPPING_ACTION=2 INTERLEAVE=true NON_PF=true \
|
||||
TMP_DIR=/path/shlee | \
|
||||
/path/bwa mem -M -t 7 -p /path/Homo_sapiens_assembly19.fasta \
|
||||
/dev/stdin | \ #to stop piping here, add '> snippet_piped.sam'
|
||||
java -Xmx8G -jar /path/picard.jar SortSam \
|
||||
INPUT=/dev/stdin \
|
||||
OUTPUT=snippet_piped.bam \
|
||||
SORT_ORDER=coordinate CREATE_INDEX=true \
|
||||
TMP_DIR=/path/shlee</code class="pre_md"></pre>
|
||||
<p><strong>Calculate alignment metrics using Picard tools.</strong> Picard offers a variety of metrics collecting tools, e.g. <a href="https://broadinstitute.github.io/picard/command-line-overview.html#CollectAlignmentSummaryMetrics">CollectAlignmentSummaryMetrics</a>, <a href="http://broadinstitute.github.io/picard/command-line-overview.html#CollectWgsMetrics">CollectWgsMetrics</a> and <a href="http://broadinstitute.github.io/picard/command-line-overview.html#CollectInsertSizeMetrics">CollectInsertSizeMetrics</a>. Some tools give more detailed metrics if given the reference sequence. See <a href="https://broadinstitute.github.io/picard/picard-metric-definitions.html">Picard for metrics definitions</a>. Metrics calculations will differ if run on the BAM directly from alignment (BWA) versus on the merged BAM (MergeBamAlignment). See [link--get from G] for guidelines on when to run tools. </p>
|
||||
<pre><code class="pre_md">java -Xmx8G -jar /path/picard.jar CollectAlignmentSummaryMetrics \
|
||||
R=/path/Homo_sapiens_assembly19.fasta \
|
||||
INPUT=slice.bam \
|
||||
OUTPUT=slice_bam_metrics.txt \
|
||||
TMP_DIR=/path/shlee # optional to process large files</code class="pre_md"></pre>
|
||||
<p>For example, percent chimeras is a calculated metric. Our tutorial alignment of the whole data set gives 0.019% (BWA) or 0.0034% (MergeBamAlignment) chimeric paired reads. The genomic interval defined in step 1 reports 0.0032% chimeric paired reads. In contrast, the aligned <code>snippet</code> gives 0.0012% (BWA) or 0.00002% (MergeBamAlignment) chimeric paired reads. This illustrates in part the differences I alluded to at the beginning of step 4.</p>
|
||||
<p><a href="#top">back to top</a></p>
|
||||
<hr />
|
||||
<p><a name="step6"></a></p>
|
||||
<h3>6. [Optional] Sort, index and convert alignment to a BAM using SortSam and visualize on IGV</h3>
|
||||
<p><strong>Picard's <a href="https://broadinstitute.github.io/picard/command-line-overview.html#SortSam">SortSam</a> sorts, indexes and converts between SAM and BAM formats.</strong> For file manipulations and to view aligned reads using the <a href="http://www.broadinstitute.org/igv/">Integrative Genomics Viewer (IGV)</a>, the SAM or BAM file must be coordinate-sorted and indexed. Some Picard tools, such as MergeBamAlignment in step 7, by default coordinate sort and can use the standard <code>CREATE_INDEX</code> option. If you didn't create an index in step 7, or want to convert to BAM and index the alignment file from step 4, then use Picard's SortSam. The index file will have an <code>sai</code> or <code>bai</code> extension depending on the specified format.</p>
|
||||
<pre><code class="pre_md">java -Xmx8G -jar /path/picard.jar SortSam \
|
||||
INPUT=Solexa-272222_markXT_aln.sam \
|
||||
OUTPUT=Solexa-272222_markXT_aln.bam \ #extension here specifies format conversion
|
||||
SORT_ORDER=coordinate \
|
||||
CREATE_INDEX=true \ # a standard option for Picard commands
|
||||
TMP_DIR=/path/shlee # optional to process large files</code class="pre_md"></pre>
|
||||
<p><strong>View aligned reads using the <a href="http://www.broadinstitute.org/igv/">Integrative Genomics Viewer (IGV)</a>.</strong> Of the multiple IGV versions, the Java Web Start <code>jnlp</code> version allows the highest memory, as of this writing 10 GB for machines with 64-bit Java. </p>
|
||||
<ul>
|
||||
<li>To run the <code>jnlp</code> version of IGV, you may need to adjust your system's <em>Java Control Panel</em> settings, e.g. enable Java content in the browser. Also, when first opening the <code>jnlp</code>, overcome Mac OS X's gatekeeper function by right-clicking the saved <code>jnlp</code> and selecting <em>Open</em> <em>with Java Web Start</em>. </li>
|
||||
<li>Load the appropriate reference genome. For our tutorial this is <em>Human (b37)</em>. </li>
|
||||
<li>Go to <em>View</em>><em>Preferences</em> and make sure the settings under the <em>Alignments</em> tab allows you to view reads of interest, e.g. duplicate reads. Default settings are tuned to genomic sequence libraries. Right-click on a track to access a menu of additional viewing modes. See <a href="http://www.broadinstitute.org/igv/AlignmentData">Viewing Alignments</a> in IGV documentation for details.</li>
|
||||
<li>Go to <em>File</em>><em>Load from</em> and either load alignments from <em>File</em> or <em>URL</em>. </li>
|
||||
</ul>
|
||||
<p>Here, IGV displays our example chimeric pair, <code>H0164ALXX140820:2:1101:10003:23460</code> at its alignment loci. BWA's secondary alignment designation causes the mates on chromosome 10 to display as unpaired in IGV's paired view. MergeBamAlignment corrects for this when it switches the secondary alignment designation. Mates display as paired on chromosome 10. </p>
|
||||
<p>Visualizing alignments in such a manner makes apparent certain convergent information. For example, we see that the chimeric region on chromosome 2 is a low complexity GC-rich region, apparent by the predominantly yellow coloring (representing guanine) of the reference region. We know there are many multimapping reads because reads with MAPQ score of zero are filled in white versus gray, and the region is down-sampled, as indicated by the underscoring in the log-scaled coverage chart. We can infer reads in this chromosome 2 region are poorly mapped based on the region's low complexity, depth of reads and prevalence of low MAPQ reads. </p>
|
||||
<img src="https://us.v-cdn.net/5019796/uploads/FileUpload/a4/d4b30ab2ad0b6595539ee25e115ffd.png" border="9"/>
|
||||
<p><a href="#top">back to top</a></p>
|
||||
<hr />
|
||||
<p><a name="step7"></a></p>
|
||||
<h3>7. Restore altered data and apply & adjust meta information using MergeBamAlignment</h3>
|
||||
<p>Our alignment file lacks read group information and certain tags, such as the mate CIGAR (MC) tag. It has hard-clipped sequences and altered base qualities. The alignment also has some mapping artifacts we would like to correct for accounting congruency. Finally, the alignment records require coordinate sorting and indexing. </p>
|
||||
<p>We use Picard's <a href="https://broadinstitute.github.io/picard/command-line-overview.html#MergeBamAlignment">MergeBamAlignment</a> to address all of these needs to produce a <em>raw</em> BAM file that is ready for GATK's variant discovery workflow. MergeBamAlignment takes metadata from a SAM or BAM file of unmapped reads (uBAM) and merges it with a SAM or BAM file containing alignment records for a <em>subset</em> of those reads. Metadata include read group information, read sequences, base quality scores and tags. The tool applies read group information from the uBAM and retains the program group information from the aligned file. In restoring original sequences, MergeBamAlignment adjusts CIGAR strings from hard-clipped to soft-clipped. The tool adjusts <a href="https://broadinstitute.github.io/picard/explain-flags.html">flag</a> values, e.g. changes primary alignment designations according to a user-specified strategy, for desired congruency. Optional parameters allow introduction of additional metadata, e.g. user-specified program group information or nonstandard aligner-generated tags. If the alignment file is missing reads present in the unaligned file, these are retained as unaligned records. Finally, alignment records are coordinate sorted, meaning they are ordered by chromosomal mapping position.</p>
|
||||
<ul>
|
||||
<li>To simply edit read group information, see Picard's <a href="https://broadinstitute.github.io/picard/command-line-overview.html#AddOrReplaceReadGroups">AddOrReplaceReadGroups</a>. To simply concatenate read records into one file, use Picard's <a href="https://broadinstitute.github.io/picard/command-line-overview.html#GatherBamFiles">GatherBamFiles</a>. An advantage of using MergeBamAlignment over AddOrReplaceReadGroups is the ability to transfer mixed read groups to reads in a single file.</li>
|
||||
<li>Consider what <code>PRIMARY_ALIGNMENT_STRATEGY</code> option best suits your samples. MergeBamAlignment applies this strategy to a read for which the aligner has provided more than one primary alignment, and for which one is designated primary by virtue of another record being marked secondary. MergeBamAlignment considers and switches only existing primary and secondary designations. </li>
|
||||
<li>MergeBamAlignment retains secondary alignments with the <code>INCLUDE_SECONDARY_ALIGNMENTS</code> parameter. It may be that alignments marked as secondary are truer to biology or at least reveal useful insight.</li>
|
||||
</ul>
|
||||
<p>A read with multiple alignment records may map to multiple loci or may be chimeric--that is, splits the alignment. It is possible for an aligner to produce multiple alignments as well as multiple primary alignments, e.g. in the case of a linear alignment set of split reads. When one alignment, or alignment set in the case of chimeric read records, is designated primary, others are designated either secondary or supplementary. Invoking the <code>-M</code> option, we had BWA mark the record with the longest aligning section of split reads as primary and all other records as secondary. MergeBamAlignment further adjusts this secondary designation and other flags, e.g. read mapped in proper pair and mate unmapped flags, to fix mapping artifacts. We only note some changes made by MergeBamAlignment to our tutorial data and by no means comprehensively list its features.</p>
|
||||
<pre><code class="pre_md">java -Xmx16G -jar /path/picard.jar MergeBamAlignment \
|
||||
R=/path/Homo_sapiens_assembly19.fasta \
|
||||
UNMAPPED_BAM=Solexa-272222_revertclean.bam \
|
||||
ALIGNED_BAM=Solexa-272222_markXT_aln.sam \
|
||||
O=Solexa-272222_merge_IGV_raw.bam \ #output file name in SAM or BAM format
|
||||
CREATE_INDEX=true \ #standard option for any Picard command
|
||||
ADD_MATE_CIGAR=true \ #default; adds MC tag
|
||||
CLIP_ADAPTERS=false \ #changed from default
|
||||
CLIP_OVERLAPPING_READS=true \ #default; soft-clips ends so mates do not overlap
|
||||
INCLUDE_SECONDARY_ALIGNMENTS=true \ #default
|
||||
MAX_INSERTIONS_OR_DELETIONS=-1 \ #changed to allow any number of insertions or deletions
|
||||
PRIMARY_ALIGNMENT_STRATEGY=MostDistant \ #changed from default BestMapq
|
||||
ATTRIBUTES_TO_RETAIN=XS \ #specify multiple times to retain alignment tags starting with X, Y, or Z
|
||||
TMP_DIR=/path/shlee #optional to process large files</code class="pre_md"></pre>
|
||||
<p>You need not invoke <code>PROGRAM</code> options as BWA's program group information is sufficient and transfer from the alignment during the merging. If, for whatever reason, you need to apply program group information by a different means, then use MergeBamAlignment to assign each of the following program group options. Example information is given. </p>
|
||||
<pre><code class="pre_md"> PROGRAM_RECORD_ID=bwa \
|
||||
PROGRAM_GROUP_NAME=bwamem \
|
||||
PROGRAM_GROUP_VERSION=0.7.7-r441 \
|
||||
PROGRAM_GROUP_COMMAND_LINE='/path/bwa mem -M -t 7 -p /path/Homo_sapiens_assembly19.fasta Solexa-272222_interleavedXT.fq > Solexa-272222_markXT_aln.sam' \ </code class="pre_md"></pre>
|
||||
<p>In the command, we change <code>CLIP_ADAPTERS</code>, <code>MAX_INSERTIONS_OR_DELETIONS</code> and <code>PRIMARY_ALIGNMENT_STRATEGY</code> values from default, and invoke other optional parameters.</p>
|
||||
<ul>
|
||||
<li>The path to the reference FASTA given by <code>R</code> should also contain the corresponding sequence dictionary with the same base name and extension <code>.dict</code>. Create a sequence dictionary using Picard's <a href="http://broadinstitute.github.io/picard/command-line-overview.html#CreateSequenceDictionary">CreateSequenceDictionary</a>.</li>
|
||||
<li><code>CLIP_ADAPTERS</code>=false leaves reads unclipped.</li>
|
||||
<li><code>MAX_INSERTIONS_OR_DELETIONS</code>=-1 retains reads irregardless of the number of insertions and deletions. Default is 1.</li>
|
||||
<li><code>PRIMARY_ALIGNMENT_STRATEGY</code>=MostDistant marks primary alignments based on the alignment <em>pair</em> with the largest insert size. This strategy is based on the premise that of chimeric sections of a read aligning to consecutive regions, the alignment giving the largest insert size with the mate gives the most information.</li>
|
||||
<li><code>ATTRIBUTES_TO_RETAIN</code> is specified to carryover the XS tag from the alignment, which for BWA reports suboptimal alignment scores. The XS tag in not necessary for our workflow. We retain it to illustrate that the tool only carries over select alignment information unless specified otherwise. For our tutorial data, this is the only additional unaccounted tag from the alignment. [IDK if this tag is used downstream; need to confirm I can keep this.]</li>
|
||||
<li>Because we have left the <code>ALIGNER_PROPER_PAIR_FLAGS</code> parameter at the default false value, MergeBamAlignment may reassign <em>proper pair</em> designations made by the aligner. </li>
|
||||
<li>By default the merged file is coordinate sorted. We set <code>CREATE_INDEX</code>=true to additionally create the <code>bai</code> index.</li>
|
||||
</ul>
|
||||
<p>Original base quality score restoration is illustrated in Step 3. The following example shows a read pair for which MergeBamAlignment adjusts multiple other information fields. The query name is listed thrice because we have paired reads where one of the reads has two alignment loci, on chromosome 2 and on chromosome 10. The mate is mapped with high MAPQ to chromosome 10. The two loci align 69 and 60 nucleotide regions, respectively, and the aligned regions coincide by 15 bases. A good portion of the chromosome 2 aligned region has low base quality scores. The <code>NM</code> tag indicates that the chromosome 2 alignment requires one change to match the reference, while the chromosome 10 read requires two changes and this is also reflected in the <code>MD</code> tags that provide the mismatching positions. When tallying alignment scores, given by the <code>AS</code> tag, aligners penalize mismatching positions, here apparently by five points per mismatch, e.g. 60 matches minus two mismatches multiplied by five gives an alignment score of 50. Both read records have values for the <code>XS</code> (suboptimal alignment score) and <code>SA</code> (chimeric alignment) tags that indicate a split read. Flag values, set by BWA, indicate the chromosome 2 record is primary and the chromosome 10 record is secondary. </p>
|
||||
<pre><code class="pre_md">#aligned reads from step 4
|
||||
H0164ALXX140820:2:1101:10003:23460 177 2 33141435 0 37S69M45S 10 91515318 0
|
||||
GGGTGGGAGGGGGGGAGAGAGGGGTGGGAGAGGGGAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAGGGGGGGGGGGGGGGGGAGGGGGGGGGGGGGGGGGAAGGAAAGGAGGGAGGGAGGGAGCAAGGAAGGAAGGAAGGAAAGA ###########################################################################################FFA<<7F<A-7-AJA7AF-A--FFA<AF-FJA-FF-AA<<JAAFA7A<FJF<F<AF-<-< NM:i:1 MD:Z:51G17 AS:i:64 XS:i:64 SA:Z:10,91515130,+,60M91S,0,2;
|
||||
|
||||
H0164ALXX140820:2:1101:10003:23460 417 10 91515130 0 60M91H = 91515318 339
|
||||
TCTTTCCTTCCTTCCTTCCTTGCTCCCTCCCTCCCTCCTTTCCTTCCCCCCCCCCCCCCC <-<-FA<F<FJF<A7AFAAJ<<AA-FF-AJF-FA<AFF--A-FA7AJA-7-A<F7<<AFF NM:i:2 MD:Z:48T4T6 AS:i:50 XS:i:36 SA:Z:2,33141435,-,37S69M45S,0,1;
|
||||
|
||||
H0164ALXX140820:2:1101:10003:23460 113 10 91515318 60 151M 2 33141435 0
|
||||
CCCATCCCCTTCCCCTTCCCTTTCCCTTTCCCTTTTCTTTCCTCTTTTAAAGAGACAAGGTCTTGTTCTGTCACCCAGGCTGGAATGCAGTGGTGCAGTCATGGCTCACTGCCGCCTCAGACTTCAGGGCAAAAGCAATCTTTCCAGCTCA <FJFFJJJJFJJJJJF7JJJ<F--JJJFJJJJ<J<FJFF<JAJJJAJAJFFJJJFJAFJAJJAJJJJJFJJJJJFJJFJJJJFJFJJJJFFJJJJJJJFAJJJFJFJFJJJFFJJJ<J7JJJJFJ<AFAJJJJJFJJJJJAJFJJAFFFFA NM:i:0 MD:Z:151 AS:i:151 XS:i:0
|
||||
|
||||
#after merging (step 7)
|
||||
H0164ALXX140820:2:1101:10003:23460 409 2 33141435 0 37S69M45S = 33141435 0
|
||||
GGGTGGGAGGGGGGGAGAGAGGGGTGGGAGAGGGGAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAGGGGGGGGGGGGGGGGGAGGGGGGGGGGGGGGGGGAAGGAAAGGAGGGAGGGAGGGAGCAAGGAAGGAAGGAAGGAAAGA ###########################################################################################FFA<<7F<A-7-AJA7AF-A--FFA<AF-FJA-FF-AA<<JAAFA7A<FJF<F<AF-<-< SA:Z:10,91515130,+,60M91S,0,2; MD:Z:51G17 PG:Z:bwamem RG:Z:H0164.2 NM:i:1 UQ:i:2 AS:i:64 XS:i:64
|
||||
|
||||
H0164ALXX140820:2:1101:10003:23460 163 10 91515130 0 60M91S = 91515318 339
|
||||
TCTTTCCTTCCTTCCTTCCTTGCTCCCTCCCTCCCTCCTTTCCTTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTTCCCCTCTCCCACCCCTCTCTCCCCCCCTCCCACCC ###########################################################################################FFA<<7F<A-7-AJA7AF-A--FFA<AF-FJA-FF-AA<<JAAFA7A<FJF<F<AF-<-< SA:Z:2,33141435,-,37S69M45S,0,1; MC:Z:151M MD:Z:48T4T6 PG:Z:bwamem RG:Z:H0164.2 NM:i:2 MQ:i:60 UQ:i:4 AS:i:50 XS:i:36
|
||||
|
||||
H0164ALXX140820:2:1101:10003:23460 83 10 91515318 60 151M = 91515130 -339
|
||||
CCCATCCCCTTCCCCTTCCCTTTCCCTTTCCCTTTTCTTTCCTCTTTTAAAGAGACAAGGTCTTGTTCTGTCACCCAGGCTGGAATGCAGTGGTGCAGTCATGGCTCACTGCCGCCTCAGACTTCAGGGCAAAAGCAATCTTTCCAGCTCA <FJFFJJJJFJJJJJF7JJJ<F--JJJFJJJJ<J<FJFF<JAJJJAJAJFFJJJFJAFJAJJAJJJJJFJJJJJFJJFJJJJFJFJJJJFFJJJJJJJFAJJJFJFJFJJJFFJJJ<J7JJJJFJ<AFAJJJJJFJJJJJAJFJJAFFFFA MC:Z:60M91S MD:Z:151 PG:Z:bwamem RG:Z:H0164.2 NM:i:0 MQ:i:0 UQ:i:0 AS:i:151 XS:i:0</code class="pre_md"></pre>
|
||||
<ul>
|
||||
<li>For the read with two alignments, the aligner hard-clipped the alignment on chromosome 10 giving a CIGAR string of 60M91H and a truncated read sequence. MergeBamAlignment restores this chromosome 10 alignment with a full read sequence and adjusts the CIGAR string to 60M91S, which soft-clips the previously hard-clipped region without loss of alignment specificity. </li>
|
||||
<li>Both chromosome 2 and chromosome 10 alignments have zero mapping qualities to indicate multiple equally likely mappings. The similar alignment scores of 64 and 50, given by the <code>AS</code> tag, contribute in part to this ambiguity. Additionally, because we asked the aligner to flag shorter split reads as secondary, with the <code>-M</code> option, it assigned a <code>417</code> <a href="https://broadinstitute.github.io/picard/explain-flags.html">flag</a> to the shorter split chromosome 10 alignment. This makes the chromosome 2 alignment for this read the primary alignment. We set our <code>PRIMARY_ALIGNMENT_STRATEGY</code> to MostDistant which asks the tool to consider the best <em>pair</em> to mark as primary from the primary and secondary records. MergeBamAlignment reassigns the chromosome 10 mapping as the primary alignment (<code>163</code> flag) and the chromosome 2 mapping as secondary (<code>409</code> flag). </li>
|
||||
<li>MergeBamAlignment updates read group <code>RG</code> information, program group <code>PG</code> information and mate CIGAR <code>MC</code> tags as specified by our command for reads and in the header section. The tool retains <code>SA</code>, <code>MD</code>, <code>NM</code> and <code>AS</code> tags from the alignment, given these are not present in the uBAM. The tool additionally adds <code>UQ</code> (the Phred likelihood of the segment) and <code>MQ</code> (mapping quality of the mate/next segment) tags if applicable. The following table summarizes changes to our tutorial data's tags during the workflow.</li>
|
||||
</ul>
|
||||
<table class="table table-striped">
|
||||
<thead>
|
||||
<tr>
|
||||
<th style="text-align: center;">original</th>
|
||||
<th style="text-align: center;">RevertSam</th>
|
||||
<th style="text-align: center;">BWA MEM</th>
|
||||
<th style="text-align: center;">MergeBamAlignment</th>
|
||||
<th></th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td style="text-align: center;">RG</td>
|
||||
<td style="text-align: center;">RG</td>
|
||||
<td style="text-align: center;"></td>
|
||||
<td style="text-align: center;">RG</td>
|
||||
<td>read group</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">PG</td>
|
||||
<td style="text-align: center;"></td>
|
||||
<td style="text-align: center;">PG</td>
|
||||
<td style="text-align: center;">PG</td>
|
||||
<td>program group</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">OC</td>
|
||||
<td style="text-align: center;"></td>
|
||||
<td style="text-align: center;"></td>
|
||||
<td style="text-align: center;"></td>
|
||||
<td>original cigar</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">XN</td>
|
||||
<td style="text-align: center;"></td>
|
||||
<td style="text-align: center;"></td>
|
||||
<td style="text-align: center;"></td>
|
||||
<td># of ambiguous bases in ref</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">OP</td>
|
||||
<td style="text-align: center;"></td>
|
||||
<td style="text-align: center;"></td>
|
||||
<td style="text-align: center;"></td>
|
||||
<td>original mapping position</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">SA</td>
|
||||
<td style="text-align: center;"></td>
|
||||
<td style="text-align: center;">SA</td>
|
||||
<td style="text-align: center;">SA</td>
|
||||
<td>chimeric alignment</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">MD</td>
|
||||
<td style="text-align: center;"></td>
|
||||
<td style="text-align: center;">MD</td>
|
||||
<td style="text-align: center;">MD</td>
|
||||
<td>string for mismatching positions</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">NM</td>
|
||||
<td style="text-align: center;"></td>
|
||||
<td style="text-align: center;">NM</td>
|
||||
<td style="text-align: center;">NM</td>
|
||||
<td># of mismatches</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">AS</td>
|
||||
<td style="text-align: center;"></td>
|
||||
<td style="text-align: center;">AS</td>
|
||||
<td style="text-align: center;">AS</td>
|
||||
<td>alignment score</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">UQ</td>
|
||||
<td style="text-align: center;"></td>
|
||||
<td style="text-align: center;"></td>
|
||||
<td style="text-align: center;">UQ</td>
|
||||
<td>Phred likelihood of the segment</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">MC</td>
|
||||
<td style="text-align: center;"></td>
|
||||
<td style="text-align: center;"></td>
|
||||
<td style="text-align: center;">MC</td>
|
||||
<td>CIGAR string for mate</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">MQ</td>
|
||||
<td style="text-align: center;"></td>
|
||||
<td style="text-align: center;"></td>
|
||||
<td style="text-align: center;">MQ</td>
|
||||
<td>mapping quality of the mate</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">OQ</td>
|
||||
<td style="text-align: center;"></td>
|
||||
<td style="text-align: center;"></td>
|
||||
<td style="text-align: center;"></td>
|
||||
<td>original base quality</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">XT</td>
|
||||
<td style="text-align: center;"></td>
|
||||
<td style="text-align: center;"></td>
|
||||
<td style="text-align: center;"></td>
|
||||
<td>tool specific</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;"></td>
|
||||
<td style="text-align: center;"></td>
|
||||
<td style="text-align: center;">XS</td>
|
||||
<td style="text-align: center;">XS</td>
|
||||
<td>BWA's secondary alignment score</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<ul>
|
||||
<li>In our example, we retained the aligner generated <code>XS</code> tag, for secondary alignment scores, with the <code>ATTRIBUTES_TO_RETAIN</code> option.</li>
|
||||
</ul>
|
||||
<p>After merging our whole tutorial file, our unmapped read records increases by 620, from 5,334,323 to 5,334,943 due to changes in flag designations and not because any reads failed to map. Our total read records remains the same at 828,846,200 for our 819,728,254 original reads, giving ~1.11% multi-record reads.</p>
|
||||
<p><a href="#top">back to top</a></p>
|
||||
<hr />
|
||||
|
|
@ -0,0 +1,28 @@
|
|||
## (howto) Set up remote debugging in IntelliJ
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/4712/howto-set-up-remote-debugging-in-intellij
|
||||
|
||||
<p>Remote debugging is a powerful tool but requires a little bit of setup. Here is the 3-step process to an easier life.</p>
|
||||
<h3>1. Set up the remote config in IntelliJ</h3>
|
||||
<p>Do the following in IntelliJ: </p>
|
||||
<ul>
|
||||
<li>
|
||||
<p>Run -> Edit Configurations -> Add new configuration (+ symbol top left) -> Remote</p>
|
||||
</li>
|
||||
<li>Fill in the appropriate host (<code>gsa[_machine#_].broadinstitute.org</code>) and port number (<em>XXXXX</em>), where <em>xxxxx</em> is a 5-digit port number you make up to avoid accidentally connecting to someone else's debug session. Press OK. Add breakpoint(s) where you want them in the code.</li>
|
||||
</ul>
|
||||
<h3>2. Run the tool on gsa machine</h3>
|
||||
<p>Run the GATK command from the server with </p>
|
||||
<pre>
|
||||
java -agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=<i>5-digit_port_number</i> \
|
||||
-jar <i>_toolName_</i> \
|
||||
<i>args</i>
|
||||
</pre>
|
||||
<p>GATK will wait for IntelliJ to actually start running.</p>
|
||||
<h3>3. Chase bug(s) in IntelliJ</h3>
|
||||
<p>Go to IntelliJ</p>
|
||||
<ul>
|
||||
<li>Run -> Debug -> Select the configuration you just created.</li>
|
||||
</ul>
|
||||
<p>Now chase.</p>
|
||||
<p>You can also add the <code>agentlib</code> business as an alias in your <code>.profile</code> or <code>.my.bashrc</code> on the server like I did. Boom.</p>
|
||||
|
|
@ -0,0 +1,31 @@
|
|||
## (howto) Speed up GATK compilation
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/5784/howto-speed-up-gatk-compilation
|
||||
|
||||
<hr />
|
||||
<p>TL;DR: <code>mvn -Ddisable.shadepackage verify</code></p>
|
||||
<hr />
|
||||
<h3>Background</h3>
|
||||
<p>In addition to Queue's GATK-wrapper codegen, relatively slow scala compilation, etc. there's still a lot of legacy compatibility from our <code>ant</code> days in the Maven scripts. Our <code>mvn verify</code> behaves more like when one runs <code>ant</code>, and builds <em>everything</em> needed to bundle the GATK.</p>
|
||||
<p>As of GATK 3.4, by default the build for the "protected" code generates jar files that contains every class needed for running, one for the GATK and one for Queue. This is done by the <a href="https://maven.apache.org/plugins/maven-shade-plugin/">Maven shade plugin</a>, and are each called the "package jar". But, there's a way to generate a jar file that only contains <code>META-INF/MANIFEST.MF</code> pointers to the dependency jar files, instead of zipping/shading them up. These are each the "executable jar", and FYI are always generated as it takes seconds, not minutes.</p>
|
||||
<hr />
|
||||
<h3>Instructions for fast compilation</h3>
|
||||
<p>While developing and recompiling Queue, disable the shaded jar with <code>-Ddisable.shadepackage</code>. Then run <code>java -jar target/executable/Queue.jar ...</code> If you need to transfer this jar to another machine / directory, you can't copy (or rsync) just the jar, you'll need the entire executable directory.</p>
|
||||
<pre><code class="pre_md"># Total expected time, on a local disk, with Queue:
|
||||
# ~5.0 min from clean
|
||||
# ~1.5 min per recompile
|
||||
mvn -Ddisable.shadepackage verify
|
||||
|
||||
# always available
|
||||
java -jar target/executable/Queue.jar --help
|
||||
|
||||
# not found when shade disabled
|
||||
java -jar target/package/Queue.jar --help</code class="pre_md"></pre>
|
||||
<p>If one is only developing for the GATK, skip Queue by adding <code>-P\!queue</code> also.</p>
|
||||
<pre><code class="pre_md">mvn -Ddisable.shadepackage -P\!queue verify
|
||||
|
||||
# always available
|
||||
java -jar target/executable/GenomeAnalysisTK.jar --help
|
||||
|
||||
# not found when queue profile disabled
|
||||
java -jar target/executable/Queue.jar --help</code class="pre_md"></pre>
|
||||
|
|
@ -0,0 +1,49 @@
|
|||
## Accessing reads: AlignmentContext and ReadBackedPileup
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/1322/accessing-reads-alignmentcontext-and-readbackedpileup
|
||||
|
||||
<h3>1. Introduction</h3>
|
||||
<p>The AlignmentContext and ReadBackedPileup work together to provide the read data associated with a given locus. This section details the tools the GATK provides for working with collections of aligned reads.</p>
|
||||
<h3>2. What are read backed pileups?</h3>
|
||||
<p>Read backed pileups are objects that contain all of the reads and their offsets that "pile up" at a locus on the genome. They are the basic input data for the GATK LocusWalkers, and underlie most of the locus-based analysis tools like the recalibrator and SNP caller. Unfortunately, there are many ways to view this data, and version one grew unwieldy trying to support all of these approaches. Version two of the ReadBackedPileup presents a consistent and clean interface for working pileup data, as well as supporting the <code>iterable()</code> interface to enable the convenient <code>for ( PileupElement p : pileup )</code> for-each loop support.</p>
|
||||
<h3>3. How do I get a ReadBackedPileup and/or how do I create one?</h3>
|
||||
<p>The best way is simply to grab the pileup (the underlying representation of the locus data) from your <code>AlignmentContext</code> object in <code>map</code>:</p>
|
||||
<pre><code class="pre_md">public Integer map(RefMetaDataTracker tracker, ReferenceContext ref, AlignmentContext context)
|
||||
ReadBackedPileup pileup = context.getPileup();</code class="pre_md"></pre>
|
||||
<p>This aligns your calculations with the GATK core infrastructure, and avoids any unnecessary data copying from the engine to your walker.</p>
|
||||
<h4>If you are trying to create your own, the best constructor is:</h4>
|
||||
<pre><code class="pre_md">public ReadBackedPileup(GenomeLoc loc, ArrayList<PileupElement> pileup )</code class="pre_md"></pre>
|
||||
<p>requiring only a list, in order of read / offset in the pileup, of PileupElements.</p>
|
||||
<h4>From List<SAMRecord> and List<Offset></h4>
|
||||
<p>If you happen to have lists of SAMRecords and integer offsets into them you can construct a <code>ReadBackedPileup</code> this way:</p>
|
||||
<pre><code class="pre_md">public ReadBackedPileup(GenomeLoc loc, List<SAMRecord> reads, List<Integer> offsets )</code class="pre_md"></pre>
|
||||
<h3>4. What's the best way to use them?</h3>
|
||||
<h4>Best way if you just need reads, bases and quals</h4>
|
||||
<pre><code class="pre_md">for ( PileupElement p : pileup ) {
|
||||
System.out.printf("%c %c %d%n", p.getBase(), p.getSecondBase(), p.getQual());
|
||||
// you can get the read itself too using p.getRead()
|
||||
}</code class="pre_md"></pre>
|
||||
<p>This is the most efficient way to get data, and should be used whenever possible.</p>
|
||||
<h4>I just want a vector of bases and quals</h4>
|
||||
<p>You can use:</p>
|
||||
<pre><code class="pre_md">public byte[] getBases()
|
||||
public byte[] getSecondaryBases()
|
||||
public byte[] getQuals()</code class="pre_md"></pre>
|
||||
<p>To get the bases and quals as a <code>byte[]</code> array, which is the underlying base representation in the SAM-JDK.</p>
|
||||
<h4>All I care about are counts of bases</h4>
|
||||
<p>Use the follow function to get counts of A, C, G, T in order: </p>
|
||||
<pre><code class="pre_md">public int[] getBaseCounts()</code class="pre_md"></pre>
|
||||
<p>Which returns a <code>int[4]</code> vector with counts according to <code>BaseUtils.simpleBaseToBaseIndex</code> for each base.</p>
|
||||
<h4>Can I view just the reads for a given sample, read group, or any other arbitrary filter?</h4>
|
||||
<p>The GATK can very efficiently stratify pileups by sample, and less efficiently stratify by read group, strand, mapping quality, base quality, or any arbitrary filter function. The sample-specific functions can be called as follows:</p>
|
||||
<pre><code class="pre_md">pileup.getSamples();
|
||||
pileup.getPileupForSample(String sampleName);</code class="pre_md"></pre>
|
||||
<p>In addition to the rich set of filtering primitives built into the <code>ReadBackedPileup</code>, you can supply your own primitives by implmenting a PileupElementFilter:</p>
|
||||
<pre><code class="pre_md">public interface PileupElementFilter {
|
||||
public boolean allow(final PileupElement pileupElement);
|
||||
}</code class="pre_md"></pre>
|
||||
<p>and passing it to <code>ReadBackedPileup</code>'s generic filter function:</p>
|
||||
<pre><code class="pre_md">public ReadBackedPileup getFilteredPileup(PileupElementFilter filter);</code class="pre_md"></pre>
|
||||
<p>See the <code>ReadBackedPileup</code>'s java documentation for a complete list of built-in filtering primitives.</p>
|
||||
<h4>Historical: StratifiedAlignmentContext</h4>
|
||||
<p>While <code>ReadBackedPileup</code> is the preferred mechanism for aligned reads, some walkers still use the <code>StratifiedAlignmentContext</code> to carve up selections of reads. If you find functions that you require in <code>StratifiedAlignmentContext</code> that seem to have no analog in <code>ReadBackedPileup</code>, please let us know and we'll port the required functions for you.</p>
|
||||
|
|
@ -0,0 +1,45 @@
|
|||
## Adding and updating dependencies [RETIRED]
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/1352/adding-and-updating-dependencies-retired
|
||||
|
||||
<h2>Adding Third-party Dependencies</h2>
|
||||
<p>The GATK build system uses the <a href="http://ant.apache.org/ivy/">Ivy dependency manager</a> to make it easy for our users to add additional dependencies. Ivy can pull the latest jars and their dependencies from the <a href="http://mvnrepository.com">Maven repository</a>, making adding or updating a dependency as simple as adding a new line to the <code>ivy.xml</code> file.</p>
|
||||
<p>If your tool is available in the maven repository, add a line to the <code>ivy.xml</code> file similar to the following:</p>
|
||||
<pre><code class="pre_md"><dependency org="junit" name="junit" rev="4.4" /></code class="pre_md"></pre>
|
||||
<p>If you would like to add a dependency to a tool not available in the maven repository, please email <a href="mailto:gsahelp@broadinstitute.org">gsahelp@broadinstitute.org</a></p>
|
||||
<h2>Updating SAM-JDK and Picard</h2>
|
||||
<p>Because we work so closely with the SAM-JDK/Picard team and are critically dependent on the code they produce, we have a special procedure for updating the SAM/Picard jars. Please use the following procedure to when updating <code>sam-*.jar</code> or <code>picard-*.jar</code>.</p>
|
||||
<ul>
|
||||
<li>
|
||||
<p>Download and build the latest versions of <a href="http://picard.svn.sourceforge.net/svnroot/picard/trunk/">Picard public</a> and <a href="https://svnrepos.broad.mit.edu/picard/trunk">Picard private</a> (restricted to Broad Institute users) from their respective svns. </p>
|
||||
</li>
|
||||
<li>
|
||||
<p>Get the latest svn versions for picard public and picard private by running the following commands:</p>
|
||||
<p>svn info $PICARD_PUBLIC_HOME | grep "Revision"
|
||||
svn info $PICARD_PRIVATE_HOME | grep "Revision"</p>
|
||||
</li>
|
||||
</ul>
|
||||
<h3>Updating the Picard public jars</h3>
|
||||
<ul>
|
||||
<li>
|
||||
<p>Rename the jars and xmls in <code>$STING_HOME/settings/repository/net.sf</code> to <code>{picard|sam}-$PICARD_PUBLIC_MAJOR_VERSION.$PICARD_PUBLIC_MINOR_VERSION.PICARD_PUBLIC_SVN_REV.{jar|xml}</code></p>
|
||||
</li>
|
||||
<li>
|
||||
<p>Update the jars in <code>$STING_HOME/settings/repository/net.sf</code> with their newer equivalents in <code>$PICARD_PUBLIC_HOME/dist/picard_lib</code>.</p>
|
||||
</li>
|
||||
<li>Update the xmls in <code>$STING_HOME/settings/repository/net.sf</code> with the appropriate version number (<code>$PICARD_PUBLIC_MAJOR_VERSION.$PICARD_PUBLIC_MINOR_VERSION.$PICARD_PUBLIC_SVN_REV</code>).</li>
|
||||
</ul>
|
||||
<h3>Updating the Picard private jar</h3>
|
||||
<ul>
|
||||
<li>
|
||||
<p>Create the picard private jar with the following command:</p>
|
||||
<p>ant clean package -Dexecutable=PicardPrivate -Dpicard.dist.dir=${PICARD_PRIVATE_HOME}/dist</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>Rename <code>picard-private-parts-*.jar</code> in <code>$STING_HOME/settings/repository/edu.mit.broad</code> to <code>picard-private-parts-$PICARD_PRIVATE_SVN_REV.jar</code>.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>Update <code>picard-private-parts-*.jar</code> in <code>$STING_HOME/settings/repository/edu.mit.broad</code> with the <code>picard-private-parts.jar</code> in <code>$STING_HOME/dist/packages/picard-private-parts</code>.</p>
|
||||
</li>
|
||||
<li>Update the xml in <code>$STING_HOME/settings/repository/edu.mit.broad</code> to reflect the new revision and publication date.</li>
|
||||
</ul>
|
||||
|
|
@ -0,0 +1,34 @@
|
|||
## Collecting output
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/1341/collecting-output
|
||||
|
||||
<h2>1. Analysis output overview</h2>
|
||||
<p>In theory, any class implementing the <code>OutputStream</code> interface. In practice, three types of classes are commonly used: <code>PrintStreams</code> for plain text files, <code>SAMFileWriters</code> for BAM files, and <code>VCFWriters</code> for VCF files.</p>
|
||||
<h2>2. PrintStream</h2>
|
||||
<p>To declare a basic <code>PrintStream</code> for output, use the following declaration syntax:</p>
|
||||
<pre><code class="pre_md">@Output
|
||||
public PrintStream out;</code class="pre_md"></pre>
|
||||
<p>And use it just as you would any other PrintStream:</p>
|
||||
<pre><code class="pre_md">out.println("Hello, world!");</code class="pre_md"></pre>
|
||||
<p>By default, <code>@Output</code> streams prepopulate <code>fullName</code>, <code>shortName</code>, <code>required</code>, and <code>doc</code>. <code>required</code> in this context means that the GATK will always fill in the contents of the <code>out</code> field for you. If the user specifies no <code>--out</code> command-line argument, the 'out' field will be prepopulated with a stream pointing to <code>System.out</code>.</p>
|
||||
<p>If your walker outputs a custom format that requires more than simple concatenation by [Queue]() you should also implement a custom <code>Gatherer</code>.</p>
|
||||
<h2>3. SAMFileWriter</h2>
|
||||
<p>For some applications, you might need to manage their own SAM readers and writers directly from inside your walker. Current best practice for creating these Readers / Writers is to declare arguments of type <code>SAMFileReader</code> or <code>SAMFileWriter</code> as in the following example:</p>
|
||||
<pre><code class="pre_md">@Output
|
||||
SAMFileWriter outputBamFile = null;</code class="pre_md"></pre>
|
||||
<p>If you do not specify the full name and short name, the writer will provide system default names for these arguments. Creating a <code>SAMFileWriter</code> in this way will create the type of writer most commonly used by members of the GSA group at the Broad Institute -- it will use the same header as the input BAM and require presorted data. To change either of these attributes, use the <code>StingSAMIterator</code> interface instead:</p>
|
||||
<pre><code class="pre_md">@Output
|
||||
StingSAMFileWriter outputBamFile = null;</code class="pre_md"></pre>
|
||||
<p>and later, in <code>initialize()</code>, run one or both of the following methods:</p>
|
||||
<p>outputBAMFile.writeHeader(customHeader);
|
||||
outputBAMFile.setPresorted(false);</p>
|
||||
<p>You can change the header or presorted state until the first alignment is written to the file.</p>
|
||||
<h2>4. VCFWriter</h2>
|
||||
<p><code>VCFWriter</code> outputs behave similarly to <code>PrintStreams</code> and <code>SAMFileWriters</code>. Declare a <code>VCFWriter</code> as follows:</p>
|
||||
<p>@Output(doc="File to which variants should be written",required=true)
|
||||
protected VCFWriter writer = null;</p>
|
||||
<h2>5. Debugging Output</h2>
|
||||
<p>The walkers provide a protected logger instance. Users can adjust the debug level of the walkers using the <code>-l</code> command line option.</p>
|
||||
<p>Turning on verbose logging can produce more output than is really necessary. To selectively turn on logging for a class or package, specify a <code>log4j.properties</code> property file from the command line as follows:</p>
|
||||
<pre><code class="pre_md">-Dlog4j.configuration=file:///<your development root>/Sting/java/config/log4j.properties</code class="pre_md"></pre>
|
||||
<p>An example <code>log4j.properties</code> file is available in the <code>java/config</code> directory of the Git repository.</p>
|
||||
|
|
@ -0,0 +1,32 @@
|
|||
## Documenting walkers
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/1346/documenting-walkers
|
||||
|
||||
<p>The GATK discovers walker documentation by reading it out of the Javadoc, Sun's design pattern for providing documentation for packages and classes. This page will provide an extremely brief explanation of how to write Javadoc; more information on writing javadoc comments can be found in <a href="http://www.oracle.com/technetwork/java/javase/documentation/index-137868.html">Sun's documentation</a>.</p>
|
||||
<h2>1. Adding walker and package descriptions to the help text</h2>
|
||||
<p>The GATK's build system uses the javadoc parser to extract the javadoc for classes and packages and embed the contents of that javadoc in the help system. If you add Javadoc to your package or walker, it will automatically appear in the help. The javadoc parser will pick up on 'standard' javadoc comments, such as the following, taken from PrintReadsWalker:</p>
|
||||
<pre><code class="pre_md">/**
|
||||
* This walker prints out the input reads in SAM format. Alternatively, the walker can write reads into a specified BAM file.
|
||||
*/</code class="pre_md"></pre>
|
||||
<p>You can add javadoc to your package by creating a special file, <code>package-info.java</code>, in the package directory. This file should consist of the javadoc for your package plus a package descriptor line. One such example follows:</p>
|
||||
<pre><code class="pre_md">/**
|
||||
* @help.display.name Miscellaneous walkers (experimental)
|
||||
*/
|
||||
package org.broadinstitute.sting.playground.gatk.walkers;</code class="pre_md"></pre>
|
||||
<p>Additionally, the GATK provides two extra custom tags for overriding the information that ultimately makes it into the help.</p>
|
||||
<ul>
|
||||
<li>
|
||||
<p><code>@help.display.name</code> Changes the name of the package as it appears in help. Note that the name of the walker cannot be changed as it is required to be passed verbatim to the <code>-T</code> argument.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><code>@help.summary</code> Changes the description which appears on the right-hand column of the help text. This is useful if you'd like to provide a more concise description of the walker that should appear in the help.</p>
|
||||
</li>
|
||||
<li><code>@help.description</code> Changes the description which appears at the bottom of the help text with <code>-T <your walker> --help</code> is specified. This is useful if you'd like to present a more complete description of your walker.</li>
|
||||
</ul>
|
||||
<h2>2. Hiding experimental walkers (use sparingly, please!)</h2>
|
||||
<p>Walkers can be hidden from the documentation system by adding the <code>@Hidden</code> annotation to the top of each walker. <code>@Hidden</code> walkers can still be run from the command-line, but their documentation will not be visible to end users. Please use this functionality sparingly to avoid walkers with hidden command-line options that are required for production use.</p>
|
||||
<h2>3. Disabling building of help</h2>
|
||||
<p>Because the building of our help text is actually heavyweight and can dramatically increase compile time on some systems, we have a mechanism to disable help generation.</p>
|
||||
<p>Compile with the following command:</p>
|
||||
<pre><code class="pre_md">ant -Ddisable.help=true</code class="pre_md"></pre>
|
||||
<p>to disable generation of help.</p>
|
||||
|
|
@ -0,0 +1,78 @@
|
|||
## Frequently asked questions about Scala
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/1315/frequently-asked-questions-about-scala
|
||||
|
||||
<h3>1. What is Scala?</h3>
|
||||
<p>Scala is a combination of an object oriented framework and a functional programming language. For a good introduction see the free online book <a href="http://programming-scala.labs.oreilly.com/">Programming Scala</a>.</p>
|
||||
<p>The following are extremely brief answers to frequently asked questions about Scala which often pop up when first viewing or editing QScripts. For more information on Scala there a multitude of resources available around the web including the <a href="http://www.scala-lang.org/">Scala home page</a> and the online <a href="http://www.scala-lang.org/api/2.8.1/index.html">Scala Doc</a>.</p>
|
||||
<h3>2. Where do I learn more about Scala?</h3>
|
||||
<ul>
|
||||
<li><a href="http://www.scala-lang.org">http://www.scala-lang.org</a></li>
|
||||
<li><a href="http://programming-scala.labs.oreilly.com">http://programming-scala.labs.oreilly.com</a></li>
|
||||
<li><a href="http://www.scala-lang.org/docu/files/ScalaByExample.pdf">http://www.scala-lang.org/docu/files/ScalaByExample.pdf</a></li>
|
||||
<li><a href="http://devcheatsheet.com/tag/scala/">http://devcheatsheet.com/tag/scala/</a></li>
|
||||
<li><a href="http://davetron5000.github.com/scala-style/index.html">http://davetron5000.github.com/scala-style/index.html</a></li>
|
||||
</ul>
|
||||
<h3>3. What is the difference between <code>var</code> and <code>val</code>?</h3>
|
||||
<p><code>var</code> is a value you can later modify, while <code>val</code> is similar to <code>final</code> in Java.</p>
|
||||
<h3>4. What is the difference between Scala collections and Java collections? / Why do I get the error: type mismatch?</h3>
|
||||
<p>Because the GATK and Queue are a mix of Scala and Java sometimes you'll run into problems when you need a Scala collection and instead a Java collection is returned.</p>
|
||||
<pre><code class="pre_md"> MyQScript.scala:39: error: type mismatch;
|
||||
found : java.util.List[java.lang.String]
|
||||
required: scala.List[String]
|
||||
val wrapped: List[String] = TextFormattingUtils.wordWrap(text, width)</code class="pre_md"></pre>
|
||||
<p>Use the implicit definitions in <code>JavaConversions</code> to automatically convert the basic Java collections to and from Scala collections.</p>
|
||||
<pre><code class="pre_md">import collection.JavaConversions._</code class="pre_md"></pre>
|
||||
<p>Scala has a very rich collections framework which you should take the time to enjoy. One of the first things you'll notice is that the default Scala collections are immutable, which means you should treat them as you would a String. When you want to 'modify' an immutable collection you need to capture the result of the operation, often assigning the result back to the original variable.</p>
|
||||
<pre><code class="pre_md">var str = "A"
|
||||
str + "B"
|
||||
println(str) // prints: A
|
||||
str += "C"
|
||||
println(str) // prints: AC
|
||||
|
||||
var set = Set("A")
|
||||
set + "B"
|
||||
println(set) // prints: Set(A)
|
||||
set += "C"
|
||||
println(set) // prints: Set(A, C)</code class="pre_md"></pre>
|
||||
<h3>5. How do I append to a list?</h3>
|
||||
<p>Use the <code>:+</code> operator for a single value.</p>
|
||||
<pre><code class="pre_md"> var myList = List.empty[String]
|
||||
myList :+= "a"
|
||||
myList :+= "b"
|
||||
myList :+= "c"</code class="pre_md"></pre>
|
||||
<p>Use <code>++</code> for appending a list.</p>
|
||||
<pre><code class="pre_md"> var myList = List.empty[String]
|
||||
myList ++= List("a", "b", "c")</code class="pre_md"></pre>
|
||||
<h3>6. How do I add to a set?</h3>
|
||||
<p>Use the <code>+</code> operator.</p>
|
||||
<pre><code class="pre_md"> var mySet = Set.empty[String]
|
||||
mySet += "a"
|
||||
mySet += "b"
|
||||
mySet += "c"</code class="pre_md"></pre>
|
||||
<h3>7. How do I add to a map?</h3>
|
||||
<p>Use the <code>+</code> and <code>-></code> operators.</p>
|
||||
<pre><code class="pre_md"> var myMap = Map.empty[String,Int]
|
||||
myMap += "a" -> 1
|
||||
myMap += "b" -> 2
|
||||
myMap += "c" -> 3</code class="pre_md"></pre>
|
||||
<h3>8. What are Option, Some, and None?</h3>
|
||||
<p>Option is a Scala generic type that can either be some generic value or <code>None</code>. Queue often uses it to represent primitives that may be null.</p>
|
||||
<pre><code class="pre_md"> var myNullableInt1: Option[Int] = Some(1)
|
||||
var myNullableInt2: Option[Int] = None</code class="pre_md"></pre>
|
||||
<h3>9. What is _ / What is the underscore?</h3>
|
||||
<p><a href="http://blog.normation.com/2010/07/01/scala-dreaded-underscore-psug/">François Armand</a>'s slide deck is a good introduction: <a href="http://www.slideshare.net/normation/scala-dreaded">http://www.slideshare.net/normation/scala-dreaded</a></p>
|
||||
<p>To quote from his slides:</p>
|
||||
<pre><code class="pre_md">Give me a variable name but
|
||||
- I don't care of what it is
|
||||
- and/or
|
||||
- don't want to pollute my namespace with it</code class="pre_md"></pre>
|
||||
<h3>10. How do I format a String?</h3>
|
||||
<p>Use the <code>.format()</code> method.</p>
|
||||
<p>This Java snippet:</p>
|
||||
<pre><code class="pre_md">String formatted = String.format("%s %i", myString, myInt);</code class="pre_md"></pre>
|
||||
<p>In Scala would be: </p>
|
||||
<pre><code class="pre_md">val formatted = "%s %i".format(myString, myInt)</code class="pre_md"></pre>
|
||||
<h3>11. Can I use Scala Enumerations as QScript @Arguments?</h3>
|
||||
<p>No. Currently Scala's <code>Enumeration</code> class does not interact with the Java reflection API in a way that could be used for Queue command line arguments. You can use Java <code>enum</code>s if for example you are importing a Java based walker's <code>enum</code> type.</p>
|
||||
<p>If/when we find a workaround for Queue we'll update this entry. In the meantime try using a String.</p>
|
||||
|
|
@ -0,0 +1,165 @@
|
|||
## GATK development process and coding standards
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/2129/gatk-development-process-and-coding-standards
|
||||
|
||||
<h2>Introduction</h2>
|
||||
<p>This document describes the current GATK coding standards for documentation and unit testing. The overall goal is that all functions be well documented, have unit tests, and conform to the coding conventions described in this guideline. It is primarily meant as an internal reference for team members, but we are making it public to provide an example of how we work. There are a few mentions of specific team member responsibilities and who to contact with questions; please just disregard those as they will not be applicable to you.</p>
|
||||
<h2>Coding conventions</h2>
|
||||
<h3>General conventions</h3>
|
||||
<p>The Genome Analysis Toolkit generally follows Java coding standards and good practices, which can be viewed <a href="http://www.oracle.com/technetwork/java/codeconvtoc-136057.html">at Sun's site</a>. </p>
|
||||
<p>The original coding standard document for the GATK was developed in early 2009. It remains a reasonable starting point but may be superseded by statements on this page (<a href="https://us.v-cdn.net/5019796/uploads/FileUpload/18/a199e46fbc5c5e08866e8136db7192.pdf">available as a PDF</a>).</p>
|
||||
<h3>Size of functions and functional programming style</h3>
|
||||
<p>Code in the GATK should be structured into clear, simple, and testable functions. Clear means that the function takes a limited number of arguments, most of which are values not modified, and in general should return newly allocated results, as opposed to directly modifying the input arguments (functional style). The max. size of functions should be approximately one screen's worth of real estate (no more than 80 lines), including inline comments. If you are writing functions that are much larger than this, you must refactor your code into modular components.</p>
|
||||
<h3>Code duplication</h3>
|
||||
<p>Do not duplicate code. If you are finding yourself wanting to make a copy of functionality, refactor the code you want to duplicate and enhance it. Duplicating code introduces bugs, makes the system harder to maintain, and will require more work since you will have a new function that must be tested, as opposed to expanding the tests on the existing functionality.</p>
|
||||
<h3>Documentation</h3>
|
||||
<p>Functions must be documented following the javadoc conventions. That means that the first line of the comment should be a simple statement of the purpose of the function. Following that is an expanded description of the function, such as edge case conditions, requirements on the argument, state changes, etc. Finally comes the @param and @return fields, that should describe the meaning of each function argument, restrictions on the values allowed or returned. In general, the return field should be about types and ranges of those values, not the meaning of the result, as this should be in the body of the documentation.</p>
|
||||
<h3>Testing for valid inputs and contracts</h3>
|
||||
<p>The GATK uses Contracts for Java to help us enforce code quality during testing. See <a href="http://code.google.com/p/cofoja/">CoFoJa</a> for more information. If you've never programmed with contracts, read their excellent description <a href="http://code.google.com/p/cofoja/wiki/AddContracts">Adding contracts to a stack</a>. Contracts are only enabled when we are testing the code (unittests and integration tests) and not during normal execution, so contracts can be reasonably expensive to compute. They are best used to enforce assumptions about the status of class variables and return results. </p>
|
||||
<p>Contracts are tricky when it comes to input arguments. The best practice is simple:</p>
|
||||
<ul>
|
||||
<li>Public functions with arguments should explicitly test those input arguments for good values with live java code (such as in the example below). Because the function is public, you don't know what the caller will be passing in, so you have to check and ensure quality.</li>
|
||||
<li>Private functions with arguments should use contracts instead. Because the function is private, the author of the code controls use of the function, and the contracts enforce good use. But in principal the quality of the inputs should be assumed at runtime since only the author controlled calls to the function and input QC should have happened elsewhere</li>
|
||||
</ul>
|
||||
<p>Below is an example private function that makes good use of input argument contracts:</p>
|
||||
<pre><code class="pre_md">/**
|
||||
* Helper function to write out a IGV formatted line to out, at loc, with values
|
||||
*
|
||||
* http://www.broadinstitute.org/software/igv/IGV
|
||||
*
|
||||
* @param out a non-null PrintStream where we'll write our line
|
||||
* @param loc the location of values
|
||||
* @param featureName string name of this feature (see IGV format)
|
||||
* @param values the floating point values to associate with loc and feature name in out
|
||||
*/
|
||||
@Requires({
|
||||
"out != null",
|
||||
"loc != null",
|
||||
"values.length > 0"
|
||||
})
|
||||
private void printIGVFormatRow(final PrintStream out, final GenomeLoc loc, final String featureName, final double ... values) {
|
||||
// note that start and stop are 0 based, but the stop is exclusive so we don't subtract 1
|
||||
out.printf("%s\t%d\t%d\t%s", loc.getContig(), loc.getStart() - 1, loc.getStop(), featureName);
|
||||
for ( final double value : values )
|
||||
out.print(String.format("\t%.3f", value));
|
||||
out.println();
|
||||
} </code class="pre_md"></pre>
|
||||
<h3>Final variables</h3>
|
||||
<p>Final java fields cannot be reassigned once set. Nearly all variables you write should be final, unless they are obviously accumulator results or other things you actually want to modify. Nearly all of your function arguments should be final. Being final stops incorrect reassigns (a major bug source) as well as more clearly captures the flow of information through the code. </p>
|
||||
<h3>An example high-quality GATK function</h3>
|
||||
<pre><code class="pre_md">/**
|
||||
* Get the reference bases from referenceReader spanned by the extended location of this active region,
|
||||
* including additional padding bp on either side. If this expanded region would exceed the boundaries
|
||||
* of the active region's contig, the returned result will be truncated to only include on-genome reference
|
||||
* bases
|
||||
* @param referenceReader the source of the reference genome bases
|
||||
* @param padding the padding, in BP, we want to add to either side of this active region extended region
|
||||
* @param genomeLoc a non-null genome loc indicating the base span of the bp we'd like to get the reference for
|
||||
* @return a non-null array of bytes holding the reference bases in referenceReader
|
||||
*/
|
||||
@Ensures("result != null")
|
||||
public byte[] getReference( final IndexedFastaSequenceFile referenceReader, final int padding, final GenomeLoc genomeLoc ) {
|
||||
if ( referenceReader == null ) throw new IllegalArgumentException("referenceReader cannot be null");
|
||||
if ( padding < 0 ) throw new IllegalArgumentException("padding must be a positive integer but got " + padding);
|
||||
if ( genomeLoc == null ) throw new IllegalArgumentException("genomeLoc cannot be null");
|
||||
if ( genomeLoc.size() == 0 ) throw new IllegalArgumentException("GenomeLoc must have size > 0 but got " + genomeLoc);
|
||||
|
||||
final byte[] reference = referenceReader.getSubsequenceAt( genomeLoc.getContig(),
|
||||
Math.max(1, genomeLoc.getStart() - padding),
|
||||
Math.min(referenceReader.getSequenceDictionary().getSequence(genomeLoc.getContig()).getSequenceLength(), genomeLoc.getStop() + padding) ).getBases();
|
||||
|
||||
return reference;
|
||||
}</code class="pre_md"></pre>
|
||||
<h2>Unit testing</h2>
|
||||
<p>All classes and methods in the GATK should have unit tests to ensure that they work properly, and to protect yourself and others who may want to extend, modify, enhance, or optimize you code. That GATK development team assumes that anything that isn't unit tested is broken. Perhaps right now they aren't broken, but with a team of 10 people they will become broken soon if you don't ensure they are correct going forward with unit tests.</p>
|
||||
<p>Walkers are a particularly complex issue. UnitTesting the map and reduce results is very hard, and in my view largely unnecessary. That said, you should write your walkers and supporting classes in such a way that all of the complex data processing functions are separated from the map and reduce functions, and those should be unit tested properly. </p>
|
||||
<p>Code coverage tells you how much of your class, at the statement or function level, has unit testing coverage. The GATK development standard is to reach something >80% method coverage (and ideally >80% statement coverage). The target is flexible as some methods are trivial (they just call into another method) so perhaps don't need coverage. At the statement level, you get deducted from 100% for branches that check for things that perhaps you don't care about, such as illegal arguments, so reaching 100% statement level coverage is unrealistic for most clases.</p>
|
||||
<p>You can find out more information about generating code coverage results at <a href="http://gatkforums.broadinstitute.org/discussion/2002/clover-coverage-analysis-with-ant#latest">Analyzing coverage with clover</a> </p>
|
||||
<p>We've created a unit testing example template in the GATK codebase that provides examples of creating core GATK data structures from scratch for unit testing. The code is in class ExampleToCopyUnitTest and can be viewed here in github directly <a href="https://github.com/broadinstitute/gsa-unstable/blob/master/public/java/test/org/broadinstitute/sting/ExampleToCopyUnitTest.java">ExampleToCopyUnitTest</a>.</p>
|
||||
<h2>The GSA-Workflow</h2>
|
||||
<p>As of GATK 2.5, we are moving to a full code review process, which has the following benefits:</p>
|
||||
<ul>
|
||||
<li>Reducing obvious coding bugs seen by other eyes</li>
|
||||
<li>Reducing code duplication, as reviewers will be able to see duplicated code within the commit and potentially across the codebase</li>
|
||||
<li>Ensure that coding quality standards are met (style and unit testing)</li>
|
||||
<li>Setting a higher code quality standard for the master GATK unstable branch</li>
|
||||
<li>Providing detailed coding feedback to newer developers, so they can improve their skills over time</li>
|
||||
</ul>
|
||||
<h3>The GSA workflow in words :</h3>
|
||||
<ul>
|
||||
<li>Create a new branch to start any work. Never work on master.
|
||||
<ul>
|
||||
<li>branch names have to follow the convention of [author prefix]<em>[feature name]</em>[JIRA ticket] (e.g. rp_pairhmm_GSA-232)</li>
|
||||
</ul></li>
|
||||
<li>Make frequent commits.</li>
|
||||
<li>Push frequently your branch to origin (branch -> branch)</li>
|
||||
<li>When you're done -- rewrite your commit history to tell a compelling story <a href="http://git-scm.com/book/en/Git-Tools-Rewriting-History">Git Tools Rewriting History</a></li>
|
||||
<li>Push your rewritten history, and request a code review.
|
||||
<ul>
|
||||
<li>The entire GSA team will review your code</li>
|
||||
<li>Mark DePristo assigns the reviewer responsible for making the judgment based on all reviews and merge your code into master. </li>
|
||||
</ul></li>
|
||||
<li>If your pull-request gets rejected, follow the comments from the team to fix it and repeat the workflow until you're ready to submit a new pull request.</li>
|
||||
<li>If your pull-request is accepted, the reviewer will merge and remove your remote branch.</li>
|
||||
</ul>
|
||||
<h3>Example GSA workflow in the command line:</h3>
|
||||
<pre><code class="pre_md"># starting a new feature
|
||||
git checkout -b rp_pairhmm_GSA-332
|
||||
git commit -av
|
||||
git push -u origin rp_pairhmm_GSA-332
|
||||
|
||||
# doing work on existing feature
|
||||
git commit -av
|
||||
git push
|
||||
|
||||
# ready to submit pull-request
|
||||
git fetch origin
|
||||
git rebase -i origin/master
|
||||
git push -f
|
||||
|
||||
# after being accepted, delete your branch
|
||||
git checkout master
|
||||
git pull
|
||||
git branch -d rp_pairhmm_GSA-332
|
||||
(the reviewer will remove your github branch)</code class="pre_md"></pre>
|
||||
<h3>Commit histories and rebasing</h3>
|
||||
<p>You must commit your code in small commit blocks with commit messages that follow the git best practices, which require the first line of the commit to summarize the purpose of the commit, followed by -- lines that describe the changes in more detail. For example, here's a recent commit that meets this criteria that added unit tests to the GenomeLocParser:</p>
|
||||
<pre><code class="pre_md">Refactoring and unit testing GenomeLocParser
|
||||
|
||||
-- Moved previously inner class to MRUCachingSAMSequenceDictionary, and unit test to 100% coverage
|
||||
-- Fully document all functions in GenomeLocParser
|
||||
-- Unit tests for things like parsePosition (shocking it wasn't tested!)
|
||||
-- Removed function to specifically create GenomeLocs for VariantContexts. The fact that you must incorporate END attributes in the context means that createGenomeLoc(Feature) works correctly
|
||||
-- Depreciated (and moved functionality) of setStart, setStop, and incPos to GenomeLoc
|
||||
-- Unit test coverage at like 80%, moving to 100% with next commit</code class="pre_md"></pre>
|
||||
<p>Now, git encourages you to commit code often, and develop your code in whatever order or what is best for you. So it's common to end up with 20 commits, all with strange, brief commit messages, that you want to push into the master branch. It is not acceptable to push such changes. You need to use the git command rebase to reorganize your commit history so satisfy the small number of clear commits with clear messages. </p>
|
||||
<p>Here is a recommended git workflow using rebase:</p>
|
||||
<ol>
|
||||
<li>
|
||||
<p>Start every project by creating a new branch for it. From your master branch, type the following command (replacing "myBranch" with an appropriate name for the new branch):</p>
|
||||
<pre><code class="pre_md">git checkout -b myBranch</code class="pre_md"></pre>
|
||||
<p>Note that you only include the <em>-b</em> when you're first creating the branch. After a branch is already created, you can switch to it by typing the checkout command without the <em>-b</em>: "git checkout myBranch"</p>
|
||||
<p>Also note that since you're always starting a new branch from master, you should keep your master branch up-to-date by occasionally doing a "git pull" while your master branch is checked out. You shouldn't do any actual work on your master branch, however.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>When you want to update your branch with the latest commits from the central repo, type this while your branch is checked out:</p>
|
||||
<pre><code class="pre_md">git fetch && git rebase origin/master</code class="pre_md"></pre>
|
||||
<p>If there are conflicts while updating your branch, git will tell you what additional commands to use.</p>
|
||||
<p>If you need to combine or reorder your commits, add "-i" to the above command, like so:</p>
|
||||
<pre><code class="pre_md">git fetch && git rebase -i origin/master</code class="pre_md"></pre>
|
||||
<p>If you want to edit your commits without also retrieving any new commits, omit the "git fetch" from the above command.</p>
|
||||
</li>
|
||||
</ol>
|
||||
<p>If you find the above commands cumbersome or hard to remember, create aliases for them using the following commands:</p>
|
||||
<pre><code class="pre_md"> git config --global alias.up '!git fetch && git rebase origin/master'
|
||||
git config --global alias.edit '!git fetch && git rebase -i origin/master'
|
||||
git config --global alias.done '!git push origin HEAD:master'</code class="pre_md"></pre>
|
||||
<p>Then you can type "git up" to update your branch, "git edit" to combine/reorder commits, and "git done" to push your branch.</p>
|
||||
<p>Here are more useful tutorials on how to use rebase:</p>
|
||||
<ul>
|
||||
<li><a href="http://git-scm.com/book/en/Git-Tools-Rewriting-History">Git Tools Rewriting History</a></li>
|
||||
<li><a href="http://www.reviewboard.org/docs/codebase/dev/git/clean-commits/">Keeping commit histories clean</a></li>
|
||||
<li><a href="http://darwinweb.net/articles/the-case-for-git-rebase">The case for git rebase</a></li>
|
||||
<li><a href="http://gitready.com/advanced/2009/02/10/squashing-commits-with-rebase.html">Squashing commits with rebase</a></li>
|
||||
</ul>
|
||||
<p>If you need help with rebasing, talk to Mauricio or David and they will help you out.</p>
|
||||
|
|
@ -0,0 +1,13 @@
|
|||
## How to access the picard and htsjdk repository (containing samtools-jdk, tribble, and variant)
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/2194/how-to-access-the-picard-and-htsjdk-repository-containing-samtools-jdk-tribble-and-variant
|
||||
|
||||
<p>The picard repository on github contains all picard public tools. Libraries live under the htsjdk, which includes the samtools-jdk, tribble, and variant packages (which includes VariantContext and associated classes as well as the VCF/BCF codecs).</p>
|
||||
<p>If you just need to check out the sources and don't need to make any commits into the picard repository, the command is:</p>
|
||||
<pre><code class="pre_md">git clone https://github.com/broadinstitute/picard.git</code class="pre_md"></pre>
|
||||
<p>Then within the picard directory, clone the htsjdk.</p>
|
||||
<pre><code class="pre_md">cd picard
|
||||
git clone https://github.com/samtools/htsjdk.git</code class="pre_md"></pre>
|
||||
<p>Then you can attach the <code>picard/src/java</code> and <code>picard/htsjdk/src/java</code> directories in IntelliJ as a source directory (File -> Project Structure -> Libraries -> Click the plus sign -> "Attach Files or Directories" in the latest IntelliJ).</p>
|
||||
<p>To build picard and the htsjdk all at once, type <code>ant</code> from within the picard directory. To run tests, type <code>ant test</code></p>
|
||||
<p>If you do need to make commits into the picard repository, first you'll need to create a github account, fork picard or htsjdk, make your changes, and then issue a pull request. For more info on pull requests, see: <a href="https://help.github.com/articles/using-pull-requests">https://help.github.com/articles/using-pull-requests</a></p>
|
||||
|
|
@ -0,0 +1,41 @@
|
|||
## How to include GATK in a Maven project
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/6214/how-to-include-gatk-in-a-maven-project
|
||||
|
||||
<p>GATK 3.x releases are not currently published to Central. But it is possible to install the GATK into your local repository, where Maven can then pick up the GATK as a dependency.</p>
|
||||
<hr />
|
||||
<p><strong>TL;DR</strong> Clone GATK 3.4, <code>mvn install</code>, then use the GATK as any other artifact.</p>
|
||||
<hr />
|
||||
<p>The repository you should use depends on what is your goal.</p>
|
||||
<p>If you want to build your own analysis tools on top of the GATK engine (not including the GATK analysis tools), with the option of distributing your project to others, you should clone the <a href="https://github.com/broadgsa/gatk"><code>gatk</code></a> repo.</p>
|
||||
<p>If you want to integrate the full GATK into a project for in-house purposes (redistribution is not allowed under the licensing terms), in which your tools can call GATK tools directly internally, you should clone <a href="https://github.com/broadgsa/gatk-protected"><code>gatk-protected</code></a>. This can be done by running the following code:</p>
|
||||
<pre><code>: 'GATK 3.4 code has known issues with the Java 8 compiler. Make sure you are using Java 7.'
|
||||
java -version
|
||||
|
||||
: 'The entire GATK repo is relatively large. This only downloads 3.4.'
|
||||
git clone -b 3.4 --depth 1 git@github.com:broadgsa/gatk-protected.git gatk-protected-3.4
|
||||
cd gatk-protected-3.4
|
||||
|
||||
: 'Install the gatk into a the local ~/.m2/repository, where your project can then refer to the GATK.'
|
||||
mvn install
|
||||
|
||||
: 'Build the "external example" as a demo of using the GATK as a library.'
|
||||
cd public/external-example
|
||||
mvn verify
|
||||
java -jar target/external-example-1.0-SNAPSHOT.jar -T MyExampleWalker --help</code></pre>
|
||||
<p>After the GATK is installed, add this dependency to your Maven artifact, and all other GATK dependencies will be included as well.</p>
|
||||
<pre><code><dependency>
|
||||
<groupId>org.broadinstitute.gatk</groupId>
|
||||
<artifactId>gatk-tools-protected</artifactId>
|
||||
<version>3.4</version>
|
||||
</dependency></code></pre>
|
||||
<p>One thing you might run into is that the GATK artifacts, and hence the external-example, transitively depend on artifacts that are also not in Central. They are instead committed under the path <code>public/repo</code>. Like in the <code>public/external-example/pom.xml</code>, your Maven project may need to include this directory as an additional repository. That being said <code>mvn install</code> <em>should</em> copy the artifacts to <code>~/.m2/repository</code> for you. For example, after the install, you should have a directory <code>~/.m2/repository/com/google/code/cofoja/cofoja</code>.</p>
|
||||
<p>If you somehow need to add the GATK's public repo as a repository, use a repository element like the one below:</p>
|
||||
<pre><code><repositories>
|
||||
<repository>
|
||||
<id>gatk.public.repo.local</id>
|
||||
<name>GATK Public Local Repository</name>
|
||||
<url>file:/Users/someuser/src/gatk-protected-3.4/public/repo</url>
|
||||
</repository>
|
||||
</repositories></code></pre>
|
||||
<p>Since the GATK is not in Central, each developer will need to install the GATK 3.4 once. Or, as an advanced step, your may also want to explore publishing the GATK on one of your shared local systems. If you have a shared filesystem you'd like to use as a repository, publish the GATK 3.4 to the directory using <code>mvn install -Dmaven.repo.local=/mount/path/to/shared/repo</code>, and then add a repository element to your Maven project. If your team is using a Maven repository such as Artifactory or Nexus, we can't provide guidance for publishing "third party" artifacts. But it should theoretically be possible, with instructions hopefully available through either Maven or the repository manager's help forums.</p>
|
||||
|
|
@ -0,0 +1,36 @@
|
|||
## How to make a walker compatible with multi-threading
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/2867/how-to-make-a-walker-compatible-with-multi-threading
|
||||
|
||||
<p>This document provides an overview of what are the steps required to make a walker multi-threadable using the <code>-nct</code> and the <code>-nt</code> arguments, which make use of the <code>NanoSchedulable</code> and <code>TreeReducible</code> interfaces, respectively.</p>
|
||||
<hr />
|
||||
<h3>NanoSchedulable / <code>-nct</code></h3>
|
||||
<p>Providing <code>-nct</code> support requires that you certify that your walker's <code>map()</code> method is thread-safe -- eg., if any data structures are shared across <code>map()</code> calls, access to these must be properly synchronized. Once your <code>map()</code> method is thread-safe, you can implement the <code>NanoSchedulable</code> interface, an empty interface with no methods that just marks your walker as having a <code>map()</code> method that's safe to parallelize:</p>
|
||||
<pre><code class="pre_md">/**
|
||||
* Root parallelism interface. Walkers that implement this
|
||||
* declare that their map function is thread-safe and so multiple
|
||||
* map calls can be run in parallel in the same JVM instance.
|
||||
*/
|
||||
public interface NanoSchedulable {
|
||||
}</code class="pre_md"></pre>
|
||||
<hr />
|
||||
<h3>TreeReducible / <code>-nt</code></h3>
|
||||
<p>Providing <code>-nt</code> support requires that both <code>map()</code> and <code>reduce()</code> be thread-safe, and you also need to implement the <code>TreeReducible</code> interface. Implementing <code>TreeReducible</code> requires you to write a <code>treeReduce()</code> method that tells the engine how to combine the results of multiple <code>reduce()</code> calls:</p>
|
||||
<pre><code class="pre_md">public interface TreeReducible<ReduceType> {
|
||||
/**
|
||||
* A composite, 'reduce of reduces' function.
|
||||
* @param lhs 'left-most' portion of data in the composite reduce.
|
||||
* @param rhs 'right-most' portion of data in the composite reduce.
|
||||
* @return The composite reduce type.
|
||||
*/
|
||||
ReduceType treeReduce(ReduceType lhs, ReduceType rhs);
|
||||
}</code class="pre_md"></pre>
|
||||
<p>This method differs from <code>reduce()</code> in that while <code>reduce()</code> adds the result of a <em>single</em> <code>map()</code> call onto a running total, <code>treeReduce()</code> takes the aggregated results from multiple map/reduce tasks that have been run in parallel and combines them. So, <code>lhs</code> and <code>rhs</code> might each represent the final result from several hundred map/reduce calls.</p>
|
||||
<p>Example <code>treeReduce()</code> implementation from the UnifiedGenotyper:</p>
|
||||
<pre><code class="pre_md">public UGStatistics treeReduce(UGStatistics lhs, UGStatistics rhs) {
|
||||
lhs.nBasesCallable += rhs.nBasesCallable;
|
||||
lhs.nBasesCalledConfidently += rhs.nBasesCalledConfidently;
|
||||
lhs.nBasesVisited += rhs.nBasesVisited;
|
||||
lhs.nCallsMade += rhs.nCallsMade;
|
||||
return lhs;
|
||||
}</code class="pre_md"></pre>
|
||||
|
|
@ -0,0 +1,289 @@
|
|||
## Managing user inputs
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/1325/managing-user-inputs
|
||||
|
||||
<h3>1. Naming walkers</h3>
|
||||
<p>Users identify which GATK walker to run by specifying a walker name via the <code>--analysis_type</code> command-line argument. By default, the GATK will derive the walker name from a walker by taking the name of the walker class and removing packaging information from the start of the name, and removing the trailing text <code>Walker</code> from the end of the name, if it exists. For example, the GATK would, by default, assign the name <code>PrintReads</code> to the walker class <code>org.broadinstitute.sting.gatk.walkers.PrintReadsWalker</code>. To override the default walker name, annotate your walker class with <code>@WalkerName("<my name>")</code>.</p>
|
||||
<h3>2. Requiring / allowing primary inputs</h3>
|
||||
<p>Walkers can flag exactly which primary data sources are allowed and required for a given walker. Reads, the reference, and reference-ordered data are currently considered primary data sources. Different traversal types have different default requirements for reads and reference, but currently no traversal types require reference-ordered data by default. You can add requirements to your walker with the <code>@Requires</code> / <code>@Allows</code> annotations as follows:</p>
|
||||
<pre><code class="pre_md">@Requires(DataSource.READS)
|
||||
@Requires({DataSource.READS,DataSource.REFERENCE})
|
||||
@Requires(value={DataSource.READS,DataSource.REFERENCE})
|
||||
@Requires(value=DataSource.REFERENCE})</code class="pre_md"></pre>
|
||||
<p>By default, all parameters are allowed unless you lock them down with the <code>@Allows</code> attribute. The command:</p>
|
||||
<pre><code class="pre_md">@Allows(value={DataSource.READS,DataSource.REFERENCE})</code class="pre_md"></pre>
|
||||
<p>will only allow the reads and the reference. Any other primary data sources will cause the system to exit with an error. </p>
|
||||
<p>Note that as of August 2011, the GATK no longer supports RMD the <code>@Requires</code> and <code>@Allows</code> syntax, as these have moved to the standard <code>@Argument</code> system.</p>
|
||||
<h3>3. Command-line argument tagging</h3>
|
||||
<p>Any command-line argument can be tagged with a comma-separated list of freeform tags. </p>
|
||||
<p>The syntax for tags is as follows:</p>
|
||||
<pre><code class="pre_md">-<argument>:<tag1>,<tag2>,<tag3> <argument value></code class="pre_md"></pre>
|
||||
<p>for example:</p>
|
||||
<pre><code class="pre_md">-I:tumor <my tumor data>.bam
|
||||
-eval,VCF yri.trio.chr1.vcf</code class="pre_md"></pre>
|
||||
<p>There is currently no mechanism in the GATK to validate either the number of tags supplied or the content of those tags.</p>
|
||||
<p>Tags can be accessed from within a walker by calling <code>getToolkit().getTags(argumentValue)</code>, where <code>argumentValue</code> is the
|
||||
parsed contents of the command-line argument to inspect.</p>
|
||||
<h4>Applications</h4>
|
||||
<p>The GATK currently has comprehensive support for tags on two built-in argument types:</p>
|
||||
<ul>
|
||||
<li>
|
||||
<p><code>-I,--input_file <input_file></code></p>
|
||||
<p>Input BAM files and BAM file lists can be tagged with any type. When a BAM file list is tagged, the tag is applied to each listed BAM file. </p>
|
||||
</li>
|
||||
</ul>
|
||||
<p>From within a walker, use the following code to access the supplied tag or tags:</p>
|
||||
<pre><code class="pre_md">getToolkit().getReaderIDForRead(read).getTags();</code class="pre_md"></pre>
|
||||
<ul>
|
||||
<li>
|
||||
<p>Input RODs, e.g. `-V <rod>' or '-eval <rod>'</p>
|
||||
<p>Tags are used to specify ROD name and ROD type. There is currently no support for adding additional tags. See the ROD system documentation for more details.</p>
|
||||
</li>
|
||||
</ul>
|
||||
<h3>4. Adding additional command-line arguments</h3>
|
||||
<p>Users can create command-line arguments for walkers by creating public member variables annotated with <code>@Argument</code> in the walker. The <code>@Argument</code> annotation takes a number of differentparameters:</p>
|
||||
<ul>
|
||||
<li>
|
||||
<p><code>fullName</code></p>
|
||||
<p>The full name of this argument. Defaults to the <code>toLowerCase()</code>’d member name. When specifying <code>fullName</code> on the command line, prefix with a double dash (<code>--</code>).</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><code>shortName</code> </p>
|
||||
<p>The alternate, short name for this argument. Defaults to the first letter of the member name. When specifying shortName on the command line, prefix with a single dash (<code>-</code>).</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><code>doc</code> </p>
|
||||
<p>Documentation for this argument. Will appear in help output when a user either requests help with the –-help (-h) argument or when a user specifies an invalid set of arguments. Documentation is the only argument that is always required.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><code>required</code> </p>
|
||||
<p>Whether the argument is required when used with this walker. Default is <code>required = true</code>.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><code>exclusiveOf</code> </p>
|
||||
<p>Specifies that this argument is mutually exclusive of another argument in the same walker. Defaults to not mutually exclusive of any other arguments.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><code>validation</code> </p>
|
||||
<p>Specifies a regular expression used to validate the contents of the command-line argument. If the text provided by the user does not match this regex, the GATK will abort with an error.</p>
|
||||
</li>
|
||||
</ul>
|
||||
<p>By default, all command-line arguments will appear in the help system. To prevent new and debugging arguments from appearing in the help system,
|
||||
you can add the <code>@Hidden</code> tag below the <code>@Argument</code> annotation, hiding it from the help system but allowing users to supply it on the command-line.
|
||||
Please use this functionality sparingly to avoid walkers with hidden command-line options that are required for production use.</p>
|
||||
<h4>Passing Command-Line Arguments</h4>
|
||||
<p>Arguments can be passed to the walker using either the full name or the short name. If passing arguments using the full name, the syntax is <code>−−<arg full name> <value></code>.</p>
|
||||
<pre><code class="pre_md">--myint 6</code class="pre_md"></pre>
|
||||
<p>If passing arguments using the short name, the syntax is <code>-<arg short name> <value></code>. Note that there is a space between the short name and the value:</p>
|
||||
<pre><code class="pre_md">-m 6</code class="pre_md"></pre>
|
||||
<p>Boolean (class) and boolean (primitive) arguments are a special in that they require no argument. The presence of a boolean indicates true, and its absence indicates false. The following example sets a flag to true.</p>
|
||||
<pre><code class="pre_md">-B</code class="pre_md"></pre>
|
||||
<h4>Supplemental command-line argument annotations</h4>
|
||||
<p>Two additional annotations can influence the behavior of command-line arguments.</p>
|
||||
<ul>
|
||||
<li>
|
||||
<p><code>@Hidden</code> </p>
|
||||
<p>Adding this annotation to an @Argument tells the help system to avoid displaying any evidence that this argument exists. This can be used to add additional debugging arguments that aren't suitable for mass consumption.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><code>@Deprecated</code> </p>
|
||||
<p>Forces the GATK to throw an exception if this argument is supplied on the command-line. This can be used to supply extra documentation to the user as command-line parameters change for walkers that are in flux.</p>
|
||||
</li>
|
||||
</ul>
|
||||
<h4>Examples</h4>
|
||||
<p>Create an required int parameter with full name <code>–myint</code>, short name <code>-m</code>. Pass this argument by adding <code>–myint 6</code> or <code>-m 6</code> to the command line.</p>
|
||||
<pre><code class="pre_md">import org.broadinstitute.sting.utils.cmdLine.Argument;
|
||||
public class HelloWalker extends ReadWalker<Integer,Long> {
|
||||
@Argument(doc="my integer")
|
||||
public int myInt;</code class="pre_md"></pre>
|
||||
<p>Create an optional float parameter with full name <code>–myFloatingPointArgument</code>, short name <code>-m</code>. Pass this argument by adding <code>–myFloatingPointArgument 2.71</code> or <code>-m 2.71</code>.</p>
|
||||
<pre><code class="pre_md">import org.broadinstitute.sting.utils.cmdLine.Argument;
|
||||
public class HelloWalker extends ReadWalker<Integer,Long> {
|
||||
@Argument(fullName="myFloatingPointArgument",doc="a floating point argument",required=false)
|
||||
public float myFloat;</code class="pre_md"></pre>
|
||||
<p>The GATK will parse the argument differently depending on the type of the public member variable’s type. Many different argument types are supported, including primitives and their wrappers, arrays, typed and untyped collections, and any type with a String constructor. When the GATK cannot completely infer the type (such as in the case of untyped collections), it will assume that the argument is a String. GATK is aware of concrete implementations of some interfaces and abstract classes. If the argument’s member variable is of type <code>List</code> or <code>Set</code>, the GATK will fill the member variable with a concrete <code>ArrayList</code> or <code>TreeSet</code>, respectively. Maps are not currently supported.</p>
|
||||
<h3>5. Additional argument types: @Input, @Output</h3>
|
||||
<p>Besides <code>@Argument</code>, the GATK provides two additional types for command-line arguments: <code>@Input</code> and <code>@Output</code>. These two inputs are very similar to <code>@Argument</code> but act as flags to indicate dataflow to <a href="http://gatkforums.broadinstitute.org/discussion/1306/overview-of-queue">Queue</a>, our pipeline management software.</p>
|
||||
<ul>
|
||||
<li>
|
||||
<p>The <code>@Input</code> tag indicates that the contents of the tagged field represents a file that will be read by the walker.</p>
|
||||
</li>
|
||||
<li>The <code>@Output</code> tag indicates that the contents of the tagged field represents a file that will be written by the walker, for consumption by downstream walkers.</li>
|
||||
</ul>
|
||||
<p>We're still determining the best way to model walker dependencies in our pipeline. As we determine best practices, we'll post them here.</p>
|
||||
<h3>6. Getting access to Reference Ordered Data (RMD) with @Input and RodBinding<T></h3>
|
||||
<p>As of August 2011, the GATK now provides a clean mechanism for creating walker <code>@Input</code> arguments and using these arguments to access <code>Reference Meta Data</code> provided by the <code>RefMetaDataTracker</code> in the <code>map()</code> call. This mechanism is preferred to the old implicit string-based mechanism, which has been retired.</p>
|
||||
<p>At a very high level, the new <code>RodBindings</code> provide a handle for a walker to obtain the <code>Feature</code> records from <code>Tribble</code> from a <code>map()</code> call, specific to a command line binding provided by the user. This can be as simple as a single ROD file argument|one-to-one binding between a command line argument and a track, or as complex as an argument argument accepting multiple command line arguments, each with a specific name. The <code>RodBindings</code> are generic and type specific, so you can require users to provide files that emit <code>VariantContext</code>s, <code>BedTable</code>s, etc, or simply the root type <code>Feature</code> from <code>Tribble</code>. Critically, the <code>RodBindings</code> interact nicely with the GATKDocs system, so you can provide summary and detailed documentation for each <code>RodBinding</code> accepted by your walker. </p>
|
||||
<h4>A single ROD file argument</h4>
|
||||
<p>Suppose you have a walker that uses a single track of <code>VariantContext</code>s, such as <code>SelectVariants</code>, in its calculation. You declare a standard GATK-style <code>@Input</code> argument in the walker, of type <code>RodBinding<VariantContext></code>: </p>
|
||||
<pre><code class="pre_md">@Input(fullName="variant", shortName = "V", doc="Select variants from this VCF file", required=true)
|
||||
public RodBinding<VariantContext> variants;</code class="pre_md"></pre>
|
||||
<p>This will require the user to provide a command line option <code>--variant:vcf my.vcf</code> to your walker. To get access to your variants, in the <code>map()</code> function you provide the variants variable to the tracker, as in:</p>
|
||||
<pre><code class="pre_md">Collection<VariantContext> vcs = tracker.getValues(variants, context.getLocation());</code class="pre_md"></pre>
|
||||
<p>which returns all of the <code>VariantContexts</code> in variants that start at <code>context.getLocation()</code>. See <code>RefMetaDataTracker</code> in the javadocs to see the full range of getter routines.</p>
|
||||
<p>Note that, as with regular tribble tracks, you have to provide the <code>Tribble</code> type of the file as a tag to the argument (<code>:vcf</code>). The system now checks up front that the corresponding <code>Tribble</code> codec produces <code>Features</code> that are type-compatible with the type of the <code>RodBinding<T></code>. </p>
|
||||
<h4>RodBindings are generic</h4>
|
||||
<p>The <code>RodBinding</code> class is generic, parameterized as <code>RodBinding<T extends Feature></code>. This <code>T</code> class describes the type of the <code>Feature</code> required by the walker. The best practice for declaring a <code>RodBinding</code> is to choose the most general <code>Feature</code> type that will allow your walker to work. For example, if all you really care about is whether a <code>Feature</code> overlaps the site in map, you can use <code>Feature</code> itself, which supports this, and will allow any <code>Tribble</code> type to be provided, using a <code>RodBinding<Feature></code>. If you are manipulating <code>VariantContext</code>s, you should declare a <code>RodBinding<VariantContext></code>, which will restrict automatically the user to providing <code>Tribble</code> types that can create a object consistent with the <code>VariantContext</code> class (a <code>VariantContext</code> itself or subclass).</p>
|
||||
<p>Note that in multi-argument <code>RodBindings</code>, as <code>List<RodBinding<T>></code> arg, the system will require all files provided here to provide an object of type <code>T</code>. So <code>List<RodBinding<VariantContext>></code> arg requires all <code>-arg</code> command line arguments to bind to files that produce <code>VariantContext</code>s.</p>
|
||||
<h4>An argument that can be provided any number of times</h4>
|
||||
<p>The <code>RodBinding</code> system supports the standard <code>@Argument</code> style of allowing a <code>vararg</code> argument by wrapping it in a Java collection. For example, if you want to allow users to provide any number of comp tracks to your walker, simply declare a <code>List<RodBinding<VariantContext>></code> field:</p>
|
||||
<pre><code class="pre_md">@Input(fullName="comp", shortName = "comp", doc="Comparison variants from this VCF file", required=true)
|
||||
public List<RodBinding<VariantContext>> comps;</code class="pre_md"></pre>
|
||||
<p>With this declaration, your walker will accept any number of <code>-comp</code> arguments, as in:</p>
|
||||
<pre><code class="pre_md">-comp:vcf 1.vcf -comp:vcf 2.vcf -comp:vcf 3.vcf</code class="pre_md"></pre>
|
||||
<p>For such a command line, the comps field would be initialized to the List with three <code>RodBindings</code>, the first binding to <code>1.vcf</code>, the second to <code>2.vcf</code> and finally the third to <code>3.vcf</code>. </p>
|
||||
<p>Because this is a required argument, at least one <code>-comp</code> must be provided. <code>Vararg</code> <code>@Input</code> <code>RodBindings</code> can be optional, but you should follow proper <code>vararg</code>s style to get the best results.</p>
|
||||
<h4>Proper handling of optional arguments</h4>
|
||||
<p>If you want to make a RodBinding optional, you first need to tell the <code>@Input</code> argument that its options (<code>required=false</code>):</p>
|
||||
<pre><code class="pre_md">@Input(fullName="discordance", required=false)
|
||||
private RodBinding<VariantContext> discordanceTrack;</code class="pre_md"></pre>
|
||||
<p>The GATK automagically sets this field to the value of the special static constructor method <code>makeUnbound(Class c)</code> to create a special "unbound" <code>RodBinding</code> here. This unbound object is type safe, can be safely passed to the <code>RefMetaDataTracker</code> get methods, and is guaranteed to never return any values. It also returns <code>false</code> when the <code>isBound()</code> method is called.</p>
|
||||
<p>An example usage of <code>isBound</code> is to conditionally add header lines, as in:</p>
|
||||
<pre><code class="pre_md">if ( mask.isBound() ) {
|
||||
hInfo.add(new VCFFilterHeaderLine(MASK_NAME, "Overlaps a user-input mask"));
|
||||
}</code class="pre_md"></pre>
|
||||
<p>The case for <code>vararg</code> style <code>RodBindings</code> is slightly different. If you want, as above, users to be able to omit the <code>-comp</code> track entirely, you should initialize the value of the collection to the appropriate <code>emptyList</code>/<code>emptySet</code> in <code>Collections</code>:</p>
|
||||
<pre><code class="pre_md">@Input(fullName="comp", shortName = "comp", doc="Comparison variants from this VCF file", required=false)
|
||||
public List<RodBinding<VariantContext>> comps = Collections.emptyList();</code class="pre_md"></pre>
|
||||
<p>which will ensure that <code>comps.isEmpty()</code> is true when no <code>-comp</code> is provided.</p>
|
||||
<h4>Implicit and explicit names for RodBindings</h4>
|
||||
<pre><code class="pre_md">@Input(fullName="variant", shortName = "V", doc="Select variants from this VCF file", required=true)
|
||||
public RodBinding<VariantContext> variants;</code class="pre_md"></pre>
|
||||
<p>By default, the <code>getName()</code> method in <code>RodBinding</code> returns the <code>fullName</code> of the <code>@Input</code>. This can be overloaded on the command-line by providing not one but two tags. The first tag is interpreted as the name for the binding, and the second as the type. As in:</p>
|
||||
<pre><code class="pre_md">-variant:vcf foo.vcf => getName() == "variant"
|
||||
-variant:foo,vcf foo.vcf => getName() == "foo"</code class="pre_md"></pre>
|
||||
<p>This capability is useful when users need to provide more meaningful names for arguments, especially with variable arguments. For example, in <code>VariantEval</code>, there's a <code>List<RodBinding<VariantContext>></code> comps, which may be <code>dbsnp</code>, <code>hapmap</code>, etc. This would be declared as:</p>
|
||||
<pre><code class="pre_md">@Input(fullName="comp", shortName = "comp", doc="Comparison variants from this VCF file", required=true)
|
||||
public List<RodBinding<VariantContext>> comps;</code class="pre_md"></pre>
|
||||
<p>where a normal command line usage would look like:</p>
|
||||
<pre><code class="pre_md">-comp:hapmap,vcf hapmap.vcf -comp:omni,vcf omni.vcf -comp:1000g,vcf 1000g.vcf</code class="pre_md"></pre>
|
||||
<p>In the code, you might have a loop that looks like:</p>
|
||||
<pre><code class="pre_md">for ( final RodBinding comp : comps )
|
||||
for ( final VariantContext vc : tracker.getValues(comp, context.getLocation())
|
||||
out.printf("%s has a binding at %s%n", comp.getName(), getToolkit().getGenomeLocParser.createGenomeLoc(vc)); </code class="pre_md"></pre>
|
||||
<p>which would print out lines that included things like:</p>
|
||||
<pre><code class="pre_md">hapmap has a binding at 1:10
|
||||
omni has a binding at 1:20
|
||||
hapmap has a binding at 1:30
|
||||
1000g has a binding at 1:30</code class="pre_md"></pre>
|
||||
<p>This last example begs the question -- what happens with <code>getName()</code> when explicit names are not provided? The system goes out of its way to provide reasonable names for the variables: </p>
|
||||
<ul>
|
||||
<li>
|
||||
<p>The first occurrence is named for the <code>fullName</code>, where <code>comp</code></p>
|
||||
</li>
|
||||
<li>Subsequent occurrences are postfixed with an integer count, starting at 2, so <code>comp2</code>, <code>comp3</code>, etc.</li>
|
||||
</ul>
|
||||
<p>In the above example, the command line </p>
|
||||
<pre><code class="pre_md">-comp:vcf hapmap.vcf -comp:vcf omni.vcf -comp:vcf 1000g.vcf</code class="pre_md"></pre>
|
||||
<p>would emit</p>
|
||||
<pre><code class="pre_md">comp has a binding at 1:10
|
||||
comp2 has a binding at 1:20
|
||||
comp has a binding at 1:30
|
||||
comp3 has a binding at 1:30</code class="pre_md"></pre>
|
||||
<h4>Dynamic type resolution</h4>
|
||||
<p>The new <code>RodBinding</code> system supports a simple form of dynamic type resolution. If the input filetype can be specially associated with a single <code>Tribble</code> type (as VCF can), then you can omit the type entirely from the the command-line binding of a <code>RodBinding</code>!</p>
|
||||
<p>So whereas a full command line would look like:</p>
|
||||
<pre><code class="pre_md">-comp:hapmap,vcf hapmap.vcf -comp:omni,vcf omni.vcf -comp:1000g,vcf 1000g.vcf</code class="pre_md"></pre>
|
||||
<p>because these are VCF files they could technically be provided as:</p>
|
||||
<pre><code class="pre_md">-comp:hapmap hapmap.vcf -comp:omni omni.vcf -comp:1000g 1000g.vcf</code class="pre_md"></pre>
|
||||
<p>If you don't care about naming, you can now say:</p>
|
||||
<pre><code class="pre_md">-comp hapmap.vcf -comp omni.vcf -comp 1000g.vcf</code class="pre_md"></pre>
|
||||
<h4>Best practice for documenting a RodBinding</h4>
|
||||
<p>The best practice is simple: use a javadoc style comment above the <code>@Input</code> annotation, with the standard first line summary and subsequent detailed discussion of the meaning of the argument. These are then picked up by the GATKdocs system and added to the standard walker docs, following the standard structure of GATKDocs <code>@Argument</code> docs. Below is a best practice documentation example from <code>SelectVariants</code>, which accepts a required variant track and two optional discordance and concordance tracks.</p>
|
||||
<pre><code class="pre_md">public class SelectVariants extends RodWalker<Integer, Integer> {
|
||||
/**
|
||||
* Variants from this file are sent through the filtering and modifying routines as directed
|
||||
* by the arguments to SelectVariants, and finally are emitted.
|
||||
*/
|
||||
@Input(fullName="variant", shortName = "V", doc="Select variants from this VCF file", required=true)
|
||||
public RodBinding<VariantContext> variants;
|
||||
|
||||
/**
|
||||
* A site is considered discordant if there exists some sample in eval that has a non-reference genotype
|
||||
* and either the site isn't present in this track, the sample isn't present in this track,
|
||||
* or the sample is called reference in this track.
|
||||
*/
|
||||
@Input(fullName="discordance", shortName = "disc", doc="Output variants that were not called in this Feature comparison track", required=false)
|
||||
private RodBinding<VariantContext> discordanceTrack;
|
||||
|
||||
/**
|
||||
* A site is considered concordant if (1) we are not looking for specific samples and there is a variant called
|
||||
* in both variants and concordance tracks or (2) every sample present in eval is present in the concordance
|
||||
* track and they have the sample genotype call.
|
||||
*/
|
||||
@Input(fullName="concordance", shortName = "conc", doc="Output variants that were also called in this Feature comparison track", required=false)
|
||||
private RodBinding<VariantContext> concordanceTrack;
|
||||
}</code class="pre_md"></pre>
|
||||
<p>Note how much better the above version is compared to the old pre-<code>Rodbinding</code> syntax (code below). Below you have a required argument variant that doesn't show up as a formal argument in the GATK, different from the conceptually similar <code>@Arguments</code> for <code>discordanceRodName</code> and <code>concordanceRodName</code>, which have no type restrictions. There's no place to document the variant argument as well, so the system is effectively blind to this essential argument.</p>
|
||||
<pre><code class="pre_md">@Requires(value={},referenceMetaData=@RMD(name="variant", type=VariantContext.class))
|
||||
public class SelectVariants extends RodWalker<Integer, Integer> {
|
||||
@Argument(fullName="discordance", shortName = "disc", doc="Output variants that were not called on a ROD comparison track. Use -disc ROD_NAME", required=false)
|
||||
private String discordanceRodName = "";
|
||||
|
||||
@Argument(fullName="concordance", shortName = "conc", doc="Output variants that were also called on a ROD comparison track. Use -conc ROD_NAME", required=false)
|
||||
private String concordanceRodName = "";
|
||||
}</code class="pre_md"></pre>
|
||||
<h4>RodBinding examples</h4>
|
||||
<p>In these examples, we have declared two <code>RodBindings</code> in the Walker</p>
|
||||
<pre><code class="pre_md">@Input(fullName="mask", doc="Input ROD mask", required=false)
|
||||
public RodBinding<Feature> mask = RodBinding.makeUnbound(Feature.class);
|
||||
|
||||
@Input(fullName="comp", doc="Comparison track", required=false)
|
||||
public List<RodBinding<VariantContext>> comps = new ArrayList<VariantContext>();</code class="pre_md"></pre>
|
||||
<ul>
|
||||
<li>
|
||||
<p>Get the first value</p>
|
||||
<p><code>Feature f = tracker.getFirstValue(mask)</code></p>
|
||||
</li>
|
||||
<li>
|
||||
<p>Get all of the values at a location</p>
|
||||
<p><code>Collection<Feature> fs = tracker.getValues(mask, thisGenomeLoc)</code></p>
|
||||
</li>
|
||||
<li>
|
||||
<p>Get all of the features here, regardless of track </p>
|
||||
<p><code>Collection<Feature> fs = tracker.getValues(Feature.class)</code></p>
|
||||
</li>
|
||||
<li>
|
||||
<p>Determining if an optional RodBinding was provided
|
||||
.
|
||||
if ( mask.isBound() ) // writes out the mask header line, if one was provided
|
||||
hInfo.add(new VCFFilterHeaderLine(MASK_NAME, "Overlaps a user-input mask"));</p>
|
||||
<p>if ( ! comps.isEmpty() )
|
||||
logger.info("At least one comp was provided")</p>
|
||||
</li>
|
||||
</ul>
|
||||
<h4>Example usage in Queue scripts</h4>
|
||||
<p>In <a href="http://gatkforums.broadinstitute.org/discussion/1307/queue-pipeline-scripts-qscripts">QScripts</a> when you need to tag a file use the class <code>TaggedFile</code> which extends from <code>java.io.File</code>.</p>
|
||||
<table class="table table-striped">
|
||||
<thead>
|
||||
<tr>
|
||||
<th style="text-align: left;">Example</th>
|
||||
<th style="text-align: left;">in the QScript</th>
|
||||
<th style="text-align: left;">on the Command Line</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td style="text-align: left;">Untagged VCF</td>
|
||||
<td style="text-align: left;"><code>myWalker.variant = new File("my.vcf")</code></td>
|
||||
<td style="text-align: left;"><code>-V my.vcf</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: left;">Tagged VCF</td>
|
||||
<td style="text-align: left;"><code>myWalker.variant = new TaggedFile("my.vcf", "VCF")</code></td>
|
||||
<td style="text-align: left;"><code>-V:VCF my.vcf</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: left;">Tagged VCF</td>
|
||||
<td style="text-align: left;"><code>myWalker.variant = new TaggedFile("my.vcf", "VCF,custom=value")</code></td>
|
||||
<td style="text-align: left;"><code>-V:VCF,custom=value my.vcf</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: left;">Labeling a tumor</td>
|
||||
<td style="text-align: left;"><code>myWalker.input_file :+= new TaggedFile("mytumor.bam", "tumor")</code></td>
|
||||
<td style="text-align: left;"><code>-I:tumor mytumor.bam</code></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<h4>Notes</h4>
|
||||
<p>No longer need to (or can) use <code>@Requires</code> and <code>@Allows</code> for ROD data. This system is now retired.</p>
|
||||
|
|
@ -0,0 +1,102 @@
|
|||
## Managing walker data presentation and flow control
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/1351/managing-walker-data-presentation-and-flow-control
|
||||
|
||||
<p>The primary goal of the GATK is to provide a suite of small data access patterns that can easily be parallelized and otherwise externally managed. As such, rather than asking walker authors how to iterate over a data stream, the GATK asks the user how data should be presented. </p>
|
||||
<h2>Locus walkers</h2>
|
||||
<p>Walk over the data set one location (single-base locus) at a time, presenting all overlapping reads, reference bases, and reference-ordered data.</p>
|
||||
<h3>1. Switching between covered and uncovered loci</h3>
|
||||
<p>The <code>@By</code> attribute can be used to control whether locus walkers see all loci or just covered loci. To switch between viewing all loci and covered loci, apply one of the following attributes:</p>
|
||||
<pre><code class="pre_md">@By(DataSource.REFERENCE)
|
||||
@By(DataSource.READS)</code class="pre_md"></pre>
|
||||
<h3>2. Filtering defaults</h3>
|
||||
<p>By default, the following filters are automatically added to every locus walker.</p>
|
||||
<ul>
|
||||
<li>Reads with nonsensical alignments</li>
|
||||
<li>Unmapped reads</li>
|
||||
<li>Non-primary alignments.</li>
|
||||
<li>Duplicate reads.</li>
|
||||
<li>Reads failing vendor quality checks.</li>
|
||||
</ul>
|
||||
<h2>ROD walkers</h2>
|
||||
<p>These walkers walk over the data set one location at a time, but only those locations covered by reference-ordered data. They are essentially a special case of locus walkers. ROD walkers are read-free traversals that include operate over Reference Ordered Data and the reference genome <strong>at sites where there is ROD information</strong>. They are geared for high-performance traversal of many RODs and the reference such as VariantEval and CallSetConcordance. Programmatically they are nearly identical to <code>RefWalkers<M,T></code> traversals with the following few quirks.</p>
|
||||
<h3>1. Differences from a RefWalker</h3>
|
||||
<ul>
|
||||
<li>
|
||||
<p>RODWalkers are only called at sites where there is at least one non-interval ROD bound. For example, if you are exploring dbSNP and some GELI call set, the map function of a RODWalker will be invoked at all sites where there is a dbSNP record or a GELI record.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>Because of this skipping RODWalkers receive a context object where the number of reference skipped bases between map calls is provided: </p>
|
||||
<p>nSites += context.getSkippedBases() + 1; // the skipped bases plus the current location</p>
|
||||
</li>
|
||||
</ul>
|
||||
<p>In order to get the final count of skipped bases at the end of an interval (or chromosome) the map function is called one last time with null <code>ReferenceContext</code> and <code>RefMetaDataTracker</code> objects. The alignment context can be accessed to get the bases skipped between the last (and final) ROD and the end of the current interval. </p>
|
||||
<h3>2. Filtering defaults</h3>
|
||||
<p>ROD walkers inherit the same filters as locus walkers:</p>
|
||||
<ul>
|
||||
<li>Reads with nonsensical alignments</li>
|
||||
<li>Unmapped reads</li>
|
||||
<li>Non-primary alignments.</li>
|
||||
<li>Duplicate reads.</li>
|
||||
<li>Reads failing vendor quality checks.</li>
|
||||
</ul>
|
||||
<h3>3. Example change over of VariantEval</h3>
|
||||
<p>Changing to a RODWalker is very easy -- here's the new top of VariantEval, changing the system to a <code>RodWalker</code> from its old <code>RefWalker</code> state:</p>
|
||||
<pre><code class="pre_md">//public class VariantEvalWalker extends RefWalker<Integer, Integer> {
|
||||
public class VariantEvalWalker extends RodWalker<Integer, Integer> {</code class="pre_md"></pre>
|
||||
<p>The map function must now capture the number of skipped bases and protect itself from the final interval map calls:</p>
|
||||
<pre><code class="pre_md">public Integer map(RefMetaDataTracker tracker, ReferenceContext ref, AlignmentContext context) {
|
||||
nMappedSites += context.getSkippedBases();
|
||||
|
||||
if ( ref == null ) { // we are seeing the last site
|
||||
return 0;
|
||||
}
|
||||
|
||||
nMappedSites++;</code class="pre_md"></pre>
|
||||
<p>That's all there is to it!</p>
|
||||
<h3>4. Performance improvements</h3>
|
||||
<p>A ROD walker can be very efficient compared to a RefWalker in the situation where you have sparse RODs. Here is a comparison of ROD vs. Ref walker implementation of VariantEval:</p>
|
||||
<table class="table table-striped">
|
||||
<thead>
|
||||
<tr>
|
||||
<th style="text-align: left;"></th>
|
||||
<th style="text-align: left;">RODWalker</th>
|
||||
<th style="text-align: left;">RefWalker</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td style="text-align: left;">dbSNP and 1KG Pilot 2 SNP calls on chr1</td>
|
||||
<td style="text-align: left;">164u (s)</td>
|
||||
<td style="text-align: left;">768u (s)</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: left;">Just 1KG Pilot 2 SNP calls on chr1</td>
|
||||
<td style="text-align: left;">54u (s)</td>
|
||||
<td style="text-align: left;">666u (s)</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<h2>Read walkers</h2>
|
||||
<p>Read walkers walk over the data set one read at a time, presenting all overlapping reference bases and reference-ordered data.</p>
|
||||
<h3>Filtering defaults</h3>
|
||||
<p>By default, the following filters are automatically added to every read walker.</p>
|
||||
<ul>
|
||||
<li>Reads with nonsensical alignments</li>
|
||||
</ul>
|
||||
<h2>Read pair walkers</h2>
|
||||
<p>Read pair walkers walk over a queryname-sorted BAM, presenting each mate and its pair. No reference bases or reference-ordered data are presented.</p>
|
||||
<h3>Filtering defaults</h3>
|
||||
<p>By default, the following filters are automatically added to every read pair walker.</p>
|
||||
<ul>
|
||||
<li>Reads with nonsensical alignments</li>
|
||||
</ul>
|
||||
<h2>Duplicate walkers</h2>
|
||||
<p>Duplicate walkers walk over a read and all its marked duplicates. No reference bases or reference-ordered data are presented.</p>
|
||||
<h3>Filtering defaults</h3>
|
||||
<p>By default, the following filters are automatically added to every duplicate walker.</p>
|
||||
<ul>
|
||||
<li>Reads with nonsensical alignments</li>
|
||||
<li>Unmapped reads</li>
|
||||
<li>Non-primary alignments</li>
|
||||
</ul>
|
||||
|
|
@ -0,0 +1,174 @@
|
|||
## Migration from Apache Ant to Apache Maven
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/3437/migration-from-apache-ant-to-apache-maven
|
||||
|
||||
<h1>Overview</h1>
|
||||
<hr />
|
||||
<p><strong>We're replacing Ant with Maven. To build, run <code>mvn verify</code>.</strong></p>
|
||||
<h2>Background</h2>
|
||||
<p>In the early days of the Genome Analysis Toolkit (GATK), the code base separated the GATK genomics engine from the core java utilities, encompassed in a wider project called Sting. During this time, the build tool of choice was the relatively flexible Java build tool <a href="http://ant.apache.org">Apache Ant</a>, run via the command <code>ant</code>.</p>
|
||||
<p>As our code base expanded to more and more packages, groups internal and external to GSA, and the Broad, have expressed interest in using portions of Sting/GATK as modules in larger projects. Unfortunately over time, many parts of the GATK and Sting intermingled, producing the current situation where developers finds it easier to copy the monolithic GATK instead, or individual java files, instead of using the tools as libraries.</p>
|
||||
<p>The goal of this first stage is to split the parts of the monolithic Sting/GATK into easily recognizable sub artifacts. The tool used to accomplish this task is <a href="http://maven.apache.org">Apache Maven</a>, also known as <em>Maven</em>, and run via the command <code>mvn</code>. Maven convention encourages developers to separate code, and accompanying resources, into a hierarchical structure of reusable artifacts. Maven attempts to avoid build configuration, preferring source repositories to lay out code in a conventional structure. When needed, a Maven configuration file called <em>pom.xml</em> specifies each artifact's build configuration, that one may think of as similar to an Ant <em>build.xml</em>.</p>
|
||||
<p>The actual migration consisted of zero changes to the contents of existing Java source files, easing git merges and rebasing. The Java files from public, protected, and private have all moved into Maven conventional child artifacts, with each artifact containing a separate <em>pom.xml</em>.</p>
|
||||
<h1>Examples</h1>
|
||||
<h2>Obtaining the GATK with Maven support</h2>
|
||||
<p>Clone the repository:</p>
|
||||
<p><code>git clone ssh://git@github.com/broadinstitute/gsa-unstable.git cd gsa-unstable</code></p>
|
||||
<h2>Building GATK and Queue</h2>
|
||||
<p>Clone the repository:</p>
|
||||
<p><code>git clone ssh://git@github.com/broadinstitute/gsa-unstable.git cd gsa-unstable</code></p>
|
||||
<p>If running on a Broad server, add maven to your environment via the dotkit:</p>
|
||||
<p><code>reuse Maven-3.0.3</code></p>
|
||||
<p>Build all of Sting, including packaged versions of the GATK and Queue:</p>
|
||||
<p><code>mvn verify</code></p>
|
||||
<p>The packaged, executable jar files will be output to:</p>
|
||||
<p><code>public/gatk-package/target/gatk-package-2.8-SNAPSHOT.jar public/queue-package/target/queue-package-2.8-SNAPSHOT.jar</code></p>
|
||||
<p>Find equivalent maven commands for existing ant targets:</p>
|
||||
<p><code>./ant-bridge.sh <target> <properties></code></p>
|
||||
<p>Example output:</p>
|
||||
<p><code>$ ./ant-bridge.sh fasttest -Dsingle=GATKKeyUnitTest Equivalent maven command mvn verify -Dsting.committests.skipped=false -pl private/gatk-private -am -Dresource.bundle.skip=true -Dit.test=disabled -Dtest=GATKKeyUnitTest $</code></p>
|
||||
<h2>Running the GATK and Queue</h2>
|
||||
<p>To run the GATK, or copy the compiled jar, find the packaged jar under public/gatk-package/target</p>
|
||||
<p><code>public/gatk-package/target/gatk-package-2.8-SNAPSHOT.jar</code></p>
|
||||
<p>To run Queue, the jar is under the similarly named public/queue-package/target</p>
|
||||
<p><code>public/queue-package/target/queue-package-2.8-SNAPSHOT.jar</code></p>
|
||||
<p><strong>NOTE:</strong> Unlike builds with Ant, you <em>cannot</em> execute the jar file built by the gatk-framework module. This is because maven does not include dependent artifacts in the target folder with assembled framework jar. Instead, use the packaged jars, listed above, that contain all the classes and resources needed to run the GATK, or Queue.</p>
|
||||
<h2>Excluding Queue</h2>
|
||||
<p><em>NOTE:</em> If you make changes to sting-utils, gatk-framework, or any other dependencies <em>and</em> disable queue, you may accidentally end up breaking the full repository build without knowing.</p>
|
||||
<p>The Queue build contributes a majority portion of the Sting project build time. To exclude Queue from your build, run maven with either (the already shell escaped) <code>-P\!queue</code> or <code>-Ddisable.queue</code>. Currently the latter property also disables the maven queue profile. This allows one other semi-permanent option to disable building Queue as part of the Sting repository. Configure your local Maven settings to always pass the property <code>-Ddisable.queue</code> by adding and activating a custom profile in your local ~/.m2/settings.xml</p>
|
||||
<p>```$ cat ~/.m2/settings.xml</p>
|
||||
<settings>
|
||||
<!--
|
||||
Other settings.xml changes...
|
||||
-->
|
||||
<!--
|
||||
Define a new profile to set disable.queue
|
||||
-->
|
||||
<profiles>
|
||||
<profile>
|
||||
<id>disable.queue</id>
|
||||
<properties>
|
||||
<disable.queue>true</disable.queue>
|
||||
</properties>
|
||||
</profile>
|
||||
</profiles>
|
||||
<!--
|
||||
Activate the profile defined above
|
||||
-->
|
||||
<activeProfiles>
|
||||
<activeProfile>disable.queue</activeProfile>
|
||||
</activeProfiles>
|
||||
</settings>
|
||||
<p>$```</p>
|
||||
<h2>Using the GATK framework as a module</h2>
|
||||
<p>Currently the GATK artifacts are not available via any centralized repository. To build code using the GATK you must still have a checkout of the GATK source code, and install the artifacts to your local mvn repository (by default ~/.m2/repository). The installation copies the artifacts to your local repo such that it may be used by your external project. The checkout of the local repo provides several artifacts under <code>public/repo</code> that will be required for your project.</p>
|
||||
<p>After updating to the latest version of the Sting source code, install the Sting artifacts via:</p>
|
||||
<p><code>mvn install</code></p>
|
||||
<p>After the GATK has been installed locally, in your own source repository, include the artifact gatk-framework as a library.</p>
|
||||
<p>In Apache Maven add this dependency:</p>
|
||||
<p>```<dependency></p>
|
||||
<groupId>org.broadinstitute.sting</groupId>
|
||||
<pre><code class="pre_md"><artifactId>gatk-framework</artifactId>
|
||||
<version>2.8-SNAPSHOT</version></code class="pre_md"></pre>
|
||||
<p></dependency>```</p>
|
||||
<p>For Apache Ivy, you may need to specify <code>~/.m2/repository</code> as a local repo. Once the local repository has been configured, ivy may find the dependency via:</p>
|
||||
<p><code><dependency org="org.broadinstitute.sting" name="gatk-framework" rev="2.8-SNAPSHOT" /></code></p>
|
||||
<p>If you decide to also use Maven to build your project, your source code should go under the conventional directory <code>src/main/java</code>. The <code>pom.xml</code> contains any special configuration for your project. To see an example pom.xml and maven conventional project structure in:</p>
|
||||
<p><code>public/external-example</code></p>
|
||||
<h2>Moved directories</h2>
|
||||
<p>If you have an old git branch that needs to be merged, you may need to know where to move files in order for your classes to now build with Maven. In general, most directories were moved with minimal or no changes.</p>
|
||||
<table class="table table-striped">
|
||||
<thead>
|
||||
<tr>
|
||||
<th><strong>Old directory</strong></th>
|
||||
<th><strong>New maven directory</strong></th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td>private/java/src/</td>
|
||||
<td>private/gatk-private/src/main/java/</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>private/R/scripts/</td>
|
||||
<td>private/gatk-private/src/main/resources/</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>private/java/test/</td>
|
||||
<td>private/gatk-private/src/test/java/</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>private/testdata/</td>
|
||||
<td>private/gatk-private/src/test/resources/</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>private/scala/qscript/</td>
|
||||
<td>private/queue-private/src/main/qscripts/</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>private/scala/src/</td>
|
||||
<td>private/queue-private/src/main/scala/</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>private/scala/test/</td>
|
||||
<td>private/queue-private/src/test/scala/</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>protected/java/src/</td>
|
||||
<td>protected/gatk-protected/src/main/java/</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>protected/java/test/</td>
|
||||
<td>protected/gatk-protected/src/test/java/</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>public/java/src/</td>
|
||||
<td>public/gatk-framework/src/main/java/</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>public/java/test/</td>
|
||||
<td>public/gatk-framework/src/test/java/</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>public/testdata/</td>
|
||||
<td>public/gatk-framework/src/test/resources/</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>public/scala/qscript/</td>
|
||||
<td>public/queue-framework/src/main/qscripts/</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>public/scala/src/</td>
|
||||
<td>public/queue-framework/src/main/scala/</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>public/scala/test/</td>
|
||||
<td>public/queue-framework/src/test/scala/</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<h1>Future Directions</h1>
|
||||
<h2>Further segregate source code</h2>
|
||||
<p>Currently, the artifacts sting-utils and the gatk-framework contain intertwined code bases. This leads to the current setup where all sting-utils code is actually found in the gatk-framework artifact, including generic utilities that could be used by other software modules. In the future, all elements under <code>org.broadinstitute.sting.gatk</code> will be located the gatk-framework, while all other packages under <code>org.broadinstitut.sting</code> will be evaluated and then separated under the gatk-framework or sting-utils artifacts.</p>
|
||||
<h2>Publishing artifacts</h2>
|
||||
<p>Tangentially related to segregating sting-utils and the gatk-framework, the current Sting and GATK artifacts are ineligible to be pushed to the <a href="http://search.maven.org">Maven Central Repository</a>, due to several other issues:</p>
|
||||
<ul>
|
||||
<li>Need to provide trivial workflow for Picard, and possibly SnpEff, to submit to central</li>
|
||||
<li>Missing <a href="https://docs.sonatype.org/display/Repository/Sonatype+OSS+Maven+Repository+Usage+Guide#SonatypeOSSMavenRepositoryUsageGuide-6.CentralSyncRequirement">meta files</a> for the jars:
|
||||
<ul>
|
||||
<li>*-sources.jar</li>
|
||||
<li>*-javadoc.jar</li>
|
||||
<li>*.md5</li>
|
||||
<li>*.sha1</li>
|
||||
</ul></li>
|
||||
</ul>
|
||||
<p><em>NOTE:</em> Artifact jars do NOT need to actually be in Central, and may be available as pom reference only, for example <a href="http://central.maven.org/maven2/com/oracle/ojdbc14/">Oracle ojdbc</a>.</p>
|
||||
<p>In the near term, we could use a private repos based on <a href="http://www.jfrog.com/home/v_artifactorycloud_overview">Artifactory</a> or <a href="http://www.sonatype.org/nexus">Nexus</a> (<a href="http://docs.codehaus.org/display/MAVENUSER/Maven+Repository+Manager+Feature+Matrix">comparison</a>). After more work of adding, cleaning up, or centrally publishing all the dependencies for Sting, we may then publish into the basic Central repo. Or, we could move to a social service like <a href="https://bintray.com">BinTray</a> (think GitHub vs. Git).</p>
|
||||
<h1>Status Updates</h1>
|
||||
<h2>February 13, 2014</h2>
|
||||
<p>Maven is now the default in gsa-unstable's master branch. For GATK developers, the git migration is effectively complete. Software engineers are resolving a few remaining issues related to the automated build and testing infrastructure, but the basic workflow for developers should now be up to date.</p>
|
||||
<h2>January 30, 2014</h2>
|
||||
<p>The migration to to maven has begun in the <a href="https://github.com/broadinstitute/gsa-unstable">gsa-unstable repository</a> on the ks_new_maven_build_system branch.</p>
|
||||
<h2>November 5, 2013</h2>
|
||||
<p>The maven port of the existing ant build resides in the <a href="https://github.com/broadinstitute/gsa-qc">gsa-qc repository</a>.</p>
|
||||
<p>This is an old branch of Sting/GATK, with the existing files relocated to Maven appropriate locations, pom.xml files added, along with basic resources to assist in artifact generation.</p>
|
||||
|
|
@ -0,0 +1,40 @@
|
|||
## Notes on downsampling in HC/M2
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/8028/notes-on-downsampling-in-hc-m2
|
||||
|
||||
<p><strong>This document aims to record some developer notes for posterity. Contents were generated July 24, 2015 and are not guaranteed to be up to date. No support guarantee either.</strong></p>
|
||||
<hr />
|
||||
<h3>Arguments and Parameters</h3>
|
||||
<ul>
|
||||
<li>"@Downsample" annotation in the class definition for HC/M2 controls the height of the pileup used for active region determination -- HC has default coverage 500, M2 downsampling takes on ActiveRegionWalker default, which is 1000</li>
|
||||
<li>"@ActiveRegionTraversalParameters" argument controls the maximum number of reads that can possibly be processed; has default maxReadsToHoldInMemoryPerSample() = 30,000 and across all samples maxReadsToHoldTotal() = 10,000,000 -- these are not currently overridden in HC or M2</li>
|
||||
<li>maxReadsInRegionPerSample and minReadsPerAlignmentStart (arguments in HC, hard-coded in M2 right now) loosely control the number of reads that go into the assembly step; default is 10K and 10 for HC, hard-coded 1000 and 5 for M2</li>
|
||||
</ul>
|
||||
<h3>Relevant Code</h3>
|
||||
<ul>
|
||||
<li>TraverseActiveRegions.java does a lot of the data management (including some downsampling) and does the iteration over ActiveRegions in traverse()</li>
|
||||
<li>TraverseActiveRegions takes in the ActiveRegionTraversalParameters annotations and creates a TAROrderedReadCache (this is where all the reads that get passed to the Walker are stored)</li>
|
||||
<li>TAROrderedReadCache contains a ReservoirDownsampler </li>
|
||||
<li>ReservoirDownsampler is unbiased with regard to read start position; gets initialized with maxCapacity = min(maxReadsToHoldTotal, maxReadsToHoldInMemoryPerSample*nSamples)</li>
|
||||
<li>Reads that go into the Walker's map() function get downsampled by the ReservoirDownsampler to exactly maxCapacity if they exceed the maxCapacity -- at this point this is the most reads you can ever use for calculations</li>
|
||||
<li>Reads that go into the assembly step (already filtered for MQ) get downsampled by the LevelingDownsampler to approximately maxReadsInRegionPerSample if the number of reads exceeds maxReadsInRegionPerSample
|
||||
<ul>
|
||||
<li>(my maxReadsInRegionPerSample is one step-through was 1037, but my downsampleReads was 3003 over 100bp, so it seems pretty approximate)</li>
|
||||
</ul></li>
|
||||
<li>The LevelingDownsampler is intentionally biased because it maintains a minimum coverage at each base as specified by minReadsPerAlignmentStart</li>
|
||||
<li>ActiveRegionTrimmer.Result trimmingResult in the Walker's map() function recovers reads (up to theTAROrderedReadCache maxCapacity) by pulling them from the originalActiveRegion, but trims them to variation events found in the (potentially downsampled) assembly</li>
|
||||
<li>Genotyping is performed based largely on the set of reads going into the map() function (M2 filters for quality with filterNonPassingReads before genotyping)</li>
|
||||
</ul>
|
||||
<h3>Worst Case M2 Behavior</h3>
|
||||
<ul>
|
||||
<li>Highest coverage M2 call on CRSP NA12878 SM-612V4.bam vs SM-612V3.bam normal-normal pair occurs at 7:100645781 with 4000-5000X coverage, also coverage can exceed 7000X in other BAMs</li>
|
||||
<li>A lot of exons have coverage exceeding the 1000X cutoff for ActiveRegion determination with isActive(), but even with downsampling to 1000X we should still trigger ActiveRegions down to around allele fraction of ~0.8% for 4000X</li>
|
||||
<li>Even the highest coverage exon in the CRSP NA12878 normal-normal calling doesn't exceed the default limit for the ReservoirDownsampler (i.e. all reads will have the potential to get genotyped)</li>
|
||||
<li>In this super high coverage exon, reads are getting downsampled to ~3000 before they go into the assembly (again, controlled by maxReadsInRegionPerSample and minReadsPerAlignmentStart)
|
||||
<ul>
|
||||
<li>Here that's a retention of about 12.5% of reads, which seems pretty aggressive</li>
|
||||
<li>The maxReadsInRegionPerSample value is 10% of what it is for HC</li>
|
||||
<li>Increasing maxReadsInRegionPerSample for M2 may increase sensitivity (although honestly not based on my LUAD comparison vs. M1) but will drastically increase assembly time</li>
|
||||
</ul></li>
|
||||
<li>All reads that pass quality filters are genotyped according to the variants found using the downsampled assembly set</li>
|
||||
</ul>
|
||||
|
|
@ -0,0 +1,113 @@
|
|||
## Output management
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/1327/output-management
|
||||
|
||||
<h3>1. Introduction</h3>
|
||||
<p>When running either single-threaded or in shared-memory parallelism mode, the GATK guarantees that output written to an output stream created via the <code>@Argument</code> mechanism will ultimately be assembled in genomic order. In order to assemble the final output file, the GATK will write the output generated from each thread into a temporary output file, ultimately assembling the data via a central coordinating thread. There are three major elements in the GATK that facilitate this functionality:</p>
|
||||
<ul>
|
||||
<li>
|
||||
<p>Stub</p>
|
||||
<p>The front-end interface to the output management system. Stubs will be injected into the walker by the command-line argument system and relay information from the walker to the output management system. There will be one stub per invocation of the GATK.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>Storage</p>
|
||||
<p>The back end interface, responsible for creating, writing and deleting temporary output files as well as merging their contents back into the primary output file. One Storage object will exist per shard processed in the GATK.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>OutputTracker</p>
|
||||
<p>The dispatcher; ultimately connects the stub object's output creation request back to the most appropriate storage object to satisfy that request. One OutputTracker will exist per GATK invocation.</p>
|
||||
</li>
|
||||
</ul>
|
||||
<h3>2. Basic Mechanism</h3>
|
||||
<p>Stubs are directly injected into the walker through the GATK's command-line argument parser as a go-between from walker to output management system. When a walker calls into the stub it's first responsibility is to call into the output tracker to retrieve an appropriate storage object. The behavior of the OutputTracker from this point forward depends mainly on the parallelization mode of this traversal of the GATK.</p>
|
||||
<h4>If the traversal is single-threaded:</h4>
|
||||
<ul>
|
||||
<li>
|
||||
<p>the OutputTracker (implemented as DirectOutputTracker) will create the storage object if necessary and return it to the stub.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>The stub will forward the request to the provided storage object. </p>
|
||||
</li>
|
||||
<li>At the end of the traversal, the microscheduler will request that the OutputTracker finalize and close the file.</li>
|
||||
</ul>
|
||||
<h4>If the traversal is multi-threaded using shared-memory parallelism:</h4>
|
||||
<ul>
|
||||
<li>
|
||||
<p>The OutputTracker (implemented as ThreadLocalOutputTracker) will look for a storage object associated with this thread via a ThreadLocal. </p>
|
||||
</li>
|
||||
<li>
|
||||
<p>If no such storage object exists, it will be created pointing to a temporary file. </p>
|
||||
</li>
|
||||
<li>
|
||||
<p>At the end of <strong>each shard processed</strong>, that file will be closed and an OutputMergeTask will be created so that the shared-memory parallelism code can merge the output at its leisure.</p>
|
||||
</li>
|
||||
<li>The shared-memory parallelism code will merge when a fixed number of temporary files appear in the input queue. The constant used to determine this frequency is fixed at compile time (see <code>HierarchicalMicroScheduler.MAX_OUTSTANDING_OUTPUT_MERGES</code>).</li>
|
||||
</ul>
|
||||
<h3>3. Using output management</h3>
|
||||
<p>To use the output management system, declare a field in your walker of one of the existing core output types, coupled with either an <code>@Argument</code> or <code>@Output</code> annotation.</p>
|
||||
<pre><code class="pre_md">@Output(doc="Write output to this BAM filename instead of STDOUT")
|
||||
SAMFileWriter out;</code class="pre_md"></pre>
|
||||
<p>Currently supported output types are SAM/BAM (declare SAMFileWriter), VCF (declare VCFWriter), and any non-buffering stream extending from OutputStream.</p>
|
||||
<h3>4. Implementing a new output type</h3>
|
||||
<p>To create a new output type, three types must be implemented: Stub, Storage, and ArgumentTypeDescriptor.</p>
|
||||
<h4>To implement Stub</h4>
|
||||
<p>Create a new Stub class, extending/inheriting the core output type's interface and implementing the Stub interface.</p>
|
||||
<pre><code class="pre_md">OutputStreamStub extends OutputStream implements Stub<OutputStream> {</code class="pre_md"></pre>
|
||||
<p>Implement a register function so that the engine can provide the stub with the session's OutputTracker.</p>
|
||||
<pre><code class="pre_md">public void register( OutputTracker outputTracker ) {
|
||||
this.outputTracker = outputTracker;
|
||||
}</code class="pre_md"></pre>
|
||||
<p>Add as fields any parameters necessary for the storage object to create temporary storage.</p>
|
||||
<pre><code class="pre_md">private final File targetFile;
|
||||
public File getOutputFile() { return targetFile; }</code class="pre_md"></pre>
|
||||
<p>Implement/override every method in the core output type's interface to pass along calls to the appropriate storage object via the OutputTracker.</p>
|
||||
<pre><code class="pre_md">public void write( byte[] b, int off, int len ) throws IOException {
|
||||
outputTracker.getStorage(this).write(b, off, len);
|
||||
}</code class="pre_md"></pre>
|
||||
<h4>To implement Storage</h4>
|
||||
<p>Create a Storage class, again extending inheriting the core output type's interface and implementing the Storage interface.</p>
|
||||
<pre><code class="pre_md">public class OutputStreamStorage extends OutputStream implements Storage<OutputStream> {</code class="pre_md"></pre>
|
||||
<p>Implement constructors that will accept just the Stub or Stub + alternate file path and create a repository for data, and a close function that will close that repository.</p>
|
||||
<pre><code class="pre_md">public OutputStreamStorage( OutputStreamStub stub ) { ... }
|
||||
public OutputStreamStorage( OutputStreamStub stub, File file ) { ... }
|
||||
public void close() { ... }</code class="pre_md"></pre>
|
||||
<p>Implement a <code>mergeInto</code> function capable of reconstituting the file created by the constructor, dumping it back into the core output type's interface, and removing the source file.</p>
|
||||
<pre><code class="pre_md">public void mergeInto( OutputStream targetStream ) { ... }</code class="pre_md"></pre>
|
||||
<p>Add a block to <code>StorageFactory.createStorage()</code> capable of creating the new storage object. <strong>TODO: use reflection to generate the storage classes.</strong></p>
|
||||
<pre><code class="pre_md"> if(stub instanceof OutputStreamStub) {
|
||||
if( file != null )
|
||||
storage = new OutputStreamStorage((OutputStreamStub)stub,file);
|
||||
else
|
||||
storage = new OutputStreamStorage((OutputStreamStub)stub);
|
||||
}</code class="pre_md"></pre>
|
||||
<h4>To implement ArgumentTypeDescriptor</h4>
|
||||
<p>Create a new object inheriting from type <code>ArgumentTypeDescriptor</code>. Note that the <code>ArgumentTypeDescriptor</code> does NOT need to support the core output type's interface.</p>
|
||||
<pre><code class="pre_md">public class OutputStreamArgumentTypeDescriptor extends ArgumentTypeDescriptor {</code class="pre_md"></pre>
|
||||
<p>Implement a truth function indicating which types this <code>ArgumentTypeDescriptor</code> can service.</p>
|
||||
<pre><code class="pre_md"> @Override
|
||||
public boolean supports( Class type ) {
|
||||
return SAMFileWriter.class.equals(type) || StingSAMFileWriter.class.equals(type);
|
||||
}</code class="pre_md"></pre>
|
||||
<p>Implement a parse function that constructs the new Stub object. The function should register this type as an output by caling <code>engine.addOutput(stub)</code>.</p>
|
||||
<pre><code class="pre_md"> public Object parse( ParsingEngine parsingEngine, ArgumentSource source, Type type, ArgumentMatches matches ) {
|
||||
...
|
||||
OutputStreamStub stub = new OutputStreamStub(new File(fileName));
|
||||
...
|
||||
engine.addOutput(stub);
|
||||
....
|
||||
return stub;
|
||||
}</code class="pre_md"></pre>
|
||||
<p>Add a creator for this new ArgumentTypeDescriptor in <code>CommandLineExecutable.getArgumentTypeDescriptors()</code>.</p>
|
||||
<pre><code class="pre_md"> protected Collection<ArgumentTypeDescriptor> getArgumentTypeDescriptors() {
|
||||
return Arrays.asList( new VCFWriterArgumentTypeDescriptor(engine,System.out,argumentSources),
|
||||
new SAMFileWriterArgumentTypeDescriptor(engine,System.out),
|
||||
new OutputStreamArgumentTypeDescriptor(engine,System.out) );
|
||||
}</code class="pre_md"></pre>
|
||||
<p>After creating these three objects, the new output type should be ready for usage as described above.</p>
|
||||
<h3>5. Outstanding issues</h3>
|
||||
<ul>
|
||||
<li>
|
||||
<p>Only non-buffering iterators are currently supported by the GATK. Of particular note, <code>PrintWriter</code> will appear to drop records if created by the command-line argument system; use <code>PrintStream</code> instead.</p>
|
||||
</li>
|
||||
<li>For efficiency, the GATK does not reduce output files together following the tree pattern used by shared-memory parallelism; output merges happen via an independent queue. Because of this, output merges happening during a <code>treeReduce</code> may not behave correctly.</li>
|
||||
</ul>
|
||||
|
|
@ -0,0 +1,32 @@
|
|||
## Scala resources
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/1897/scala-resources
|
||||
|
||||
<h2>References for Scala development</h2>
|
||||
<p>The online course <a href="https://www.coursera.org/course/progfun">Functional Programming Principles in Scala</a> taught by Martin Odersky, creator of Scala, and a <a href="https://github.com/lrytz/progfun-wiki/blob/gh-pages/CheatSheet.md">Cheat Sheet</a> for that course</p>
|
||||
<p><a href="http://www.scala-lang.org/docu/files/ScalaByExample.pdf">Scala by Example (PDF)</a> - also by Martin Odersky</p>
|
||||
<p><a href="http://www.artima.com/scalazine/articles/steps.html">First Steps to Scala</a></p>
|
||||
<p><a href="http://programming-scala.labs.oreilly.com">Programming Scala</a> - O'Reilly Media</p>
|
||||
<p><a href="http://twitter.github.com/scala_school/">Scala School</a> - Twitter</p>
|
||||
<p><a href="http://davetron5000.github.com/scala-style/index.html">Scala Style Guide</a></p>
|
||||
<p><a href="http://www.cis.upenn.edu/~matuszek/Concise%20Guides/Concise%20Scala.html">A Concise Introduction To Scala</a></p>
|
||||
<p><a href="http://jim-mcbeath.blogspot.com/2008/12/scala-operator-cheat-sheet.html">Scala Operator Cheat Sheet</a></p>
|
||||
<p><a href="http://www.scala-lang.org/node/104">A Tour of Scala</a></p>
|
||||
<h4>Stack Overflow</h4>
|
||||
<ul>
|
||||
<li><a href="http://stackoverflow.com/questions/7888944/scala-punctuation-aka-symbols-operators">Scala Punctuation (aka symbols, operators)</a></li>
|
||||
<li><a href="http://stackoverflow.com/questions/8000903/what-are-all-the-uses-of-an-underscore-in-scala">What are all the uses of an underscore in Scala?</a></li>
|
||||
</ul>
|
||||
<h4>A Conversation with Martin Odersky</h4>
|
||||
<ol>
|
||||
<li><a href="http://www.artima.com/scalazine/articles/origins_of_scala.html">The Origins of Scala</a></li>
|
||||
<li><a href="http://www.artima.com/scalazine/articles/goals_of_scala.html">The Goals of Scala's Design</a></li>
|
||||
<li><a href="http://www.artima.com/scalazine/articles/scalas_type_system.html">The Purpose of Scala's Type System</a></li>
|
||||
<li><a href="http://www.artima.com/scalazine/articles/pattern_matchingP.html">The Point of Pattern Matching in Scala</a></li>
|
||||
</ol>
|
||||
<h4>Scala Collections for the Easily Bored</h4>
|
||||
<ol>
|
||||
<li><a href="http://www.codecommit.com/blog/scala/scala-collections-for-the-easily-bored-part-1">A Tale of Two Flavors</a></li>
|
||||
<li><a href="http://www.codecommit.com/blog/scala/scala-collections-for-the-easily-bored-part-2">One at a Time</a></li>
|
||||
<li><a href="http://www.codecommit.com/blog/scala/scala-collections-for-the-easily-bored-part-3">All at Once</a></li>
|
||||
</ol>
|
||||
|
|
@ -0,0 +1,48 @@
|
|||
## Seeing deletion spanning reads in LocusWalkers
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/1348/seeing-deletion-spanning-reads-in-locuswalkers
|
||||
|
||||
<h2>1. Introduction</h2>
|
||||
<p>The <code>LocusTraversal</code> now supports passing walkers reads that have deletions spanning the current locus. This is useful in many situation where you want to calculate coverage, call variants and need to avoid calling variants where there are a lot of deletions, etc. </p>
|
||||
<p>Currently, the system by default will not pass you deletion-spanning reads. In order to see them, you need to overload the function:</p>
|
||||
<pre><code class="pre_md">/**
|
||||
* (conceptual static) method that states whether you want to see reads piling up at a locus
|
||||
* that contain a deletion at the locus.
|
||||
*
|
||||
* ref: ATCTGA
|
||||
* read1: ATCTGA
|
||||
* read2: AT--GA
|
||||
*
|
||||
* Normally, the locus iterator only returns a list of read1 at this locus at position 3, but
|
||||
* if this function returns true, then the system will return (read1, read2) with offsets
|
||||
* of (3, -1). The -1 offset indicates a deletion in the read.
|
||||
*
|
||||
* @return false if you don't want to see deletions, or true if you do
|
||||
*/
|
||||
public boolean includeReadsWithDeletionAtLoci() { return true; }</code class="pre_md"></pre>
|
||||
<p>in your walker. Now you will start seeing deletion-spanning reads in your walker. These reads are flagged with offsets of -1, so that you can:</p>
|
||||
<pre><code class="pre_md"> for ( int i = 0; i < context.getReads().size(); i++ ) {
|
||||
SAMRecord read = context.getReads().get(i);
|
||||
int offset = context.getOffsets().get(i);
|
||||
|
||||
if ( offset == -1 )
|
||||
nDeletionReads++;
|
||||
else
|
||||
nCleanReads++;
|
||||
}</code class="pre_md"></pre>
|
||||
<p>There are also two convenience functions in <code>AlignmentContext</code> to extract subsets of the reads with and without spanning deletions:</p>
|
||||
<pre><code class="pre_md">/**
|
||||
* Returns only the reads in ac that do not contain spanning deletions of this locus
|
||||
*
|
||||
* @param ac
|
||||
* @return
|
||||
*/
|
||||
public static AlignmentContext withoutSpanningDeletions( AlignmentContext ac );
|
||||
|
||||
/**
|
||||
* Returns only the reads in ac that do contain spanning deletions of this locus
|
||||
*
|
||||
* @param ac
|
||||
* @return
|
||||
*/
|
||||
public static AlignmentContext withSpanningDeletions( AlignmentContext ac );</code class="pre_md"></pre>
|
||||
|
|
@ -0,0 +1,85 @@
|
|||
## Setting up your dev environment: Maven and IntelliJ for GATK 3+
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/4023/setting-up-your-dev-environment-maven-and-intellij-for-gatk-3
|
||||
|
||||
<h3>Overview</h3>
|
||||
<p>Since GATK 3.0, we use Apache Maven (instead of Ant) as our build system, and IntelliJ as our IDE (Integrated Development Environment). This document describes how to get set up to use Maven as well as how to create an IntelliJ project around our Maven project structure.</p>
|
||||
<h3>Before you start</h3>
|
||||
<ul>
|
||||
<li>Ensure that you have git clones of our repositories on your machine. See <a href="http://www.broadinstitute.org/gatk/guide/article?id=4022">this document</a> for details on obtaining the GATK source code from our Git repos.</li>
|
||||
</ul>
|
||||
<h3>Setting up Maven</h3>
|
||||
<ol>
|
||||
<li>
|
||||
<p>Check whether you can run <code>mvn --version</code> on your machine. If you can't, install Maven from <a href="http://maven.apache.org/">here</a>.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>Ensure that the JAVA_HOME environment variable is properly set. If it's not, add the appropriate line to your shell's startup file:</p>
|
||||
<p>for tcsh: </p>
|
||||
<pre><code class="pre_md">setenv JAVA_HOME \`/usr/libexec/java_home\`</code class="pre_md"></pre>
|
||||
<p>for bash: </p>
|
||||
<pre><code class="pre_md">export JAVA_HOME=\`/usr/libexec/java_home\`</code class="pre_md"></pre>
|
||||
</li>
|
||||
</ol>
|
||||
<p>Note that the commands above use backticks, not single quotes.</p>
|
||||
<h3>Basic Maven usage</h3>
|
||||
<ol>
|
||||
<li>
|
||||
<p>To compile everything, type:</p>
|
||||
<pre><code class="pre_md">mvn verify</code class="pre_md"></pre>
|
||||
</li>
|
||||
<li>
|
||||
<p>To compile the GATK but not Queue (much faster!), the command is:</p>
|
||||
<pre><code class="pre_md">mvn verify -P\!queue</code class="pre_md"></pre>
|
||||
<p>Note that the <code>!</code> needs to be escaped with a backslash to avoid interpretation by the shell.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>To obtain a clean working directory, type:</p>
|
||||
<pre><code class="pre_md">mvn clean</code class="pre_md"></pre>
|
||||
</li>
|
||||
<li>
|
||||
<p>If you're used to using ant to compile the GATK, you should be able to feed your old ant commands to the <code>ant-bridge.sh</code> script in the root directory. For example:</p>
|
||||
<pre><code class="pre_md">./ant-bridge.sh test -Dsingle=MyTestClass</code class="pre_md"></pre>
|
||||
</li>
|
||||
</ol>
|
||||
<h3>Setting up IntelliJ</h3>
|
||||
<ol>
|
||||
<li>
|
||||
<p>Run <code>mvn test-compile</code> in your git clone's root directory.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>Open IntelliJ</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>File -> import project, select your git clone directory, then click "ok"</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>On the next screen, select "import project from external model", then "maven", then click "next"</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>Click "next" on the next screen without changing any defaults -- in particular:</p>
|
||||
<ul>
|
||||
<li>DON'T check "Import maven projects automatically" </li>
|
||||
<li>DON'T check "Create module groups for multi-module maven projects"</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li>
|
||||
<p>On the "Select Profiles" screen, make sure private and protected ARE checked, then click "next".</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>On the next screen, the "gatk-aggregator" project should already be checked for you -- if not, then check it. Click "next".</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>Select the 1.7 SDK, then click "next".</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>Select an appropriate project name (can be anything), then click "next" (or "finish", depending on your version of IntelliJ).</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>Click "Finish" to create the new IntelliJ project.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>That's it! Due to Maven magic, everything else will be set up for you automatically, including modules, libraries, Scala facets, etc.</p>
|
||||
</li>
|
||||
<li>You will see a popup "Maven projects need to be imported" on every IntelliJ startup. You should click import unless you're working on the actual pom files that make up the build system.</li>
|
||||
</ol>
|
||||
|
|
@ -0,0 +1,736 @@
|
|||
## Sting to GATK renaming
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/4173/sting-to-gatk-renaming
|
||||
|
||||
<h1>Overview</h1>
|
||||
<p>The GATK 3.2 source code uses new java package names, directory paths, and executable jars. Post GATK 3.2, any patches submitted via pull requests should also include classes moved to the appropriate artifact.</p>
|
||||
<p>Note that the document includes references to the <code>private</code> module, which is part of our internal development codebase but is not available to the general public. </p>
|
||||
<h1>Summary</h1>
|
||||
<p>A long term ideal of the GATK is to separate out reusable parts and eventually make them available as compiled libraries via centralized binary repositories. Ahead of publishing a number of steps must be completed. One of the larger steps has been completed for GATK 3.2, where the code base rebranded all references of Sting to GATK.</p>
|
||||
<p>Currently implemented changes include:</p>
|
||||
<ul>
|
||||
<li>Java/Scala package names changed from org.broadinstitute.sting to org.broadinstitute.gatk</li>
|
||||
<li>Renamed Maven artifacts including new directories</li>
|
||||
</ul>
|
||||
<p>As of May 16, 2014, remaining TODOs ahead of publishing to central include:</p>
|
||||
<ul>
|
||||
<li>Uploading all transitive GATK dependencies to central repositories</li>
|
||||
<li>Separating a bit more of the intertwined utility, engine, and tool classes</li>
|
||||
</ul>
|
||||
<p>Now that the new package names and Maven artifacts are available, any pull request should include ensuring that updated classes are also moved into the correct GATK Maven artifact. While there are a significant number of classes, cleaning up as we go along will allow the larger task to be completed in a distributed fashion.</p>
|
||||
<p>The full lists of new Maven artifacts and renamed packages are below under [Renamed Artifact Directories]. For those developers in the middle of a <code>git rebase</code> around commits before and after 3.2, here is an abridged mapping of renamed directories for those trying to locate files:</p>
|
||||
<table class="table table-striped">
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Old Maven Artifact</th>
|
||||
<th>New Maven Artifact</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td><code>public/sting-root</code></td>
|
||||
<td><code>public/gatk-root</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>public/sting-utils</code></td>
|
||||
<td><code>public/gatk-utils</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>public/gatk-framework</code></td>
|
||||
<td><code>public/gatk-tools-public</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>public/queue-framework</code></td>
|
||||
<td><code>public/gatk-queue</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>protected/gatk-protected</code></td>
|
||||
<td><code>protected/gatk-tools-protected</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>private/gatk-private</code></td>
|
||||
<td><code>private/gatk-tools-private</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>private/queue-private</code></td>
|
||||
<td><code>private/gatk-queue-private</code></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<p>QScripts are no longer located with the Queue engine, and instead are now located with the GATK wrappers implemented as Queue extensions. See [Separated Queue Extensions] for more info.</p>
|
||||
<h1>Changes</h1>
|
||||
<h2>Separating the GATK Engine and Tools</h2>
|
||||
<p>Starting with GATK 3.2, separate Maven utility artifacts exist to separate reusable portions of the GATK engine apart from tool specific implementations. The biggest impact this will have on developers is the separation of the walkers packages.</p>
|
||||
<p>In GATK versions <= 3.1 there was one package for both the base classes and the implementations of walkers:</p>
|
||||
<ul>
|
||||
<li>org.broadinstitute.sting.gatk.walkers</li>
|
||||
</ul>
|
||||
<p>In GATK versions >= 3.2 threre are two packages. The first contains the base interfaces, annotations, etc. The latter package is for the concrete tools implemented as walkers:</p>
|
||||
<ul>
|
||||
<li>
|
||||
<p>org.broadinstitute.<strong>gatk.engine</strong>.walkers</p>
|
||||
<ul>
|
||||
<li>Ex: ReadWalker, LocusWalker, @PartitionBy, @Requires, etc.</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li>org.broadinstitute.<strong>gatk.tools</strong>.walkers
|
||||
<ul>
|
||||
<li>Ex: PrintReads, VariantEval, IndelRealigner, HaplotypeCaller, etc.</li>
|
||||
</ul></li>
|
||||
</ul>
|
||||
<h2>Renamed Binary Packages</h2>
|
||||
<p>Previously, depending on how the source code was compiled, the executable gatk-package-3.1.jar and queue-package-3.1.jar (aka GenomeAnalysisTK.jar and Queue.jar) contained various mixes of public/protected/private code. For example, if the private directory was present when the source code was compiled, the same artifact named gatk-package-3.1.jar might, or might not contain private code.</p>
|
||||
<p>Starting with 3.2, there are two versions of the jar created, each with specific file contents.</p>
|
||||
<table class="table table-striped">
|
||||
<thead>
|
||||
<tr>
|
||||
<th>New Maven Artifact</th>
|
||||
<th>Alias in the /target folder</th>
|
||||
<th>Packaged contents</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td>gatk-package-distribution-3.2.jar</td>
|
||||
<td>GenomeAnalysisTK.jar</td>
|
||||
<td>public,protected</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>gatk-package-internal-3.2.jar</td>
|
||||
<td>GenomeAnalysisTK-internal.jar</td>
|
||||
<td>public,protected,private</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>gatk-queue-package-distribution-3.2.jar</td>
|
||||
<td>Queue.jar</td>
|
||||
<td>public,protected</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>gatk-queue-package-internal-3.2.jar</td>
|
||||
<td>Queue-internal.jar</td>
|
||||
<td>public,protected,private</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<h2>Separated Queue Extensions</h2>
|
||||
<p>When creating a packaged version of Queue, the GATKExtensionsGenerator builds Queue engine compatible command line wrappers around each GATK walker. Previously, the wrappers were generated during the compilation of the Queue framework. Similar to the binary packages, depending on who built the source code, queue-framework-3.1.jar would contain various mixes of public/protected/private wrappers.</p>
|
||||
<p>Starting with GATK 3.2, the gatk-queue-3.2.jar only contains code for the Queue engine. Generated and manually created extensions for wrapping any other command line programs are all included in separate artifacts. Due to a current limitation regarding how the generator uses reflection, the generator cannot build wrappers for just private classes without also generating protected and public classes. Thus, there are three different Maven artifacts generated, that contain different mixes of public, protected and private wrappers.</p>
|
||||
<table class="table table-striped">
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Extensions Artifact</th>
|
||||
<th>Generated wrappers for GATK tools</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td>gatk-queue-extensions-public-3.2.jar</td>
|
||||
<td>public <em>only</em></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>gatk-queue-extensions-distribution-3.2.jar</td>
|
||||
<td>public,protected</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>gatk-queue-extensions-internal-3.2.jar</td>
|
||||
<td>public,protected,private</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<p>As for QScripts that used to be located with the framework, they are now located with the generated wrappers.</p>
|
||||
<table class="table table-striped">
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Old QScripts Artifact Directory</th>
|
||||
<th>New QScripts Artifact Directory</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td><code>public/queue-framework/src/main/qscripts</code></td>
|
||||
<td><code>public/gatk-queue-extensions-public/src/main/qscripts</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>private/queue-private/src/main/qscripts</code></td>
|
||||
<td><code>private/gatk-queue-extensions-internal/src/main/qscripts</code></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<h2>Renamed Artifact Directories</h2>
|
||||
<p>The following list shows the mapping of artifact names pre and post GATK 3.2. In addition to the engine changes, the packaging updates and extensions changes above also affected Maven artifact refactoring. The packaging artifacts have split from a single public to protected and private versions, and new queue extensions artifacts have been added as well.</p>
|
||||
<table class="table table-striped">
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Maven Artifact <= GATK 3.1</th>
|
||||
<th>Maven Artifact >= GATK 3.2</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td><code>/pom.xml</code> <em>(sting-aggregator)</em></td>
|
||||
<td><code>/pom.xml</code> _(gatk<em>aggregator)</em></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>public/sting-root</code></td>
|
||||
<td><code>public/gatk-root</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>public/sting-utils</code></td>
|
||||
<td><code>public/gatk-utils</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><em>none</em></td>
|
||||
<td><code>public/gatk-engine</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>public/gatk-framework</code></td>
|
||||
<td><code>public/gatk-tools-public</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>public/queue-framework</code></td>
|
||||
<td><code>public/gatk-queue</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>public/gatk-queue-extgen</code></td>
|
||||
<td><code>public/gatk-queue-extensions-generator</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>protected/gatk-protected</code></td>
|
||||
<td><code>protected/gatk-tools-protected</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>private/gatk-private</code></td>
|
||||
<td><code>private/gatk-tools-private</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>private/queue-private</code></td>
|
||||
<td><code>private/gatk-queue-private</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>public/gatk-package</code></td>
|
||||
<td><code>protected/gatk-package-distribution</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>public/queue-package</code></td>
|
||||
<td><code>protected/gatk-queue-package-distribution</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><em>none</em></td>
|
||||
<td><code>private/gatk-package-internal</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><em>none</em></td>
|
||||
<td><code>private/gatk-queue-package-internal</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><em>none</em></td>
|
||||
<td><code>public/gatk-queue-extensions-public</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><em>none</em></td>
|
||||
<td><code>protected/gatk-queue-extensions-distribution</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><em>none</em></td>
|
||||
<td><code>private/gatk-queue-extensions-internal</code></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<p><em>A note regarding the aggregator:</em></p>
|
||||
<p>The aggregator is the pom.xml in the top directory level of the GATK source code. When someone clones the GATK source code and runs <code>mvn</code> in the top level directory, the aggregator the pom.xml executed.</p>
|
||||
<p>The root is a pom.xml that contains all common Maven configuration. There are a couple dependent pom.xml files that inherit configuration from the root, but are <em>NOT</em> aggregated during normal source compilation.</p>
|
||||
<p>As of GATK 3.2, these un-aggregated child artifacts are VectorPairHMM and picard-maven. They should not run by default with each instance of <code>mvn</code> run on the GATK source code.</p>
|
||||
<p>For more clarification on Maven Inheritance vs. Aggregation, see the Maven <a href="http://maven.apache.org/guides/introduction/introduction-to-the-pom.html#Project_Inheritance_vs_Project_Aggregation">introduction to the pom</a>.</p>
|
||||
<h2>Renamed Java/Scala Package Names</h2>
|
||||
<p>In GATK 3.2, except for classes with Sting in the name, all file names are still the same. To locate migrated files under new java package names, developers should either use <a href="http://www.jetbrains.com/idea/webhelp/navigating-to-class-file-or-symbol-by-name.html">Intellij IDEA Navigation</a> or <code>/bin/find</code> to locate the same file they used previously.</p>
|
||||
<p>The biggest change most developers will face is the new package names for GATK classes. Code entanglement does not permit simply moving the classes into the correct Maven artifacts, as a few number of lines of code must be edited inside a large number of files. So post renaming only a very small number of classes were moved out of the incorrect Maven artifacts as examples.</p>
|
||||
<p>As of the May 16, 2014, the migrated GATK package distribution is as follows. This list includes only main classes. The table excludes all tests, renamed files such as StingException, certain private Queue wrappers, and qscripts renamed to end in *.scala.</p>
|
||||
<table class="table table-striped">
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Scope</th>
|
||||
<th>Type</th>
|
||||
<th><= 3.1 Artifact</th>
|
||||
<th><= 3.1 Package</th>
|
||||
<th>>= GATK 3.2 Artifact</th>
|
||||
<th>>= 3.2 GATK Package</th>
|
||||
<th style="text-align: right;">Files</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td>public</td>
|
||||
<td>java</td>
|
||||
<td>gatk-framework</td>
|
||||
<td>o.b.s</td>
|
||||
<td>gatk-utils</td>
|
||||
<td>o.b.g</td>
|
||||
<td style="text-align: right;">4</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>public</td>
|
||||
<td>java</td>
|
||||
<td>gatk-framework</td>
|
||||
<td>o.b.s.gatk</td>
|
||||
<td>gatk-engine</td>
|
||||
<td>o.b.g.engine</td>
|
||||
<td style="text-align: right;">2</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>public</td>
|
||||
<td>java</td>
|
||||
<td>gatk-framework</td>
|
||||
<td>o.b.s</td>
|
||||
<td>gatk-tools-public</td>
|
||||
<td>o.b.g</td>
|
||||
<td style="text-align: right;">202</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>public</td>
|
||||
<td>java</td>
|
||||
<td>gatk-framework</td>
|
||||
<td>o.b.s</td>
|
||||
<td>gatk-tools-public</td>
|
||||
<td>o.b.g.utils</td>
|
||||
<td style="text-align: right;">49</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>public</td>
|
||||
<td>java</td>
|
||||
<td>gatk-framework</td>
|
||||
<td>o.b.s</td>
|
||||
<td>gatk-tools-public</td>
|
||||
<td>o.b.g.engine</td>
|
||||
<td style="text-align: right;">34</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>public</td>
|
||||
<td>java</td>
|
||||
<td>gatk-framework</td>
|
||||
<td>o.b.s.gatk</td>
|
||||
<td>gatk-tools-public</td>
|
||||
<td>o.b.g.engine</td>
|
||||
<td style="text-align: right;">244</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>public</td>
|
||||
<td>java</td>
|
||||
<td>gatk-framework</td>
|
||||
<td>o.b.s.gatk</td>
|
||||
<td>gatk-tools-public</td>
|
||||
<td>o.b.g.tools</td>
|
||||
<td style="text-align: right;">134</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>public</td>
|
||||
<td>java</td>
|
||||
<td>gatk-framework</td>
|
||||
<td>o.b.s.gatk</td>
|
||||
<td>gatk-tools-public</td>
|
||||
<td>o.b.g.tools.walkers</td>
|
||||
<td style="text-align: right;">2</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>protected</td>
|
||||
<td>java</td>
|
||||
<td>gatk-protected</td>
|
||||
<td>o.b.s</td>
|
||||
<td>gatk-tools-protected</td>
|
||||
<td>o.b.g</td>
|
||||
<td style="text-align: right;">44</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>protected</td>
|
||||
<td>java</td>
|
||||
<td>gatk-protected</td>
|
||||
<td>o.b.s.gatk</td>
|
||||
<td>gatk-tools-protected</td>
|
||||
<td>o.b.g.engine</td>
|
||||
<td style="text-align: right;">1</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>protected</td>
|
||||
<td>java</td>
|
||||
<td>gatk-protected</td>
|
||||
<td>o.b.s.gatk</td>
|
||||
<td>gatk-tools-protected</td>
|
||||
<td>o.b.g.tools</td>
|
||||
<td style="text-align: right;">209</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>private</td>
|
||||
<td>java</td>
|
||||
<td>gatk-private</td>
|
||||
<td>o.b.s</td>
|
||||
<td>gatk-tools-private</td>
|
||||
<td>o.b.g</td>
|
||||
<td style="text-align: right;">23</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>private</td>
|
||||
<td>java</td>
|
||||
<td>gatk-private</td>
|
||||
<td>o.b.s</td>
|
||||
<td>gatk-tools-private</td>
|
||||
<td>o.b.g.utils</td>
|
||||
<td style="text-align: right;">7</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>private</td>
|
||||
<td>java</td>
|
||||
<td>gatk-private</td>
|
||||
<td>o.b.s.gatk</td>
|
||||
<td>gatk-tools-private</td>
|
||||
<td>o.b.g.engine</td>
|
||||
<td style="text-align: right;">5</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>private</td>
|
||||
<td>java</td>
|
||||
<td>gatk-private</td>
|
||||
<td>o.b.s.gatk</td>
|
||||
<td>gatk-tools-private</td>
|
||||
<td>o.b.g.tools</td>
|
||||
<td style="text-align: right;">133</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>public</td>
|
||||
<td>java</td>
|
||||
<td>queue-framework</td>
|
||||
<td>o.b.s</td>
|
||||
<td>gatk-queue</td>
|
||||
<td>o.b.g</td>
|
||||
<td style="text-align: right;">2</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>public</td>
|
||||
<td>scala</td>
|
||||
<td>queue-framework</td>
|
||||
<td>o.b.s</td>
|
||||
<td>gatk-queue</td>
|
||||
<td>o.b.g</td>
|
||||
<td style="text-align: right;">72</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>public</td>
|
||||
<td>scala</td>
|
||||
<td>queue-framework</td>
|
||||
<td>o.b.s</td>
|
||||
<td>gatk-queue-extensions-public</td>
|
||||
<td>o.b.g</td>
|
||||
<td style="text-align: right;">31</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>public</td>
|
||||
<td>qscripts</td>
|
||||
<td>queue-framework</td>
|
||||
<td>o.b.s</td>
|
||||
<td>gatk-queue-extensions-public</td>
|
||||
<td>o.b.g</td>
|
||||
<td style="text-align: right;">12</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>private</td>
|
||||
<td>scala</td>
|
||||
<td>queue-private</td>
|
||||
<td>o.b.s</td>
|
||||
<td>gatk-queue-private</td>
|
||||
<td>o.b.g</td>
|
||||
<td style="text-align: right;">2</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>private</td>
|
||||
<td>qscripts</td>
|
||||
<td>queue-private</td>
|
||||
<td>o.b.s</td>
|
||||
<td>gatk-queue-extensions-internal</td>
|
||||
<td>o.b.g</td>
|
||||
<td style="text-align: right;">118</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<p><strong>During all future code modifications and pull requests, classes should be refactored to correct artifacts and package as follows.</strong></p>
|
||||
<p>All non-engine tools should be in the tools artifacts, with appropriate sub-package names.</p>
|
||||
<table class="table table-striped">
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Scope</th>
|
||||
<th>Type</th>
|
||||
<th>Artifact</th>
|
||||
<th>Package(s)</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td>public</td>
|
||||
<td>java</td>
|
||||
<td>gatk-utils</td>
|
||||
<td>o.b.g.utils</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>public</td>
|
||||
<td>java</td>
|
||||
<td>gatk-engine</td>
|
||||
<td>o.b.g.engine</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>public</td>
|
||||
<td>java</td>
|
||||
<td>gatk-tools-public</td>
|
||||
<td>o.b.g.tools.walkers</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>public</td>
|
||||
<td>java</td>
|
||||
<td>gatk-tools-public</td>
|
||||
<td>o.b.g.tools.*</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>protected</td>
|
||||
<td>java</td>
|
||||
<td>gatk-tools-protected</td>
|
||||
<td>o.b.g.tools.walkers</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>protected</td>
|
||||
<td>java</td>
|
||||
<td>gatk-tools-protected</td>
|
||||
<td>o.b.g.tools.*</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>private</td>
|
||||
<td>java</td>
|
||||
<td>gatk-tools-private</td>
|
||||
<td>o.b.g.tools.walkers</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>private</td>
|
||||
<td>java</td>
|
||||
<td>gatk-tools-private</td>
|
||||
<td>o.b.g.tools.*</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>public</td>
|
||||
<td>java</td>
|
||||
<td>gatk-queue</td>
|
||||
<td>o.b.g.queue</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>public</td>
|
||||
<td>scala</td>
|
||||
<td>gatk-queue</td>
|
||||
<td>o.b.g.queue</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>public</td>
|
||||
<td>scala</td>
|
||||
<td>gatk-queue-extensions-public</td>
|
||||
<td>o.b.g.queue.extensions</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>public</td>
|
||||
<td>qscripts</td>
|
||||
<td>gatk-queue-extensions-public</td>
|
||||
<td>o.b.g.queue.qscripts</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>private</td>
|
||||
<td>scala</td>
|
||||
<td>gatk-queue-private</td>
|
||||
<td>o.b.g.queue</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>private</td>
|
||||
<td>qscripts</td>
|
||||
<td>gatk-queue-extensions-internal</td>
|
||||
<td>o.b.g.queue.qscripts</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<h2>Renamed Classes</h2>
|
||||
<p>The following class names were updated to replace Sting with GATK.</p>
|
||||
<table class="table table-striped">
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Old Sting class</th>
|
||||
<th>New GATK class</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td><code>ArtificialStingSAMFileWriter</code></td>
|
||||
<td><code>ArtificialGATKSAMFileWriter</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>ReviewedStingException</code></td>
|
||||
<td><code>ReviewedGATKException</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>StingException</code></td>
|
||||
<td><code>GATKException</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>StingSAMFileWriter</code></td>
|
||||
<td><code>GATKSAMFileWriter</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>StingSAMIterator</code></td>
|
||||
<td><code>GATKSAMIterator</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>StingSAMIteratorAdapter</code></td>
|
||||
<td><code>GATKSAMIteratorAdapter</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>StingSAMRecordIterator</code></td>
|
||||
<td><code>GATKSAMRecordIterator</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>StingTextReporter</code></td>
|
||||
<td><code>GATKTextReporter</code></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<h1>Common Git/Maven Issues</h1>
|
||||
<h2>Renamed files</h2>
|
||||
<p>The 3.2 renaming patch is actually split into two commits. The first commit renames the files without making any content changes, while the second changes the contents of the files without changing any file paths.</p>
|
||||
<p>When dealing with renamed files, it is best to work with a clean directory during rebasing. It will be easier for you track files that you may not have added to git.</p>
|
||||
<p>After running a git rebase or merge, you may first run into problems with files that you renamed and were moved during the GATK 3.2 package renaming. As a general rule, the renaming only changes directory names. The exception to this rule are classes such as StingException that are renamed to GATKException, and are listed under [Renamed Classes]. The workflow for resolving these merge issues is to find the list of your renamed files, put your content in the correct location, then register the changes with git.</p>
|
||||
<p>To obtain the list of renamed directories and files:</p>
|
||||
<ol>
|
||||
<li>Use <code>git status</code> to get a list of affected files</li>
|
||||
<li>Find the common old directory and file name under "both deleted"</li>
|
||||
<li>Find your new file name under "added by them" (yes, you are "them")</li>
|
||||
<li>Find the new directory under "added by us"</li>
|
||||
</ol>
|
||||
<p>Then, to resolve the issue for each file:</p>
|
||||
<ol>
|
||||
<li>Move your copy of your renamed file to the new directory</li>
|
||||
<li><code>git rm</code> the old paths as appropriate</li>
|
||||
<li><code>git add</code> the new path</li>
|
||||
<li>Repeat for other files until git status shows "all conflicts fixed"</li>
|
||||
</ol>
|
||||
<p>Upon first rebasing you will see a lot of text. <strong>At this moment, you can ignore most of it, and use git status instead.</strong></p>
|
||||
<p>For the purposes of illustration, while running <code>git rebase</code> it is perfectly normal to see something similar to:</p>
|
||||
<pre><code class="pre_md">$ git rebase master
|
||||
First, rewinding head to replay your work on top of it...
|
||||
Applying: <<< Your first commit message here >>>
|
||||
Using index info to reconstruct a base tree...
|
||||
A protected/gatk-protected/src/main/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/GenotypingEngine.java
|
||||
A protected/gatk-protected/src/test/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/GenotypingEngineUnitTest.java
|
||||
<<<Other files that you renamed.>>>
|
||||
warning: squelched 12 whitespace errors
|
||||
warning: 34 lines add whitespace errors.
|
||||
Falling back to patching base and 3-way merge...
|
||||
CONFLICT (rename/rename): Rename "protected/gatk-protected/src/test/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/GenotypingEngineUnitTest.java"->"protected/gatk-tools-protected/src/test/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/GenotypingEngineUnitTest.java" in branch "HEAD" rename "protected/gatk-protected/src/test/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/GenotypingEngineUnitTest.java"->"protected/gatk-protected/src/test/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCallerGenotypingEngineUnitTest.java" in "<<< Your first commit message here >>>"
|
||||
CONFLICT (rename/rename): Rename "protected/gatk-protected/src/main/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/GenotypingEngine.java"->"protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/GenotypingEngine.java" in branch "HEAD" rename "protected/gatk-protected/src/main/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/GenotypingEngine.java"->"protected/gatk-protected/src/main/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCallerGenotypingEngine.java" in "<<< Your first commit message here >>>"
|
||||
Failed to merge in the changes.
|
||||
Patch failed at 0001 Example conflict.
|
||||
The copy of the patch that failed is found in:
|
||||
/Users/zzuser/src/gsa-unstable/.git/rebase-apply/patch
|
||||
|
||||
When you have resolved this problem, run "git rebase --continue".
|
||||
If you prefer to skip this patch, run "git rebase --skip" instead.
|
||||
To check out the original branch and stop rebasing, run "git rebase --abort".
|
||||
|
||||
$</code class="pre_md"></pre>
|
||||
<p>While everything you need to resolve the issue is technically in the message above, it may be much easier to track what's going on using <code>git status</code>.</p>
|
||||
<pre><code class="pre_md">$ git status
|
||||
rebase in progress; onto cba4321
|
||||
You are currently rebasing branch 'zz_renaming_haplotypecallergenotypingengine' on 'cba4321'.
|
||||
(fix conflicts and then run "git rebase --continue")
|
||||
(use "git rebase --skip" to skip this patch)
|
||||
(use "git rebase --abort" to check out the original branch)
|
||||
|
||||
Unmerged paths:
|
||||
(use "git reset HEAD <file>..." to unstage)
|
||||
(use "git add/rm <file>..." as appropriate to mark resolution)
|
||||
|
||||
added by them: protected/gatk-protected/src/main/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCallerGenotypingEngine.java
|
||||
both deleted: protected/gatk-protected/src/main/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/GenotypingEngine.java
|
||||
added by them: protected/gatk-protected/src/test/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCallerGenotypingEngineUnitTest.java
|
||||
both deleted: protected/gatk-protected/src/test/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/GenotypingEngineUnitTest.java
|
||||
added by us: protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/GenotypingEngine.java
|
||||
added by us: protected/gatk-tools-protected/src/test/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/GenotypingEngineUnitTest.java
|
||||
|
||||
Untracked files:
|
||||
(use "git add <file>..." to include in what will be committed)
|
||||
|
||||
<<< possible untracked files if your working directory is not clean>>>
|
||||
|
||||
no changes added to commit (use "git add" and/or "git commit -a")
|
||||
$ </code class="pre_md"></pre>
|
||||
<p>Let's look at the main java file as an example. If you are having issues figuring out the new directory and new file name, they are all listed in the output.</p>
|
||||
<pre><code class="pre_md">Path in the common ancestor branch:
|
||||
| old source directory | old package name | old file name |
|
||||
protected/gatk-protected/src/main/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/GenotypingEngine.java
|
||||
|
||||
Path in the new master branch before merge:
|
||||
| new source directory | new package name | old file name |
|
||||
protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/GenotypingEngine.java
|
||||
|
||||
Path in your branch before merge:
|
||||
| old source directory | old package name | new file name |
|
||||
protected/gatk-protected/src/main/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCallerGenotypingEngine.java
|
||||
|
||||
Path in your branch post merge:
|
||||
| new source directory | new package name | new file name |
|
||||
protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/HaplotypeCallerGenotypingEngine.java </code class="pre_md"></pre>
|
||||
<p>After identifying the new paths for use post merge, use the following workflow for each file:</p>
|
||||
<ol>
|
||||
<li>Move or copy your version of the renamed file to the new directory</li>
|
||||
<li><code>git rm</code> the three old file paths: common ancestor, old directory with new file name, and new directory with old file name</li>
|
||||
<li><code>git add</code> the new file name in the new directory</li>
|
||||
</ol>
|
||||
<p>After you process all files correctly, in the output of <code>git status</code> you should see the "all conflicts fixed" and all your files renamed.</p>
|
||||
<pre><code class="pre_md">$ git status
|
||||
rebase in progress; onto cba4321
|
||||
You are currently rebasing branch 'zz_renaming_haplotypecallergenotypingengine' on 'cba4321'.
|
||||
(all conflicts fixed: run "git rebase --continue")
|
||||
|
||||
Changes to be committed:
|
||||
(use "git reset HEAD <file>..." to unstage)
|
||||
|
||||
renamed: protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/GenotypingEngine.java -> protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/HaplotypeCallerGenotypingEngine.java
|
||||
renamed: protected/gatk-tools-protected/src/test/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/GenotypingEngineUnitTest.java -> protected/gatk-tools-protected/src/test/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/HaplotypeCallerGenotypingEngineUnitTest.java
|
||||
|
||||
Untracked files:
|
||||
(use "git add <file>..." to include in what will be committed)
|
||||
|
||||
<<< possible untracked files if your working directory is not clean>>>
|
||||
|
||||
$</code class="pre_md"></pre>
|
||||
<p>Continue your rebase, handling other merges as normal.</p>
|
||||
<pre><code class="pre_md">$ git rebase --continue</code class="pre_md"></pre>
|
||||
<h2>Fixing imports</h2>
|
||||
<p>Because all the packages names are different in 3.2, while rebasing you may run into conflicts due to imports you also changed. Use your favorite editor to fix the imports within the files. Then try recompiling, and repeat as necessary until your code works.</p>
|
||||
<p>While editing the files with conflicts with a basic text editor may work, IntelliJ IDEA also offers a special merge tool that may help via the menu:</p>
|
||||
<pre><code class="pre_md">VCS > Git > Resolve Conflicts...</code class="pre_md"></pre>
|
||||
<p>For each file, click on the "Merge" button in the first dialog. Use the various buttons in the <a href="https://www.jetbrains.com/idea/webhelp/resolving-conflicts.html">Conflict Resolution Tool</a> to automatically accept any changes that are not in conflict. Then find any edit any remaining conflicts that require further manual intervention.</p>
|
||||
<p>Once you begin editing the import statements in the three way merge tool, another IntelliJ IDEA 13.1 feature that may speed up repairing blocks of import statements is <a href="http://blog.jetbrains.com/idea/2014/03/intellij-idea-13-1-rc-introduces-sublime-text-style-multiple-selections/">Multiple Selections</a>. Find a block of import lines that need the same changes. Hold down the option key as you drag your cursor vertically down the edit point on each import line. Then begin typing or deleting text from the multiple lines.</p>
|
||||
<h2>Switching branches</h2>
|
||||
<p>Even after a successful merge, you may still run into stale GATK code or links from modifications before and after the 3.2 package renaming. To significantly reduce these chances, run <code>mvn clean</code> <em>before</em> and then again <em>after</em> switching branches.</p>
|
||||
<p>If this doesn't work, run <code>mvn clean && git status</code>, looking for any directories you don't that shouldn't be in the current branch. It is possible that some files were not correctly moved, including classes or test resources. Find the file still in the old directories via a command such as <code>find public/gatk-framework -type f</code>. Then move them to the correct new directories and commit them into git.</p>
|
||||
<h2>Slow Builds with Queue and Private</h2>
|
||||
<p>Due to the [Renamed Binary Packages], the separate artifacts including and excluding private code are now packaged during the Maven package build lifecycle.</p>
|
||||
<p>When building packages, to significantly speed up the default packaging time, if you only require the GATK tools run <code>mvn verify -P\!queue</code>.</p>
|
||||
<p>Alternatively, if you do not require building private source, then disable private compiling via <code>mvn verify -P\!private</code>. </p>
|
||||
<p>The two may be combined as well via: <code>mvn verify -P\!queue,\!private</code>. </p>
|
||||
<p>The exclamation mark is a shell command that must be escaped, in the above case with a backslash. Shell quotes may also be used: <code>mvn verify -P'!queue,!private'</code>.</p>
|
||||
<p>Alternatively, developers with access to private may often want to disable packaging the protected distributions. In this case, use the <code>gsadev</code> profile. This may be done via <code>mvn verify -Pgsadev</code> or, excluding Queue, <code>mvn verify -Pgsadev,\!queue</code>.</p>
|
||||
<h2>Stale symlinks</h2>
|
||||
<p>Users see errors from maven when an unclean repo in git is updated.
|
||||
Because BaseTest.java currently hardcodes relative paths to
|
||||
"public/testdata", maven creates these symbolic links all over the
|
||||
file system to help the various tests in different modules find the
|
||||
relative path "<current module>/public/testdata".</p>
|
||||
<p>However, our Maven support has evolved from 2.8, to 3.0, to now the
|
||||
3.2 renaming, each time has changed the symbolic link's target
|
||||
directory. Whenever a stale symbolic link to an old testdata directory
|
||||
remains in the users folder, maven is saying it will not remove the
|
||||
link, because maven basically doesn't know why the link is pointing to
|
||||
the wrong folder (answer, the link is from an old git checkout) and
|
||||
thinks it's a bug in the build.</p>
|
||||
<p>If one doesn't have an stale / unclean maven repo when updating git
|
||||
via merge/rebase/checkout, you will never see this issue.</p>
|
||||
<p>The script that can remove the stale symlinks, <code>public/src/main/scripts/shell/delete_maven_links.sh</code>, should run automatically during a <code>mvn test-compile</code> or <code>mvn verify</code>.</p>
|
||||
|
|
@ -0,0 +1,119 @@
|
|||
## Tribble
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/1349/tribble
|
||||
|
||||
<h2>1. Overview</h2>
|
||||
<p>The Tribble project was started as an effort to overhaul our reference-ordered data system; we had many different formats that were shoehorned into a common framework that didn't really work as intended. What we wanted was a common framework that allowed for searching of reference ordered data, regardless of the underlying type. Jim Robinson had developed indexing schemes for text-based files, which was incorporated into the Tribble library.</p>
|
||||
<h2>2. Architecture Overview</h2>
|
||||
<p>Tribble provides a lightweight interface and API for querying features and creating indexes from feature files, while allowing iteration over know feature files that we're unable to create indexes for. The main entry point for external users is the BasicFeatureReader class. It takes in a codec, an index file, and a file containing the features to be processed. With an instance of a <code>BasicFeatureReader</code>, you can query for features that span a specific location, or get an iterator over all the records in the file. </p>
|
||||
<h2>3. Developer Overview</h2>
|
||||
<p>For developers, there are two important classes to implement: the FeatureCodec, which decodes lines of text and produces features, and the feature class, which is your underlying record type.</p>
|
||||
<img src="https://us.v-cdn.net/5019796/uploads/FileUpload/cc/f41b5df64878ee361ba5e4b78047ce.png" />
|
||||
<p>For developers there are two classes that are important:</p>
|
||||
<ul>
|
||||
<li>
|
||||
<p><strong>Feature</strong></p>
|
||||
<p>This is the genomicly oriented feature that represents the underlying data in the input file. For instance in the VCF format, this is the variant call including quality information, the reference base, and the alternate base. The required information to implement a feature is the chromosome name, the start position (one based), and the stop position. The start and stop position represent a closed, one-based interval. I.e. the first base in chromosome one would be chr1:1-1. </p>
|
||||
</li>
|
||||
<li>
|
||||
<p><strong>FeatureCodec</strong> </p>
|
||||
<p>This class takes in a line of text (from an input source, whether it's a file, compressed file, or a http link), and produces the above feature. </p>
|
||||
</li>
|
||||
</ul>
|
||||
<p>To implement your new format into Tribble, you need to implement the two above classes (in an appropriately named subfolder in the Tribble check-out). The Feature object should know nothing about the file representation; it should represent the data as an in-memory object. The interface for a feature looks like:</p>
|
||||
<pre><code class="pre_md">public interface Feature {
|
||||
|
||||
/**
|
||||
* Return the features reference sequence name, e.g chromosome or contig
|
||||
*/
|
||||
public String getChr();
|
||||
|
||||
/**
|
||||
* Return the start position in 1-based coordinates (first base is 1)
|
||||
*/
|
||||
public int getStart();
|
||||
|
||||
/**
|
||||
* Return the end position following 1-based fully closed conventions. The length of a feature is
|
||||
* end - start + 1;
|
||||
*/
|
||||
public int getEnd();
|
||||
}</code class="pre_md"></pre>
|
||||
<p>And the interface for FeatureCodec:</p>
|
||||
<pre><code class="pre_md">/**
|
||||
* the base interface for classes that read in features.
|
||||
* @param <T> The feature type this codec reads
|
||||
*/
|
||||
public interface FeatureCodec<T extends Feature> {
|
||||
/**
|
||||
* Decode a line to obtain just its FeatureLoc for indexing -- contig, start, and stop.
|
||||
*
|
||||
* @param line the input line to decode
|
||||
* @return Return the FeatureLoc encoded by the line, or null if the line does not represent a feature (e.g. is
|
||||
* a comment)
|
||||
*/
|
||||
public Feature decodeLoc(String line);
|
||||
|
||||
/**
|
||||
* Decode a line as a Feature.
|
||||
*
|
||||
* @param line the input line to decode
|
||||
* @return Return the Feature encoded by the line, or null if the line does not represent a feature (e.g. is
|
||||
* a comment)
|
||||
*/
|
||||
public T decode(String line);
|
||||
|
||||
/**
|
||||
* This function returns the object the codec generates. This is allowed to be Feature in the case where
|
||||
* conditionally different types are generated. Be as specific as you can though.
|
||||
*
|
||||
* This function is used by reflections based tools, so we can know the underlying type
|
||||
*
|
||||
* @return the feature type this codec generates.
|
||||
*/
|
||||
public Class<T> getFeatureType();
|
||||
|
||||
/** Read and return the header, or null if there is no header.
|
||||
*
|
||||
* @return header object
|
||||
*/
|
||||
public Object readHeader(LineReader reader);
|
||||
}</code class="pre_md"></pre>
|
||||
<h2>4. Supported Formats</h2>
|
||||
<p>The following formats are supported in Tribble:</p>
|
||||
<ul>
|
||||
<li>VCF Format</li>
|
||||
<li>DbSNP Format</li>
|
||||
<li>BED Format</li>
|
||||
<li>GATK Interval Format</li>
|
||||
</ul>
|
||||
<h2>5. Updating the Tribble, htsjdk, and/or Picard library</h2>
|
||||
<p>Updating the revision of Tribble on the system is a relatively straightforward task if the following steps are taken.</p>
|
||||
<p><em>NOTE:</em> Any directory starting with <code>~</code> may be different on your machine, depending on where you cloned the various repositories for gsa-unstable, picard, and htsjdk.</p>
|
||||
<p>A Maven script to install picard into the local repository is located under <code>gsa-unstable/private/picard-maven</code>. To operate, it requires a symbolic link named <code>picard</code> pointing to a working checkout of the <a href="http://github.com/broadinstitute/picard">picard github repository</a>. <em>NOTE:</em> compiling picard <a href="http://broadinstitute.github.io/picard">requires</a> an <a href="http://github.com/samtools/htsjdk">htsjdk github repository</a> checkout available at <code>picard/htsjdk</code>, either as a subdirectory or another symbolic link. The final full path should be <code>gsa-unstable/private/picard-maven/picard/htsjdk</code>.</p>
|
||||
<pre><code class="pre_md">cd ~/src/gsa-unstable
|
||||
cd private/picard-maven
|
||||
ln -s ~/src/picard picard</code class="pre_md"></pre>
|
||||
<p>Create a git branch of Picard and/or htsjdk and make your changes. To install your changes into the GATK you must run <code>mvn install</code> in the <code>private/picard-maven</code> directory. This will compile and copy the jars into <code>gsa-unstable/public/repo</code>, and update <code>gsa-unstable/gatk-root/pom.xml</code> with the corresponding version. While making changes your revision of picard and htslib will be labeled with <code>-SNAPSHOT</code>.</p>
|
||||
<pre><code class="pre_md">cd ~/src/gsa-unstable
|
||||
cd private/picard-maven
|
||||
mvn install</code class="pre_md"></pre>
|
||||
<p>Continue testing in the GATK. Once your changes and updated tests for picard/htsjdk are complete, push your branch and submit your pull request to the Picard and/or htsjdk github. After your Picard/htsjdk patches are accepted, switch your Picard/htsjdk branches back to the master branch. <em>NOTE:</em> Leave your gsa-unstable branch on your development branch!</p>
|
||||
<pre><code class="pre_md">cd ~/src/picard
|
||||
ant clean
|
||||
git checkout master
|
||||
git fetch
|
||||
git rebase
|
||||
cd htsjdk
|
||||
git checkout master
|
||||
git fetch
|
||||
git rebase</code class="pre_md"></pre>
|
||||
<p><em>NOTE:</em> The version number of old and new Picard/htsjdk will vary, and during active development will end with <code>-SNAPSHOT</code>. While, if needed, you may push <code>-SNAPSHOT</code> version for testing on Bamboo, you should NOT submit a pull request with a <code>-SNAPSHOT</code> version. <code>-SNAPSHOT</code> indicates your local changes are not reproducible from source control.</p>
|
||||
<p>When ready, run <code>mvn install</code> once more to create the non <code>-SNAPSHOT</code> versions under <code>gsa-unstable/public/repo</code>. In that directory, <code>git add</code> the new version, and <code>git rm</code> the old versions.</p>
|
||||
<pre><code class="pre_md">cd ~/src/gsa-unstable
|
||||
cd public/repo
|
||||
git add picard/picard/1.115.1499/
|
||||
git add samtools/htsjdk/1.115.1509/
|
||||
git rm -r picard/picard/1.112.1452/
|
||||
git rm -r samtools/htsjdk/1.112.1452/</code class="pre_md"></pre>
|
||||
<p>Commit and then push your gsa-unstable branch, then issue a pull request for review.</p>
|
||||
|
|
@ -0,0 +1,102 @@
|
|||
## Using DiffEngine to summarize differences between structured data files
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/1299/using-diffengine-to-summarize-differences-between-structured-data-files
|
||||
|
||||
<h3>1. What is DiffEngine?</h3>
|
||||
<p>DiffEngine is a summarizing difference engine that allows you to compare two structured files -- such as BAMs and VCFs -- to find what are the differences between them. This is primarily useful in regression testing or optimization, where you want to ensure that the differences are those that you expect and not any others. </p>
|
||||
<h3>2. The summarized differences</h3>
|
||||
<p>The GATK contains a summarizing difference engine called DiffEngine that compares hierarchical data structures to emit:</p>
|
||||
<ul>
|
||||
<li>
|
||||
<p>A list of specific differences between the two data structures. This is similar to saying the value in field A in record 1 in file F differs from the value in field A in record 1 in file G. </p>
|
||||
</li>
|
||||
<li>A summarized list of differences ordered by frequency of the difference. This output is similar to saying field A differed in 50 records between files F and G.</li>
|
||||
</ul>
|
||||
<h3>3. The DiffObjects walker</h3>
|
||||
<p>The GATK contains a private walker called DiffObjects that allows you access to the DiffEngine capabilities on the command line. Simply provide the walker with the master and test files and it will emit summarized differences for you.</p>
|
||||
<h3>4. Understanding the output</h3>
|
||||
<p>The DiffEngine system compares to two hierarchical data structures for specific differences in the values of named nodes. Suppose I have two trees:</p>
|
||||
<pre><code class="pre_md">Tree1=(A=1 B=(C=2 D=3))
|
||||
Tree2=(A=1 B=(C=3 D=3 E=4))
|
||||
Tree3=(A=1 B=(C=4 D=3 E=4))</code class="pre_md"></pre>
|
||||
<p>where every node in the tree is named, or is a raw value (here all leaf values are integers). The DiffEngine traverses these data structures by name, identifies equivalent nodes by fully qualified names (<code>Tree1.A</code> is distinct from <code>Tree2.A</code>, and determines where their values are equal (<code>Tree1.A=1</code>, <code>Tree2.A=1</code>, so they are). </p>
|
||||
<p>These itemized differences are listed as:</p>
|
||||
<pre><code class="pre_md">Tree1.B.C=2 != Tree2.B.C=3
|
||||
Tree1.B.C=2 != Tree3.B.C=4
|
||||
Tree2.B.C=3 != Tree3.B.C=4
|
||||
Tree1.B.E=MISSING != Tree2.B.E=4</code class="pre_md"></pre>
|
||||
<p>This conceptually very similar to the output of the unix command line tool <code>diff</code>. What's nice about DiffEngine though is that it computes similarity among the itemized differences and displays the count of differences names in the system. In the above example, the field <code>C</code> is not equal three times, while the missing <code>E</code> in <code>Tree1</code> occurs only once. So the summary is:</p>
|
||||
<pre><code class="pre_md">*.B.C : 3
|
||||
*.B.E : 1</code class="pre_md"></pre>
|
||||
<p>where the <code>*</code> operator indicates that any named field matches. This output is sorted by counts, and provides an immediate picture of the commonly occurring differences between the files. </p>
|
||||
<p>Below is a detailed example of two VCF fields that differ because of a bug in the <code>AC</code>, <code>AF</code>, and <code>AN</code> counting routines, detected by the <code>integrationtest</code> integration (more below). You can see that in the although there are many specific instances of these differences between the two files, the summarized differences provide an immediate picture that the <code>AC</code>, <code>AF</code>, and <code>AN</code> fields are the major causes of the differences.</p>
|
||||
<pre><code class="pre_md">[testng] path count
|
||||
[testng] *.*.*.AC 6
|
||||
[testng] *.*.*.AF 6
|
||||
[testng] *.*.*.AN 6
|
||||
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000000.AC 1
|
||||
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000000.AF 1
|
||||
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000000.AN 1
|
||||
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000117.AC 1
|
||||
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000117.AF 1
|
||||
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000117.AN 1
|
||||
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000211.AC 1
|
||||
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000211.AF 1
|
||||
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000211.AN 1
|
||||
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000598.AC 1</code class="pre_md"></pre>
|
||||
<h3>5. Integration tests</h3>
|
||||
<p>The DiffEngine codebase that supports these calculations is integrated into the <code>integrationtest</code> framework, so that when a test fails the system automatically summarizes the differences between the master MD5 file and the failing MD5 file, if it is an understood type. When failing you will see in the integration test logs not only the basic information, but the detailed DiffEngine output. </p>
|
||||
<p>For example, in the output below I broke the GATK BAQ calculation and the integration test DiffEngine clearly identifies that all of the records differ in their <code>BQ</code> tag value in the two BAM files:</p>
|
||||
<pre><code class="pre_md">/humgen/1kg/reference/human_b36_both.fasta -I /humgen/gsa-hpprojects/GATK/data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.allTechs.bam -o /var/folders/Us/UsMJ3xRrFVyuDXWkUos1xkC43FQ/-Tmp-/walktest.tmp_param.05785205687740257584.tmp -L 1:10,000,000-10,100,000 -baq RECALCULATE -et NO_ET
|
||||
[testng] WARN 22:59:22,875 TextFormattingUtils - Unable to load help text. Help output will be sparse.
|
||||
[testng] WARN 22:59:22,875 TextFormattingUtils - Unable to load help text. Help output will be sparse.
|
||||
[testng] ##### MD5 file is up to date: integrationtests/e5147656858fc4a5f470177b94b1fc1b.integrationtest
|
||||
[testng] Checking MD5 for /var/folders/Us/UsMJ3xRrFVyuDXWkUos1xkC43FQ/-Tmp-/walktest.tmp_param.05785205687740257584.tmp [calculated=e5147656858fc4a5f470177b94b1fc1b, expected=4ac691bde1ba1301a59857694fda6ae2]
|
||||
[testng] ##### Test testPrintReadsRecalBAQ is going fail #####
|
||||
[testng] ##### Path to expected file (MD5=4ac691bde1ba1301a59857694fda6ae2): integrationtests/4ac691bde1ba1301a59857694fda6ae2.integrationtest
|
||||
[testng] ##### Path to calculated file (MD5=e5147656858fc4a5f470177b94b1fc1b): integrationtests/e5147656858fc4a5f470177b94b1fc1b.integrationtest
|
||||
[testng] ##### Diff command: diff integrationtests/4ac691bde1ba1301a59857694fda6ae2.integrationtest integrationtests/e5147656858fc4a5f470177b94b1fc1b.integrationtest
|
||||
[testng] ##:GATKReport.v0.1 diffences : Summarized differences between the master and test files.
|
||||
[testng] See http://www.broadinstitute.org/gsa/wiki/index.php/DiffObjectsWalker_and_SummarizedDifferences for more information
|
||||
[testng] Difference NumberOfOccurrences
|
||||
[testng] *.*.*.BQ 895
|
||||
[testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAE_0002_FC205W7AAXX:2:266:272:361.BQ 1
|
||||
[testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAE_0002_FC205W7AAXX:5:245:474:254.BQ 1
|
||||
[testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAE_0002_FC205W7AAXX:5:255:178:160.BQ 1
|
||||
[testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAE_0002_FC205W7AAXX:6:158:682:495.BQ 1
|
||||
[testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAE_0002_FC205W7AAXX:6:195:591:884.BQ 1
|
||||
[testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAE_0002_FC205W7AAXX:7:165:236:848.BQ 1
|
||||
[testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAE_0002_FC205W7AAXX:7:191:223:910.BQ 1
|
||||
[testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAE_0002_FC205W7AAXX:7:286:279:434.BQ 1
|
||||
[testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAF_0002_FC205Y7AAXX:2:106:516:354.BQ 1
|
||||
[testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAF_0002_FC205Y7AAXX:3:102:580:518.BQ 1
|
||||
[testng]
|
||||
[testng] Note that the above list is not comprehensive. At most 20 lines of output, and 10 specific differences will be listed. Please use -T DiffObjects -R public/testdata/exampleFASTA.fasta -m integrationtests/4ac691bde1ba1301a59857694fda6ae2.integrationtest -t integrationtests/e5147656858fc4a5f470177b94b1fc1b.integrationtest to explore the differences more freely</code class="pre_md"></pre>
|
||||
<h3>6. Adding your own DiffableObjects to the system</h3>
|
||||
<p>The system dynamically finds all classes that implement the following simple interface:</p>
|
||||
<pre><code class="pre_md">public interface DiffableReader {
|
||||
@Ensures("result != null")
|
||||
/**
|
||||
* Return the name of this DiffableReader type. For example, the VCF reader returns 'VCF' and the
|
||||
* bam reader 'BAM'
|
||||
*/
|
||||
public String getName();
|
||||
|
||||
@Ensures("result != null")
|
||||
@Requires("file != null")
|
||||
/**
|
||||
* Read up to maxElementsToRead DiffElements from file, and return them.
|
||||
*/
|
||||
public DiffElement readFromFile(File file, int maxElementsToRead);
|
||||
|
||||
/**
|
||||
* Return true if the file can be read into DiffElement objects with this reader. This should
|
||||
* be uniquely true/false for all readers, as the system will use the first reader that can read the
|
||||
* file. This routine should never throw an exception. The VCF reader, for example, looks at the
|
||||
* first line of the file for the ##format=VCF4.1 header, and the BAM reader for the BAM_MAGIC value
|
||||
* @param file
|
||||
* @return
|
||||
*/
|
||||
@Requires("file != null")
|
||||
public boolean canRead(File file);</code class="pre_md"></pre>
|
||||
<p>See the VCF and BAMDiffableReaders for example implementations. If you extend this to a new object types both the DiffObjects walker and the <code>integrationtest</code> framework will automatically work with your new file type.</p>
|
||||
|
|
@ -0,0 +1,56 @@
|
|||
## Writing GATKdocs for your walkers
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/1324/writing-gatkdocs-for-your-walkers
|
||||
|
||||
<p>The GATKDocs are what we call <a href="http://www.broadinstitute.org/gatk/gatkdocs/">"Technical Documentation"</a> in the Guide section of this website. The HTML pages are generated automatically at build time from specific blocks of documentation in the source code. </p>
|
||||
<p>The best place to look for example documentation for a GATK walker is GATKDocsExample walker in <code>org.broadinstitute.sting.gatk.examples</code>. This is available <a href="https://github.com/broadgsa/gatk/blob/master/public/java/src/org/broadinstitute/sting/gatk/examples/GATKDocsExample.java">here</a>. </p>
|
||||
<p>Below is the reproduction of that file from August 11, 2011:</p>
|
||||
<pre><code class="pre_md">/**
|
||||
* [Short one sentence description of this walker]
|
||||
*
|
||||
* <p>
|
||||
* [Functionality of this walker]
|
||||
* </p>
|
||||
*
|
||||
* <h2>Input</h2>
|
||||
* <p>
|
||||
* [Input description]
|
||||
* </p>
|
||||
*
|
||||
* <h2>Output</h2>
|
||||
* <p>
|
||||
* [Output description]
|
||||
* </p>
|
||||
*
|
||||
* <h2>Examples</h2>
|
||||
* PRE-TAG
|
||||
* java
|
||||
* -jar GenomeAnalysisTK.jar
|
||||
* -T $WalkerName
|
||||
* PRE-TAG
|
||||
*
|
||||
* @category Walker Category
|
||||
* @author Your Name
|
||||
* @since Date created
|
||||
*/
|
||||
public class GATKDocsExample extends RodWalker<Integer, Integer> {
|
||||
/**
|
||||
* Put detailed documentation about the argument here. No need to duplicate the summary information
|
||||
* in doc annotation field, as that will be added before this text in the documentation page.
|
||||
*
|
||||
* Notes:
|
||||
* <ul>
|
||||
* <li>This field can contain HTML as a normal javadoc</li>
|
||||
* <li>Don't include information about the default value, as gatkdocs adds this automatically</li>
|
||||
* <li>Try your best to describe in detail the behavior of the argument, as ultimately confusing
|
||||
* docs here will just result in user posts on the forum</li>
|
||||
* </ul>
|
||||
*/
|
||||
@Argument(fullName="full", shortName="short", doc="Brief summary of argument [~ 80 characters of text]", required=false)
|
||||
private boolean myWalkerArgument = false;
|
||||
|
||||
public Integer map(RefMetaDataTracker tracker, ReferenceContext ref, AlignmentContext context) { return 0; }
|
||||
public Integer reduceInit() { return 0; }
|
||||
public Integer reduce(Integer value, Integer sum) { return value + sum; }
|
||||
public void onTraversalDone(Integer result) { }
|
||||
}</code class="pre_md"></pre>
|
||||
|
|
@ -0,0 +1,60 @@
|
|||
## Writing and working with reference metadata classes
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/1350/writing-and-working-with-reference-metadata-classes
|
||||
|
||||
<h2>Brief introduction to reference metadata (RMDs)</h2>
|
||||
<p><em>Note that the <code>-B</code> flag referred to below is deprecated; these docs need to be updated</em></p>
|
||||
<p>The GATK allows you to process arbitrary numbers of reference metadata (RMD) files inside of walkers (previously we called this reference ordered data, or ROD). Common RMDs are things like dbSNP, VCF call files, and refseq annotations. The only real constraints on RMD files is that:</p>
|
||||
<ul>
|
||||
<li>
|
||||
<p>They must contain information necessary to provide contig and position data for each element to the GATK engine so it knows with what loci to associate the RMD element.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>The file must be sorted with regard to the reference fasta file so that data can be accessed sequentially by the engine.</p>
|
||||
</li>
|
||||
<li>The file must have a <a href="http://gatkforums.broadinstitute.org/discussion/1349/tribble">Tribble</a> RMD parsing class associated with the file type so that elements in the RMD file can be parsed by the engine.</li>
|
||||
</ul>
|
||||
<p>Inside of the GATK the RMD system has the concept of RMD tracks, which associate an arbitrary string name with the data in the associated RMD file. For example, the <code>VariantEval</code> module uses the named track <code>eval</code> to get calls for evaluation, and <code>dbsnp</code> as the track containing the database of known variants.</p>
|
||||
<h2>How do I get reference metadata files into my walker?</h2>
|
||||
<p>RMD files are extremely easy to get into the GATK using the <code>-B</code> syntax:</p>
|
||||
<pre><code class="pre_md">java -jar GenomeAnalysisTK.jar -R Homo_sapiens_assembly18.fasta -T PrintRODs -B:variant,VCF calls.vcf</code class="pre_md"></pre>
|
||||
<p>In this example, the GATK will attempt to parse the file <code>calls.vcf</code> using the VCF parser and bind the VCF data to the RMD track named <code>variant</code>.</p>
|
||||
<p>In general, you can provide as many RMD bindings to the GATK as you like:</p>
|
||||
<pre><code class="pre_md">java -jar GenomeAnalysisTK.jar -R Homo_sapiens_assembly18.fasta -T PrintRODs -B:calls1,VCF calls1.vcf -B:calls2,VCF calls2.vcf</code class="pre_md"></pre>
|
||||
<p>Works just as well. Some modules may require specifically named RMD tracks -- like <code>eval</code> above -- and some are happy to just assess all RMD tracks of a certain class and work with those -- like <code>VariantsToVCF</code>.</p>
|
||||
<h3>1. Directly getting access to a single named track</h3>
|
||||
<p>In this snippet from <code>SNPDensityWalker</code>, we grab the <code>eval</code> track as a <code>VariantContext</code> object, only for the variants that are of type SNP:</p>
|
||||
<pre><code class="pre_md">public Pair<VariantContext, GenomeLoc> map(RefMetaDataTracker tracker, ReferenceContext ref, AlignmentContext context) {
|
||||
VariantContext vc = tracker.getVariantContext(ref, "eval", EnumSet.of(VariantContext.Type.SNP), context.getLocation(), false);
|
||||
}</code class="pre_md"></pre>
|
||||
<h3>2. Grabbing anything that's convertable to a VariantContext</h3>
|
||||
<p>From <code>VariantsToVCF</code> we call the helper function <code>tracker.getVariantContexts</code> to look at all of the RMDs and convert what it can to <code>VariantContext</code> objects.</p>
|
||||
<pre><code class="pre_md">Allele refAllele = new Allele(Character.toString(ref.getBase()), true);
|
||||
Collection<VariantContext> contexts = tracker.getVariantContexts(INPUT_RMD_NAME, ALLOWED_VARIANT_CONTEXT_TYPES, context.getLocation(), refAllele, true, false);</code class="pre_md"></pre>
|
||||
<h3>3. Looking at all of the RMDs</h3>
|
||||
<p>Here's a totally general code snippet from <code>PileupWalker.java</code>. This code, as you can see, iterates over all of the GATKFeature objects in the reference ordered data, converting each RMD to a string and capturing these strings in a list. It finally grabs the dbSNP binding specifically for a more detailed string conversion, and then binds them all up in a single string for display along with the read pileup.</p>
|
||||
<p>private String getReferenceOrderedData( RefMetaDataTracker tracker ) {
|
||||
ArrayList<String> rodStrings = new ArrayList<String>();
|
||||
for ( GATKFeature datum : tracker.getAllRods() ) {
|
||||
if ( datum != null && ! (datum.getUnderlyingObject() instanceof DbSNPFeature)) {
|
||||
rodStrings.add(((ReferenceOrderedDatum)datum.getUnderlyingObject()).toSimpleString()); // TODO: Aaron: this line still survives, try to remove it
|
||||
}
|
||||
}
|
||||
String rodString = Utils.join(", ", rodStrings);</p>
|
||||
<pre><code class="pre_md"> DbSNPFeature dbsnp = tracker.lookup(DbSNPHelper.STANDARD_DBSNP_TRACK_NAME, DbSNPFeature.class);
|
||||
|
||||
if ( dbsnp != null)
|
||||
rodString += DbSNPHelper.toMediumString(dbsnp);
|
||||
|
||||
if ( !rodString.equals("") )
|
||||
rodString = "[ROD: " + rodString + "]";
|
||||
|
||||
return rodString;
|
||||
}</code class="pre_md"></pre>
|
||||
<h2>How do I write my own RMD types?</h2>
|
||||
<p>Tracks of reference metadata are loaded using the <a href="http://gatkforums.broadinstitute.org/discussion/1349/tribble">Tribble</a> infrastructure. Tracks are loaded using the feature codec and underlying type information. See the <a href="http://gatkforums.broadinstitute.org/discussion/1349/tribble">Tribble documentation</a> for more information.</p>
|
||||
<p>Tribble codecs that are in the classpath are automatically found; the GATK discovers all classes that implement the <code>FeatureCodec</code> class. Name resolution occurs using the <code>-B</code> type parameter, i.e. if the user specified: </p>
|
||||
<pre><code class="pre_md">-B:calls1,VCF calls1.vcf</code class="pre_md"></pre>
|
||||
<p>The GATK looks for a <code>FeatureCodec</code> called <code>VCFCodec.java</code> to decode the record type. Alternately, if the user specified:</p>
|
||||
<pre><code class="pre_md">-B:calls1,MYAwesomeFormat calls1.maft</code class="pre_md"></pre>
|
||||
<p>THe GATK would look for a codec called <code>MYAwesomeFormatCodec.java</code>. This look-up is not case sensitive, i.e. it will resolve <code>MyAwEsOmEfOrMaT</code> as well, though why you would want to write something so painfully ugly to read is beyond us.</p>
|
||||
|
|
@ -0,0 +1,133 @@
|
|||
## Writing unit tests for walkers
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/1339/writing-unit-tests-for-walkers
|
||||
|
||||
<h2>1. Testing core walkers is critical</h2>
|
||||
<p>Most GATK walkers are really too complex to easily test using the standard unit test framework. It's just not feasible to make artificial read piles and then extrapolate from simple tests passing whether the system as a whole is working correctly. However, we need some way to determine whether changes to the core of the GATK are altering the expected output of complex walkers like BaseRecalibrator or SingleSampleGenotyper. In additional to correctness, we want to make sure that the performance of key walkers isn't degrading over time, so that calling snps, cleaning indels, etc., isn't slowly creeping down over time. Since we are now using a bamboo server to automatically build and run unit tests (as well as measure their runtimes) we want to put as many good walker tests into the test framework so we capture performance metrics over time.</p>
|
||||
<h2>2. The WalkerTest framework</h2>
|
||||
<p>To make this testing process easier, we've created a <code>WalkerTest</code> framework that lets you invoke the GATK using command-line GATK commands in the <code>JUnit</code> system and test for changes in your output files by comparing the current ant build results to previous run via an MD5 sum. It's a bit coarse grain, but it will work to ensure that changes to key walkers are detected quickly by the system, and authors can either update the expected MD5s or go track down bugs.</p>
|
||||
<p>The system is fairly straightforward to use. Ultimately we will end up with <code>JUnit</code> style tests in the unit testing structure. In the piece of code below, we have a piece of code that checks the MD5 of the SingleSampleGenotyper's GELI text output at LOD 3 and LOD 10. </p>
|
||||
<pre><code class="pre_md">package org.broadinstitute.sting.gatk.walkers.genotyper;
|
||||
|
||||
import org.broadinstitute.sting.WalkerTest;
|
||||
import org.junit.Test;
|
||||
|
||||
import java.util.HashMap;
|
||||
import java.util.Map;
|
||||
import java.util.Arrays;
|
||||
|
||||
public class SingleSampleGenotyperTest extends WalkerTest {
|
||||
@Test
|
||||
public void testLOD() {
|
||||
HashMap<Double, String> e = new HashMap<Double, String>();
|
||||
e.put( 10.0, "e4c51dca6f1fa999f4399b7412829534" );
|
||||
e.put( 3.0, "d804c24d49669235e3660e92e664ba1a" );
|
||||
|
||||
for ( Map.Entry<Double, String> entry : e.entrySet() ) {
|
||||
WalkerTest.WalkerTestSpec spec = new WalkerTest.WalkerTestSpec(
|
||||
"-T SingleSampleGenotyper -R /broad/1KG/reference/human_b36_both.fasta -I /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.bam -varout %s --variant_output_format GELI -L 1:10,000,000-11,000,000 -m EMPIRICAL -lod " + entry.getKey(), 1,
|
||||
Arrays.asList(entry.getValue()));
|
||||
executeTest("testLOD", spec);
|
||||
}
|
||||
}
|
||||
}</code class="pre_md"></pre>
|
||||
<p>The fundamental piece here is to inherit from <code>WalkerTest</code>. This gives you access to the <code>executeTest()</code> function that consumes a <code>WalkerTestSpec</code>:</p>
|
||||
<pre><code class="pre_md"> public WalkerTestSpec(String args, int nOutputFiles, List<String> md5s)</code class="pre_md"></pre>
|
||||
<p>The <code>WalkerTestSpec</code> takes regular, command-line style GATK arguments describing what you want to run, the number of output files the walker will generate, and your expected MD5s for each of these output files. The args string can contain <code>%s String.format</code> specifications, and for each of the <code>nOutputFiles</code>, the <code>executeTest()</code> function will (1) generate a <code>tmp</code> file for output and (2) call <code>String.format</code> on your args to fill in the tmp output files in your arguments string. For example, in the above argument string <code>varout</code> is followed by <code>%s</code>, so our single SingleSampleGenotyper output is the variant output file.</p>
|
||||
<h2>3. Example output</h2>
|
||||
<p>When you add a <code>walkerTest</code> inherited unit test to the GATK, and then <code>build test</code>, you'll see output that looks like:</p>
|
||||
<pre><code class="pre_md">[junit] WARN 13:29:50,068 WalkerTest - --------------------------------------------------------------------------------
|
||||
[junit] WARN 13:29:50,068 WalkerTest - --------------------------------------------------------------------------------
|
||||
[junit] WARN 13:29:50,069 WalkerTest - Executing test testLOD with GATK arguments: -T SingleSampleGenotyper -R /broad/1KG/reference/human_b36_both.fasta -I /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.bam -varout /tmp/walktest.tmp_param.05524470250256847817.tmp --variant_output_format GELI -L 1:10,000,000-11,000,000 -m EMPIRICAL -lod 3.0
|
||||
[junit]
|
||||
[junit] WARN 13:29:50,069 WalkerTest - Executing test testLOD with GATK arguments: -T SingleSampleGenotyper -R /broad/1KG/reference/human_b36_both.fasta -I /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.bam -varout /tmp/walktest.tmp_param.05524470250256847817.tmp --variant_output_format GELI -L 1:10,000,000-11,000,000 -m EMPIRICAL -lod 3.0
|
||||
[junit]
|
||||
[junit] WARN 13:30:39,407 WalkerTest - Checking MD5 for /tmp/walktest.tmp_param.05524470250256847817.tmp [calculated=d804c24d49669235e3660e92e664ba1a, expected=d804c24d49669235e3660e92e664ba1a]
|
||||
[junit] WARN 13:30:39,407 WalkerTest - Checking MD5 for /tmp/walktest.tmp_param.05524470250256847817.tmp [calculated=d804c24d49669235e3660e92e664ba1a, expected=d804c24d49669235e3660e92e664ba1a]
|
||||
[junit] WARN 13:30:39,408 WalkerTest - => testLOD PASSED
|
||||
[junit] WARN 13:30:39,408 WalkerTest - => testLOD PASSED
|
||||
[junit] WARN 13:30:39,409 WalkerTest - --------------------------------------------------------------------------------
|
||||
[junit] WARN 13:30:39,409 WalkerTest - --------------------------------------------------------------------------------
|
||||
[junit] WARN 13:30:39,409 WalkerTest - Executing test testLOD with GATK arguments: -T SingleSampleGenotyper -R /broad/1KG/reference/human_b36_both.fasta -I /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.bam -varout /tmp/walktest.tmp_param.03852477489430798188.tmp --variant_output_format GELI -L 1:10,000,000-11,000,000 -m EMPIRICAL -lod 10.0
|
||||
[junit]
|
||||
[junit] WARN 13:30:39,409 WalkerTest - Executing test testLOD with GATK arguments: -T SingleSampleGenotyper -R /broad/1KG/reference/human_b36_both.fasta -I /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.bam -varout /tmp/walktest.tmp_param.03852477489430798188.tmp --variant_output_format GELI -L 1:10,000,000-11,000,000 -m EMPIRICAL -lod 10.0
|
||||
[junit]
|
||||
[junit] WARN 13:31:30,213 WalkerTest - Checking MD5 for /tmp/walktest.tmp_param.03852477489430798188.tmp [calculated=e4c51dca6f1fa999f4399b7412829534, expected=e4c51dca6f1fa999f4399b7412829534]
|
||||
[junit] WARN 13:31:30,213 WalkerTest - Checking MD5 for /tmp/walktest.tmp_param.03852477489430798188.tmp [calculated=e4c51dca6f1fa999f4399b7412829534, expected=e4c51dca6f1fa999f4399b7412829534]
|
||||
[junit] WARN 13:31:30,213 WalkerTest - => testLOD PASSED
|
||||
[junit] WARN 13:31:30,213 WalkerTest - => testLOD PASSED
|
||||
[junit] WARN 13:31:30,214 SingleSampleGenotyperTest -
|
||||
[junit] WARN 13:31:30,214 SingleSampleGenotyperTest - </code class="pre_md"></pre>
|
||||
<h2>4. Recommended location for GATK testing data</h2>
|
||||
<p>We keep all of the permenant GATK testing data in:</p>
|
||||
<pre><code class="pre_md">/humgen/gsa-scr1/GATK_Data/Validation_Data/</code class="pre_md"></pre>
|
||||
<p>A good set of data to use for walker testing is the CEU daughter data from 1000 Genomes:</p>
|
||||
<pre><code class="pre_md">gsa2 ~/dev/GenomeAnalysisTK/trunk > ls -ltr /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_1*.bam /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_1*.calls
|
||||
-rw-rw-r--+ 1 depristo wga 51M 2009-09-03 07:56 /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.bam
|
||||
-rw-rw-r--+ 1 depristo wga 185K 2009-09-04 13:21 /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.lod5.variants.geli.calls
|
||||
-rw-rw-r--+ 1 depristo wga 164M 2009-09-04 13:22 /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.lod5.genotypes.geli.calls
|
||||
-rw-rw-r--+ 1 depristo wga 24M 2009-09-04 15:00 /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SOLID.bam
|
||||
-rw-rw-r--+ 1 depristo wga 12M 2009-09-04 15:01 /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.454.bam
|
||||
-rw-r--r--+ 1 depristo wga 91M 2009-09-04 15:02 /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.allTechs.bam</code class="pre_md"></pre>
|
||||
<h2>5. Test dependencies</h2>
|
||||
<p>The tests depend on a variety of input files, that are generally constrained to three mount points on the internal Broad network:</p>
|
||||
<pre><code class="pre_md">*/seq/
|
||||
*/humgen/1kg/
|
||||
*/humgen/gsa-hpprojects/GATK/Data/Validation_Data/</code class="pre_md"></pre>
|
||||
<p>To run the unit and integration tests you'll have to have access to these files. They may have different mount points on your machine (say, if you're running remotely over the VPN and have mounted the directories on your own machine).</p>
|
||||
<h2>6. MD5 database and comparing MD5 results</h2>
|
||||
<p>Every file that generates an MD5 sum as part of the WalkerTest framework will be copied to <code><MD5>. integrationtest</code> in the <code>integrationtests</code> subdirectory of the GATK trunk. This MD5 database of results enables you to easily examine the results of an integration test as well as compare the results of a test before/after a code change. For example, below is an example test for the UnifiedGenotyper that, due to a code change, where the output VCF differs from the VCF with the expected MD5 value in the test code itself. The test provides provides the path to the two results files as well as a diff command to compare expected to the observed MD5:</p>
|
||||
<pre><code class="pre_md">[junit] --------------------------------------------------------------------------------
|
||||
[junit] Executing test testParameter[-genotype] with GATK arguments: -T UnifiedGenotyper -R /broad/1KG/reference/human_b36_both.fasta -I /humgen/gsa-hpprojects/GATK/data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.bam -varout /tmp/walktest.tmp_param.05997727998894311741.tmp -L 1:10,000,000-10,010,000 -genotype
|
||||
[junit] ##### MD5 file is up to date: integrationtests/ab20d4953b13c3fc3060d12c7c6fe29d.integrationtest
|
||||
[junit] Checking MD5 for /tmp/walktest.tmp_param.05997727998894311741.tmp [calculated=ab20d4953b13c3fc3060d12c7c6fe29d, expected=0ac7ab893a3f550cb1b8c34f28baedf6]
|
||||
[junit] ##### Test testParameter[-genotype] is going fail #####
|
||||
[junit] ##### Path to expected file (MD5=0ac7ab893a3f550cb1b8c34f28baedf6): integrationtests/0ac7ab893a3f550cb1b8c34f28baedf6.integrationtest
|
||||
[junit] ##### Path to calculated file (MD5=ab20d4953b13c3fc3060d12c7c6fe29d): integrationtests/ab20d4953b13c3fc3060d12c7c6fe29d.integrationtest
|
||||
[junit] ##### Diff command: diff integrationtests/0ac7ab893a3f550cb1b8c34f28baedf6.integrationtest integrationtests/ab20d4953b13c3fc3060d12c7c6fe29d.integrationtest</code class="pre_md"></pre>
|
||||
<p>Examining the diff we see a few lines that have changed the <code>DP</code> count in the new code</p>
|
||||
<pre><code class="pre_md">> diff integrationtests/0ac7ab893a3f550cb1b8c34f28baedf6.integrationtest integrationtests/ab20d4953b13c3fc3060d12c7c6fe29d.integrationtest | head
|
||||
385,387c385,387
|
||||
< 1 10000345 . A . 106.54 . AN=2;DP=33;Dels=0.00;MQ=89.17;MQ0=0;SB=-10.00 GT:DP:GL:GQ 0/0:25:-0.09,-7.57,-75.74:74.78
|
||||
< 1 10000346 . A . 103.75 . AN=2;DP=31;Dels=0.00;MQ=88.85;MQ0=0;SB=-10.00 GT:DP:GL:GQ 0/0:24:-0.07,-7.27,-76.00:71.99
|
||||
< 1 10000347 . A . 109.79 . AN=2;DP=31;Dels=0.00;MQ=88.85;MQ0=0;SB=-10.00 GT:DP:GL:GQ 0/0:26:-0.05,-7.85,-84.74:78.04
|
||||
---
|
||||
> 1 10000345 . A . 106.54 . AN=2;DP=32;Dels=0.00;MQ=89.50;MQ0=0;SB=-10.00 GT:DP:GL:GQ 0/0:25:-0.09,-7.57,-75.74:74.78
|
||||
> 1 10000346 . A . 103.75 . AN=2;DP=30;Dels=0.00;MQ=89.18;MQ0=0;SB=-10.00 GT:DP:GL:GQ 0/0:24:-0.07,-7.27,-76.00:71.99
|
||||
> 1 10000347 . A . 109.79 . AN=2;DP=30;Dels=0.00;MQ=89.18;MQ0=0;SB=-10.00 GT:DP:GL:GQ 0/0:26:-0.05,-7.85,-84.74:78</code class="pre_md"></pre>
|
||||
<p>Whether this is the expected change is up to you to decide, but the system makes it as easy as possible to see the consequences of your code change.</p>
|
||||
<h2>7. Testing for Exceptions</h2>
|
||||
<p>The walker test framework supports an additional syntax for ensuring that a particular java Exception is thrown when a walker executes using a simple alternate version of the <code>WalkerSpec</code> object. Rather than specifying the MD5 of the result, you can provide a single subclass of <code>Exception.class</code> and the testing framework will ensure that when the walker runs an instance (class or subclass) of your expected exception is thrown. The system also flags if no exception is thrown.</p>
|
||||
<p>For example, the following code tests that the GATK can detect and error out when incompatible VCF and FASTA files are given:</p>
|
||||
<pre><code class="pre_md">@Test public void fail8() { executeTest("hg18lex-v-b36", test(lexHG18, callsB36)); }
|
||||
|
||||
private WalkerTest.WalkerTestSpec test(String ref, String vcf) {
|
||||
return new WalkerTest.WalkerTestSpec("-T VariantsToTable -M 10 -B:two,vcf "
|
||||
+ vcf + " -F POS,CHROM -R "
|
||||
+ ref + " -o %s",
|
||||
1, UserException.IncompatibleSequenceDictionaries.class);
|
||||
|
||||
}</code class="pre_md"></pre>
|
||||
<p>During the integration test this looks like:</p>
|
||||
<pre><code class="pre_md">[junit] Executing test hg18lex-v-b36 with GATK arguments: -T VariantsToTable -M 10 -B:two,vcf /humgen/gsa-hpprojects/GATK/data/Validation_Data/lowpass.N3.chr1.raw.vcf -F POS,CHROM -R /humgen/gsa-hpprojects/GATK/data/Validation_Data/lexFasta/lex.hg18.fasta -o /tmp/walktest.tmp_param.05541601616101756852.tmp -l WARN -et NO_ET
|
||||
[junit] [junit] Wanted exception class org.broadinstitute.sting.utils.exceptions.UserException$IncompatibleSequenceDictionaries, saw class org.broadinstitute.sting.utils.exceptions.UserException$IncompatibleSequenceDictionaries
|
||||
[junit] => hg18lex-v-b36 PASSED</code class="pre_md"></pre>
|
||||
<h2>8. Miscellaneous information</h2>
|
||||
<ul>
|
||||
<li>
|
||||
<p>Please do not put any extremely long tests in the regular <code>ant build test</code> target. We are currently splitting the system into fast and slow tests so that unit tests can be run in \< 3 minutes while saving a test target for long-running regression tests. More information on that will be posted. </p>
|
||||
</li>
|
||||
<li>
|
||||
<p>An expected MG5 string of <code>""</code> means don't check for equality between the calculated and expected MD5s. Useful if you are just writing a new test and don't know the true output.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>Overload <code>parameterize() { return true; }</code> if you want the system to just run your calculations, not throw an error if your MD5s don't match, across all tests</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>If your tests all of a sudden stop giving equality MD5s, you can just (1) look at the <code>.tmp</code> output files directly or (2) grab the printed GATK command-line options and explore what is happening.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>You can always run a GATK walker on the command line and then run md5sum on its output files to obtain, outside of the testing framework, the MD5 expected results.</p>
|
||||
</li>
|
||||
<li>Don't worry about the duplication of lines in the output ; it's just an annoyance of having two global loggers. Eventually we'll bug fix this away.</li>
|
||||
</ul>
|
||||
|
|
@ -0,0 +1,68 @@
|
|||
## Writing walkers
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/1302/writing-walkers
|
||||
|
||||
<h3>1. Introduction</h3>
|
||||
<p>The core concept behind GATK tools is the walker, a class that implements the three core operations: <strong>filtering</strong>, <strong>mapping</strong>, and <strong>reducing</strong>.</p>
|
||||
<ul>
|
||||
<li>
|
||||
<p><strong>filter</strong>
|
||||
Reduces the size of the dataset by applying a predicate.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><strong>map</strong>
|
||||
Applies a function to each individual element in a dataset, effectively <em>mapping</em> it to a new element.</p>
|
||||
</li>
|
||||
<li><strong>reduce</strong>
|
||||
Inductively combines the elements of a list. The base case is supplied by the <code>reduceInit()</code> function, and the inductive step is performed by the <code>reduce()</code> function.</li>
|
||||
</ul>
|
||||
<p>Users of the GATK will provide a walker to run their analyses. The engine will produce a result by first filtering the dataset, running a map operation, and finally reducing the map operation to a single result.</p>
|
||||
<h3>2. Creating a Walker</h3>
|
||||
<p>To be usable by the GATK, the walker must satisfy the following properties:</p>
|
||||
<ul>
|
||||
<li>
|
||||
<p>It must subclass one of the basic walkers in the <code>org.broadinstitute.sting.gatk.walkers</code> package, usually ReadWalker or LociWalker.</p>
|
||||
<ul>
|
||||
<li>
|
||||
<p>Locus walkers present all the reads, reference bases, and reference-ordered data that overlap a single base in the reference. Locus walkers are best used for analyses that look at each locus independently, such as genotyping.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>Read walkers present only one read at a time, as well as the reference bases and reference-ordered data that overlap that read.</p>
|
||||
</li>
|
||||
<li>Besides read walkers and locus walkers, the GATK features several other data access patterns, described <a href="http://www.broadinstitute.org/gatk/guide/article?id=1351">here</a>.</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li>The compiled class or jar must be on the current classpath. The Java classpath can be controlled using either the <code>$CLASSPATH</code> environment variable or the JVM's <code>-cp</code> option.</li>
|
||||
</ul>
|
||||
<h3>3. Examples</h3>
|
||||
<p>The best way to get started with the GATK is to explore the walkers we've written. Here are the best walkers to look at when getting started:</p>
|
||||
<ul>
|
||||
<li>
|
||||
<p>CountLoci </p>
|
||||
<p>It is the simplest locus walker in our codebase. It counts the number of loci walked over in a single run of the GATK.</p>
|
||||
</li>
|
||||
</ul>
|
||||
<p><code>$STING_HOME/java/src/org/broadinstitute/sting/gatk/walkers/qc/CountLociWalker.java</code></p>
|
||||
<ul>
|
||||
<li>
|
||||
<p>CountReads </p>
|
||||
<p>It is the simplest read walker in our codebase. It counts the number of reads walked over in a single run of the GATK.</p>
|
||||
</li>
|
||||
</ul>
|
||||
<p><code>$STING_HOME/java/src/org/broadinstitute/sting/gatk/walkers/qc/CountReadsWalker.java</code></p>
|
||||
<ul>
|
||||
<li>
|
||||
<p>GATKPaperGenotyper </p>
|
||||
<p>This is a more sophisticated example, taken from our recent paper in Genome Research (and using our ReadBackedPileup to select and filter reads). It is an extremely basic Bayesian genotyper that demonstrates how to output data to a stream and execute simple base operations.</p>
|
||||
</li>
|
||||
</ul>
|
||||
<p><code>$STING_HOME/java/src/org/broadinstitute/sting/gatk/examples/papergenotyper/GATKPaperGenotyper.java</code> </p>
|
||||
<p><strong>Please note that the walker above is NOT the UnifiedGenotyper. While conceptually similar to the UnifiedGenotyper, the GATKPaperGenotyper uses a much simpler calling model for increased clarity and readability.</strong></p>
|
||||
<h3>4. External walkers and the 'external' directory</h3>
|
||||
<p>The GATK can absorb external walkers placed in a directory of your choosing. By default, that directory is called 'external' and is relative to the Sting git root directory (for example, <code>~/src/Sting/external</code>). However, you can choose to place that directory anywhere on the filesystem and specify its complete path using the ant <code>external.dir</code> property. </p>
|
||||
<pre><code class="pre_md">ant -Dexternal.dir=~/src/external</code class="pre_md"></pre>
|
||||
<p>The GATK will check each directory under the external directory (but not the external directory itself!) for small build scripts. These build scripts must contain at least a <code>compile</code> target that compiles your walker and places the resulting class file into the GATK's class file output directory. The following is a sample compile target:</p>
|
||||
<pre><code class="pre_md"><target name="compile" depends="init">
|
||||
<javac srcdir="." destdir="${build.dir}" classpath="${gatk.classpath}" />
|
||||
</target></code class="pre_md"></pre>
|
||||
<p>As a convenience, the <code>build.dir</code> ant property will be predefined to be the GATK's class file output directory and the <code>gatk.classpath</code> property will be predefined to be the GATK's core classpath. Once this structure is defined, any invocation of the ant build scripts will build the contents of the external directory as well as the GATK itself.</p>
|
||||
|
|
@ -0,0 +1,55 @@
|
|||
## Writing walkers in Scala
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/1354/writing-walkers-in-scala
|
||||
|
||||
<h2>1. Install scala somewhere</h2>
|
||||
<p>At the Broad, we typically put it somewhere like this: </p>
|
||||
<pre><code class="pre_md">/home/radon01/depristo/work/local/scala-2.7.5.final</code class="pre_md"></pre>
|
||||
<p>Next, create a symlink from this directory to <code>trunk/scala/installation</code>:</p>
|
||||
<pre><code class="pre_md">ln -s /home/radon01/depristo/work/local/scala-2.7.5.final trunk/scala/installation</code class="pre_md"></pre>
|
||||
<h2>2. Setting up your path</h2>
|
||||
<p>Right now the only way to get scala walkers into the GATK is by explicitly setting your <code>CLASSPATH</code> in your <code>.my.cshrc</code> file:</p>
|
||||
<pre><code class="pre_md">setenv CLASSPATH /humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/FourBaseRecaller.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/GenomeAnalysisTK.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/Playground.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/StingUtils.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/bcel-5.2.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/colt-1.2.0.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/google-collections-0.9.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/javassist-3.7.ga.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/junit-4.4.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/log4j-1.2.15.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/picard-1.02.63.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/picard-private-875.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/reflections-0.9.2.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/sam-1.01.63.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/simple-xml-2.0.4.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/GATKScala.jar:/humgen/gsa-scr1/depristo/local/scala-2.7.5.final/lib/scala-library.jar</code class="pre_md"></pre>
|
||||
<p>Really this needs to be manually updated whenever any of the libraries are updated. If you see this error:</p>
|
||||
<pre><code class="pre_md">Caused by: java.lang.RuntimeException: java.util.zip.ZipException: error in opening zip file
|
||||
at org.reflections.util.VirtualFile.iterable(VirtualFile.java:79)
|
||||
at org.reflections.util.VirtualFile$5.transform(VirtualFile.java:169)
|
||||
at org.reflections.util.VirtualFile$5.transform(VirtualFile.java:167)
|
||||
at org.reflections.util.FluentIterable$3.transform(FluentIterable.java:43)
|
||||
at org.reflections.util.FluentIterable$3.transform(FluentIterable.java:41)
|
||||
at org.reflections.util.FluentIterable$ForkIterator.computeNext(FluentIterable.java:81)
|
||||
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:132)
|
||||
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:127)
|
||||
at org.reflections.util.FluentIterable$FilterIterator.computeNext(FluentIterable.java:102)
|
||||
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:132)
|
||||
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:127)
|
||||
at org.reflections.util.FluentIterable$TransformIterator.computeNext(FluentIterable.java:124)
|
||||
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:132)
|
||||
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:127)
|
||||
at org.reflections.Reflections.scan(Reflections.java:69)
|
||||
at org.reflections.Reflections.<init>(Reflections.java:47)
|
||||
at org.broadinstitute.sting.utils.PackageUtils.<clinit>(PackageUtils.java:23)</code class="pre_md"></pre>
|
||||
<p>It's because the libraries aren't updated. Basically just do an <code>ls</code> of your <code>trunk/dist</code> directory after the GATK has been build, make this your classpath as above, and tack on:</p>
|
||||
<pre><code class="pre_md">/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/GATKScala.jar:/humgen/gsa-scr1/depristo/local/scala-2.7.5.final/lib/scala-library.jar</code class="pre_md"></pre>
|
||||
<p>A command that almost works (but you'll need to replace the spaces with colons) is:</p>
|
||||
<pre><code class="pre_md">#setenv CLASSPATH $CLASSPATH `ls /humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/*.jar` /humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/GATKScala.jar:/humgen/gsa-scr1/depristo/local/scala-2.7.5.final/lib/scala-library.jar</code class="pre_md"></pre>
|
||||
<h2>3. Building scala code</h2>
|
||||
<p>All of the Scala source code lives in <code>scala/src</code>, which you build using <code>ant scala</code></p>
|
||||
<p>There are already some example Scala walkers in <code>scala/src</code>, so doing a standard checkout, installing scala, settting up your environment, should allow you to run something like:</p>
|
||||
<pre><code class="pre_md">gsa2 ~/dev/GenomeAnalysisTK/trunk > ant scala
|
||||
Buildfile: build.xml
|
||||
|
||||
init.scala:
|
||||
|
||||
scala:
|
||||
[echo] Sting: Compiling scala!
|
||||
[scalac] Compiling 2 source files to /humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/scala/classes
|
||||
[scalac] warning: there were deprecation warnings; re-run with -deprecation for details
|
||||
[scalac] one warning found
|
||||
[scalac] Compile suceeded with 1 warning; see the compiler output for details.
|
||||
[delete] Deleting: /humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/GATKScala.jar
|
||||
[jar] Building jar: /humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/GATKScala.jar</code class="pre_md"></pre>
|
||||
<h2>4. Invoking a scala walker</h2>
|
||||
<p>Until we can include Scala walkers along with the main GATK jar (avoiding the classpath issue too) you have to invoke your scala walkers using this syntax:</p>
|
||||
<pre><code class="pre_md">java -Xmx2048m org.broadinstitute.sting.gatk.CommandLineGATK -T BaseTransitionTableCalculator -R /broad/1KG/reference/human_b36_both.fasta -I /broad/1KG/DCC_merged/freeze5/NA12878.pilot2.SLX.bam -l INFO -L 1:1-100</code class="pre_md"></pre>
|
||||
<p>Here, the <code>BaseTransitionTableCalculator</code> walker is written in Scala and being loaded into the system by the GATK walker manager. Otherwise everything looks like a normal GATK module.</p>
|
||||
|
|
@ -0,0 +1,6 @@
|
|||
## Bait bias
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/6333/bait-bias
|
||||
|
||||
<p>Bait bias (single bait bias or reference bias artifact) is a type of artifact that affects data generated through <a href="http://gatkforums.broadinstitute.org/gatk/discussion/6331">hybrid selection</a> methods. </p>
|
||||
<p>These artifacts occur during or after the target selection step, and correlate with substitution rates that are biased or higher for sites having one base on the reference/positive strand relative to sites having the complementary base on that strand. For example, a G>T artifact during the target selection step might result in a higher (G>T)/(C>A) substitution rate at sites with a G on the positive strand (and C on the negative), relative to sites with the flip (C positive)/(G negative). This is known as the <strong>"G-Ref"</strong> artifact.</p>
|
||||
|
|
@ -0,0 +1,19 @@
|
|||
## Biallelic vs Multiallelic sites
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/6455/biallelic-vs-multiallelic-sites
|
||||
|
||||
<p>A <strong>biallelic</strong> site is a specific locus in a genome that contains two observed alleles, counting the reference as one, and therefore allowing for one variant allele. In practical terms, this is what you would call a site where, across multiple samples in a cohort, you have evidence for a single non-reference allele. Shown below is a toy example in which the consensus sequence for samples 1-3 have a <em>deletion</em> at position 7. Sample 4 matches the reference. This is considered a biallelic site because there are only two possible alleles-- a deletion, or the reference allele <code>G</code>.</p>
|
||||
<pre><code> 1 2 3 4 5 6 7 8 9
|
||||
Reference: A T A T A T G C G
|
||||
Sample 1 : A T A T A T - C G
|
||||
Sample 2 : A T A T A T - C G
|
||||
Sample 3 : A T A T A T - C G
|
||||
Sample 4 : A T A T A T G C G</code></pre>
|
||||
<hr />
|
||||
<p>A <strong>multiallelic</strong> site is a specific locus in a genome that contains three or more observed alleles, again counting the reference as one, and therefore allowing for two or more variant alleles. This is what you would call a site where, across multiple samples in a cohort, you see evidence for two or more non-reference alleles. Show below is a toy example in which the consensus sequences for samples 1-3 have a <em>deletion</em> or a <em>SNP</em> at the 7th position. Sample 4 matches the reference. This is considered a multiallelic site because there are four possible alleles-- a deletion, the reference allele <code>G</code>, a <code>C</code> (SNP), or a <code>T</code> (SNP). True multiallelic sites are not observed very frequently unless you look at very large cohorts, so they are often taken as a sign of a noisy region where artifacts are likely. </p>
|
||||
<pre><code> 1 2 3 4 5 6 7 8 9
|
||||
Reference: A T A T A T G C G
|
||||
Sample 1 : A T A T A T - C G
|
||||
Sample 2 : A T A T A T C C G
|
||||
Sample 3 : A T A T A T T C G
|
||||
Sample 4 : A T A T A T G C G</code></pre>
|
||||
|
|
@ -0,0 +1,6 @@
|
|||
## Bisulfite sequencing / Cytosine methylation
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/6330/bisulfite-sequencing-cytosine-methylation
|
||||
|
||||
<p>Cytosine methylation is a key component in epigenetic regulation of gene expression and frequently occurs at CpG sites throughout the genome. Bisulfite sequencing is a technique used to analyze the genome-wide methylation profiles on a single nucleotide level <a href="http://nar.oxfordjournals.org/content/33/18/5868.short"><strong>[doi:10.1093/nar/gki901]</strong></a>. Sodium bisulfite efficiently and selectively deaminates unmethylated cytosine residues to uracil without affecting 5-methyl cytosine (methylated). Using restriction enzymes and PCR to enrich for regions of the genome that have high CpG content, the resulting reduced genome comprises ~1% of the original genome but includes key regulatory sequences as well as repeated regions.</p>
|
||||
<p>The protocol involves several steps. First, genomic DNA is digested with a restriction endonuclease such as MspI, which targets CG dinucleotides. This results in DNA fragments with CG at the ends. Next, the fragments are size selected (via gel electrophoresis), which facilitates the enrichment of CpG-containing sequences. This is followed by bisulfite treatment, which converts unmethylated C nucleotides to uracil (U) while methylated cytosines will remain intact. The bisulfite-treated DNA is amplified with a proofreading-deficient DNA polymerase to facilitate amplification of both methylated cytosines as well as the C -> U converted bases. Subsequent to PCR amplification, each original unmethylated cytosine will be converted to either a T (+ strand) or an A (- strand), while methylated C will remain a C (+ strand) or a G (- strand). The PCR products are then sequenced using conventional methods and aligned to a reference. </p>
|
||||
|
|
@ -0,0 +1,44 @@
|
|||
## Downsampling
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/1323/downsampling
|
||||
|
||||
<h4>Downsampling is a process by which read depth is reduced, either at a particular position or within a region.</h4>
|
||||
<p>Normal sequencing and alignment protocols can often yield pileups with vast numbers of reads aligned to a single section of the genome in otherwise well-behaved datasets. Because of the frequency of these 'speed bumps', the GATK now downsamples pileup data unless explicitly overridden.</p>
|
||||
<p>Note that there is also a proportional "downsample to fraction" mechanism that is mostly intended for testing the effect of different overall coverage means on analysis results.</p>
|
||||
<p>See below for details of how this is implemented and controlled in GATK.</p>
|
||||
<hr />
|
||||
<h2>1. Downsampling to a target coverage</h2>
|
||||
<p>The principle of this downsampling type is to downsample reads to a given capping threshold coverage. Its purpose is to get rid of excessive coverage, because above a certain depth, having additional data is not informative and imposes unreasonable computational costs. The downsampling process takes two different forms depending on the type of analysis it is used with. For locus-based traversals (LocusWalkers like UnifiedGenotyper and ActiveRegionWalkers like HaplotypeCaller), downsample_to_coverage controls the maximum depth of coverage at each locus. For read-based traversals (ReadWalkers like BaseRecalibrator), it controls the maximum number of reads sharing the same alignment start position. For ReadWalkers you will typically need to use much lower dcov values than you would with LocusWalkers to see an effect. Note that this downsampling option does not produce an unbiased random sampling from all available reads at each locus: instead, the primary goal of the to-coverage downsampler is to maintain an even representation of reads from all alignment start positions when removing excess coverage. For a truly unbiased random sampling of reads, use -dfrac instead. Also note that the coverage target is an approximate goal that is not guaranteed to be met exactly: the downsampling algorithm will under some circumstances retain slightly more or less coverage than requested.</p>
|
||||
<h3>Defaults</h3>
|
||||
<p>The GATK's default downsampler (invoked by <code>-dcov</code>) exhibits the following properties:</p>
|
||||
<ul>
|
||||
<li>The downsampler treats data from each sample independently, so that high coverage in one sample won't negatively impact calling in other samples. </li>
|
||||
<li>The downsampler attempts to downsample uniformly across the range spanned by the reads in the pileup. </li>
|
||||
<li>The downsampler's memory consumption is proportional to the sampled coverage depth rather than the full coverage depth.</li>
|
||||
</ul>
|
||||
<p>By default, the downsampler is limited to 1000 reads per sample. This value can be adjusted either per-walker or per-run.</p>
|
||||
<h3>Customizing</h3>
|
||||
<p>From the command line:</p>
|
||||
<ul>
|
||||
<li>To disable the downsampler, specify <code>-dt NONE</code>. </li>
|
||||
<li>To change the default coverage per-sample, specify the desired coverage to the <code>-dcov</code> option.</li>
|
||||
</ul>
|
||||
<p>To modify the walker's default behavior:</p>
|
||||
<ul>
|
||||
<li>Add the @Downsample interface to the top of your walker. Override the downsampling type by changing the <code>by=<value></code>. Override the downsampling depth by changing the <code>toCoverage=<value></code>.</li>
|
||||
</ul>
|
||||
<h3>Algorithm details</h3>
|
||||
<p>The downsampler algorithm is designed to maintain uniform coverage while preserving a low memory footprint in regions of especially deep data. Given an already established pileup, a single-base locus, and a pile of reads with an alignment start of single-base locus + 1, the outline of the algorithm is as follows:</p>
|
||||
<p>For each sample:</p>
|
||||
<ul>
|
||||
<li>Select <sample size> reads with the next alignment start. </li>
|
||||
<li>While the number of existing reads + the number of incoming reads is greater than the target sample size:</li>
|
||||
</ul>
|
||||
<p>Now walk backward through each set of reads having the same alignment start. If the count of reads having the same alignment start is > 1, throw out one randomly selected read.</p>
|
||||
<ul>
|
||||
<li>If we have n slots available where n is >= 1, randomly select n of the incoming reads and add them to the pileup. </li>
|
||||
<li>Otherwise, we have zero slots available. Choose the read from the existing pileup with the least alignment start. Throw it out and add one randomly selected read from the new pileup.</li>
|
||||
</ul>
|
||||
<hr />
|
||||
<h2>2. Downsampling to a fraction of the coverage</h2>
|
||||
<p>Reads will be downsampled so the specified fraction remains; e.g. if you specify -dfrac 0.25, three-quarters of the reads will be removed, and the remaining one quarter will be used in the analysis. This method of downsampling is truly unbiased and random. It is typically used to simulate the effect of generating different amounts of sequence data for a given sample. For example, you can use this in a pilot experiment to evaluate how much target coverage you need to aim for in order to obtain enough coverage in all loci of interest.</p>
|
||||
|
|
@ -0,0 +1,9 @@
|
|||
## Heterozygosity
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/8603/heterozygosity
|
||||
|
||||
<h3>Heterozygosity in population genetics</h3>
|
||||
<p>In the context of population genetics, heterozygosity can refer to the fraction of individuals in a given population that are heterozygous at a given locus, or the fraction of loci that are heterozygous in an individual. See the Wikipedia entries on <a href="http://en.wikipedia.org/wiki/Zygosity#Heterozygosity_in_population_genetics">Heterozygosity</a> and <a href="https://en.wikipedia.org/wiki/Coalescent_theory">Coalescent Theory</a> as well as the book "Population Genetics: A Concise Guide" by John H. Gillespie for further details on related theory.</p>
|
||||
<h3>Heterozygosity in GATK</h3>
|
||||
<p>In GATK genotyping, we use an "expected heterozygosity" value to compute the prior probability that a locus is non-reference. Given the expected heterozygosity <code>hets</code>, we calculate the probability of N samples being hom-ref at a site as <code>1 - sum_i_2N (hets / i)</code>. The default value provided for humans is <code>hets = 1e-3</code>; a value of 0.001 implies that two randomly chosen chromosomes from the population of organisms would differ from each other at a rate of 1 in 1000 bp. In this context <code>hets</code> is analogous to the parameter <code>theta</code> from population genetics. The <code>hets</code> parameter value can be modified if desired.</p>
|
||||
<p>Note that this quantity has nothing to do with the likelihood of any given sample having a heterozygous genotype, which in the GATK is purely determined by the probability of the observed data P(D | AB) under the model that there may be an AB heterozygous genotype. The posterior probability of this AB genotype would use the <code>hets</code> prior, but the GATK only uses this posterior probability in determining the probability that a site is polymorphic. So changing the <code>hets</code> parameters only increases the chance that a site will be called non-reference across all samples, but doesn't actually change the output genotype likelihoods at all, as these aren't <em>posterior</em> probabilities. The one quantity that changes whether the GATK considers the possibility of a heterozygous genotype at all is the <em>ploidy</em>, which describes how many copies of each chromosome each individual in the species carries.</p>
|
||||
|
|
@ -0,0 +1,8 @@
|
|||
## Hybrid selection
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/6331/hybrid-selection
|
||||
|
||||
<p>Hybrid selection is a method that enables selection of specific sequences from a pool of genomic DNA for targeted sequencing analyses via pull-down assays. Typical applications include the selection of exome sequences or pathogen-specific sequences in complex biological samples. Hybrid selection involve the use <strong>baits</strong> to select desired fragments. </p>
|
||||
<p>Briefly, baits are RNA (or sometimes DNA) molecules synthesized with biotinylated nucleotides. The biotinylated nucleotides are ligands for streptavidin enabling enabling RNA:DNA hybrids to be captured in solution. The hybridization targets are sheared genomic DNA fragments, which have been "polished" with synthetic adapters to facilitate PCR cloning downstream. Hybridization of the baits with the denatured targets is followed by selective capture of the RNA:DNA "hybrids" using streptavidin-coated beads via pull-down assays or columns.</p>
|
||||
<p>Systematic errors, ultimately leading to sequence bias and incorrect variant calls, can arise at several steps. See the GATK dictionary entries <a href="http://gatkforums.broadinstitute.org/gatk/discussion/6333">bait bias</a> and <a href="http://gatkforums.broadinstitute.org/gatk/discussion/6332">pre-adapter artifacts</a> for more details.</p>
|
||||
<p>Please see the following <a href="http://www.nature.com/nbt/journal/v27/n2/abs/nbt.1523.html">reference</a> for the theory behind this technique. </p>
|
||||
|
|
@ -0,0 +1,5 @@
|
|||
## Jumping libraries
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/6326/jumping-libraries
|
||||
|
||||
<p>Jumping libraries are created to bypass difficult to align/map regions, such as those containing repetitive DNA sequences. Briefly, the DNA of interest is identified, cut into fragments either with restriction enzymes or by shearing. The size-selected fragments are ligated to adapters for bead-capture and circularized. After bead-capture, the DNA is linearized via restriction enzymes, and can be sequenced using adapter primers facing in outward [reverse/forward (RF)] directions. These library inserts are considered jumping because the ends originate from distal genomic DNA sequences and are ligated adjacent to one another during circularization. Potential artifacts of this method include small inserts (lacking the linearizing restriction enzyme sequence), which are inward-facing [forward/reverse (FR)] (non-jumping) read pairs. In addition, chimeras result from the paired ends falling on different chromosomes, the insert size exceeding the maximum of 100 KB, or two times the mode of the insert size for outward-facing pairs. For additional information, see <a href="http://www.wikipedia.org/wiki/Jumping_library#Paired-end_sequencing">the Wikipedia article</a>. </p>
|
||||
|
|
@ -0,0 +1,16 @@
|
|||
## Likelihoods and Probabilities
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/7860/likelihoods-and-probabilities
|
||||
|
||||
<p>There are several instances in the GATK documentation where you will encounter the terms "likelihood" and "probability", because key tools in the variant discovery workflow rely heavily on Bayesian statistics. For example, the <a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php">HaplotypeCaller</a>, our most prominent germline SNP and indel caller, uses Bayesian statistics to <a href="https://www.broadinstitute.org/gatk/guide/article?id=4442">determine genotypes</a>. </p>
|
||||
<h4>So what do likelihood and probability mean and how are they related to each other in the Bayesian context?</h4>
|
||||
<p>In Bayesian statistics (as opposed to <a href="https://xkcd.com/1132/">frequentist statistics</a>), we are typically trying to evaluate the <a href="https://en.wikipedia.org/wiki/Posterior_probability">posterior probability</a> of a hypothesis (H) based on a series of observations (data, D). </p>
|
||||
<p><strong>Bayes' rule</strong> states that </p>
|
||||
<p>$${P(H|D)}=\frac{P(H)P(D|H)}{P(D)}$$</p>
|
||||
<p>where the bit we care about most, <strong>P(D|H)</strong>, is the <strong>probability of observing D given the hypothesis H</strong>. This can also be formulated as <strong>L(H|D)</strong>, i.e. the <strong>likelihood of the hypothesis H given the observation D</strong>:</p>
|
||||
<p>$$P(D|H)=L(H|D)$$</p>
|
||||
<p>We use the term <strong>likelihood</strong> instead of <strong>probability</strong> to describe the term on the right because we cannot calculate a meaningful probability distribution on a hypothesis, which by definition is binary (it will either be true or false) -- but we <em>can</em> determine the likelihood that a hypothesis is true or false given a set of observations. For a more detailed explanation of these concepts, please see the following lesson (<a href="http://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/readings/MIT18_05S14_Reading11.pdf">http://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/readings/MIT18_05S14_Reading11.pdf</a>).</p>
|
||||
<p>Now you may wonder, what about the posterior probability P(H|D) that we eventually calculate through Bayes' rule? Isn't that a "probability of a hypothesis"? Well yes; in Bayesian statistics, we <em>can</em> calculate a <em>posterior</em> probability distribution on a hypothesis, because its probability distribution is <em>relative</em> to all of the other competing hypotheses (<a href="http://www.smbc-comics.com/index.php?id=4127">http://www.smbc-comics.com/index.php?id=4127</a>). Tadaa. </p>
|
||||
<p>See <a href="https://www.broadinstitute.org/gatk/guide/article?id=4442">this HaplotypeCaller doc article</a> for a worked out explanation of how we calculate and use genotype likelihoods in germline variant calling.</p>
|
||||
<p>So always remember this, if nothing else: the terms likelihood and probability are <em>not</em> interchangeable in the Bayesian context, even though they are often used interchangeably in common English. </p>
|
||||
<p>A special thanks to Jon M. Bloom PhD (MIT) for his assistance in the preparation of this article.</p>
|
||||
|
|
@ -0,0 +1,19 @@
|
|||
## Mate unmapped records
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/6976/mate-unmapped-records
|
||||
|
||||
<h3>Mate unmapped records are identifiable using the <code>8</code> SAM flag.</h3>
|
||||
<p>It is possible for a BAM to have multiple types of mate-unmapped records. These mate unmapped records are distinct from mate missing records, where the mate is altogether absent from the BAM. Of the three types of mate unmapped records listed below, we describe only the first two in this dictionary entry.</p>
|
||||
<ol>
|
||||
<li>Singly mapping pair. </li>
|
||||
<li>A secondary/supplementary record is flagged as mate-unmapped but the mate is in fact mapped.</li>
|
||||
<li>Both reads in a pair are unmapped.</li>
|
||||
</ol>
|
||||
<hr />
|
||||
<h3>(1) Singly mapping pair</h3>
|
||||
<p>A mapped read's unmapped mate is marked in their SAM record in an unexpected manner that allow the pair to sort together. If you look at these unmapped reads, the alignment columns 2 and 3 indicate they align, in fact identically to the mapped mate. However, what is distinct is the asterisk <code>*</code> in the CIGAR field (column 6) that indicates the record is unmapped. This allows us to (i) identify the unmapped read as having passed through the aligner, and (ii) keep the pairs together in file manipulations that use either coordinate or queryname sorted BAMs. For example, when a genomic interval of reads are taken to create a new BAM, the pair remain together. For file manipulations dependent on such sorting, we can deduce that these mate unmapped records are immune to becoming missing mates.</p>
|
||||
<h3>(2) Mate unmapped record whose mate is mapped but in a pair that excludes the record</h3>
|
||||
<p>The second type of mate unmapped records apply to multimapping read sets processed through MergeBamAlignment such as in <a href="http://gatkforums.broadinstitute.org/gatk/discussion/6483/how-to-map-and-clean-up-short-read-sequence-data-efficiently#latest">Tutorial#6483</a>. Besides reassigning primary and secondary flags within multimapping sets according to a user specified strategy, MergeBamAlignment marks secondary records with the mate unmapped flag. Specifically, after BWA-MEM alignment, records in multimapping sets are all each <em>mate-mapped</em>. After going through MergeBamAlignment, the secondary records become <em>mate-unmapped</em>. The primary alignments remain <em>mate-mapped</em>. This effectively minimizes the association between secondary records from their previous mate. </p>
|
||||
<hr />
|
||||
<h3>How do tools treat them differently?</h3>
|
||||
<p>GATK tools typically ignore secondary/supplementary records from consideration. However, tools will process the mapped read in a singly mapping pair. For example, MarkDuplicates skips secondary records from consideration but marks duplicate singly mapping reads.</p>
|
||||
|
|
@ -0,0 +1,12 @@
|
|||
## OxoG oxidative artifacts
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/6328/oxog-oxidative-artifacts
|
||||
|
||||
<p>Oxidation of guanine to 8-oxoguanine is one of the most common <strong>pre-adapter artifacts</strong> associated with genomic library preparation, arising from a combination of heat, shearing, and metal contaminates in a sample (doi: 10.1093/nar/gks1443). The 8-oxoguanine base can pair with either cytosine or adenine, ultimately leading to G→T transversion mutations during PCR amplification. </p>
|
||||
<p>This occurs when a G on the template strand is oxidized, giving it an affinity for binding to A rather than the usual C. Thus, PCR will introduce apparent G>T substitutions in read 1 and C>A in read 2. In the resulting alignments, a given G>T or C>A observation could either be: </p>
|
||||
<ol>
|
||||
<li>a true mutation </li>
|
||||
<li>an 8-oxoguanine artifact</li>
|
||||
<li>some other kind of artifact. </li>
|
||||
</ol>
|
||||
<p>The variants (C→A)/(G→T) tend to occur in specific sequence contexts e.g. CCG→CAG (doi:10.1093/nar/gks1443). Although occurring at relatively low frequencies, these artifacts can have profound impacts on variant calling fidelity (doi:10.1093/nar/gks1443).</p>
|
||||
|
|
@ -0,0 +1,11 @@
|
|||
## PF reads / Illumina chastity filter
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/6329/pf-reads-illumina-chastity-filter
|
||||
|
||||
<p>Illumina sequencers perform an internal quality filtering procedure called <strong>chastity filter</strong>, and reads that pass this filter are called <strong>PF</strong> for <strong>pass-filter</strong>. According to Illumina, <strong>chastity</strong> is defined as the ratio of the brightest base intensity divided by the sum of the brightest and second brightest base intensities. Clusters of reads pass the filter if no more than 1 base call has a chastity value below 0.6 in the first 25 cycles. This filtration process removes the least reliable clusters from the image analysis results.</p>
|
||||
<p>For additional information on chastity filters, please see: </p>
|
||||
<ul>
|
||||
<li>Illumina, Inc. (2015). Calculating Percent Passing Filter for Patterned and Non-Patterned Flow Cells: A comparison of methods for calculating percent passing filter on Illumina flow cells</li>
|
||||
<li>Ilumina Inc. (2014) HiSeq X System user guide</li>
|
||||
</ul>
|
||||
<p>Both articles can be found at <a href="http://www.Illumina.com">http://www.Illumina.com</a></p>
|
||||
|
|
@ -0,0 +1,18 @@
|
|||
## Paired-end / mate-pair
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/6327/paired-end-mate-pair
|
||||
|
||||
<p>In paired-end sequencing, the library preparation yields a set of fragments, and the machine sequences each fragment from both ends; for example if you have a 300bp contiguous fragment, the machine will sequence e.g. bases 1-75 (forward direction) and bases 225-300 (reverse direction) of the fragment. </p>
|
||||
<p>In mate-pair sequencing, the library preparation yields two fragments that are distal to each other in the genome and in the opposite in orientation to that of a mate-paired fragment.</p>
|
||||
<p>The three read orientation categories are forward reverse (FR), reverse forward (RF), and reverse-reverse/forward-forward (TANDEM). In general, paired-end reads tend to be in a FR orientation, have relatively small inserts (~300 - 500 bp), and are particularly useful for the sequencing of fragments that contain short repeat regions. Mate-pair fragments are generally in a RF conformation, contain larger inserts (~3 kb), and enable sequence coverage of genomic regions containing large structural rearrangements. Tandem reads can result from inversions and rearrangements during library preparation. </p>
|
||||
<p>Here is a more illustrative example:</p>
|
||||
<p><strong>FR:</strong> 5' --F--> <--R-- 5' (in slang called "innie" because they point inward)</p>
|
||||
<p><strong>RF:</strong> <--R-- 5' 5' --F--> (in slang called "outie" because they point outward)</p>
|
||||
<p><strong>TANDEM:</strong> 5' --F--> 5' --F--> or <--R-- 5' <--R-- 5'</p>
|
||||
<p>The figure below illustrates this graphically along with the SAM flags that correspond to the FR and RF configurations.</p>
|
||||
<img src="https://us.v-cdn.net/5019796/uploads/FileUpload/e3/c9e87118d6e8c4b2a4e014d97a1b22.png" />
|
||||
<p>For detailed explanations of library construction strategies (for Illumina sequencers) and how read orientations are determined, please see:</p>
|
||||
<ul>
|
||||
<li><a href="http://www.illumina.com/technology/next-generation-sequencing/paired-end-sequencing_assay.html">Illumina paired-end sequencing documentation (webpage)</a></li>
|
||||
<li><a href="http://www.illumina.com/documents/products/technotes/technote_nextera_matepair_data_processing.pdf">Illumina Nextera mate-pair processing documentation (pdf)</a></li>
|
||||
</ul>
|
||||
|
|
@ -0,0 +1,86 @@
|
|||
## Parallelism
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/1988/parallelism
|
||||
|
||||
<p><em>This document explains the concepts involved and how they are applied within the GATK (and Crom+WDL or Queue where applicable). For specific configuration recommendations, see the companion document on <a href="http://www.broadinstitute.org/gatk/guide/article?id=1975">parallelizing GATK tools</a>.</em></p>
|
||||
<hr />
|
||||
<h2>1. The concept of parallelism</h2>
|
||||
<p>Parallelism is a way to make a program finish faster by performing several operations in parallel, rather than sequentially (<em>i.e.</em> waiting for each operation to finish before starting the next one).</p>
|
||||
<p>Imagine you need to cook rice for sixty-four people, but your rice cooker can only make enough rice for four people at a time. If you have to cook all the batches of rice sequentially, it's going to take all night. But if you have eight rice cookers that you can use in parallel, you can finish up to eight times faster.</p>
|
||||
<p>This is a very simple idea but it has a key requirement: you have to be able to break down the job into smaller tasks that can be done independently. It's easy enough to divide portions of rice because rice itself is a collection of discrete units. In contrast, let's look at a case where you can't make that kind of division: it takes one pregnant woman nine months to grow a baby, but you can't do it in one month by having nine women share the work. </p>
|
||||
<p>The good news is that most GATK runs are more like rice than like babies. Because GATK tools are built to use the Map/Reduce method (see <a href="http://www.broadinstitute.org/gatk/guide/article?id=1754">doc</a> for details), most GATK runs essentially consist of a series of many small independent operations that can be parallelized.</p>
|
||||
<h3>A quick warning about tradeoffs</h3>
|
||||
<p>Parallelism is a great way to speed up processing on large amounts of data, but it has "overhead" costs. Without getting too technical at this point, let's just say that parallelized jobs need to be managed, you have to set aside memory for them, regulate file access, collect results and so on. So it's important to balance the costs against the benefits, and avoid dividing the overall work into too many small jobs.</p>
|
||||
<p>Going back to the introductory example, you wouldn't want to use a million tiny rice cookers that each boil a single grain of rice. They would take way too much space on your countertop, and the time it would take to distribute each grain then collect it when it's cooked would negate any benefits from parallelizing in the first place.</p>
|
||||
<h3>Parallel computing in practice (sort of)</h3>
|
||||
<p>OK, parallelism sounds great (despite the tradeoffs caveat), but how do we get from cooking rice to executing programs? What actually happens in the computer?</p>
|
||||
<p>Consider that when you run a program like the GATK, you're just telling the computer to execute a set of instructions.</p>
|
||||
<p>Let's say we have a text file and we want to count the number of lines in it. The set of instructions to do this can be as simple as:</p>
|
||||
<ul>
|
||||
<li><code>open the file, count the number of lines in the file, tell us the number, close the file</code></li>
|
||||
</ul>
|
||||
<p><em>Note that <code>tell us the number</code> can mean writing it to the console, or storing it somewhere for use later on.</em></p>
|
||||
<p>Now let's say we want to know the number of words on each line. The set of instructions would be:</p>
|
||||
<ul>
|
||||
<li><code>open the file, read the first line, count the number of words, tell us the number, read the second line, count the number of words, tell us the number, read the third line, count the number of words, tell us the number</code></li>
|
||||
</ul>
|
||||
<p>And so on until we've read all the lines, and finally we can close the file. It's pretty straightforward, but if our file has a lot of lines, it will take a long time, and it will probably not use all the computing power we have available.</p>
|
||||
<p>So to parallelize this program and save time, we just cut up this set of instructions into separate subsets like this:</p>
|
||||
<ul>
|
||||
<li>
|
||||
<p><code>open the file, index the lines</code> </p>
|
||||
</li>
|
||||
<li><code>read the first line, count the number of words, tell us the number</code></li>
|
||||
<li><code>read the second line, count the number of words, tell us the number</code></li>
|
||||
<li><code>read the third line, count the number of words, tell us the number</code></li>
|
||||
<li>
|
||||
<p><code>[repeat for all lines]</code></p>
|
||||
</li>
|
||||
<li><code>collect final results and close the file</code></li>
|
||||
</ul>
|
||||
<p>Here, the <code>read the Nth line</code> steps can be performed in parallel, because they are all independent operations.</p>
|
||||
<p>You'll notice that we added a step, <code>index the lines</code>. That's a little bit of peliminary work that allows us to perform the <code>read the Nth line</code> steps in parallel (or in any order we want) because it tells us how many lines there are and where to find each one within the file. It makes the whole process much more efficient. As you may know, the GATK requires index files for the main data files (reference, BAMs and VCFs); the reason is essentially to have that indexing step already done.</p>
|
||||
<p>Anyway, that's the general principle: you transform your linear set of instructions into several subsets of instructions. There's usually one subset that has to be run first and one that has to be run last, but all the subsets in the middle can be run at the same time (in parallel) or in whatever order you want.</p>
|
||||
<hr />
|
||||
<h2>2. Parallelizing the GATK</h2>
|
||||
<p>There are three different modes of parallelism offered by the GATK, and to really understand the difference you first need to understand what are the different <em>levels of computing</em> that are involved.</p>
|
||||
<h3>A quick word about levels of computing</h3>
|
||||
<p>By <em>levels of computing</em>, we mean the computing units in terms of hardware: the core, the machine (or CPU) and the cluster or cloud.</p>
|
||||
<ul>
|
||||
<li>
|
||||
<p><strong>Core:</strong> the level below the machine. On your laptop or desktop, the CPU (central processing unit, or processor) contains one or more cores. If you have a recent machine, your CPU probably has at least two cores, and is therefore called dual-core. If it has four, it's a quad-core, and so on. High-end consumer machines like the latest Mac Pro have up to twelve-core CPUs (which should be called dodeca-core if we follow the Latin terminology) but the CPUs on some professional-grade machines can have tens or hundreds of cores.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><strong>Machine:</strong> the middle of the scale. For most of us, the machine is the laptop or desktop computer. Really we should refer to the CPU specifically, since that's the relevant part that does the processing, but the most common usage is to say <strong>machine</strong>. Except if the machine is part of a cluster, in which case it's called a <strong>node</strong>.</p>
|
||||
</li>
|
||||
<li><strong>Cluster or cloud:</strong> the level above the machine. This is a high-performance computing structure made of a bunch of machines (usually called <strong>nodes</strong>) networked together. If you have access to a cluster, chances are it either belongs to your institution, or your company is renting time on it. A cluster can also be called a <strong>server farm</strong> or a <strong>load-sharing facility</strong>.</li>
|
||||
</ul>
|
||||
<p>Parallelism can be applied at all three of these levels, but in different ways of course, and under different names. Parallelism takes the name of <strong>multi-threading</strong> at the core and machine levels, and <strong>scatter-gather</strong> at the cluster level.</p>
|
||||
<h3>Multi-threading</h3>
|
||||
<p>In computing, a <strong>thread of execution</strong> is a set of instructions that the program issues to the processor to get work done. In <strong>single-threading mode</strong>, a program only sends a single thread at a time to the processor and waits for it to be finished before sending another one. In <strong>multi-threading mode</strong>, the program may send several threads to the processor at the same time.</p>
|
||||
<img src="https://us.v-cdn.net/5019796/uploads/FileUpload/2e/0f426b616b548a3f11b6928a20a324.png" />
|
||||
<p>Not making sense? Let's go back to our earlier example, in which we wanted to count the number of words in each line of our text document. Hopefully it is clear that the first version of our little program (one long set of sequential instructions) is what you would run in single-threaded mode. And the second version (several subsets of instructions) is what you would run in multi-threaded mode, with each subset forming a separate thread. You would send out the first thread, which performs the preliminary work; then once it's done you would send the "middle" threads, which can be run in parallel; then finally once they're all done you would send out the final thread to clean up and collect final results. </p>
|
||||
<p>If you're still having a hard time visualizing what the different threads are like, just imagine that you're doing cross-stitching. If you're a regular human, you're working with just one hand. You're pulling a needle and thread (a single thread!) through the canvas, making one stitch after another, one row after another. Now try to imagine an octopus doing cross-stitching. He can make several rows of stitches at the same time using a different needle and thread for each. Multi-threading in computers is surprisingly similar to that.</p>
|
||||
<p><em>Hey, if you have a better example, let us know in the forum and we'll use that instead.</em></p>
|
||||
<p>Alright, now that you understand the idea of multithreading, let's get practical: how do we do get the GATK to use multi-threading?</p>
|
||||
<p>There are two options for multi-threading with the GATK, controlled by the arguments <code>-nt</code> and <code>-nct</code>, respectively. They can be combined, since they act at different levels of computing:</p>
|
||||
<ul>
|
||||
<li>
|
||||
<p><code>-nt</code> / <code>--num_threads</code> controls the number of <strong>data threads</strong> sent to the processor (acting at the <strong>machine</strong> level)</p>
|
||||
</li>
|
||||
<li><code>-nct</code> / <code>--num_cpu_threads_per_data_thread</code> controls the number of <strong>CPU threads</strong> allocated to each data thread (acting at the <strong>core</strong> level).</li>
|
||||
</ul>
|
||||
<p>Not all GATK tools can use these options due to the nature of the analyses that they perform and how they traverse the data. Even in the case of tools that are used sequentially to perform a multi-step process, the individual tools may not support the same options. For example, at time of writing (Dec. 2012), of the tools involved in local realignment around indels, RealignerTargetCreator supports <code>-nt</code> but not <code>-nct</code>, while IndelRealigner does not support either of these options. </p>
|
||||
<p>In addition, there are some important technical details that affect how these options can be used with optimal results. Those are explained along with specific recommendations for the main GATK tools in a <a href="http://gatkforums.broadinstitute.org/discussion/1975/recommendations-for-parallelizing-gatk-tools">companion document</a> on parallelizing the GATK.</p>
|
||||
<h3>Scatter-gather</h3>
|
||||
<p>If you Google it, you'll find that the term <strong>scatter-gather</strong> can refer to a lot of different things, including strategies to get the best price quotes from online vendors, methods to control memory allocation and… an indie-rock band. What all of those things have in common (except possibly the band) is that they involve breaking up a task into smaller, parallelized tasks (scattering) then collecting and integrating the results (gathering). That should sound really familiar to you by now, since it's the general principle of parallel computing.</p>
|
||||
<p>So yes, "scatter-gather" is really just another way to say we're parallelizing things. OK, but how is it different from multithreading, and why do we need yet another name?</p>
|
||||
<p>As you know by now, multithreading specifically refers to what happens internally when the program (in our case, the GATK) sends several sets of instructions to the processor to achieve the instructions that you originally gave it in a single command-line. In contrast, the scatter-gather strategy as used by the GATK involves separate programs. There are two pipelining solutions that we support for scatter-gathering GATK jobs, Crom+WDL and Queue. They are quite different, but both are able to generate separate GATK jobs (each with its own command-line) to achieve the instructions given in a script.</p>
|
||||
<img src="https://us.v-cdn.net/5019796/uploads/FileUpload/14/96d5bba42167a599a60ed7ac58602f.png" />
|
||||
<p>At the simplest level, the script can involve a single GATK tool*. In that case, the execution engine (Cromwell or Queue) will create separate GATK commands that will each run that tool on a portion of the input data (= the scatter step). The results of each run will be stored in temporary files. Then once all the runs are done, the engine will collate all the results into the final output files, as if the tool had been run as a single command (= the gather step).</p>
|
||||
<p><em>Note that Queue and Cromwell have additional capabilities, such as managing the use of multiple GATK tools in a dependency-aware manner to run complex pipelines, but that is outside the scope of this article. To learn more about pipelining the GATK with Queue, please see the <a href="http://www.broadinstitute.org/gatk/guide/article?id=1306">Queue documentation</a>. To learn more about Crom+WDL, see the <a href="https://software.broadinstitute.org/wdl/">WDL website</a>.</em></p>
|
||||
<h3>Compare and combine</h3>
|
||||
<p>So you see, scatter-gather is a very different process from multi-threading because the parallelization happens <strong>outside</strong> of the program itself. The big advantage is that this opens up the upper level of computing: the cluster level. Remember, the GATK program is limited to dispatching threads to the processor of the machine on which it is run – it cannot by itself send threads to a different machine. But an execution engine like Queue or Cromwell can dispatch scattered GATK jobs to different machines in a computing cluster or on a cloud platform by interfacing with the appropriate job management software.</p>
|
||||
<p>That being said, multithreading has the great advantage that cores and machines all have access to shared machine memory with very high bandwidth capacity. In contrast, the multiple machines on a network used for scatter-gather are fundamentally limited by network costs. </p>
|
||||
<p>The good news is that you can combine scatter-gather and multithreading: use Queue or Cromwell to scatter GATK jobs to different nodes on your cluster or cloud platform, then use the GATK's internal multithreading capabilities to parallelize the jobs running on each node.</p>
|
||||
<p>Going back to the rice-cooking example, it's as if instead of cooking the rice yourself, you hired a catering company to do it for you. The company assigns the work to several people, who each have their own cooking station with multiple rice cookers. Now you can feed a lot more people in the same amount of time! And you don't even have to clean the dishes. </p>
|
||||
|
|
@ -0,0 +1,37 @@
|
|||
## Pedigree / PED files
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/7696/pedigree-ped-files
|
||||
|
||||
<p>A pedigree is a structured description of the familial relationships between samples. </p>
|
||||
<p>Some GATK tools are capable of incorporating pedigree information in the analysis they perform if provided in the form of a PED file through the <code>--pedigree</code> (or <code>-ped</code>) argument. </p>
|
||||
<hr />
|
||||
<h3>PED file format</h3>
|
||||
<p>PED files are tabular text files describing meta-data about the samples. See <a href="http://www.broadinstitute.org/mpg/tagger/faq.html"><a href="http://www.broadinstitute.org/mpg/tagger/faq.html">http://www.broadinstitute.org/mpg/tagger/faq.html</a></a> and <a href="http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped"><a href="http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped">http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped</a></a> for more information.</p>
|
||||
<p>The PED file is a white-space (space or tab) delimited file: the first six columns are mandatory:</p>
|
||||
<ul>
|
||||
<li>Family ID</li>
|
||||
<li>Individual ID</li>
|
||||
<li>Paternal ID</li>
|
||||
<li>Maternal ID</li>
|
||||
<li>Sex (1=male; 2=female; other=unknown)</li>
|
||||
<li>Phenotype</li>
|
||||
</ul>
|
||||
<p>The IDs are alphanumeric: the combination of family and individual ID should uniquely identify a person. If an individual's sex is unknown, then any character other than 1 or 2 can be used in the fifth column.</p>
|
||||
<p>A PED file must have 1 and only 1 phenotype in the sixth column. The phenotype can be either a quantitative trait or an "affected status" column: GATK will automatically detect which type (i.e. based on whether a value other than 0, 1, 2 or the missing genotype code is observed). </p>
|
||||
<p>Affected status should be coded as follows:</p>
|
||||
<ul>
|
||||
<li>-9 missing</li>
|
||||
<li>0 missing</li>
|
||||
<li>1 unaffected</li>
|
||||
<li>2 affected</li>
|
||||
</ul>
|
||||
<p>If any value outside of -9,0,1,2 is detected, then the samples are assumed to have phenotype values, interpreted as string phenotype values.</p>
|
||||
<p>Note that genotypes (column 7 onwards) cannot be specified to the GATK.</p>
|
||||
<p>You can add a comment to a PED or MAP file by starting the line with a # character. The rest of that line will be ignored, so make sure none of the IDs start with this character.</p>
|
||||
<p>Each -ped argument can be tagged with NO_FAMILY_ID, NO_PARENTS, NO_SEX, NO_PHENOTYPE to tell the GATK PED parser that the corresponding fields are missing from the ped file.</p>
|
||||
<h4>Example</h4>
|
||||
<p>Here are two individuals (one row = one person):</p>
|
||||
<pre>
|
||||
FAM001 1 0 0 1 2
|
||||
FAM001 2 0 0 1 2
|
||||
</pre>
|
||||
|
|
@ -0,0 +1,69 @@
|
|||
## Phred-scaled Quality Scores
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/4260/phred-scaled-quality-scores
|
||||
|
||||
<p>You may have noticed that a lot of the scores that are output by the GATK are in Phred scale. The Phred scale was originally used to represent base quality scores emitted by the Phred program in the early days of the Human Genome Project (see <a href="http://en.wikipedia.org/wiki/Phred_quality_score">this Wikipedia article</a> for more historical background). Now they are widely used to represent probabilities and confidence scores in other contexts of genome science.</p>
|
||||
<h3>Phred scale in context</h3>
|
||||
<p>In the context of sequencing, Phred-scaled quality scores are used to represent how confident we are in the assignment of each base call by the sequencer. </p>
|
||||
<p>In the context of variant calling, Phred-scaled quality scores can be used to represent many types of probabilities. The most commonly used in GATK is the QUAL score, or variant quality score. It is used in much the same way as the base quality score: the variant quality score is a Phred-scaled estimate of how confident we are that the variant caller correctly identified that a given genome position displays variation in at least one sample. </p>
|
||||
<h3>Phred scale in practice</h3>
|
||||
<p>In today’s sequencing output, by convention, most useable Phred-scaled base quality scores range from 2 to 40, with some variations in the range depending on the origin of the sequence data (see the <a href="https://en.wikipedia.org/wiki/FASTQ_format#Encoding">FASTQ format</a> documentation for details). However, Phred-scaled quality scores in general can range anywhere from 0 to infinity. A <strong>higher score</strong> indicates a higher probability that a particular decision is <strong>correct</strong>, while conversely, a <strong>lower score</strong> indicates a higher probability that the decision is <strong>incorrect</strong>. </p>
|
||||
<p>The Phred quality score (Q) is logarithmically related to the error probability (E).</p>
|
||||
<p>$$ Q = -10 \log E $$</p>
|
||||
<p>So we can interpret this score as an estimate of <strong>error</strong>, where the error is <em>e.g.</em> the probability that the base is called <strong>incorrectly</strong> by the sequencer, but we can also interpret it as an estimate of <strong>accuracy</strong>, where the accuracy is <em>e.g.</em> the probability that the base was identified <strong>correctly</strong> by the sequencer. Depending on how we decide to express it, we can make the following calculations:</p>
|
||||
<p>If we want the probability of error (E), we take:</p>
|
||||
<p>$$ E = 10 ^{-\left(\frac{Q}{10}\right)} $$ </p>
|
||||
<p>And conversely, if we want to express this as the estimate of accuracy (A), we simply take </p>
|
||||
<p>$$
|
||||
\begin{eqnarray}
|
||||
A &=& 1 - E \nonumber \
|
||||
&=& 1 - 10 ^{-\left(\frac{Q}{10}\right)} \nonumber \
|
||||
\end{eqnarray}
|
||||
$$</p>
|
||||
<p>Here is a table of how to interpret a range of Phred Quality Scores. It is largely adapted from the Wikipedia page for Phred Quality Score.</p>
|
||||
<p>For many purposes, a Phred Score of 20 or above is acceptable, because this means that whatever it qualifies is 99% accurate, with a 1% chance of error. </p>
|
||||
<table class="table table-striped">
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Phred Quality Score</th>
|
||||
<th>Error</th>
|
||||
<th>Accuracy (1 - Error)</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td>10</td>
|
||||
<td>1/10 = 10%</td>
|
||||
<td>90%</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>20</td>
|
||||
<td>1/100 = 1%</td>
|
||||
<td>99%</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>30</td>
|
||||
<td>1/1000 = 0.1%</td>
|
||||
<td>99.9%</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>40</td>
|
||||
<td>1/10000 = 0.01%</td>
|
||||
<td>99.99%</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>50</td>
|
||||
<td>1/100000 = 0.001%</td>
|
||||
<td>99.999%</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>60</td>
|
||||
<td>1/1000000 = 0.0001%</td>
|
||||
<td>99.9999%</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<p>And finally, here is a graphical representation of the Phred scores showing their relationship to accuracy and error probabilities. </p>
|
||||
<img src="https://us.v-cdn.net/5019796/uploads/FileUpload/78/663145e9df43db3efe5df4d0b88cf4.png" />
|
||||
<p>The red line shows the error, and the blue line shows the accuracy. Of course, as error decreases, accuracy increases symmetrically. </p>
|
||||
<p>Note: You can see that below Q20 (which is how we usually refer to a Phred score of 20), the curve is really steep, meaning that as the Phred score decreases, you lose confidence very rapidly. In contrast, above Q20, both of the graphs level out. This is why Q20 is a good cutoff score for many basic purposes.</p>
|
||||
|
|
@ -0,0 +1,6 @@
|
|||
## Pre-adapter artifacts (in hybrid selection)
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/6332/pre-adapter-artifacts-in-hybrid-selection
|
||||
|
||||
<p>Various sources of error affect the <strong>hybrid selection</strong> (HS) process. Pre-adapter artifacts are those that arise in the preparation step(s) prior to the ligation of the PCR adapters. These artifacts occur on the original template strand, before the addition of adapters, so they correlate with read number orientation in a specific way.</p>
|
||||
<p>A classic example is the shearing of target genomic DNA leading to oxidation of an amine of guanine at position 8 <strong>8-oxoguanine</strong> (8-OxoG, OxoG) (doi:10.1093/nar/gks1443) (see also OxoG entry in this dictionary). </p>
|
||||
|
|
@ -0,0 +1,65 @@
|
|||
## Read groups
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/6472/read-groups
|
||||
|
||||
<p>There is no formal definition of what is a read group, but in practice, this term refers to a set of reads that were generated from a single run of a sequencing instrument. </p>
|
||||
<p>In the simple case where a single library preparation derived from a single biological sample was run on a single lane of a flowcell, all the reads from that lane run belong to the same read group. When multiplexing is involved, then each subset of reads originating from a separate library run on that lane will constitute a separate read group.</p>
|
||||
<p>Read groups are identified in the SAM/BAM /CRAM file by a number of tags that are defined in the <a href="http://samtools.github.io/hts-specs/">official SAM specification</a>. These tags, when assigned appropriately, allow us to differentiate not only samples, but also various technical features that are associated with artifacts. With this information in hand, we can mitigate the effects of those artifacts during the duplicate marking and base recalibration steps. The GATK requires several read group fields to be present in input files and will fail with errors if this requirement is not satisfied. See <a href="http://www.broadinstitute.org/gatk/guide/article?id=59">this article</a> for common problems related to read groups.</p>
|
||||
<p>To see the read group information for a BAM file, use the following command. </p>
|
||||
<pre><code class="pre_md">samtools view -H sample.bam | grep '@RG'</code class="pre_md"></pre>
|
||||
<p>This prints the lines starting with <code>@RG</code> within the header, e.g. as shown in the example below. </p>
|
||||
<pre><code class="pre_md">@RG ID:H0164.2 PL:illumina PU:H0164ALXX140820.2 LB:Solexa-272222 PI:0 DT:2014-08-20T00:00:00-0400 SM:NA12878 CN:BI</code class="pre_md"></pre>
|
||||
<hr />
|
||||
<h3>Meaning of the read group fields required by GATK</h3>
|
||||
<ul>
|
||||
<li>
|
||||
<p><code>ID</code> = <strong>Read group identifier</strong>
|
||||
This tag identifies which read group each read belongs to, so each read group's <code>ID</code> must be unique. It is referenced both in the read group definition line in the file header (starting with <code>@RG</code>) and in the <code>RG:Z</code> tag for each read record. Note that some Picard tools have the ability to modify <code>ID</code>s when merging SAM files in order to avoid collisions. In Illumina data, read group <code>ID</code>s are composed using the flowcell + lane name and number, making them a globally unique identifier across all sequencing data in the world.
|
||||
<em>Use for BQSR:</em> <code>ID</code> is the lowest denominator that differentiates factors contributing to technical batch effects: therefore, a read group is effectively treated as a separate run of the instrument in data processing steps such as base quality score recalibration, since they are assumed to share the same error model. </p>
|
||||
</li>
|
||||
<li>
|
||||
<p><code>PU</code> = <strong>Platform Unit</strong>
|
||||
The <code>PU</code> holds three types of information, the {FLOWCELL_BARCODE}.{LANE}.{SAMPLE_BARCODE}. The {FLOWCELL_BARCODE} refers to the unique identifier for a particular flow cell. The {LANE} indicates the lane of the flow cell and the {SAMPLE_BARCODE} is a sample/library-specific identifier. Although the <code>PU</code> is not required by GATK but takes precedence over <code>ID</code> for base recalibration if it is present. In the example shown earlier, two read group fields, <code>ID</code> and <code>PU</code>, appropriately differentiate flow cell lane, marked by <code>.2</code>, a factor that contributes to batch effects. </p>
|
||||
</li>
|
||||
<li>
|
||||
<p><code>SM</code> = <strong>Sample</strong>
|
||||
The name of the sample sequenced in this read group. GATK tools treat all read groups with the same <code>SM</code> value as containing sequencing data for the same sample, and this is also the name that will be used for the sample column in the VCF file. Therefore it's critical that the <code>SM</code> field be specified correctly. When sequencing pools of samples, use a pool name instead of an individual sample name. </p>
|
||||
</li>
|
||||
<li>
|
||||
<p><code>PL</code> = <strong>Platform/technology used to produce the read</strong>
|
||||
This constitutes the only way to know what sequencing technology was used to generate the sequencing data. Valid values: ILLUMINA, SOLID, LS454, HELICOS and PACBIO. </p>
|
||||
</li>
|
||||
<li><code>LB</code> = <strong>DNA preparation library identifier</strong>
|
||||
MarkDuplicates uses the LB field to determine which read groups might contain molecular duplicates, in case the same DNA library was sequenced on multiple lanes. </li>
|
||||
</ul>
|
||||
<p>If your sample collection's BAM files lack required fields or do not differentiate pertinent factors within the fields, use Picard's <a href="http://broadinstitute.github.io/picard/command-line-overview.html#AddOrReplaceReadGroups">AddOrReplaceReadGroups</a> to add or appropriately rename the read group fields as outlined <a href="http://gatkforums.broadinstitute.org/discussion/2909/">here</a>.</p>
|
||||
<hr />
|
||||
<h3>Deriving <code>ID</code> and <code>PU</code> fields from read names</h3>
|
||||
<p>Here we illustrate how to derive both <code>ID</code> and <code>PU</code> fields from read names as they are formed in the data produced by the Broad Genomic Services pipelines (other sequence providers may use different naming conventions). We break down the common portion of two different read names from a sample file. The unique portion of the read names that come after flow cell lane, and separated by colons, are tile number, x-coordinate of cluster and y-coordinate of cluster. </p>
|
||||
<pre><code class="pre_md">H0164ALXX140820:2:1101:10003:23460
|
||||
H0164ALXX140820:2:1101:15118:25288</code class="pre_md"></pre>
|
||||
<p>Breaking down the common portion of the query names:</p>
|
||||
<pre><code class="pre_md">H0164____________ #portion of @RG ID and PU fields indicating Illumina flow cell
|
||||
_____ALXX140820__ #portion of @RG PU field indicating barcode or index in a multiplexed run
|
||||
_______________:2 #portion of @RG ID and PU fields indicating flow cell lane</code class="pre_md"></pre>
|
||||
<hr />
|
||||
<h3>Multi-sample and multiplexed example</h3>
|
||||
<p>Suppose I have a trio of samples: MOM, DAD, and KID. Each has two DNA libraries prepared, one with 400 bp inserts and another with 200 bp inserts. Each of these libraries is run on two lanes of an Illumina HiSeq, requiring 3 x 2 x 2 = 12 lanes of data. When the data come off the sequencer, I would create 12 bam files, with the following <code>@RG</code> fields in the header:</p>
|
||||
<pre><code class="pre_md">Dad's data:
|
||||
@RG ID:FLOWCELL1.LANE1 PL:ILLUMINA LB:LIB-DAD-1 SM:DAD PI:200
|
||||
@RG ID:FLOWCELL1.LANE2 PL:ILLUMINA LB:LIB-DAD-1 SM:DAD PI:200
|
||||
@RG ID:FLOWCELL1.LANE3 PL:ILLUMINA LB:LIB-DAD-2 SM:DAD PI:400
|
||||
@RG ID:FLOWCELL1.LANE4 PL:ILLUMINA LB:LIB-DAD-2 SM:DAD PI:400
|
||||
|
||||
Mom's data:
|
||||
@RG ID:FLOWCELL1.LANE5 PL:ILLUMINA LB:LIB-MOM-1 SM:MOM PI:200
|
||||
@RG ID:FLOWCELL1.LANE6 PL:ILLUMINA LB:LIB-MOM-1 SM:MOM PI:200
|
||||
@RG ID:FLOWCELL1.LANE7 PL:ILLUMINA LB:LIB-MOM-2 SM:MOM PI:400
|
||||
@RG ID:FLOWCELL1.LANE8 PL:ILLUMINA LB:LIB-MOM-2 SM:MOM PI:400
|
||||
|
||||
Kid's data:
|
||||
@RG ID:FLOWCELL2.LANE1 PL:ILLUMINA LB:LIB-KID-1 SM:KID PI:200
|
||||
@RG ID:FLOWCELL2.LANE2 PL:ILLUMINA LB:LIB-KID-1 SM:KID PI:200
|
||||
@RG ID:FLOWCELL2.LANE3 PL:ILLUMINA LB:LIB-KID-2 SM:KID PI:400
|
||||
@RG ID:FLOWCELL2.LANE4 PL:ILLUMINA LB:LIB-KID-2 SM:KID PI:400</code class="pre_md"></pre>
|
||||
<p>Note the hierarchical relationship between read groups (unique for each lane) to libraries (sequenced on two lanes) and samples (across four lanes, two lanes for each library).</p>
|
||||
|
|
@ -0,0 +1,79 @@
|
|||
## Reference Genome Components
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/7857/reference-genome-components
|
||||
|
||||
<h4>Document is in <code>BETA</code>. It may be incomplete and/or inaccurate. Post suggestions to the <em>Comments</em> section.</h4>
|
||||
<hr />
|
||||
<p><a href="http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/"><img src="https://us.v-cdn.net/5019796/uploads/FileUpload/a3/19c0c82fd10f748e04201847c89d70.png" align="right" height="210" style="margin:0px 0px 5px 10px"/></a> This document defines several components of a reference genome. We use the human GRCh38/hg38 assembly to illustrate.</p>
|
||||
<p>GRCh38/hg38 is the assembly of the human genome released December of 2013, that uses alternate or <strong>ALT</strong> contigs to represent common complex variation, including <a href="https://en.wikipedia.org/wiki/Human_leukocyte_antigen">HLA</a> loci. Alternate contigs are also present in past assemblies but not to the extent we see with GRCh38. Much of the improvements in GRCh38 are the result of other genome sequencing and analysis projects, including the <a href="http://www.1000genomes.org/">1000 Genomes Project</a>. </p>
|
||||
<p>The ideogram is from the <em>Genome Reference Consortium</em> website and showcases GRCh38.p7. The zoomed region illustrates how regions in blue are full of Ns. </p>
|
||||
<p><strong>Analysis set</strong> reference genomes have special features to accommodate sequence read alignment. This type of genome reference can differ from the reference you use to browse the genome.</p>
|
||||
<ul>
|
||||
<li>For example, the <a href="http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/analysisSet/">GRCh38 analysis set</a> <strong>hard-masks</strong>, i.e. replaces with Ns, a proportion of homologous centromeric and genomic <a href="https://en.wikipedia.org/wiki/Satellite_DNA">repeat arrays</a> (on chromosomes 5, 14, 19, 21, & 22) and two PAR (pseudoautosomal) regions on chromosome Y. Confirm the set you are using by viewing a PAR region of the Y chromosome on IGV as shown in the figure below. The chrY location of PAR1 and PAR2 on GRCh38 are chrY:10,000-2,781,479 and chrY:56,887,902-57,217,415.
|
||||
<a href="https://us.v-cdn.net/5019796/uploads/FileUpload/83/c5938ded241dd754b8e8c148467338.png"><img src="https://us.v-cdn.net/5019796/uploads/FileUpload/83/c5938ded241dd754b8e8c148467338.png" align="" height="150" style=""/></a>
|
||||
The sequence in the reference set is a mix of uppercase and lowercase letters. The lowercase letters represent <strong>soft-masked</strong> sequence corresponding to repeats from <a href="http://www.repeatmasker.org/">RepeatMasker</a> and <a href="https://tandem.bu.edu/trf/trf.html">Tandem Repeats Finder</a>. </li>
|
||||
<li>The GRCh38 analysis sets also include a contig to siphon off reads corresponding to the Epstein-Barr virus sequence as well as <strong>decoy</strong> contigs. The EBV contig can help correct for artifacts stemming from immortalization of human blood lymphocytes with <a href="https://en.wikipedia.org/wiki/Epstein%E2%80%93Barr_virus#Transformation_of_B-lymphocytes">EBV transformation</a>, as well as capture endogenous EBV sequence as <a href="http://gbe.oxfordjournals.org/content/6/4/846.full">EBV naturally infects B cells</a> in ~90% of the world population. Heng Li provides the decoy contigs.</li>
|
||||
</ul>
|
||||
<hr />
|
||||
<h2>Nomenclature: words to describe components of reference genomes</h2>
|
||||
<ul>
|
||||
<li>
|
||||
<p>A <strong>contig</strong> is a contiguous sequence without gaps.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><strong>Alternate contigs</strong>, <strong>alternate scaffolds</strong> or <strong>alternate loci</strong> allow for representation of diverging haplotypes. These regions are too complex for a single representation. Identify ALT contigs by their <code>_alt</code> suffix.</p>
|
||||
<p>The GRCh38 ALT contigs total 109Mb in length and span 60Mb of the primary assembly. Alternate contig sequences can be novel to highly diverged or nearly identical to corresponding primary assembly sequence. Sequences that are highly diverged from the primary assembly only contribute a few million bases. Most subsequences of ALT contigs are fairly similar to the primary assembly. This means that if we align sequence reads to GRCh38+ALT blindly, then we obtain many multi-mapping reads with zero mapping quality. Since many GATK tools have a ZeroMappingQuality filter, we will then miss variants corresponding to such loci.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><strong>Primary assembly</strong> refers to the collection of (i) assembled chromosomes, (ii) unlocalized and (iii) unplaced sequences. It represents a non-redundant haploid genome.</p>
|
||||
<p>(i) <strong>Assembled chromosomes</strong> for hg38 are chromosomes 1–22 (<code>chr1</code>–<code>chr22</code>), X (<code>chrX</code>), Y (<code>chrY</code>) and Mitochondrial (<code>chrM</code>).
|
||||
(ii) <strong>Unlocalized</strong> sequence are on a specific chromosome but with unknown order or orientation. Identify by <code>_random</code> suffix.
|
||||
(iii) <strong>Unplaced</strong> sequence are on an unknown chromosome. Identify by <code>chrU_</code> prefix.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><strong>PAR</strong> stands for <a href="https://en.wikipedia.org/wiki/Pseudoautosomal_region">pseudoautosomal region</a>. PAR regions in mammalian X and Y chromosomes allow for recombination between the sex chromosomes. Because the PAR sequences together create a diploid or <em>pseudo-autosomal</em> sequence region, the X and Y chromosome sequences are intentionally identical in the genome assembly. <em>Analysis set</em> genomes further hard-mask two of the Y chromosome PAR regions so as to allow mapping of reads solely to the X chromosome PAR regions. </p>
|
||||
</li>
|
||||
<li>
|
||||
<p>Different <strong>assemblies</strong> shift coordinates for loci and are released infrequently. Hg19 and hg38 represent two different major assemblies. Comparing data from different assemblies requires lift-over tools that adjust genomic coordinates to match loci, at times imperfectly. In the special case of hg19 and GRCh37, the primary assembly coordinates are identical for loci but patch updates differ. Also, the naming conventions of the references differ, e.g. the use of chr1 versus 1 to indicate chromosome 1, such that these also require lift-over to compare data. GRCh38/hg38 unifies the assemblies and the naming conventions.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><strong>Patches</strong> are regional fixes that are released periodically for a given assembly. GRCh38.p7 indicates the seventh patched minor release of GRCh38. <a href="http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/info/patches.shtml">This NCBI page</a> explains in more detail. Patches add information to the assembly without disrupting the chromosome coordinates. Again, they improve representation without affecting chromosome coordinate stability. The two types of patches, fixed and novel, represent different types of sequence.</p>
|
||||
<p>(i) <strong>Fix patches</strong> represent sequences that will replace primary assembly sequence in the next major assembly release. When interpreting data, fix patches should take precedence over the chromosomes.
|
||||
(ii) <strong>Novel patches</strong> represent alternate loci. When interpreting data, treat novel patches as population sequence variants.</p>
|
||||
</li>
|
||||
</ul>
|
||||
<hr />
|
||||
<h2>The GATK perspective on reference genomes</h2>
|
||||
<p>Within GATK documentation, <a href="https://software.broadinstitute.org/gatk/documentation/article?id=8017">Tutorial#8017</a> outlines how to map reads in an alternate contig aware manner and discusses some of the implications of mapping reads to reference genomes with alternate contigs. </p>
|
||||
<p>GATK tools allow for use of a genomic <a href="http://gatkforums.broadinstitute.org/gatk/discussion/4133/when-should-i-use-l-to-pass-in-a-list-of-intervals">intervals list</a> that tells tools which regions of the genome the tools should act on. Judicious use of an intervals list, e.g. one that excludes regions of Ns and low complexity repeat regions in the genome, makes processes more efficient. This brings us to the next point.</p>
|
||||
<h4>Specifying contigs with colons in their names, as occurs for new contigs in GRCh38, requires special handling for GATK versions prior to v3.6. Please use the following workaround.</h4>
|
||||
<ul>
|
||||
<li>For example, <code>HLA-A*01:01:01:01</code> is a new contig in GRCh38. The colons are a new feature of contig naming for GRCh38 from prior assemblies. This has implications for using the <code>-L</code> option of GATK as the option also uses the colon as a delimiter to distinguish between contig and genomic coordinates.</li>
|
||||
<li>When defining coordinates of interest for a contig, e.g. positions 1-100 for chr1, we would use <code>-L chr1:1-100</code>. This also works for our HLA contig, e.g. <code>-L HLA-A*01:01:01:01:1-100</code>.</li>
|
||||
<li>
|
||||
<p>However, when passing in an entire contig, for contigs with colons in the name, you must add <code>:1+</code> to the end of the chromosome name as shown below. This ensures that portions of the contig name are appropriately identified as part of the contig name and not genomic coordinates.</p>
|
||||
<pre><code class="pre_md"> -L HLA-A*01:01:01:01:1+</code class="pre_md"></pre>
|
||||
</li>
|
||||
</ul>
|
||||
<h3>Viewing CRAM alignments on genome browsers</h3>
|
||||
<p>Because CRAM compression depends on the alignment reference genome, tools that use CRAM files ensure correct decompression by comparing reference contig <a href="https://en.wikipedia.org/wiki/MD5">MD5 hashtag</a> values. These are sensitive to any changes in the sequence, e.g. masking with Ns. This can have implications for viewing alignments in genome browsers when there is a disjoint between the reference that is loaded in the browser and the reference that was used in alignment. If you are using a version of tools for which this is an issue, be sure to load the original analysis set reference genome to view the CRAM alignments.</p>
|
||||
<h3>Should I switch to a newer reference?</h3>
|
||||
<p>Yes you should. In addition to adding many alternate contigs, GRCh38 corrects thousands of SNPs and indels in the GRCh37 assembly that are absent in the population and are likely sequencing artifacts. It also includes synthetic centromeric sequence and updates non-nuclear genomic sequence.</p>
|
||||
<p>The ability to recognize alternate haplotypes for loci is a drastic improvement that GRCh38 makes possible. Going forward, expanding genomics data will help identify variants for alternate haplotypes, improve existing and add additional alternate haplotypes and give us a better accounting of alternate haplotypes within populations. We are already seeing improvements and additions in the patch releases to reference genomes, e.g. the seven minor releases of GRCh38 available at the time of this writing. </p>
|
||||
<p>Note that variants produced by alternate haplotypes when they are represented on the primary assembly may or may not be present in data resources, e.g. dbSNP. This could have varying degrees of impact, including negligible, for any process that relies on known variant sites. Consider the impact this discrepant coverage in data resources may have for your research aims and weigh this against the impact of missing variants because their sequence context is unaccounted for in previous assemblies.</p>
|
||||
<hr />
|
||||
<h2>External resources</h2>
|
||||
<ol>
|
||||
<li><code>New 11/16/2016</code> For a brief history and discussion on challenges in using GRCh38, see the 2015 <em>Genome Biology</em> article <em>Extending reference assembly models</em> by Church et al. (DOI: <a href="https://dx.doi.org/10.1186/s13059-015-0587-3">10.1186/s13059-015-0587-3</a>).</li>
|
||||
<li>For press releases highlighting improvements in GRCh38 from December 2013, see <a href="http://www.ncbi.nlm.nih.gov/news/12-23-2013-grch38-released/">http://www.ncbi.nlm.nih.gov/news/12-23-2013-grch38-released/</a> and <a href="http://genomeref.blogspot.co.uk/2013/12/announcing-grch38.html">http://genomeref.blogspot.co.uk/2013/12/announcing-grch38.html</a>. The latter post summarizes major improvements, including the correction of thousands of SNPs and indels in GRCh37 not seen in the population and the inclusion of synthetic centromeric sequence.</li>
|
||||
<li>Recent releases of BWA, e.g. v0.7.15+, handle alt contig mapping and HLA typing. See the <a href="https://github.com/lh3/bwa">BWA repository</a> for information. See these pages for <a href="https://sourceforge.net/projects/bio-bwa/files/">download</a> and <a href="http://gatkforums.broadinstitute.org/wdl/discussion/2899">installation instructions</a>.</li>
|
||||
<li>The <a href="http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/">Genome Reference Consortium (GRC)</a> provides human, mouse, zebrafish and chicken sequences, and <a href="http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/">this particular webpage</a> gives an overview of GRCh38. Namely, an interactive chromosome ideogram marks regions with corresponding alternate loci, regions with fix patches and regions containing novel patches. For additional assembly terminology, see <a href="http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/info/definitions.shtml"><a href="http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/info/definitions.shtml">http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/info/definitions.shtml</a></a>.</li>
|
||||
<li>
|
||||
<p>The <a href="http://genome.ucsc.edu/cgi-bin/hgGateway?clade=mammal&org=Human&db=hg38">UCSC Genome Browser</a> allows browsing and download of genomes, including <em>analysis sets</em>, from many different species. For more details on the difference between GRCh38 reference and analysis sets, see <code>ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/README.txt</code> and <code>ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/analysisSet/README.txt</code>, respectively. In addition, the site provides annotation files, e.g. <a href="http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/">here</a> is the annotation database for GRCh38. Within this particular page, the file named <a href="http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/gap.txt.gz">gap.txt.gz</a> catalogues the gapped regions of the assembly full of Ns. For our illustration above, the corresponding region in this file shows:</p>
|
||||
<pre><code class="pre_md"> 585 chr14 0 10000 1 N 10000 telomere no
|
||||
1 chr14 10000 16000000 2 N 15990000 short_arm no
|
||||
707 chr14 16022537 16022637 4 N 100 contig no</code class="pre_md"></pre>
|
||||
</li>
|
||||
<li>The <a href="http://www.broadinstitute.org/igv/home">Integrative Genomics Viewer</a> is a desktop application for viewing genomics data including alignments. The tool accesses reference genomes you provide via file or URL or that it hosts over a server. The numerous hosted reference genomes include GRCh38. See <a href="http://www.broadinstitute.org/igv/Genomes">this page</a> for information on hosted reference genomes. For the most up-to-date list of hosted genomes, open IGV and go to <em>Genomes</em>><em>Load Genome From Server</em>. A menu lists genomes you can make available in the main genome dropdown menu. </li>
|
||||
</ol>
|
||||
<hr />
|
||||
|
|
@ -0,0 +1,15 @@
|
|||
## Spanning or overlapping deletions
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/6926/spanning-or-overlapping-deletions
|
||||
|
||||
<p>We use the term <strong>spanning deletion</strong> or <strong>overlapping deletion</strong> to refer to a deletion that spans a position of interest. </p>
|
||||
<p>The presence of a spanning deletion affects how we can represent genotypes at any site(s) that it spans for those samples that carry the deletion, whether in heterozygous or homozygous variant form. Page 8, item 5 of the <a href="https://samtools.github.io/hts-specs/VCFv4.3.pdf">VCF v4.3 specification</a> reserves the <code>*</code> allele to reference overlapping deletions. This is not to be confused with the bracketed asterisk <code><*></code> used to denote symbolic alternate alleles.</p>
|
||||
<hr />
|
||||
<img src="https://us.v-cdn.net/5019796/uploads/FileUpload/3e/6389220c22db2a69857811121c3dd1.png" width="400" align="right" border="10"/>
|
||||
<p>Here we illustrate with four human samples. Bob and Lian each have a heterozygous <code>A</code> to <code>T</code> single polymorphism at position 20, our position of interest. Kyra has a 9 bp deletion from position 15 to 23 on both homologous chromosomes that extends across position 20. Lian and Omar each are heterozygous for the same 9 bp deletion. Omar and Bob's other allele is the reference <code>A</code>.</p>
|
||||
<p><strong>What are the genotypes for each individual at position 20?</strong> For Bob, the reference A and variant T alleles are clearly present for a genotype of <code>A/T</code>.</p>
|
||||
<p>What about Lian? Lian has a variant T allele plus a 9 bp deletion overlapping position 20. To notate the deletion as we do single nucleotide deletions is technically inaccurate. We need a placeholder notation to signify absent sequence that extends beyond the position of interest and that is listed for an earlier position, in our case position 14. The solution is to use a star or asterisk <code>*</code> at position 20 to refer to the spanning deletion. Using this convention, Lian's genotype is <code>T/*</code>.</p>
|
||||
<p>At the sample-level, Kyra and Omar would not have records for position 20. However, we are comparing multiple samples and so we indicate the spanning deletion at position 20 with <code>*</code>. Omar's genotype is <code>A/*</code> and Kyra's is <code>*/*</code>. </p>
|
||||
<hr />
|
||||
<img src="https://us.v-cdn.net/5019796/uploads/FileUpload/f6/72037063701d84343ea6469fe64d2e.png" height="180" align="right" border="10"/>
|
||||
<p><strong>In the VCF</strong>, depending on the format used by tools, positions equivalent to our example position 20 may or may not be listed. If listed, such as in the first example VCF shown, the spanning deletion is noted with the asterisk <code>*</code> under the <code>ALT</code> column. The spanning deletion is then referred to in the genotype <code>GT</code> for Kyra, Lian and Omar. Alternatively, a VCF may altogether avoid referencing the spanning deletion by listing the variant with the spanning deletion together with the deletion. This is shown in the second example VCF at position 14. </p>
|
||||
|
|
@ -0,0 +1,12 @@
|
|||
## At what point should I merge read group BAM files belonging to the same sample into a single file?
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/6057/at-what-point-should-i-merge-read-group-bam-files-belonging-to-the-same-sample-into-a-single-file
|
||||
|
||||
<p>It is fairly common to have multiple read groups for a sample, either from sequencing multiple libraries or from spreading a library across multiple lanes. It seems this causes a lot of confusion, and people often tell us they're not sure how to organize the data for the pre-processing steps or how to feed the data into HaplotypeCaller. </p>
|
||||
<p>Well, there are several options for organizing the processing. We have a fairly detailed FAQ article that describes <a href="https://www.broadinstitute.org/gatk/guide/article?id=3060">our preferred workflow for pre-processing data from multiplexed sequencing and multi-library designs</a>. But in this article we describe at a simpler level what are the main two options depending on how you want to provide the analysis ready BAM files to the variant caller. </p>
|
||||
<h3>To produce a combined per-sample bam file to feed to HaplotypeCaller (most common)</h3>
|
||||
<p>The simplest thing to do is to input all the bam files that belong to that sample, either at the MarkDuplicates step, the Indel Realignment step or at the BQSR step. The choice depends mostly on how deep the coverage is. High depth means a lot of data to process at the same time, which slows down Indel Realignment. This is because Indel Realignment ignores all read group information and simply processes all reads together. BQSR doesn't suffer from that problem because it processes read groups separately. In either case, when you input all samples together, the bam that gets written out with the processed data will include all the libraries / read groups in one handy per-sample file. </p>
|
||||
<p><em>Note: We do not require the PU field in the RG, however, BQSR will consider the PU field over all other fields.</em></p>
|
||||
<h3>To produce a separate bam file for each read group (less common)</h3>
|
||||
<p>Another option is to keep all the bam files separate until variant calling, and then input them to Haplotype Caller together. You can do this by simply running Indel Realignment and BQSR on each of the bams separately. You can then input all of the bams into HaplotypeCaller at once. This works even if you want to run HaplotypeCaller in GVCF mode, which can only be done on a single sample at a time. As long as the SM tags are identical, HaplotypeCaller will recognize that it's a single-sample run. This is because the GATK engine will merge the data before presenting it to the HaplotypeCaller tool, so HaplotypeCaller does not know nor care whether the data came from many files or one file.</p>
|
||||
<p><em>Note: If you input many bam files into Indel Realigner, the default output is one bam file. However, you can output one bam file for each input bam file by using <a href="https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_indels_IndelRealigner.php#--nWayOut"><code>-nWayOut</code></a>.</em></p>
|
||||
|
|
@ -0,0 +1,8 @@
|
|||
## Can I apply the germline variant joint calling workflow to my RNAseq data?
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/7363/can-i-apply-the-germline-variant-joint-calling-workflow-to-my-rnaseq-data
|
||||
|
||||
<p>We have <strong>not yet validated</strong> the joint genotyping methods (HaplotypeCaller in <code>-ERC GVCF</code> mode per-sample then GenotypeGVCFs per-cohort) on RNAseq data. Our standard recommendation is to process RNAseq samples individually as laid out in the RNAseq-specific documentation. </p>
|
||||
<p>However, we know that a lot of people have been trying out the joint genotyping workflow on RNAseq data, and there do not seem to be any major technical problems. You are welcome to try it on your own data, with the caveat that we cannot guarantee correctness of results, and may not be able to help you if something goes wrong. Please be sure to examine your results carefully and critically.</p>
|
||||
<p>If you do pursue this, you will need to pre-process your samples according to our RNA-specific documentation, then switch to the GVCF workflow at the HaplotypeCaller stage. For filtering, it will be up to you to determine whether the hard filtering or VQSR filtering method produce best results. We have not tested any of this so we cannot provide a recommendation. Be prepared to do a lot of analysis to validate the quality of your results. </p>
|
||||
<p>Good luck!</p>
|
||||
|
|
@ -0,0 +1,19 @@
|
|||
## Can I use GATK on non-diploid organisms?
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/1214/can-i-use-gatk-on-non-diploid-organisms
|
||||
|
||||
<img src="https://us.v-cdn.net/5019796/uploads/FileUpload/d3/424549fd16f54f89339a95b6634461.jpg" />
|
||||
<p>In general most GATK tools don't care about ploidy. The major exception is, of course, at the variant calling step: the variant callers need to know what ploidy is assumed for a given sample in order to perform the appropriate calculations. </p>
|
||||
<h3>Ploidy-related functionalities</h3>
|
||||
<p>As of version 3.3, the HaplotypeCaller and GenotypeGVCFs are able to deal with non-diploid organisms (whether haploid or exotically polyploid). In the case of HaplotypeCaller, you need to specify the ploidy of your non-diploid sample with the <code>-ploidy</code> argument. HC can only deal with one ploidy at a time, so if you want to process different chromosomes with different ploidies (e.g. to call X and Y in males) you need to run them separately. On the bright side, you can combine the resulting files afterward. In particular, if you’re running the -ERC GVCF workflow, you’ll find that both CombineGVCFs and GenotypeGVCFs are able to handle mixed ploidies (between locations and between samples). Both tools are able to correctly work out the ploidy of any given sample at a given site based on the composition of the GT field, so they don’t require you to specify the -ploidy argument.</p>
|
||||
<p>For earlier versions (all the way to 2.0) the fallback option is UnifiedGenotyper, which also accepts the <code>-ploidy</code> argument. </p>
|
||||
<h3>Cases where ploidy needs to be specified</h3>
|
||||
<ol>
|
||||
<li>Native variant calling in haploid or polyploid organisms. </li>
|
||||
<li>Pooled calling where many pooled organisms share a single barcode and hence are treated as a single "sample". </li>
|
||||
<li>Pooled validation/genotyping at known sites. </li>
|
||||
</ol>
|
||||
<p>For normal organism ploidy, you just set the <code>-ploidy</code> argument to the desired number of chromosomes per organism. In the case of pooled sequencing experiments, this argument should be set to the number of chromosomes per barcoded sample, i.e. <code>(Ploidy per individual) * (Individuals in pool)</code>.</p>
|
||||
<h2>Important limitations</h2>
|
||||
<p>Several variant annotations are not appropriate for use with non-diploid cases. In particular, InbreedingCoeff will not be annotated on non-diploid calls. Annotations that do work and are supported in non-diploid use cases are the following: <code>QUAL</code>, <code>QD</code>, <code>SB</code>, <code>FS</code>, <code>AC</code>, <code>AF</code>, and Genotype annotations such as <code>PL</code>, <code>AD</code>, <code>GT</code>, etc.</p>
|
||||
<p>You should also be aware of the fundamental accuracy limitations of high ploidy calling. Calling low-frequency variants in a pool or in an organism with high ploidy is hard because these rare variants become almost indistinguishable from sequencing errors. </p>
|
||||
|
|
@ -0,0 +1,18 @@
|
|||
## Can I use different versions of the GATK at different steps of my analysis?
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/3536/can-i-use-different-versions-of-the-gatk-at-different-steps-of-my-analysis
|
||||
|
||||
<p>Short answer: NO. </p>
|
||||
<p>Medium answer: no, at least not if you want to run a low-risk pipeline.</p>
|
||||
<p>Long answer: see below for details.</p>
|
||||
<hr />
|
||||
<p><strong>The rationale</strong></p>
|
||||
<p>There are several reasons why you might want to do this: you're using the latest version of GATK and one of the tools has a show-stopping bug, so you'd like to use an older, pre-bug version of that tool, but still use the latest version of all the other tools; or maybe you've been using an older version of GATK and you'd like to use a new tool, but keep using the rest in the version that you've been using to process hundreds of samples already.</p>
|
||||
<p><strong>The problem: compatibility is not guaranteed</strong></p>
|
||||
<p>In many cases, when we modify one tool in the GATK, we need to make adjustments to other tools that interact either directly or indirectly with the data consumed or produced by the upgraded tool. If you mix and match tools from different versions of GATK, you risk running into compatibility issues. For example, HaplotypeCaller expects a BAM compressed by Reduce Reads to have its data annotated in a certain way. If the information is formatted differently than what the HC expects (because that's how the corresponding RR from the same version does it), it can blow up -- or worse, do the wrong thing but not tell you there's a problem.</p>
|
||||
<p><strong>But what if the tools/tasks are in unrelated workflows?</strong></p>
|
||||
<p>Would it really be so bad to use CountReads from GATK version 2.7 for a quick QC check that's not actually part of my pipeline, which uses version 2.5? Well, maaaaybe not, but we still think it's a source of error, and we do our damnedest to eliminate those.</p>
|
||||
<p><strong>The conclusion</strong></p>
|
||||
<p>You shouldn't use tools from different versions within the same workflow, that's for sure. We don't think it's worth the risks. If there's a show-stopping bug, let us know and we promise to fix it as soon as (humanly) possible. For the rest, either accept that you're stuck with the version you started your study with (we may be able to help with workarounds for known issues), or upgrade your entire workflow and start your analysis from scratch. Depending on how far along you are one of those options will be less painful to you; go with that. </p>
|
||||
<p><strong>The plea bargain, and a warning</strong></p>
|
||||
<p>If despite our dire warnings you're still going to mix and match tool versions, fine, we can't stop you. But be really careful, and check every version release notes document ever. And keep in mind that when things go wrong, we will deny you support if we think you've been reckless. </p>
|
||||
|
|
@ -0,0 +1,10 @@
|
|||
## Collected FAQs about VCF files
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/1318/collected-faqs-about-vcf-files
|
||||
|
||||
<h3>1. What file formats do you support for variant callsets?</h3>
|
||||
<p>We support the <a href="http://www.1000genomes.org/wiki/doku.php?id=1000_genomes:analysis:vcf4.0">Variant Call Format (VCF)</a> for variant callsets. No other file formats are supported.</p>
|
||||
<h3>2. How can I know if my VCF file is valid?</h3>
|
||||
<p><a href="http://vcftools.sourceforge.net/">VCFTools</a> contains a <a href="http://vcftools.sourceforge.net/docs.html#validator">validation tool</a> that will allow you to verify it.</p>
|
||||
<h3>3. Are you planning to include any converters from different formats or allow different input formats than VCF?</h3>
|
||||
<p>No, we like VCF and we think it's important to have a good standard format. Multiplying formats just makes life hard for everyone, both developers and analysts. </p>
|
||||
|
|
@ -0,0 +1,90 @@
|
|||
## Collected FAQs about input files for sequence read data (BAM/CRAM)
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/1317/collected-faqs-about-input-files-for-sequence-read-data-bam-cram
|
||||
|
||||
<h3>1. What file formats do you support for sequence data input?</h3>
|
||||
<p>The GATK supports the <a href="http://samtools.github.io/hts-specs/">BAM</a> format for reads, quality scores, alignments, and metadata (<em>e.g.</em> the lane of sequencing, center of origin, sample name, etc.). Starting with version 3.5, the <a href="http://samtools.github.io/hts-specs/">CRAM</a> format is supported as well. SAM format is not supported but can be easily converted with Picard tools. </p>
|
||||
<hr />
|
||||
<h3>2. How do I get my data into BAM format?</h3>
|
||||
<p>The GATK doesn't have any tools for getting data into BAM format, but many other toolkits exist for this purpose. We recommend you look at <a href="http://broadinstitute.github.io/picard/">Picard</a> and <a href="http://samtools.sourceforge.net/">Samtools</a> for creating and manipulating BAM files. Also, many aligners are starting to emit BAM files directly. See <a href="http://bio-bwa.sourceforge.net/bwa.shtml">BWA</a> for one such aligner.</p>
|
||||
<hr />
|
||||
<h3>3. What are the formatting requirements for my BAM file(s)?</h3>
|
||||
<p>All BAM/CRAM files must satisfy the following requirements:</p>
|
||||
<ul>
|
||||
<li>It must be aligned to one of the references described <a href="http://www.broadinstitute.org/gatk/guide/article?id=1204">here</a>.</li>
|
||||
<li>It must be sorted in <strong>coordinate order</strong> (not by queryname and not "unsorted").</li>
|
||||
<li>It must list the <a href="http://www.broadinstitute.org/gatk/guide/article?id=6472">read groups</a> with sample names in the header.</li>
|
||||
<li>Every read must belong to a read group.</li>
|
||||
<li>The BAM file must pass Picard <a href="https://broadinstitute.github.io/picard/command-line-overview.html#ValidateSamFile">ValidateSamFile</a> validation.</li>
|
||||
</ul>
|
||||
<p>See the <a href="http://samtools.github.io/hts-specs/">official BAM specification</a> for more information on what constitutes a valid BAM file.</p>
|
||||
<hr />
|
||||
<h3>4. What is the canonical ordering of human reference contigs in a BAM file?</h3>
|
||||
<p>It depends on whether you're using the NCBI/GRC build 36/build 37 version of the human genome, or the UCSC hg18/hg19 version of the human genome. While substantially equivalent, the naming conventions are different. The canonical ordering of contigs for these genomes is as follows:</p>
|
||||
<p>Human genome reference consortium standard ordering and names (b3x):
|
||||
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT...</p>
|
||||
<p>UCSC convention (hg1x):
|
||||
chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY...</p>
|
||||
<hr />
|
||||
<h3>5. How can I tell if my BAM file is sorted properly?</h3>
|
||||
<p>The easiest way to do it is to download Samtools and run the following command to examine the header of your file:</p>
|
||||
<pre><code class="pre_md">$ samtools view -H /path/to/my.bam
|
||||
@HD VN:1.0 GO:none SO:coordinate
|
||||
@SQ SN:1 LN:247249719
|
||||
@SQ SN:2 LN:242951149
|
||||
@SQ SN:3 LN:199501827
|
||||
@SQ SN:4 LN:191273063
|
||||
@SQ SN:5 LN:180857866
|
||||
@SQ SN:6 LN:170899992
|
||||
@SQ SN:7 LN:158821424
|
||||
@SQ SN:8 LN:146274826
|
||||
@SQ SN:9 LN:140273252
|
||||
@SQ SN:10 LN:135374737
|
||||
@SQ SN:11 LN:134452384
|
||||
@SQ SN:12 LN:132349534
|
||||
@SQ SN:13 LN:114142980
|
||||
@SQ SN:14 LN:106368585
|
||||
@SQ SN:15 LN:100338915
|
||||
@SQ SN:16 LN:88827254
|
||||
@SQ SN:17 LN:78774742
|
||||
@SQ SN:18 LN:76117153
|
||||
@SQ SN:19 LN:63811651
|
||||
@SQ SN:20 LN:62435964
|
||||
@SQ SN:21 LN:46944323
|
||||
@SQ SN:22 LN:49691432
|
||||
@SQ SN:X LN:154913754
|
||||
@SQ SN:Y LN:57772954
|
||||
@SQ SN:MT LN:16571
|
||||
@SQ SN:NT_113887 LN:3994
|
||||
...</code class="pre_md"></pre>
|
||||
<p>If the order of the contigs here matches the contig ordering specified above, and the <code>SO:coordinate</code> flag appears in your header, then your contig and read ordering satisfies the GATK requirements.</p>
|
||||
<hr />
|
||||
<h3>6. My BAM file isn't sorted that way. How can I fix it?</h3>
|
||||
<p><a href="http://picard.sourceforge.net/">Picard</a> offers a tool called <a href="http://picard.sourceforge.net/command-line-overview.shtml#SortSam">SortSam</a> that will sort a BAM file properly. A similar utility exists in Samtools, but we recommend the Picard tool because SortSam will also set a flag in the header that specifies that the file is correctly sorted, and this flag is necessary for the GATK to know it is safe to process the data. Also, you can use the <a href="http://picard.sourceforge.net/command-line-overview.shtml">ReorderSam</a> command to make a BAM file SQ order match another reference sequence.</p>
|
||||
<hr />
|
||||
<h3>7. How can I tell if my BAM file has read group and sample information?</h3>
|
||||
<p>A quick Unix command using Samtools will do the trick:</p>
|
||||
<pre><code class="pre_md">$ samtools view -H /path/to/my.bam | grep '^@RG'
|
||||
@RG ID:0 PL:solid PU:Solid0044_20080829_1_Pilot1_Ceph_12414_B_lib_1_2Kb_MP_Pilot1_Ceph_12414_B_lib_1_2Kb_MP LB:Lib1 PI:2750 DT:2008-08-28T20:00:00-0400 SM:NA12414 CN:bcm
|
||||
@RG ID:1 PL:solid PU:0083_BCM_20080719_1_Pilot1_Ceph_12414_B_lib_1_2Kb_MP_Pilot1_Ceph_12414_B_lib_1_2Kb_MP LB:Lib1 PI:2750 DT:2008-07-18T20:00:00-0400 SM:NA12414 CN:bcm
|
||||
@RG ID:2 PL:LS454 PU:R_2008_10_02_06_06_12_FLX01080312_retry LB:HL#01_NA11881 PI:0 SM:NA11881 CN:454MSC
|
||||
@RG ID:3 PL:LS454 PU:R_2008_10_02_06_07_08_rig19_retry LB:HL#01_NA11881 PI:0 SM:NA11881 CN:454MSC
|
||||
@RG ID:4 PL:LS454 PU:R_2008_10_02_17_50_32_FLX03080339_retry LB:HL#01_NA11881 PI:0 SM:NA11881 CN:454MSC
|
||||
...</code class="pre_md"></pre>
|
||||
<p>The presence of the <code>@RG</code> tags indicate the presence of read groups. Each read group has a <code>SM</code> tag, indicating the sample from which the reads belonging to that read group originate.</p>
|
||||
<p>In addition to the presence of a read group in the header, each read must belong to one and only one read group. Given the following example reads,</p>
|
||||
<pre><code class="pre_md">$ samtools view /path/to/my.bam | grep '^@RG'
|
||||
EAS139_44:2:61:681:18781 35 1 1 0 51M = 9 59 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA B<>;==?=?<==?=?=>>?>><=<?=?8<=?>?<:=?>?<==?=>:;<?:= RG:Z:4 MF:i:18 Aq:i:0 NM:i:0 UQ:i:0 H0:i:85 H1:i:31
|
||||
EAS139_44:7:84:1300:7601 35 1 1 0 51M = 12 62 TAACCCTAAGCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA G<>;==?=?&=>?=?<==?>?<>>?=?<==?>?<==?>?1==@>?;<=><; RG:Z:3 MF:i:18 Aq:i:0 NM:i:1 UQ:i:5 H0:i:0 H1:i:85
|
||||
EAS139_44:8:59:118:13881 35 1 1 0 51M = 2 52 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA @<>;<=?=?==>?>?<==?=><=>?-?;=>?:><==?7?;<>?5?<<=>:; RG:Z:1 MF:i:18 Aq:i:0 NM:i:0 UQ:i:0 H0:i:85 H1:i:31
|
||||
EAS139_46:3:75:1326:2391 35 1 1 0 51M = 12 62 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA @<>==>?>@???B>A>?>A?A>??A?@>?@A?@;??A>@7>?>>@:>=@;@ RG:Z:0 MF:i:18 Aq:i:0 NM:i:0 UQ:i:0 H0:i:85 H1:i:31
|
||||
...</code class="pre_md"></pre>
|
||||
<p>membership in a read group is specified by the <code>RG:Z:*</code> tag. For instance, the first read belongs to read group 4 (sample NA11881), while the last read shown here belongs to read group 0 (sample NA12414).</p>
|
||||
<hr />
|
||||
<h3>8. My BAM file doesn't have read group and sample information. Do I really need it?</h3>
|
||||
<p>Yes! Many algorithms in the GATK need to know that certain reads were sequenced together on a specific lane, as they attempt to compensate for variability from one sequencing run to the next. Others need to know that the data represents not just one, but many samples. Without the read group and sample information, the GATK has no way of determining this critical information. You can use Picard's <a href="https://broadinstitute.github.io/picard/command-line-overview.html#AddOrReplaceReadGroups">AddOrReplaceReadGroups</a> tool to add read group information.</p>
|
||||
<hr />
|
||||
<h3>11. What's the best way to create a subset of my BAM file containing only reads over a small interval?</h3>
|
||||
<p>You can use the GATK to do the following:</p>
|
||||
<pre><code class="pre_md">java -jar GenomeAnalysisTK.jar -R reference.fasta -I full_input.bam -T PrintReads -L chr1:10-20 -o subset_input.bam</code class="pre_md"></pre>
|
||||
<p>and you'll get a BAM file containing only reads overlapping those points. This operation retains the complete BAM header from the full file (this was the reference aligned to, after all) so that the BAM remains easy to work with. We routinely use these features for testing and high-performance analysis with the GATK.</p>
|
||||
|
|
@ -0,0 +1,40 @@
|
|||
## Collected FAQs about interval lists
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/1319/collected-faqs-about-interval-lists
|
||||
|
||||
<h2>1. Can GATK tools be restricted to specific intervals instead of processing the entire reference?</h2>
|
||||
<p>Absolutely. Just use the <code>-L</code> argument to provide the list of intervals you wish to run on. Or you can use <code>-XL</code> to <em>exclude</em> intervals, e.g. to blacklist genome regions that are problematic. </p>
|
||||
<hr />
|
||||
<h2>2. What file formats does GATK support for interval lists?</h2>
|
||||
<p>GATK supports several types of interval list formats: Picard-style <code>.interval_list</code>, GATK-style <code>.list</code>, BED files with extension <code>.bed</code>, and VCF files. </p>
|
||||
<h3>A. Picard-style <code>.interval_list</code></h3>
|
||||
<p>Picard-style interval files have a SAM-like header that includes a sequence dictionary. The intervals are given in the form <code><chr> <start> <stop> + <target_name></code>, with fields separated by tabs, and the coordinates are 1-based (first position in the genome is position 1, not position 0). </p>
|
||||
<pre><code class="pre_md">@HD VN:1.0 SO:coordinate
|
||||
@SQ SN:1 LN:249250621 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:1b22b98cdeb4a9304cb5d48026a85128 SP:Homo Sapiens
|
||||
@SQ SN:2 LN:243199373 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:a0d9851da00400dec1098a9255ac712e SP:Homo Sapiens
|
||||
1 30366 30503 + target_1
|
||||
1 69089 70010 + target_2
|
||||
1 367657 368599 + target_3
|
||||
1 621094 622036 + target_4
|
||||
1 861320 861395 + target_5
|
||||
1 865533 865718 + target_6</code class="pre_md"></pre>
|
||||
<p>This is the preferred format because the explicit sequence dictionary safeguards against accidental misuse (e.g. apply hg18 intervals to an hg19 BAM file). Note that this file is 1-based, not 0-based (the first position in the genome is position 1).</p>
|
||||
<h3>B. GATK-style <code>.list</code> or <code>.intervals</code></h3>
|
||||
<p>This is a simpler format, where intervals are in the form <code><chr>:<start>-<stop></code>, and no sequence dictionary is necessary. This file format also uses 1-based coordinates. Note that only the <code><chr></code> part is strictly required; if you just want to specify chromosomes/ contigs as opposed to specific coordinate ranges, you don't need to specify the rest. Both <code><chr>:<start>-<stop></code> and <code><chr></code> can be present in the same file. You can also specify intervals in this format directly at the command line instead of writing them in a file.</p>
|
||||
<h3>C. BED files with extension <code>.bed</code></h3>
|
||||
<p>We also accept the widely-used BED format, where intervals are in the form <code><chr> <start> <stop></code>, with fields separated by tabs. However, you should be aware that this file format is 0-based for the start coordinates, so coordinates taken from 1-based formats (e.g. if you're cooking up a custom interval list derived from a file in a 1-based format) should be offset by 1. The GATK engine recognizes the <code>.bed</code> extension and interprets the coordinate system accordingly.</p>
|
||||
<h3>D. VCF files</h3>
|
||||
<p>Yeah, I bet you didn't expect that was a thing! It's very convenient. Say you want to redo a variant calling run on a set of variant calls that you were given by a colleague, but with the latest version of HaplotypeCaller. You just provide the VCF, slap on some padding on the fly using e.g. <code>-ip 100</code> in the HC command, and boom, done. Each record in the VCF will be interpreted as a single-base interval, and by adding padding you ensure that the caller sees enough context to reevaluate the call appropriately.</p>
|
||||
<hr />
|
||||
<h2>3. Is there a required order of intervals?</h2>
|
||||
<p>Yes, thanks for asking. The intervals MUST be sorted by coordinate (in increasing order) within contigs; and the contigs must be sorted in the same order as in the sequence dictionary. This is for efficiency reasons. </p>
|
||||
<hr />
|
||||
<h2>4. Can I provide multiple sets of intervals?</h2>
|
||||
<p>Sure, no problem -- just pass them in using separate <code>-L</code> arguments. You can use all the different formats within the same command line. By default, the GATK engine will take the UNION of all the intervals in all the sets. This behavior can be modified by setting an <a href="https://software.broadinstitute.org/gatk/documentation/tooldocs/org_broadinstitute_gatk_engine_CommandLineGATK.php#--interval_set_rule"><code>interval_set</code></a> rule.</p>
|
||||
<hr />
|
||||
<h2>5. How will GATK handle intervals that abut or overlap?</h2>
|
||||
<p>Very gracefully. By default the GATK engine will merge any intervals that abut (i.e. they are contiguous, they touch without overlapping) or overlap into a single interval. This behavior can be modified by setting an <a href="https://software.broadinstitute.org/gatk/documentation/tooldocs/org_broadinstitute_gatk_engine_CommandLineGATK.php#--interval_merging"><code>interval_merging</code></a> rule.</p>
|
||||
<hr />
|
||||
<h2>6. What's the best way to pad intervals?</h2>
|
||||
<p>You can use the <a href="https://software.broadinstitute.org/gatk/documentation/tooldocs/org_broadinstitute_gatk_engine_CommandLineGATK.php#--interval_padding"><code>-ip</code></a> engine argument to add padding on the fly. No need to produce separate padded targets files. Sweet, right? </p>
|
||||
<p>Note that if intervals that previously didn't abut or overlap before you added padding now do so, by default the GATK engine will merge them as described above. This behavior can be modified by setting an <a href="https://software.broadinstitute.org/gatk/documentation/tooldocs/org_broadinstitute_gatk_engine_CommandLineGATK.php#--interval_merging"><code>interval_merging</code></a> rule.</p>
|
||||
|
|
@ -0,0 +1,18 @@
|
|||
## How can I access the GSA public FTP server?
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/1215/how-can-i-access-the-gsa-public-ftp-server
|
||||
|
||||
<p><strong>NOTE: This article will be deprecated in the near future as this information will be consolidated elsewhere.</strong></p>
|
||||
<p>We make various files available for public download from the GSA FTP server, such as the GATK resource bundle and presentation slides. We also maintain a public upload feature for processing bug reports from users.</p>
|
||||
<p>There are two logins to choose from depending on whether you want to upload or download something:</p>
|
||||
<h3>Downloading</h3>
|
||||
<pre><code class="pre_md">location: ftp.broadinstitute.org
|
||||
username: gsapubftp-anonymous
|
||||
password: <blank></code class="pre_md"></pre>
|
||||
<h3>Uploading</h3>
|
||||
<pre><code class="pre_md">location: ftp.broadinstitute.org
|
||||
username: gsapubftp
|
||||
password: 5WvQWSfi</code class="pre_md"></pre>
|
||||
<h3>Using a browser as FTP client</h3>
|
||||
<p>If you use your browser as FTP client, make sure to include the login information in the address, otherwise you will access the general Broad Institute FTP instead of our team FTP. This should work as a direct link (for downloading only):</p>
|
||||
<p><a href="ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle">ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle</a></p>
|
||||
|
|
@ -0,0 +1,14 @@
|
|||
## How can I invoke read filters and their arguments?
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/2338/how-can-i-invoke-read-filters-and-their-arguments
|
||||
|
||||
<p>Most GATK tools apply several read filters by default. You can look up exactly what are the defaults for each tool in their respective <a href="http://www.broadinstitute.org/gatk/gatkdocs/">Technical Documentation</a> pages. </p>
|
||||
<p>But sometimes you want to specify additional filters yourself (and before you ask, no, you cannot disable the default read filters used by a given tool). This is how you do it:</p>
|
||||
<p>The <code>--read-filter</code> argument (or <code>-rf</code> for short) allows you to apply whatever read filters you'd like. For example, to add the <code>MaxReadLengthFilter</code> filter above to <code>PrintReads</code>, you just add this to your command line:</p>
|
||||
<pre><code class="pre_md">--read_filter MaxReadLength </code class="pre_md"></pre>
|
||||
<h4>Notice that when you specify a read filter, you need to strip the Filter part of its name off!</h4>
|
||||
<p>The read filter will be applied with its default value (which you can also look up in the Tech Docs for that filter). Now, if you want to specify a different value from the default, you pass the relevant argument by adding this right after the read filter:</p>
|
||||
<pre><code class="pre_md">--read_filter MaxReadLength -maxReadLength 76</code class="pre_md"></pre>
|
||||
<p>It's important that you pass the argument right after the filter itself, otherwise the command line parser won't know that they're supposed to go together.</p>
|
||||
<p>And of course, you can add as many filters as you like by using multiple copies of the <code>--read_filter</code> parameter:</p>
|
||||
<pre><code class="pre_md">--read_filter MaxReadLength --maxReadLength 76 --read_filter ZeroMappingQualityRead</code class="pre_md"></pre>
|
||||
|
|
@ -0,0 +1,114 @@
|
|||
## How can I prepare a FASTA file to use as reference?
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/1601/how-can-i-prepare-a-fasta-file-to-use-as-reference
|
||||
|
||||
<p>This article describes the steps necessary to prepare your reference file (if it's not one that you got from us). As a complement to this article, see the relevant <a href="http://www.broadinstitute.org/gatk/guide/article?id=2798">tutorial</a>.</p>
|
||||
<h3>Why these steps are necessary</h3>
|
||||
<p>The GATK uses two files to access and safety check access to the reference files: a <code>.dict</code> dictionary of the contig names and sizes and a <code>.fai</code> fasta index file to allow efficient random access to the reference bases. You have to generate these files in order to be able to use a Fasta file as reference.</p>
|
||||
<p><strong>NOTE: Picard and samtools treat spaces in contig names differently. We recommend that you avoid using spaces in contig names.</strong></p>
|
||||
<h3>Creating the fasta sequence dictionary file</h3>
|
||||
<p>We use CreateSequenceDictionary.jar from Picard to create a .dict file from a fasta file. </p>
|
||||
<pre><code class="pre_md">> java -jar CreateSequenceDictionary.jar R= Homo_sapiens_assembly18.fasta O= Homo_sapiens_assembly18.dict
|
||||
[Fri Jun 19 14:09:11 EDT 2009] net.sf.picard.sam.CreateSequenceDictionary R= Homo_sapiens_assembly18.fasta O= Homo_sapiens_assembly18.dict
|
||||
[Fri Jun 19 14:09:58 EDT 2009] net.sf.picard.sam.CreateSequenceDictionary done.
|
||||
Runtime.totalMemory()=2112487424
|
||||
44.922u 2.308s 0:47.09 100.2% 0+0k 0+0io 2pf+0w</code class="pre_md"></pre>
|
||||
<p>This produces a SAM-style header file describing the contents of our fasta file.</p>
|
||||
<pre><code class="pre_md">> cat Homo_sapiens_assembly18.dict
|
||||
@HD VN:1.0 SO:unsorted
|
||||
@SQ SN:chrM LN:16571 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:d2ed829b8a1628d16cbeee88e88e39eb
|
||||
@SQ SN:chr1 LN:247249719 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:9ebc6df9496613f373e73396d5b3b6b6
|
||||
@SQ SN:chr2 LN:242951149 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:b12c7373e3882120332983be99aeb18d
|
||||
@SQ SN:chr3 LN:199501827 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:0e48ed7f305877f66e6fd4addbae2b9a
|
||||
@SQ SN:chr4 LN:191273063 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:cf37020337904229dca8401907b626c2
|
||||
@SQ SN:chr5 LN:180857866 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:031c851664e31b2c17337fd6f9004858
|
||||
@SQ SN:chr6 LN:170899992 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:bfe8005c536131276d448ead33f1b583
|
||||
@SQ SN:chr7 LN:158821424 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:74239c5ceee3b28f0038123d958114cb
|
||||
@SQ SN:chr8 LN:146274826 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:1eb00fe1ce26ce6701d2cd75c35b5ccb
|
||||
@SQ SN:chr9 LN:140273252 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:ea244473e525dde0393d353ef94f974b
|
||||
@SQ SN:chr10 LN:135374737 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:4ca41bf2d7d33578d2cd7ee9411e1533
|
||||
@SQ SN:chr11 LN:134452384 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:425ba5eb6c95b60bafbf2874493a56c3
|
||||
@SQ SN:chr12 LN:132349534 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:d17d70060c56b4578fa570117bf19716
|
||||
@SQ SN:chr13 LN:114142980 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:c4f3084a20380a373bbbdb9ae30da587
|
||||
@SQ SN:chr14 LN:106368585 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:c1ff5d44683831e9c7c1db23f93fbb45
|
||||
@SQ SN:chr15 LN:100338915 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:5cd9622c459fe0a276b27f6ac06116d8
|
||||
@SQ SN:chr16 LN:88827254 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:3e81884229e8dc6b7f258169ec8da246
|
||||
@SQ SN:chr17 LN:78774742 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:2a5c95ed99c5298bb107f313c7044588
|
||||
@SQ SN:chr18 LN:76117153 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:3d11df432bcdc1407835d5ef2ce62634
|
||||
@SQ SN:chr19 LN:63811651 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:2f1a59077cfad51df907ac25723bff28
|
||||
@SQ SN:chr20 LN:62435964 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:f126cdf8a6e0c7f379d618ff66beb2da
|
||||
@SQ SN:chr21 LN:46944323 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:f1b74b7f9f4cdbaeb6832ee86cb426c6
|
||||
@SQ SN:chr22 LN:49691432 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:2041e6a0c914b48dd537922cca63acb8
|
||||
@SQ SN:chrX LN:154913754 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:d7e626c80ad172a4d7c95aadb94d9040
|
||||
@SQ SN:chrY LN:57772954 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:62f69d0e82a12af74bad85e2e4a8bd91
|
||||
@SQ SN:chr1_random LN:1663265 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:cc05cb1554258add2eb62e88c0746394
|
||||
@SQ SN:chr2_random LN:185571 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:18ceab9e4667a25c8a1f67869a4356ea
|
||||
@SQ SN:chr3_random LN:749256 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:9cc571e918ac18afa0b2053262cadab6
|
||||
@SQ SN:chr4_random LN:842648 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:9cab2949ccf26ee0f69a875412c93740
|
||||
@SQ SN:chr5_random LN:143687 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:05926bdbff978d4a0906862eb3f773d0
|
||||
@SQ SN:chr6_random LN:1875562 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:d62eb2919ba7b9c1d382c011c5218094
|
||||
@SQ SN:chr7_random LN:549659 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:28ebfb89c858edbc4d71ff3f83d52231
|
||||
@SQ SN:chr8_random LN:943810 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:0ed5b088d843d6f6e6b181465b9e82ed
|
||||
@SQ SN:chr9_random LN:1146434 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:1e3d2d2f141f0550fa28a8d0ed3fd1cf
|
||||
@SQ SN:chr10_random LN:113275 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:50be2d2c6720dabeff497ffb53189daa
|
||||
@SQ SN:chr11_random LN:215294 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:bfc93adc30c621d5c83eee3f0d841624
|
||||
@SQ SN:chr13_random LN:186858 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:563531689f3dbd691331fd6c5730a88b
|
||||
@SQ SN:chr15_random LN:784346 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:bf885e99940d2d439d83eba791804a48
|
||||
@SQ SN:chr16_random LN:105485 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:dd06ea813a80b59d9c626b31faf6ae7f
|
||||
@SQ SN:chr17_random LN:2617613 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:34d5e2005dffdfaaced1d34f60ed8fc2
|
||||
@SQ SN:chr18_random LN:4262 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:f3814841f1939d3ca19072d9e89f3fd7
|
||||
@SQ SN:chr19_random LN:301858 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:420ce95da035386cc8c63094288c49e2
|
||||
@SQ SN:chr21_random LN:1679693 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:a7252115bfe5bb5525f34d039eecd096
|
||||
@SQ SN:chr22_random LN:257318 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:4f2d259b82f7647d3b668063cf18378b
|
||||
@SQ SN:chrX_random LN:1719168 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:f4d71e0758986c15e5455bf3e14e5d6f</code class="pre_md"></pre>
|
||||
<h3>Creating the fasta index file</h3>
|
||||
<p>We use the faidx command in samtools to prepare the fasta index file. This file describes byte offsets in the fasta file for each contig, allowing us to compute exactly where a particular reference base at contig:pos is in the fasta file.</p>
|
||||
<pre><code class="pre_md">> samtools faidx Homo_sapiens_assembly18.fasta
|
||||
108.446u 3.384s 2:44.61 67.9% 0+0k 0+0io 0pf+0w</code class="pre_md"></pre>
|
||||
<p>This produces a text file with one record per line for each of the fasta contigs. Each record is of the: contig, size, location, basesPerLine, bytesPerLine. The index file produced above looks like:</p>
|
||||
<pre><code class="pre_md">> cat Homo_sapiens_assembly18.fasta.fai
|
||||
chrM 16571 6 50 51
|
||||
chr1 247249719 16915 50 51
|
||||
chr2 242951149 252211635 50 51
|
||||
chr3 199501827 500021813 50 51
|
||||
chr4 191273063 703513683 50 51
|
||||
chr5 180857866 898612214 50 51
|
||||
chr6 170899992 1083087244 50 51
|
||||
chr7 158821424 1257405242 50 51
|
||||
chr8 146274826 1419403101 50 51
|
||||
chr9 140273252 1568603430 50 51
|
||||
chr10 135374737 1711682155 50 51
|
||||
chr11 134452384 1849764394 50 51
|
||||
chr12 132349534 1986905833 50 51
|
||||
chr13 114142980 2121902365 50 51
|
||||
chr14 106368585 2238328212 50 51
|
||||
chr15 100338915 2346824176 50 51
|
||||
chr16 88827254 2449169877 50 51
|
||||
chr17 78774742 2539773684 50 51
|
||||
chr18 76117153 2620123928 50 51
|
||||
chr19 63811651 2697763432 50 51
|
||||
chr20 62435964 2762851324 50 51
|
||||
chr21 46944323 2826536015 50 51
|
||||
chr22 49691432 2874419232 50 51
|
||||
chrX 154913754 2925104499 50 51
|
||||
chrY 57772954 3083116535 50 51
|
||||
chr1_random 1663265 3142044962 50 51
|
||||
chr2_random 185571 3143741506 50 51
|
||||
chr3_random 749256 3143930802 50 51
|
||||
chr4_random 842648 3144695057 50 51
|
||||
chr5_random 143687 3145554571 50 51
|
||||
chr6_random 1875562 3145701145 50 51
|
||||
chr7_random 549659 3147614232 50 51
|
||||
chr8_random 943810 3148174898 50 51
|
||||
chr9_random 1146434 3149137598 50 51
|
||||
chr10_random 113275 3150306975 50 51
|
||||
chr11_random 215294 3150422530 50 51
|
||||
chr13_random 186858 3150642144 50 51
|
||||
chr15_random 784346 3150832754 50 51
|
||||
chr16_random 105485 3151632801 50 51
|
||||
chr17_random 2617613 3151740410 50 51
|
||||
chr18_random 4262 3154410390 50 51
|
||||
chr19_random 301858 3154414752 50 51
|
||||
chr21_random 1679693 3154722662 50 51
|
||||
chr22_random 257318 3156435963 50 51
|
||||
chrX_random 1719168 3156698441 50 51</code class="pre_md"></pre>
|
||||
|
|
@ -0,0 +1,16 @@
|
|||
## How can I turn on or customize forum notifications?
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/27/how-can-i-turn-on-or-customize-forum-notifications
|
||||
|
||||
<p>By default, the forum does not send notification messages about new comments or discussions. If you want to turn on notifications or customize the type of notifications you want to receive (email, popup message etc), you need to do the following:
|
||||
</p>
|
||||
<ul>
|
||||
<li>Go to your profile page by clicking on your user name (in blue box, top left corner);</li>
|
||||
<li>Click on "Edit Profile" (button with silhouette of person, top right corner); </li>
|
||||
<li>In the menu on the left, click on "Notification Preferences"; </li>
|
||||
<li>Select the categories that you want to follow and the type of notification you want to receive. </li>
|
||||
<li>Be sure to click on Save Preferences. </li>
|
||||
</ul>
|
||||
<p>
|
||||
To specifically get new GATK announcements, scroll down to "Category Notifications" and tick off the "Announcements" category for email notification for discussions (and comments if you really want to know everything).
|
||||
</p>
|
||||
|
|
@ -0,0 +1,164 @@
|
|||
## How can I use parallelism to make GATK tools run faster?
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/1975/how-can-i-use-parallelism-to-make-gatk-tools-run-faster
|
||||
|
||||
<p><em>This document provides technical details and recommendations on how the parallelism options offered by the GATK can be used to yield optimal performance results.</em></p>
|
||||
<h3>Overview</h3>
|
||||
<p>As explained in the <a href="http://www.broadinstitute.org/gatk/guide/article?id=1988">primer on parallelism for the GATK</a>, there are two main kinds of parallelism that can be applied to the GATK: multi-threading and scatter-gather (using <a href="https://software.broadinstitute.org/gatk/documentation/pipelines">Queue or Crom/WDL</a>).</p>
|
||||
<h3>Multi-threading options</h3>
|
||||
<p>There are two options for multi-threading with the GATK, controlled by the arguments <code>-nt</code> and <code>-nct</code>, respectively, which can be combined:</p>
|
||||
<ul>
|
||||
<li><code>-nt / --num_threads</code>
|
||||
controls the number of <strong>data threads</strong> sent to the processor </li>
|
||||
<li><code>-nct / --num_cpu_threads_per_data_thread</code>
|
||||
controls the number of <strong>CPU threads</strong> allocated to each data thread</li>
|
||||
</ul>
|
||||
<p>For more information on how these multi-threading options work, please read the <a href="http://www.broadinstitute.org/gatk/guide/article?id=1988">primer on parallelism for the GATK</a>.</p>
|
||||
<h4>Memory considerations for multi-threading</h4>
|
||||
<p>Each data thread needs to be given the full amount of memory you’d normally give a single run. So if you’re running a tool that normally requires 2 Gb of memory to run, if you use <code>-nt 4</code>, the multithreaded run will use 8 Gb of memory. In contrast, CPU threads will share the memory allocated to their “mother” data thread, so you don’t need to worry about allocating memory based on the number of CPU threads you use. </p>
|
||||
<h4>Additional consideration when using <code>-nct</code> with versions 2.2 and 2.3</h4>
|
||||
<p>Because of the way the <code>-nct</code> option was originally implemented, in versions 2.2 and 2.3, there is one CPU thread that is reserved by the system to “manage” the rest. So if you use <code>-nct</code>, you’ll only really start seeing a speedup with <code>-nct 3</code> (which yields two effective "working" threads) and above. This limitation has been resolved in the implementation that will be available in versions 2.4 and up.</p>
|
||||
<h3>Scatter-gather</h3>
|
||||
<p>For more details on scatter-gather, see the <a href="http://gatkforums.broadinstitute.org/discussion/1988/a-primer-on-parallelism-with-the-gatk">primer on parallelism for the GATK</a> and the documentation on <a href="https://software.broadinstitute.org/gatk/documentation/pipelines">pipelining options</a>.</p>
|
||||
<h3>Applicability of parallelism to the major GATK tools</h3>
|
||||
<p>Please note that not all tools support all parallelization modes. The parallelization modes that are available for each tool depend partly on the type of traversal that the tool uses to walk through the data, and partly on the nature of the analyses it performs.</p>
|
||||
<table class="table table-striped">
|
||||
<thead>
|
||||
<tr>
|
||||
<th style="text-align: left;">Tool</th>
|
||||
<th style="text-align: left;">Full name</th>
|
||||
<th style="text-align: left;">Type of traversal</th>
|
||||
<th style="text-align: center;">NT</th>
|
||||
<th style="text-align: center;">NCT</th>
|
||||
<th style="text-align: center;">SG</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td style="text-align: left;">RTC</td>
|
||||
<td style="text-align: left;">RealignerTargetCreator</td>
|
||||
<td style="text-align: left;">RodWalker</td>
|
||||
<td style="text-align: center;">+</td>
|
||||
<td style="text-align: center;">-</td>
|
||||
<td style="text-align: center;">-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: left;">IR</td>
|
||||
<td style="text-align: left;">IndelRealigner</td>
|
||||
<td style="text-align: left;">ReadWalker</td>
|
||||
<td style="text-align: center;">-</td>
|
||||
<td style="text-align: center;">-</td>
|
||||
<td style="text-align: center;">+</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: left;">BR</td>
|
||||
<td style="text-align: left;">BaseRecalibrator</td>
|
||||
<td style="text-align: left;">LocusWalker</td>
|
||||
<td style="text-align: center;">-</td>
|
||||
<td style="text-align: center;">+</td>
|
||||
<td style="text-align: center;">+</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: left;">PR</td>
|
||||
<td style="text-align: left;">PrintReads</td>
|
||||
<td style="text-align: left;">ReadWalker</td>
|
||||
<td style="text-align: center;">-</td>
|
||||
<td style="text-align: center;">+</td>
|
||||
<td style="text-align: center;">-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: left;">RR</td>
|
||||
<td style="text-align: left;">ReduceReads</td>
|
||||
<td style="text-align: left;">ReadWalker</td>
|
||||
<td style="text-align: center;">-</td>
|
||||
<td style="text-align: center;">-</td>
|
||||
<td style="text-align: center;">+</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: left;">HC</td>
|
||||
<td style="text-align: left;">HaplotypeCaller</td>
|
||||
<td style="text-align: left;">ActiveRegionWalker</td>
|
||||
<td style="text-align: center;">-</td>
|
||||
<td style="text-align: center;">(+)</td>
|
||||
<td style="text-align: center;">+</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: left;">UG</td>
|
||||
<td style="text-align: left;">UnifiedGenotyper</td>
|
||||
<td style="text-align: left;">LocusWalker</td>
|
||||
<td style="text-align: center;">+</td>
|
||||
<td style="text-align: center;">+</td>
|
||||
<td style="text-align: center;">+</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<p>Note that while HaplotypeCaller supports <code>-nct</code> in principle, many have reported that it is not very stable (random crashes may occur -- but if there is no crash, results will be correct). We prefer not to use this option with HC; use it at your own risk. </p>
|
||||
<h3>Recommended configurations</h3>
|
||||
<p>The table below summarizes configurations that we typically use for our own projects (one per tool, except we give three alternate possibilities for the UnifiedGenotyper). The different values allocated for each tool reflect not only the technical capabilities of these tools (which options are supported), but also our empirical observations of what provides the best tradeoffs between performance gains and commitment of resources. Please note however that this is meant only as a guide, and that we cannot give you any guarantee that these configurations are the best for your own setup. You will probably have to experiment with the settings to find the configuration that is right for you. </p>
|
||||
<table class="table table-striped">
|
||||
<thead>
|
||||
<tr>
|
||||
<th style="text-align: left;">Tool</th>
|
||||
<th style="text-align: center;">RTC</th>
|
||||
<th style="text-align: center;">IR</th>
|
||||
<th style="text-align: center;">BR</th>
|
||||
<th style="text-align: center;">PR</th>
|
||||
<th style="text-align: center;">RR</th>
|
||||
<th style="text-align: center;">HC</th>
|
||||
<th style="text-align: center;">UG</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td style="text-align: left;">Available modes</td>
|
||||
<td style="text-align: center;">NT</td>
|
||||
<td style="text-align: center;">SG</td>
|
||||
<td style="text-align: center;">NCT,SG</td>
|
||||
<td style="text-align: center;">NCT</td>
|
||||
<td style="text-align: center;">SG</td>
|
||||
<td style="text-align: center;">NCT,SG</td>
|
||||
<td style="text-align: center;">NT,NCT,SG</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: left;">Cluster nodes</td>
|
||||
<td style="text-align: center;">1</td>
|
||||
<td style="text-align: center;">4</td>
|
||||
<td style="text-align: center;">4</td>
|
||||
<td style="text-align: center;">1</td>
|
||||
<td style="text-align: center;">4</td>
|
||||
<td style="text-align: center;">4</td>
|
||||
<td style="text-align: center;">4 / 4 / 4</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: left;">CPU threads (<code>-nct</code>)</td>
|
||||
<td style="text-align: center;">1</td>
|
||||
<td style="text-align: center;">1</td>
|
||||
<td style="text-align: center;">8</td>
|
||||
<td style="text-align: center;">4-8</td>
|
||||
<td style="text-align: center;">1</td>
|
||||
<td style="text-align: center;">4</td>
|
||||
<td style="text-align: center;">3 / 6 / 24</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: left;">Data threads (<code>-nt</code>)</td>
|
||||
<td style="text-align: center;">24</td>
|
||||
<td style="text-align: center;">1</td>
|
||||
<td style="text-align: center;">1</td>
|
||||
<td style="text-align: center;">1</td>
|
||||
<td style="text-align: center;">1</td>
|
||||
<td style="text-align: center;">1</td>
|
||||
<td style="text-align: center;">8 / 4 / 1</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: left;">Memory (Gb)</td>
|
||||
<td style="text-align: center;">48</td>
|
||||
<td style="text-align: center;">4</td>
|
||||
<td style="text-align: center;">4</td>
|
||||
<td style="text-align: center;">4</td>
|
||||
<td style="text-align: center;">4</td>
|
||||
<td style="text-align: center;">16</td>
|
||||
<td style="text-align: center;">32 / 16 / 4</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<p>Where NT is data multithreading, NCT is CPU multithreading and SG is scatter-gather using Queue or other data parallelization framework. For more details on scatter-gather, see the <a href="http://www.broadinstitute.org/gatk/guide/article?id=1988">primer on parallelism for the GATK</a> and the documentation on <a href="https://software.broadinstitute.org/gatk/documentation/pipelines">pipelining options</a>.</p>
|
||||
|
|
@ -0,0 +1,36 @@
|
|||
## How do I submit a detailed bug report?
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/1894/how-do-i-submit-a-detailed-bug-report
|
||||
|
||||
<p><strong><em>Note: only do this if you have been explicitly asked to do so.</em></strong></p>
|
||||
<h3>Scenario:</h3>
|
||||
<p>You posted a question about a problem you had with GATK tools, we answered that we think it's a bug, and we asked you to submit a detailed bug report. </p>
|
||||
<h3>Here's what you need to provide:</h3>
|
||||
<ul>
|
||||
<li>The exact command line that you used when you had the problem (in a text file)</li>
|
||||
<li>The full log output (program output in the console) from the start of the run to the end or error message (in a text file)</li>
|
||||
<li>A snippet of the BAM file if applicable and the index (.bai) file associated with it</li>
|
||||
<li>If a non-standard reference (i.e. not available in our resource bundle) was used, we need the .fasta, .fai, and .dict files for the reference</li>
|
||||
<li>Any other relevant files such as recalibration plots</li>
|
||||
</ul>
|
||||
<p>A snippet file is a slice of the original BAM file which contains the problematic region and is sufficient to reproduce the error. We need it in order to reproduce the problem on our end, which is the first necessary step to finding and fixing the bug. We ask you to provide this as a snippet rather than the full file so that you don't have to upload (and we don't have to process) huge giga-scale files. </p>
|
||||
<h3>Here's how you create a snippet file:</h3>
|
||||
<ul>
|
||||
<li>Look at the error message and see if it cites a specific position where the error occurred</li>
|
||||
<li>If not, identify what region caused the problem by running with <code>-L</code> argument and progressively narrowing down the interval</li>
|
||||
<li>Once you have the region, use PrintReads with <code>-L</code> to write the problematic region (with 500 bp padding on either side) to a new file -- this is your snippet file.</li>
|
||||
<li>Test your command line on this snippet file to make sure you can still reproduce the error on it. </li>
|
||||
</ul>
|
||||
<h3>And finally, here's how you send us the files:</h3>
|
||||
<ul>
|
||||
<li>Put all those files into a <code>.zip</code> or <code>.tar.gz</code> archive </li>
|
||||
<li>
|
||||
<p>Upload them onto our FTP server with the following credentials:</p>
|
||||
<pre><code>location: ftp.broadinstitute.org
|
||||
username: gsapubftp
|
||||
password: 5WvQWSfi</code></pre>
|
||||
</li>
|
||||
<li>Post in the original discussion thread that you have done this</li>
|
||||
<li>Be sure to tell us the name of your archive file!</li>
|
||||
</ul>
|
||||
<h3>We will get back to you --hopefully with a bug fix!-- as soon as we can.</h3>
|
||||
|
|
@ -0,0 +1,9 @@
|
|||
## How does the GATK handle these huge NGS datasets?
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/1320/how-does-the-gatk-handle-these-huge-ngs-datasets
|
||||
|
||||
<p>Imagine a simple question like, "What's the depth of coverage at position A of the genome?" </p>
|
||||
<p>First, you are given billions of reads that are aligned to the genome but not ordered in any particular way (except perhaps in the order they were emitted by the sequencer). This simple question is then very difficult to answer efficiently, because the algorithm is forced to examine every single read in succession, since any one of them might span position A. The algorithm must now take several hours in order to compute this value.</p>
|
||||
<p>Instead, imagine the billions of reads are now sorted in reference order (that is to say, on each chromosome, the reads are stored on disk in the same order they appear on the chromosome). Now, answering the question above is trivial, as the algorithm can jump to the desired location, examine only the reads that span the position, and return immediately after those reads (and only those reads) are inspected. The total number of reads that need to be interrogated is only a handful, rather than several billion, and the processing time is seconds, not hours.</p>
|
||||
<p>This reference-ordered sorting enables the GATK to process terabytes of data quickly and without tremendous memory overhead. Most GATK tools run very quickly and with less than 2 gigabytes of RAM. Without this sorting, the GATK cannot operate correctly. Thus, it is a fundamental rule of working with the GATK, which is the reason for the Central Dogma of the GATK:</p>
|
||||
<h4>All datasets (reads, alignments, quality scores, variants, dbSNP information, gene tracks, interval lists - everything) must be sorted in order of one of the <a href="http://gatkforums.broadinstitute.org/discussion/1204/what-input-files-does-the-gatk-accept">canonical references sequences</a>.</h4>
|
||||
|
|
@ -0,0 +1,25 @@
|
|||
## How should I cite GATK in my own publications?
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/6201/how-should-i-cite-gatk-in-my-own-publications
|
||||
|
||||
<p>To date we have published three papers on GATK (citation details below). The ideal way to cite the GATK is to use all as a triple citation, as in:</p>
|
||||
<blockquote>
|
||||
<p>We sequenced 10 samples on 10 lanes on an Illumina HiSeq 2000, aligned the resulting reads to the hg19 reference genome with BWA (Li & Durbin), applied GATK <strong>(McKenna <em>et al.</em>, 2010)</strong> base quality score recalibration, indel realignment, duplicate removal, and performed SNP and INDEL discovery and genotyping across all 10 samples simultaneously using standard hard filtering parameters or variant quality score recalibration according to GATK Best Practices recommendations <strong>(DePristo <em>et al.</em>, 2011; Van der Auwera <em>et al.</em>, 2013)</strong>.</p>
|
||||
</blockquote>
|
||||
<hr />
|
||||
<h3>McKenna <em>et al.</em> 2010 : Original description of the GATK framework</h3>
|
||||
<p>The first GATK paper covers the computational philosophy underlying the GATK and is a good citation for the GATK in general.</p>
|
||||
<p><strong>The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data</strong> McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA, 2010 <em>GENOME RESEARCH 20:1297-303</em> </p>
|
||||
<p><a href="http://dx.doi.org/10.1101/gr.107524.110">Article</a> | <a href="http://www.ncbi.nlm.nih.gov/pubmed?term=20644199">Pubmed</a></p>
|
||||
<hr />
|
||||
<h3>DePristo <em>et al.</em> 2011 : First incarnation of the Best Practices workflow</h3>
|
||||
<p>The second GATK paper describes in more detail some of the key tools commonly used in the GATK for high-throughput sequencing data processing and variant discovery. The paper covers base quality score recalibration, indel realignment, SNP calling with UnifiedGenotyper, variant quality score recalibration and their application to deep whole genome, whole exome, and low-pass multi-sample calling. This is a good citation if you use the GATK for variant discovery. </p>
|
||||
<p><strong>A framework for variation discovery and genotyping using next-generation DNA sequencing data</strong> DePristo M, Banks E, Poplin R, Garimella K, Maguire J, Hartl C, Philippakis A, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell T, Kernytsky A, Sivachenko A, Cibulskis K, Gabriel S, Altshuler D, Daly M, 2011 <em>NATURE GENETICS 43:491-498</em> </p>
|
||||
<p><a href="http://dx.doi.org/10.1038/ng.806">Article</a> | <a href="http://www.ncbi.nlm.nih.gov/pubmed?term=21478889">Pubmed</a></p>
|
||||
<p>Note that the workflow described in this paper corresponds to the version 1.x to 2.x best practices. Some key steps for variant discovery have been significantly modified in later versions (3.x onwards). This paper should not be used as a definitive guide to variant discovery with GATK. For that, please see our online documentation guide.</p>
|
||||
<hr />
|
||||
<h3>Van der Auwera <em>et al.</em> 2013 : Hands-on tutorial with step-by-step explanations</h3>
|
||||
<p>The third GATK paper describes the Best Practices for Variant Discovery (version 2.x). It is intended mainly as a learning resource for first-time users and as a protocol reference. This is a good citation to include in a Materials and Methods section. </p>
|
||||
<p><strong>From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline</strong> Van der Auwera GA, Carneiro M, Hartl C, Poplin R, del Angel G, Levy-Moonshine A, Jordan T, Shakir K, Roazen D, Thibault J, Banks E, Garimella K, Altshuler D, Gabriel S, DePristo M, 2013 <em>CURRENT PROTOCOLS IN BIOINFORMATICS 43:11.10.1-11.10.33</em> </p>
|
||||
<p><a href="http://dx.doi.org/10.1002/0471250953.bi1110s43">Article</a> | <a href="http://www.ncbi.nlm.nih.gov/pubmed/?term=25431634">PubMed</a></p>
|
||||
<p>Remember that as our work continues and our Best Practices recommendations evolve, specific command lines, argument values and even tool choices described in the paper become obsolete. Be sure to always refer to our Best Practices documentation for the most up-to-date and version-appropriate recommendations.</p>
|
||||
|
|
@ -0,0 +1,53 @@
|
|||
## How should I pre-process data from multiplexed sequencing and multi-library designs?
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/3060/how-should-i-pre-process-data-from-multiplexed-sequencing-and-multi-library-designs
|
||||
|
||||
<p>Our Best Practices pre-processing documentation assumes a simple experimental design in which you have one set of input sequence files (forward/reverse or interleaved FASTQ, or unmapped uBAM) per sample, and you run each step of the pre-processing workflow separately for each sample, resulting in one BAM file per sample at the end of this phase. </p>
|
||||
<p>However, if you are generating multiple libraries for each sample, and/or multiplexing samples within and/or across sequencing lanes, the data must be de-multiplexed before pre-processing, typically resulting in multiple sets of FASTQ files per sample all of which should have distinct <a href="https://www.broadinstitute.org/gatk/guide/article?id=6472">read group</a> IDs (RGID). </p>
|
||||
<p>At that point there are several different valid strategies for implementing the pre-processing workflow. Here at the Broad Institute, we run the initial steps of the pre-processing workflow (mapping, sorting and marking duplicates) separately on each individual read group. Then we merge the data to produce a single BAM file for each sample (aggregation); this is done by re-running Mark Duplicates, this time on all read group BAM files for a sample at the same time. Then we run Indel Realignment and Base Recalibration on the aggregated per-sample BAM files. See the worked-out example below and <a href="https://www.broadinstitute.org/gatk/events/slides/1506/GATKwr8-A-3-GATK_Best_Practices_and_Broad_pipelines.pdf">this presentation</a> for more details.</p>
|
||||
<p><em>Note that there are many possible ways to achieve a similar result; here we present the way we think gives the best combination of efficiency and quality. This assumes that you are dealing with one or more samples, and each of them was sequenced on one or more lanes.</em></p>
|
||||
<h3>Example</h3>
|
||||
<p>Let's say we have this example data (assuming interleaved FASTQs containing both forward and reverse reads) for two sample libraries, <em>sampleA</em> and <em>sampleB</em>, which were each sequenced on two lanes, <em>lane1</em> and <em>lane2</em>:</p>
|
||||
<ul>
|
||||
<li>sampleA_lane1.fq</li>
|
||||
<li>sampleA_lane2.fq</li>
|
||||
<li>sampleB_lane1.fq</li>
|
||||
<li>sampleB_lane2.fq</li>
|
||||
</ul>
|
||||
<p>These will each be identified as separate read groups A1, A2, B1 and B2. If we had multiple libraries per sample, we would further distinguish them (eg sampleA_lib1_lane1.fq leading to read group A11, sampleA_lib2_lane1.fq leading to read group A21 and so on).</p>
|
||||
<h4>1. Run initial steps per-readgroup once</h4>
|
||||
<p>Assuming that you received one FASTQ file per sample library, per lane of sequence data (which amounts to a <a href="https://www.broadinstitute.org/gatk/guide/article?id=6472">read group</a>), run each file through mapping and sorting. During the mapping step you assign read group information, which will be very important in the next steps so be sure to do it correctly. See the <a href="https://www.broadinstitute.org/gatk/guide/article?id=6472">read groups</a> dictionary entry for guidance. </p>
|
||||
<p>The example data becomes:</p>
|
||||
<ul>
|
||||
<li>sampleA_rgA1.bam</li>
|
||||
<li>sampleA_rgA2.bam</li>
|
||||
<li>sampleB_rgB1.bam</li>
|
||||
<li>sampleB_rgB2.bam</li>
|
||||
</ul>
|
||||
<p>At this point we mark duplicates in each read group BAM file (dedup), which allows us to estimate the complexity of the corresponding library of origin as a quality control step. This step is optional. </p>
|
||||
<p>The example data becomes:</p>
|
||||
<ul>
|
||||
<li>sampleA_rgA1.dedup.bam</li>
|
||||
<li>sampleA_rgA2.dedup.bam</li>
|
||||
<li>sampleB_rgB1.dedup.bam</li>
|
||||
<li>sampleB_rgB2.dedup.bam</li>
|
||||
</ul>
|
||||
<p>Technically this first run of marking duplicates is not necessary because we will run it again per-sample, and that per-sample marking would be enough to achieve the desired result. To reiterate, we only do this round of marking duplicates for QC purposes. </p>
|
||||
<h4>2. Merge read groups and mark duplicates per sample (aggregation + dedup)</h4>
|
||||
<p>Once you have pre-processed each read group individually, you merge read groups belonging to the same sample into a single BAM file. You can do this as a standalone step, bur for the sake of efficiency we combine this with the per-readgroup duplicate marking step (it's simply a matter of passing the multiple inputs to MarkDuplicates in a single command). </p>
|
||||
<p>The example data becomes:</p>
|
||||
<ul>
|
||||
<li>sampleA.merged.dedup.bam</li>
|
||||
<li>sampleB.merged.dedup.bam</li>
|
||||
</ul>
|
||||
<p>To be clear, this is the round of marking duplicates that matters. It eliminates PCR duplicates (arising from library preparation) across all lanes in addition to optical duplicates (which are by definition only per-lane). </p>
|
||||
<h4>3. Remaining per-sample pre-processing</h4>
|
||||
<p>Then you run indel realignment (optional) and base recalibration (BQSR). </p>
|
||||
<p>The example data becomes:</p>
|
||||
<ul>
|
||||
<li>sample1.merged.dedup.(realn).recal.bam</li>
|
||||
<li>sample2.merged.dedup.(realn).recal.bam</li>
|
||||
</ul>
|
||||
<p>Realigning around indels per-sample leads to consistent alignments across all lanes within a sample. This step is only necessary if you will be using a locus-based variant caller like MuTect 1 or UnifiedGenotyper (for legacy reasons). If you will be using HaplotypeCaller or MuTect2, you do not need to perform indel realignment. </p>
|
||||
<p>Base recalibration will be applied per-read group if you assigned appropriate read group information in your data. BaseRecalibrator distinguishes read groups by RGID, or RGPU if it is available (PU takes precedence over ID). This will identify separate read groups (distinguishing both lanes and libraries) as such even if they are in the same BAM file, and it will always process them separately -- as long as the read groups are identified correctly of course. There would be no sense in trying to recalibrate across lanes, since the purpose of this processing step is to compensate for the errors made by the machine during sequencing, and the lane is the base unit of the sequencing machine (assuming the equipment is Illumina HiSeq or similar technology). </p>
|
||||
<p><em>People often ask also if it's worth the trouble to try realigning across all samples in a cohort. The answer is almost always no, unless you have very shallow coverage. The problem is that while it would be lovely to ensure consistent alignments around indels across all samples, the computational cost gets too ridiculous too fast. That being said, for contrastive calling projects -- such as cancer tumor/normals -- we do recommend realigning both the tumor and the normal together in general to avoid slight alignment differences between the two tissue types.</em></p>
|
||||
|
|
@ -0,0 +1,11 @@
|
|||
## How should I select samples for a Panel of Normals for somatic analysis?
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/7366/how-should-i-select-samples-for-a-panel-of-normals-for-somatic-analysis
|
||||
|
||||
<p>The Panel of Normals (PoN) plays two important roles in somatic variant analysis: </p>
|
||||
<ol>
|
||||
<li>Exclude germline variant sites that are found in the normals to avoid calling them as potential somatic variants in the tumor;</li>
|
||||
<li>Exclude technical artifacts that arise from particular techniques (eg sample preservation) and technologies (eg library capture, sequencing chemistry).</li>
|
||||
</ol>
|
||||
<p>Given these roles, the most important selection criteria are the technical properties of how the normal data was generated. It's very important to use normals that are as technically similar as possible to the tumor. Also, the samples should come from subjects that were young and healthy (to minimize the chance of using as normal a sample from someone who has an undiagnosed tumor).</p>
|
||||
<p>If possible it is better to use normals generated from the same type of tissue because if the tissues were preserved differently, the artifact patterns may be different. </p>
|
||||
|
|
@ -0,0 +1,45 @@
|
|||
## I'm new to GATK. Where do I start?
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/4863/im-new-to-gatk-where-do-i-start
|
||||
|
||||
<p>If this is your first rodeo, you're probably asking yourself:</p>
|
||||
<ul>
|
||||
<li>
|
||||
<p><strong>What can GATK do for me?</strong>
|
||||
Identify variants in a bunch of sample sequences, with great sensitivity and specificity.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><strong>How do I get GATK to do that?</strong>
|
||||
You run the recommended <a href="https://www.broadinstitute.org/gatk/guide/best-practices">Best Practices</a> steps, one by one, from start to finish, as described in the <a href="https://www.broadinstitute.org/gatk/guide/best-practices">Best Practices documentation</a>.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><strong>No but really, how do I know what to do?</strong>
|
||||
For each step in the <a href="https://www.broadinstitute.org/gatk/guide/best-practices">Best Practices</a>, there is a tutorial that details how to run the tools involved, with example commands. The idea is to daisy-chain all thosee tutorials in the order that they're referenced in the <a href="https://www.broadinstitute.org/gatk/guide/best-practices">Best Practices</a> doc into a pipeline.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><strong>Oh, you mean I can just copy/paste all the tutorial commands as they are?</strong>
|
||||
Not quite, because there are a few things that need to be tweaked. For example, the tutorials use the <code>-L/--intervals</code> argument to restrict analysis for demo purposes, but depending on your data and experimental design, you may need to remove it (e.g. for WGS) or adapt it (for WEx). Hopefully it's explained clearly enough in the tutorials.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><strong>Why don't you just provide one script that runs all the tools?</strong>
|
||||
It's really hard to build and maintain a one-size-fits-all pipeline solution. Really really hard. And not nearly as much fun as developing new analysis methods. We do provide a pipelining program called Queue that has the advantage of understanding GATK argument syntax natively, but you still have to actually write scripts yourself in Scala to use it. Sorry. Maybe one day we will be able to offer GATK analysis on the Cloud. But not today. </p>
|
||||
</li>
|
||||
<li>
|
||||
<p><strong>What if I want to know what a command line argument does or change a parameter?</strong>
|
||||
First, check out the <a href="https://www.broadinstitute.org/gatk/guide/article?id=4669">basic GATK command syntax FAQ</a> if it's your first time using GATK, then consult the relevant <a href="https://www.broadinstitute.org/gatk/guide/tooldocs/index">Tool Documentation</a> page. Keep in mind that some arguments are "engine parameters" that are shared by many tools, and are listed in a <a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_engine_CommandLineGATK.php">separate document</a>. Also, you can always use the search box to find an argument description really quickly. </p>
|
||||
</li>
|
||||
<li>
|
||||
<p><strong>The documentation seems chaotic. Is there any logic to how it's organized?</strong>
|
||||
Sort of. (And, ouch. Tough crowd.) The main category names should be obvious enough (if not, see the "Documentation Categories" tab). Within categories, everything is just in alphabetical order. In future, we're going to try to provide more use-case based structure, but for now this is what we have. The best way to find practical information is to either go from the <a href="https://www.broadinstitute.org/gatk/guide/best-practices">Best Practices</a> doc (which provide links to all FAQs, method articles and tutorials directly related to a given step), or use the search box and search-by-tag functions (see the "Search tab"). Be sure to also check out the <a href="https://www.broadinstitute.org/gatk/guide/presentations">Presentations section</a>, which provides workshop materials and videos that explain a lot of the motivation and methods behind the Best Practices. </p>
|
||||
</li>
|
||||
<li>
|
||||
<p><strong>Does GATK include other tools beside the ones in the Best Practices?</strong>
|
||||
Oh sure, there's a whole bunch of them, all listed in the <a href="https://www.broadinstitute.org/gatk/guide/tooldocs/index">Tool Documentation</a> section, categorized by type of analysis. But be aware that anything that's not part of the <a href="https://www.broadinstitute.org/gatk/guide/best-practices">Best Practices</a> is most likely either a tool that was written for a one-off analysis years ago, an experimental feature that we're still not sure is actually useful, or an accessory utility that can be used in many different ways and takes expert inside knowledge to use properly. All these may be buggy, insufficiently documented, or both. We provide support for them as well as humanly possible but ultimately, you use them at your own risk. </p>
|
||||
</li>
|
||||
<li>
|
||||
<p><strong>Why do the answers to these questions keep getting longer and longer?</strong>
|
||||
I don't know what you're talking about. </p>
|
||||
</li>
|
||||
<li><strong>What else should I know before I start?</strong>
|
||||
You should probably browse the titles of the <a href="https://www.broadinstitute.org/gatk/guide/topic?name=faqs">Frequently Asked Questions</a> -- there will be at least a handful you'll want to read, but it's hard for us to predict which ones.</li>
|
||||
</ul>
|
||||
|
|
@ -0,0 +1,18 @@
|
|||
## Lane, Library, Sample and Cohort -- what do they mean and why are they important?
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/3059/lane-library-sample-and-cohort-what-do-they-mean-and-why-are-they-important
|
||||
|
||||
<p>There are four major organizational units for next-generation DNA sequencing processes that used throughout the GATK documentation:</p>
|
||||
<ul>
|
||||
<li>
|
||||
<p><strong>Lane:</strong> The basic machine unit for sequencing. The lane reflects the basic independent run of an NGS machine. For Illumina machines, this is the physical sequencing lane. </p>
|
||||
</li>
|
||||
<li>
|
||||
<p><strong>Library:</strong> A unit of DNA preparation that at some point is physically pooled together. Multiple lanes can be run from aliquots from the same library. The DNA library and its preparation is the natural unit that is being sequenced. For example, if the library has limited complexity, then many sequences are duplicated and will result in a high duplication rate across lanes.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><strong>Sample:</strong> A single individual, such as human CEPH NA12878. Multiple libraries with different properties can be constructed from the original sample DNA source. Throughout our documentation, we treat samples as independent individuals whose genome sequence we are attempting to determine. Note that from this perspective, tumor / normal samples are different despite coming from the same individual.</p>
|
||||
</li>
|
||||
<li><strong>Cohort:</strong> A collection of samples being analyzed together. This organizational unit is the most subjective and depends very specifically on the design goals of the sequencing project. For population discovery projects like the 1000 Genomes, the analysis cohort is the ~100 individual in each population. For exome projects with many deeply sequenced samples (e.g., ESP with 800 EOMI samples) we divide up the complete set of samples into cohorts of ~50 individuals for multi-sample analyses.</li>
|
||||
</ul>
|
||||
<p>Note that many GATK commands can be run at the lane level, but will give better results seeing all of the data for a single sample, or even all of the data for all samples. Unfortunately, there's a trade-off in computational cost, since running these commands across all of your data simultaneously requires much more computing power. Please see the documentation for each step to understand what is the best way to group or partition your data for that particular process.</p>
|
||||
|
|
@ -0,0 +1,31 @@
|
|||
## Should I analyze my samples alone or together?
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/4150/should-i-analyze-my-samples-alone-or-together
|
||||
|
||||
<h3>Together is (almost always) better than alone</h3>
|
||||
<p>We recommend performing variant discovery in a way that enables joint analysis of multiple samples, as laid out in our <a href="https://www.broadinstitute.org/gatk/guide/best-practices">Best Practices</a> workflow. That workflow includes a joint analysis step that empowers variant discovery by providing the ability to leverage population-wide information from a cohort of multiple sample, allowing us to detect variants with great sensitivity and genotype samples as accurately as possible. Our workflow recommendations provide a way to do this in a way that is scalable and allows incremental processing of the sequencing data. </p>
|
||||
<p>The key point is that you don’t actually have to call variants on all your samples together to perform a joint analysis. We have developed a workflow that allows us to decouple the initial identification of potential variant sites (ie variant calling) from the genotyping step, which is the only part that really needs to be done jointly. Since GATK 3.0, you can use the HaplotypeCaller to call variants individually per-sample in <code>-ERC GVCF</code> mode, followed by a joint genotyping step on all samples in the cohort, as described in <a href="http://www.broadinstitute.org/gatk/guide/article?id=3893">this method article</a>. This achieves what we call incremental joint discovery, providing you with all the benefits of classic joint calling (as described below) without the drawbacks.</p>
|
||||
<p><strong>Why "almost always"?</strong> Because some people have reported missing a small fraction of singletons (variants that are unique to individual samples) when using the new method. For most studies, this is an acceptable tradeoff (which is reduced by the availability of high quality sequencing data), but if you are very specifically looking for singletons, you may need to do some careful evaluation before committing to this method.</p>
|
||||
<hr />
|
||||
<h3>Previously established cohort analysis strategies</h3>
|
||||
<p>Until recently, three strategies were available for variant discovery in multiple samples:</p>
|
||||
<p><strong>- single sample calling:</strong> sample BAMs are analyzed individually, and individual call sets are combined in a downstream processing step;<br />
|
||||
<strong>- batch calling:</strong> sample BAMs are analyzed in separate batches, and batch call sets are merged in a downstream processing step;<br />
|
||||
<strong>- joint calling:</strong> variants are called simultaneously across all sample BAMs, generating a single call set for the entire cohort. </p>
|
||||
<p>The best of these, from the point of view of variant discovery, was joint calling, because it provided the following benefits: </p>
|
||||
<h4>1. Clearer distinction between homozygous reference sites and sites with missing data</h4>
|
||||
<p>Batch-calling does not output a genotype call at sites where no member in the batch has evidence for a variant; it is thus impossible to distinguish such sites from locations missing data. In contrast, joint calling emits genotype calls at every site where any individual in the call set has evidence for variation.</p>
|
||||
<h4>2. Greater sensitivity for low-frequency variants</h4>
|
||||
<p>By sharing information across all samples, joint calling makes it possible to “rescue” genotype calls at sites where a carrier has low coverage but other samples within the call set have a confident variant at that location. However this does not apply to singletons, which are unique to a single sample. To minimize the chance of missing singletons, we increase the cohort size -- so that singletons themselves have less chance of happening in the first place.</p>
|
||||
<h4>3. Greater ability to filter out false positives</h4>
|
||||
<p>The current approaches to variant filtering (such as VQSR) use statistical models that work better with large amounts of data. Of the three calling strategies above, only joint calling provides enough data for accurate error modeling and ensures that filtering is applied uniformly across all samples.</p>
|
||||
<p><a href="https://us.v-cdn.net/5019796/uploads/FileUpload/40/3d322e97441f1918626854d56c2574.png"><img src="https://us.v-cdn.net/5019796/uploads/FileUpload/40/3d322e97441f1918626854d56c2574.png" /></a></p>
|
||||
<p><strong>Figure 1:</strong> <em>Power of joint calling in finding mutations at low coverage sites. The variant allele is present in only two of the N samples, in both cases with such low coverage that the variant is not callable when processed separately. Joint calling allows evidence to be accumulated over all samples and renders the variant callable. (right) Importance of joint calling to square off the genotype matrix, using an example of two disease-relevant variants. Neither sample will have records in a variants-only output file, for different reasons: the first sample is homozygous reference while the second sample has no data. However, merging the results from single sample calling will incorrectly treat both of these samples identically as being non-informative.</em></p>
|
||||
<hr />
|
||||
<h3>Drawbacks of traditional joint calling (all steps performed multi-sample)</h3>
|
||||
<p>There are two major problems with the joint calling strategy. </p>
|
||||
<p><strong>- Scaling & infrastructure</strong><br />
|
||||
Joint calling scales very badly -- the calculations involved in variant calling (especially by methods like the HaplotypeCaller’s) become exponentially more computationally costly as you add samples to the cohort. If you don't have a lot of compute available, you run into limitations pretty quickly. Even here at Broad where we have fairly ridiculous amounts of compute available, we can't brute-force our way through the numbers for the larger cohort sizes that we're called on to handle.</p>
|
||||
<p><strong>- The N+1 problem</strong><br />
|
||||
When you’re getting a large-ish number of samples sequenced (especially clinical samples), you typically get them in small batches over an extended period of time, and you analyze each batch as it comes in (whether it’s because the analysis is time-sensitive or your PI is breathing down your back). But that’s not joint calling, that’s batch calling, and it doesn’t give you the same significant gains that joint calling can give you. Unfortunately the joint calling approach doesn’t allow for incremental analysis -- every time you get even one new sample sequence, you have to re-call all samples from scratch.</p>
|
||||
<h4>Both of these problems are solved by the single-sample calling + joint genotyping workflow.</h4>
|
||||
|
|
@ -0,0 +1,14 @@
|
|||
## Should I use UnifiedGenotyper or HaplotypeCaller to call variants on my data?
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/3151/should-i-use-unifiedgenotyper-or-haplotypecaller-to-call-variants-on-my-data
|
||||
|
||||
<p><strong>Use HaplotypeCaller!</strong></p>
|
||||
<p>The HaplotypeCaller is a more recent and sophisticated tool than the UnifiedGenotyper. Its ability to call SNPs is equivalent to that of the UnifiedGenotyper, its ability to call indels is far superior, and it is now capable of calling non-diploid samples. It also comprises several unique functionalities such as the reference confidence model (which enables efficient and incremental variant discovery on ridiculously large cohorts) and special settings for RNAseq data. </p>
|
||||
<p><strong>As of GATK version 3.3, we recommend using HaplotypeCaller in all cases, with no exceptions.</strong></p>
|
||||
<p><em>Caveats for older versions</em></p>
|
||||
<p>If you are limited to older versions for project continuity, you may opt to use UnifiedGenotyper in the following cases:</p>
|
||||
<ul>
|
||||
<li>If you are working with non-diploid organisms (UG can handle different levels of ploidy while older versions of HC cannot) </li>
|
||||
<li>If you are working with pooled samples (also due to the HC’s limitation regarding ploidy) </li>
|
||||
<li>If you want to analyze more than 100 samples at a time (for performance reasons) (versions 2.x) </li>
|
||||
</ul>
|
||||
|
|
@ -0,0 +1,49 @@
|
|||
## What's in the resource bundle and how can I get it?
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/1213/whats-in-the-resource-bundle-and-how-can-i-get-it
|
||||
|
||||
<p><strong>NOTE: we recently made some changes to the bundle on the FTP server; see the <a href="https://software.broadinstitute.org/gatk/download/bundle">Resource Bundle</a> page for details. In a nutshell: minor directory structure changes, and Hg38 bundle now mirrors the cloud version.</strong></p>
|
||||
<hr />
|
||||
<h3>1. Accessing the bundle</h3>
|
||||
<p>See the <a href="https://software.broadinstitute.org/gatk/download/bundle">Resource Bundle</a> page. In a nutshell, there's a Google Cloud bucket and an FTP server. The cloud bucket only has Hg38 resources; the resources for other builds are currently only available through the FTP server. Let us know if you want them on the Cloud too. </p>
|
||||
<hr />
|
||||
<h3>2. Grch38/Hg38 Resources: the soon-to-be Standard Set</h3>
|
||||
<p>This contains all the resource files needed for Best Practices short variant discovery in whole-genome sequencing data (WGS). Exome files and itemized resource list coming soon(ish). </p>
|
||||
<hr />
|
||||
<h4>All resources below this are available only on the FTP server, not on the cloud.</h4>
|
||||
<hr />
|
||||
<h3>3. b37 Resources: the Standard Data Set pending completion of the Hg38 bundle</h3>
|
||||
<ul>
|
||||
<li>Reference sequence (standard 1000 Genomes fasta) along with fai and dict files</li>
|
||||
<li>dbSNP in VCF. This includes two files:
|
||||
<ul>
|
||||
<li>A recent dbSNP release (build 138)</li>
|
||||
<li>This file subsetted to only sites discovered in or before dbSNPBuildID 129, which excludes the impact of the 1000 Genomes project and is useful for evaluation of dbSNP rate and Ti/Tv values at novel sites.</li>
|
||||
</ul></li>
|
||||
<li>HapMap genotypes and sites VCFs</li>
|
||||
<li>OMNI 2.5 genotypes for 1000 Genomes samples, as well as sites, VCF </li>
|
||||
<li>The current best set of known indels to be used for local realignment (note that we don't use dbSNP for this anymore); use both files:
|
||||
<ul>
|
||||
<li>1000G_phase1.indels.b37.vcf (currently from the 1000 Genomes Phase I indel calls)</li>
|
||||
<li>Mills_and_1000G_gold_standard.indels.b37.sites.vcf</li>
|
||||
</ul></li>
|
||||
<li>The latest set from 1000G phase 3 (v4) for genotype refinement: 1000G_phase3_v4_20130502.sites.vcf</li>
|
||||
<li>A large-scale standard single sample BAM file for testing:
|
||||
<ul>
|
||||
<li>NA12878.HiSeq.WGS.bwa.cleaned.recal.b37.20.bam containing ~64x reads of NA12878 on chromosome 20</li>
|
||||
<li>A callset produced by running UnifiedGenotyper on the dataset above. Note that this resource is out of date and does not represent the results of our Best Practices. This will be updated in the near future.</li>
|
||||
</ul></li>
|
||||
<li>The Broad's custom exome targets list: Broad.human.exome.b37.interval_list (note that you should always use the exome targets list that is appropriate for your data, which typically depends on the prep kit that was used, and should be available from the kit manufacturer's website)</li>
|
||||
</ul>
|
||||
<p>Additionally, these files all have supplementary indices, statistics, and other QC data available.</p>
|
||||
<hr />
|
||||
<h3>4. hg19 Resources: lifted over from b37</h3>
|
||||
<p>Includes the UCSC-style hg19 reference along with all lifted over VCF files.</p>
|
||||
<hr />
|
||||
<h3>5. hg18 Resources: lifted over from b37</h3>
|
||||
<p>Includes the UCSC-style hg18 reference along with all lifted over VCF files. The refGene track and BAM files are not available. We only provide data files for this genome-build that can be lifted over "easily" from our master b37 repository. Sorry for whatever inconvenience that this might cause.</p>
|
||||
<p>Also includes a chain file to lift over to b37.</p>
|
||||
<hr />
|
||||
<h3>6. b36 Resources: lifted over from b37</h3>
|
||||
<p>Includes the 1000 Genomes pilot b36 formatted reference sequence (human_b36_both.fasta) along with all lifted over VCF files. The refGene track and BAM files are not available. We only provide data files for this genome-build that can be lifted over "easily" from our master b37 repository. Sorry for whatever inconvenience that this might cause.</p>
|
||||
<p>Also includes a chain file to lift over to b37.</p>
|
||||
|
|
@ -0,0 +1,12 @@
|
|||
## What are the prerequisites for running GATK?
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/1852/what-are-the-prerequisites-for-running-gatk
|
||||
|
||||
<h3>1. Operating system</h3>
|
||||
<p>The GATK runs natively on most if not all flavors of UNIX, which includes MacOSX, Linux and BSD. It is possible to get it running on Windows using Cygwin, but we don't provide any support nor instructions for that.</p>
|
||||
<h3>2. Java 7 / 1.7</h3>
|
||||
<p>The GATK is a Java-based program, so you'll need to have Java installed on your machine. The Java version should be at 1.7 (at this time we don't officially support 1.8, and 1.6 no longer works). You can check what version you have by typing <code>java -version</code> at the command line. <a href="http://www.broadinstitute.org/gatk/guide/article?id=1200">This article</a> has some more details about what to do if you don't have the right version. Note that at this time we only support the Sun/Oracle Java JDK; OpenJDK is not supported. </p>
|
||||
<h3>4. R dependencies</h3>
|
||||
<p>Some of the GATK tools produce plots using R, so if you want to get the plots you'll need to have R and Rscript installed, as well as several R libraries. Full details can be found in the <a href="http://www.broadinstitute.org/gatk/guide/article?id=2899">Tutorial on installing required software</a>.</p>
|
||||
<h3>3. Familiarity with command-line programs</h3>
|
||||
<p>The GATK does not have a Graphical User Interface (GUI). You don't open it by clicking on the <code>.jar</code> file; you have to use the Console (or Terminal) to input commands. If this is all new to you, we recommend you first learn about that and follow some <a href="http://lifehacker.com/5633909/who-needs-a-mouse-learn-to-use-the-command-line-for-almost-anything">online tutorials</a> before trying to use the GATK. It's not difficult but you'll need to learn some jargon and get used to living without a mouse. Trust us, it's a liberating experience :)</p>
|
||||
|
|
@ -0,0 +1,11 @@
|
|||
## What do I need to do before attending a workshop hands-on session?
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/4610/what-do-i-need-to-do-before-attending-a-workshop-hands-on-session
|
||||
|
||||
<p>So you're going to a GATK workshop, and you've been selected to participate in a hands-on session? Fantastic! We're looking forward to walking you through some exercises that will help you master the tools. However -- in order to make the best of the time we have together, we'd like to ask you to come prepared. Specifically, <em>if the workshop hosts are not providing machines and you have been asked to bring your own laptop</em>, please complete the following steps:</p>
|
||||
<h4>- Download and install all necessary software as described in <a href="https://www.broadinstitute.org/gatk/guide/article?id=7098">this tutorial</a>.</h4>
|
||||
<p>Note that if you are a Mac user, you may need to install Apple's XCode Tools, which are free but fairly large, so plan ahead because it can take a loooong time to download them if your connection is anything less than super-fast.</p>
|
||||
<h4>- Download the tutorial bundle from the link provided by the workshop organizers.</h4>
|
||||
<p>This will typically be provided by email two to three weeks before the date of the workshop. </p>
|
||||
<p>At the start of the session, we'll give you handouts with a walkthrough of the session so you can follow along and take notes (highly recommended!). </p>
|
||||
<p>With that, you should be all set. See you soon!</p>
|
||||
|
|
@ -0,0 +1,263 @@
|
|||
## What do the VariantEval modules do?
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/2361/what-do-the-varianteval-modules-do
|
||||
|
||||
<p>VariantEval accepts two types of modules: stratification and evaluation modules.</p>
|
||||
<ul>
|
||||
<li>Stratification modules will stratify (group) the variants based on certain properties. </li>
|
||||
<li>Evaluation modules will compute certain metrics for the variants</li>
|
||||
</ul>
|
||||
<h3>CpG</h3>
|
||||
<p>CpG is a three-state stratification:</p>
|
||||
<ul>
|
||||
<li>The locus is a CpG site ("CpG")</li>
|
||||
<li>The locus is not a CpG site ("non_CpG")</li>
|
||||
<li>The locus is either a CpG or not a CpG site ("all")</li>
|
||||
</ul>
|
||||
<p>A CpG site is defined as a site where the reference base at a locus is a C and the adjacent reference base in the 3' direction is a G.</p>
|
||||
<h3>EvalRod</h3>
|
||||
<p>EvalRod is an N-state stratification, where N is the number of eval rods bound to VariantEval.</p>
|
||||
<h3>Sample</h3>
|
||||
<p>Sample is an N-state stratification, where N is the number of samples in the eval files.</p>
|
||||
<h3>Filter</h3>
|
||||
<p>Filter is a three-state stratification:</p>
|
||||
<ul>
|
||||
<li>The locus passes QC filters ("called")</li>
|
||||
<li>The locus fails QC filters ("filtered")</li>
|
||||
<li>The locus either passes or fails QC filters ("raw")</li>
|
||||
</ul>
|
||||
<h3>FunctionalClass</h3>
|
||||
<p>FunctionalClass is a four-state stratification:</p>
|
||||
<ul>
|
||||
<li>The locus is a synonymous site ("silent")</li>
|
||||
<li>The locus is a missense site ("missense")</li>
|
||||
<li>The locus is a nonsense site ("nonsense")</li>
|
||||
<li>The locus is of any functional class ("any")</li>
|
||||
</ul>
|
||||
<h3>CompRod</h3>
|
||||
<p>CompRod is an N-state stratification, where N is the number of comp tracks bound to VariantEval.</p>
|
||||
<h3>Degeneracy</h3>
|
||||
<p>Degeneracy is a six-state stratification:</p>
|
||||
<ul>
|
||||
<li>The underlying base position in the codon is 1-fold degenerate ("1-fold")</li>
|
||||
<li>The underlying base position in the codon is 2-fold degenerate ("2-fold")</li>
|
||||
<li>The underlying base position in the codon is 3-fold degenerate ("3-fold")</li>
|
||||
<li>The underlying base position in the codon is 4-fold degenerate ("4-fold")</li>
|
||||
<li>The underlying base position in the codon is 6-fold degenerate ("6-fold")</li>
|
||||
<li>The underlying base position in the codon is degenerate at any level ("all")</li>
|
||||
</ul>
|
||||
<p>See the [<a href="http://en.wikipedia.org/wiki/Genetic_code#Degeneracy">http://en.wikipedia.org/wiki/Genetic_code#Degeneracy</a> Wikipedia page on degeneracy] for more information.</p>
|
||||
<h3>JexlExpression</h3>
|
||||
<p>JexlExpression is an N-state stratification, where N is the number of JEXL expressions supplied to VariantEval. See [[Using JEXL expressions]]</p>
|
||||
<h3>Novelty</h3>
|
||||
<p>Novelty is a three-state stratification:</p>
|
||||
<ul>
|
||||
<li>The locus overlaps the knowns comp track (usually the dbSNP track) ("known")</li>
|
||||
<li>The locus does not overlap the knowns comp track ("novel")</li>
|
||||
<li>The locus either overlaps or does not overlap the knowns comp track ("all")</li>
|
||||
</ul>
|
||||
<h3>CountVariants</h3>
|
||||
<p>CountVariants is an evaluation module that computes the following metrics:</p>
|
||||
<table class="table table-striped">
|
||||
<thead>
|
||||
<tr>
|
||||
<th style="text-align: left;">Metric</th>
|
||||
<th style="text-align: left;">Definition</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td style="text-align: left;">nProcessedLoci</td>
|
||||
<td style="text-align: left;">Number of processed loci</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: left;">nCalledLoci</td>
|
||||
<td style="text-align: left;">Number of called loci</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: left;">nRefLoci</td>
|
||||
<td style="text-align: left;">Number of reference loci</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: left;">nVariantLoci</td>
|
||||
<td style="text-align: left;">Number of variant loci</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: left;">variantRate</td>
|
||||
<td style="text-align: left;">Variants per loci rate</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: left;">variantRatePerBp</td>
|
||||
<td style="text-align: left;">Number of variants per base</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: left;">nSNPs</td>
|
||||
<td style="text-align: left;">Number of snp loci</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: left;">nInsertions</td>
|
||||
<td style="text-align: left;">Number of insertion</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: left;">nDeletions</td>
|
||||
<td style="text-align: left;">Number of deletions</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: left;">nComplex</td>
|
||||
<td style="text-align: left;">Number of complex loci</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: left;">nNoCalls</td>
|
||||
<td style="text-align: left;">Number of no calls loci</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: left;">nHets</td>
|
||||
<td style="text-align: left;">Number of het loci</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: left;">nHomRef</td>
|
||||
<td style="text-align: left;">Number of hom ref loci</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: left;">nHomVar</td>
|
||||
<td style="text-align: left;">Number of hom var loci</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: left;">nSingletons</td>
|
||||
<td style="text-align: left;">Number of singletons</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: left;">heterozygosity</td>
|
||||
<td style="text-align: left;">heterozygosity per locus rate</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: left;">heterozygosityPerBp</td>
|
||||
<td style="text-align: left;">heterozygosity per base pair</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: left;">hetHomRatio</td>
|
||||
<td style="text-align: left;">heterozygosity to homozygosity ratio</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: left;">indelRate</td>
|
||||
<td style="text-align: left;">indel rate (insertion count + deletion count)</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: left;">indelRatePerBp</td>
|
||||
<td style="text-align: left;">indel rate per base pair</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: left;">deletionInsertionRatio</td>
|
||||
<td style="text-align: left;">deletion to insertion ratio</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<h3>CompOverlap</h3>
|
||||
<p>CompOverlap is an evaluation module that computes the following metrics:</p>
|
||||
<table class="table table-striped">
|
||||
<thead>
|
||||
<tr>
|
||||
<th style="text-align: left;">Metric</th>
|
||||
<th style="text-align: left;">Definition</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td style="text-align: left;">nEvalSNPs</td>
|
||||
<td style="text-align: left;">number of eval SNP sites</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: left;">nCompSNPs</td>
|
||||
<td style="text-align: left;">number of comp SNP sites</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: left;">novelSites</td>
|
||||
<td style="text-align: left;">number of eval sites outside of comp sites</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: left;">nVariantsAtComp</td>
|
||||
<td style="text-align: left;">number of eval sites at comp sites (that is, sharing the same locus as a variant in the comp track, regardless of whether the alternate allele is the same)</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: left;">compRate</td>
|
||||
<td style="text-align: left;">percentage of eval sites at comp sites</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: left;">nConcordant</td>
|
||||
<td style="text-align: left;">number of concordant sites (that is, for the sites that share the same locus as a variant in the comp track, those that have the same alternate allele)</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: left;">concordantRate</td>
|
||||
<td style="text-align: left;">the concordance rate</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<h4>Understanding the output of CompOverlap</h4>
|
||||
<p>A SNP in the detection set is said to be 'concordant' if the position exactly matches an entry in dbSNP and the allele is the same. To understand this and other output of CompOverlap, we shall examine a detailed example. First, consider a fake dbSNP file (headers are suppressed so that one can see the important things):</p>
|
||||
<pre><code class="pre_md"> $ grep -v '##' dbsnp.vcf
|
||||
#CHROM POS ID REF ALT QUAL FILTER INFO
|
||||
1 10327 rs112750067 T C . . ASP;R5;VC=SNP;VP=050000020005000000000100;WGT=1;dbSNPBuildID=132</code class="pre_md"></pre>
|
||||
<p>Now, a detection set file with a single sample, where the variant allele is the same as listed in dbSNP:</p>
|
||||
<pre><code class="pre_md"> $ grep -v '##' eval_correct_allele.vcf
|
||||
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 001-6
|
||||
1 10327 . T C 5168.52 PASS ... GT:AD:DP:GQ:PL 0/1:357,238:373:99:3959,0,4059</code class="pre_md"></pre>
|
||||
<p>Finally, a detection set file with a single sample, but the alternate allele differs from that in dbSNP:</p>
|
||||
<pre><code class="pre_md"> $ grep -v '##' eval_incorrect_allele.vcf
|
||||
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 001-6
|
||||
1 10327 . T A 5168.52 PASS ... GT:AD:DP:GQ:PL 0/1:357,238:373:99:3959,0,4059</code class="pre_md"></pre>
|
||||
<p>Running VariantEval with just the CompOverlap module:</p>
|
||||
<pre><code class="pre_md"> $ java -jar $STING_DIR/dist/GenomeAnalysisTK.jar -T VariantEval \
|
||||
-R /seq/references/Homo_sapiens_assembly19/v1/Homo_sapiens_assembly19.fasta \
|
||||
-L 1:10327 \
|
||||
-B:dbsnp,VCF dbsnp.vcf \
|
||||
-B:eval_correct_allele,VCF eval_correct_allele.vcf \
|
||||
-B:eval_incorrect_allele,VCF eval_incorrect_allele.vcf \
|
||||
-noEV \
|
||||
-EV CompOverlap \
|
||||
-o eval.table</code class="pre_md"></pre>
|
||||
<p>We find that the eval.table file contains the following:</p>
|
||||
<pre><code class="pre_md"> $ grep -v '##' eval.table | column -t
|
||||
CompOverlap CompRod EvalRod JexlExpression Novelty nEvalVariants nCompVariants novelSites nVariantsAtComp compRate nConcordant concordantRate
|
||||
CompOverlap dbsnp eval_correct_allele none all 1 1 0 1 100.00000000 1 100.00000000
|
||||
CompOverlap dbsnp eval_correct_allele none known 1 1 0 1 100.00000000 1 100.00000000
|
||||
CompOverlap dbsnp eval_correct_allele none novel 0 0 0 0 0.00000000 0 0.00000000
|
||||
CompOverlap dbsnp eval_incorrect_allele none all 1 1 0 1 100.00000000 0 0.00000000
|
||||
CompOverlap dbsnp eval_incorrect_allele none known 1 1 0 1 100.00000000 0 0.00000000
|
||||
CompOverlap dbsnp eval_incorrect_allele none novel 0 0 0 0 0.00000000 0 0.00000000</code class="pre_md"></pre>
|
||||
<p>As you can see, the detection set variant was listed under nVariantsAtComp (meaning the variant was seen at a position listed in dbSNP), but only the eval_correct_allele dataset is shown to be concordant at that site, because the allele listed in this dataset and dbSNP match.</p>
|
||||
<h3>TiTvVariantEvaluator</h3>
|
||||
<p>TiTvVariantEvaluator is an evaluation module that computes the following metrics:</p>
|
||||
<table class="table table-striped">
|
||||
<thead>
|
||||
<tr>
|
||||
<th style="text-align: left;">Metric</th>
|
||||
<th style="text-align: left;">Definition</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td style="text-align: left;">nTi</td>
|
||||
<td style="text-align: left;">number of transition loci</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: left;">nTv</td>
|
||||
<td style="text-align: left;">number of transversion loci</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: left;">tiTvRatio</td>
|
||||
<td style="text-align: left;">the transition to transversion ratio</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: left;">nTiInComp</td>
|
||||
<td style="text-align: left;">number of comp transition sites</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: left;">nTvInComp</td>
|
||||
<td style="text-align: left;">number of comp transversion sites</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: left;">TiTvRatioStandard</td>
|
||||
<td style="text-align: left;">the transition to transversion ratio for comp sites</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
|
@ -0,0 +1,66 @@
|
|||
## What input files does the GATK accept / require?
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/1204/what-input-files-does-the-gatk-accept-require
|
||||
|
||||
<p>All analyses done with the GATK typically involve several (though not necessarily all) of the following inputs:</p>
|
||||
<ul>
|
||||
<li>Reference genome sequence</li>
|
||||
<li>Sequencing reads</li>
|
||||
<li>Intervals of interest</li>
|
||||
<li>Reference-ordered data</li>
|
||||
</ul>
|
||||
<p>This article describes the corresponding file formats that are acceptable for use with the GATK.</p>
|
||||
<hr />
|
||||
<h3>1. Reference Genome Sequence</h3>
|
||||
<p>The GATK requires the reference sequence in a single reference sequence in FASTA format, with all contigs in the same file. The GATK requires strict adherence to the FASTA standard. All the standard IUPAC bases are accepted, but keep in mind that non-standard bases (i.e. other than ACGT, such as W for example) will be ignored (i.e. those positions in the genome will be skipped). </p>
|
||||
<p><strong>Some users have reported having issues with reference files that have been stored or modified on Windows filesystems. The issues manifest as "10" characters (corresponding to encoded newlines) inserted in the sequence, which cause the GATK to quit with an error. If you encounter this issue, you will need to re-download a valid master copy of the reference file, or clean it up yourself.</strong> </p>
|
||||
<p>Gzipped fasta files will not work with the GATK, so please make sure to unzip them first. Please see <a href="http://www.broadinstitute.org/gatk/guide/article?id=1601">this article</a> for more information on preparing FASTA reference sequences for use with the GATK.</p>
|
||||
<h4>Important note about human genome reference versions</h4>
|
||||
<p>If you are using human data, your reads must be aligned to one of the official b3x (e.g. b36, b37) or hg1x (e.g. hg18, hg19) references. The names and order of the contigs in the reference you used must exactly match that of one of the official references canonical orderings. These are defined by historical karotyping of largest to smallest chromosomes, followed by the X, Y, and MT for the b3x references; the order is thus 1, 2, 3, ..., 10, 11, 12, ... 20, 21, 22, X, Y, MT. The hg1x references differ in that the chromosome names are prefixed with "chr" and chrM appears first instead of last. The GATK will detect misordered contigs (for example, lexicographically sorted) and throw an error. This draconian approach, though unnecessary technically, ensures that all supplementary data provided with the GATK works correctly. You can use ReorderSam to fix a BAM file aligned to a missorted reference sequence.</p>
|
||||
<p><strong>Our Best Practice recommendation is that you use a standard GATK reference from the <a href="http://www.broadinstitute.org/gatk/guide/article?id=1213">GATK resource bundle</a>.</strong></p>
|
||||
<hr />
|
||||
<h3>2. Sequencing Reads</h3>
|
||||
<p>The only input format for sequence reads that the GATK itself supports is the [Sequence Alignment/Map (SAM)] format. See [SAM/BAM] for more details on the SAM/BAM format as well as <a href="http://samtools.sourceforge.net/">Samtools</a> and <a href="http://picard.sourceforge.net/">Picard</a>, two complementary sets of utilities for working with SAM/BAM files.</p>
|
||||
<p>If you don't find the information you need in this section, please see our <a href="http://www.broadinstitute.org/gatk/guide/article?id=1317">FAQs on BAM files</a>.</p>
|
||||
<p>If you are starting out your pipeline with raw reads (typically in FASTQ format) you'll need to make sure that when you map those reads to the reference and produce a BAM file, the resulting BAM file is fully compliant with the GATK requirements. See the Best Practices documentation for detailed instructions on how to do this. </p>
|
||||
<p>In addition to being in SAM format, we require the following additional constraints in order to use your file with the GATK:</p>
|
||||
<ul>
|
||||
<li>The file must be binary (with <code>.bam</code> file extension).</li>
|
||||
<li>The file must be indexed.</li>
|
||||
<li>The file must be sorted in coordinate order with respect to the reference (i.e. the contig ordering in your bam must exactly match that of the reference you are using).</li>
|
||||
<li>The file must have a proper bam header with read groups. Each read group must contain the platform (PL) and sample (SM) tags. For the platform value, we currently support 454, LS454, Illumina, Solid, ABI_Solid, and CG (all case-insensitive).</li>
|
||||
<li>Each read in the file must be associated with exactly one read group.</li>
|
||||
</ul>
|
||||
<p>Below is an example well-formed SAM field header and fields (with @SQ dictionary truncated to show only the first two chromosomes for brevity): </p>
|
||||
<pre><code class="pre_md">@HD VN:1.0 GO:none SO:coordinate
|
||||
@SQ SN:1 LN:249250621 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:1b22b98cdeb4a9304cb5d48026a85128
|
||||
@SQ SN:2 LN:243199373 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:a0d9851da00400dec1098a9255ac712e
|
||||
@RG ID:ERR000162 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC
|
||||
@RG ID:ERR000252 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC
|
||||
@RG ID:ERR001684 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC
|
||||
@RG ID:ERR001685 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC
|
||||
@PG ID:GATK TableRecalibration VN:v2.2.16 CL:Covariates=[ReadGroupCovariate, QualityScoreCovariate, DinucCovariate, CycleCovariate], use_original_quals=true, defau
|
||||
t_read_group=DefaultReadGroup, default_platform=Illumina, force_read_group=null, force_platform=null, solid_recal_mode=SET_Q_ZERO, window_size_nqs=5, homopolymer_nback=7, except on_if_no_tile=false, pQ=5, maxQ=40, smoothing=137 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:b4eb71ee878d3706246b7c1dbef69299
|
||||
@PG ID:bwa VN:0.5.5
|
||||
ERR001685.4315085 16 1 9997 25 35M * 0 0 CCGATCTCCCTAACCCTAACCCTAACCCTAACCCT ?8:C7ACAABBCBAAB?CCAABBEBA@ACEBBB@? XT:A:U XN:i:4 X0:i:1 X1:i:0 XM:i:2 XO:i:0 XG:i:0 RG:Z:ERR001685 NM:i:6 MD:Z:0N0N0N0N1A0A28 OQ:Z:>>:>2>>>>>>>>>>>>>>>>>>?>>>>??>???>
|
||||
ERR001689.1165834 117 1 9997 0 * = 9997 0 CCGATCTAGGGTTAGGGTTAGGGTTAGGGTTAGGG >7AA<@@C?@?B?B??>9?B??>A?B???BAB??@ RG:Z:ERR001689 OQ:Z:>:<<8<<<><<><><<>7<>>>?>>??>???????
|
||||
ERR001689.1165834 185 1 9997 25 35M = 9997 0 CCGATCTCCCTAACCCTAACCCTAACCCTAACCCT 758A:?>>8?=@@>>?;4<>=??@@==??@?==?8 XT:A:U XN:i:4 SM:i:25 AM:i:0 X0:i:1 X1:i:0 XM:i:2 XO:i:0 XG:i:0 RG:Z:ERR001689 NM:i:6 MD:Z:0N0N0N0N1A0A28 OQ:Z:;74>7><><><>>>>><:<>>>>>>>>>>>>>>>>
|
||||
ERR001688.2681347 117 1 9998 0 * = 9998 0 CGATCTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG 5@BA@A6B???A?B??>B@B??>B@B??>BAB??? RG:Z:ERR001688 OQ:Z:=>>>><4><<?><?????????????????????? </code class="pre_md"></pre>
|
||||
<h4>Note about fixing BAM files with alternative sortings</h4>
|
||||
<p>The GATK requires that the BAM file be sorted in the same order as the reference. Unfortunately, many BAM files have headers that are sorted in some other order -- lexicographical order is a common alternative. To resort the BAM file please use <a href="http://picard.sourceforge.net/command-line-overview.shtml#ReorderSam">ReorderSam</a>. </p>
|
||||
<hr />
|
||||
<h3>3. Intervals of interest</h3>
|
||||
<p>The GATK accept interval files for processing subsets of the genome in several different formats. Please see the <a href="http://www.broadinstitute.org/gatk/guide/article?id=1319">FAQs on interval lists</a> for details.</p>
|
||||
<hr />
|
||||
<h3>4. Reference Ordered Data (ROD) file formats</h3>
|
||||
<p>The GATK can associate arbitrary reference ordered data (ROD) files with named tracks for all tools. Some tools require specific ROD data files for processing, and developers are free to write tools that access arbitrary data sets using the ROD interface. The general ROD system has the following syntax:</p>
|
||||
<pre><code class="pre_md">-argumentName:name,type file</code class="pre_md"></pre>
|
||||
<p>Where <code>name</code> is the name in the GATK tool (like "eval" in VariantEval), <code>type</code> is the type of the file, such as VCF or dbSNP, and <code>file</code> is the path to the file containing the ROD data.</p>
|
||||
<p>The GATK supports several common file formats for reading ROD data:</p>
|
||||
<ul>
|
||||
<li><a href="http://www.1000genomes.org/wiki/analysis/variant-call-format/">VCF</a> : VCF type, the recommended format for representing variant loci and genotype calls. The GATK will only process valid VCF files; <a href="http://vcftools.sourceforge.net/">VCFTools</a> provides the official VCF validator. See <a href="http://vcftools.sourceforge.net/VCF-poster.pdf">here</a> for a useful poster detailing the VCF specification.</li>
|
||||
<li>UCSC formated dbSNP : dbSNP type, UCSC dbSNP database output</li>
|
||||
<li>BED : BED type, a general purpose format for representing genomic interval data, useful for masks and other interval outputs. <strong>Please note that the bed format is 0-based while most other formats are 1-based.</strong></li>
|
||||
</ul>
|
||||
<p><strong>Note that we no longer support the PED format. See <a href="http://atgu.mgh.harvard.edu/plinkseq/output.shtml">here</a> for converting .ped files to VCF.</strong></p>
|
||||
<p>If you need additional information on VCF files, please see our FAQs on VCF files <a href="http://www.broadinstitute.org/gatk/guide/article?id=1318">here</a> and <a href="http://www.broadinstitute.org/gatk/guide/article?id=1268">here</a>.</p>
|
||||
|
|
@ -0,0 +1,108 @@
|
|||
## What is "Phone Home" and how does it affect me?
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/1250/what-is-phone-home-and-how-does-it-affect-me
|
||||
|
||||
<p>In GATK versions produced between September 2010 and May 2016, the GATK had a "Phone Home" usage reporting feature that sent us information about each GATK run via the Broad filesystem (within the Broad) and Amazon's S3 cloud storage service (outside the Broad). This feature was enabled by default and required a key to be disabled (for running offline or for regulatory reasons).</p>
|
||||
<p><strong>The Phone Home feature was removed in version 3.6.</strong> Keys are no longer necessary, so if you had one, you can stop using it. We do not expect that including Phone Home arguments in GATK command lines would cause any errors (so this should not break any scripts), but let us know if you run into any trouble.</p>
|
||||
<p>Note that keys remain necessary for disabling Phone Home in older versions of GATK. See further below for details on how to obtain a key. </p>
|
||||
<hr />
|
||||
<h3>How Phone Home helped development</h3>
|
||||
<p>At the time, the information provided by the Phone Home feature was critical in driving improvements to the GATK:</p>
|
||||
<ul>
|
||||
<li>By recording detailed information about each error that occurs, it enabled GATK developers to <strong>identify and fix previously-unknown bugs</strong> in the GATK. </li>
|
||||
<li>It allowed us to better understand how the GATK is used in practice and <strong>adjust our documentation and development goals</strong> for common use cases.</li>
|
||||
<li>It gave us a picture of <strong>which versions</strong> of the GATK are in use over time, and how successful we've been at encouraging users to migrate from obsolete or broken versions of the GATK to newer, improved versions.</li>
|
||||
<li>It told us <strong>which tools</strong> were most commonly used, allowing us to monitor the adoption of newly-released tools and abandonment of outdated tools.</li>
|
||||
<li>It provided us with a sense of the <strong>overall size of our user base</strong> and the major organizations/institutions using the GATK.</li>
|
||||
</ul>
|
||||
<hr />
|
||||
<h3>What information was sent to us</h3>
|
||||
<p>Below are two example GATK Run Reports showing exactly what information is sent to us each time the GATK phones home.</p>
|
||||
<h4>A successful run:</h4>
|
||||
<pre><code class="pre_md"><GATK-run-report>
|
||||
<id>D7D31ULwTSxlAwnEOSmW6Z4PawXwMxEz</id>
|
||||
<start-time>2012/03/10 20.21.19</start-time>
|
||||
<end-time>2012/03/10 20.21.19</end-time>
|
||||
<run-time>0</run-time>
|
||||
<walker-name>CountReads</walker-name>
|
||||
<svn-version>1.4-483-g63ecdb2</svn-version>
|
||||
<total-memory>85000192</total-memory>
|
||||
<max-memory>129957888</max-memory>
|
||||
<user-name>depristo</user-name>
|
||||
<host-name>10.0.1.10</host-name>
|
||||
<java>Apple Inc.-1.6.0_26</java>
|
||||
<machine>Mac OS X-x86_64</machine>
|
||||
<iterations>105</iterations>
|
||||
</GATK-run-report></code class="pre_md"></pre>
|
||||
<h4>A run where an exception has occurred:</h4>
|
||||
<pre><code class="pre_md"><GATK-run-report>
|
||||
<id>yX3AnltsqIlXH9kAQqTWHQUd8CQ5bikz</id>
|
||||
<exception>
|
||||
<message>Failed to parse Genome Location string: 20:10,000,000-10,000,001x</message>
|
||||
<stacktrace class="java.util.ArrayList">
|
||||
<string>org.broadinstitute.sting.utils.GenomeLocParser.parseGenomeLoc(GenomeLocParser.java:377)</string>
|
||||
<string>org.broadinstitute.sting.utils.interval.IntervalUtils.parseIntervalArguments(IntervalUtils.java:82)</string>
|
||||
<string>org.broadinstitute.sting.commandline.IntervalBinding.getIntervals(IntervalBinding.java:106)</string>
|
||||
<string>org.broadinstitute.sting.gatk.GenomeAnalysisEngine.loadIntervals(GenomeAnalysisEngine.java:618)</string>
|
||||
<string>org.broadinstitute.sting.gatk.GenomeAnalysisEngine.initializeIntervals(GenomeAnalysisEngine.java:585)</string>
|
||||
<string>org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:231)</string>
|
||||
<string>org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:128)</string>
|
||||
<string>org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)</string>
|
||||
<string>org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)</string>
|
||||
<string>org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:92)</string>
|
||||
</stacktrace>
|
||||
<cause>
|
||||
<message>Position: &apos;10,000,001x&apos; contains invalid chars.</message>
|
||||
<stacktrace class="java.util.ArrayList">
|
||||
<string>org.broadinstitute.sting.utils.GenomeLocParser.parsePosition(GenomeLocParser.java:411)</string>
|
||||
<string>org.broadinstitute.sting.utils.GenomeLocParser.parseGenomeLoc(GenomeLocParser.java:374)</string>
|
||||
<string>org.broadinstitute.sting.utils.interval.IntervalUtils.parseIntervalArguments(IntervalUtils.java:82)</string>
|
||||
<string>org.broadinstitute.sting.commandline.IntervalBinding.getIntervals(IntervalBinding.java:106)</string>
|
||||
<string>org.broadinstitute.sting.gatk.GenomeAnalysisEngine.loadIntervals(GenomeAnalysisEngine.java:618)</string>
|
||||
<string>org.broadinstitute.sting.gatk.GenomeAnalysisEngine.initializeIntervals(GenomeAnalysisEngine.java:585)</string>
|
||||
<string>org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:231)</string>
|
||||
<string>org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:128)</string>
|
||||
<string>org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)</string>
|
||||
<string>org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)</string>
|
||||
<string>org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:92)</string>
|
||||
</stacktrace>
|
||||
<is-user-exception>false</is-user-exception>
|
||||
</cause>
|
||||
<is-user-exception>true</is-user-exception>
|
||||
</exception>
|
||||
<start-time>2012/03/10 20.19.52</start-time>
|
||||
<end-time>2012/03/10 20.19.52</end-time>
|
||||
<run-time>0</run-time>
|
||||
<walker-name>CountReads</walker-name>
|
||||
<svn-version>1.4-483-g63ecdb2</svn-version>
|
||||
<total-memory>85000192</total-memory>
|
||||
<max-memory>129957888</max-memory>
|
||||
<user-name>depristo</user-name>
|
||||
<host-name>10.0.1.10</host-name>
|
||||
<java>Apple Inc.-1.6.0_26</java>
|
||||
<machine>Mac OS X-x86_64</machine>
|
||||
<iterations>0</iterations>
|
||||
</GATK-run-report></code class="pre_md"></pre>
|
||||
<p><strong>Note that as of GATK 1.5 we no longer collected information about the command-line executed, the working directory, or tmp directory.</strong></p>
|
||||
<hr />
|
||||
<h3>Disabling Phone Home</h3>
|
||||
<p>Versions of GATK older than 3.6 attempted to "phone home" as a normal part of each run. However, we recognized that some of our users need to run the GATK with the Phone Home disabled. To enable this, we provided an option (<code>-et NO_ET</code> ) in GATK 1.5 and later to disable the Phone Home feature. To use this option, you need to contact us to request a key. Instructions for doing so are below.</p>
|
||||
<h4>How to obtain and use a GATK key</h4>
|
||||
<p>To obtain a GATK key, please fill out the <a href="http://www.broadinstitute.org/gatk/request-key">request form</a>. </p>
|
||||
<p>Running the GATK with a key is simple: you just need to append a <code>-K your.key</code> argument to your customary command line, where <code>your.key</code> is the path to the key file you obtained from us:</p>
|
||||
<pre><code class="pre_md">java -jar dist/GenomeAnalysisTK.jar \
|
||||
-T PrintReads \
|
||||
-I public/testdata/exampleBAM.bam \
|
||||
-R public/testdata/exampleFASTA.fasta \
|
||||
-et NO_ET \
|
||||
-K your.key</code class="pre_md"></pre>
|
||||
<p>The <code>-K</code> argument is only necessary when running the GATK with the <code>NO_ET</code> option.</p>
|
||||
<h4>Troubleshooting key-related problems</h4>
|
||||
<ul>
|
||||
<li>Corrupt/Unreadable/Revoked Keys</li>
|
||||
</ul>
|
||||
<p>If you get an error message from the GATK saying that your key is corrupt, unreadable, or has been revoked, please apply for a new key.</p>
|
||||
<ul>
|
||||
<li>GATK Public Key Not Found</li>
|
||||
</ul>
|
||||
<p>If you get an error message stating that the GATK public key could not be located or read, then something is likely wrong with your build of the GATK. If you're running the binary release, try <a href="http://www.broadinstitute.org/gatk/download">downloading</a> it again. If you're compiling from source, try re-compiling. If all else fails, please ask for help on our <a href="http://gatkforums.broadinstitute.org/">community forum</a>.</p>
|
||||
|
|
@ -0,0 +1,34 @@
|
|||
## What is GATK-Lite and how does it relate to "full" GATK 2.x? [RETIRED]
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/1720/what-is-gatk-lite-and-how-does-it-relate-to-full-gatk-2-x-retired
|
||||
|
||||
<p><strong>Please note that GATK-Lite was retired in February 2013 when version 2.4 was released. See the announcement <a href="http://www.broadinstitute.org/gatk/guide/article?id=2091">here</a>.</strong></p>
|
||||
<hr />
|
||||
<p>You probably know by now that GATK-Lite is a free-for-everyone and completely open-source version of the GATK (licensed under the original [MIT license]( <a href="http://en.wikipedia.org/wiki/MIT_License">http://en.wikipedia.org/wiki/MIT_License</a>)). </p>
|
||||
<p>But what's in the box? What can GATK-Lite do -- or rather, what can it <strong>not</strong> do that the full version (let's call it GATK-Full) can? And what does that mean exactly, in terms of functionality, reliability and power? </p>
|
||||
<p>To really understand the differences between GATK-Lite and GATK-Full, you need some more information on how the GATK works, and how we work to develop and improve it.</p>
|
||||
<h3>First you need to understand what are the two core components of the GATK: the engine and tools (see picture below).</h3>
|
||||
<p>As explained <a href="http://www.broadinstitute.org/gatk/about/#what-is-the-gatk">here</a>, the <strong>engine</strong> handles all the common work that's related to data access, conversion and traversal, as well as high-performance computing features. The engine is supported by an infrastructure of software libraries. If the GATK was a car, that would be the engine and chassis. What we call the *<em>tools</em> are attached on top of that, and they provide the various analytical and processing functionalities like variant calling and base or variant recalibration. On your car, that would be headlights, airbags and so on.</p>
|
||||
<p><img src="http://www.broadinstitute.org/gatk/img/core_gatk2.png" alt="Core GATK components" /></p>
|
||||
<h3>Second is how we work on developing the GATK, and what it means for how improvements are shared (or not) between Lite and Full.</h3>
|
||||
<p>We do all our development work on a single codebase. This means that everything --the engine and all tools-- is on one common workbench. There are <strong>not</strong> different versions that we work on in parallel -- that would be crazy to manage! That's why the version numbers of GATK-Lite and GATK-Full always match: if the latest GATK-Full version is numbered 2.1-13, then the latest GATK-Lite is also numbered 2.1-13.</p>
|
||||
<p>The most important consequence of this setup is that when we make improvements to the infrastructure and engine, the same improvements will end up in GATK Lite and in GATK Full. So for the purposes of power, speed and robustness of the GATK that is determined by the engine, there is no difference between them. </p>
|
||||
<p>For the tools, it's a little more complicated -- but not much. When we "build" the GATK binaries (the <code>.jar</code> files), we put everything from the workbench into the Full build, but we only put a subset into the Lite build. Note that this Lite subset is pretty big -- it contains all the tools that were previously available in GATK 1.x versions, and always will. We also reserve the right to add previews or not-fully-featured versions of the new tools that are in Full, at our discretion, to the Lite build.</p>
|
||||
<h3>So there are two basic types of differences between the tools available in the Lite and Full builds (see picture below).</h3>
|
||||
<ol>
|
||||
<li>
|
||||
<p>We have a new tool that performs a brand new function (which wasn't available in GATK 1.x), and we only include it in the Full build.</p>
|
||||
</li>
|
||||
<li>We have a tool that has some new add-on capabilities (which weren't possible in GATK 1.x); we put the tool in both the Lite and the Full build, but the add-ons are only available in the Full build.</li>
|
||||
</ol>
|
||||
<p><img src="http://www.broadinstitute.org/gatk/img/lite_vs_2x.png" alt="Tools in Lite vs. Full" /></p>
|
||||
<p>Reprising the car analogy, GATK-Lite and GATK-Full are like two versions of the same car -- the basic version and the fully-equipped one. They both have the exact same engine, and most of the equipment (tools) is the same -- for example, they both have the same airbag system, and they both have headlights. But there are a few important differences: </p>
|
||||
<ol>
|
||||
<li>
|
||||
<p>The GATK-Full car comes with a GPS (sat-nav for our UK friends), for which the Lite car has no equivalent. You could buy a portable GPS unit from a third-party store for your Lite car, but it might not be as good, and certainly not as convenient, as the Full car's built-in one.</p>
|
||||
</li>
|
||||
<li>Both cars have windows of course, but the Full car has power windows, while the Lite car doesn't. The Lite windows can open and close, but you have to operate them by hand, which is much slower. </li>
|
||||
</ol>
|
||||
<h3>So, to summarize:</h3>
|
||||
<p>The underlying engine is exactly the same in both GATK-Lite and GATK-Full. Most functionalities are available in both builds, performed by the same tools. Some functionalities are available in both builds, but they are performed by different tools, and the tool in the Full build is better. New, cutting-edge functionalities are only available in the Full build, and there is no equivalent in the Lite build. </p>
|
||||
<p>We hope this clears up some of the confusion surrounding GATK-Lite. If not, please leave a comment and we'll do our best to clarify further! </p>
|
||||
|
|
@ -0,0 +1,28 @@
|
|||
## What is Map/Reduce and why are GATK tools called "walkers"?
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/1754/what-is-map-reduce-and-why-are-gatk-tools-called-walkers
|
||||
|
||||
<h3>Overview</h3>
|
||||
<p>One of the key challenges of working with next-gen sequence data is that input files are usually very large. We can’t just make the program open the files, load all the data into memory and perform whatever analysis is needed on all of it in one go. It’s just too much work, even for supercomputers.</p>
|
||||
<p>Instead, we make the program cut the job into smaller tasks that the computer can easily process separately. Then we have it combine the results of each step into the final result.</p>
|
||||
<h3>Map/Reduce</h3>
|
||||
<p><strong>Map/Reduce</strong> is the technique we use to achieve this. It consists of three steps formally called <code>filter</code>, <code>map</code> and <code>reduce</code>. Let’s apply it to an example case where we want to find out what is the average depth of coverage in our dataset for a certain region of the genome.</p>
|
||||
<ul>
|
||||
<li>
|
||||
<p><code>filter</code> determines what subset of the data needs to be processed in each task. In our example, the program lists all the reference positions in our region of interest.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><code>map</code> applies the function, <em>i.e.</em> performs the analysis on each subset of data. In our example, for each position in the list, the program looks into the BAM file, pulls out the pileup of bases and outputs the depth of coverage at that position.</p>
|
||||
</li>
|
||||
<li><code>reduce</code> combines the elements in the list of results output by the <code>map</code> function. In our example, the program takes the coverage numbers that were calculated separately for all the reference positions and calculates their average, which is the final result we want.</li>
|
||||
</ul>
|
||||
<p>This may seem trivial for such a simple example, but it is a very powerful method with many advantages. Among other things, it makes it relatively easy to parallelize operations, which makes the tools run much faster on large datasets.</p>
|
||||
<h3>Walkers, filters and traversal types</h3>
|
||||
<p>All the tools in the GATK are built from the ground up to take advantage of this method. That’s why we call them <strong>walkers</strong>: because they “walk” across the genome, getting things done.</p>
|
||||
<p>Note that even though it’s not included in the Map/Reduce technique’s name, the <code>filter</code> step is very important. It determines what data get presented to the tool for analysis, selecting only the appropriate data for each task and discarding anything that’s not relevant. This is a key part of the Map/Reduce technique, because that’s what makes each task “bite-sized” enough for the computer to handle easily.</p>
|
||||
<p>Each tool has filters that are tailored specifically for the type of analysis it performs. The filters rely on <strong>traversal engines</strong>, which are little programs that are designed to “traverse” the data (<em>i.e.</em> walk through the data) in specific ways.</p>
|
||||
<p>There are three major types of traversal: <strong>Locus Traversal</strong>, <strong>Read Traversal</strong> and <strong>Active Region Traversal</strong>. In our interval coverage example, the tool’s filter uses the <strong>Locus Traversal</strong> engine, which walks through the data by locus, <em>i.e.</em> by position along the reference genome. Because of that, the tool is classified as a <strong>Locus Walker</strong>. Similarly, the <strong>Read Traversal</strong> engine is used, you’ve guessed it, by <strong>Read Walkers</strong>. </p>
|
||||
<p>The GATK engine comes packed with many other ways to walk through the genome and get the job done seamlessly, but those are the ones you’ll encounter most often. </p>
|
||||
<h3>Further reading</h3>
|
||||
<p><a href="http://www.broadinstitute.org/gatk/guide/article?id=1988">A primer on parallelism with the GATK</a>
|
||||
<a href="http://www.broadinstitute.org/gatk/guide/article?id=1975">How can I use parallelism to make GATK tools run faster?</a></p>
|
||||
|
|
@ -0,0 +1,90 @@
|
|||
## What is a GVCF and how is it different from a 'regular' VCF?
|
||||
|
||||
http://gatkforums.broadinstitute.org/gatk/discussion/4017/what-is-a-gvcf-and-how-is-it-different-from-a-regular-vcf
|
||||
|
||||
<h3>Overview</h3>
|
||||
<p>GVCF stands for Genomic VCF. A GVCF is a kind of VCF, so the basic format specification is the same as for a regular VCF (see the spec documentation <a href="http://vcftools.sourceforge.net/specs.html">here</a>), but a Genomic VCF contains extra information. </p>
|
||||
<p>This document explains what that extra information is and how you can use it to empower your variants analyses. </p>
|
||||
<h3>Important caveat</h3>
|
||||
<p>What we're covering here is strictly limited to GVCFs produced by HaplotypeCaller in GATK versions 3.0 and above. The term GVCF is sometimes used simply to describe VCFs that contain a record for every position in the genome (or interval of interest) regardless of whether a variant was detected at that site or not (such as VCFs produced by UnifiedGenotyper with <code>--output_mode EMIT_ALL_SITES</code>). GVCFs produced by HaplotypeCaller 3.x contain additional information that is formatted in a very specific way. Read on to find out more.</p>
|
||||
<h3>General comparison of VCF vs. gVCF</h3>
|
||||
<p>The key difference between a regular VCF and a gVCF is that the gVCF has records for all sites, whether there is a variant call there or not. The goal is to have every site represented in the file in order to do <a href="http://www.broadinstitute.org/gatk/guide/article?id=3893">joint analysis of a cohort</a> in subsequent steps. The records in a gVCF include an accurate estimation of how confident we are in the determination that the sites are homozygous-reference or not. This estimation is generated by the HaplotypeCaller's built-in <a href="http://www.broadinstitute.org/gatk/guide/article?id=4042">reference model</a>.</p>
|
||||
<img src="https://us.v-cdn.net/5019796/uploads/FileUpload/e6/bd853ec3eca81ccde698c73c02731e.png" />
|
||||
<p>Note that some other tools (including the GATK's own UnifiedGenotyper) may output an all-sites VCF that looks superficially like the <code>BP_RESOLUTION</code> gVCFs produced by HaplotypeCaller, but they do not provide an accurate estimate of reference confidence, and therefore cannot be used in joint genotyping analyses. </p>
|
||||
<h3>The two types of gVCFs</h3>
|
||||
<p>As you can see in the figure above, there are two options you can use with <code>-ERC</code>: <code>GVCF</code> and <code>BP_RESOLUTION</code>. With <code>BP_RESOLUTION</code>, you get a gVCF with an individual record at every site: either a variant record, or a non-variant record. With <code>GVCF</code>, you get a gVCF with individual variant records for variant sites, but the non-variant sites are grouped together into non-variant block records that represent intervals of sites for which the genotype quality (GQ) is within a certain range or band. The GQ ranges are defined in the <code>##GVCFBlock</code> line of the gVCF header. The purpose of the blocks (also called banding) is to keep file size down, and there is no downside for the downstream analysis, so we do recommend using the <code>-GVCF</code> option. </p>
|
||||
<h3>Example gVCF file</h3>
|
||||
<p>This is a banded gVCF produced by HaplotypeCaller with the <code>-GVCF</code> option. </p>
|
||||
<h4>Header:</h4>
|
||||
<p>As you can see in the first line, the basic file format is a valid version 4.1 VCF:</p>
|
||||
<pre><code class="pre_md">##fileformat=VCFv4.1
|
||||
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
|
||||
##FILTER=<ID=LowQual,Description="Low quality">
|
||||
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
|
||||
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
|
||||
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
|
||||
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
|
||||
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
|
||||
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
|
||||
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
|
||||
##GVCFBlock=minGQ=0(inclusive),maxGQ=5(exclusive)
|
||||
##GVCFBlock=minGQ=20(inclusive),maxGQ=60(exclusive)
|
||||
##GVCFBlock=minGQ=5(inclusive),maxGQ=20(exclusive)
|
||||
##GVCFBlock=minGQ=60(inclusive),maxGQ=2147483647(exclusive)
|
||||
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
|
||||
##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">
|
||||
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
|
||||
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
|
||||
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
|
||||
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
|
||||
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
|
||||
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
|
||||
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
|
||||
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
|
||||
##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
|
||||
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
|
||||
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
|
||||
##contig=<ID=20,length=63025520,assembly=b37>
|
||||
##reference=file:///humgen/1kg/reference/human_g1k_v37.fasta</code class="pre_md"></pre>
|
||||
<p>Toward the middle you see the <code>##GVCFBlock</code> lines (after the <code>##FORMAT</code> lines) (repeated here for clarity):</p>
|
||||
<pre><code class="pre_md">##GVCFBlock=minGQ=0(inclusive),maxGQ=5(exclusive)
|
||||
##GVCFBlock=minGQ=20(inclusive),maxGQ=60(exclusive)
|
||||
##GVCFBlock=minGQ=5(inclusive),maxGQ=20(exclusive)</code class="pre_md"></pre>
|
||||
<p>which indicate the GQ ranges used for banding (corresponding to the boundaries <code>[5, 20, 60]</code>). </p>
|
||||
<p>You can also see the definition of the <code>MIN_DP</code> annotation in the <code>##FORMAT</code> lines. </p>
|
||||
<h4>Records</h4>
|
||||
<p>The first thing you'll notice, hopefully, is the <code><NON_REF></code> symbolic allele listed in every record's <code>ALT</code> field. This provides us with a way to represent the possibility of having a non-reference allele at this site, and to indicate our confidence either way.</p>
|
||||
<p>The second thing to look for is the <code>END</code> tag in the <code>INFO</code> field of non-variant block records. This tells you at what position the block ends. For example, the first line is a non-variant block that starts at position 20:10000000 and ends at 20:10000116. </p>
|
||||
<pre><code class="pre_md">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878
|
||||
20 10000000 . T <NON_REF> . . END=10000116 GT:DP:GQ:MIN_DP:PL 0/0:44:99:38:0,89,1385
|
||||
20 10000117 . C T,<NON_REF> 612.77 . BaseQRankSum=0.000;ClippingRankSum=-0.411;DP=38;MLEAC=1,0;MLEAF=0.500,0.00;MQ=221.39;MQ0=0;MQRankSum=-2.172;ReadPosRankSum=-0.235 GT:AD:DP:GQ:PL:SB 0/1:17,21,0:38:99:641,0,456,691,519,1210:6,11,11,10
|
||||
20 10000118 . T <NON_REF> . . END=10000210 GT:DP:GQ:MIN_DP:PL 0/0:42:99:38:0,80,1314
|
||||
20 10000211 . C T,<NON_REF> 638.77 . BaseQRankSum=0.894;ClippingRankSum=-1.927;DP=42;MLEAC=1,0;MLEAF=0.500,0.00;MQ=221.89;MQ0=0;MQRankSum=-1.750;ReadPosRankSum=1.549 GT:AD:DP:GQ:PL:SB 0/1:20,22,0:42:99:667,0,566,728,632,1360:9,11,12,10
|
||||
20 10000212 . A <NON_REF> . . END=10000438 GT:DP:GQ:MIN_DP:PL 0/0:52:99:42:0,99,1403
|
||||
20 10000439 . T G,<NON_REF> 1737.77 . DP=57;MLEAC=2,0;MLEAF=1.00,0.00;MQ=221.41;MQ0=0 GT:AD:DP:GQ:PL:SB 1/1:0,56,0:56:99:1771,168,0,1771,168,1771:0,0,0,0
|
||||
20 10000440 . T <NON_REF> . . END=10000597 GT:DP:GQ:MIN_DP:PL 0/0:56:99:49:0,120,1800
|
||||
20 10000598 . T A,<NON_REF> 1754.77 . DP=54;MLEAC=2,0;MLEAF=1.00,0.00;MQ=185.55;MQ0=0 GT:AD:DP:GQ:PL:SB 1/1:0,53,0:53:99:1788,158,0,1788,158,1788:0,0,0,0
|
||||
20 10000599 . T <NON_REF> . . END=10000693 GT:DP:GQ:MIN_DP:PL 0/0:51:99:47:0,120,1800
|
||||
20 10000694 . G A,<NON_REF> 961.77 . BaseQRankSum=0.736;ClippingRankSum=-0.009;DP=54;MLEAC=1,0;MLEAF=0.500,0.00;MQ=106.92;MQ0=0;MQRankSum=0.482;ReadPosRankSum=1.537 GT:AD:DP:GQ:PL:SB 0/1:21,32,0:53:99:990,0,579,1053,675,1728:9,12,10,22
|
||||
20 10000695 . G <NON_REF> . . END=10000757 GT:DP:GQ:MIN_DP:PL 0/0:48:99:45:0,120,1800
|
||||
20 10000758 . T A,<NON_REF> 1663.77 . DP=51;MLEAC=2,0;MLEAF=1.00,0.00;MQ=59.32;MQ0=0 GT:AD:DP:GQ:PL:SB 1/1:0,50,0:50:99:1697,149,0,1697,149,1697:0,0,0,0
|
||||
20 10000759 . A <NON_REF> . . END=10001018 GT:DP:GQ:MIN_DP:PL 0/0:40:99:28:0,65,1080
|
||||
20 10001019 . T G,<NON_REF> 93.77 . BaseQRankSum=0.058;ClippingRankSum=-0.347;DP=26;MLEAC=1,0;MLEAF=0.500,0.00;MQ=29.65;MQ0=0;MQRankSum=-0.925;ReadPosRankSum=0.000 GT:AD:DP:GQ:PL:SB 0/1:19,7,0:26:99:122,0,494,179,515,694:12,7,4,3
|
||||
20 10001020 . C <NON_REF> . . END=10001020 GT:DP:GQ:MIN_DP:PL 0/0:26:72:26:0,72,1080
|
||||
20 10001021 . T <NON_REF> . . END=10001021 GT:DP:GQ:MIN_DP:PL 0/0:25:37:25:0,37,909
|
||||
20 10001022 . C <NON_REF> . . END=10001297 GT:DP:GQ:MIN_DP:PL 0/0:30:87:25:0,72,831
|
||||
20 10001298 . T A,<NON_REF> 1404.77 . DP=41;MLEAC=2,0;MLEAF=1.00,0.00;MQ=171.56;MQ0=0 GT:AD:DP:GQ:PL:SB 1/1:0,41,0:41:99:1438,123,0,1438,123,1438:0,0,0,0
|
||||
20 10001299 . C <NON_REF> . . END=10001386 GT:DP:GQ:MIN_DP:PL 0/0:43:99:39:0,95,1226
|
||||
20 10001387 . C <NON_REF> . . END=10001418 GT:DP:GQ:MIN_DP:PL 0/0:41:42:39:0,21,315
|
||||
20 10001419 . T <NON_REF> . . END=10001425 GT:DP:GQ:MIN_DP:PL 0/0:45:12:42:0,9,135
|
||||
20 10001426 . A <NON_REF> . . END=10001427 GT:DP:GQ:MIN_DP:PL 0/0:49:0:48:0,0,1282
|
||||
20 10001428 . T <NON_REF> . . END=10001428 GT:DP:GQ:MIN_DP:PL 0/0:49:21:49:0,21,315
|
||||
20 10001429 . G <NON_REF> . . END=10001429 GT:DP:GQ:MIN_DP:PL 0/0:47:18:47:0,18,270
|
||||
20 10001430 . G <NON_REF> . . END=10001431 GT:DP:GQ:MIN_DP:PL 0/0:45:0:44:0,0,1121
|
||||
20 10001432 . A <NON_REF> . . END=10001432 GT:DP:GQ:MIN_DP:PL 0/0:43:18:43:0,18,270
|
||||
20 10001433 . T <NON_REF> . . END=10001433 GT:DP:GQ:MIN_DP:PL 0/0:44:0:44:0,0,1201
|
||||
20 10001434 . G <NON_REF> . . END=10001434 GT:DP:GQ:MIN_DP:PL 0/0:44:18:44:0,18,270
|
||||
20 10001435 . A <NON_REF> . . END=10001435 GT:DP:GQ:MIN_DP:PL 0/0:44:0:44:0,0,1130
|
||||
20 10001436 . A AAGGCT,<NON_REF> 1845.73 . DP=43;MLEAC=2,0;MLEAF=1.00,0.00;MQ=220.07;MQ0=0 GT:AD:DP:GQ:PL:SB 1/1:0,42,0:42:99:1886,125,0,1888,126,1890:0,0,0,0
|
||||
20 10001437 . A <NON_REF> . . END=10001437 GT:DP:GQ:MIN_DP:PL 0/0:44:0:44:0,0,0</code class="pre_md"></pre>
|
||||
<p>Note that toward the end of this snippet, you see multiple consecutive non-variant block records. These were not merged into a single record because the sites they contain belong to different ranges of GQ (which are defined in the header).</p>
|
||||
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue