From a6f632874bd04c383e9172e61a6d312af5e74c4c Mon Sep 17 00:00:00 2001
From: Geraldine Van der Auwera The basic operation of the HaplotypeCaller proceeds as follows: The program determines which regions of the genome it needs to operate on, based on the presence of significant
+ * evidence for variation. For each ActiveRegion, the program builds a De Bruijn-like graph to reassemble the ActiveRegion, and identifies
+ * what are the possible haplotypes present in the data. The program then realigns each haplotype against the reference
+ * haplotype using the Smith-Waterman algorithm in order to identify potentially variant sites. For each ActiveRegion, the program performs a pairwise alignment of each read against each haplotype using the
+ * PairHMM algorithm. This produces a matrix of likelihoods of haplotypes given the read data. These likelihoods are
+ * then marginalized to obtain the likelihoods of alleles for each potentially variant site given the read data. For each potentially variant site, the program applies Bayes’ rule, using the likelihoods of alleles given the
+ * read data to calculate the likelihoods of each genotype per sample given the read data observed for that
+ * sample. The most likely genotype is then assigned to the sample.
* Input bam file(s) from which to make calls
@@ -114,23 +144,71 @@ import java.util.*;
* These are example commands that show how to run HaplotypeCaller for typical use cases. Square brackets ("[ ]")
+ * indicate optional arguments. Note that parameter values shown here may not be the latest recommended; see the
+ * Best Practices documentation for detailed recommendations.
+ *
+ * 1. Define active regions
+ *
+ *
+ * 2. Determine haplotypes by re-assembly of the active region
+ *
+ *
+ * 3. Determine likelihoods of the haplotypes given the read data
+ *
+ *
+ * 4. Assign sample genotypes
+ *
+ *
* Input
* Examples
+ *
+ *
+ * Single-sample all-sites calling on DNAseq (for GVCF-based cohort analysis workflow)
+ *
+ * java
+ * -jar GenomeAnalysisTK.jar
+ * -T HaplotypeCaller
+ * -R reference/human_g1k_v37.fasta
+ * -I sample1.bam \
+ * --emitRefConfidence GVCF \
+ * --variant_index_type LINEAR \
+ * --variant_index_parameter 128000
+ * [--dbsnp dbSNP.vcf] \
+ * [-L targets.interval_list] \
+ * -o output.raw.snps.indels.g.vcf
+ *
+ *
*
* java * -jar GenomeAnalysisTK.jar * -T HaplotypeCaller * -R reference/human_g1k_v37.fasta * -I sample1.bam [-I sample2.bam ...] \ - * --dbsnp dbSNP.vcf \ - * -stand_call_conf [50.0] \ - * -stand_emit_conf 10.0 \ - * [-L targets.interval_list] + * [--dbsnp dbSNP.vcf] \ + * [-stand_call_conf 30] \ + * [-stand_emit_conf 10] \ + * [-L targets.interval_list] \ * -o output.raw.snps.indels.vcf *+ * + * + *
+ *
+ * java + * -jar GenomeAnalysisTK.jar + * -T HaplotypeCaller + * -R reference/human_g1k_v37.fasta + * -I sample1.bam \ + * -recoverDanglingHeads \ + * -dontUseSoftClippedBases \ + * [--dbsnp dbSNP.vcf] \ + * -stand_call_conf 20 \ + * -stand_emit_conf 20 \ + * -o output.raw.snps.indels.vcf + *+ * * *
- * GenotypeGVCFs merges gVCF records that were produced as part of the "single sample discovery" pipeline using - * the '-ERC GVCF' mode of the Haplotype Caller. This tool performs the multi-sample joint aggregation + * GenotypeGVCFs merges gVCF records that were produced as part of the reference model-based variant discovery pipeline (see documentation for more details) using + * the '-ERC GVCF' or '-ERC BP_RESOLUTION' mode of the HaplotypeCaller. This tool performs the multi-sample joint aggregation * step and merges the records together in a sophisticated manner. * * At all positions of the target, this tool will combine all spanning records, produce correct genotype likelihoods, * re-genotype the newly merged record, and then re-annotate it. * - * Note that this tool cannot work with just any gVCF files - they must have been produced with the Haplotype Caller, + * Note that this tool cannot work with just any gVCF files - they must have been produced with the HaplotypeCaller, * which uses a sophisticated reference model to produce accurate genotype likelihoods for every position in the target. * *