-- The previous algorithm would compute the likelihood of each haplotype pooled across samples. This has a tendency to select "consensus" haplotypes that are reasonably good across all samples, while missing the true haplotypes that each sample likes. The new algorithm computes instead the most likely pair of haplotypes among all haplotypes for each sample independently, contributing 1 vote to each haplotype it selects. After all N samples have been run, we sort the haplotypes by their counts, and take 2 * nSample + 1 haplotypes or maxHaplotypesInPopulation, whichever is smaller.
-- After discussing with Mauricio our view is that the algorithmic complexity of this approach is no worse than the previous approach, so it should be equivalently fast.
-- One potential improvement is to use not hard counts for the haplotypes, but this would radically complicate the current algorithm so it wasn't selected.
-- For an example of a specific problem caused by this, see https://jira.broadinstitute.org/browse/GSA-871.
-- Remove old pooled likelihood model. It's worse than the current version in both single and multiple samples:
1000G EUR samples:
10Kb
per sample: 7.17 minutes
pooled: 7.36 minutes
Name VariantType TRUE_POSITIVE FALSE_POSITIVE FALSE_NEGATIVE TRUE_NEGATIVE CALLED_NOT_IN_DB_AT_ALL
per_sample SNPS 50 0 5 8 1
per_sample INDELS 6 0 7 2 1
pooled SNPS 49 0 6 8 1
pooled INDELS 5 0 8 2 1
100 kb
per sample: 140.00 minutes
pooled: 145.27 minutes
Name VariantType TRUE_POSITIVE FALSE_POSITIVE FALSE_NEGATIVE TRUE_NEGATIVE CALLED_NOT_IN_DB_AT_ALL
per_sample SNPS 144 0 22 28 1
per_sample INDELS 28 1 16 9 11
pooled SNPS 143 0 23 28 1
pooled INDELS 27 1 17 9 11
java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T HaplotypeCaller -I private/testdata/AFR.structural.indels.bam -L 20:8187565-8187800 -L 20:18670537-18670730 -R ~/Desktop/broadLocal/localData/human_g1k_v37.fasta -o /dev/null -debug
haplotypes from samples: 8 seconds
haplotypes from pools: 8 seconds
java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T HaplotypeCaller -I /Users/depristo/Desktop/broadLocal/localData/phaseIII.4x.100kb.bam -L 20:10,000,000-10,001,000 -R ~/Desktop/broadLocal/localData/human_g1k_v37.fasta -o /dev/null -debug
haplotypes from samples: 173.32 seconds
haplotypes from pools: 167.12 seconds