A new PairHMM implementation for read/haplotype likelihood calculations. Output is the same as the LOGLESS_CACHING version.
Instead of allocating an entire (read x haplotype) matrix for each HMM state, this version stores sub-computations in 1D arrays. It also accesses intersections of the (read x haplotype) alignment in a different order, proceeding over "diagonals" if we think of the alignment as a matrix.
This implementation makes use of haplotype caching. Because arrays are overwritten, it has to explicitly store mid-process information. Knowing where to capture this info requires us to look ahead at the subsequent haplotype to be analyzed. This necessitated a signature change in the primary method for all pairHMM implementations.
We also had to adjust the classes that employ the pairHMM:
LikelihoodCalculationEngine (used by HaplotypeCaller)
PairHMMIndelErrorModel (used by indel genotyping classes)
Made the array version the default in the HaplotypeCaller and the UnifiedArgumentCollection.
The latter affects classes:
ErrorModel
GeneralPloidyIndelGenotypeLikelihoodsCalculationModel
IndelGenotypeLikelihoodsCalculationModel
... all of which use the pairHMM via PairHMMIndelErrorModel
-This was a dependency of the test suite, but not the GATK proper,
which caused problems when running the test suite on the packaged
GATK jar at release time
-Use GATKVCFUtils.readVCF() instead
-Switch to using new GSA AWS account for storage of phone home data
-Use DNS-compliant bucket names, as per Amazon's best practices
-Encrypt publicly-distributed version of credentials. Grant only PutObject
permission, and only for the relevant buckets.
-Store non-distributed credentials in private/GATKLogs/newAWSAccountCredentials
for now -- need to integrate with existing python/shell scripts
later to get the log downloading working with the new account
* Refactoring implementations of readHeader(LineReader) -> readActualHeader(LineIterator), including nullary implementations where applicable.
* Galvanizing fo generic types.
* Test fixups, mostly to pass around LineIterators instead of LineReaders.
* New rev of tribble, which incorporates a fix that addresses a problem with TribbleIndexedFeatureReader reading a header twice in some instances.
* New rev of sam, to make AbstractIterator visible (was moved from picard -> sam in Tribble API refactor).
Why wasn't it there before, you ask
----------------------------------
Before I was running it separately (by hand), but now it's integrated in
the FullProcessingPipeline.
Integration was a pain because of Queue's limitation of only allowing 1
@Output file. This forced me to write the ugliest piece of code of my
life, but it's working and it's processing the YRI from scratch using
that right now. So I'm happy... somewhat.
Other changes to the pipeline
-----------------------------
* Add --filter_bases_not_stored to the IndelRealigner step -- sometimes BAM files have reads with no bases stored in the unmapped section (no idea why) but this disrupts the pipeline.
* Change adaptor marking parameter to "dual indexed" instead of "pair-ended" -- for PCR Free data.
There is now a command-line option to set the model to use in the HC.
Incorporated Ryan's current (unmerged) branch in for most of these changes.
Because small changes to the math can have drastic effects, I decided not to let users tweak
the calculations themselves. Instead they can select either NONE, CONSERVATIVE (the default),
or AGGRESSIVE.
Note that any base insertion/deletion qualities from BQSR are still used.
Also, note that the repeat unit x repeat length approach gave very poor results against the KB,
so it is not included as an option here.
* add interleaved fastq option to sam2fastq
* add optional adapter trimming path
* add "skip_revert" option to skip reverting the bams (sometimes useful -- hidden parameter)
* add a walker that reads in one bam file and outputs N bam files, one for each read group in the original bam. This is a very important step in any BAM reprocessing pipeline.
I am using this new pipeline to process the CEU and YRI PCR Free WGS
trios.
- Make -rod required
- Document that contaminationFile is currently not functional with HC
- Document liftover process more clearly
- Document VariantEval combinations of ST and VE that are incompatible
- Added a caveat about using MVLR from HC and UG.
- Added caveat about not using -mte with -nt
- Clarified masking options
- Fixed docs based on Erics comments
-- When provided, this argument causes us to only emit the selected samples into the VCF. No INFO field annotations (AC for example) or other features are modified. It's current primary use is for efficiently evaluating joint calling.
-- Add integration test for onlyEmitSamples
1) TP reviews with 0/0 genotypes were killing those sites and making them appear as assessed FPs even when correctly called!
Fixed this by changing the logic in the assessor to allow discordant genotypes through as FPs.
Also, isMonomorphic() in the MongoGenotype needs to check whether the genotype is discordant.
Added unit test for this case.
2) Minor code cleanup in the Assessor class.
The most important being the renaming of isUsableCall() to isNotUsableCall() since that's what it is returning.