We have generalized the processing script to be able to handle multiple scenarios. Originally it was
designed for PCR free data only, we added all the steps necessary to start from fastq and process
RNA-seq as well as non-human data. This is our go to script in TechDev.
* add optional "starting from fastq" path to the pipeline
* add mark duplicates (optionally) to the pipeline
* add an option to run with the mouse data (without dbsnp and with single ended fastq)
* add option to process RNA-seq data from topHat (add RG and reassign mapping quality if necessary)
* add option to filter or include reads with N in the cigar string
* add parameter to allow keeping the intermediate files
-- We use the RegenotypeVariants walker to recompute the qual field. (instead of the discussed idea of adding this functionality to CombineVariants)
-- QualByDepth will now be recomputed even if the stratified contexts are missing. This greatly improves the QD estimate for this pipeline. Doesn't work for multi-allelics since the qual can't be recomputed.
Quick fix the missing column header in the QualifyMissingIntervals
report.
Adding a QScript for the tool as well as a few minor updates to the
GATKReportGatherer.
-- Adding changes to CombineVariants to work with the Reference Model mode of the HaplotypeCaller.
-- Added -combineAnnotations mode to CombineVariants to merge the info field annotations by taking the median
-- Added new StrandBiasBySample genotype annotation for use in computing strand bias from single sample input vcfs
-- Bug fixes to calcGenotypeLikelihoodsOfRefVsAny, used in isActive() as well as the reference model
-- Added active region trimming capabilities to the reference model mode, not perfect yet, turn off with --dontTrimActiveRegions
-- We only realign reads in the reference model if there are non-reference haplotypes, a big time savings
-- We only realign reads in the reference model if the read is informative for a particular haplotype over another
-- GVCF blocks will now track and output the minimum PLs over the block
-- MD5 changes!
-- HC tests: from bug fixes in calcGenotypeLikelihoodsOfRefVsAny
-- GVCF tests: from HC changes above and adding in active region trimming
A new PairHMM implementation for read/haplotype likelihood calculations. Output is the same as the LOGLESS_CACHING version.
Instead of allocating an entire (read x haplotype) matrix for each HMM state, this version stores sub-computations in 1D arrays. It also accesses intersections of the (read x haplotype) alignment in a different order, proceeding over "diagonals" if we think of the alignment as a matrix.
This implementation makes use of haplotype caching. Because arrays are overwritten, it has to explicitly store mid-process information. Knowing where to capture this info requires us to look ahead at the subsequent haplotype to be analyzed. This necessitated a signature change in the primary method for all pairHMM implementations.
We also had to adjust the classes that employ the pairHMM:
LikelihoodCalculationEngine (used by HaplotypeCaller)
PairHMMIndelErrorModel (used by indel genotyping classes)
Made the array version the default in the HaplotypeCaller and the UnifiedArgumentCollection.
The latter affects classes:
ErrorModel
GeneralPloidyIndelGenotypeLikelihoodsCalculationModel
IndelGenotypeLikelihoodsCalculationModel
... all of which use the pairHMM via PairHMMIndelErrorModel
-This was a dependency of the test suite, but not the GATK proper,
which caused problems when running the test suite on the packaged
GATK jar at release time
-Use GATKVCFUtils.readVCF() instead
-Switch to using new GSA AWS account for storage of phone home data
-Use DNS-compliant bucket names, as per Amazon's best practices
-Encrypt publicly-distributed version of credentials. Grant only PutObject
permission, and only for the relevant buckets.
-Store non-distributed credentials in private/GATKLogs/newAWSAccountCredentials
for now -- need to integrate with existing python/shell scripts
later to get the log downloading working with the new account
* Refactoring implementations of readHeader(LineReader) -> readActualHeader(LineIterator), including nullary implementations where applicable.
* Galvanizing fo generic types.
* Test fixups, mostly to pass around LineIterators instead of LineReaders.
* New rev of tribble, which incorporates a fix that addresses a problem with TribbleIndexedFeatureReader reading a header twice in some instances.
* New rev of sam, to make AbstractIterator visible (was moved from picard -> sam in Tribble API refactor).
- Make -rod required
- Document that contaminationFile is currently not functional with HC
- Document liftover process more clearly
- Document VariantEval combinations of ST and VE that are incompatible
- Added a caveat about using MVLR from HC and UG.
- Added caveat about not using -mte with -nt
- Clarified masking options
- Fixed docs based on Erics comments
-- Bugfix for BAMs containing reads without real (M,I,D,N) operators. Simply needed to set validation stringency to SILENT in the read. Added a BadCigar filter to the SAMRecord stream anyway
-- Add capture all sites mode to AssessNA12878: will write all sites to the badSites VCF, regardless of whether they are bad. It's useful if you essentially want to annotate a VCF with KB information for later analysis, such as computing ROC curves
-- Add ignore filters mode to AssessNA12878: will as expected treat all sites in the input VCF calls as PASS, even if the site has a FILTER field setting
-- Add minPNonRef argument to AssessNA12878: this will consider a site not called even if the NA12878 genotype is not 0/0 if the PLs are present and the PL for 0/0 isn't greater than this value. It allows us to easily differentiate low confidence non-ref sites obtained via multi-sample calling from highly confident non-ref calls that might be real TP or FPs
Problem
-------
Caching strategy is incompatible with the current sorting of the haplotypes, and is rendering the cache nearly useless.
Before the PairHMM updates, we realized that a lexicographically sorted list of haplotypes would optimize the use of the cache. This was only true until we've added the initial condition to the first row of the deletion matrix, which depends on the length of the haplotype. Because of that, every time the haplotypes differ in length, the cache has to be wiped. A lexicographic sorting of the haplotypes will put different lengths haplotypes clustered together therefore wasting *tons* of re-compute.
Solution
-------
Very simple. Sort the haplotypes by LENGTH and then in lexicographic order.