* add interleaved fastq option to sam2fastq
* add optional adapter trimming path
* add "skip_revert" option to skip reverting the bams (sometimes useful -- hidden parameter)
* add a walker that reads in one bam file and outputs N bam files, one for each read group in the original bam. This is a very important step in any BAM reprocessing pipeline.
I am using this new pipeline to process the CEU and YRI PCR Free WGS
trios.
- Make -rod required
- Document that contaminationFile is currently not functional with HC
- Document liftover process more clearly
- Document VariantEval combinations of ST and VE that are incompatible
- Added a caveat about using MVLR from HC and UG.
- Added caveat about not using -mte with -nt
- Clarified masking options
- Fixed docs based on Erics comments
-- When provided, this argument causes us to only emit the selected samples into the VCF. No INFO field annotations (AC for example) or other features are modified. It's current primary use is for efficiently evaluating joint calling.
-- Add integration test for onlyEmitSamples
1) TP reviews with 0/0 genotypes were killing those sites and making them appear as assessed FPs even when correctly called!
Fixed this by changing the logic in the assessor to allow discordant genotypes through as FPs.
Also, isMonomorphic() in the MongoGenotype needs to check whether the genotype is discordant.
Added unit test for this case.
2) Minor code cleanup in the Assessor class.
The most important being the renaming of isUsableCall() to isNotUsableCall() since that's what it is returning.
-- Previous version used overlaps on the full GenomeLoc of the variant in the KB, which meant that deletions that didn't start in an interval would be included in an interval, which isn't the behavior of the tribble and so caused a mismatch when assessing variants in the knowledgebase
-- Bugfix for BAMs containing reads without real (M,I,D,N) operators. Simply needed to set validation stringency to SILENT in the read. Added a BadCigar filter to the SAMRecord stream anyway
-- Add capture all sites mode to AssessNA12878: will write all sites to the badSites VCF, regardless of whether they are bad. It's useful if you essentially want to annotate a VCF with KB information for later analysis, such as computing ROC curves
-- Add ignore filters mode to AssessNA12878: will as expected treat all sites in the input VCF calls as PASS, even if the site has a FILTER field setting
-- Add minPNonRef argument to AssessNA12878: this will consider a site not called even if the NA12878 genotype is not 0/0 if the PLs are present and the PL for 0/0 isn't greater than this value. It allows us to easily differentiate low confidence non-ref sites obtained via multi-sample calling from highly confident non-ref calls that might be real TP or FPs
-- The previous approach in VQSR was to build a GMM with the same max. number of Gaussians for the positive and negative models. However, we usually have many more positive sites than negative, so we'd prefer to use a more detailed GMM for the positive model and a less well defined model using few sites for the negative model.
-- Now the maxGaussians argument only applies to the positive model
-- This update builds a GMM for the negative model with a default 4 max gaussians (though this can be controlled via command line parameter)
-- Removes the percentBadVariants argument. The only way to control how many variants are included in the negative model is with minNumBad
-- Reduced the minNumBad argument default to 1000 from 2500
-- Update MD5s for VQSR. md5s changed significantly due to underlying changes in the default GMM model. Only sites with NEGATIVE_TRAINING_LABELs and the resulting VQSLOD are different, as expected.
-- minNumBad is now numBad
-- Plot all negative training points as well, since this significantly changes our view of the GMM PDF
-- In the case where there's some variation to assembly and evaluate but the resulting haplotypes don't result in any called variants, the reference model would exception out with "java.lang.IllegalArgumentException: calledHaplotypes must contain the refHaplotype". Now we detect this case and emit the standard no variation output.
-- [delivers #54625060]
-- The previous version of TribbleIndexedFeatureReader.query() would open a RandomAccessFile each time the GATK crossed a shard boundary. When running with -L wex.intervals (or any time there were lots of intervals) we'd be opening and closing enormous numbers of files, radically slowing down the GATK. With these patched versions of Tribble we see something like the following performance improvements:
SelectVariants with -L wex.intervals on my local machine against non-local file
pre-patch => 3 hours
post-patch => 30 seconds
Calculates reference bias based on the AD genotype field instead of AB. This is slightly more meaningful for indels and still a good estimator for snps.
Problem
-------
Caching strategy is incompatible with the current sorting of the haplotypes, and is rendering the cache nearly useless.
Before the PairHMM updates, we realized that a lexicographically sorted list of haplotypes would optimize the use of the cache. This was only true until we've added the initial condition to the first row of the deletion matrix, which depends on the length of the haplotype. Because of that, every time the haplotypes differ in length, the cache has to be wiped. A lexicographic sorting of the haplotypes will put different lengths haplotypes clustered together therefore wasting *tons* of re-compute.
Solution
-------
Very simple. Sort the haplotypes by LENGTH and then in lexicographic order.
1. Removing old legacy code that was capping the positional depth for reduced reads to 127.
Unfortunately this cap affectively performs biased down-sampling and throws off e.g. FS numbers.
Added end to end unit test that depth counts in RR can be higher than max byte.
Some md5s change in the RR tests because depths are now (correctly) no longer capped at 127.
2. Down-sampling in ReduceReads was not safe as it could remove het compressed consensus reads.
Refactored it so that it can only remove non-consensus reads.