-User must provide a mapping file via new --sample_rename_mapping_file argument.
Mapping file must contain a mapping from absolute bam file path to new sample name
(format is described in the docs for the argument).
-Requires that each bam file listed in the mapping file contain only one sample
in their headers (they may contain multiple read groups for that sample, however).
The engine enforces this, and throws a UserException if on-the-fly renaming is
requested for a multi-sample bam.
-Not all bam files for a traversal need to be listed in the mapping file.
-On-the-fly renaming is done as the VERY first step after creating the SAMFileReaders
in SAMDataSource (before the headers are even merged), to prevent possible consistency
issues.
-Renaming is done ONCE at traversal start for each SAMReaders resource creation in the
SAMResourcePool; this effectively means once per -nt thread
-Comprehensive unit/integration tests
Known issues: -if you specify the absolute path to a bam in the mapping file, and then
provide a path to that same bam to -I using SYMLINKS, the renaming won't
work. The absolute paths will look different to the engine due to the
symlink being present in one path and not in the other path.
GSA-974 #resolve
Merged bug fix from Stable into Unstable
Conflicts:
protected/java/test/org/broadinstitute/sting/gatk/walkers/compression/reducereads/SlidingWindowUnitTest.java
Previous fixes and tests only covered trailing soft-clips. Now that up front
hard-clipping is working properly though, we were failing on those in the tool.
Added a patch for this as well as a separate test independent of the soft-clips
to make sure that it's working properly.
So, compromise solution is to go back to having biallelic PLs but emit a new FORMAT field, called APL, which has the 10 values, but all other statistics and regular PLs are computed as before.
Note that integration test had to be disabled, as the BCF2 codec apparently doesn't support writing into genotype fields other than PL,DP,AD,GQ,FT and GT.
1. All NA12878DBWalkers that export/emit sites need to do so in order; also one should be able
to use -L with them and not have it iterate over all possible sites.
Updated ExportReviews and ExtractConsensusSites to adhere to these constraints.
2. Added the option to AssessNA12878 to have it ignore FNs that overlap with a provided VCF.
This is useful if you have a list of sites from reviews that are okay to be missed in
particular techs only (because for some reason there is coverage but no evidence of the
alternate allele in them) - intended to be used with Jenkins.
3. Hooked up the logic of complex events all the way through the KB.
Now the consensus incorporates whether a call is complex and the assessor does not penalize for them.
4. Fixed long-standing bug that I managed to find accidentally:
AssessNA12878 was closing its DB connection before its final call to includeMissingCalls().
5. Hooked up the per-call confidences through the KB.
We no longer have a 2-tiered priority system in the KB (reviews and everything else) but instead
use a quasi-Bayesian estimator (will update to proper Bayesian treatment if needed).
Now ImportCallset and ImportReviews assigns confidences as appropriate.
Also needed to fix up the consensus logic for calls with UNKNOWN status.
-Two SAMReaderIDs that pointed at the same underlying bam file through
a relative vs. an absolute path were not being treated as equal, and
had different hash codes. This was causing problems in the engine, since
SAMReaderIDs are often used as the keys of HashMaps.
-Fix: explicitly use the absolute path to the encapsulated bam file in
hashCode() and equals()
-Added tests to ensure this doesn't break again
Problem
-------
Qualify Missing Intervals only accepted GATK formatted interval files for it's coding sequence and bait parameters.
Solution
-------
There is no reason for such limitation, I erased all the code that did the parsing and used IntervalUtils to parse it (therefore, now it handles any type of interval file that the GATK can handle).
ps: Also added an average depth column to the output
- Added integration test to show that providing a contamination value and providing same value via a file results in the same VCF
- overrode default contamination value in test
1. Some minor refactorings and claenup (e.g. removing unused imports) throughout.
2. Updates to the KB assessment functionality:
a. Exclude duplicate reads when checking to see whether there's enough coverage to make a call.
b. Lower the threshold on FS for FPs that would easily be filtered since it's only single sample calling.
3. Make the HC consistent in how it treats the pruning factor. As part of this I removed and archived
the DeBruijn assembler.
4. Improvements to the likelihoods for the HC
a. We now include a "tristate" correction in the PairHMM (just like we do with UG). Basically, we need
to divide e by 3 because the observed base could have come from any of the non-observed alleles.
b. We now correct overlapping read pairs. Note that the fragments are not merged (which we know is
dangerous). Rather, the overlapping bases are just down-weighted so that their quals are not more
than Q20 (or more specifically, half of the phred-scaled PCR error rate); mismatching bases are
turned into Q0s for now.
c. We no longer run contamination removal by default in the UG or HC. The exome tends to have real
sites with off kilter allele balances and we occasionally lose them to contamination removal.
5. Improved the dangling tail merging implementation.
-Add ability to manually specify dependencies on the command line. This allows one
to specify, for example, that all walkers depend on the GeneralCallingPipeline
QScript, even though they don't have any compile-time dependencies on that QScript.
-Check that the provided walker class is valid in DependencyAnalyzer.xml
-Check ant exit status in the front-end script
-Fix bug where analyzer would give incorrect results if the list of changed
Java classes was empty
-This test is failing intermittently for unexplained reasons (see GSA-943)
-In the interest of keeping the rest of the pipeline test suite running, it's
best to disable this one test until GSA-943 is resolved