Commit Graph

504 Commits (59484dfae404edbd6ddc4e383a433a8b653e69cc)

Author SHA1 Message Date
Mark DePristo 5632c13bf2 Resolves GSA-681 / Compressed VCF.gz output is too big because of unnecessary call to flush().
-- Now compressed output VCFs are properly blocked compressed (i.e., they are actually smaller than the uncompressed VCF)
2012-12-12 10:27:07 -05:00
Mark DePristo dd52a70d45 Fix AFCalcResult unit test
-- I was simply passing in the wrong values into the function.  Fixed the calls, and expanded the docs on what needs to be passed in.
2012-12-11 10:40:12 -05:00
Mauricio Carneiro 8a115edbaf ReduceReads is now scattered by contig
It's no longer safe to scatter/gather by interval because now we don't hard-clip to the intervals anymore.
2012-12-10 15:25:27 -05:00
Eric Banks bdda63d973 Related bug fixes to GGA mode in the HC: some variants (especially MNPs) were causing problems because they don't have to start at the current location to match the allele being genotyped. Fixed. 2012-12-10 14:47:04 -05:00
David Roazen 46edab6d6a Use the new downsampling implementation by default
-Switch back to the old implementation, if needed, with --use_legacy_downsampler

-LocusIteratorByStateExperimental becomes the new LocusIteratorByState, and
the original LocusIteratorByState becomes LegacyLocusIteratorByState

-Similarly, the ExperimentalReadShardBalancer becomes the new ReadShardBalancer,
with the old one renamed to LegacyReadShardBalancer

-Performance improvements: locus traversals used to be 20% slower in the new
downsampling implementation, now they are roughly the same speed.

-Tests show a very high level of concordance with UG calls from the previous
implementation, with some new calls and edge cases that still require more examination.

-With the new implementation, can now use -dcov with ReadWalkers to set a limit
on the max # of reads per alignment start position per sample. Appropriate value
for ReadWalker dcov may be in the single digits for some tools, but this too
requires more investigation.
2012-12-10 09:44:50 -05:00
Eric Banks 574d5b467f Bug fix for indel HMM: protect against situation where long reads (e.g. Sanger) in a pileup can lead to a read starting after the haplotype end for a given haplotype. 2012-12-09 02:09:34 -05:00
Eric Banks 406adb8d44 The allele biased downsampling should not abort if there's a reduced read. Rather it should always keep the RR and downsample only original reads in the pileup. 2012-12-05 23:15:36 -05:00
Mark DePristo d0cab795b7 Got caught in the middle of a bad integration test, that was fixed in independent push. Moved test bam into testdata. 2012-12-05 14:49:22 -05:00
Eric Banks ef87b18e09 In retrospect, it wasn't a good idea to have FisherStrand handle reduced reads since they are always on the forward strand. For now, FS ignores reduced reads but I've added a note (and JIRA) to make this work once the RR het compression is enabled (since we will have directionality in reads then). 2012-12-05 02:00:35 -05:00
Eric Banks 726332db79 Disabling the testNoCmdLineHeaderStdout test in UG because it keeps crashing when I run it locally 2012-12-05 00:54:00 -05:00
Eric Banks bca860723a Updating tests to handle bad validation data files (that used the wrong qual score encoding); overrides push from stable. 2012-12-03 22:01:07 -05:00
Ryan Poplin d5ed184691 Updating the HC integration test md5s. According to the NA12878 knowledge base this commit cuts down the FP rate by more than 50 percent with no loss in sensitivity. 2012-12-03 15:38:59 -05:00
Ryan Poplin 156d6a5e0b misc minor bug fixes to GenotypingEngine. 2012-12-03 12:47:35 -05:00
Ryan Poplin 18b002c99c Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2012-12-03 10:08:56 -05:00
Ryan Poplin 1bdf17ef53 Reworking of how the likelihood calculation is organized in the HaplotypeCaller to facilitate the inclusion of per allele downsampling. We now use the downsampling for both the GL calculations and the annotation calculations. 2012-12-02 11:58:32 -05:00
Mark DePristo 2849889af5 Updating md5 for UG 2012-12-01 14:24:19 -05:00
Joel Thibault 198923b597 Add ActiveRegionReadState handling 2012-11-28 13:59:57 -05:00
Mark DePristo c676853731 Merged bug fix from Stable into Unstable. Updating md5s
Conflicts:
	protected/java/test/org/broadinstitute/sting/gatk/walkers/genotyper/UnifiedGenotyperIntegrationTest.java
2012-11-28 12:54:36 -05:00
Mark DePristo a1d6461121 Critical bugfix to AFCalcResult affecting UG/HC quality score emission thresholds
As reported by Menachem Fromer: a critical bug in AFCalcResult:

Specifically, the implementation:
    public boolean isPolymorphic(final Allele allele, final double log10minPNonRef) {
        return getLog10PosteriorOfAFGt0ForAllele(allele) >= log10minPNonRef;
    }

seems incorrect and should probably be:

getLog10PosteriorOfAFEq0ForAllele(allele) <= log10minPNonRef

The issue here is that the 30 represents a Phred-scaled probability of *error* and it's currently being compared to a log probability of *non-error*.

Instead, we need to require that our probability of error be less than the error threshold.
This bug has only a minor impact on the calls -- hardly any sites change -- which is good.  But the inverted logic effects multi-allelic sites significantly.  Basically you only hit this logic with multiple alleles, and in that case it'\s including extra alt alleles incorrectly, and throwing out good ones.

Change was to create a new function that properly handles thresholds that are PhredScaled quality scores:

    /**
     * Same as #isPolymorphic but takes a phred-scaled quality score as input
     */
    public boolean isPolymorphicPhredScaledQual(final Allele allele, final double minPNonRefPhredScaledQual) {
        if ( minPNonRefPhredScaledQual < 0 ) throw new IllegalArgumentException("phredScaledQual " + minPNonRefPhredScaledQual + " < 0 ");
        final double log10Threshold = Math.log10(QualityUtils.qualToProb(minPNonRefPhredScaledQual));
        return isPolymorphic(allele, log10Threshold);
    }
2012-11-28 12:08:02 -05:00
Mauricio Carneiro 97fd5de260 Merging latest CMI updates with UNSTABLE 2012-11-27 09:08:00 -05:00
Ryan Poplin 59cef880d1 Updating HC integration tests because experimental, HC-specific annotations have been removed. 2012-11-26 12:20:07 -05:00
Ryan Poplin c3b7dd1374 Misc cleanup in the HaplotypeCaller. Cleaning up unused arguments after recent changes to HC-GenotypingEngine 2012-11-26 12:19:11 -05:00
Ryan Poplin fedc4fde6c Merged bug fix from Stable into Unstable 2012-11-25 21:55:55 -05:00
Ryan Poplin d978cfe835 Soft clipped bases shouldn't be counted in the delocalized BQSR. 2012-11-25 21:55:29 -05:00
Eric Banks 937ac7290f Lots more GGA fixes for the HC now that I understand what's going on internally. Integration tests pass except for the GGA test which I believe now produces better results. 2012-11-20 16:13:29 -05:00
Eric Banks f0b8a0228f Quick fix for HC refactoring: when copying over Haplotype objects, make sure to copy over the artificial allele used to create it too. 2012-11-19 09:57:55 -05:00
Eric Banks ff180a8e02 Significant refactoring of the Haplotype Caller to handle problems with GGA. The main fix is that we now maintain a mapping from 'original' allele to 'Smith-Waterman-based' allele so that we no longer need to do a (buggy) matching throughout the calling process. 2012-11-19 09:09:57 -05:00
Mauricio Carneiro 8b749673bc centralize header element removal in reduce reads 2012-11-14 13:59:34 -05:00
Mauricio Carneiro e35fd1c717 Merging CMI-0.5.0 and GATK-2.2 together. 2012-11-14 10:42:03 -05:00
Mauricio Carneiro a079d8d0d1 Breaking the utility to write @PG tags for SAMFileWriters and StingSAMFileWriters 2012-11-14 10:33:22 -05:00
Mauricio Carneiro dba31018f4 Implementation of BySampleSAMFileWriter
ReduceReads now works with the n-way-out capability, splitting by sample.
DEV-27 #resolve #time 3m
2012-11-14 10:33:22 -05:00
Mauricio Carneiro a17cd54b68 Co-Reduction implementation in ReduceReads
ReduceReads now co-reduces bams if they're passed in toghether with multiple -I. Co-reduction forces every variant region in one sample to be a variant region in all samples.
Also:
  * Added integrationtest for co-reduction
  * Fixed bug with new no-recalculation implementation of the marksites object where the last object wasn't being removed after finalizing a variant region (updated MD5's accordingly)

DEV-200 #resolve #time 8m
2012-11-14 10:33:21 -05:00
Eric Banks e93d461910 Adding integration test to BQSR for the csv file 2012-11-09 09:11:04 -05:00
Eric Banks 2da76db945 Updating integration tests 2012-11-06 22:23:05 -08:00
Eric Banks 0a2dded093 Fixes for bugs uncovered by unit tests 2012-11-06 16:07:40 -08:00
Eric Banks b07106b3a7 Reimplement the allele biased downsampling to be smarter. Now we don't blindly pull n% of reads off of each allele. Instead, we try all possible genotype conformations for the contaminating sample and choose the one that provides the best genotype for the target sample (based heuristically on allele balance). This method allows us to save some of the reads that belong to the target sample, which should make Daniel M happy. Added unit tests to test the biased downsampling functionality. 2012-11-06 14:39:58 -08:00
Mark DePristo 1444cd753b Bugfix for GSA-647 HaplotypeCaller misses good variant because the active region doesn't trigger for an exome
-- The logic for determining active regions was a bit broken in the HC when intervals were used in the system
-- TraverseActiveRegions now uses the AllLocus view, since we always want to see all reference sites, not just those covered.  Simplifies logic of TAR
-- Non-overlapping intervals are always treated as separate objects for determing active / inactive state.  This means that each exon will stand on its own when deciding if it should be active or inactive
-- Misc. cleanup, docs of some TAR infrastructure to make it safer and easier to debug in the future.
-- Committing the SingleExomeCalling script that I used to find this problem, and will continue to use in evaluating calling of a single exome with the HC
-- Make sure to get all of the reads into the set of potentially active reads, even for genomic locations that themselves don't overlap the engine intervals but may have reads that overlap the regions
-- Remove excessively expensive calls to check bases are upper cased in ReferenceContext
-- Update md5s after a lot of manual review and discussion with Ryan
2012-11-01 15:34:04 -04:00
Eric Banks f8af8a2355 Moving UG integration tests to protected since they use protected-only contamination filtering. Adding a new UGLite integration test to confirm that contamination filtering is ignored in lite. 2012-10-31 21:28:07 -04:00
Guillermo del Angel 51a9ce28e1 Merge remote-tracking branch 'unstable/master' into develop 2012-10-31 10:29:48 -04:00
Ryan Poplin 4e661847b2 DelocalizedBaseRecalibrator becomes the BaseRecalibrator. 2012-10-29 12:53:39 -04:00
David Roazen 35483a7eef Update MD5s for PrintReads with BQSR Integration Test
The MD5s for these tests were changed in commit 87435f1074615b2cd016f042980109fd53962c8d
to match the output of a broken version of BaseRecalibration. With the patch in
commit c397102ecc1fd1d2cd8f209a8f358ab4a60b50a7, the output once again matches the
*original* MD5s for these tests, and does not vary as you increase -nct.

Final resolution to GSA-632
2012-10-26 14:25:25 -04:00
Eric Banks ed11b7dab2 Fix UG parallelization test 2012-10-26 12:10:44 -04:00
Eric Banks 7a706ed345 Fix some of the broken integration tests 2012-10-26 11:23:44 -04:00
Eric Banks b06f689d4b Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-10-26 02:13:26 -04:00
Eric Banks a53e03d525 Do not let reduced reads get removed in the contamination down-sampling 2012-10-26 02:13:04 -04:00
Eric Banks bf3d61ce82 The default value for --contamination_fraction_to_filter is now 0.05 (5%) in both UG and HC. Users of GATK-lite get pushed down to 0% by default (since it's not enabled) or get a user error if they try to set it. 2012-10-26 01:04:51 -04:00
Mark DePristo cc8c12b954 Committing a broken version of BaseRecalibration
-- I'm committing because there's some kind of fundamental problem with the ReadCovariates cache, in that historical data isn't being cleared / computed properly, and I'd rather it fail for a while than leave it in JIRA.
-- The integration tests test the -nct with PrintReads to get 1, 2, 4 and the 4 fails.  But that's because of this incorrect calculation
-- Updating GATKPerformanceOverTime with the new @ClassType annotation
2012-10-25 14:46:35 -04:00
Eric Banks df9e0b7045 Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-10-25 02:49:54 -04:00
Eric Banks 72714ee43e Minor patches to get the contamination down-sampling working for indels. Adding @Hidden logging output for easy debugging. 2012-10-25 02:47:42 -04:00
Eric Banks c6b57fffda Added allele biased down-sampling capabilities to the PerReadAlleleLikelihoodMap object, which means that both the UG and HC can use this functionality. Note that it's only available in protected, so GATK-lite users won't be allowed to enable it. Needs more testing. 2012-10-24 22:52:25 -04:00
Eric Banks 9da7bbf689 Refactoring the PerReadAlleleLikelihoodMap in preparation for adding contntamination downsampling into protected only. 2012-10-24 15:49:07 -04:00
David Roazen 02018ca764 Legacy BaseRecalibrator walker is neither TreeReducible nor NanoSchedulable
The old BaseRecalibrator walker is and never will be thread-safe, since it's a
LocusWalker that uses read attributes to track state.

ONLY the newer DelocalizedBaseRecalibrator is believed likely to be thread-safe
at this point. It is safe to run the DelocalizedBaseRecalibrator with -nct > 1
for testing purposes, but wait for further testing to be done before using it
for production purposes in multithreaded mode.
2012-10-24 15:22:50 -04:00
David Roazen 991658acf4 BQSR: use more granular locking for concurrency control
-With this change, BQSR performance scales properly by thread rather
 than gaining nothing from additional threads.
-Benefits are seen when using either -nt (HierarchicalMicroScheduler) or -nct
 (NanoScheduler)
-Removes high-level locks in the recalibration engines and NestedIntegerArray
 in favor of maximally-granular locks on and around manipulation of the leaf
 nodes of the NestedIntegerArray.
-NestedIntegerArray now creates all interior nodes upfront rather than on
 the fly to avoid the need for locking during tree traversals. This uses
 more memory in the initial part of BQSR runs, but the BQSR would eventually
 converge to use this memory anyway over the course of a typical run.

IMPORTANT NOTE: This does not mean it's safe to run the old BaseRecalibrator
walker with multiple threads. The BaseRecalibrator walker is and will never be
thread-safe, as it's a LocusWalker that uses read attributes to track
state information. ONLY the newer DelocalizedBaseRecalibrator can be made
thread-safe (and will hopefully be made so in my subsequent commits). This
commit addresses performance, not correctness.
2012-10-24 15:22:50 -04:00
Ryan Poplin a27ee26481 updating HC integration test. 2012-10-24 14:08:39 -04:00
Ryan Poplin 094db7bf24 We now require at least 10 samples to merge variants into complex events in the HC. Added a new population based bam for the complex event integration test. 2012-10-24 14:07:36 -04:00
Mauricio Carneiro 4cd1a92358 Updating RR integration tests
Forgot to update the integration tests after merging DEV-117 with optimizations from GATK main repo.
2012-10-23 11:26:26 -04:00
Mauricio Carneiro c210b7cde4 Merge GATK repo into CMI-GATK
Bringing in the following relevant changes:
	* Fixes the indel realigner N-Way out null pointer exception DEV-10
	* Optimizations to ReduceReads that bring the run time to 1/3rd.

Conflicts:
	protected/java/src/org/broadinstitute/sting/gatk/walkers/compression/reducereads/SlidingWindow.java

DEV-10 #resolve #time 2m
2012-10-23 10:59:11 -04:00
Mauricio Carneiro bbf7a0fb09 Adding integration test to ReduceReads coreduction
DEV-117 #resolve
2012-10-23 10:56:33 -04:00
Mark DePristo 90f59803fd MaxAltAlleles now defaults to 6, no more MaxAltAllelesForIndels
-- Updated StandardCallerArgumentCollection to remove MaxAltAllelesForIndels. Previous argument is deprecated with meaningful doc message for people to use maxAltAlleles
-- All constructores, factory methods, and test builders and their users updated to provide just a single argument
-- Updating MD5s for integration tests that change due to genotyping more alleles
-- Adding more alleles to genotyping results in slight changes in the QUAL value for multi-allelic loci where one or more alleles aren't polymorphic.  That's simply due to the way that alternative hypotheses contribute as reference evidence against each true allele.  The effect can be large (new qual = old qual / 2 in one case here).
-- If we want more precision in our estimates we could decide (Eric, should we discuss?) to actually separately do a discovery phase in the genotyping, eliminate all variants not considered polymorphic, and then do a final round of calling to get the exact QUAL value for only those that are segregating.  This would have the value of having the QUAL stay constant as more alleles are genotyped, at the cost of some code complexity increase and runtime.  Might be worth it through
2012-10-22 13:47:56 -04:00
Eric Banks ccae6a5b92 Fixed the RR bug I (knowingly) introduced last week: turns out we can't trust a context size's worth of data from the previous marking. I think Mauricio warned me about this but I forgot. 2012-10-22 11:48:34 -04:00
Mark DePristo 9f2851d769 Updating UnifiedGenotyperGeneralPloidyIntegrationTest following rebasing
-- Created a JIRA ticket https://jira.broadinstitute.org/browse/GSA-623 for Guillermo to look at the differences as the multi-allelic nature of many sites seems to change with the new more protected infrastructure.  This may be due to implementation issues in the pooled caller, problems with my interface, or could be a genuine improvement.
2012-10-21 20:23:11 -04:00
Mark DePristo d21e42608a Updating integration tests for minor changes due to switching to EXACT_INDEPENDENT model by default 2012-10-21 12:43:46 -04:00
Mark DePristo 0fcd358ace Original EXACT model implementation lives, providing another reference (bi-allelic only) EXACT model
-- Potentially a very fast implementation (it's very clean) but restricted to the biallelic case
-- A starting point for future bi-allelic only optimized (logless) or generalized (bi-allelic general ploidy) implementations
-- Added systematic unit tests covering this implementation, and comparing it to others
-- Uncovered a nasty normalization bug in StateTracker that was capping our likelihoods at 0, even after summing up multiple likelihoods, which is just not safe to do and was causing us to lose likelihood in some cases
-- Removed the restriction that a likelihood be <= 0 in StateTracker, and the protection for these cases in GeneralPloidyExactAFCalc which just wasn't right
2012-10-21 12:42:31 -04:00
Mark DePristo eaffb814d3 IndependentExactAFCalc is now the default EXACT model implementation
-- Changed UG / HC to use this one via the StandardCallerArgumentCollection
-- Update the AFCalcFactory.Calculation to have a getDefault() value instead of having a duplicate entry in the enums
2012-10-21 12:42:31 -04:00
Mark DePristo 326f429270 Bugfixes to make new AFCalc system pass integrationtests
-- GeneralPloidyExactAFCalc turns -Infinity values into -Double.MAX_VALUE, so our calculations pass unit tests
-- Bugfix for GeneralPloidyGenotypeLikelihoodsCalculationModel, return a null VC when the only allele we get from our final alleles to use method is the reference base
-- Fix calculation of reference posteriors when P(AF == 0) = 0.0 and P(AF == 0) = X for some meaningful value of X.  Added unit test to ensure this behavior is correct
-- Fix horrible sorting bug in IndependentAllelesDiploidExactAFCalc that applied the theta^N priors in the wrong order.  Add contract to ensure this doesn't ever happen again
-- Bugfix in GLBasedSampleSelector, where VCs without any polymorphic alleles were being sent to the exact model
--
2012-10-21 12:42:31 -04:00
Mark DePristo 695cf83675 More docs and contracts for classes in genotyper.afcalc
-- Future protection of the output of GeneralPloidyExactAFCalc, which produces in some cases bad likelihoods (positive values)
2012-10-21 12:42:31 -04:00
Mark DePristo 99c9031cb4 Merge AFCalcResultTracker into StateTracker, cleanup
-- These two classes were really the same, and now they are actually the same!
-- Cleanuped the interfaces, removed duplicate data
-- Added lots of contracts, some of which found numerical issues with GeneralPloidyExactAFCalc (which have been patched over but not fixed)
-- Moved goodProbability and goodProbabilityVector utilities to MathUtils.  Very useful for contracts!
2012-10-21 12:42:31 -04:00
Guillermo del Angel e9b7324dc1 Merge branch 'master' of ssh://gsa3/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-10-21 12:38:49 -04:00
Guillermo del Angel 67b9e7319e Fix for integration tests: new criterion in AF exact calculation model to trim alleles based on likelihoods does produce better results and resulting alleles changed in 2 sites at integration tests (and all subsequent sites after this had minor annotation differences due to RankSum dithering) 2012-10-21 12:38:33 -04:00
Eric Banks 0616b98551 Not sure why we were setting the UAC variables instead of the simpleUAC ones when that's what we wanted. 2012-10-21 08:26:26 -04:00
Eric Banks 2c624f76c8 Refactoring the Unified (and Standard) Argument Collections because it was really ugly that the subclass had to do all the cloning for the super class. The clone() method is really not recommended best practice in Java anyways, so I changed it so that we use standard overloaded constructors. Confirmed that the Haplotype Caller --help docs do not include UG-specific arguments. 2012-10-20 20:35:54 -04:00
Ryan Poplin a647f1e076 Refactoring the PairHMM util class to allow for multiple implementations which can be specified by the callers via an enum argument. Adding an optimized PairHMM implementation which caches per-read calculations as well as a logless implementation which drastically reduces the runtime of the HMM while also increasing the precision of the result. In the HaplotypeCaller we now lexicographically sort the haplotypes to take maximal benefit of the haplotype offset optimization which only recalculates the HMM matrices after the first differing base in the haplotype. Many thanks to Mauricio for all the initial groundwork for these optimizations. The change to the one HC integration test is in the fourth decimal of HaplotypeScore. 2012-10-20 16:38:18 -04:00
Eric Banks 4622896312 Oops, killed contracts 2012-10-19 13:04:05 -04:00
Eric Banks f7bd4998fc No need for dummy GLs 2012-10-19 12:13:59 -04:00
Eric Banks deca564aef Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-10-19 12:01:49 -04:00
Eric Banks d3cf37dfaf Bug fix for general ploidy model: when choosing the most likely alternate allele(s), you need to weight the likelihood mass by the ploidy of the specific alleles (otherwise all alt alleles will have the same probability). This fixes Yossi's issue with pooled validation calling. This may brek integration tests, but I will leave that to GdA to handle. 2012-10-19 12:01:45 -04:00
Eric Banks 27d8d3f51e RR optimization: don't recalculate the entire bitset of variant sites for every read added to the sliding window. Instead, reuse as much of the previously calculated bitset as you can (basically from the window start until the start of the new read minus the context size). In some awfully performing regions this cuts down the runtime in half, although in others this doesn't seem to help much (so clearly something else is going on). Note that I still need to fix one last bug here, but it's almost done. 2012-10-19 11:59:34 -04:00
Ryan Poplin b4e69239dd In order to be considered an informative read in the PerReadAlleleLikelihoodMap it has to be informative compared to all other alleles not just the worst allele. Also, fixing a bug when there is only one allele in the map. 2012-10-18 14:31:15 -04:00
Eric Banks 20ffbcc86e RR optimization: profiling was showing that the BaseCounts class was a major bottleneck because the underlying implementation was a HashMap. Given that the map index was an indexable Enum anyways, it makes a lot more sense to implement as a native array. Knocks 30% off the runtime in bad regions. 2012-10-17 21:44:53 -04:00
Mauricio Carneiro 32ee2c7dff Refactored the compression interface per sample in ReduceReadsa
The CompressionStash is now responsible for keeping track of all intervals that must be kept uncompressed by all samples. In general this is a list generated by a tumor sample that will enforce all normal samples to abide.
  - Updated ReduceReads integration tests
  - Sliding Window is now using the CompressionStash (single sample).

DEV-104 #resolve #time 3m
2012-10-17 16:40:40 -04:00
Mauricio Carneiro b57df6cac8 Bringing CMI changes into the main GATK repo.
Merge remote-tracking branch 'cmi/master'
2012-10-17 15:23:19 -04:00
Mark DePristo fa93681f51 Scalability test for EXACT models 2012-10-17 14:15:11 -04:00
Mark DePristo c9e7a947c2 Improve interface of ExactCallLogger, use it to have a more informative AFCalcPerformanceTest 2012-10-17 14:15:11 -04:00
Eric Banks 33df1afe0e More BaseCounts optimizations for RR. 2012-10-17 00:55:44 -04:00
Eric Banks 19e2b5f0d5 RR optimization: since total count in BaseCounts is requested so often, don't keep computing it from scratch each time. 2012-10-17 00:44:23 -04:00
Mark DePristo 9bcefadd4e Refactor ExactCallLogger into a separate class
-- Update minor integration tests with NanoSchedule due to qual accuracy update
2012-10-16 13:30:09 -04:00
Mark DePristo c74d7061fe Added AFCalcResultUnitTest
-- Ensures that the posteriors remain within reasonable ranges.  Fixed bug where normalization of posteriors = {-1e30, 0.0} => {-100000, 0.0} which isn't good.  Now tests ensure that the normalization process preserves log10 precision where possible
-- Updated MathUtils to make this possible
2012-10-16 08:11:06 -04:00
Mark DePristo 9b0ab4e941 Cleanup IndependentAllelesDiploidExactAFCalc
-- Remove capability to truncate genotype likelihoods -- this wasn't used and isn't really useful after all
-- Added lots of contracts and docs, still more to come.
-- Created a default makeMaxLikelihoods function in ReferenceDiploidExactAFCalc and DiploidExactAFCalc so that multiple subclasses don't just do the default thing
-- Generalized reference bi-allelic model in IndependentAllelesDiploidExactAFCalc so that in principle any bi-allelic reference model can be used.
2012-10-16 08:11:06 -04:00
Mark DePristo d1511e38ad Removing ConstrainedAFCalculationModel; AFCalcPerformanceTest
-- Superceded by IndependentAFCalc
-- Added support to read in an ExactModelLog in AFCalcPerformanceTest and run the independent alleles model on it.
-- A few misc. bug fixes discovered during running the performance test
2012-10-16 08:11:06 -04:00
Ryan Poplin 31be807664 Updating missed integration test. 2012-10-15 22:31:52 -04:00
Ryan Poplin d27ae67bb6 Updating the multi-step UG integration test. 2012-10-15 22:30:01 -04:00
Mauricio Carneiro a234bacb02 Making nContigs parameter hidden in ReduceReads
For now, the het reduction should only be performed for diploids (n=2). We haven't really tested it for other ploidy so it should remain hidden until someone braves it out.
2012-10-15 13:49:08 -04:00
Ryan Poplin 25be94fbb8 Increasing the precision of MathUtils.approximateLog10SumLog10 from 1E-3 to 1E-4. Genotyper integration tests change as a result. Expanding the unit tests of MathUtils.log10sumLog10. 2012-10-15 13:24:32 -04:00
Mark DePristo dcf8af42a8 Finalizing IndependentAllelesDiploidExactAFCalc
-- Updating integration tests, confirming that results for the original EXACT model are as expected given our new more rigorous application of likelihoods, priors, and posteriors
-- Fix basic logic bug in AFCalcResult.isPolymorphic and UnifiedGenotypeEngine, where isNonRef really meant isRef.  Not ideal.  Finally caught by some tests, but good god it almost made it into the code
-- Now takes the Math.abs of the phred-scaled confidence so that we don't see -0.0
-- Massive new suite of unit tests to ensure that bi-allelic and tri-allele events are called properly with all models, and that the IndependentAllelesDiploidExactAFCalc calls events with up to 4 alt alleles correctly.  ID'd some of the bugs below
-- Fix sort order bug in IndependentAllelesDiploidExactAFCalc caught by new unit tests
-- Fix bug in GeneralPloidyExactAFCalc where the AFCalcResult has meaningless values in the likelihoods when no there we no informative GLs.
2012-10-15 08:21:03 -04:00
Mark DePristo 6b639f51f0 Finalizing new exact model and tests
-- New capabilities in IndependentAllelesDiploidExactAFCalc to actually apply correct theta^n.alt.allele prior.
-- Tests that theta^n.alt.alleles is being applied correctly
-- Bugfix: keep in logspace when computing posterior probability in toAFCalcResult in AFCalcResultTracker.java
-- Bugfix: use only the alleles used in genotyping when assessing if an allele is polymorphic in a sample in UnifiedGenotyperEngine
2012-10-15 07:53:57 -04:00
Mark DePristo 2d72265f7d AFCalcUnit test a more appropriate name 2012-10-15 07:53:57 -04:00
Mark DePristo cb857d1640 AFCalcs must be made by factory method now
-- AFCalcFactory is the only way to make AFCalcs now.  There's a nice ordered enum there describing the models and their ploidy and max alt allele restrictions.  The factory makes it easy to create them, and to find models that work for you given your ploidy and max alt alleles.
-- AFCalc no longer has UAC constructor -- only AFCalcFactory does.  Code cleanup throughout
-- Enabling more unit tests, all of which almost pass now (except for IndependentAllelesDiploidExactAFCalc which will be fixed next)
-- It's now possible to run the UG / HC with any of the exact models currently in the system.
-- Code cleanup throughout the system, reorganizing the unit tests in particular
2012-10-15 07:53:56 -04:00
Mark DePristo 6bbe750e03 Continuing work on IndependentAllelesDiploidExactAFCalc
-- Continuing to get IndependentAllelesDiploidExactAFCalc working correctly.  A long way towards the right answer now, but still not there
-- Restored (but not tested) OriginalDiploidExactAFCalc, the clean diploid O(N) version for Ryan
-- MathUtils.normalizeFromLog10 no longer returns -Infinity when kept in log space, enforces the min log10 value there
-- New convenience method in VariantContext that looks up the allele index in the alleles
2012-10-15 07:53:56 -04:00
Mark DePristo 176b74095d Intermediate commit on the path to getting a working IndependentAllelesDiploidExact calculation
-- Still not work, but I know what's wrong
-- Many tests disabled, that need to be reanabled
2012-10-15 07:53:56 -04:00
Mark DePristo 91aeddeb5a Steps on the way to a fully described and semantically meaningful AFCalcResult
-- AFCalcResult now sports a isPolymorphic and getLog10PosteriorAFGt0ForAllele functions that allow you to ask individually whether specific alleles we've tried to genotype are polymorphic given some confidence threshold
-- Lots of contracts for AFCalcResult
-- Slowly killing off AFCalcResultsTracker
-- Fix for the way UG checks for alt alleles being polymorphic, which is now properly conditioned on the alt allele
-- Change in behavior for normalizeFromLog10 in MathUtils: now sets the log10 for 0 values to -10000, instead of -Infinity, since this is really better to ensure that we don't have -Infinity values traveling around the system
-- ExactAFCalculationModelUnitTest now checks for meaningful pNonRef values for each allele, uncovering a bug in the GeneralPloidy (not fixed, related to Eric's summation issue from long ago that was reverted) in that we get different results for diploid and general-ploidy == 2 models for multi-allelics.
2012-10-15 07:53:56 -04:00
Mark DePristo 4f1b1c4228 Intermediate commit II on simplifying AFCalcResult
-- All of the code now uses the AFCalc object, not the not package protected AFCalcResultTracker.  Nearly all unit tests pass (expect for a contract failing one that will be dealt with in subsequent commit), due to -Infinity values from normalizeLog10.
-- Changed the way that UnifiedGenotyper decides if the best model is non-ref.  Previously looked at the MAP AC, but the MAP AC values are no longer provided by AFCalcResult.  This is on purpose, because the MAP isn't a meaningful quantity for the exact model (i.e., everything is going to go to MLE AC in some upcoming commit).  If you want to understand why come talk to me.  Now uses the isPolymorphic function and the EMIT confidence, so that if pNonRef > EMIT then the site is poly, otherwise it's mono.
2012-10-15 07:53:56 -04:00
Mark DePristo 06687bfaf6 Intermediate commit on simplifying AFCalcResult
-- Renamed old class AFCalcResultTracker.  This object is now allocated by the AFCalc itself, since it is heavy-weight and was badly optimized in the UG with a thread-local variable. Now, since there's already a AFCalc thread-local there, we get that optimization for free.
-- Removed the interface to provide the AFCalcResultTracker to getlog10PNonRef.
-- Wrote new, clean but unused AFCalcResult object that will soon replace the tracker as the external interface to the AFCalc model results, leaving the tracker as an internal tracker structure.  This will allow me to (1) finally test things exhaustively, as the contracts on this class are clear (2) finalize the IndependentAllelesDiploidExactAFCalc class as it can work with a meaningfully defined result across each object
2012-10-15 07:53:56 -04:00
Mark DePristo c82aa01e0e Generalize testing infrastructure to allow us to run specific n.samples calculation 2012-10-15 07:53:55 -04:00
Mark DePristo ec935f76f6 Initial implementation and tests for IndependentAllelesDiploidExactAFCalc
-- This model separates each of N alt alleles, combines the genotype likelihoods into the X/X, X/N_i, and N_i/N_i biallelic case, and runs the exact model on each independently to handle the multi-allelic case.  This is very fast, scaling at O(n.alt.alleles x n.samples)
-- Many outstanding TODOs in order to truly pass unit tests
-- Added proper unit tests for the pNonRef calculation, which all of the models pass
2012-10-15 07:53:55 -04:00
Mark DePristo 5a4e2a5fa4 Test code to ensure that pNonRef is being computed correctly for at least 1 genotype, bi and tri allelic 2012-10-15 07:53:55 -04:00
Mark DePristo ee2f12e2ac Simpler naming convention for AlleleFrequencyCalculation => AFCalc 2012-10-15 07:53:55 -04:00
Mark DePristo cf3f9d6ee8 Reorganize and cleanup AFCalculations
-- Now contained in a package called afcalc
-- Extracted standard alone classes from private static classes in ExactAF
-- Most fields are now private, with accessors
-- Overall cleaner organization now
2012-10-15 07:53:55 -04:00
Mark DePristo 13211231c7 Restructure and cleanup ExactAFCalculations
-- Now there's no duplication between exact old and constrained models.  The behavior is controlled by an overloaded abstract function
-- No more static function to access the linear exact model -- you have to create the surrounding class.  Updated code in the system
-- Everything passes unit tests
2012-10-15 07:53:54 -04:00
Mark DePristo 99ad7b2d71 GeneralPloidyExact should use indel max alt alleles 2012-10-15 07:53:54 -04:00
Mark DePristo bf276baca0 Don't try to compute full exact model for > 100 samples 2012-10-15 07:53:54 -04:00
Mark DePristo b924e9ebb4 Add OptimizedDiploidExactAF to PerformanceTesting framework 2012-10-15 07:53:54 -04:00
Mark DePristo f800f3fb88 Optimized diploid exact AF calculation uses maxACs to stop the calculation by maxAC by allele
-- Added unit tests to ensure the approximation isn't so far from our reference implementation (DiploidExactAFCalculation)
2012-10-15 07:53:54 -04:00
Mark DePristo efad215edb Greedy version of function to compute the max achievable AC for each alt allele
-- walks over the genotypes in VC, and computes for each alt allele the maximum AC we need to consider in that alt allele dimension.  Does the calculation based on the PLs in each genotype g, choosing to update the max AC for the alt alleles corresponding to that PL.  Only takes the first lowest PL, if there are multiple genotype configurations with the same PL value.  It takes values in the order of the alt alleles.
2012-10-15 07:53:54 -04:00
Mark DePristo 7666a58773 Function to compute the max achievable AC for each alt allele
-- Additional minor cleanup of ExactAFCalculation
2012-10-15 07:53:53 -04:00
Guillermo del Angel 5971006678 Bug fix when running nondiploid mode in UG with EMIT_ALL_SITES: if site was reference-only, QUAL is produced OK but genotypes were being set to no-call because of unnecessary likelihood normalization. May change integration test md5 which I'll fix later today 2012-10-12 12:45:55 -04:00
Ryan Poplin 2a9ee89c19 Turning on allele trimming for the haplotype caller. 2012-10-10 10:47:26 -04:00
Eric Banks be9fcba546 Don't allow triggering of polyploid consensus creation in regions where there is more than one het, as it just doesn't work properly. We could probably refactor at some point to make it work, but it's not worth doing that now (especially as it should be rare to have multiple proximal known hets in a single sample exome). 2012-10-07 16:32:48 -04:00
Eric Banks 08ac80c080 RR bug: when the last base in the window around the polyploid consensus is filtered (low quality), the filtered consensus is not flushed and subsequent filtered bases (but importantly not contiguous to this one) are just added to this position. In other words, bases were being added to the wrong genomic positions. Fixed. 2012-10-07 10:52:01 -04:00
Eric Banks e8a6460a33 After merging with Yossi's fix I can confirm that the AD is fixed when going through the HC too. Added similar fixes to DP and FS annotations too. 2012-10-05 16:37:42 -04:00
Yossi Farjoun ef90beb827 - forgot to use git rm to delete a file from git. Now that VCF is deleted.
- uncommented a HC test that I missed.
2012-10-05 16:14:51 -04:00
Yossi Farjoun d419a33ed1 * Added an integration test for AD annotation in the Haplotype caller.
* Corrected FS Anotation for UG as for AD.
* HC still does not annotate ReducedReads correctly (for FS nor AD)
2012-10-05 15:23:59 -04:00
Eric Banks f840d9edbd HC test should continue using 3 alt alleles for indels 2012-10-05 02:03:34 -04:00
Eric Banks c66ef17cd0 Add a separate max alt alleles argument for indels that defaults to 2 instead of 3. PLEASE TAKE NOTE. 2012-10-04 13:52:14 -04:00
Eric Banks e13e61673b Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-10-04 10:54:23 -04:00
Eric Banks dfddc4bb0e Protect against cases where there are counts but no quals 2012-10-04 10:52:30 -04:00
Eric Banks 0c46845c92 Refactored the BaseCounts classes so that they are safer and allow for calculations on the most probable base (which is not necessarily the most common base). 2012-10-04 10:37:11 -04:00
Mark DePristo b6e20e083a Copied DiploidExactAFCalc to placeholder OptimizedDiploidExact
-- Will be removed.  Only commiting now to fix public -> private dependency
2012-10-03 20:16:38 -07:00
Mark DePristo 51cafa73e6 Removing public -> private dependency 2012-10-03 20:05:03 -07:00
Mark DePristo f8ef4332de Count the number of evaluations in AFResult; expand unit tests
-- AFResult now tracks the number of evaluations (turns through the model calculation) so we can now compute the scaling of exact model itself as a function of n samples
-- Added unittests for priors (flat and human)
-- Discovered nasty general ploidy bug (enabled with Guillermo_FIXME)
2012-10-03 19:55:11 -07:00
Mark DePristo de941ddbbe Cleanup Exact model, better unit tests
-- Added combinatorial unit tests for both Diploid and General (in diploid-case) for 2 and 3 alleles in all combinations of sample types (i.e., AA, AB, BB and equiv. for tri-allelic).  More assert statements to ensure quality of the result.
-- Added docs (DOCUMENT YOUR CODE!) to AlleleFrequencyCalculationResult, with proper input error handling and contracts.  Made mutation functions all protected
-- No longer need to call reset on your AlleleFrequencyCalculationResult -- it'd done for you in the calculation function.  reset is a protected method now, so it's all cleaner and nicer this way
-- TODO still -- need to add edge-case tests for non-informative samples (0,0,0), for the impact of priors, and I need to add some way to test the result of the pNonRef
2012-10-03 19:55:11 -07:00
Mark DePristo 3e01a76590 Clean up AlleleFrequencyCalculation classes
-- Added a true base class that only does truly common tasks (like manage call logging)
   -- This base class provides the only public method (getLog10PNonRef) and calls into a protected compute function that's abstract
   -- Split ExactAF into superclass ExactAF with common data structures and two subclasses: DiploidExact and GeneralPloidyExact
   -- Added an abstract reduceScope function that manages the simplification of the input VariantContext in the case where there are too many alleles or other constraints require us to only attempt a smaller computation
   -- All unit tests pass
2012-10-03 19:55:11 -07:00
Mark DePristo 1c52db4cdd Add exactCallsLog output file to ExactModel and StandardCallerArgumentCollection
-- This allows us to log all of the information about the exact model call (alleles, priors, PLs, result, and runtime) to a file for later debugging / optimization
2012-10-03 19:55:11 -07:00
Eric Banks 2df5be702c Added an argument to RR to allow polyploid consensus creation (by default it is turned off). This will eventually be replaced by the known SNPs track trigger. 2012-09-28 11:44:25 -04:00
Eric Banks 11a71e0390 RR bug: when determining the most common base at a position, break ties by which base has the highest sum of base qualities. Otherwise, sites with 1 Q2 N and 1 Q30 C are ending up as Ns in the consensus. I think perhaps we don't even care about which base has the most observations - it should just be determined by which has the highest sum of base qualities - but I'm not sure that's what users would expect. 2012-09-24 21:46:14 -04:00
Eric Banks 6a73265a06 RR bug: we were adding synthetic reads from the header only before the variant region, which meant that reads that overlap the variant region but that weren't used for the consensus (because e.g. of low base quality for the spanning base) were never being used at all. Instead, add synthetic reads from before and spanning the variant region. 2012-09-24 13:29:37 -04:00
Eric Banks ef680e1e13 RR fix: push the header removal all the way into the inner loops so that we literally remove a read from the general header only if it was added to the polyploid header. Add comments. 2012-09-24 11:14:18 -04:00
Eric Banks 0187f04a90 Proper fix for a previous RR bug fix: only remove reads from the header if they were actually used in the creation of the polyploid consensus. 2012-09-23 00:39:19 -04:00
Eric Banks 344083051b Reverting the fix to the generalized ploidy exact model since it cannot handle it computationally. Will file this in the JIRA. 2012-09-22 23:07:28 -04:00
Eric Banks ced652b3dd RR bug: we need to call removeFromHeader() for reads that were used in creating a polyploid consensus or else they are reused later in creating synthetic reads. In the worst case, this bug caused the tool to create 2 copies of the reduced read. 2012-09-22 21:50:10 -04:00
Eric Banks 60b93acf7d RR bug: we need to test that the mapping and base quals are >= the MIN values and not just >. This was causing us to drop Q20 bases. 2012-09-22 21:32:29 -04:00
Eric Banks dcd31e654d Turn off RR tests while I debug 2012-09-21 17:26:00 -04:00
Eric Banks 21251c29c2 Off-by-one error in sliding window manifests itself at end of a coverage region dropping the last covered base. 2012-09-21 17:22:30 -04:00
Mauricio Carneiro 2c3dc291c0 Added positive/negative strand to the synthetic reads 2012-09-21 10:00:48 -04:00
Mauricio Carneiro 51cb5098e4 Fixed the alignment issues with reads that started with empty consensus headers 2012-09-21 10:00:47 -04:00
Mauricio Carneiro aa1d2f3a5b Not every consensus is well aligned. Need to check more, but starting position has been fixed. 2012-09-21 10:00:45 -04:00
Mauricio Carneiro 97874b92d1 Program runs, but the consensus reads are all out of place and need more tags 2012-09-21 10:00:44 -04:00
Mauricio Carneiro 3494a52ddc another intermediate commit to update changes from stable 2012-09-21 10:00:43 -04:00
Mauricio Carneiro a89ff7b5dd Intermediate commit to resolve conflicts coming from stable 2012-09-21 10:00:41 -04:00
Eric Banks 1316b579f0 Bad news folks: BQSR scatter-gather was totally busted; you absolutely cannot trust any BQSR table that was a product of SG (for any version of BQSR). I fixed BQSR-gathering, rewrote (and enabled) the unit test, and confirmed that outputs are now identical whether or not SG is used to create the table. 2012-09-20 14:14:34 -04:00
Eric Banks 4b7edc72d1 Fixing edge case bug in the Exact model (both standard and generalized) where we could abort prematurely in the special case of multiple polymorphic alleles and samples with widely different depths of coverage (e.g. exome and low-pass). In these cases it was possible to call the site bi-allelic when in fact it was multi-allelic (but it wouldn't cause it to create a monomorphic call). 2012-09-20 10:59:42 -04:00
Mauricio Carneiro ee31a54a03 Merged bug fix from Stable into Unstable 2012-09-19 16:09:45 -04:00
Mauricio Carneiro 7cf9911924 Fixed ReduceReads bug where variant regions were missing.
This affected variant regions with more than 100 reads and less than 250 reads. Only bams reduced with GATK v2 and 2.1 were affected.
2012-09-19 16:09:08 -04:00
Ryan Poplin 26e35e5ee2 updating BQSR integration tests 2012-09-19 14:10:34 -04:00
Ryan Poplin b99099f05c The BaseRecalibrator and DelocalizedBaseRecalibrator have gotten out of sync. Fixing. 2012-09-19 12:30:26 -04:00
Ryan Poplin 7a7103a757 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-09-19 10:39:18 -04:00
Guillermo del Angel bebd5c14b8 Update general ploidy md5's due to bad merge of md5's in previous commit, and new shortened interval definition for EMIT_ALL_CONFIDENT_SITES was buggy 2012-09-18 20:12:15 -04:00
Guillermo del Angel ca010160a9 Merge fix 2012-09-14 14:05:21 -04:00
Guillermo del Angel 6b37350bc0 Two hairy bugs in pool caller: a) Site error model wasn't counting errors in insertions correctly - Alleles passed in had padded ref byte, but event base in PileupElement doesn't have it. As a result, mismatch rate was grossly overestimated with insertions and we missed several calls we should have made. Integration test reflects changes. b) Adding a ref GL to the exact model is correct mathematically but AFResult wasn't filled properly. As a result, QUAL was junk in pure ref sites, and in all other sites the last ref GL introduced wasn't properly updating Pr(AF>0). c) Added integration test that covers -out_mode EMIT_ALL_CONFIDENT_SITES. Not fully sure if the math is 100% correct (for both diploid and generalized case) but at least now diploid and non-diploid cases behave similarly. md5 of this new test will fail since it's taking me a long time to run so I'll update from Bamboo output shortly 2012-09-14 13:13:22 -04:00
Eric Banks 0206e09a6a Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-09-12 15:18:27 -04:00
Eric Banks d94d0d15c2 Complete overhaul of previous commits to make it all work with scatter-gather. Now tracks output files correctly and can print to stdout. 2012-09-12 15:15:40 -04:00
Ryan Poplin c9111bb23e Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-09-12 14:46:50 -04:00
Ryan Poplin 849a2b8839 Adding HC integration test for _structural_ insertions and deletions. 2012-09-12 12:23:00 -04:00
Eric Banks 994a4ff387 Track all outputs from BQSR (.table, .csv., and .pdf) as @Output arguments. Updated integration tests because we no longer have command-line options not to generate plots (now just don't provide a pdf) or to keep the intermediate csv (now, just provide a filename on the command-line). This is currently busted because we can't access the original filenames from the Engine's storage/stub system and therefore cannot call out to the Rscript with the executor (which requires filename strings). 2012-09-12 11:24:53 -04:00
Mark DePristo bfbf1686cd Fixed nasty bug with defaulting to diploid no-call genotypes
-- For the pooled caller we were writing diploid no-calls even when other samples were haploid.  Changed maxPloidy function to return a defaultPloidy, rather than 0, in the case where all samples are missing.
-- VCF/BCF Writers now create missing genotypes with the ploidy of other samples, or 2 if none are available at all.
-- Updating integration tests for general ploidy, as previously we wrote ./. even when other calls were 0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/1/1/1/1/1, but now we write ./././././././././././././././././././././././. (ugly but correct)
2012-09-12 07:08:03 -04:00
Ryan Poplin 35d15278af Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-09-11 14:34:17 -04:00
Guillermo del Angel 13831106d5 Fix GSA-535: storing likelihoods in allele map was busted when running HaplotypeCaller, only the last likelihood of a haplotype was being stored, as opposed to the max likelihood of all haplotypes mapping to an allele 2012-09-11 11:01:26 -04:00
Ryan Poplin aa9829b55c fixing typo 2012-09-10 13:36:37 -04:00
Guillermo del Angel 10c720cbba Merge branch 'master' of ssh://gsa4/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-09-10 09:56:47 -04:00
Guillermo del Angel 2d4b00833b Bug fix for logging likelihoods in new read allele map: reads which were filtered out were being excluded from map, but they should be included in annotations 2012-09-09 20:35:45 -04:00
Ryan Poplin 36913706c0 Bug fix in HC GenotypingEngine to ensure that all the merged complex events get properly added to the priority list used by VariantContextUtils when combining multiallelic events. 2012-09-09 13:47:54 -04:00
Ryan Poplin 688fc9fb56 Bug fix in HC GenotypingEngine to ensure that all the merged complex events get properly added to the priority list used by VariantContextUtils when combining multiallelic events. 2012-09-09 10:36:09 -04:00
David Roazen cb84a6473f Downsampling: experimental engine integration
-Off by default; engine fork isolates new code paths from old code paths,
so no integration tests change yet

-Experimental implementation is currently BROKEN due to a serious issue
involving file spans. No one can/should use the experimental features
until I've patched this issue.

-There are temporarily two independent versions of LocusIteratorByState.
Anyone changing one version should port the change to the other (if possible),
and anyone adding unit tests for one version should add the same unit tests
for the other (again, if possible). This situation will hopefully be extremely
temporary, and last only until the experimental implementation is proven.
2012-09-06 15:03:27 -04:00
Yossi Farjoun d6884e705a Revert "fixed a typo in StringText.properties"
This reverts commit b74c1c17e748f75e59d23545084b983e2a8d2fa6.
2012-09-05 15:21:00 -04:00
Yossi Farjoun f4b39a7545 Merge branch 'master' of ssh://gsa4/humgen/gsa-scr1/gsa-engineering/git/unstable
merging trivially after a commit
2012-09-05 14:33:39 -04:00
Yossi Farjoun 6e517df5d9 fixed a typo in StringText.properties 2012-09-05 14:33:08 -04:00
Ryan Poplin 9cc1a9931b Resolving merge conflicts. 2012-09-04 10:47:38 -04:00
Ryan Poplin c9944d81ef Skip array needs to also be used in the updateDataForRead function of the delocalized BQSR. 2012-09-04 10:33:37 -04:00
Mark DePristo 1b0ce511a6 Updating BQSR tests due to my change to reset BQSR calibration data 2012-08-31 19:51:09 -04:00
Mark DePristo 817ece37a2 General infrastructure for ReadTransformers
-- These are like read filters but can be applied either on input, on output, of handled by the walker
-- Previous example of BAQ now uses the general framework
    -- Resulted in massive conceptual cleanup of SAMDataSource and ReadProperties!  Yeah!
-- BQSR now uses this framework.  We can now do BQSR on input, on output, or within a walker
-- PrintReads now handles all read transformers in the walker in map, enabling us to parallelize PrintReads with BAQ and BQSR
-- Currently BQSR is excepting in parallel, which subsequent commit with fix
-- Removed global variable setting in GenomeAnalysisEngine for BAQ, as command line parameters are cleanly handled by ReadTransformer infrastructure
-- In principle ReadFilters are just a special kind of ReadTransformer, but this refactoring is larger than I can do. It's a JIRA entry
-- Many files touched simply due to the refactoring and renaming of classes
2012-08-31 13:42:41 -04:00
Mark DePristo 1200848bbf Part II of GSA-462: Consistent RODBinding access across Ref and Read trackers
-- Deleted ReadMetaDataTracker
-- Added function to ReadShard to give us the span from the left most position of the reads in the shard to the right most, which is needed for the new view
2012-08-30 10:15:10 -04:00
Ryan Poplin 57d997f06f Fixing bug from when FragmentUtils merging function moved over to the soft clipped start instead of the unclipped start 2012-08-30 10:10:43 -04:00
Ryan Poplin 35baf0b155 This along with Mauricio's previous commit (thanks!) fixes GSA-522. There are no longer any modifications to reads in the map calls of ActiveRegion walkers. Added the bam which identified this error as a new integration test. 2012-08-30 09:07:36 -04:00
Ryan Poplin e12ae65d33 Changing the commenting style in the BQSR 2012-08-29 11:27:45 -04:00
Ryan Poplin 18eca3544e Initial commit of the delocalized BQSR written as a read walker. 2012-08-28 15:24:20 -04:00
Mark DePristo 0f4acaae1b Update MD5s with new FS score 2012-08-28 08:06:47 -04:00
Mark DePristo b3fd74f0c4 HaplotypeCaller forbids BAQ 2012-08-24 13:25:05 -04:00
Ryan Poplin fe3069b278 Merged bug fix from Stable into Unstable 2012-08-22 14:40:34 -04:00
Ryan Poplin e5cfdb4811 Bug fix for popular _Duplicate allele added to VariantContext_ error reported on the forum. It seems to be due to lower case bases in the reference being treated as reference mismatches. We would try to turn these mismatches into SNP events, for example c/C. We now uppercase the result from IndexedFastaSequenceFile.getSubsequenceAt() 2012-08-22 14:39:35 -04:00
Ryan Poplin 63213e8eb5 Expanding the HaplotypeCaller integration tests to cover a wider range of data 2012-08-22 14:18:44 -04:00
Guillermo del Angel 901f47d8af Final step (for now) in VA refactoring: update MD5's because, a) since it's not guaranteed that we'll iterate through reads/pileups in the same order, the rank sum dithering will change annotations, b) FS uses new generic threshold to distinguish uninformative reads (it used to use ad-hoc thresholds), c) AD definition changed and throws away uninformative reads, d) shortened general ploidy integration tests for quicker debugging. May have missed some MD5's in the update so there may be lingering test failures still 2012-08-22 11:38:51 -04:00
Guillermo del Angel 6a8cf1c84a Enable and adapt HaplotypeScore and MappingQualityZero as active region annotations now that we have per-read likelihoods passed in to annotations 2012-08-21 14:35:40 -04:00
Guillermo del Angel d0644b3565 Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-21 10:35:23 -04:00
Ryan Poplin 94e7f677ad Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-21 10:21:47 -04:00
Guillermo del Angel 418ace463a More merge conflict resolution 2012-08-21 10:15:52 -04:00
Ryan Poplin 10961db3ce Another round of FindBugs fixes. Object returns its internal reference to an externally mutable array. Very dangerous. 2012-08-21 09:35:55 -04:00
Ryan Poplin 605acaae9c Another round of FindBugs fixes. Object internally stores a reference to an externally mutable array. Very dangerous. 2012-08-21 09:33:58 -04:00
Ryan Poplin 55b7949d68 Another round of FindBugs fixes. Comparator doesn't implement Serializable. 2012-08-21 09:20:55 -04:00
Eric Banks 286b658fab Re-enabling parallelism in the BaseRecalibrator now that the release is out. 2012-08-20 21:25:14 -04:00
Guillermo del Angel 7bbd2a7a20 Fixing merge conflicts 2012-08-20 20:38:25 -04:00
Ryan Poplin 77fbaec044 Another round of FindBugs fixes. Class implements its own compareTo() but uses base Object.equals() which can lead to unpredictable behavior. 2012-08-20 16:55:00 -04:00
Ryan Poplin a9472c1980 Another round of FindBugs fixes. Inefficient use of keySet iterator instead of entrySet iterator. 2012-08-20 16:11:45 -04:00
Ryan Poplin 5db3bd6fd2 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-20 15:28:57 -04:00
Ryan Poplin 464d49509a Pulling out common caller arguments into its own StandardCallerArgumentCollection base class so that every caller isn't exposed to the unused arguments from every other caller. 2012-08-20 15:28:39 -04:00
Ryan Poplin c67d708c51 Bug fix in HaplotypeCaller for non-regular bases in the reference or reads. Those events don't get created any more. Bug fix for advanced GenotypeFullActiveRegion mode: custom variant annotations created by the HC don't make sense when in this mode so don't try to calculate them. 2012-08-20 13:41:08 -04:00
Eric Banks 154f65e0de Temporarily disabling multi-threaded usage of BaseRecalibrator for performance reasons. 2012-08-20 12:43:17 -04:00
Guillermo del Angel 963ad03f8b Second step of interface cleanup for variant annotator: several bug fixes, don't hash pileup elements to Maps because the hashCode() for a pileup element is not implemented and strange things can happen. Still several things to do, not done yet 2012-08-19 21:18:18 -04:00
Guillermo del Angel b61ecc7c19 Fix merge conflicts 2012-08-16 20:45:52 -04:00
Guillermo del Angel d26183e0ec First preliminary big refactoring of UG annotation engine. Goals: a) Remove gigantic hack that cached per-read haplotype likelihoods in a static array so that annotations would go back and retrieve them, b) unify interface for annotations between HaplotypeCaller and UnifiedGenotyper, c) as a consequence, removed and cleaned duplicated code. As a bonus, annotations have now more relevant info to help them compute values.
Major idea is that per-read haplotype likelihoods are now stored in a single unified object of class PerReadAlleleLikelihoodMap. Class implementation in theory hides internal storage details from outside work (still may need work cleaning up interface), and this object(or rather, a Map from Sample->perReadAlleleLikelihoodMap) is produced by UGCalcLikelihoods. The genotype calculation is also able to potentially use this info if needed. All InfoFieldAnnotations now get an extra argument with this map. Currently, this map is only produced for indels in UG, or for all variants within HaplotypeCaller. If this map is absent (SNPs in UG), the old Pileup interface is used, but it's avoided whenever possible. FORMAT annotations are not yet changed but will be focus of second step. Major benefit will be that annotations will be able to very easily discard non-informative reads for certain events. HaplotypeCaller also uses this new class, and no longer hard-codes the mapping of allele ->list(reads) but instead uses the same objects and interfaces as the rest of the modules. Code still needs further testing/cleaning/reviewing/debugging
2012-08-16 20:36:53 -04:00
Eric Banks 05cbf1c8c0 FindBugs 'Efficiency' fixes 2012-08-16 15:40:52 -04:00
Eric Banks dac3958461 Killing off some FindBugs 'Usability' issues 2012-08-16 13:32:44 -04:00
Eric Banks 2df04dc48a Fix for performance problem in GGA mode related to previous --regenotype commit. Instead of trying to hack around the determination of the calculation model when it's not needed, just simply overload the calculateGenotypes() method to add one that does simple genotyping. Re-enabling the Pool Caller integration tests. 2012-08-16 13:05:17 -04:00
Eric Banks 9035b554fb Adding tests for the --solid_nocall_strategy argument 2012-08-15 23:13:24 -04:00
Mark DePristo 3556c36668 Disable general ploidy integration tests because they are running forever 2012-08-15 21:13:16 -04:00
Mark DePristo 243af0adb1 Expanded the BQSR reporting script
-- Includes header page
-- Table of arguments (Arguments)
-- Summary of counts (RecalData0)
-- Summary of counts by qual (RecalData1)
-- Fixed bug in output that resulted in covariates list always being null (updated md5s accordingly)
-- BQSR.R loads all relevant libaries now, include gplots, grid, and gsalib to run correctly
2012-08-12 13:45:14 -04:00
Eric Banks eca9613356 Adding support of X and = CIGAR operators to the GATK 2012-08-10 14:54:07 -04:00
Ryan Poplin 2a113977a9 Resolving merge conflicts with the new MD5s 2012-08-10 11:47:00 -04:00
Ryan Poplin 5f82ffd5d8 Adding LowQual filter to the output of the HaplotypeCaller. 2012-08-10 11:25:14 -04:00
Ryan Poplin 9887bc4410 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-09 16:31:06 -04:00
Mauricio Carneiro abb168e1ba Merged bug fix from Stable into Unstable 2012-08-09 16:09:58 -04:00
Mauricio Carneiro 67d4148b32 Fixing but reported by Thomas in the forum where reads were soft-clipped beyond the limits of the contig and ReduceReads was failing with a NoSuchElement exception. Now we hard clip anything that goes beyond the boundaries of the contig. 2012-08-09 15:58:18 -04:00
Mauricio Carneiro 58420098ac Merged bug fix from Stable into Unstable 2012-08-09 13:02:23 -04:00
Mauricio Carneiro c6132ebe26 Fixed divide by zero bug when downsampler goes over regions where reads are all filtered out. Added Guillermo's bug report as an integration test 2012-08-09 13:02:11 -04:00
Ryan Poplin e48727dae3 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-09 10:31:10 -04:00
Guillermo del Angel 5be7e0621d Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-09 09:58:34 -04:00
Guillermo del Angel 71ee8d87b3 Rename per-sample ML allelic fractions and counts so that they don't have the same name as the per-site INFO fields, and clarify wording in VCF header 2012-08-09 09:58:20 -04:00
Mauricio Carneiro 250ffd2ad7 Merged bug fix from Stable into Unstable 2012-08-08 15:50:07 -04:00
Mauricio Carneiro 78c1556186 Fixing ReduceReads downsampling bug -- downsampled reads were not being excluded from the read window, causing them to trail back and get caught by the sliding window exception 2012-08-08 15:49:31 -04:00
Ryan Poplin 1223d77546 Removing argument from HaplotypeCaller that was made unneccesary by recent improvements to triggering around large events 2012-08-08 15:13:20 -04:00
Eric Banks 4b2e3cec0b Quick pass of FindBugs 'inefficient use of keySet iterator instead of entrySet iterator' fixes for core tools. 2012-08-08 14:29:41 -04:00
Guillermo del Angel 3e2752667c Intermediate checkin for ReducedReads with HaplotypeCaller - change min read count over k-mer to average count over k-mer when doing assembly of a reduced read (not optimal, currently trying max and then will decide on best approach), fix merge conflicts 2012-08-08 12:07:33 -04:00
Eric Banks 2c76f71a03 Update -maxAlleles argument in integration tests 2012-08-06 22:48:04 -04:00
Guillermo del Angel c66a896b8e Fix UG integration test broken by new -maxAltAlleles nomenclature 2012-08-06 21:29:21 -04:00
Guillermo del Angel 97c5ed4feb Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-06 20:22:31 -04:00
Guillermo del Angel 238d55cb61 Fixes for running HaplotypeCaller with reduced reads: a) minor refactoring, pulled out code to compute mean representative count to ReadUtils, b) Don't use min representative count over kmer when constructing de Bruijn graph - this creates many paths with multiplicity=1 and makes us lose a lot of SNP's at edge of capture targets. Use mean instead 2012-08-06 20:22:12 -04:00
Ryan Poplin d85b38e4da Updating HaplotypeCaller integration tests 2012-08-06 12:02:19 -04:00
Ryan Poplin b8709d8c67 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-06 11:41:28 -04:00
Ryan Poplin 973d1d47ed Merging together the computeDiploidHaplotypeLikelihoods functions in the HaplotypeCaller's LikelihoodEngine so they both benefit from the ReducedRead's RepresentativeCount 2012-08-06 11:40:07 -04:00
Ryan Poplin b7eec2fd0e Bug fixes related to the changes in allele padding. If a haplotype started with an insertion it led to array index out of bounds. Haplotype allele insert function is now very simple because all alleles are treated the same way. HaplotypeUnitTest now uses a variant context instead of creating Allele objects directly. 2012-08-05 12:29:10 -04:00
Guillermo del Angel d2e8eb7b23 Fixed 2 haplotype caller unit tests: a) new interface for addReadLikelihoods() including read counts, b) disable test that test basic DeBruijn graph assembly, not ready yet 2012-08-03 14:26:51 -04:00
Ryan Poplin c3b6e2b143 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-03 13:14:43 -04:00
Ryan Poplin ff80f17721 Using PathComparatorTotalScore in the assembly graph traversal does a better job of capturing low frequency branches that are inside high frequnecy haplotypes. 2012-08-03 13:14:37 -04:00
Guillermo del Angel 6f8e7692d4 Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-03 12:24:37 -04:00
Guillermo del Angel 9e25b209e0 First pass of implementation of Reduced Reads with HaplotypeCaller. Main changes: a) Active region: scale PL's by representative count to determine whether region is active. b) Scale per-read, per-haplotype likelihoods by read representative counts. A read representative count is (temporarily) defined as the average representative count over all bases in read, TBD whether this is good enough to avoid biases in GL's. c) DeBruijn assembler inserts kmers N times in graph, where N is min representative count of read over kmer span - TBD again whether this is the best approach. d) Bug fixes in FragmentUtils: logic to merge fragments was wrong in cases where there is discrepancy of overlaps between unclipped/soft clipped bases. Didn't affect things before but RR makes prevalence of hard-clipped bases in CIGARs more prevalent so this was exposed. e) Cache read representative counts along with read likelihoods associated with a Haplotype. Code can/should be cleaned up and unified with PairHMMIndelErrorModelCode, as well as refactored to support arbitrary ploidy in HaplotypeCaller 2012-08-03 12:24:23 -04:00
Ryan Poplin 3ece4c4993 Merged bug fix from Stable into Unstable 2012-08-02 11:41:36 -04:00
Ryan Poplin cb8bc18aeb Fix for error in HaplotypeCaller. HC has a UG argument collection for the UG engine but some of those arguments aren't appropriate to set. 2012-08-02 11:41:06 -04:00
Guillermo del Angel 9ac72dbd4d Merged bug fix from Stable into Unstable 2012-08-01 10:56:45 -04:00
Guillermo del Angel 01265f78e6 Add sanity check and possible bug fix for forum user: if haplotypes cannot be created from given alleles when genotyping indels (e.g. too close to contig boundary, etc.) in pool mode, empty allele list, signifying site can't be genotyped 2012-08-01 10:50:00 -04:00
Guillermo del Angel 4a23f3cd11 Simple cleanup of pool caller code - since usage is much more general than just calling pools, AF calculation models and GL calculation models are renamed from Pool -> GeneralPloidy. Also, don't have users specify special arguments for -glm and -pnrm. Instead, when running UG with sample ploidy != 2, the correct general ploidy modules are automatically detected and loaded. -glm now reverts to old [SNP|INDEL|BOTH] usage 2012-07-31 16:34:20 -04:00
Mark DePristo e00ed8bc5e Cleanup BQSR classes
-- Moved most of BQSR classes (which are used throughout the codebase) to utils.recalibration.  It's better in my opinion to keep commonly used code in utils, and only specialized code in walkers.  As code becomes embedded throughout GATK its should be refactored to live in utils
-- Removed unncessary imports of BQSR in VQSR v3
-- Now ready to refactor QualQuantizer and unit test into a subclass of RecalDatum, refactor unit tests into RecalDatum unit tests, and generalize into hierarchical recal datum that can be used in QualQuantizer and the analysis of adaptive context covariate
-- Update PluginManager to sort the plugins and interfaces.  This allows us to have a deterministic order in which the plugin classes come back, which caused BQSR integration tests to temporarily change because I moved my classes around a bit.
2012-07-31 08:11:03 -04:00
Guillermo del Angel e6b326c189 Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-30 21:32:19 -04:00