Mark DePristo
c82aa01e0e
Generalize testing infrastructure to allow us to run specific n.samples calculation
2012-10-15 07:53:55 -04:00
Mark DePristo
ec935f76f6
Initial implementation and tests for IndependentAllelesDiploidExactAFCalc
...
-- This model separates each of N alt alleles, combines the genotype likelihoods into the X/X, X/N_i, and N_i/N_i biallelic case, and runs the exact model on each independently to handle the multi-allelic case. This is very fast, scaling at O(n.alt.alleles x n.samples)
-- Many outstanding TODOs in order to truly pass unit tests
-- Added proper unit tests for the pNonRef calculation, which all of the models pass
2012-10-15 07:53:55 -04:00
Mark DePristo
5a4e2a5fa4
Test code to ensure that pNonRef is being computed correctly for at least 1 genotype, bi and tri allelic
2012-10-15 07:53:55 -04:00
Mark DePristo
ee2f12e2ac
Simpler naming convention for AlleleFrequencyCalculation => AFCalc
2012-10-15 07:53:55 -04:00
Mark DePristo
cf3f9d6ee8
Reorganize and cleanup AFCalculations
...
-- Now contained in a package called afcalc
-- Extracted standard alone classes from private static classes in ExactAF
-- Most fields are now private, with accessors
-- Overall cleaner organization now
2012-10-15 07:53:55 -04:00
Mark DePristo
13211231c7
Restructure and cleanup ExactAFCalculations
...
-- Now there's no duplication between exact old and constrained models. The behavior is controlled by an overloaded abstract function
-- No more static function to access the linear exact model -- you have to create the surrounding class. Updated code in the system
-- Everything passes unit tests
2012-10-15 07:53:54 -04:00
Mark DePristo
99ad7b2d71
GeneralPloidyExact should use indel max alt alleles
2012-10-15 07:53:54 -04:00
Mark DePristo
bf276baca0
Don't try to compute full exact model for > 100 samples
2012-10-15 07:53:54 -04:00
Mark DePristo
b924e9ebb4
Add OptimizedDiploidExactAF to PerformanceTesting framework
2012-10-15 07:53:54 -04:00
Mark DePristo
f800f3fb88
Optimized diploid exact AF calculation uses maxACs to stop the calculation by maxAC by allele
...
-- Added unit tests to ensure the approximation isn't so far from our reference implementation (DiploidExactAFCalculation)
2012-10-15 07:53:54 -04:00
Mark DePristo
efad215edb
Greedy version of function to compute the max achievable AC for each alt allele
...
-- walks over the genotypes in VC, and computes for each alt allele the maximum AC we need to consider in that alt allele dimension. Does the calculation based on the PLs in each genotype g, choosing to update the max AC for the alt alleles corresponding to that PL. Only takes the first lowest PL, if there are multiple genotype configurations with the same PL value. It takes values in the order of the alt alleles.
2012-10-15 07:53:54 -04:00
Mark DePristo
7666a58773
Function to compute the max achievable AC for each alt allele
...
-- Additional minor cleanup of ExactAFCalculation
2012-10-15 07:53:53 -04:00
Guillermo del Angel
5971006678
Bug fix when running nondiploid mode in UG with EMIT_ALL_SITES: if site was reference-only, QUAL is produced OK but genotypes were being set to no-call because of unnecessary likelihood normalization. May change integration test md5 which I'll fix later today
2012-10-12 12:45:55 -04:00
Ryan Poplin
2a9ee89c19
Turning on allele trimming for the haplotype caller.
2012-10-10 10:47:26 -04:00
Eric Banks
be9fcba546
Don't allow triggering of polyploid consensus creation in regions where there is more than one het, as it just doesn't work properly. We could probably refactor at some point to make it work, but it's not worth doing that now (especially as it should be rare to have multiple proximal known hets in a single sample exome).
2012-10-07 16:32:48 -04:00
Eric Banks
08ac80c080
RR bug: when the last base in the window around the polyploid consensus is filtered (low quality), the filtered consensus is not flushed and subsequent filtered bases (but importantly not contiguous to this one) are just added to this position. In other words, bases were being added to the wrong genomic positions. Fixed.
2012-10-07 10:52:01 -04:00
Eric Banks
e8a6460a33
After merging with Yossi's fix I can confirm that the AD is fixed when going through the HC too. Added similar fixes to DP and FS annotations too.
2012-10-05 16:37:42 -04:00
Yossi Farjoun
ef90beb827
- forgot to use git rm to delete a file from git. Now that VCF is deleted.
...
- uncommented a HC test that I missed.
2012-10-05 16:14:51 -04:00
Yossi Farjoun
d419a33ed1
* Added an integration test for AD annotation in the Haplotype caller.
...
* Corrected FS Anotation for UG as for AD.
* HC still does not annotate ReducedReads correctly (for FS nor AD)
2012-10-05 15:23:59 -04:00
Eric Banks
f840d9edbd
HC test should continue using 3 alt alleles for indels
2012-10-05 02:03:34 -04:00
Eric Banks
c66ef17cd0
Add a separate max alt alleles argument for indels that defaults to 2 instead of 3. PLEASE TAKE NOTE.
2012-10-04 13:52:14 -04:00
Eric Banks
e13e61673b
Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-10-04 10:54:23 -04:00
Eric Banks
dfddc4bb0e
Protect against cases where there are counts but no quals
2012-10-04 10:52:30 -04:00
Eric Banks
0c46845c92
Refactored the BaseCounts classes so that they are safer and allow for calculations on the most probable base (which is not necessarily the most common base).
2012-10-04 10:37:11 -04:00
Mark DePristo
b6e20e083a
Copied DiploidExactAFCalc to placeholder OptimizedDiploidExact
...
-- Will be removed. Only commiting now to fix public -> private dependency
2012-10-03 20:16:38 -07:00
Mark DePristo
51cafa73e6
Removing public -> private dependency
2012-10-03 20:05:03 -07:00
Mark DePristo
f8ef4332de
Count the number of evaluations in AFResult; expand unit tests
...
-- AFResult now tracks the number of evaluations (turns through the model calculation) so we can now compute the scaling of exact model itself as a function of n samples
-- Added unittests for priors (flat and human)
-- Discovered nasty general ploidy bug (enabled with Guillermo_FIXME)
2012-10-03 19:55:11 -07:00
Mark DePristo
de941ddbbe
Cleanup Exact model, better unit tests
...
-- Added combinatorial unit tests for both Diploid and General (in diploid-case) for 2 and 3 alleles in all combinations of sample types (i.e., AA, AB, BB and equiv. for tri-allelic). More assert statements to ensure quality of the result.
-- Added docs (DOCUMENT YOUR CODE!) to AlleleFrequencyCalculationResult, with proper input error handling and contracts. Made mutation functions all protected
-- No longer need to call reset on your AlleleFrequencyCalculationResult -- it'd done for you in the calculation function. reset is a protected method now, so it's all cleaner and nicer this way
-- TODO still -- need to add edge-case tests for non-informative samples (0,0,0), for the impact of priors, and I need to add some way to test the result of the pNonRef
2012-10-03 19:55:11 -07:00
Mark DePristo
3e01a76590
Clean up AlleleFrequencyCalculation classes
...
-- Added a true base class that only does truly common tasks (like manage call logging)
-- This base class provides the only public method (getLog10PNonRef) and calls into a protected compute function that's abstract
-- Split ExactAF into superclass ExactAF with common data structures and two subclasses: DiploidExact and GeneralPloidyExact
-- Added an abstract reduceScope function that manages the simplification of the input VariantContext in the case where there are too many alleles or other constraints require us to only attempt a smaller computation
-- All unit tests pass
2012-10-03 19:55:11 -07:00
Mark DePristo
1c52db4cdd
Add exactCallsLog output file to ExactModel and StandardCallerArgumentCollection
...
-- This allows us to log all of the information about the exact model call (alleles, priors, PLs, result, and runtime) to a file for later debugging / optimization
2012-10-03 19:55:11 -07:00
Eric Banks
2df5be702c
Added an argument to RR to allow polyploid consensus creation (by default it is turned off). This will eventually be replaced by the known SNPs track trigger.
2012-09-28 11:44:25 -04:00
Eric Banks
11a71e0390
RR bug: when determining the most common base at a position, break ties by which base has the highest sum of base qualities. Otherwise, sites with 1 Q2 N and 1 Q30 C are ending up as Ns in the consensus. I think perhaps we don't even care about which base has the most observations - it should just be determined by which has the highest sum of base qualities - but I'm not sure that's what users would expect.
2012-09-24 21:46:14 -04:00
Eric Banks
6a73265a06
RR bug: we were adding synthetic reads from the header only before the variant region, which meant that reads that overlap the variant region but that weren't used for the consensus (because e.g. of low base quality for the spanning base) were never being used at all. Instead, add synthetic reads from before and spanning the variant region.
2012-09-24 13:29:37 -04:00
Eric Banks
ef680e1e13
RR fix: push the header removal all the way into the inner loops so that we literally remove a read from the general header only if it was added to the polyploid header. Add comments.
2012-09-24 11:14:18 -04:00
Eric Banks
0187f04a90
Proper fix for a previous RR bug fix: only remove reads from the header if they were actually used in the creation of the polyploid consensus.
2012-09-23 00:39:19 -04:00
Eric Banks
344083051b
Reverting the fix to the generalized ploidy exact model since it cannot handle it computationally. Will file this in the JIRA.
2012-09-22 23:07:28 -04:00
Eric Banks
ced652b3dd
RR bug: we need to call removeFromHeader() for reads that were used in creating a polyploid consensus or else they are reused later in creating synthetic reads. In the worst case, this bug caused the tool to create 2 copies of the reduced read.
2012-09-22 21:50:10 -04:00
Eric Banks
60b93acf7d
RR bug: we need to test that the mapping and base quals are >= the MIN values and not just >. This was causing us to drop Q20 bases.
2012-09-22 21:32:29 -04:00
Eric Banks
dcd31e654d
Turn off RR tests while I debug
2012-09-21 17:26:00 -04:00
Eric Banks
21251c29c2
Off-by-one error in sliding window manifests itself at end of a coverage region dropping the last covered base.
2012-09-21 17:22:30 -04:00
Mauricio Carneiro
2c3dc291c0
Added positive/negative strand to the synthetic reads
2012-09-21 10:00:48 -04:00
Mauricio Carneiro
51cb5098e4
Fixed the alignment issues with reads that started with empty consensus headers
2012-09-21 10:00:47 -04:00
Mauricio Carneiro
aa1d2f3a5b
Not every consensus is well aligned. Need to check more, but starting position has been fixed.
2012-09-21 10:00:45 -04:00
Mauricio Carneiro
97874b92d1
Program runs, but the consensus reads are all out of place and need more tags
2012-09-21 10:00:44 -04:00
Mauricio Carneiro
3494a52ddc
another intermediate commit to update changes from stable
2012-09-21 10:00:43 -04:00
Mauricio Carneiro
a89ff7b5dd
Intermediate commit to resolve conflicts coming from stable
2012-09-21 10:00:41 -04:00
Eric Banks
1316b579f0
Bad news folks: BQSR scatter-gather was totally busted; you absolutely cannot trust any BQSR table that was a product of SG (for any version of BQSR). I fixed BQSR-gathering, rewrote (and enabled) the unit test, and confirmed that outputs are now identical whether or not SG is used to create the table.
2012-09-20 14:14:34 -04:00
Eric Banks
4b7edc72d1
Fixing edge case bug in the Exact model (both standard and generalized) where we could abort prematurely in the special case of multiple polymorphic alleles and samples with widely different depths of coverage (e.g. exome and low-pass). In these cases it was possible to call the site bi-allelic when in fact it was multi-allelic (but it wouldn't cause it to create a monomorphic call).
2012-09-20 10:59:42 -04:00
Mauricio Carneiro
ee31a54a03
Merged bug fix from Stable into Unstable
2012-09-19 16:09:45 -04:00
Mauricio Carneiro
7cf9911924
Fixed ReduceReads bug where variant regions were missing.
...
This affected variant regions with more than 100 reads and less than 250 reads. Only bams reduced with GATK v2 and 2.1 were affected.
2012-09-19 16:09:08 -04:00
Ryan Poplin
26e35e5ee2
updating BQSR integration tests
2012-09-19 14:10:34 -04:00
Ryan Poplin
b99099f05c
The BaseRecalibrator and DelocalizedBaseRecalibrator have gotten out of sync. Fixing.
2012-09-19 12:30:26 -04:00
Ryan Poplin
7a7103a757
Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-09-19 10:39:18 -04:00
Guillermo del Angel
bebd5c14b8
Update general ploidy md5's due to bad merge of md5's in previous commit, and new shortened interval definition for EMIT_ALL_CONFIDENT_SITES was buggy
2012-09-18 20:12:15 -04:00
Guillermo del Angel
ca010160a9
Merge fix
2012-09-14 14:05:21 -04:00
Guillermo del Angel
6b37350bc0
Two hairy bugs in pool caller: a) Site error model wasn't counting errors in insertions correctly - Alleles passed in had padded ref byte, but event base in PileupElement doesn't have it. As a result, mismatch rate was grossly overestimated with insertions and we missed several calls we should have made. Integration test reflects changes. b) Adding a ref GL to the exact model is correct mathematically but AFResult wasn't filled properly. As a result, QUAL was junk in pure ref sites, and in all other sites the last ref GL introduced wasn't properly updating Pr(AF>0). c) Added integration test that covers -out_mode EMIT_ALL_CONFIDENT_SITES. Not fully sure if the math is 100% correct (for both diploid and generalized case) but at least now diploid and non-diploid cases behave similarly. md5 of this new test will fail since it's taking me a long time to run so I'll update from Bamboo output shortly
2012-09-14 13:13:22 -04:00
Eric Banks
0206e09a6a
Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-09-12 15:18:27 -04:00
Eric Banks
d94d0d15c2
Complete overhaul of previous commits to make it all work with scatter-gather. Now tracks output files correctly and can print to stdout.
2012-09-12 15:15:40 -04:00
Ryan Poplin
c9111bb23e
Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-09-12 14:46:50 -04:00
Ryan Poplin
849a2b8839
Adding HC integration test for _structural_ insertions and deletions.
2012-09-12 12:23:00 -04:00
Eric Banks
994a4ff387
Track all outputs from BQSR (.table, .csv., and .pdf) as @Output arguments. Updated integration tests because we no longer have command-line options not to generate plots (now just don't provide a pdf) or to keep the intermediate csv (now, just provide a filename on the command-line). This is currently busted because we can't access the original filenames from the Engine's storage/stub system and therefore cannot call out to the Rscript with the executor (which requires filename strings).
2012-09-12 11:24:53 -04:00
Mark DePristo
bfbf1686cd
Fixed nasty bug with defaulting to diploid no-call genotypes
...
-- For the pooled caller we were writing diploid no-calls even when other samples were haploid. Changed maxPloidy function to return a defaultPloidy, rather than 0, in the case where all samples are missing.
-- VCF/BCF Writers now create missing genotypes with the ploidy of other samples, or 2 if none are available at all.
-- Updating integration tests for general ploidy, as previously we wrote ./. even when other calls were 0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/1/1/1/1/1, but now we write ./././././././././././././././././././././././. (ugly but correct)
2012-09-12 07:08:03 -04:00
Ryan Poplin
35d15278af
Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-09-11 14:34:17 -04:00
Guillermo del Angel
13831106d5
Fix GSA-535: storing likelihoods in allele map was busted when running HaplotypeCaller, only the last likelihood of a haplotype was being stored, as opposed to the max likelihood of all haplotypes mapping to an allele
2012-09-11 11:01:26 -04:00
Ryan Poplin
aa9829b55c
fixing typo
2012-09-10 13:36:37 -04:00
Guillermo del Angel
10c720cbba
Merge branch 'master' of ssh://gsa4/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-09-10 09:56:47 -04:00
Guillermo del Angel
2d4b00833b
Bug fix for logging likelihoods in new read allele map: reads which were filtered out were being excluded from map, but they should be included in annotations
2012-09-09 20:35:45 -04:00
Ryan Poplin
36913706c0
Bug fix in HC GenotypingEngine to ensure that all the merged complex events get properly added to the priority list used by VariantContextUtils when combining multiallelic events.
2012-09-09 13:47:54 -04:00
Ryan Poplin
688fc9fb56
Bug fix in HC GenotypingEngine to ensure that all the merged complex events get properly added to the priority list used by VariantContextUtils when combining multiallelic events.
2012-09-09 10:36:09 -04:00
David Roazen
cb84a6473f
Downsampling: experimental engine integration
...
-Off by default; engine fork isolates new code paths from old code paths,
so no integration tests change yet
-Experimental implementation is currently BROKEN due to a serious issue
involving file spans. No one can/should use the experimental features
until I've patched this issue.
-There are temporarily two independent versions of LocusIteratorByState.
Anyone changing one version should port the change to the other (if possible),
and anyone adding unit tests for one version should add the same unit tests
for the other (again, if possible). This situation will hopefully be extremely
temporary, and last only until the experimental implementation is proven.
2012-09-06 15:03:27 -04:00
Yossi Farjoun
d6884e705a
Revert "fixed a typo in StringText.properties"
...
This reverts commit b74c1c17e748f75e59d23545084b983e2a8d2fa6.
2012-09-05 15:21:00 -04:00
Yossi Farjoun
f4b39a7545
Merge branch 'master' of ssh://gsa4/humgen/gsa-scr1/gsa-engineering/git/unstable
...
merging trivially after a commit
2012-09-05 14:33:39 -04:00
Yossi Farjoun
6e517df5d9
fixed a typo in StringText.properties
2012-09-05 14:33:08 -04:00
Ryan Poplin
9cc1a9931b
Resolving merge conflicts.
2012-09-04 10:47:38 -04:00
Ryan Poplin
c9944d81ef
Skip array needs to also be used in the updateDataForRead function of the delocalized BQSR.
2012-09-04 10:33:37 -04:00
Mark DePristo
1b0ce511a6
Updating BQSR tests due to my change to reset BQSR calibration data
2012-08-31 19:51:09 -04:00
Mark DePristo
817ece37a2
General infrastructure for ReadTransformers
...
-- These are like read filters but can be applied either on input, on output, of handled by the walker
-- Previous example of BAQ now uses the general framework
-- Resulted in massive conceptual cleanup of SAMDataSource and ReadProperties! Yeah!
-- BQSR now uses this framework. We can now do BQSR on input, on output, or within a walker
-- PrintReads now handles all read transformers in the walker in map, enabling us to parallelize PrintReads with BAQ and BQSR
-- Currently BQSR is excepting in parallel, which subsequent commit with fix
-- Removed global variable setting in GenomeAnalysisEngine for BAQ, as command line parameters are cleanly handled by ReadTransformer infrastructure
-- In principle ReadFilters are just a special kind of ReadTransformer, but this refactoring is larger than I can do. It's a JIRA entry
-- Many files touched simply due to the refactoring and renaming of classes
2012-08-31 13:42:41 -04:00
Mark DePristo
1200848bbf
Part II of GSA-462: Consistent RODBinding access across Ref and Read trackers
...
-- Deleted ReadMetaDataTracker
-- Added function to ReadShard to give us the span from the left most position of the reads in the shard to the right most, which is needed for the new view
2012-08-30 10:15:10 -04:00
Ryan Poplin
57d997f06f
Fixing bug from when FragmentUtils merging function moved over to the soft clipped start instead of the unclipped start
2012-08-30 10:10:43 -04:00
Ryan Poplin
35baf0b155
This along with Mauricio's previous commit (thanks!) fixes GSA-522. There are no longer any modifications to reads in the map calls of ActiveRegion walkers. Added the bam which identified this error as a new integration test.
2012-08-30 09:07:36 -04:00
Ryan Poplin
e12ae65d33
Changing the commenting style in the BQSR
2012-08-29 11:27:45 -04:00
Ryan Poplin
18eca3544e
Initial commit of the delocalized BQSR written as a read walker.
2012-08-28 15:24:20 -04:00
Mark DePristo
0f4acaae1b
Update MD5s with new FS score
2012-08-28 08:06:47 -04:00
Mark DePristo
b3fd74f0c4
HaplotypeCaller forbids BAQ
2012-08-24 13:25:05 -04:00
Ryan Poplin
fe3069b278
Merged bug fix from Stable into Unstable
2012-08-22 14:40:34 -04:00
Ryan Poplin
e5cfdb4811
Bug fix for popular _Duplicate allele added to VariantContext_ error reported on the forum. It seems to be due to lower case bases in the reference being treated as reference mismatches. We would try to turn these mismatches into SNP events, for example c/C. We now uppercase the result from IndexedFastaSequenceFile.getSubsequenceAt()
2012-08-22 14:39:35 -04:00
Ryan Poplin
63213e8eb5
Expanding the HaplotypeCaller integration tests to cover a wider range of data
2012-08-22 14:18:44 -04:00
Guillermo del Angel
901f47d8af
Final step (for now) in VA refactoring: update MD5's because, a) since it's not guaranteed that we'll iterate through reads/pileups in the same order, the rank sum dithering will change annotations, b) FS uses new generic threshold to distinguish uninformative reads (it used to use ad-hoc thresholds), c) AD definition changed and throws away uninformative reads, d) shortened general ploidy integration tests for quicker debugging. May have missed some MD5's in the update so there may be lingering test failures still
2012-08-22 11:38:51 -04:00
Guillermo del Angel
6a8cf1c84a
Enable and adapt HaplotypeScore and MappingQualityZero as active region annotations now that we have per-read likelihoods passed in to annotations
2012-08-21 14:35:40 -04:00
Guillermo del Angel
d0644b3565
Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-08-21 10:35:23 -04:00
Ryan Poplin
94e7f677ad
Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-08-21 10:21:47 -04:00
Guillermo del Angel
418ace463a
More merge conflict resolution
2012-08-21 10:15:52 -04:00
Ryan Poplin
10961db3ce
Another round of FindBugs fixes. Object returns its internal reference to an externally mutable array. Very dangerous.
2012-08-21 09:35:55 -04:00
Ryan Poplin
605acaae9c
Another round of FindBugs fixes. Object internally stores a reference to an externally mutable array. Very dangerous.
2012-08-21 09:33:58 -04:00
Ryan Poplin
55b7949d68
Another round of FindBugs fixes. Comparator doesn't implement Serializable.
2012-08-21 09:20:55 -04:00
Eric Banks
286b658fab
Re-enabling parallelism in the BaseRecalibrator now that the release is out.
2012-08-20 21:25:14 -04:00
Guillermo del Angel
7bbd2a7a20
Fixing merge conflicts
2012-08-20 20:38:25 -04:00
Ryan Poplin
77fbaec044
Another round of FindBugs fixes. Class implements its own compareTo() but uses base Object.equals() which can lead to unpredictable behavior.
2012-08-20 16:55:00 -04:00
Ryan Poplin
a9472c1980
Another round of FindBugs fixes. Inefficient use of keySet iterator instead of entrySet iterator.
2012-08-20 16:11:45 -04:00
Ryan Poplin
5db3bd6fd2
Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-08-20 15:28:57 -04:00
Ryan Poplin
464d49509a
Pulling out common caller arguments into its own StandardCallerArgumentCollection base class so that every caller isn't exposed to the unused arguments from every other caller.
2012-08-20 15:28:39 -04:00
Ryan Poplin
c67d708c51
Bug fix in HaplotypeCaller for non-regular bases in the reference or reads. Those events don't get created any more. Bug fix for advanced GenotypeFullActiveRegion mode: custom variant annotations created by the HC don't make sense when in this mode so don't try to calculate them.
2012-08-20 13:41:08 -04:00
Eric Banks
154f65e0de
Temporarily disabling multi-threaded usage of BaseRecalibrator for performance reasons.
2012-08-20 12:43:17 -04:00
Guillermo del Angel
963ad03f8b
Second step of interface cleanup for variant annotator: several bug fixes, don't hash pileup elements to Maps because the hashCode() for a pileup element is not implemented and strange things can happen. Still several things to do, not done yet
2012-08-19 21:18:18 -04:00
Guillermo del Angel
b61ecc7c19
Fix merge conflicts
2012-08-16 20:45:52 -04:00
Guillermo del Angel
d26183e0ec
First preliminary big refactoring of UG annotation engine. Goals: a) Remove gigantic hack that cached per-read haplotype likelihoods in a static array so that annotations would go back and retrieve them, b) unify interface for annotations between HaplotypeCaller and UnifiedGenotyper, c) as a consequence, removed and cleaned duplicated code. As a bonus, annotations have now more relevant info to help them compute values.
...
Major idea is that per-read haplotype likelihoods are now stored in a single unified object of class PerReadAlleleLikelihoodMap. Class implementation in theory hides internal storage details from outside work (still may need work cleaning up interface), and this object(or rather, a Map from Sample->perReadAlleleLikelihoodMap) is produced by UGCalcLikelihoods. The genotype calculation is also able to potentially use this info if needed. All InfoFieldAnnotations now get an extra argument with this map. Currently, this map is only produced for indels in UG, or for all variants within HaplotypeCaller. If this map is absent (SNPs in UG), the old Pileup interface is used, but it's avoided whenever possible. FORMAT annotations are not yet changed but will be focus of second step. Major benefit will be that annotations will be able to very easily discard non-informative reads for certain events. HaplotypeCaller also uses this new class, and no longer hard-codes the mapping of allele ->list(reads) but instead uses the same objects and interfaces as the rest of the modules. Code still needs further testing/cleaning/reviewing/debugging
2012-08-16 20:36:53 -04:00
Eric Banks
05cbf1c8c0
FindBugs 'Efficiency' fixes
2012-08-16 15:40:52 -04:00
Eric Banks
dac3958461
Killing off some FindBugs 'Usability' issues
2012-08-16 13:32:44 -04:00
Eric Banks
2df04dc48a
Fix for performance problem in GGA mode related to previous --regenotype commit. Instead of trying to hack around the determination of the calculation model when it's not needed, just simply overload the calculateGenotypes() method to add one that does simple genotyping. Re-enabling the Pool Caller integration tests.
2012-08-16 13:05:17 -04:00
Eric Banks
9035b554fb
Adding tests for the --solid_nocall_strategy argument
2012-08-15 23:13:24 -04:00
Mark DePristo
3556c36668
Disable general ploidy integration tests because they are running forever
2012-08-15 21:13:16 -04:00
Mark DePristo
243af0adb1
Expanded the BQSR reporting script
...
-- Includes header page
-- Table of arguments (Arguments)
-- Summary of counts (RecalData0)
-- Summary of counts by qual (RecalData1)
-- Fixed bug in output that resulted in covariates list always being null (updated md5s accordingly)
-- BQSR.R loads all relevant libaries now, include gplots, grid, and gsalib to run correctly
2012-08-12 13:45:14 -04:00
Eric Banks
eca9613356
Adding support of X and = CIGAR operators to the GATK
2012-08-10 14:54:07 -04:00
Ryan Poplin
2a113977a9
Resolving merge conflicts with the new MD5s
2012-08-10 11:47:00 -04:00
Ryan Poplin
5f82ffd5d8
Adding LowQual filter to the output of the HaplotypeCaller.
2012-08-10 11:25:14 -04:00
Ryan Poplin
9887bc4410
Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-08-09 16:31:06 -04:00
Mauricio Carneiro
abb168e1ba
Merged bug fix from Stable into Unstable
2012-08-09 16:09:58 -04:00
Mauricio Carneiro
67d4148b32
Fixing but reported by Thomas in the forum where reads were soft-clipped beyond the limits of the contig and ReduceReads was failing with a NoSuchElement exception. Now we hard clip anything that goes beyond the boundaries of the contig.
2012-08-09 15:58:18 -04:00
Mauricio Carneiro
58420098ac
Merged bug fix from Stable into Unstable
2012-08-09 13:02:23 -04:00
Mauricio Carneiro
c6132ebe26
Fixed divide by zero bug when downsampler goes over regions where reads are all filtered out. Added Guillermo's bug report as an integration test
2012-08-09 13:02:11 -04:00
Ryan Poplin
e48727dae3
Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-08-09 10:31:10 -04:00
Guillermo del Angel
5be7e0621d
Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-08-09 09:58:34 -04:00
Guillermo del Angel
71ee8d87b3
Rename per-sample ML allelic fractions and counts so that they don't have the same name as the per-site INFO fields, and clarify wording in VCF header
2012-08-09 09:58:20 -04:00
Mauricio Carneiro
250ffd2ad7
Merged bug fix from Stable into Unstable
2012-08-08 15:50:07 -04:00
Mauricio Carneiro
78c1556186
Fixing ReduceReads downsampling bug -- downsampled reads were not being excluded from the read window, causing them to trail back and get caught by the sliding window exception
2012-08-08 15:49:31 -04:00
Ryan Poplin
1223d77546
Removing argument from HaplotypeCaller that was made unneccesary by recent improvements to triggering around large events
2012-08-08 15:13:20 -04:00
Eric Banks
4b2e3cec0b
Quick pass of FindBugs 'inefficient use of keySet iterator instead of entrySet iterator' fixes for core tools.
2012-08-08 14:29:41 -04:00
Guillermo del Angel
3e2752667c
Intermediate checkin for ReducedReads with HaplotypeCaller - change min read count over k-mer to average count over k-mer when doing assembly of a reduced read (not optimal, currently trying max and then will decide on best approach), fix merge conflicts
2012-08-08 12:07:33 -04:00
Eric Banks
2c76f71a03
Update -maxAlleles argument in integration tests
2012-08-06 22:48:04 -04:00
Guillermo del Angel
c66a896b8e
Fix UG integration test broken by new -maxAltAlleles nomenclature
2012-08-06 21:29:21 -04:00
Guillermo del Angel
97c5ed4feb
Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-08-06 20:22:31 -04:00
Guillermo del Angel
238d55cb61
Fixes for running HaplotypeCaller with reduced reads: a) minor refactoring, pulled out code to compute mean representative count to ReadUtils, b) Don't use min representative count over kmer when constructing de Bruijn graph - this creates many paths with multiplicity=1 and makes us lose a lot of SNP's at edge of capture targets. Use mean instead
2012-08-06 20:22:12 -04:00
Ryan Poplin
d85b38e4da
Updating HaplotypeCaller integration tests
2012-08-06 12:02:19 -04:00
Ryan Poplin
b8709d8c67
Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-08-06 11:41:28 -04:00
Ryan Poplin
973d1d47ed
Merging together the computeDiploidHaplotypeLikelihoods functions in the HaplotypeCaller's LikelihoodEngine so they both benefit from the ReducedRead's RepresentativeCount
2012-08-06 11:40:07 -04:00
Ryan Poplin
b7eec2fd0e
Bug fixes related to the changes in allele padding. If a haplotype started with an insertion it led to array index out of bounds. Haplotype allele insert function is now very simple because all alleles are treated the same way. HaplotypeUnitTest now uses a variant context instead of creating Allele objects directly.
2012-08-05 12:29:10 -04:00
Guillermo del Angel
d2e8eb7b23
Fixed 2 haplotype caller unit tests: a) new interface for addReadLikelihoods() including read counts, b) disable test that test basic DeBruijn graph assembly, not ready yet
2012-08-03 14:26:51 -04:00
Ryan Poplin
c3b6e2b143
Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-08-03 13:14:43 -04:00
Ryan Poplin
ff80f17721
Using PathComparatorTotalScore in the assembly graph traversal does a better job of capturing low frequency branches that are inside high frequnecy haplotypes.
2012-08-03 13:14:37 -04:00
Guillermo del Angel
6f8e7692d4
Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-08-03 12:24:37 -04:00
Guillermo del Angel
9e25b209e0
First pass of implementation of Reduced Reads with HaplotypeCaller. Main changes: a) Active region: scale PL's by representative count to determine whether region is active. b) Scale per-read, per-haplotype likelihoods by read representative counts. A read representative count is (temporarily) defined as the average representative count over all bases in read, TBD whether this is good enough to avoid biases in GL's. c) DeBruijn assembler inserts kmers N times in graph, where N is min representative count of read over kmer span - TBD again whether this is the best approach. d) Bug fixes in FragmentUtils: logic to merge fragments was wrong in cases where there is discrepancy of overlaps between unclipped/soft clipped bases. Didn't affect things before but RR makes prevalence of hard-clipped bases in CIGARs more prevalent so this was exposed. e) Cache read representative counts along with read likelihoods associated with a Haplotype. Code can/should be cleaned up and unified with PairHMMIndelErrorModelCode, as well as refactored to support arbitrary ploidy in HaplotypeCaller
2012-08-03 12:24:23 -04:00
Ryan Poplin
3ece4c4993
Merged bug fix from Stable into Unstable
2012-08-02 11:41:36 -04:00
Ryan Poplin
cb8bc18aeb
Fix for error in HaplotypeCaller. HC has a UG argument collection for the UG engine but some of those arguments aren't appropriate to set.
2012-08-02 11:41:06 -04:00
Guillermo del Angel
9ac72dbd4d
Merged bug fix from Stable into Unstable
2012-08-01 10:56:45 -04:00
Guillermo del Angel
01265f78e6
Add sanity check and possible bug fix for forum user: if haplotypes cannot be created from given alleles when genotyping indels (e.g. too close to contig boundary, etc.) in pool mode, empty allele list, signifying site can't be genotyped
2012-08-01 10:50:00 -04:00
Guillermo del Angel
4a23f3cd11
Simple cleanup of pool caller code - since usage is much more general than just calling pools, AF calculation models and GL calculation models are renamed from Pool -> GeneralPloidy. Also, don't have users specify special arguments for -glm and -pnrm. Instead, when running UG with sample ploidy != 2, the correct general ploidy modules are automatically detected and loaded. -glm now reverts to old [SNP|INDEL|BOTH] usage
2012-07-31 16:34:20 -04:00
Mark DePristo
e00ed8bc5e
Cleanup BQSR classes
...
-- Moved most of BQSR classes (which are used throughout the codebase) to utils.recalibration. It's better in my opinion to keep commonly used code in utils, and only specialized code in walkers. As code becomes embedded throughout GATK its should be refactored to live in utils
-- Removed unncessary imports of BQSR in VQSR v3
-- Now ready to refactor QualQuantizer and unit test into a subclass of RecalDatum, refactor unit tests into RecalDatum unit tests, and generalize into hierarchical recal datum that can be used in QualQuantizer and the analysis of adaptive context covariate
-- Update PluginManager to sort the plugins and interfaces. This allows us to have a deterministic order in which the plugin classes come back, which caused BQSR integration tests to temporarily change because I moved my classes around a bit.
2012-07-31 08:11:03 -04:00
Guillermo del Angel
e6b326c189
Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-07-30 21:32:19 -04:00
Guillermo del Angel
6c9d3ec155
Remerge after changes to allele construction code. More cleanups/fixes to artificial read pileup provider
2012-07-30 21:32:03 -04:00
Ryan Poplin
3dabb90eb0
Updating example active region walker integration test.
2012-07-30 21:26:16 -04:00