Commit Graph

292 Commits (ec3bf9f36283e16b00a090bf75161213bb3f63dc)

Author SHA1 Message Date
Guillermo del Angel 971ded341b Swap java Random generator for GATK one to ensure test determinism 2013-02-04 10:57:34 -05:00
Guillermo del Angel f31bf37a6f First step in better BQSR unit tests for covariates (not done yet): more test coverage in basic covariates, test logging several read groups/read lengths and more combinations simultaneously.
Add basic Javadocs headers for PerReadAlleleLikehoodMap.
2013-02-03 15:31:30 -05:00
Eric Banks 03df5e6ee6 - Added more comprehensive tests for consensus creation to RR. Still need to add tests for I/D ops.
- Added RR qual correctness tests (note that this is a case where we don't add code coverage but still need to test critical infrastructure).
- Also added minor cleanup of BaseUtils
2013-02-01 15:37:19 -05:00
Ryan Poplin 2fee000dba Adding unit tests for KBestPaths class and fixing edge case bugs. 2013-02-01 13:51:31 -05:00
David Roazen c6581e4953 Update MD5s to reflect version number change in the BAM header
I've confirmed via a script that all of these differences only
involve the version number bump in the BAM headers and nothing
else:

< @HD   VN:1.0  GO:none SO:coordinate
---
> @HD   VN:1.4  GO:none SO:coordinate
2013-02-01 13:51:31 -05:00
Guillermo del Angel a520058ef6 Add option to specify maximum STR length to RepeatCovariates from command line to ease testing 2013-02-01 13:51:31 -05:00
Ryan Poplin 495bca3d1a Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-31 10:12:26 -05:00
Ryan Poplin ca6968d038 Use base List and Map types in the GenotypingEngineUnitTest. 2013-01-31 10:12:18 -05:00
Eric Banks 75ceddf9e5 Adding new unit tests for RR. These tests took a frustratingly long time to get to pass, but now we have a framework for
testing the adding of reads into the SlidingWindow plus consensus creation.  Will flesh these out more after I take care of
some other items on my plate.
2013-01-31 09:46:38 -05:00
Ryan Poplin 5f4a063def Breaking up my massive commits into smaller pieces that I can successfully merge and digest. This one enables downsampling in the HaplotypeCaller (by lowering the default dcov to 20) and removes my long-standing, temporary region-based downsampling. 2013-01-30 16:14:07 -05:00
Ryan Poplin ff8ba03249 Updating BQSR integration test md5s to reflect the updates to the hierarchicalBayesianQualityEstimate function 2013-01-30 13:30:18 -05:00
Ryan Poplin 85dabd321f Adding unit tests for hierarchicalBayesianQualityEstimate function 2013-01-30 13:26:07 -05:00
Ryan Poplin 07fe3dd1ef Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-30 13:19:24 -05:00
David Roazen 9985f82a7a Move BaseUtils back to the GATK by request, along with associated utility methods 2013-01-30 13:09:44 -05:00
Ryan Poplin 2967776458 The Empirical quality column in the recalibration report can't be compared in the BQSRGatherer because the value is calculated using the Bayesian estimate with different priors. This value should never be used from a recalibration report anyway except during plotting. 2013-01-30 12:28:14 -05:00
Eric Banks 9025567cb8 Refactoring the SimpleGenomeLoc into the now public utility UnvalidatingGenomeLoc and the RR-specific FinishedGenomeLoc.
Moved the merging utility methods into GenomeLoc and moved the unit tests around accordingly.
2013-01-30 10:45:29 -05:00
Mark DePristo 92c5635e19 Cleanup, document, and unit test ActiveRegion
-- All functions tested.  In the testing / review I discovered several bugs in the ActiveRegion routines that manipulate reads.  New version should be correct
-- Enforce correct ordering of supporting states in constructor
-- Enforce read ordering when adding reads to an active region in add
-- Fix bug in HaplotypeCaller map with new updating read spans.  Now get the full span before clipping down reads in map, so that variants are correctly placed w.r.t. the full reference sequence
-- Encapsulate isActive field with an accessor function
-- Make sure that all state lists are unmodifiable, and that the docs are clear about this
-- ActiveRegion equalsExceptReads is for testing only, so make it package protected
-- ActiveRegion.hardClipToRegion must resort reads as they can become out of order
-- Previous version of HC clipped reads but, due to clipping, these reads could no longer overlap the active region.  The old version of HC kept these reads, while the enforced contracts on the ActiveRegion detected this was a problem and those reads are removed.  Has a minor impact on PLs and RankSumTest values
-- Updating HaplotypeCaller MD5s to reflect changes to ActiveRegions read inclusion policy
2013-01-30 09:47:12 -05:00
Mauricio Carneiro 29fd536c28 Updating licenses manually
Please check that your commit hook is properly pointing at ../../private/shell/pre-commit

Conflicts:
	public/java/test/org/broadinstitute/variant/VariantBaseTest.java
2013-01-29 17:27:53 -05:00
David Roazen a536e1da84 Move some VCF/VariantContext methods back to the GATK based on feedback
-Moved some of the more specialized / complex VariantContext and VCF utility
 methods back to the GATK.

-Due to this re-shuffling, was able to return things like the Pair class back
 to the GATK as well.
2013-01-29 16:56:55 -05:00
Eric Banks e4ec899a87 First pass at adding unit tests for the RR framework: I have added 3 tests and all 3 uncovered RR bugs!
One of the fixes was critical: SlidingWindow was not converting between global and relative positions correctly.
Besides not being correct, it was resulting in a massive slow down of the RR traversal.
That fix definitely breaks at least one of the integration tests, but it's not worth changing md5s now because I'll be
changing things all over RR for the next few days, so I am going to let that test fail indefinitely until I can confirm
general correctness of the tool.
2013-01-29 15:51:07 -05:00
Guillermo del Angel 1d5b29e764 Unit tests for repeat covariates: generate 100 random reads consisting of tandem repeat units of random content and size, and check that covariates match expected values at all positions in reads.
Fixed corner case where value of covariate at border between 2 tandem repeats of different length/content wasn't consistent
2013-01-29 15:23:02 -05:00
Ryan Poplin 35543b9cba updating BQSR integration test values for the PR half of BQSR. 2013-01-29 09:47:57 -05:00
Guillermo del Angel ff799cc79a Fixed bad merge 2013-01-28 20:04:25 -05:00
Guillermo del Angel 5995f01a01 Big intermediate commit (mostly so that I don't have to go again through merge/rebase hell) in expanding BQSR capabilities. Far from done yet:
a) Add option to stratify CalibrateGenotypeLikelihoods by repeat - will add integration test in next push.
b) Simulator to produce BAM files with given error profile - for now only given SNP/indel error rate can be given. A bad context can be specified and if such context is present then error rate is increased to given value.
c) Rewrote RepeatLength covariate to do the right thing - not fully working yet, work in progress.
d) Additional experimental covariates to log repeat unit and combined repeat unit+length. Needs code refactoring/testing
2013-01-28 19:55:46 -05:00
David Roazen f63f27aa13 org.broadinstitute.variant refactor, part 2
-removed sting dependencies from test classes
-removed org.apache.log4j dependency
-misc cleanup
2013-01-28 09:03:46 -05:00
Ami Levy-Moonshine b4447cdca2 In cases where one uses VariantContextUtils.GenotypeMergeType.REQUIRE_UNIQUE we used to verify that the samples names are unique in VariantContextUtils.simpleMerge for each VCs. It couse to a bug that was reported on the forum (when a VCs had 2 VC from the same sample).
Now we will check it only in CombineVariants.init using the headers. A new function was added to SamplesUtils with unitTests in CVunitTest.java.
2013-01-25 15:49:51 -05:00
Mark DePristo 3f95f39be3 Updating HC md5s for new cutting algorithm and default band pass filter parameters 2013-01-25 11:07:29 -05:00
Eric Banks 6dd0e1ddd6 Pulled out the --regenotype functionality from SelectVariants into its own tool: RegenotypeVariants.
This allows us to move SelectVariants into the public suite of tools now.
2013-01-25 09:42:04 -05:00
Chris Hartl a3b98daf1a Merge branch 'master' of gsa2:/humgen/gsa-scr1/chartl/dev/unstable 2013-01-23 14:49:34 -05:00
Chris Hartl 7fcfa4668c Since GenotypeConcordance is now a standalone walker, remove the old GenotypeConcordance evaluation module and the associated integration tests. 2013-01-23 14:47:23 -05:00
Mark DePristo 8026199e4c Updating md5s for CountReadsInActiveRegions and HaplotypeCaller to reflect new activity profile mechanics
-- In this process I discovered a few missed sites in the old code.  The new approach actually produces better HC results than the previous version.
2013-01-23 13:46:01 -05:00
Chris Hartl c500e1d8ac Merge branch 'master' of gsa2:/humgen/gsa-scr1/chartl/dev/unstable 2013-01-22 15:31:30 -05:00
Chris Hartl 7060e01a8e Fix for broken unit test plus some minor changes to comments. Unit tests were broken by my pulling the site status utility function into the enum. Thankfully the unit tests caught my silly duplication of a line. 2013-01-22 15:14:41 -05:00
Mauricio Carneiro 7b8b064165 Last manual license update (hopefully)
if everyone updates their git hook accordingly, this will be the last time I have to manually run the script.

GSATDG-5
2013-01-18 16:13:07 -05:00
Eric Banks cac439bc5e Optimized the Allele Biased Downsampling: now it doesn't re-sort the pileup but just removes reads from the original one.
Added a small fix that slightly changed md5s.
2013-01-18 11:17:31 -05:00
Chris Hartl 08d2da9057 Merge branch 'master' of gsa2:/humgen/gsa-scr1/chartl/dev/unstable 2013-01-18 10:28:45 -05:00
Chris Hartl bf5748a538 Forgot to actually put in the md5. Also with the new change to record pairing and filtering, the multiple-records integration test changed: the indel records (T/TG | T/TGACA) are matched up (rather than left separate) resulting in properly identifying mismatching alleles, rather than HET-UNAVAILABLE and UNAVAILABLE-HET. Very nice. 2013-01-18 10:25:36 -05:00
Chris Hartl 91030e9afa Bugfix: records that get paired up during the resolution of multiple-records-per-site were not going into genotype-level filtering. Caught via testing.
Testing for moltenized output, and for genotype-level filtering. This tool is now fully functional. There are three todo items:

1) Docs
2) An additional output table that gives concordance proportions normalized by records in both eval and comp (not just total in eval or total in comp)
3) Code cleanup for table creation (putting a table together the way I do takes -way- too many lines of code)
2013-01-18 09:49:48 -05:00
Eric Banks 39c73a6cf5 1. Ryan and I noticed that the FisherStrand annotation was completely busted for indels with reduced reads; fixed.
2. While making the previous fix and unifying FS for SNPs and indels, I noticed that FS was slightly broken in the general case for indels too; fixed.
3. I also fixed a minor bug in the Allele Biased Downsampling code for reduced reads.
2013-01-18 03:35:48 -05:00
Eric Banks 6a903f2c23 I finally gave up on trying to get the Haplotype/Allele merging to work in the HaplotypeCaller.
I've resigned myself instead to create a mapping from Allele to Haplotype.  It's cheap so not a big deal, but really shouldn't be necessary.
Ryan and I are talking about refactoring for GATK2.5.
2013-01-18 01:21:08 -05:00
Eric Banks 953592421b I think we got out of sync with the HC tests as we were clobbering each other's changes. Only differences here are to some RankSumTest values. 2013-01-17 09:19:21 -05:00
Eric Banks a623cca89a Bug fix for HaplotypeCaller, as reported on the forum: when reduced reads didn't completely overlap a deletion call,
we were incorrectly trying to find the reference position of a base on the read that didn't exist.
Added integration test to cover this case.
2013-01-16 22:47:58 -05:00
Mark DePristo 2a42b47e4a Massive expansion of ActiveRegionTraversal unit tests, resulting in several bugfixes to ART
-- UnitTests now include combinational tiling of reads within and spanning shard boundaries
-- ART now properly handles shard transitions, and does so efficiently without requiring hash sets or other collections of reads
-- Updating HC and CountReadsInActiveRegions integration tests
2013-01-16 15:30:00 -05:00
Eric Banks d18dbcbac1 Added tests for changing IUPAC bases to Ns, for failing on bad ref bases, and for the HaplotypeCaller not failing when running over a region with an IUPAC base.
Out of curiosity, why does Picard's IndexedFastaSequenceFile allow one to query for start position 0?  When doing so, that base is a line feed (-1 offset to the first base in the contig) which is an illegal base (and which caused me no end of trouble)...
2013-01-16 14:55:33 -05:00
Eric Banks 392b5cbcdf The CachingIndexedFastaSequenceFile now automatically converts IUPAC bases to Ns and errors out on other non-standard bases.
This way walkers won't see anything except the standard bases plus Ns in the reference.
Added option to turn off this feature (to maintain backwards compatibility).

As part of this commit I cleaned up the BaseUtils code by adding a Base enum and removing all of the static indexes for
each of the bases.  This uncovered a bug in the way the DepthOfCoverage walker counts deletions (it was counting Ns instead!) that isn't covered by tests.  Fortunately that walker is being deprecated soon...
2013-01-16 10:22:43 -05:00
Eric Banks 4fb3e48099 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-16 00:13:38 -05:00
Chris Hartl 327169b283 Refactor the method that identifies the site overlap type into the type enum class (so it can be used elsewhere potentially).
Completed todo item: for sites like

(eval)
20   12345   A    C
20   12345   A    AC

(comp)
20   12345   A    C
20   12345   A    ACCC

the records will be matched by the presence of a non-empty intersection of alleles. Any leftover records are then paired with an empty variant context (as though the call was unique). This has one somewhat counterintuitive feature, which is that normally

(eval)
20  12345  A   AC
(comp)
20  12345  A   ACCC

would be classified as 'ALLELES_DO_NOT_MATCH' (and not counted in genotype tables), in the presence of the SNP, they're counted as EVAL_ONLY and TRUTH_ONLY respectively.

+ integration test
2013-01-15 12:13:45 -05:00
Mark DePristo 3c37ea014b Retire original TraverseActiveRegion, leaving only the new optimized version
-- Required some updates to MD5s, which was unexpected, and will be sorted out later with more detailed unit tests
2013-01-15 10:24:45 -05:00
Eric Banks 94800771e3 1. Initial implementation of bam writing for the HaplotypeCaller with -bam argument; currently only assembled haplotypes are emitted.
2. Framework is set up in the VariantAnnotator for the HaplotypeCaller to be able to call in to annotate dbSNP plus comp RODs.  Until the HC uses meta data though, this won't work.
2013-01-15 10:19:18 -05:00
Chris Hartl 682c59ff04 Merge branch 'master' of gsa2:/humgen/gsa-scr1/chartl/dev/unstable 2013-01-14 13:27:34 -05:00
Chris Hartl 61bc334df1 Ensure output table formatting does not contain NaNs. For (0 eval ref calls)/(0 comp ref calls), set the proportion to 0.00.
Added integration tests (checked against manual tabulation)
2013-01-14 09:21:30 -05:00
Ryan Poplin a7fe334a3f calculating the md5s for the new tests. 2013-01-11 15:43:52 -05:00
Ryan Poplin 65afec2a53 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-11 15:22:52 -05:00
Mark DePristo 85b529cced Updating MD5s in HC and UG that changed due to new LIBS
-- Resolved what was clearly a bug in UG (GGA mode was returning a neighboring, equivalent indel site that wasn't in input list.  Not ideal)
-- Trivial read count differences in HC
2013-01-11 15:17:19 -05:00
Mark DePristo fb9eb3d4ee PileupElement and LIBS cleanup
-- function to create pileup elements in AlignmentStateMachine and LIBS
-- Cleanup pileup element constructors, directing users to LIBS.createPileupFromRead() that really does the right thing
2013-01-11 15:17:17 -05:00
Mark DePristo b53286cc3c HaplotypeCaller mode to skip assembly and genotyping for performance testing
-- Added HCPerformance evaluation Qscript
-- Added some docs about one of the HC integration tests
-- HaplotypeCaller / ART performance evaluation script
2013-01-11 15:17:16 -05:00
Ryan Poplin e952296c10 Adding HC GGA integration test to cover duplicated input alleles. 2013-01-11 15:01:27 -05:00
Ryan Poplin 7f7f40f851 Adding additional HC GGA integration tests to cover more complicated input alleles. 2013-01-11 14:36:21 -05:00
Mauricio Carneiro 2a4ccfe6fd Updated all JAVA file licenses accordingly
GSATDG-5
2013-01-10 17:06:41 -05:00
Chris Hartl 31a5f88c4f Expanded unit tests to cover the Concordance Metrics class fairly uniformly. 2013-01-10 14:33:47 -05:00
Chris Hartl c1de92b511 Add in some todo items 2013-01-09 13:16:06 -05:00
Chris Hartl 8d126161e2 Merge branch 'master' of gsa2:/humgen/gsa-scr1/chartl/dev/unstable 2013-01-09 13:15:04 -05:00
Eric Banks 4fa439d89e Move some classes back to public because they are used in the engine. Move some test classes to protected. We should have no more public->protected dependancies now 2013-01-09 11:06:10 -05:00
Chris Hartl b56754606b Initial break-out of GenotypeConcordance as a standalone walker. Some basic functionality testing. Currently performs only a pairwise comparison, but is very careful about proper tabulation through the GenotypeType enum. 2013-01-09 00:34:07 -05:00
Eric Banks ee7d85c6e6 Move around the DiploidGenotype classes (so it can be used by the GATKPaperGenotyper) 2013-01-08 15:53:11 -05:00
Eric Banks b099e2b4ae Moving integration tests to protected 2013-01-08 09:34:08 -05:00
Ryan Poplin 4f95f850b3 Bug fix in the HC's allele mapping for multi-allelic events. Using the allele alone as a key isn't sufficient because alleles change when the reference allele changes during VariantContextUtils.simpleMerge for multi-allelic events. 2013-01-07 11:05:44 -05:00
Eric Banks 52067f0549 Handle merge conflicts 2013-01-06 12:29:12 -05:00
Chris Hartl 41bc416b65 Remove AAL and update MD5s. 2013-01-04 16:46:14 -05:00
Eric Banks dd7f5e2be7 Hooking up the Bayesian estimate code for calculating Qemp in BQSR; various fixes after adding unit tests. 2013-01-04 14:43:11 -05:00
Chris Hartl 3753209584 One md5sum slipped past in the HC integration test. 2013-01-02 15:09:28 -05:00
Chris Hartl e1d09ab0db QD is now divided by the average length of the alternate allele (weighted by the allele count). The average length is stored in a related annotation, "AAL", which can be used to re-compute the "old" QD by simple multiplication. Integration tests *should* all pass. 2013-01-02 14:41:29 -05:00
Ryan Poplin c8cd6ac465 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2012-12-20 14:58:04 -05:00
Ryan Poplin a098888f4d Updating missed UG md5 2012-12-20 14:57:53 -05:00
Tad Jordan b491c177ff Added functionality of outputting sorted GATKReport Tables
- Added an optional argument to BaseRecalibrator to produce sorted GATKReport Tables
- Modified BSQR Integration Tests to include the optional argument. Tests now produce sorted tables
2012-12-20 14:02:21 -05:00
Ryan Poplin 54e5c84018 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2012-12-19 11:31:40 -05:00
Ryan Poplin aa39037be8 updating UG integration tests. 2012-12-19 11:31:35 -05:00
Eric Banks 70479cb71d RR bug fix: we were failing when a read started with an insertion just at the edge of the consensus region.
The weird part is that the comments claimed it was doing what it was supposed to, but it didn't actually do it.
Now we maintain the last header element of the consensus (but without bases and quals) if it adjoins an element with an insertion.

Added the user's test file as an integration test.
2012-12-19 10:59:07 -05:00
David Roazen 07b369ca7e Move VCF/BCF2/VariantContext to new standalone org.broadinstitute.variant package
This is an intermediate commit so that there is a record of these changes in our
commit history. Next step is to isolate the test classes as well, and then move
the entire package to the Picard repository and replace it with a jar in our repo.

-Removed all dependencies on org.broadinstitute.sting (still need to do the test classes,
though)

-Had to split some of the utility classes into "GATK-specific" vs generic methods
(eg., GATKVCFUtils vs. VCFUtils)

-Placement of some methods and choice of exception classes to replace the StingExceptions
and UserExceptions may need to be tweaked until everyone is happy, but this can be
done after the move.
2012-12-19 10:25:22 -05:00
Ryan Poplin 92185dd5f4 updating HC integration tests. 2012-12-19 10:12:07 -05:00
Ryan Poplin 98f18b5f9e Changing the HC over to using the non-contamination-downsampled read maps for the purposes of annotations. This behavior now matches the UG. There is a new command line option to go back to the older behavior to explore the differences. 2012-12-17 11:27:44 -05:00
Mauricio Carneiro 5f1afb4136 Fixing an off-by-one clipping error in ReduceReads for reads off the contig
Reads that are soft-clipped off the contig (before the beginning of the contig) were being soft-clipped to position 0 instead of 1 because of an off-by-one issue. Fixed and included in the integration test.
2012-12-13 22:10:11 -05:00
Mauricio Carneiro 74344a3871 Bringing in the changes from the CMI repo 2012-12-13 21:59:37 -05:00
Mark DePristo aeab932c63 Actual working version of unflushing VCFWriter
-- Uses high-performance local writer backed by byte array that writes the entire VCF line in some write operation to the underlying output stream.
-- Fixes problems with indexing of unflushed writes while still allowing efficient block zipping
-- Same (or better) IO performance as previous implementation
-- IndexingVariantContextWriter now properly closes the underlying output stream when it's closed
-- Updated compressed VCF output file
2012-12-13 16:15:08 -05:00
Mauricio Carneiro 33290bfe0c Added integration test to catch the read off contig in ReduceReads.
So upstream changes won't break it again.
2012-12-12 13:49:54 -05:00
Mark DePristo 5632c13bf2 Resolves GSA-681 / Compressed VCF.gz output is too big because of unnecessary call to flush().
-- Now compressed output VCFs are properly blocked compressed (i.e., they are actually smaller than the uncompressed VCF)
2012-12-12 10:27:07 -05:00
Mark DePristo dd52a70d45 Fix AFCalcResult unit test
-- I was simply passing in the wrong values into the function.  Fixed the calls, and expanded the docs on what needs to be passed in.
2012-12-11 10:40:12 -05:00
Eric Banks bdda63d973 Related bug fixes to GGA mode in the HC: some variants (especially MNPs) were causing problems because they don't have to start at the current location to match the allele being genotyped. Fixed. 2012-12-10 14:47:04 -05:00
David Roazen 46edab6d6a Use the new downsampling implementation by default
-Switch back to the old implementation, if needed, with --use_legacy_downsampler

-LocusIteratorByStateExperimental becomes the new LocusIteratorByState, and
the original LocusIteratorByState becomes LegacyLocusIteratorByState

-Similarly, the ExperimentalReadShardBalancer becomes the new ReadShardBalancer,
with the old one renamed to LegacyReadShardBalancer

-Performance improvements: locus traversals used to be 20% slower in the new
downsampling implementation, now they are roughly the same speed.

-Tests show a very high level of concordance with UG calls from the previous
implementation, with some new calls and edge cases that still require more examination.

-With the new implementation, can now use -dcov with ReadWalkers to set a limit
on the max # of reads per alignment start position per sample. Appropriate value
for ReadWalker dcov may be in the single digits for some tools, but this too
requires more investigation.
2012-12-10 09:44:50 -05:00
Eric Banks 574d5b467f Bug fix for indel HMM: protect against situation where long reads (e.g. Sanger) in a pileup can lead to a read starting after the haplotype end for a given haplotype. 2012-12-09 02:09:34 -05:00
Mark DePristo d0cab795b7 Got caught in the middle of a bad integration test, that was fixed in independent push. Moved test bam into testdata. 2012-12-05 14:49:22 -05:00
Eric Banks ef87b18e09 In retrospect, it wasn't a good idea to have FisherStrand handle reduced reads since they are always on the forward strand. For now, FS ignores reduced reads but I've added a note (and JIRA) to make this work once the RR het compression is enabled (since we will have directionality in reads then). 2012-12-05 02:00:35 -05:00
Eric Banks 726332db79 Disabling the testNoCmdLineHeaderStdout test in UG because it keeps crashing when I run it locally 2012-12-05 00:54:00 -05:00
Eric Banks bca860723a Updating tests to handle bad validation data files (that used the wrong qual score encoding); overrides push from stable. 2012-12-03 22:01:07 -05:00
Ryan Poplin d5ed184691 Updating the HC integration test md5s. According to the NA12878 knowledge base this commit cuts down the FP rate by more than 50 percent with no loss in sensitivity. 2012-12-03 15:38:59 -05:00
Ryan Poplin 156d6a5e0b misc minor bug fixes to GenotypingEngine. 2012-12-03 12:47:35 -05:00
Mark DePristo 2849889af5 Updating md5 for UG 2012-12-01 14:24:19 -05:00
Mark DePristo c676853731 Merged bug fix from Stable into Unstable. Updating md5s
Conflicts:
	protected/java/test/org/broadinstitute/sting/gatk/walkers/genotyper/UnifiedGenotyperIntegrationTest.java
2012-11-28 12:54:36 -05:00
Mark DePristo a1d6461121 Critical bugfix to AFCalcResult affecting UG/HC quality score emission thresholds
As reported by Menachem Fromer: a critical bug in AFCalcResult:

Specifically, the implementation:
    public boolean isPolymorphic(final Allele allele, final double log10minPNonRef) {
        return getLog10PosteriorOfAFGt0ForAllele(allele) >= log10minPNonRef;
    }

seems incorrect and should probably be:

getLog10PosteriorOfAFEq0ForAllele(allele) <= log10minPNonRef

The issue here is that the 30 represents a Phred-scaled probability of *error* and it's currently being compared to a log probability of *non-error*.

Instead, we need to require that our probability of error be less than the error threshold.
This bug has only a minor impact on the calls -- hardly any sites change -- which is good.  But the inverted logic effects multi-allelic sites significantly.  Basically you only hit this logic with multiple alleles, and in that case it'\s including extra alt alleles incorrectly, and throwing out good ones.

Change was to create a new function that properly handles thresholds that are PhredScaled quality scores:

    /**
     * Same as #isPolymorphic but takes a phred-scaled quality score as input
     */
    public boolean isPolymorphicPhredScaledQual(final Allele allele, final double minPNonRefPhredScaledQual) {
        if ( minPNonRefPhredScaledQual < 0 ) throw new IllegalArgumentException("phredScaledQual " + minPNonRefPhredScaledQual + " < 0 ");
        final double log10Threshold = Math.log10(QualityUtils.qualToProb(minPNonRefPhredScaledQual));
        return isPolymorphic(allele, log10Threshold);
    }
2012-11-28 12:08:02 -05:00
Ryan Poplin 59cef880d1 Updating HC integration tests because experimental, HC-specific annotations have been removed. 2012-11-26 12:20:07 -05:00