Commit Graph

11274 Commits (bba63a3b0ed94bf4d604d5a7f15e33f0f52fa930)

Author SHA1 Message Date
Eric Banks bba63a3b0e Fix for GSA-615: UnifiedGenotyperEngine.getGLModelsToUse takes 5% of the runtime of UG, should be optimized away. 2012-12-12 20:25:45 +00:00
Ryan Poplin 211a6e78ea Further related bug fixes to GGA mode in the HC: some variants (especially MNPs) were causing problems because they don't have to start at the current location to match the allele being genotyped. Fixed. 2012-12-12 14:53:02 -05:00
Mark DePristo 5632c13bf2 Resolves GSA-681 / Compressed VCF.gz output is too big because of unnecessary call to flush().
-- Now compressed output VCFs are properly blocked compressed (i.e., they are actually smaller than the uncompressed VCF)
2012-12-12 10:27:07 -05:00
Mark DePristo dd52a70d45 Fix AFCalcResult unit test
-- I was simply passing in the wrong values into the function.  Fixed the calls, and expanded the docs on what needs to be passed in.
2012-12-11 10:40:12 -05:00
Ami Levy-Moonshine 6bf31065e3 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2012-12-11 10:34:50 -05:00
Ami Levy-Moonshine 2f99569dda change the md5 in one of the CV intergration tests, since it wasn't use the priority list when printing the origin of the annotation (the setValue field) 2012-12-10 22:48:15 -05:00
Ami Levy-Moonshine 2e3284f306 Continue to fix the case where PRIORITIZE is used but no priority list is given. While fixing that case I also removed unnecessary sorting, when the prioeity list is not provied. When the priority list is not provided, it will continue to be null. Thus, the number of original Variant Contexts should be given as a new parameter to simpleMerge (since priority might be null). This new parameter is used for checking if there are filtered VC, when annotationOrigin is true. 2012-12-10 22:23:58 -05:00
Mauricio Carneiro 8a115edbaf ReduceReads is now scattered by contig
It's no longer safe to scatter/gather by interval because now we don't hard-clip to the intervals anymore.
2012-12-10 15:25:27 -05:00
Eric Banks bdda63d973 Related bug fixes to GGA mode in the HC: some variants (especially MNPs) were causing problems because they don't have to start at the current location to match the allele being genotyped. Fixed. 2012-12-10 14:47:04 -05:00
Ryan Poplin ceb5431dcb Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2012-12-10 12:24:47 -05:00
Ryan Poplin c84ff9d75e Adding explicit true negative assessment category to the AssessNA12878 walker. 2012-12-10 12:24:43 -05:00
Ami Levy-Moonshine 573ace4403 restore the right version of VariantContextUtils.java in my unstable dir 2012-12-10 10:28:56 -05:00
David Roazen 46edab6d6a Use the new downsampling implementation by default
-Switch back to the old implementation, if needed, with --use_legacy_downsampler

-LocusIteratorByStateExperimental becomes the new LocusIteratorByState, and
the original LocusIteratorByState becomes LegacyLocusIteratorByState

-Similarly, the ExperimentalReadShardBalancer becomes the new ReadShardBalancer,
with the old one renamed to LegacyReadShardBalancer

-Performance improvements: locus traversals used to be 20% slower in the new
downsampling implementation, now they are roughly the same speed.

-Tests show a very high level of concordance with UG calls from the previous
implementation, with some new calls and edge cases that still require more examination.

-With the new implementation, can now use -dcov with ReadWalkers to set a limit
on the max # of reads per alignment start position per sample. Appropriate value
for ReadWalker dcov may be in the single digits for some tools, but this too
requires more investigation.
2012-12-10 09:44:50 -05:00
Ami Levy-Moonshine 5460c96137 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2012-12-09 23:43:57 -05:00
Ami Levy-Moonshine 3a420d163e (1) changes in catVariants (work still under development) (2) changes to CV to throw an error when GenotypeMergeType is PRIORITIZE but no priority (rod_priority_list) is not given. Reported by TechnicalVault on the forum on Nov 14 2012 2012-12-09 23:40:03 -05:00
Eric Banks 2637f512f8 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2012-12-09 02:09:40 -05:00
Eric Banks 574d5b467f Bug fix for indel HMM: protect against situation where long reads (e.g. Sanger) in a pileup can lead to a read starting after the haplotype end for a given haplotype. 2012-12-09 02:09:34 -05:00
Mark DePristo 9b6ee0576f Fix bugs in the consensus genotype creation algorithm for the NA12878 KB
-- Was screwing up mixed reviewed / non-reviewed sites.  Now only considered reviewed calls, if any are present, or all calls if no reviewed sites are found
-- Was just taking the first genotype, now it properly looks at all of the genotype calls and makes a reasonable guess what the answer should be
-- Added unit tests for the consensus creation algorithm
2012-12-08 13:18:07 -05:00
Mark DePristo bf8421eeb7 Fixes GSA-671 / AFCalcResult.log10pNonRefByAllele should really be log10pRefByAllele
-- The current implementation of AFCalcResult contains a map from allele -> log10pNonRef. The only use of this field is to support the isPolymorphic function per allele. The call to this function looks like isPolymorphic(allele, QUAL). The QUAL is a phred-scaled threshold where you want to include alleles where the log10pNonRef >= QUAL (appropriately transformed). The problem is that when log10pNonRef is large, it quickly gets set to 0, while it's complementary log10pRef value has a meaningful log10 value. For example, if log10pRef = -100 (not an uncommonly large value), log10pNonRef = 0.0.
-- In order to preserve precision and allow us to more finally differentiate high QUAL from low QUAL (but still poly) sites we should store log10pRef values instead, and test that log10pRef <= threshold.
-- See https://jira.broadinstitute.org/browse/GSA-671 for more information.
2012-12-07 16:03:40 -05:00
Ryan Poplin 3355216366 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2012-12-07 15:35:17 -05:00
Ryan Poplin 9573648f85 Changes to count the sites which might be present in some of the input rods but not present at all in other rods. Now loop over the input rod names instead of looping over the tracker results. 2012-12-07 15:35:08 -05:00
Joel Thibault 3b0e3767bf Add a test for a read that extends off the end of chr1 2012-12-07 14:07:15 -05:00
Joel Thibault cc4e3ec589 Update TODO list 2012-12-06 12:06:47 -05:00
Mark DePristo abd94b2976 Bugfix for handling invalid records in NA12878 KB
-- The previous approach tried to remove the entire MongoVariantContext but when it was malformed was prone to error.  Now just grabs the _id and uses it to remove the bad record.
2012-12-06 10:24:24 -05:00
Eric Banks 406adb8d44 The allele biased downsampling should not abort if there's a reduced read. Rather it should always keep the RR and downsample only original reads in the pileup. 2012-12-05 23:15:36 -05:00
Mark DePristo dbf721968d PrintReads large-scale test to protect against another major low-level performance issue 2012-12-05 21:36:27 -05:00
Ryan Poplin 00c23bf704 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2012-12-05 15:53:05 -05:00
Ryan Poplin 234ff64556 Changes to AssessNA12878 to allow for 100s of input callsets to assess against the database. 2012-12-05 15:52:57 -05:00
Ami Levy-Moonshine 5d78a61f7a Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2012-12-05 15:07:12 -05:00
Mark DePristo d0cab795b7 Got caught in the middle of a bad integration test, that was fixed in independent push. Moved test bam into testdata. 2012-12-05 14:49:22 -05:00
Mark DePristo 465694078e Major performance improvement to the GATK engine
-- The NanoSchedule timing code (in NSRuntimeProfile) was crazy expensive, but never showed up in the profilers.  Removed all of the timing code from the NanoScheduler, the NSRuntimeProfile itself, and updated the unit tests.
-- For tools that largely pass through data quickly, this change reduces runtimes by as much as 10x.  For the RealignerTargetCreator example, the runtime before this commit was 3 hours, and after is 30 minutes (6x improvement).
-- Took this opportunity to improve the GATK ProgressMeter.  NotifyOfProgress now just keeps track of the maximum position seen, and a separate daemon thread ProgressMeterDaemon periodically wakes up and prints the current progress.  This removes all inner loop calls to the GATK timers.
-- The history of the bug started here: http://gatkforums.broadinstitute.org/discussion/comment/2402#Comment_2402
2012-12-05 14:49:22 -05:00
Mark DePristo 2b601571e7 Better error handling in NanoScheduler
-- The previous nanoscheduler would deadlock in the case where an Error, not an Exception, was thrown.  Errors, like out of memory, would cause the whole system to die.  This bugfix resolves that issue
2012-12-05 14:49:22 -05:00
Mark DePristo 51dbb562c9 Reduce amount of debugging information from NA12878KnowledgeBaseServer 2012-12-05 14:49:22 -05:00
Mauricio Carneiro efe256ec09 binary search implementation to find the minimum coverage
speeds up the walker from 7 days to 12 minutes on chr20.
2012-12-05 14:45:57 -05:00
Eric Banks 0c925856cb Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2012-12-05 02:00:39 -05:00
Eric Banks ef87b18e09 In retrospect, it wasn't a good idea to have FisherStrand handle reduced reads since they are always on the forward strand. For now, FS ignores reduced reads but I've added a note (and JIRA) to make this work once the RR het compression is enabled (since we will have directionality in reads then). 2012-12-05 02:00:35 -05:00
Mauricio Carneiro 13896356ad Added bootstrapping and fixed the GLM model of the FMCC 2012-12-05 01:32:19 -05:00
Mauricio Carneiro 30f013aeb0 Added a copy() method for ReadBackedPileups
necessary to create new alignment contexts with hard-copies of the pileup.
2012-12-05 01:32:18 -05:00
Mauricio Carneiro 6feda540a4 Better error message for SimpleGATKReports 2012-12-05 01:32:18 -05:00
Eric Banks 726332db79 Disabling the testNoCmdLineHeaderStdout test in UG because it keeps crashing when I run it locally 2012-12-05 00:54:00 -05:00
Randal Moore 8d2d0253a2 introduce a level of indirection for the forum URLs - this new function will allow me a place to morph the URL into something that is supported by Confluence
Signed-off-by: Eric Banks <ebanks@broadinstitute.org>
2012-12-03 22:33:02 -05:00
Eric Banks 1af41754e3 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2012-12-03 22:01:11 -05:00
Eric Banks bca860723a Updating tests to handle bad validation data files (that used the wrong qual score encoding); overrides push from stable. 2012-12-03 22:01:07 -05:00
Eric Banks 387c0defed don't change md5 here because I am handling it separately from unstable with a better command-line in the test 2012-12-03 21:49:45 -05:00
Eric Banks ef95757311 Fix MD5 because of a need to fix a busted bam file in our validation directory (it used the wrong quality score encoding...) 2012-12-03 21:46:46 -05:00
Menachem Fromer 472381245a Allow for more refined control of memory and queues to run with 2012-12-03 17:07:03 -05:00
Eric Banks 67932b357d Bug fix for RR: don't let the softclip start position be less than 1 2012-12-03 15:59:14 -05:00
Ryan Poplin d5ed184691 Updating the HC integration test md5s. According to the NA12878 knowledge base this commit cuts down the FP rate by more than 50 percent with no loss in sensitivity. 2012-12-03 15:38:59 -05:00
Ryan Poplin a47da9bb2f Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2012-12-03 14:30:14 -05:00
Ryan Poplin 156d6a5e0b misc minor bug fixes to GenotypingEngine. 2012-12-03 12:47:35 -05:00