Commit Graph

11281 Commits (a52e3c7e15a1bca6221df4d881642cfce590f84d)

Author SHA1 Message Date
Mauricio Carneiro a52e3c7e15 Revert "Bug fix for RR: don't let the softclip start position be less than 1"
this introduced a bug in reduce reads by de-activating it's hard clipping of the out of bounds soft-clips (specially in the MT).
DEV-322 #resolve #time 4m

This reverts commit 42acfd9d0bccfc0411944c342a5b889f5feae736.
2012-12-12 13:09:39 -05:00
Guillermo del Angel 216f92276c Disable scatter-gather with PrintReads since we're already setting high nct so it's unnecessary 2012-12-12 09:06:37 -05:00
Kristian Cibulskis 0e5b1093fb initial implementation of contamination estimation, tested on single gene (which doesn't have enough data) waiting to test on exome/chr20 2012-12-11 15:59:59 -05:00
Mauricio Carneiro 19372225af Merge Broad's GATK and CMI gatk 2012-12-10 15:33:38 -05:00
Mauricio Carneiro 8a115edbaf ReduceReads is now scattered by contig
It's no longer safe to scatter/gather by interval because now we don't hard-clip to the intervals anymore.
2012-12-10 15:25:27 -05:00
Eric Banks bdda63d973 Related bug fixes to GGA mode in the HC: some variants (especially MNPs) were causing problems because they don't have to start at the current location to match the allele being genotyped. Fixed. 2012-12-10 14:47:04 -05:00
Ryan Poplin ceb5431dcb Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2012-12-10 12:24:47 -05:00
Ryan Poplin c84ff9d75e Adding explicit true negative assessment category to the AssessNA12878 walker. 2012-12-10 12:24:43 -05:00
Ami Levy-Moonshine 573ace4403 restore the right version of VariantContextUtils.java in my unstable dir 2012-12-10 10:28:56 -05:00
David Roazen 46edab6d6a Use the new downsampling implementation by default
-Switch back to the old implementation, if needed, with --use_legacy_downsampler

-LocusIteratorByStateExperimental becomes the new LocusIteratorByState, and
the original LocusIteratorByState becomes LegacyLocusIteratorByState

-Similarly, the ExperimentalReadShardBalancer becomes the new ReadShardBalancer,
with the old one renamed to LegacyReadShardBalancer

-Performance improvements: locus traversals used to be 20% slower in the new
downsampling implementation, now they are roughly the same speed.

-Tests show a very high level of concordance with UG calls from the previous
implementation, with some new calls and edge cases that still require more examination.

-With the new implementation, can now use -dcov with ReadWalkers to set a limit
on the max # of reads per alignment start position per sample. Appropriate value
for ReadWalker dcov may be in the single digits for some tools, but this too
requires more investigation.
2012-12-10 09:44:50 -05:00
Ami Levy-Moonshine 5460c96137 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2012-12-09 23:43:57 -05:00
Ami Levy-Moonshine 3a420d163e (1) changes in catVariants (work still under development) (2) changes to CV to throw an error when GenotypeMergeType is PRIORITIZE but no priority (rod_priority_list) is not given. Reported by TechnicalVault on the forum on Nov 14 2012 2012-12-09 23:40:03 -05:00
Eric Banks 2637f512f8 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2012-12-09 02:09:40 -05:00
Eric Banks 574d5b467f Bug fix for indel HMM: protect against situation where long reads (e.g. Sanger) in a pileup can lead to a read starting after the haplotype end for a given haplotype. 2012-12-09 02:09:34 -05:00
Mark DePristo 9b6ee0576f Fix bugs in the consensus genotype creation algorithm for the NA12878 KB
-- Was screwing up mixed reviewed / non-reviewed sites.  Now only considered reviewed calls, if any are present, or all calls if no reviewed sites are found
-- Was just taking the first genotype, now it properly looks at all of the genotype calls and makes a reasonable guess what the answer should be
-- Added unit tests for the consensus creation algorithm
2012-12-08 13:18:07 -05:00
Mark DePristo bf8421eeb7 Fixes GSA-671 / AFCalcResult.log10pNonRefByAllele should really be log10pRefByAllele
-- The current implementation of AFCalcResult contains a map from allele -> log10pNonRef. The only use of this field is to support the isPolymorphic function per allele. The call to this function looks like isPolymorphic(allele, QUAL). The QUAL is a phred-scaled threshold where you want to include alleles where the log10pNonRef >= QUAL (appropriately transformed). The problem is that when log10pNonRef is large, it quickly gets set to 0, while it's complementary log10pRef value has a meaningful log10 value. For example, if log10pRef = -100 (not an uncommonly large value), log10pNonRef = 0.0.
-- In order to preserve precision and allow us to more finally differentiate high QUAL from low QUAL (but still poly) sites we should store log10pRef values instead, and test that log10pRef <= threshold.
-- See https://jira.broadinstitute.org/browse/GSA-671 for more information.
2012-12-07 16:03:40 -05:00
Ryan Poplin 3355216366 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2012-12-07 15:35:17 -05:00
Ryan Poplin 9573648f85 Changes to count the sites which might be present in some of the input rods but not present at all in other rods. Now loop over the input rod names instead of looping over the tracker results. 2012-12-07 15:35:08 -05:00
Joel Thibault 3b0e3767bf Add a test for a read that extends off the end of chr1 2012-12-07 14:07:15 -05:00
Mauricio Carneiro 58e39a8468 Enabling 4-way parallel by default in FastQ2BAM
DEV-317
2012-12-06 17:27:54 -05:00
Joel Thibault cc4e3ec589 Update TODO list 2012-12-06 12:06:47 -05:00
Mark DePristo abd94b2976 Bugfix for handling invalid records in NA12878 KB
-- The previous approach tried to remove the entire MongoVariantContext but when it was malformed was prone to error.  Now just grabs the _id and uses it to remove the bad record.
2012-12-06 10:24:24 -05:00
Eric Banks 406adb8d44 The allele biased downsampling should not abort if there's a reduced read. Rather it should always keep the RR and downsample only original reads in the pileup. 2012-12-05 23:15:36 -05:00
Mauricio Carneiro 6d22f4f737 Bringing latest performance updates from the GATK to CMI 2012-12-05 21:40:03 -05:00
Mark DePristo dbf721968d PrintReads large-scale test to protect against another major low-level performance issue 2012-12-05 21:36:27 -05:00
Ryan Poplin 00c23bf704 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2012-12-05 15:53:05 -05:00
Ryan Poplin 234ff64556 Changes to AssessNA12878 to allow for 100s of input callsets to assess against the database. 2012-12-05 15:52:57 -05:00
Ami Levy-Moonshine 5d78a61f7a Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2012-12-05 15:07:12 -05:00
Mark DePristo d0cab795b7 Got caught in the middle of a bad integration test, that was fixed in independent push. Moved test bam into testdata. 2012-12-05 14:49:22 -05:00
Mark DePristo 465694078e Major performance improvement to the GATK engine
-- The NanoSchedule timing code (in NSRuntimeProfile) was crazy expensive, but never showed up in the profilers.  Removed all of the timing code from the NanoScheduler, the NSRuntimeProfile itself, and updated the unit tests.
-- For tools that largely pass through data quickly, this change reduces runtimes by as much as 10x.  For the RealignerTargetCreator example, the runtime before this commit was 3 hours, and after is 30 minutes (6x improvement).
-- Took this opportunity to improve the GATK ProgressMeter.  NotifyOfProgress now just keeps track of the maximum position seen, and a separate daemon thread ProgressMeterDaemon periodically wakes up and prints the current progress.  This removes all inner loop calls to the GATK timers.
-- The history of the bug started here: http://gatkforums.broadinstitute.org/discussion/comment/2402#Comment_2402
2012-12-05 14:49:22 -05:00
Mark DePristo 2b601571e7 Better error handling in NanoScheduler
-- The previous nanoscheduler would deadlock in the case where an Error, not an Exception, was thrown.  Errors, like out of memory, would cause the whole system to die.  This bugfix resolves that issue
2012-12-05 14:49:22 -05:00
Mark DePristo 51dbb562c9 Reduce amount of debugging information from NA12878KnowledgeBaseServer 2012-12-05 14:49:22 -05:00
Mauricio Carneiro efe256ec09 binary search implementation to find the minimum coverage
speeds up the walker from 7 days to 12 minutes on chr20.
2012-12-05 14:45:57 -05:00
Eric Banks 0c925856cb Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2012-12-05 02:00:39 -05:00
Eric Banks ef87b18e09 In retrospect, it wasn't a good idea to have FisherStrand handle reduced reads since they are always on the forward strand. For now, FS ignores reduced reads but I've added a note (and JIRA) to make this work once the RR het compression is enabled (since we will have directionality in reads then). 2012-12-05 02:00:35 -05:00
Mauricio Carneiro 13896356ad Added bootstrapping and fixed the GLM model of the FMCC 2012-12-05 01:32:19 -05:00
Mauricio Carneiro 30f013aeb0 Added a copy() method for ReadBackedPileups
necessary to create new alignment contexts with hard-copies of the pileup.
2012-12-05 01:32:18 -05:00
Mauricio Carneiro 6feda540a4 Better error message for SimpleGATKReports 2012-12-05 01:32:18 -05:00
Eric Banks 726332db79 Disabling the testNoCmdLineHeaderStdout test in UG because it keeps crashing when I run it locally 2012-12-05 00:54:00 -05:00
kshakir 61bde6210b Restored RemoteFile push and pull in base QScript. 2012-12-04 12:34:07 -05:00
Randal Moore 8d2d0253a2 introduce a level of indirection for the forum URLs - this new function will allow me a place to morph the URL into something that is supported by Confluence
Signed-off-by: Eric Banks <ebanks@broadinstitute.org>
2012-12-03 22:33:02 -05:00
Eric Banks 1af41754e3 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2012-12-03 22:01:11 -05:00
Eric Banks bca860723a Updating tests to handle bad validation data files (that used the wrong qual score encoding); overrides push from stable. 2012-12-03 22:01:07 -05:00
Eric Banks 387c0defed don't change md5 here because I am handling it separately from unstable with a better command-line in the test 2012-12-03 21:49:45 -05:00
Eric Banks ef95757311 Fix MD5 because of a need to fix a busted bam file in our validation directory (it used the wrong quality score encoding...) 2012-12-03 21:46:46 -05:00
Guillermo del Angel 4ced2e4ffc Merge branch 'develop' of github.com:broadinstitute/cmi-gatk into develop 2012-12-03 20:14:43 -05:00
Guillermo del Angel c2c6b858e3 Better checks/more flexibility in fastq2bam parsing. Immediate benefit: we can now process normal-only samples, and metadata should be able to specify tumor/normal pairs in any order. Hard-coded hacks removed. DEV-134 #resolve #time 3m 2012-12-03 20:14:37 -05:00
Menachem Fromer 472381245a Allow for more refined control of memory and queues to run with 2012-12-03 17:07:03 -05:00
Eric Banks 67932b357d Bug fix for RR: don't let the softclip start position be less than 1 2012-12-03 15:59:14 -05:00
Ryan Poplin d5ed184691 Updating the HC integration test md5s. According to the NA12878 knowledge base this commit cuts down the FP rate by more than 50 percent with no loss in sensitivity. 2012-12-03 15:38:59 -05:00