Commit Graph

8363 Commits (7c58d8e37d490c890baac6421b6c309b4297716d)

Author SHA1 Message Date
Ryan Poplin 7c58d8e37d Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-15 12:52:46 -05:00
Ryan Poplin f38ed69fd0 Work around for a known adapter clipping issue. Temporary fix while adapter clipping is being rewritten. 2011-12-15 12:52:34 -05:00
Mauricio Carneiro 4748ae0a14 Bugfix: Softclips before Hardclips weren't being accounted for
caught a bug in the hard clipper where it does not account for hard clipping softclipped bases in the resulting cigar string, if there is already a hard clipped base immediately after it.
* updated unit test for hardClipSoftClippedBases with corresponding test-case.
2011-12-15 12:17:25 -05:00
Mauricio Carneiro 62a2e335bc Changing HardClipper contract to allow UNMAPPED reads
shifted the contract to functions that operate on reference based coordinates. The clipper should do the right thing with unmapped reads, but it needs more testing (Ryan is using it at the moment and says it works). Will write some unit tests.
2011-12-15 11:08:19 -05:00
Ryan Poplin 598a21d01c Adding new downsampling argument to haplotype caller qscript. 2011-12-15 08:49:21 -05:00
Ryan Poplin 568d972991 Adding downsample per region argument to the haplotype caller 2011-12-15 08:43:48 -05:00
Matt Hanna 9333b678b5 Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-14 18:05:44 -05:00
Matt Hanna 6fb4be1a09 Cache header merger. 2011-12-14 18:05:31 -05:00
Ryan Poplin 9dbd0ef06a Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-14 17:11:56 -05:00
Ryan Poplin 3283d96bf2 Reducing the memory usage and runtime of the haplotype caller integration tests so that Eric can run them on his laptop. 2011-12-14 17:11:41 -05:00
Mauricio Carneiro 50dee86d7f Added unit test to catch Ryan's exception
Unit test to catch the special case that broke the clipping op, fixed in the previous commit.
2011-12-14 16:58:14 -05:00
Mauricio Carneiro 128bdf9c09 Create artificial reads with "default" parameters
* added functions to create synthetic reads for unit testing with reasonable default parameters
* added more functions to create synthetic reads based on cigar string + bases and quals.
2011-12-14 16:58:14 -05:00
Mauricio Carneiro c85100ce9c Fix ClippingOp bug when performing multiple hardclip ops
bug: When performing multiple hard clip operations in a read that has indels, if the N+1 hardclip requests to clip inside an indel that has been removed by one of the (1..N) previous hardclips, the hard clipper would go out of bounds.

fix: dynamically adjust the boundaries according to the new hardclipped read length. (this maintains the current contract that hardclipping will never return a read starting or ending in indels).
2011-12-14 16:57:47 -05:00
Eric Banks de5928ac5a Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-14 16:24:56 -05:00
Eric Banks 4fddac9f22 Updating busted integration tests 2011-12-14 16:24:43 -05:00
Mark DePristo 30e5531e1b GATKPerformanceOverTime includes CombineVariants, CountCovariates, TableRecalibrator, and SelectVariants
-- Updated R script as well
2011-12-14 16:15:04 -05:00
Mark DePristo 01e547eed3 Parallel SAMDataSource initialization
-- Uses 8 threads to load BAM files and indices in parallel, decreasing costs to read thousands of BAM files by a significant amount
-- Added logger.info message noting progress and cost of reading low-level BAM data.
2011-12-14 16:14:26 -05:00
Mark DePristo 71b4bb12b7 Bug fix for incorrect logic in subsetSamples
-- Now properly handles the case where a sample isn't present (no longer adds a null to the genotypes list)
-- Fix for logic failure where if the number of requested samples equals the number of known genotypes then all of the records were returned, which isn't correct when there are missing samples.
-- Unit tests added to handle these cases
2011-12-14 16:14:26 -05:00
Mark DePristo 7ac8966184 G1K phased I table now includes calculation for chrX 2011-12-14 16:14:25 -05:00
Eric Banks e90d77e531 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-14 15:32:26 -05:00
Eric Banks 35fc2e13c3 Using the new PL cache, fix a bug: when only a subset of the genotyped alleles are used for assigning genotypes (because the exact model determined that they weren't all real) the PLs need to be adjusted to reflect this. While fixing this I discovered that the integration tests are busted because ref calls (ALT=.) were getting annotated with PLs, which makes no sense at all. 2011-12-14 15:31:09 -05:00
Eric Banks 1e90d602a4 Optimization: cache up front the PL index to the pair of alleles it represents for all possible numbers of alternate alleles. 2011-12-14 13:38:20 -05:00
Eric Banks 988d60091f Forgot to add in the new result class 2011-12-14 13:37:15 -05:00
Ryan Poplin 4c077f9155 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-14 12:15:46 -05:00
Ryan Poplin 08e0889f0a Adding multi sample haplotype caller integration test. Changing interval list to include multi-allelic event. Fix to force a consistent ordering of the best alleles so that the multi-allelic alleles and GLs come out in a deterministic order. 2011-12-14 12:15:30 -05:00
Eric Banks 106bf13056 Use a thread local result object to collect the results of the exact calculation instead of passing in multiple pre-allocated arrays. 2011-12-14 12:05:50 -05:00
Eric Banks 7648521718 Add check for mixed genotype so that we don't exception out for a valid record 2011-12-14 11:26:43 -05:00
Eric Banks 9497e9492c Bug fix for complex records: do not ever reverse clip out a complete allele. 2011-12-14 11:21:28 -05:00
Eric Banks 9740ae2090 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-14 10:43:59 -05:00
Eric Banks 09a5a9eac0 Don't update lineNo for decodeLoc - only for decode (otherwise they get double-counted). Even still, because of the way the GATK currently utilizes Tribble we can parse the same line multiple times, which knocks the line counter out of sync. For now, I've added a TODO in the code to remind us and the error messages note that it's an approximate line number. 2011-12-14 10:43:52 -05:00
Eric Banks d3f4a5a901 Fail gracefully when encountering malformed VCFs without enough data columns 2011-12-14 10:37:38 -05:00
Ryan Poplin e061e236ab Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-14 10:24:48 -05:00
Ryan Poplin 23f15851c1 Turn off haplotype caller verbose debug output by default. 2011-12-14 10:24:33 -05:00
Eric Banks 079932ba2a The log10cache needs to be larger if we want to handle 10K samples in the UG. 2011-12-13 23:36:10 -05:00
Mark DePristo 6d6bed1ccc Linear time ROC calculation 2011-12-13 18:46:16 -05:00
Mark DePristo 7dd5c74591 Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-13 18:19:41 -05:00
Mark DePristo ebbdd02569 V1 of random forest analysis script 2011-12-13 18:19:16 -05:00
Ryan Poplin cd390277d0 Adding temporary read filter to HaplotypeCaller integration test while ReadClipper contracts are being worked out. 2011-12-13 17:41:10 -05:00
Ryan Poplin 7fa1ab1bae Fix to allow haplotype caller to call indels after UG engine entry points were unified. Adding Haplotype Caller integration test 2011-12-13 17:19:40 -05:00
Ryan Poplin 7a386b45a5 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-13 15:59:52 -05:00
Ryan Poplin 32a1e729ba Bug fix in HaplotypeCaller for multiallelics with SNP and indel starting at same locus 2011-12-13 15:59:43 -05:00
Eric Banks e47a113c9f Enabled multi-allelic SNP discovery in the UG. Needs loads of testing so do not use yet. While working in the UG engine, I removed the extraneous and unnecessary MultiallelicGenotypeLikelihoods class: now a VariantContext with PL-annotated Genotypes is passed around instead. Integration tests pass so it must all work, right? 2011-12-12 23:02:45 -05:00
Mauricio Carneiro 5cc1e72fdb Parallelized SelectVariants
* can now use -nt with SelectVariants for significant speedup in large files
* added parallelization integration tests for SelectVariants
2011-12-12 18:41:14 -05:00
Mauricio Carneiro 663184ee9d Added test mode to PPP
* in test mode, no @PG tags are output to the final bam file
* updated pipeline test to use -test mode.
* MD5s updated accordingly
2011-12-12 18:29:06 -05:00
Mauricio Carneiro a3c3d72313 Added test mode to DPP
* in test mode, no @PG tags are output to the final bam file
* updated pipeline test to use -test mode.
* MD5s are now dependent on BWA version
2011-12-12 18:29:06 -05:00
Mauricio Carneiro a70a0f25fb Better debug output for SAMDataSource
output the name and number of the files being loaded by the GATK instead of "coordinate sorted".
2011-12-12 17:57:29 -05:00
Mark DePristo 1ba1717ad8 Queue framework to compute UG, CountLoci and other walkers performance across GATK versions
- Includes // TODO with optimization targets for ExactAFCalculationModel
2011-12-12 17:39:52 -05:00
Mark DePristo d03425df2f TODO optimization targets 2011-12-12 17:39:51 -05:00
Mauricio Carneiro 3519a897c4 Merged bug fix from Stable into Unstable 2011-12-12 11:00:47 -05:00
Mauricio Carneiro c8b1c92a6c Updating the other half of the PPP 2011-12-12 10:55:41 -05:00