Ryan Poplin
7c58d8e37d
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-15 12:52:46 -05:00
Ryan Poplin
f38ed69fd0
Work around for a known adapter clipping issue. Temporary fix while adapter clipping is being rewritten.
2011-12-15 12:52:34 -05:00
Mauricio Carneiro
4748ae0a14
Bugfix: Softclips before Hardclips weren't being accounted for
...
caught a bug in the hard clipper where it does not account for hard clipping softclipped bases in the resulting cigar string, if there is already a hard clipped base immediately after it.
* updated unit test for hardClipSoftClippedBases with corresponding test-case.
2011-12-15 12:17:25 -05:00
Mauricio Carneiro
62a2e335bc
Changing HardClipper contract to allow UNMAPPED reads
...
shifted the contract to functions that operate on reference based coordinates. The clipper should do the right thing with unmapped reads, but it needs more testing (Ryan is using it at the moment and says it works). Will write some unit tests.
2011-12-15 11:08:19 -05:00
Ryan Poplin
598a21d01c
Adding new downsampling argument to haplotype caller qscript.
2011-12-15 08:49:21 -05:00
Ryan Poplin
568d972991
Adding downsample per region argument to the haplotype caller
2011-12-15 08:43:48 -05:00
Matt Hanna
9333b678b5
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-14 18:05:44 -05:00
Matt Hanna
6fb4be1a09
Cache header merger.
2011-12-14 18:05:31 -05:00
Ryan Poplin
9dbd0ef06a
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-14 17:11:56 -05:00
Ryan Poplin
3283d96bf2
Reducing the memory usage and runtime of the haplotype caller integration tests so that Eric can run them on his laptop.
2011-12-14 17:11:41 -05:00
Mauricio Carneiro
50dee86d7f
Added unit test to catch Ryan's exception
...
Unit test to catch the special case that broke the clipping op, fixed in the previous commit.
2011-12-14 16:58:14 -05:00
Mauricio Carneiro
128bdf9c09
Create artificial reads with "default" parameters
...
* added functions to create synthetic reads for unit testing with reasonable default parameters
* added more functions to create synthetic reads based on cigar string + bases and quals.
2011-12-14 16:58:14 -05:00
Mauricio Carneiro
c85100ce9c
Fix ClippingOp bug when performing multiple hardclip ops
...
bug: When performing multiple hard clip operations in a read that has indels, if the N+1 hardclip requests to clip inside an indel that has been removed by one of the (1..N) previous hardclips, the hard clipper would go out of bounds.
fix: dynamically adjust the boundaries according to the new hardclipped read length. (this maintains the current contract that hardclipping will never return a read starting or ending in indels).
2011-12-14 16:57:47 -05:00
Eric Banks
de5928ac5a
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-14 16:24:56 -05:00
Eric Banks
4fddac9f22
Updating busted integration tests
2011-12-14 16:24:43 -05:00
Mark DePristo
30e5531e1b
GATKPerformanceOverTime includes CombineVariants, CountCovariates, TableRecalibrator, and SelectVariants
...
-- Updated R script as well
2011-12-14 16:15:04 -05:00
Mark DePristo
01e547eed3
Parallel SAMDataSource initialization
...
-- Uses 8 threads to load BAM files and indices in parallel, decreasing costs to read thousands of BAM files by a significant amount
-- Added logger.info message noting progress and cost of reading low-level BAM data.
2011-12-14 16:14:26 -05:00
Mark DePristo
71b4bb12b7
Bug fix for incorrect logic in subsetSamples
...
-- Now properly handles the case where a sample isn't present (no longer adds a null to the genotypes list)
-- Fix for logic failure where if the number of requested samples equals the number of known genotypes then all of the records were returned, which isn't correct when there are missing samples.
-- Unit tests added to handle these cases
2011-12-14 16:14:26 -05:00
Mark DePristo
7ac8966184
G1K phased I table now includes calculation for chrX
2011-12-14 16:14:25 -05:00
Eric Banks
e90d77e531
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-14 15:32:26 -05:00
Eric Banks
35fc2e13c3
Using the new PL cache, fix a bug: when only a subset of the genotyped alleles are used for assigning genotypes (because the exact model determined that they weren't all real) the PLs need to be adjusted to reflect this. While fixing this I discovered that the integration tests are busted because ref calls (ALT=.) were getting annotated with PLs, which makes no sense at all.
2011-12-14 15:31:09 -05:00
Eric Banks
1e90d602a4
Optimization: cache up front the PL index to the pair of alleles it represents for all possible numbers of alternate alleles.
2011-12-14 13:38:20 -05:00
Eric Banks
988d60091f
Forgot to add in the new result class
2011-12-14 13:37:15 -05:00
Ryan Poplin
4c077f9155
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-14 12:15:46 -05:00
Ryan Poplin
08e0889f0a
Adding multi sample haplotype caller integration test. Changing interval list to include multi-allelic event. Fix to force a consistent ordering of the best alleles so that the multi-allelic alleles and GLs come out in a deterministic order.
2011-12-14 12:15:30 -05:00
Eric Banks
106bf13056
Use a thread local result object to collect the results of the exact calculation instead of passing in multiple pre-allocated arrays.
2011-12-14 12:05:50 -05:00
Eric Banks
7648521718
Add check for mixed genotype so that we don't exception out for a valid record
2011-12-14 11:26:43 -05:00
Eric Banks
9497e9492c
Bug fix for complex records: do not ever reverse clip out a complete allele.
2011-12-14 11:21:28 -05:00
Eric Banks
9740ae2090
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-14 10:43:59 -05:00
Eric Banks
09a5a9eac0
Don't update lineNo for decodeLoc - only for decode (otherwise they get double-counted). Even still, because of the way the GATK currently utilizes Tribble we can parse the same line multiple times, which knocks the line counter out of sync. For now, I've added a TODO in the code to remind us and the error messages note that it's an approximate line number.
2011-12-14 10:43:52 -05:00
Eric Banks
d3f4a5a901
Fail gracefully when encountering malformed VCFs without enough data columns
2011-12-14 10:37:38 -05:00
Ryan Poplin
e061e236ab
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-14 10:24:48 -05:00
Ryan Poplin
23f15851c1
Turn off haplotype caller verbose debug output by default.
2011-12-14 10:24:33 -05:00
Eric Banks
079932ba2a
The log10cache needs to be larger if we want to handle 10K samples in the UG.
2011-12-13 23:36:10 -05:00
Mark DePristo
6d6bed1ccc
Linear time ROC calculation
2011-12-13 18:46:16 -05:00
Mark DePristo
7dd5c74591
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-13 18:19:41 -05:00
Mark DePristo
ebbdd02569
V1 of random forest analysis script
2011-12-13 18:19:16 -05:00
Ryan Poplin
cd390277d0
Adding temporary read filter to HaplotypeCaller integration test while ReadClipper contracts are being worked out.
2011-12-13 17:41:10 -05:00
Ryan Poplin
7fa1ab1bae
Fix to allow haplotype caller to call indels after UG engine entry points were unified. Adding Haplotype Caller integration test
2011-12-13 17:19:40 -05:00
Ryan Poplin
7a386b45a5
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-13 15:59:52 -05:00
Ryan Poplin
32a1e729ba
Bug fix in HaplotypeCaller for multiallelics with SNP and indel starting at same locus
2011-12-13 15:59:43 -05:00
Eric Banks
e47a113c9f
Enabled multi-allelic SNP discovery in the UG. Needs loads of testing so do not use yet. While working in the UG engine, I removed the extraneous and unnecessary MultiallelicGenotypeLikelihoods class: now a VariantContext with PL-annotated Genotypes is passed around instead. Integration tests pass so it must all work, right?
2011-12-12 23:02:45 -05:00
Mauricio Carneiro
5cc1e72fdb
Parallelized SelectVariants
...
* can now use -nt with SelectVariants for significant speedup in large files
* added parallelization integration tests for SelectVariants
2011-12-12 18:41:14 -05:00
Mauricio Carneiro
663184ee9d
Added test mode to PPP
...
* in test mode, no @PG tags are output to the final bam file
* updated pipeline test to use -test mode.
* MD5s updated accordingly
2011-12-12 18:29:06 -05:00
Mauricio Carneiro
a3c3d72313
Added test mode to DPP
...
* in test mode, no @PG tags are output to the final bam file
* updated pipeline test to use -test mode.
* MD5s are now dependent on BWA version
2011-12-12 18:29:06 -05:00
Mauricio Carneiro
a70a0f25fb
Better debug output for SAMDataSource
...
output the name and number of the files being loaded by the GATK instead of "coordinate sorted".
2011-12-12 17:57:29 -05:00
Mark DePristo
1ba1717ad8
Queue framework to compute UG, CountLoci and other walkers performance across GATK versions
...
- Includes // TODO with optimization targets for ExactAFCalculationModel
2011-12-12 17:39:52 -05:00
Mark DePristo
d03425df2f
TODO optimization targets
2011-12-12 17:39:51 -05:00
Mauricio Carneiro
3519a897c4
Merged bug fix from Stable into Unstable
2011-12-12 11:00:47 -05:00
Mauricio Carneiro
c8b1c92a6c
Updating the other half of the PPP
2011-12-12 10:55:41 -05:00