Commit Graph

8351 Commits (c85100ce9cee310c013b8564a308dcd608f96098)

Author SHA1 Message Date
Mauricio Carneiro c85100ce9c Fix ClippingOp bug when performing multiple hardclip ops
bug: When performing multiple hard clip operations in a read that has indels, if the N+1 hardclip requests to clip inside an indel that has been removed by one of the (1..N) previous hardclips, the hard clipper would go out of bounds.

fix: dynamically adjust the boundaries according to the new hardclipped read length. (this maintains the current contract that hardclipping will never return a read starting or ending in indels).
2011-12-14 16:57:47 -05:00
Eric Banks de5928ac5a Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-14 16:24:56 -05:00
Eric Banks 4fddac9f22 Updating busted integration tests 2011-12-14 16:24:43 -05:00
Mark DePristo 30e5531e1b GATKPerformanceOverTime includes CombineVariants, CountCovariates, TableRecalibrator, and SelectVariants
-- Updated R script as well
2011-12-14 16:15:04 -05:00
Mark DePristo 01e547eed3 Parallel SAMDataSource initialization
-- Uses 8 threads to load BAM files and indices in parallel, decreasing costs to read thousands of BAM files by a significant amount
-- Added logger.info message noting progress and cost of reading low-level BAM data.
2011-12-14 16:14:26 -05:00
Mark DePristo 71b4bb12b7 Bug fix for incorrect logic in subsetSamples
-- Now properly handles the case where a sample isn't present (no longer adds a null to the genotypes list)
-- Fix for logic failure where if the number of requested samples equals the number of known genotypes then all of the records were returned, which isn't correct when there are missing samples.
-- Unit tests added to handle these cases
2011-12-14 16:14:26 -05:00
Mark DePristo 7ac8966184 G1K phased I table now includes calculation for chrX 2011-12-14 16:14:25 -05:00
Eric Banks e90d77e531 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-14 15:32:26 -05:00
Eric Banks 35fc2e13c3 Using the new PL cache, fix a bug: when only a subset of the genotyped alleles are used for assigning genotypes (because the exact model determined that they weren't all real) the PLs need to be adjusted to reflect this. While fixing this I discovered that the integration tests are busted because ref calls (ALT=.) were getting annotated with PLs, which makes no sense at all. 2011-12-14 15:31:09 -05:00
Eric Banks 1e90d602a4 Optimization: cache up front the PL index to the pair of alleles it represents for all possible numbers of alternate alleles. 2011-12-14 13:38:20 -05:00
Eric Banks 988d60091f Forgot to add in the new result class 2011-12-14 13:37:15 -05:00
Ryan Poplin 4c077f9155 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-14 12:15:46 -05:00
Ryan Poplin 08e0889f0a Adding multi sample haplotype caller integration test. Changing interval list to include multi-allelic event. Fix to force a consistent ordering of the best alleles so that the multi-allelic alleles and GLs come out in a deterministic order. 2011-12-14 12:15:30 -05:00
Eric Banks 106bf13056 Use a thread local result object to collect the results of the exact calculation instead of passing in multiple pre-allocated arrays. 2011-12-14 12:05:50 -05:00
Eric Banks 7648521718 Add check for mixed genotype so that we don't exception out for a valid record 2011-12-14 11:26:43 -05:00
Eric Banks 9497e9492c Bug fix for complex records: do not ever reverse clip out a complete allele. 2011-12-14 11:21:28 -05:00
Eric Banks 9740ae2090 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-14 10:43:59 -05:00
Eric Banks 09a5a9eac0 Don't update lineNo for decodeLoc - only for decode (otherwise they get double-counted). Even still, because of the way the GATK currently utilizes Tribble we can parse the same line multiple times, which knocks the line counter out of sync. For now, I've added a TODO in the code to remind us and the error messages note that it's an approximate line number. 2011-12-14 10:43:52 -05:00
Eric Banks d3f4a5a901 Fail gracefully when encountering malformed VCFs without enough data columns 2011-12-14 10:37:38 -05:00
Ryan Poplin e061e236ab Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-14 10:24:48 -05:00
Ryan Poplin 23f15851c1 Turn off haplotype caller verbose debug output by default. 2011-12-14 10:24:33 -05:00
Eric Banks 079932ba2a The log10cache needs to be larger if we want to handle 10K samples in the UG. 2011-12-13 23:36:10 -05:00
Mark DePristo 6d6bed1ccc Linear time ROC calculation 2011-12-13 18:46:16 -05:00
Mark DePristo 7dd5c74591 Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-13 18:19:41 -05:00
Mark DePristo ebbdd02569 V1 of random forest analysis script 2011-12-13 18:19:16 -05:00
Ryan Poplin cd390277d0 Adding temporary read filter to HaplotypeCaller integration test while ReadClipper contracts are being worked out. 2011-12-13 17:41:10 -05:00
Ryan Poplin 7fa1ab1bae Fix to allow haplotype caller to call indels after UG engine entry points were unified. Adding Haplotype Caller integration test 2011-12-13 17:19:40 -05:00
Ryan Poplin 7a386b45a5 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-13 15:59:52 -05:00
Ryan Poplin 32a1e729ba Bug fix in HaplotypeCaller for multiallelics with SNP and indel starting at same locus 2011-12-13 15:59:43 -05:00
Eric Banks e47a113c9f Enabled multi-allelic SNP discovery in the UG. Needs loads of testing so do not use yet. While working in the UG engine, I removed the extraneous and unnecessary MultiallelicGenotypeLikelihoods class: now a VariantContext with PL-annotated Genotypes is passed around instead. Integration tests pass so it must all work, right? 2011-12-12 23:02:45 -05:00
Mauricio Carneiro 5cc1e72fdb Parallelized SelectVariants
* can now use -nt with SelectVariants for significant speedup in large files
* added parallelization integration tests for SelectVariants
2011-12-12 18:41:14 -05:00
Mauricio Carneiro 663184ee9d Added test mode to PPP
* in test mode, no @PG tags are output to the final bam file
* updated pipeline test to use -test mode.
* MD5s updated accordingly
2011-12-12 18:29:06 -05:00
Mauricio Carneiro a3c3d72313 Added test mode to DPP
* in test mode, no @PG tags are output to the final bam file
* updated pipeline test to use -test mode.
* MD5s are now dependent on BWA version
2011-12-12 18:29:06 -05:00
Mauricio Carneiro a70a0f25fb Better debug output for SAMDataSource
output the name and number of the files being loaded by the GATK instead of "coordinate sorted".
2011-12-12 17:57:29 -05:00
Mark DePristo 1ba1717ad8 Queue framework to compute UG, CountLoci and other walkers performance across GATK versions
- Includes // TODO with optimization targets for ExactAFCalculationModel
2011-12-12 17:39:52 -05:00
Mark DePristo d03425df2f TODO optimization targets 2011-12-12 17:39:51 -05:00
Mauricio Carneiro 3519a897c4 Merged bug fix from Stable into Unstable 2011-12-12 11:00:47 -05:00
Mauricio Carneiro c8b1c92a6c Updating the other half of the PPP 2011-12-12 10:55:41 -05:00
Mauricio Carneiro 2a32ebe104 Bringing Laurent's Mendelian Violation changes to the main repo -- he promised to follow the guidelines next time 2011-12-12 09:52:08 -05:00
Mauricio Carneiro 1008c453ec Merge remote-tracking branch 'lau/master' into laurent 2011-12-12 09:50:58 -05:00
Mauricio Carneiro 52c64b971f Updating MD5s -- really dont know why it didn't update before 2011-12-12 09:48:58 -05:00
Laurent Francioli 7cf27bb66e Updated md5sum for MendelianViolationEvaluator test to reflect the change in column alignment in VariantEval. 2011-12-12 12:22:43 +01:00
Laurent Francioli 025bdfe2cc Merge branch 'master' of ssh://copper.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-12 12:19:44 +01:00
Mauricio Carneiro ed91461c49 Data Processing Pipeline Test
* Added standard pipeline test for the DPP
* Added a full BWA pipeline test for the DPP
* Included the extra files for the reference needed by BWA (to be used by DPP and PPP tests)
2011-12-12 00:24:51 -05:00
Mauricio Carneiro cca8a18608 PPP pipeline test
* added a pipeline test to the Pacbio Processing Pipeline.
* updated exampleBAM with more complete RG information so we can use it in a wider variety of pipeline tests
* added exampleDBSNP.vcf file with only chromosome 1 in the range of the exampleFASTA.fasta reference for pipeline tests
2011-12-11 17:32:21 -05:00
Eric Banks 7b6338c742 Merge branch 'master' into trialleles 2011-12-11 00:28:46 -05:00
Eric Banks 7c4b9338ad The old bi-allelic implementation of the Exact model has been completely deprecated - you can only use the multi-allelic implementation now. 2011-12-11 00:23:33 -05:00
Eric Banks 044f211a30 Don't collapse likelihoods over all alt alleles - that's just not right. For now, the QUAL is calculated for just the most likely of the alt alleles; I need to think about the right way to handle this properly. 2011-12-10 23:57:14 -05:00
Mauricio Carneiro 21ac3b59d7 Merged bug fix from Stable into Unstable 2011-12-09 16:51:46 -05:00
Mauricio Carneiro 13905c00b3 Updating PacbioProcessingPipeline to new Queue standards 2011-12-09 16:51:02 -05:00