Eric Banks
de5928ac5a
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-14 16:24:56 -05:00
Eric Banks
4fddac9f22
Updating busted integration tests
2011-12-14 16:24:43 -05:00
Mark DePristo
30e5531e1b
GATKPerformanceOverTime includes CombineVariants, CountCovariates, TableRecalibrator, and SelectVariants
...
-- Updated R script as well
2011-12-14 16:15:04 -05:00
Mark DePristo
01e547eed3
Parallel SAMDataSource initialization
...
-- Uses 8 threads to load BAM files and indices in parallel, decreasing costs to read thousands of BAM files by a significant amount
-- Added logger.info message noting progress and cost of reading low-level BAM data.
2011-12-14 16:14:26 -05:00
Mark DePristo
71b4bb12b7
Bug fix for incorrect logic in subsetSamples
...
-- Now properly handles the case where a sample isn't present (no longer adds a null to the genotypes list)
-- Fix for logic failure where if the number of requested samples equals the number of known genotypes then all of the records were returned, which isn't correct when there are missing samples.
-- Unit tests added to handle these cases
2011-12-14 16:14:26 -05:00
Mark DePristo
7ac8966184
G1K phased I table now includes calculation for chrX
2011-12-14 16:14:25 -05:00
Eric Banks
e90d77e531
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-14 15:32:26 -05:00
Eric Banks
35fc2e13c3
Using the new PL cache, fix a bug: when only a subset of the genotyped alleles are used for assigning genotypes (because the exact model determined that they weren't all real) the PLs need to be adjusted to reflect this. While fixing this I discovered that the integration tests are busted because ref calls (ALT=.) were getting annotated with PLs, which makes no sense at all.
2011-12-14 15:31:09 -05:00
Eric Banks
1e90d602a4
Optimization: cache up front the PL index to the pair of alleles it represents for all possible numbers of alternate alleles.
2011-12-14 13:38:20 -05:00
Eric Banks
988d60091f
Forgot to add in the new result class
2011-12-14 13:37:15 -05:00
Ryan Poplin
4c077f9155
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-14 12:15:46 -05:00
Ryan Poplin
08e0889f0a
Adding multi sample haplotype caller integration test. Changing interval list to include multi-allelic event. Fix to force a consistent ordering of the best alleles so that the multi-allelic alleles and GLs come out in a deterministic order.
2011-12-14 12:15:30 -05:00
Eric Banks
106bf13056
Use a thread local result object to collect the results of the exact calculation instead of passing in multiple pre-allocated arrays.
2011-12-14 12:05:50 -05:00
Eric Banks
7648521718
Add check for mixed genotype so that we don't exception out for a valid record
2011-12-14 11:26:43 -05:00
Eric Banks
9497e9492c
Bug fix for complex records: do not ever reverse clip out a complete allele.
2011-12-14 11:21:28 -05:00
Eric Banks
9740ae2090
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-14 10:43:59 -05:00
Eric Banks
09a5a9eac0
Don't update lineNo for decodeLoc - only for decode (otherwise they get double-counted). Even still, because of the way the GATK currently utilizes Tribble we can parse the same line multiple times, which knocks the line counter out of sync. For now, I've added a TODO in the code to remind us and the error messages note that it's an approximate line number.
2011-12-14 10:43:52 -05:00
Eric Banks
d3f4a5a901
Fail gracefully when encountering malformed VCFs without enough data columns
2011-12-14 10:37:38 -05:00
Ryan Poplin
e061e236ab
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-14 10:24:48 -05:00
Ryan Poplin
23f15851c1
Turn off haplotype caller verbose debug output by default.
2011-12-14 10:24:33 -05:00
Eric Banks
079932ba2a
The log10cache needs to be larger if we want to handle 10K samples in the UG.
2011-12-13 23:36:10 -05:00
Mark DePristo
6d6bed1ccc
Linear time ROC calculation
2011-12-13 18:46:16 -05:00
Mark DePristo
7dd5c74591
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-13 18:19:41 -05:00
Mark DePristo
ebbdd02569
V1 of random forest analysis script
2011-12-13 18:19:16 -05:00
Ryan Poplin
cd390277d0
Adding temporary read filter to HaplotypeCaller integration test while ReadClipper contracts are being worked out.
2011-12-13 17:41:10 -05:00
Ryan Poplin
7fa1ab1bae
Fix to allow haplotype caller to call indels after UG engine entry points were unified. Adding Haplotype Caller integration test
2011-12-13 17:19:40 -05:00
Ryan Poplin
7a386b45a5
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-13 15:59:52 -05:00
Ryan Poplin
32a1e729ba
Bug fix in HaplotypeCaller for multiallelics with SNP and indel starting at same locus
2011-12-13 15:59:43 -05:00
Eric Banks
e47a113c9f
Enabled multi-allelic SNP discovery in the UG. Needs loads of testing so do not use yet. While working in the UG engine, I removed the extraneous and unnecessary MultiallelicGenotypeLikelihoods class: now a VariantContext with PL-annotated Genotypes is passed around instead. Integration tests pass so it must all work, right?
2011-12-12 23:02:45 -05:00
Mauricio Carneiro
5cc1e72fdb
Parallelized SelectVariants
...
* can now use -nt with SelectVariants for significant speedup in large files
* added parallelization integration tests for SelectVariants
2011-12-12 18:41:14 -05:00
Mauricio Carneiro
663184ee9d
Added test mode to PPP
...
* in test mode, no @PG tags are output to the final bam file
* updated pipeline test to use -test mode.
* MD5s updated accordingly
2011-12-12 18:29:06 -05:00
Mauricio Carneiro
a3c3d72313
Added test mode to DPP
...
* in test mode, no @PG tags are output to the final bam file
* updated pipeline test to use -test mode.
* MD5s are now dependent on BWA version
2011-12-12 18:29:06 -05:00
Mauricio Carneiro
a70a0f25fb
Better debug output for SAMDataSource
...
output the name and number of the files being loaded by the GATK instead of "coordinate sorted".
2011-12-12 17:57:29 -05:00
Mark DePristo
1ba1717ad8
Queue framework to compute UG, CountLoci and other walkers performance across GATK versions
...
- Includes // TODO with optimization targets for ExactAFCalculationModel
2011-12-12 17:39:52 -05:00
Mark DePristo
d03425df2f
TODO optimization targets
2011-12-12 17:39:51 -05:00
Mauricio Carneiro
3519a897c4
Merged bug fix from Stable into Unstable
2011-12-12 11:00:47 -05:00
Mauricio Carneiro
c8b1c92a6c
Updating the other half of the PPP
2011-12-12 10:55:41 -05:00
Mauricio Carneiro
2a32ebe104
Bringing Laurent's Mendelian Violation changes to the main repo -- he promised to follow the guidelines next time
2011-12-12 09:52:08 -05:00
Mauricio Carneiro
1008c453ec
Merge remote-tracking branch 'lau/master' into laurent
2011-12-12 09:50:58 -05:00
Mauricio Carneiro
52c64b971f
Updating MD5s -- really dont know why it didn't update before
2011-12-12 09:48:58 -05:00
Laurent Francioli
7cf27bb66e
Updated md5sum for MendelianViolationEvaluator test to reflect the change in column alignment in VariantEval.
2011-12-12 12:22:43 +01:00
Laurent Francioli
025bdfe2cc
Merge branch 'master' of ssh://copper.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-12 12:19:44 +01:00
Mauricio Carneiro
ed91461c49
Data Processing Pipeline Test
...
* Added standard pipeline test for the DPP
* Added a full BWA pipeline test for the DPP
* Included the extra files for the reference needed by BWA (to be used by DPP and PPP tests)
2011-12-12 00:24:51 -05:00
Mauricio Carneiro
cca8a18608
PPP pipeline test
...
* added a pipeline test to the Pacbio Processing Pipeline.
* updated exampleBAM with more complete RG information so we can use it in a wider variety of pipeline tests
* added exampleDBSNP.vcf file with only chromosome 1 in the range of the exampleFASTA.fasta reference for pipeline tests
2011-12-11 17:32:21 -05:00
Eric Banks
7b6338c742
Merge branch 'master' into trialleles
2011-12-11 00:28:46 -05:00
Eric Banks
7c4b9338ad
The old bi-allelic implementation of the Exact model has been completely deprecated - you can only use the multi-allelic implementation now.
2011-12-11 00:23:33 -05:00
Eric Banks
044f211a30
Don't collapse likelihoods over all alt alleles - that's just not right. For now, the QUAL is calculated for just the most likely of the alt alleles; I need to think about the right way to handle this properly.
2011-12-10 23:57:14 -05:00
Mauricio Carneiro
21ac3b59d7
Merged bug fix from Stable into Unstable
2011-12-09 16:51:46 -05:00
Mauricio Carneiro
13905c00b3
Updating PacbioProcessingPipeline to new Queue standards
2011-12-09 16:51:02 -05:00
Eric Banks
364f1a030b
Plumbing added so that the UG engine can handle multiple alleles and they can successfully be genotyped. Alleles that aren't likely are not allowed to be used when assigning genotypes, but otherwise the greedy PL-based approach is what is used. Moved assign genotypes code to UG engine since it has nothing to do with the Exact model. Still have some TODOs in here before I can push this out to everyone.
2011-12-09 14:25:28 -05:00