Commit Graph

12686 Commits (dd85646067d4effc7abbb3fc852207474ba695b7)

Author SHA1 Message Date
Eric Banks dd85646067 Various small improvements to the KB assessments.
1) TP reviews with 0/0 genotypes were killing those sites and making them appear as assessed FPs even when correctly called!
Fixed this by changing the logic in the assessor to allow discordant genotypes through as FPs.
Also, isMonomorphic() in the MongoGenotype needs to check whether the genotype is discordant.
Added unit test for this case.

2) Minor code cleanup in the Assessor class.
The most important being the renaming of isUsableCall() to isNotUsableCall() since that's what it is returning.
2013-08-08 23:37:45 -04:00
kshakir 1f86cf13d1 Merge pull request #359 from lbergelson/lb_relax_add_parameter
Trivial update to QScript.scala
2013-08-07 06:45:01 -07:00
Eric Banks 6d67795916 Merge pull request #365 from broadinstitute/md_kb_improvements
Fix multiple critical bugs in NA12878 KB
2013-08-07 06:38:21 -07:00
Mark DePristo 66e1b75118 Critical bugfix for OneChunkIterator
-- Previous version used overlaps on the full GenomeLoc of the variant in the KB, which meant that deletions that didn't start in an interval would be included in an interval, which isn't the behavior of the tribble and so caused a mismatch when assessing variants in the knowledgebase
2013-08-07 08:08:37 -04:00
Mark DePristo 7aba5a2f9f Several improvements to AssessNA12878 and KB
-- Bugfix for BAMs containing reads without real (M,I,D,N) operators.  Simply needed to set validation stringency to SILENT in the read. Added a BadCigar filter to the SAMRecord stream anyway
-- Add capture all sites mode to AssessNA12878: will write all sites to the badSites VCF, regardless of whether they are bad.  It's useful if you essentially want to annotate a VCF with KB information for later analysis, such as computing ROC curves
-- Add ignore filters mode to AssessNA12878: will as expected treat all sites in the input VCF calls as PASS, even if the site has a FILTER field setting
-- Add minPNonRef argument to AssessNA12878: this will consider a site not called even if the NA12878 genotype is not 0/0 if the PLs are present and the PL for 0/0 isn't greater than this value.  It allows us to easily differentiate low confidence non-ref sites obtained via multi-sample calling from highly confident non-ref calls that might be real TP or FPs
2013-08-07 08:08:37 -04:00
Mark DePristo 00f4d767e4 Merge pull request #364 from broadinstitute/md_vqsr_improvements
Separate num Gaussians for + and - GMM in VQSR
2013-08-07 04:37:45 -07:00
Mark DePristo c21402d4af Separate num Gaussians for + and - GMM in VQSR
-- The previous approach in VQSR was to build a GMM with the same max. number of Gaussians for the positive and negative models.  However, we usually have many more positive sites than negative, so we'd prefer to use a more detailed GMM for the positive model and a less well defined model using few sites for the negative model.
-- Now the maxGaussians argument only applies to the positive model
-- This update builds a GMM for the negative model with a default 4 max gaussians (though this can be controlled via command line parameter)
-- Removes the percentBadVariants argument.  The only way to control how many variants are included in the negative model is with minNumBad
-- Reduced the minNumBad argument default to 1000 from 2500
-- Update MD5s for VQSR.  md5s changed significantly due to underlying changes in the default GMM model.  Only sites with NEGATIVE_TRAINING_LABELs and the resulting VQSLOD are different, as expected.
-- minNumBad is now numBad
-- Plot all negative training points as well, since this significantly changes our view of the GMM PDF
2013-08-07 07:36:50 -04:00
Mark DePristo 44348a3761 Merge pull request #363 from broadinstitute/md_het_docs
Better docs on the meaning of heterozygosity
2013-08-07 04:28:16 -07:00
Mark DePristo 318f7e74e4 Better docs on the meaning of heterozygosity
-- [delivers #53522209]
2013-08-07 07:27:45 -04:00
Eric Banks dd0e6409c6 Merge pull request #367 from broadinstitute/md_hc_ref_fix
Bugfix for ReferenceConfidenceModel
2013-08-06 20:37:08 -07:00
Mark DePristo 40bc7d6a9c Bugfix for ReferenceConfidenceModel
-- In the case where there's some variation to assembly and evaluate but the resulting haplotypes don't result in any called variants, the reference model would exception out with "java.lang.IllegalArgumentException: calledHaplotypes must contain the refHaplotype".  Now we detect this case and emit the standard no variation output.
-- [delivers #54625060]
2013-08-06 16:00:32 -04:00
Ryan Poplin 6dfd17122f Merge pull request #362 from broadinstitute/rp_single_sample_hc_pipeline
Adding single sample HC qscript for Mauricio.
2013-08-06 12:11:50 -07:00
Ryan Poplin ee0aba224c Adding single sample HC qscript for Mauricio. 2013-08-06 15:10:15 -04:00
Mark DePristo 81a74351fd Merge pull request #360 from broadinstitute/rp_vqsr_ordering_of_annotations_bug
Fix for the VQSR visualization script with the new ordering of annotatio...
2013-08-03 09:49:42 -07:00
Ryan Poplin a46f633bd6 Fix for the VQSR visualization script with the new ordering of annotations. 2013-08-02 19:10:45 -04:00
lbergelson af36c7ce9a Update QScript.scala
Relaxing addAll parameter type from Seq to Traversable to make it slightly more flexible.
2013-08-02 14:09:26 -04:00
Eric Banks 08a7ef6620 Merge pull request #358 from broadinstitute/md_tribble_reuse_query_stream
Rev picard to get optimized tribble feature reads
2013-08-02 10:29:39 -07:00
Mark DePristo d5dd3b23db Rev picard to get optimized tribble feature reads
-- The previous version of TribbleIndexedFeatureReader.query() would open a RandomAccessFile each time the GATK crossed a shard boundary.  When running with -L wex.intervals (or any time there were lots of intervals) we'd be opening and closing enormous numbers of files, radically slowing down the GATK.  With these patched versions of Tribble we see something like the following performance improvements:

SelectVariants with -L wex.intervals on my local machine against non-local file

pre-patch => 3 hours
post-patch => 30 seconds
2013-08-02 10:31:36 -04:00
jmthibault79 9316a70d1e Merge pull request #355 from broadinstitute/eb_add_error_handling_to_kb
Added error handling to the newly added sites iterator so that it doesn't NPE when it encounters a bad record
2013-08-02 06:40:41 -07:00
Eric Banks ae5fc4c726 Merge pull request #356 from broadinstitute/mc_refbias
Reference bias walker
2013-08-02 06:30:56 -07:00
Eric Banks 8a1e2d58ef Merge pull request #357 from broadinstitute/mc_resort_haplotypes
Better caching for the HaplotypeCaller (significant speed up on NA12878 chr20!!)
2013-08-02 06:29:41 -07:00
Mauricio Carneiro 3e75262a3e Reference bias walker
Calculates reference bias based on the AD genotype field instead of AB. This is slightly more meaningful for indels and still a good estimator for snps.
2013-08-02 01:44:57 -04:00
Mauricio Carneiro 285ab2ac62 Better caching for the HaplotypeCaller
Problem
-------
Caching strategy is incompatible with the current sorting of the haplotypes, and is rendering the cache nearly useless.

Before the PairHMM updates, we realized that a lexicographically sorted list of haplotypes would optimize the use of the cache. This was only true until we've added the initial condition to the first row of the deletion matrix, which depends on the length of the haplotype. Because of that, every time the haplotypes differ in length, the cache has to be wiped. A lexicographic sorting of the haplotypes will put different lengths haplotypes clustered together therefore wasting *tons* of re-compute.

Solution
-------
Very simple. Sort the haplotypes by LENGTH and then in lexicographic order.
2013-08-02 01:27:29 -04:00
Eric Banks 0b062e7f22 Merge pull request #354 from broadinstitute/eb_fix_rr_count_encoding
Two reduce reads updates/fixes
2013-08-01 12:34:19 -07:00
Eric Banks e5be038f1a Added error handling to the newly added sites iterator so that it doesn't NPE when it encounters a bad record.
Added a unit test that exactly replicates the behavior.
2013-08-01 15:25:20 -04:00
Eric Banks 1e396af4d0 Two reduce reads updates/fixes:
1. Removing old legacy code that was capping the positional depth for reduced reads to 127.

Unfortunately this cap affectively performs biased down-sampling and throws off e.g. FS numbers.
Added end to end unit test that depth counts in RR can be higher than max byte.

Some md5s change in the RR tests because depths are now (correctly) no longer capped at 127.

2. Down-sampling in ReduceReads was not safe as it could remove het compressed consensus reads.

Refactored it so that it can only remove non-consensus reads.
2013-08-01 14:34:59 -04:00
Eric Banks ec3c885a25 Merge pull request #353 from broadinstitute/rp_HC_updates_for_1000G_and_WGS_calling
Max number of haplotypes to evaluate no longer grows unbounded with the ...
2013-07-31 08:29:06 -07:00
Ryan Poplin 4f3411f3d4 Max number of haplotypes to evaluate no longer grows unbounded with the number of samples. This is necessary for multi-sample calling projects with over 100 samples. 2013-07-31 10:48:55 -04:00
Yossi Farjoun 00cedd0bd3 Merge pull request #352 from broadinstitute/yf_SNPEFF_Stratifier
moved SnpEffUtilUnitTest to public tree
2013-07-30 14:52:33 -07:00
Yossi Farjoun 284176cd7b moved SnpEffUtilUnitTest to public tree 2013-07-30 17:51:40 -04:00
droazen b8709b1942 Merge pull request #332 from broadinstitute/st_fpga_hmm
FPGA support for PairHMM
2013-07-30 14:21:21 -07:00
Eric Banks ac06829194 Merge pull request #349 from broadinstitute/yf_SNPEFF_Stratifier
Adding a representation of the hierarchy of flags output by snpEff (Yoss...
2013-07-30 12:42:25 -07:00
Joseph Rose d2860a5486 Adding a representation of the hierarchy of flags output by snpEff (Yossi) and a stratifier whose output states are coding regions, genes, stop_gain, stop_lost and splice sites, all determined by the snpEff hierarchy (J. Rose) 2013-07-30 15:38:32 -04:00
Mauricio Carneiro 7b731dd596 Removed native method call
and fixed indentation.
2013-07-30 13:59:58 -04:00
chartl cf46256356 Merge pull request #350 from broadinstitute/chartl_genotypeconcordance_doc_cleanup
Add <pre> tags to the Genotype Concordance docs. Tables were not being d...
2013-07-29 16:17:26 -07:00
Chris Hartl 464a5b229d Add <pre> tags to the Genotype Concordance docs. Tables were not being displayed properly. 2013-07-29 15:48:17 -07:00
Eric Banks 678d038c76 Merge pull request #348 from broadinstitute/gg_gatkdoc_fixes
Gg gatkdoc fixes
2013-07-26 13:17:51 -07:00
Geraldine Van der Auwera 3063d82797 Fixed example in CallableLoci gatkdoc 2013-07-26 15:51:31 -04:00
Geraldine Van der Auwera fc4a8b1dd0 Fixed example in DoC gatkdoc 2013-07-26 15:51:30 -04:00
Geraldine Van der Auwera 660b075900 Added deprecation notice for SomaticIndelDetector 2013-07-26 15:51:30 -04:00
Geraldine Van der Auwera 5ad99c362d Added caveat to gatkdocs for MAPQ read transformers & cleaned up AB annotation gatkdocs 2013-07-26 15:51:30 -04:00
Geraldine Van der Auwera 0ea3f8ca58 Added function to gatkdocs to specify what VCF field an annotation goes in (INFO or FORMAT) 2013-07-26 15:51:30 -04:00
Geraldine Van der Auwera edbd17b8e0 Added note of caution to VQSR gatkdocs for option BOTH of recalibration mode 2013-07-26 15:51:29 -04:00
Ryan Poplin f52196496d Merge pull request #347 from broadinstitute/eb_more_dnagling_tail_improvements
More specific fix for the dangling tail edge case with a single leading deletion.
2013-07-26 07:25:47 -07:00
Ryan Poplin 66db412ad0 Merge pull request #345 from broadinstitute/rp_vqsr_sort_annotations
Automatically order the annotation dimensions in the VQSR by their stand...
2013-07-26 07:23:42 -07:00
Ryan Poplin 8c205dda1b Automatically order the annotation dimensions in the VQSR by their standard deviation instead of the order they were specified on the command line. 2013-07-26 10:22:43 -04:00
Eric Banks 924d9b7ef4 Merge pull request #344 from lbergelson/lb_library_read_filter
Adding LibraryReadFilter.
2013-07-26 06:44:53 -07:00
Louis Bergelson 7c43b5f26a Adding LibraryReadFilter.
--Moving LibraryReadFilter which has been part of Mutect into gatk public.
--Added an additional check for null values.
2013-07-26 09:32:14 -04:00
Eric Banks 9372c5ef41 Merge pull request #334 from broadinstitute/mc_generic_input_for_qualify_missing_intervals
QualifyMissingIntervals: support different formats
2013-07-25 12:39:26 -07:00
sathibault 71eb944e62 Adding CnyPairHMMUnitTest 2013-07-25 14:19:50 -05:00