Commit Graph

12250 Commits (eb6308a0e4eeaf2f06bfd3cd9e0d9c219b18140f)

Author SHA1 Message Date
Mauricio Carneiro eb6308a0e4 General DiagnoseTargets documentation cleanup
* remove interval statistic low_median_coverage -- it is already captured by low coverage and coverage gaps.
   * add gatkdocs to all the parameters
   * clean up the logic on callable status a bit (still need to be re-worked into a plugin system)
   * update integration tests
2013-04-22 23:53:09 -04:00
Mauricio Carneiro b3c0abd9e8 Remove REF_N status from DiagnoseTargets
This is not really feasible with the current mandate of this walker. We would have to traverse by reference and that would make the runtime much higher, and we are not really interested in the status 99% of the time anyway. There are other walkers that can report this, and just this, status more cheaply.

[fixes #48442663]
2013-04-22 23:53:09 -04:00
Mauricio Carneiro 2b923f1568 fix for DiagnoseTargets multiple filter output
Problem
-------
Diagnose targets is outputting both LOW_MEDIAN_COVERAGE and NO_READS when no reads are covering the interval

Solution
--------
Only allow low median coverage check if there are reads

[fixes #48442675]
2013-04-22 23:53:09 -04:00
Mauricio Carneiro cf7afc1ad4 Fixed "skipped intervals" bug on DiagnoseTargets
Problem
-------
Diagnose targets was skipping intervals when they were not covered by any reads.

Solution
--------
Rework the interval iteration logic to output all intervals as they're skipped over by the traversal, as well as adding a loop on traversal done to finish outputting intervals past the coverage of teh BAM file.

Summarized Changes
------------------
   * Outputs all intervals it iterates over, even if uncovered
   * Outputs leftover intervals in the end of the traversal
   * Updated integration tests

[fixes #47813825]
2013-04-22 23:53:09 -04:00
Ryan Poplin ff430c821e Merge pull request #178 from broadinstitute/md_common_suffix_bugfix
Bugfix for CommonSuffixSplitter
2013-04-22 06:53:01 -07:00
Mark DePristo be66049a6f Bugfix for CommonSuffixSplitter
-- The problem is that the common suffix splitter could eliminate the reference source vertex when there's an incoming node that contains all of the reference source vertex bases and then some additional prefix bases.  In this case we'd eliminate the reference source vertex.  Fixed by checking for this condition and aborting the simplification
-- Update MD5s, including minor improvements
2013-04-21 19:37:01 -04:00
Mark DePristo fc29a66a63 Merge pull request #177 from broadinstitute/md_gatklog_cleanup
Delete progress log when script is done for downloadGATKReportsFromS3
2013-04-20 11:08:06 -07:00
Mark DePristo 7e88a12bf1 Delete progress log when script is done for downloadGATKReportsFromS3.csh 2013-04-20 14:06:02 -04:00
Eric Banks 12ac60ac3a Merge pull request #175 from broadinstitute/eb_increase_log10_cache_size
Minor: bump up the amount of cached log10 data in MathUtils so that Monk...
2013-04-19 05:43:07 -07:00
Eric Banks 3477e092ea Minor: bump up the amount of cached log10 data in MathUtils so that Monkol can actually call 50K samples. 2013-04-19 08:39:08 -04:00
Ryan Poplin ef8679c0a0 Merge pull request #174 from broadinstitute/md_hc_parameters
Two sensitivity / specificity improvements to the haplotype caller
2013-04-17 12:33:42 -07:00
Mark DePristo f0e64850da Two sensitivity / specificity improvements to the haplotype caller
-- Reduce the min read length to 10 bp in the filterNonPassingReads in the HC.  Now that we filter out reads before genotyping, we have to be more tolerant of shorter, but informative, reads, in order to avoid a few FNs in shallow read data
-- Reduce the min usable base qual to 8 by default in the HC.  In regions with low coverage we sometimes throw out our only informative kmers because we required a contiguous run of bases with >= 16 QUAL.  This is a bit too aggressive of a requirement, so I lowered it to 8.
-- Together with the previous commit this results in a significant improvement in the sensitivity and specificity of the caller

 NA12878 MEM chr20:10-11
 Name    VariantType  TRUE_POSITIVE  FALSE_POSITIVE  FALSE_NEGATIVE  TRUE_NEGATIVE  CALLED_NOT_IN_DB_AT_ALL
 branch  SNPS                  1216               0               2            194                        0
 branch  INDELS                 312               2              13             71                        7
 master  SNPS                  1214               0               4            194                        1
 master  INDELS                 309               2              16             71                       10

-- Update MD5s in the integration tests to reflect these two new changes
2013-04-17 12:32:31 -04:00
MauricioCarneiro b27706859a Merge pull request #172 from broadinstitute/eb_reducereads_het_improvements
Eb reducereads het improvements
2013-04-16 16:48:10 -07:00
Eric Banks 5bce0e086e Refactored binomial probability code in MathUtils.
* Moved redundant code out of UGEngine
  * Added overloaded methods that assume p=0.5 for speed efficiency
  * Added unit test for the binomialCumulativeProbability method
2013-04-16 18:19:07 -04:00
Eric Banks df189293ce Improve compression in Reduce Reads by incorporating probabilistic model and global het compression
The Problem:
  Exomes seem to be more prone to base errors and one error in 20x coverage (or below, like most
  regions in an exome) causes RR (with default settings) to consider it a variant region.  This
  seriously hurts compression performance.

The Solution:
  1. We now use a probabilistic model for determining whether we can create a consensus (in other
  words, whether we can error correct a site) instead of the old ratio threshold.  We calculate
  the cumulative binomial probability of seeing the given ratio and trigger consensus creation if
  that pvalue is lower than the provided threshold (0.01 by default, so rather conservative).
  2. We also allow het compression globally, not just at known sites.  So if we cannot create a
  consensus at a given site then we try to perform het compression; and if we cannot perform het
  compression that we just don't reduce the variant region.  This way very wonky regions stay
  uncompressed, regions with one errorful read get fully compressed, and regions with one errorful
  locus get het compressed.

Details:
  1. -minvar is now deprecated in favor of -min_pvalue.
  2. Added integration test for bad pvalue input.
  3. -known argument still works to force het compression only at known sites; if it's not included
     then we allow het compression anywhere.  Added unit tests for this.
  4. This commit includes fixes to het compression problems that were revealed by systematic qual testing.
     Before finalizing het compression, we now check for insertions or other variant regions (usually due
     to multi-allelics) which can render a region incompressible (and we back out if we find one).  We
     were checking for excessive softclips before, but now we add these tests too.
  5. We now allow het compression on some but not all of the 4 consensus reads: if creating one of the
     consensuses is not possible (e.g. because of excessive softclips) then we just back that one consensus
     out instead of backing out all of them.
  6. We no longer create a mini read at the stop of the variant window for het compression.  Instead, we
     allow it to be part of the next global consensus.
  7. The coverage test is no longer run systematically on all integration tests because the quals test
     supercedes it.  The systematic quals test is now much stricter in order to catch bugs and edge cases
     (very useful!).
  8. Each consensus (both the normal and filtered) keep track of their own mapping qualities (before the MQ
     for a consensus was affected by good and bad bases/reads).
  9. We now completely ignore low quality bases, unless they are the only bases present in a pileup.
     This way we preserve the span of reads across a region (needed for assembly). Min base qual moved to Q15.
  10.Fixed long-standing bug where sliding window didn't do the right thing when removing reads that start
     with insertions from a header.

Note that this commit must come serially before the next commit in which I am refactoring the binomial prob
code in MathUtils (which is failing and slow).
2013-04-16 18:19:06 -04:00
Mark DePristo 8e6309d56e Merge pull request #171 from broadinstitute/rp_hc_restore_filter_function
Restore the read filter function in the HaplotypeCaller.
2013-04-16 11:21:32 -07:00
Ryan Poplin e0dfe5ca14 Restore the read filter function in the HaplotypeCaller. 2013-04-16 12:01:30 -04:00
Geraldine Van der Auwera e176fc3af1 Merge pull request #159 from broadinstitute/md_bqsr_ion
Trivial BQSR bug fixes and improvement
2013-04-16 08:54:47 -07:00
Ryan Poplin 936f4da1f6 Merge pull request #166 from broadinstitute/md_hc_persample_haplotypes
Select the haplotypes we move forward for genotyping per sample, not poo...
2013-04-16 08:46:56 -07:00
Mark DePristo bfbc23e99e Merge pull request #173 from broadinstitute/md_fix_vqsr
Update MD5s for VQSR header change
2013-04-16 08:46:26 -07:00
Mark DePristo 17982bcbf8 Update MD5s for VQSR header change 2013-04-16 11:45:45 -04:00
Mark DePristo 067d24957b Select the haplotypes we move forward for genotyping per sample, not pooled
-- The previous algorithm would compute the likelihood of each haplotype pooled across samples.  This has a tendency to select "consensus" haplotypes that are reasonably good across all samples, while missing the true haplotypes that each sample likes.  The new algorithm computes instead the most likely pair of haplotypes among all haplotypes for each sample independently, contributing 1 vote to each haplotype it selects.  After all N samples have been run, we sort the haplotypes by their counts, and take 2 * nSample + 1 haplotypes or maxHaplotypesInPopulation, whichever is smaller.
-- After discussing with Mauricio our view is that the algorithmic complexity of this approach is no worse than the previous approach, so it should be equivalently fast.
-- One potential improvement is to use not hard counts for the haplotypes, but this would radically complicate the current algorithm so it wasn't selected.
-- For an example of a specific problem caused by this, see https://jira.broadinstitute.org/browse/GSA-871.
-- Remove old pooled likelihood model.  It's worse than the current version in both single and multiple samples:

1000G EUR samples:

10Kb
per sample: 7.17 minutes
pooled: 7.36 minutes

Name        VariantType  TRUE_POSITIVE  FALSE_POSITIVE  FALSE_NEGATIVE  TRUE_NEGATIVE  CALLED_NOT_IN_DB_AT_ALL
per_sample  SNPS                    50               0               5              8                        1
per_sample  INDELS                   6               0               7              2                        1
pooled      SNPS                    49               0               6              8                        1
pooled      INDELS                   5               0               8              2                        1

100 kb
per sample: 140.00 minutes
pooled: 145.27 minutes

Name        VariantType  TRUE_POSITIVE  FALSE_POSITIVE  FALSE_NEGATIVE  TRUE_NEGATIVE  CALLED_NOT_IN_DB_AT_ALL
per_sample  SNPS                   144               0              22             28                        1
per_sample  INDELS                  28               1              16              9                       11
pooled      SNPS                   143               0              23             28                        1
pooled      INDELS                  27               1              17              9                       11

java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T HaplotypeCaller -I private/testdata/AFR.structural.indels.bam -L 20:8187565-8187800 -L 20:18670537-18670730 -R ~/Desktop/broadLocal/localData/human_g1k_v37.fasta -o /dev/null -debug

haplotypes from samples: 8 seconds
haplotypes from pools: 8 seconds

java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T HaplotypeCaller -I /Users/depristo/Desktop/broadLocal/localData/phaseIII.4x.100kb.bam -L 20:10,000,000-10,001,000 -R ~/Desktop/broadLocal/localData/human_g1k_v37.fasta -o /dev/null -debug

haplotypes from samples: 173.32 seconds
haplotypes from pools: 167.12 seconds
2013-04-16 09:42:03 -04:00
Ryan Poplin 0ee21e58c3 Merge pull request #165 from broadinstitute/md_vqsr_improvements
Two simple VQSR usability improvements
2013-04-16 06:26:38 -07:00
Mark DePristo 5a74a3190c Improvements to the VariantRecalibrator R plots
-- VariantRecalibrator now emits plots with denormlized values (original values) instead of their normalized (x - mu / sigma) which helps to understand the distribution of values that are good and bad
2013-04-16 09:09:51 -04:00
Mark DePristo 564fe36d22 VariantRecalibrator's VQSR.vcf now contains NEG/POS labels
-- It's useful to know which sites have been used in the training of the model.  The recal_file emitted by VR now contains VCF info field annotations labeling each site that was used in the positive or negative training models with POSITIVE_TRAINING_SITE and/or NEGATIVE_TRAINING_SITE
-- Update MD5s, which all changed now that the recal file and the resulting applied vcfs all have these pos / neg labels
2013-04-16 09:09:47 -04:00
Mark DePristo ee51195bf5 Merge pull request #170 from broadinstitute/mc_hmm_mantissa_optimize
Quick optimization to the PairHMM
2013-04-15 10:23:56 -07:00
Mauricio Carneiro 9bfa5eb70f Quick optimization to the PairHMM
Problem
--------
the logless HMM scale factor (to avoid double under-flows) was 10^300. Although this serves the purpose this value results in a complex mantissa that further complicates cpu calculations.

Solution
---------
initialize with 2^1020 (2^1023 is the max value), and adjust the scale factor accordingly.
2013-04-14 23:25:33 -04:00
MauricioCarneiro 55547b68bb Merge pull request #169 from broadinstitute/md_ug_bugfix
UnifiedGenotyper bugfix: don't create haplotypes with 0 bases
2013-04-13 12:05:53 -07:00
Mark DePristo 3144eae51c UnifiedGenotyper bugfix: don't create haplotypes with 0 bases
-- The PairHMM no longer allows us to create haplotypes with 0 bases.  The UG indel caller used to create such haplotypes.  Now we assign -Double.MAX_VALUE likelihoods to such haplotypes.
-- Add integration test to cover this case, along with private/testdata BAM
-- [Fixes #47523579]
2013-04-13 14:57:55 -04:00
delangel 6c9360b020 Merge pull request #167 from broadinstitute/gda_read_adaptor_trimmer_improvements
Several improvements to ReadAdaptorTrimmer so that it can be incorporate...
2013-04-13 10:43:45 -07:00
Guillermo del Angel a971e7ab6d Several improvements to ReadAdaptorTrimmer so that it can be incorporated into ancient DNA processing pipelines (for which it was developed):
-- Add pair cleaning feature. Reads in query-name sorted order are required and pairs need to appear consecutively, but if -cleanPairs option is set, a malformed pair where second read is missing is just skipped instead of erroring out.
-- Add integration tests
-- Move walker to public
2013-04-13 13:41:36 -04:00
Mark DePristo a5301e17a2 Merge pull request #168 from broadinstitute/mc_update_example_grp
Updating the exampleGRP.grp test file
2013-04-13 08:01:07 -07:00
Mauricio Carneiro a063e79597 Updating the exampleGRP.grp test file
It had been generated with an old version of BQSRv2 and wasn't compatible with exampleBAM anymore.
2013-04-13 09:07:13 -04:00
Mauricio Carneiro f11c8d22d4 Updating java 7 md5's to java 6 md5's 2013-04-13 08:21:48 -04:00
Mark DePristo 776f5a2f6f Merge pull request #164 from broadinstitute/mc_clia_scripts
Clia Scripts
2013-04-12 13:08:18 -07:00
Mauricio Carneiro 09d29e5d0d In HaplotypeCallerScript:
* fix downsampling parameter
   * fix the default value of required fields  (_ instead of .)
   * add support to multiple interval files
2013-04-12 15:54:54 -04:00
Mauricio Carneiro 802ae76905 Script for coverage evaluation of exomes and targeted sequencing projects in the Genomics Platform 2013-04-12 15:54:53 -04:00
Mark DePristo b32457be8d Merge pull request #163 from broadinstitute/mc_hmm_caching_again
Fix another caching issue with the PairHMM
2013-04-12 12:34:49 -07:00
Mauricio Carneiro 403f9de122 Fix another caching issue with the PairHMM
The Problem
----------
Some read x haplotype pairs were getting very low likelihood when caching is on. Turning it off seemed to give the right result.

Solution
--------
The HaplotypeCaller only initializes the PairHMM once and then feed it with a set of reads and haplotypes. The PairHMM always caches the matrix when the previous haplotype length is the same as the current one. This is not true when the read has changed. This commit adds another condition to zero the haplotype start index when the read changes.

Summarized Changes
------------------
   * Added the recacheReadValue check to flush the matrix (hapStartIndex = 0)
   * Updated related MD5's

Bamboo link: http://gsabamboo.broadinstitute.org/browse/GSAUNSTABLE-PARALLEL9
2013-04-12 14:52:45 -04:00
Ryan Poplin e5b9b6041c Merge pull request #162 from broadinstitute/md_path_sw_update
Slight update to Path SW parameters.
2013-04-12 10:38:26 -07:00
Mark DePristo 0e627bce93 Slight update to Path SW parameters.
-- Decreasing the match value means that we no longer think that ACTG vs. ATCG is best modeled by 1M1D1M1I1M, since we don't get so much value for the middle C match that we can pay two gap open penalties to get it.
2013-04-12 12:43:52 -04:00
Ryan Poplin 4ef6e0deb1 Merge pull request #161 from broadinstitute/md_unstable_sw
Slightly improved Smith-Waterman parameter values for HaplotypeCaller Pa...
2013-04-12 07:59:43 -07:00
Mark DePristo 50cdffc61f Slightly improved Smith-Waterman parameter values for HaplotypeCaller Path comparisons
Key improvement
---------------
-- The haplotype caller was producing unstable calls when comparing the following two haplotypes:

ref:               ACAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA
alt: TGTGTGTGTGTGTGACAGAGAGAGAGAGAGAGAGAGAGAGAGAGA

in which the alt and ref haplotypes differ in having indel at both the start and end of the bubble.  The previous parameter values used in the Path algorithm were set so that such haplotype comparisons would result in the either the above alignment or the following alignment depending on exactly how many GA units were present in the bubble.

ref: ACAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA
alt: TGTGTGTGTGTGTGACAGAGAGAGAGAGAGAGAGAGAGAGAGAGA

The number of elements could vary depending on how the graph was built, and resulted in real differences in the calls between BWA mem and BWA-SW calls.  I added a few unit tests for this case, and found a set of SW parameter values with lower gap-extension penalties that significantly favor the first alignment, which is the right thing to do, as we really don't mind large indels in the haplotypes relative to having lots of mismatches.

-- Expanded the unit tests in both SW and KBestPaths to look at complex events like this, and to check as well somewhat sysmatically that we are finding many types of expected mutational events.
-- Verified that this change doesn't alter our calls on 20:10,000,000-11,000,000 at all

General code cleanup
--------------------
-- Move Smith-Waterman to its own package in utils
-- Refactored out SWParameters class in SWPairwiseAlignment, and made constructors take either a named parameter set or a Parameter object directly.  Depreciated old call to inline constants.  This makes it easier to group all of the SW parameters into a single object for callers
-- Update users of SW code to use new Parameter class
-- Also moved haplotype bam writers to protected so they can use the Path SW parameter, which is protected
-- Removed the storage of the SW scoring matrix in SWPairwiseAligner by default.  Only the SWPairwiseAlignmentMain test program needs this, so added a gross protected static variable that enables its storage
2013-04-11 18:22:55 -04:00
Mark DePristo 74196ff7db Trivial BQSR bug fixes and improvement
-- Ensure that BQSR works properly for an Ion Torrent BAM.  (Added integration test and bam)
-- Improve the error message when a unknown platform is found (integration test added)
2013-04-11 17:08:35 -04:00
Mark DePristo 35293cde49 Merge pull request #160 from broadinstitute/rp_bqsr_gatherer_works_with_empty_tables
Updating BQSR RecalibrationEngine to work correctly with empty BQSR tabl...
2013-04-11 14:07:24 -07:00
Ryan Poplin a507381a33 Updating BQSR RecalibrationEngine to work correctly with empty BQSR tables.
-- Previously would crash when a scatter/gather interval contained no usable data.
-- Added unit test to cover this case.
2013-04-11 16:27:59 -04:00
delangel 44230b97eb Merge pull request #158 from broadinstitute/hc_fast_kmer_graph_creation
Performance improvements for building DeBruijn graphs
2013-04-11 12:09:01 -07:00
Mark DePristo fb86887bf2 Fast algorithm for determining which kmers are good in a read
-- old algorithm was O(kmerSize * readLen) for each read.  New algorithm is O(readLen)
-- Added real unit tests for the addKmersFromReads to the graph.  Using a builder is great because we can create a MockBuilder that captures all of the calls, and then verify that all of the added kmers are the ones we'd expect.
2013-04-11 09:54:22 -04:00
Mark DePristo bf42be44fc Fast DeBruijnGraph creation using the kmer counter
-- The previous creation algorithm used the following algorithm:

for each kmer1 -> kmer2 in each read
  add kmers 1 and 2 to the graph
  add edge kmer1 -> kmer2 in the graph, if it's not present (does check)
  update edge count by 1 if kmer1 -> kmer2 already existed in the graph

-- This algorithm had O(reads * kmers / read * (getEdge cost + addEdge cost)).  This is actually pretty expensive because get and add edges is expensive in jgrapht.
-- The new approach uses the following algorithm:

for each kmer1 -> kmer2 in each read
  add kmers 1 and 2 to a kmer counter, that counts kmer1+kmer2 in a fast hashmap

for each kmer pair 1 and 2 in the hash counter
  add edge kmer1 -> kmer2 in the graph, if it's not present (does check) with multiplicity count from map
  update edge count by count from map if kmer1 -> kmer2 already existed in the graph

-- This algorithm ensures that we add very much fewer edges
-- Additionally, created a fast kmer class that lets us create kmers from larger byte[]s of bases without cutting up the byte[] itself.
-- Overall runtimes are greatly reduced using this algorith
2013-04-10 17:10:59 -04:00
Mark DePristo 7d267639d8 Merge pull request #157 from broadinstitute/rp_smith_waterman_failing_test
Bug fix in SWPairwiseAlignment.
2013-04-10 13:54:25 -07:00