Commit Graph

630 Commits (cdfd07f9eb4e2ca18b2b6b10d00797cd3a156ebd)

Author SHA1 Message Date
Mark DePristo 8caf39cb65 Experimental LikelihoodRankSum annotation
-- Added experimental LikelihoodRankSum, which required slightly more detailed access to the information managed by the base class, so added an overloaded getElementForRead also provides access to the MostLikelyAllele class
-- Added base class default implementation of getElementForPileupElement() which returns null, indicating that the pileup version isn't supported.
-- Added @Override to many of the RankSum classes for safety's sake

-- Updates to GeneralCallingPipeline: annotate sites with dbSNP IDs,
-- R script to assess the value of annotations for VQSR
2013-06-21 13:57:11 -04:00
Mark DePristo f726d8130a VariantRecalibrator bugfix for bad log10sumlog10 values
-- The VR, when the model is bad, may evaluate log10sumlog10 where some of the values in the vector are NaN. This case is now trapped in VR and handled as previously -- indicating that the model has failed and evaluation continues.
2013-06-21 12:28:53 -04:00
Mark DePristo dee51c4189 Error out when NCT and BAMOUT are used with the HaplotypeCaller
-- Currently we don't support writing a BAM file from the haplotype caller when nct is enabled.  Check in initialize if this is the case, and throw a UserException
2013-06-21 09:25:57 -04:00
sathibault 3db8908ae8 Remove debug print statement 2013-06-20 08:28:58 -05:00
Valentin Ruano-Rubio 1f8282633b Removed plots generation from the BaseRecalibration software
Improved AnalyzeCovariates (AC) integration test.
Renamed AC test files ending with .grp to .table

Implementation:

* Removed RECAL_PDF/CSV_FILE from RecalibrationArgumentCollection (RAC). Updated rest of the code accordingly.
* Fixed BQSRIntegrationTest to work with new changes
2013-06-19 14:47:56 -04:00
Valentin Ruano-Rubio 08f92bb6f9 Added AnalyzeCovariates tool to generate BQSR assessment quality plots.
Implemtation details:

* Added tool class *.AnalyzeCovariates
* Added convenient addAll method to Utils to be able to add elements of an array.
* Added parameter comparison methods to RecalibrationArgumentCollection class in order to verify that multiple imput recalibration report are compatible and comparable.
* Modified the BQSR.R script to handle up to 3 different recalibration tables (-BQSR, -before and -after) and removed some irrelevant arguments (or argument values) from the output.
* Added an integration test class.
2013-06-19 14:38:02 -04:00
Guillermo del Angel f176c854c6 Swapping in logless Pair HMM for default usage with UG:
-- Changed default HMM model.
-- Removed check.
-- Changed md5's: PL's in the high 100s change by a point or two due to new implementation.
-- Resulting performance improvement is about 30 to 50% less runtime when using -glm INDEL.
2013-06-18 10:06:27 -04:00
Ryan Poplin 8511c4385c Adding new pruning parameter to ReadThreadingAssembler
-- numPruningSamples allows one to specify that the minPruning factor must be met by this many samples for a path to be considered good (e.g. seen twice in three samples). By default this is just one sample.
-- adding unit test to test this new functionality
2013-06-17 16:46:40 -04:00
Guillermo del Angel f6025d25ae Feature requested by Reich lab and Paavo lab in Leipzig for ancient DNA processing:
-- When doing cross-species comparisons and studying population history and ancient DNA data, having SOME measure of confidence is needed at every single site that doesn't depend on the reference base, even in a naive per-site SNP mode. Old versions of GATK provided GQ and some wrong PL values at reference sites but these were wrong. This commit addresses this need by adding a new UG command line argument, -allSitePLs, that, if enabled will:
a) Emit all 3 ALT snp alleles in the ALT column.
b) Emit all corresponding 10 PL values.
It's up to the user to process these PL values downstream to make sense of these. Note that, in order to follow VCF spec, the QUAL field in a reference call when there are non-null ALT alleles present will be zero, so QUAL will be useless and filtering will need to be done based on other fields.
-- Tweaks and fixes to processing pipelines for Reich lab.
2013-06-17 13:21:09 -04:00
Eric Banks e48f754478 Fixes to several of the annotations for reduced reads (and other issues).
1. Have the RMSMappingQuality annotation take into account the fact that reduced reads represent multiple reads.

2. The rank sume tests should not be using reduced reads (because they do not represent distinct observations).

3. Fixed a massive bug in the BaseQualityRankSumTest annotation!  It was not using the base qualities but rather
the read likelihoods?!

Added a unit test for Rank Sum Tests to prove that the distributions are correctly getting assigned appropriate p-values.
Also, and just as importantly, the test shows that using reduced reads in the rank sum tests skews the results and
makes insignificant distributions look significant (so it can falsely cause the filtering of good sites).

Also included in this commit is a massive refactor of the RankSumTest class as requested by the reviewer.
2013-06-16 01:18:20 -04:00
Mark DePristo dd5674b3b8 Add genotyping accuracy assessment to AssessNA12878
-- Now table looks like:

Name     VariantType  AssessmentType           Count
variant  SNPS         TRUE_POSITIVE              1220
variant  SNPS         FALSE_POSITIVE                0
variant  SNPS         FALSE_NEGATIVE                1
variant  SNPS         TRUE_NEGATIVE               150
variant  SNPS         CALLED_NOT_IN_DB_AT_ALL       0
variant  SNPS         HET_CONCORDANCE          100.00
variant  SNPS         HOMVAR_CONCORDANCE        99.63
variant  INDELS       TRUE_POSITIVE               273
variant  INDELS       FALSE_POSITIVE                0
variant  INDELS       FALSE_NEGATIVE               15
variant  INDELS       TRUE_NEGATIVE                79
variant  INDELS       CALLED_NOT_IN_DB_AT_ALL       2
variant  INDELS       HET_CONCORDANCE           98.67
variant  INDELS       HOMVAR_CONCORDANCE        89.58

-- Rewrite / refactored parts of subsetDiploidAlleles in GATKVariantContextUtils to have a BEST_MATCH assignment method that does it's best to simply match the genotype after subsetting to a set of alleles.  So if the original GT was A/B and you subset to A/B it remains A/B but if you subset to A/C you get A/A.  This means that het-alt B/C genotypes become A/B and A/C when subsetting to bi-allelics which is the convention in the KB.  Add lots of unit tests for this functions (from 0 previously)
-- BadSites in Assessment now emits TP sites with discordant genotypes with the type GENOTYPE_DISCORDANCE and tags the expected genotype in the info field as ExpectedGenotype, such as this record:

20      10769255        .       A       ATGTG   165.73  .       ExpectedGenotype=HOM_VAR;SupportingCallsets=ebanks,depristo,CEUTrio_best_practices;WHY=GENOTYPE_DISCORDANCE     GT:AD:DP:GQ:PL  0/1:1,9:10:6:360,0,6

Indicating that the call was a HET but the expected result was HOM_VAR
-- Forbid subsetting of diploid genotypes to just a single allele.
-- Added subsetToRef as a separate specific function.  Use that in the DiploidExactAFCalc in the case that you need to reduce yourself to ref only. Preserves DP in the genotype field when this is possible, so a few integration tests have changed for the UG
2013-06-13 15:05:32 -04:00
Mark DePristo 33720b83eb No longer merge overlapping fragments from HaplotypeCaller
-- Merging overlapping fragments turns out to be a bad idea.  In the case where you can safely merge the reads you only gain a small about of overlapping kmers, so the potential gains are relatively small.  That's in contrast to the very large danger of merging reads inappropriately, such as when the reads only overlap in a repetitive region, and you artificially construct reads that look like the reference but actually may carry a larger true insertion w.r.t. the reference.  Because this problem isn't limited to repetitive sequeuence, but in principle could occur in any sequence, it's just not safe to do this merging.  Best to leave haplotype construction to the assembly graph.
2013-06-13 15:05:32 -04:00
Mark DePristo c837d67b2f Merge pull request #273 from broadinstitute/rp_readIsPoorlyModelled
Relaxing the constraints on the readIsPoorlyModelled function.
2013-06-13 08:40:24 -07:00
Ryan Poplin f44efc27ae Relaxing the constraints on the readIsPoorlyModelled function.
-- Turns out we were aggressively throwing out borderline-good reads.
2013-06-13 11:06:23 -04:00
Ryan Poplin d5f0848bd5 HC bam writer now sets the read to MQ0 if it isn't informative
-- Makes visualization of read evidence easier in IGV.
2013-06-13 10:11:54 -04:00
sathibault 336050ab71 Merge branch 'master' into st_fpga_hmm
Conflicts:
	protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCaller.java
	protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/LikelihoodCalculationEngine.java
2013-06-13 07:28:24 -05:00
Ryan Poplin d1f397c711 Fixing bug with dangling tails in which the tail connects all the way back to the reference source node.
-- List of vertices can't contain a source node.
2013-06-12 12:23:01 -04:00
Ryan Poplin e1fd3dff9a Merge pull request #268 from broadinstitute/eb_calling_accuracy_improvements_to_HC
Eb calling accuracy improvements to hc
2013-06-11 11:18:51 -07:00
Eric Banks dadcfe296d Reworking of the dangling tails merging code.
We now run Smith-Waterman on the dangling tail against the corresponding reference tail.
If we can generate a reasonable, low entropy alignment then we trigger the merge to the
reference path; otherwise we abort.  Also, we put in a check for low-complexity of graphs
and don't let those pass through.

Added tests for this implementation that checks exact SW results and correct edges added.
2013-06-11 12:53:04 -04:00
Guillermo del Angel 55d5f2194c Read Error Corrector for haplotype assembly
Principle is simple: when coverage is deep enough, any single-base read error will look like a rare k-mer but correct sequence will be supported by many reads to correct sequences will look like common k-mers. So, algorithm has 3 main steps:
1. K-mer graph buildup.
For each read in an active region, a map from k-mers to the number of times they have been seen is built.
2. Building correction map.
All "rare" k-mers that are sparse (by default, seen only once), get mapped to k-mers that are good (by default, seen at least 20 times but this is a CL argument), and that lie within a given Hamming distance (by default, =1). This map can be empty (i.e. k-mers can be uncorrectable).
3. Correction proposal
For each constituent k-mer of each read, if this k-mer is rare and maps to a good k-mer, get differing base positions in k-mer and add these to a list of corrections for each base in each read. Then, correct read at positions where correction proposal is unanimous and non-empty.

The algorithm defaults are chosen to be very stringent and conservative in the correction: we only try to correct singleton k-mers, we only look for good k-mers lying at Hamming distance = 1 from them, and we only correct a base in read if all correction proposals are congruent.

By default, algorithm is disabled but can be enabled in HaplotypeCaller via the -readErrorCorrect CL option. However, at this point it's about 3x-10x more expensive so it needs to be optimized if it's to be used.
2013-06-11 12:26:24 -04:00
Eric Banks c0030f3f2d We no longer subset down to the best N haplotypes for the GL calculation.
I explain in comments within the code that this was causing problems with the marginalization over events.
2013-06-11 11:51:26 -04:00
Eric Banks c0e3874db0 Change the HC's phredScaledGlobalReadMismappingRate from 60 to 45, because Ryan and Mark told me to. 2013-06-11 11:51:26 -04:00
Eric Banks 77868d034f Do not allow the use of Ns in reads for graph construction.
Ns are treated as wildcards in the PairHMM so creating haplotypes with Ns gives them artificial advantages over other ones.
This was the cause of at least one FN where there were Ns at a SNP position.
2013-06-11 11:51:26 -04:00
Eric Banks e4e7d39e2c Fix FN problem stemming from sequence graphs that contain cycles.
Problem:
The sequence graphs can get very complex and it's not enough just to test that any given read has non-unique kmers.
Reads with variants can have kmers that match unique regions of the reference, and this causes cycles in the final
sequence graph.  Ultimately the problem is that kmers of 10/25 may not be large enough for these complex regions.

Solution:
We continue to try kmers of 10/25 but detect whether cycles exist; if so, we do not use them.  If (and only if) we
can't get usable graphs from the 10/25 kmers, then we start iterating over larger kmers until we either can generate
a graph without cycles or attempt too many iterations.
2013-06-11 11:51:26 -04:00
Ryan Poplin 58e354176e Minor changes to docs in the graph pruning. 2013-06-11 10:33:22 -04:00
Mark DePristo 1c03ebc82d Implement ActiveRegionTraversal RefMetaDataTracker for map call; HaplotypeCaller now annotates ID from dbSNP
-- Reuse infrastructure for RODs for reads to implement general IntervalReferenceOrderedView so that both TraverseReads and TraverseActiveRegions can use the same underlying infrastructure
-- TraverseActiveRegions now provides a meaningful RefMetaDataTracker to ActiveRegionWalker.map
-- Cleanup misc. code as it came up
-- Resolves GSA-808: Write general utility code to do rsID allele matching, hook up to UG and HC
2013-06-10 16:20:31 -04:00
Mauricio Carneiro 1d67d07cf1 better docs for Qualify Missing Intervals
now that it's available to the public, better give'em good docs!
2013-06-10 15:17:40 -04:00
Mauricio Carneiro c84f0deb1d Don't crash if cds file is not provided
CDS file should be optional.
2013-06-10 13:42:00 -04:00
Mauricio Carneiro a95fbd48e5 Moving QualifyMissingIntervals to protected
Making this walker available so we can share it with the CSER group for CLIA analysis.
2013-06-10 13:11:41 -04:00
Michael McCowan 00c06e9e52 Performance improvements:
- Memoized MathUtil's cumulative binomial probability function.
 - Reduced the default size of the read name map in reduced reads and handle its resets more efficiently.
2013-06-09 11:26:52 -04:00
Mark DePristo 209dd64268 HaplotypeCaller now emits per-sample DP
-- Created a new annotation DepthPerSampleHC that is by default on in the HaplotypeCaller
-- The depth for the HC is the sum of the informative alleles at this site.  It's not perfect (as we cannot differentiate between reads that align over the event but aren't informative vs. those that aren't even close) but it's a pretty good proxy and it matches with the AD field (i.e., sum(AD) = DP).
-- Update MD5s
-- delivers [#48240601]
2013-06-06 09:47:32 -04:00
Mark DePristo 34bdf20132 Bugfix for bad AD values in UG/HC
-- In the case where we have multiple potential alternative alleles *and* we weren't calling all of them (so that n potential values < n called) we could end up trimming the alleles down which would result in the mismatch between the PerReadAlleleLikelihoodMap alleles and the VariantContext trimmed alleles.
-- Fixed by doing two things (1) moving the trimming code after the annotation call and (2) updating AD annotation to check that the alleles in the VariantContext and the PerReadAlleleLikelihoodMap are concordant, which will stop us from degenerating in the future.
-- delivers [#50897077]
2013-06-05 17:48:41 -04:00
Mark DePristo c9f5b53efa Bugfix for HC can fail to assemble the correct reference sequence in some cases
-- Ultimately this was caused by overly aggressive merging of CommonSuffixMerger.  In the case where you have this graph:

ACT [ref source] -> C
G -> ACT -> C

we would merge into

G -> ACT -> C

which would linearlize into

GACTC

Causing us to add bases to the reference source node that couldn't be recovered.  The solution was to ensure that CommonSuffixMerger only operates when all nodes to be merged aren't source nodes themselves.

-- Added a convenient argument to the haplotype caller (captureAssemblyFailureBAM) that will write out the exact reads to a BAM file that went into a failed assembly run (going to a file called AssemblyFailure.BAM).  This can be used to rerun the haplotype caller to produce the exact error, which can be hard in regions of deep coverage where the downsampler state determines the exact reads going into assembly and therefore makes running with a sub-interval not reproduce the error
-- Did some misc. cleanup of code while debugging
-- [delivers #50917729]
2013-06-03 16:16:39 -04:00
Ryan Poplin ab40f4af43 Break out the GGA kmers and the read kmers into separate functions for the DeBruijn assembler.
-- Added unit test for new function.
2013-06-03 14:00:35 -04:00
sathibault de2a2a4cc7 Added command-line flag to disble FPGA
Completed integration with FPGA driver
2013-06-03 07:30:32 -05:00
Mark DePristo 64b4d80729 Make BQSR calculateIsIndel robust to indel CIGARs are start/end of read
-- The previous implementation attempted to be robust to this, but not all cases were handled properly.  Added a helper function updateInde() that bounds up the update to be in the range of the indel array, and cleaned up logic of how the method works.  The previous behavior was inconsistent across read fwd/rev stand, so that the indel cigars at the end of read were put at the start of reads if the reads were in the forward strand but not if they were in the reverse strand.  Everything is now consistent, as can be seen in the symmetry of the unit tests:

        tests.add(new Object[]{"1D3M",   false, EventType.BASE_DELETION, new int[]{0,0,0}});
        tests.add(new Object[]{"1M1D2M", false, EventType.BASE_DELETION, new int[]{1,0,0}});
        tests.add(new Object[]{"2M1D1M", false, EventType.BASE_DELETION, new int[]{0,1,0}});
        tests.add(new Object[]{"3M1D",   false, EventType.BASE_DELETION, new int[]{0,0,1}});

        tests.add(new Object[]{"1D3M",   true, EventType.BASE_DELETION, new int[]{1,0,0}});
        tests.add(new Object[]{"1M1D2M", true, EventType.BASE_DELETION, new int[]{0,1,0}});
        tests.add(new Object[]{"2M1D1M", true, EventType.BASE_DELETION, new int[]{0,0,1}});
        tests.add(new Object[]{"3M1D",   true, EventType.BASE_DELETION, new int[]{0,0,0}});

        tests.add(new Object[]{"4M1I",   false, EventType.BASE_INSERTION, new int[]{0,0,0,1,0}});
        tests.add(new Object[]{"3M1I1M", false, EventType.BASE_INSERTION, new int[]{0,0,1,0,0}});
        tests.add(new Object[]{"2M1I2M", false, EventType.BASE_INSERTION, new int[]{0,1,0,0,0}});
        tests.add(new Object[]{"1M1I3M", false, EventType.BASE_INSERTION, new int[]{1,0,0,0,0}});
        tests.add(new Object[]{"1I4M",   false, EventType.BASE_INSERTION, new int[]{0,0,0,0,0}});

        tests.add(new Object[]{"4M1I",   true, EventType.BASE_INSERTION, new int[]{0,0,0,0,0}});
        tests.add(new Object[]{"3M1I1M", true, EventType.BASE_INSERTION, new int[]{0,0,0,0,1}});
        tests.add(new Object[]{"2M1I2M", true, EventType.BASE_INSERTION, new int[]{0,0,0,1,0}});
        tests.add(new Object[]{"1M1I3M", true, EventType.BASE_INSERTION, new int[]{0,0,1,0,0}});
        tests.add(new Object[]{"1I4M",   true, EventType.BASE_INSERTION, new int[]{0,1,0,0,0}});

-- delivers #50445353
2013-05-31 13:58:37 -04:00
Ryan Poplin b5b9d745a7 New implementation of the GGA mode in the HaplotypeCaller
-- We now inject the given alleles into the reference haplotype and add them to the graph.
-- Those paths are read off of the graph and then evaluated with the appropriate marginalization for GGA mode.
-- This unifies how Smith-Waterman is performed between discovery and GGA modes.
-- Misc minor cleanup in several places.
2013-05-31 10:35:36 -04:00
Ryan Poplin 61af37d0d2 Create a new normalDistributionLog10 function that is unit tested for use in the VQSR. 2013-05-30 16:00:08 -04:00
Eric Banks a5a68c09fa Fix for the "Removed too many insertions, header is now negative" bug in ReduceReads.
The problem ultimately was that ReadUtils.readStartsWithInsertion() ignores leading hard/softclips, but
ReduceReads does not.  So I refactored that method to include a boolean argument as to whether or not
clips should be ignored.  Also rebased so that return type is no longer a Pair.
Added unit test to cover this situation.
2013-05-29 16:41:01 -04:00
Mauricio Carneiro f1affa9fbb Turn off downsampling for DiagnoseTargets
Diagnose targets should never be downsampled. (and I didn't know there was a default downsampling going on for locus walkers)
2013-05-28 14:58:50 -04:00
Ryan Poplin 85905dba92 Bugfix for GGA mode in UG silently ignoring indels
-- Started by Mark. Finished up by Ryan.
-- GGA mode still respected glm argument for SNP and INDEL models, so that you would silently fail to genotype indels at all if the -glm INDEL wasn't provided, but you'd still emit the sites, so you'd see records in the VCF but all alleles would be no calls.
-- https://www.pivotaltracker.com/story/show/48924339 for more information
-- [resolves #48924339]
2013-05-24 13:47:26 -04:00
Mauricio Carneiro da21924b44 Make the missing targets output never use stdout
Problem
--------
Diagnose Targets is outputting missing intervals to stdout if the argument -missing is not provided

Solution
--------
Make it NOT default to stdout

[Delivers #50386741]
2013-05-22 14:22:54 -04:00
Mark DePristo d167743852 Archived banded logless PairHMM
BandedHMM
---------
-- An implementation of a linear runtime, linear memory usage banded logless PairHMM.  Thought about 50% faster than current PairHMM, this implementation will be superceded by the GraphHMM when it becomes available.  The implementation is being archived for future reference

Useful infrastructure changes
-----------------------------
-- Split PairHMM into a N2MemoryPairHMM that allows smarter implementation to not allocate the double[][] matrices if they don't want, which was previously occurring in the base class PairHMM
-- Added functionality (controlled by private static boolean) to write out likelihood call information to a file from inside of LikelihoodCalculationEngine for using in unit or performance testing.  Added example of 100kb of data to private/testdata.  Can be easily read in with the PairHMMTestData class.
-- PairHMM now tracks the number of possible cell evaluations, and the LoglessCachingPairHMM updates the nCellsEvaluated so we can see how many cells are saved by the caching calculation.
2013-05-22 12:24:00 -04:00
Mark DePristo a1093ad230 Optimization for ActiveRegion.removeAll
-- Previous version took a Collection<GATKSAMRecord> to remove, and called ArrayList.removeAll() on this collection to remove reads from the ActiveRegion.  This can be very slow when there are lots of reads, as ArrayList.removeAll ultimately calls indexOf() that searches through the list calling equals() on each element.   New version takes a set, and uses an iterator on the list to remove() from the iterator any read that is in the set.  Given that we were already iterating over the list of reads to update the read span, this algorithm is actually simpler and faster than the previous one.
-- Update HaplotypeCaller filterReadsInRegion to use a Set not a List.
-- Expanded the unit tests a bit for ActiveRegion.removeAll
2013-05-21 16:18:57 -04:00
Eric Banks 1f3624d204 Base Recalibrator doesn't recalibrate all reads, so the final output line was confusing 2013-05-21 11:35:05 -04:00
Ryan Poplin 507853c583 Active region boundary parameters need to be bigger when running in GGA mode. CGL performance is quite a bit better as a result.
-- The troule stems from the fact that we may be trying to genotype indels even though it appears there are only SNPs in the reads.
2013-05-20 14:29:04 -04:00
sathibault 195f0c3e98 Disable CnyPairHMM 2013-05-17 08:30:23 -05:00
Yossi Farjoun 9234a0efcd Merge pull request #223 from broadinstitute/mc_dt_gaddy_outputs
Bug fixes and missing interval functionality for Diagnose Targets

While the code seems fine, the complex parts of it are untested. This is probably fine for now, but private code can have a tendency to creep into the codebase once accepted. I would have preferred that unit test OR a big comment stating that the code is untested (and thus broken by Mark's rule).

It is with these cavets that I accept the pull request.
2013-05-16 09:25:54 -07:00
Mauricio Carneiro 9eceae793a Tool to manipulate intervals outside the GATK
Performs basic set operations on intervals like union, intersect and difference between two or more intervals. Useful for techdev and QC purposes.
2013-05-13 11:56:24 -04:00
Mauricio Carneiro 3dbb86b052 Outputting missing intervals in DiagnoseTargets
Problem
------
Diagnose Targets identifies holes in the coverage of a targetted experiment, but it only reports them doesn't list the actual missing loci

Solution
------
This commit implements an optional intervals file output listing the exact loci that did not pass filters

Itemized changes
--------------
   * Cache callable statuses (to avoid recalculation)
   * Add functionality to output missing intervals
   * Implement new tool to qualify the missing intervals (QualifyMissingIntervals) by gc content, size, type of missing coverage and origin (coding sequence, intron, ...)
2013-05-13 11:51:56 -04:00
Mauricio Carneiro 1466396a31 Diagnose target is outputting intervals out of order
Problem
-------
When the interval had no reads, it was being sent to the VCF before the intervals that just got processed, therefore violating the sort order of the VCF.

Solution
--------
Use a linked hash map, and make the insertion and removal all happen in one place regardless of having reads or not. Since the input is ordered, the output has to be ordered as well.

Itemized changes
--------------
   * Clean up code duplication in LocusStratification and SampleStratification
   * Add number of uncovered sites and number of low covered sites to the VCF output.
   * Add new VCF format fields
   * Fix outputting multiple status when threshold is 0 (ratio must be GREATER THAN not equal to the threshold to get reported)

[fixes #48780333]
[fixes #48787311]
2013-05-13 11:50:22 -04:00
Mark DePristo b4f482a421 NanoScheduled ActiveRegionTraversal and HaplotypeCaller
-- Made CountReadsInActiveRegions Nano schedulable, confirming identical results for linear and nano results
-- Made Haplotype NanoScheduled, requiring misc. changes in the map/reduce type so that the map() function returns a List<VariantContext> and reduce actually prints out the results to disk
-- Tests for NanoScheduling
  -- CountReadsInActiveRegionsIntegrationTest now does NCT 1, 2, 4 with CountReadsInActiveRegions
  -- HaplotypeCallerParallelIntegrationTest does NCT 1,2,4 calling on 100kb of PCR free data
-- Some misc. code cleanup of HaplotypeCaller
-- Analysis scripts to assess performance of nano scheduled HC
-- In order to make the haplotype caller thread safe we needed to use an AtomicInteger for the class-specific static ID counter in SeqVertex and MultiDebrujinVertex, avoiding a race condition where multiple new Vertex() could end up with the same id.
2013-05-13 11:09:02 -04:00
Mark DePristo 111e8cef0f Merge pull request #219 from broadinstitute/eb_rr_multisample_fix
Fix bug in Reduce Reads that arises in multi-sample mode.
2013-05-09 15:31:14 -07:00
Eric Banks 8b9c6aae3e Fix bug in Reduce Reads that arises in multi-sample mode.
* bitset could legitimately be in an unfinished state but we were trying to access it without finalizing.
  * added --cancer_mode argument per Mark's suggestion to force the user to explicitly enable multi-sample mode.
  * tests were easiest to implement as integration tests (this was a really complicated case).
2013-05-08 23:23:51 -04:00
Mark DePristo fa8a47ceef Replace DeBruijnAssembler with ReadThreadingAssembler
Problem
-------
The DeBruijn assembler was too slow.  The cause of the slowness was the need to construct many kmer graphs (from max read length in the interval to 11 kmer, in increments of 6 bp).  This need to build many kmer graphs was because the assembler (1) needed long kmers to assemble through regions where a shorter kmer was non-unique in the reference, as we couldn't split cycles in the reference (2) shorter kmers were needed to be sensitive to differences from the reference near the edge of reads, which would be lost often when there was chain of kmers of longer length that started before and after the variant.

Solution
--------
The read threading assembler uses a fixed kmer, in this implementation by default two graphs with 10 and 25 kmers.  The algorithm operates as follows:

identify all non-unique kmers of size K among all reads and the reference
for each sequence (ref and read):
  find a unique starting position of the sequence in the graph by matching to a unique kmer, or starting a new source node if non exist
  for each base in the sequence from the starting vertex kmer:
    look at the existing outgoing nodes of current vertex V.  If the base in sequence matches the suffix of outgoing vertex N, read the sequence to N, and continue
    If no matching next vertex exists, find a unique vertex with kmer K.  If one exists, merge the sequence into this vertex, and continue
    If a merge vertex cannot be found, create a new vertex (note this vertex may have a kmer identical to another in the graph, if it is not unique) and thread the sequence to this vertex, and continue

This algorithm has a key property: it can robustly use a very short kmer without introducing cycles, as we will create paths through the graph through regions that aren't unique w.r.t. the sequence at the given kmer size.  This allows us to assemble well with even very short kmers.

This commit includes many critical changes to the haplotype caller to make it fast, sensitive, and accurate on deep and shallow WGS and exomes, the key changes are highlighted below:

-- The ReadThreading assembler keeps track of the maximum edge multiplicity per sample in the graph, so that we prune per sample, not across all samples.  This change is essential to operate effectively when there are many deep samples (i.e., 100 exomes)
-- A new pruning algorithm that will only prune linear paths where the maximum edge weight among all edges in the path have < pruningFactor.  This makes pruning more robust when you have a long chain of bases that have high multiplicity at the start but only barely make it back into the main path in the graph.
-- We now do a global SmithWaterman to compute the cigar of a Path, instead of the previous bubble-based SmithWaterman optimization.  This change is essential for us to get good variants from our paths when the kmer size is small.  It also ensures that we produce a cigar from a path that only depends only the sequence of bases in the path, unlike the previous approach which would depend on both the bases and the way the path was decomposed into vertices, which depended on the kmer size we used.
-- Removed MergeHeadlessIncomingSources, which was introducing problems in the graphs in some cases, and just isn't the safest operation.  Since we build a kmer graph of size 10, this operation is no longer necessary as it required a perfect match of 10 bp to merge anyway.
-- The old DebruijnAssembler is still available with a command line option
-- The number of paths we take forward from the each assembly graph is now capped at a factor per sample, so that we allow 128 paths for a single sample up to 10 x nSamples as necessary.  This is an essential change to make the system work well for large numbers of samples.
-- Add a global mismapping parameter to the HC likelihood calculation: The phredScaledGlobalReadMismappingRate reflects the average global mismapping rate of all reads, regardless of their mapping quality. This term effects the probability that a read originated from the reference haploytype, regardless of its edit distance from the reference, in that the read could have originated from the reference haplotype but from another location in the genome. Suppose a read has many mismatches from the reference, say like 5, but has a very high mapping quality of 60. Without this parameter, the read would contribute 5 * Q30 evidence in favor of its 5 mismatch haplotype compared to reference, potentially enough to make a call off that single read for all of these events. With this parameter set to Q30, though, the maximum evidence against the reference that this (and any) read could contribute against reference is Q30. -- Controllable via a command line argument, defaulting to Q60 rate. Results from 20:10-11 mb for branch are consistent with the previous behavior, but this does help in cases where you have rare very divergent haplotypes
-- Reduced ActiveRegionExtension from 200 bp to 100 bp, which is a performance win and the large extension is largely unnecessary with the short kmers used with the read threading assembler

Infrastructure changes / improvements
-------------------------------------
-- Refactored BaseGraph to take a subclass of BaseEdge, so that we can use a MultiSampleEdge in the ReadThreadingAssembler
-- Refactored DeBruijnAssembler, moving common functionality into LocalAssemblyEngine, which now more directly manages the subclasses, requiring them to only implement a assemble() method that takes ref and reads and provides a List<SeqGraph>, which the LocalAssemblyEngine takes forward to compute haplotypes and other downstream operations.  This allows us to have only a limited amount of code that differentiates the Debruijn and ReadThreading assemblers
-- Refactored active region trimming code into ActiveRegionTrimmer class
-- Cleaned up the arguments in HaplotypeCaller, reorganizing them and making arguments @Hidden and @Advanced as appropriate.  Renamed several arguments now that the read threading assembler is the default
-- LocalAssemblyEngineUnitTest reads in the reference sequence from b37, and assembles with synthetic reads intervals from 10-11 mbs with only the reference sequence as well as artificial snps, deletions, and insertions.
-- Misc. updates to Smith Waterman code. Added generic interface to called not surpisingly SmithWaterman, making it easier to have alternative implementations.
-- Many many more unit tests throughout the entire assembler, and in random utilities
2013-05-08 21:41:42 -04:00
sathibault d79b5f0931 Adding Convey HC-1 HMM acceleration 2013-05-08 11:01:20 -05:00
Eric Banks d242f1bba3 Secondary alignments were not handled correctly in IndelRealigner
* This is emerging now because BWA-MEM produces lots of reads that are not primary alignments
 * The ConstrainedMateFixingManager class used by IndelRealigner was mis-adjusting SAM flags because it
     was getting confused by these secondary alignments
 * Added unit test to cover this case
2013-05-06 19:09:10 -04:00
Eric Banks b53336c2d0 Added hidden mode for BQSR to force all read groups to be the same one.
* Very useful for debugging sample-specific issues
 * This argument got lost in the transition from BQSR v1 to v2
 * Added unit test to cover this case
2013-05-06 19:09:10 -04:00
Mark DePristo 2bcbdd469f leftAlignCigarSequentially now supports haplotypes with insertions and deletions where the deletion allele was previously removed by the leftAlignSingleIndel during it's cleanup phase. 2013-05-03 09:32:05 -04:00
Eric Banks 58424e56be Setting the reduce reads count tag was all wrong in a previous commit; fixing.
RR counts are represented as offsets from the first count, but that wasn't being done
correctly when counts are adjusted on the fly.  Also, we were triggering the expensive
conversion and writing to binary tags even when we weren't going to write the read
to disk.

The code has been updated so that unconverted counts are passed to the GATKSAMRecord
and it knows how to encode the tag correctly.  Also, there are now methods to write
to the reduced counts array without forcing the conversion (and methods that do force
the conversion).

Also:
1. counts are now maintained as ints whenever possible.  Only the GATKSAMRecord knows
about the internal encoding.
2. as discussed in meetings today, we updated the encoding so that it can now handle
a range of values that extends to 255 instead of 127 (and is backwards compatible).
3. tests have been moved from SyntheticReadUnitTest to GATKSAMRecordUnitTest accordingly.
2013-04-30 13:45:42 -04:00
Guillermo del Angel 20d3137928 Fix for indel calling with UG in presence of reduced reads: When a read is long enough so that there's no reference context available, the reads gets clipped so that it falls again within the reference context range. However, the clipping is incorrect, as it makes the read end precisely at the end of the reference context coordinates. This might lead to a case where a read might span beyond the haplotype if one of the candidate haplotypes is shorter than the reference context (As in the case e.g. with deletions). In this case, the HMM will not work properly and the likelihood will be bad, since "insertions" at end of reads when haplotype is done will be penalized and likelihood will be much lower than it should.
-- Added check to see if read spans beyond reference window MINUS padding and event length. This guarantees that read will always be contained in haplotype.
-- Changed md5's that happen when long reads from old 454 data have their likelihoods changed because of the extra base clipping.
2013-04-29 19:33:02 -04:00
Mark DePristo 5dd73ba2d1 Merge pull request #198 from broadinstitute/mc_reduce_reads_ds_doc
Updates GATKDocs for ReduceReads downsampling
2013-04-27 05:49:47 -07:00
Mauricio Carneiro 76e997895e Updates GATKDocs for ReduceReads downsampling
[fixes #48258295]
2013-04-26 23:33:44 -04:00
Guillermo del Angel 4168aaf280 Add feature to specify Allele frequency priors by command line when calling variants.
Use case:
The default AF priors used (infinite sites model, neutral variation) is appropriate in the case where the reference allele is ancestral, and the called allele is a derived allele.
Most of the times this is true but in several population studies and in ancient DNA analyses this might introduce reference biases, and in some other cases it's hard to ascertain what the ancestral allele is (normally requiring to look up homologous chimp sequence).
Specifying no prior is one solution, but this may introduce a lot of artifactual het calls in shallower coverage regions.
With this option, users can specify what the prior for each AC should be according to their needs, subject to the restrictions documented in the code and in GATK docs.
-- Updated ancient DNA single sample calling script with filtering options and other cleanups.
-- Added integration test. Removed old -noPrior syntax.
2013-04-26 19:06:39 -04:00
Eric Banks 021adf4220 WTF - I thought we had disabled the randomized dithering of rank sum tests for integration tests?!
Well, it wasn't done so I went ahead and did so.  Lots of MD5 changes accordingly.
2013-04-26 11:24:05 -04:00
Eric Banks ba2c3b57ed Extended the allele-biased down-sampling functionality to handle reduced reads.
Note that this works only in the case of pileups (i.e. coming from UG);
allele-biased down-sampling for RR just cannot work for haplotypes.

Added lots of unit tests for new functionality.
2013-04-26 11:23:17 -04:00
Eric Banks 379a9841ce Various bug fixes for recent Reduce Reads additions plus solution implemented for low MQ reads.
1. Using cumulative binomial probability was not working at high coverage sites (because p-values quickly
got out of hand) so instead we use a hybrid system for determining significance: at low coverage sites
use binomial prob and at high coverage sites revert to using the old base proportions.  Then we get the
best of both worlds.  As a note, coverage refers to just the individual base counts and not the entire pileup.

2. Reads were getting lost because of the comparator being used in the SlidingWindow. When read pairs had
the same alignment end position the 2nd one encountered would get dropped (but added to the header!). We
now use a PriorityQueue instead of a TreeSet to allow for such cases.

3. Each consensus keeps track of its own number of softclipped bases.  There was no reason that that number
should be shared between them.

4. We output consensus filtered (i.e. low MQ) reads whenever they are present for now.  Don't lose that
information.  Maybe we'll decide to change this in the future, but for now we are conservative.

5. Also implemented various small performance optimizations based on profiling.

Added unit tests to cover these changes; systematic assessment now tests against low MQ reads too.
2013-04-24 18:18:50 -04:00
MauricioCarneiro 45fec382e7 Merge pull request #180 from broadinstitute/mc_diagnosetargets_missing_targets
DiagnoseTargets Global Refactor
2013-04-24 14:54:55 -07:00
Mauricio Carneiro 367f0c0ac1 Split class names into stratification and metrics
Calling everything statistics was very confusing. Diagnose Targets stratifies the data three ways: Interval, Sample and Locus. Each stratification then has it's own set of metrics (plugin system) to calculate -- LocusMetric, SampleMetric, IntervalMetric.

 Metrics are generalized by the Metric interface. (for generic access)
 Stratifications are generalized by the AbstractStratification abstract class. (to aggressively limit code duplication)
2013-04-24 14:15:49 -04:00
Ryan Poplin 80131ac996 Adding the 1000G_phase1.snps.high_confidence callset to the GATK resource bundle for use in the April 2013 updated best practices. 2013-04-24 11:41:32 -04:00
Guillermo del Angel 2ab270cf3f Corner case fix to General Ploidy SNP likelihood model.
-- In case there are no informative bases in a pileup but pileup isn't empty (like when all bases have Q < min base quality) the GLs were still computed (but were all zeros) and fed to the exact model. Now, mimic case of diploid Gl computation where GLs are only added if # good bases > 0
-- I believe general case where only non-informative GLs are fed into AF calc model is broken and yields bogus QUAL, will investigate separately.
2013-04-23 21:13:18 -04:00
Mauricio Carneiro 8f8f339e4b Abstract class for the statistics
Addressing the code duplication issue raised by Mark.
2013-04-23 18:02:27 -04:00
Mauricio Carneiro 38662f1d47 Limiting access to the DT classes
* Make most classes final, others package local
    * Move to diagnostics.diagnosetargets package
    * Aggregate statistics and walker classes on the same package for simplified visibility.
    * Make status list a LinkedList instead of a HashSet
2013-04-23 14:01:43 -04:00
Ryan Poplin cb4ec3437a After debate reverting SW parameter changes temporarily while we explore global SW plans. 2013-04-23 13:32:06 -04:00
Mauricio Carneiro fdd16dc6f9 DiagnoseTargets refactor
A plugin enabled implementation of DiagnoseTargets

Summarized Changes:
-------------------
   * move argument collection into Thresholder object
   * make thresholder object private member of all statistics classes
   * rework the logic of the mate pairing thresholds
   * update unit and integration tests to reflect the new behavior
   * Implements Locus Statistic plugins
   * Extend Locus Statistic plugins to determine sample status
   * Export all common plugin functionality into utility class
   * Update tests accordingly

[fixes #48465557]
2013-04-22 23:53:10 -04:00
Mauricio Carneiro eb6308a0e4 General DiagnoseTargets documentation cleanup
* remove interval statistic low_median_coverage -- it is already captured by low coverage and coverage gaps.
   * add gatkdocs to all the parameters
   * clean up the logic on callable status a bit (still need to be re-worked into a plugin system)
   * update integration tests
2013-04-22 23:53:09 -04:00
Mauricio Carneiro b3c0abd9e8 Remove REF_N status from DiagnoseTargets
This is not really feasible with the current mandate of this walker. We would have to traverse by reference and that would make the runtime much higher, and we are not really interested in the status 99% of the time anyway. There are other walkers that can report this, and just this, status more cheaply.

[fixes #48442663]
2013-04-22 23:53:09 -04:00
Mauricio Carneiro 2b923f1568 fix for DiagnoseTargets multiple filter output
Problem
-------
Diagnose targets is outputting both LOW_MEDIAN_COVERAGE and NO_READS when no reads are covering the interval

Solution
--------
Only allow low median coverage check if there are reads

[fixes #48442675]
2013-04-22 23:53:09 -04:00
Mauricio Carneiro cf7afc1ad4 Fixed "skipped intervals" bug on DiagnoseTargets
Problem
-------
Diagnose targets was skipping intervals when they were not covered by any reads.

Solution
--------
Rework the interval iteration logic to output all intervals as they're skipped over by the traversal, as well as adding a loop on traversal done to finish outputting intervals past the coverage of teh BAM file.

Summarized Changes
------------------
   * Outputs all intervals it iterates over, even if uncovered
   * Outputs leftover intervals in the end of the traversal
   * Updated integration tests

[fixes #47813825]
2013-04-22 23:53:09 -04:00
Mark DePristo be66049a6f Bugfix for CommonSuffixSplitter
-- The problem is that the common suffix splitter could eliminate the reference source vertex when there's an incoming node that contains all of the reference source vertex bases and then some additional prefix bases.  In this case we'd eliminate the reference source vertex.  Fixed by checking for this condition and aborting the simplification
-- Update MD5s, including minor improvements
2013-04-21 19:37:01 -04:00
Mark DePristo f0e64850da Two sensitivity / specificity improvements to the haplotype caller
-- Reduce the min read length to 10 bp in the filterNonPassingReads in the HC.  Now that we filter out reads before genotyping, we have to be more tolerant of shorter, but informative, reads, in order to avoid a few FNs in shallow read data
-- Reduce the min usable base qual to 8 by default in the HC.  In regions with low coverage we sometimes throw out our only informative kmers because we required a contiguous run of bases with >= 16 QUAL.  This is a bit too aggressive of a requirement, so I lowered it to 8.
-- Together with the previous commit this results in a significant improvement in the sensitivity and specificity of the caller

 NA12878 MEM chr20:10-11
 Name    VariantType  TRUE_POSITIVE  FALSE_POSITIVE  FALSE_NEGATIVE  TRUE_NEGATIVE  CALLED_NOT_IN_DB_AT_ALL
 branch  SNPS                  1216               0               2            194                        0
 branch  INDELS                 312               2              13             71                        7
 master  SNPS                  1214               0               4            194                        1
 master  INDELS                 309               2              16             71                       10

-- Update MD5s in the integration tests to reflect these two new changes
2013-04-17 12:32:31 -04:00
Eric Banks 5bce0e086e Refactored binomial probability code in MathUtils.
* Moved redundant code out of UGEngine
  * Added overloaded methods that assume p=0.5 for speed efficiency
  * Added unit test for the binomialCumulativeProbability method
2013-04-16 18:19:07 -04:00
Eric Banks df189293ce Improve compression in Reduce Reads by incorporating probabilistic model and global het compression
The Problem:
  Exomes seem to be more prone to base errors and one error in 20x coverage (or below, like most
  regions in an exome) causes RR (with default settings) to consider it a variant region.  This
  seriously hurts compression performance.

The Solution:
  1. We now use a probabilistic model for determining whether we can create a consensus (in other
  words, whether we can error correct a site) instead of the old ratio threshold.  We calculate
  the cumulative binomial probability of seeing the given ratio and trigger consensus creation if
  that pvalue is lower than the provided threshold (0.01 by default, so rather conservative).
  2. We also allow het compression globally, not just at known sites.  So if we cannot create a
  consensus at a given site then we try to perform het compression; and if we cannot perform het
  compression that we just don't reduce the variant region.  This way very wonky regions stay
  uncompressed, regions with one errorful read get fully compressed, and regions with one errorful
  locus get het compressed.

Details:
  1. -minvar is now deprecated in favor of -min_pvalue.
  2. Added integration test for bad pvalue input.
  3. -known argument still works to force het compression only at known sites; if it's not included
     then we allow het compression anywhere.  Added unit tests for this.
  4. This commit includes fixes to het compression problems that were revealed by systematic qual testing.
     Before finalizing het compression, we now check for insertions or other variant regions (usually due
     to multi-allelics) which can render a region incompressible (and we back out if we find one).  We
     were checking for excessive softclips before, but now we add these tests too.
  5. We now allow het compression on some but not all of the 4 consensus reads: if creating one of the
     consensuses is not possible (e.g. because of excessive softclips) then we just back that one consensus
     out instead of backing out all of them.
  6. We no longer create a mini read at the stop of the variant window for het compression.  Instead, we
     allow it to be part of the next global consensus.
  7. The coverage test is no longer run systematically on all integration tests because the quals test
     supercedes it.  The systematic quals test is now much stricter in order to catch bugs and edge cases
     (very useful!).
  8. Each consensus (both the normal and filtered) keep track of their own mapping qualities (before the MQ
     for a consensus was affected by good and bad bases/reads).
  9. We now completely ignore low quality bases, unless they are the only bases present in a pileup.
     This way we preserve the span of reads across a region (needed for assembly). Min base qual moved to Q15.
  10.Fixed long-standing bug where sliding window didn't do the right thing when removing reads that start
     with insertions from a header.

Note that this commit must come serially before the next commit in which I am refactoring the binomial prob
code in MathUtils (which is failing and slow).
2013-04-16 18:19:06 -04:00
Ryan Poplin e0dfe5ca14 Restore the read filter function in the HaplotypeCaller. 2013-04-16 12:01:30 -04:00
Geraldine Van der Auwera e176fc3af1 Merge pull request #159 from broadinstitute/md_bqsr_ion
Trivial BQSR bug fixes and improvement
2013-04-16 08:54:47 -07:00
Mark DePristo 067d24957b Select the haplotypes we move forward for genotyping per sample, not pooled
-- The previous algorithm would compute the likelihood of each haplotype pooled across samples.  This has a tendency to select "consensus" haplotypes that are reasonably good across all samples, while missing the true haplotypes that each sample likes.  The new algorithm computes instead the most likely pair of haplotypes among all haplotypes for each sample independently, contributing 1 vote to each haplotype it selects.  After all N samples have been run, we sort the haplotypes by their counts, and take 2 * nSample + 1 haplotypes or maxHaplotypesInPopulation, whichever is smaller.
-- After discussing with Mauricio our view is that the algorithmic complexity of this approach is no worse than the previous approach, so it should be equivalently fast.
-- One potential improvement is to use not hard counts for the haplotypes, but this would radically complicate the current algorithm so it wasn't selected.
-- For an example of a specific problem caused by this, see https://jira.broadinstitute.org/browse/GSA-871.
-- Remove old pooled likelihood model.  It's worse than the current version in both single and multiple samples:

1000G EUR samples:

10Kb
per sample: 7.17 minutes
pooled: 7.36 minutes

Name        VariantType  TRUE_POSITIVE  FALSE_POSITIVE  FALSE_NEGATIVE  TRUE_NEGATIVE  CALLED_NOT_IN_DB_AT_ALL
per_sample  SNPS                    50               0               5              8                        1
per_sample  INDELS                   6               0               7              2                        1
pooled      SNPS                    49               0               6              8                        1
pooled      INDELS                   5               0               8              2                        1

100 kb
per sample: 140.00 minutes
pooled: 145.27 minutes

Name        VariantType  TRUE_POSITIVE  FALSE_POSITIVE  FALSE_NEGATIVE  TRUE_NEGATIVE  CALLED_NOT_IN_DB_AT_ALL
per_sample  SNPS                   144               0              22             28                        1
per_sample  INDELS                  28               1              16              9                       11
pooled      SNPS                   143               0              23             28                        1
pooled      INDELS                  27               1              17              9                       11

java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T HaplotypeCaller -I private/testdata/AFR.structural.indels.bam -L 20:8187565-8187800 -L 20:18670537-18670730 -R ~/Desktop/broadLocal/localData/human_g1k_v37.fasta -o /dev/null -debug

haplotypes from samples: 8 seconds
haplotypes from pools: 8 seconds

java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T HaplotypeCaller -I /Users/depristo/Desktop/broadLocal/localData/phaseIII.4x.100kb.bam -L 20:10,000,000-10,001,000 -R ~/Desktop/broadLocal/localData/human_g1k_v37.fasta -o /dev/null -debug

haplotypes from samples: 173.32 seconds
haplotypes from pools: 167.12 seconds
2013-04-16 09:42:03 -04:00
Mark DePristo 5a74a3190c Improvements to the VariantRecalibrator R plots
-- VariantRecalibrator now emits plots with denormlized values (original values) instead of their normalized (x - mu / sigma) which helps to understand the distribution of values that are good and bad
2013-04-16 09:09:51 -04:00
Mark DePristo 564fe36d22 VariantRecalibrator's VQSR.vcf now contains NEG/POS labels
-- It's useful to know which sites have been used in the training of the model.  The recal_file emitted by VR now contains VCF info field annotations labeling each site that was used in the positive or negative training models with POSITIVE_TRAINING_SITE and/or NEGATIVE_TRAINING_SITE
-- Update MD5s, which all changed now that the recal file and the resulting applied vcfs all have these pos / neg labels
2013-04-16 09:09:47 -04:00
Mauricio Carneiro 9bfa5eb70f Quick optimization to the PairHMM
Problem
--------
the logless HMM scale factor (to avoid double under-flows) was 10^300. Although this serves the purpose this value results in a complex mantissa that further complicates cpu calculations.

Solution
---------
initialize with 2^1020 (2^1023 is the max value), and adjust the scale factor accordingly.
2013-04-14 23:25:33 -04:00
Mark DePristo 3144eae51c UnifiedGenotyper bugfix: don't create haplotypes with 0 bases
-- The PairHMM no longer allows us to create haplotypes with 0 bases.  The UG indel caller used to create such haplotypes.  Now we assign -Double.MAX_VALUE likelihoods to such haplotypes.
-- Add integration test to cover this case, along with private/testdata BAM
-- [Fixes #47523579]
2013-04-13 14:57:55 -04:00
Mark DePristo 0e627bce93 Slight update to Path SW parameters.
-- Decreasing the match value means that we no longer think that ACTG vs. ATCG is best modeled by 1M1D1M1I1M, since we don't get so much value for the middle C match that we can pay two gap open penalties to get it.
2013-04-12 12:43:52 -04:00
Mark DePristo 50cdffc61f Slightly improved Smith-Waterman parameter values for HaplotypeCaller Path comparisons
Key improvement
---------------
-- The haplotype caller was producing unstable calls when comparing the following two haplotypes:

ref:               ACAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA
alt: TGTGTGTGTGTGTGACAGAGAGAGAGAGAGAGAGAGAGAGAGAGA

in which the alt and ref haplotypes differ in having indel at both the start and end of the bubble.  The previous parameter values used in the Path algorithm were set so that such haplotype comparisons would result in the either the above alignment or the following alignment depending on exactly how many GA units were present in the bubble.

ref: ACAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA
alt: TGTGTGTGTGTGTGACAGAGAGAGAGAGAGAGAGAGAGAGAGAGA

The number of elements could vary depending on how the graph was built, and resulted in real differences in the calls between BWA mem and BWA-SW calls.  I added a few unit tests for this case, and found a set of SW parameter values with lower gap-extension penalties that significantly favor the first alignment, which is the right thing to do, as we really don't mind large indels in the haplotypes relative to having lots of mismatches.

-- Expanded the unit tests in both SW and KBestPaths to look at complex events like this, and to check as well somewhat sysmatically that we are finding many types of expected mutational events.
-- Verified that this change doesn't alter our calls on 20:10,000,000-11,000,000 at all

General code cleanup
--------------------
-- Move Smith-Waterman to its own package in utils
-- Refactored out SWParameters class in SWPairwiseAlignment, and made constructors take either a named parameter set or a Parameter object directly.  Depreciated old call to inline constants.  This makes it easier to group all of the SW parameters into a single object for callers
-- Update users of SW code to use new Parameter class
-- Also moved haplotype bam writers to protected so they can use the Path SW parameter, which is protected
-- Removed the storage of the SW scoring matrix in SWPairwiseAligner by default.  Only the SWPairwiseAlignmentMain test program needs this, so added a gross protected static variable that enables its storage
2013-04-11 18:22:55 -04:00
Mark DePristo 74196ff7db Trivial BQSR bug fixes and improvement
-- Ensure that BQSR works properly for an Ion Torrent BAM.  (Added integration test and bam)
-- Improve the error message when a unknown platform is found (integration test added)
2013-04-11 17:08:35 -04:00
Ryan Poplin a507381a33 Updating BQSR RecalibrationEngine to work correctly with empty BQSR tables.
-- Previously would crash when a scatter/gather interval contained no usable data.
-- Added unit test to cover this case.
2013-04-11 16:27:59 -04:00
Mark DePristo fb86887bf2 Fast algorithm for determining which kmers are good in a read
-- old algorithm was O(kmerSize * readLen) for each read.  New algorithm is O(readLen)
-- Added real unit tests for the addKmersFromReads to the graph.  Using a builder is great because we can create a MockBuilder that captures all of the calls, and then verify that all of the added kmers are the ones we'd expect.
2013-04-11 09:54:22 -04:00
Mark DePristo bf42be44fc Fast DeBruijnGraph creation using the kmer counter
-- The previous creation algorithm used the following algorithm:

for each kmer1 -> kmer2 in each read
  add kmers 1 and 2 to the graph
  add edge kmer1 -> kmer2 in the graph, if it's not present (does check)
  update edge count by 1 if kmer1 -> kmer2 already existed in the graph

-- This algorithm had O(reads * kmers / read * (getEdge cost + addEdge cost)).  This is actually pretty expensive because get and add edges is expensive in jgrapht.
-- The new approach uses the following algorithm:

for each kmer1 -> kmer2 in each read
  add kmers 1 and 2 to a kmer counter, that counts kmer1+kmer2 in a fast hashmap

for each kmer pair 1 and 2 in the hash counter
  add edge kmer1 -> kmer2 in the graph, if it's not present (does check) with multiplicity count from map
  update edge count by count from map if kmer1 -> kmer2 already existed in the graph

-- This algorithm ensures that we add very much fewer edges
-- Additionally, created a fast kmer class that lets us create kmers from larger byte[]s of bases without cutting up the byte[] itself.
-- Overall runtimes are greatly reduced using this algorith
2013-04-10 17:10:59 -04:00
Mark DePristo b115e5c582 Critical bugfix for CommonSuffixSplitter to avoid infinite loops
-- The previous version would enter into an infinite loop in the case where we have a graph that looks like:

X -> A -> B
Y -> A -> B

So that the incoming vertices of B all have the same sequence.  This would cause us to remodel the graph endless by extracting the common sequence A and rebuilding exactly the same graph.  Fixed and unit tested

-- Additionally add a max to the number of simplification cycles that are run (100), which will throw an error and write out the graph for future debugging.  So the GATK will always error out, rather than just go on forever
-- After 5 rounds of simplification we start keeping a copy of the previous graph, and then check if the current graph is actually different from the previous graph.  Equals here means that all vertices have equivalents in both graphs, as do all edges.  If the two graphs are equal we stop simplifying.  It can be a bit expensive but it only happens when we end up cycling due to the structure of the graph.
-- Added a unittest that goes into an infinite loop (found empirically in running the CEU trio) and confirmed that the new approach aborts out correctly
-- #resolves GSA-924
-- See https://jira.broadinstitute.org/browse/GSA-924 for more details
-- Update MD5s due to change in assembly graph construction
2013-04-09 16:19:26 -04:00
Mark DePristo 51954ae3e5 HaplotypeCaller doesn't support EXACT_GENERAL_PLOIDY model
-- HC now throws a UserException if this model is provided.  Documented this option as not being supported in the HC in the docs for EXACT_GENERAL_PLOIDY
2013-04-09 15:18:42 -04:00
Mark DePristo 33ecec535d Turn off the LD merging code by default
-- It's just too hard to interpret the called variation when we merge variants via LD.
-- Can now be turned on with -mergeVariantsViaLD
-- Update MD5s
2013-04-09 10:08:06 -04:00
Mark DePristo 21410690a2 Address reviewer comments 2013-04-08 12:48:20 -04:00