Commit Graph

435 Commits (aab160372a63bcca13c8e8740f0351a3cf5b164a)

Author SHA1 Message Date
Mark DePristo 85b529cced Updating MD5s in HC and UG that changed due to new LIBS
-- Resolved what was clearly a bug in UG (GGA mode was returning a neighboring, equivalent indel site that wasn't in input list.  Not ideal)
-- Trivial read count differences in HC
2013-01-11 15:17:19 -05:00
Mark DePristo 8b83f4d6c7 Near final cleanup of PileupElement
-- All functions documented and unit tested
-- New constructor interface
-- Cleanup some uses of old / removed functionality
2013-01-11 15:17:17 -05:00
Mark DePristo fb9eb3d4ee PileupElement and LIBS cleanup
-- function to create pileup elements in AlignmentStateMachine and LIBS
-- Cleanup pileup element constructors, directing users to LIBS.createPileupFromRead() that really does the right thing
2013-01-11 15:17:17 -05:00
Mark DePristo cc1d259cac Implement get Length and Bases of OfImmediatelyFollowingIndel in PileupElement
-- Added unit tests for this behavior.  Updated users of this code
2013-01-11 15:17:17 -05:00
Mark DePristo 2c38310868 Create LIBS using new AlignmentStateMachine infrastructure
-- Optimizations to AlignmentStateMachine
-- Properly count deletions.  Added unit test for counting routines
-- AlignmentStateMachine.java is no longer recursive
-- Traversals now use new LIBS, not the old one
2013-01-11 15:17:17 -05:00
Mark DePristo b53286cc3c HaplotypeCaller mode to skip assembly and genotyping for performance testing
-- Added HCPerformance evaluation Qscript
-- Added some docs about one of the HC integration tests
-- HaplotypeCaller / ART performance evaluation script
2013-01-11 15:17:16 -05:00
Ryan Poplin e952296c10 Adding HC GGA integration test to cover duplicated input alleles. 2013-01-11 15:01:27 -05:00
Ryan Poplin 7f7f40f851 Adding additional HC GGA integration tests to cover more complicated input alleles. 2013-01-11 14:36:21 -05:00
Eric Banks 85baf71b39 Merged bug fix from Stable into Unstable 2013-01-11 11:05:27 -05:00
Eric Banks d78539774f Another RR bug: off by one error led to ArrayIndexOutOfBoundsException when working with multiple samples and the variant region ended 1 base after the end of the last read for a given sample. 2013-01-11 11:05:09 -05:00
Eric Banks 79b93f659c Merged bug fix from Stable into Unstable 2013-01-11 09:20:13 -05:00
Eric Banks 67fafbb625 Forgot an include 2013-01-11 09:19:46 -05:00
Eric Banks 6bf0cc32f9 When reducing multiple samples it is possible to try to close a region that for a given sample has no reads. Currently we'd NPE. Fixed. 2013-01-11 09:16:19 -05:00
Eric Banks e7906713d9 Moving some random walkers back to public as requested by Mark. Mauricio will the licenses get updated automatically? 2013-01-11 02:03:43 -05:00
Eric Banks 3a51823c2a Clean up imports 2013-01-10 23:35:01 -05:00
Eric Banks e4b7b1955c Forgot to add the note about length normalization to the QD docs 2013-01-10 23:34:06 -05:00
Eric Banks ff5ac986d8 Fix docs for QD 2013-01-10 23:31:46 -05:00
Mauricio Carneiro 2a4ccfe6fd Updated all JAVA file licenses accordingly
GSATDG-5
2013-01-10 17:06:41 -05:00
Mauricio Carneiro dd177b1714 Removing fully commented out varianteval evaluators
- Files were completely commmented out, and were screwing up my license script. Dont like them. Removed them.

GSATDG-5
2013-01-10 17:06:12 -05:00
Chris Hartl 80dec72c53 Merge branch 'master' of gsa2:/humgen/gsa-scr1/chartl/dev/unstable 2013-01-10 14:35:59 -05:00
Chris Hartl 31a5f88c4f Expanded unit tests to cover the Concordance Metrics class fairly uniformly. 2013-01-10 14:33:47 -05:00
Ryan Poplin 1a18947abf Adding new command line argument requested on the forum to control the maximum number of haplotypes that are sent forward for genotyping. In the presence of a large degree of heterozygosity the current algorithm breaks down and so this argument would need to be increased. 2013-01-09 15:54:02 -05:00
Ryan Poplin 487fb2afb4 Bug fix for the case of overlapping assembled and partially-assembled events created by the HC. Unfortunately the symbolic allele can't be combined with the indel allele because the reference basis will change. 2013-01-09 15:30:46 -05:00
Chris Hartl 6787f86803 Eliminate the import of DiploidGenotype, which switched public/private underneath me but for some reason didn't stop me from compiling... 2013-01-09 13:23:24 -05:00
Chris Hartl c1de92b511 Add in some todo items 2013-01-09 13:16:06 -05:00
Chris Hartl 8d126161e2 Merge branch 'master' of gsa2:/humgen/gsa-scr1/chartl/dev/unstable 2013-01-09 13:15:04 -05:00
Eric Banks 3a0dd4b175 Oops, I broke the build. NOW we shouldn't have any more public->protected dependancies. 2013-01-09 11:12:28 -05:00
Eric Banks a921b06e02 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-09 11:06:17 -05:00
Eric Banks 4fa439d89e Move some classes back to public because they are used in the engine. Move some test classes to protected. We should have no more public->protected dependancies now 2013-01-09 11:06:10 -05:00
Ryan Poplin 396bce1f28 Reverting this change until we can figure out the right thing to do here. 2013-01-09 10:51:30 -05:00
Eric Banks 676e79542a Bring CombineVariants back to public since it's used for SG. I needed to break ChromosomeCountConstants out of ChromosomeCounts to make this work. 2013-01-09 10:39:48 -05:00
Ryan Poplin c87ad8c0ef Bug fixes related to HC's GGA mode. Tracking just the artificial allele isn't sufficient when there are multiple GGA records that change the reference basis. Also, duplicated records screw up the tracking of merged alleles. 2013-01-09 10:00:46 -05:00
Chris Hartl ad7c2a08d4 Normalize by the event type counts, not the total genotype counts: more useful normalization. 2013-01-09 09:12:41 -05:00
Chris Hartl b56754606b Initial break-out of GenotypeConcordance as a standalone walker. Some basic functionality testing. Currently performs only a pairwise comparison, but is very careful about proper tabulation through the GenotypeType enum. 2013-01-09 00:34:07 -05:00
Eric Banks 264cc9e78d Resolve protected->public dependencies for BQSR by wrapping the BQSR-specific arguments in a new class.
Instead of the GATK Engine creating a new BaseRecalibrator (not clean), it just keeps track of the arguments (clean).

There are still some dependency issues, but it looks like they are related to Ami's code.  Need to look into it further.
2013-01-08 16:23:29 -05:00
Eric Banks ee7d85c6e6 Move around the DiploidGenotype classes (so it can be used by the GATKPaperGenotyper) 2013-01-08 15:53:11 -05:00
Eric Banks 0e2e672521 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-08 15:46:39 -05:00
Eric Banks f0bd1b5ae5 Okay, all public->protected dependencies are gone except for the BQSR arguments. I'll need to think through this but should be able to make that work too. 2013-01-08 15:46:32 -05:00
Tad Jordan 9cbb2b868f ErrorRatePerCycleIntegrationTest fix
-- sorting by row is required
2013-01-08 14:53:07 -05:00
Eric Banks b099e2b4ae Moving integration tests to protected 2013-01-08 09:34:08 -05:00
Eric Banks dfe4cf1301 When merging the PerReadAlleleLikelihoodMap classes, I forgot to initialize the underlying objects. This was causing the LargeScaleTests to fail. 2013-01-08 09:24:12 -05:00
Eric Banks 9e6c2afb28 Not sure why IntelliJ didn't add this for commit like the other dirs 2013-01-07 18:11:07 -05:00
Ami Levy-Moonshine 3787ee6de7 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-07 17:07:29 -05:00
Eric Banks 47d030a52d Oops, move the covariates over too 2013-01-07 15:47:25 -05:00
Eric Banks 35699a8376 Move bqsr utils to protected 2013-01-07 15:41:21 -05:00
Eric Banks a0219acfaa Collapse the PerReadAlleleLikelihoodMap classes into 1 now that Lite is gone 2013-01-07 14:55:21 -05:00
Eric Banks 35d9bd377c Moved (nearly) all Walkers from public to protected and removed GATKLite utils 2013-01-07 14:42:40 -05:00
Ryan Poplin 4f95f850b3 Bug fix in the HC's allele mapping for multi-allelic events. Using the allele alone as a key isn't sufficient because alleles change when the reference allele changes during VariantContextUtils.simpleMerge for multi-allelic events. 2013-01-07 11:05:44 -05:00
Ami Levy-Moonshine d3c2c97fb2 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-06 23:35:47 -05:00
Ami Levy-Moonshine 81eef3aa37 merge development branchs of log-less HMM and FastGatherer to master 2013-01-06 23:01:58 -05:00
Eric Banks 52067f0549 Handle merge conflicts 2013-01-06 12:29:12 -05:00
Chris Hartl 41bc416b65 Remove AAL and update MD5s. 2013-01-04 16:46:14 -05:00
Eric Banks bce6fce58d Resolving merge conflicts after Mark's latest push 2013-01-04 14:46:39 -05:00
Eric Banks dd7f5e2be7 Hooking up the Bayesian estimate code for calculating Qemp in BQSR; various fixes after adding unit tests. 2013-01-04 14:43:11 -05:00
Mark DePristo bbdf9ee91b BQSR cleanup: merge Advanced and Standard recalibration engine into just the RecalibrationEngine
-- As we are no longer maintaining a public/protected system we need only have one RecalibrationEngine.
-- Misc. code cleanup and docs along the way
2013-01-04 11:39:24 -05:00
Mark DePristo 7df47418d8 BQSR optimization: make RecalibrationTables thread-local, and merge results in onTraversalDone
-- With the newer, faster BQSR, scaling was limited by the NestedIntegerArray.  The solution to this is to make the entire table thread-local, so that each nct thread has its own data and doesn't have any collisions.
-- Removed the previous partial solution of having a thread-local quality score table
-- Added a new argument -lowMemory
2013-01-04 11:39:24 -05:00
Chris Hartl 3753209584 One md5sum slipped past in the HC integration test. 2013-01-02 15:09:28 -05:00
Chris Hartl e1d09ab0db QD is now divided by the average length of the alternate allele (weighted by the allele count). The average length is stored in a related annotation, "AAL", which can be used to re-compute the "old" QD by simple multiplication. Integration tests *should* all pass. 2013-01-02 14:41:29 -05:00
Eric Banks 275575462f Protect against non-standard ref bases. Ryan, please review. 2012-12-26 15:46:21 -05:00
Mark DePristo 7bf1f67273 BQSR optimization: read group x quality score calibration table is thread-local
-- AdvancedRecalibrationEngine now uses a thread-local table for the quality score table, and in finalizeData merges these thread-local tables into the final table.  Radically reduces the contention for RecalDatum in this very highly used table
-- Refactored the utility function to combine two tables into RecalUtils, and created UnitTests for this function, as well as all of RecalibrationTables.  Updated combine in RecalibrationReport to use this table combiner function
-- Made several core functions in RecalDatum into final methods for performance
-- Added RecalibrationTestUtils, a home for recalibration testing utilities
2012-12-24 13:35:58 -05:00
Mark DePristo 0f0188ddb1 Optimization of BQSR
-- Created a ReadRecalibrationInfo class that holds all of the information (read, base quality vectors, error vectors) for a read for the call to updateDataForRead in RecalibrationEngine.  This object has a restrictive interface to just get information about specific qual and error values at offset and for event type.  This restrict allows us to avoid creating an vector of byte 45 for each read to represent BI and BD values not in the reads.  Shaves 5% of the runtime off the entire code.
-- Cleaned up code and added lots more docs
-- With this commit we no longer have much in the way of low-hanging fruit left in the optimization of BQSR.  95% of the runtime is spent in BAQing the read, and updating the RecalData in the NestedIntegerArrays.
2012-12-24 13:35:09 -05:00
Ami Levy-Moonshine 6590039bc3 add fast gather to UG; change UG to work with log-lessHMM (work in prograss) 2012-12-20 14:58:57 -05:00
Ryan Poplin c8cd6ac465 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2012-12-20 14:58:04 -05:00
Ryan Poplin a098888f4d Updating missed UG md5 2012-12-20 14:57:53 -05:00
Tad Jordan b491c177ff Added functionality of outputting sorted GATKReport Tables
- Added an optional argument to BaseRecalibrator to produce sorted GATKReport Tables
- Modified BSQR Integration Tests to include the optional argument. Tests now produce sorted tables
2012-12-20 14:02:21 -05:00
Eric Banks 4a7e0427a3 Pushing the RR bug fix that I puished into unstable into stable, as requested by Tim 2012-12-19 11:47:16 -05:00
Ryan Poplin 54e5c84018 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2012-12-19 11:31:40 -05:00
Ryan Poplin aa39037be8 updating UG integration tests. 2012-12-19 11:31:35 -05:00
Eric Banks 70479cb71d RR bug fix: we were failing when a read started with an insertion just at the edge of the consensus region.
The weird part is that the comments claimed it was doing what it was supposed to, but it didn't actually do it.
Now we maintain the last header element of the consensus (but without bases and quals) if it adjoins an element with an insertion.

Added the user's test file as an integration test.
2012-12-19 10:59:07 -05:00
David Roazen 07b369ca7e Move VCF/BCF2/VariantContext to new standalone org.broadinstitute.variant package
This is an intermediate commit so that there is a record of these changes in our
commit history. Next step is to isolate the test classes as well, and then move
the entire package to the Picard repository and replace it with a jar in our repo.

-Removed all dependencies on org.broadinstitute.sting (still need to do the test classes,
though)

-Had to split some of the utility classes into "GATK-specific" vs generic methods
(eg., GATKVCFUtils vs. VCFUtils)

-Placement of some methods and choice of exception classes to replace the StingExceptions
and UserExceptions may need to be tweaked until everyone is happy, but this can be
done after the move.
2012-12-19 10:25:22 -05:00
Ryan Poplin 92185dd5f4 updating HC integration tests. 2012-12-19 10:12:07 -05:00
Ryan Poplin 902ca7ea70 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2012-12-18 15:45:33 -05:00
Ryan Poplin b5d590ba92 Based on NA12878 knowledge base experiments updating HC to allow for a much smaller minimum kmer length in the assembly graph. 2012-12-18 15:43:56 -05:00
Mark DePristo a481d006f0 Optimizations for applying BQSR table with PrintReads
-- Cleaned up code in updateDataForRead so that constant values where not computed in inner loops
-- BaseRecalibrator doesn't create it's own fasta index reader, it just piggy backs on the GATK one
-- ReadCovariates <init> now uses a thread local cache for it's int[][][] keys member variable.  This stops us from recreating an expensive array over and over.  In order to make this really work had to update recordValues in ContextCovariate so it writes 0s over base values its skipping because of low quality base clipping.  Previously the values in the ReadCovariates keys were 0 because they were never modified by ContextCovariates. Now these values are actually zero'd out explicitly by the covariates.
2012-12-17 16:47:27 -05:00
Mark DePristo 5ec25797b3 Optimizations for BaseRecalibrator
-- No longer computes at each update the overall read group table.  Now computes this derived table only at the end of the computation, using the ByQual table as input.  Reduces BQSR runtime by 1/3 in my test
2012-12-17 16:47:27 -05:00
Ryan Poplin 98f18b5f9e Changing the HC over to using the non-contamination-downsampled read maps for the purposes of annotations. This behavior now matches the UG. There is a new command line option to go back to the older behavior to explore the differences. 2012-12-17 11:27:44 -05:00
Mauricio Carneiro 5f1afb4136 Fixing an off-by-one clipping error in ReduceReads for reads off the contig
Reads that are soft-clipped off the contig (before the beginning of the contig) were being soft-clipped to position 0 instead of 1 because of an off-by-one issue. Fixed and included in the integration test.
2012-12-13 22:10:11 -05:00
Mauricio Carneiro 74344a3871 Bringing in the changes from the CMI repo 2012-12-13 21:59:37 -05:00
Mark DePristo aeab932c63 Actual working version of unflushing VCFWriter
-- Uses high-performance local writer backed by byte array that writes the entire VCF line in some write operation to the underlying output stream.
-- Fixes problems with indexing of unflushed writes while still allowing efficient block zipping
-- Same (or better) IO performance as previous implementation
-- IndexingVariantContextWriter now properly closes the underlying output stream when it's closed
-- Updated compressed VCF output file
2012-12-13 16:15:08 -05:00
Ryan Poplin 211a6e78ea Further related bug fixes to GGA mode in the HC: some variants (especially MNPs) were causing problems because they don't have to start at the current location to match the allele being genotyped. Fixed. 2012-12-12 14:53:02 -05:00
Mauricio Carneiro 33290bfe0c Added integration test to catch the read off contig in ReduceReads.
So upstream changes won't break it again.
2012-12-12 13:49:54 -05:00
Mark DePristo 5632c13bf2 Resolves GSA-681 / Compressed VCF.gz output is too big because of unnecessary call to flush().
-- Now compressed output VCFs are properly blocked compressed (i.e., they are actually smaller than the uncompressed VCF)
2012-12-12 10:27:07 -05:00
Mark DePristo dd52a70d45 Fix AFCalcResult unit test
-- I was simply passing in the wrong values into the function.  Fixed the calls, and expanded the docs on what needs to be passed in.
2012-12-11 10:40:12 -05:00
Mauricio Carneiro 8a115edbaf ReduceReads is now scattered by contig
It's no longer safe to scatter/gather by interval because now we don't hard-clip to the intervals anymore.
2012-12-10 15:25:27 -05:00
Eric Banks bdda63d973 Related bug fixes to GGA mode in the HC: some variants (especially MNPs) were causing problems because they don't have to start at the current location to match the allele being genotyped. Fixed. 2012-12-10 14:47:04 -05:00
David Roazen 46edab6d6a Use the new downsampling implementation by default
-Switch back to the old implementation, if needed, with --use_legacy_downsampler

-LocusIteratorByStateExperimental becomes the new LocusIteratorByState, and
the original LocusIteratorByState becomes LegacyLocusIteratorByState

-Similarly, the ExperimentalReadShardBalancer becomes the new ReadShardBalancer,
with the old one renamed to LegacyReadShardBalancer

-Performance improvements: locus traversals used to be 20% slower in the new
downsampling implementation, now they are roughly the same speed.

-Tests show a very high level of concordance with UG calls from the previous
implementation, with some new calls and edge cases that still require more examination.

-With the new implementation, can now use -dcov with ReadWalkers to set a limit
on the max # of reads per alignment start position per sample. Appropriate value
for ReadWalker dcov may be in the single digits for some tools, but this too
requires more investigation.
2012-12-10 09:44:50 -05:00
Eric Banks 574d5b467f Bug fix for indel HMM: protect against situation where long reads (e.g. Sanger) in a pileup can lead to a read starting after the haplotype end for a given haplotype. 2012-12-09 02:09:34 -05:00
Eric Banks 406adb8d44 The allele biased downsampling should not abort if there's a reduced read. Rather it should always keep the RR and downsample only original reads in the pileup. 2012-12-05 23:15:36 -05:00
Mark DePristo d0cab795b7 Got caught in the middle of a bad integration test, that was fixed in independent push. Moved test bam into testdata. 2012-12-05 14:49:22 -05:00
Eric Banks ef87b18e09 In retrospect, it wasn't a good idea to have FisherStrand handle reduced reads since they are always on the forward strand. For now, FS ignores reduced reads but I've added a note (and JIRA) to make this work once the RR het compression is enabled (since we will have directionality in reads then). 2012-12-05 02:00:35 -05:00
Eric Banks 726332db79 Disabling the testNoCmdLineHeaderStdout test in UG because it keeps crashing when I run it locally 2012-12-05 00:54:00 -05:00
Eric Banks bca860723a Updating tests to handle bad validation data files (that used the wrong qual score encoding); overrides push from stable. 2012-12-03 22:01:07 -05:00
Ryan Poplin d5ed184691 Updating the HC integration test md5s. According to the NA12878 knowledge base this commit cuts down the FP rate by more than 50 percent with no loss in sensitivity. 2012-12-03 15:38:59 -05:00
Ryan Poplin 156d6a5e0b misc minor bug fixes to GenotypingEngine. 2012-12-03 12:47:35 -05:00
Ryan Poplin 18b002c99c Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2012-12-03 10:08:56 -05:00
Ryan Poplin 1bdf17ef53 Reworking of how the likelihood calculation is organized in the HaplotypeCaller to facilitate the inclusion of per allele downsampling. We now use the downsampling for both the GL calculations and the annotation calculations. 2012-12-02 11:58:32 -05:00
Mark DePristo 2849889af5 Updating md5 for UG 2012-12-01 14:24:19 -05:00
Joel Thibault 198923b597 Add ActiveRegionReadState handling 2012-11-28 13:59:57 -05:00
Mark DePristo c676853731 Merged bug fix from Stable into Unstable. Updating md5s
Conflicts:
	protected/java/test/org/broadinstitute/sting/gatk/walkers/genotyper/UnifiedGenotyperIntegrationTest.java
2012-11-28 12:54:36 -05:00
Mark DePristo a1d6461121 Critical bugfix to AFCalcResult affecting UG/HC quality score emission thresholds
As reported by Menachem Fromer: a critical bug in AFCalcResult:

Specifically, the implementation:
    public boolean isPolymorphic(final Allele allele, final double log10minPNonRef) {
        return getLog10PosteriorOfAFGt0ForAllele(allele) >= log10minPNonRef;
    }

seems incorrect and should probably be:

getLog10PosteriorOfAFEq0ForAllele(allele) <= log10minPNonRef

The issue here is that the 30 represents a Phred-scaled probability of *error* and it's currently being compared to a log probability of *non-error*.

Instead, we need to require that our probability of error be less than the error threshold.
This bug has only a minor impact on the calls -- hardly any sites change -- which is good.  But the inverted logic effects multi-allelic sites significantly.  Basically you only hit this logic with multiple alleles, and in that case it'\s including extra alt alleles incorrectly, and throwing out good ones.

Change was to create a new function that properly handles thresholds that are PhredScaled quality scores:

    /**
     * Same as #isPolymorphic but takes a phred-scaled quality score as input
     */
    public boolean isPolymorphicPhredScaledQual(final Allele allele, final double minPNonRefPhredScaledQual) {
        if ( minPNonRefPhredScaledQual < 0 ) throw new IllegalArgumentException("phredScaledQual " + minPNonRefPhredScaledQual + " < 0 ");
        final double log10Threshold = Math.log10(QualityUtils.qualToProb(minPNonRefPhredScaledQual));
        return isPolymorphic(allele, log10Threshold);
    }
2012-11-28 12:08:02 -05:00