Commit Graph

12427 Commits (df488dbd4923f75a88efd5eccab7b902f2a4f72e)

Author SHA1 Message Date
Mark DePristo df488dbd49 Merge pull request #259 from broadinstitute/md_discovar
Converts for Discovar to VCF; bugfix for AssessNA12878
2013-06-05 13:09:18 -07:00
Mark DePristo 95376908e5 Converter script from discovar variants files to VCF 2013-06-05 15:34:40 -04:00
Mark DePristo 5a54eac57d Bugfix for AssessNA12878
-- Add GT field to bad sites VCF by default, as we write out GT even if the input VCF is sites only
2013-06-05 15:34:23 -04:00
MauricioCarneiro eaebba5ba1 Merge pull request #257 from broadinstitute/md_unclip_reads_over_contig
Bugfix for HaplotypeCaller error: Only one of refStart or refStop must b...
2013-06-04 08:01:30 -07:00
Mark DePristo e19c24f3ee Bugfix for HaplotypeCaller error: Only one of refStart or refStop must be < 0, not both
-- This occurred because we were reverting reads with soft clips that would produce reads with negative (or 0) alignment starts.  From such reads we could end up with adaptor starts that were negative and that would ultimately produce the "Only one of refStart or refStop must be < 0, not both" error in the FragmentUtils merging code (which would revert and adaptor clip reads).
-- We now hard clip away bases soft clipped reverted bases that fall before the 1-based contig start in revertSoftClippedBases.
-- Replace buggy cigarFromString with proper SAM-JDK call TextCigarCodec.getSingleton().decode(cigarString)
-- Added unit tests for reverting soft clipped bases that create a read before the contig
-- [delivers #50892431]
2013-06-04 10:33:46 -04:00
Mark DePristo a0817b696b Merge pull request #256 from broadinstitute/md_hc_assembly_bug
Bugfix for HC can fail to assemble the correct reference sequence in som...
2013-06-03 13:58:41 -07:00
Mark DePristo c9f5b53efa Bugfix for HC can fail to assemble the correct reference sequence in some cases
-- Ultimately this was caused by overly aggressive merging of CommonSuffixMerger.  In the case where you have this graph:

ACT [ref source] -> C
G -> ACT -> C

we would merge into

G -> ACT -> C

which would linearlize into

GACTC

Causing us to add bases to the reference source node that couldn't be recovered.  The solution was to ensure that CommonSuffixMerger only operates when all nodes to be merged aren't source nodes themselves.

-- Added a convenient argument to the haplotype caller (captureAssemblyFailureBAM) that will write out the exact reads to a BAM file that went into a failed assembly run (going to a file called AssemblyFailure.BAM).  This can be used to rerun the haplotype caller to produce the exact error, which can be hard in regions of deep coverage where the downsampler state determines the exact reads going into assembly and therefore makes running with a sub-interval not reproduce the error
-- Did some misc. cleanup of code while debugging
-- [delivers #50917729]
2013-06-03 16:16:39 -04:00
Mark DePristo a05c543728 Merge pull request #255 from broadinstitute/rp_gga_mode_kmer_function
Break out the GGA kmers and the read kmers into separate functions for t...
2013-06-03 11:30:34 -07:00
Ryan Poplin ab40f4af43 Break out the GGA kmers and the read kmers into separate functions for the DeBruijn assembler.
-- Added unit test for new function.
2013-06-03 14:00:35 -04:00
Ryan Poplin 21334e728d Merge pull request #252 from broadinstitute/md_bqsr_index_out_of_bounds
Make BQSR calculateIsIndel robust to indel CIGARs are start/end of read
2013-06-03 07:13:00 -07:00
MauricioCarneiro 1143c12c44 Merge pull request #254 from broadinstitute/md_bad_merge
Fix error in merging code in HC
2013-05-31 13:53:14 -07:00
Mark DePristo 6555361742 Fix error in merging code in HC
-- Ultimately this was caused by an underlying bug in the reverting of soft clipped bases in the read clipper.  The read clipper would fail to properly set the alignment start for reads that were 100% clipped before reverting, such as 10H2S5H => 10H2M5H.  This has been fixed and unit tested.
-- Update 1 ReduceReads MD5, which was due to cases where we were clipping away all of the MATCH part of the read, leaving a cigar like 50H11S and the revert soft clips was failing to properly revert the bases.
-- delivers #50655421
2013-05-31 16:29:29 -04:00
MauricioCarneiro 749f53d60e Merge pull request #253 from broadinstitute/md_bad_compress
Check that -compress arguments are within range 0-9
2013-05-31 12:40:01 -07:00
Mark DePristo 4b206a3540 Check that -compress arguments are within range 0-9
-- Although the original bug report was about SplitSamFile it actually was an engine wide error.  The two places in the that provide compression to the BAM write now check the validity of the compress argument via a static method in ReadUtils
-- delivers #49531009
2013-05-31 15:29:02 -04:00
Mark DePristo 64b4d80729 Make BQSR calculateIsIndel robust to indel CIGARs are start/end of read
-- The previous implementation attempted to be robust to this, but not all cases were handled properly.  Added a helper function updateInde() that bounds up the update to be in the range of the indel array, and cleaned up logic of how the method works.  The previous behavior was inconsistent across read fwd/rev stand, so that the indel cigars at the end of read were put at the start of reads if the reads were in the forward strand but not if they were in the reverse strand.  Everything is now consistent, as can be seen in the symmetry of the unit tests:

        tests.add(new Object[]{"1D3M",   false, EventType.BASE_DELETION, new int[]{0,0,0}});
        tests.add(new Object[]{"1M1D2M", false, EventType.BASE_DELETION, new int[]{1,0,0}});
        tests.add(new Object[]{"2M1D1M", false, EventType.BASE_DELETION, new int[]{0,1,0}});
        tests.add(new Object[]{"3M1D",   false, EventType.BASE_DELETION, new int[]{0,0,1}});

        tests.add(new Object[]{"1D3M",   true, EventType.BASE_DELETION, new int[]{1,0,0}});
        tests.add(new Object[]{"1M1D2M", true, EventType.BASE_DELETION, new int[]{0,1,0}});
        tests.add(new Object[]{"2M1D1M", true, EventType.BASE_DELETION, new int[]{0,0,1}});
        tests.add(new Object[]{"3M1D",   true, EventType.BASE_DELETION, new int[]{0,0,0}});

        tests.add(new Object[]{"4M1I",   false, EventType.BASE_INSERTION, new int[]{0,0,0,1,0}});
        tests.add(new Object[]{"3M1I1M", false, EventType.BASE_INSERTION, new int[]{0,0,1,0,0}});
        tests.add(new Object[]{"2M1I2M", false, EventType.BASE_INSERTION, new int[]{0,1,0,0,0}});
        tests.add(new Object[]{"1M1I3M", false, EventType.BASE_INSERTION, new int[]{1,0,0,0,0}});
        tests.add(new Object[]{"1I4M",   false, EventType.BASE_INSERTION, new int[]{0,0,0,0,0}});

        tests.add(new Object[]{"4M1I",   true, EventType.BASE_INSERTION, new int[]{0,0,0,0,0}});
        tests.add(new Object[]{"3M1I1M", true, EventType.BASE_INSERTION, new int[]{0,0,0,0,1}});
        tests.add(new Object[]{"2M1I2M", true, EventType.BASE_INSERTION, new int[]{0,0,0,1,0}});
        tests.add(new Object[]{"1M1I3M", true, EventType.BASE_INSERTION, new int[]{0,0,1,0,0}});
        tests.add(new Object[]{"1I4M",   true, EventType.BASE_INSERTION, new int[]{0,1,0,0,0}});

-- delivers #50445353
2013-05-31 13:58:37 -04:00
Eric Banks a96f48bc39 Merge pull request #249 from broadinstitute/rp_hc_gga_mode
New implementation of the GGA mode in the HaplotypeCaller
2013-05-31 10:54:50 -07:00
droazen a665d759cd Merge pull request #251 from broadinstitute/md_mapq_reassign
Command-line read filters are now applied before Walker default filters
2013-05-31 09:05:24 -07:00
David Roazen ed4f19d79b Restore scala compilation by default in build.xml
-This was accidentally clobbered in a recent commit.

-If you want to compile Java-only, easiest thing to
 do is run "ant gatk" rather than modifying build.xml
2013-05-31 11:28:29 -04:00
Ryan Poplin b5b9d745a7 New implementation of the GGA mode in the HaplotypeCaller
-- We now inject the given alleles into the reference haplotype and add them to the graph.
-- Those paths are read off of the graph and then evaluated with the appropriate marginalization for GGA mode.
-- This unifies how Smith-Waterman is performed between discovery and GGA modes.
-- Misc minor cleanup in several places.
2013-05-31 10:35:36 -04:00
Eric Banks 50b4c130ca Merge pull request #250 from broadinstitute/chartl_mathutils_checks_for_valid_inputs-GSA-767
MathUtils checks for argument values GSA-767
2013-05-30 21:58:33 -07:00
Chris Hartl 199476eae1 Three squashed commits:
1) Add in checks for input parameters in MathUtils method. I was careful to use the bottom-level methods whenever possible, so that parameters don't needlessly go through multiple checks (so for instance, the parameters n and k for a binomial aren't checked on log10binomial, but rather in the log10binomialcoefficient subroutine).

This addresses JIRA GSA-767

Unit tests pass (we'll let bamboo deal with the integrations)

2) Address reviewer comments (change UserExceptions to IllegalArgumentExceptions).

3) .isWellFormedDouble() tests for infinity and not strictly positive infinity. Allow negative-infinity values for log10sumlog10 (as these just correspond to p=0).

After these commits, unit and integration tests now pass, and GSA-767 is done.

rebase and fix conflict:

public/java/src/org/broadinstitute/sting/utils/MathUtils.java
2013-05-31 00:26:50 -04:00
Mark DePristo b16de45ce4 Command-line read filters are now applied before Walker default filters
-- This allows us to use -rf ReassignMappingQuality to reassign mapping qualities to 60 *before* the BQSR filters them out with MappingQualityUnassignedFilter.
-- delivers #50222251
2013-05-30 16:54:18 -04:00
Mark DePristo ac90e6765e Merge pull request #248 from broadinstitute/rp_chartl_mathutils_checks_for_valid_inputs-GSA-767
Create a new normalDistributionLog10 function that is unit tested for us...
2013-05-30 13:39:33 -07:00
Ryan Poplin 61af37d0d2 Create a new normalDistributionLog10 function that is unit tested for use in the VQSR. 2013-05-30 16:00:08 -04:00
Mark DePristo 56b14be4bc Merge pull request #247 from broadinstitute/eb_fix_RR_negative_header_problem
Fix for the "Removed too many insertions, header is now negative" bug in ReduceReads.
2013-05-29 18:10:19 -07:00
Mark DePristo 50a12df68c Merge pull request #246 from broadinstitute/dr_fix_read_shard_balancer_log_output
Fix confusing extraneous log output from the ReadShardBalancer at traversal end
2013-05-29 13:50:26 -07:00
Eric Banks a5a68c09fa Fix for the "Removed too many insertions, header is now negative" bug in ReduceReads.
The problem ultimately was that ReadUtils.readStartsWithInsertion() ignores leading hard/softclips, but
ReduceReads does not.  So I refactored that method to include a boolean argument as to whether or not
clips should be ignored.  Also rebased so that return type is no longer a Pair.
Added unit test to cover this situation.
2013-05-29 16:41:01 -04:00
David Roazen eb206e9f71 Fix confusing log output from the engine
-ReadShardBalancer was printing out an extra "Loading BAM index data for next contig"
 message at traversal end, which was confusing users and making the GATK look stupid.
 Suppress the extraneous message, and reword the log messages to be less confusing.

-Improve log message output when initializing the shard iterator in GenomeAnalysisEngine.
 Don't mention BAMs when the are none, and say "Preparing for traversal" rather than
 mentioning the meaningless-for-users concept of "shard strategy"

-These log messages are needed because the operations they surround might take a
 while under some circumstances, and the user should know that the GATK is actively
 doing something rather than being hung.
2013-05-29 16:17:04 -04:00
Mark DePristo 684c91c2e7 Merge pull request #245 from broadinstitute/dr_enforce_min_dcov
Require a minimum dcov value of 200 for Locus and ActiveRegion walkers when downsampling to coverage
2013-05-29 09:52:13 -07:00
Mark DePristo 60c3c83b94 Merge pull request #244 from broadinstitute/mc_add_missing_example_vcf_idx
Somehow the index of exampleDBSNP.vcf was missing
2013-05-29 09:48:24 -07:00
Mark DePristo 677e6bd6cf Merge pull request #243 from broadinstitute/mc_turn_off_downsampling_in_diagnose_targets
Turn off downsampling for DiagnoseTargets
2013-05-29 09:48:08 -07:00
David Roazen a7cb599945 Require a minimum dcov value of 200 for Locus and ActiveRegion walkers when downsampling to coverage
-Throw a UserException if a Locus or ActiveRegion walker is run with -dcov < 200,
 since low dcov values can result in problematic downsampling artifacts for locus-based
 traversals.

-Read-based traversals continue to have no minimum for -dcov, since dcov for read traversals
 controls the number of reads per alignment start position, and even a dcov value of 1 might
 be safe/desirable in some circumstances.

-Also reorganize the global downsampling defaults so that they are specified as annotations
 to the Walker, LocusWalker, and ActiveRegionWalker classes rather than as constants in the
 DownsamplingMethod class.

-The default downsampling settings have not been changed: they are still -dcov 1000
 for Locus and ActiveRegion walkers, and -dt NONE for all other walkers.
2013-05-29 12:07:12 -04:00
Mauricio Carneiro 38e765f00d Somehow the index of exampleDBSNP.vcf was missing
This was missed when we added all the indices of our testdata
2013-05-28 15:29:43 -04:00
Mauricio Carneiro f1affa9fbb Turn off downsampling for DiagnoseTargets
Diagnose targets should never be downsampled. (and I didn't know there was a default downsampling going on for locus walkers)
2013-05-28 14:58:50 -04:00
Ryan Poplin 3516daff52 Merge pull request #240 from broadinstitute/rp_gga_implies_both_models
Bugfix for GGA mode in UG silently ignoring indels
2013-05-24 11:13:57 -07:00
Ryan Poplin 85905dba92 Bugfix for GGA mode in UG silently ignoring indels
-- Started by Mark. Finished up by Ryan.
-- GGA mode still respected glm argument for SNP and INDEL models, so that you would silently fail to genotype indels at all if the -glm INDEL wasn't provided, but you'd still emit the sites, so you'd see records in the VCF but all alleles would be no calls.
-- https://www.pivotaltracker.com/story/show/48924339 for more information
-- [resolves #48924339]
2013-05-24 13:47:26 -04:00
Mark DePristo b7fab31e25 Merge pull request #239 from broadinstitute/mc_quick_dt_output_fix
Make the missing targets output never use stdout
2013-05-23 07:00:57 -07:00
Mauricio Carneiro da21924b44 Make the missing targets output never use stdout
Problem
--------
Diagnose Targets is outputting missing intervals to stdout if the argument -missing is not provided

Solution
--------
Make it NOT default to stdout

[Delivers #50386741]
2013-05-22 14:22:54 -04:00
MauricioCarneiro 32c33901d7 Merge pull request #237 from broadinstitute/md_banded_pairhmm
Archived banded logless PairHMM
2013-05-22 10:25:18 -07:00
Mark DePristo d167743852 Archived banded logless PairHMM
BandedHMM
---------
-- An implementation of a linear runtime, linear memory usage banded logless PairHMM.  Thought about 50% faster than current PairHMM, this implementation will be superceded by the GraphHMM when it becomes available.  The implementation is being archived for future reference

Useful infrastructure changes
-----------------------------
-- Split PairHMM into a N2MemoryPairHMM that allows smarter implementation to not allocate the double[][] matrices if they don't want, which was previously occurring in the base class PairHMM
-- Added functionality (controlled by private static boolean) to write out likelihood call information to a file from inside of LikelihoodCalculationEngine for using in unit or performance testing.  Added example of 100kb of data to private/testdata.  Can be easily read in with the PairHMMTestData class.
-- PairHMM now tracks the number of possible cell evaluations, and the LoglessCachingPairHMM updates the nCellsEvaluated so we can see how many cells are saved by the caching calculation.
2013-05-22 12:24:00 -04:00
delangel 925232b0fc Merge pull request #236 from broadinstitute/md_simple_hc_performance_improvements
3 simple performance improvements for HaplotypeCaller
2013-05-22 07:58:28 -07:00
Mark DePristo adec22e748 Merge pull request #238 from broadinstitute/eb_optimize_filter_counting
Optimized counting of filtered records by filter.
2013-05-22 04:57:26 -07:00
Eric Banks 881b2b50ab Optimized counting of filtered records by filter.
Don't map class to counts in the ReadMetrics (necessitating 2 HashMap lookups for every increment).
Instead, wrap the ReadFilters with a counting version and then set those counts only when updating global metrics.
2013-05-21 21:54:49 -04:00
Mark DePristo 010034a650 Optimization/bugfix for PerReadAlleleLikelihoodMap
-- Add() call had a misplaced map.put call, so that we were always putting the result of get() back into the map, when what we really intended was to only put the value back in if the original get() resulted in a null and so initialized the result
2013-05-21 16:18:57 -04:00
Mark DePristo a1093ad230 Optimization for ActiveRegion.removeAll
-- Previous version took a Collection<GATKSAMRecord> to remove, and called ArrayList.removeAll() on this collection to remove reads from the ActiveRegion.  This can be very slow when there are lots of reads, as ArrayList.removeAll ultimately calls indexOf() that searches through the list calling equals() on each element.   New version takes a set, and uses an iterator on the list to remove() from the iterator any read that is in the set.  Given that we were already iterating over the list of reads to update the read span, this algorithm is actually simpler and faster than the previous one.
-- Update HaplotypeCaller filterReadsInRegion to use a Set not a List.
-- Expanded the unit tests a bit for ActiveRegion.removeAll
2013-05-21 16:18:57 -04:00
Mark DePristo d9cdc5d006 Optimization: track alleles in the PerReadAlleleLikelihoodMap with a HashSet
-- The previous version of PerReadAlleleLikelihoodMap only stored the alleles in an ArrayList, and used ArrayList.contains() to determine if an allele was already present in the map.  This is very slow with many alleles.  Now keeps both the ArrayList (for get() performance) and a Set of alleles for contains().
2013-05-21 16:18:56 -04:00
Mark DePristo 3cfe2dcc64 Merge pull request #235 from broadinstitute/eb_several_traversal_printout_fixes
Eb several traversal printout fixes
2013-05-21 13:13:11 -07:00
Eric Banks 20c7a89030 Fixes to get accurate read counts for Read traversals
1. Don't clone the dataSource's metrics object (because then the engine won't continue to get updated counts)
 2. Use the dataSource's metrics object in the CountingFilteringIterator and not the first shard's object!
 3. Synchronize ReadMetrics.incrementMetrics to prevent race conditions.

Also:
 * Make sure users realize that the read counts are approximate in the print outs.
 * Removed a lot of unused cruft from the metrics object while I was in there.
 * Added test to make sure that the ReadMetrics read count does not overflow ints.
 * Added unit tests for traversal metrics (reads, loci, and active region traversals); these test counts of reads and records.
2013-05-21 15:24:07 -04:00
Eric Banks 58f4b81222 Count Reads should use a Long instead of an Integer for counts to prevent overflows. Added unit test. 2013-05-21 15:23:51 -04:00
Eric Banks 1f3624d204 Base Recalibrator doesn't recalibrate all reads, so the final output line was confusing 2013-05-21 11:35:05 -04:00