Commit Graph

3919 Commits (111e8cef0f5d5e34fcc768c8eb8fe2323058ea6c)

Author SHA1 Message Date
Mark DePristo fa8a47ceef Replace DeBruijnAssembler with ReadThreadingAssembler
Problem
-------
The DeBruijn assembler was too slow.  The cause of the slowness was the need to construct many kmer graphs (from max read length in the interval to 11 kmer, in increments of 6 bp).  This need to build many kmer graphs was because the assembler (1) needed long kmers to assemble through regions where a shorter kmer was non-unique in the reference, as we couldn't split cycles in the reference (2) shorter kmers were needed to be sensitive to differences from the reference near the edge of reads, which would be lost often when there was chain of kmers of longer length that started before and after the variant.

Solution
--------
The read threading assembler uses a fixed kmer, in this implementation by default two graphs with 10 and 25 kmers.  The algorithm operates as follows:

identify all non-unique kmers of size K among all reads and the reference
for each sequence (ref and read):
  find a unique starting position of the sequence in the graph by matching to a unique kmer, or starting a new source node if non exist
  for each base in the sequence from the starting vertex kmer:
    look at the existing outgoing nodes of current vertex V.  If the base in sequence matches the suffix of outgoing vertex N, read the sequence to N, and continue
    If no matching next vertex exists, find a unique vertex with kmer K.  If one exists, merge the sequence into this vertex, and continue
    If a merge vertex cannot be found, create a new vertex (note this vertex may have a kmer identical to another in the graph, if it is not unique) and thread the sequence to this vertex, and continue

This algorithm has a key property: it can robustly use a very short kmer without introducing cycles, as we will create paths through the graph through regions that aren't unique w.r.t. the sequence at the given kmer size.  This allows us to assemble well with even very short kmers.

This commit includes many critical changes to the haplotype caller to make it fast, sensitive, and accurate on deep and shallow WGS and exomes, the key changes are highlighted below:

-- The ReadThreading assembler keeps track of the maximum edge multiplicity per sample in the graph, so that we prune per sample, not across all samples.  This change is essential to operate effectively when there are many deep samples (i.e., 100 exomes)
-- A new pruning algorithm that will only prune linear paths where the maximum edge weight among all edges in the path have < pruningFactor.  This makes pruning more robust when you have a long chain of bases that have high multiplicity at the start but only barely make it back into the main path in the graph.
-- We now do a global SmithWaterman to compute the cigar of a Path, instead of the previous bubble-based SmithWaterman optimization.  This change is essential for us to get good variants from our paths when the kmer size is small.  It also ensures that we produce a cigar from a path that only depends only the sequence of bases in the path, unlike the previous approach which would depend on both the bases and the way the path was decomposed into vertices, which depended on the kmer size we used.
-- Removed MergeHeadlessIncomingSources, which was introducing problems in the graphs in some cases, and just isn't the safest operation.  Since we build a kmer graph of size 10, this operation is no longer necessary as it required a perfect match of 10 bp to merge anyway.
-- The old DebruijnAssembler is still available with a command line option
-- The number of paths we take forward from the each assembly graph is now capped at a factor per sample, so that we allow 128 paths for a single sample up to 10 x nSamples as necessary.  This is an essential change to make the system work well for large numbers of samples.
-- Add a global mismapping parameter to the HC likelihood calculation: The phredScaledGlobalReadMismappingRate reflects the average global mismapping rate of all reads, regardless of their mapping quality. This term effects the probability that a read originated from the reference haploytype, regardless of its edit distance from the reference, in that the read could have originated from the reference haplotype but from another location in the genome. Suppose a read has many mismatches from the reference, say like 5, but has a very high mapping quality of 60. Without this parameter, the read would contribute 5 * Q30 evidence in favor of its 5 mismatch haplotype compared to reference, potentially enough to make a call off that single read for all of these events. With this parameter set to Q30, though, the maximum evidence against the reference that this (and any) read could contribute against reference is Q30. -- Controllable via a command line argument, defaulting to Q60 rate. Results from 20:10-11 mb for branch are consistent with the previous behavior, but this does help in cases where you have rare very divergent haplotypes
-- Reduced ActiveRegionExtension from 200 bp to 100 bp, which is a performance win and the large extension is largely unnecessary with the short kmers used with the read threading assembler

Infrastructure changes / improvements
-------------------------------------
-- Refactored BaseGraph to take a subclass of BaseEdge, so that we can use a MultiSampleEdge in the ReadThreadingAssembler
-- Refactored DeBruijnAssembler, moving common functionality into LocalAssemblyEngine, which now more directly manages the subclasses, requiring them to only implement a assemble() method that takes ref and reads and provides a List<SeqGraph>, which the LocalAssemblyEngine takes forward to compute haplotypes and other downstream operations.  This allows us to have only a limited amount of code that differentiates the Debruijn and ReadThreading assemblers
-- Refactored active region trimming code into ActiveRegionTrimmer class
-- Cleaned up the arguments in HaplotypeCaller, reorganizing them and making arguments @Hidden and @Advanced as appropriate.  Renamed several arguments now that the read threading assembler is the default
-- LocalAssemblyEngineUnitTest reads in the reference sequence from b37, and assembles with synthetic reads intervals from 10-11 mbs with only the reference sequence as well as artificial snps, deletions, and insertions.
-- Misc. updates to Smith Waterman code. Added generic interface to called not surpisingly SmithWaterman, making it easier to have alternative implementations.
-- Many many more unit tests throughout the entire assembler, and in random utilities
2013-05-08 21:41:42 -04:00
Mark DePristo 2b86ab02be Improve queue script jobreport visualization script
-- the Queue jobreport PDF script now provides a high-level summary of the de-scattered runtimes of each analysis, so that its easy to see where your script is spending its time across scatters.
2013-05-07 12:11:46 -04:00
chartl 98021db264 Merge pull request #208 from broadinstitute/yf_fix_molten_GenotypeConcordance
- Fixed a small bug in the printout of molten data in GenotypeConcordanc...
2013-05-06 08:42:06 -07:00
Mark DePristo f42bb86bdd e# This is a combination of 2 commits.
Only try to clip adaptors when both reads of the pair are on opposite strands

-- Read pairs that have unusual alignments, such as two reads both oriented like:

  <-----
     <-----

where previously having their adaptors clipped as though the standard calculation of the insert size was meaningful, which it is not for such oddly oriented pairs.  This caused us to clip extra good bases from reads.
-- Update MD5s due change in adaptor clipping, which add some coverage in some places
2013-05-03 11:19:14 -04:00
Mark DePristo 0587a145bf Utils.dupString should allow 0 number of duplicates to produce empty string 2013-05-03 09:32:05 -04:00
Mark DePristo f5a301fb63 Bugfix for AlignmentUtils.trimCigarByBases
-- Previous version would trim down 2M2D2M into 2M if you asked for the first 2 bases, but this can result in incorrect alignment of the bases to the reference as the bases no longer span the full reference interval expected.  Fixed and added unit tests
2013-05-03 09:32:05 -04:00
Mark DePristo 2bcbdd469f leftAlignCigarSequentially now supports haplotypes with insertions and deletions where the deletion allele was previously removed by the leftAlignSingleIndel during it's cleanup phase. 2013-05-03 09:32:05 -04:00
Eric Banks d981fd01b8 Now that we don't generate dict and fai files, the resource script needs to copy them to the bundle. 2013-05-02 15:18:13 -04:00
David Roazen 13bfa963da Revert changes to exampleFASTA.fasta.fai for now to get tests passing again 2013-05-02 12:59:20 -04:00
Eric Banks f88a964e2c Adding .fai file to example fasta since we don't generate it anymore 2013-05-02 10:54:32 -04:00
Eric Banks 6d0e383a60 Fixing the bundle script
1. someone out there busted it when adding high confidence 1000G calls
2. new path to NA12878 bam
3. updated clashing version argument
2013-05-02 09:40:36 -04:00
Yossi Farjoun 4b8b411b92 - Fixed a small bug in the printout of molten data in GenotypeConcordance
Output didn't "mix-up" the genotypes, it outputed the same HET vs HET (e.g.) 3 times rather than the combinations of HET vs {HET, HOM, HOM_REF}, etc.
This was only a problem in the text, _not_ the actual numbers, which were outputted correctly.

- Updated MD5's after looking at diffs to verify that the change is what I expected.
2013-05-02 09:16:07 -04:00
David Roazen f3c94a3c87 Update expected test output for Java 7
-Changes in Java 7 related to comparators / sorting produce a large number
 of innocuous differences in our test output. Updating expectations now
 that we've moved to using Java 7 internally.

-Also incorporate Eric's fix to the GATKSAMRecordUnitTest to prevent
 intermittent failures.
2013-05-01 16:18:01 -04:00
David Roazen f57256b6c2 Delete unused FastaSequenceIndexBuilder class and accompanying test
This class, being unused, was no longer getting packaged into the
GATK release jar by bcel, and so attempting to run its unit test
on the release jar was producing an error.
2013-05-01 01:02:01 -04:00
Eric Banks 58424e56be Setting the reduce reads count tag was all wrong in a previous commit; fixing.
RR counts are represented as offsets from the first count, but that wasn't being done
correctly when counts are adjusted on the fly.  Also, we were triggering the expensive
conversion and writing to binary tags even when we weren't going to write the read
to disk.

The code has been updated so that unconverted counts are passed to the GATKSAMRecord
and it knows how to encode the tag correctly.  Also, there are now methods to write
to the reduced counts array without forcing the conversion (and methods that do force
the conversion).

Also:
1. counts are now maintained as ints whenever possible.  Only the GATKSAMRecord knows
about the internal encoding.
2. as discussed in meetings today, we updated the encoding so that it can now handle
a range of values that extends to 255 instead of 127 (and is backwards compatible).
3. tests have been moved from SyntheticReadUnitTest to GATKSAMRecordUnitTest accordingly.
2013-04-30 13:45:42 -04:00
Mark DePristo 73fcacbf1b Change Long to long 2013-04-30 09:21:10 -04:00
Yossi Farjoun 0e7e6d35d8 GATKBAMIndex calls buffer.length() on every read. This is causing much pain.
Optimized by getting the read of the file upon opening the index-file and using that instead.
2013-04-29 12:49:02 -04:00
Mark DePristo 0387ea8df9 Bugfix for ReadClipper with ReducedReads
-- The previous version of the read clipping operations wouldn't modify the reduced reads counts, so hardClipToRegion would result in a read with, say, 50 bp of sequence and base qualities but 250 bp of reduced read counts.  Updated the hardClip operation to handle reduce reads, and added a unit test to make sure this works properly.  Also had to update GATKSAMRecord.emptyRead() to set the reduced count to new byte[0] if the template read is a reduced read
-- Update md5s, where the new code recovers a TP variant with count 2 that was missed previously
2013-04-29 11:12:09 -04:00
Mark DePristo 759c531d1b Merge pull request #197 from broadinstitute/dr_disable_snpeff_version_check
Add support for snpEff "GATK compatibility mode" (-o gatk)
2013-04-26 13:55:14 -07:00
David Roazen 7d90bbab08 Add support for snpEff "GATK compatibility mode" (-o gatk)
-Do not throw an exception when parsing snpEff output files
 generated by not-officially-supported versions of snpEff,
 PROVIDED that snpEff was run with -o gatk

-Requested by the snpEff author

-Relevant integration tests updated/expanded
2013-04-26 15:47:15 -04:00
Mark DePristo 071fd67d55 Merge pull request #193 from broadinstitute/eb_contamination_fixing_for_reduced_reads
Eb contamination fixing for reduced reads
2013-04-26 09:48:45 -07:00
Mark DePristo 92a6c7b561 Merge pull request #195 from broadinstitute/eb_exclude_sample_file_bug_in_select_variants
Fixed bug reported on the forum where using the --exclude_sample_file ar...
2013-04-26 09:47:38 -07:00
Eric Banks 360e2ba87e Fixed bug reported on the forum where using the --exclude_sample_file argument in SV was giving bad results.
Added integration test.
https://www.pivotaltracker.com/s/projects/793457/stories/47399245
2013-04-26 12:23:11 -04:00
Eric Banks ba2c3b57ed Extended the allele-biased down-sampling functionality to handle reduced reads.
Note that this works only in the case of pileups (i.e. coming from UG);
allele-biased down-sampling for RR just cannot work for haplotypes.

Added lots of unit tests for new functionality.
2013-04-26 11:23:17 -04:00
Mark DePristo 528c3d083a Merge pull request #191 from broadinstitute/dr_fix_rod_system_locking
Detect stuck lock-acquisition calls, and disable file locking for tests
2013-04-25 09:32:54 -07:00
Mark DePristo d20be41fee Bugfix for FragmentUtils.mergeOverlappingPairedFragments
-- The previous version was unclipping soft clipped bases, and these were sometimes adaptor sequences.  If the two reads successfully merged, we'd lose all of the information necessary to remove the adaptor, producing a very high quality read that matched reference.  Updated the code to first clip the adapter sequences from the incoming fragments
-- Update MD5s
2013-04-25 11:11:15 -04:00
David Roazen 4d56142163 Detect stuck lock-acquisition calls, and disable file locking for tests
-Acquire file locks in a background thread with a timeout of 30 seconds,
 and throw a UserException if a lock acquisition call times out

    * should solve the locking issue for most people provided they
      RETRY failed farm jobs

    * since we use NON-BLOCKING lock acquisition calls, any call that
      takes longer than a second or two indicates a problem with the
      underlying OS file lock support

    * use daemon threads so that stuck lock acquisition tasks don't
      prevent the JVM from exiting

-Disable both auto-index creation and file locking for integration tests
 via a hidden GATK argument --disable_auto_index_creation_and_locking_when_reading_rods

    * argument not safe for general use, since it allows reading from
      an index file without first acquiring a lock

    * this is fine for the test suite, since all index files already
      exist for test files (or if they don't, they should!)

-Added missing indices for files in private/testdata

-Had to delete most of RMDTrackBuilderUnitTest, since it mostly tested auto-index
 creation, which we can't test with locking disabled, but I replaced the deleted
 tests with some tests of my own.

-Unit test for FSLockWithShared to test the timeout feature
2013-04-24 22:49:02 -04:00
Eric Banks 379a9841ce Various bug fixes for recent Reduce Reads additions plus solution implemented for low MQ reads.
1. Using cumulative binomial probability was not working at high coverage sites (because p-values quickly
got out of hand) so instead we use a hybrid system for determining significance: at low coverage sites
use binomial prob and at high coverage sites revert to using the old base proportions.  Then we get the
best of both worlds.  As a note, coverage refers to just the individual base counts and not the entire pileup.

2. Reads were getting lost because of the comparator being used in the SlidingWindow. When read pairs had
the same alignment end position the 2nd one encountered would get dropped (but added to the header!). We
now use a PriorityQueue instead of a TreeSet to allow for such cases.

3. Each consensus keeps track of its own number of softclipped bases.  There was no reason that that number
should be shared between them.

4. We output consensus filtered (i.e. low MQ) reads whenever they are present for now.  Don't lose that
information.  Maybe we'll decide to change this in the future, but for now we are conservative.

5. Also implemented various small performance optimizations based on profiling.

Added unit tests to cover these changes; systematic assessment now tests against low MQ reads too.
2013-04-24 18:18:50 -04:00
Eric Banks 3f52f55c55 Merge pull request #186 from broadinstitute/md_libs_canonical_cigar
Performance optimizations and caliper benchmarks code for consolidateCigar
2013-04-24 12:58:32 -07:00
Ryan Poplin 80131ac996 Adding the 1000G_phase1.snps.high_confidence callset to the GATK resource bundle for use in the April 2013 updated best practices. 2013-04-24 11:41:32 -04:00
Mark DePristo df90597bfc Performance optimizations and caliper benchmarking code for consolidateCigar
-- Now that this function is used in the core of LIBS it needed some basic optimizations, which are now complete, pass all unit tests.
-- Added caliper benchmark for AlignmentUtils to assess performance (showing new version is 3x-10x faster)
-- Remove unused import in ReadStateManager
2013-04-24 11:36:43 -04:00
Eric Banks 3477e092ea Minor: bump up the amount of cached log10 data in MathUtils so that Monkol can actually call 50K samples. 2013-04-19 08:39:08 -04:00
Eric Banks 5bce0e086e Refactored binomial probability code in MathUtils.
* Moved redundant code out of UGEngine
  * Added overloaded methods that assume p=0.5 for speed efficiency
  * Added unit test for the binomialCumulativeProbability method
2013-04-16 18:19:07 -04:00
Eric Banks df189293ce Improve compression in Reduce Reads by incorporating probabilistic model and global het compression
The Problem:
  Exomes seem to be more prone to base errors and one error in 20x coverage (or below, like most
  regions in an exome) causes RR (with default settings) to consider it a variant region.  This
  seriously hurts compression performance.

The Solution:
  1. We now use a probabilistic model for determining whether we can create a consensus (in other
  words, whether we can error correct a site) instead of the old ratio threshold.  We calculate
  the cumulative binomial probability of seeing the given ratio and trigger consensus creation if
  that pvalue is lower than the provided threshold (0.01 by default, so rather conservative).
  2. We also allow het compression globally, not just at known sites.  So if we cannot create a
  consensus at a given site then we try to perform het compression; and if we cannot perform het
  compression that we just don't reduce the variant region.  This way very wonky regions stay
  uncompressed, regions with one errorful read get fully compressed, and regions with one errorful
  locus get het compressed.

Details:
  1. -minvar is now deprecated in favor of -min_pvalue.
  2. Added integration test for bad pvalue input.
  3. -known argument still works to force het compression only at known sites; if it's not included
     then we allow het compression anywhere.  Added unit tests for this.
  4. This commit includes fixes to het compression problems that were revealed by systematic qual testing.
     Before finalizing het compression, we now check for insertions or other variant regions (usually due
     to multi-allelics) which can render a region incompressible (and we back out if we find one).  We
     were checking for excessive softclips before, but now we add these tests too.
  5. We now allow het compression on some but not all of the 4 consensus reads: if creating one of the
     consensuses is not possible (e.g. because of excessive softclips) then we just back that one consensus
     out instead of backing out all of them.
  6. We no longer create a mini read at the stop of the variant window for het compression.  Instead, we
     allow it to be part of the next global consensus.
  7. The coverage test is no longer run systematically on all integration tests because the quals test
     supercedes it.  The systematic quals test is now much stricter in order to catch bugs and edge cases
     (very useful!).
  8. Each consensus (both the normal and filtered) keep track of their own mapping qualities (before the MQ
     for a consensus was affected by good and bad bases/reads).
  9. We now completely ignore low quality bases, unless they are the only bases present in a pileup.
     This way we preserve the span of reads across a region (needed for assembly). Min base qual moved to Q15.
  10.Fixed long-standing bug where sliding window didn't do the right thing when removing reads that start
     with insertions from a header.

Note that this commit must come serially before the next commit in which I am refactoring the binomial prob
code in MathUtils (which is failing and slow).
2013-04-16 18:19:06 -04:00
Geraldine Van der Auwera e176fc3af1 Merge pull request #159 from broadinstitute/md_bqsr_ion
Trivial BQSR bug fixes and improvement
2013-04-16 08:54:47 -07:00
Mark DePristo 067d24957b Select the haplotypes we move forward for genotyping per sample, not pooled
-- The previous algorithm would compute the likelihood of each haplotype pooled across samples.  This has a tendency to select "consensus" haplotypes that are reasonably good across all samples, while missing the true haplotypes that each sample likes.  The new algorithm computes instead the most likely pair of haplotypes among all haplotypes for each sample independently, contributing 1 vote to each haplotype it selects.  After all N samples have been run, we sort the haplotypes by their counts, and take 2 * nSample + 1 haplotypes or maxHaplotypesInPopulation, whichever is smaller.
-- After discussing with Mauricio our view is that the algorithmic complexity of this approach is no worse than the previous approach, so it should be equivalently fast.
-- One potential improvement is to use not hard counts for the haplotypes, but this would radically complicate the current algorithm so it wasn't selected.
-- For an example of a specific problem caused by this, see https://jira.broadinstitute.org/browse/GSA-871.
-- Remove old pooled likelihood model.  It's worse than the current version in both single and multiple samples:

1000G EUR samples:

10Kb
per sample: 7.17 minutes
pooled: 7.36 minutes

Name        VariantType  TRUE_POSITIVE  FALSE_POSITIVE  FALSE_NEGATIVE  TRUE_NEGATIVE  CALLED_NOT_IN_DB_AT_ALL
per_sample  SNPS                    50               0               5              8                        1
per_sample  INDELS                   6               0               7              2                        1
pooled      SNPS                    49               0               6              8                        1
pooled      INDELS                   5               0               8              2                        1

100 kb
per sample: 140.00 minutes
pooled: 145.27 minutes

Name        VariantType  TRUE_POSITIVE  FALSE_POSITIVE  FALSE_NEGATIVE  TRUE_NEGATIVE  CALLED_NOT_IN_DB_AT_ALL
per_sample  SNPS                   144               0              22             28                        1
per_sample  INDELS                  28               1              16              9                       11
pooled      SNPS                   143               0              23             28                        1
pooled      INDELS                  27               1              17              9                       11

java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T HaplotypeCaller -I private/testdata/AFR.structural.indels.bam -L 20:8187565-8187800 -L 20:18670537-18670730 -R ~/Desktop/broadLocal/localData/human_g1k_v37.fasta -o /dev/null -debug

haplotypes from samples: 8 seconds
haplotypes from pools: 8 seconds

java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T HaplotypeCaller -I /Users/depristo/Desktop/broadLocal/localData/phaseIII.4x.100kb.bam -L 20:10,000,000-10,001,000 -R ~/Desktop/broadLocal/localData/human_g1k_v37.fasta -o /dev/null -debug

haplotypes from samples: 173.32 seconds
haplotypes from pools: 167.12 seconds
2013-04-16 09:42:03 -04:00
Guillermo del Angel a971e7ab6d Several improvements to ReadAdaptorTrimmer so that it can be incorporated into ancient DNA processing pipelines (for which it was developed):
-- Add pair cleaning feature. Reads in query-name sorted order are required and pairs need to appear consecutively, but if -cleanPairs option is set, a malformed pair where second read is missing is just skipped instead of erroring out.
-- Add integration tests
-- Move walker to public
2013-04-13 13:41:36 -04:00
Mauricio Carneiro a063e79597 Updating the exampleGRP.grp test file
It had been generated with an old version of BQSRv2 and wasn't compatible with exampleBAM anymore.
2013-04-13 09:07:13 -04:00
Mark DePristo b32457be8d Merge pull request #163 from broadinstitute/mc_hmm_caching_again
Fix another caching issue with the PairHMM
2013-04-12 12:34:49 -07:00
Mauricio Carneiro 403f9de122 Fix another caching issue with the PairHMM
The Problem
----------
Some read x haplotype pairs were getting very low likelihood when caching is on. Turning it off seemed to give the right result.

Solution
--------
The HaplotypeCaller only initializes the PairHMM once and then feed it with a set of reads and haplotypes. The PairHMM always caches the matrix when the previous haplotype length is the same as the current one. This is not true when the read has changed. This commit adds another condition to zero the haplotype start index when the read changes.

Summarized Changes
------------------
   * Added the recacheReadValue check to flush the matrix (hapStartIndex = 0)
   * Updated related MD5's

Bamboo link: http://gsabamboo.broadinstitute.org/browse/GSAUNSTABLE-PARALLEL9
2013-04-12 14:52:45 -04:00
Mark DePristo 50cdffc61f Slightly improved Smith-Waterman parameter values for HaplotypeCaller Path comparisons
Key improvement
---------------
-- The haplotype caller was producing unstable calls when comparing the following two haplotypes:

ref:               ACAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA
alt: TGTGTGTGTGTGTGACAGAGAGAGAGAGAGAGAGAGAGAGAGAGA

in which the alt and ref haplotypes differ in having indel at both the start and end of the bubble.  The previous parameter values used in the Path algorithm were set so that such haplotype comparisons would result in the either the above alignment or the following alignment depending on exactly how many GA units were present in the bubble.

ref: ACAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA
alt: TGTGTGTGTGTGTGACAGAGAGAGAGAGAGAGAGAGAGAGAGAGA

The number of elements could vary depending on how the graph was built, and resulted in real differences in the calls between BWA mem and BWA-SW calls.  I added a few unit tests for this case, and found a set of SW parameter values with lower gap-extension penalties that significantly favor the first alignment, which is the right thing to do, as we really don't mind large indels in the haplotypes relative to having lots of mismatches.

-- Expanded the unit tests in both SW and KBestPaths to look at complex events like this, and to check as well somewhat sysmatically that we are finding many types of expected mutational events.
-- Verified that this change doesn't alter our calls on 20:10,000,000-11,000,000 at all

General code cleanup
--------------------
-- Move Smith-Waterman to its own package in utils
-- Refactored out SWParameters class in SWPairwiseAlignment, and made constructors take either a named parameter set or a Parameter object directly.  Depreciated old call to inline constants.  This makes it easier to group all of the SW parameters into a single object for callers
-- Update users of SW code to use new Parameter class
-- Also moved haplotype bam writers to protected so they can use the Path SW parameter, which is protected
-- Removed the storage of the SW scoring matrix in SWPairwiseAligner by default.  Only the SWPairwiseAlignmentMain test program needs this, so added a gross protected static variable that enables its storage
2013-04-11 18:22:55 -04:00
Mark DePristo 74196ff7db Trivial BQSR bug fixes and improvement
-- Ensure that BQSR works properly for an Ion Torrent BAM.  (Added integration test and bam)
-- Improve the error message when a unknown platform is found (integration test added)
2013-04-11 17:08:35 -04:00
Ryan Poplin 850be5e9da Bug fix in SWPairwiseAlignment.
-- When the alignments are sufficiently apart from each other all the scores in the sw matrix could be negative which screwed up the max score calculation since it started at zero.
2013-04-10 16:04:37 -04:00
Mauricio Carneiro 3960733c88 Fix PrintReads out of space issue
Problem:
--------
Print Reads was running out of disk space when using the -BQSR option even for small bam files

Solution:
---------
Configure setupWriter to expect pre sorted reads
2013-04-09 08:19:52 -04:00
Mark DePristo 1b36db8940 Make ActiveRegionTraversal robust to excessive coverage
-- Add a maximum per sample and overall maximum number of reads held in memory by the ART at any one time.  Does this in a new TAROrderedReadCache data structure that uses a reservior downsampler to limit the total number of reads to a constant amount.  This constant is set to be by default 3000 reads * nSamples to a global maximum of 1M reads, all controlled via the ActiveRegionTraversalParameters annotation.
-- Added an integration test and associated excessively covered BAM excessiveCoverage.1.121484835.bam (private/testdata) that checks that the system is operating correctly.
-- #resolves GSA-921
2013-04-08 15:48:19 -04:00
Mark DePristo 317dc4c323 Add size() method to Downsampler interface
-- This method provides client with the current number of elements, without having to retreive the underlying list<T>.  Added unit tests for LevelingDownsampler and ReservoirDownsampler as these are the only two complex ones.  All of the others are trivially obviously correct.
2013-04-08 15:48:13 -04:00
Mark DePristo 21410690a2 Address reviewer comments 2013-04-08 12:48:20 -04:00
Mark DePristo 6d22485a4c Critical bugfix to ReduceRead functionality of the GATKSAMRecord
-- The function getReducedCounts() was returning the undecoded reduced read tag, which looks like [10, 5, -1, -5] when the depths were [10, 15, 9, 5].  The only function that actually gave the real counts was getReducedCount(int i) which did the proper decoding.  Now GATKSAMRecord decodes the tag into the proper depths vector so that getReduceCounts() returns what one reasonably expects it to, and getReduceCount(i) merely looks up the value at i.  Added unit test to ensure this behavior going forward.
-- Changed the name of setReducedCounts() to setReducedCountsTag as this function assumes that counts have already been encoded in the tag way.
2013-04-08 12:47:50 -04:00
Mark DePristo 3a19266843 Fix residual merge conflicts 2013-04-08 12:47:50 -04:00
Mark DePristo 15461567d7 HaplotypeCaller no longer uses reads with poor likelihoods w.r.t. any haplotype
-- The previous likelihood calculation proceeds as normal, but after each read has been evaluated against each haplotype we go through the read / allele / likelihoods map and eliminate all reads that have poor fit to any of the haplotypes.  This functionality stops us from making a particular type of error in the HC, where we have a haplotype that's very far from the reference allele but not the right true haplotype.  All of the reads that are slightly closer to this FP haplotype than the reference previously generated enormous likelihoods in favor of this FP haplotype because they were closer to it than the reference, even if each read had many mismatches w.r.t. the FP haplotype (and so the FP haplotype was a bad model for the true underlying haplotype).
2013-04-08 12:47:49 -04:00