Commit Graph

3728 Commits (4fe4ace232c6f01a9d1066ed4e0219d83b446bf3)

Author SHA1 Message Date
Eric Banks 3477e092ea Minor: bump up the amount of cached log10 data in MathUtils so that Monkol can actually call 50K samples. 2013-04-19 08:39:08 -04:00
Eric Banks 5bce0e086e Refactored binomial probability code in MathUtils.
* Moved redundant code out of UGEngine
  * Added overloaded methods that assume p=0.5 for speed efficiency
  * Added unit test for the binomialCumulativeProbability method
2013-04-16 18:19:07 -04:00
Eric Banks df189293ce Improve compression in Reduce Reads by incorporating probabilistic model and global het compression
The Problem:
  Exomes seem to be more prone to base errors and one error in 20x coverage (or below, like most
  regions in an exome) causes RR (with default settings) to consider it a variant region.  This
  seriously hurts compression performance.

The Solution:
  1. We now use a probabilistic model for determining whether we can create a consensus (in other
  words, whether we can error correct a site) instead of the old ratio threshold.  We calculate
  the cumulative binomial probability of seeing the given ratio and trigger consensus creation if
  that pvalue is lower than the provided threshold (0.01 by default, so rather conservative).
  2. We also allow het compression globally, not just at known sites.  So if we cannot create a
  consensus at a given site then we try to perform het compression; and if we cannot perform het
  compression that we just don't reduce the variant region.  This way very wonky regions stay
  uncompressed, regions with one errorful read get fully compressed, and regions with one errorful
  locus get het compressed.

Details:
  1. -minvar is now deprecated in favor of -min_pvalue.
  2. Added integration test for bad pvalue input.
  3. -known argument still works to force het compression only at known sites; if it's not included
     then we allow het compression anywhere.  Added unit tests for this.
  4. This commit includes fixes to het compression problems that were revealed by systematic qual testing.
     Before finalizing het compression, we now check for insertions or other variant regions (usually due
     to multi-allelics) which can render a region incompressible (and we back out if we find one).  We
     were checking for excessive softclips before, but now we add these tests too.
  5. We now allow het compression on some but not all of the 4 consensus reads: if creating one of the
     consensuses is not possible (e.g. because of excessive softclips) then we just back that one consensus
     out instead of backing out all of them.
  6. We no longer create a mini read at the stop of the variant window for het compression.  Instead, we
     allow it to be part of the next global consensus.
  7. The coverage test is no longer run systematically on all integration tests because the quals test
     supercedes it.  The systematic quals test is now much stricter in order to catch bugs and edge cases
     (very useful!).
  8. Each consensus (both the normal and filtered) keep track of their own mapping qualities (before the MQ
     for a consensus was affected by good and bad bases/reads).
  9. We now completely ignore low quality bases, unless they are the only bases present in a pileup.
     This way we preserve the span of reads across a region (needed for assembly). Min base qual moved to Q15.
  10.Fixed long-standing bug where sliding window didn't do the right thing when removing reads that start
     with insertions from a header.

Note that this commit must come serially before the next commit in which I am refactoring the binomial prob
code in MathUtils (which is failing and slow).
2013-04-16 18:19:06 -04:00
Geraldine Van der Auwera e176fc3af1 Merge pull request #159 from broadinstitute/md_bqsr_ion
Trivial BQSR bug fixes and improvement
2013-04-16 08:54:47 -07:00
Mark DePristo 067d24957b Select the haplotypes we move forward for genotyping per sample, not pooled
-- The previous algorithm would compute the likelihood of each haplotype pooled across samples.  This has a tendency to select "consensus" haplotypes that are reasonably good across all samples, while missing the true haplotypes that each sample likes.  The new algorithm computes instead the most likely pair of haplotypes among all haplotypes for each sample independently, contributing 1 vote to each haplotype it selects.  After all N samples have been run, we sort the haplotypes by their counts, and take 2 * nSample + 1 haplotypes or maxHaplotypesInPopulation, whichever is smaller.
-- After discussing with Mauricio our view is that the algorithmic complexity of this approach is no worse than the previous approach, so it should be equivalently fast.
-- One potential improvement is to use not hard counts for the haplotypes, but this would radically complicate the current algorithm so it wasn't selected.
-- For an example of a specific problem caused by this, see https://jira.broadinstitute.org/browse/GSA-871.
-- Remove old pooled likelihood model.  It's worse than the current version in both single and multiple samples:

1000G EUR samples:

10Kb
per sample: 7.17 minutes
pooled: 7.36 minutes

Name        VariantType  TRUE_POSITIVE  FALSE_POSITIVE  FALSE_NEGATIVE  TRUE_NEGATIVE  CALLED_NOT_IN_DB_AT_ALL
per_sample  SNPS                    50               0               5              8                        1
per_sample  INDELS                   6               0               7              2                        1
pooled      SNPS                    49               0               6              8                        1
pooled      INDELS                   5               0               8              2                        1

100 kb
per sample: 140.00 minutes
pooled: 145.27 minutes

Name        VariantType  TRUE_POSITIVE  FALSE_POSITIVE  FALSE_NEGATIVE  TRUE_NEGATIVE  CALLED_NOT_IN_DB_AT_ALL
per_sample  SNPS                   144               0              22             28                        1
per_sample  INDELS                  28               1              16              9                       11
pooled      SNPS                   143               0              23             28                        1
pooled      INDELS                  27               1              17              9                       11

java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T HaplotypeCaller -I private/testdata/AFR.structural.indels.bam -L 20:8187565-8187800 -L 20:18670537-18670730 -R ~/Desktop/broadLocal/localData/human_g1k_v37.fasta -o /dev/null -debug

haplotypes from samples: 8 seconds
haplotypes from pools: 8 seconds

java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T HaplotypeCaller -I /Users/depristo/Desktop/broadLocal/localData/phaseIII.4x.100kb.bam -L 20:10,000,000-10,001,000 -R ~/Desktop/broadLocal/localData/human_g1k_v37.fasta -o /dev/null -debug

haplotypes from samples: 173.32 seconds
haplotypes from pools: 167.12 seconds
2013-04-16 09:42:03 -04:00
Guillermo del Angel a971e7ab6d Several improvements to ReadAdaptorTrimmer so that it can be incorporated into ancient DNA processing pipelines (for which it was developed):
-- Add pair cleaning feature. Reads in query-name sorted order are required and pairs need to appear consecutively, but if -cleanPairs option is set, a malformed pair where second read is missing is just skipped instead of erroring out.
-- Add integration tests
-- Move walker to public
2013-04-13 13:41:36 -04:00
Mark DePristo b32457be8d Merge pull request #163 from broadinstitute/mc_hmm_caching_again
Fix another caching issue with the PairHMM
2013-04-12 12:34:49 -07:00
Mauricio Carneiro 403f9de122 Fix another caching issue with the PairHMM
The Problem
----------
Some read x haplotype pairs were getting very low likelihood when caching is on. Turning it off seemed to give the right result.

Solution
--------
The HaplotypeCaller only initializes the PairHMM once and then feed it with a set of reads and haplotypes. The PairHMM always caches the matrix when the previous haplotype length is the same as the current one. This is not true when the read has changed. This commit adds another condition to zero the haplotype start index when the read changes.

Summarized Changes
------------------
   * Added the recacheReadValue check to flush the matrix (hapStartIndex = 0)
   * Updated related MD5's

Bamboo link: http://gsabamboo.broadinstitute.org/browse/GSAUNSTABLE-PARALLEL9
2013-04-12 14:52:45 -04:00
Mark DePristo 50cdffc61f Slightly improved Smith-Waterman parameter values for HaplotypeCaller Path comparisons
Key improvement
---------------
-- The haplotype caller was producing unstable calls when comparing the following two haplotypes:

ref:               ACAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA
alt: TGTGTGTGTGTGTGACAGAGAGAGAGAGAGAGAGAGAGAGAGAGA

in which the alt and ref haplotypes differ in having indel at both the start and end of the bubble.  The previous parameter values used in the Path algorithm were set so that such haplotype comparisons would result in the either the above alignment or the following alignment depending on exactly how many GA units were present in the bubble.

ref: ACAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA
alt: TGTGTGTGTGTGTGACAGAGAGAGAGAGAGAGAGAGAGAGAGAGA

The number of elements could vary depending on how the graph was built, and resulted in real differences in the calls between BWA mem and BWA-SW calls.  I added a few unit tests for this case, and found a set of SW parameter values with lower gap-extension penalties that significantly favor the first alignment, which is the right thing to do, as we really don't mind large indels in the haplotypes relative to having lots of mismatches.

-- Expanded the unit tests in both SW and KBestPaths to look at complex events like this, and to check as well somewhat sysmatically that we are finding many types of expected mutational events.
-- Verified that this change doesn't alter our calls on 20:10,000,000-11,000,000 at all

General code cleanup
--------------------
-- Move Smith-Waterman to its own package in utils
-- Refactored out SWParameters class in SWPairwiseAlignment, and made constructors take either a named parameter set or a Parameter object directly.  Depreciated old call to inline constants.  This makes it easier to group all of the SW parameters into a single object for callers
-- Update users of SW code to use new Parameter class
-- Also moved haplotype bam writers to protected so they can use the Path SW parameter, which is protected
-- Removed the storage of the SW scoring matrix in SWPairwiseAligner by default.  Only the SWPairwiseAlignmentMain test program needs this, so added a gross protected static variable that enables its storage
2013-04-11 18:22:55 -04:00
Mark DePristo 74196ff7db Trivial BQSR bug fixes and improvement
-- Ensure that BQSR works properly for an Ion Torrent BAM.  (Added integration test and bam)
-- Improve the error message when a unknown platform is found (integration test added)
2013-04-11 17:08:35 -04:00
Ryan Poplin 850be5e9da Bug fix in SWPairwiseAlignment.
-- When the alignments are sufficiently apart from each other all the scores in the sw matrix could be negative which screwed up the max score calculation since it started at zero.
2013-04-10 16:04:37 -04:00
Mauricio Carneiro 3960733c88 Fix PrintReads out of space issue
Problem:
--------
Print Reads was running out of disk space when using the -BQSR option even for small bam files

Solution:
---------
Configure setupWriter to expect pre sorted reads
2013-04-09 08:19:52 -04:00
Mark DePristo 1b36db8940 Make ActiveRegionTraversal robust to excessive coverage
-- Add a maximum per sample and overall maximum number of reads held in memory by the ART at any one time.  Does this in a new TAROrderedReadCache data structure that uses a reservior downsampler to limit the total number of reads to a constant amount.  This constant is set to be by default 3000 reads * nSamples to a global maximum of 1M reads, all controlled via the ActiveRegionTraversalParameters annotation.
-- Added an integration test and associated excessively covered BAM excessiveCoverage.1.121484835.bam (private/testdata) that checks that the system is operating correctly.
-- #resolves GSA-921
2013-04-08 15:48:19 -04:00
Mark DePristo 317dc4c323 Add size() method to Downsampler interface
-- This method provides client with the current number of elements, without having to retreive the underlying list<T>.  Added unit tests for LevelingDownsampler and ReservoirDownsampler as these are the only two complex ones.  All of the others are trivially obviously correct.
2013-04-08 15:48:13 -04:00
Mark DePristo 21410690a2 Address reviewer comments 2013-04-08 12:48:20 -04:00
Mark DePristo 6d22485a4c Critical bugfix to ReduceRead functionality of the GATKSAMRecord
-- The function getReducedCounts() was returning the undecoded reduced read tag, which looks like [10, 5, -1, -5] when the depths were [10, 15, 9, 5].  The only function that actually gave the real counts was getReducedCount(int i) which did the proper decoding.  Now GATKSAMRecord decodes the tag into the proper depths vector so that getReduceCounts() returns what one reasonably expects it to, and getReduceCount(i) merely looks up the value at i.  Added unit test to ensure this behavior going forward.
-- Changed the name of setReducedCounts() to setReducedCountsTag as this function assumes that counts have already been encoded in the tag way.
2013-04-08 12:47:50 -04:00
Mark DePristo 3a19266843 Fix residual merge conflicts 2013-04-08 12:47:50 -04:00
Mark DePristo 15461567d7 HaplotypeCaller no longer uses reads with poor likelihoods w.r.t. any haplotype
-- The previous likelihood calculation proceeds as normal, but after each read has been evaluated against each haplotype we go through the read / allele / likelihoods map and eliminate all reads that have poor fit to any of the haplotypes.  This functionality stops us from making a particular type of error in the HC, where we have a haplotype that's very far from the reference allele but not the right true haplotype.  All of the reads that are slightly closer to this FP haplotype than the reference previously generated enormous likelihoods in favor of this FP haplotype because they were closer to it than the reference, even if each read had many mismatches w.r.t. the FP haplotype (and so the FP haplotype was a bad model for the true underlying haplotype).
2013-04-08 12:47:49 -04:00
Mark DePristo af593094a2 Major improvements to HC that trims down active regions before genotyping
-- Trims down active regions and associated reads and haplotypes to a smaller interval based on the events actually in the haplotypes within the original active region (without extension).  Radically speeds up calculations when using large active region extensions.  The ActiveRegion.trim algorithm does the best job it can of trimming an active region down to a requested interval while ensuring the resulting active region has a region (and extension) no bigger than the original while spanning as much of the requested extend as possible.  The trimming results in an active region that is a subset of the previous active region based on the position and types of variants found among the haplotypes
-- Retire error corrector, archive old code and repurpose subsystem into a general kmer counter.  The previous error corrector was just broken (conceptually) and was disabled by default in the engine.  Now turning on error correction throws a UserException. Old part of the error corrector that counts kmers was extracted and put into KMerCounter.java
-- Add final simplify graph call after we prune away the non-reference paths in DeBruijnAssembler
2013-04-08 12:47:49 -04:00
Mark DePristo 7105ad65a6 Remove the capability of EventMap to emit symbolic alleles for unassembled events
-- These events always occur on the very edge of the haplotypes, and are intrinsically dodgy.  So instead of emitting them and then potentially having to deal with merging real basepair events into them we just no longer emit those events.
2013-04-08 12:47:48 -04:00
Mark DePristo f1d772ac25 LD-based merging algorithm for nearby events in the haplotypes
-- Moved R^2 LD haplotype merging system to the utils.haplotype package
-- New LD merging only enabled with HC argument.
-- EventExtractor and EventExtractorUnitTest refactors so we can test the block substitution code without having to enabled it via a static variable
-- A few misc. bug fixes in LDMerger itself
-- Refactoring of Haplotype event splitting and merging code
-- Renamed EventExtractor to EventMap
-- EventMap has a static method that computes the event maps among n haplotypes
-- Refactor Haplotype score and base comparators into their own classes and unit tested them
-- Refactored R^2 based LD merging code into its own class HaplotypeR2Calculator and unit tested much of it.
-- LDMerger now uses the HaplotypeR2Calculator, which cleans up the code a bunch and allowed me to easily test that code with a MockHaplotypeR2Calculator.  For those who haven't seen this testing idiom, have a look, and very useful
-- New algorithm uses a likelihood-ratio test to compute the probability that only the phased haplotypes exist in the population.
-- Fixed fundamental bug in the way the previous R^2 implementation worked
-- Optimizations for HaplotypeLDCalculator: only compute the per sample per haplotype summed likelihoods once, regardless of how many calls there are
-- Previous version would enter infinite loop if it merged two events but the second event had other low likelihood events in other haplotypes that didn't get removed.  Now when events are removed they are removed from all event maps, regardless of whether the haplotypes carry both events
-- Bugfixes for EventMap in the HaplotypeCaller as well.  Previous version was overly restrictive, requiring that the first event to make into a block substitution was a snp.  In some cases we need to merge an insertion with a deletion, such as when the cigar is 10M2I3D4M.  The new code supports this.  UnitTested and documented as well.  LDMerger handles case where merging two alleles results in a no-op event.  Merging CA/C + A/AA -> CAA/CAA -> no op.  Handles this case by removing the two events.  UnitTested
-- Turn off debugging output for the LDMerger in the HaplotypeCaller unless -debug was enabled
-- This new version does a much more specific test (that's actually right).  Here's the new algorithm:

     * Compute probability that two variants are in phase with each other and that no
     * compound hets exist in the population.
     *
     * Implemented as a likelihood ratio test of the hypothesis:
     *
     * x11 and x22 are the only haplotypes in the populations
     *
     * vs.
     *
     * all four haplotype combinations (x11, x12, x21, and x22) all exist in the population.
     *
     * Now, since we have to have both variants in the population, we exclude the x11 & x11 state.  So the
     * p of having just x11 and x22 is P(x11 & x22) + p(x22 & x22).
     *
     * Alternatively, we might have any configuration that gives us both 1 and 2 alts, which are:
     *
     * - P(x11 & x12 & x21) -- we have hom-ref and both hets
     * - P(x22 & x12 & x21) -- we have hom-alt and both hets
     * - P(x22 & x12) -- one haplotype is 22 and the other is het 12
     * - P(x22 & x21) -- one haplotype is 22 and the other is het 21
2013-04-08 12:47:48 -04:00
Mark DePristo 167cd49e71 Added -forceActive argument to ActiveRegionWalkers
-- Causes the ART tool to treat all bases as active.   Useful for debugging
2013-04-08 12:47:48 -04:00
Mark DePristo 8656bd5e29 Haplotype now consolidates cigars in setCigar
-- This fixes edge base bugs where non-consolidated cigars are causing problems in users of the Haplotype object.  Input arguments are now checks (let's see if we blow up)
2013-04-08 12:47:47 -04:00
Mark DePristo 0310499b65 System to merge multiple nearby alleles into block substitutions
-- Block substitution algorithm that merges nearby events based on distance.
-- Also does some cleanup of GenotypingEngine
2013-04-08 12:47:47 -04:00
Mark DePristo bff13bb5c5 Move Haplotype class to its own package in utils 2013-04-08 12:47:47 -04:00
Mark DePristo b7d59ea13b LIBS unit test debugging should be false 2013-04-08 12:47:47 -04:00
Mauricio Carneiro ebe2edbef3 Fix caching indices in the PairHMM
Problem:
--------
PairHMM was generating positive likelihoods (even after the re-work of the model)

Solution:
---------
The caching idices were never re-initializing the initial conditions in the first position of the deletion matrix. Also the match matrix was being wrongly initialized (there is not necessarily a match in the first position). This commit fixes both issues on both the Logless and the Log10 versions of the PairHMM.

Summarized Changes:
------------------
* Redesign the matrices to have only 1 col/row of padding instead of 2.
* PairHMM class now owns the caching of the haplotype (keeps track of last haplotypes, and decides where the caching should start)
* Initial condition (in the deletionMatrix) is now updated every time the haplotypes differ in length (this was wrong in the previous version)
* Adjust the prior and probability matrices to be one based (logless)
* Update Log10PairHMM to work with prior and probability matrices as well
* Move prior and probability matrices to parent class
* Move and rename padded lengths to parent class to simplify interface and prevent off by one errors in new implementations
* Simple cleanup of PairHMMUnitTest class for a little speedup
* Updated HC and UG integration test MD5's because of the new initialization (without enforcing match on first base).
* Create static indices for the transition probabilities (for better readability)

[fixes #47399227]
2013-04-08 11:05:12 -04:00
Eric Banks 6253ba164e Using --keepOriginalAC in SelectVariants was causing it to emit bad VCFs
* This occurred when one or more alleles were lost from the record after selection
  * Discussed here: http://gatkforums.broadinstitute.org/discussion/comment/4718#Comment_4718
  * Added some integration tests for --keepOriginalAC (there were none before)
2013-04-05 00:53:28 -04:00
Eric Banks 7897d52f32 Don't allow users to specify keys and IDs that contain angle brackets or equals signs (not allowed in VCF spec).
* As reported here: http://gatkforums.broadinstitute.org/discussion/comment/4270#Comment_4270
  * This was a commit into the variant.jar; the changes here are a rev of that jar and handling of errors in VF
  * Added integration test to confirm failure with User Error
  * Removed illegal header line in KB test VCF that was causing related tests to fail.
2013-04-05 00:52:32 -04:00
Eric Banks 14bbba0980 Optimization to method for getting values in ArgumentMatch
* Very trivial, but I happened to see this code and it drove me nuts so I felt compelled to refactor it.
  * Instead of iterating over keys in map to get the values, just iterate over the values...
2013-04-04 23:30:47 -04:00
Ryan Poplin 8a93bb687b Critical bug fix for the case of duplicate map calls in ActiveRegionWalkers with exome interval lists.
-- When consecutive intervals were within the bandpass filter size the ActiveRegion traversal engine would create
duplicate active regions.
-- Now when flushing the activity profile after we jump to a new interval we remove the extra states which are outside
of the current interval.
-- Added integration test which ensures that the output VCF contains no duplicate records. Was failing test before this commit.
2013-04-03 13:15:30 -04:00
David Roazen 2eac97a76c Remove auto-creation of fai/dict files for fasta references
-A UserException is now thrown if either the fai or dict file for the
 reference does not exist, with pointers to instructions for creating
 these files.

-Gets rid of problematic file locking that was causing intermittent
 errors on our farm.

-Integration tests to verify that correct exceptions are thrown in
 the case of a missing fai / dict file.

GSA-866 #resolve
2013-04-02 18:34:08 -04:00
Mark DePristo e7a8e6e8ee Merge pull request #140 from broadinstitute/dr_interval_intersection_bug_GSA-909
Intervals: fix bug where we could fail to find the intersection of unsorted/missorted interval lists
2013-04-02 11:59:01 -07:00
David Roazen 5baf906c28 Intervals: fix bug where we could fail to find the intersection of unsorted/missorted interval lists
-The algorithm for finding the intersection of two sets of intervals
 relies on the sortedness of the intervals within each set, but the engine
 was not sorting the intervals before attempting to find the intersection.

-The result was that if one or both interval lists was unsorted / lexicographically
 sorted, we would often fail to find the intersection correctly.

-Now the IntervalBinding sorts all sets of intervals before returning them,
 solving the problem.

-Added an integration test for this case.

GSA-909 #resolve
2013-04-02 14:01:52 -04:00
Ryan Poplin a58a3e7e1e Merge pull request #134 from broadinstitute/mc_phmm_experiments
PairHMM rework
2013-04-01 12:10:43 -07:00
Mark DePristo 7c83efc1b9 Merge pull request #135 from broadinstitute/mc_pgtag_fix
Fixing @PG tag uniqueness issue
2013-03-31 11:36:40 -07:00
Guillermo del Angel 9686e91a51 Added small feature to VariantFiltration to filter sites outside of a given mask:
-- Sometimes it's desireable to specify a set of "good" regions and filter out other stuff (like say an alignability mask or a "good regions" mask). But by default, the -mask argument in VF will only filter sites inside a particular mask. New argument -filterNotInMask will reverse default logic and filter outside of a given mask.
-- Added integration test, and made sure we also test with a BED rod.
2013-03-31 08:48:16 -04:00
Mauricio Carneiro ec475a46b1 Fixing @PG tag uniqueness issue
The Problem:
------------
the SAM spec does not allow multiple @PG tags with the same id. Our @PG tag writing routines were allowing that to happen with the boolean parameter "keep_all_pg_records".

How this fixes it:
------------------
This commit removes that option from all the utility functions and cleans up the code around the classes that used these methods off-spec.

Summarized changes:
-------------------
* Remove keep_all_pg_records option from setupWriter utility methos in Util
* Update all walkers to now replace the last @PG tag of the same walker (if it already exists)
* Cleanup NWaySamFileWriter now that it doesn't need to keep track of the keep_all_pg_records variable
* Simplify the multiple implementations to setupWriter

Bamboo:
-------
http://gsabamboo.broadinstitute.org/browse/GSAUNSTABLE-PARALLEL31

Issue Tracker:
--------------
[fixes 47100885]
2013-03-30 20:31:33 -04:00
Mauricio Carneiro 52e67a6973 ReviewedStingException -> IllegalStateException 2013-03-30 20:11:55 -04:00
Guillermo del Angel 6b8bed34d0 Big bad bug fix: feature added to LeftAlignAndTrimVariants to left align multiallelic records didn't work.
-- Corrected logic to pick biallelic vc to left align.
-- Added integration test to make sure this feature is tested and feature to trim bases is also tested.
2013-03-30 19:31:28 -04:00
Mauricio Carneiro 0de6f55660 PairHMM rework
The current implementation of the PairHMM had issues with the probabilities and the state machines. Probabilities were not adding up to one because:
   # Initial conditions were not being set properly
   # Emission probabilities in the last row were not adding up to 1

The following commit fixes both by
   # averaging all potential start locations (giving an equal prior to the state machine in it's first iteration -- allowing the read to start it's alignment anywhere in the haplotype with equal probability)
   # discounting all paths that end in deletions by not adding the last row of the deletion matrix and summing over all paths ending in matches and insertions (this saves us from a fourth matrix to represent the end state)

Summarized changes:
   * Fix LoglessCachingPairHMM and Log10PairHMM according to the new algorithm
   * Refactor probabilities check to throw exception if we ever encounter probabilities greater than 1.
   * Rename LoglessCachingPairHMM to LoglessPairHMM (this is the default implementation in the HC now)
   * Rename matrices to matchMatrix, insertionMatrix and deletionMatrix for clarity
   * Rename metric lengths to read and haplotype lengths for clarity
   * Rename private methods to initializePriors (distance) and initializeProbabilities (constants) for clarity
   * Eliminate first row constants (because they're not used anyway!) and directly assign initial conditions in the deletionMatrix
   * Remove unnecessary parameters from updateCell()
   * Fix the expected probabilities coming from the exact model in PairHMMUnitTest
   * Neatify PairHMM class (removed unused methods) and PairHMMUnitTest (removed unused variables)
   * Update MD5s: Probabilities have changed according to the new PairHMM model and as expected HC and UG integration tests have new MD5s.

[fix 47164949]
2013-03-30 10:50:06 -04:00
Guillermo del Angel 8fbf9c947f Upgrades and changes to LeftAlignVariants, motivated by 1000G consensus indel production:
-- Added ability to trim common bases in front of indels before left-aligning. Otherwise, records may not be left-aligned if they have common bases, as they will be mistaken by complext records.
-- Added ability to split multiallelic records and then left align them, otherwise we miss a lot of good left-aligneable indels.
-- Motivated by this, renamed walker to LeftAlignAndTrimVariants.
-- Code refactoring, cleanup and bring up to latest coding standards.
-- Added unit testing to make sure left alignment is performed correctly for all offsets.
-- Changed phase 3 HC script to new syntax. Add command line options, more memory and reduce alt alleles because jobs keep crashing.
2013-03-29 10:02:06 -04:00
Chris Hartl 73d1c319bf Rarely-occurring logic bugfix for GenotypeConcordance, streamlining and testing of MathUtils
Currently, the multi-allelic test is covering the following case:

Eval   A   T,C
Comp   A   C

reciprocate this so that the reverse can be covered.

Eval   A   C
Comp   A   T,C

And furthermore, modify ConcordanceMetrics to more properly handle the situation where multiple alternate alleles are available in the comp. It was possible for an eval C/C sample to match a comp T/T sample, so long as the C allele were also present in at least one other comp sample.

This comes from the fact that "truth" reference alleles can be paired with *any* allele also present in the truth VCF, while truth het/hom var sites are restricted to having to match only the alleles present in the genotype. The reason that truth ref alleles are special case is as follows, imagine:

Eval:   A  G,T      0/0   2/0   2/2   1/1
Comp:   A  C,T      0/0   1/0   0/0   0/0

Even though the alt allele of the comp is a C, the assessment of genotypes should be as follows:

Sample1: ref called ref
Sample2: alleles don't match (the alt allele of the comp was not assessed in eval)
Sample3: ref called hom-var
Sample4: alleles don't match (the alt allele of the eval was not assessed in comp)

Before this change, Sample2 was evaluated as "het called het" (as the T allele in eval happens to also be in the comp record, just not in the comp sample). Thus: apply current
logic to comp hom-refs, and the more restrictive logic ("you have to match an allele in the comp genotype") when the comp is not reference.

Also in this commit,major refactoring and testing for MathUtils. A large number of methods were not used at all in the codebase, these methods were removed:
 - dotProduct(several types). logDotProduct is used extensively, but not the real-space version.
 - vectorSum
 - array shuffle, random subset
 - countOccurances (general forms, the char form is used in the codebase)
 - getNMaxElements
 - array permutation
 - sorted array permutation
 - compare floats
 - sum() (for integer arrays and lists).

Final keyword was extensively added to MathUtils.

The ratio() and percentage() methods were revised to error out with non-positive denominators, except in the case of 0/0 (which returns 0.0 (ratio), or 0.0% (percentage)). Random sampling code was updated to make use of the cleaner implementations of generating permutations in MathUtils (allowing the array permutation code to be retired).

The PaperGenotyper still made use of one of these array methods, since it was the only walker it was migrated into the genotyper itself.

In addition, more extensive tests were added for
 - logBinomialCoefficient (Newton's identity should always hold)
 - logFactorial
 - log10sumlog10 and its approximation

All unit tests pass
2013-03-28 23:25:28 -04:00
MauricioCarneiro a2b69790a6 Merge pull request #128 from broadinstitute/eb_rr_polyploid_compression_GSA-639 2013-03-28 06:39:43 -07:00
Mark DePristo 12475cc027 Display the active MappingQualityFilter if mmq > 0 in the HaplotypeCaller 2013-03-26 14:27:18 -04:00
Mark DePristo ad04fdb233 PerReadAlleleLikelihoodMap getMostLikelyAllele returns an MostLikelyAllele objects now
-- This new functionality allows the client to make decisions about how to handle non-informative reads, rather than having a single enforced constant that isn't really appropriate for all users.  The previous functionality is maintained now and used by all of the updated pieces of code, except the BAM writers, which now emit reads to display to their best allele, regardless of whether this is particularly informative or not.  That way you can see all of your data realigned to the new HC structure, rather than just those that are specifically informative.
-- This all makes me concerned that the informative thresholding isn't appropriately used in the annotations themselves.  There are many cases where nearby variation makes specific reads non-informative about one event, due to not being informative about the second.  For example, suppose you have two SNPs A/B and C/D that are in the same active region but separated by more than the read length of the reads.   All reads would be non-informative as no read provides information about the full combination of 4 haplotypes, as they reads only span a single event.  In this case our annotations will all fall apart, returning their default values.  Added a JIRA to address this (should be discussed in group meeting)
2013-03-26 14:27:13 -04:00
Eric Banks 593d3469d4 Refactored the het (polyploid) consensus creation in ReduceReads.
* It is now cleaner and easier to test; added tests for newly implemented methods.
 * Many fixes to the logic to make it work
   * The most important change was that after triggering het compression we actually need to back it out if it
      creates reads that incorporated too many softclips at any one position (because they get unclipped).
   * There was also an off-by-one error in the general code that only manifested itself with het compression.
 * Removed support for creating a het consensus around deletions (which was broken anyways).
   * Mauricio gave his blessing for this.
 * Het compression now works only against known sites (with -known argument).
    * The user can pass in one or more VCFs with known SNPs (other variants are ignored).
    * If no known SNPs are provided het compression will automatically be disabled.
 * Added SAM tag to stranded (i.e. het compressed) reduced reads to distinguish their
   strandedness from normal reduced reads.
    * GATKSAMRecord now checks for this tag when determining whether or not the read is stranded.
    * This allows us to update the FisherStrand annotation to count het compressed reduced reads
       towards the FS calculation.
    * [It would have been nice to mark the normal reads as unstranded but then we wouldn't be
       backwards compatible.]
    * Updated integration tests accordingly with new het compressed bams (both for RR and UG).
 * In the process of fixing the FS annotation I noticed that SpanningDeletions wasn't handling
   RR properly, so I fixed it too.
    * Also, the test in the UG engine for determining whether there are too many overlapping
       deletions is updated to handle RR.
 * I added a special hook in the RR integration tests to additionally run the systematic
   coverage checking tool I wrote earlier.
    * AssessReducedCoverage is now run against all RR integration tests to ensure coverage is
       not lost from original to reduced bam.
    * This helped uncover a huge bug in the MultiSampleCompressor where it would drop reads
       from all but 1 sample (now fixed).
    * AssessReducedCoverage moved from private to protected for packaging reasons.
 * #resolve GSA-639

At this point, this commit encompasses most of what is needed for het compression to go live.
There are still a few TODO items that I want to get in before the 2.5 release, but I will save
those for a separate branch because as it is I feel bad for the person who needs to review all
these changes (sorry, Mauricio).
2013-03-25 09:34:54 -04:00
Mauricio Carneiro eb33da6820 Added support to reduce reads to Callable Loci
-- added calls to representativeCount() of the pileup instead of using ++
-- renamed CallableLoci integration test
-- added integration test for reduce read support on callable loci
2013-03-21 15:53:04 -04:00
Mark DePristo 7ae15dadbe HC now by default only uses reads with MAPQ >= 20 for assembly and calling
-- Previously we tried to include lots of these low mapping quality reads in the assembly and calling, but we effectively were just filtering them out anyway while generating an enormous amount of computational expense to handle them, as well as much larger memory requirements.  The new version simply uses a read filter to remove them upfront.  This causes no major problems -- at least, none that don't have other underlying causes -- compared to 10-11mb of the KB
-- Update MD5s to reflect changes due to no longer including mmq < 20 by default
2013-03-21 13:10:50 -04:00
Mark DePristo 3a8f001c27 Misc. fixes upon pull request review
-- DeBruijnAssemblerUnitTest and AlignmentUtilsUnitTest were both in DEBUG = true mode (bad!)
-- Remove the maxHaplotypesToConsider feature of HC as it's not useful
2013-03-20 22:54:37 -04:00
Mark DePristo 98c4cd060d HaplotypeCaller now uses SeqGraph instead of kmer graph to build haplotypes.
-- DeBruijnAssembler functions are no longer static.  This isn't the right way to unit test your code
-- An a HaplotypeCaller command line option to use low-quality bases in the assembly
-- Refactored DeBruijnGraph and associated libraries into base class
-- Refactored out BaseEdge, BaseGraph, and BaseVertex from DeBruijn equivalents.  These DeBruijn versions now inherit from these base classes.  Added some reasonable unit tests for the base and Debruijn edges and vertex classes.
-- SeqVertex: allows multiple vertices in the sequence graph to have the same sequence and yet be distinct
-- Further refactoring of DeBruijnAssembler in preparation for the full SeqGraph <-> DeBruijnGraph split
-- Moved generic methods in DeBruijnAssembler into BaseGraph
-- Created a simple SeqGraph that contains SeqVertex objects
-- Simple chain zipper for SeqGraph that reproduces the results for the mergeNode function on DeBruijnGraphs
-- A working version of the diamond remodeling algorithm in SeqGraph that converts graphs that look like A -> Xa, A -> Ya, Xa -> Z, Ya -> Z into A -> X -> a, A -Y -> a, a -> Z
-- Allow SeqGraph zip merging of vertices where the in vertex has multiple incoming edges or the out vertex has multiple outgoing edges
-- Fix all unit tests so they work with the new SeqGraph system.  All tests passed without modification.
-- Debugging makes it easier to tell which kmer graph contributes to a haplotype
-- Better docs and unit tests for BaseVertex, SeqVertex, BaseEdge, and KMerErrorCorrector
-- Remove unnecessary printing of cleaning info in BaseGraph
-- Turn off kmer graph creation in DeBruijnAssembler.java
-- Only print SeqGraphs when debugGraphTransformations is set to true
-- Rename DeBruijnGraphUnitTest to SeqGraphUnitTest.  Now builds DeBruijnGraph, converts to SeqGraph, uses SeqGraph.mergenodes and tests for equality.
-- Update KBestPathsUnitTest to use SeqGraphs not DebruijnGraphs
-- DebruijnVertex now longer takes kmer argument -- it's implicit that the kmer length is the sequence.length now
2013-03-20 22:54:36 -04:00
Mark DePristo ffea6dd95f HaplotypeCaller now has the ability to only consider the best N haplotypes for genotyping
-- Added a -dontGenotype mode for testing assembly efficiency
-- However, it looks like this has a very negative impact on the quality of the results, so the code should be deleted
2013-03-20 22:54:36 -04:00
Mark DePristo a8fb26bf01 A generic downsampler that reduces coverage for a bunch of reads
-- Exposed the underlying minElementsPerStack parameter for LevelingDownsampler
2013-03-20 22:54:35 -04:00
Mark DePristo 752440707d AlignmentUtils.calcNumDifferentBases computes the number of bases that differ between a reference and read sequence given a cigar between the two. 2013-03-20 22:54:35 -04:00
Geraldine Van der Auwera d70bf64737 Created new DeprecatedToolChecks class
--Based on existing code in GenomeAnalysisEngine
	--Hashmaps hold mapping of deprecated tool name to version number and recommended replacement (if any)
	--Using FastUtils for maps; specifically Object2ObjectMap but there could be a better type for Strings...
	--Added user exception for deprecated annotations
	--Added deprecation check to AnnotationInterfaceManager.validateAnnotations
	--Run when annotations are initialized
	--Made annotation sets instead of lists
2013-03-20 06:46:02 -04:00
Geraldine Van der Auwera 6b4d88ebe9 Created ListAnnotations utility (extends CommandLineProgram)
--Refactored listAnnotations basic method out of VA into HelpUtils
	--HelpUtils.listAnnotations() is now called by both VA and the new ListAnnotations utility (lives in sting.tools)
	--This way we keep the VA --list option but we also offer a way to list annotations without a full valid VA command-line, which was a pain users continually complained about
	--We could get rid of the VA --list option altogether ...?
2013-03-20 06:15:27 -04:00
Geraldine Van der Auwera 95a9ed853d Made some documentation updates & fixes
--Mostly doc block tweaks
	--Added @DocumentedGATKFeature to some walkers that were undocumented because they were ending up in "uncategorized". Very important for GSA: if a walker is in public or protected, it HAS to be properly tagged-in. If it's not ready for the public, it should be in private.
2013-03-20 06:15:20 -04:00
Mark DePristo d7bec9eb6e AssessNA12878 bugfixes
-- @Output isn't required for AssessNA12878
-- Previous version would could non-variant sites in NA12878 that resulted from subsetting a multi-sample VC to NA12878 as CALLED_BUT_NOT_IN_DB sites.  Now they are properly skipped
-- Bugfix for subsetting samples to NA12878.  Previous version wouldn't trim the alleles when subsetting down a multi-sample VCF, so we'd have false FN/FP sites at indels when the multi-sample VCF has alleles that result in the subset for NA12878 having non-trimmed alleles.  Fixed and unit tested now.
2013-03-18 15:48:08 -04:00
Ami Levy-Moonshine 0e9c1913ff fix typos in argument docs and in printed output in CoveredByNSamplesSites and rewrite an unaccurate comment 2013-03-18 13:54:21 -04:00
Mark DePristo 2b80068164 Merged bug fix from Stable into Unstable 2013-03-18 12:36:21 -04:00
Mark DePristo 7ab7c873a1 Temp. to PairHMM to avoid bad likelihoods
-- Simply caps PairHMM likelihoods from rising above 0 by taking the min of the likelihood and 0.  Will be properly fixed in GATK 2.5 with better PairHMM implementation.
2013-03-18 12:34:51 -04:00
David Roazen a67d8c8dd6 Bump timeout for MaxRuntimeIntegrationTest
Looks like returning this timeout to its original value was a
bit too aggressive -- adding 40 seconds to the tolerance limit.
2013-03-17 16:17:29 -04:00
David Roazen 742a7651e9 Further tweaking of test timeouts
Increase one timeout, restore others that were only timing out due to the
Java crypto lib bug to their original values.

-DOUBLE timeout for NanoSchedulerUnitTest.testNanoSchedulerInLoop()

-REDUCE timeout for EngineFeaturesIntegrationTest to its original value

-REDUCE timeout for MaxRuntimeIntegrationTest to its original value

-REDUCE timeout for GATKRunReportUnitTest to its original value
2013-03-15 14:49:21 -04:00
Mark DePristo 8317cc155e Merge pull request #108 from broadinstitute/eb_bqsr_out_of_bounds_fix
Added check in the MalformedReadFilter for reads without stored bases (i...
2013-03-14 17:29:35 -07:00
MauricioCarneiro 6f0269df2c Merge pull request #107 from broadinstitute/eb_fix_bqsr_clip_exception 2013-03-14 14:40:06 -07:00
Eric Banks 232afdcbea Added check in the MalformedReadFilter for reads without stored bases (i.e. that use '*').
* We now throw a User Error for such reads
  * User can override this to filter instead with --filter_bases_not_stored
  * Added appropriate unit test
2013-03-14 17:17:26 -04:00
droazen 0fd9f0e77c Merge pull request #104 from broadinstitute/eb_fix_output_annotation_GSA-837
Fixed the logic of the @Output annotation and its interaction with 'required'
2013-03-14 12:52:00 -07:00
Ryan Poplin 38914384d1 Changing CALLED_IN_DB_UNKNOWN_STATUS to count as TRUE_POSITIVEs in the simplified stats for AssessNA12878. 2013-03-14 14:44:18 -04:00
Eric Banks 6d6264b108 Merge pull request #105 from broadinstitute/gg_annotations_cleanup_45802765
Cleaned up annotations
2013-03-14 11:35:00 -07:00
Geraldine Van der Auwera 61349ecefa Cleaned up annotations
- Moved AverageAltAlleleLength, MappingQualityZeroFraction and TechnologyComposition to Private
  - VariantType, TransmissionDisequilibriumTest, MVLikelihoodRatio and GCContent are no longer Experimental
  - AlleleBalanceBySample, HardyWeinberg and HomopolymerRun are Experimental and available to users with a big bold caveat message
  - Refactored getMeanAltAlleleLength() out of AverageAltAlleleLength into GATKVariantContextUtils in order to make QualByDepth independent of where AverageAltAlleleLength lives
  - Unrelated change, bundled in for convenience: made HC argument includeUnmappedreads @Hidden
  - Removed unnecessary check in AverageAltAlleleLength
2013-03-14 14:26:48 -04:00
Eric Banks 7cab709a88 Fixed the logic of the @Output annotation and its interaction with 'required'.
ALL GATK DEVELOPERS PLEASE READ NOTES BELOW:

I have updated the @Output annotation to behave differently and to include a 'defaultToStdout' tag.
  * The 'defaultToStdout' tags lets walkers specify whether to default to stdout if -o is not provided.
  * The logic for @Output is now:
    * if required==true then -o MUST be provided or a User Error is generated.
    * if required==false and defaultToStdout==true then the output is assigned to stdout if no -o is provided.
      * this is the default behavior (i.e. @Output with no modifiers).
    * if required==false and defaultToStdout==false then the output object is null.
      * use this combination for truly optional outputs (e.g. the -badSites option in AssessNA12878).

  * I have updated walkers so that previous behavior has been maintained (as best I could).
    * In general, all @Outputs with default long/short names have required=false.
    * Walkers with nWayOut options must have required==false and defaultToStdout==false (I added checks for this)
  * I added unit tests for @Output changes with David's help (thanks!).
  * #resolve GSA-837
2013-03-14 11:58:51 -04:00
Eric Banks 573ed07ad0 Fixed reported bug in BQSR for RNA seq alignments with Ns.
* ClippingOp updated to incorporate Ns in the hard clips.
  * ReadUtils.getReadCoordinateForReferenceCoordinate() updated to account for Ns.
  * Added test that covers the BQSR case we saw.
  * Created GSA-856 (for Mauricio) to add lots of tests to ReadUtils.
    * It will require refactoring code and not in the scope of what I was willing to do to fix this.
2013-03-14 11:26:52 -04:00
Eric Banks ff87b62fe3 Fixed bug in SelectVariants where maxIndelSize argument wasn't getting applied to deletions.
Added unit tests and docs.
2013-03-13 15:11:34 -04:00
Mark DePristo b5b63eaac7 New GATKSAMRecord concept of a strandless read, update to FS
-- Strandless GATK reads are ones where they don't really have a meaningful strand value, such as Reduced Reads or fragment merged reads.  Added GATKSAMRecord support for such reads, along with unit tests
-- The merge overlapping fragments code in FragmentUtils now produces strandless merged fragments
-- FisherStrand annotation generalized to treat strandless as providing 1/2 the representative count for both strands.  This means that that merged fragments are properly handled from the HC, so we don't hallucinate fake strand-bias just because we managed to merge a lot of reads together.
-- The previous getReducedCount() wouldn't work if a read was made into a reduced read after getReducedCount() had been called.  Added new GATKSAMRecord method setReducedCounts() that does the right thing.  Updated SlidingWindow and SyntheticRead to explicitly call this function, and so the readTag parameter is now gone.
-- Update MD5s for change to FS calculation.  Differences are just minor updates to the FS
2013-03-13 11:16:36 -04:00
Mark DePristo 925846c65f Cleanup of FragmentUtils
-- Code was undocumented, big, and not well tested.  All three things fixed.
-- Currently not passing, but the framework works well for testing
-- Added concat(byte[] ... arrays) to utils
2013-03-13 07:36:20 -04:00
David Roazen 8ed78b453f Increase timeout for a test in the EngineFeaturesIntegrationTest
-This test was intermittently failing when run on the farm
2013-03-12 23:53:26 -04:00
Mark DePristo b3f67899b5 Merge pull request #101 from broadinstitute/dr_fix_failing_parallel_tests
Fix more tests that fail when run in parallel on the farm
2013-03-12 14:11:02 -07:00
David Roazen cdb1fa1105 Fix more tests that fail when run in parallel on the farm
-Allow the default S3 put timeout of 30 seconds for GATKRunReports
 to be overridden via a constructor argument, and use a timeout
 of 300 seconds for tests. The timeout remains 30 seconds in all
 other cases.

-Change integration tests that themselves dispatch farm jobs
 into pipeline tests. Necessary because some farm nodes are
 not set up as submit hosts. Pipeline tests are still run
 directly on gsa4.

-Bump up the timeout for the MaxRuntimeIntegrationTest even more
 (was still occasionally failing on the farm!)
2013-03-12 16:53:30 -04:00
Geraldine Van der Auwera f972963918 Fixed issues raised by Appistry QA (mostly small fixes, corrections & clarifications to GATKDocs)
GATK-73 updated docs for bqsr args
GATK-9 differentiate CountRODs from CountRODsByRef
GATK-76 generate GATKDoc for CatVariants
GATK-4 made resource arg required
GATK-10 added -o, some docs to CountMales; some docs to CountLoci
GATK-11 fixed by MC's -o change; straightened out the docs.
GATK-77 fixed references to wiki
GATK-76 Added Ami's doc block
GATK-14 Added note that these annotations can only be used with VariantAnnotator
GATK-15 specified required=false for two arguments
GATK-23 Added documentation block
GATK-33 Added documentation
GATK-34 Added documentation
GATK-32 Corrected arg name and docstring in DiffObjects
GATK-32 Added note to DO doc about reference (required but unused)
GATK-29 Added doc block to CountIntervals
GATK-31 Added @Output PrintStream to enable -o
GATK-35 Touched up docs
GATK-36 Touched up docs, specified verbosity is optional
GATK-60 Corrected GContent annot module location in gatkdocs
GATK-68 touched up docs and arg docstrings
GATK-16 Added note of caution about calling RODRequiringAnnotations as a group
GATK-61 Added run requirements (num samples, min genotype quality)
Tweaked template and generic doc block formatting (h2 to h3 titles)
GATK-62 Added a caveat to HR annot
Made experimental annotation hidden
GATK-75 Added setup info regarding BWA
GATK-22 Clarified some argument requirements
GATK-48 Clarified -G doc comments
GATK-67 Added arg requirement
GATK-58 Added annotation and usage docs
GSATDG-96 Corrected doc
Updated MD5 for DiffObjectsIntegrationTests (only change is link in table title)
2013-03-12 10:57:14 -04:00
Guillermo del Angel 695723ba43 Two features useful for ancient DNA processing.
Ancient DNA sequencing data is in many ways different from modern data, and methods to analyze it need to be adapted accordingly.
Feature 1: Read adaptor trimming. Ancient DNA libraries typically have very short inserts (in the order of 50 bp), so typical Illumina libraries sequenced in, say, 100bp HiSeq will have a large adaptor component being read after the insert.
If this adaptor is not removed, data will not be aligneable. There are third party tools that remove adaptor and potentially merge read pairs, but are cumbersome to use and require precise knowledge of the library construction and adaptor sequence.
-- New walker ReadAdaptorTrimmer walks through paired end data, computes pair overlap and trims auto-detected adaptor sequence.
-- Unit tests added for trimming operation.
-- Utility walker (may be retired later) DetailedReadLengthDistribution computes insert size or read length distribution stratified by read group and mapping status and outputs a GATKReport with data.
-- Renamed MaxReadLengthFilter to ReadLengthFilter and added ability to specify minimum read length as a filter (may be useful if, as a consequence of adaptor trimming, we're left with a lot of very short reads which will map poorly and will just clutter output BAMs).

Feature 2: Unbiased site QUAL estimation: many times ancestral allele status is not known and VCF fields like QUAL, QD, GQ, etc. are affected by the pop. gen. prior at a site. This might introduce subtle biases in studies where a species is aligned against the reference of another species, so an option for UG and HC not to apply such prior is introduced.
-- Added -noPrior argument to StandardCallerArgumentCollection.
-- Added option not to fill priors is such argument is set.
-- Added an integration test.
2013-03-09 18:18:13 -05:00
Yossi Farjoun baad965a57 - Changed loadContaminationFile file parser to delimit by tab only. This allows spaces in sampleIDs, which apparently are allowed.
- This was needed since samples with spaces in their names are regularly found in the picard pipeline.
- Modified the tests to account for this (removed spaces from the good tests, and changed the failing tests accordingly)
- Cleaned up the unit tests using a @DataProvider (I'm in love...).
- Moved AlleleBiasedDownsamplingUtilsUnitTest to public to match location of class it is testing (due to the way bamboo operates)
2013-03-07 13:04:24 -05:00
David Roazen 3ab78543a7 Fix tests that were consistently or intermittently failing when run in parallel on the farm
-Make MaxRuntimeIntegrationTest more lenient by assuming that startup overhead
 might be as long as 120 seconds on a very slow node, rather than the original
 assumption of 20 seconds

-In TraverseActiveRegionsUnitTest, write temp bam file to the temp directory, not
 to the current working directory

-SimpleTimerUnitTest: This test was internally inconsistent. It asserted that
 a particular operation should take no more than 10 milliseconds, and then asserted
 again that this same operation should take no more than 100 microseconds (= 0.1 millisecond).
 On a slow node it could take slightly longer than 100 microseconds, however.
 Changed the test to assert that the operation should require no more than 10000 microseconds
 (= 10 milliseconds)

-change global default test timeout from 20 to 40 minutes (things just take longer
 on the farm!)

-build.xml: allow runtestonly target to work with scala test classes
2013-03-06 13:56:54 -05:00
Eric Banks 3759d9dd67 Added the functionality to impose a relative ordering on ReadTransformers in the GATK engine.
* ReadTransformers can say they must be first, must be last, or don't care.
  * By default, none of the existing ones care about ordering except BQSR (must be first).
    * This addresses a bug reported on the forum where BAQ is incorrectly applied before BQSR.
  * The engine now orders the read transformers up front before applying iterators.
  * The engine checks for enabled RTs that are not compatible (e.g. both must be first) and blows up (gracefully).
  * Added unit tests.
2013-03-06 12:38:59 -05:00
Mark DePristo 446cd61f7e Merge pull request #84 from broadinstitute/eb_allelic_primitives
Added new walker to split MNPs into their allelic primitives (SNPs).
2013-03-06 09:02:21 -08:00
Eric Banks 78721ee09b Added new walker to split MNPs into their allelic primitives (SNPs).
* Can be extended to complex alleles at some point.
  * Currently only works for bi-allelics (documented).
  * Added unit and integration tests.
2013-03-05 23:16:42 -05:00
Mauricio Carneiro e2d41f0282 Turning @Output required to false
By default all output is assigned to stdout if a -o is not provided. Technically this makes @Output a not required parameter, and the documentation is misleading because it's reading from the annotation.
GSA-820 #resolve
2013-03-05 17:26:16 -05:00
Eric Banks 2be57fbcfb Merged bug fix from Stable into Unstable 2013-03-05 13:28:46 -05:00
Eric Banks 5e89f01e10 Don't allow the use of compressed (.gz) references in the GATK. 2013-03-05 13:28:19 -05:00
Mauricio Carneiro d0c8105387 Cleaning up hilarious exception messages
Too many users (with RNASeq reads) are hitting these exceptions that were never supposed to happen. Let's give them (and us) a better and clearer error message.
2013-03-04 16:52:22 -05:00
Mark DePristo 42d3919ca4 Expanded functionality for writing BAMs from HaplotypeCaller
-- The new code includes a new mode to write out a BAM containing reads realigned to the called haplotypes from the HC, which can be easily visualized in IGV.
-- Previous functionality maintained, with bug fixes
-- Haplotype BAM writing code now lives in utils
-- Created a base class that includes most of the functionality of writing reads realigned to haplotypes onto haplotypes.
-- Created two subclasses, one that writes all haplotypes (previous functionality) and a CalledHaplotypeBAMWriter that will only write reads aligned to the actually called haplotypes
-- Extended PerReadAlleleLikelihoodMap.getMostLikelyAllele to optionally restrict set of alleles to consider best
-- Massive increase in unit tests in AlignmentUtils, along with several new powerful functions for manipulating cigars
-- Fix bug in SWPairwiseAlignment that produces cigar elements with 0 size, and are now fixed with consolidateCigar in AlignmentUtils
-- HaplotypeCaller now tracks the called haplotypes in the GenotypingEngine, and returns this information to the HC for use in visualization.
-- Added extensive docs to HaplotypeCaller on how to use this capability
-- BUGFIX -- don't modify the read bases in GATKSAMRecord in LikelihoodCalculationEngine in the HC
-- Cleaned up SWPairwiseAlignment.  Refactored out the big main and supplementary static methods.  Added a unit test with a bug TODO to fix what seems to be an edge case bug in SW
-- Integration test to make sure we can actually write a BAM for each mode.  This test only ensures that the code runs and doesn't exception out.  It doesn't actually enforce any MD5s
-- HaplotypeBAMWriter also left aligns indels in the reads, as SW can return a random placement of a read against the haplotype.  Calls leftAlign to make the alignments more clear, with unit test of real read to cover this case
-- Writes out haplotypes for both all haplotype and called haplotype mode
-- Haplotype writers now get the active region call, regardless of whether an actual call was made.  Only emitting called haplotypes is moved down to CalledHaplotypeBAMWriter
2013-03-03 12:07:29 -05:00
depristo 6204e6ccc9 Merge pull request #76 from broadinstitute/md_kb_bugfix_GSA-795
Bug fixes and optimizations for NA12878 KB
2013-03-01 10:52:16 -08:00
Eric Banks ebd5404124 Fixed the add functionality of GenomeLocSortedSet.
* Fixed GenomeLocSortedSet.add() to ensure that overlapping intervals are detected and an exception is thrown.
 * Fixed GenomeLocSortedSet.addRegion() by merging it with the add() method; it now produces sorted inputs in all cases.
 * Cleaned up duplicated code throughout the engine to create a list of intervals over all contigs.
 * Added more unit tests for add functionality of GLSS.
 * Resolves GSA-775.
2013-02-28 23:31:00 -05:00
Mark DePristo 4095a9ef32 Bugfixes for AssessNA12878
-- Refactor initialization routine into BadSitesWriter.  This now adds the GQ and DP genotype header lines which are necessarily if the input VCF doesn't have proper headers
-- GATKVariantContextUtils subset to biallelics now tolerates samples with bad GL values for multi-allelics, where it just removes the PLs and issues a warning.
2013-02-28 10:35:06 -05:00
depristo 92d6a4f441 Merge pull request #75 from broadinstitute/eb_missing_rg_error_GSA-407
Added better error message for BAMs with bad read groups.
2013-02-28 05:20:39 -08:00
Eric Banks 12fc198b80 Added better error message for BAMs with bad read groups.
* Split the cases into reads that don't have a RG at all vs. those with a RG that's not defined in the header.
  * Added integration tests to make sure that the correct error is thrown.
  * Resolved GSA-407.
2013-02-27 16:02:56 -05:00
Eric Banks 69b8173535 Replace uses of NestedHashMap with NestedIntegerArray.
* Removed from codebase NestedHashMap since it is unused and untested.
 * Integration tests change because the BQSR CSV is now sorted automatically.
 * Resolves GSA-732
2013-02-27 14:03:39 -05:00
David Roazen 752f4335a5 Merged bug fix from Stable into Unstable 2013-02-27 05:20:41 -05:00
David Roazen 2a7af43164 Fix improper dependencies in QScripts used by pipeline tests, and attempt to fix the flawed MisencodedBaseQualityUnitTest
-Some QScripts used by public pipeline tests unnecessarily used the (now protected) UnifiedGenotyper.
 Changed them to use PrintReads instead.

-Moved ExampleUnifiedGenotyperPipelineTest to protected

-Attempt to fix the flawed and sporadically failing MisencodedBaseQualityUnitTest:

   After looking at this class a bit, I think the problem was the use of global arrays for the quals
   shared across all reads in all tests (BAMRecord class definitely does not make a separate copy for
   each read!). One test (testFixBadQuals) modifies the bad quals array, and if this happens to run
   before the testBadQualsThrowsError test the bad quals array will have been "fixed" and no exception
   will be thrown.
2013-02-27 04:45:53 -05:00
David Roazen a53b4a7521 Merged bug fix from Stable into Unstable 2013-02-26 21:41:13 -05:00
David Roazen 65d31ba4ad Fix runtime public -> protected dependencies in the test suite
-replace unnecessary uses of the UnifiedGenotyper by public integration tests
 with PrintReads

-move NanoSchedulerIntegrationTest to protected, since it's completely dependent
 on the UnifiedGenotyper
2013-02-26 21:19:12 -05:00
depristo 93205154b5 Merge pull request #63 from broadinstitute/eb_fix_pairhmm_unittest_GSA-776
Eb fix pairhmm unittest gsa 776
2013-02-26 11:56:58 -08:00
Mauricio Carneiro 711cbd3b5a Archiving CoverageBySample
This walker was not updated since 2009, and users were getting wrong answers when running it with ReduceReads. I don't want to deal with this because DiagnoseTargets does everything this walker does.
2013-02-26 13:49:00 -05:00
depristo 51d618de97 Merge pull request #62 from broadinstitute/rp_increase_max_kmer_in_assembly
The maximum kmer length is derived from the reads.
2013-02-26 05:37:02 -08:00
Eric Banks 7519484a38 Refactored PairHMM.initialize to first take haplotype max length and then the read max length so that it is consistent with other PairHMM methods. 2013-02-25 15:04:23 -05:00
Ryan Poplin 89e2943dd1 The maximum kmer length is derived from the reads.
-- This is done to take advantage of longer reads which can produce less ambiguous haplotypes
-- Integration tests change for HC and BiasedDownsampling
2013-02-25 14:40:25 -05:00
David Roazen 3645ea9bb6 Sequence dictionary validation: detect problematic contig indexing differences
The GATK engine does not behave correctly when contigs are indexed
differently in the reads sequence dictionaries vs. the reference
sequence dictionary, and the inconsistently-indexed contigs are included
in the user's intervals. For example, given the dictionaries:

Reference dictionary = { chrM, chr1, chr2, ... }
BAM dictionary       = { chr1, chr2, ... }

and the interval "-L chr1", the engine would fail to correctly retrieve
the reads from chr1, since chr1 has a different index in the two dictionaries.

With this patch, we throw an exception if there are contig index differences
between the dictionaries for reads and reference, AND the user's intervals
include at least one of the mismatching contigs.

The user can disable this exception via -U ALLOW_SEQ_DICT_INCOMPATIBILITY

In all other cases, dictionary validation behaves as before.

I also added comprehensive unit tests for the (previously-untested)
SequenceDictionaryUtils class.

GSA-768 #resolve
2013-02-25 11:14:22 -05:00
Ryan Poplin 6a639c8ffc Replace Smith-Waterman alignment with the bubble traversal.
-- Instead of doing a full SW alignment against the reference we read off bubbles from the assembly graph.
-- Smith-Waterman is run only on the base composition of the bubbles which drastically reduces runtime.
-- Refactoring graph functions into a new DeBruijnAssemblyGraph class.
-- Bug fix in path.getBases().
-- Adding validation code to the assembly engine.
-- Renaming SimpleDeBruijnAssembler to match the naming of the new Assembly graph class.
-- Adding bug fixes, docs and unit tests for DeBruijnAssemblyGraph and KBestPaths classes.
-- Added ability to ignore bubbles that are too divergent from the reference
-- Max kmer can't be bigger than the extension size.
-- Reverse the order that we create the assembly graphs so that the bigger kmers are used first.
-- New algorithm for determining unassembled insertions based on the bubble traversal instead of the full SW alignment.
-- Don't need the full read span reference loc for anything any more now that we clip down to the extended loc for both assembly and likelihood evaluation.
-- Updating HaplotypeCaller and BiasedDownsampling integration tests.
-- Rebased everything into one commit as requested by Eric
-- improvements to the bubble traversal are coming as a separate push
2013-02-22 15:42:16 -05:00
depristo 2ad559cf58 Merge pull request #59 from broadinstitute/mc_reving_testng_GSA-695
Updating TestNG to the latest version
2013-02-22 10:39:04 -08:00
Mauricio Carneiro 4ac50c89ad Updating TestNG to the latest version
-- changed SkipException constructors that are now private in TestNG
-- Updated build.xml to use the latest testng
-- Added guice dependency to ivy
-- Fixed broken SampleDBUnitTest

The SampleDBUnitTest was only passing before because the map comparison in the old TestNG was broken. It was comparing two DIFFERENT samples and testing for "equals"

GSA-695 #resolve
2013-02-22 09:40:23 -05:00
Mark DePristo 182c32a2b7 Relax bounds checking in QualityUtils.boundQual
-- Previous version did runtime checking that qual >= 0 but BQSR was relying on boundQual to restore -1 to 1.  So relax the bound.
2013-02-22 08:46:59 -05:00
Mark DePristo 8ac6d3521f Vast improvements to AssessNA12878 code and functionality
-- AssessNA12878 now breaks out multi-allelics into bi-allelic components.  This means that we can properly assess multi-allelic calls against the bi-allelic KB
-- Refactor AssessNA12878, moving into assess package in KB.  Split out previously private classes in the walker itself into separate classes.  Added real docs for all of the classes.
-- Vastly expand (from 0) unit tests for NA12878 assessments
-- Allow sites only VCs to be evaluated by Assessor
-- Move utility for creating simple VCs from a list of string alleles from GATKVariantContextUtilsUnitTest to GATKVariantContextUtils
-- Assessor bugfix for discordant records at a site.  Previous version didn't handle properly the case where one had a non-matching call in the callset w.r.t. the KB, so that the KB element was eaten during the analysis.  Fixed.  UnitTested
-- See GSA-781 -- Handle multi-allelic variants in KB for more information
-- Bugfix for missing site counting in AssessNA12878.  Previous version would count N misses for every missed value at a site.  Not that this has much impact but it's worth fixing
-- UnitTests for BadSitesWriter
-- UnitTests for filtered and filtering sites in the Assessor
-- Cleanup end report generation code (simply the code).  Note that instead of "indel" the new code will print out "INDELS"
-- Assessor DoC calculations now us LIBS and RBPs for the depth calculation.  The previous version was broken for reduced reads.  Added unit test that reads a complex reduced read example and matches the DoC of this BAM with the output of the GATK DoC tool here.
-- Added convenience constructor for LIBS using just SAMFileReader and an iterator.  It's now easy to create a LIBS from a BAM at a locus.  Added advanceToLocus function that moves the LIBS to a specific position.  UnitTested via the assessor (which isn't ideal, but is a proper test)
2013-02-21 20:43:12 -05:00
Mark DePristo 29319bf222 Improved allele trimming code in GATKVariantContextUtils
-- Now supports trimming the alleles from both the reverse and forward direction.
-- Added lots of unit tests for forwrad allele trimming, as well as creating VC from forward and reverse trimming.
-- Added docs and tests for the code, to bring it up to GATK spec
2013-02-21 12:01:43 -05:00
Eric Banks 6996a953a8 Haplotype/Allele based optimizations for the HaplotypeCaller that knock off nearly 20% of the total runtime (multi-sample).
These 2 changes improve runtime performance almost as much as Ryan's previous attempt (with ID-based comparisons):
* Don't unnecessarily overload Allele.getBases() in the Haplotype class.
  * Haplotype.getBases() was calling clone() on the byte array.
* Added a constructor to Allele (and Haplotype) that takes in an Allele as input.
  * It makes a copy of he given allele without having to go through the validation of the bases (since the Allele has already been validated).
  * Rev'ed the variant jar accordingly.

For the reviewer: all tests passed before rebasing, so this should be good to go as far as correctness.
2013-02-21 10:14:11 -05:00
Geraldine Van der Auwera c3e01fea40 Added several more info types / annotations to GATKDocs
-- top-level walker type (locus, read etc)
-- parallelism options (nt or nct)
-- annotation type (for Variant Annotations)
-- downsampling settings that override engine defaults
-- reference window size
-- active region settings
-- partitionBy info
2013-02-21 03:12:40 -05:00
Geraldine Van der Auwera e674b4a524 Added new ReadFilter that allows users to specifically reassign one single mapping quality to a different value. Useful for TopHat and other RNA-seq software users. 2013-02-20 01:24:45 -05:00
MauricioCarneiro 76810465aa Merge pull request #40 from broadinstitute/gg_retrieve_readfilters_GSATDG-63 2013-02-19 19:42:35 -08:00
Mark DePristo 910d966428 Extend timeout of NanoScheduler deadlock tests
-- The previous timeout of 1 second was just dangerously short.  Increase the timeout to 10 seconds
2013-02-19 20:25:25 -05:00
Eric Banks 0055a6f1cd Merge pull request #45 from broadinstitute/mc_fix_indelrealigner_GSA-774
Fix to the Indel Realigner bug described in GSA-774
2013-02-19 16:16:48 -08:00
Geraldine Van der Auwera faef85841b Added GATKDocs fct to indicate default Read Filters for each tool
-- Added getClazzAnnotations() as hub to retrieve various annotations values and class properties through reflection
-- Added getReadFilters() method to retrieve Read Filter annotations
-- getReadFilters() uses recursion to walk up the inheritance to also capture superclass annotations
-- getClazzAnnotations() stores collected info in doc handler root, which is unit.forTemplate in Doclet
-- Modified FreeMarker template to use the Readfilters info (displayed after arg table, before additional capabilities)
-- Tadaaa :-) #GSATDG-63 resolve
2013-02-19 16:12:29 -05:00
Mauricio Carneiro 371ea2f24c Fixed IndelRealigner reference length bug (GSA-774)
-- modified ReadBin GenomeLoc to keep track of softStart() and softEnd() of the reads coming in, to make sure the reference will always be sufficient even if we want to use the soft-clipped bases
-- changed the verification from readLength to aligned bases to allow reads with soft-clipped bases
-- switched TreeSet -> PriorityQueue in the ConstrainedMateFixer as some different reads can be considered equal by picard's SAMRecordCoordinateComparator (the Set was replacing them)
-- pulled out ReadBin class so it can be testable
-- added unit tests for ReadBin with soft-clips
-- added tests for getMismatchCount (AlignmentUtils) to make sure it works with soft-clipped reads

GSA-774 #resolve
2013-02-19 16:00:36 -05:00
Mauricio Carneiro 815028edd4 Added verbose error message to the PluginManager
-- added a logger.error with a more descriptive message of what the most likely cause of the error is

Typical error happens when a walker's global variable is not initialized properly (usually in test conditions). The old error message was very hard to understand "Could not create module because of an exception of type NullPointerException ocurred caused by exception null"
2013-02-19 16:00:35 -05:00
Ryan Poplin c025e84c8b Fix for calculating read pos rank sum test with reads that are informative but don't actually overlap the variant due to some hard clipping.
-- Updated a few integration tests for HC, UG, and UG general ploidy
2013-02-19 14:09:24 -05:00
Mark DePristo be45edeff2 ActivityProfile and ActiveRegions respects engine interval boundaries
-- Active regions are created as normal, but they are split and trimmed to the engine intervals when added to the traversal, if there are intervals present.
-- UnitTests for ActiveRegion.splitAndTrimToIntervals
-- GenomeLocSortedSet.getOverlapping uses binary search to efficiently in ~ log N time find overlapping intervals
-- UnitTesting overlap function in GenomeLocSortedSet
-- Discovered fundamental implementation bug in that adding genome locs out of order (elements on 20 then on 19) produces an invalid GenomeLocSortedSet.  Created a JIRA to address this: https://jira.broadinstitute.org/browse/GSA-775
-- Constructor that takes a collection of genome locs now sorts its input and merges overlapping intervals
-- Added docs for the constructors in GLSS
-- Update HaplotypeCaller MD5s, which change because ActiveRegions are now restricted to the engine intervals, which changes slightly the regions in the tests and so the reads in the regions, and thus the md5s
-- GenomeAnalysisEngineUnitTest needs to provide non-null genome loc parser
2013-02-18 10:40:25 -05:00
Mark DePristo 3b67aa8aee Final edge case bug fixes to QualityUtil routines
-- log10 functions in QualityUtils allow -Infinity to allow log10(0.0) values
-- Fix edge condition of log10OneMinusX failing with Double.MIN_VALUE
-- Fix another edge condition of log10OneMinusX failing with a small but not min_value double
2013-02-16 07:31:38 -08:00
Mark DePristo b393c27f07 QualityUtils now uses runtime argument checks instead of contract
-- There's some runtime cost for these tests, but it's not big enough to outweigh the value of catching errors quickly
2013-02-16 07:31:38 -08:00
Mark DePristo 9a29d6d4be Fix an catastrophic bug (WoW!) in the reference calculation of the UG
-- The UG was using MathUtils binomial probability backward, so that the estimated confidence was always NaN, and was as a side effect other utils converted this to a meaningless 0.0.  This is all because there wasn't a unit test.
-- I've fixed the calculation, so it's now log10 based, uses robust MathUtils and QualityUtils functions to compute probabilities, and added a unit test.
2013-02-16 07:31:38 -08:00
Mark DePristo 9e28d1e347 Cleanup and unit tests for QualityUtils
-- Fixed a few conversion bugs with edge case quals (ones that were very high)
-- Fixed a critical bug in the conversion of quals that was causing near capped quals to fall below their actual value.  Will undoubtedly need to fix md5s
-- More precise prob -> qual calculations for very high confidence events in phredScaleCorrectRate, trueProbToQual, and errorProbToQual.  Very likely to improve accuracy of many calculations in the GATK
-- Added errorProbToQual and trueProbToQual calculations that accept an integer cap, and perform the (tricky) conversion from int to byte correctly.
-- Full docs and unit tests for phredScaleCorrectRate and phredScaleErrorRate.
-- Renamed probToQual to trueProbToQual
-- Added goodProbability and log10OneMinusX to MathUtils
-- Went through the GATK and cleaned up many uses of QualityUtils
-- Cleanup constants in QualityUtils
-- Added full docs for all of the constants
-- Rename MAX_QUAL_SCORE to MAX_SAM_QUAL_SCORE for clarity
-- Moved MAX_GATK_USABLE_Q_SCORE to RecalDatum, as it's s BQSR specific feature
-- Convert uses of QualityUtils.errorProbToQual(1-x) to QualityUtils.trueProbToQual(x)
-- Cleanup duplicate quality score routines in MathUtils.  Moved and renamed MathUtils.log10ProbabilityToPhredScale => QualityUtils.phredScaleLog10ErrorRate. Removed 3 routines from MathUtils, and remapped their usages into the better routines in QualityUtils
2013-02-16 07:31:37 -08:00
Yossi Farjoun aa99a5f47c Added an option to print out the version string
@argument (-)-version
(should this be @hidden?)

Prints out the version to System.out and quit(0)
No tests. (any ideas on how to test this would be happily accepted)
2013-02-15 12:42:59 -05:00
droazen 664960373d Merge pull request #31 from broadinstitute/yf_fast_BAM_index_traversal
-re-enables fast BAM indexing
2013-02-15 09:12:32 -08:00
MauricioCarneiro 1dd284a5bb Merge pull request #39 from broadinstitute/tj_printreads_tag_for_bqsr_GSA-720
PrintReads writes a header when used with -BQSR
2013-02-15 07:18:28 -08:00
MauricioCarneiro b58a0eca6b Merge pull request #33 from broadinstitute/gg_more_gatkdocs_tweaks_GSATDG-62
Refactored GATKDocs categories some more ( GSATDG-62 )
2013-02-14 22:35:07 -08:00
Tad Jordan 6cb80591e3 PrintReads writes a header when used with -BQSR 2013-02-14 22:19:14 -05:00
Yossi Farjoun 3a7c8c13e2 Re-enabled fastBAMindexing by replacing the FileChannel with a SeekableBufferedStream
This helps a lot since FileChannel is very low-level and traversing the BAMIndex involves lots of short reads.

- Fixed a deterioration in BAMIndex due to rev'ed picard (see below)
- Added unit tests for SeekableBufferedStream
- Added integrationTests for GATKBAMIndex (in PileupWalkerIntegrationTest)
- Added a runtime-test to verify that the amount read equals the amount requested.
- Added failing tests with expectedExceptions
- Used a DataProvider to make code nicer
2013-02-14 17:51:15 -05:00
Mark DePristo f92328a1a1 Extend default timeout to 20 minutes
-- The default of 10 minutes is right on the edge for some tests, and we really want a default not to enforce a max time (test should be short) but to stop testng from failing to terminate ever in the case where some test is truly hung
2013-02-13 17:43:40 -08:00
Geraldine Van der Auwera 6208742f7c Refactored GATKDocs categories some more ( GSATDG-62 )
-- Renamed ValidatePileup to CheckPileup since validation is reserved word
-- Renamed AlignmentValidation to CheckAlignment (same as above)
-- Refactored category definitions to use constants defined in HelpConstants
-- Fixed a couple of minor typos and an example error
-- Reorganized the GATKDocs index template to use supercategories
-- Refactored integration tests for renamed walkers (my earlier refactoring had screwed them up or not carried over)
2013-02-13 16:49:18 -05:00
Guillermo del Angel 4308b27f8c Fixed non-determinism in HaplotypeCaller and some UG calls -
-- HaplotypeCaller and PerReadAlleleLikelihoodMap should use LinkedHashMaps instead of plain HashMaps. That way the ordering when traversing alleles is maintained. If the JVM traverses HashMaps with random ordering, different reads (with same likelihood) may be removed by contamination checker, and different alleles may be picked if they have same likelihoods for all reads.
-- Put in some GATKDocs and contracts in HaplotypeCaller files (far from done, code is a beast)
-- Update md5's due to different order of iteration in LinkedHashMaps instead of HashMaps inside HaplotypeCaller  (due to change in PerReadAlleleLikelihoodMap that also slightly modifies reads chosen by per-read downsampling).
-- Reenabled testHaplotypeCallerMultiSampleGGAMultiAllelic test
-- Added some defensive argument checks into HaplotypeCaller public functions (not intended to be done yet).
2013-02-12 15:43:29 -05:00
Geraldine Van der Auwera dff5ef562b Reorganized walker categories in GATKDocs (@DocumentedGATKFeature details)
-- Sorted out contents of BAM Processing vs. Diagnostics & QC Tools
-- Moved two validation-related walkers from Diagnostics & QC to Validation Utilities
-- Reworded some category names and descriptions to be more explicit and user-friendly
2013-02-12 13:36:15 -05:00
Mark DePristo e40d83f00e Final version of PairHMMs with correct edge conditions
-- Uses 1/N for N potential start sites as the probability of starting at any one of the potential start sites
-- Add flag that says to use the original edge condition, respected by all subclasses.  This brings the new code back to the original state, but with all of the cleanup I've done
-- Only test configurations where the read length <= haplotype length.  I think this is actually the contract, but we'll talk about this tomorrow
-- Fix egregious bug with the myLog10SumLog10 function doing the exact opposite of the requested arguments, so that doExact really meant don't do exact
-- PairHMM now exposes computeReadLikelihoodGivenHaplotypeLog10 but subclasses must overload subComputeReadLikelihoodGivenHaplotypeLog10.  This protected function does the work, and the public function will do argument and result QC
-- Have to be more tolerant of reference (approximate) HMM.  All unit tests from the original HMM implementations pass now
-- Added locs of docs
-- Generalize unit tests with multiple equivalent matches of read to haplotype
-- Added runtime argument checking for initial and computeReadLikelihoodGivenHaplotypeLog10
-- Functions to dumpMatrices for debugging
-- Fix nasty bug (without original unit tests) in LoglessPairHMM
-- Max read and haplotype lengths only worked in previous code if they were exactly equal to the provided read and haplotype sizes.  Fixed bug.  Added unit test to ensure this doesn't break again.
-- Added dupString(string, n) method to Utils
-- Added TODOs for next commit.  Need to compute number of potential start sites not in initialize but in the calc routine since this number depends not on the max sizes but the actual read sizes
-- Unit tests for the hapStartIndex functionality of PairHMM
-- Moved computeFirstDifferingPosition to PairHMM, and added unit tests
-- Added extensive unit tests for the hapStartIndex functionality of computeReadLikelihoodGivenHaplotypeLog10
-- Still TODOs left in the code that I'll fix up
-- Logless now compute constants, if they haven't been yet initialized, even if you forgot to say so
-- General: the likelihood penalty for potential start sites is now properly computed against the actual read and reference bases, not the maximum.  This involved moving some initialize() code into the computeLikelihoods function.  That's ok because all of the potential log10 functions are actually going to cached versions, so the slowdown is minimal
-- Added some unit tests to ensure that common errors (providing haplotypes too long, reads too long, not initializing the HMM) are captured as errors
2013-02-09 19:19:22 -05:00
Mark DePristo 09595cdeb9 Remove ExactPairHMM and OriginalPairHMM, everyone just uses Log10PairHMM with appropriate arguments 2013-02-09 13:06:54 -05:00
Mark DePristo 2d802e17a4 Delete the CachingPairHMM 2013-02-09 13:06:54 -05:00
Mark DePristo 7dcafe8b81 Preliminary version of LoglessCachingPairHMM that avoids positive likelihoods
-- Would have been squashed but could not because of subsequent deletion of Caching and Exact/Original PairHMMs
-- Actual working unit tests for PairHMMUnitTest
-- Fixed incorrect logic in how I compared hmm results to the theoretical and exact results
-- PairHMM has protected variables used throughout the subclasses
2013-02-09 13:06:54 -05:00
Mark DePristo ca76de0619 Move ProcessUtilsUnitTest to private 2013-02-09 12:34:45 -05:00
MauricioCarneiro f5e52b72ea Merge pull request #23 from broadinstitute/md_process_utils_unit_tests
UnitTests for ProcessUtils
2013-02-09 09:27:31 -08:00
MauricioCarneiro 3ff10ab277 Merge pull request #24 from broadinstitute/md_ngsplatform_unittests
Expand NGSPlatform to meet SAM 1.4 spec, with full unit tests
2013-02-09 09:27:03 -08:00
Mark DePristo b127fc6a1a Expand NGSPlatform to meet SAM 1.4 spec, with full unit tests
-- Added CAPILLARY and HELICOS platforms as required by spec 1.4
-- Added extensive unit tests to ensure NGSPlatform functions work as expected.
-- Fixed some NPE bugs for reads that don't have RGs or PLs in their RG fields
2013-02-09 11:16:21 -05:00
Mark DePristo fc3307a97f UnitTests for ProcessUtils 2013-02-09 10:13:01 -05:00
Mark DePristo 7fb620dce7 Generalize and fixup ContigComparator
-- Now uses a SAMSequenceDictionary to do the comparison of contigs (which is the right way to do it)
-- Added unit tests
2013-02-09 09:52:13 -05:00
Mark DePristo a3dc7dc5cb Extend AWS timeout for uploads of the GATK run reports to 30 seconds 2013-02-08 17:37:36 -05:00
Mauricio Carneiro 5f49c95cc1 Added distance across contigs calculation to GenomeLocs
-- distance across contigs is calculated given a sequence dictionary (from SAMFileHeader)
-- unit test added
GSATDG-45
2013-02-07 16:31:41 -05:00
Eric Banks 9826192854 Added contracts, docs, and tests for several methods in AlignmentUtils. There are over 74K tests being run now for this class!
* AlignmentUtils.getMismatchCount()
* AlignmentUtils.calcAlignmentByteArrayOffset()
* AlignmentUtils.readToAlignmentByteArray().
* AlignmentUtils.leftAlignIndel()
2013-02-07 13:04:24 -05:00
eitanbanks 584899329c Merge pull request #13 from broadinstitute/dr_variant_migration_GSA-692
Replace org.broadinstitute.variant with jar built from the Picard repo
2013-02-06 07:22:30 -08:00
Eric Banks 562f2406d7 Added check that BaseRecalibrator is not being run on a reduced bam.
- Throws user exception if it is.
 - Can be turned off with --allow_bqsr_on_reduced_bams_despite_repeated_warnings argument.
 - Added test to check this is working.
 - Added docs to BQSRReadTransformer explaining why this check is not performed on PrintReads end.
 - Added small bug fix to GenomeAnalysisEngine that I uncovered in this process.
 - Added comment about not changing the program record name, as per reviewer comments.
 - Removed unused variable.
2013-02-06 10:14:27 -05:00
Eric Banks 4e5ff3d6f1 Bug fix for NPE in HC with --dbsnp argument.
- I had added the framework in the VA engine but should not have hooked it up to the HC yet since the RefMetaDataTracker is always null.
 - Added contracts and docs to the relevant methods in the VA engine so that this doesn't happen in the future.
2013-02-05 21:59:19 -05:00
David Roazen e7e76ed76e Replace org.broadinstitute.variant with jar built from the Picard repo
The migration of org.broadinstitute.variant into the Picard repo is
complete. This commit deletes the org.broadinstitute.variant sources
from our repo and replaces it with a jar built from a checkout of the
latest Picard-public svn revision.
2013-02-05 17:24:25 -05:00
Mauricio Carneiro f6bc5be6b4 Fixing license on Yossi's file
Somebody needs to set up the license hook ;-)
2013-02-05 11:14:43 -05:00
MauricioCarneiro 050c4794a5 Merge pull request #11 from yfarjoun/per_sample2
-Added Per-Sample Contamination Removal to UnifiedGenotyper: Added an @A...
2013-02-05 08:04:29 -08:00
Eric Banks 00c98ff0cf Need to reset the static counter before tests are run or else we won't be deterministic.
Also need to give credit where credit is due: David was right that this was not a non-deterministic Bamboo failure...
2013-02-05 10:41:46 -05:00
Yossi Farjoun de03f17be4 -Added Per-Sample Contamination Removal to UnifiedGenotyper: Added an @Advanced option to the StandardCallerArgumentCollection, a file which should
contain two columns, Sample (String) and Fraction (Double) that form the Sample-Fraction map for the per-sample AlleleBiasedDownsampling.
-Integration tests to UnifiedGenotyper (Using artificially contaminated BAMs created from a mixure of two broadly concented samples) were added
-includes throwing an exception in HC if called using per-sample contamination file (not implemented); tested in a new integration test.
-(Note: HaplotypeCaller already has "Flat" contamination--using the same fraction for all samples--what it doesn't have is
   _per-sample_ AlleleBiasedDownsampling, which is what has been added here to the UnifiedGenotyper.
-New class: DefaultHashMap (a Defaulting HashMap...) and new function: loadContaminationFile (which reads a Sample-Fraction file and returns a map).
-Unit tests to the new class and function are provided.
-Added tests to see that malformed contamination files are found and that spaces and tabs are now read properly.
-Merged the integration tests that pertain to biased downsampling, whether HaplotypeCaller or unifiedGenotyper, into a new IntegrationTest class.
2013-02-04 18:24:36 -05:00
Mark DePristo a281fa6548 Resolves Genome Sequence Analysis GSA-750 Don't print an endless series of starting messages from the ProgressMeter
-- The progress meter isn't started until the GATK actually calls execute on the microscheduler.  Now we get a message saying "Creating shard strategy" while this (expensive) operation runs
2013-02-04 15:47:30 -05:00
Chris Hartl 3c99010be4 Part 1 of Variant Annotator Unit tests: PerReadAlleleLikelihoodMap
- Added contract enforcement for public methods
 - Refactored the conversion from read -> (allele -> likelihood) to allele -> list[read] into its own method
 - added method documentation for non getters/setters
 - finals, finals everywhere
 - Add in a unit test for the PerReadAlleleLikelihoodMap. Complete coverage except for .clear() and a method that is a straight call into a separately-tested utility class.
2013-02-04 14:16:06 -05:00
Guillermo del Angel 5521bf3dd7 Fix bad contract implementation 2013-02-03 16:15:14 -05:00
Guillermo del Angel f31bf37a6f First step in better BQSR unit tests for covariates (not done yet): more test coverage in basic covariates, test logging several read groups/read lengths and more combinations simultaneously.
Add basic Javadocs headers for PerReadAlleleLikehoodMap.
2013-02-03 15:31:30 -05:00
Mark DePristo 8d08780582 GATKRunReport now tracks the errorMessage and errorThrown during post for later analysis
-- This is primarily useful in the unit tests, as I now print out additional information on why a test might have failed, if it in fact did.
2013-02-02 19:24:31 -05:00
Mark DePristo 6382d5bdc9 Final cleanup and unit testing for GATKRunReport
-- Bringing code up to document, style, and code coverage specs
-- Move GATKRunReportUnitTest to private
-- Fully expand GATKRunReportUnitTests to coverage writing and reading GATKRunReport to local disk, to standard out, to AWS.
-- Move documentation URL from GATKRunReport to UserException
-- Delete a few unused files from s3GATKReport
-- Added capabilities to GATKRunReport to make testing easier
-- Added capabilities to deserialize GATKRunReports from an InputStream
2013-02-02 15:06:56 -05:00
Mark DePristo eb17230c2f Update AWS access and private keys to the new GATK2LogUploader user
-- Updated EncryptAWSKeys to write the key into the correct resources directory
2013-02-02 15:06:56 -05:00
Eric Banks 03df5e6ee6 - Added more comprehensive tests for consensus creation to RR. Still need to add tests for I/D ops.
- Added RR qual correctness tests (note that this is a case where we don't add code coverage but still need to test critical infrastructure).
- Also added minor cleanup of BaseUtils
2013-02-01 15:37:19 -05:00
David Roazen c6581e4953 Update MD5s to reflect version number change in the BAM header
I've confirmed via a script that all of these differences only
involve the version number bump in the BAM headers and nothing
else:

< @HD   VN:1.0  GO:none SO:coordinate
---
> @HD   VN:1.4  GO:none SO:coordinate
2013-02-01 13:51:31 -05:00
David Roazen c4b0ba4d45 Temporarily back out the Picard team's patches to GATKBAMIndex from December
These patches to GATKBAMIndex are causing massive BAM index reading errors in
combination with the latest version of Picard. The bug is either in the patches
themselves or in the underlying SeekableBufferedStream class they rely on. Until
the cause can be identified, we are temporarily backing out these changes so that
we can continue to run with the latest Picard/Tribble.

This reverts commits:
81483ec21e528790dfa719d18cdee27d577ca98e
68cf0309db490b79eecdabb4034987ff825ffea8
54bb68f28ad5fe1b3df01702e9c5e108106a0176
2013-02-01 13:51:31 -05:00
David Roazen 1fb182d951 Restore Utils.appendArray()
This utility method was used by the PipelineTest class, and deleting it
was causing tests to not compile.
2013-02-01 13:51:31 -05:00
Mark DePristo 6d9816f1a5 Cleanup unused utils functions, and add unit test for one (append) 2013-02-01 13:51:31 -05:00
Mark DePristo 206eab80e3 Expanded unit tests for AlignmentUtils
-- Added JIRA entries for the remaining capabilities to be fixed up and unit tested
2013-02-01 13:51:31 -05:00
David Roazen 292037dfda Rev picard, sam-jdk, and tribble
This is a necessary prerequisite for the org.broadinstitute.variant migration.

-Picard and sam-jdk go from version 1.67.1197 to 1.84.1337

-Picard-private goes from version 2375 to 2662

-Tribble goes from version 119 to 1.84.1337

-RADICALLY trimmed down the list of classes we extract from Picard-private
 (jar goes from 326993 bytes to 6445 bytes!)
2013-02-01 13:51:30 -05:00
Ryan Poplin e07cefb058 Updating AlignmentUtils.consolidateCigar() to the GATK coding standards. 2013-02-01 13:51:30 -05:00
Mark DePristo c3c4e2785b UnitTest for calcNumHighQualityBases in AlignmentUtils 2013-01-31 13:57:23 -05:00
David Roazen 6ec1e613a2 Move AWS keys to a resources subdirectory within the phonehome package
Resources must be in a subdirectory called "resources" in the package
hierarchy to be picked up by the packaging system. Adding each resource
manually to the jars in build.xml does not cause the resource to be
added to the standalone GATK jar when we package the GATK, so it's best
to always use this convention.
2013-01-31 11:56:34 -05:00
Ryan Poplin 496727ac5e Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-31 11:51:08 -05:00
Ryan Poplin ac033ce41a Intermediate commit of new bubble assembly graph traversal algorithm for the HaplotypeCaller. Adding functionality for a path from an assembly graph to calculate its own cigar string from each of the bubbles instead of doing a massive Smith-Waterman alignment between the path's full base composition and the reference. 2013-01-31 11:32:19 -05:00
Eric Banks 9c0207f8ef Fixing BQSR/BAQ bug:
If a read had an existing BAQ tag, was clipped by our engine, and couldn't have the BAQ recalculated (for whatever reason), then we would
fail in the BQSR because we would default to using the old tag (which no longer matched the length of the read bases).
The right thing to do here is to remove the old BAQ tag when RECALCULATE and ADD_TAG are the BAQ modes used but BAQ cannot be recalculated.
Added a unit test to ensure that the tags are removed in such a case.
2013-01-31 11:03:17 -05:00
Mark DePristo 404ee9a6e4 More aggressive checking of AWS key quality upon startup in the GATK 2013-01-31 09:08:38 -05:00
Ryan Poplin 438c98035b Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-30 17:12:28 -05:00
Ryan Poplin bb29bd7df7 Use base List and Map types in the HaplotypeCaller when possible. 2013-01-30 17:09:27 -05:00
Mark DePristo b707331332 Encrypt GATK AWS keys using the GATK private key, and decrypt as needed as a resource when uploading to AWS logs
-- Has the overall effect that the GATK user AWS keys are no longer visible in the gatk source as plain text.  This will stop AWS from emailing me (they crawl the web looking for keys)
-- Added utility EncryptAWSKeys that takes as command line arguments the GATK user AWS access and secret keys, encrypts them with the GATK private key, and writes out the resulting file to resources in phonehome.
-- GATKRunReport now decrypts as needed these keys using the GATK public key as resources in the GATK bundle
-- Refactored the essential function of Resource (reading the resource) from IOUtils into the class itself.  Now how to get the data in the resouce is straightforward
-- Refactored md5 calculation code from a byte[] into Utils.  Added unit tests
-- Committing the encrypted AWS keys
-- #resolves https://jira.broadinstitute.org/browse/GSA-730
2013-01-30 16:42:23 -05:00
David Roazen 591df2be44 Move additional VariantContext utility methods back to the GATK
Thanks to Eric for his feedback
2013-01-30 13:58:17 -05:00
David Roazen 9985f82a7a Move BaseUtils back to the GATK by request, along with associated utility methods 2013-01-30 13:09:44 -05:00
Mark DePristo 1ff78679ca UnitTesting example for copying
-- Example combinatorial unit tests, plus unit tests that create reads and bam files, pileups, variant context (from scratch and from a file), and genome locs
2013-01-30 11:19:08 -05:00
Eric Banks d067c7f136 Resolving merge conflicts 2013-01-30 10:47:59 -05:00
Eric Banks 9025567cb8 Refactoring the SimpleGenomeLoc into the now public utility UnvalidatingGenomeLoc and the RR-specific FinishedGenomeLoc.
Moved the merging utility methods into GenomeLoc and moved the unit tests around accordingly.
2013-01-30 10:45:29 -05:00
Mark DePristo 4852c7404e GenomeLocs are already comparable, so I'm removing the less complete GenomeLocComparator class and updating ReduceReads and CompressionStash to use built-in comparator 2013-01-30 10:12:27 -05:00
Mark DePristo 45603f58cd Refactoring and unit testing GenomeLocParser
-- Moved previously inner class to MRUCachingSAMSequenceDictionary, and unit test to 100% coverage
-- Fully document all functions in GenomeLocParser
-- Unit tests for things like parsePosition (shocking it wasn't tested!)
-- Removed function to specifically create GenomeLocs for VariantContexts.  The fact that you must incorporate END attributes in the context means that createGenomeLoc(Feature) works correctly
-- Depreciated (and moved functionality) of setStart, setStop, and incPos to GenomeLoc
-- Unit test coverage at like 80%, moving to 100% with next commit
2013-01-30 09:47:47 -05:00
Mark DePristo 8562bfaae1 Optimize GenomeLocParser.createGenomeLoc
-- The new version is roughly 2x faster than the previous version.  The key here was to cleanup the workflow for validateGenomeLoc and remove the now unnecessary synchronization blocks from the CachingSequencingDictionary, since these are now thread local variables
-- #resolves https://jira.broadinstitute.org/browse/GSA-724
2013-01-30 09:47:47 -05:00
Mark DePristo 69dd5cc902 AutoFormattingTimeUnitTest should be in utils 2013-01-30 09:47:47 -05:00
Mark DePristo 92c5635e19 Cleanup, document, and unit test ActiveRegion
-- All functions tested.  In the testing / review I discovered several bugs in the ActiveRegion routines that manipulate reads.  New version should be correct
-- Enforce correct ordering of supporting states in constructor
-- Enforce read ordering when adding reads to an active region in add
-- Fix bug in HaplotypeCaller map with new updating read spans.  Now get the full span before clipping down reads in map, so that variants are correctly placed w.r.t. the full reference sequence
-- Encapsulate isActive field with an accessor function
-- Make sure that all state lists are unmodifiable, and that the docs are clear about this
-- ActiveRegion equalsExceptReads is for testing only, so make it package protected
-- ActiveRegion.hardClipToRegion must resort reads as they can become out of order
-- Previous version of HC clipped reads but, due to clipping, these reads could no longer overlap the active region.  The old version of HC kept these reads, while the enforced contracts on the ActiveRegion detected this was a problem and those reads are removed.  Has a minor impact on PLs and RankSumTest values
-- Updating HaplotypeCaller MD5s to reflect changes to ActiveRegions read inclusion policy
2013-01-30 09:47:12 -05:00
David Roazen 6449c320b4 Fix the CachingIndexedFastaSequenceFileUnitTest
BaseUtils.convertIUPACtoN() no longer throws a UserException,
since it's in org.broadinstitute.variant
2013-01-29 21:07:16 -05:00
Mauricio Carneiro 29fd536c28 Updating licenses manually
Please check that your commit hook is properly pointing at ../../private/shell/pre-commit

Conflicts:
	public/java/test/org/broadinstitute/variant/VariantBaseTest.java
2013-01-29 17:27:53 -05:00
David Roazen a536e1da84 Move some VCF/VariantContext methods back to the GATK based on feedback
-Moved some of the more specialized / complex VariantContext and VCF utility
 methods back to the GATK.

-Due to this re-shuffling, was able to return things like the Pair class back
 to the GATK as well.
2013-01-29 16:56:55 -05:00
Ami Levy-Moonshine a1908a0eca Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-29 16:33:20 -05:00
Ami Levy-Moonshine 4aaef495c6 correct the help message 2013-01-29 16:33:12 -05:00
Ryan Poplin bf25196a0b Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-28 22:33:13 -05:00
Ryan Poplin e9c3a0acdf fix typo 2013-01-28 22:18:58 -05:00
Ami Levy-Moonshine a8a68697f1 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-28 20:18:51 -05:00
Guillermo del Angel 5995f01a01 Big intermediate commit (mostly so that I don't have to go again through merge/rebase hell) in expanding BQSR capabilities. Far from done yet:
a) Add option to stratify CalibrateGenotypeLikelihoods by repeat - will add integration test in next push.
b) Simulator to produce BAM files with given error profile - for now only given SNP/indel error rate can be given. A bad context can be specified and if such context is present then error rate is increased to given value.
c) Rewrote RepeatLength covariate to do the right thing - not fully working yet, work in progress.
d) Additional experimental covariates to log repeat unit and combined repeat unit+length. Needs code refactoring/testing
2013-01-28 19:55:46 -05:00
Ami Levy-Moonshine 3f5c2e4989 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-28 19:04:52 -05:00
Ami Levy-Moonshine c103623cf6 bug fix in my new function at SampleUtils.java 2013-01-28 19:04:39 -05:00
Ryan Poplin d665a8ba0c The Bayesian calculation of Qemp in the BQSR is now hierarchical. This fixes issues in which the covariate bins were very sparse and the prior estimate being used was the original quality score. This resulted in large correction factors for each covariate which breaks the equation. There is also now a new option, qlobalQScorePrior, which can be used to ignore the given (very high) quality scores and instead use this value as the prior. 2013-01-28 15:56:33 -05:00
Tad Jordan 8777e02aa5 R issue in Queue fixed.
GSA-721
2013-01-28 14:42:20 -05:00
David Roazen f63f27aa13 org.broadinstitute.variant refactor, part 2
-removed sting dependencies from test classes
-removed org.apache.log4j dependency
-misc cleanup
2013-01-28 09:03:46 -05:00
David Roazen 3744d1a596 Collapse the downsampling fork in the GATK engine
With LegacyLocusIteratorByState deleted, the legacy downsampling implementation
was already non-functional. This commit removes all remaining code in the
engine belonging to the legacy implementation.
2013-01-28 01:50:30 -05:00
Mark DePristo 63913d516f Add join call to Progress meter unit test so we actually know the daemon thread has finished 2013-01-27 16:52:45 -05:00
Mark DePristo 14d8afe413 Remove startSearchAt state variable from ActivityProfile
-- New algorithm will only try to create an active region if there's at least maxREgionSize + propagation distance states in the list.  When that's true, we are guaranteed to actually find a region.  So this algorithm is not only truly correct but as super fast, as we only ever do the search for the end of the region when we will certainly find one, and actually generate a region.
2013-01-27 14:10:08 -05:00
Mark DePristo c97a361b5d Added realistic BandPassFilterUnitTest that ensures quality results for 1000G phase I VCF and NA12878 VCF
-- Helped ID more bugs in the ActivityProfile, necessitating a new algorithm for popping off active regions.  This new algorithm requires that at least maxRegionSize + prob. propagation distance states have been examined.  This ensures that the incremental results are the same as you get reading in an entire profile and running getRegions on the full profile
-- TODO is to remove incremental search start algorithm, as this is no longer necessary, and nicely eliminates a state variable I was always uncomfortable with
2013-01-27 14:10:08 -05:00
Mark DePristo 72b2e77eed Linearize the findEndOfRegion algorithm in ActivityProfile, radically improving its performance
-- Previous algorithm was O(N^2)
-- #resolve GSA-723 https://jira.broadinstitute.org/browse/GSA-723
2013-01-27 14:10:06 -05:00
Mark DePristo 0fb238b61e TraverseActiveRegions Optimizations and Bugfixes: make sure to record position of current locus to discharge active regions when there's no data
-- Now records the position of the current locus, as well as that of the last read.  Necessary when passing through regions with no reads.  The previous version would keep accumulating empty active regions, and never discharge them until end of traversal (if there was no reads in the future) or until a read was finally found
-- Protected a call to logger.debug with if ( logger.isDebugEnabled()) to avoid a lot of overhead in writing unseen debugger logging information
2013-01-27 14:10:06 -05:00
Mark DePristo 93d88cdc68 Optimization: LocusReferenceView now passes along the contig index to createGenomeLoc, speeding up their creation
-- Also cleaned up some unused methods
2013-01-27 14:10:06 -05:00
Mark DePristo 52a28968a9 ART optimization: BandPassActivityProfile only applies the gaussian filter if the state probability > 0 2013-01-27 14:10:06 -05:00
Mauricio Carneiro 705cccaf63 Making SplitReads output FastQ's instead of BAM
- eliminates one step in my pipeline
   - BAM is too finicky and maintaining parameters that wouldn't be useful was becoming a headache, better avoided.
2013-01-27 02:36:31 -05:00
Mauricio Carneiro 6ea7133d95 Updating licenses of latest moved files 2013-01-26 13:46:52 -05:00
Ami Levy-Moonshine 99cb8d68e9 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-25 16:07:38 -05:00
Mark DePristo b8c0b05785 Add contract to ensure that getAdapterBoundary returns the right result
-- Also renamed the function to getAdaptorBoundary for consistency across the codebase
2013-01-25 16:05:17 -05:00
Mark DePristo e445c71161 LIBS optimization for adapter clipping
-- GATKSAMRecords now cache the result of the getAdapterBoundary, allowing us to avoid repeating a lot of work in LIBS
-- Added unittests to cover adapter clipping
2013-01-25 16:05:17 -05:00
Ami Levy-Moonshine b4447cdca2 In cases where one uses VariantContextUtils.GenotypeMergeType.REQUIRE_UNIQUE we used to verify that the samples names are unique in VariantContextUtils.simpleMerge for each VCs. It couse to a bug that was reported on the forum (when a VCs had 2 VC from the same sample).
Now we will check it only in CombineVariants.init using the headers. A new function was added to SamplesUtils with unitTests in CVunitTest.java.
2013-01-25 15:49:51 -05:00
Ami Levy-Moonshine fc22a5c71c Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-25 11:47:38 -05:00
Ami Levy-Moonshine eaf6279d48 adding RBP to the general calling pipeline and few other small changes to it (to make it run with the current bundel file names 2013-01-25 11:47:30 -05:00
Mark DePristo 008b617577 Cleanup the getLIBS function in LocusIterator
-- Now throws an UnsupportedOperationException in the base class.  Only LocusView implements this function and actually returns the LIBS
2013-01-25 11:07:28 -05:00
Eric Banks 6dd0e1ddd6 Pulled out the --regenotype functionality from SelectVariants into its own tool: RegenotypeVariants.
This allows us to move SelectVariants into the public suite of tools now.
2013-01-25 09:42:04 -05:00
Mark DePristo c7a29b1d39 Fixed NPE in ActiveRegionUnitTest by allowing null supporting states in ActiveRegion 2013-01-24 13:48:00 -05:00
Mark DePristo 592f90aaef ActivityProfile now cuts intelligently at the best local minimum when in a larger than max size active region
-- This new algorithm is essential to properly handle activity profiles that have many large active regions generated from lots of dense variant events.  The new algorithm passes unit tests and passes visualize visual inspection of both running on 1000G and NA12878
-- Misc. commenting of the code
-- Updated ActiveRegionExtension to include a min active region size
-- Renamed ActiveRegionExtension to ActiveRegionTraversalParameters, as it carries more than just the traversal extension now
2013-01-24 13:48:00 -05:00
Mark DePristo c96b64973a Soft clip probability propagation is capped by the MAX_PROB_PROPAGATION_DISTANCE, which is 50 bp 2013-01-24 13:48:00 -05:00
Mark DePristo 0c94e3d96e Adaptively compute the band pass filter from the sigma, up to a maximum size of 50 bp
-- Previously we allowed band pass filter size to be specified along with the sigma.  But now that sigma is controllable from walkers and from the command line, we instead compute the filter size given the kernel from the sigma, including all kernel points with p > 1e-5 in the kernel.  This means that if you use a smaller kernel you get a small band size and therefore faster ART
-- Update, as discussed with Ryan, the sigma and band size to 17 bp for HC (default ART wide) and max band size of 50 bp
2013-01-24 13:47:59 -05:00
Mark DePristo 9e43a2028d Making band pass filter size, sigma, active region max size and extension all accessible from the command line 2013-01-24 13:47:59 -05:00
Mark DePristo cd91e365f4 Optimize getCurrentContigLength and getLocForOffset in ActivityProfile 2013-01-24 13:47:59 -05:00
Eric Banks 6790e103e0 Moving lots of walkers back from protected to public (along with several of the VA annotations).
Let's see whether Mauricio's automatic git hook really works!
2013-01-24 11:42:49 -05:00
Mark DePristo ee8039bf25 Fix trivial call in unit test 2013-01-23 13:51:58 -05:00
Mark DePristo 09edc6baeb TraverseActiveRegions now writes out very nice active region and activity profile IGV formatted files 2013-01-23 13:46:01 -05:00
Mark DePristo 8e8126506b Renaming IncrementalActivityProfile to ActivityProfile
-- Also adding a work in progress functionality to make it easy to visualize activity profiles and active regions in IGV
2013-01-23 13:46:01 -05:00
Mark DePristo e917f56df8 Remove old ActivityProfile and old BandPassActivityProfile 2013-01-23 13:46:01 -05:00
Mark DePristo 7fd27a5167 Add band pass filtering activity profile
-- Based on the new incremental activity profile
-- Unit Tested!  Fixed a few bugs with the old band pass filter
-- Expand IncrementalActivityProfileUnitTest to test the band pass filter as well for basic properties
-- Add new UnitTest for BandPassIncrementalActivityProfile
-- Added normalizeFromRealSpace to MathUtils
-- Cleanup unused code in new activity profiles
2013-01-23 13:46:01 -05:00
Mark DePristo eb60235dcd Working version of incremental active region traversals
-- The incremental version now processes active regions as soon as they are ready to be processed, instead of waiting until the end of the shard as in the previous version.  This means that ART walkers will now take much less memory than previously.  On chr20 of NA12878 the majority of regions are processed with as few as 500 reads in memory.  Over the whole chr20 only 5K reads were ever held in ART at one time.
-- Fixed bug in the way active regions worked with shard boundaries.  The new implementation no longer see shard boundaries in any meaningful way, and that uncovered a problem that active regions were always being closed across shard boundaries.  This behavior was actually encoded in the unit tests, so those needed to be updated as well.
-- Changed the way that preset regions work in ART.  The new contract ensures that you get exactly the regions you requested.  the isActive function is still called, but its result has no impact on the regions.  With this functionality is should be possible to use the HC as a generic assembly by forcing it to operate over very large regions
-- Added a few misc. useful functions to IncrementalActivityProfile
2013-01-23 13:46:00 -05:00
Mark DePristo ce160931d5 Optimize creation of reads in ArtificialBAMBuilder
-- Now caches the reads so subsequent calls to makeReads() don't reallocate the reads from scratch each time
2013-01-23 13:46:00 -05:00
Mark DePristo e050f649fd IncrementalActivityProfile, complete with extensive unit tests
-- This is an activity profile compatible with fetching its implied active regions incrementally, as activity profile states are added
2013-01-23 13:45:21 -05:00
Mark DePristo 8d9b0f1bd5 Restructure ActivityProfiler into root class ActivityProfile and derived class BandPassActivityProfile
-- Required before I jump in an redo the entire activity profile so it's can be run imcrementally
-- This restructuring makes the differences between the two functionalities clearer, as almost all of the functionality is in the base class. The only functionality provided by the BandPassActivityProfile is isolated to a finalizeProfile function overloaded from the base class.
-- Renamed ActivityProfileResult to ActivityProfileState, as this is a clearer indication of its actual functionality.  Almost all of the misc. walker changes are due to this name update
-- Code cleanup and docs for TraverseActiveRegions
-- Expanded unit tests for ActivityProfile and ActivityProfileState
2013-01-23 13:45:21 -05:00
Mark DePristo 42b807a5fe Unit tests for ActivityProfileResult 2013-01-23 13:45:20 -05:00
Mauricio Carneiro 7b8b064165 Last manual license update (hopefully)
if everyone updates their git hook accordingly, this will be the last time I have to manually run the script.

GSATDG-5
2013-01-18 16:13:07 -05:00
Ami Levy-Moonshine 0fb7b73107 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-18 15:03:42 -05:00
Ami Levy-Moonshine 826c29827b change the default VCFs gatherer of the GATK (not just the UG) 2013-01-18 15:03:12 -05:00
Eric Banks 6a903f2c23 I finally gave up on trying to get the Haplotype/Allele merging to work in the HaplotypeCaller.
I've resigned myself instead to create a mapping from Allele to Haplotype.  It's cheap so not a big deal, but really shouldn't be necessary.
Ryan and I are talking about refactoring for GATK2.5.
2013-01-18 01:21:08 -05:00
Eric Banks ded659232b Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-16 22:49:56 -05:00
Eric Banks a623cca89a Bug fix for HaplotypeCaller, as reported on the forum: when reduced reads didn't completely overlap a deletion call,
we were incorrectly trying to find the reference position of a base on the read that didn't exist.
Added integration test to cover this case.
2013-01-16 22:47:58 -05:00
Mark DePristo 738c24a3b1 Add tests to ensure that all insertion reads appear in the active region traversal 2013-01-16 16:25:36 -05:00
Eric Banks 79bc818022 Bug fix for VariantsToVCF: old dbSNP files can have '-' as reference base and those records always need to be padded. 2013-01-16 16:15:58 -05:00
Mark DePristo 2a42b47e4a Massive expansion of ActiveRegionTraversal unit tests, resulting in several bugfixes to ART
-- UnitTests now include combinational tiling of reads within and spanning shard boundaries
-- ART now properly handles shard transitions, and does so efficiently without requiring hash sets or other collections of reads
-- Updating HC and CountReadsInActiveRegions integration tests
2013-01-16 15:30:00 -05:00