Commit Graph

12145 Commits (2eac97a76c74d1aa7e8a17ea590082ba23621377)

Author SHA1 Message Date
David Roazen 2eac97a76c Remove auto-creation of fai/dict files for fasta references
-A UserException is now thrown if either the fai or dict file for the
 reference does not exist, with pointers to instructions for creating
 these files.

-Gets rid of problematic file locking that was causing intermittent
 errors on our farm.

-Integration tests to verify that correct exceptions are thrown in
 the case of a missing fai / dict file.

GSA-866 #resolve
2013-04-02 18:34:08 -04:00
Mark DePristo e7a8e6e8ee Merge pull request #140 from broadinstitute/dr_interval_intersection_bug_GSA-909
Intervals: fix bug where we could fail to find the intersection of unsorted/missorted interval lists
2013-04-02 11:59:01 -07:00
David Roazen b4b58a3968 Fix unprintable character in a comment from the BaseEdge class
Compiler warnings about this were starting to get to me...
2013-04-02 14:24:23 -04:00
David Roazen 5baf906c28 Intervals: fix bug where we could fail to find the intersection of unsorted/missorted interval lists
-The algorithm for finding the intersection of two sets of intervals
 relies on the sortedness of the intervals within each set, but the engine
 was not sorting the intervals before attempting to find the intersection.

-The result was that if one or both interval lists was unsorted / lexicographically
 sorted, we would often fail to find the intersection correctly.

-Now the IntervalBinding sorts all sets of intervals before returning them,
 solving the problem.

-Added an integration test for this case.

GSA-909 #resolve
2013-04-02 14:01:52 -04:00
Ryan Poplin d412605c9b Merge pull request #139 from broadinstitute/md_bugfix_suffix_splitter
Critical bugfix for CommonSuffixSplitter
2013-04-02 07:11:45 -07:00
Mark DePristo c191d7de8c Critical bugfix for CommonSuffixSplitter
-- Graphs with cycles from the bottom node to one of the middle nodes would introduce an infinite cycle in the algorithm.  Created unit test that reproduced the issue, and then fixed the underlying issue.
2013-04-02 09:22:33 -04:00
Ryan Poplin a58a3e7e1e Merge pull request #134 from broadinstitute/mc_phmm_experiments
PairHMM rework
2013-04-01 12:10:43 -07:00
Mark DePristo 791c386972 Merge pull request #138 from broadinstitute/rp_hc_gga_alleles_rod_GSA-908
Two changes to HC GGA mode to make it more like the UG.
2013-04-01 08:28:23 -07:00
Ryan Poplin f65206e758 Two changes to HC GGA mode to make it more like the UG.
-- Only try to genotype PASSing records in the alleles file
-- Don't attempt to genotype multiple records with the same start location. Instead take the first record and throw a warning message.
2013-04-01 10:20:23 -04:00
Mark DePristo 7c83efc1b9 Merge pull request #135 from broadinstitute/mc_pgtag_fix
Fixing @PG tag uniqueness issue
2013-03-31 11:36:40 -07:00
Eric Banks 7dd58f671f Merge pull request #132 from broadinstitute/gda_filter_unmasked_sites
Added small feature to VariantFiltration to filter sites outside of a gi...
2013-03-31 06:27:26 -07:00
Mark DePristo 4f828ca86f Merge pull request #137 from broadinstitute/eb_more_rr_tests
Updated AssessReducedQuals and applied it systematically to all ReduceRe...
2013-03-31 05:50:27 -07:00
Guillermo del Angel 9686e91a51 Added small feature to VariantFiltration to filter sites outside of a given mask:
-- Sometimes it's desireable to specify a set of "good" regions and filter out other stuff (like say an alignability mask or a "good regions" mask). But by default, the -mask argument in VF will only filter sites inside a particular mask. New argument -filterNotInMask will reverse default logic and filter outside of a given mask.
-- Added integration test, and made sure we also test with a BED rod.
2013-03-31 08:48:16 -04:00
Eric Banks 8e2094d2af Updated AssessReducedQuals and applied it systematically to all ReduceReads integration tests.
* Moved to protected for packaging purposes.
  * Cleaned up and removed debugging output.
  * Fixed logic for epsilons so that we really only test significant differences between BAMs.
  * Other small fixes (e.g. don't include low quality reduced reads in overall qual).
  * Most RR integration tests now automatically run the quals test on output.
    * A few are disabled because we expect them to fail in various locations (e.g. due to downsampling).
2013-03-31 00:27:14 -04:00
Eric Banks baf5240e2b Merge pull request #136 from broadinstitute/gda_leftalign_fix
Big bad bug fix: feature added to LeftAlignAndTrimVariants to left align...
2013-03-30 18:23:39 -07:00
Mauricio Carneiro ec475a46b1 Fixing @PG tag uniqueness issue
The Problem:
------------
the SAM spec does not allow multiple @PG tags with the same id. Our @PG tag writing routines were allowing that to happen with the boolean parameter "keep_all_pg_records".

How this fixes it:
------------------
This commit removes that option from all the utility functions and cleans up the code around the classes that used these methods off-spec.

Summarized changes:
-------------------
* Remove keep_all_pg_records option from setupWriter utility methos in Util
* Update all walkers to now replace the last @PG tag of the same walker (if it already exists)
* Cleanup NWaySamFileWriter now that it doesn't need to keep track of the keep_all_pg_records variable
* Simplify the multiple implementations to setupWriter

Bamboo:
-------
http://gsabamboo.broadinstitute.org/browse/GSAUNSTABLE-PARALLEL31

Issue Tracker:
--------------
[fixes 47100885]
2013-03-30 20:31:33 -04:00
Mauricio Carneiro 52e67a6973 ReviewedStingException -> IllegalStateException 2013-03-30 20:11:55 -04:00
Mauricio Carneiro 68bf470524 making LoglessPairHMM final 2013-03-30 20:00:45 -04:00
Guillermo del Angel 6b8bed34d0 Big bad bug fix: feature added to LeftAlignAndTrimVariants to left align multiallelic records didn't work.
-- Corrected logic to pick biallelic vc to left align.
-- Added integration test to make sure this feature is tested and feature to trim bases is also tested.
2013-03-30 19:31:28 -04:00
Mauricio Carneiro 0de6f55660 PairHMM rework
The current implementation of the PairHMM had issues with the probabilities and the state machines. Probabilities were not adding up to one because:
   # Initial conditions were not being set properly
   # Emission probabilities in the last row were not adding up to 1

The following commit fixes both by
   # averaging all potential start locations (giving an equal prior to the state machine in it's first iteration -- allowing the read to start it's alignment anywhere in the haplotype with equal probability)
   # discounting all paths that end in deletions by not adding the last row of the deletion matrix and summing over all paths ending in matches and insertions (this saves us from a fourth matrix to represent the end state)

Summarized changes:
   * Fix LoglessCachingPairHMM and Log10PairHMM according to the new algorithm
   * Refactor probabilities check to throw exception if we ever encounter probabilities greater than 1.
   * Rename LoglessCachingPairHMM to LoglessPairHMM (this is the default implementation in the HC now)
   * Rename matrices to matchMatrix, insertionMatrix and deletionMatrix for clarity
   * Rename metric lengths to read and haplotype lengths for clarity
   * Rename private methods to initializePriors (distance) and initializeProbabilities (constants) for clarity
   * Eliminate first row constants (because they're not used anyway!) and directly assign initial conditions in the deletionMatrix
   * Remove unnecessary parameters from updateCell()
   * Fix the expected probabilities coming from the exact model in PairHMMUnitTest
   * Neatify PairHMM class (removed unused methods) and PairHMMUnitTest (removed unused variables)
   * Update MD5s: Probabilities have changed according to the new PairHMM model and as expected HC and UG integration tests have new MD5s.

[fix 47164949]
2013-03-30 10:50:06 -04:00
MauricioCarneiro f42bdaee3c Merge pull request #133 from broadinstitute/chartl_fix_validation_site_selector_md5s
MathUtils.randomSubset() now uses Collections.shuffle() (indirectly, thr...
2013-03-29 12:57:10 -07:00
Chris Hartl 74a17359a8 MathUtils.randomSubset() now uses Collections.shuffle() (indirectly, through the other methods
that are tested), resulting in slightly different numbers of calls to the RNG, and ultimately
different sets of selected variants.

This commits updates the md5 values for the validation site selector integration test to reflect
these new random subsets of variants that are selected.
2013-03-29 14:52:10 -04:00
Mark DePristo ed98fc6a1f Merge pull request #131 from broadinstitute/gda_1000g_consensus_GSA-890
Upgrades and changes to LeftAlignVariants, motivated by 1000G consensus ...
2013-03-29 08:03:14 -07:00
Guillermo del Angel 8fbf9c947f Upgrades and changes to LeftAlignVariants, motivated by 1000G consensus indel production:
-- Added ability to trim common bases in front of indels before left-aligning. Otherwise, records may not be left-aligned if they have common bases, as they will be mistaken by complext records.
-- Added ability to split multiallelic records and then left align them, otherwise we miss a lot of good left-aligneable indels.
-- Motivated by this, renamed walker to LeftAlignAndTrimVariants.
-- Code refactoring, cleanup and bring up to latest coding standards.
-- Added unit testing to make sure left alignment is performed correctly for all offsets.
-- Changed phase 3 HC script to new syntax. Add command line options, more memory and reduce alt alleles because jobs keep crashing.
2013-03-29 10:02:06 -04:00
MauricioCarneiro b818a4f219 Merge pull request #130 from broadinstitute/chartl_genotype_concordance_and_mathutils 2013-03-29 05:24:59 -07:00
Chris Hartl 73d1c319bf Rarely-occurring logic bugfix for GenotypeConcordance, streamlining and testing of MathUtils
Currently, the multi-allelic test is covering the following case:

Eval   A   T,C
Comp   A   C

reciprocate this so that the reverse can be covered.

Eval   A   C
Comp   A   T,C

And furthermore, modify ConcordanceMetrics to more properly handle the situation where multiple alternate alleles are available in the comp. It was possible for an eval C/C sample to match a comp T/T sample, so long as the C allele were also present in at least one other comp sample.

This comes from the fact that "truth" reference alleles can be paired with *any* allele also present in the truth VCF, while truth het/hom var sites are restricted to having to match only the alleles present in the genotype. The reason that truth ref alleles are special case is as follows, imagine:

Eval:   A  G,T      0/0   2/0   2/2   1/1
Comp:   A  C,T      0/0   1/0   0/0   0/0

Even though the alt allele of the comp is a C, the assessment of genotypes should be as follows:

Sample1: ref called ref
Sample2: alleles don't match (the alt allele of the comp was not assessed in eval)
Sample3: ref called hom-var
Sample4: alleles don't match (the alt allele of the eval was not assessed in comp)

Before this change, Sample2 was evaluated as "het called het" (as the T allele in eval happens to also be in the comp record, just not in the comp sample). Thus: apply current
logic to comp hom-refs, and the more restrictive logic ("you have to match an allele in the comp genotype") when the comp is not reference.

Also in this commit,major refactoring and testing for MathUtils. A large number of methods were not used at all in the codebase, these methods were removed:
 - dotProduct(several types). logDotProduct is used extensively, but not the real-space version.
 - vectorSum
 - array shuffle, random subset
 - countOccurances (general forms, the char form is used in the codebase)
 - getNMaxElements
 - array permutation
 - sorted array permutation
 - compare floats
 - sum() (for integer arrays and lists).

Final keyword was extensively added to MathUtils.

The ratio() and percentage() methods were revised to error out with non-positive denominators, except in the case of 0/0 (which returns 0.0 (ratio), or 0.0% (percentage)). Random sampling code was updated to make use of the cleaner implementations of generating permutations in MathUtils (allowing the array permutation code to be retired).

The PaperGenotyper still made use of one of these array methods, since it was the only walker it was migrated into the genotyper itself.

In addition, more extensive tests were added for
 - logBinomialCoefficient (Newton's identity should always hold)
 - logFactorial
 - log10sumlog10 and its approximation

All unit tests pass
2013-03-28 23:25:28 -04:00
MauricioCarneiro a2b69790a6 Merge pull request #128 from broadinstitute/eb_rr_polyploid_compression_GSA-639 2013-03-28 06:39:43 -07:00
Ryan Poplin e58d0a71e4 Merge pull request #129 from broadinstitute/md_hc_good_head_tail
Many more improvements and bugfixes to the HaplotypeCaller
2013-03-27 12:36:39 -07:00
Mark DePristo fde7d36926 Updating md5s due to changes in assembly graph creation algorithms and default parameter 2013-03-27 15:31:24 -04:00
Mark DePristo 197d149495 Increase the maxNumHaplotypesInPopulation to 25
-- A somewhat arbitrary increase, and will need some evaluation but necessary to get good results on the AFR integrationtest.
2013-03-27 15:31:24 -04:00
Mark DePristo 66910b036c Added new and improved suffix and node merging algorithms
-- These new algorithms are more powerful than the restricted diamond merging algoriths, in that they can merge nodes with multiple incoming and outgoing edges.  Together the splitter + merger algorithms will correctly merge many more cases than the original headless and tailless diamond merger.
-- Refactored haplotype caller infrastructure into graphs package, code cleanup
-- Cleanup new merging / splitting algorithms, with proper docs and unit tests
-- Fix bug in zipping of linear chains.  Because the multiplicity can be 0, protect ourselves with a max function call
-- Fix BaseEdge.max unit test
-- Add docs and some more unit tests
-- Move error correct from DeBruijnGraph to DeBruijnAssembler
-- Replaced uses of System.out.println with logger.info
-- Don't make multiplicity == 0 nodes look like they should be pruned
-- Fix toString of Path
2013-03-27 15:31:18 -04:00
Mark DePristo 39f2e811e5 Increase max cigar elements from SW before failing path creation to 20 from 6
-- This allows more diversity in paths, which is sometimes necessary when we cannot simply graphs that have large bubbles
2013-03-26 14:27:18 -04:00
Mark DePristo b1b615b668 BaseGraph shouldn't implement getEdge -- no idea why I added this 2013-03-26 14:27:18 -04:00
Mark DePristo a97576384d Fix bug in the HC not respecting the requested pruning 2013-03-26 14:27:18 -04:00
Mark DePristo 78c672676b Bugfix for pruning and removing non-reference edges in graph
-- Previous algorithms were applying pruneGraph inappropriately on the raw sequence graph (where each vertex is a single base).  This results in overpruning of the graph, as prunegraph really relied on the zipping of linear chains (and the sharing of weight this provides) to avoid over-pruning the graph.  Probably we should think hard about this.  This commit fixes this logic, so we zip the graph between pruning
-- In this process ID's a fundamental problem with how we were trimming away vertices that occur on a path from the reference source to sink.  In fact, we were leaving in any vertex that happened to be accessible from source, any vertices in cycles, and any vertex that wasn't the absolute end of a chain going to a sink.  The new algorithm fixes all of this, using a BaseGraphIterator that's a general approach to walking the base graph.  Other routines that use the same traversal idiom refactored to use this iterator.  Added unit tests for all of these capabilities.
-- Created new BaseGraphIterator, which abstracts common access patterns to graph, and use this where appropriate
2013-03-26 14:27:18 -04:00
Mark DePristo 12475cc027 Display the active MappingQualityFilter if mmq > 0 in the HaplotypeCaller 2013-03-26 14:27:18 -04:00
Mark DePristo ad04fdb233 PerReadAlleleLikelihoodMap getMostLikelyAllele returns an MostLikelyAllele objects now
-- This new functionality allows the client to make decisions about how to handle non-informative reads, rather than having a single enforced constant that isn't really appropriate for all users.  The previous functionality is maintained now and used by all of the updated pieces of code, except the BAM writers, which now emit reads to display to their best allele, regardless of whether this is particularly informative or not.  That way you can see all of your data realigned to the new HC structure, rather than just those that are specifically informative.
-- This all makes me concerned that the informative thresholding isn't appropriately used in the annotations themselves.  There are many cases where nearby variation makes specific reads non-informative about one event, due to not being informative about the second.  For example, suppose you have two SNPs A/B and C/D that are in the same active region but separated by more than the read length of the reads.   All reads would be non-informative as no read provides information about the full combination of 4 haplotypes, as they reads only span a single event.  In this case our annotations will all fall apart, returning their default values.  Added a JIRA to address this (should be discussed in group meeting)
2013-03-26 14:27:13 -04:00
Mark DePristo 2472828e1c HC bug fixes: no longer create reference graphs with cycles
-- Though not intended, it was possible to create reference graphs with cycles in the case where you started the graph with a homopolymer of length > the kmer.  The previous test would fail to catch this case.  Now its not possible
-- Lots of code cleanup and refactoring in this push.  Split the monolithic createGraphFromSequences into simple calls to addReferenceKmersToGraph and addReadKmersToGraph which themselves share lower level functions like addKmerPairFromSeqToGraph.
-- Fix performance problem with reduced reads and the HC, where we were calling add kmer pair for each count in the reduced read, instead of just calling it once with a multiplicity of count.
-- Refactor addKmersToGraph() to use things like addOrUpdateEdge, now the code is very clear
2013-03-26 10:12:24 -04:00
Mark DePristo 1917d55dc2 Bugfix for DeBruijnAssembler: don't fail when read length > haplotype length
-- The previous version would generate graphs that had no reference bases at all in the situation where the reference haplotype was < the longer read length, which would cause the kmer size to exceed the reference haplotype length.  Now return immediately with a null graph when this occurs as opposed to continuing and eventually causing an error
2013-03-26 10:12:17 -04:00
Mark DePristo 464e65ea96 Disable error correcting kmers by default in the HC
-- The error correction algorithm can break the reference graph in some cases by error correcting us into a bad state for the reference sequence.  Because we know that the error correction algorithm isn't ideal, and worse, doesn't actually seem to improve the calling itself on chr20, I've simply disabled error correction by default and allowed it to be turned on with a hidden argument.
-- In the process I've changed a bit the assembly interface, moving some common arguments us into the LocalAssemblyEngine, which are turned on/off via setter methods.
-- Went through the updated arguments in the HC to be @Hidden and @Advanced as appropriate
-- Don't write out an errorcorrected graph when debugging and error correction isn't enabled
2013-03-26 10:05:17 -04:00
Eric Banks 593d3469d4 Refactored the het (polyploid) consensus creation in ReduceReads.
* It is now cleaner and easier to test; added tests for newly implemented methods.
 * Many fixes to the logic to make it work
   * The most important change was that after triggering het compression we actually need to back it out if it
      creates reads that incorporated too many softclips at any one position (because they get unclipped).
   * There was also an off-by-one error in the general code that only manifested itself with het compression.
 * Removed support for creating a het consensus around deletions (which was broken anyways).
   * Mauricio gave his blessing for this.
 * Het compression now works only against known sites (with -known argument).
    * The user can pass in one or more VCFs with known SNPs (other variants are ignored).
    * If no known SNPs are provided het compression will automatically be disabled.
 * Added SAM tag to stranded (i.e. het compressed) reduced reads to distinguish their
   strandedness from normal reduced reads.
    * GATKSAMRecord now checks for this tag when determining whether or not the read is stranded.
    * This allows us to update the FisherStrand annotation to count het compressed reduced reads
       towards the FS calculation.
    * [It would have been nice to mark the normal reads as unstranded but then we wouldn't be
       backwards compatible.]
    * Updated integration tests accordingly with new het compressed bams (both for RR and UG).
 * In the process of fixing the FS annotation I noticed that SpanningDeletions wasn't handling
   RR properly, so I fixed it too.
    * Also, the test in the UG engine for determining whether there are too many overlapping
       deletions is updated to handle RR.
 * I added a special hook in the RR integration tests to additionally run the systematic
   coverage checking tool I wrote earlier.
    * AssessReducedCoverage is now run against all RR integration tests to ensure coverage is
       not lost from original to reduced bam.
    * This helped uncover a huge bug in the MultiSampleCompressor where it would drop reads
       from all but 1 sample (now fixed).
    * AssessReducedCoverage moved from private to protected for packaging reasons.
 * #resolve GSA-639

At this point, this commit encompasses most of what is needed for het compression to go live.
There are still a few TODO items that I want to get in before the 2.5 release, but I will save
those for a separate branch because as it is I feel bad for the person who needs to review all
these changes (sorry, Mauricio).
2013-03-25 09:34:54 -04:00
Mark DePristo 965043472a Vastly more powerful, cleaner graph simplification approach
-- Generalizes previous node merging and splitting approaches.  Can split common prefixes and suffices among nodes, build a subgraph representing this new structure, and incorporate it into the original graph.  Introduces the concept of edges with 0 multiplicity (for purely structural reasons) as well as vertices with no sequence (again, for structural reasons).  Fully UnitTested.  These new algorithms can now really simplify diamond configurations as well as ones sources and sinks that arrive / depart linearly at a common single root node.
-- This new suite of algorithms is fully integrated into the HC, replacing previous approaches
-- SeqGraph transformations are applied iteratively (zipping, splitting, merging) until no operations can be performed on the graph.  This further simplifies the graphs, as splitting nodes may enable other merging / zip operations to go.
2013-03-23 17:40:55 -04:00
Mark DePristo 078c63654d Merge pull request #127 from broadinstitute/mc_callable_loci_with_reduce_reads
Adding ReduceReads support to Callable Loci
2013-03-21 17:56:23 -07:00
Mark DePristo de832de03d Merge pull request #125 from broadinstitute/mc_fly_pipeline
Removing mark duplicates from the quick processing pipeline
2013-03-21 13:11:53 -07:00
Mauricio Carneiro eb33da6820 Added support to reduce reads to Callable Loci
-- added calls to representativeCount() of the pileup instead of using ++
-- renamed CallableLoci integration test
-- added integration test for reduce read support on callable loci
2013-03-21 15:53:04 -04:00
Ryan Poplin c15453542e Merge pull request #124 from broadinstitute/md_hc_lowmapq_read_filter
HC now by default only uses reads with MAPQ >= 20 for assembly and calli...
2013-03-21 12:00:28 -07:00
Mark DePristo b30deda7ee Merge pull request #123 from broadinstitute/rp_event_mapper_bug_GSA-877
Bug fix in HC gga mode.
2013-03-21 11:42:20 -07:00
Mark DePristo 7ae15dadbe HC now by default only uses reads with MAPQ >= 20 for assembly and calling
-- Previously we tried to include lots of these low mapping quality reads in the assembly and calling, but we effectively were just filtering them out anyway while generating an enormous amount of computational expense to handle them, as well as much larger memory requirements.  The new version simply uses a read filter to remove them upfront.  This causes no major problems -- at least, none that don't have other underlying causes -- compared to 10-11mb of the KB
-- Update MD5s to reflect changes due to no longer including mmq < 20 by default
2013-03-21 13:10:50 -04:00
Ryan Poplin b9c331c2fa Bug fix in HC gga mode.
-- Don't try to test alleles which haven't had haplotypes assigned to them
2013-03-21 11:02:41 -04:00
Ryan Poplin 1a95ce5dcf Merge pull request #122 from broadinstitute/md_ceu_trio_calls_2x250_GSA-739
Many improvements to HaplotypeCaller for CEU trio best practice variant calling
2013-03-21 06:58:13 -07:00