Commit Graph

754 Commits (2c3c680eb704348cb8e20572f5b9d2fb3a5a986c)

Author SHA1 Message Date
Eric Banks 2c3c680eb7 Misc changes and cleanup from all previous commits in this push.
1. By default, do not include the UG CEU callset for assessment.
2. Updated md5s that are different now with all the HC changes.
2013-06-11 12:53:11 -04:00
Eric Banks dadcfe296d Reworking of the dangling tails merging code.
We now run Smith-Waterman on the dangling tail against the corresponding reference tail.
If we can generate a reasonable, low entropy alignment then we trigger the merge to the
reference path; otherwise we abort.  Also, we put in a check for low-complexity of graphs
and don't let those pass through.

Added tests for this implementation that checks exact SW results and correct edges added.
2013-06-11 12:53:04 -04:00
Eric Banks c0030f3f2d We no longer subset down to the best N haplotypes for the GL calculation.
I explain in comments within the code that this was causing problems with the marginalization over events.
2013-06-11 11:51:26 -04:00
Eric Banks c0e3874db0 Change the HC's phredScaledGlobalReadMismappingRate from 60 to 45, because Ryan and Mark told me to. 2013-06-11 11:51:26 -04:00
Eric Banks 77868d034f Do not allow the use of Ns in reads for graph construction.
Ns are treated as wildcards in the PairHMM so creating haplotypes with Ns gives them artificial advantages over other ones.
This was the cause of at least one FN where there were Ns at a SNP position.
2013-06-11 11:51:26 -04:00
Eric Banks e4e7d39e2c Fix FN problem stemming from sequence graphs that contain cycles.
Problem:
The sequence graphs can get very complex and it's not enough just to test that any given read has non-unique kmers.
Reads with variants can have kmers that match unique regions of the reference, and this causes cycles in the final
sequence graph.  Ultimately the problem is that kmers of 10/25 may not be large enough for these complex regions.

Solution:
We continue to try kmers of 10/25 but detect whether cycles exist; if so, we do not use them.  If (and only if) we
can't get usable graphs from the 10/25 kmers, then we start iterating over larger kmers until we either can generate
a graph without cycles or attempt too many iterations.
2013-06-11 11:51:26 -04:00
Ryan Poplin 58e354176e Minor changes to docs in the graph pruning. 2013-06-11 10:33:22 -04:00
Mark DePristo 1c03ebc82d Implement ActiveRegionTraversal RefMetaDataTracker for map call; HaplotypeCaller now annotates ID from dbSNP
-- Reuse infrastructure for RODs for reads to implement general IntervalReferenceOrderedView so that both TraverseReads and TraverseActiveRegions can use the same underlying infrastructure
-- TraverseActiveRegions now provides a meaningful RefMetaDataTracker to ActiveRegionWalker.map
-- Cleanup misc. code as it came up
-- Resolves GSA-808: Write general utility code to do rsID allele matching, hook up to UG and HC
2013-06-10 16:20:31 -04:00
Mark DePristo 0d593cff70 Refactor rsID and overlap detection in VariantOverlapAnnotator utility class
-- Variants will be considered matching if they have the same reference allele and at least 1 common alternative allele.  This matching algorithm determines how rsID are added back into the VariantContext we want to annotate, and as well determining the overlap FLAG attribute field.
-- Updated VariantAnnotator and VariantsToVCF to use this class, removing its old stale implementation
-- Added unit tests for this VariantOverlapAnnotator class
-- Removed GATKVCFUtils.rsIDOfFirstRealVariant as this is now better to use VariantOverlapAnnotator
-- Now requires strict allele matching, without any option to just use site annotation.
2013-06-10 15:51:13 -04:00
Mauricio Carneiro a95fbd48e5 Moving QualifyMissingIntervals to protected
Making this walker available so we can share it with the CSER group for CLIA analysis.
2013-06-10 13:11:41 -04:00
Valentin Ruano-Rubio 96073c3058 This commit addresses JIRA issue GSA-948: Prevent users from doing the wrong thing with RNA-Seq data and the GATK.
The previous behavior is to process reads with N CIGAR operators as they are despite that many of the tools do not actually support such operator and results become unpredictible.

Now if the there is some read with the N operator, the engine returns a user exception. The error message indicates what is the problem (including the offending read and mapping position) and give a couple of alternatives that the user can take in order to move forward:

a) ask for those reads to be filtered out (with --filter_reads_with_N_cigar or -filterRNC)

b) keep them in as before (with -U ALLOW_N_CIGAR_READS or -U ALL)

Notice that (b) does not have any effect if (a) is enacted; i.e. filtering overrides ignoring.

Implementation:

* Added filterReadsWithMCigar argument to MalformedReadFilter with the corresponding changes in the code to get it to work.
* Added ALLOW_N_CIGAR_READS unsafe flag so that N cigar containing reads can be processed as they are if that is what the user wants.
* Added ReadFilterTest class commont parent for ReadFilter test cases.
* Refactor ReadGroupBlackListFilterUnitTest to extend ReadFilterTest and push up some functionality to that class.
* Modified MalformedReadFilterUnitTest to extend ReadFilterTest and to test the new filter functionality.
* Added AllowNCigarMalformedReadFilterUnittest to check on the behavior when the unsafe ALLOW_N_CIGAR_READS flag is used.
* Added UnsafeNCigarMalformedReadFilterUnittest to check on the behavior when the unsafe ALL flag is used.
* Updated a broken test case in UnifiedGenotyperIntegrationTest resulting from the new behavior.
* Updated EngineFeaturesIntegrationTest testdata to be compliant with new behavior
2013-06-10 10:44:42 -04:00
Michael McCowan 00c06e9e52 Performance improvements:
- Memoized MathUtil's cumulative binomial probability function.
 - Reduced the default size of the read name map in reduced reads and handle its resets more efficiently.
2013-06-09 11:26:52 -04:00
Mark DePristo 209dd64268 HaplotypeCaller now emits per-sample DP
-- Created a new annotation DepthPerSampleHC that is by default on in the HaplotypeCaller
-- The depth for the HC is the sum of the informative alleles at this site.  It's not perfect (as we cannot differentiate between reads that align over the event but aren't informative vs. those that aren't even close) but it's a pretty good proxy and it matches with the AD field (i.e., sum(AD) = DP).
-- Update MD5s
-- delivers [#48240601]
2013-06-06 09:47:32 -04:00
Mark DePristo 34bdf20132 Bugfix for bad AD values in UG/HC
-- In the case where we have multiple potential alternative alleles *and* we weren't calling all of them (so that n potential values < n called) we could end up trimming the alleles down which would result in the mismatch between the PerReadAlleleLikelihoodMap alleles and the VariantContext trimmed alleles.
-- Fixed by doing two things (1) moving the trimming code after the annotation call and (2) updating AD annotation to check that the alleles in the VariantContext and the PerReadAlleleLikelihoodMap are concordant, which will stop us from degenerating in the future.
-- delivers [#50897077]
2013-06-05 17:48:41 -04:00
Mark DePristo c9f5b53efa Bugfix for HC can fail to assemble the correct reference sequence in some cases
-- Ultimately this was caused by overly aggressive merging of CommonSuffixMerger.  In the case where you have this graph:

ACT [ref source] -> C
G -> ACT -> C

we would merge into

G -> ACT -> C

which would linearlize into

GACTC

Causing us to add bases to the reference source node that couldn't be recovered.  The solution was to ensure that CommonSuffixMerger only operates when all nodes to be merged aren't source nodes themselves.

-- Added a convenient argument to the haplotype caller (captureAssemblyFailureBAM) that will write out the exact reads to a BAM file that went into a failed assembly run (going to a file called AssemblyFailure.BAM).  This can be used to rerun the haplotype caller to produce the exact error, which can be hard in regions of deep coverage where the downsampler state determines the exact reads going into assembly and therefore makes running with a sub-interval not reproduce the error
-- Did some misc. cleanup of code while debugging
-- [delivers #50917729]
2013-06-03 16:16:39 -04:00
Ryan Poplin ab40f4af43 Break out the GGA kmers and the read kmers into separate functions for the DeBruijn assembler.
-- Added unit test for new function.
2013-06-03 14:00:35 -04:00
Ryan Poplin 21334e728d Merge pull request #252 from broadinstitute/md_bqsr_index_out_of_bounds
Make BQSR calculateIsIndel robust to indel CIGARs are start/end of read
2013-06-03 07:13:00 -07:00
Mark DePristo 6555361742 Fix error in merging code in HC
-- Ultimately this was caused by an underlying bug in the reverting of soft clipped bases in the read clipper.  The read clipper would fail to properly set the alignment start for reads that were 100% clipped before reverting, such as 10H2S5H => 10H2M5H.  This has been fixed and unit tested.
-- Update 1 ReduceReads MD5, which was due to cases where we were clipping away all of the MATCH part of the read, leaving a cigar like 50H11S and the revert soft clips was failing to properly revert the bases.
-- delivers #50655421
2013-05-31 16:29:29 -04:00
Mark DePristo 64b4d80729 Make BQSR calculateIsIndel robust to indel CIGARs are start/end of read
-- The previous implementation attempted to be robust to this, but not all cases were handled properly.  Added a helper function updateInde() that bounds up the update to be in the range of the indel array, and cleaned up logic of how the method works.  The previous behavior was inconsistent across read fwd/rev stand, so that the indel cigars at the end of read were put at the start of reads if the reads were in the forward strand but not if they were in the reverse strand.  Everything is now consistent, as can be seen in the symmetry of the unit tests:

        tests.add(new Object[]{"1D3M",   false, EventType.BASE_DELETION, new int[]{0,0,0}});
        tests.add(new Object[]{"1M1D2M", false, EventType.BASE_DELETION, new int[]{1,0,0}});
        tests.add(new Object[]{"2M1D1M", false, EventType.BASE_DELETION, new int[]{0,1,0}});
        tests.add(new Object[]{"3M1D",   false, EventType.BASE_DELETION, new int[]{0,0,1}});

        tests.add(new Object[]{"1D3M",   true, EventType.BASE_DELETION, new int[]{1,0,0}});
        tests.add(new Object[]{"1M1D2M", true, EventType.BASE_DELETION, new int[]{0,1,0}});
        tests.add(new Object[]{"2M1D1M", true, EventType.BASE_DELETION, new int[]{0,0,1}});
        tests.add(new Object[]{"3M1D",   true, EventType.BASE_DELETION, new int[]{0,0,0}});

        tests.add(new Object[]{"4M1I",   false, EventType.BASE_INSERTION, new int[]{0,0,0,1,0}});
        tests.add(new Object[]{"3M1I1M", false, EventType.BASE_INSERTION, new int[]{0,0,1,0,0}});
        tests.add(new Object[]{"2M1I2M", false, EventType.BASE_INSERTION, new int[]{0,1,0,0,0}});
        tests.add(new Object[]{"1M1I3M", false, EventType.BASE_INSERTION, new int[]{1,0,0,0,0}});
        tests.add(new Object[]{"1I4M",   false, EventType.BASE_INSERTION, new int[]{0,0,0,0,0}});

        tests.add(new Object[]{"4M1I",   true, EventType.BASE_INSERTION, new int[]{0,0,0,0,0}});
        tests.add(new Object[]{"3M1I1M", true, EventType.BASE_INSERTION, new int[]{0,0,0,0,1}});
        tests.add(new Object[]{"2M1I2M", true, EventType.BASE_INSERTION, new int[]{0,0,0,1,0}});
        tests.add(new Object[]{"1M1I3M", true, EventType.BASE_INSERTION, new int[]{0,0,1,0,0}});
        tests.add(new Object[]{"1I4M",   true, EventType.BASE_INSERTION, new int[]{0,1,0,0,0}});

-- delivers #50445353
2013-05-31 13:58:37 -04:00
Ryan Poplin b5b9d745a7 New implementation of the GGA mode in the HaplotypeCaller
-- We now inject the given alleles into the reference haplotype and add them to the graph.
-- Those paths are read off of the graph and then evaluated with the appropriate marginalization for GGA mode.
-- This unifies how Smith-Waterman is performed between discovery and GGA modes.
-- Misc minor cleanup in several places.
2013-05-31 10:35:36 -04:00
Ryan Poplin 61af37d0d2 Create a new normalDistributionLog10 function that is unit tested for use in the VQSR. 2013-05-30 16:00:08 -04:00
Eric Banks a5a68c09fa Fix for the "Removed too many insertions, header is now negative" bug in ReduceReads.
The problem ultimately was that ReadUtils.readStartsWithInsertion() ignores leading hard/softclips, but
ReduceReads does not.  So I refactored that method to include a boolean argument as to whether or not
clips should be ignored.  Also rebased so that return type is no longer a Pair.
Added unit test to cover this situation.
2013-05-29 16:41:01 -04:00
Mark DePristo 684c91c2e7 Merge pull request #245 from broadinstitute/dr_enforce_min_dcov
Require a minimum dcov value of 200 for Locus and ActiveRegion walkers when downsampling to coverage
2013-05-29 09:52:13 -07:00
David Roazen a7cb599945 Require a minimum dcov value of 200 for Locus and ActiveRegion walkers when downsampling to coverage
-Throw a UserException if a Locus or ActiveRegion walker is run with -dcov < 200,
 since low dcov values can result in problematic downsampling artifacts for locus-based
 traversals.

-Read-based traversals continue to have no minimum for -dcov, since dcov for read traversals
 controls the number of reads per alignment start position, and even a dcov value of 1 might
 be safe/desirable in some circumstances.

-Also reorganize the global downsampling defaults so that they are specified as annotations
 to the Walker, LocusWalker, and ActiveRegionWalker classes rather than as constants in the
 DownsamplingMethod class.

-The default downsampling settings have not been changed: they are still -dcov 1000
 for Locus and ActiveRegion walkers, and -dt NONE for all other walkers.
2013-05-29 12:07:12 -04:00
Mauricio Carneiro f1affa9fbb Turn off downsampling for DiagnoseTargets
Diagnose targets should never be downsampled. (and I didn't know there was a default downsampling going on for locus walkers)
2013-05-28 14:58:50 -04:00
Ryan Poplin 85905dba92 Bugfix for GGA mode in UG silently ignoring indels
-- Started by Mark. Finished up by Ryan.
-- GGA mode still respected glm argument for SNP and INDEL models, so that you would silently fail to genotype indels at all if the -glm INDEL wasn't provided, but you'd still emit the sites, so you'd see records in the VCF but all alleles would be no calls.
-- https://www.pivotaltracker.com/story/show/48924339 for more information
-- [resolves #48924339]
2013-05-24 13:47:26 -04:00
Mauricio Carneiro da21924b44 Make the missing targets output never use stdout
Problem
--------
Diagnose Targets is outputting missing intervals to stdout if the argument -missing is not provided

Solution
--------
Make it NOT default to stdout

[Delivers #50386741]
2013-05-22 14:22:54 -04:00
Mark DePristo d167743852 Archived banded logless PairHMM
BandedHMM
---------
-- An implementation of a linear runtime, linear memory usage banded logless PairHMM.  Thought about 50% faster than current PairHMM, this implementation will be superceded by the GraphHMM when it becomes available.  The implementation is being archived for future reference

Useful infrastructure changes
-----------------------------
-- Split PairHMM into a N2MemoryPairHMM that allows smarter implementation to not allocate the double[][] matrices if they don't want, which was previously occurring in the base class PairHMM
-- Added functionality (controlled by private static boolean) to write out likelihood call information to a file from inside of LikelihoodCalculationEngine for using in unit or performance testing.  Added example of 100kb of data to private/testdata.  Can be easily read in with the PairHMMTestData class.
-- PairHMM now tracks the number of possible cell evaluations, and the LoglessCachingPairHMM updates the nCellsEvaluated so we can see how many cells are saved by the caching calculation.
2013-05-22 12:24:00 -04:00
Mark DePristo a1093ad230 Optimization for ActiveRegion.removeAll
-- Previous version took a Collection<GATKSAMRecord> to remove, and called ArrayList.removeAll() on this collection to remove reads from the ActiveRegion.  This can be very slow when there are lots of reads, as ArrayList.removeAll ultimately calls indexOf() that searches through the list calling equals() on each element.   New version takes a set, and uses an iterator on the list to remove() from the iterator any read that is in the set.  Given that we were already iterating over the list of reads to update the read span, this algorithm is actually simpler and faster than the previous one.
-- Update HaplotypeCaller filterReadsInRegion to use a Set not a List.
-- Expanded the unit tests a bit for ActiveRegion.removeAll
2013-05-21 16:18:57 -04:00
Eric Banks 1f3624d204 Base Recalibrator doesn't recalibrate all reads, so the final output line was confusing 2013-05-21 11:35:05 -04:00
Valentin Ruano Rubio 71bbb25c9e Merge pull request #231 from broadinstitute/md_combinevariants_bugfix
CombineVariants no longer adds PASS to unfiltered records
2013-05-20 14:28:20 -07:00
Mark DePristo 62fc88f92e CombineVariants no longer adds PASS to unfiltered records
-- [Delivers #49876703]
-- Add integration test and test file
-- Update SymbolicAlleles combine variant tests, which was turning unfiltered records into PASS!
2013-05-20 16:53:51 -04:00
Ryan Poplin 507853c583 Active region boundary parameters need to be bigger when running in GGA mode. CGL performance is quite a bit better as a result.
-- The troule stems from the fact that we may be trying to genotype indels even though it appears there are only SNPs in the reads.
2013-05-20 14:29:04 -04:00
Eric Banks 8a442d3c9f @Output needs to be required for LiftoverVariants to prevent a NPE and documentation needed updating. 2013-05-17 10:04:10 -04:00
Yossi Farjoun 9234a0efcd Merge pull request #223 from broadinstitute/mc_dt_gaddy_outputs
Bug fixes and missing interval functionality for Diagnose Targets

While the code seems fine, the complex parts of it are untested. This is probably fine for now, but private code can have a tendency to creep into the codebase once accepted. I would have preferred that unit test OR a big comment stating that the code is untested (and thus broken by Mark's rule).

It is with these cavets that I accept the pull request.
2013-05-16 09:25:54 -07:00
Chris Hartl 6da0aed30f Update GCIT md5s to account for trivial changes to description strings 2013-05-14 19:45:30 -04:00
Yossi Farjoun 409a202492 Merge pull request #214 from broadinstitute/chartl_genotype_concordance_diploid_and_OGC
Add overall genotype concordance to the genotype concordance tool. In ad...
2013-05-14 14:19:54 -07:00
Mauricio Carneiro adcbf947bf Update MD5s and the Diagnose Target scala script 2013-05-13 12:06:17 -04:00
Mauricio Carneiro 9eceae793a Tool to manipulate intervals outside the GATK
Performs basic set operations on intervals like union, intersect and difference between two or more intervals. Useful for techdev and QC purposes.
2013-05-13 11:56:24 -04:00
Mauricio Carneiro 3dbb86b052 Outputting missing intervals in DiagnoseTargets
Problem
------
Diagnose Targets identifies holes in the coverage of a targetted experiment, but it only reports them doesn't list the actual missing loci

Solution
------
This commit implements an optional intervals file output listing the exact loci that did not pass filters

Itemized changes
--------------
   * Cache callable statuses (to avoid recalculation)
   * Add functionality to output missing intervals
   * Implement new tool to qualify the missing intervals (QualifyMissingIntervals) by gc content, size, type of missing coverage and origin (coding sequence, intron, ...)
2013-05-13 11:51:56 -04:00
Mauricio Carneiro 1466396a31 Diagnose target is outputting intervals out of order
Problem
-------
When the interval had no reads, it was being sent to the VCF before the intervals that just got processed, therefore violating the sort order of the VCF.

Solution
--------
Use a linked hash map, and make the insertion and removal all happen in one place regardless of having reads or not. Since the input is ordered, the output has to be ordered as well.

Itemized changes
--------------
   * Clean up code duplication in LocusStratification and SampleStratification
   * Add number of uncovered sites and number of low covered sites to the VCF output.
   * Add new VCF format fields
   * Fix outputting multiple status when threshold is 0 (ratio must be GREATER THAN not equal to the threshold to get reported)

[fixes #48780333]
[fixes #48787311]
2013-05-13 11:50:22 -04:00
Mark DePristo b4f482a421 NanoScheduled ActiveRegionTraversal and HaplotypeCaller
-- Made CountReadsInActiveRegions Nano schedulable, confirming identical results for linear and nano results
-- Made Haplotype NanoScheduled, requiring misc. changes in the map/reduce type so that the map() function returns a List<VariantContext> and reduce actually prints out the results to disk
-- Tests for NanoScheduling
  -- CountReadsInActiveRegionsIntegrationTest now does NCT 1, 2, 4 with CountReadsInActiveRegions
  -- HaplotypeCallerParallelIntegrationTest does NCT 1,2,4 calling on 100kb of PCR free data
-- Some misc. code cleanup of HaplotypeCaller
-- Analysis scripts to assess performance of nano scheduled HC
-- In order to make the haplotype caller thread safe we needed to use an AtomicInteger for the class-specific static ID counter in SeqVertex and MultiDebrujinVertex, avoiding a race condition where multiple new Vertex() could end up with the same id.
2013-05-13 11:09:02 -04:00
Eric Banks 2f5ef6db44 New faster Smith-Waterman implementation that is edge greedy and assumes that ref and haplotype have same global start/end points.
* This version inherits from the original SW implementation so it can use the same matrix creation method.
   * A bunch of refactoring was done to the original version to clean it up a bit and to have it do the
     right thing for indels at the edges of the alignments.
     * Enum added for the overhang strategy to use; added implementation for the INDEL version of this strategy.
   * Lots of systematic testing added for this implementation.
   * NOT HOOKED UP TO HAPLOTYPE CALLER YET. Committing so that people can play around with this for now.
2013-05-13 09:36:39 -04:00
Mark DePristo 111e8cef0f Merge pull request #219 from broadinstitute/eb_rr_multisample_fix
Fix bug in Reduce Reads that arises in multi-sample mode.
2013-05-09 15:31:14 -07:00
Eric Banks 8b9c6aae3e Fix bug in Reduce Reads that arises in multi-sample mode.
* bitset could legitimately be in an unfinished state but we were trying to access it without finalizing.
  * added --cancer_mode argument per Mark's suggestion to force the user to explicitly enable multi-sample mode.
  * tests were easiest to implement as integration tests (this was a really complicated case).
2013-05-08 23:23:51 -04:00
Mark DePristo fa8a47ceef Replace DeBruijnAssembler with ReadThreadingAssembler
Problem
-------
The DeBruijn assembler was too slow.  The cause of the slowness was the need to construct many kmer graphs (from max read length in the interval to 11 kmer, in increments of 6 bp).  This need to build many kmer graphs was because the assembler (1) needed long kmers to assemble through regions where a shorter kmer was non-unique in the reference, as we couldn't split cycles in the reference (2) shorter kmers were needed to be sensitive to differences from the reference near the edge of reads, which would be lost often when there was chain of kmers of longer length that started before and after the variant.

Solution
--------
The read threading assembler uses a fixed kmer, in this implementation by default two graphs with 10 and 25 kmers.  The algorithm operates as follows:

identify all non-unique kmers of size K among all reads and the reference
for each sequence (ref and read):
  find a unique starting position of the sequence in the graph by matching to a unique kmer, or starting a new source node if non exist
  for each base in the sequence from the starting vertex kmer:
    look at the existing outgoing nodes of current vertex V.  If the base in sequence matches the suffix of outgoing vertex N, read the sequence to N, and continue
    If no matching next vertex exists, find a unique vertex with kmer K.  If one exists, merge the sequence into this vertex, and continue
    If a merge vertex cannot be found, create a new vertex (note this vertex may have a kmer identical to another in the graph, if it is not unique) and thread the sequence to this vertex, and continue

This algorithm has a key property: it can robustly use a very short kmer without introducing cycles, as we will create paths through the graph through regions that aren't unique w.r.t. the sequence at the given kmer size.  This allows us to assemble well with even very short kmers.

This commit includes many critical changes to the haplotype caller to make it fast, sensitive, and accurate on deep and shallow WGS and exomes, the key changes are highlighted below:

-- The ReadThreading assembler keeps track of the maximum edge multiplicity per sample in the graph, so that we prune per sample, not across all samples.  This change is essential to operate effectively when there are many deep samples (i.e., 100 exomes)
-- A new pruning algorithm that will only prune linear paths where the maximum edge weight among all edges in the path have < pruningFactor.  This makes pruning more robust when you have a long chain of bases that have high multiplicity at the start but only barely make it back into the main path in the graph.
-- We now do a global SmithWaterman to compute the cigar of a Path, instead of the previous bubble-based SmithWaterman optimization.  This change is essential for us to get good variants from our paths when the kmer size is small.  It also ensures that we produce a cigar from a path that only depends only the sequence of bases in the path, unlike the previous approach which would depend on both the bases and the way the path was decomposed into vertices, which depended on the kmer size we used.
-- Removed MergeHeadlessIncomingSources, which was introducing problems in the graphs in some cases, and just isn't the safest operation.  Since we build a kmer graph of size 10, this operation is no longer necessary as it required a perfect match of 10 bp to merge anyway.
-- The old DebruijnAssembler is still available with a command line option
-- The number of paths we take forward from the each assembly graph is now capped at a factor per sample, so that we allow 128 paths for a single sample up to 10 x nSamples as necessary.  This is an essential change to make the system work well for large numbers of samples.
-- Add a global mismapping parameter to the HC likelihood calculation: The phredScaledGlobalReadMismappingRate reflects the average global mismapping rate of all reads, regardless of their mapping quality. This term effects the probability that a read originated from the reference haploytype, regardless of its edit distance from the reference, in that the read could have originated from the reference haplotype but from another location in the genome. Suppose a read has many mismatches from the reference, say like 5, but has a very high mapping quality of 60. Without this parameter, the read would contribute 5 * Q30 evidence in favor of its 5 mismatch haplotype compared to reference, potentially enough to make a call off that single read for all of these events. With this parameter set to Q30, though, the maximum evidence against the reference that this (and any) read could contribute against reference is Q30. -- Controllable via a command line argument, defaulting to Q60 rate. Results from 20:10-11 mb for branch are consistent with the previous behavior, but this does help in cases where you have rare very divergent haplotypes
-- Reduced ActiveRegionExtension from 200 bp to 100 bp, which is a performance win and the large extension is largely unnecessary with the short kmers used with the read threading assembler

Infrastructure changes / improvements
-------------------------------------
-- Refactored BaseGraph to take a subclass of BaseEdge, so that we can use a MultiSampleEdge in the ReadThreadingAssembler
-- Refactored DeBruijnAssembler, moving common functionality into LocalAssemblyEngine, which now more directly manages the subclasses, requiring them to only implement a assemble() method that takes ref and reads and provides a List<SeqGraph>, which the LocalAssemblyEngine takes forward to compute haplotypes and other downstream operations.  This allows us to have only a limited amount of code that differentiates the Debruijn and ReadThreading assemblers
-- Refactored active region trimming code into ActiveRegionTrimmer class
-- Cleaned up the arguments in HaplotypeCaller, reorganizing them and making arguments @Hidden and @Advanced as appropriate.  Renamed several arguments now that the read threading assembler is the default
-- LocalAssemblyEngineUnitTest reads in the reference sequence from b37, and assembles with synthetic reads intervals from 10-11 mbs with only the reference sequence as well as artificial snps, deletions, and insertions.
-- Misc. updates to Smith Waterman code. Added generic interface to called not surpisingly SmithWaterman, making it easier to have alternative implementations.
-- Many many more unit tests throughout the entire assembler, and in random utilities
2013-05-08 21:41:42 -04:00
Eric Banks d242f1bba3 Secondary alignments were not handled correctly in IndelRealigner
* This is emerging now because BWA-MEM produces lots of reads that are not primary alignments
 * The ConstrainedMateFixingManager class used by IndelRealigner was mis-adjusting SAM flags because it
     was getting confused by these secondary alignments
 * Added unit test to cover this case
2013-05-06 19:09:10 -04:00
Eric Banks b53336c2d0 Added hidden mode for BQSR to force all read groups to be the same one.
* Very useful for debugging sample-specific issues
 * This argument got lost in the transition from BQSR v1 to v2
 * Added unit test to cover this case
2013-05-06 19:09:10 -04:00
Chris Hartl 6ff74deac7 Add overall genotype concordance to the genotype concordance tool. In addition, protect from non-diploid genotypes, which can cause very strange behavior.
Update MD5 sums. As expected, md5 changes are consistent with the genotype concordance field being added to each output.
2013-05-06 13:06:30 -04:00
chartl 98021db264 Merge pull request #208 from broadinstitute/yf_fix_molten_GenotypeConcordance
- Fixed a small bug in the printout of molten data in GenotypeConcordanc...
2013-05-06 08:42:06 -07:00