Commit Graph

12344 Commits (cb49b8cc714528fa7ecccfa8dc038e1c8e0c09c1)

Author SHA1 Message Date
Mark DePristo cb49b8cc71 Merge pull request #222 from broadinstitute/dr_convenient_diff_engine_md5_display_in_bamboo_GSA-915
Enable convenient display of diff engine output in Bamboo, plus misc. minor test-related improvements
2013-05-11 10:58:42 -07:00
David Roazen 639030bd6d Enable convenient display of diff engine output in Bamboo, plus misc. minor test-related improvements
-Diff engine output is now included in the actual exception message thrown as a
 result of an MD5 mismatch, which allows it to be conveniently viewed on the
 main page of a build in Bamboo.

Minor Additional Improvements:

-WalkerTestSpec now auto-detects test class name via new JVMUtils.getCallingClass()
 method, and the test class name is now included as a regular part of integration
 test output for each test.

-Fix race condition in MD5DB.ensureMd5DbDirectory()

-integrationtests dir is now cleaned by "ant clean"

GSA-915 #resolve
2013-05-10 19:00:33 -04:00
Mark DePristo 111e8cef0f Merge pull request #219 from broadinstitute/eb_rr_multisample_fix
Fix bug in Reduce Reads that arises in multi-sample mode.
2013-05-09 15:31:14 -07:00
Ryan Poplin 1078d0e51c Merge pull request #216 from broadinstitute/md_better_assembly
Replace DeBruijnAssembler with ReadThreadingAssembler
2013-05-09 07:36:17 -07:00
Eric Banks 8b9c6aae3e Fix bug in Reduce Reads that arises in multi-sample mode.
* bitset could legitimately be in an unfinished state but we were trying to access it without finalizing.
  * added --cancer_mode argument per Mark's suggestion to force the user to explicitly enable multi-sample mode.
  * tests were easiest to implement as integration tests (this was a really complicated case).
2013-05-08 23:23:51 -04:00
Mark DePristo fa8a47ceef Replace DeBruijnAssembler with ReadThreadingAssembler
Problem
-------
The DeBruijn assembler was too slow.  The cause of the slowness was the need to construct many kmer graphs (from max read length in the interval to 11 kmer, in increments of 6 bp).  This need to build many kmer graphs was because the assembler (1) needed long kmers to assemble through regions where a shorter kmer was non-unique in the reference, as we couldn't split cycles in the reference (2) shorter kmers were needed to be sensitive to differences from the reference near the edge of reads, which would be lost often when there was chain of kmers of longer length that started before and after the variant.

Solution
--------
The read threading assembler uses a fixed kmer, in this implementation by default two graphs with 10 and 25 kmers.  The algorithm operates as follows:

identify all non-unique kmers of size K among all reads and the reference
for each sequence (ref and read):
  find a unique starting position of the sequence in the graph by matching to a unique kmer, or starting a new source node if non exist
  for each base in the sequence from the starting vertex kmer:
    look at the existing outgoing nodes of current vertex V.  If the base in sequence matches the suffix of outgoing vertex N, read the sequence to N, and continue
    If no matching next vertex exists, find a unique vertex with kmer K.  If one exists, merge the sequence into this vertex, and continue
    If a merge vertex cannot be found, create a new vertex (note this vertex may have a kmer identical to another in the graph, if it is not unique) and thread the sequence to this vertex, and continue

This algorithm has a key property: it can robustly use a very short kmer without introducing cycles, as we will create paths through the graph through regions that aren't unique w.r.t. the sequence at the given kmer size.  This allows us to assemble well with even very short kmers.

This commit includes many critical changes to the haplotype caller to make it fast, sensitive, and accurate on deep and shallow WGS and exomes, the key changes are highlighted below:

-- The ReadThreading assembler keeps track of the maximum edge multiplicity per sample in the graph, so that we prune per sample, not across all samples.  This change is essential to operate effectively when there are many deep samples (i.e., 100 exomes)
-- A new pruning algorithm that will only prune linear paths where the maximum edge weight among all edges in the path have < pruningFactor.  This makes pruning more robust when you have a long chain of bases that have high multiplicity at the start but only barely make it back into the main path in the graph.
-- We now do a global SmithWaterman to compute the cigar of a Path, instead of the previous bubble-based SmithWaterman optimization.  This change is essential for us to get good variants from our paths when the kmer size is small.  It also ensures that we produce a cigar from a path that only depends only the sequence of bases in the path, unlike the previous approach which would depend on both the bases and the way the path was decomposed into vertices, which depended on the kmer size we used.
-- Removed MergeHeadlessIncomingSources, which was introducing problems in the graphs in some cases, and just isn't the safest operation.  Since we build a kmer graph of size 10, this operation is no longer necessary as it required a perfect match of 10 bp to merge anyway.
-- The old DebruijnAssembler is still available with a command line option
-- The number of paths we take forward from the each assembly graph is now capped at a factor per sample, so that we allow 128 paths for a single sample up to 10 x nSamples as necessary.  This is an essential change to make the system work well for large numbers of samples.
-- Add a global mismapping parameter to the HC likelihood calculation: The phredScaledGlobalReadMismappingRate reflects the average global mismapping rate of all reads, regardless of their mapping quality. This term effects the probability that a read originated from the reference haploytype, regardless of its edit distance from the reference, in that the read could have originated from the reference haplotype but from another location in the genome. Suppose a read has many mismatches from the reference, say like 5, but has a very high mapping quality of 60. Without this parameter, the read would contribute 5 * Q30 evidence in favor of its 5 mismatch haplotype compared to reference, potentially enough to make a call off that single read for all of these events. With this parameter set to Q30, though, the maximum evidence against the reference that this (and any) read could contribute against reference is Q30. -- Controllable via a command line argument, defaulting to Q60 rate. Results from 20:10-11 mb for branch are consistent with the previous behavior, but this does help in cases where you have rare very divergent haplotypes
-- Reduced ActiveRegionExtension from 200 bp to 100 bp, which is a performance win and the large extension is largely unnecessary with the short kmers used with the read threading assembler

Infrastructure changes / improvements
-------------------------------------
-- Refactored BaseGraph to take a subclass of BaseEdge, so that we can use a MultiSampleEdge in the ReadThreadingAssembler
-- Refactored DeBruijnAssembler, moving common functionality into LocalAssemblyEngine, which now more directly manages the subclasses, requiring them to only implement a assemble() method that takes ref and reads and provides a List<SeqGraph>, which the LocalAssemblyEngine takes forward to compute haplotypes and other downstream operations.  This allows us to have only a limited amount of code that differentiates the Debruijn and ReadThreading assemblers
-- Refactored active region trimming code into ActiveRegionTrimmer class
-- Cleaned up the arguments in HaplotypeCaller, reorganizing them and making arguments @Hidden and @Advanced as appropriate.  Renamed several arguments now that the read threading assembler is the default
-- LocalAssemblyEngineUnitTest reads in the reference sequence from b37, and assembles with synthetic reads intervals from 10-11 mbs with only the reference sequence as well as artificial snps, deletions, and insertions.
-- Misc. updates to Smith Waterman code. Added generic interface to called not surpisingly SmithWaterman, making it easier to have alternative implementations.
-- Many many more unit tests throughout the entire assembler, and in random utilities
2013-05-08 21:41:42 -04:00
Mark DePristo fbc683f1e8 Merge pull request #218 from broadinstitute/dr_rev_picard_to_1.91
Rev picard, sam-jdk, tribble, and variant to version 1.91.1453
2013-05-07 14:32:03 -07:00
David Roazen 52c20a8416 Rev picard, sam-jdk, tribble, and variant to version 1.91.1453
-mainly for the seek() performance optimization
2013-05-07 14:21:15 -04:00
Mark DePristo 178f1499ee Merge pull request #217 from broadinstitute/md_queue_jobreport_improvements
Improve queue script jobreport visualization script
2013-05-07 09:14:03 -07:00
Mark DePristo 2b86ab02be Improve queue script jobreport visualization script
-- the Queue jobreport PDF script now provides a high-level summary of the de-scattered runtimes of each analysis, so that its easy to see where your script is spending its time across scatters.
2013-05-07 12:11:46 -04:00
Mark DePristo b153042a9d Merge pull request #215 from broadinstitute/eb_misc_improvements
Eb misc improvements
2013-05-07 06:05:53 -07:00
Eric Banks d242f1bba3 Secondary alignments were not handled correctly in IndelRealigner
* This is emerging now because BWA-MEM produces lots of reads that are not primary alignments
 * The ConstrainedMateFixingManager class used by IndelRealigner was mis-adjusting SAM flags because it
     was getting confused by these secondary alignments
 * Added unit test to cover this case
2013-05-06 19:09:10 -04:00
Eric Banks b53336c2d0 Added hidden mode for BQSR to force all read groups to be the same one.
* Very useful for debugging sample-specific issues
 * This argument got lost in the transition from BQSR v1 to v2
 * Added unit test to cover this case
2013-05-06 19:09:10 -04:00
chartl 98021db264 Merge pull request #208 from broadinstitute/yf_fix_molten_GenotypeConcordance
- Fixed a small bug in the printout of molten data in GenotypeConcordanc...
2013-05-06 08:42:06 -07:00
Mark DePristo 084a464779 Merge pull request #213 from broadinstitute/gda_md5_fix
Re-fix md5's that changed due to conflicting pushes
2013-05-03 16:21:46 -07:00
Guillermo del Angel 874dc8f9c1 Re-fix md5's that changed due to conflicting pushes 2013-05-03 14:59:16 -04:00
Mark DePristo 8bac61f9fb Merge pull request #212 from broadinstitute/md_misc_improvements
Several important bugfixes / improvements to general GATK infrastructure
2013-05-03 08:20:55 -07:00
Mark DePristo f42bb86bdd e# This is a combination of 2 commits.
Only try to clip adaptors when both reads of the pair are on opposite strands

-- Read pairs that have unusual alignments, such as two reads both oriented like:

  <-----
     <-----

where previously having their adaptors clipped as though the standard calculation of the insert size was meaningful, which it is not for such oddly oriented pairs.  This caused us to clip extra good bases from reads.
-- Update MD5s due change in adaptor clipping, which add some coverage in some places
2013-05-03 11:19:14 -04:00
Mark DePristo 0587a145bf Utils.dupString should allow 0 number of duplicates to produce empty string 2013-05-03 09:32:05 -04:00
Mark DePristo f5a301fb63 Bugfix for AlignmentUtils.trimCigarByBases
-- Previous version would trim down 2M2D2M into 2M if you asked for the first 2 bases, but this can result in incorrect alignment of the bases to the reference as the bases no longer span the full reference interval expected.  Fixed and added unit tests
2013-05-03 09:32:05 -04:00
Mark DePristo 2bcbdd469f leftAlignCigarSequentially now supports haplotypes with insertions and deletions where the deletion allele was previously removed by the leftAlignSingleIndel during it's cleanup phase. 2013-05-03 09:32:05 -04:00
Eric Banks aefddaa219 Merge pull request #210 from broadinstitute/gda_QUAL_fix_GSA-910
Rev'd up Picard to get PL fix: PLs were saturated to 32767 (Short.MAX_VA...
2013-05-03 06:28:10 -07:00
Mark DePristo 4f627e96c0 Merge pull request #211 from broadinstitute/md_kb_script
Starting NA12878 KB requires use Java-1.7
2013-05-03 04:30:23 -07:00
Mark DePristo b89d97cb9c Starting NA12878 KB requires use Java-1.7 2013-05-03 07:29:29 -04:00
Guillermo del Angel 0c30a5ebc6 Rev'd up Picard to get PL fix: PLs were saturated to 32767 (Short.MAX_VALUE) when converting from GL to integers. Increase capping to Integer.MAX_VALUE (2^31-1) which should be enough for reasonable sites now. Integration tests change because some tests have some hyper-deep pileups where this case was hit 2013-05-02 16:31:43 -04:00
Eric Banks c6df20cff5 Merge pull request #209 from broadinstitute/eb_more_fixes_to_bundle_script
Now that we don't generate dict and fai files, the resource script needs...
2013-05-02 12:23:41 -07:00
Eric Banks d981fd01b8 Now that we don't generate dict and fai files, the resource script needs to copy them to the bundle. 2013-05-02 15:18:13 -04:00
David Roazen 13bfa963da Revert changes to exampleFASTA.fasta.fai for now to get tests passing again 2013-05-02 12:59:20 -04:00
Mark DePristo dfdc0df4f1 Merge pull request #207 from broadinstitute/eb_fixing_bundle_script
Fixing the bundle script
2013-05-02 08:18:19 -07:00
Eric Banks f88a964e2c Adding .fai file to example fasta since we don't generate it anymore 2013-05-02 10:54:32 -04:00
Eric Banks 6d0e383a60 Fixing the bundle script
1. someone out there busted it when adding high confidence 1000G calls
2. new path to NA12878 bam
3. updated clashing version argument
2013-05-02 09:40:36 -04:00
Yossi Farjoun 4b8b411b92 - Fixed a small bug in the printout of molten data in GenotypeConcordance
Output didn't "mix-up" the genotypes, it outputed the same HET vs HET (e.g.) 3 times rather than the combinations of HET vs {HET, HOM, HOM_REF}, etc.
This was only a problem in the text, _not_ the actual numbers, which were outputted correctly.

- Updated MD5's after looking at diffs to verify that the change is what I expected.
2013-05-02 09:16:07 -04:00
Mark DePristo 803f666fd5 Merge pull request #204 from broadinstitute/rp_cgl_force_sample_name
Adding argument to CGL to not stratify by sample name. This is useful wh...
2013-05-01 13:59:20 -07:00
Ryan Poplin ad84f15572 Adding argument to CGL to not stratify by sample name. This is useful when running with 1000G so that you don't get 1000s of lines on the plots. 2013-05-01 16:25:11 -04:00
droazen 19b59f0fb9 Merge pull request #206 from broadinstitute/dr_update_tests_for_java7
Update expected test output for Java 7
2013-05-01 13:23:25 -07:00
David Roazen f3c94a3c87 Update expected test output for Java 7
-Changes in Java 7 related to comparators / sorting produce a large number
 of innocuous differences in our test output. Updating expectations now
 that we've moved to using Java 7 internally.

-Also incorporate Eric's fix to the GATKSAMRecordUnitTest to prevent
 intermittent failures.
2013-05-01 16:18:01 -04:00
David Roazen d0980e236a Merged bug fix from Stable into Unstable 2013-05-01 01:08:29 -04:00
David Roazen f57256b6c2 Delete unused FastaSequenceIndexBuilder class and accompanying test
This class, being unused, was no longer getting packaged into the
GATK release jar by bcel, and so attempting to run its unit test
on the release jar was producing an error.
2013-05-01 01:02:01 -04:00
David Roazen 2edb286d1c Merged bug fix from Stable into Unstable 2013-04-30 22:33:36 -04:00
David Roazen 3390fc7d67 Include cofoja jar in classpath when testing release jars
-Even though we're no longer compiling/using contracts in tests,
 we still need the cofoja jar in the classpath when testing the
 release jars due to some bad behavior on the part of TestNG in
 not being able to handle missing annotation classes.

-We don't need to package the cofoja classes in the actual GATK
 jar, however (and we never have).
2013-04-30 22:16:58 -04:00
Eric Banks e29b52b9a5 Merge remote-tracking branch 'unstable/master' 2013-04-30 15:31:33 -04:00
Mark DePristo a0a1a366e3 Merge pull request #201 from broadinstitute/eb_fix_reduced_count_tagging
Setting the reduce reads count tag was all wrong in a previous commit; fixing
2013-04-30 12:14:40 -07:00
Eric Banks 58424e56be Setting the reduce reads count tag was all wrong in a previous commit; fixing.
RR counts are represented as offsets from the first count, but that wasn't being done
correctly when counts are adjusted on the fly.  Also, we were triggering the expensive
conversion and writing to binary tags even when we weren't going to write the read
to disk.

The code has been updated so that unconverted counts are passed to the GATKSAMRecord
and it knows how to encode the tag correctly.  Also, there are now methods to write
to the reduced counts array without forcing the conversion (and methods that do force
the conversion).

Also:
1. counts are now maintained as ints whenever possible.  Only the GATKSAMRecord knows
about the internal encoding.
2. as discussed in meetings today, we updated the encoding so that it can now handle
a range of values that extends to 255 instead of 127 (and is backwards compatible).
3. tests have been moved from SyntheticReadUnitTest to GATKSAMRecordUnitTest accordingly.
2013-04-30 13:45:42 -04:00
delangel 15266da51c Merge pull request #203 from broadinstitute/gda_poolcaller_paper
Updates to pool caller scala script due to new paths and cleanup, hopefu...
2013-04-30 07:12:02 -07:00
Guillermo del Angel 95637e03a0 Updates to pool caller scala script due to new paths and cleanup, hopefully with final changes for paper.
Added also R script used to process everything into a couple of ggplot-friendly data frames.
Functionality is basically the same. Enhancements:
-- Add annotation to log axiom and Exome Chip AC along with LOF results for concordance comparisons.
-- General Cleanup.
-- Used base path for files as a variable in case directory structure in gsa-hpprojects changes again.
-- Output also per-pool data by subsetting genotypes per pool and comparing with corresponding genotypes from Axiom, exome chip and omni.
-- Commit R scripts that load all tables and crunch them to analyze them.
2013-04-30 10:05:35 -04:00
Mark DePristo 6ea2bceb55 Merge pull request #202 from broadinstitute/yf_remove_getlength_from_every_GATKBAMIndex_read
GATKBAMIndex calls buffer.length() on every read.
2013-04-30 06:24:34 -07:00
Mark DePristo 73fcacbf1b Change Long to long 2013-04-30 09:21:10 -04:00
Eric Banks a3a2ec5a1c Merge pull request #200 from broadinstitute/gda_ug_rr_bug_48742591
Fix for indel calling with UG in presence of reduced reads: When a read ...
2013-04-29 17:54:43 -07:00
Guillermo del Angel 20d3137928 Fix for indel calling with UG in presence of reduced reads: When a read is long enough so that there's no reference context available, the reads gets clipped so that it falls again within the reference context range. However, the clipping is incorrect, as it makes the read end precisely at the end of the reference context coordinates. This might lead to a case where a read might span beyond the haplotype if one of the candidate haplotypes is shorter than the reference context (As in the case e.g. with deletions). In this case, the HMM will not work properly and the likelihood will be bad, since "insertions" at end of reads when haplotype is done will be penalized and likelihood will be much lower than it should.
-- Added check to see if read spans beyond reference window MINUS padding and event length. This guarantees that read will always be contained in haplotype.
-- Changed md5's that happen when long reads from old 454 data have their likelihoods changed because of the extra base clipping.
2013-04-29 19:33:02 -04:00
Yossi Farjoun 0e7e6d35d8 GATKBAMIndex calls buffer.length() on every read. This is causing much pain.
Optimized by getting the read of the file upon opening the index-file and using that instead.
2013-04-29 12:49:02 -04:00