Commit Graph

12371 Commits (b239cb76d4cc6aa3e1b8c82e0896bebf407c595f)

Author SHA1 Message Date
Mark DePristo b239cb76d4 Merge pull request #230 from broadinstitute/mc_gsalib_toR3
Updating gsalib for R-3.0 compatibility
2013-05-19 04:32:58 -07:00
Mauricio Carneiro c8b1c47764 Updating gsalib for R-3.0 compatibility
* add package namespace that exports all the visible objects
   * list gsalib dependencies in the package requirements

[fixes #49987933]
2013-05-18 12:43:38 -04:00
Eric Banks 665e45f0fc Merge pull request #229 from broadinstitute/eb_liftover_variants_output_required
@Output needs to be required for LiftoverVariants to prevent a NPE and d...
2013-05-17 07:49:55 -07:00
Eric Banks 8a442d3c9f @Output needs to be required for LiftoverVariants to prevent a NPE and documentation needed updating. 2013-05-17 10:04:10 -04:00
MauricioCarneiro 92072e1815 Merge pull request #224 from broadinstitute/yf_emit_insert_length_with_pileup
added a @hidden option to PileupWalker that causes it to emit insert sizes
2013-05-16 11:30:19 -07:00
Yossi Farjoun 3e2a0b15ed - Added a @Hidden option ( -outputInsertLength ) to PileupWalker that causes it to emit insert sizes together with the pileup (to assist Mark Daly's investigation of the contamination dependance on insert length)
- Converted my old GATKBAMIndexText (within PileupWalkerIntegrationTest) to use a dataProvider
- Added two integration tests to test -outputInsertLength option
2013-05-16 12:47:16 -04:00
Yossi Farjoun 9234a0efcd Merge pull request #223 from broadinstitute/mc_dt_gaddy_outputs
Bug fixes and missing interval functionality for Diagnose Targets

While the code seems fine, the complex parts of it are untested. This is probably fine for now, but private code can have a tendency to creep into the codebase once accepted. I would have preferred that unit test OR a big comment stating that the code is untested (and thus broken by Mark's rule).

It is with these cavets that I accept the pull request.
2013-05-16 09:25:54 -07:00
droazen a733a5e9b7 Merge pull request #228 from broadinstitute/mc_fine_grained_maxruntime
Subshard timeouts in the GATK
2013-05-15 05:25:43 -07:00
Mark DePristo 371f3752c1 Subshard timeouts in the GATK
-- The previous implementation of the maxRuntime would require us to wait until all of the work was completed within a shard, which can be a substantial amount of work in the case of a locus walker with 16kb shards.
-- This implementation ensures that we exit from the traversal very soon after the max runtime is exceeded, without completely all of our work within the shard.  This is done by updating all of the traversal engines to return false for hasNext() in the nano scheduled input provider.  So as soon as the timeout is exceeeded, we stop generating additional data to process, and we only have to wait until the currently executing data processing unit (locus, read, active region) completes.
-- In order to implement this timeout efficiently at this fine scale, the progress meter now lives in the genome analysis engine, and the exceedsTimeout() call in the engine looks at a periodically updated runtime variable in the meter.  This variable contains the elapsed runtime of the engine, but is updated by the progress meter daemon thread so that the engine doesn't call System.nanotime() in each cycle of the engine, which would be very expense.  Instead we basically wait for the daemon to update this variable, and so our precision of timing out is limited by the update frequency of the daemon, which is on the order of every few hundred milliseconds, totally fine for a timeout.
-- Added integration tests to ensure that subshard timeouts are working properly
2013-05-15 07:00:39 -04:00
chartl 6137e5f89b Merge pull request #227 from broadinstitute/chartl_genotypeconcordance_integration_md5_fix
Update GCIT md5s to account for trivial changes to description strings
2013-05-14 17:26:24 -07:00
Mark DePristo 43e78286a0 Merge pull request #226 from broadinstitute/hc_ceu_trio_calling
Trivial update to ceutrio.ped file to make it really the CEU trio samples
2013-05-14 17:02:52 -07:00
Chris Hartl 6da0aed30f Update GCIT md5s to account for trivial changes to description strings 2013-05-14 19:45:30 -04:00
Yossi Farjoun 409a202492 Merge pull request #214 from broadinstitute/chartl_genotype_concordance_diploid_and_OGC
Add overall genotype concordance to the genotype concordance tool. In ad...
2013-05-14 14:19:54 -07:00
Mark DePristo 7d78a77f17 Trivial update to ceutrio.ped file to make it really the CEU trio sample names 2013-05-14 17:08:13 -04:00
Mark DePristo e9427d817f Merge pull request #225 from broadinstitute/eb_change_consensus_logic_in_kb
Fixed the logic of FP and DISCORDANT consensus.
2013-05-14 11:49:43 -07:00
Eric Banks b951ab1337 Fixed the logic of FP and DISCORDANT consensus.
* added a good unit test for this case
  * also, added the Reviewed status to the INFO field of the MVC record
2013-05-14 13:13:30 -04:00
Mark DePristo 349521e4c2 Merge pull request #220 from broadinstitute/md_nano_hc
NanoScheduling for the HaplotypeCaller!
2013-05-13 09:07:13 -07:00
Mauricio Carneiro adcbf947bf Update MD5s and the Diagnose Target scala script 2013-05-13 12:06:17 -04:00
Mauricio Carneiro 9eceae793a Tool to manipulate intervals outside the GATK
Performs basic set operations on intervals like union, intersect and difference between two or more intervals. Useful for techdev and QC purposes.
2013-05-13 11:56:24 -04:00
Mauricio Carneiro 3dbb86b052 Outputting missing intervals in DiagnoseTargets
Problem
------
Diagnose Targets identifies holes in the coverage of a targetted experiment, but it only reports them doesn't list the actual missing loci

Solution
------
This commit implements an optional intervals file output listing the exact loci that did not pass filters

Itemized changes
--------------
   * Cache callable statuses (to avoid recalculation)
   * Add functionality to output missing intervals
   * Implement new tool to qualify the missing intervals (QualifyMissingIntervals) by gc content, size, type of missing coverage and origin (coding sequence, intron, ...)
2013-05-13 11:51:56 -04:00
Mauricio Carneiro 1466396a31 Diagnose target is outputting intervals out of order
Problem
-------
When the interval had no reads, it was being sent to the VCF before the intervals that just got processed, therefore violating the sort order of the VCF.

Solution
--------
Use a linked hash map, and make the insertion and removal all happen in one place regardless of having reads or not. Since the input is ordered, the output has to be ordered as well.

Itemized changes
--------------
   * Clean up code duplication in LocusStratification and SampleStratification
   * Add number of uncovered sites and number of low covered sites to the VCF output.
   * Add new VCF format fields
   * Fix outputting multiple status when threshold is 0 (ratio must be GREATER THAN not equal to the threshold to get reported)

[fixes #48780333]
[fixes #48787311]
2013-05-13 11:50:22 -04:00
Mark DePristo 39e4396de0 New ActiveRegionShardBalancer allows efficient NanoScheduling
-- Previously we used the LocusShardBalancer for the haplotype caller, which meant that TraverseActiveRegions saw its shards grouped in chunks of 16kb bits on the genome.  These locus shards are useful when you want to use the HierarchicalMicroScheduler, as they provide fine-grained accessed to the underlying BAM, but they have two major drawbacks (1) we have to fairly frequently reset our state in TAR to handle moving between shard boundaries and (2) with the nano scheduled TAR we end up blocking at the end of each shard while our threads all finish processing.
-- This commit changes the system over to using an ActiveRegionShardBalancers, that combines all of the shard data for a single contig into a single combined shard.  This ensures that TAR, and by extensions the HaplotypeCaller, gets all of the data on a single contig together so the the NanoSchedule runs efficiently instead of blocking over and over at shard boundaries.  This simple change allows us to scale efficiently to around 8 threads in the nano scheduler:
  -- See https://www.dropbox.com/s/k7f280pd2zt0lyh/hc_nano_linear_scale.pdf
  -- See https://www.dropbox.com/s/fflpnan802m2906/hc_nano_log_scale.pdf
-- Misc. changes throughout the codebase so we Use the ActiveRegionShardBalancer where appropriate.
-- Added unit tests for ActiveRegionShardBalancer to confirm it does the merging as expected.
-- Fix bad toString in FilePointer
2013-05-13 11:09:02 -04:00
Mark DePristo b4f482a421 NanoScheduled ActiveRegionTraversal and HaplotypeCaller
-- Made CountReadsInActiveRegions Nano schedulable, confirming identical results for linear and nano results
-- Made Haplotype NanoScheduled, requiring misc. changes in the map/reduce type so that the map() function returns a List<VariantContext> and reduce actually prints out the results to disk
-- Tests for NanoScheduling
  -- CountReadsInActiveRegionsIntegrationTest now does NCT 1, 2, 4 with CountReadsInActiveRegions
  -- HaplotypeCallerParallelIntegrationTest does NCT 1,2,4 calling on 100kb of PCR free data
-- Some misc. code cleanup of HaplotypeCaller
-- Analysis scripts to assess performance of nano scheduled HC
-- In order to make the haplotype caller thread safe we needed to use an AtomicInteger for the class-specific static ID counter in SeqVertex and MultiDebrujinVertex, avoiding a race condition where multiple new Vertex() could end up with the same id.
2013-05-13 11:09:02 -04:00
Mark DePristo 82e6a77385 Merge pull request #221 from broadinstitute/eb_greedy_SW_v2_GSA-936
New faster Smith-Waterman implementation that is edge greedy and assumes...
2013-05-13 07:50:50 -07:00
Eric Banks 2f5ef6db44 New faster Smith-Waterman implementation that is edge greedy and assumes that ref and haplotype have same global start/end points.
* This version inherits from the original SW implementation so it can use the same matrix creation method.
   * A bunch of refactoring was done to the original version to clean it up a bit and to have it do the
     right thing for indels at the edges of the alignments.
     * Enum added for the overhang strategy to use; added implementation for the INDEL version of this strategy.
   * Lots of systematic testing added for this implementation.
   * NOT HOOKED UP TO HAPLOTYPE CALLER YET. Committing so that people can play around with this for now.
2013-05-13 09:36:39 -04:00
Mark DePristo cb49b8cc71 Merge pull request #222 from broadinstitute/dr_convenient_diff_engine_md5_display_in_bamboo_GSA-915
Enable convenient display of diff engine output in Bamboo, plus misc. minor test-related improvements
2013-05-11 10:58:42 -07:00
David Roazen 639030bd6d Enable convenient display of diff engine output in Bamboo, plus misc. minor test-related improvements
-Diff engine output is now included in the actual exception message thrown as a
 result of an MD5 mismatch, which allows it to be conveniently viewed on the
 main page of a build in Bamboo.

Minor Additional Improvements:

-WalkerTestSpec now auto-detects test class name via new JVMUtils.getCallingClass()
 method, and the test class name is now included as a regular part of integration
 test output for each test.

-Fix race condition in MD5DB.ensureMd5DbDirectory()

-integrationtests dir is now cleaned by "ant clean"

GSA-915 #resolve
2013-05-10 19:00:33 -04:00
Mark DePristo 111e8cef0f Merge pull request #219 from broadinstitute/eb_rr_multisample_fix
Fix bug in Reduce Reads that arises in multi-sample mode.
2013-05-09 15:31:14 -07:00
Ryan Poplin 1078d0e51c Merge pull request #216 from broadinstitute/md_better_assembly
Replace DeBruijnAssembler with ReadThreadingAssembler
2013-05-09 07:36:17 -07:00
Eric Banks 8b9c6aae3e Fix bug in Reduce Reads that arises in multi-sample mode.
* bitset could legitimately be in an unfinished state but we were trying to access it without finalizing.
  * added --cancer_mode argument per Mark's suggestion to force the user to explicitly enable multi-sample mode.
  * tests were easiest to implement as integration tests (this was a really complicated case).
2013-05-08 23:23:51 -04:00
Mark DePristo fa8a47ceef Replace DeBruijnAssembler with ReadThreadingAssembler
Problem
-------
The DeBruijn assembler was too slow.  The cause of the slowness was the need to construct many kmer graphs (from max read length in the interval to 11 kmer, in increments of 6 bp).  This need to build many kmer graphs was because the assembler (1) needed long kmers to assemble through regions where a shorter kmer was non-unique in the reference, as we couldn't split cycles in the reference (2) shorter kmers were needed to be sensitive to differences from the reference near the edge of reads, which would be lost often when there was chain of kmers of longer length that started before and after the variant.

Solution
--------
The read threading assembler uses a fixed kmer, in this implementation by default two graphs with 10 and 25 kmers.  The algorithm operates as follows:

identify all non-unique kmers of size K among all reads and the reference
for each sequence (ref and read):
  find a unique starting position of the sequence in the graph by matching to a unique kmer, or starting a new source node if non exist
  for each base in the sequence from the starting vertex kmer:
    look at the existing outgoing nodes of current vertex V.  If the base in sequence matches the suffix of outgoing vertex N, read the sequence to N, and continue
    If no matching next vertex exists, find a unique vertex with kmer K.  If one exists, merge the sequence into this vertex, and continue
    If a merge vertex cannot be found, create a new vertex (note this vertex may have a kmer identical to another in the graph, if it is not unique) and thread the sequence to this vertex, and continue

This algorithm has a key property: it can robustly use a very short kmer without introducing cycles, as we will create paths through the graph through regions that aren't unique w.r.t. the sequence at the given kmer size.  This allows us to assemble well with even very short kmers.

This commit includes many critical changes to the haplotype caller to make it fast, sensitive, and accurate on deep and shallow WGS and exomes, the key changes are highlighted below:

-- The ReadThreading assembler keeps track of the maximum edge multiplicity per sample in the graph, so that we prune per sample, not across all samples.  This change is essential to operate effectively when there are many deep samples (i.e., 100 exomes)
-- A new pruning algorithm that will only prune linear paths where the maximum edge weight among all edges in the path have < pruningFactor.  This makes pruning more robust when you have a long chain of bases that have high multiplicity at the start but only barely make it back into the main path in the graph.
-- We now do a global SmithWaterman to compute the cigar of a Path, instead of the previous bubble-based SmithWaterman optimization.  This change is essential for us to get good variants from our paths when the kmer size is small.  It also ensures that we produce a cigar from a path that only depends only the sequence of bases in the path, unlike the previous approach which would depend on both the bases and the way the path was decomposed into vertices, which depended on the kmer size we used.
-- Removed MergeHeadlessIncomingSources, which was introducing problems in the graphs in some cases, and just isn't the safest operation.  Since we build a kmer graph of size 10, this operation is no longer necessary as it required a perfect match of 10 bp to merge anyway.
-- The old DebruijnAssembler is still available with a command line option
-- The number of paths we take forward from the each assembly graph is now capped at a factor per sample, so that we allow 128 paths for a single sample up to 10 x nSamples as necessary.  This is an essential change to make the system work well for large numbers of samples.
-- Add a global mismapping parameter to the HC likelihood calculation: The phredScaledGlobalReadMismappingRate reflects the average global mismapping rate of all reads, regardless of their mapping quality. This term effects the probability that a read originated from the reference haploytype, regardless of its edit distance from the reference, in that the read could have originated from the reference haplotype but from another location in the genome. Suppose a read has many mismatches from the reference, say like 5, but has a very high mapping quality of 60. Without this parameter, the read would contribute 5 * Q30 evidence in favor of its 5 mismatch haplotype compared to reference, potentially enough to make a call off that single read for all of these events. With this parameter set to Q30, though, the maximum evidence against the reference that this (and any) read could contribute against reference is Q30. -- Controllable via a command line argument, defaulting to Q60 rate. Results from 20:10-11 mb for branch are consistent with the previous behavior, but this does help in cases where you have rare very divergent haplotypes
-- Reduced ActiveRegionExtension from 200 bp to 100 bp, which is a performance win and the large extension is largely unnecessary with the short kmers used with the read threading assembler

Infrastructure changes / improvements
-------------------------------------
-- Refactored BaseGraph to take a subclass of BaseEdge, so that we can use a MultiSampleEdge in the ReadThreadingAssembler
-- Refactored DeBruijnAssembler, moving common functionality into LocalAssemblyEngine, which now more directly manages the subclasses, requiring them to only implement a assemble() method that takes ref and reads and provides a List<SeqGraph>, which the LocalAssemblyEngine takes forward to compute haplotypes and other downstream operations.  This allows us to have only a limited amount of code that differentiates the Debruijn and ReadThreading assemblers
-- Refactored active region trimming code into ActiveRegionTrimmer class
-- Cleaned up the arguments in HaplotypeCaller, reorganizing them and making arguments @Hidden and @Advanced as appropriate.  Renamed several arguments now that the read threading assembler is the default
-- LocalAssemblyEngineUnitTest reads in the reference sequence from b37, and assembles with synthetic reads intervals from 10-11 mbs with only the reference sequence as well as artificial snps, deletions, and insertions.
-- Misc. updates to Smith Waterman code. Added generic interface to called not surpisingly SmithWaterman, making it easier to have alternative implementations.
-- Many many more unit tests throughout the entire assembler, and in random utilities
2013-05-08 21:41:42 -04:00
Chris Hartl d3c9910af6 Cosmetic changes (comments and variable names) to GenotypeConcordance and ConcordanceMetrics to address reviewer comments. 2013-05-08 17:25:14 -04:00
Mark DePristo fbc683f1e8 Merge pull request #218 from broadinstitute/dr_rev_picard_to_1.91
Rev picard, sam-jdk, tribble, and variant to version 1.91.1453
2013-05-07 14:32:03 -07:00
David Roazen 52c20a8416 Rev picard, sam-jdk, tribble, and variant to version 1.91.1453
-mainly for the seek() performance optimization
2013-05-07 14:21:15 -04:00
Mark DePristo 178f1499ee Merge pull request #217 from broadinstitute/md_queue_jobreport_improvements
Improve queue script jobreport visualization script
2013-05-07 09:14:03 -07:00
Mark DePristo 2b86ab02be Improve queue script jobreport visualization script
-- the Queue jobreport PDF script now provides a high-level summary of the de-scattered runtimes of each analysis, so that its easy to see where your script is spending its time across scatters.
2013-05-07 12:11:46 -04:00
Mark DePristo b153042a9d Merge pull request #215 from broadinstitute/eb_misc_improvements
Eb misc improvements
2013-05-07 06:05:53 -07:00
Eric Banks d242f1bba3 Secondary alignments were not handled correctly in IndelRealigner
* This is emerging now because BWA-MEM produces lots of reads that are not primary alignments
 * The ConstrainedMateFixingManager class used by IndelRealigner was mis-adjusting SAM flags because it
     was getting confused by these secondary alignments
 * Added unit test to cover this case
2013-05-06 19:09:10 -04:00
Eric Banks b53336c2d0 Added hidden mode for BQSR to force all read groups to be the same one.
* Very useful for debugging sample-specific issues
 * This argument got lost in the transition from BQSR v1 to v2
 * Added unit test to cover this case
2013-05-06 19:09:10 -04:00
Chris Hartl 6ff74deac7 Add overall genotype concordance to the genotype concordance tool. In addition, protect from non-diploid genotypes, which can cause very strange behavior.
Update MD5 sums. As expected, md5 changes are consistent with the genotype concordance field being added to each output.
2013-05-06 13:06:30 -04:00
chartl 98021db264 Merge pull request #208 from broadinstitute/yf_fix_molten_GenotypeConcordance
- Fixed a small bug in the printout of molten data in GenotypeConcordanc...
2013-05-06 08:42:06 -07:00
Mark DePristo 084a464779 Merge pull request #213 from broadinstitute/gda_md5_fix
Re-fix md5's that changed due to conflicting pushes
2013-05-03 16:21:46 -07:00
Guillermo del Angel 874dc8f9c1 Re-fix md5's that changed due to conflicting pushes 2013-05-03 14:59:16 -04:00
Mark DePristo 8bac61f9fb Merge pull request #212 from broadinstitute/md_misc_improvements
Several important bugfixes / improvements to general GATK infrastructure
2013-05-03 08:20:55 -07:00
Mark DePristo f42bb86bdd e# This is a combination of 2 commits.
Only try to clip adaptors when both reads of the pair are on opposite strands

-- Read pairs that have unusual alignments, such as two reads both oriented like:

  <-----
     <-----

where previously having their adaptors clipped as though the standard calculation of the insert size was meaningful, which it is not for such oddly oriented pairs.  This caused us to clip extra good bases from reads.
-- Update MD5s due change in adaptor clipping, which add some coverage in some places
2013-05-03 11:19:14 -04:00
Mark DePristo 0587a145bf Utils.dupString should allow 0 number of duplicates to produce empty string 2013-05-03 09:32:05 -04:00
Mark DePristo f5a301fb63 Bugfix for AlignmentUtils.trimCigarByBases
-- Previous version would trim down 2M2D2M into 2M if you asked for the first 2 bases, but this can result in incorrect alignment of the bases to the reference as the bases no longer span the full reference interval expected.  Fixed and added unit tests
2013-05-03 09:32:05 -04:00
Mark DePristo 2bcbdd469f leftAlignCigarSequentially now supports haplotypes with insertions and deletions where the deletion allele was previously removed by the leftAlignSingleIndel during it's cleanup phase. 2013-05-03 09:32:05 -04:00
Eric Banks aefddaa219 Merge pull request #210 from broadinstitute/gda_QUAL_fix_GSA-910
Rev'd up Picard to get PL fix: PLs were saturated to 32767 (Short.MAX_VA...
2013-05-03 06:28:10 -07:00
Mark DePristo 4f627e96c0 Merge pull request #211 from broadinstitute/md_kb_script
Starting NA12878 KB requires use Java-1.7
2013-05-03 04:30:23 -07:00