Commit Graph

12096 Commits (b9c331c2fa3299244f87694bf1afd94b16a868f6)

Author SHA1 Message Date
Ryan Poplin b9c331c2fa Bug fix in HC gga mode.
-- Don't try to test alleles which haven't had haplotypes assigned to them
2013-03-21 11:02:41 -04:00
Ryan Poplin 1a95ce5dcf Merge pull request #122 from broadinstitute/md_ceu_trio_calls_2x250_GSA-739
Many improvements to HaplotypeCaller for CEU trio best practice variant calling
2013-03-21 06:58:13 -07:00
Mark DePristo aa7f172b18 Cap the computational cost of the kmer based error correction in the DeBruijnGraph
-- Simply don't do more than MAX_CORRECTION_OPS_TO_ALLOW = 5000 * 1000 operations to correct a graph.  If the number of ops would exceed this threshold, the original graph is used.
-- Overall the algorithm is just extremely computational expensive, and actually doesn't implement the correct correction.  So we live with this limitations while we continue to explore better algorithms
-- Updating MD5s to reflect changes in assembly algorithms
2013-03-21 09:21:35 -04:00
Mark DePristo d94b3f85bc Increase NUM_BEST_PATHS_PER_KMER_GRAPH in DeBruijnAssembler to 25
-- The value of 11 was too small to properly return a real low-frequency variant in our the 1000G AFR integration test.
2013-03-20 22:54:38 -04:00
Mark DePristo 6d7d21ca47 Bugfix for incorrect branch diamond merging algorithm
-- Previous version was just incorrectly accumulating information about nodes that were completely eliminated by the common suffix, so we were dropping some reference connections between vertices.  Fixed.  In the process simplified the entire algorithm and codebase
-- Resolves https://jira.broadinstitute.org/browse/GSA-884
2013-03-20 22:54:37 -04:00
Mark DePristo 3a8f001c27 Misc. fixes upon pull request review
-- DeBruijnAssemblerUnitTest and AlignmentUtilsUnitTest were both in DEBUG = true mode (bad!)
-- Remove the maxHaplotypesToConsider feature of HC as it's not useful
2013-03-20 22:54:37 -04:00
Mark DePristo d3b756bdc7 BaseVertex optimization: don't clone byte[] unnecessarily
-- Don't clone sequence upon construction or in getSequence(), as these are frequently called, memory allocating routines and cloning will be prohibitively expensive
2013-03-20 22:54:37 -04:00
Mark DePristo 5226b24a11 HaplotypeCaller instructure cleanup and unit testing
-- UnitTest for isRootOfDiamond along with key bugfix detected while testing
-- Fix up the equals methods in BaseEdge.  Now called hasSameSourceAndTarget and seqEquals.  A much more meaningful naming
-- Generalize graphEquals to use seqEquals, so it works equally well with Debruijn and SeqGraphs
-- Add BaseVertex method called seqEquals that returns true if two BaseVertex objects have the same sequence
-- Reorganize SeqGraph mergeNodes into a single master function that does zipping, branch merging, and zipping again, rather than having this done in the DeBruijnAssembler itself
-- Massive expansion of the SeqGraph unit tests.  We now really test out the zipping and branch merging code.
-- Near final cleanup of the current codebase
-- DeBruijnVertex cleanup and optimizations.  Since kmer graphs don't allow sequences longer than the kmer size, the suffix is always a byte, not a byte[].  Optimize the code to make use of this constraint
2013-03-20 22:54:37 -04:00
Mark DePristo 2e36f15861 Update md5s to reflect new downsampling and assembly algorithm output
-- Only minor differences, with improvement in allele discovery where the sites differ.  The test of an insertion at the start of the MT no longer calls a 1 bp indel at position 0 in the genome
2013-03-20 22:54:37 -04:00
Mark DePristo 1fa5050faf Cleanup, unit test, and optimize KBestPaths and Path
-- Split Path from inner class of KBestPaths
-- Use google MinMaxPriorityQueue to track best k paths, a more efficient implementation
-- Path now properly typed throughout the code
-- Path maintains a on-demand hashset of BaseEdges so that path.containsEdge is fast
2013-03-20 22:54:36 -04:00
Mark DePristo 98c4cd060d HaplotypeCaller now uses SeqGraph instead of kmer graph to build haplotypes.
-- DeBruijnAssembler functions are no longer static.  This isn't the right way to unit test your code
-- An a HaplotypeCaller command line option to use low-quality bases in the assembly
-- Refactored DeBruijnGraph and associated libraries into base class
-- Refactored out BaseEdge, BaseGraph, and BaseVertex from DeBruijn equivalents.  These DeBruijn versions now inherit from these base classes.  Added some reasonable unit tests for the base and Debruijn edges and vertex classes.
-- SeqVertex: allows multiple vertices in the sequence graph to have the same sequence and yet be distinct
-- Further refactoring of DeBruijnAssembler in preparation for the full SeqGraph <-> DeBruijnGraph split
-- Moved generic methods in DeBruijnAssembler into BaseGraph
-- Created a simple SeqGraph that contains SeqVertex objects
-- Simple chain zipper for SeqGraph that reproduces the results for the mergeNode function on DeBruijnGraphs
-- A working version of the diamond remodeling algorithm in SeqGraph that converts graphs that look like A -> Xa, A -> Ya, Xa -> Z, Ya -> Z into A -> X -> a, A -Y -> a, a -> Z
-- Allow SeqGraph zip merging of vertices where the in vertex has multiple incoming edges or the out vertex has multiple outgoing edges
-- Fix all unit tests so they work with the new SeqGraph system.  All tests passed without modification.
-- Debugging makes it easier to tell which kmer graph contributes to a haplotype
-- Better docs and unit tests for BaseVertex, SeqVertex, BaseEdge, and KMerErrorCorrector
-- Remove unnecessary printing of cleaning info in BaseGraph
-- Turn off kmer graph creation in DeBruijnAssembler.java
-- Only print SeqGraphs when debugGraphTransformations is set to true
-- Rename DeBruijnGraphUnitTest to SeqGraphUnitTest.  Now builds DeBruijnGraph, converts to SeqGraph, uses SeqGraph.mergenodes and tests for equality.
-- Update KBestPathsUnitTest to use SeqGraphs not DebruijnGraphs
-- DebruijnVertex now longer takes kmer argument -- it's implicit that the kmer length is the sequence.length now
2013-03-20 22:54:36 -04:00
Mark DePristo 0f4328f6fe Basic kmer error correction algorithm xfor the HaplotypeCaller
-- Error correction algorithm for the assembler.  Only error correct reads to others that are exactly 1 mismatch away
-- The assembler logic is now: build initial graph, error correct*, merge nodes*, prune dead nodes, merge again, make haplotypes.  The * elements are new
-- Refactored the printing routines a bit so it's easy to write a single graph to disk for testing.
-- Easier way to control the testing of the graph assembly algorithms
-- Move graph printing function to DeBruijnAssemblyGraph from DeBruijnAssembler
-- Simple protected parsing function for making DeBruijnAssemblyGraph
-- Change the default prune factor for the graph to 1, from 2
-- debugging graph transformations are controllable from command line
2013-03-20 22:54:36 -04:00
Mark DePristo 53a904bcbd Bugfix for HaplotypeCaller: GSA-822 for trimming softclipped reads
-- Previous version would not trim down soft clip bases that extend beyond the active region, causing the assembly graph to go haywire.  The new code explicitly reverts soft clips to M bases with the ever useful ReadClipper, and then trims.  Note this isn't a 100% fix for the issue, as it's possible that the newly unclipped bases might in reality extend beyond the active region, should their true alignment include a deletion in the reference.  Needs to be fixed.  JIRA added

-- See https://jira.broadinstitute.org/browse/GSA-822
-- #resolve #fix GSA-822
2013-03-20 22:54:36 -04:00
Mark DePristo ffea6dd95f HaplotypeCaller now has the ability to only consider the best N haplotypes for genotyping
-- Added a -dontGenotype mode for testing assembly efficiency
-- However, it looks like this has a very negative impact on the quality of the results, so the code should be deleted
2013-03-20 22:54:36 -04:00
Mark DePristo a8fb26bf01 A generic downsampler that reduces coverage for a bunch of reads
-- Exposed the underlying minElementsPerStack parameter for LevelingDownsampler
2013-03-20 22:54:35 -04:00
Mark DePristo 752440707d AlignmentUtils.calcNumDifferentBases computes the number of bases that differ between a reference and read sequence given a cigar between the two. 2013-03-20 22:54:35 -04:00
Mark DePristo a783f19ab1 Fix for potential HaplotypeCaller bug in annotation ordering
-- Annotations were being called on VariantContext that might needed to be trimmed.  Simply inverted the order of operations so trimming occurs before the annotations are added.
-- Minor cleanup of call to PairHMM in LikelihoodCalculationEngine
2013-03-20 22:54:35 -04:00
Mark DePristo 559a4bc05d Updating general calling pipeline to work with newer HC and UG arguments and filtering
-- Use default VQSR params of QD, FS, DP and MQ for SNPs, with ReadPosRankSum and HaplotypeScore for UG SNPs
-- Add combine variants to GeneralCallingPipelin
-- Fix incorrect intervals in HaplotypeCaller in GeneralCallingPipeline.scala
-- GCP now emits tables for VCFs by default
-- GCP runs HC first before UG
-- GeneralCallingPipeline now jointly calls input BAMs, not separately processes them.  Ready to handle CEU trio calling
-- Assess NA12878 on the particularly well reviewed 10-11mb in addition to all of 20
-- Use 4G for HC
2013-03-20 22:54:35 -04:00
Eric Banks 1fae750ebe Merge pull request #120 from broadinstitute/aw_reduce_reads_clear_name_cache
Clear ReduceReads name cache after each set of reads produced by ReduceR...
2013-03-20 19:47:42 -07:00
Mark DePristo 7e29beadff Merge pull request #121 from broadinstitute/gda_hc_gls_for_1000g_GSA-878
Fix (rather workaround) encountered when running HaplotypeCaller in GGA ...
2013-03-20 14:08:10 -07:00
Guillermo del Angel ea01dbf130 Fix to issue encountered when running HaplotypeCaller in GGA mode with data from other 1000G callers.
In particular, someone produced a tandem repeat site with 57 alt alleles (sic) which made the caller blow up.
Inelegant fix is to detect if # of alleles is > our max cached capacity, and if so, emit an informative warning and skip site.
-- Added unit test to UG engine to cover this case.
-- Commit to posterity private scala script currently used for 1000G indel consensus (still very much subject to changes).
GSA-878 #resolve
2013-03-20 14:30:37 -04:00
MauricioCarneiro 470746c907 Merge pull request #117 from broadinstitute/gg_handling_deprecated_tools_45941819
gg handling deprecated tools 45941819
2013-03-20 07:31:33 -07:00
Geraldine Van der Auwera d70bf64737 Created new DeprecatedToolChecks class
--Based on existing code in GenomeAnalysisEngine
	--Hashmaps hold mapping of deprecated tool name to version number and recommended replacement (if any)
	--Using FastUtils for maps; specifically Object2ObjectMap but there could be a better type for Strings...
	--Added user exception for deprecated annotations
	--Added deprecation check to AnnotationInterfaceManager.validateAnnotations
	--Run when annotations are initialized
	--Made annotation sets instead of lists
2013-03-20 06:46:02 -04:00
Geraldine Van der Auwera 6b4d88ebe9 Created ListAnnotations utility (extends CommandLineProgram)
--Refactored listAnnotations basic method out of VA into HelpUtils
	--HelpUtils.listAnnotations() is now called by both VA and the new ListAnnotations utility (lives in sting.tools)
	--This way we keep the VA --list option but we also offer a way to list annotations without a full valid VA command-line, which was a pain users continually complained about
	--We could get rid of the VA --list option altogether ...?
2013-03-20 06:15:27 -04:00
Geraldine Van der Auwera 95a9ed853d Made some documentation updates & fixes
--Mostly doc block tweaks
	--Added @DocumentedGATKFeature to some walkers that were undocumented because they were ending up in "uncategorized". Very important for GSA: if a walker is in public or protected, it HAS to be properly tagged-in. If it's not ready for the public, it should be in private.
2013-03-20 06:15:20 -04:00
Alec Wysoker bccc9d79e5 Clear ReduceReads name cache after each set of reads produced by ReduceReadsStash.
Name cache was filling up with names of all reads in entire file, which for large file eventually
consumes all of memory.  Only keep read name cache for the reads that are together in one variant
region, so that a pair of reads within the same variant region will still be joined via read name.
Otherwise the ability to connect a read to its mate is lost.

Update MD5s in integration test to reflect altered output.
Add new integration test that confirms that pair within variant region is joined by read name.
2013-03-19 14:12:33 -04:00
Ryan Poplin c813259283 Merge pull request #119 from broadinstitute/md_assessn12878_bugfixes
AssessNA12878 bugfixes
2013-03-19 05:11:50 -07:00
David Roazen d4f873f664 Revert "github webhook handler: convert from daemon to cron job"
Turns out the email script doesn't work correctly from cron.
Converting the webhook script back to a daemon for now until
it can be made to work as a cron job.

This reverts commit 9679accb641537f5c637cce0aeb63f3925521b42.
2013-03-19 03:50:39 -04:00
David Roazen ff79118379 github webhook handler: convert from daemon to cron job
-having this as a daemon was annoying because we had to be sure to
 re-spawn the daemon whenever it got killed

-now it will be run as a cron job once per minute

-delete now-unnecessary spawn script
2013-03-19 02:47:13 -04:00
David Roazen f9ad8d4325 Merged bug fix from Stable into Unstable
Conflicts:
	private/gsa-engineering/pdfgen/trigger_pdfgen.sh
2013-03-19 01:23:58 -04:00
David Roazen 532efad8cd Release scripts: small changes to reduce intermittent failures
-don't check exit status of wget in the trigger_pdfgen script;
 it was exiting with non-0 status even though the pdf generation
 was being triggered correctly

-introduce a delay after filtering the git history to allow HEAD
 to be properly reset

-re-enable sanity checks in filter_stable and source_release scripts
 that had temporarily been disabled while the new protected repository
 was being set up
2013-03-19 01:09:30 -04:00
Mark DePristo d7bec9eb6e AssessNA12878 bugfixes
-- @Output isn't required for AssessNA12878
-- Previous version would could non-variant sites in NA12878 that resulted from subsetting a multi-sample VC to NA12878 as CALLED_BUT_NOT_IN_DB sites.  Now they are properly skipped
-- Bugfix for subsetting samples to NA12878.  Previous version wouldn't trim the alleles when subsetting down a multi-sample VCF, so we'd have false FN/FP sites at indels when the multi-sample VCF has alleles that result in the subset for NA12878 having non-trimmed alleles.  Fixed and unit tested now.
2013-03-18 15:48:08 -04:00
Eric Banks a36e2b8f9d Merge pull request #118 from broadinstitute/ami-typoInCoveredByNSamplesSites
fix typos in argument docs in CoveredByNSamplesSites and rewrite an unac...
2013-03-18 11:10:10 -07:00
Ami Levy-Moonshine 0e9c1913ff fix typos in argument docs and in printed output in CoveredByNSamplesSites and rewrite an unaccurate comment 2013-03-18 13:54:21 -04:00
Mark DePristo 2b80068164 Merged bug fix from Stable into Unstable 2013-03-18 12:36:21 -04:00
Mark DePristo 7ab7c873a1 Temp. to PairHMM to avoid bad likelihoods
-- Simply caps PairHMM likelihoods from rising above 0 by taking the min of the likelihood and 0.  Will be properly fixed in GATK 2.5 with better PairHMM implementation.
2013-03-18 12:34:51 -04:00
David Roazen a67d8c8dd6 Bump timeout for MaxRuntimeIntegrationTest
Looks like returning this timeout to its original value was a
bit too aggressive -- adding 40 seconds to the tolerance limit.
2013-03-17 16:17:29 -04:00
droazen a67aae0261 Merge pull request #114 from broadinstitute/dr_tweak_test_timeouts
Further tweaking of test timeouts
2013-03-15 15:43:55 -07:00
Mark DePristo d86a1242d1 Merge pull request #115 from broadinstitute/md_kb_unstable_server_GSA-778
NA12878 KB startup script takes full path to GATK.jar
2013-03-15 13:34:10 -07:00
Mark DePristo 2f27e5682a NA12878 KB startup script takes full path to GATK.jar 2013-03-15 16:33:29 -04:00
David Roazen 236eb54abd Trivial script to publish private unstable jars for group use
-Jars will get updated every time the "Serial Commit Tests" plan in
 Bamboo passes on the master branch

-Differs from the nightly builds in that it includes "private" and
 has actually passed the test suite

-latest jar is always located at:
 /humgen/gsa-hpprojects/GATK/private_unstable_builds/GenomeAnalysisTK_latest_unstable.jar
2013-03-15 16:00:59 -04:00
Mark DePristo 090db06793 Merge pull request #110 from broadinstitute/rp_fix_extending_partial_haplotype_bug_GSA-840
Bug fix in assembly for edge case in which the extendPartialHaplotype fu...
2013-03-15 11:53:31 -07:00
David Roazen 742a7651e9 Further tweaking of test timeouts
Increase one timeout, restore others that were only timing out due to the
Java crypto lib bug to their original values.

-DOUBLE timeout for NanoSchedulerUnitTest.testNanoSchedulerInLoop()

-REDUCE timeout for EngineFeaturesIntegrationTest to its original value

-REDUCE timeout for MaxRuntimeIntegrationTest to its original value

-REDUCE timeout for GATKRunReportUnitTest to its original value
2013-03-15 14:49:21 -04:00
droazen e681df68c9 Merge pull request #113 from broadinstitute/dr_parallel_tests_print_exited_classes
parallel tests: print names of test classes that had an error in real time
2013-03-15 11:41:40 -07:00
David Roazen 68c6ebd93f parallel tests: print names of test classes that had an error in real time 2013-03-15 14:28:20 -04:00
Ryan Poplin 0cf5d30dac Bug fix in assembly for edge case in which the extendPartialHaplotype function was filling in deletions in the middle of haplotypes. 2013-03-15 14:20:25 -04:00
droazen 9d6d1f94b0 Merge pull request #112 from broadinstitute/dr_parallel_tests_print_unfinished_classes
parallel tests: start printing the names of unfinished test classes once...
2013-03-15 10:57:59 -07:00
Mark DePristo 4a042e9bff Merge pull request #111 from broadinstitute/rp_no_ref_padding_bug_GSA-860
Fix for edge case bug of trying to create insertions/deletions on the ed...
2013-03-15 10:34:45 -07:00
David Roazen f42a52c090 parallel tests: start printing the names of unfinished test classes once there are < 10 jobs left
This will let us see in real time in Bamboo which classes are preventing
our runs from finishing
2013-03-15 13:34:30 -04:00
Ryan Poplin b8991f5e98 Fix for edge case bug of trying to create insertions/deletions on the edge of contigs.
-- Added integration test using MT that previously failed
2013-03-15 12:32:13 -04:00