gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Mark DePristo	1917d55dc2	Bugfix for DeBruijnAssembler: don't fail when read length > haplotype length -- The previous version would generate graphs that had no reference bases at all in the situation where the reference haplotype was < the longer read length, which would cause the kmer size to exceed the reference haplotype length. Now return immediately with a null graph when this occurs as opposed to continuing and eventually causing an error	2013-03-26 10:12:17 -04:00
Mark DePristo	464e65ea96	Disable error correcting kmers by default in the HC -- The error correction algorithm can break the reference graph in some cases by error correcting us into a bad state for the reference sequence. Because we know that the error correction algorithm isn't ideal, and worse, doesn't actually seem to improve the calling itself on chr20, I've simply disabled error correction by default and allowed it to be turned on with a hidden argument. -- In the process I've changed a bit the assembly interface, moving some common arguments us into the LocalAssemblyEngine, which are turned on/off via setter methods. -- Went through the updated arguments in the HC to be @Hidden and @Advanced as appropriate -- Don't write out an errorcorrected graph when debugging and error correction isn't enabled	2013-03-26 10:05:17 -04:00
Mark DePristo	965043472a	Vastly more powerful, cleaner graph simplification approach -- Generalizes previous node merging and splitting approaches. Can split common prefixes and suffices among nodes, build a subgraph representing this new structure, and incorporate it into the original graph. Introduces the concept of edges with 0 multiplicity (for purely structural reasons) as well as vertices with no sequence (again, for structural reasons). Fully UnitTested. These new algorithms can now really simplify diamond configurations as well as ones sources and sinks that arrive / depart linearly at a common single root node. -- This new suite of algorithms is fully integrated into the HC, replacing previous approaches -- SeqGraph transformations are applied iteratively (zipping, splitting, merging) until no operations can be performed on the graph. This further simplifies the graphs, as splitting nodes may enable other merging / zip operations to go.	2013-03-23 17:40:55 -04:00
Mark DePristo	078c63654d	Merge pull request #127 from broadinstitute/mc_callable_loci_with_reduce_reads Adding ReduceReads support to Callable Loci	2013-03-21 17:56:23 -07:00
Mark DePristo	de832de03d	Merge pull request #125 from broadinstitute/mc_fly_pipeline Removing mark duplicates from the quick processing pipeline	2013-03-21 13:11:53 -07:00
Mauricio Carneiro	eb33da6820	Added support to reduce reads to Callable Loci -- added calls to representativeCount() of the pileup instead of using ++ -- renamed CallableLoci integration test -- added integration test for reduce read support on callable loci	2013-03-21 15:53:04 -04:00
Ryan Poplin	c15453542e	Merge pull request #124 from broadinstitute/md_hc_lowmapq_read_filter HC now by default only uses reads with MAPQ >= 20 for assembly and calli...	2013-03-21 12:00:28 -07:00
Mark DePristo	b30deda7ee	Merge pull request #123 from broadinstitute/rp_event_mapper_bug_GSA-877 Bug fix in HC gga mode.	2013-03-21 11:42:20 -07:00
Mark DePristo	7ae15dadbe	HC now by default only uses reads with MAPQ >= 20 for assembly and calling -- Previously we tried to include lots of these low mapping quality reads in the assembly and calling, but we effectively were just filtering them out anyway while generating an enormous amount of computational expense to handle them, as well as much larger memory requirements. The new version simply uses a read filter to remove them upfront. This causes no major problems -- at least, none that don't have other underlying causes -- compared to 10-11mb of the KB -- Update MD5s to reflect changes due to no longer including mmq < 20 by default	2013-03-21 13:10:50 -04:00
Ryan Poplin	b9c331c2fa	Bug fix in HC gga mode. -- Don't try to test alleles which haven't had haplotypes assigned to them	2013-03-21 11:02:41 -04:00
Ryan Poplin	1a95ce5dcf	Merge pull request #122 from broadinstitute/md_ceu_trio_calls_2x250_GSA-739 Many improvements to HaplotypeCaller for CEU trio best practice variant calling	2013-03-21 06:58:13 -07:00
Mark DePristo	aa7f172b18	Cap the computational cost of the kmer based error correction in the DeBruijnGraph -- Simply don't do more than MAX_CORRECTION_OPS_TO_ALLOW = 5000 * 1000 operations to correct a graph. If the number of ops would exceed this threshold, the original graph is used. -- Overall the algorithm is just extremely computational expensive, and actually doesn't implement the correct correction. So we live with this limitations while we continue to explore better algorithms -- Updating MD5s to reflect changes in assembly algorithms	2013-03-21 09:21:35 -04:00
Mark DePristo	d94b3f85bc	Increase NUM_BEST_PATHS_PER_KMER_GRAPH in DeBruijnAssembler to 25 -- The value of 11 was too small to properly return a real low-frequency variant in our the 1000G AFR integration test.	2013-03-20 22:54:38 -04:00
Mark DePristo	6d7d21ca47	Bugfix for incorrect branch diamond merging algorithm -- Previous version was just incorrectly accumulating information about nodes that were completely eliminated by the common suffix, so we were dropping some reference connections between vertices. Fixed. In the process simplified the entire algorithm and codebase -- Resolves https://jira.broadinstitute.org/browse/GSA-884	2013-03-20 22:54:37 -04:00
Mark DePristo	3a8f001c27	Misc. fixes upon pull request review -- DeBruijnAssemblerUnitTest and AlignmentUtilsUnitTest were both in DEBUG = true mode (bad!) -- Remove the maxHaplotypesToConsider feature of HC as it's not useful	2013-03-20 22:54:37 -04:00
Mark DePristo	d3b756bdc7	BaseVertex optimization: don't clone byte[] unnecessarily -- Don't clone sequence upon construction or in getSequence(), as these are frequently called, memory allocating routines and cloning will be prohibitively expensive	2013-03-20 22:54:37 -04:00
Mark DePristo	5226b24a11	HaplotypeCaller instructure cleanup and unit testing -- UnitTest for isRootOfDiamond along with key bugfix detected while testing -- Fix up the equals methods in BaseEdge. Now called hasSameSourceAndTarget and seqEquals. A much more meaningful naming -- Generalize graphEquals to use seqEquals, so it works equally well with Debruijn and SeqGraphs -- Add BaseVertex method called seqEquals that returns true if two BaseVertex objects have the same sequence -- Reorganize SeqGraph mergeNodes into a single master function that does zipping, branch merging, and zipping again, rather than having this done in the DeBruijnAssembler itself -- Massive expansion of the SeqGraph unit tests. We now really test out the zipping and branch merging code. -- Near final cleanup of the current codebase -- DeBruijnVertex cleanup and optimizations. Since kmer graphs don't allow sequences longer than the kmer size, the suffix is always a byte, not a byte[]. Optimize the code to make use of this constraint	2013-03-20 22:54:37 -04:00
Mark DePristo	2e36f15861	Update md5s to reflect new downsampling and assembly algorithm output -- Only minor differences, with improvement in allele discovery where the sites differ. The test of an insertion at the start of the MT no longer calls a 1 bp indel at position 0 in the genome	2013-03-20 22:54:37 -04:00
Mark DePristo	1fa5050faf	Cleanup, unit test, and optimize KBestPaths and Path -- Split Path from inner class of KBestPaths -- Use google MinMaxPriorityQueue to track best k paths, a more efficient implementation -- Path now properly typed throughout the code -- Path maintains a on-demand hashset of BaseEdges so that path.containsEdge is fast	2013-03-20 22:54:36 -04:00
Mark DePristo	98c4cd060d	HaplotypeCaller now uses SeqGraph instead of kmer graph to build haplotypes. -- DeBruijnAssembler functions are no longer static. This isn't the right way to unit test your code -- An a HaplotypeCaller command line option to use low-quality bases in the assembly -- Refactored DeBruijnGraph and associated libraries into base class -- Refactored out BaseEdge, BaseGraph, and BaseVertex from DeBruijn equivalents. These DeBruijn versions now inherit from these base classes. Added some reasonable unit tests for the base and Debruijn edges and vertex classes. -- SeqVertex: allows multiple vertices in the sequence graph to have the same sequence and yet be distinct -- Further refactoring of DeBruijnAssembler in preparation for the full SeqGraph <-> DeBruijnGraph split -- Moved generic methods in DeBruijnAssembler into BaseGraph -- Created a simple SeqGraph that contains SeqVertex objects -- Simple chain zipper for SeqGraph that reproduces the results for the mergeNode function on DeBruijnGraphs -- A working version of the diamond remodeling algorithm in SeqGraph that converts graphs that look like A -> Xa, A -> Ya, Xa -> Z, Ya -> Z into A -> X -> a, A -Y -> a, a -> Z -- Allow SeqGraph zip merging of vertices where the in vertex has multiple incoming edges or the out vertex has multiple outgoing edges -- Fix all unit tests so they work with the new SeqGraph system. All tests passed without modification. -- Debugging makes it easier to tell which kmer graph contributes to a haplotype -- Better docs and unit tests for BaseVertex, SeqVertex, BaseEdge, and KMerErrorCorrector -- Remove unnecessary printing of cleaning info in BaseGraph -- Turn off kmer graph creation in DeBruijnAssembler.java -- Only print SeqGraphs when debugGraphTransformations is set to true -- Rename DeBruijnGraphUnitTest to SeqGraphUnitTest. Now builds DeBruijnGraph, converts to SeqGraph, uses SeqGraph.mergenodes and tests for equality. -- Update KBestPathsUnitTest to use SeqGraphs not DebruijnGraphs -- DebruijnVertex now longer takes kmer argument -- it's implicit that the kmer length is the sequence.length now	2013-03-20 22:54:36 -04:00
Mark DePristo	0f4328f6fe	Basic kmer error correction algorithm xfor the HaplotypeCaller -- Error correction algorithm for the assembler. Only error correct reads to others that are exactly 1 mismatch away -- The assembler logic is now: build initial graph, error correct, merge nodes, prune dead nodes, merge again, make haplotypes. The * elements are new -- Refactored the printing routines a bit so it's easy to write a single graph to disk for testing. -- Easier way to control the testing of the graph assembly algorithms -- Move graph printing function to DeBruijnAssemblyGraph from DeBruijnAssembler -- Simple protected parsing function for making DeBruijnAssemblyGraph -- Change the default prune factor for the graph to 1, from 2 -- debugging graph transformations are controllable from command line	2013-03-20 22:54:36 -04:00
Mark DePristo	53a904bcbd	Bugfix for HaplotypeCaller: GSA-822 for trimming softclipped reads -- Previous version would not trim down soft clip bases that extend beyond the active region, causing the assembly graph to go haywire. The new code explicitly reverts soft clips to M bases with the ever useful ReadClipper, and then trims. Note this isn't a 100% fix for the issue, as it's possible that the newly unclipped bases might in reality extend beyond the active region, should their true alignment include a deletion in the reference. Needs to be fixed. JIRA added -- See https://jira.broadinstitute.org/browse/GSA-822 -- #resolve #fix GSA-822	2013-03-20 22:54:36 -04:00
Mark DePristo	ffea6dd95f	HaplotypeCaller now has the ability to only consider the best N haplotypes for genotyping -- Added a -dontGenotype mode for testing assembly efficiency -- However, it looks like this has a very negative impact on the quality of the results, so the code should be deleted	2013-03-20 22:54:36 -04:00
Mark DePristo	a8fb26bf01	A generic downsampler that reduces coverage for a bunch of reads -- Exposed the underlying minElementsPerStack parameter for LevelingDownsampler	2013-03-20 22:54:35 -04:00
Mark DePristo	752440707d	AlignmentUtils.calcNumDifferentBases computes the number of bases that differ between a reference and read sequence given a cigar between the two.	2013-03-20 22:54:35 -04:00
Mark DePristo	a783f19ab1	Fix for potential HaplotypeCaller bug in annotation ordering -- Annotations were being called on VariantContext that might needed to be trimmed. Simply inverted the order of operations so trimming occurs before the annotations are added. -- Minor cleanup of call to PairHMM in LikelihoodCalculationEngine	2013-03-20 22:54:35 -04:00
Mark DePristo	559a4bc05d	Updating general calling pipeline to work with newer HC and UG arguments and filtering -- Use default VQSR params of QD, FS, DP and MQ for SNPs, with ReadPosRankSum and HaplotypeScore for UG SNPs -- Add combine variants to GeneralCallingPipelin -- Fix incorrect intervals in HaplotypeCaller in GeneralCallingPipeline.scala -- GCP now emits tables for VCFs by default -- GCP runs HC first before UG -- GeneralCallingPipeline now jointly calls input BAMs, not separately processes them. Ready to handle CEU trio calling -- Assess NA12878 on the particularly well reviewed 10-11mb in addition to all of 20 -- Use 4G for HC	2013-03-20 22:54:35 -04:00
Eric Banks	1fae750ebe	Merge pull request #120 from broadinstitute/aw_reduce_reads_clear_name_cache Clear ReduceReads name cache after each set of reads produced by ReduceR...	2013-03-20 19:47:42 -07:00
Mark DePristo	7e29beadff	Merge pull request #121 from broadinstitute/gda_hc_gls_for_1000g_GSA-878 Fix (rather workaround) encountered when running HaplotypeCaller in GGA ...	2013-03-20 14:08:10 -07:00
Guillermo del Angel	ea01dbf130	Fix to issue encountered when running HaplotypeCaller in GGA mode with data from other 1000G callers. In particular, someone produced a tandem repeat site with 57 alt alleles (sic) which made the caller blow up. Inelegant fix is to detect if # of alleles is > our max cached capacity, and if so, emit an informative warning and skip site. -- Added unit test to UG engine to cover this case. -- Commit to posterity private scala script currently used for 1000G indel consensus (still very much subject to changes). GSA-878 #resolve	2013-03-20 14:30:37 -04:00
MauricioCarneiro	470746c907	Merge pull request #117 from broadinstitute/gg_handling_deprecated_tools_45941819 gg handling deprecated tools 45941819	2013-03-20 07:31:33 -07:00
Geraldine Van der Auwera	d70bf64737	Created new DeprecatedToolChecks class --Based on existing code in GenomeAnalysisEngine --Hashmaps hold mapping of deprecated tool name to version number and recommended replacement (if any) --Using FastUtils for maps; specifically Object2ObjectMap but there could be a better type for Strings... --Added user exception for deprecated annotations --Added deprecation check to AnnotationInterfaceManager.validateAnnotations --Run when annotations are initialized --Made annotation sets instead of lists	2013-03-20 06:46:02 -04:00
Geraldine Van der Auwera	6b4d88ebe9	Created ListAnnotations utility (extends CommandLineProgram) --Refactored listAnnotations basic method out of VA into HelpUtils --HelpUtils.listAnnotations() is now called by both VA and the new ListAnnotations utility (lives in sting.tools) --This way we keep the VA --list option but we also offer a way to list annotations without a full valid VA command-line, which was a pain users continually complained about --We could get rid of the VA --list option altogether ...?	2013-03-20 06:15:27 -04:00
Geraldine Van der Auwera	95a9ed853d	Made some documentation updates & fixes --Mostly doc block tweaks --Added @DocumentedGATKFeature to some walkers that were undocumented because they were ending up in "uncategorized". Very important for GSA: if a walker is in public or protected, it HAS to be properly tagged-in. If it's not ready for the public, it should be in private.	2013-03-20 06:15:20 -04:00
Alec Wysoker	bccc9d79e5	Clear ReduceReads name cache after each set of reads produced by ReduceReadsStash. Name cache was filling up with names of all reads in entire file, which for large file eventually consumes all of memory. Only keep read name cache for the reads that are together in one variant region, so that a pair of reads within the same variant region will still be joined via read name. Otherwise the ability to connect a read to its mate is lost. Update MD5s in integration test to reflect altered output. Add new integration test that confirms that pair within variant region is joined by read name.	2013-03-19 14:12:33 -04:00
Ryan Poplin	c813259283	Merge pull request #119 from broadinstitute/md_assessn12878_bugfixes AssessNA12878 bugfixes	2013-03-19 05:11:50 -07:00
David Roazen	d4f873f664	Revert "github webhook handler: convert from daemon to cron job" Turns out the email script doesn't work correctly from cron. Converting the webhook script back to a daemon for now until it can be made to work as a cron job. This reverts commit 9679accb641537f5c637cce0aeb63f3925521b42.	2013-03-19 03:50:39 -04:00
David Roazen	ff79118379	github webhook handler: convert from daemon to cron job -having this as a daemon was annoying because we had to be sure to re-spawn the daemon whenever it got killed -now it will be run as a cron job once per minute -delete now-unnecessary spawn script	2013-03-19 02:47:13 -04:00
David Roazen	f9ad8d4325	Merged bug fix from Stable into Unstable Conflicts: private/gsa-engineering/pdfgen/trigger_pdfgen.sh	2013-03-19 01:23:58 -04:00
David Roazen	532efad8cd	Release scripts: small changes to reduce intermittent failures -don't check exit status of wget in the trigger_pdfgen script; it was exiting with non-0 status even though the pdf generation was being triggered correctly -introduce a delay after filtering the git history to allow HEAD to be properly reset -re-enable sanity checks in filter_stable and source_release scripts that had temporarily been disabled while the new protected repository was being set up	2013-03-19 01:09:30 -04:00
Mark DePristo	d7bec9eb6e	AssessNA12878 bugfixes -- @Output isn't required for AssessNA12878 -- Previous version would could non-variant sites in NA12878 that resulted from subsetting a multi-sample VC to NA12878 as CALLED_BUT_NOT_IN_DB sites. Now they are properly skipped -- Bugfix for subsetting samples to NA12878. Previous version wouldn't trim the alleles when subsetting down a multi-sample VCF, so we'd have false FN/FP sites at indels when the multi-sample VCF has alleles that result in the subset for NA12878 having non-trimmed alleles. Fixed and unit tested now.	2013-03-18 15:48:08 -04:00
Eric Banks	a36e2b8f9d	Merge pull request #118 from broadinstitute/ami-typoInCoveredByNSamplesSites fix typos in argument docs in CoveredByNSamplesSites and rewrite an unac...	2013-03-18 11:10:10 -07:00
Ami Levy-Moonshine	0e9c1913ff	fix typos in argument docs and in printed output in CoveredByNSamplesSites and rewrite an unaccurate comment	2013-03-18 13:54:21 -04:00
Mark DePristo	2b80068164	Merged bug fix from Stable into Unstable	2013-03-18 12:36:21 -04:00
Mark DePristo	7ab7c873a1	Temp. to PairHMM to avoid bad likelihoods -- Simply caps PairHMM likelihoods from rising above 0 by taking the min of the likelihood and 0. Will be properly fixed in GATK 2.5 with better PairHMM implementation.	2013-03-18 12:34:51 -04:00
David Roazen	a67d8c8dd6	Bump timeout for MaxRuntimeIntegrationTest Looks like returning this timeout to its original value was a bit too aggressive -- adding 40 seconds to the tolerance limit.	2013-03-17 16:17:29 -04:00
droazen	a67aae0261	Merge pull request #114 from broadinstitute/dr_tweak_test_timeouts Further tweaking of test timeouts	2013-03-15 15:43:55 -07:00
Mark DePristo	d86a1242d1	Merge pull request #115 from broadinstitute/md_kb_unstable_server_GSA-778 NA12878 KB startup script takes full path to GATK.jar	2013-03-15 13:34:10 -07:00
Mark DePristo	2f27e5682a	NA12878 KB startup script takes full path to GATK.jar	2013-03-15 16:33:29 -04:00
David Roazen	236eb54abd	Trivial script to publish private unstable jars for group use -Jars will get updated every time the "Serial Commit Tests" plan in Bamboo passes on the master branch -Differs from the nightly builds in that it includes "private" and has actually passed the test suite -latest jar is always located at: /humgen/gsa-hpprojects/GATK/private_unstable_builds/GenomeAnalysisTK_latest_unstable.jar	2013-03-15 16:00:59 -04:00

1 2 3 4 5 ...

12106 Commits (1917d55dc228f450cabd669b1038e8dce861584f) All Branches Search

12106 Commits (1917d55dc228f450cabd669b1038e8dce861584f)

All Branches