gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Mark DePristo	fde7d36926	Updating md5s due to changes in assembly graph creation algorithms and default parameter	2013-03-27 15:31:24 -04:00
Mark DePristo	197d149495	Increase the maxNumHaplotypesInPopulation to 25 -- A somewhat arbitrary increase, and will need some evaluation but necessary to get good results on the AFR integrationtest.	2013-03-27 15:31:24 -04:00
Mark DePristo	66910b036c	Added new and improved suffix and node merging algorithms -- These new algorithms are more powerful than the restricted diamond merging algoriths, in that they can merge nodes with multiple incoming and outgoing edges. Together the splitter + merger algorithms will correctly merge many more cases than the original headless and tailless diamond merger. -- Refactored haplotype caller infrastructure into graphs package, code cleanup -- Cleanup new merging / splitting algorithms, with proper docs and unit tests -- Fix bug in zipping of linear chains. Because the multiplicity can be 0, protect ourselves with a max function call -- Fix BaseEdge.max unit test -- Add docs and some more unit tests -- Move error correct from DeBruijnGraph to DeBruijnAssembler -- Replaced uses of System.out.println with logger.info -- Don't make multiplicity == 0 nodes look like they should be pruned -- Fix toString of Path	2013-03-27 15:31:18 -04:00
Mark DePristo	39f2e811e5	Increase max cigar elements from SW before failing path creation to 20 from 6 -- This allows more diversity in paths, which is sometimes necessary when we cannot simply graphs that have large bubbles	2013-03-26 14:27:18 -04:00
Mark DePristo	b1b615b668	BaseGraph shouldn't implement getEdge -- no idea why I added this	2013-03-26 14:27:18 -04:00
Mark DePristo	a97576384d	Fix bug in the HC not respecting the requested pruning	2013-03-26 14:27:18 -04:00
Mark DePristo	78c672676b	Bugfix for pruning and removing non-reference edges in graph -- Previous algorithms were applying pruneGraph inappropriately on the raw sequence graph (where each vertex is a single base). This results in overpruning of the graph, as prunegraph really relied on the zipping of linear chains (and the sharing of weight this provides) to avoid over-pruning the graph. Probably we should think hard about this. This commit fixes this logic, so we zip the graph between pruning -- In this process ID's a fundamental problem with how we were trimming away vertices that occur on a path from the reference source to sink. In fact, we were leaving in any vertex that happened to be accessible from source, any vertices in cycles, and any vertex that wasn't the absolute end of a chain going to a sink. The new algorithm fixes all of this, using a BaseGraphIterator that's a general approach to walking the base graph. Other routines that use the same traversal idiom refactored to use this iterator. Added unit tests for all of these capabilities. -- Created new BaseGraphIterator, which abstracts common access patterns to graph, and use this where appropriate	2013-03-26 14:27:18 -04:00
Mark DePristo	ad04fdb233	PerReadAlleleLikelihoodMap getMostLikelyAllele returns an MostLikelyAllele objects now -- This new functionality allows the client to make decisions about how to handle non-informative reads, rather than having a single enforced constant that isn't really appropriate for all users. The previous functionality is maintained now and used by all of the updated pieces of code, except the BAM writers, which now emit reads to display to their best allele, regardless of whether this is particularly informative or not. That way you can see all of your data realigned to the new HC structure, rather than just those that are specifically informative. -- This all makes me concerned that the informative thresholding isn't appropriately used in the annotations themselves. There are many cases where nearby variation makes specific reads non-informative about one event, due to not being informative about the second. For example, suppose you have two SNPs A/B and C/D that are in the same active region but separated by more than the read length of the reads. All reads would be non-informative as no read provides information about the full combination of 4 haplotypes, as they reads only span a single event. In this case our annotations will all fall apart, returning their default values. Added a JIRA to address this (should be discussed in group meeting)	2013-03-26 14:27:13 -04:00
Mark DePristo	2472828e1c	HC bug fixes: no longer create reference graphs with cycles -- Though not intended, it was possible to create reference graphs with cycles in the case where you started the graph with a homopolymer of length > the kmer. The previous test would fail to catch this case. Now its not possible -- Lots of code cleanup and refactoring in this push. Split the monolithic createGraphFromSequences into simple calls to addReferenceKmersToGraph and addReadKmersToGraph which themselves share lower level functions like addKmerPairFromSeqToGraph. -- Fix performance problem with reduced reads and the HC, where we were calling add kmer pair for each count in the reduced read, instead of just calling it once with a multiplicity of count. -- Refactor addKmersToGraph() to use things like addOrUpdateEdge, now the code is very clear	2013-03-26 10:12:24 -04:00
Mark DePristo	1917d55dc2	Bugfix for DeBruijnAssembler: don't fail when read length > haplotype length -- The previous version would generate graphs that had no reference bases at all in the situation where the reference haplotype was < the longer read length, which would cause the kmer size to exceed the reference haplotype length. Now return immediately with a null graph when this occurs as opposed to continuing and eventually causing an error	2013-03-26 10:12:17 -04:00
Mark DePristo	464e65ea96	Disable error correcting kmers by default in the HC -- The error correction algorithm can break the reference graph in some cases by error correcting us into a bad state for the reference sequence. Because we know that the error correction algorithm isn't ideal, and worse, doesn't actually seem to improve the calling itself on chr20, I've simply disabled error correction by default and allowed it to be turned on with a hidden argument. -- In the process I've changed a bit the assembly interface, moving some common arguments us into the LocalAssemblyEngine, which are turned on/off via setter methods. -- Went through the updated arguments in the HC to be @Hidden and @Advanced as appropriate -- Don't write out an errorcorrected graph when debugging and error correction isn't enabled	2013-03-26 10:05:17 -04:00
Eric Banks	593d3469d4	Refactored the het (polyploid) consensus creation in ReduceReads. * It is now cleaner and easier to test; added tests for newly implemented methods. * Many fixes to the logic to make it work * The most important change was that after triggering het compression we actually need to back it out if it creates reads that incorporated too many softclips at any one position (because they get unclipped). * There was also an off-by-one error in the general code that only manifested itself with het compression. * Removed support for creating a het consensus around deletions (which was broken anyways). * Mauricio gave his blessing for this. * Het compression now works only against known sites (with -known argument). * The user can pass in one or more VCFs with known SNPs (other variants are ignored). * If no known SNPs are provided het compression will automatically be disabled. * Added SAM tag to stranded (i.e. het compressed) reduced reads to distinguish their strandedness from normal reduced reads. * GATKSAMRecord now checks for this tag when determining whether or not the read is stranded. * This allows us to update the FisherStrand annotation to count het compressed reduced reads towards the FS calculation. * [It would have been nice to mark the normal reads as unstranded but then we wouldn't be backwards compatible.] * Updated integration tests accordingly with new het compressed bams (both for RR and UG). * In the process of fixing the FS annotation I noticed that SpanningDeletions wasn't handling RR properly, so I fixed it too. * Also, the test in the UG engine for determining whether there are too many overlapping deletions is updated to handle RR. * I added a special hook in the RR integration tests to additionally run the systematic coverage checking tool I wrote earlier. * AssessReducedCoverage is now run against all RR integration tests to ensure coverage is not lost from original to reduced bam. * This helped uncover a huge bug in the MultiSampleCompressor where it would drop reads from all but 1 sample (now fixed). * AssessReducedCoverage moved from private to protected for packaging reasons. * #resolve GSA-639 At this point, this commit encompasses most of what is needed for het compression to go live. There are still a few TODO items that I want to get in before the 2.5 release, but I will save those for a separate branch because as it is I feel bad for the person who needs to review all these changes (sorry, Mauricio).	2013-03-25 09:34:54 -04:00
Mark DePristo	965043472a	Vastly more powerful, cleaner graph simplification approach -- Generalizes previous node merging and splitting approaches. Can split common prefixes and suffices among nodes, build a subgraph representing this new structure, and incorporate it into the original graph. Introduces the concept of edges with 0 multiplicity (for purely structural reasons) as well as vertices with no sequence (again, for structural reasons). Fully UnitTested. These new algorithms can now really simplify diamond configurations as well as ones sources and sinks that arrive / depart linearly at a common single root node. -- This new suite of algorithms is fully integrated into the HC, replacing previous approaches -- SeqGraph transformations are applied iteratively (zipping, splitting, merging) until no operations can be performed on the graph. This further simplifies the graphs, as splitting nodes may enable other merging / zip operations to go.	2013-03-23 17:40:55 -04:00
Ryan Poplin	c15453542e	Merge pull request #124 from broadinstitute/md_hc_lowmapq_read_filter HC now by default only uses reads with MAPQ >= 20 for assembly and calli...	2013-03-21 12:00:28 -07:00
Mark DePristo	7ae15dadbe	HC now by default only uses reads with MAPQ >= 20 for assembly and calling -- Previously we tried to include lots of these low mapping quality reads in the assembly and calling, but we effectively were just filtering them out anyway while generating an enormous amount of computational expense to handle them, as well as much larger memory requirements. The new version simply uses a read filter to remove them upfront. This causes no major problems -- at least, none that don't have other underlying causes -- compared to 10-11mb of the KB -- Update MD5s to reflect changes due to no longer including mmq < 20 by default	2013-03-21 13:10:50 -04:00
Ryan Poplin	b9c331c2fa	Bug fix in HC gga mode. -- Don't try to test alleles which haven't had haplotypes assigned to them	2013-03-21 11:02:41 -04:00
Mark DePristo	aa7f172b18	Cap the computational cost of the kmer based error correction in the DeBruijnGraph -- Simply don't do more than MAX_CORRECTION_OPS_TO_ALLOW = 5000 * 1000 operations to correct a graph. If the number of ops would exceed this threshold, the original graph is used. -- Overall the algorithm is just extremely computational expensive, and actually doesn't implement the correct correction. So we live with this limitations while we continue to explore better algorithms -- Updating MD5s to reflect changes in assembly algorithms	2013-03-21 09:21:35 -04:00
Mark DePristo	d94b3f85bc	Increase NUM_BEST_PATHS_PER_KMER_GRAPH in DeBruijnAssembler to 25 -- The value of 11 was too small to properly return a real low-frequency variant in our the 1000G AFR integration test.	2013-03-20 22:54:38 -04:00
Mark DePristo	6d7d21ca47	Bugfix for incorrect branch diamond merging algorithm -- Previous version was just incorrectly accumulating information about nodes that were completely eliminated by the common suffix, so we were dropping some reference connections between vertices. Fixed. In the process simplified the entire algorithm and codebase -- Resolves https://jira.broadinstitute.org/browse/GSA-884	2013-03-20 22:54:37 -04:00
Mark DePristo	3a8f001c27	Misc. fixes upon pull request review -- DeBruijnAssemblerUnitTest and AlignmentUtilsUnitTest were both in DEBUG = true mode (bad!) -- Remove the maxHaplotypesToConsider feature of HC as it's not useful	2013-03-20 22:54:37 -04:00
Mark DePristo	d3b756bdc7	BaseVertex optimization: don't clone byte[] unnecessarily -- Don't clone sequence upon construction or in getSequence(), as these are frequently called, memory allocating routines and cloning will be prohibitively expensive	2013-03-20 22:54:37 -04:00
Mark DePristo	5226b24a11	HaplotypeCaller instructure cleanup and unit testing -- UnitTest for isRootOfDiamond along with key bugfix detected while testing -- Fix up the equals methods in BaseEdge. Now called hasSameSourceAndTarget and seqEquals. A much more meaningful naming -- Generalize graphEquals to use seqEquals, so it works equally well with Debruijn and SeqGraphs -- Add BaseVertex method called seqEquals that returns true if two BaseVertex objects have the same sequence -- Reorganize SeqGraph mergeNodes into a single master function that does zipping, branch merging, and zipping again, rather than having this done in the DeBruijnAssembler itself -- Massive expansion of the SeqGraph unit tests. We now really test out the zipping and branch merging code. -- Near final cleanup of the current codebase -- DeBruijnVertex cleanup and optimizations. Since kmer graphs don't allow sequences longer than the kmer size, the suffix is always a byte, not a byte[]. Optimize the code to make use of this constraint	2013-03-20 22:54:37 -04:00
Mark DePristo	2e36f15861	Update md5s to reflect new downsampling and assembly algorithm output -- Only minor differences, with improvement in allele discovery where the sites differ. The test of an insertion at the start of the MT no longer calls a 1 bp indel at position 0 in the genome	2013-03-20 22:54:37 -04:00
Mark DePristo	1fa5050faf	Cleanup, unit test, and optimize KBestPaths and Path -- Split Path from inner class of KBestPaths -- Use google MinMaxPriorityQueue to track best k paths, a more efficient implementation -- Path now properly typed throughout the code -- Path maintains a on-demand hashset of BaseEdges so that path.containsEdge is fast	2013-03-20 22:54:36 -04:00
Mark DePristo	98c4cd060d	HaplotypeCaller now uses SeqGraph instead of kmer graph to build haplotypes. -- DeBruijnAssembler functions are no longer static. This isn't the right way to unit test your code -- An a HaplotypeCaller command line option to use low-quality bases in the assembly -- Refactored DeBruijnGraph and associated libraries into base class -- Refactored out BaseEdge, BaseGraph, and BaseVertex from DeBruijn equivalents. These DeBruijn versions now inherit from these base classes. Added some reasonable unit tests for the base and Debruijn edges and vertex classes. -- SeqVertex: allows multiple vertices in the sequence graph to have the same sequence and yet be distinct -- Further refactoring of DeBruijnAssembler in preparation for the full SeqGraph <-> DeBruijnGraph split -- Moved generic methods in DeBruijnAssembler into BaseGraph -- Created a simple SeqGraph that contains SeqVertex objects -- Simple chain zipper for SeqGraph that reproduces the results for the mergeNode function on DeBruijnGraphs -- A working version of the diamond remodeling algorithm in SeqGraph that converts graphs that look like A -> Xa, A -> Ya, Xa -> Z, Ya -> Z into A -> X -> a, A -Y -> a, a -> Z -- Allow SeqGraph zip merging of vertices where the in vertex has multiple incoming edges or the out vertex has multiple outgoing edges -- Fix all unit tests so they work with the new SeqGraph system. All tests passed without modification. -- Debugging makes it easier to tell which kmer graph contributes to a haplotype -- Better docs and unit tests for BaseVertex, SeqVertex, BaseEdge, and KMerErrorCorrector -- Remove unnecessary printing of cleaning info in BaseGraph -- Turn off kmer graph creation in DeBruijnAssembler.java -- Only print SeqGraphs when debugGraphTransformations is set to true -- Rename DeBruijnGraphUnitTest to SeqGraphUnitTest. Now builds DeBruijnGraph, converts to SeqGraph, uses SeqGraph.mergenodes and tests for equality. -- Update KBestPathsUnitTest to use SeqGraphs not DebruijnGraphs -- DebruijnVertex now longer takes kmer argument -- it's implicit that the kmer length is the sequence.length now	2013-03-20 22:54:36 -04:00
Mark DePristo	0f4328f6fe	Basic kmer error correction algorithm xfor the HaplotypeCaller -- Error correction algorithm for the assembler. Only error correct reads to others that are exactly 1 mismatch away -- The assembler logic is now: build initial graph, error correct, merge nodes, prune dead nodes, merge again, make haplotypes. The * elements are new -- Refactored the printing routines a bit so it's easy to write a single graph to disk for testing. -- Easier way to control the testing of the graph assembly algorithms -- Move graph printing function to DeBruijnAssemblyGraph from DeBruijnAssembler -- Simple protected parsing function for making DeBruijnAssemblyGraph -- Change the default prune factor for the graph to 1, from 2 -- debugging graph transformations are controllable from command line	2013-03-20 22:54:36 -04:00
Mark DePristo	53a904bcbd	Bugfix for HaplotypeCaller: GSA-822 for trimming softclipped reads -- Previous version would not trim down soft clip bases that extend beyond the active region, causing the assembly graph to go haywire. The new code explicitly reverts soft clips to M bases with the ever useful ReadClipper, and then trims. Note this isn't a 100% fix for the issue, as it's possible that the newly unclipped bases might in reality extend beyond the active region, should their true alignment include a deletion in the reference. Needs to be fixed. JIRA added -- See https://jira.broadinstitute.org/browse/GSA-822 -- #resolve #fix GSA-822	2013-03-20 22:54:36 -04:00
Mark DePristo	ffea6dd95f	HaplotypeCaller now has the ability to only consider the best N haplotypes for genotyping -- Added a -dontGenotype mode for testing assembly efficiency -- However, it looks like this has a very negative impact on the quality of the results, so the code should be deleted	2013-03-20 22:54:36 -04:00
Mark DePristo	a783f19ab1	Fix for potential HaplotypeCaller bug in annotation ordering -- Annotations were being called on VariantContext that might needed to be trimmed. Simply inverted the order of operations so trimming occurs before the annotations are added. -- Minor cleanup of call to PairHMM in LikelihoodCalculationEngine	2013-03-20 22:54:35 -04:00
Eric Banks	1fae750ebe	Merge pull request #120 from broadinstitute/aw_reduce_reads_clear_name_cache Clear ReduceReads name cache after each set of reads produced by ReduceR...	2013-03-20 19:47:42 -07:00
Guillermo del Angel	ea01dbf130	Fix to issue encountered when running HaplotypeCaller in GGA mode with data from other 1000G callers. In particular, someone produced a tandem repeat site with 57 alt alleles (sic) which made the caller blow up. Inelegant fix is to detect if # of alleles is > our max cached capacity, and if so, emit an informative warning and skip site. -- Added unit test to UG engine to cover this case. -- Commit to posterity private scala script currently used for 1000G indel consensus (still very much subject to changes). GSA-878 #resolve	2013-03-20 14:30:37 -04:00
Geraldine Van der Auwera	95a9ed853d	Made some documentation updates & fixes --Mostly doc block tweaks --Added @DocumentedGATKFeature to some walkers that were undocumented because they were ending up in "uncategorized". Very important for GSA: if a walker is in public or protected, it HAS to be properly tagged-in. If it's not ready for the public, it should be in private.	2013-03-20 06:15:20 -04:00
Alec Wysoker	bccc9d79e5	Clear ReduceReads name cache after each set of reads produced by ReduceReadsStash. Name cache was filling up with names of all reads in entire file, which for large file eventually consumes all of memory. Only keep read name cache for the reads that are together in one variant region, so that a pair of reads within the same variant region will still be joined via read name. Otherwise the ability to connect a read to its mate is lost. Update MD5s in integration test to reflect altered output. Add new integration test that confirms that pair within variant region is joined by read name.	2013-03-19 14:12:33 -04:00
Ryan Poplin	0cf5d30dac	Bug fix in assembly for edge case in which the extendPartialHaplotype function was filling in deletions in the middle of haplotypes.	2013-03-15 14:20:25 -04:00
Ryan Poplin	b8991f5e98	Fix for edge case bug of trying to create insertions/deletions on the edge of contigs. -- Added integration test using MT that previously failed	2013-03-15 12:32:13 -04:00
Mark DePristo	2d35065238	QualityByDepth remaps QD values > 40 to a gaussian around 30 -- This is a temporarily fix / hack to deal with the very high QD values that are generated by the haplotype caller when nearby events occur within reads. In that case, the QUAL field can be many fold higher than normal, and results in an inflated QD value. This hack projects such high QD values back into the good range (as these are good variants in general) so they aren't filtered away by VQSR. -- The long-term solution to this problem is to move the HaplotypeCaller to the full bubble calling algorithm -- Update md5s	2013-03-14 16:09:41 -04:00
droazen	0fd9f0e77c	Merge pull request #104 from broadinstitute/eb_fix_output_annotation_GSA-837 Fixed the logic of the @Output annotation and its interaction with 'required'	2013-03-14 12:52:00 -07:00
Ryan Poplin	38914384d1	Changing CALLED_IN_DB_UNKNOWN_STATUS to count as TRUE_POSITIVEs in the simplified stats for AssessNA12878.	2013-03-14 14:44:18 -04:00
Geraldine Van der Auwera	61349ecefa	Cleaned up annotations - Moved AverageAltAlleleLength, MappingQualityZeroFraction and TechnologyComposition to Private - VariantType, TransmissionDisequilibriumTest, MVLikelihoodRatio and GCContent are no longer Experimental - AlleleBalanceBySample, HardyWeinberg and HomopolymerRun are Experimental and available to users with a big bold caveat message - Refactored getMeanAltAlleleLength() out of AverageAltAlleleLength into GATKVariantContextUtils in order to make QualByDepth independent of where AverageAltAlleleLength lives - Unrelated change, bundled in for convenience: made HC argument includeUnmappedreads @Hidden - Removed unnecessary check in AverageAltAlleleLength	2013-03-14 14:26:48 -04:00
Eric Banks	7cab709a88	Fixed the logic of the @Output annotation and its interaction with 'required'. ALL GATK DEVELOPERS PLEASE READ NOTES BELOW: I have updated the @Output annotation to behave differently and to include a 'defaultToStdout' tag. * The 'defaultToStdout' tags lets walkers specify whether to default to stdout if -o is not provided. * The logic for @Output is now: * if required==true then -o MUST be provided or a User Error is generated. * if required==false and defaultToStdout==true then the output is assigned to stdout if no -o is provided. * this is the default behavior (i.e. @Output with no modifiers). * if required==false and defaultToStdout==false then the output object is null. * use this combination for truly optional outputs (e.g. the -badSites option in AssessNA12878). * I have updated walkers so that previous behavior has been maintained (as best I could). * In general, all @Outputs with default long/short names have required=false. * Walkers with nWayOut options must have required==false and defaultToStdout==false (I added checks for this) * I added unit tests for @Output changes with David's help (thanks!). * #resolve GSA-837	2013-03-14 11:58:51 -04:00
Mark DePristo	b5b63eaac7	New GATKSAMRecord concept of a strandless read, update to FS -- Strandless GATK reads are ones where they don't really have a meaningful strand value, such as Reduced Reads or fragment merged reads. Added GATKSAMRecord support for such reads, along with unit tests -- The merge overlapping fragments code in FragmentUtils now produces strandless merged fragments -- FisherStrand annotation generalized to treat strandless as providing 1/2 the representative count for both strands. This means that that merged fragments are properly handled from the HC, so we don't hallucinate fake strand-bias just because we managed to merge a lot of reads together. -- The previous getReducedCount() wouldn't work if a read was made into a reduced read after getReducedCount() had been called. Added new GATKSAMRecord method setReducedCounts() that does the right thing. Updated SlidingWindow and SyntheticRead to explicitly call this function, and so the readTag parameter is now gone. -- Update MD5s for change to FS calculation. Differences are just minor updates to the FS	2013-03-13 11:16:36 -04:00
MauricioCarneiro	4403e3572a	Merge pull request #94 from broadinstitute/gg_gatkdoc_docfixes_GSATDG-111	2013-03-12 13:02:35 -07:00
MauricioCarneiro	3a16ba04d4	Merge pull request #97 from broadinstitute/eb_refactor_sliding_window Refactoring of SlidingWindow class in RR to reduce complexity and fix important bug	2013-03-12 12:27:26 -07:00
Geraldine Van der Auwera	f972963918	Fixed issues raised by Appistry QA (mostly small fixes, corrections & clarifications to GATKDocs) GATK-73 updated docs for bqsr args GATK-9 differentiate CountRODs from CountRODsByRef GATK-76 generate GATKDoc for CatVariants GATK-4 made resource arg required GATK-10 added -o, some docs to CountMales; some docs to CountLoci GATK-11 fixed by MC's -o change; straightened out the docs. GATK-77 fixed references to wiki GATK-76 Added Ami's doc block GATK-14 Added note that these annotations can only be used with VariantAnnotator GATK-15 specified required=false for two arguments GATK-23 Added documentation block GATK-33 Added documentation GATK-34 Added documentation GATK-32 Corrected arg name and docstring in DiffObjects GATK-32 Added note to DO doc about reference (required but unused) GATK-29 Added doc block to CountIntervals GATK-31 Added @Output PrintStream to enable -o GATK-35 Touched up docs GATK-36 Touched up docs, specified verbosity is optional GATK-60 Corrected GContent annot module location in gatkdocs GATK-68 touched up docs and arg docstrings GATK-16 Added note of caution about calling RODRequiringAnnotations as a group GATK-61 Added run requirements (num samples, min genotype quality) Tweaked template and generic doc block formatting (h2 to h3 titles) GATK-62 Added a caveat to HR annot Made experimental annotation hidden GATK-75 Added setup info regarding BWA GATK-22 Clarified some argument requirements GATK-48 Clarified -G doc comments GATK-67 Added arg requirement GATK-58 Added annotation and usage docs GSATDG-96 Corrected doc Updated MD5 for DiffObjectsIntegrationTests (only change is link in table title)	2013-03-12 10:57:14 -04:00
Eric Banks	05e69b6294	Refactoring of SlidingWindow class in RR to reduce complexity and fix important bug. * Allow RR to write its BAM to stdout by setting required=true for @Output. * Fixed bug in sliding window where a break in coverage after a long stretch without a variant region was causing a doubling of all the reads before the break. * Refactored SlidingWindow.updateHeaderCounts() into 3 separate tested methods. * Refactored polyploid consensus code out of SlidingWindow.compressVariantRegion().	2013-03-12 09:06:55 -04:00
Ryan Poplin	c96fbcb995	Use the indel heterozygosity prior when calling indels with the HC	2013-03-11 14:12:43 -04:00
Guillermo del Angel	695723ba43	Two features useful for ancient DNA processing. Ancient DNA sequencing data is in many ways different from modern data, and methods to analyze it need to be adapted accordingly. Feature 1: Read adaptor trimming. Ancient DNA libraries typically have very short inserts (in the order of 50 bp), so typical Illumina libraries sequenced in, say, 100bp HiSeq will have a large adaptor component being read after the insert. If this adaptor is not removed, data will not be aligneable. There are third party tools that remove adaptor and potentially merge read pairs, but are cumbersome to use and require precise knowledge of the library construction and adaptor sequence. -- New walker ReadAdaptorTrimmer walks through paired end data, computes pair overlap and trims auto-detected adaptor sequence. -- Unit tests added for trimming operation. -- Utility walker (may be retired later) DetailedReadLengthDistribution computes insert size or read length distribution stratified by read group and mapping status and outputs a GATKReport with data. -- Renamed MaxReadLengthFilter to ReadLengthFilter and added ability to specify minimum read length as a filter (may be useful if, as a consequence of adaptor trimming, we're left with a lot of very short reads which will map poorly and will just clutter output BAMs). Feature 2: Unbiased site QUAL estimation: many times ancestral allele status is not known and VCF fields like QUAL, QD, GQ, etc. are affected by the pop. gen. prior at a site. This might introduce subtle biases in studies where a species is aligned against the reference of another species, so an option for UG and HC not to apply such prior is introduced. -- Added -noPrior argument to StandardCallerArgumentCollection. -- Added option not to fill priors is such argument is set. -- Added an integration test.	2013-03-09 18:18:13 -05:00
Yossi Farjoun	baad965a57	- Changed loadContaminationFile file parser to delimit by tab only. This allows spaces in sampleIDs, which apparently are allowed. - This was needed since samples with spaces in their names are regularly found in the picard pipeline. - Modified the tests to account for this (removed spaces from the good tests, and changed the failing tests accordingly) - Cleaned up the unit tests using a @DataProvider (I'm in love...). - Moved AlleleBiasedDownsamplingUtilsUnitTest to public to match location of class it is testing (due to the way bamboo operates)	2013-03-07 13:04:24 -05:00
Eric Banks	3759d9dd67	Added the functionality to impose a relative ordering on ReadTransformers in the GATK engine. * ReadTransformers can say they must be first, must be last, or don't care. * By default, none of the existing ones care about ordering except BQSR (must be first). * This addresses a bug reported on the forum where BAQ is incorrectly applied before BQSR. * The engine now orders the read transformers up front before applying iterators. * The engine checks for enabled RTs that are not compatible (e.g. both must be first) and blows up (gracefully). * Added unit tests.	2013-03-06 12:38:59 -05:00
Eric Banks	78721ee09b	Added new walker to split MNPs into their allelic primitives (SNPs). * Can be extended to complex alleles at some point. * Currently only works for bi-allelics (documented). * Added unit and integration tests.	2013-03-05 23:16:42 -05:00
Eric Banks	bbbaf9ad20	Revert push from stable (I forgot that pushing from stable overwrites current unstable changes)	2013-03-05 09:06:02 -05:00
Eric Banks	a037423225	Merged bug fix from Stable into Unstable	2013-03-05 09:03:48 -05:00
Eric Banks	7e1bfd6a7c	Included an accidental change from unstable into the previous push	2013-03-05 09:03:31 -05:00
Eric Banks	bd4e4f4ee3	Merged bug fix from Stable into Unstable	2013-03-04 23:24:44 -05:00
Eric Banks	b715218bfe	Fix for mismatching indel quals erro: need to adjust for softclips just like we do for bases and normal quals.	2013-03-04 23:23:18 -05:00
Ryan Poplin	ce7554e9d6	Merged bug fix from Stable into Unstable	2013-03-04 12:36:04 -05:00
Ryan Poplin	0697594778	Active regions that don't contain any usable reads should just be skipped over instead of throwing an IllegalStateException.	2013-03-04 12:35:40 -05:00
Mark DePristo	42d3919ca4	Expanded functionality for writing BAMs from HaplotypeCaller -- The new code includes a new mode to write out a BAM containing reads realigned to the called haplotypes from the HC, which can be easily visualized in IGV. -- Previous functionality maintained, with bug fixes -- Haplotype BAM writing code now lives in utils -- Created a base class that includes most of the functionality of writing reads realigned to haplotypes onto haplotypes. -- Created two subclasses, one that writes all haplotypes (previous functionality) and a CalledHaplotypeBAMWriter that will only write reads aligned to the actually called haplotypes -- Extended PerReadAlleleLikelihoodMap.getMostLikelyAllele to optionally restrict set of alleles to consider best -- Massive increase in unit tests in AlignmentUtils, along with several new powerful functions for manipulating cigars -- Fix bug in SWPairwiseAlignment that produces cigar elements with 0 size, and are now fixed with consolidateCigar in AlignmentUtils -- HaplotypeCaller now tracks the called haplotypes in the GenotypingEngine, and returns this information to the HC for use in visualization. -- Added extensive docs to HaplotypeCaller on how to use this capability -- BUGFIX -- don't modify the read bases in GATKSAMRecord in LikelihoodCalculationEngine in the HC -- Cleaned up SWPairwiseAlignment. Refactored out the big main and supplementary static methods. Added a unit test with a bug TODO to fix what seems to be an edge case bug in SW -- Integration test to make sure we can actually write a BAM for each mode. This test only ensures that the code runs and doesn't exception out. It doesn't actually enforce any MD5s -- HaplotypeBAMWriter also left aligns indels in the reads, as SW can return a random placement of a read against the haplotype. Calls leftAlign to make the alignments more clear, with unit test of real read to cover this case -- Writes out haplotypes for both all haplotype and called haplotype mode -- Haplotype writers now get the active region call, regardless of whether an actual call was made. Only emitting called haplotypes is moved down to CalledHaplotypeBAMWriter	2013-03-03 12:07:29 -05:00
David Roazen	c5c99c8339	Split long-running integration test classes into multiple classes This is to facilitate the current experiment with class-level test suite parallelism. It's our hope that with these changes, we can get the runtime of the integration test suite down to 20 minutes or so. -UnifiedGenotyper tests: these divided nicely into logical categories that also happened to distribute the runtime fairly evenly -UnifiedGenotyperPloidy: these had to be divided arbitrarily into two classes in order to halve the runtime -HaplotypeCaller: turns out that the tests for complex and symbolic variants make up half the runtime here, so merely moving these into a separate class was sufficient -BiasedDownsampling: most of these tests use excessively large intervals that likely can't be reduced without defeating the goals of the tests. I'm disabling these tests for now until they can either be redesigned to use smaller intervals around the variants of interest, or refactored into unit tests (creating a JIRA for Yossi for this task)	2013-03-01 13:55:23 -05:00
depristo	cac3f80c64	Merge pull request #73 from broadinstitute/eb_remove_nested_hashmap_GSA-732 Replace uses of NestedHashMap with NestedIntegerArray.	2013-02-28 05:19:56 -08:00
Eric Banks	d2904cb636	Update docs for RTC.	2013-02-27 14:56:44 -05:00
Eric Banks	69b8173535	Replace uses of NestedHashMap with NestedIntegerArray. * Removed from codebase NestedHashMap since it is unused and untested. * Integration tests change because the BQSR CSV is now sorted automatically. * Resolves GSA-732	2013-02-27 14:03:39 -05:00
Alec Wysoker	c8368ae2a5	Eliminate 7-element arrays in BaseCounts and BaseAndQualsCount and replace with in-line primitive attributes. This is ugly but reduces heap overhead, and changes are localized. When used in conjunction with Mauricio's FastUtil changes it saves and additional 9% or so of execution time.	2013-02-27 12:49:56 -05:00
David Roazen	6466463d5a	Merged bug fix from Stable into Unstable	2013-02-26 21:54:54 -05:00
David Roazen	12a3d7ecad	Fix licenses on files modified in 2.4-1	2013-02-26 21:53:17 -05:00
David Roazen	a53b4a7521	Merged bug fix from Stable into Unstable	2013-02-26 21:41:13 -05:00
David Roazen	65d31ba4ad	Fix runtime public -> protected dependencies in the test suite -replace unnecessary uses of the UnifiedGenotyper by public integration tests with PrintReads -move NanoSchedulerIntegrationTest to protected, since it's completely dependent on the UnifiedGenotyper	2013-02-26 21:19:12 -05:00
depristo	93205154b5	Merge pull request #63 from broadinstitute/eb_fix_pairhmm_unittest_GSA-776 Eb fix pairhmm unittest gsa 776	2013-02-26 11:56:58 -08:00
Eric Banks	734353e9df	Merge pull request #60 from broadinstitute/mc_fastutil_GSATDG-83 Brought all of ReduceReads to fastutils	2013-02-26 11:56:41 -08:00
David Roazen	8b29030467	Change default downsampling coverage target for the HaplotypeCaller to 250 -was previously set to 30, which seems far too aggressive given that with ActiveRegionWalkers, as with LocusWalkers, this limits the depth of any pileup returned by LIBS -250 is a more conservative default used by the UG -can adjust down/up later based on further experiments (GSA-699 will remain open) -verified with Ryan that all integration test differences are either innocent or represent an improvement GSA-699	2013-02-26 09:33:25 -05:00
Eric Banks	396b7e0933	Fixed the intermittent PairHMM unit test failure. The issue here is that the OptimizedLikelihoodTestProvider uses the same basic underlying class as the BasicLikelihoodTestProvider and we were using the BasicTestProvider functionality to pull out tests of that class; so if the optimized tests were run first we were unintentionally running those same tests again with the basic ones (but expecting different results).	2013-02-25 15:05:13 -05:00
Eric Banks	7519484a38	Refactored PairHMM.initialize to first take haplotype max length and then the read max length so that it is consistent with other PairHMM methods.	2013-02-25 15:04:23 -05:00
Ryan Poplin	89e2943dd1	The maximum kmer length is derived from the reads. -- This is done to take advantage of longer reads which can produce less ambiguous haplotypes -- Integration tests change for HC and BiasedDownsampling	2013-02-25 14:40:25 -05:00
Mauricio Carneiro	0ff3343282	Addressing Eric's comments -- added @param docs to the new variables -- made all variables final -- switched to string builder instead of String for performance. GSATDG-83	2013-02-25 13:33:47 -05:00
Mauricio Carneiro	9e5a31b595	Brought all of ReduceReads to fastutils -- Added unit tests to ReduceReads name compression -- Updated reduce reads walker for unit testing GSATDG-83	2013-02-23 22:53:23 -05:00
Ryan Poplin	6a639c8ffc	Replace Smith-Waterman alignment with the bubble traversal. -- Instead of doing a full SW alignment against the reference we read off bubbles from the assembly graph. -- Smith-Waterman is run only on the base composition of the bubbles which drastically reduces runtime. -- Refactoring graph functions into a new DeBruijnAssemblyGraph class. -- Bug fix in path.getBases(). -- Adding validation code to the assembly engine. -- Renaming SimpleDeBruijnAssembler to match the naming of the new Assembly graph class. -- Adding bug fixes, docs and unit tests for DeBruijnAssemblyGraph and KBestPaths classes. -- Added ability to ignore bubbles that are too divergent from the reference -- Max kmer can't be bigger than the extension size. -- Reverse the order that we create the assembly graphs so that the bigger kmers are used first. -- New algorithm for determining unassembled insertions based on the bubble traversal instead of the full SW alignment. -- Don't need the full read span reference loc for anything any more now that we clip down to the extended loc for both assembly and likelihood evaluation. -- Updating HaplotypeCaller and BiasedDownsampling integration tests. -- Rebased everything into one commit as requested by Eric -- improvements to the bubble traversal are coming as a separate push	2013-02-22 15:42:16 -05:00
Mauricio Carneiro	e3f01673e1	Implementation of the find and diagnose Queue script -- Added 'uncovered intervals' output for FindCoveredIntervals -- updated scala script to make use of it.	2013-02-22 10:19:01 -05:00
Ryan Poplin	62e14f5b58	Bug fix in LikelihoodCalculationEngine: Mapping quality was being cast to a byte and overflowing for reads with large mapping quality scores.	2013-02-21 14:34:17 -05:00
Eric Banks	6996a953a8	Haplotype/Allele based optimizations for the HaplotypeCaller that knock off nearly 20% of the total runtime (multi-sample). These 2 changes improve runtime performance almost as much as Ryan's previous attempt (with ID-based comparisons): * Don't unnecessarily overload Allele.getBases() in the Haplotype class. * Haplotype.getBases() was calling clone() on the byte array. * Added a constructor to Allele (and Haplotype) that takes in an Allele as input. * It makes a copy of he given allele without having to go through the validation of the bases (since the Allele has already been validated). * Rev'ed the variant jar accordingly. For the reviewer: all tests passed before rebasing, so this should be good to go as far as correctness.	2013-02-21 10:14:11 -05:00
Eric Banks	551d33686c	Merge pull request #47 from broadinstitute/aw_reduceread_perf_1_GSA-761 Reduce memory footprint of SyntheticRead by replacing several Lists with...	2013-02-20 04:49:07 -08:00
Eric Banks	9dfdb9528b	Merge pull request #49 from broadinstitute/gda_hidden_ug_args Hide arguments related to reference sample operation in UG - for interna...	2013-02-19 16:18:32 -08:00
Eric Banks	0055a6f1cd	Merge pull request #45 from broadinstitute/mc_fix_indelrealigner_GSA-774 Fix to the Indel Realigner bug described in GSA-774	2013-02-19 16:16:48 -08:00
Guillermo del Angel	5a0a9bc488	Hide arguments related to reference sample operation in UG - for internal use only until paper is published and docs are polished.	2013-02-19 19:06:42 -05:00
Mauricio Carneiro	371ea2f24c	Fixed IndelRealigner reference length bug (GSA-774) -- modified ReadBin GenomeLoc to keep track of softStart() and softEnd() of the reads coming in, to make sure the reference will always be sufficient even if we want to use the soft-clipped bases -- changed the verification from readLength to aligned bases to allow reads with soft-clipped bases -- switched TreeSet -> PriorityQueue in the ConstrainedMateFixer as some different reads can be considered equal by picard's SAMRecordCoordinateComparator (the Set was replacing them) -- pulled out ReadBin class so it can be testable -- added unit tests for ReadBin with soft-clips -- added tests for getMismatchCount (AlignmentUtils) to make sure it works with soft-clipped reads GSA-774 #resolve	2013-02-19 16:00:36 -05:00
Alec Wysoker	ab75e053da	Reduce memory footprint of SyntheticRead by replacing several Lists with a single List of a small private static class that contains the attributes that were scattered across the several Lists.	2013-02-19 15:33:33 -05:00
Ryan Poplin	c025e84c8b	Fix for calculating read pos rank sum test with reads that are informative but don't actually overlap the variant due to some hard clipping. -- Updated a few integration tests for HC, UG, and UG general ploidy	2013-02-19 14:09:24 -05:00
Mark DePristo	be45edeff2	ActivityProfile and ActiveRegions respects engine interval boundaries -- Active regions are created as normal, but they are split and trimmed to the engine intervals when added to the traversal, if there are intervals present. -- UnitTests for ActiveRegion.splitAndTrimToIntervals -- GenomeLocSortedSet.getOverlapping uses binary search to efficiently in ~ log N time find overlapping intervals -- UnitTesting overlap function in GenomeLocSortedSet -- Discovered fundamental implementation bug in that adding genome locs out of order (elements on 20 then on 19) produces an invalid GenomeLocSortedSet. Created a JIRA to address this: https://jira.broadinstitute.org/browse/GSA-775 -- Constructor that takes a collection of genome locs now sorts its input and merges overlapping intervals -- Added docs for the constructors in GLSS -- Update HaplotypeCaller MD5s, which change because ActiveRegions are now restricted to the engine intervals, which changes slightly the regions in the tests and so the reads in the regions, and thus the md5s -- GenomeAnalysisEngineUnitTest needs to provide non-null genome loc parser	2013-02-18 10:40:25 -05:00
Ryan Poplin	b7e9c342c7	Reducing the size of the reference padding in the HaplotypeCaller.	2013-02-17 11:09:00 -05:00
Mark DePristo	73a363b166	Update MD5s due to new QualityUtils calculations -- Increase the allowed runtime of one UG integration test -- The GGA indels mode runs two UG commands, and was barely under the 10 minute limit before. Some updates can push this right over the edge. Increased limit -- CalibrateGenotypeLikelihoods runs on a small data set now, so it's faster -- Updating MD5s due to more correct quality utils. DuplicatesWalkers quality estimates have changed. One UG test has different FS and rank sum tests because the conversion to phred scores are slightly (second decimal place) different	2013-02-16 07:31:38 -08:00
Mark DePristo	3231031c1a	Bugfix for FisherStrand -- FisherStrand pValues can sum to slightly greater than 1.0, so they need to be capped to convert to a Phred-scaled quality score	2013-02-16 07:31:38 -08:00
Mark DePristo	9a29d6d4be	Fix an catastrophic bug (WoW!) in the reference calculation of the UG -- The UG was using MathUtils binomial probability backward, so that the estimated confidence was always NaN, and was as a side effect other utils converted this to a meaningless 0.0. This is all because there wasn't a unit test. -- I've fixed the calculation, so it's now log10 based, uses robust MathUtils and QualityUtils functions to compute probabilities, and added a unit test.	2013-02-16 07:31:38 -08:00
Mark DePristo	9e28d1e347	Cleanup and unit tests for QualityUtils -- Fixed a few conversion bugs with edge case quals (ones that were very high) -- Fixed a critical bug in the conversion of quals that was causing near capped quals to fall below their actual value. Will undoubtedly need to fix md5s -- More precise prob -> qual calculations for very high confidence events in phredScaleCorrectRate, trueProbToQual, and errorProbToQual. Very likely to improve accuracy of many calculations in the GATK -- Added errorProbToQual and trueProbToQual calculations that accept an integer cap, and perform the (tricky) conversion from int to byte correctly. -- Full docs and unit tests for phredScaleCorrectRate and phredScaleErrorRate. -- Renamed probToQual to trueProbToQual -- Added goodProbability and log10OneMinusX to MathUtils -- Went through the GATK and cleaned up many uses of QualityUtils -- Cleanup constants in QualityUtils -- Added full docs for all of the constants -- Rename MAX_QUAL_SCORE to MAX_SAM_QUAL_SCORE for clarity -- Moved MAX_GATK_USABLE_Q_SCORE to RecalDatum, as it's s BQSR specific feature -- Convert uses of QualityUtils.errorProbToQual(1-x) to QualityUtils.trueProbToQual(x) -- Cleanup duplicate quality score routines in MathUtils. Moved and renamed MathUtils.log10ProbabilityToPhredScale => QualityUtils.phredScaleLog10ErrorRate. Removed 3 routines from MathUtils, and remapped their usages into the better routines in QualityUtils	2013-02-16 07:31:37 -08:00
MauricioCarneiro	d80b99143f	Merge pull request #37 from broadinstitute/rp_left_alignment_hc_contract_GSA-771	2013-02-15 08:32:45 -08:00
MauricioCarneiro	1dd284a5bb	Merge pull request #39 from broadinstitute/tj_printreads_tag_for_bqsr_GSA-720 PrintReads writes a header when used with -BQSR	2013-02-15 07:18:28 -08:00
MauricioCarneiro	b58a0eca6b	Merge pull request #33 from broadinstitute/gg_more_gatkdocs_tweaks_GSATDG-62 Refactored GATKDocs categories some more ( GSATDG-62 )	2013-02-14 22:35:07 -08:00
Tad Jordan	6cb80591e3	PrintReads writes a header when used with -BQSR	2013-02-14 22:19:14 -05:00
Guillermo del Angel	b18f216033	Updated md5's from BiasedDownsamplerIntegrationTest that changed due to changes in HaplotypeCaller - changing HashMaps to LinkedHashMaps changed ordering of reads presented to BiasedDownSampler which changed reads chosen, thereby marginally changing PL's and some site info.	2013-02-14 20:18:49 -05:00
Ryan Poplin	871c8b3866	No need to consider haplotypes which Smith-Waterman aligns off the end of the large padded reference.	2013-02-14 11:18:10 -05:00
Geraldine Van der Auwera	6208742f7c	Refactored GATKDocs categories some more ( GSATDG-62 ) -- Renamed ValidatePileup to CheckPileup since validation is reserved word -- Renamed AlignmentValidation to CheckAlignment (same as above) -- Refactored category definitions to use constants defined in HelpConstants -- Fixed a couple of minor typos and an example error -- Reorganized the GATKDocs index template to use supercategories -- Refactored integration tests for renamed walkers (my earlier refactoring had screwed them up or not carried over)	2013-02-13 16:49:18 -05:00
depristo	357d196dad	Merge pull request #32 from broadinstitute/yf_per-sample-downsampling_GSA_765 Fixed md5s for the per-sample downsampling IntegrationTests that were disabled.	2013-02-13 10:08:11 -08:00

1 2 3 4 5 ...

657 Commits (f11c8d22d47218c7d0dfbdfb5c19cbdd336a5df4)