gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Eric Banks	ebd5404124	Fixed the add functionality of GenomeLocSortedSet. * Fixed GenomeLocSortedSet.add() to ensure that overlapping intervals are detected and an exception is thrown. * Fixed GenomeLocSortedSet.addRegion() by merging it with the add() method; it now produces sorted inputs in all cases. * Cleaned up duplicated code throughout the engine to create a list of intervals over all contigs. * Added more unit tests for add functionality of GLSS. * Resolves GSA-775.	2013-02-28 23:31:00 -05:00
Eric Banks	12fc198b80	Added better error message for BAMs with bad read groups. * Split the cases into reads that don't have a RG at all vs. those with a RG that's not defined in the header. * Added integration tests to make sure that the correct error is thrown. * Resolved GSA-407.	2013-02-27 16:02:56 -05:00
Mauricio Carneiro	711cbd3b5a	Archiving CoverageBySample This walker was not updated since 2009, and users were getting wrong answers when running it with ReduceReads. I don't want to deal with this because DiagnoseTargets does everything this walker does.	2013-02-26 13:49:00 -05:00
David Roazen	3645ea9bb6	Sequence dictionary validation: detect problematic contig indexing differences The GATK engine does not behave correctly when contigs are indexed differently in the reads sequence dictionaries vs. the reference sequence dictionary, and the inconsistently-indexed contigs are included in the user's intervals. For example, given the dictionaries: Reference dictionary = { chrM, chr1, chr2, ... } BAM dictionary = { chr1, chr2, ... } and the interval "-L chr1", the engine would fail to correctly retrieve the reads from chr1, since chr1 has a different index in the two dictionaries. With this patch, we throw an exception if there are contig index differences between the dictionaries for reads and reference, AND the user's intervals include at least one of the mismatching contigs. The user can disable this exception via -U ALLOW_SEQ_DICT_INCOMPATIBILITY In all other cases, dictionary validation behaves as before. I also added comprehensive unit tests for the (previously-untested) SequenceDictionaryUtils class. GSA-768 #resolve	2013-02-25 11:14:22 -05:00
Geraldine Van der Auwera	e674b4a524	Added new ReadFilter that allows users to specifically reassign one single mapping quality to a different value. Useful for TopHat and other RNA-seq software users.	2013-02-20 01:24:45 -05:00
Mark DePristo	be45edeff2	ActivityProfile and ActiveRegions respects engine interval boundaries -- Active regions are created as normal, but they are split and trimmed to the engine intervals when added to the traversal, if there are intervals present. -- UnitTests for ActiveRegion.splitAndTrimToIntervals -- GenomeLocSortedSet.getOverlapping uses binary search to efficiently in ~ log N time find overlapping intervals -- UnitTesting overlap function in GenomeLocSortedSet -- Discovered fundamental implementation bug in that adding genome locs out of order (elements on 20 then on 19) produces an invalid GenomeLocSortedSet. Created a JIRA to address this: https://jira.broadinstitute.org/browse/GSA-775 -- Constructor that takes a collection of genome locs now sorts its input and merges overlapping intervals -- Added docs for the constructors in GLSS -- Update HaplotypeCaller MD5s, which change because ActiveRegions are now restricted to the engine intervals, which changes slightly the regions in the tests and so the reads in the regions, and thus the md5s -- GenomeAnalysisEngineUnitTest needs to provide non-null genome loc parser	2013-02-18 10:40:25 -05:00
Mark DePristo	9e28d1e347	Cleanup and unit tests for QualityUtils -- Fixed a few conversion bugs with edge case quals (ones that were very high) -- Fixed a critical bug in the conversion of quals that was causing near capped quals to fall below their actual value. Will undoubtedly need to fix md5s -- More precise prob -> qual calculations for very high confidence events in phredScaleCorrectRate, trueProbToQual, and errorProbToQual. Very likely to improve accuracy of many calculations in the GATK -- Added errorProbToQual and trueProbToQual calculations that accept an integer cap, and perform the (tricky) conversion from int to byte correctly. -- Full docs and unit tests for phredScaleCorrectRate and phredScaleErrorRate. -- Renamed probToQual to trueProbToQual -- Added goodProbability and log10OneMinusX to MathUtils -- Went through the GATK and cleaned up many uses of QualityUtils -- Cleanup constants in QualityUtils -- Added full docs for all of the constants -- Rename MAX_QUAL_SCORE to MAX_SAM_QUAL_SCORE for clarity -- Moved MAX_GATK_USABLE_Q_SCORE to RecalDatum, as it's s BQSR specific feature -- Convert uses of QualityUtils.errorProbToQual(1-x) to QualityUtils.trueProbToQual(x) -- Cleanup duplicate quality score routines in MathUtils. Moved and renamed MathUtils.log10ProbabilityToPhredScale => QualityUtils.phredScaleLog10ErrorRate. Removed 3 routines from MathUtils, and remapped their usages into the better routines in QualityUtils	2013-02-16 07:31:37 -08:00
droazen	664960373d	Merge pull request #31 from broadinstitute/yf_fast_BAM_index_traversal -re-enables fast BAM indexing	2013-02-15 09:12:32 -08:00
MauricioCarneiro	1dd284a5bb	Merge pull request #39 from broadinstitute/tj_printreads_tag_for_bqsr_GSA-720 PrintReads writes a header when used with -BQSR	2013-02-15 07:18:28 -08:00
Tad Jordan	6cb80591e3	PrintReads writes a header when used with -BQSR	2013-02-14 22:19:14 -05:00
Yossi Farjoun	3a7c8c13e2	Re-enabled fastBAMindexing by replacing the FileChannel with a SeekableBufferedStream This helps a lot since FileChannel is very low-level and traversing the BAMIndex involves lots of short reads. - Fixed a deterioration in BAMIndex due to rev'ed picard (see below) - Added unit tests for SeekableBufferedStream - Added integrationTests for GATKBAMIndex (in PileupWalkerIntegrationTest) - Added a runtime-test to verify that the amount read equals the amount requested. - Added failing tests with expectedExceptions - Used a DataProvider to make code nicer	2013-02-14 17:51:15 -05:00
Geraldine Van der Auwera	6208742f7c	Refactored GATKDocs categories some more ( GSATDG-62 ) -- Renamed ValidatePileup to CheckPileup since validation is reserved word -- Renamed AlignmentValidation to CheckAlignment (same as above) -- Refactored category definitions to use constants defined in HelpConstants -- Fixed a couple of minor typos and an example error -- Reorganized the GATKDocs index template to use supercategories -- Refactored integration tests for renamed walkers (my earlier refactoring had screwed them up or not carried over)	2013-02-13 16:49:18 -05:00
Geraldine Van der Auwera	dff5ef562b	Reorganized walker categories in GATKDocs (@DocumentedGATKFeature details) -- Sorted out contents of BAM Processing vs. Diagnostics & QC Tools -- Moved two validation-related walkers from Diagnostics & QC to Validation Utilities -- Reworded some category names and descriptions to be more explicit and user-friendly	2013-02-12 13:36:15 -05:00
Mark DePristo	a3dc7dc5cb	Extend AWS timeout for uploads of the GATK run reports to 30 seconds	2013-02-08 17:37:36 -05:00
Eric Banks	9826192854	Added contracts, docs, and tests for several methods in AlignmentUtils. There are over 74K tests being run now for this class! * AlignmentUtils.getMismatchCount() * AlignmentUtils.calcAlignmentByteArrayOffset() * AlignmentUtils.readToAlignmentByteArray(). * AlignmentUtils.leftAlignIndel()	2013-02-07 13:04:24 -05:00
Eric Banks	562f2406d7	Added check that BaseRecalibrator is not being run on a reduced bam. - Throws user exception if it is. - Can be turned off with --allow_bqsr_on_reduced_bams_despite_repeated_warnings argument. - Added test to check this is working. - Added docs to BQSRReadTransformer explaining why this check is not performed on PrintReads end. - Added small bug fix to GenomeAnalysisEngine that I uncovered in this process. - Added comment about not changing the program record name, as per reviewer comments. - Removed unused variable.	2013-02-06 10:14:27 -05:00
Eric Banks	4e5ff3d6f1	Bug fix for NPE in HC with --dbsnp argument. - I had added the framework in the VA engine but should not have hooked it up to the HC yet since the RefMetaDataTracker is always null. - Added contracts and docs to the relevant methods in the VA engine so that this doesn't happen in the future.	2013-02-05 21:59:19 -05:00
Yossi Farjoun	de03f17be4	-Added Per-Sample Contamination Removal to UnifiedGenotyper: Added an @Advanced option to the StandardCallerArgumentCollection, a file which should contain two columns, Sample (String) and Fraction (Double) that form the Sample-Fraction map for the per-sample AlleleBiasedDownsampling. -Integration tests to UnifiedGenotyper (Using artificially contaminated BAMs created from a mixure of two broadly concented samples) were added -includes throwing an exception in HC if called using per-sample contamination file (not implemented); tested in a new integration test. -(Note: HaplotypeCaller already has "Flat" contamination--using the same fraction for all samples--what it doesn't have is _per-sample_ AlleleBiasedDownsampling, which is what has been added here to the UnifiedGenotyper. -New class: DefaultHashMap (a Defaulting HashMap...) and new function: loadContaminationFile (which reads a Sample-Fraction file and returns a map). -Unit tests to the new class and function are provided. -Added tests to see that malformed contamination files are found and that spaces and tabs are now read properly. -Merged the integration tests that pertain to biased downsampling, whether HaplotypeCaller or unifiedGenotyper, into a new IntegrationTest class.	2013-02-04 18:24:36 -05:00
Mark DePristo	a281fa6548	Resolves Genome Sequence Analysis GSA-750 Don't print an endless series of starting messages from the ProgressMeter -- The progress meter isn't started until the GATK actually calls execute on the microscheduler. Now we get a message saying "Creating shard strategy" while this (expensive) operation runs	2013-02-04 15:47:30 -05:00
Mark DePristo	8d08780582	GATKRunReport now tracks the errorMessage and errorThrown during post for later analysis -- This is primarily useful in the unit tests, as I now print out additional information on why a test might have failed, if it in fact did.	2013-02-02 19:24:31 -05:00
Mark DePristo	6382d5bdc9	Final cleanup and unit testing for GATKRunReport -- Bringing code up to document, style, and code coverage specs -- Move GATKRunReportUnitTest to private -- Fully expand GATKRunReportUnitTests to coverage writing and reading GATKRunReport to local disk, to standard out, to AWS. -- Move documentation URL from GATKRunReport to UserException -- Delete a few unused files from s3GATKReport -- Added capabilities to GATKRunReport to make testing easier -- Added capabilities to deserialize GATKRunReports from an InputStream	2013-02-02 15:06:56 -05:00
Mark DePristo	eb17230c2f	Update AWS access and private keys to the new GATK2LogUploader user -- Updated EncryptAWSKeys to write the key into the correct resources directory	2013-02-02 15:06:56 -05:00
David Roazen	c4b0ba4d45	Temporarily back out the Picard team's patches to GATKBAMIndex from December These patches to GATKBAMIndex are causing massive BAM index reading errors in combination with the latest version of Picard. The bug is either in the patches themselves or in the underlying SeekableBufferedStream class they rely on. Until the cause can be identified, we are temporarily backing out these changes so that we can continue to run with the latest Picard/Tribble. This reverts commits: 81483ec21e528790dfa719d18cdee27d577ca98e 68cf0309db490b79eecdabb4034987ff825ffea8 54bb68f28ad5fe1b3df01702e9c5e108106a0176	2013-02-01 13:51:31 -05:00
David Roazen	292037dfda	Rev picard, sam-jdk, and tribble This is a necessary prerequisite for the org.broadinstitute.variant migration. -Picard and sam-jdk go from version 1.67.1197 to 1.84.1337 -Picard-private goes from version 2375 to 2662 -Tribble goes from version 119 to 1.84.1337 -RADICALLY trimmed down the list of classes we extract from Picard-private (jar goes from 326993 bytes to 6445 bytes!)	2013-02-01 13:51:30 -05:00
David Roazen	6ec1e613a2	Move AWS keys to a resources subdirectory within the phonehome package Resources must be in a subdirectory called "resources" in the package hierarchy to be picked up by the packaging system. Adding each resource manually to the jars in build.xml does not cause the resource to be added to the standalone GATK jar when we package the GATK, so it's best to always use this convention.	2013-01-31 11:56:34 -05:00
Mark DePristo	404ee9a6e4	More aggressive checking of AWS key quality upon startup in the GATK	2013-01-31 09:08:38 -05:00
Mark DePristo	b707331332	Encrypt GATK AWS keys using the GATK private key, and decrypt as needed as a resource when uploading to AWS logs -- Has the overall effect that the GATK user AWS keys are no longer visible in the gatk source as plain text. This will stop AWS from emailing me (they crawl the web looking for keys) -- Added utility EncryptAWSKeys that takes as command line arguments the GATK user AWS access and secret keys, encrypts them with the GATK private key, and writes out the resulting file to resources in phonehome. -- GATKRunReport now decrypts as needed these keys using the GATK public key as resources in the GATK bundle -- Refactored the essential function of Resource (reading the resource) from IOUtils into the class itself. Now how to get the data in the resouce is straightforward -- Refactored md5 calculation code from a byte[] into Utils. Added unit tests -- Committing the encrypted AWS keys -- #resolves https://jira.broadinstitute.org/browse/GSA-730	2013-01-30 16:42:23 -05:00
David Roazen	591df2be44	Move additional VariantContext utility methods back to the GATK Thanks to Eric for his feedback	2013-01-30 13:58:17 -05:00
David Roazen	9985f82a7a	Move BaseUtils back to the GATK by request, along with associated utility methods	2013-01-30 13:09:44 -05:00
Mark DePristo	45603f58cd	Refactoring and unit testing GenomeLocParser -- Moved previously inner class to MRUCachingSAMSequenceDictionary, and unit test to 100% coverage -- Fully document all functions in GenomeLocParser -- Unit tests for things like parsePosition (shocking it wasn't tested!) -- Removed function to specifically create GenomeLocs for VariantContexts. The fact that you must incorporate END attributes in the context means that createGenomeLoc(Feature) works correctly -- Depreciated (and moved functionality) of setStart, setStop, and incPos to GenomeLoc -- Unit test coverage at like 80%, moving to 100% with next commit	2013-01-30 09:47:47 -05:00
Mark DePristo	92c5635e19	Cleanup, document, and unit test ActiveRegion -- All functions tested. In the testing / review I discovered several bugs in the ActiveRegion routines that manipulate reads. New version should be correct -- Enforce correct ordering of supporting states in constructor -- Enforce read ordering when adding reads to an active region in add -- Fix bug in HaplotypeCaller map with new updating read spans. Now get the full span before clipping down reads in map, so that variants are correctly placed w.r.t. the full reference sequence -- Encapsulate isActive field with an accessor function -- Make sure that all state lists are unmodifiable, and that the docs are clear about this -- ActiveRegion equalsExceptReads is for testing only, so make it package protected -- ActiveRegion.hardClipToRegion must resort reads as they can become out of order -- Previous version of HC clipped reads but, due to clipping, these reads could no longer overlap the active region. The old version of HC kept these reads, while the enforced contracts on the ActiveRegion detected this was a problem and those reads are removed. Has a minor impact on PLs and RankSumTest values -- Updating HaplotypeCaller MD5s to reflect changes to ActiveRegions read inclusion policy	2013-01-30 09:47:12 -05:00
David Roazen	a536e1da84	Move some VCF/VariantContext methods back to the GATK based on feedback -Moved some of the more specialized / complex VariantContext and VCF utility methods back to the GATK. -Due to this re-shuffling, was able to return things like the Pair class back to the GATK as well.	2013-01-29 16:56:55 -05:00
Ryan Poplin	e9c3a0acdf	fix typo	2013-01-28 22:18:58 -05:00
Ryan Poplin	d665a8ba0c	The Bayesian calculation of Qemp in the BQSR is now hierarchical. This fixes issues in which the covariate bins were very sparse and the prior estimate being used was the original quality score. This resulted in large correction factors for each covariate which breaks the equation. There is also now a new option, qlobalQScorePrior, which can be used to ignore the given (very high) quality scores and instead use this value as the prior.	2013-01-28 15:56:33 -05:00
David Roazen	f63f27aa13	org.broadinstitute.variant refactor, part 2 -removed sting dependencies from test classes -removed org.apache.log4j dependency -misc cleanup	2013-01-28 09:03:46 -05:00
David Roazen	3744d1a596	Collapse the downsampling fork in the GATK engine With LegacyLocusIteratorByState deleted, the legacy downsampling implementation was already non-functional. This commit removes all remaining code in the engine belonging to the legacy implementation.	2013-01-28 01:50:30 -05:00
Mark DePristo	0fb238b61e	TraverseActiveRegions Optimizations and Bugfixes: make sure to record position of current locus to discharge active regions when there's no data -- Now records the position of the current locus, as well as that of the last read. Necessary when passing through regions with no reads. The previous version would keep accumulating empty active regions, and never discharge them until end of traversal (if there was no reads in the future) or until a read was finally found -- Protected a call to logger.debug with if ( logger.isDebugEnabled()) to avoid a lot of overhead in writing unseen debugger logging information	2013-01-27 14:10:06 -05:00
Mark DePristo	93d88cdc68	Optimization: LocusReferenceView now passes along the contig index to createGenomeLoc, speeding up their creation -- Also cleaned up some unused methods	2013-01-27 14:10:06 -05:00
Mauricio Carneiro	705cccaf63	Making SplitReads output FastQ's instead of BAM - eliminates one step in my pipeline - BAM is too finicky and maintaining parameters that wouldn't be useful was becoming a headache, better avoided.	2013-01-27 02:36:31 -05:00
Mauricio Carneiro	6ea7133d95	Updating licenses of latest moved files	2013-01-26 13:46:52 -05:00
Ami Levy-Moonshine	b4447cdca2	In cases where one uses VariantContextUtils.GenotypeMergeType.REQUIRE_UNIQUE we used to verify that the samples names are unique in VariantContextUtils.simpleMerge for each VCs. It couse to a bug that was reported on the forum (when a VCs had 2 VC from the same sample). Now we will check it only in CombineVariants.init using the headers. A new function was added to SamplesUtils with unitTests in CVunitTest.java.	2013-01-25 15:49:51 -05:00
Mark DePristo	008b617577	Cleanup the getLIBS function in LocusIterator -- Now throws an UnsupportedOperationException in the base class. Only LocusView implements this function and actually returns the LIBS	2013-01-25 11:07:28 -05:00
Eric Banks	6dd0e1ddd6	Pulled out the --regenotype functionality from SelectVariants into its own tool: RegenotypeVariants. This allows us to move SelectVariants into the public suite of tools now.	2013-01-25 09:42:04 -05:00
Mark DePristo	592f90aaef	ActivityProfile now cuts intelligently at the best local minimum when in a larger than max size active region -- This new algorithm is essential to properly handle activity profiles that have many large active regions generated from lots of dense variant events. The new algorithm passes unit tests and passes visualize visual inspection of both running on 1000G and NA12878 -- Misc. commenting of the code -- Updated ActiveRegionExtension to include a min active region size -- Renamed ActiveRegionExtension to ActiveRegionTraversalParameters, as it carries more than just the traversal extension now	2013-01-24 13:48:00 -05:00
Mark DePristo	0c94e3d96e	Adaptively compute the band pass filter from the sigma, up to a maximum size of 50 bp -- Previously we allowed band pass filter size to be specified along with the sigma. But now that sigma is controllable from walkers and from the command line, we instead compute the filter size given the kernel from the sigma, including all kernel points with p > 1e-5 in the kernel. This means that if you use a smaller kernel you get a small band size and therefore faster ART -- Update, as discussed with Ryan, the sigma and band size to 17 bp for HC (default ART wide) and max band size of 50 bp	2013-01-24 13:47:59 -05:00
Mark DePristo	9e43a2028d	Making band pass filter size, sigma, active region max size and extension all accessible from the command line	2013-01-24 13:47:59 -05:00
Eric Banks	6790e103e0	Moving lots of walkers back from protected to public (along with several of the VA annotations). Let's see whether Mauricio's automatic git hook really works!	2013-01-24 11:42:49 -05:00
Mark DePristo	09edc6baeb	TraverseActiveRegions now writes out very nice active region and activity profile IGV formatted files	2013-01-23 13:46:01 -05:00
Mark DePristo	8e8126506b	Renaming IncrementalActivityProfile to ActivityProfile -- Also adding a work in progress functionality to make it easy to visualize activity profiles and active regions in IGV	2013-01-23 13:46:01 -05:00
Mark DePristo	eb60235dcd	Working version of incremental active region traversals -- The incremental version now processes active regions as soon as they are ready to be processed, instead of waiting until the end of the shard as in the previous version. This means that ART walkers will now take much less memory than previously. On chr20 of NA12878 the majority of regions are processed with as few as 500 reads in memory. Over the whole chr20 only 5K reads were ever held in ART at one time. -- Fixed bug in the way active regions worked with shard boundaries. The new implementation no longer see shard boundaries in any meaningful way, and that uncovered a problem that active regions were always being closed across shard boundaries. This behavior was actually encoded in the unit tests, so those needed to be updated as well. -- Changed the way that preset regions work in ART. The new contract ensures that you get exactly the regions you requested. the isActive function is still called, but its result has no impact on the regions. With this functionality is should be possible to use the HC as a generic assembly by forcing it to operate over very large regions -- Added a few misc. useful functions to IncrementalActivityProfile	2013-01-23 13:46:00 -05:00

1 2 3 4 5 ...

2157 Commits (ec3bf9f36283e16b00a090bf75161213bb3f63dc)