Commit Graph

2157 Commits (ec3bf9f36283e16b00a090bf75161213bb3f63dc)

Author SHA1 Message Date
Eric Banks ebd5404124 Fixed the add functionality of GenomeLocSortedSet.
* Fixed GenomeLocSortedSet.add() to ensure that overlapping intervals are detected and an exception is thrown.
 * Fixed GenomeLocSortedSet.addRegion() by merging it with the add() method; it now produces sorted inputs in all cases.
 * Cleaned up duplicated code throughout the engine to create a list of intervals over all contigs.
 * Added more unit tests for add functionality of GLSS.
 * Resolves GSA-775.
2013-02-28 23:31:00 -05:00
Eric Banks 12fc198b80 Added better error message for BAMs with bad read groups.
* Split the cases into reads that don't have a RG at all vs. those with a RG that's not defined in the header.
  * Added integration tests to make sure that the correct error is thrown.
  * Resolved GSA-407.
2013-02-27 16:02:56 -05:00
Mauricio Carneiro 711cbd3b5a Archiving CoverageBySample
This walker was not updated since 2009, and users were getting wrong answers when running it with ReduceReads. I don't want to deal with this because DiagnoseTargets does everything this walker does.
2013-02-26 13:49:00 -05:00
David Roazen 3645ea9bb6 Sequence dictionary validation: detect problematic contig indexing differences
The GATK engine does not behave correctly when contigs are indexed
differently in the reads sequence dictionaries vs. the reference
sequence dictionary, and the inconsistently-indexed contigs are included
in the user's intervals. For example, given the dictionaries:

Reference dictionary = { chrM, chr1, chr2, ... }
BAM dictionary       = { chr1, chr2, ... }

and the interval "-L chr1", the engine would fail to correctly retrieve
the reads from chr1, since chr1 has a different index in the two dictionaries.

With this patch, we throw an exception if there are contig index differences
between the dictionaries for reads and reference, AND the user's intervals
include at least one of the mismatching contigs.

The user can disable this exception via -U ALLOW_SEQ_DICT_INCOMPATIBILITY

In all other cases, dictionary validation behaves as before.

I also added comprehensive unit tests for the (previously-untested)
SequenceDictionaryUtils class.

GSA-768 #resolve
2013-02-25 11:14:22 -05:00
Geraldine Van der Auwera e674b4a524 Added new ReadFilter that allows users to specifically reassign one single mapping quality to a different value. Useful for TopHat and other RNA-seq software users. 2013-02-20 01:24:45 -05:00
Mark DePristo be45edeff2 ActivityProfile and ActiveRegions respects engine interval boundaries
-- Active regions are created as normal, but they are split and trimmed to the engine intervals when added to the traversal, if there are intervals present.
-- UnitTests for ActiveRegion.splitAndTrimToIntervals
-- GenomeLocSortedSet.getOverlapping uses binary search to efficiently in ~ log N time find overlapping intervals
-- UnitTesting overlap function in GenomeLocSortedSet
-- Discovered fundamental implementation bug in that adding genome locs out of order (elements on 20 then on 19) produces an invalid GenomeLocSortedSet.  Created a JIRA to address this: https://jira.broadinstitute.org/browse/GSA-775
-- Constructor that takes a collection of genome locs now sorts its input and merges overlapping intervals
-- Added docs for the constructors in GLSS
-- Update HaplotypeCaller MD5s, which change because ActiveRegions are now restricted to the engine intervals, which changes slightly the regions in the tests and so the reads in the regions, and thus the md5s
-- GenomeAnalysisEngineUnitTest needs to provide non-null genome loc parser
2013-02-18 10:40:25 -05:00
Mark DePristo 9e28d1e347 Cleanup and unit tests for QualityUtils
-- Fixed a few conversion bugs with edge case quals (ones that were very high)
-- Fixed a critical bug in the conversion of quals that was causing near capped quals to fall below their actual value.  Will undoubtedly need to fix md5s
-- More precise prob -> qual calculations for very high confidence events in phredScaleCorrectRate, trueProbToQual, and errorProbToQual.  Very likely to improve accuracy of many calculations in the GATK
-- Added errorProbToQual and trueProbToQual calculations that accept an integer cap, and perform the (tricky) conversion from int to byte correctly.
-- Full docs and unit tests for phredScaleCorrectRate and phredScaleErrorRate.
-- Renamed probToQual to trueProbToQual
-- Added goodProbability and log10OneMinusX to MathUtils
-- Went through the GATK and cleaned up many uses of QualityUtils
-- Cleanup constants in QualityUtils
-- Added full docs for all of the constants
-- Rename MAX_QUAL_SCORE to MAX_SAM_QUAL_SCORE for clarity
-- Moved MAX_GATK_USABLE_Q_SCORE to RecalDatum, as it's s BQSR specific feature
-- Convert uses of QualityUtils.errorProbToQual(1-x) to QualityUtils.trueProbToQual(x)
-- Cleanup duplicate quality score routines in MathUtils.  Moved and renamed MathUtils.log10ProbabilityToPhredScale => QualityUtils.phredScaleLog10ErrorRate. Removed 3 routines from MathUtils, and remapped their usages into the better routines in QualityUtils
2013-02-16 07:31:37 -08:00
droazen 664960373d Merge pull request #31 from broadinstitute/yf_fast_BAM_index_traversal
-re-enables fast BAM indexing
2013-02-15 09:12:32 -08:00
MauricioCarneiro 1dd284a5bb Merge pull request #39 from broadinstitute/tj_printreads_tag_for_bqsr_GSA-720
PrintReads writes a header when used with -BQSR
2013-02-15 07:18:28 -08:00
Tad Jordan 6cb80591e3 PrintReads writes a header when used with -BQSR 2013-02-14 22:19:14 -05:00
Yossi Farjoun 3a7c8c13e2 Re-enabled fastBAMindexing by replacing the FileChannel with a SeekableBufferedStream
This helps a lot since FileChannel is very low-level and traversing the BAMIndex involves lots of short reads.

- Fixed a deterioration in BAMIndex due to rev'ed picard (see below)
- Added unit tests for SeekableBufferedStream
- Added integrationTests for GATKBAMIndex (in PileupWalkerIntegrationTest)
- Added a runtime-test to verify that the amount read equals the amount requested.
- Added failing tests with expectedExceptions
- Used a DataProvider to make code nicer
2013-02-14 17:51:15 -05:00
Geraldine Van der Auwera 6208742f7c Refactored GATKDocs categories some more ( GSATDG-62 )
-- Renamed ValidatePileup to CheckPileup since validation is reserved word
-- Renamed AlignmentValidation to CheckAlignment (same as above)
-- Refactored category definitions to use constants defined in HelpConstants
-- Fixed a couple of minor typos and an example error
-- Reorganized the GATKDocs index template to use supercategories
-- Refactored integration tests for renamed walkers (my earlier refactoring had screwed them up or not carried over)
2013-02-13 16:49:18 -05:00
Geraldine Van der Auwera dff5ef562b Reorganized walker categories in GATKDocs (@DocumentedGATKFeature details)
-- Sorted out contents of BAM Processing vs. Diagnostics & QC Tools
-- Moved two validation-related walkers from Diagnostics & QC to Validation Utilities
-- Reworded some category names and descriptions to be more explicit and user-friendly
2013-02-12 13:36:15 -05:00
Mark DePristo a3dc7dc5cb Extend AWS timeout for uploads of the GATK run reports to 30 seconds 2013-02-08 17:37:36 -05:00
Eric Banks 9826192854 Added contracts, docs, and tests for several methods in AlignmentUtils. There are over 74K tests being run now for this class!
* AlignmentUtils.getMismatchCount()
* AlignmentUtils.calcAlignmentByteArrayOffset()
* AlignmentUtils.readToAlignmentByteArray().
* AlignmentUtils.leftAlignIndel()
2013-02-07 13:04:24 -05:00
Eric Banks 562f2406d7 Added check that BaseRecalibrator is not being run on a reduced bam.
- Throws user exception if it is.
 - Can be turned off with --allow_bqsr_on_reduced_bams_despite_repeated_warnings argument.
 - Added test to check this is working.
 - Added docs to BQSRReadTransformer explaining why this check is not performed on PrintReads end.
 - Added small bug fix to GenomeAnalysisEngine that I uncovered in this process.
 - Added comment about not changing the program record name, as per reviewer comments.
 - Removed unused variable.
2013-02-06 10:14:27 -05:00
Eric Banks 4e5ff3d6f1 Bug fix for NPE in HC with --dbsnp argument.
- I had added the framework in the VA engine but should not have hooked it up to the HC yet since the RefMetaDataTracker is always null.
 - Added contracts and docs to the relevant methods in the VA engine so that this doesn't happen in the future.
2013-02-05 21:59:19 -05:00
Yossi Farjoun de03f17be4 -Added Per-Sample Contamination Removal to UnifiedGenotyper: Added an @Advanced option to the StandardCallerArgumentCollection, a file which should
contain two columns, Sample (String) and Fraction (Double) that form the Sample-Fraction map for the per-sample AlleleBiasedDownsampling.
-Integration tests to UnifiedGenotyper (Using artificially contaminated BAMs created from a mixure of two broadly concented samples) were added
-includes throwing an exception in HC if called using per-sample contamination file (not implemented); tested in a new integration test.
-(Note: HaplotypeCaller already has "Flat" contamination--using the same fraction for all samples--what it doesn't have is
   _per-sample_ AlleleBiasedDownsampling, which is what has been added here to the UnifiedGenotyper.
-New class: DefaultHashMap (a Defaulting HashMap...) and new function: loadContaminationFile (which reads a Sample-Fraction file and returns a map).
-Unit tests to the new class and function are provided.
-Added tests to see that malformed contamination files are found and that spaces and tabs are now read properly.
-Merged the integration tests that pertain to biased downsampling, whether HaplotypeCaller or unifiedGenotyper, into a new IntegrationTest class.
2013-02-04 18:24:36 -05:00
Mark DePristo a281fa6548 Resolves Genome Sequence Analysis GSA-750 Don't print an endless series of starting messages from the ProgressMeter
-- The progress meter isn't started until the GATK actually calls execute on the microscheduler.  Now we get a message saying "Creating shard strategy" while this (expensive) operation runs
2013-02-04 15:47:30 -05:00
Mark DePristo 8d08780582 GATKRunReport now tracks the errorMessage and errorThrown during post for later analysis
-- This is primarily useful in the unit tests, as I now print out additional information on why a test might have failed, if it in fact did.
2013-02-02 19:24:31 -05:00
Mark DePristo 6382d5bdc9 Final cleanup and unit testing for GATKRunReport
-- Bringing code up to document, style, and code coverage specs
-- Move GATKRunReportUnitTest to private
-- Fully expand GATKRunReportUnitTests to coverage writing and reading GATKRunReport to local disk, to standard out, to AWS.
-- Move documentation URL from GATKRunReport to UserException
-- Delete a few unused files from s3GATKReport
-- Added capabilities to GATKRunReport to make testing easier
-- Added capabilities to deserialize GATKRunReports from an InputStream
2013-02-02 15:06:56 -05:00
Mark DePristo eb17230c2f Update AWS access and private keys to the new GATK2LogUploader user
-- Updated EncryptAWSKeys to write the key into the correct resources directory
2013-02-02 15:06:56 -05:00
David Roazen c4b0ba4d45 Temporarily back out the Picard team's patches to GATKBAMIndex from December
These patches to GATKBAMIndex are causing massive BAM index reading errors in
combination with the latest version of Picard. The bug is either in the patches
themselves or in the underlying SeekableBufferedStream class they rely on. Until
the cause can be identified, we are temporarily backing out these changes so that
we can continue to run with the latest Picard/Tribble.

This reverts commits:
81483ec21e528790dfa719d18cdee27d577ca98e
68cf0309db490b79eecdabb4034987ff825ffea8
54bb68f28ad5fe1b3df01702e9c5e108106a0176
2013-02-01 13:51:31 -05:00
David Roazen 292037dfda Rev picard, sam-jdk, and tribble
This is a necessary prerequisite for the org.broadinstitute.variant migration.

-Picard and sam-jdk go from version 1.67.1197 to 1.84.1337

-Picard-private goes from version 2375 to 2662

-Tribble goes from version 119 to 1.84.1337

-RADICALLY trimmed down the list of classes we extract from Picard-private
 (jar goes from 326993 bytes to 6445 bytes!)
2013-02-01 13:51:30 -05:00
David Roazen 6ec1e613a2 Move AWS keys to a resources subdirectory within the phonehome package
Resources must be in a subdirectory called "resources" in the package
hierarchy to be picked up by the packaging system. Adding each resource
manually to the jars in build.xml does not cause the resource to be
added to the standalone GATK jar when we package the GATK, so it's best
to always use this convention.
2013-01-31 11:56:34 -05:00
Mark DePristo 404ee9a6e4 More aggressive checking of AWS key quality upon startup in the GATK 2013-01-31 09:08:38 -05:00
Mark DePristo b707331332 Encrypt GATK AWS keys using the GATK private key, and decrypt as needed as a resource when uploading to AWS logs
-- Has the overall effect that the GATK user AWS keys are no longer visible in the gatk source as plain text.  This will stop AWS from emailing me (they crawl the web looking for keys)
-- Added utility EncryptAWSKeys that takes as command line arguments the GATK user AWS access and secret keys, encrypts them with the GATK private key, and writes out the resulting file to resources in phonehome.
-- GATKRunReport now decrypts as needed these keys using the GATK public key as resources in the GATK bundle
-- Refactored the essential function of Resource (reading the resource) from IOUtils into the class itself.  Now how to get the data in the resouce is straightforward
-- Refactored md5 calculation code from a byte[] into Utils.  Added unit tests
-- Committing the encrypted AWS keys
-- #resolves https://jira.broadinstitute.org/browse/GSA-730
2013-01-30 16:42:23 -05:00
David Roazen 591df2be44 Move additional VariantContext utility methods back to the GATK
Thanks to Eric for his feedback
2013-01-30 13:58:17 -05:00
David Roazen 9985f82a7a Move BaseUtils back to the GATK by request, along with associated utility methods 2013-01-30 13:09:44 -05:00
Mark DePristo 45603f58cd Refactoring and unit testing GenomeLocParser
-- Moved previously inner class to MRUCachingSAMSequenceDictionary, and unit test to 100% coverage
-- Fully document all functions in GenomeLocParser
-- Unit tests for things like parsePosition (shocking it wasn't tested!)
-- Removed function to specifically create GenomeLocs for VariantContexts.  The fact that you must incorporate END attributes in the context means that createGenomeLoc(Feature) works correctly
-- Depreciated (and moved functionality) of setStart, setStop, and incPos to GenomeLoc
-- Unit test coverage at like 80%, moving to 100% with next commit
2013-01-30 09:47:47 -05:00
Mark DePristo 92c5635e19 Cleanup, document, and unit test ActiveRegion
-- All functions tested.  In the testing / review I discovered several bugs in the ActiveRegion routines that manipulate reads.  New version should be correct
-- Enforce correct ordering of supporting states in constructor
-- Enforce read ordering when adding reads to an active region in add
-- Fix bug in HaplotypeCaller map with new updating read spans.  Now get the full span before clipping down reads in map, so that variants are correctly placed w.r.t. the full reference sequence
-- Encapsulate isActive field with an accessor function
-- Make sure that all state lists are unmodifiable, and that the docs are clear about this
-- ActiveRegion equalsExceptReads is for testing only, so make it package protected
-- ActiveRegion.hardClipToRegion must resort reads as they can become out of order
-- Previous version of HC clipped reads but, due to clipping, these reads could no longer overlap the active region.  The old version of HC kept these reads, while the enforced contracts on the ActiveRegion detected this was a problem and those reads are removed.  Has a minor impact on PLs and RankSumTest values
-- Updating HaplotypeCaller MD5s to reflect changes to ActiveRegions read inclusion policy
2013-01-30 09:47:12 -05:00
David Roazen a536e1da84 Move some VCF/VariantContext methods back to the GATK based on feedback
-Moved some of the more specialized / complex VariantContext and VCF utility
 methods back to the GATK.

-Due to this re-shuffling, was able to return things like the Pair class back
 to the GATK as well.
2013-01-29 16:56:55 -05:00
Ryan Poplin e9c3a0acdf fix typo 2013-01-28 22:18:58 -05:00
Ryan Poplin d665a8ba0c The Bayesian calculation of Qemp in the BQSR is now hierarchical. This fixes issues in which the covariate bins were very sparse and the prior estimate being used was the original quality score. This resulted in large correction factors for each covariate which breaks the equation. There is also now a new option, qlobalQScorePrior, which can be used to ignore the given (very high) quality scores and instead use this value as the prior. 2013-01-28 15:56:33 -05:00
David Roazen f63f27aa13 org.broadinstitute.variant refactor, part 2
-removed sting dependencies from test classes
-removed org.apache.log4j dependency
-misc cleanup
2013-01-28 09:03:46 -05:00
David Roazen 3744d1a596 Collapse the downsampling fork in the GATK engine
With LegacyLocusIteratorByState deleted, the legacy downsampling implementation
was already non-functional. This commit removes all remaining code in the
engine belonging to the legacy implementation.
2013-01-28 01:50:30 -05:00
Mark DePristo 0fb238b61e TraverseActiveRegions Optimizations and Bugfixes: make sure to record position of current locus to discharge active regions when there's no data
-- Now records the position of the current locus, as well as that of the last read.  Necessary when passing through regions with no reads.  The previous version would keep accumulating empty active regions, and never discharge them until end of traversal (if there was no reads in the future) or until a read was finally found
-- Protected a call to logger.debug with if ( logger.isDebugEnabled()) to avoid a lot of overhead in writing unseen debugger logging information
2013-01-27 14:10:06 -05:00
Mark DePristo 93d88cdc68 Optimization: LocusReferenceView now passes along the contig index to createGenomeLoc, speeding up their creation
-- Also cleaned up some unused methods
2013-01-27 14:10:06 -05:00
Mauricio Carneiro 705cccaf63 Making SplitReads output FastQ's instead of BAM
- eliminates one step in my pipeline
   - BAM is too finicky and maintaining parameters that wouldn't be useful was becoming a headache, better avoided.
2013-01-27 02:36:31 -05:00
Mauricio Carneiro 6ea7133d95 Updating licenses of latest moved files 2013-01-26 13:46:52 -05:00
Ami Levy-Moonshine b4447cdca2 In cases where one uses VariantContextUtils.GenotypeMergeType.REQUIRE_UNIQUE we used to verify that the samples names are unique in VariantContextUtils.simpleMerge for each VCs. It couse to a bug that was reported on the forum (when a VCs had 2 VC from the same sample).
Now we will check it only in CombineVariants.init using the headers. A new function was added to SamplesUtils with unitTests in CVunitTest.java.
2013-01-25 15:49:51 -05:00
Mark DePristo 008b617577 Cleanup the getLIBS function in LocusIterator
-- Now throws an UnsupportedOperationException in the base class.  Only LocusView implements this function and actually returns the LIBS
2013-01-25 11:07:28 -05:00
Eric Banks 6dd0e1ddd6 Pulled out the --regenotype functionality from SelectVariants into its own tool: RegenotypeVariants.
This allows us to move SelectVariants into the public suite of tools now.
2013-01-25 09:42:04 -05:00
Mark DePristo 592f90aaef ActivityProfile now cuts intelligently at the best local minimum when in a larger than max size active region
-- This new algorithm is essential to properly handle activity profiles that have many large active regions generated from lots of dense variant events.  The new algorithm passes unit tests and passes visualize visual inspection of both running on 1000G and NA12878
-- Misc. commenting of the code
-- Updated ActiveRegionExtension to include a min active region size
-- Renamed ActiveRegionExtension to ActiveRegionTraversalParameters, as it carries more than just the traversal extension now
2013-01-24 13:48:00 -05:00
Mark DePristo 0c94e3d96e Adaptively compute the band pass filter from the sigma, up to a maximum size of 50 bp
-- Previously we allowed band pass filter size to be specified along with the sigma.  But now that sigma is controllable from walkers and from the command line, we instead compute the filter size given the kernel from the sigma, including all kernel points with p > 1e-5 in the kernel.  This means that if you use a smaller kernel you get a small band size and therefore faster ART
-- Update, as discussed with Ryan, the sigma and band size to 17 bp for HC (default ART wide) and max band size of 50 bp
2013-01-24 13:47:59 -05:00
Mark DePristo 9e43a2028d Making band pass filter size, sigma, active region max size and extension all accessible from the command line 2013-01-24 13:47:59 -05:00
Eric Banks 6790e103e0 Moving lots of walkers back from protected to public (along with several of the VA annotations).
Let's see whether Mauricio's automatic git hook really works!
2013-01-24 11:42:49 -05:00
Mark DePristo 09edc6baeb TraverseActiveRegions now writes out very nice active region and activity profile IGV formatted files 2013-01-23 13:46:01 -05:00
Mark DePristo 8e8126506b Renaming IncrementalActivityProfile to ActivityProfile
-- Also adding a work in progress functionality to make it easy to visualize activity profiles and active regions in IGV
2013-01-23 13:46:01 -05:00
Mark DePristo eb60235dcd Working version of incremental active region traversals
-- The incremental version now processes active regions as soon as they are ready to be processed, instead of waiting until the end of the shard as in the previous version.  This means that ART walkers will now take much less memory than previously.  On chr20 of NA12878 the majority of regions are processed with as few as 500 reads in memory.  Over the whole chr20 only 5K reads were ever held in ART at one time.
-- Fixed bug in the way active regions worked with shard boundaries.  The new implementation no longer see shard boundaries in any meaningful way, and that uncovered a problem that active regions were always being closed across shard boundaries.  This behavior was actually encoded in the unit tests, so those needed to be updated as well.
-- Changed the way that preset regions work in ART.  The new contract ensures that you get exactly the regions you requested.  the isActive function is still called, but its result has no impact on the regions.  With this functionality is should be possible to use the HC as a generic assembly by forcing it to operate over very large regions
-- Added a few misc. useful functions to IncrementalActivityProfile
2013-01-23 13:46:00 -05:00