Guillermo del Angel
e6b326c189
Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-07-30 21:32:19 -04:00
Guillermo del Angel
6c9d3ec155
Remerge after changes to allele construction code. More cleanups/fixes to artificial read pileup provider
2012-07-30 21:32:03 -04:00
Ryan Poplin
7ed06ee7b9
Updating FindCoveredIntervals to use the changes to the ActiveRegionWalker.
2012-07-30 12:16:27 -04:00
Ryan Poplin
13591b169f
Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-07-30 12:13:24 -04:00
Eric Banks
0b30588d67
Catch yet another class of User Errors
2012-07-30 11:59:56 -04:00
Eric Banks
5743694196
Merged bug fix from Stable into Unstable
2012-07-30 11:35:28 -04:00
Eric Banks
79195b97a3
Adding categories for the remaining uncategorized walkers
2012-07-30 11:35:08 -04:00
Guillermo del Angel
5b9a1af7fe
Intermediate fix for pool GL unit test: fix up artificial read pileup provider to give consistent data. b) Increase downsampling in pool integration tests with reference sample, and shorten MT tests so they don't last too long
2012-07-30 09:56:10 -04:00
Eric Banks
7630c929a7
Re-enabling the unit tests for reverse allele clipping
2012-07-29 22:24:56 -04:00
Eric Banks
b07bf1950b
Adding an integration test for another feature that I snuck in during a previous commit: we now allow lower-case bases in the REF/ALT alleles of a VCF and upper-case them (this had been turned off because the previous version used Strings to do the uppercasing whereas we stick with byte operations now).
2012-07-29 22:19:49 -04:00
Eric Banks
c4ae9c6cfb
With the new Allele representation we can finally handle complex events (because they aren't so complex anymore). One place this manifests itself is with the strict VCF validation (ValidateVariants used to skip these events but doesn't anymore) so I've added a new test with complex events to the VV integration test.
2012-07-29 19:22:02 -04:00
Eric Banks
99b15b2b3a
Final checkpoint: all tests pass. Note that there were bugs in the PoolGenotypeLikelihoodsUnitTest that needed fixing and eventually led to my needing to disable one of the tests (with a note for Guillermo to look into it). Also note that while I have moved over the GATK to use the new non-null representation of Alleles, I didn't remove all of the now-superfluous code throughout to do padding checking on merges; we'll need to do this on a subsequent push.
2012-07-29 01:07:59 -04:00
Eric Banks
2b1b00ade5
All integration tests and VC/Allele unit tests are passing
2012-07-27 17:03:49 -04:00
Eric Banks
beb7610195
Resolving merge conflicts
2012-07-27 15:52:02 -04:00
Eric Banks
27e7e11ec0
Allele refactoring checkpoint #3 : all integration tests except for PoolCaller are passing now. Fixed a couple of bugs from old code that popped up during md5 difference review. Added VariantContextUtils.requiresPaddingBase() method for tools that create alleles to use for determining whether or not to add the ref padding base. One of the HaplotypeCaller tests wasn't passing because of RankSumTest differences, so I added a TODO for Ryan to look into this.
2012-07-27 15:48:40 -04:00
Ryan Poplin
22bb4804f0
HaplotypeCaller now use an excessive number of high quality soft clips as a triggering signal in order to capture both end points of a large deletion in a single active region.
2012-07-27 12:44:02 -04:00
Ryan Poplin
a0890126a8
ActiveRegionWalker's isActive function returns a results object now instead of just a double.
2012-07-27 11:01:39 -04:00
Eric Banks
ef335b6213
Several more walkers have been brought up to use the new Allele representation.
2012-07-27 02:14:25 -04:00
Eric Banks
9e2209694a
Re-enable reverse trimming of alleles in UG engine when sub-selecting alleles after genotyping. UG integration tests now pass.
2012-07-27 00:47:15 -04:00
Eric Banks
baf3e33730
Allele refactoring checkpoint 2: all code finally compiles, AD and STR annotations are fixed, and most of the UG integration tests pass.
2012-07-26 23:27:11 -04:00
Ryan Poplin
35e803e110
Merged bug fix from Stable into Unstable
2012-07-26 14:00:04 -04:00
Ryan Poplin
4f741b4cd7
Smoothing in the BQSR bins should be one error observation and one non-error observation.
2012-07-26 13:59:02 -04:00
Guillermo del Angel
2ae890155c
Improvements to indel calling in pool caller: a) Compute per-read likelihoods in reference sample to determine wheter a read is informative or not. b) Fixed bugs in unit tests. c) Fixed padding-related bugs when computing matches/mismatches in ErrorModel, d) Added a couple of more integration tests to increase test coverage, including testing odd ploidy
2012-07-26 13:43:00 -04:00
Eric Banks
a694d1b5de
Merge branch 'master' into allelePadding
2012-07-26 01:53:14 -04:00
Eric Banks
32516a2f60
Initial checkpoint commit of VariantContext/Allele refactoring. There were just too many problems associated with the different representation of alleles in VCF (padded) vs. VariantContext (unpadded). We are moving VC to use the VCF representation. No more reference base for indels in VC and no more trimming and padding of alleles. Even reverse trimming has been stopped (the theory being that writers of VCF now know what they are doing and often want the reverse padding if they put it there; this has been requested on GetSatisfaction). Code compiles but presumably pretty much all tests with indels with fail at this point.
2012-07-26 01:50:39 -04:00
Mark DePristo
8c418a15da
Sorting out HMS error handling (fingers crossed)
...
-- Check if a traversal error occurred in the last shard
-- Catch ExecutionException from the TreeReducer and throw as our HMS execption
-- ShardTraverser just throws the exception as formatted by the HMS, rather than wrapping it as a RuntimeException itself
-- EngineFeaturesIntegrationTests now uses public exampleFASTA (faster), and does 1000x iterations (slower)
2012-07-25 23:13:12 -04:00
Mark DePristo
9242f63a4d
On the way to really sorting out HMS error handling
...
-- Better error message when a traveral error occurs (a real bug)
-- EngineFeaturesIntegrationTest runs the multi-threaded error testing routines 50x times
-- A bit of cleanup in WalkerTest
2012-07-25 22:11:10 -04:00
Mark DePristo
5671992db3
RMDTrackBuilderUnitTest now uses private/testdata file to avoid filesystem race conditions
2012-07-25 22:05:04 -04:00
Eric Banks
7eb3f54750
Added category docs for the remaining public walkers (I think I got them all). I removed a couple of totally unnecessary walkers.
2012-07-25 21:40:28 -04:00
Eric Banks
2982b24c4b
Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/stable
2012-07-25 20:36:53 -04:00
Eric Banks
0a98a6aa8d
Adding extraDocs tag per Mauricio's request
2012-07-25 18:23:18 -04:00
Mauricio Carneiro
fce5cb9f35
Few category changes
2012-07-25 17:23:02 -04:00
Eric Banks
05fa377a8e
Adding GATK categories to standard walkers. Will add to remaining walkers after the next successful release (so that I can see which walkers are public and still need it).
2012-07-25 16:05:47 -04:00
Mauricio Carneiro
d46cf47bd1
Updating Read Filter documentation
2012-07-25 15:05:47 -04:00
Eric Banks
6a3bfa3811
Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/stable
2012-07-25 14:11:11 -04:00
Eric Banks
357e0b35af
Register GATK-full-only walkers and rethrow the missing walker error as a not supported in GATK lite error
2012-07-25 14:11:03 -04:00
Roger Zurawicki
5b74763096
Removed Categories.
...
We will use DocumentedGATKFeatures to create categories in our documentation. Eric I guess will be in charge of this. We need to remove walkers and think how to categorize everything.
Tools can be hidden from GATKdocs with the @Hidden annotation
Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-07-25 13:46:24 -04:00
Eric Banks
a5721a8846
Context covariate optimizations were not suited for multiple threads, so I removed them (since that ended up being much, much easier than trying to make the covariates thread local). Added -nt 2 layer to BQSR integration tests to confirm that it now works with multiple threads.
2012-07-25 13:38:07 -04:00
Eric Banks
e0c07f5567
Reverting old commits that made error handling better because ultimately they made things worse.
2012-07-25 12:37:59 -04:00
Mark DePristo
16947e93f2
Integration test to ensure VariantFiltration makes . -> PASS/FAIL like VQSR
...
Signed-off-by: Mark DePristo <depristo@broadinstitute.org>
2012-07-25 08:56:39 -04:00
Mark DePristo
fcefa61bce
Remove reference dependence in BCF2Codec
...
-- Adding BCF2Codec to VCF.jar and associated unit tests
Signed-off-by: Mark DePristo <depristo@broadinstitute.org>
2012-07-25 08:56:38 -04:00
Mark DePristo
19a257a5c1
Multiple bugfixes
...
-- VariantFiltration now properly sets passFilters in VC
-- BCF2 writer now properly decodes lazy BCF genotype data that it uses. Improper use generated a horrible subtle bug but the good news is that the extra checks I put in (unnecessarily a few days ago) caught the bug!
Signed-off-by: Mark DePristo <depristo@broadinstitute.org>
2012-07-25 08:56:38 -04:00
Mark DePristo
3066894215
Bugfix for BCF2
...
-- Always decode genotypes block when writing out a BCF file. If the header changes (and we currently don't know this easily) then the dictionary keys used in the genotypes block may be invalid. Temporarily added a private static boolean that turns off writing of the blocks until Eric and his team rewrite the header.
Signed-off-by: Mark DePristo <depristo@broadinstitute.org>
2012-07-25 08:56:38 -04:00
Guillermo del Angel
eb55061fd0
a) Document BEAGLE codec, b) Bug fix: inbreeding coefficient shouldn't be computed for non-diploid organisms in current implementaiton
2012-07-24 12:16:15 -04:00
Mauricio Carneiro
348e86159e
Moving doclets to public
2012-07-23 23:52:14 -04:00
Mauricio Carneiro
5cd98a36b9
Making ForumAPIUtils public
2012-07-23 17:44:24 -04:00
Mauricio Carneiro
3d92f041f3
forgot to delete the merging line
2012-07-23 17:35:07 -04:00
Roger Zurawicki
f3c504769b
Added the ability to update the Forum
...
GATKDocs looks for a key on gsa4, and updates the forum with new walker if it exists.
More changes were made to the GATKDocs. Works nicely with bootstrap on and offline.
Cleaned up the code as well
Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-07-23 17:17:33 -04:00
Khalid Shakir
46ca49b63d
Removed 'Walker' suffix from packages/GATKEngine.xml that were breaking the packaged release.
...
Archived AnalyzeCovariates scripts and removed references in build packages / GATK extensions.
2012-07-23 16:32:31 -04:00
Ryan Poplin
2a14bbe4f0
Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-07-23 11:28:26 -04:00
Ryan Poplin
10d143c35c
Adding error model header names in the BQSR recal plot. Making the downsampling of points look a little nicer.
2012-07-23 11:28:17 -04:00
Eric Banks
675ccab2fa
Renaming BQSR to BaseRecalibrator
2012-07-23 10:17:17 -04:00
Ryan Poplin
2e486d83e2
Updating HaplotypeCaller docs and expanding integration tests.
2012-07-23 10:05:42 -04:00
Guillermo del Angel
39f45127f3
Fix md5's broken by recent changes to FisherStrand calculation
2012-07-21 14:41:38 -04:00
Mauricio Carneiro
65f4b67b86
Fixing walker unit test with the new naming convention
2012-07-20 17:50:29 -04:00
Mauricio Carneiro
921eaad33f
Generalized the default platform parameter in BQSRv2
...
Parameter wasn't working outside of the BQSR walker. It now takes the information on the recalibration report in other tools (PrintReads for example) and treats all reads as coming from the defined default platform.
2012-07-20 17:29:13 -04:00
Mauricio Carneiro
5dc2143142
Removed support for walkers ending with "Walker" from the engine.
...
If your walker has "Walker" in the name, you will have to use "Walker" on the -T to access it.
2012-07-20 17:27:11 -04:00
Mauricio Carneiro
d446d34227
GATK Error messages now point to the new website instead of GetSatisfaction.
2012-07-20 17:27:11 -04:00
Mauricio Carneiro
116885a450
Removed the "Walker" suffix from all walkers that had it.
...
* Did not touch archived walkers... those can be named whatever.
* Kept abstract classes that end in Walker untouched (e.g. LocusWalker, ReadWalker, ...)
* Renamed a few inner classes due to conflict when stripping off Walker from their outer classes: ContigStats, FlagStats and FastaStats.
2012-07-20 17:27:11 -04:00
Christopher Hartl
3ee46cced2
Merge branch 'master' of ssh://gsa1.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-07-19 21:25:40 -04:00
Christopher Hartl
af383c30b5
Ensure that the gene summary has a header line
2012-07-19 21:24:04 -04:00
Mark DePristo
2ca5fc62a2
Support for MISSING BCF2 type
...
-- Heng wants to use 0x0? to represent any missing type value, which in our implementation was invalid. Updated our codebase to support this construct. Heng said he'll update the BCF2 quick reference.
-- Enabled integration test reading Heng's ex2.bcf file
-- GATK now only warns in the case where the END info field isn't the same (or +1 due to padding) as the getEnd() function as determined by the GATK. Turns out there's a single record in the 1000G SV call set that doesn't have the right length
-- VariantContextTestProvider now tests that X = Y where X -> writing -> reading -> writing -> reading = Y for a variety of variant context inputs X
-- Added integration test reading 1000G SV chr1 calls (from Chris)
2012-07-19 16:14:26 -04:00
Guillermo del Angel
c16f9f2f15
a) Use new method to check for GATK Like, b) minor improvements to indel pool caller (more to come): brain-dead, quick way to limit number of alt alleles to genotype. We can't process too many alt alleles because of the combinatorial explosion of GL values with high ploidy, and some STR validation targets had up to 12 alt alleles, resulting of GL vectors of > 1e8 elements. Can't use pileup elements since typically not many alleles will be in one pileup, and different alleles will appear in different samples, TBD a nicer solution. c) Commit to posterity scala script for large scale validation calling, still work in progress
2012-07-19 10:24:08 -04:00
Eric Banks
5f5edeca63
Reverting move of BQSR tests to public, as per DR's email
2012-07-19 10:02:05 -04:00
Eric Banks
e370030e6c
As requested by Mark, I've broken out the code to pull out the protected subclass when available (and otherwise use the public version) into the GATKLiteUtils class. People should use this code instead of reimplementing all of the java reflection on their own.
2012-07-18 22:44:37 -04:00
Eric Banks
d46ccec04e
Adding Unit Tests to cover the exception catching for Picard errors: because we are using String matching, we want to ensure that we know if/when the exception text changes underneath us.
2012-07-18 21:48:58 -04:00
Eric Banks
9c1ab1b0c0
Move BQSR integration test and its dependent files into public; previously there was a protected->private dependency.
2012-07-18 21:11:33 -04:00
Mark DePristo
994c5c31c1
Enabling VariantEval integration tests for ValidationReport
2012-07-18 16:07:47 -04:00
Mark DePristo
74e153ff4a
FisherStrand now uses RankSumTest isUsableBase to decide if a read should be included in testing
...
-- Previously used hardcoded MAPQ > 20 && QUAL > 20 but now uses isUsableBase
-- Updating MD5s as appropriate
2012-07-18 16:07:47 -04:00
Mark DePristo
dede3a30e9
Improvements to the validation report of VariantEval
...
-- If eval has genotypes and comp has genotypes, then subset the genotypes of comp down to the samples being evaluated when considering TP, FP, FN, TN status. This is important in the case where you want to use this to assess, for example, the quality of calls on NA12878 but you have a CEU trio comp VCF. The previous version was counting sites polymorphic in mom against the calls in NA12878.
-- Added testdata VCF and integrationtests to ensure this behavior continues in the future
-- TODO: actually run integration tests when I have an internet connection
2012-07-18 16:07:47 -04:00
Mark DePristo
559a4826be
Improvements to the validation report of VariantEval
...
-- If eval has genotypes and comp has genotypes, then subset the genotypes of comp down to the samples being evaluated when considering TP, FP, FN, TN status. This is important in the case where you want to use this to assess, for example, the quality of calls on NA12878 but you have a CEU trio comp VCF. The previous version was counting sites polymorphic in mom against the calls in NA12878.
-- Added testdata VCF and integrationtests to ensure this behavior continues in the future
2012-07-18 16:07:46 -04:00
Mark DePristo
dc292c0317
FisherStrand now includes all reads and bases, regardless of mapping quality and base quality, just like other annotations
...
-- This actually proved to be a problem with Ion Torrent data where the base quality can be quite low, and so we need to include Q15 bases for calling effectively.
2012-07-18 16:07:46 -04:00
Eric Banks
2c0f073ab1
Make -qq arg hidden for now since it's still very experimental
2012-07-18 15:43:25 -04:00
Eric Banks
b46c85e8b4
More bad BAM file catching
2012-07-18 15:26:31 -04:00
Eric Banks
659eee13a6
Handle NPE generated in UG when non-standard reference bases are present in the fasta
2012-07-18 15:16:27 -04:00
Eric Banks
9af2cfe283
Catch underlying file system problems that get masked as Tribble index errors. There's also a quick patch to the HMS that isn't really the ultimate fix needed; Mark and I will review at a later point.
2012-07-18 15:11:38 -04:00
Eric Banks
4c730542f0
Handle RuntimeExceptions thrown by Picard that are really User Errors. I will add unit tests for these as best I can later.
2012-07-18 13:56:35 -04:00
Eric Banks
ae08d35138
Catch 'too many open files' errors that show up when trying to read the bam index. All that needs to be done is to flesh out the original error message (because it will get caught later and rethrown correctly).
2012-07-18 12:57:34 -04:00
Eric Banks
f2fe59a9d4
Wow, there are a ton of errors captured having to do with being unable to merge the temp Tribble output. I'm expanding the error message a bit to help see if we can do anything going forward.
2012-07-18 12:31:59 -04:00
Eric Banks
e4db8dde91
Enabled a whole other bunch of integration tests for BQSRv2. While I was there I also changed the default context size for indels to 3 (from 8) since that's what works best in the current implementation (as suggested by Ryan). At this point, all of the new core tools (ReduceReads, BQSRv2, HaplotypeCaller, UG extensions) have been moved over to protected and should be stable. Looks like we are pretty much ready for GATK 2.0!
2012-07-17 23:36:43 -04:00
Eric Banks
a8d08ea18d
As a user pointed out, it is not valid for a GenomeLoc to have a start or stop equal to 0.
2012-07-17 22:18:43 -04:00
Guillermo del Angel
29273abab7
Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-07-17 16:58:12 -04:00
Guillermo del Angel
731bbba2e6
Bug fixes for integration test, use correct new UG syntax
2012-07-17 16:57:59 -04:00
Eric Banks
33be41ecf5
Cleaning up integration test
2012-07-17 16:06:04 -04:00
Eric Banks
8dbc9cb29c
Add the ability to emit the original quals in the OQ tag
2012-07-17 15:52:56 -04:00
Guillermo del Angel
40b8c7172c
Pool Caller refactoring in preparation of GATK 2.0: a) PoolCallerUnifiedArgumentCollection disappeared, and arguments moved to UnifiedArgumentCollection. b) PoolCallerWalker is no longer needed and redundant, all functionality subsumed by UG. UG now checks if GATK is lite - if so, don't allow ploidy > 2. c) Moved pool classes from private to protected. d) Changed the way to specify ploidy. Instead of specifying samples per pool and having ploidy = 2*samplesPerPool, have user specify ploidy directly, which is cleaner. Update tests accordingly. We can now call triploid seedless grape genotypes correctly in theory. e) Renamed argument -reference to -reference_sample_calls since the former is ambiguous and it's not clear what it refers to.
2012-07-17 15:27:04 -04:00
Laurent Francioli
68d0e4dd6d
- Multi-allelic sites are now correctly ignored - Reporting of mendelian violations enhanced - Corrected TP overflow by caping it to Bye.MAX_VALUE
...
-Updated integrationtests to reflect changes in MVF file output
Signed-off-by: Eric Banks <ebanks@broadinstitute.org>
2012-07-17 15:21:10 -04:00
Eric Banks
b0d99fd10d
Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-07-17 15:12:28 -04:00
Eric Banks
305db8c0d1
Total rewrite of the isGATKLite() functionality with help of Khalid/David. PluginManager was not working for us.
2012-07-17 15:11:03 -04:00
Ryan Poplin
6efbcd99f1
HaplotypeCaller is now an AnnotatorCompatibleWalker with all the rights and privileges pertaining thereto. Enabling the ClippingRankSumTest after showing it was useful for 1000 Genomes calling.
2012-07-17 14:38:36 -04:00
Eric Banks
110886e8b9
Oops, got the logic wrong.
2012-07-17 13:37:11 -04:00
Eric Banks
a963b37424
Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-07-17 13:15:37 -04:00
Eric Banks
3a64398d07
Cleaned up the isGATKLite check
2012-07-17 12:46:16 -04:00
Eric Banks
62c5228048
1) Revert previous change - indel recalibration is turned on by default and users of the Lite version will need to turn it off to avoid a User Error. 2) Implemented the engine.isGATKLite() method.
2012-07-17 12:23:40 -04:00
Chris Saunders
1913d1bbd0
Put RunReport S3 upload on timeout thread
...
Move the RunReport S3 upload process onto a separate thread with a timeout allowing the parent to continue.
Signed-off-by: Khalid Shakir <kshakir@broadinstitute.org>
2012-07-17 12:19:39 -04:00
Eric Banks
40618ac471
A bunch of BQSR changes: 1) by default we do not emit indel quals, but they can be turned on with --enable_indel_quals. 2) We check whether or not we are running in Lite mode (not done yet) and if so and the user is trying to recalibrate indels, we throw a User Error (not supported). 3) Like v1 we now allow the user to set the qual value below which we don't recalibrate (this was the remaining source of differences in the v1 vs. v2 plots).
2012-07-17 10:52:43 -04:00
Eric Banks
d5b3a2eabf
Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-07-17 00:32:53 -04:00
Eric Banks
f657b8bda8
Complete overhaul of the BQSRv2 integration tests. Much more comprehensive. Still need to deal with a few tests that need some modifications before I'm done, but I'll take care of that sometime tomorrow.
2012-07-17 00:32:34 -04:00
Eric Banks
a003148d50
Move AnalyzeCovariates over too.
2012-07-16 16:11:56 -04:00
Eric Banks
0a89adbcdb
Add utility decorators so that classes can tell you which package source they come from if they want to (suggested by Khalid). Using those decorators, we can easily pull out the BQSR updateDataForPileupElement() method into a standard RecalibrationEngine and an AdvancedRecalibrationEngine and use the protected one (AdvancedRE) if available (otherwise, the public one).
2012-07-16 15:34:50 -04:00
Eric Banks
52baac1e16
Move BQSRv2 into public and v1 into the archive.
2012-07-16 14:23:38 -04:00
Khalid Shakir
07822d6c0f
Fixed input annotations for master/test files on DiffObjectsWalker.
2012-07-16 13:33:11 -04:00
Eric Banks
2a830939df
Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-07-14 23:49:59 -04:00
Eric Banks
f29cadd7e2
By default, don't quantize quals in BQSRv2
2012-07-14 23:49:48 -04:00
Eric Banks
75543a3f22
ReadClipper.clipRead's claim that it doesn't modify the original read was false. Ultimately, GATKSAMRecord.clone (as documented) creates a soft copy of the read - so modifying e.g. the bases of the cloned read means that you modify the bases of the original read too. Because of this, when the BQSRv2 Context covariate was writing Ns over the low quality tails of the reads they got propagated out to the output BAM file (very bad). I've updated the ReadClipper docs and cleaned up the code (no reason to use a clone of the read anymore given that we are already modifying the original). For now, the simplest thing is to have the Context covariate store the original bases, overwrite low quality Ns, compute covariates, and rewrite the original bases; we can update later if needed.
2012-07-13 18:50:27 -04:00
Ryan Poplin
443f02ffc2
Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-07-13 16:09:24 -04:00
Khalid Shakir
6dfcc486e8
In ApplyRecalibration marking filter as PASS instead of '.' when the site passes by calling .passFilters().
2012-07-13 15:40:56 -04:00
Ami Levy Moonshine
5d0a7335ea
remove unnecessary use in the PRIORITY list
...
remove unneeded imports
2012-07-13 15:27:08 -04:00
Ryan Poplin
d70bb59182
HaplotypeCaller now calls insertion events that aren't fully assembled as symbolic alleles.
2012-07-10 14:22:23 -06:00
Guillermo del Angel
279dff9f81
Bug fix when specifying a JEXL expression for a field that doesn't exist: we should treat the whole expression as false, but we were rethrowing the JEXL exception in this case. Added integration test to cover this in SelectVariants
2012-07-10 13:59:00 -04:00
Mauricio Carneiro
7eb45b4038
Fixed BQSR IntegrationTests
...
* BinaryTag covariate is Experimental, not Standard (this was breaking integration tests)
* New parameter in the Recalibration report requires new MD5 for one of the integration tests.
2012-07-09 13:55:12 -04:00
Eric Banks
dd0c47ab7e
Don't cast to a specific walker type since any walker can use the VA engine
2012-07-09 10:25:58 -04:00
Mark DePristo
5b0ade67c8
Updates to VCF processing for better BCF processing
...
-- getMetaData now split into getMetaDataInSortedOrder() [old functionality] and getMetaDataInOriginalOrder() [according to the header order]. Important as BCF uses the order of elements in the header in the offsets to keys, and we were automatically sorting the BCF2 header which is out of order in samtools and the whole system was going crazy
-- Updating GATK code to use the appropriate header function (this is why so many files have changed)
-- BCF2 code was busted in not differentiating PASS from . from FILTER in VC (tests coming that will actually stress this)
-- Bugfix for adding contig lines to BCF2 header dictionary
-- VCFHeader metaData no longer sorted internally. The system now maintains the data in header order, and only sorts output as requested in API
-- VCFWriter and BCF2Writer now explictly sort their header lines
-- Don't allow filters to be added that are PASS in the contract
2012-07-08 15:44:33 -07:00
Mark DePristo
63f5262e45
mergeInfoWithMaxAC is no longer hidden in CombineVariants
2012-07-08 15:44:32 -07:00
Mark DePristo
66aee613e2
Bugfix for set key in mergeInfoWithMaxAC.
...
-- Previous version was always setting set=source of info with highest AC. Should actually have been set to the set annotation value itself.
2012-07-08 15:44:32 -07:00
Mark DePristo
91f0ed8059
Fixed nasty Rscript typo in VariantRecalibrator when compactPDF is available
2012-07-08 15:44:32 -07:00
Mark DePristo
87b090c362
Update VariantRecalibator error message to use -resource not old -B syntax
2012-07-08 15:44:31 -07:00
Mauricio Carneiro
125e6c1a47
added BinaryTagCovariate for ancient dna analysis
2012-07-06 15:03:20 -04:00
Mauricio Carneiro
e93b025b39
Fixing unit test
...
with the new clipping behavior for weird cigars, we no longer can assert the final number of bases in the unit test, so I'm taking this bit off the unit test.
2012-07-06 12:08:09 -04:00
Mauricio Carneiro
f603d4c48c
Fixing PairHMMIndelErrorModel boundary issue
...
When checking the limits of a read to clip, it wasn't considering reads that may already been clipped before.
2012-07-06 11:48:04 -04:00
Eric Banks
dd571d9aa0
Added a --no_indel_quals argument that when used with -BQSR inhibits the writing of base insertion and base deletion quality tags.
2012-07-04 01:22:20 -04:00
Eric Banks
33306d2e20
Changing the logic of the -standard argument; the way it stands currently one can never turn off the cycle or context covariates. Now they are on by default and users must opt out of them to turn them off.
2012-07-04 00:21:21 -04:00
Eric Banks
7d30558e6f
Only 'pad' the cycle covariate for indels, not substitutions
2012-07-03 23:47:01 -04:00
Mauricio Carneiro
17efbbf8b1
Fixed ReadClipperUnitTest
...
The behavior of the clipping on weird cigar strings such as 1I1S1H and 9S56H has changed, and the test has to change accordingly.
2012-07-03 16:38:51 -04:00
Eric Banks
22f1afddaa
Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-07-03 14:55:59 -04:00
Eric Banks
617eebd204
More misc cleanup
2012-07-03 14:55:37 -04:00
Eric Banks
344c3aeb1d
Cleanup from previous commit
2012-07-03 14:42:44 -04:00
Ryan Poplin
9e8e78de15
Adding the model name to the VQSR filter lines so that they don't get clobbered with consecutive VQSR runs for SNPs and then indels.
2012-07-03 14:30:37 -04:00
Eric Banks
0b37d44b0d
Optimizations for the RecalDatum to make BQSR (Count Covariates) much faster. Needs some cleanup.
2012-07-03 13:05:11 -04:00
Eric Banks
031322ff00
Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-07-03 00:12:59 -04:00
Eric Banks
a4670113bd
Refactored/renamed the nested integer array; cleaned up code a bit.
2012-07-03 00:12:33 -04:00
Ryan Poplin
f92139dd82
Ooops, UG VA path for rank sum tests aren't happy with empty lists. Disabling clipping rank sum test for now.
2012-07-02 21:12:42 -04:00
Ryan Poplin
7e7b4cd1b9
Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-07-02 16:37:54 -04:00
Ryan Poplin
b807ff63ef
HaplotypeCaller now creates MNP and complex substitutions by using LD information to decide if events segregate together on haplotypes. Added unit test.
2012-07-02 16:37:39 -04:00
Mauricio Carneiro
3cea080aa8
Cache SoftStart() and SoftEnd() in the GATKSAMRecord
...
these are costly operations when done repeatedly on the same read.
2012-07-02 16:22:00 -04:00
Mauricio Carneiro
88a02fa2cb
Fixing but for reads with cigars like 9S54H
...
When hard-clipping predict when the read is going to be fully hard clipped to the point where only soft/hard-clips are left in the read and preemptively eliminate the read before the SAMRecord mathematics on malformed cigars kills the GATK.
2012-07-02 16:22:00 -04:00
Mark DePristo
1b0a775773
Disabling bcf2 reading from samtools because it's 1 basis; updating select variants integrationtest
2012-07-02 15:55:42 -04:00
Eric Banks
cac72bce91
Initial version of int indexed mapping for BQSR. Will be cleaned up in a bit.
2012-07-02 14:33:33 -04:00
Mark DePristo
602729c09d
Moved parallel tests from SelectVariants to separate SelectVariantsParallelIntegrationTest
...
-- Enabled previous tests -- all now working
-- Added modern test against new VCF as well
2012-07-02 11:40:28 -04:00
Mark DePristo
bcd2e13d8b
Adding duplicate header line keys is a logger.debug not logger.warn message now
2012-07-02 11:39:34 -04:00
Mark DePristo
01e04992f8
Fixed compatibilities in AbstractVCFCodec that resulted in key=; being parsed as written as key; in VCF output
2012-07-02 11:38:59 -04:00
Eric Banks
c94c8a9c09
Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-07-02 08:53:01 -04:00
Mark DePristo
7aff4446d4
Added unit tests for header repairing capabilities in the GATK engine
2012-07-01 15:38:10 -04:00
Mark DePristo
480b32e759
BCF2 is now officially zero-based open-interval, and that's how the GATK does it now
2012-07-01 14:59:27 -04:00
Ryan Poplin
b6093ff02c
Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-07-01 10:32:37 -04:00
Mark DePristo
9b87dcda4f
Fixing remaining integration test errors. Adding missing ex2.bcf
2012-06-30 16:23:11 -04:00
Mark DePristo
5ad9a98a15
Minor bugfixes / consistency fixes to filter strings of Genotypes and AC/AF annotations
...
-- GenotypeBuilder now sorts the list of filter strings so that the output is in a consistent order
-- calculateChromosomeCounts removes the AC/AF fields entirely when there are no alt alleles, to be on VCF spec for A defined info field values
2012-06-30 11:22:49 -04:00
Mark DePristo
385a3c630f
Added check in VariantContext.validate to ensure that getEnd() == END value when present
...
-- Fixed bug in VariantDataManager that this validation mode was intended to detect going forward
-- Still no VariantRecalibrationWalkersIntegrationTest for indels with BCF2 but that's because LowQual is missing from test VCF
2012-06-30 11:22:48 -04:00
Mark DePristo
893630af53
Enabling symbolic alleles in BCF2
...
-- Bugfix for VCFDiffableReader: don't add null filters to object
-- BCF2Codec uses new VCFAlleleClipper to handle clipping / unclipping of alleles
-- AbstractVCFCodec: decodeLoc uses full decode() [still doesn't decode genotypes] to avoid dangerous code duplication. Refactored code that clipped alleles and determined end position into updateBuilderAllelesAndStop method that uses new VCFAlleleClipper. Fixed bug by ensuring the VCF codec always uses the END field in the INFO when it's provided, not just in the case where the there's a biallelic symbolic allele
-- Brand new home for allele clipping / padding routines in VCFAlleleClipper. Actually documented this code, which results in lots of **** negative comments on the code quality. Eric has promised that he and Ami are going to rethink this code from scratch. Fixed many nasty bugs in here, cleaning up unnecessary branches, etc. Added UnitTests in VCFAlleleClipper that actually test the code full. In the process of testing I discovered lots of edge cases that don't work, and I've commented out failing tests or manually skipped them, noting how this tests need to be fixed. Even introduced some minor optimizations
-- VariantContext: validateAllele was broken in the case where there were mixed symbolic and concrete alleles, failing validation for no reason. Fixed.
-- Added computeEndFromAlleles() function to VariantContextUtils and VariantContextBuilder for convenience calculating where the VC really ends given alleles
--
2012-06-30 11:22:48 -04:00
Mark DePristo
16276f81a1
BCF2 with support symbolic alleles
...
-- refactored allele clipping / padding code into VCFAlleleClipping class, and added much needed docs and TODOs for methods dev guys
-- Added real unit tests for (some) clipping operations in VCFUtilsUnitTest
2012-06-30 11:22:48 -04:00