Commit Graph

12949 Commits (e2c2aa7b05765794b3477dc4b1f89d6b60e1682b)

Author SHA1 Message Date
Eric Banks e2c2aa7b05 Merge pull request #475 from broadinstitute/eb_fix_null_alleles_bug_PT63551060
Added in a check for what would be an empty allele after trimming.
2014-01-15 08:05:21 -08:00
Eric Banks 9f1ab0087a Added in a check for what would be an empty allele after trimming. 2014-01-15 11:04:19 -05:00
Ryan Poplin 201ad398ac Merge pull request #473 from broadinstitute/eb_fix_qd_indel_normalization
The QD normalization for indels was busted and is now fixed.
2014-01-14 08:56:19 -08:00
Eric Banks e4fdc5ac44 Merge pull request #474 from broadinstitute/eb_fix_haplotype_resolver_PT63333488
Fixing the Haplotype Resolver so that it doesn't complain about missing header lines
2014-01-14 07:36:53 -08:00
Geraldine Van der Auwera f67c33919b Merge pull request #468 from broadinstitute/gg_fixSAMPileup
Updated SAMPileup codec and pileup-related docs
2014-01-14 06:30:04 -08:00
Geraldine Van der Auwera edf5880022 Updated SAMPileup codec and pileup-related docs
Problem: the codec was written to take in consensus pileups produced with pileup -c option (which consists of 10 or 13 fields per line depending on the variant type) but errored out on the basic pileup format (which only has 6 fields per line). This was inconsistent and confusing to users.

	Solution: I added a switch in the parsing to recognize and handle both cases more appropriately, and updated related docs. While I was at it I also improved error messages in CheckPileup, which now emits User Error: Bad Input exceptions when reporting mismatches. Which may not be the best thing to do (ultimately they're not really errors, they're just reporting unwelcome results) but it beats emitting Runtime Exceptions.

	Tested by CheckPileupIntegrationTest which tests both format cases.
2014-01-14 09:14:16 -05:00
Eric Banks 16ecc53749 Merge pull request #469 from broadinstitute/gg_gatkdoc_fixes
Assorted fixes and improvements to gatkdocs
2014-01-14 05:56:07 -08:00
Eric Banks fd511d12a2 Fixing the Haplotype Resolver so that it doesn't complain about missing header lines.
The code comments very clearly state that INFO fields shouldn't be propagated into the output,
but someone must have accidentally changed it afterwards.  This is just a simple one-line fix
to make sure the code adhered to the comments.

Delivers #63333488.
2014-01-13 22:47:43 -05:00
droazen 347fab4717 Merge pull request #471 from broadinstitute/eb_output_log_info_for_tim
Adding more meta information about the user to the GATK logging output, per Tim F's request.
2014-01-13 17:48:40 -08:00
Geraldine Van der Auwera bdb3954eb3 removed maxRuntime minValue 2014-01-13 20:45:43 -05:00
Geraldine Van der Auwera 8fcad6680b Assorted fixes and improvements to gatkdocs
-Added docs for ERC mode in HC
 -Move RecalibrationPerformance walker since to private since it is experimental and unsupported
 -Updated VR docs and restored percentBad/numBad (but @Hidden) to enable deprecation alert if users try to use them
 -Improved error msg for conflict between per-interval aggregation and -nt
 -Minor clean up in exception docs
 -Added Toy Walkers category for devs and dev supercat (to build out docs for developers)
 -Added more detailed info to GenotypeConcordance doc based on Chris forum post
 -Added system to include min/max argument values in gatkdocs (build gatkdocs with 'ant gatkdocs' to test it, see engine and DoC args for in situ examples)
 -Added tentative min/max argument annotations to DepthOfCoverage and CommandLineGATK arguments (and improved docs while at it)
 -Added gotoDev annotation to GATKDocumentedFeature to track who is the go-to person in GSA for questions & issues about specific walkers/tools (now discreetly indicated in each gatkdoc)
2014-01-13 17:46:22 -05:00
Eric Banks c7e08965d0 The QD normalization for indels was busted and is now fixed.
It is true that indels of length > 1 have higher QUALS than those of length = 1.  But for the HC those
QUALS are not that much higher, and it doesn't continue scaling up as the indels get larger.  So we no
longer normalize by indel length (which massively over-penalizes larger events and effectively drops their
QD to 0).

For the UG the previous normalization also wasn't perfect.  Now we divide the indel length by a factor
of 3 to make sure that QD is consistent over the range of indel lengths.

Integration tests change because QD is different for indels.
Also, got permission from Valentin to archive a failing test that no longer applies.

Thanks to Kurt on the GATK forum for pointing this all out.
2014-01-13 15:23:36 -05:00
Eric Banks 851ec67bdc Adding more meta information about the user to the GATK logging output, per Tim F's request. 2014-01-13 14:36:02 -05:00
droazen 7cd304fb41 Merge pull request #470 from broadinstitute/mf_new_RBP
Mf new rbp
2014-01-13 08:46:27 -08:00
Ryan Poplin 3b8209f3b2 Merge pull request #467 from broadinstitute/rp_fix_names_NA12878ROCCurve
The ROC Curve report lists the name as the name of the vcf file now inst...
2014-01-09 06:56:34 -08:00
MauricioCarneiro 50cd6781b3 Merge pull request #465 from broadinstitute/eb_improvements_to_ref_confidence_merger
Improvements to ref confidence merger
2014-01-08 10:51:01 -08:00
Ryan Poplin 8881926bc6 The ROC Curve report lists the name as the name of the vcf file now instead of project+name. 2014-01-08 09:44:21 -05:00
Ryan Poplin c86e36c909 Merge pull request #466 from broadinstitute/rp_phase3_vqsr_scala
Adding here the Qscript used to perform the VQSR for 1000 Genomes Projec...
2014-01-08 06:39:46 -08:00
Ryan Poplin 7d5a710ea6 Adding here the Qscript used to perform the VQSR for 1000 Genomes Project phase 3 2014-01-08 09:38:13 -05:00
Eric Banks 553b3e56bd Merge pull request #463 from broadinstitute/eb_fix_realigner_bugs_from_pearson
Fixed edge condition in the realigner where a realigned read can sometim...
2014-01-08 05:36:11 -08:00
Eric Banks 0323caefc8 Added some bug fixes to the gVCF merging code after finally getting some real data to play with.
Still under construction, awaiting more test data from Valentin.
2014-01-08 08:34:35 -05:00
Eric Banks f172c349f6 Adding the functionality to enable users to input a file of VCFs for -V.
To do this I have added a RodBindingCollection which can represent either a VCF or a
file of VCFs.  Note that e.g. SelectVariants allows a list of RodBindingCollections so
that one can intermix VCFs and VCF lists.

For VariantContext tags with a list, by default the tags for the -V argument are applied
unless overridden by the individual line.  In other words, any given line can have either
one token (the file path) or two tokens (the new tags and the file path).  For example:
foo.vcf
VCF,name=bar bar.vcf

Note that a VCF list file name must end with '.list'.

Added this functionality to CombineVariants, CombineReferenceCalculationVariants, and VariantRecalibrator.
2014-01-08 00:45:00 -05:00
Eric Banks c133909d32 Fixed edge condition in the realigner where a realigned read can sometimes get partially aligned off the end of the contig.
Now we ignore such reads (which is much easier than trying to figure out when to soft-clip).
Added unit test.
2014-01-08 00:37:28 -05:00
Menachem Fromer e33d3dafc6 Add documentation for RBP, and also update the MD5 for the tests now that the output uses HP tags instead of '|', which is now reserved for trio-based phasing 2014-01-03 12:04:47 -05:00
Menachem Fromer d1275651ae Merge remote-tracking branch 'origin/master' into mf_new_RBP 2014-01-03 01:13:40 -05:00
Eric Banks f6a44afa3a Merge pull request #464 from broadinstitute/eb_rev_variant_jar_for_bcf_fixes
Rev'ing the Variant jar to incorporate some patches to the BCF encoder t...
2014-01-02 21:05:13 -08:00
Eric Banks 856c17868b Rev'ing the Variant jar to incorporate some patches to the BCF encoder that Menachem needs. 2014-01-02 23:33:17 -05:00
Ryan Poplin 5c32ad174a Merge pull request #452 from broadinstitute/rp_vqsr_aggregate_model
Allow for additional input data to be used in the VQSR for clustering bu...
2014-01-02 12:54:45 -08:00
Ryan Poplin 856c1f87c1 Allow for additional input data to be used in the VQSR for clustering but don't carry it forward into the output VCF file.
-- New -a argument in the VQSR for specifying additional data to be used in the clustering
-- New NA12878KB walker which creates ROC curves by partitioning the data along VQSLOD and calculating how many KB TP/FP's are called.
2014-01-02 14:46:04 -05:00
Ryan Poplin c82501ac35 Merge pull request #462 from broadinstitute/rp_SingleSampleHC_exome_scala
Adding SingleSampleHC_exome.scala for Valentin to use as a jumping off p...
2014-01-02 08:57:27 -08:00
Ryan Poplin 15372c4873 Adding SingleSampleHC_exome.scala for Valentin to use as a jumping off point. 2014-01-02 11:56:17 -05:00
amilev f81a38f596 Merge pull request #446 from broadinstitute/ami-RNAseq-tools
Write a new tool for spliting reads that have N cigar string.
2014-01-01 21:06:25 -08:00
MauricioCarneiro 1223345726 Merge pull request #459 from broadinstitute/eb_fix_bad_hmm_clipping
Fixed up edge condition for clipping long reads in the HMM.
2014-01-01 20:00:34 -08:00
Ami Levy-Moonshine 6da53aea09 Write a new tool for spliting reads that have N cigar string.
For example, this tool can be used for processing bowtie RNA-seq data.
Each read with k N-cigar elemments is plit to k+1 reads. The split is done by hard clipping the bases rest of the bases.

In order to do it, few changes were introduced to some other clipping methods:
- make a segnificant change in ClippingOp.hardClip() that prevent the spliting of read with cigar: 1M2I1N1M3I.
- change getReadCoordinateForReferenceCoordinate in ReadUtil to recognize Ns

create unitTests for that walker:
- change ReadClipperTestUtils to be more general in order to use its code and avoid code duplication
- move some useful methods from ReadClipperTestUtils to CigarUtils

create integration test for that class

small change in a comment in FullProcessingPipeline

last commit:

Address review comments:
- move to protected under walkers/rnaseq
- change the read splitting methods to be more readable and more efficiant
- change (minor changes) some methods in ReadClipper to allow the changes in split reads
- add (minor change) one method to CigarUtils to allow the changes in split reads
- change ReadUtils.getReadCoordinateForReferenceCoordinate to include possible N in the cigar
- address the rest of the review comments (minor changes)

- fix ReadUtilsUnitTest.testReadWithNs acoording to the defult behaviour of getReadCoordinateForReferenceCoordinate (in case of refernce index that fall into deletion, return the read index of the base before the deletion).
- add another test to ReadUtilsUnitTest.testReadWithNs

- Allow the user to print the split positions (not working proparly currently)
2014-01-01 22:21:36 -05:00
Eric Banks bb4c4b1fcd Fixed up edge condition for clipping long reads in the HMM.
MD5s change because some reads were incorrectly getting clipped before.

[delivers #62584746]
2014-01-01 19:05:09 -05:00
Eric Banks ece346689c Merge pull request #460 from broadinstitute/mc_document_readclippingstats
Better documentation for ReadClippingStats walker
2014-01-01 16:00:40 -08:00
Eric Banks 154bab0849 Merge pull request #461 from broadinstitute/mc_make_comparebams_private
Move CompareBAMs to private
2014-01-01 15:58:05 -08:00
Mauricio Carneiro d52bd44867 Move CompareBAMs to private
This is a tool that we use internally validate the ReduceReads development. I think it should be
private. There is no need to improve docs.

[delivers #54703398]
2014-01-01 14:33:23 -05:00
Mauricio Carneiro d1febb89c8 Better documentation for ReadClippingStats walker
* add overall walker GATKDocs
* add explanation for skip parameter and make it advanced
* reverse the logic on exculding unmapped reads for clarity
* fix read length  calculation to no longer include indels

ps: I am not sure how useful this walker is (I didn't write it) but the skip logic is poor and
calculates the entire statistic for the reads it is eventually going to skip. This would be an easy
fix, but only worth our time if people actually use this.
2014-01-01 14:26:26 -05:00
Eric Banks 9355598129 Merge pull request #458 from broadinstitute/eb_dont_fail_when_using_incompatible_annotation
Don't fail in annotations if the wrong tools are calling them, just silently skip them.
2013-12-31 21:22:26 -08:00
Eric Banks 050ca8ae09 Merge pull request #457 from broadinstitute/eb_rev_variant_for_doc_updates
Updating variant jar.
2013-12-31 20:49:20 -08:00
Eric Banks 9665f75ad4 Don't fail in annotations if the wrong tools are calling them, just silently skip them.
This is important for cases when users want to use annotation groups (like all experimental annotations).
2013-12-31 23:45:21 -05:00
Eric Banks f82a7c3f4c Updating variant jar.
The update contains:
1. documentation changes for VariantContext and Allele (which used to discuss the now obsolete null allele)
2. better error messages for VCFs containing complex rearrangements with breakends
3. instead of failing badly on format field lists with '.'s, just ignore them
Also, there is a trivial change to use a more efficient method to remove a bunch of attributes from a VC.

Delivers PT#s 59675378, 59496612, and 60524016.
2013-12-31 22:48:29 -05:00
Eric Banks 5a1564d1f2 Merge pull request #456 from broadinstitute/eb_unify_hc_combination_steps
Created a new walker to do the full combination of N gVCFs from the HC single-sample ref calc pipeline.
2013-12-31 18:57:27 -08:00
Eric Banks 83e09b1f64 Created a new walker to do the full combination of N gVCFs from the HC single-sample ref calc pipeline.
Basically, it does 3 things (as opposed to having to call into 3 separate walkers):
1. merge the records at any given position into a single one with all alleles and appropriate PLs
2. re-genotype the record using the exact AF calculation model
3. re-annotate the record using the VariantAnnotatorEngine

In the course of this work it became clear that we couldn't just use the simpleMerge() method used
by CombineVariants; combining HC-based gVCFs is really a complicated process.  So I added a new
utility method to handle this merging and pulled any related code out of CombineVariants.  I tried
to clean up a lot of that code, but ultimately that's out of the scope of this project.

Added unit tests for correctness testing.
Integration tests cannot be used yet because the HC doesn't output correct gVCFs.
2013-12-31 12:07:56 -05:00
Eric Banks 9394af1230 Merge pull request #454 from jsilter/master
Make na12878kb functionality more transparent to users
2013-12-19 08:47:24 -08:00
Menachem Fromer 48ef7a1a2f Merge remote-tracking branch 'origin/master' into mf_new_RBP 2013-12-19 10:42:20 -05:00
Eric Banks 26a7082018 Merge pull request #455 from broadinstitute/dr_add_min_max_argument_values
Add ability to specify min/max required/recommended values for numeric arguments in the @Argument annotation
2013-12-18 20:40:06 -08:00
David Roazen 4a79831adc Add ability to specify min/max required/recommended values for numeric arguments in the @Argument annotation
-You can now add "minValue", "maxValue", "minRecommendedValue", and "maxRecommendedValue" attributes
 to @Argument annotations for command-line arguments

-"minValue" and "maxValue" specify hard limits that generate an exception if violated

-"minRecommendedValue" and "maxRecommendedValue" specify soft limits that generate a warning if violated

-Works only for numeric arguments (int, double, etc.) with @Argument annotations

-Only considers values actually specified by the user on the command line, not default values
 assigned in the code

As requested by Geraldine
2013-12-18 18:09:08 -05:00
Jacob Silterra 0c7ea2d823 Add label and specVersion fields to MongoDBManager.Locator
Add "BLANK" option for DBType

Want to get away from adding extensions to dbname
2013-12-18 17:21:53 -05:00