Problem: the codec was written to take in consensus pileups produced with pileup -c option (which consists of 10 or 13 fields per line depending on the variant type) but errored out on the basic pileup format (which only has 6 fields per line). This was inconsistent and confusing to users.
Solution: I added a switch in the parsing to recognize and handle both cases more appropriately, and updated related docs. While I was at it I also improved error messages in CheckPileup, which now emits User Error: Bad Input exceptions when reporting mismatches. Which may not be the best thing to do (ultimately they're not really errors, they're just reporting unwelcome results) but it beats emitting Runtime Exceptions.
Tested by CheckPileupIntegrationTest which tests both format cases.
The code comments very clearly state that INFO fields shouldn't be propagated into the output,
but someone must have accidentally changed it afterwards. This is just a simple one-line fix
to make sure the code adhered to the comments.
Delivers #63333488.
-Added docs for ERC mode in HC
-Move RecalibrationPerformance walker since to private since it is experimental and unsupported
-Updated VR docs and restored percentBad/numBad (but @Hidden) to enable deprecation alert if users try to use them
-Improved error msg for conflict between per-interval aggregation and -nt
-Minor clean up in exception docs
-Added Toy Walkers category for devs and dev supercat (to build out docs for developers)
-Added more detailed info to GenotypeConcordance doc based on Chris forum post
-Added system to include min/max argument values in gatkdocs (build gatkdocs with 'ant gatkdocs' to test it, see engine and DoC args for in situ examples)
-Added tentative min/max argument annotations to DepthOfCoverage and CommandLineGATK arguments (and improved docs while at it)
-Added gotoDev annotation to GATKDocumentedFeature to track who is the go-to person in GSA for questions & issues about specific walkers/tools (now discreetly indicated in each gatkdoc)
It is true that indels of length > 1 have higher QUALS than those of length = 1. But for the HC those
QUALS are not that much higher, and it doesn't continue scaling up as the indels get larger. So we no
longer normalize by indel length (which massively over-penalizes larger events and effectively drops their
QD to 0).
For the UG the previous normalization also wasn't perfect. Now we divide the indel length by a factor
of 3 to make sure that QD is consistent over the range of indel lengths.
Integration tests change because QD is different for indels.
Also, got permission from Valentin to archive a failing test that no longer applies.
Thanks to Kurt on the GATK forum for pointing this all out.
To do this I have added a RodBindingCollection which can represent either a VCF or a
file of VCFs. Note that e.g. SelectVariants allows a list of RodBindingCollections so
that one can intermix VCFs and VCF lists.
For VariantContext tags with a list, by default the tags for the -V argument are applied
unless overridden by the individual line. In other words, any given line can have either
one token (the file path) or two tokens (the new tags and the file path). For example:
foo.vcf
VCF,name=bar bar.vcf
Note that a VCF list file name must end with '.list'.
Added this functionality to CombineVariants, CombineReferenceCalculationVariants, and VariantRecalibrator.
-- New -a argument in the VQSR for specifying additional data to be used in the clustering
-- New NA12878KB walker which creates ROC curves by partitioning the data along VQSLOD and calculating how many KB TP/FP's are called.
For example, this tool can be used for processing bowtie RNA-seq data.
Each read with k N-cigar elemments is plit to k+1 reads. The split is done by hard clipping the bases rest of the bases.
In order to do it, few changes were introduced to some other clipping methods:
- make a segnificant change in ClippingOp.hardClip() that prevent the spliting of read with cigar: 1M2I1N1M3I.
- change getReadCoordinateForReferenceCoordinate in ReadUtil to recognize Ns
create unitTests for that walker:
- change ReadClipperTestUtils to be more general in order to use its code and avoid code duplication
- move some useful methods from ReadClipperTestUtils to CigarUtils
create integration test for that class
small change in a comment in FullProcessingPipeline
last commit:
Address review comments:
- move to protected under walkers/rnaseq
- change the read splitting methods to be more readable and more efficiant
- change (minor changes) some methods in ReadClipper to allow the changes in split reads
- add (minor change) one method to CigarUtils to allow the changes in split reads
- change ReadUtils.getReadCoordinateForReferenceCoordinate to include possible N in the cigar
- address the rest of the review comments (minor changes)
- fix ReadUtilsUnitTest.testReadWithNs acoording to the defult behaviour of getReadCoordinateForReferenceCoordinate (in case of refernce index that fall into deletion, return the read index of the base before the deletion).
- add another test to ReadUtilsUnitTest.testReadWithNs
- Allow the user to print the split positions (not working proparly currently)
This is a tool that we use internally validate the ReduceReads development. I think it should be
private. There is no need to improve docs.
[delivers #54703398]
* add overall walker GATKDocs
* add explanation for skip parameter and make it advanced
* reverse the logic on exculding unmapped reads for clarity
* fix read length calculation to no longer include indels
ps: I am not sure how useful this walker is (I didn't write it) but the skip logic is poor and
calculates the entire statistic for the reads it is eventually going to skip. This would be an easy
fix, but only worth our time if people actually use this.
The update contains:
1. documentation changes for VariantContext and Allele (which used to discuss the now obsolete null allele)
2. better error messages for VCFs containing complex rearrangements with breakends
3. instead of failing badly on format field lists with '.'s, just ignore them
Also, there is a trivial change to use a more efficient method to remove a bunch of attributes from a VC.
Delivers PT#s 59675378, 59496612, and 60524016.
Basically, it does 3 things (as opposed to having to call into 3 separate walkers):
1. merge the records at any given position into a single one with all alleles and appropriate PLs
2. re-genotype the record using the exact AF calculation model
3. re-annotate the record using the VariantAnnotatorEngine
In the course of this work it became clear that we couldn't just use the simpleMerge() method used
by CombineVariants; combining HC-based gVCFs is really a complicated process. So I added a new
utility method to handle this merging and pulled any related code out of CombineVariants. I tried
to clean up a lot of that code, but ultimately that's out of the scope of this project.
Added unit tests for correctness testing.
Integration tests cannot be used yet because the HC doesn't output correct gVCFs.
-You can now add "minValue", "maxValue", "minRecommendedValue", and "maxRecommendedValue" attributes
to @Argument annotations for command-line arguments
-"minValue" and "maxValue" specify hard limits that generate an exception if violated
-"minRecommendedValue" and "maxRecommendedValue" specify soft limits that generate a warning if violated
-Works only for numeric arguments (int, double, etc.) with @Argument annotations
-Only considers values actually specified by the user on the command line, not default values
assigned in the code
As requested by Geraldine