Build a ReferenceContext in ActiveRegionWalkers to pass in to annotation engine so we can call the TandemRepeatAnnotator from M2
Make TandemRepeatAnnotator default annotation for M2.
Setup (but don't use yet) HC-style contamination downsampling.
New HC integration test with TandemRepeatAnnotator
- ASEReadCounter (public tool) replce Tuuli's script to produce the input to Manny's tool.
It count the number of reads that support the ref allele and the alt allele, filtereing low qual reads and bases and keep only properPaired reads
- ASECaller (private tool) take both RNA and DNA, and produce ontingencyTables ** still under development **
minor changes in other tools:
- update RNA HC variant calling scala script
- expose FS method pValueForContingencyTable to be able to call it from ASEcaller
In ASEReadCounter:
- allow different option to deal with overlaping read from the same fragment
- add option to ignore or include indels in the pileups
- add option to disabled DuplicateRead
add ASEReadCounterIntegrationTest.java and files for the test
fixed NPE when normal contains no reads
first integration test (micro) and unit tests, also rename of MuTectHC -> M2
adding in standard GATK license terms
incorporated HOSTILE mode to PCR Error Correction
removed tumor and normal name parameters and cleaned up internal name handling
changes to allow for calling without a matched normal (technically, not true 'tumor-only' calling). Used for panel-of-normals creation
additional regression tests, based on DREAM data. Removed accidental addition of TandemRepeatAnnotator to default annotations
updated MD5 based on run from GSA4 to fix bamboo issue
reverted unneeded visibility changes
Now, instead of stripping out the GQs for mono sites, we transfer them to the RGQ.
This is extremely useful for people who want to know how confident the hom ref genotype calls are.
Perhaps this is just what CRSP needs for pertinent negatives.
Note that I also changed the tool to no longer use the GenotypeSummaries annotation by default since
it was adding some seemingly unnecessary annotations (like mean GQ now that we keep the GQ around and
number of no-calls). Let me know if this was a mistake (although Laura gave me a thumbs up).
Using --breakBandsAtMultiplesOf N will ensure that no reference blocks span across
genomic positions that are multiples of N. This is especially important in the
case of scatter-gather where you don't want your scatter intervals to start in the
middle of blocks (because of a limitation in the way -L works in the GATK for VCF
records with the END tag).
For example, running with --breakBandsAtMultiplesOf 5 on this record:
1 69491 . G <NON_REF> . . END=69523 GT:DP:GQ:MIN_DP:MIN_GQ:PL ./.:94:99:82:99:0,120,1800
Will produce the following records:
1 69491 . G <NON_REF> . . END=69494 GT:DP:GQ:MIN_DP:MIN_GQ:PL ./.:94:99:82:99:0,120,1800
1 69495 . C <NON_REF> . . END=69499 GT:DP:GQ:MIN_DP:MIN_GQ:PL ./.:94:99:82:99:0,120,1800
1 69500 . T <NON_REF> . . END=69504 GT:DP:GQ:MIN_DP:MIN_GQ:PL ./.:94:99:82:99:0,120,1800
etc.
Added docs and a new test.
GenotypeGVCFs now has the ability to unique-ify samples so I can genotype together two different datasets containing the same sample
Modify InbreedingCoeff so that it works when genotyping uniquified samples
* The value of this element (default true) determines whether Queue will explicitly run this walker over unmapped reads
* This patch fixes a runtime error when FindCoveredIntervals was used with Queue
* PT 81777160
* TextCigarCodec.decode() is now static, and the getSingleton() method is gone
* MergingSamRecordIterator now wants a Collection<SamReader> rather than Collection<SAMFileReader> in the constructor
* SeekableBufferedStream now correctly reads the requested number of bytes, removed workaround in GATKBAMIndex
* Removed unused annotations (CCC and HWP)
* Renamed one of the two GC annotations to "IGC" (for Interval GC)
* Revved picard & htsjdk (GATK constants are now removed from htsjdk)
* PT 82046038
-- Active Region Traversal was using per sample limits on the number of reads that were too low, especially now that we are running one sample at a time. This caused issues with high confidence variants being dropped in high coverage data.
-- HaplotypeCallerGVCFIntegrationTest PL/annotation changes due to using more reads in those tests
-- Removed a CountReadsInActiveRegionsIntegrationTest test for excessive coverage because the read coverage no longer goes over the limits in ART
Story:
=====
- https://www.pivotaltracker.com/story/show/83803796
Changes:
=======
- From a fix maximum ploidy indel RCM likelihood cache to a
dynamically resizable one.
- Used the occassion to removed an unused and deprecated method from ReferenceConfidenceModel
Testing:
=======
- Added integration test to check on ploidies larger than the previous limit of 20.
Add multi-allele test for info field annotations
Fix to process all types of INFO annotations
roll back to previous version, removes INFO and FORMAT
Correct @return for VariantAnnotatorEngine.getNonReferenceAlleles()
Enhance comments and clean up multi-allelic logic, handle header info number = R
only parse counts of A & R
Add INFO for AC
update MD5
Performance enhancement, only parse multiallelic with a count A or R
Make argument final in getNonReferenceAlleles()
Code cleanup, add exceptions for bad expression/allele size mismatch and missing header info for an expression
Change exception to warning for expression value/number of alleles check
remove adevertised exceptions
-- Ignore SNP matches that lie outside the clipped read window
-- This fixes an issue where GATK would skip the entire read if a SNP is entirely
contained within a sequencing adapter.
Story:
=====
- https://www.pivotaltracker.com/story/show/83259038
Changes:
=======
- Done minimal changes to make the fix after an arduous attempt to understand
CombineGVCFs code.
Test:
====
- Added a integration test to explicitly test for the bug.
- Updated a md5 changes as the bug was actually affecting one of the existing
integration tests.
* PT 84242218
* Note that FORMAT fields behave the same as INFO fields - if the annotation has a count of A (one entry per Alt Allele), it is split across the multiple output lines. Otherwise, the entire list is output with each field
Story:
-----
- https://www.pivotaltracker.com/story/show/83800586
Changes:
-------
- In GVCFWriter GQ is now recalculated out of the fianl PL array for the block.
Testing:
-------
- Updated affected integration test md5s
Add more logging to annotators, change loggers from info to warn
Add comments to testStrandBiasBySample()
Clarify comments in testStrandBiasBySample
remove logic for not prcossing an indel if strand bias (SB) was not computed
remove per variant warnings in annotate()
Log warnings if using the wrong annotator or missing a pedgree file
Log test failures once in annotate(), because HaplotypeCaller does not call initialize(). Avoid using exceptions
Fix so only log once in annotate(), Hardey-Weinberg does not require pedigree files, fix test MD5s so pass
Check if founderIds == null
Update MD5s from HaplotypeCaller integrations tests and clean up code
Change logic so SnpEff does not throw excpetions, change engine to utils in imports
Update test MD5s, return immediately if cannot annotate in SnpEff.initialization()
Post peer review, add more logging warnings
Update MD5 for testHaplotypeCallerMultiSampleComplex1, return null if PossibleDeNovo.annotate() is not called by VariantAnnotator
Story:
-----
https://www.pivotaltracker.com/story/show/80684230
Changes:
-------
- Corrected the bug: AlignmentUtils#createReadAlignedToRef was
not realigning against the reference but the best haplotype for
the read.
Test:
----
- Added integration test in HaplotypeCallerIntegrationTest to check
that the bug has been fixed.
- Fixed md5s modified by this change; these are cause due to small
changes in the state of the random-number generator and read vs
variant site overlapping.
CombineGVCFs now outputs ref conf for the duration of deletions so that SNPs occuring in other samples aligned with those deletions will be genotyped correctly
Reading the multiple GATKText files as a single stream, especially with new top level target executable jar files pointing to a lib folder.
Don't dirty the build with a new GATKText.properties if input files are unmodified.
Stop warning on undocumented abstract classes.
Fixed ClassNotFoundException/NoClassDefFoundError by fixing ResourceBundleExtractorDoclet artifact.
Excluding Exceptions from documentation.
Removed custom log4j dependency from ResourceBundleExtractorDoclet.
Stop generating the dependency reduced pom during shade.
Stop regenerating gsalib when the files are already up to date.
Disabled mvn site generation from external-example.
Moved top level target symlinks to package jar files to under target/package.
Executable jar files are placed under target/executable with the new target[/lib] directories.
Under top level target, symlinks to *either* the package *or* the executable jars replace what was a symlink to the package jar path.
Allow disabling of the shade package.
ant-bridge.sh by default only builds executable jars, and doesn't package by default, as did the old ant build.xml.
Added a new package_path.sh utility script for other scripts to use instead of anything in the target folder.
remove final keyword before refMap and altMap, constructHaplotype() changes their values
return ArtificialHaplotype from constructHaplotype instaed of passing as an argument
Add logic so arraycopy does not throw an IndexOutOfBoundsException, add test for a long insert
* This argument is intended to be used in conjunction with -bamout, and disable early-exit optimizations to allow reference regions to be contained in the output bam
* Also forcibly includes the reference haplotype in the set of haplotypes given to the BAMWriter
* Made -dontTrimActiveRegions visible, as it is likely also desirable in this use case
* Addresses PT 77731660
remove TODO comment after activeProbThreshold
recover static ACTIVE_PROB_THRESHOLD for unit tests
Add min/max values for active_probability_threshold parameter
Move activeProbThreshold parameter to GATKArguemtnCollection
define ACTIVE_PROB_THRESHOLD in unit tests
add construction of argCollection in in ctor
Move arguments from GATKArgumentCollection to ActiveRegionWalker
Throw exception if threshold < 0 or > 1 in ActivityProfile ctor
max propogation distance parameter to ActiveRegionWalker for AcrtivityProfile
Use polymorphic getMaxProbPropagationDistance() so BandPassActivityProfile computes the crrect region size cutoff
Get the maxProbPropagationDistance from the super class's method, instead of directly, this is safer
Removed extraneous command line imports and make maxProbPropagationDistance a hidden argument
remove limit check for activeProbThreshold, not necessary because the check is made when imput as a command line arg
Remove extra 'region' in the doxygen param description for maxProbPropagationDistance
Rename parameters using camel case and add to integration test
Correct documentation for maxReadsInRegionPerSample and minReadsPerAlignmentStart
Change the argument--minReadsPerAlignmentStart in the integration test from 50 to 5
'each genomic location' only pertains to minReadsPerAlignmentStart, not maxReadsInRegionPerSample