Intermediate commit for tests
Adding tests
Fixing tests after rebase
Fixing one MD5
Fixing documentation
Removing annotation from standard group
Adding documentation
Renamed M2 to MuTect2
Renamed ContaminationWalker to ContEst
Refactored related tests and usages (including in Queue scripts)
Moved M2 and ContEst + accompanying classes from private to protected
Made QSS a StandardSomaticAnnotation (new annotation group/interface) to prevent it from being sucked in with the rest of the StandardAnnotation group
ReadAdaptorTrimmer (unsound and untested)
BaseCoverageDistribution (redundant with DiagnoseTargets)
CoveredByNSamplesSites (redundant with DiagnoseTargets)
FindCoveredIntervals (redundant with DiagnoseTargets)
VariantValidationAssessor (has a scary TODO -- REWRITE THIS TO WORK WITH VARIANT CONTEXT comment and zero tests)
LiftOverVariants, FilterLiftedVariants and liftOverVCF.pl (in #1106) (use Picard liftover tool)
sortByRef.pl (use Picard SortVCF)
ListAnnotations (useless)
Also deleted the java archive from the private repository (old junk we never use)
Grouped default output annotations to keep them from getting dropped when -A is specified; addresses #918
Also refactored code shared by ExcessHet and InbreedingCoeff
Integration Tests
Updated test
Changed method
Minor changes
Changed whitespace
Fixed uncalled counts and 0 in R
Fixed ReadBackedPileUp
Removed imports and changed MD5
Fixed failing test
Adding vqslod color
Updating script to create KB
Fixing integration test now that the KB is bigger
Adressing comments
The ParallelShell job runner will run jobs locally on one node concurrently as specified by the DAG, with the option to limit the maximum number of concurrently running jobs using the flag `maximumNumberOfJobsToRunConcurrently`.
Signed-off-by: Khalid Shakir <kshakir@broadinstitute.org>
Updated other IntelliJ IDEA warnings in GATKBAMIndex.
Updated example .cram files to match versions generated by current GATK/HTSJDK.
Bumped HTSJDK and Picard to 1.139 releases.
Added support for using `-SNAPSHOT` of HTSJDK in the future.
This change doesn't affect the performance of the Indel Realigner at all (as per tests).
This is just a request from the Picard side (where further testing is happening).
Make MQ threshold a parameter (compare to M1 by setting to zero)
Add logic for multiple alternate alleles in tumor
Exclude MQ0 normal reads from normal LOD calculation
Fix path errors in Dream_Evaluations.md
Move M2 eval scripts out of walkers package so they run
Previous version of OverclippedReadFilter would only filter a read if both ends of a read had a soft-clipped block.
This adds a boolean option to relax that requirement, and only require 1 soft-clipped block, while also filtering on read length - softclipped length
CRAM now requires .bai index, just like BAM.
Test updates:
- Updated existing MD5s, as TLEN has changed.
- Tests multiple contigs.
- Tests several intervals per contig.
- Tests when `.cram.bai` is missing, even when `.cram.crai` is present.
Updated gatk docs for CRAM support, including:
- Arguments that work for both BAM and CRAM listed as such.
- Arguments that don't work for CRAM either explicitly say "BAM" or "doesn't work for CRAM".
- Instructions on how to recreate a `.cram.bai` using cramtools.
Cleaned up IntelliJ IDEA warnings regarding `Arrays.asList()` -> `Collections.singletonList()`.
Changed a division by -10.0 to a multiplication by -.1 in QualUtils (typically multiplication is faster than division).
Addresses performance issue #1081.
When using CatVariants, VCF files were being sorted solely on the base
pair position of the first record, ignoring the chromosome. This can
become problematic when merging files from different chromosomes,
espeically if you have multiple VCFs per chromosome.
As an example, assume the following 3 lines are all in separate files:
1 10
1 100
2 20
The merged VCF from CatVariants (without -assumeSorted) would read:
1 10
2 20
1 100
This has the potential to break tools that expect chromosomes to be
contiguous within a VCF file.
This commit changes the comparator from one of Pair<Integer, File> to
one of Pair<VariantContext, File>. We construct a
VariantContextComparator from the provided reference, which will sort
the first record by chromosome and position properly. Additionally, if
-assumeSorted is given, we simply use a null VariantContext as the first
record, which will all be equal (as all will be null)
Add oxoG read count annotation and add as default annotation
Add ##SAMPLE VCF header line in accordance with TCGA VCF spec, specifying "File" line in sample header with BAM file name and "SampleName" with BAM sample name (Don't print sample file path if --no_cmdline_in_header is specified to help with test consistency)
Turn on active region assembly-based physical phasing for M2
Clean up M2-related annotations so UG doesn't crash if M2 annotations are called
added "str_contraction" artifact filter (improves specificity, especially in exomes)
refactored out VCF constants and added descriptions
added "artifact detection mode" for PON creation
added "str_contraction" artifact filter (improves specificity, especially in exomes)
added new dream evaulation markdown
added results for SMC 4
fixed up documentation, moved location to /dsde/working/mutect/dream_smc, and checked in scala script
added "artifact detection mode" for PON creation
added "str_contraction" artifact filter (improves specificity, especially in exomes)
fixed bug which would overwrite germline_risk filter errors
updated "how to" documents and records
fixed license text
thinned down FP regression test from 700 sites to 100. we have better ways (DREAM, NN) to check accuracy of the method and 100 is good enough to catch regressions
why oh why do the MD5-based unit tests produce different results on different machine architectures? I hate that :/
Thanks to GG, LDG and DR -- test should now produce the same results regardless of machine architecture
disabled downsampling... hopefully in the final attempt to make this work cross architecture!
enforced LOGLESS_CACHING... hopefully in the final final attempt to make this work cross architecture!
refactored out VCF constants and added descriptions
-We now pull htsjdk and picard from maven central.
-Updated the GATK codebase as necessary to adapt to changes in the Feature
interface.
-Since VCFHeader now requires that all header lines have unique keys, uniquified
the keys of GVCFBlock header lines by including the min/max GQ in the key.
Updated MD5s accordingly.
-Other MD5s changed as a result of an htsjdk fix to eliminate "-0" in VCF output.
Previously, if a SNP occurred in sample A at a position that was in the middle of a deletion for sample B,
sample B would be genotyped as homozygous reference there (but it's NOT reference - there's a deletion).
Now, sample B is genotyped as having a symbolic DEL allele.
Minor cleanup added. Note that I also removed Laura's previous fix for this problem.
Existing integration tests change because I've added a new header line to the VCF being output.
I also added several tests for the new functionality showing:
1. genotyping from separate and already combined gvcfs give the same output
2. genotyping over multiple spanning deletions works
3. combining works too
Existing unit tests also cover this case.
Exclude MQ0BySample
Move SD and TRA to new StandardUGAnnotation interface
There is now annotation interface (StandardUGAnnotation) holding annots that are standard in UG but should't be used as they are now with HC. This allows us to not have to exclude these annotations explicitly in HC, but still be able to use them for development purposes.
Fairly minor if plentiful fixes to various gatkdocs. Merging this without formal review since all tests pass, the gatkdocs build, and no one really wants to review corrections to grammar, typos and layout for 120+ documents. Review will be done by users in production ;-)
When -qsub-broad is specified instead of -qsub, use the "h_vmem" parameter
instead of "h_rss" to specify memory limit requests.
Also cause the GridEngine native arguments to be output by default to the logger,
instead of only when in debug mode.
The GATK command line header keys were being repeated in the VCF and
subsequently lost to a single key value by HTSJDK. This resolves
the issue by appending the name of the walker after the text
"GATKCommandLine" and a number after that if the same walker was
used more than once in the form: GATKCommandLine.(walker name) for
the first occurrence of the walker, and GATKCommandLine.(walker name).#
where # is the number of the occurrence of the walker (e.g.
GATKCommandLine.SomeWalker.2 for the second occurrence of SomeWalker).
Integration test added to EngineFeaturesIntegrationTest to verify
two runs of same walker follow expected form.
Resolves#909
See also: HTSJDK #43
Build a ReferenceContext in ActiveRegionWalkers to pass in to annotation engine so we can call the TandemRepeatAnnotator from M2
Make TandemRepeatAnnotator default annotation for M2.
Setup (but don't use yet) HC-style contamination downsampling.
New HC integration test with TandemRepeatAnnotator
- ASEReadCounter (public tool) replce Tuuli's script to produce the input to Manny's tool.
It count the number of reads that support the ref allele and the alt allele, filtereing low qual reads and bases and keep only properPaired reads
- ASECaller (private tool) take both RNA and DNA, and produce ontingencyTables ** still under development **
minor changes in other tools:
- update RNA HC variant calling scala script
- expose FS method pValueForContingencyTable to be able to call it from ASEcaller
In ASEReadCounter:
- allow different option to deal with overlaping read from the same fragment
- add option to ignore or include indels in the pileups
- add option to disabled DuplicateRead
add ASEReadCounterIntegrationTest.java and files for the test
Now, instead of stripping out the GQs for mono sites, we transfer them to the RGQ.
This is extremely useful for people who want to know how confident the hom ref genotype calls are.
Perhaps this is just what CRSP needs for pertinent negatives.
Note that I also changed the tool to no longer use the GenotypeSummaries annotation by default since
it was adding some seemingly unnecessary annotations (like mean GQ now that we keep the GQ around and
number of no-calls). Let me know if this was a mistake (although Laura gave me a thumbs up).