-Only works with single-sample vcfs
-As with bams, the user must provide a file mapping the absolute path to
each vcf whose samples are to be renamed to the new sample name for that
vcf. The argument is the same as for bams: --sample_rename_mapping_file,
and the mapping file may contain a mix of bam and vcf files should the
user wish.
-It's an error to attempt to remap the sample names of a multi-sample
or sites-only vcf
-Implemented at the codec level at the instant the vcf header is first
read in to minimize the chances of downstream code examining vcf
headers/records before renaming occurs.
-Integration tests are in sting, unit tests are in picard
-Rev picard et. al. to 1.111.1902
Following reviewers comments the command line interface has been simplified.
All extra strict validations are performed by default (as before) and the
user has to indicate which one he/she does not want to use with --validationTypeToExclude.
Before he/she was able to indicate the only ones to apply with --validationType but that has been scrapped out.
Stories:
- https://www.pivotaltracker.com/story/show/68725164
Changes:
- Removed validateType argument.
- Improved documentation.
- Added some warnning log message on suspicious argument combinations.
Tests:
- ValidateVariantsIntegrationTest#*
-- This results in much more consistent distribution of LOD scores for SNPs and Indels.
-- Removing genotype summary stats since they are now produced by default.
-- Added functionality to specify certain subsets of the training data to be used in Tranche file generation, -good:tranche=true set.vcf
More concretelly Picard's strict VCF validation does not like that there is alternative alleles that are not participating in any genotype call across samples.
This is an issue with GVCF in the single-sample pipeline where this is certainly expected with <NON_REF> and other relative unlikely alleles.
To solve this issue we allow the user to exclude some of the strict validations using a new argument --validationTypeToExclude. In order to avoid the validation
issue with GVCF the user needs to add the following to the command line: '--validationTypeToExclude ALLELES'
Story:
https://www.pivotaltracker.com/story/show/68725164
Changes:
- Added validateTypeToExclude argument to ValidateVariants walker.
- Implemented the selective exclusion of validation types.
- Added new info and improved existing documentation of the ValidateVariants walker.
Tests:
- ValidateVariantsIntegrationTest#testUnusedAlleleError
- ValidateVariantsIntegrationTest#testUnusedAlleleFix
In some cases, the program records were being removed from the BAM headers by the GATK engine
before we applied the check for reduced reads (so we did not fail appropriately). Pushed up the
check to happen before the PG tags are modified and added a unit test to ensure it stays that way.
It turns out that some UG tests still used reduced bams so I switched to use different ones.
Based on reviewer feedback, made it more generic so that it's easy to add new unsupported tools.
Previously it required you to create a single sample VCF and then to pass that in to the tool, but
Geraldine convinced me that this was a pain for users (because they usually have multi-sample VCFs).
Instead now you can pass in a multi-sample VCF and specify which sample's genotypes should be used
for the IUPAC encoding. Therefore the argument changed from '--useIUPAC' to '--use_IUPAC_sample NA12878'.
Stories:
https://www.pivotaltracker.com/story/show/66263868
Bug:
The problem was due to the way we were calculating the fix penalty of a large deletion or insertion. In this case we calculate the alignment likelihood of the portion
or read or haplotype deletion as the penalty of that deletion/insertion without going through the full pair-hmm process. For large events this resulted in a 0 in
in linear scale computations that ins transformed into an infinity in log scale.
Changes:
- Change to use log10 scale for calculate those penalties.
- Minor addition of .gitignore to hide ./public/external-example/target which is generated by the building process.
- SamPairUtils migrated in Picard r1737
- Revert IndelRealigner changes made in commit 4f4b85
-- Those changes were based on Picard revision 1722 to net/sf/picard/sam/SamPairUtil.java
-- Picard revision 1723 reverts these changes, so we also revert to match
This is currently my leading suspect for the cause of the
intermittent NoSuchElementException errors on master, since
the maven surefire plugin seems unable to handle errors in
TestNG DataProviders without blowing up.
This test is not, as I had initially thought, the cause of the
maven errors. Our master branch is failing intermittently
regardless of whether this test is enabled or disabled.
This reverts commit 45fc9ff515eec8d676b64a04fb34fb357492ff84.
This test passes when run individually, as part of the commit tests, or as
part of the package tests. However, when running the unit tests in isolation
it causes maven/surefire to throw a NoSuchElementException.
This is clearly a maven/surefire bug or configuration issue. I will re-enable
this test on a branch as Khalid and I try to work through it.
-Hide AWS downloader credentials in a private properties file
-Remove references to private ActiveRegion walker
Allows phone home functionality to be tested at release time
when we are running tests on the release jar.
1. Enable on-the-fly indexing for vcf.gz.
2. Handle on-the-fly indexing where file to be indexed is not a regular file, thus index should not be created.
3. Add method setProgressLogger to all SAMFileWriter implementations.
4. Revved picard to 1.109.1722
5. IndelRealigner md5s change because the MC tag is added to records now.
Fixed up and signed off by ebanks.
Package tests now hard coding just the gatk-framework tests jar, to include ONLY BaseTest, until the exclusions may be debugged.
Removing cofoja's annotation service from the package jars, to allow javac -cp <package>.jar.
Currently the best haplotypes are those that accumulate the largest ABSOLUTE edge *multiplicity* sum across their path in the assembly graph.
The edge *mulitplicity* is equal to the number of reads that expand through that edge, i.e. have a kmer that uniquely map to some vertex up-stream from the edge and the following base calls extend across that edge to vertices downstream from it.
Despite that it is obvious that higher multiplicties correlated with haplotype probability this criterion fails short in some regards of which the most relevant is:
As it is evaluated in condensed seq-graph (as supposed to uncompressed read-threading-graphs) it is bias to haplotypes that have more short-sequence vetices
( -> ATGC -> CA -> has worse score than -> A -> T -> G -> C -> C -> A ->). This is partly result of how we modify the edge multiplicities when we merge vertices from a linear chain.
This pull-request addresses the problem by changing to a new scoring schema based in likelihood estimates:
Each haplotype's likelihood can be calculated as the multiplication of the likelihood of "taking" its edges in the assembly graph. The likelihood of "taking" an edge in the assembly
graph is calculated as its multiplicity divide by the sum of multiplicity of edges that share the same source vertex.
This pull-request addresses the following stories:
https://www.pivotaltracker.com/story/show/66691418https://www.pivotaltracker.com/story/show/64319760
Change Summary:
1. Change to the new scoring schema.
2. Added a graph DOT printing code to KBestHaplotypeFinder in order to diagnose scoring.
3. Graph transformation have been modified in order to generate no 0-multiplicity edges. (Nevertheless the schema above should work with 0 edges assuming that they are in fact 0.5)
The maven shade plugin was eliminating a necessary class (IgnoreCookiesSpec)
when packaging the GATK/Queue. Work around this by telling maven to
always package all of commons-httpclient.
Enable it with the new --useIUPAC argument.
Added both unit and integration tests for the new functionality - and fixed up the
exising tests once I was in there.
-These tests are really integration tests for Queue rather than generalized
pipeline tests, so it makes sense to call them QueueTests.
-Rename test classes and maven build targets, and update shell scripts
to reflect new naming.