In some cases, the program records were being removed from the BAM headers by the GATK engine
before we applied the check for reduced reads (so we did not fail appropriately). Pushed up the
check to happen before the PG tags are modified and added a unit test to ensure it stays that way.
It turns out that some UG tests still used reduced bams so I switched to use different ones.
Based on reviewer feedback, made it more generic so that it's easy to add new unsupported tools.
Previously it required you to create a single sample VCF and then to pass that in to the tool, but
Geraldine convinced me that this was a pain for users (because they usually have multi-sample VCFs).
Instead now you can pass in a multi-sample VCF and specify which sample's genotypes should be used
for the IUPAC encoding. Therefore the argument changed from '--useIUPAC' to '--use_IUPAC_sample NA12878'.
Stories:
https://www.pivotaltracker.com/story/show/66263868
Bug:
The problem was due to the way we were calculating the fix penalty of a large deletion or insertion. In this case we calculate the alignment likelihood of the portion
or read or haplotype deletion as the penalty of that deletion/insertion without going through the full pair-hmm process. For large events this resulted in a 0 in
in linear scale computations that ins transformed into an infinity in log scale.
Changes:
- Change to use log10 scale for calculate those penalties.
- Minor addition of .gitignore to hide ./public/external-example/target which is generated by the building process.
- SamPairUtils migrated in Picard r1737
- Revert IndelRealigner changes made in commit 4f4b85
-- Those changes were based on Picard revision 1722 to net/sf/picard/sam/SamPairUtil.java
-- Picard revision 1723 reverts these changes, so we also revert to match
This is currently my leading suspect for the cause of the
intermittent NoSuchElementException errors on master, since
the maven surefire plugin seems unable to handle errors in
TestNG DataProviders without blowing up.
This test is not, as I had initially thought, the cause of the
maven errors. Our master branch is failing intermittently
regardless of whether this test is enabled or disabled.
This reverts commit 45fc9ff515eec8d676b64a04fb34fb357492ff84.
This test passes when run individually, as part of the commit tests, or as
part of the package tests. However, when running the unit tests in isolation
it causes maven/surefire to throw a NoSuchElementException.
This is clearly a maven/surefire bug or configuration issue. I will re-enable
this test on a branch as Khalid and I try to work through it.
-Hide AWS downloader credentials in a private properties file
-Remove references to private ActiveRegion walker
Allows phone home functionality to be tested at release time
when we are running tests on the release jar.
1. Enable on-the-fly indexing for vcf.gz.
2. Handle on-the-fly indexing where file to be indexed is not a regular file, thus index should not be created.
3. Add method setProgressLogger to all SAMFileWriter implementations.
4. Revved picard to 1.109.1722
5. IndelRealigner md5s change because the MC tag is added to records now.
Fixed up and signed off by ebanks.
Package tests now hard coding just the gatk-framework tests jar, to include ONLY BaseTest, until the exclusions may be debugged.
Removing cofoja's annotation service from the package jars, to allow javac -cp <package>.jar.
Currently the best haplotypes are those that accumulate the largest ABSOLUTE edge *multiplicity* sum across their path in the assembly graph.
The edge *mulitplicity* is equal to the number of reads that expand through that edge, i.e. have a kmer that uniquely map to some vertex up-stream from the edge and the following base calls extend across that edge to vertices downstream from it.
Despite that it is obvious that higher multiplicties correlated with haplotype probability this criterion fails short in some regards of which the most relevant is:
As it is evaluated in condensed seq-graph (as supposed to uncompressed read-threading-graphs) it is bias to haplotypes that have more short-sequence vetices
( -> ATGC -> CA -> has worse score than -> A -> T -> G -> C -> C -> A ->). This is partly result of how we modify the edge multiplicities when we merge vertices from a linear chain.
This pull-request addresses the problem by changing to a new scoring schema based in likelihood estimates:
Each haplotype's likelihood can be calculated as the multiplication of the likelihood of "taking" its edges in the assembly graph. The likelihood of "taking" an edge in the assembly
graph is calculated as its multiplicity divide by the sum of multiplicity of edges that share the same source vertex.
This pull-request addresses the following stories:
https://www.pivotaltracker.com/story/show/66691418https://www.pivotaltracker.com/story/show/64319760
Change Summary:
1. Change to the new scoring schema.
2. Added a graph DOT printing code to KBestHaplotypeFinder in order to diagnose scoring.
3. Graph transformation have been modified in order to generate no 0-multiplicity edges. (Nevertheless the schema above should work with 0 edges assuming that they are in fact 0.5)
The maven shade plugin was eliminating a necessary class (IgnoreCookiesSpec)
when packaging the GATK/Queue. Work around this by telling maven to
always package all of commons-httpclient.
Enable it with the new --useIUPAC argument.
Added both unit and integration tests for the new functionality - and fixed up the
exising tests once I was in there.
-These tests are really integration tests for Queue rather than generalized
pipeline tests, so it makes sense to call them QueueTests.
-Rename test classes and maven build targets, and update shell scripts
to reflect new naming.
This module was causing failures during the release
packaging tests. After discussing with Khalid, we've
decided to disable it for now until a fix can be
developed.
C++ code has PAPI calls for reading hardware counters
Followed Khalid's suggestion for packing libVectorLoglessCaching into
the jar file with Maven
Native library part of git repo
1. Renamed directory structure from public/c++/VectorPairHMM to
public/VectorPairHMM/src/main/c++ as per Khalid's suggestion
2. Use java.home in public/VectorPairHMM/pom.xml to pass environment
variable JRE_HOME to the make process. This is needed because the
Makefile needs to compile JNI code with the flag -I<JRE_HOME>/../include (among
others). Assuming that the Maven build process uses a JDK (and not just
a JRE), the variable java.home points to the JRE inside maven.
3. Dropped all pretense at cross-platform compatibility. Removed Mac
profile from pom.xml for VectorPairHMM
Moved JNI_README
1. Added the catch UnsatisfiedLinkError exception in
PairHMMLikelihoodCalculationEngine.java to fall back to LOGLESS_CACHING
in case the native library could not be loaded. Made
VECTOR_LOGLESS_CACHING as the default implementation.
2. Updated the README with Mauricio's comments
3. baseline.cc is used within the library - if the machine supports
neither AVX nor SSE4.1, the native library falls back to un-vectorized
C++ in baseline.cc.
4. pairhmm-1-base.cc: This is not part of the library, but is being
heavily used for debugging/profiling. Can I request that we keep it
there for now? In the next release, we can delete it from the
repository.
5. I agree with Mauricio about the ifdefs. I am sure you already know,
but just to reassure you the debug code is not compiled into the library
(because of the ifdefs) and will not affect performance.
1. Changed logger.info to logger.warn in PairHMMLikelihoodCalculationEngine.java
2. Committing the right set of files after rebase
Added public license text to all C++ files
Added license to Makefile
Add package info to Sandbox.java
Conflicts:
protected/gatk-protected/src/main/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCaller.java
protected/gatk-protected/src/main/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/PairHMMLikelihoodCalculationEngine.java
protected/gatk-protected/src/main/java/org/broadinstitute/sting/utils/pairhmm/DebugJNILoglessPairHMM.java
protected/gatk-protected/src/main/java/org/broadinstitute/sting/utils/pairhmm/JNILoglessPairHMM.java
protected/gatk-protected/src/main/java/org/broadinstitute/sting/utils/pairhmm/VectorLoglessPairHMM.java
public/VectorPairHMM/src/main/c++/.gitignore
public/VectorPairHMM/src/main/c++/LoadTimeInitializer.cc
public/VectorPairHMM/src/main/c++/LoadTimeInitializer.h
public/VectorPairHMM/src/main/c++/Makefile
public/VectorPairHMM/src/main/c++/Sandbox.cc
public/VectorPairHMM/src/main/c++/Sandbox.h
public/VectorPairHMM/src/main/c++/Sandbox.java
public/VectorPairHMM/src/main/c++/Sandbox_JNIHaplotypeDataHolderClass.h
public/VectorPairHMM/src/main/c++/Sandbox_JNIReadDataHolderClass.h
public/VectorPairHMM/src/main/c++/baseline.cc
public/VectorPairHMM/src/main/c++/define-double.h
public/VectorPairHMM/src/main/c++/define-float.h
public/VectorPairHMM/src/main/c++/define-sse-double.h
public/VectorPairHMM/src/main/c++/define-sse-float.h
public/VectorPairHMM/src/main/c++/headers.h
public/VectorPairHMM/src/main/c++/jnidebug.h
public/VectorPairHMM/src/main/c++/org_broadinstitute_sting_utils_pairhmm_DebugJNILoglessPairHMM.cc
public/VectorPairHMM/src/main/c++/org_broadinstitute_sting_utils_pairhmm_DebugJNILoglessPairHMM.h
public/VectorPairHMM/src/main/c++/org_broadinstitute_sting_utils_pairhmm_VectorLoglessPairHMM.cc
public/VectorPairHMM/src/main/c++/org_broadinstitute_sting_utils_pairhmm_VectorLoglessPairHMM.h
public/VectorPairHMM/src/main/c++/pairhmm-template-kernel.cc
public/VectorPairHMM/src/main/c++/pairhmm-template-main.cc
public/VectorPairHMM/src/main/c++/run.sh
public/VectorPairHMM/src/main/c++/shift_template.c
public/VectorPairHMM/src/main/c++/utils.cc
public/VectorPairHMM/src/main/c++/utils.h
public/VectorPairHMM/src/main/c++/vector_function_prototypes.h
-- throws UserException; added tests in PosteriorLikelihoodsUtilsUnitTests
Add error handling to CalculateGenotypePosteriors for cases where MLEAC>AN; add tests in PosteriorLikelihoodsUtilsUnitTests
Add unit tests to confirm that CalculateGenotypePosteriors has the ability to switch genotypes for four cases
Changes:
1. Addressed review comments on new K-best haplotype assembly graph finder.
2. Generalize KBestHaplotypeFinder to deal with multiple source and sink vertices.
3. Updated test to use KBestHaplotypeFinder instead of KBestPaths
4. Retired KBestPaths to the archive.
5. Small improvements to the code and documentation.
Story:
https://www.pivotaltracker.com/story/show/66238286
Changes:
1. Created a new k-best haplotype search implementation in class KBestHaplotypeFinder.
2. Changed HC code to use the new implementation.
This seems to fix the original problem without causing significant changes in outputs using some empirical data test cases
3. Moved haplotype's cigar calculation code from Path to CigarUtils; need that in order to gain independence from Path in some parts of the code.
In any case that seems like a more natural location for that functionality.
The GATK now fails with a user error if you try to run with a reduced bam.
(I added a unit test for that; everything else here is just the removal of all traces of RR)
PairHMMLikelihoodCalculationEngine.java to fall back to LOGLESS_CACHING
in case the native library could not be loaded. Made
VECTOR_LOGLESS_CACHING as the default implementation.
2. Updated the README with Mauricio's comments
3. baseline.cc is used within the library - if the machine supports
neither AVX nor SSE4.1, the native library falls back to un-vectorized
C++ in baseline.cc.
4. pairhmm-1-base.cc: This is not part of the library, but is being
heavily used for debugging/profiling. Can I request that we keep it
there for now? In the next release, we can delete it from the
repository.
5. I agree with Mauricio about the ifdefs. I am sure you already know,
but just to reassure you the debug code is not compiled into the library
(because of the ifdefs) and will not affect performance.
-- Keep a list of processed files in ArgumentTypeDescriptor.getRodBindingsCollection
-- Throw user exception if a file name duplicates one that was previously parsed
-- Throw user exception if the ROD list is empty
-- Added two unit tests to RodBindingCollectionUnitTest
public/VectorPairHMM/src/main/c++ as per Khalid's suggestion
2. Use java.home in public/VectorPairHMM/pom.xml to pass environment
variable JRE_HOME to the make process. This is needed because the
Makefile needs to compile JNI code with the flag -I<JRE_HOME>/../include (among
others). Assuming that the Maven build process uses a JDK (and not just
a JRE), the variable java.home points to the JRE inside maven.
3. Dropped all pretense at cross-platform compatibility. Removed Mac
profile from pom.xml for VectorPairHMM
Re-added import java.io.File for BamGatherFunction.
Other cleanup to resolve scala syntax warnings from intellij.
Moved Example UG script to from protected to public.
- This change means that BamGatherFunction will now have an @Output field for the BAM index, which will allow the bai to be deleted for intermediate functions
Signed-off-by: Khalid Shakir <kshakir@broadinstitute.org>
Bug uncovered by some untrimmed alleles in the single sample pipeline output.
Notice however does not fix the untrimmed alleles in general.
Story:
https://www.pivotaltracker.com/story/show/65481104
Changes:
1. Fixed the bug itself.
2. Fixed non-working tests (sliently skipped due to exception in dataProvider).
Note that this tool is still a work in progress and very experimental, so isn't 100% stable. Most of
the features are untested (both by people and by unit/integration tests) because Chris Hartl implemented
it right before he left, and we're going to need to add tests at some point soon. I added a first
integration test in this commit, but it's just a start.
The fixes include:
1. Stop having the genotyping code strip out AD values. It doesn't make sense that it should do this so
I don't know why it was doing that at all.
Updated GenotypeGVCFs so that it doesn't need to manually recover them anymore.
This also helps CalculateGenotypePosteriors which was losing the AD values.
Updated code in LeftAlignAndTrimVariants to strip out PLs and AD, since it wasn't doing that before.
Updated the integration test for that walker to include such data.
2. Chris was calling Math.pow directly on the normalized posteriors which isn't safe.
Instead, the normalization routine itself can revert back to log scale in a safe manner so let's use it.
Also, renamed the variable to posteriorProbabilities (and not likelihoods).
3. Have CGP update the AC/AF/AN counts after fixing GTs.
commit 5e73b94eed3d1fc75c88863c2cf07d5972eb348b
Merge: e12593a d04a585
Author: Nicholas Clarke <nc6@sanger.ac.uk>
Date: Fri Feb 14 09:25:22 2014 +0000
Merge pull request #1 from broadinstitute/checkpoint
SimpleTimer passes tests, with formatting
commit d04a58533f1bf5e39b0b43018c9db3302943d985
Author: kshakir <github@kshakir.org>
Date: Fri Feb 14 14:46:01 2014 +0800
SimpleTimer passes tests, with formatting
Fixed getNanoOffset() to offset nano to nano, instead of nano to seconds.
Updated warning message with comma separated numbers, and exact values of offsets.
commit e12593ae66a5e6f0819316f2a580dbc7ae5896ad
Author: Nicholas Clarke <nc6@sanger.ac.uk>
Date: Wed Feb 12 13:27:07 2014 +0000
Remove instance of 'Timer'.
commit 47a73e0b123d4257b57cfc926a5bdd75d709fcf9
Author: Nicholas Clarke <nc6@sanger.ac.uk>
Date: Wed Feb 12 12:19:00 2014 +0000
Revert a couple of changes that survived somehow.
- CheckpointableTimer,Timer -> SimpleTimer
commit d86d9888ae93400514a8119dc2024e0a101f7170
Author: Nicholas Clarke <nc6@sanger.ac.uk>
Date: Mon Jan 20 14:13:09 2014 +0000
Revised commits following comments.
- All utility merged into `SimpleTimer`.
- All tests merged into `SimpleTimerUnitTest`.
- Behaviour of `getElapsedTime` should now be consistent with `stop`.
- Use 'TimeUnit' class for all unit conversions.
- A bit more tidying.
commit 354ee49b7fc880e944ff9df4343a86e9a5d477c7
Author: Nicholas Clarke <nc6@sanger.ac.uk>
Date: Fri Jan 17 17:04:39 2014 +0000
Add a new CheckpointableTimerUnitTest.
Revert SimpleTimerUnitTest to the version before any changes were made.
commit 2ad1b6c87c158399ededd706525c776372bbaf6e
Author: Nicholas Clarke <nc6@sanger.ac.uk>
Date: Tue Jan 14 16:11:18 2014 +0000
Add test specifically checking behaviour under checkpoint/restart.
Slight alteration to the checkpointable timer based on observations
during the testing - it seems that there's a fair amount of drift
between the sources anyway, so each time we stop we resynchronise the
offset. Hopefully this should avoid gradual drift building up and
presenting as checkpoint/restart drift.
commit 1c98881594dc51e4e2365ac95b31d410326d8b53
Author: Nicholas Clarke <nc6@sanger.ac.uk>
Date: Tue Jan 14 14:11:31 2014 +0000
Should use consistent time units
commit 6f70d42d660b31eee4c2e9d918e74c4129f46036
Author: Nicholas Clarke <nc6@sanger.ac.uk>
Date: Tue Jan 14 14:01:10 2014 +0000
Add a new timer supporting checkpoint mechanisms.
The issue with this is that the current timer is locked to JVM nanoTime. This can be reset after
a checkpoint/restart and result in negative elapsed times, which causes an error.
This patch addresses the issue in two ways:
- Moves the check on timer information in GenomeAnalysisEngine.java to only occur if a time limit has been
set.
- Create a new timer (CheckpointableTimer) which keeps track of the relation between system and nano time. If
this changes drastically, then the assumption is that there has been a JVM restart owing to checkpoint/restart.
Any time straddling a checkpoint/restart event will not be counted towards total running time.
Signed-off-by: Khalid Shakir <kshakir@broadinstitute.org>
Updated path for output gatkdocs in nightly build script.
Removed patch in plugin manager that contained a workaround for gatkdocs running in the top level directory.
After extensive detective work, Joel determined that these tests were failing
due to changes in the implementation of Math.pow() in newer versions of
Java 1.7.
All GSA members should ensure that they're using a JDK that is at least
as current as the one in the Java-1.7 dotkit on the Broad servers
(build 1.7.0_51-b13).
This change should allow us to test that the GATK jar has been
correctly packaged at release time, by ensuring that only the
packaged jar + a few test-related dependencies are on the classpath
when tests are run.
Note that we still need to actually test that this works as intended
before we can make this live in the Bamboo release plan.
1. AD values now propogate up (they weren't before).
2. MIN_DP gets transferred over to DP and removed.
3. SB gets removed after FS is calculated.
Also, added a bunch of new integration tests for GenotypeGVCFs.
This tool will take any number of gVCFs and create a merged gVCF (as opposed to
GenotypeGVCFs which produces a standard VCF).
Added unit/integration tests and fixed up GATK docs.
New properties to disable regenerating example resources artifact when each parallel test runs under packagetest.
Moved collection of packagetest parameters from shell scripts into maven profiles.
Fixed necessity of test-utils jar by removing incorrect dependenciesToScan element during packagetests.
When building picard libraries, run clean first.
Fixed tools jar dependency in picard pom.
Integration tests properly use the ant-bridge.sh test.debug.port variable, like unit tests.
Story:
https://www.pivotaltracker.com/story/show/65048706https://www.pivotaltracker.com/story/show/65116908
Changes:
ActiveRegionTrimmer in now an argument collection and it returns not only the trimmed down active region but also the non-variant containing flanking regions
HaplotypeCaller code has been simplified significantly pushing some functionality two other classes like ActiveRegion and AssemblyResultSet.
Fixed a problem with the way the trimming was done causing some gVCF non-variant records no have conservative 0,0,0 PLs
JNI. See copied text from email below.
2. This commit contains all the code used in profiling, detecting FP
exceptions, dumping intermediate results. All flagged off using ifdefs,
but it's there.
--------------Text from email
As we discussed before, it's the denormal numbers that are causing the
slowdown - the core executes some microcode uops (called FP assists)
when denormal numbers are detected for FP operations (even un-vectorized
code).
The C++ compiler by default enables flush to zero (FTZ) - when set, the
hardware simply converts denormal numbers to 0. The Java binary
(executable provided by Oracle, not the native library) seems to be
compiled without FTZ (sensible choice, they want to be conservative).
Hence, the JNI invocation sees a large slowdown. Disabling FTZ in C++
slows down the C++ sandbox performance to the JNI version (fortunately,
the reverse also holds :)).
Not sure how to show the overhead for these FP assists easily - measured
a couple of counters.
FP_ASSISTS:ANY - shows number of uops executed as part of the FP
assists. When FTZ is enabled, this is 0 (both C++ and JNI), when FTZ is
disabled this value is around 203540557 (both C++ and JNI)
IDQ:MS_UOPS_CYCLES - shows the number of cycles the decoder was issuing
uops when the microcode sequencing engine was busy. When FTZ is enabled,
this is around 1.77M cycles (both C++ and JNI), when FTZ is disabled
this value is around 4.31B cycles (both C++ and JNI). This number is
still small with respect to total cycles (~40B), but it only reflects
the cycles in the decode stage. The total overhead of the microcode
assist ops could be larger.
As suggested by Mustafa, I compared intermediate values (matrices M,X,Y)
and final output of compute_full_prob. The values produced by C++ and
Java are identical to the last bit (as long as both use FTZ or no-FTZ).
Comparing the outputs of compute_full_prob for the cases no-FTZ and FTZ,
there are differences for very small values (denormal numbers).
Examples:
Diff values 1.952970E-33 1.952967E-33
Diff values 1.135071E-32 1.135070E-32
Diff values 1.135071E-32 1.135070E-32
Diff values 1.135071E-32 1.135070E-32
For this test case (low coverage NA12878), all these values would be
recomputed using the double precision version. Enabling FTZ should be
fine.
-------------------End text from email
These changes happened in Tribble, but Joel clobbered them with his commit.
We can now change the logging priority on failures to validate the sequence dictionary to WARN.
Thanks to Tim F for indirectly pointing this out.
1. Throw a user error when the input data for a given genotype does not contain PLs.
2. Add VCF header line for --dbsnp input
3. Need to check that the UG result is not null
4. Don't error out at positions with no gVCFs (which is possible when using a dbSNP rod)
Joel is working on these failures in a separate branch. Since
maven (currently! we're working on this..) won't run the whole
test suite to completion if there's a failure early on, we need
to temporarily disable these tests in order to allow group members
to run tests on their branches again.
Added pom.xml workarounds for duplicate classpath error, due to gatk-framework dependency containing required BaseTest, and jarred *UnitTest/*IntegrationTest classes that also exist as files under target/test-classes.
Here are the git moved directories in case other files need to be moved during a merge:
git-mv private/java/src/ private/gatk-private/src/main/java/
git-mv private/R/scripts/ private/gatk-private/src/main/resources/
git-mv private/java/test/ private/gatk-private/src/test/java/
git-mv private/testdata/ private/gatk-private/src/test/resources/
git-mv private/scala/qscript/ private/queue-private/src/main/qscripts/
git-mv private/scala/src/ private/queue-private/src/main/scala/
git-mv protected/java/src/ protected/gatk-protected/src/main/java/
git-mv protected/java/test/ protected/gatk-protected/src/test/java/
git-mv public/java/src/ public/gatk-framework/src/main/java/
git-mv public/java/test/ public/gatk-framework/src/test/java/
git-mv public/testdata/ public/gatk-framework/src/test/resources/
git-mv public/scala/qscript/ public/queue-framework/src/main/qscripts/
git-mv public/scala/src/ public/queue-framework/src/main/scala/
git-mv public/scala/test/ public/queue-framework/src/test/scala/
Changes:
-------
<NON_REF> likelihood in variant sites is calculated as the maximum possible likelihood for an unseen alternative allele: for reach read is calculated as the second best likelihood amongst the reported alleles.
When –ERC gVCF, stand_conf_emit and stand_conf_call are forcefully set to 0. Also dontGenotype is set to false for consistency sake.
Integration test MD5 have been changed accordingly.
Additional fix:
--------------
Specially after adding the <NON_REF> allele, but also happened without that, QUAL values tend to go to 0 (very large integer number in log 10) due to underflow when combining GLs (GenotypingEngine.combineGLs). To fix that combineGLs has been substituted by combineGLsPrecise that uses the log-sum-exp trick.
In just a few cases this change results in genotype changes in integration tests but after double-checking using unit-test and difference between combineGLs and combineGLsPrecise in the affected integration test, the previous GT calls were either border-line cases and or due to the underflow.
2. Split into DebugJNILoglessPairHMM and VectorLoglessPairHMM with base
class JNILoglessPairHMM. DebugJNILoglessPairHMM can, in principle,
invoke any other child class of JNILoglessPairHMM.
3. Added more profiling code for Java parts of LoglessPairHMM
Problem:
matchToMatch transition calculation was wrong resulting in transition probabilites coming out of the Match state that added more than 1.
Reports:
https://www.pivotaltracker.com/s/projects/793457/stories/62471780https://www.pivotaltracker.com/s/projects/793457/stories/61082450
Changes:
The transition matrix update code has been moved to a common place in PairHMMModel to dry out its multiple copies.
MatchToMatch transtion calculation has been fixed and implemented in PairHMMModel.
Affected integration test md5 have been updated, there were no differences in GT fields and example differences always implied
small changes in likelihoods that is what is expected.
2. Wrapped _mm_empty() with ifdef SIMD_TYPE_SSE
3. OpenMP disabled
4. Added code for initializing PairHMM's data inside initializePairHMM -
not used yet
SSE compilation warning.
2. Added code to dynamically select between AVX, SSE4.2 and normal C++ (in
that order)
3. Created multiple files to compile with different compilation flags:
avx_function_prototypes.cc is compiled with -xAVX while
sse_function_instantiations.cc is compiled with -xSSE4.2 flag.
4. Added jniClose() and support in Java (HaplotypeCaller,
PairHMMLikelihoodCalculationEngine) to call this function at the end of
the program.
5. Removed debug code, kept assertions and profiling in C++
6. Disabled OpenMP for now.
1. Moved computeLikelihoods from PairHMM to native implementation
2. Disabled debug - debug code still left (hopefully, not part of
bytecode)
3. Added directory PairHMM_JNI in the root which holds the C++
library that contains the PairHMM AVX implementation. See
PairHMM_JNI/JNI_README first
It didn't completely work before (it was hard-coded for a particular long-lost data set) but it should work now.
Since I thought that it might prove useful to others, I moved it to protected and added integration tests.
GERALDINE: NEW TOOL ALERT!
Problem: the codec was written to take in consensus pileups produced with pileup -c option (which consists of 10 or 13 fields per line depending on the variant type) but errored out on the basic pileup format (which only has 6 fields per line). This was inconsistent and confusing to users.
Solution: I added a switch in the parsing to recognize and handle both cases more appropriately, and updated related docs. While I was at it I also improved error messages in CheckPileup, which now emits User Error: Bad Input exceptions when reporting mismatches. Which may not be the best thing to do (ultimately they're not really errors, they're just reporting unwelcome results) but it beats emitting Runtime Exceptions.
Tested by CheckPileupIntegrationTest which tests both format cases.
-Added docs for ERC mode in HC
-Move RecalibrationPerformance walker since to private since it is experimental and unsupported
-Updated VR docs and restored percentBad/numBad (but @Hidden) to enable deprecation alert if users try to use them
-Improved error msg for conflict between per-interval aggregation and -nt
-Minor clean up in exception docs
-Added Toy Walkers category for devs and dev supercat (to build out docs for developers)
-Added more detailed info to GenotypeConcordance doc based on Chris forum post
-Added system to include min/max argument values in gatkdocs (build gatkdocs with 'ant gatkdocs' to test it, see engine and DoC args for in situ examples)
-Added tentative min/max argument annotations to DepthOfCoverage and CommandLineGATK arguments (and improved docs while at it)
-Added gotoDev annotation to GATKDocumentedFeature to track who is the go-to person in GSA for questions & issues about specific walkers/tools (now discreetly indicated in each gatkdoc)
To do this I have added a RodBindingCollection which can represent either a VCF or a
file of VCFs. Note that e.g. SelectVariants allows a list of RodBindingCollections so
that one can intermix VCFs and VCF lists.
For VariantContext tags with a list, by default the tags for the -V argument are applied
unless overridden by the individual line. In other words, any given line can have either
one token (the file path) or two tokens (the new tags and the file path). For example:
foo.vcf
VCF,name=bar bar.vcf
Note that a VCF list file name must end with '.list'.
Added this functionality to CombineVariants, CombineReferenceCalculationVariants, and VariantRecalibrator.
For example, this tool can be used for processing bowtie RNA-seq data.
Each read with k N-cigar elemments is plit to k+1 reads. The split is done by hard clipping the bases rest of the bases.
In order to do it, few changes were introduced to some other clipping methods:
- make a segnificant change in ClippingOp.hardClip() that prevent the spliting of read with cigar: 1M2I1N1M3I.
- change getReadCoordinateForReferenceCoordinate in ReadUtil to recognize Ns
create unitTests for that walker:
- change ReadClipperTestUtils to be more general in order to use its code and avoid code duplication
- move some useful methods from ReadClipperTestUtils to CigarUtils
create integration test for that class
small change in a comment in FullProcessingPipeline
last commit:
Address review comments:
- move to protected under walkers/rnaseq
- change the read splitting methods to be more readable and more efficiant
- change (minor changes) some methods in ReadClipper to allow the changes in split reads
- add (minor change) one method to CigarUtils to allow the changes in split reads
- change ReadUtils.getReadCoordinateForReferenceCoordinate to include possible N in the cigar
- address the rest of the review comments (minor changes)
- fix ReadUtilsUnitTest.testReadWithNs acoording to the defult behaviour of getReadCoordinateForReferenceCoordinate (in case of refernce index that fall into deletion, return the read index of the base before the deletion).
- add another test to ReadUtilsUnitTest.testReadWithNs
- Allow the user to print the split positions (not working proparly currently)
* add overall walker GATKDocs
* add explanation for skip parameter and make it advanced
* reverse the logic on exculding unmapped reads for clarity
* fix read length calculation to no longer include indels
ps: I am not sure how useful this walker is (I didn't write it) but the skip logic is poor and
calculates the entire statistic for the reads it is eventually going to skip. This would be an easy
fix, but only worth our time if people actually use this.
The update contains:
1. documentation changes for VariantContext and Allele (which used to discuss the now obsolete null allele)
2. better error messages for VCFs containing complex rearrangements with breakends
3. instead of failing badly on format field lists with '.'s, just ignore them
Also, there is a trivial change to use a more efficient method to remove a bunch of attributes from a VC.
Delivers PT#s 59675378, 59496612, and 60524016.
Basically, it does 3 things (as opposed to having to call into 3 separate walkers):
1. merge the records at any given position into a single one with all alleles and appropriate PLs
2. re-genotype the record using the exact AF calculation model
3. re-annotate the record using the VariantAnnotatorEngine
In the course of this work it became clear that we couldn't just use the simpleMerge() method used
by CombineVariants; combining HC-based gVCFs is really a complicated process. So I added a new
utility method to handle this merging and pulled any related code out of CombineVariants. I tried
to clean up a lot of that code, but ultimately that's out of the scope of this project.
Added unit tests for correctness testing.
Integration tests cannot be used yet because the HC doesn't output correct gVCFs.
-You can now add "minValue", "maxValue", "minRecommendedValue", and "maxRecommendedValue" attributes
to @Argument annotations for command-line arguments
-"minValue" and "maxValue" specify hard limits that generate an exception if violated
-"minRecommendedValue" and "maxRecommendedValue" specify soft limits that generate a warning if violated
-Works only for numeric arguments (int, double, etc.) with @Argument annotations
-Only considers values actually specified by the user on the command line, not default values
assigned in the code
As requested by Geraldine
In general, test classes cannot use 3rd-party libraries that are not
also dependencies of the GATK proper without causing problems when,
at release time, we test that the GATK jar has been packaged correctly
with all required dependencies.
If a test class needs to use a 3rd-party library that is not a GATK
dependency, write wrapper methods in the GATK utils/* classes, and
invoke those wrapper methods from the test class.
Previously, we would strip out the PLs and AD values since they were no longer accurate. However, this is not ideal because
then that information is just lost and 1) users complain on the forum and post it as a bug and 2) it gives us problems in both
the current and future (single sample) calling pipelines because we subset samples/alleles all the time and lose info.
Now the PLs and AD get correctly selected down.
While I was in there I also refactored some related code in subsetDiploidAlleles(). There were no real changes there - I just
broke it out into smaller chunks as per our best practices.
Added unit tests and updated integration tests.
Addressed reviews.
To active this feature add '--likelihoodCalculationEngine GraphBased' to the HC command line.
New HC Options (both Advanced and Hidden):
==========================================
--likelihoodCalculationEngine PairHMM/GraphBased/Random (default PairHMM)
Specifies what engine should be used to generate read vs haplotype likelihoods.
PairHMM : standard full-PairHMM approach.
GraphBased : using the assembly graph to accelarate the process.
Random : generate random likelihoods - used for benchmarking purposes only.
--heterogeneousKmerSizeResolution COMBO_MIN/COMBO_MAX/MAX_ONLY/MIN_ONLY (default COMBO_MIN)
It idicates how to merge haplotypes produced using different kmerSizes.
Only has effect when used in combination with (--likelihooCalculationEngine GraphBased)
COMBO_MIN : use the smallest kmerSize with all haplotypes.
COMBO_MAX : use the larger kmerSize with all haplotypes.
MIN_ONLY : use the smallest kmerSize with haplotypes assembled using it.
MAX_ONLY : use the larger kmerSize with haplotypes asembled using it.
Major code changes:
===================
* Introduce multiple likelihood calculation engines (before there was just one).
* Assembly results from different kmerSies are now packed together using the AssemblyResultSet class.
* Added yet another PairHMM implementation with a different API in order to spport
local PairHMM calculations, (e.g. a segment of the read vs a segment of the haplotype).
Major components:
================
* FastLoglessPairHMM: New pair-hmm implemtation using some heuristic to speed up partial PairHMM calculations
* GraphBasedLikelihoodCalculationEngine: delegates onto GraphBasedLikelihoodCalculationEngineInstance the exectution
of the graph-based likelihood approach.
* GraphBasedLikelihoodCalculationEngineInstance: one instance per active-region, implements the graph traversals
to calcualte the likelihoods using the graph as an scafold.
* HaplotypeGraph: haplotype threading graph where build from the assembly haplotypes. This structure is the one
used by GraphBasedLikelihoodCalculationEngineInstance to do its work.
* ReadAnchoring and KmerSequenceGraphMap: contain information as how a read map on the HaplotypeGraph that is
used by GraphBasedLikelihoodCalcuationEngineInstance to do its work.
Remove mergeCommonChains from HaplotypeGraph creation
Fixed bamboo issues with HaplotypeGraphUnitTest
Fixed probrems with HaplotypeCallerIntegrationTest
Fixed issue with GraphLikelihoodVsLoglessAccuracyIntegrationTest
Fixed ReadThreadingLikelihoodCalculationEngine issues
Moved event-block iteration outside GraphBased*EngineInstance
Removed unecessary parameter from ReadAnchoring constructor.
Fixed test problem
Added a bit more documentation to EventBlockSearchEngine
Fixing some private - protected dependency issues
Further refactoring making GraphBased*Instance and HaplotypeGraph slimmer. Addressed last pull request commit comments
Fixed FastLoglessPairHMM public -> protected dependency
Fixed probrem with HaplotypeGraph unit test
Adding Graph-based likelihood ratio calculation to HC
To active this feature add '--likelihoodCalculationEngine GraphBased' to the HC command line.
New HC Options (both Advanced and Hidden):
==========================================
--likelihoodCalculationEngine PairHMM/GraphBased/Random (default PairHMM)
Specifies what engine should be used to generate read vs haplotype likelihoods.
PairHMM : standard full-PairHMM approach.
GraphBased : using the assembly graph to accelarate the process.
Random : generate random likelihoods - used for benchmarking purposes only.
--heterogeneousKmerSizeResolution COMBO_MIN/COMBO_MAX/MAX_ONLY/MIN_ONLY (default COMBO_MIN)
It idicates how to merge haplotypes produced using different kmerSizes.
Only has effect when used in combination with (--likelihooCalculationEngine GraphBased)
COMBO_MIN : use the smallest kmerSize with all haplotypes.
COMBO_MAX : use the larger kmerSize with all haplotypes.
MIN_ONLY : use the smallest kmerSize with haplotypes assembled using it.
MAX_ONLY : use the larger kmerSize with haplotypes asembled using it.
Major code changes:
===================
* Introduce multiple likelihood calculation engines (before there was just one).
* Assembly results from different kmerSies are now packed together using the AssemblyResultSet class.
* Added yet another PairHMM implementation with a different API in order to spport
local PairHMM calculations, (e.g. a segment of the read vs a segment of the haplotype).
Major components:
================
* FastLoglessPairHMM: New pair-hmm implemtation using some heuristic to speed up partial PairHMM calculations
* GraphBasedLikelihoodCalculationEngine: delegates onto GraphBasedLikelihoodCalculationEngineInstance the exectution
of the graph-based likelihood approach.
* GraphBasedLikelihoodCalculationEngineInstance: one instance per active-region, implements the graph traversals
to calcualte the likelihoods using the graph as an scafold.
* HaplotypeGraph: haplotype threading graph where build from the assembly haplotypes. This structure is the one
used by GraphBasedLikelihoodCalculationEngineInstance to do its work.
* ReadAnchoring and KmerSequenceGraphMap: contain information as how a read map on the HaplotypeGraph that is
used by GraphBasedLikelihoodCalcuationEngineInstance to do its work.
Remove mergeCommonChains from HaplotypeGraph creation
Fixed bamboo issues with HaplotypeGraphUnitTest
Fixed probrems with HaplotypeCallerIntegrationTest
Fixed issue with GraphLikelihoodVsLoglessAccuracyIntegrationTest
Fixed ReadThreadingLikelihoodCalculationEngine issues
Moved event-block iteration outside GraphBased*EngineInstance
Removed unecessary parameter from ReadAnchoring constructor.
Fixed test problem
Added a bit more documentation to EventBlockSearchEngine
Fixing some private - protected dependency issues
Further refactoring making GraphBased*Instance and HaplotypeGraph slimmer. Addressed last pull request commit comments
Fixed FastLoglessPairHMM public -> protected dependency
Fixed probrem with HaplotypeGraph unit test
CalculatePosteriors enables the user to calculate genotype likelihood posteriors (and set genotypes accordingly) given one or more panels containing allele counts (for instance, calculating NA12878 genotypes based on 1000G EUR frequencies). The uncertainty in allele frequency is modeled by a Dirichlet distribution (parameters being the observed allele counts across each allele), and the genotype state is modeled by assuming independent draws (Hardy-Weinberg Equilibrium). This leads to the Dirichlet-Multinomial distribution.
Currently this is implemented only for ploidy=2. It should be straightforward to generalize. In addition there's a parameter for "EM" that currently does nothing but throw an exception -- another extension of this method is to run an EM over the Maximum A-Posteriori (MAP) allele count in the input sample as follows:
while not converged:
* AC = [external AC] + [sample AC]
* Prior = DirichletMultinomial[AC]
* Posteriors = [sample GL + Prior]
* sample AC = MLEAC(Posteriors)
This is more useful for large callsets with small panels than for small callsets with large panels -- the latter of these being the more common usecase.
Fully unit tested.
Reviewer (Eric) jumped in to address many of his own comments plus removed public->protected dependencies.
Motivation:
The API was different between the regular PairHMM and the FPGA-implementation
via CnyPairHMM. As a result, the LikelihoodCalculationEngine had
to use account for this. The goal is to change the API to be the same
for all implementations, and make it easier to access.
PairHMM
PairHMM now accepts a list of reads and a map of alleles/haplotpes and returns a PerReadAlleleLikelihoodMap.
Added a new primary method that loops the reads and haplotypes, extracts qualities,
and passes them to the computeReadLikelihoodGivenHaplotypeLog10 method.
Did not alter that method, or its subcompute method, at all.
PairHMM also now handles its own (re)initialization, so users don't have to worry about that.
CnyPairHMM
Added that same new primary access method to this FPGA class.
Method overrides the default implementation in PairHMM. Walks through a list of reads.
Individual-read quals and the full haplotype list are fed to batchAdd(), as before.
However, instead of waiting for every read to get added, and then walking through the reads
again to extract results, we just get the haplotype-results array for each read as soon as it
is generated, and pack it into a perReadAlleleLikelihoodMap for return.
The main access method is now the same no matter whether the FPGA CnyPairHMM is used or not.
LikelihoodCalculationEngine
The functionality to loop through the reads and haplotypes and get individual log10-likelihoods
was moved to the PairHMM, and so removed from here. However, this class does need to retain
the ability to pre-process the reads, and post-process the resulting likelihoods map.
Those features were separated from running the HMM and refactored into their own methods
Commented out the (unused) system for finding best N haplotypes for genotyping.
PairHMMIndelErrorModel
Similar changes were made as to the LCE. However, in this case the haplotypes are modified
based on each individual read, so the read-list we feed into the HMM only has one read.
-- Added 'jobRunnerJobName' definition to QFunction, defaults to value of shortDescription
-- Edited Lsf and Drmaa JobRunners to use this string instead of description for naming jobs in the scheduler
Signed-off-by: Joel Thibault <thibault@broadinstitute.org>
We have generalized the processing script to be able to handle multiple scenarios. Originally it was
designed for PCR free data only, we added all the steps necessary to start from fastq and process
RNA-seq as well as non-human data. This is our go to script in TechDev.
* add optional "starting from fastq" path to the pipeline
* add mark duplicates (optionally) to the pipeline
* add an option to run with the mouse data (without dbsnp and with single ended fastq)
* add option to process RNA-seq data from topHat (add RG and reassign mapping quality if necessary)
* add option to filter or include reads with N in the cigar string
* add parameter to allow keeping the intermediate files
-- We use the RegenotypeVariants walker to recompute the qual field. (instead of the discussed idea of adding this functionality to CombineVariants)
-- QualByDepth will now be recomputed even if the stratified contexts are missing. This greatly improves the QD estimate for this pipeline. Doesn't work for multi-allelics since the qual can't be recomputed.
--Previously it gave a cryptic message:
----IO error while decoding blarg.script with UTF-8
----Please try specifying another one using the -encoding option
this script downsamples an exome BAM several times and makes a coverage distribution
analysis (of bases that pass filters) as well as haplotype caller calls with a NA12878
Knowledge Base assessment with comparison against multi-sample calling
with the UG.
This script was used for the "downsampling the exome" presentation
Quick fix the missing column header in the QualifyMissingIntervals
report.
Adding a QScript for the tool as well as a few minor updates to the
GATKReportGatherer.
--specifying exception types in cases where none was already specified
----mostly changed to catch Exception instead of Throwable
----EmailMessage has a point where it should only be expecting a RetryException but was catching everything
--changing build.xml so that it prints scala feature warning details
--added necessary imports needed to remove feature warnings
--updating a newly deprecated enum declaration to match the new syntax
--modified ivy dependencies
--modified scala classpath in build.xml to include scala-reflect
--changed imports to point to the new scala scala.reflect.internal.util
--set the bootclasspath in QScriptManager as well as the classpath variable.
--removing Set[File] <-> Set[String] conversions
----Set is invariant now and the conversions broke
--removing unit tests for Set[File] <-> Set[String] conversions
-- Adding changes to CombineVariants to work with the Reference Model mode of the HaplotypeCaller.
-- Added -combineAnnotations mode to CombineVariants to merge the info field annotations by taking the median
-- Added new StrandBiasBySample genotype annotation for use in computing strand bias from single sample input vcfs
-- Bug fixes to calcGenotypeLikelihoodsOfRefVsAny, used in isActive() as well as the reference model
-- Added active region trimming capabilities to the reference model mode, not perfect yet, turn off with --dontTrimActiveRegions
-- We only realign reads in the reference model if there are non-reference haplotypes, a big time savings
-- We only realign reads in the reference model if the read is informative for a particular haplotype over another
-- GVCF blocks will now track and output the minimum PLs over the block
-- MD5 changes!
-- HC tests: from bug fixes in calcGenotypeLikelihoodsOfRefVsAny
-- GVCF tests: from HC changes above and adding in active region trimming
A new PairHMM implementation for read/haplotype likelihood calculations. Output is the same as the LOGLESS_CACHING version.
Instead of allocating an entire (read x haplotype) matrix for each HMM state, this version stores sub-computations in 1D arrays. It also accesses intersections of the (read x haplotype) alignment in a different order, proceeding over "diagonals" if we think of the alignment as a matrix.
This implementation makes use of haplotype caching. Because arrays are overwritten, it has to explicitly store mid-process information. Knowing where to capture this info requires us to look ahead at the subsequent haplotype to be analyzed. This necessitated a signature change in the primary method for all pairHMM implementations.
We also had to adjust the classes that employ the pairHMM:
LikelihoodCalculationEngine (used by HaplotypeCaller)
PairHMMIndelErrorModel (used by indel genotyping classes)
Made the array version the default in the HaplotypeCaller and the UnifiedArgumentCollection.
The latter affects classes:
ErrorModel
GeneralPloidyIndelGenotypeLikelihoodsCalculationModel
IndelGenotypeLikelihoodsCalculationModel
... all of which use the pairHMM via PairHMMIndelErrorModel
-This was a dependency of the test suite, but not the GATK proper,
which caused problems when running the test suite on the packaged
GATK jar at release time
-Use GATKVCFUtils.readVCF() instead
-Switch to using new GSA AWS account for storage of phone home data
-Use DNS-compliant bucket names, as per Amazon's best practices
-Encrypt publicly-distributed version of credentials. Grant only PutObject
permission, and only for the relevant buckets.
-Store non-distributed credentials in private/GATKLogs/newAWSAccountCredentials
for now -- need to integrate with existing python/shell scripts
later to get the log downloading working with the new account
* Refactoring implementations of readHeader(LineReader) -> readActualHeader(LineIterator), including nullary implementations where applicable.
* Galvanizing fo generic types.
* Test fixups, mostly to pass around LineIterators instead of LineReaders.
* New rev of tribble, which incorporates a fix that addresses a problem with TribbleIndexedFeatureReader reading a header twice in some instances.
* New rev of sam, to make AbstractIterator visible (was moved from picard -> sam in Tribble API refactor).
Why wasn't it there before, you ask
----------------------------------
Before I was running it separately (by hand), but now it's integrated in
the FullProcessingPipeline.
Integration was a pain because of Queue's limitation of only allowing 1
@Output file. This forced me to write the ugliest piece of code of my
life, but it's working and it's processing the YRI from scratch using
that right now. So I'm happy... somewhat.
Other changes to the pipeline
-----------------------------
* Add --filter_bases_not_stored to the IndelRealigner step -- sometimes BAM files have reads with no bases stored in the unmapped section (no idea why) but this disrupts the pipeline.
* Change adaptor marking parameter to "dual indexed" instead of "pair-ended" -- for PCR Free data.
* add interleaved fastq option to sam2fastq
* add optional adapter trimming path
* add "skip_revert" option to skip reverting the bams (sometimes useful -- hidden parameter)
* add a walker that reads in one bam file and outputs N bam files, one for each read group in the original bam. This is a very important step in any BAM reprocessing pipeline.
I am using this new pipeline to process the CEU and YRI PCR Free WGS
trios.
- Make -rod required
- Document that contaminationFile is currently not functional with HC
- Document liftover process more clearly
- Document VariantEval combinations of ST and VE that are incompatible
- Added a caveat about using MVLR from HC and UG.
- Added caveat about not using -mte with -nt
- Clarified masking options
- Fixed docs based on Erics comments
-- Bugfix for BAMs containing reads without real (M,I,D,N) operators. Simply needed to set validation stringency to SILENT in the read. Added a BadCigar filter to the SAMRecord stream anyway
-- Add capture all sites mode to AssessNA12878: will write all sites to the badSites VCF, regardless of whether they are bad. It's useful if you essentially want to annotate a VCF with KB information for later analysis, such as computing ROC curves
-- Add ignore filters mode to AssessNA12878: will as expected treat all sites in the input VCF calls as PASS, even if the site has a FILTER field setting
-- Add minPNonRef argument to AssessNA12878: this will consider a site not called even if the NA12878 genotype is not 0/0 if the PLs are present and the PL for 0/0 isn't greater than this value. It allows us to easily differentiate low confidence non-ref sites obtained via multi-sample calling from highly confident non-ref calls that might be real TP or FPs
Problem
-------
Caching strategy is incompatible with the current sorting of the haplotypes, and is rendering the cache nearly useless.
Before the PairHMM updates, we realized that a lexicographically sorted list of haplotypes would optimize the use of the cache. This was only true until we've added the initial condition to the first row of the deletion matrix, which depends on the length of the haplotype. Because of that, every time the haplotypes differ in length, the cache has to be wiped. A lexicographic sorting of the haplotypes will put different lengths haplotypes clustered together therefore wasting *tons* of re-compute.
Solution
-------
Very simple. Sort the haplotypes by LENGTH and then in lexicographic order.
-User must provide a mapping file via new --sample_rename_mapping_file argument.
Mapping file must contain a mapping from absolute bam file path to new sample name
(format is described in the docs for the argument).
-Requires that each bam file listed in the mapping file contain only one sample
in their headers (they may contain multiple read groups for that sample, however).
The engine enforces this, and throws a UserException if on-the-fly renaming is
requested for a multi-sample bam.
-Not all bam files for a traversal need to be listed in the mapping file.
-On-the-fly renaming is done as the VERY first step after creating the SAMFileReaders
in SAMDataSource (before the headers are even merged), to prevent possible consistency
issues.
-Renaming is done ONCE at traversal start for each SAMReaders resource creation in the
SAMResourcePool; this effectively means once per -nt thread
-Comprehensive unit/integration tests
Known issues: -if you specify the absolute path to a bam in the mapping file, and then
provide a path to that same bam to -I using SYMLINKS, the renaming won't
work. The absolute paths will look different to the engine due to the
symlink being present in one path and not in the other path.
GSA-974 #resolve
-Two SAMReaderIDs that pointed at the same underlying bam file through
a relative vs. an absolute path were not being treated as equal, and
had different hash codes. This was causing problems in the engine, since
SAMReaderIDs are often used as the keys of HashMaps.
-Fix: explicitly use the absolute path to the encapsulated bam file in
hashCode() and equals()
-Added tests to ensure this doesn't break again
1. Some minor refactorings and claenup (e.g. removing unused imports) throughout.
2. Updates to the KB assessment functionality:
a. Exclude duplicate reads when checking to see whether there's enough coverage to make a call.
b. Lower the threshold on FS for FPs that would easily be filtered since it's only single sample calling.
3. Make the HC consistent in how it treats the pruning factor. As part of this I removed and archived
the DeBruijn assembler.
4. Improvements to the likelihoods for the HC
a. We now include a "tristate" correction in the PairHMM (just like we do with UG). Basically, we need
to divide e by 3 because the observed base could have come from any of the non-observed alleles.
b. We now correct overlapping read pairs. Note that the fragments are not merged (which we know is
dangerous). Rather, the overlapping bases are just down-weighted so that their quals are not more
than Q20 (or more specifically, half of the phred-scaled PCR error rate); mismatching bases are
turned into Q0s for now.
c. We no longer run contamination removal by default in the UG or HC. The exome tends to have real
sites with off kilter allele balances and we occasionally lose them to contamination removal.
5. Improved the dangling tail merging implementation.
-This test is failing intermittently for unexplained reasons (see GSA-943)
-In the interest of keeping the rest of the pipeline test suite running, it's
best to disable this one test until GSA-943 is resolved
-- Assembly graph building now returns an object that describes whether the graph was successfully built and has variation, was succesfully built but didn't have variation, or truly failed in construction. Fixing an annoying bug where you'd prefectly assembly the sequence into the reference graph, but then return a null graph because of this, and you'd increase your kmer because it null was also used to indicate assembly failure
--
-- Output format looks like:
20 10026072 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:3,0:3:9:0,9,120
20 10026073 . A <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:3,0:3:9:0,9,119
20 10026074 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:3,0:3:9:0,9,121
20 10026075 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:3,0:3:9:0,9,119
20 10026076 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:3,0:3:9:0,9,120
20 10026077 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:3,0:3:9:0,9,120
20 10026078 . C <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:5,0:5:15:0,15,217
20 10026079 . A <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:6,0:6:18:0,18,240
20 10026080 . G <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:6,0:6:18:0,18,268
20 10026081 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:7,0:7:21:0,21,267
We use a symbolic allele to indicate that the site is hom-ref, and because we have an ALT allele we can provide AD and PL field values. Currently these are calculated as ref vs. any non-ref value (mismatch or insertion) but doesn't yet account properly for alignment uncertainty.
-- Can we enabled for single samples with --emitRefConfidence (-ERC).
-- This is accomplished by realigning the each read to its most likley haplotype, and then evaluting the resulting pileups over the active region interval. The realignment is done by the HaplotypeBAMWriter, which now has a generalized interface that lets us provide a ReadDestination object so we can capture the realigned reads
-- Provide access to the more raw LocusIteratorByState constructor so we can more easily make them programmatically without constructing lots of misc. GATK data structures. Moved the NO_DOWNSAMPLING constant from LIBSDownsamplingInfo to LocusIteratorByState so clients can use it without making LIBSDownsamplingInfo a public class.
-- Includes GVCF writer
-- Add 1 mb of WEx data to private/testdata
-- Integration tests for reference model output for WGS and WEx data
-- Emit GQ block information into VCF header for GVCF mode
-- OutputMode from StandardCallerArgumentCollection moved to UnifiedArgumentCollection as its no longer relevant for HC
-- Control max indel size for the reference confidence model from the command line. Increase default to 10
-- Don't use out_mode in HaplotypeCallerComplexAndSymbolicVariantsIntegrationTest
-- Unittests for ReferenceConfidenceModel
-- Unittests for new MathUtils functions
-- The previous code would adapter clip before reverting soft clips, so because we only clip the adapter when it's actually aligned (i.e., not in the soft clips) we were actually not removing bases in the adapter unless at least 1 bp of the adapter was aligned to the reference. Terrible.
-- Removed the broken logic of determining whether a read adaptor is too long.
-- Doesn't require isProperPairFlag to be set for a read to be adapter clipped
-- Update integration tests for new adapter clipping code
-Explicitly state that -dcov does not produce an unbiased random sampling from all available reads
at each locus, and that instead it tries to maintain an even representation of reads from
all alignment start positions (which, of course, is a form of bias)
-Recommend -dfrac for users who want a true across-the-board unbiased random sampling
-- Because LocusWalkers have multiple filtering streams, each counting filtering independent, and the close() function set calling setFilter on the global result, not on the private counter, which is incorporated into the global (thereby incrementing the counts of each filter).
-- [delivers #52667213]
There are a few pipeline test classes that do not run Queue, but are
classified as pipeline tests because they submit farm jobs. Make these
unconventional pipeline tests respect the pipeline test dry run setting.
Previous fixes and tests only covered trailing soft-clips. Now that up front
hard-clipping is working properly though, we were failing on those in the tool.
Added a patch for this as well as a separate test independent of the soft-clips
to make sure that it's working properly.
-- Previous version emitted command lines that look like:
##HaplotypeCaller="analysis_type=HaplotypeCaller input_file=[private/testdata/reduced.readNotFullySpanningDeletion.bam] ..."
the new version provides additional information on when the GATK was run and the GATK version in a nicer format:
##GATKCommandLine=<ID=HaplotypeCaller,Version=2.5-206-gbc7be2b,Date="Thu Jun 20 11:09:01 EDT 2013",Epoch=1371740941197,CommandLineOptions="analysis_type=HaplotypeCaller input_file=[private/testdata/reduced.readNotFullySpanningDeletion.bam] read_buffer_size=null phone_home=AWS ...">
-- Additionally, the command line options are emitted sequentially in the file, so you can see a running record of how a VCF was produced, such as this example from the integration test:
##GATKCommandLine=<ID=HaplotypeCaller,Version=2.5-206-gbc7be2b,Date="Thu Jun 20 11:09:01 EDT 2013",Epoch=1371740941197,CommandLineOptions="lots of stuff">
##GATKCommandLine=<ID=SelectVariants,Version=2.5-206-gbc7be2b,Date="Thu Jun 20 11:16:23 EDT 2013",Epoch=1371741383277,CommandLineOptions="lots of stuff">
-- Removed the ProtectedEngineFeaturesIntegrationTest
-- Actual unit tests for these features!
Improved AnalyzeCovariates (AC) integration test.
Renamed AC test files ending with .grp to .table
Implementation:
* Removed RECAL_PDF/CSV_FILE from RecalibrationArgumentCollection (RAC). Updated rest of the code accordingly.
* Fixed BQSRIntegrationTest to work with new changes
Implemtation details:
* Added tool class *.AnalyzeCovariates
* Added convenient addAll method to Utils to be able to add elements of an array.
* Added parameter comparison methods to RecalibrationArgumentCollection class in order to verify that multiple imput recalibration report are compatible and comparable.
* Modified the BQSR.R script to handle up to 3 different recalibration tables (-BQSR, -before and -after) and removed some irrelevant arguments (or argument values) from the output.
* Added an integration test class.
-Collapses zero-length and repeated cigar elements, neither of which
can necessarily be handled correctly by downstream code (like LIBS).
-Consolidation is done before read filters, because not all read filters
behave correctly with non-consoliated cigars.
-Examined other uses of consolidateCigar() throughout the GATK, and
found them to not be redundant with the new engine-level consolidation
(they're all on artificially-created cigars in the HaplotypeCaller
and SmithWaterman classes)
-Improved comments in SAMDataSource.applyDecoratingIterators()
-Updated MD5s; differences were examined and found to be innocuous
-Two tests: -Unit test for ReadFormattingIterator
-Integration test for correct handling of zero-length
cigar elements by the GATK engine as a whole
-This argument was completely redundant with the engine-level -dfrac
argument.
-Could produce unintended consequences if used in conjunction with
engine-level downsampling arguments.
-- It was being applied in the wrong order (after the first call to the underlying MalformedReadFilter) so if your first read was malformed you'd blow up there instead of being fixed properly. Added integration tests to ensure this continues to work.
-- [delivers #49538319]
-- When doing cross-species comparisons and studying population history and ancient DNA data, having SOME measure of confidence is needed at every single site that doesn't depend on the reference base, even in a naive per-site SNP mode. Old versions of GATK provided GQ and some wrong PL values at reference sites but these were wrong. This commit addresses this need by adding a new UG command line argument, -allSitePLs, that, if enabled will:
a) Emit all 3 ALT snp alleles in the ALT column.
b) Emit all corresponding 10 PL values.
It's up to the user to process these PL values downstream to make sense of these. Note that, in order to follow VCF spec, the QUAL field in a reference call when there are non-null ALT alleles present will be zero, so QUAL will be useless and filtering will need to be done based on other fields.
-- Tweaks and fixes to processing pipelines for Reich lab.
-- VariantContextWriterStorage was gzipping the intermediate files that would be merged in, but the mergeInto function couldn't read those outputs, and we'd throw a very strange error. Now tmp. VCFs aren't compressed, even if the final VCF is. Added integrationtest to ensure this behavior works going forward.
-- [delivers #47399279]
-- Previous version created FILTERs for each possible alt allele when that site was set to monomorphic by BEAGLE. So if you had a A/C SNP in the original file and beagle thought it was AC=0, then you'd get a record with BGL_RM_WAS_A in the FILTER field. This obviously would cause problems for indels, as so the tool was blowing up in this case. Now beagle sets the filter field to BGL_SET_TO_MONOMORPHIC and sets the info field annotation OriginalAltAllele to A instead. This works in general with any type of allele.
-- Here's an example output line from the previous and current versions:
old: 20 64150 rs7274499 C . 3041.68 BGL_RM_WAS_A AN=566;DB;DP=1069;Dels=0.00;HRun=0;HaplotypeScore=238.33;LOD=3.5783;MQ=83.74;MQ0=0;NumGenotypesChanged=1;OQ=1949.35;QD=10.95;SB=-6918.88
new: 20 64062 . G . 100.39 BGL_SET_TO_MONOMORPHIC AN=566;DP=1108;Dels=0.00;HRun=2;HaplotypeScore=221.59;LOD=-0.5051;MQ=85.69;MQ0=0;NumGenotypesChanged=1;OQ=189.66;OriginalAltAllele=A;QD=15.81;SB=-6087.15
-- update MD5s to reflect these changes
-- [delivers #50847721]
-WalkerTest now deletes *.idx files on exit
-ArtificialBAMBuilder now deletes *.bai files on exit
-VariantsToBinaryPed walker now deletes its temp files on exit
Problem:
Classes in com.sun.javadoc.* are non-standard. Since we can't depend on their availability for
all users, the GATK proper should not have any runtime dependencies on this package.
Solution:
-Isolate com.sun.javadoc.* dependencies in a DocletUtils class for use only by doclets. The
only users who need to run our doclets are those who compile from source, and they
should be competent enough to figure out how to resolve a missing com.sun.* dependency.
-HelpUtils now contains no com.sun.javadoc.* dependencies and can be safely used by walkers/other
tools.
-Added comments with instructions on when it is safe to use DocletUtils vs. HelpUtils
[delivers #51450385]
[delivers #50387199]
-- Now table looks like:
Name VariantType AssessmentType Count
variant SNPS TRUE_POSITIVE 1220
variant SNPS FALSE_POSITIVE 0
variant SNPS FALSE_NEGATIVE 1
variant SNPS TRUE_NEGATIVE 150
variant SNPS CALLED_NOT_IN_DB_AT_ALL 0
variant SNPS HET_CONCORDANCE 100.00
variant SNPS HOMVAR_CONCORDANCE 99.63
variant INDELS TRUE_POSITIVE 273
variant INDELS FALSE_POSITIVE 0
variant INDELS FALSE_NEGATIVE 15
variant INDELS TRUE_NEGATIVE 79
variant INDELS CALLED_NOT_IN_DB_AT_ALL 2
variant INDELS HET_CONCORDANCE 98.67
variant INDELS HOMVAR_CONCORDANCE 89.58
-- Rewrite / refactored parts of subsetDiploidAlleles in GATKVariantContextUtils to have a BEST_MATCH assignment method that does it's best to simply match the genotype after subsetting to a set of alleles. So if the original GT was A/B and you subset to A/B it remains A/B but if you subset to A/C you get A/A. This means that het-alt B/C genotypes become A/B and A/C when subsetting to bi-allelics which is the convention in the KB. Add lots of unit tests for this functions (from 0 previously)
-- BadSites in Assessment now emits TP sites with discordant genotypes with the type GENOTYPE_DISCORDANCE and tags the expected genotype in the info field as ExpectedGenotype, such as this record:
20 10769255 . A ATGTG 165.73 . ExpectedGenotype=HOM_VAR;SupportingCallsets=ebanks,depristo,CEUTrio_best_practices;WHY=GENOTYPE_DISCORDANCE GT:AD:DP:GQ:PL 0/1:1,9:10:6:360,0,6
Indicating that the call was a HET but the expected result was HOM_VAR
-- Forbid subsetting of diploid genotypes to just a single allele.
-- Added subsetToRef as a separate specific function. Use that in the DiploidExactAFCalc in the case that you need to reduce yourself to ref only. Preserves DP in the genotype field when this is possible, so a few integration tests have changed for the UG
Problem:
-Downsamplers were treating reduced reads the same as normal reads,
with occasionally catastrophic results on variant calling when an
entire reduced read happened to get eliminated.
Solution:
-Since reduced reads lack the information we need to do position-based
downsampling on them, best available option for now is to simply
exempt all reduced reads from elimination during downsampling.
Details:
-Add generic capability of exempting items from elimination to
the Downsampler interface via new doNotDiscardItem() method.
Default inherited version of this method exempts all reduced reads
(or objects encapsulating reduced reads) from elimination.
-Switch from interfaces to abstract classes to facilitate this change,
and do some minor refactoring of the Downsampler interface (push
implementation of some methods into the abstract classes, improve
names of the confusing clear() and reset() methods).
-Rewrite TAROrderedReadCache. This class was incorrectly relying
on the ReservoirDownsampler to preserve the relative ordering of
items in some circumstances, which was behavior not guaranteed by
the API and only happened to work due to implementation details
which no longer apply. Restructured this class around the assumption
that the ReservoirDownsampler will not preserve relative ordering
at all.
-Add disclaimer to description of -dcov argument explaining that
coverage targets are approximate goals that will not always be
precisely met.
-Unit tests for all individual downsamplers to verify that reduced
reads are exempted from elimination
We now run Smith-Waterman on the dangling tail against the corresponding reference tail.
If we can generate a reasonable, low entropy alignment then we trigger the merge to the
reference path; otherwise we abort. Also, we put in a check for low-complexity of graphs
and don't let those pass through.
Added tests for this implementation that checks exact SW results and correct edges added.
-- Reuse infrastructure for RODs for reads to implement general IntervalReferenceOrderedView so that both TraverseReads and TraverseActiveRegions can use the same underlying infrastructure
-- TraverseActiveRegions now provides a meaningful RefMetaDataTracker to ActiveRegionWalker.map
-- Cleanup misc. code as it came up
-- Resolves GSA-808: Write general utility code to do rsID allele matching, hook up to UG and HC
-- Variants will be considered matching if they have the same reference allele and at least 1 common alternative allele. This matching algorithm determines how rsID are added back into the VariantContext we want to annotate, and as well determining the overlap FLAG attribute field.
-- Updated VariantAnnotator and VariantsToVCF to use this class, removing its old stale implementation
-- Added unit tests for this VariantOverlapAnnotator class
-- Removed GATKVCFUtils.rsIDOfFirstRealVariant as this is now better to use VariantOverlapAnnotator
-- Now requires strict allele matching, without any option to just use site annotation.
The previous behavior is to process reads with N CIGAR operators as they are despite that many of the tools do not actually support such operator and results become unpredictible.
Now if the there is some read with the N operator, the engine returns a user exception. The error message indicates what is the problem (including the offending read and mapping position) and give a couple of alternatives that the user can take in order to move forward:
a) ask for those reads to be filtered out (with --filter_reads_with_N_cigar or -filterRNC)
b) keep them in as before (with -U ALLOW_N_CIGAR_READS or -U ALL)
Notice that (b) does not have any effect if (a) is enacted; i.e. filtering overrides ignoring.
Implementation:
* Added filterReadsWithMCigar argument to MalformedReadFilter with the corresponding changes in the code to get it to work.
* Added ALLOW_N_CIGAR_READS unsafe flag so that N cigar containing reads can be processed as they are if that is what the user wants.
* Added ReadFilterTest class commont parent for ReadFilter test cases.
* Refactor ReadGroupBlackListFilterUnitTest to extend ReadFilterTest and push up some functionality to that class.
* Modified MalformedReadFilterUnitTest to extend ReadFilterTest and to test the new filter functionality.
* Added AllowNCigarMalformedReadFilterUnittest to check on the behavior when the unsafe ALLOW_N_CIGAR_READS flag is used.
* Added UnsafeNCigarMalformedReadFilterUnittest to check on the behavior when the unsafe ALL flag is used.
* Updated a broken test case in UnifiedGenotyperIntegrationTest resulting from the new behavior.
* Updated EngineFeaturesIntegrationTest testdata to be compliant with new behavior
- Memoized MathUtil's cumulative binomial probability function.
- Reduced the default size of the read name map in reduced reads and handle its resets more efficiently.
-- In the case where we have multiple potential alternative alleles *and* we weren't calling all of them (so that n potential values < n called) we could end up trimming the alleles down which would result in the mismatch between the PerReadAlleleLikelihoodMap alleles and the VariantContext trimmed alleles.
-- Fixed by doing two things (1) moving the trimming code after the annotation call and (2) updating AD annotation to check that the alleles in the VariantContext and the PerReadAlleleLikelihoodMap are concordant, which will stop us from degenerating in the future.
-- delivers [#50897077]
-- This occurred because we were reverting reads with soft clips that would produce reads with negative (or 0) alignment starts. From such reads we could end up with adaptor starts that were negative and that would ultimately produce the "Only one of refStart or refStop must be < 0, not both" error in the FragmentUtils merging code (which would revert and adaptor clip reads).
-- We now hard clip away bases soft clipped reverted bases that fall before the 1-based contig start in revertSoftClippedBases.
-- Replace buggy cigarFromString with proper SAM-JDK call TextCigarCodec.getSingleton().decode(cigarString)
-- Added unit tests for reverting soft clipped bases that create a read before the contig
-- [delivers #50892431]
-- Ultimately this was caused by an underlying bug in the reverting of soft clipped bases in the read clipper. The read clipper would fail to properly set the alignment start for reads that were 100% clipped before reverting, such as 10H2S5H => 10H2M5H. This has been fixed and unit tested.
-- Update 1 ReduceReads MD5, which was due to cases where we were clipping away all of the MATCH part of the read, leaving a cigar like 50H11S and the revert soft clips was failing to properly revert the bases.
-- delivers #50655421
-- Although the original bug report was about SplitSamFile it actually was an engine wide error. The two places in the that provide compression to the BAM write now check the validity of the compress argument via a static method in ReadUtils
-- delivers #49531009
-- We now inject the given alleles into the reference haplotype and add them to the graph.
-- Those paths are read off of the graph and then evaluated with the appropriate marginalization for GGA mode.
-- This unifies how Smith-Waterman is performed between discovery and GGA modes.
-- Misc minor cleanup in several places.
1) Add in checks for input parameters in MathUtils method. I was careful to use the bottom-level methods whenever possible, so that parameters don't needlessly go through multiple checks (so for instance, the parameters n and k for a binomial aren't checked on log10binomial, but rather in the log10binomialcoefficient subroutine).
This addresses JIRA GSA-767
Unit tests pass (we'll let bamboo deal with the integrations)
2) Address reviewer comments (change UserExceptions to IllegalArgumentExceptions).
3) .isWellFormedDouble() tests for infinity and not strictly positive infinity. Allow negative-infinity values for log10sumlog10 (as these just correspond to p=0).
After these commits, unit and integration tests now pass, and GSA-767 is done.
rebase and fix conflict:
public/java/src/org/broadinstitute/sting/utils/MathUtils.java
-- This allows us to use -rf ReassignMappingQuality to reassign mapping qualities to 60 *before* the BQSR filters them out with MappingQualityUnassignedFilter.
-- delivers #50222251
The problem ultimately was that ReadUtils.readStartsWithInsertion() ignores leading hard/softclips, but
ReduceReads does not. So I refactored that method to include a boolean argument as to whether or not
clips should be ignored. Also rebased so that return type is no longer a Pair.
Added unit test to cover this situation.
-ReadShardBalancer was printing out an extra "Loading BAM index data for next contig"
message at traversal end, which was confusing users and making the GATK look stupid.
Suppress the extraneous message, and reword the log messages to be less confusing.
-Improve log message output when initializing the shard iterator in GenomeAnalysisEngine.
Don't mention BAMs when the are none, and say "Preparing for traversal" rather than
mentioning the meaningless-for-users concept of "shard strategy"
-These log messages are needed because the operations they surround might take a
while under some circumstances, and the user should know that the GATK is actively
doing something rather than being hung.
-Throw a UserException if a Locus or ActiveRegion walker is run with -dcov < 200,
since low dcov values can result in problematic downsampling artifacts for locus-based
traversals.
-Read-based traversals continue to have no minimum for -dcov, since dcov for read traversals
controls the number of reads per alignment start position, and even a dcov value of 1 might
be safe/desirable in some circumstances.
-Also reorganize the global downsampling defaults so that they are specified as annotations
to the Walker, LocusWalker, and ActiveRegionWalker classes rather than as constants in the
DownsamplingMethod class.
-The default downsampling settings have not been changed: they are still -dcov 1000
for Locus and ActiveRegion walkers, and -dt NONE for all other walkers.
BandedHMM
---------
-- An implementation of a linear runtime, linear memory usage banded logless PairHMM. Thought about 50% faster than current PairHMM, this implementation will be superceded by the GraphHMM when it becomes available. The implementation is being archived for future reference
Useful infrastructure changes
-----------------------------
-- Split PairHMM into a N2MemoryPairHMM that allows smarter implementation to not allocate the double[][] matrices if they don't want, which was previously occurring in the base class PairHMM
-- Added functionality (controlled by private static boolean) to write out likelihood call information to a file from inside of LikelihoodCalculationEngine for using in unit or performance testing. Added example of 100kb of data to private/testdata. Can be easily read in with the PairHMMTestData class.
-- PairHMM now tracks the number of possible cell evaluations, and the LoglessCachingPairHMM updates the nCellsEvaluated so we can see how many cells are saved by the caching calculation.
Don't map class to counts in the ReadMetrics (necessitating 2 HashMap lookups for every increment).
Instead, wrap the ReadFilters with a counting version and then set those counts only when updating global metrics.
-- Add() call had a misplaced map.put call, so that we were always putting the result of get() back into the map, when what we really intended was to only put the value back in if the original get() resulted in a null and so initialized the result
-- Previous version took a Collection<GATKSAMRecord> to remove, and called ArrayList.removeAll() on this collection to remove reads from the ActiveRegion. This can be very slow when there are lots of reads, as ArrayList.removeAll ultimately calls indexOf() that searches through the list calling equals() on each element. New version takes a set, and uses an iterator on the list to remove() from the iterator any read that is in the set. Given that we were already iterating over the list of reads to update the read span, this algorithm is actually simpler and faster than the previous one.
-- Update HaplotypeCaller filterReadsInRegion to use a Set not a List.
-- Expanded the unit tests a bit for ActiveRegion.removeAll
-- The previous version of PerReadAlleleLikelihoodMap only stored the alleles in an ArrayList, and used ArrayList.contains() to determine if an allele was already present in the map. This is very slow with many alleles. Now keeps both the ArrayList (for get() performance) and a Set of alleles for contains().
1. Don't clone the dataSource's metrics object (because then the engine won't continue to get updated counts)
2. Use the dataSource's metrics object in the CountingFilteringIterator and not the first shard's object!
3. Synchronize ReadMetrics.incrementMetrics to prevent race conditions.
Also:
* Make sure users realize that the read counts are approximate in the print outs.
* Removed a lot of unused cruft from the metrics object while I was in there.
* Added test to make sure that the ReadMetrics read count does not overflow ints.
* Added unit tests for traversal metrics (reads, loci, and active region traversals); these test counts of reads and records.
-- [Delivers #49876703]
-- Add integration test and test file
-- Update SymbolicAlleles combine variant tests, which was turning unfiltered records into PASS!