Motivation:
The API was different between the regular PairHMM and the FPGA-implementation
via CnyPairHMM. As a result, the LikelihoodCalculationEngine had
to use account for this. The goal is to change the API to be the same
for all implementations, and make it easier to access.
PairHMM
PairHMM now accepts a list of reads and a map of alleles/haplotpes and returns a PerReadAlleleLikelihoodMap.
Added a new primary method that loops the reads and haplotypes, extracts qualities,
and passes them to the computeReadLikelihoodGivenHaplotypeLog10 method.
Did not alter that method, or its subcompute method, at all.
PairHMM also now handles its own (re)initialization, so users don't have to worry about that.
CnyPairHMM
Added that same new primary access method to this FPGA class.
Method overrides the default implementation in PairHMM. Walks through a list of reads.
Individual-read quals and the full haplotype list are fed to batchAdd(), as before.
However, instead of waiting for every read to get added, and then walking through the reads
again to extract results, we just get the haplotype-results array for each read as soon as it
is generated, and pack it into a perReadAlleleLikelihoodMap for return.
The main access method is now the same no matter whether the FPGA CnyPairHMM is used or not.
LikelihoodCalculationEngine
The functionality to loop through the reads and haplotypes and get individual log10-likelihoods
was moved to the PairHMM, and so removed from here. However, this class does need to retain
the ability to pre-process the reads, and post-process the resulting likelihoods map.
Those features were separated from running the HMM and refactored into their own methods
Commented out the (unused) system for finding best N haplotypes for genotyping.
PairHMMIndelErrorModel
Similar changes were made as to the LCE. However, in this case the haplotypes are modified
based on each individual read, so the read-list we feed into the HMM only has one read.
--Previously it gave a cryptic message:
----IO error while decoding blarg.script with UTF-8
----Please try specifying another one using the -encoding option
Pool Caller scripts with last minute fixes. Also committed script that plotted 1000G FDR that I used in ASHG2012.
Added also a README.txt file in /humgen/gsa-hpprojects/dev/validationExperiments/largeScaleValidation/finalPaperData/README.txt
in case things need to get run again.
this script downsamples an exome BAM several times and makes a coverage distribution
analysis (of bases that pass filters) as well as haplotype caller calls with a NA12878
Knowledge Base assessment with comparison against multi-sample calling
with the UG.
This script was used for the "downsampling the exome" presentation
* add a length of the overlaping interval metric as per CSER request
* standardized the distance metrics to be positive when fully overlapping and the longest off-target tail (as a negative number) when not overlapping
* add gatkdocs to the tool (finally!)
--specifying exception types in cases where none was already specified
----mostly changed to catch Exception instead of Throwable
----EmailMessage has a point where it should only be expecting a RetryException but was catching everything
--changing build.xml so that it prints scala feature warning details
--added necessary imports needed to remove feature warnings
--updating a newly deprecated enum declaration to match the new syntax
--modified ivy dependencies
--modified scala classpath in build.xml to include scala-reflect
--changed imports to point to the new scala scala.reflect.internal.util
--set the bootclasspath in QScriptManager as well as the classpath variable.
--removing Set[File] <-> Set[String] conversions
----Set is invariant now and the conversions broke
--removing unit tests for Set[File] <-> Set[String] conversions
* add a new column to do what I have been doing manually for every project, understand why we got no usable coverage in that interval
* add unit tests -- this tool is now public, we need tests.
* slightly better docs -- in an effort to produce better docs for this tool
most people don't care about excessive coverage (unless you're very
particular about your analysis). Therefore the best possible default
value for this is Integer.maxValue so it doesn't get in the way.
Itemized Changes:
* change maximumCoverage threshold to Integer.maxValue
[delivers #57353620]