Motivation:
The API was different between the regular PairHMM and the FPGA-implementation
via CnyPairHMM. As a result, the LikelihoodCalculationEngine had
to use account for this. The goal is to change the API to be the same
for all implementations, and make it easier to access.
PairHMM
PairHMM now accepts a list of reads and a map of alleles/haplotpes and returns a PerReadAlleleLikelihoodMap.
Added a new primary method that loops the reads and haplotypes, extracts qualities,
and passes them to the computeReadLikelihoodGivenHaplotypeLog10 method.
Did not alter that method, or its subcompute method, at all.
PairHMM also now handles its own (re)initialization, so users don't have to worry about that.
CnyPairHMM
Added that same new primary access method to this FPGA class.
Method overrides the default implementation in PairHMM. Walks through a list of reads.
Individual-read quals and the full haplotype list are fed to batchAdd(), as before.
However, instead of waiting for every read to get added, and then walking through the reads
again to extract results, we just get the haplotype-results array for each read as soon as it
is generated, and pack it into a perReadAlleleLikelihoodMap for return.
The main access method is now the same no matter whether the FPGA CnyPairHMM is used or not.
LikelihoodCalculationEngine
The functionality to loop through the reads and haplotypes and get individual log10-likelihoods
was moved to the PairHMM, and so removed from here. However, this class does need to retain
the ability to pre-process the reads, and post-process the resulting likelihoods map.
Those features were separated from running the HMM and refactored into their own methods
Commented out the (unused) system for finding best N haplotypes for genotyping.
PairHMMIndelErrorModel
Similar changes were made as to the LCE. However, in this case the haplotypes are modified
based on each individual read, so the read-list we feed into the HMM only has one read.
-- Added 'jobRunnerJobName' definition to QFunction, defaults to value of shortDescription
-- Edited Lsf and Drmaa JobRunners to use this string instead of description for naming jobs in the scheduler
Signed-off-by: Joel Thibault <thibault@broadinstitute.org>
We have generalized the processing script to be able to handle multiple scenarios. Originally it was
designed for PCR free data only, we added all the steps necessary to start from fastq and process
RNA-seq as well as non-human data. This is our go to script in TechDev.
* add optional "starting from fastq" path to the pipeline
* add mark duplicates (optionally) to the pipeline
* add an option to run with the mouse data (without dbsnp and with single ended fastq)
* add option to process RNA-seq data from topHat (add RG and reassign mapping quality if necessary)
* add option to filter or include reads with N in the cigar string
* add parameter to allow keeping the intermediate files
1. Updated NIST path to its proper place on the file system (and updated the NIST calls to the latest, v2.17).
2. Don't assess genotype concordance for multi-allelic sites because we mess up the GTs when we break
them into their component parts (and therefore the GTs look wrong when they really aren't).
3. Add an argument to control the minimum GQ for a GT to be considered called.
This improves genotyping accuracy assessments which were unfairly penalizing low confidence GT calls.
Delivers PT #59846158.
-- We use the RegenotypeVariants walker to recompute the qual field. (instead of the discussed idea of adding this functionality to CombineVariants)
-- QualByDepth will now be recomputed even if the stratified contexts are missing. This greatly improves the QD estimate for this pipeline. Doesn't work for multi-allelics since the qual can't be recomputed.
Making the usage more clear since the parameter is being used over and over to define baited
regions. Updated the headers accordingly and made it more readable.
--Previously it gave a cryptic message:
----IO error while decoding blarg.script with UTF-8
----Please try specifying another one using the -encoding option
Pool Caller scripts with last minute fixes. Also committed script that plotted 1000G FDR that I used in ASHG2012.
Added also a README.txt file in /humgen/gsa-hpprojects/dev/validationExperiments/largeScaleValidation/finalPaperData/README.txt
in case things need to get run again.
this script downsamples an exome BAM several times and makes a coverage distribution
analysis (of bases that pass filters) as well as haplotype caller calls with a NA12878
Knowledge Base assessment with comparison against multi-sample calling
with the UG.
This script was used for the "downsampling the exome" presentation
Quick fix the missing column header in the QualifyMissingIntervals
report.
Adding a QScript for the tool as well as a few minor updates to the
GATKReportGatherer.