-- Refactored some duplicated code (FYI, code duplication = root of all evil) into shared functions
-- Added long-missing integrationtests
-- CHRIS/RYAN -- it would be very good to add an integration test covering external VCF files as I believe we rely on this functionality and it's not tested at all
This is a quick-and-dirty patch for the null pointer error Mauricio reported earlier.
Later on we might want to address in a more general way the fact that we validate user intervals
against the reference but not against the merged BAM header produced by the engine at runtime.
This fix is similar, but distinct from the earlier fix to GATKBAMIndex. If we fail to read in
a complete 3-integer bin header from the BAM schedule file that the engine has written, throw a
ReviewedStingException (since this is our problem, not the user's) rather than allowing a
cryptic buffer underflow error to occur.
Note that this change does not fix the underlying problem in the engine, if there is one
(there may be an as-yet-undetected bug in the code that writes the bam schedule). It will
just make it easier for us to identify what's going wrong in the future.
GATKBAMIndex would allow an extremely confusing BufferUnderflowException to be
thrown when a BAM index file was truncated or corrupt. Now, a UserException is
thrown in this situation instructing the user to re-index the BAM.
Added a unit test for this case as well.
-- A cleaner table output (molten). For those interested in seeing how this can be done with GATKReports look here for a nice clean example
-- Integration tests
-- Minor improvements to GATKReportTable with methods to getPrimaryKeys
-- We weren't properly handling the case where a site had both a SNP and indel in both eval and comp. These would naturally pair off as SNP x SNP and INDEL x INDEL in eval, but we'd still invoke update2 with (null, SNP) and (null, INDEL) resulting most conspicously as incorrect false negatives in the validation report.
-- Updating misc. integrationtests, as the counting of comps (in particular for dbSNP) was inflated because of this effect.
Several of the unit tests for the new key authorization feature require
read access to the GATK master private key file. Since this file is only
readable by members of the group gsagit, this makes it hard for people
outside the group to run the test suite.
Now, we skip tests that require the master private key if the private
key exists (since not existing would be a true error) but is not readable
by the user running the test suite
Bamboo, of course, will always be able to run these tests.
-Running the GATK with the -et NO_ET or -et STDOUT options now
requires a key issued by us. Our reasons for doing this, and the
procedure for our users to request keys, are documented here:
http://www.broadinstitute.org/gsa/wiki/index.php/Phone_home
-A GATK user key is an email address plus a cryptographic signature
signed using our private key, all wrapped in a GZIP container.
User keys are validated using the public key we now distribute with
the GATK. Our private key is kept in a secure location.
-Keys are cryptographically secure in that valid keys definitely
came from us and keys cannot be fabricated, however keys are not
"copy-protected" in any way.
-Includes private, standalone utilities to create a new GATK user key
(GenerateGATKUserKey) and to create a new master public/private key
pair (GenerateKeyPair). Usage of these tools will be documented on
the internal wiki shortly.
-Comprehensive unit/integration tests, including tests to ensure the
continued integrity of the GATK master public/private key pair.
-Generation of new user keys and the new unit/integration tests both
require access to the GATK private key, which can only be read by
members of the group "gsagit".
-- Includes paired end status (T/F)
-- Includes count of reads used in calculation
-- Includes simple read type (2x76 for example)
-- Better handling of insert size, read length when there's no data, or the data isn't paired end by emitting NA not 0
-- ReadGroupProperties: Emits a GATKReport containing read group, sample, library, platform, center, median insert size and median read length for each read group in every BAM file.
-- Median tool that collects up to a given maximum number of elements and returns the median of the elements.
-- Unit and integration tests for everything.
-- Making name of TestProvider protected so subclasses and override name more easily
* All contexts with 'N' bases are now collapsed as uninformative
* Context size is now represented internally as a BitSet but output as a dna string
* Temporarily disabled sorted outputs because of null objects
* Turns DNA sequences (for context covariates) into bit sets for maximum compression
* Allows variable context size representation guaranteeing uniqueness.
* Works with long precision, so it is limited to a context size of 31 bases (can be extended with BigNumber precision if necessary).
* Unit Tests added
-- As these represent the bulk of the StingExceptions coming from BAMSchedule and are caused by simple problems like the user providing bad input tmp directories, etc.
-- DoC now by default ignores bases with reference Ns, so these are not included in the coverage calculations at any stage.
-- Added option --includeRefNSites that will include them in the calculation
-- Added integration tests that ensures the per base tables (and so all subsequent calculations) work with and without reference N bases included
-- Reorganized command line options, tagging advanced options with @Advanced
* The tailSet generated every time we flush the reads stash is still being affected by subsequent clears because it is just a pointer to the parent element in the original TreeSet. This is dangerous, and there is a weird condition where the clear will affects it.
* Fix by creating a new set, given the tailSet instead of trying to do magic with just the pointer.
When aggregating raw BAM file spans into shards, the IntervalSharder tries to combine
file spans when it can. Unfortunately, the method that combines two BAM file
spans was seriously flawed, and would produce a truncated union if the file spans
overlapped in certain ways. This could cause entire regions of the BAM file containing
reads within the requested intervals to be dropped.
Modified GATKBAMFileSpan.union() to correct this problem, and added unit tests
to verify that the correct union is produced regardless of how the file spans
happen to overlap.
Thanks to Khalid, who did at least as much work on this bug as I did.
so Ryan can work on the recalibration on the fly without breaking the build. Supposedly all the secret sauce is in the BQSR walker, which sits in private.
* added support to base before deletion in the pileup
* refactored covariates to operate on mismatches, insertions and deletions at the same time
* all code is in private so original BQSR is still working as usual in public
* outputs a molten CSV with mismatches, insertions and deletions, time to play!
* barely tested, passes my very simple tests... haven't tested edge cases.
premature push from my part. Roger is still working on the new format and we need to update the other tools to operate correctly with the new GATKReport.
This reverts commit aea0de314220810c2666055dc75f04f9010436ad.
- Added the GATKReportGatherer
- Added private methods in GATKReport to combine Tables and Reports
- It is very conservative and it will only gather if the table columns, match.
- At the column level it uses the (redundant) row ids to add new rows. It will throw an exception if it is overwriting data.
Added the gatherer functions to CoverageByRG
Also added the scatterCount parameter in the Interval Coverage script
Made some more GATKReport methods public
The UnitTest included shows that the merging methods work
Added a getter for the PrimaryKeyName
Fixed bugs that prevented the gatherer form working
Working GATKReportGatherer
Has only the functional to addLines
The input file parser assumes that the first column is the primary key
Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
* Adding the context covariate standard in both modes (including old CountCovariates) with parameters
* Updating all covariates and modules to use GATKSAMRecord throughout the code.
* BQSR now processes indels in the pileup (but doesn't do anything with them yet)
* calculates and interprets the coverage of a given interval track
* allows to expand intervals by specified number of bases
* classifies targets as CALLABLE, LOW_COVERAGE, EXCESSIVE_COVERAGE and POOR_QUALITY.
* outputs text file for now (testing purposes only), soon to be VCF.
* filters are overly aggressive for now.
-- This is a partial fix for the problem with uploading S3 logs reported by Mauricio. There the problem is that the java.io.tmpdir is not accessible (network just hangs). Because of that the s3 upload fails because the underlying system uses tmpdir for caching, etc. As far as I can tell there's no way around this bug -- you cannot overload the java.io.tmpdir programmatically and even if I could what value would we use? The only solution seems to me is to detect that tmpdir is hanging (how?!) and fail with a meaningful error.
per Mark's recommendation to reuse the Indel Realigner tag that made it to the SAM spec. The Alignment end tag is still "OE" as there is no official tag to reuse.
* new unit tests for the alignment shift properties of reduce reads
* moved unit tests from ReadUtils that were actually testing GATKSAMRecord, not any of the ReadUtils to it.
* cleaned up ReadUtilsUnitTest
* Added annotations for reads that had been soft clipped prior to being reduced so that we can later recuperate their original alignments (start and end).
* Tags keep the alignment shifts, not real alignment, for better compression
* Tags are defined in the GATKSAMRecord
* GATKSAMRecord has new functionality to retrieve original alignment start of all reads (trimmed or not) -- getOriginalAlignmentStart() and getOriginalAligmentEnd()
* Updated ReduceReads MD5s accordingly
AllelePair is being mean and not telling me what genotype it sees when it finds a non-diploid genotype, but i suspect it's a no-call (".") rather than a no call ("./.").
Eric reported this bug due to the reduced reads failing with an index out of bounds on what we thought was a deletion, but turned out to be a read starting with insertion.
* Refactored PileupElement to distinguish clearly between deletions and read starting with insertion
* Modified ExtendedEventPileup to correctly distinguish elements with deletion when creating new pileups
* Refactored most of the lazyLoadNextAlignment() function of the LocusIteratorByState for clarity and to create clear separation between what is a pileup with a deletion and what's not one. Got rid of many useless if statements.
* Changed the way LocusIteratorByState creates extended event pileups to differentiate between insertions in the beginning of the read and deletions.
* Every deletion now has an offset (start of the event)
* Fixed bug when LocusITeratorByState found a read starting with insertion that happened to be a reduced read.
* Separated the definitions of deletion/insertion (in the beginning of the read) in all UG annotations (and the annotator engine).
* Pileup depth of coverage for a deleted base will now return the average coverage around the deletion.
* Indel ReadPositionRankSum test now uses the deletion true offset from the read, changed all appropriate md5's
* The extra pileup elements now properly read by the Indel mode of the UG made any subsequent call have a different random number and therefore all RankSum tests have slightly different values (in the 10^-3 range). Updated all appropriate md5s after extremely careful inspection -- Thanks Ryan!
phew!
reporting malformed BAMs was actually misreporting any error that happened in the Picard layer
as a BAM ERROR.
Specifically changing PicardException to report as a ReviewedStingException; we might want to
change it in the future. I'll followup with the Picard team to make sure they really, really
want PicardException to inherit from SAMException.
-- Disturbingly, fixing this bug doesn't actually cause an test failures.
-- Wrote a new QCRefWalker to actually check in detail that the reference bases coming into the RefWalker are all correct when comparing against a clean uncached load of the contig bases directly.
-- However, I cannot run this tool due to some kind of weird BAM error -- sending this on to Matt