* Not working yet, still very much a work-in-progress with lots of placeholders
* Needed to check this in to enable possible collaboration, since it's
going slower than anticipated and the conference deadline looms.
a) Utility class called Probability Vector that holds a log-probability vector and has the ability to clip ends that deviate largely from max value.
b) Used this class to hold site error model, since likelihoods of error model away from peak are so far down that it's not worth computing with them and just wastes time.
c) Expand unit tests and add an exhaustive test for ErrorModel class.
d) Corrected major math bug in ErrorModel uncovered by exhaustive test: log(e^x) is NOT x if log's base = 10.
e) Refactored utility functions that created artificial pileups for testing into separate class ArtificialPileupTestProvider. Right now functionality is limited (one artificial contig of 10 bp), can only specify pileups in one position with a given number of matches and mismatches to ref) but functionality will be expanded in future to cover more test cases.
f) Use this utility class for IndelGenotypeLikelihoods unit test and for PoolGenotypeLikelihoods unit test (the latter testing functionality still not done).
g) Linearized implementation of biallelic exact model (very simple approach, similar to diploid exact model, just abort if we're past the max value of AC distribution and below a threshold). Still need to add unit tests for this and to expand to multiallelic model.
h) Update integration test md5's due to minor differences stemming from linearized exact model and better error model math
The GATK -L unmapped is for GenomeLocs with SAMRecord.NO_ALIGNMENT_REFERENCE_NAME, not SAMRecord.getReadUnmappedFlag()
Previously unmapped flag reads in the last bin were being printed while also seeking for the reads without a reference contig.
* fixed queue script plot file names
* updated the ReadGroupCovariate to use the platform unit instead of sample + lane.
* fixed plotting of marginalized reported qualities
* updated BQSR queue script for faster turnaround
* implemented plot generation for scatter/gatherered runs
* adjusted output file names to be cooperative with the queue script
* added the recalibration report file to the argument table in the report
* added ReadCovariates unit test -- guarantees that all the covariates are being generated for every base in the read
* added RecalibrationReport unit test -- guarantees the integrity of the delta tables
* fixed context covariate famous "off by one" error
* reduced maximum quality score to Q50 (following Eric/Ryan's suggestion)
* remove context downsampling in BQSR R script
This test brings together the old and the new BQSR, building a recalibration table using the two separate frameworks and performing the recalibration calculation using the two different frameworks for 10,000+ bases and asserting that the calculations match in every case.
* Refactored CycleCovariate to be a fragment covariate instead of a per read covariate
* Refactored the CycleCovariateUnitTest to test the pairing information
* Updated BQSR Integration tests accordingly
* Made quantization levels parameter not hidden anymore
* Added hidden option to keep intermediate plotting files for debug purposes (they're automatically deleted)
* Added hidden option not to generate the plots automatically (important for scatter/gathering)
The most important reason for this change is that we no longer need to read the entire recal file into memory up front in ApplyRecalibration. For 1000G calling this was prohibitive in terms of memory requirements. Now we go through the rod system and pull in just the records we need at a given position.
As an added bonus, once BCF2 is live we can drastically cut down the sizes of these recal files (which can grow large for whole genome calling).
* removed low quality bases from the recalibration report.
* refactored the Datum (Recal and Accuracy) class structure
* created a new plotting csv table for optimized performance with the R script
* added a datum object that carries the accuracy information (AccuracyDatum) for plotting
* added mean reported quality score to all covariates
* added QualityScore as a covariate for plotting purposes
* added unit test to the key manager to operate with one required covariate and multiple optional covariates
* integrated the plotting into BQSR (automatically generates the pdf with the recalibration tearsheet)
- By porting from jython to java now accessible to Queue via automatic extension generation.
- Better handling for problematic sample names by using PicardAggregationUtils.
GATKReportTable looks up keys using arrays instead of dot-separated strings, which is useful when a sample has a period in the name.
CombineVariants has option to suppress the header with the command line, which is now invoked during VCF gathering.
Added SelectHeaders walker for filtering headers for dbGAP submission.
Generated command line for read filters now correctly prefixes the argument name as --read_filter instead of -read_filter.
Latest WholeGenomePipeline.
Other minor cleanup to utility methods.
-- Not hooked up yet, so the output of VariantEval should be the same as before
-- Implemented a VariantEvalUnitTest that tests the low level strat / eval combinatorics and counting routines
-- Better docs throughout
-- Now properly includes both bi and multi-allelic variants. These are actually counted as well, and emitted as counts and % of sites with multiple alleles
-- Bug fix for gold standard rate
-- HMS no longer tries to grab and throw all exceptions. Exceptions are just thrown directly now.
-- Proper error handling is handled by functions in HMS, which are used by ShardTraverser and TreeReducer
-- Better printing of stack traces in WalkerTest
Adding support for active-region-based annotation for most standard annotations. I need to discuss with Ryan what to do about tests that require offsets into the reads (since I don't have access to the offsets) like e.g. the ReadPosRankSumTest.
IMPORTANT NOTE: this is still very much a dev effort and can only be accessed through private walkers (i.e. the HaplotypeCaller). The interface is in flux and so we are making no attempt at all to make it clean or to merge this with the Locus-Traversal-based annotation system. When we are satisfied that it's working properly and have settled on the proper interface, we will clean it up then.
* Fixed output format to get a valid vcf
* Optimzed the per sample pileup routine O(n^2) => O(n) pileup for samples
* Added support to overlapping intervals
* Removed expand target functionality (for now)
* Removed total depth (pointless metric)
-- SamFileReader.java:525
-- BlockCompressedInputStream:376
These were both instances were we weren't catching and rethrowing picard exceptions as UserExceptions.
- refactored the statistics classes
- concurrent callable statuses by sample are now available.
Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
* Added parameter -qq to quantize qualities using a recalibration report
* Added options to quantize using the recalibration report quantization levels, new nLevels and no quantization.
* Updated BQSR scripts to make use of the new parameters
-- Now calculates the number of Indels overlapping gold standard sites, as well as the percent of indels overlapping gold standard sites
-- Removed insertion : deletion ratio for 1 bp event, replaced it with 1 + 2 : 3 bp ratio for insertions and deletions separately. This is based on an old email from Mark Daly:
// - Since 1 & 2 bp insertions and 1 & 2 bp deletions are equally likely to cause a
// downstream frameshift, if we make the simplifying assumptions that 3 bp ins
// and 3bp del (adding/subtracting 1 AA in general) are roughly comparably
// selected against, we should see a consistent 1+2 : 3 bp ratio for insertions
// as for deletions, and certainly would expect consistency between in/dels that
// multiple methods find and in/dels that are unique to one method (since deletions
// are more common and the artifacts differ, it is probably worth looking at the totals,
// overlaps and ratios for insertions and deletions separately in the methods
// comparison and in this case don't even need to make the simplifying in = del functional assumption
-- Added a new VEW argument to bind a gold standard track
-- Added two new stratifications: OneBPIndel and TandemRepeat which do exactly what you imagine they do
-- Deleted random unused functions in IndelUtils
Returns true iff VC is an non-complex indel where every allele represents an expansion or
contraction of a series of identical bases in the reference.
The logic of this function is pretty simple. Take all of the non-null alleles in VC. For
each insertion allele of n bases, check if that allele matches the next n reference bases.
For each deletion allele of n bases, check if this matches the reference bases at n - 2 n,
as it must necessarily match the first n bases. If this test returns true for all
alleles you are a tandem repeat, otherwise you are not. Note that in this context n is the
base differences between the ref and alt alleles
* fixed the loading of the new reduced size reports
* reduced BQSR scala script memory to 2Gb
* removed dcov parameter from BQSR scala script
* fixed estimatedQReported calculation from -log10(pe) to -10*log10(pe).
* updated md5's with the proper PHRED scaled EstimatedQReported
* fixed bug where some keys were using the same recal datum objects
* fixed quantization qual calculations when combining multiple reports
* fixed rounding error with empirical quality reported when combining reports
* fixed combine routine in the gatk reports due to the primary keys being out of order
* added auto-recalibration option to BQSR scala script
* reduced the size of the recalibration report by ~15%
* updated md5's
-- Molten now supports variableName and valueName so you don't have to use variable and value if you don't want to.
-- Cleanup code, reorganize a bit more.
-- Fix for broken integrationtests
*** WAY FASTER ***
-- 3x performance for multiple sample analysis with 1000 samples
-- Analyzing 1MB of the ESP call set (3100 samples) takes 40 secs, compared to several minutes in the previous version
-- According to JProfiler all of the runtime is now spent decoding genotypes, which will only get better when we move to BCF2
-- Remove the TableType system, as this was way too complex. No longer possible to embed what were effectively multiple tables in a single Evaluator. You now have to have 1 table per eval
-- Replaced it with @Molten, which allows an evaluator to provide a single Map from variable -> value for analysis. IndelLengthHistogram is now a @Molten data type. GenotypeConcordance is also.
-- No longer allow Evaluators to use private and protected variables at @DataPoints. You get an error if you do.
-- Simplified entire IO system of VE. Refactored into VariantEvalReportWriter.
-- Commented out GenotypePhasingEvaluator, as it uses the retired TableType
-- Stratifications are all fully typed, so it's easy for GATKReports to format them.
-- Removed old VE work around from GATKReportColumn
-- General code cleanup throughout
-- Updated integration tests
-- Added memory and safety optimizations to StratNode and StratificationManager. Fresh, immutable Hashmaps are allocated for final data structures, so they exactly the correct size and cannot be changed by users.
-- Added ability of a stratification to specify incompatible evaluation. The two strats using this are AC and Sample with VariantSummary, as this computes per-sample averages and so combining these results in an O(n^2) memory requirement. Added integration test to cover incompatible strats and evals
-- Renamed and reorganized infrastructure
-- StratificationManager now a Map from List<Object> -> V. All key functions are implemented. Less commonly used TODO
-- Ready for hookup to VE
-- Created a unit tested tree mapping from a List<String> -> integer (StratificationStates). This class is the key infrastructure necessary to create a complete static mapping from all stratification combinations to an offset in a vector of EvalutionContexts for update in map.
-- Minor code cleanup throughout VE (removing unused headers, for example)
VariantEval is overly abusive of the GATKReport (lack of) spec.
1. It converts numeric values (longs, integers and doubles) to string before sending to the Report, then expects it to decipher that those were actually numbers.
2. Worse, the stratification modules somehow instead of sending the actual values to the report table, sends a string with the value "unknown" and then abuses the GATKReport spec to convert those "unknown" placeholder values with numbers. Then again, it expects the report to know those are numbers, not strings.
Now that the GATKReport HAS specs, VariantEval needs to be overhauled to conform with that. In the meantime, I have added special ad-hoc treatment to these wrong contracts. It works, and the integration tests all passed without changing any MD5's, but right after Mark and Ryan commit their VariantEval refactors, I will step in to change the way it interacts with the GATKReport, so we can clean up the GATKReport.
No wonder, the printing needed to be O(n^2).
* when gathering, be aware that some keys will be missing from some tables.
* when a gatktable has no elements, it should still output the header so we know it had no records
- The Integer column type now accepts byte and shorts
- Updated Unit Tests and added a new testParse() test
Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
* restructured the hash tables into one class (RecalibrationReport) that has all the functionality for the different tables and key managers
* optmized empirical qual calculation when merging recalibration reports
* centralized the quality score quantization functionalities
* unified the creating/loading of all the key manager/hash table structures.
* added unit tests for the gatherer (disabled because gatk report needs to be sorted for automated testing)
* added integration tests for BQSR and on-the-fly recalibration
-- Minor refactoring of state key iteration in VEW.map to make the dependencies more clear
-- Long discussion about the performance problems with StateKey, and how to fix it, which I have run out of time to address before ESP meeting.
-- The previous approach (requiring > 5 copies among all reads) is breaking down in many samples (>1000) just from sequencing errors.
-- This breakdown is producing spurious clustered indels (lots of these!) around real common indels
-- The new approach requires >X% of reads in a sample to carry an indel of any type (no allele matching) to be including in the counting towards 5. This actually makes sense in that if you have enough data we expect most reads to have the indel, but the allele might be wrong because of alignment, etc. If you have very few reads, then the threshold is crossed with any indel containing read, and it's counted.
-- As far as I can tell this is the right thing to do in general. We'll make another call set in ESP and see how it works at scale.
-- Added integration tests to ensure that the system is behaving as I expect on the site I developed the code on from ESP
-- StateKey no longer extends TreeMap. It's now a final immutable data structure that caches it's toString and hashcode values. TODO optimizations to entirely remove the TreeMap and just store the HashMap for performance and use the tree for the sorted tostring function.
-- NewEvaluationContext has a method makeStateKey() that contains all of the functionality that once was spread around VEUtils
-- AnalysisModuleScanner uses an annotationCache to speed up the reflections getAnnotations() call when invoked over and over on the same objects. Still expensive to convert each field to a string for the cache, but the only way around that is a complete refactoring of the toTransversalDone of VE
-- VariantEvaluator base class has a cached getSimpleName() function
-- VEUtils: general cleanup due to refactoring of StateKey
-- VEWalker: much better iteration of map data structures. If you need access to iterate over all key/value pairs use the Map.Entry construct with entrySet. This is far better than iterating over the keys and calling get() on each key.
-- Now the only use for update0, calculating the number of processed loci, is centrally tracked in the walker itself not the evaluations.
-- This allows us to avoid calling update0 are every genomic base in 100ks of evaluates when there are a lot of stratifications.
-- No need to modify the integration tests, this optimization doesn't change the result of the calculation
* added empirical quality counts to allow quantization during on-the-fly recalibration to any level
* added number of observations and errors to all tables to enable plotting of all covariates
* restructured BQSR to report recalibrated tables.
* implemented empirical quality calculation to the BQSR stage (instead of on-the-fly recalibration)
* linked quality score quantization to the BQSR stage, outputting a quantization histogram
* included the arguments used in BQSR to the GATK Report
* included all three tables (RG, QUAL and COVARIATES) to the GATK Report with empirical qualities
On-the-fly recalibration with GATK Report
* loads all tables from the GATKReport using existing infrastructure (with minor updates)
* implemented initialiazation of the covariates using BQSR's argument list
* reduced memory usage significantly by loading only the empirical quality and estimated quality reported for each bit set key
* applied quality quantization to the base recalibration
* excluded low quality bases from on-the-fly recalibration for mismatches, insertions or deletions
-- This behavior, which isn't obviously valuable at all, continued to grab and rethrow exceptions in the HMS that, if run without NT, would show up as more meaningful errors. Now HMS simply checks whether the throwable it received on error was a RuntimeException. If so, it is stored and rethrow without wrapping later. If it isn't, only in this case is the exception wrapped in a ReviewedStingException.
-- Added a QC walker ErrorThrowingWalker that will throw a UserException, ReviewedStingException, and NullPointerException from map as specified on the command line
-- Added IT that ensures that all three types are thrown properly (i.e., you catch a NullPointerException when you ask for one to be thrown) with and without threading enabled.
-- I believe this will finally put to rest all of these annoying HMS captures.
-- Use a LinkedHashMap not a TreeMap so iteration is faster.
-- Note that with a lot of stratifications the update0 is taking up a lot of time. For example, with 822 samples and functional class and sample on there are 100K contexts and 30% of the runtime is just in the update0 call
-- Now you always get SNP and indel metrics with VariantEval!
-- Includes Number of SNPs, Number of singleton SNPs, Number of Indels, Number of singleton Indels, Percent of indel sites that are multi-allelic, SNP to indel ratio, Singleton SNP to indel ratio, Indel novelty rate, 1 to 2 bp indel ratio, 1 to 3 bp indel ratio, 2 to 3 bp indel ratio, 1 and 2 to 3 bp indel ratio, Frameshift percent, Insertion to deletion ratio, Insertion to deletion ratio for 1 bp events, Number of indels in protein-coding regions labeled as frameshift, Number of indels in protein-coding regions not labeled as frameshift, Het to hom ratio for SNPs, Het to hom ratio for indels, a Histogram of indel lengths, Number of large (>10 bp) deletions, Number of large (>10 bp) insertions, Ratio of large (>10 bp) insertions to deletions
-- Updated VE integration tests as appropriate
-- Moved a variety of useful formatting routines for ratios, percentages, etc, into VariantEvalator.java so everyone can share. Code updated to use these routines where appropriate
-- Added variantWasSingleton() to VariantEvaluator, which can be used to determine if a site, even after subsetting to specific samples, was a singleton in the original full VCF
-- TableType, which used to be an interface, is now an abstract class, allowing us to implement some generally functionality and avoid duplication.
-- This included creating a getRowName() function that used to be hardcoded as "row" but how can be overridden.
-- #### This allows us implement molten tables, which are vastly easier to use than multi-row data sets. See IndelHistogram class (in later commit) for example of molten VE output
-- No more IndelLengthHistogram (superceded by IndelSummary in subsequent commit)
-- No more SamplePreviousGenotypes or PhaseStats
-- No more MultiallelicAFs
* fixed BadCigarFilter to filter out reads starting/ending in deletion and that have adjacent I/D events.
* added Unit tests for BadCigarFilter
* updated all exceptions in LocusIteratorByState to tell the user that he can instead run with -rf BadCigar
* added the BadCigar filter to ReduceReads and RealignTargetCreator (if your walker blows up with these malformed reads, you may want to add it too)
- Updated the documentation on the code
- Made the table.write() method private and updated necessary files.
- Added a constructor to GATKReport that takes GATKReportTables
- Optimized my code
Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
This is important for quick turnaround in the analysis cycle of the new covariates. Also added a dummy unit test that doesn't really test anything (disabled), but helps in debugging.
Pulled out the functionality from Indel Realigner and Table Recalibrator into Utils.setupWriter to make everyone else's life's easier if they want to include the PG tag in their walkers.
Infrastructure:
* Added static interface to all different clipping algorithms of low quality tail clipping
* Added reverse direction pileup element event lookup (indels) to the PileupElement and LocusIteratorByState
* Complete refactor of the KeyManager. Much cleaner implementation that handles keys with no optional covariates (necessary for on-the-fly recalibration)
* EventType is now an independent enum with added capabilities. All functionality is now centralized.
BQSR and RecalibrateBases:
* On-the-fly recalibration is now generic and uses the same bit set structure as BQSR for a reduced memory footprint
* Refactored the object creation to take advantage of the compact key structure
* Replaced nested hash maps with single hash maps indexed by bitsets
* Eliminated low quality tails from the context covariate (using ReadClipper's write N's algorithm).
* Excluded contexts with N's from the output file.
* Fixed cycle covariate for discrete platforms (need to check flow cycle platforms now!)
* Redfined error for indels to look at the previous base in negative strand reads (using new PE functionality)
* Added the covariate ID (for optional covariates) to the output for disambiguation purposes
* Refactored CovariateKeySet -- eventType functionality is now handled by the EventType enum.
* Reduced memory usage of the BQSR script to 4
Tests:
* Refactored BQSRKeyManagerUnitTest to handle the new implementation of the key manager
* Added tests for keys without optional covariates
* Added tests for on-the-fly recalibration (but more tests are necessary)
Infrastructure:
* Generic BitSet implementation with any precision (up to long)
* Two's complement implementation of the bit set handles negative numbers (cycle covariate)
* Memoized implementation of the BitSet utils for better performance.
* All exponents are now calculated with bit shifts, fixing numerical precision issues with the double Math.pow.
* Replace log/sqrt with bitwise logic to get rid of numerical issues
BQSR:
* All covariates output BitSets and have the functionality to decode them back into Object values.
* Covariates are responsible for determining the size of the key they will use (number of bits).
* Generalized KeyManager implementation combines any arbitrary number of covariates into one bitset key with event type
* No more NestedHashMaps. Single key system now fits in one hash to reduce hash table objects overhead
Tests:
* Unit tests added to every method of BitSetUtils
* Unit tests added to the generalized key system infrastructure of BQSRv2 (KeyManager)
* Unit tests added to the cycle and context covariates (will add unit tests to all covariates)
-- TODO for ryan -- there are bugs in ActivityProfile code that I cannot fix right now :-(
-- UnitTesting framework for ActivityProfile -- needs to be expanded
-- Minor helper functions for ActiveRegion to help with unit tests
-- Refactored ART into clearer, simpler procedures. Attempted to merge shared code into utility classes.
-- Added some docs
-- Created a new, testable ActivityProfile that represents as a class the probability of a base being active or inactive
-- Separated band-pass filtering from creation of active regions. Now you can band pass filter a profile to make another profile, and then that is explicitly converted to active regions
-- Misc. utility functions in ActiveRegionWalker such as hasPresetActiveRegions()
-- Many TODOs in ActivityProfile.
GATKReport format changes:
- All non-data header lines are preceeded with a single pound ( #:)
- Every report now has a report header containing the version number and number of tables
- Every table has two lines of table header: The first explains the size of the table and the data types of each column, the second contains the table name and description.
- This new format will allow reports in the future to be gatherable.
- Changed the header format to include an end-of-line string ":;"
Added features:
- Simplified GATK Reports:
The constructor for a simplified GATK Report. Simplified GATK report are designed for reports that do not need the advanced functionality of a full GATK Report.
A simple GATK Report consists of:
- A single table
- No primary key ( it is hidden )
Optional:
- Only untyped columns. As long as the data is an Object, it will be accepted.
- Default column values being empty strings.
Limitations:
- A simple GATK report cannot contain multiple tables.
- It cannot contain typed columns, which prevents arithmetic gathering.
- Added a constructor to generate simplified GATK reports.
- Added a method to easily add data to simple GATK reports.
- Upgraded the input parser take advantage of the new file format (v1).
- Added the GATKReportGatherer, more usability cmoing in next versionof GATK Report. Curently, it can only add rows from one table to another. Added private methods in GATKReport to combine Tables and Reports, It is very conservative and will only gather if the table columns, as well as everything else matches. At the column level, it uses the (redundant) row ids to add new rows. It will throw an exception if it is overwriting data.
- Made some GATKReport methods public, and added more setters and getters.
- Added method that compares formats of two GATKReports, and added an equals method to verify all data inside.
- The gsalib for R now supports reading GATKReport v1 files in addition to legacy formats (v0.*)
- Added a GATKReportDataType enum to give column a certain data type. This must be specified when making a gatherable report. This enum contains several methods including a reverse lookup map.
- Added a data type field in GATKColumn, when a type is not specified, the unknown type is used. Unknown types should not be gathered.
Test changes:
- Updated Unit Tests for GATK Report v1. Added a test for the gatherer. Left one test disabled while we transition from v0 to v1.
- Updated the MD5 hashes in integration tests throughout the GATK.
Other changes:
- Added the gatherer functions to CoverageByRG
- Also added the scatterCount parameter in the Interval Coverage script
- Dropped support for reading in legacy GATKReport formats ( v0.*)
- Updated VariantEvalWalker to work with GATK Report v1, added a format String to all applicable DataPoints.
- Rewrote the read file method for GATK report files.
- Optimized the equals methods within GATKReport. The protected functions should only be called by the GATKReport methods.
Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
Now looks like:
<GATK-run-report>
<id>D7D31ULwTSxlAwnEOSmW6Z4PawXwMxEz</id>
<start-time>2012/03/10 20.21.19</start-time>
<end-time>2012/03/10 20.21.19</end-time>
<run-time>0</run-time>
<walker-name>CountReads</walker-name>
<svn-version>1.4-483-g63ecdb2</svn-version>
<total-memory>85000192</total-memory>
<max-memory>129957888</max-memory>
<user-name>depristo</user-name>
<host-name>10.0.1.10</host-name>
<java>Apple Inc.-1.6.0_26</java>
<machine>Mac OS X-x86_64</machine>
<iterations>105</iterations>
</GATK-run-report>
No longer capturing command line or directory information, to minimize people's concerns with phone home and privacy
This is a quick-and-dirty patch for the null pointer error Mauricio reported earlier.
Later on we might want to address in a more general way the fact that we validate user intervals
against the reference but not against the merged BAM header produced by the engine at runtime.
This fix is similar, but distinct from the earlier fix to GATKBAMIndex. If we fail to read in
a complete 3-integer bin header from the BAM schedule file that the engine has written, throw a
ReviewedStingException (since this is our problem, not the user's) rather than allowing a
cryptic buffer underflow error to occur.
Note that this change does not fix the underlying problem in the engine, if there is one
(there may be an as-yet-undetected bug in the code that writes the bam schedule). It will
just make it easier for us to identify what's going wrong in the future.
GATKBAMIndex would allow an extremely confusing BufferUnderflowException to be
thrown when a BAM index file was truncated or corrupt. Now, a UserException is
thrown in this situation instructing the user to re-index the BAM.
Added a unit test for this case as well.
-- A cleaner table output (molten). For those interested in seeing how this can be done with GATKReports look here for a nice clean example
-- Integration tests
-- Minor improvements to GATKReportTable with methods to getPrimaryKeys
-- We weren't properly handling the case where a site had both a SNP and indel in both eval and comp. These would naturally pair off as SNP x SNP and INDEL x INDEL in eval, but we'd still invoke update2 with (null, SNP) and (null, INDEL) resulting most conspicously as incorrect false negatives in the validation report.
-- Updating misc. integrationtests, as the counting of comps (in particular for dbSNP) was inflated because of this effect.
-Running the GATK with the -et NO_ET or -et STDOUT options now
requires a key issued by us. Our reasons for doing this, and the
procedure for our users to request keys, are documented here:
http://www.broadinstitute.org/gsa/wiki/index.php/Phone_home
-A GATK user key is an email address plus a cryptographic signature
signed using our private key, all wrapped in a GZIP container.
User keys are validated using the public key we now distribute with
the GATK. Our private key is kept in a secure location.
-Keys are cryptographically secure in that valid keys definitely
came from us and keys cannot be fabricated, however keys are not
"copy-protected" in any way.
-Includes private, standalone utilities to create a new GATK user key
(GenerateGATKUserKey) and to create a new master public/private key
pair (GenerateKeyPair). Usage of these tools will be documented on
the internal wiki shortly.
-Comprehensive unit/integration tests, including tests to ensure the
continued integrity of the GATK master public/private key pair.
-Generation of new user keys and the new unit/integration tests both
require access to the GATK private key, which can only be read by
members of the group "gsagit".
-- Includes paired end status (T/F)
-- Includes count of reads used in calculation
-- Includes simple read type (2x76 for example)
-- Better handling of insert size, read length when there's no data, or the data isn't paired end by emitting NA not 0
-- ReadGroupProperties: Emits a GATKReport containing read group, sample, library, platform, center, median insert size and median read length for each read group in every BAM file.
-- Median tool that collects up to a given maximum number of elements and returns the median of the elements.
-- Unit and integration tests for everything.
-- Making name of TestProvider protected so subclasses and override name more easily
* All contexts with 'N' bases are now collapsed as uninformative
* Context size is now represented internally as a BitSet but output as a dna string
* Temporarily disabled sorted outputs because of null objects
* Turns DNA sequences (for context covariates) into bit sets for maximum compression
* Allows variable context size representation guaranteeing uniqueness.
* Works with long precision, so it is limited to a context size of 31 bases (can be extended with BigNumber precision if necessary).
* Unit Tests added
-- As these represent the bulk of the StingExceptions coming from BAMSchedule and are caused by simple problems like the user providing bad input tmp directories, etc.
-- DoC now by default ignores bases with reference Ns, so these are not included in the coverage calculations at any stage.
-- Added option --includeRefNSites that will include them in the calculation
-- Added integration tests that ensures the per base tables (and so all subsequent calculations) work with and without reference N bases included
-- Reorganized command line options, tagging advanced options with @Advanced
* The tailSet generated every time we flush the reads stash is still being affected by subsequent clears because it is just a pointer to the parent element in the original TreeSet. This is dangerous, and there is a weird condition where the clear will affects it.
* Fix by creating a new set, given the tailSet instead of trying to do magic with just the pointer.
When aggregating raw BAM file spans into shards, the IntervalSharder tries to combine
file spans when it can. Unfortunately, the method that combines two BAM file
spans was seriously flawed, and would produce a truncated union if the file spans
overlapped in certain ways. This could cause entire regions of the BAM file containing
reads within the requested intervals to be dropped.
Modified GATKBAMFileSpan.union() to correct this problem, and added unit tests
to verify that the correct union is produced regardless of how the file spans
happen to overlap.
Thanks to Khalid, who did at least as much work on this bug as I did.
so Ryan can work on the recalibration on the fly without breaking the build. Supposedly all the secret sauce is in the BQSR walker, which sits in private.
* added support to base before deletion in the pileup
* refactored covariates to operate on mismatches, insertions and deletions at the same time
* all code is in private so original BQSR is still working as usual in public
* outputs a molten CSV with mismatches, insertions and deletions, time to play!
* barely tested, passes my very simple tests... haven't tested edge cases.
premature push from my part. Roger is still working on the new format and we need to update the other tools to operate correctly with the new GATKReport.
This reverts commit aea0de314220810c2666055dc75f04f9010436ad.
- Added the GATKReportGatherer
- Added private methods in GATKReport to combine Tables and Reports
- It is very conservative and it will only gather if the table columns, match.
- At the column level it uses the (redundant) row ids to add new rows. It will throw an exception if it is overwriting data.
Added the gatherer functions to CoverageByRG
Also added the scatterCount parameter in the Interval Coverage script
Made some more GATKReport methods public
The UnitTest included shows that the merging methods work
Added a getter for the PrimaryKeyName
Fixed bugs that prevented the gatherer form working
Working GATKReportGatherer
Has only the functional to addLines
The input file parser assumes that the first column is the primary key
Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
* Adding the context covariate standard in both modes (including old CountCovariates) with parameters
* Updating all covariates and modules to use GATKSAMRecord throughout the code.
* BQSR now processes indels in the pileup (but doesn't do anything with them yet)
* calculates and interprets the coverage of a given interval track
* allows to expand intervals by specified number of bases
* classifies targets as CALLABLE, LOW_COVERAGE, EXCESSIVE_COVERAGE and POOR_QUALITY.
* outputs text file for now (testing purposes only), soon to be VCF.
* filters are overly aggressive for now.
-- This is a partial fix for the problem with uploading S3 logs reported by Mauricio. There the problem is that the java.io.tmpdir is not accessible (network just hangs). Because of that the s3 upload fails because the underlying system uses tmpdir for caching, etc. As far as I can tell there's no way around this bug -- you cannot overload the java.io.tmpdir programmatically and even if I could what value would we use? The only solution seems to me is to detect that tmpdir is hanging (how?!) and fail with a meaningful error.
per Mark's recommendation to reuse the Indel Realigner tag that made it to the SAM spec. The Alignment end tag is still "OE" as there is no official tag to reuse.