-- Molten now supports variableName and valueName so you don't have to use variable and value if you don't want to.
-- Cleanup code, reorganize a bit more.
-- Fix for broken integrationtests
*** WAY FASTER ***
-- 3x performance for multiple sample analysis with 1000 samples
-- Analyzing 1MB of the ESP call set (3100 samples) takes 40 secs, compared to several minutes in the previous version
-- According to JProfiler all of the runtime is now spent decoding genotypes, which will only get better when we move to BCF2
-- Remove the TableType system, as this was way too complex. No longer possible to embed what were effectively multiple tables in a single Evaluator. You now have to have 1 table per eval
-- Replaced it with @Molten, which allows an evaluator to provide a single Map from variable -> value for analysis. IndelLengthHistogram is now a @Molten data type. GenotypeConcordance is also.
-- No longer allow Evaluators to use private and protected variables at @DataPoints. You get an error if you do.
-- Simplified entire IO system of VE. Refactored into VariantEvalReportWriter.
-- Commented out GenotypePhasingEvaluator, as it uses the retired TableType
-- Stratifications are all fully typed, so it's easy for GATKReports to format them.
-- Removed old VE work around from GATKReportColumn
-- General code cleanup throughout
-- Updated integration tests
-- Added memory and safety optimizations to StratNode and StratificationManager. Fresh, immutable Hashmaps are allocated for final data structures, so they exactly the correct size and cannot be changed by users.
-- Added ability of a stratification to specify incompatible evaluation. The two strats using this are AC and Sample with VariantSummary, as this computes per-sample averages and so combining these results in an O(n^2) memory requirement. Added integration test to cover incompatible strats and evals
-- Renamed and reorganized infrastructure
-- StratificationManager now a Map from List<Object> -> V. All key functions are implemented. Less commonly used TODO
-- Ready for hookup to VE
-- Created a unit tested tree mapping from a List<String> -> integer (StratificationStates). This class is the key infrastructure necessary to create a complete static mapping from all stratification combinations to an offset in a vector of EvalutionContexts for update in map.
-- Minor code cleanup throughout VE (removing unused headers, for example)
VariantEval is overly abusive of the GATKReport (lack of) spec.
1. It converts numeric values (longs, integers and doubles) to string before sending to the Report, then expects it to decipher that those were actually numbers.
2. Worse, the stratification modules somehow instead of sending the actual values to the report table, sends a string with the value "unknown" and then abuses the GATKReport spec to convert those "unknown" placeholder values with numbers. Then again, it expects the report to know those are numbers, not strings.
Now that the GATKReport HAS specs, VariantEval needs to be overhauled to conform with that. In the meantime, I have added special ad-hoc treatment to these wrong contracts. It works, and the integration tests all passed without changing any MD5's, but right after Mark and Ryan commit their VariantEval refactors, I will step in to change the way it interacts with the GATKReport, so we can clean up the GATKReport.
No wonder, the printing needed to be O(n^2).
* when gathering, be aware that some keys will be missing from some tables.
* when a gatktable has no elements, it should still output the header so we know it had no records
- The Integer column type now accepts byte and shorts
- Updated Unit Tests and added a new testParse() test
Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
* restructured the hash tables into one class (RecalibrationReport) that has all the functionality for the different tables and key managers
* optmized empirical qual calculation when merging recalibration reports
* centralized the quality score quantization functionalities
* unified the creating/loading of all the key manager/hash table structures.
* added unit tests for the gatherer (disabled because gatk report needs to be sorted for automated testing)
* added integration tests for BQSR and on-the-fly recalibration
-- Minor refactoring of state key iteration in VEW.map to make the dependencies more clear
-- Long discussion about the performance problems with StateKey, and how to fix it, which I have run out of time to address before ESP meeting.
-- The previous approach (requiring > 5 copies among all reads) is breaking down in many samples (>1000) just from sequencing errors.
-- This breakdown is producing spurious clustered indels (lots of these!) around real common indels
-- The new approach requires >X% of reads in a sample to carry an indel of any type (no allele matching) to be including in the counting towards 5. This actually makes sense in that if you have enough data we expect most reads to have the indel, but the allele might be wrong because of alignment, etc. If you have very few reads, then the threshold is crossed with any indel containing read, and it's counted.
-- As far as I can tell this is the right thing to do in general. We'll make another call set in ESP and see how it works at scale.
-- Added integration tests to ensure that the system is behaving as I expect on the site I developed the code on from ESP
-- StateKey no longer extends TreeMap. It's now a final immutable data structure that caches it's toString and hashcode values. TODO optimizations to entirely remove the TreeMap and just store the HashMap for performance and use the tree for the sorted tostring function.
-- NewEvaluationContext has a method makeStateKey() that contains all of the functionality that once was spread around VEUtils
-- AnalysisModuleScanner uses an annotationCache to speed up the reflections getAnnotations() call when invoked over and over on the same objects. Still expensive to convert each field to a string for the cache, but the only way around that is a complete refactoring of the toTransversalDone of VE
-- VariantEvaluator base class has a cached getSimpleName() function
-- VEUtils: general cleanup due to refactoring of StateKey
-- VEWalker: much better iteration of map data structures. If you need access to iterate over all key/value pairs use the Map.Entry construct with entrySet. This is far better than iterating over the keys and calling get() on each key.
-- Now the only use for update0, calculating the number of processed loci, is centrally tracked in the walker itself not the evaluations.
-- This allows us to avoid calling update0 are every genomic base in 100ks of evaluates when there are a lot of stratifications.
-- No need to modify the integration tests, this optimization doesn't change the result of the calculation
* added empirical quality counts to allow quantization during on-the-fly recalibration to any level
* added number of observations and errors to all tables to enable plotting of all covariates
* restructured BQSR to report recalibrated tables.
* implemented empirical quality calculation to the BQSR stage (instead of on-the-fly recalibration)
* linked quality score quantization to the BQSR stage, outputting a quantization histogram
* included the arguments used in BQSR to the GATK Report
* included all three tables (RG, QUAL and COVARIATES) to the GATK Report with empirical qualities
On-the-fly recalibration with GATK Report
* loads all tables from the GATKReport using existing infrastructure (with minor updates)
* implemented initialiazation of the covariates using BQSR's argument list
* reduced memory usage significantly by loading only the empirical quality and estimated quality reported for each bit set key
* applied quality quantization to the base recalibration
* excluded low quality bases from on-the-fly recalibration for mismatches, insertions or deletions
-- This behavior, which isn't obviously valuable at all, continued to grab and rethrow exceptions in the HMS that, if run without NT, would show up as more meaningful errors. Now HMS simply checks whether the throwable it received on error was a RuntimeException. If so, it is stored and rethrow without wrapping later. If it isn't, only in this case is the exception wrapped in a ReviewedStingException.
-- Added a QC walker ErrorThrowingWalker that will throw a UserException, ReviewedStingException, and NullPointerException from map as specified on the command line
-- Added IT that ensures that all three types are thrown properly (i.e., you catch a NullPointerException when you ask for one to be thrown) with and without threading enabled.
-- I believe this will finally put to rest all of these annoying HMS captures.
-- Use a LinkedHashMap not a TreeMap so iteration is faster.
-- Note that with a lot of stratifications the update0 is taking up a lot of time. For example, with 822 samples and functional class and sample on there are 100K contexts and 30% of the runtime is just in the update0 call
-- Now you always get SNP and indel metrics with VariantEval!
-- Includes Number of SNPs, Number of singleton SNPs, Number of Indels, Number of singleton Indels, Percent of indel sites that are multi-allelic, SNP to indel ratio, Singleton SNP to indel ratio, Indel novelty rate, 1 to 2 bp indel ratio, 1 to 3 bp indel ratio, 2 to 3 bp indel ratio, 1 and 2 to 3 bp indel ratio, Frameshift percent, Insertion to deletion ratio, Insertion to deletion ratio for 1 bp events, Number of indels in protein-coding regions labeled as frameshift, Number of indels in protein-coding regions not labeled as frameshift, Het to hom ratio for SNPs, Het to hom ratio for indels, a Histogram of indel lengths, Number of large (>10 bp) deletions, Number of large (>10 bp) insertions, Ratio of large (>10 bp) insertions to deletions
-- Updated VE integration tests as appropriate
-- Moved a variety of useful formatting routines for ratios, percentages, etc, into VariantEvalator.java so everyone can share. Code updated to use these routines where appropriate
-- Added variantWasSingleton() to VariantEvaluator, which can be used to determine if a site, even after subsetting to specific samples, was a singleton in the original full VCF
-- TableType, which used to be an interface, is now an abstract class, allowing us to implement some generally functionality and avoid duplication.
-- This included creating a getRowName() function that used to be hardcoded as "row" but how can be overridden.
-- #### This allows us implement molten tables, which are vastly easier to use than multi-row data sets. See IndelHistogram class (in later commit) for example of molten VE output
-- No more IndelLengthHistogram (superceded by IndelSummary in subsequent commit)
-- No more SamplePreviousGenotypes or PhaseStats
-- No more MultiallelicAFs
* fixed BadCigarFilter to filter out reads starting/ending in deletion and that have adjacent I/D events.
* added Unit tests for BadCigarFilter
* updated all exceptions in LocusIteratorByState to tell the user that he can instead run with -rf BadCigar
* added the BadCigar filter to ReduceReads and RealignTargetCreator (if your walker blows up with these malformed reads, you may want to add it too)
- Updated the documentation on the code
- Made the table.write() method private and updated necessary files.
- Added a constructor to GATKReport that takes GATKReportTables
- Optimized my code
Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
This is important for quick turnaround in the analysis cycle of the new covariates. Also added a dummy unit test that doesn't really test anything (disabled), but helps in debugging.
Pulled out the functionality from Indel Realigner and Table Recalibrator into Utils.setupWriter to make everyone else's life's easier if they want to include the PG tag in their walkers.
Infrastructure:
* Added static interface to all different clipping algorithms of low quality tail clipping
* Added reverse direction pileup element event lookup (indels) to the PileupElement and LocusIteratorByState
* Complete refactor of the KeyManager. Much cleaner implementation that handles keys with no optional covariates (necessary for on-the-fly recalibration)
* EventType is now an independent enum with added capabilities. All functionality is now centralized.
BQSR and RecalibrateBases:
* On-the-fly recalibration is now generic and uses the same bit set structure as BQSR for a reduced memory footprint
* Refactored the object creation to take advantage of the compact key structure
* Replaced nested hash maps with single hash maps indexed by bitsets
* Eliminated low quality tails from the context covariate (using ReadClipper's write N's algorithm).
* Excluded contexts with N's from the output file.
* Fixed cycle covariate for discrete platforms (need to check flow cycle platforms now!)
* Redfined error for indels to look at the previous base in negative strand reads (using new PE functionality)
* Added the covariate ID (for optional covariates) to the output for disambiguation purposes
* Refactored CovariateKeySet -- eventType functionality is now handled by the EventType enum.
* Reduced memory usage of the BQSR script to 4
Tests:
* Refactored BQSRKeyManagerUnitTest to handle the new implementation of the key manager
* Added tests for keys without optional covariates
* Added tests for on-the-fly recalibration (but more tests are necessary)
Infrastructure:
* Generic BitSet implementation with any precision (up to long)
* Two's complement implementation of the bit set handles negative numbers (cycle covariate)
* Memoized implementation of the BitSet utils for better performance.
* All exponents are now calculated with bit shifts, fixing numerical precision issues with the double Math.pow.
* Replace log/sqrt with bitwise logic to get rid of numerical issues
BQSR:
* All covariates output BitSets and have the functionality to decode them back into Object values.
* Covariates are responsible for determining the size of the key they will use (number of bits).
* Generalized KeyManager implementation combines any arbitrary number of covariates into one bitset key with event type
* No more NestedHashMaps. Single key system now fits in one hash to reduce hash table objects overhead
Tests:
* Unit tests added to every method of BitSetUtils
* Unit tests added to the generalized key system infrastructure of BQSRv2 (KeyManager)
* Unit tests added to the cycle and context covariates (will add unit tests to all covariates)
-- TODO for ryan -- there are bugs in ActivityProfile code that I cannot fix right now :-(
-- UnitTesting framework for ActivityProfile -- needs to be expanded
-- Minor helper functions for ActiveRegion to help with unit tests
-- Refactored ART into clearer, simpler procedures. Attempted to merge shared code into utility classes.
-- Added some docs
-- Created a new, testable ActivityProfile that represents as a class the probability of a base being active or inactive
-- Separated band-pass filtering from creation of active regions. Now you can band pass filter a profile to make another profile, and then that is explicitly converted to active regions
-- Misc. utility functions in ActiveRegionWalker such as hasPresetActiveRegions()
-- Many TODOs in ActivityProfile.
GATKReport format changes:
- All non-data header lines are preceeded with a single pound ( #:)
- Every report now has a report header containing the version number and number of tables
- Every table has two lines of table header: The first explains the size of the table and the data types of each column, the second contains the table name and description.
- This new format will allow reports in the future to be gatherable.
- Changed the header format to include an end-of-line string ":;"
Added features:
- Simplified GATK Reports:
The constructor for a simplified GATK Report. Simplified GATK report are designed for reports that do not need the advanced functionality of a full GATK Report.
A simple GATK Report consists of:
- A single table
- No primary key ( it is hidden )
Optional:
- Only untyped columns. As long as the data is an Object, it will be accepted.
- Default column values being empty strings.
Limitations:
- A simple GATK report cannot contain multiple tables.
- It cannot contain typed columns, which prevents arithmetic gathering.
- Added a constructor to generate simplified GATK reports.
- Added a method to easily add data to simple GATK reports.
- Upgraded the input parser take advantage of the new file format (v1).
- Added the GATKReportGatherer, more usability cmoing in next versionof GATK Report. Curently, it can only add rows from one table to another. Added private methods in GATKReport to combine Tables and Reports, It is very conservative and will only gather if the table columns, as well as everything else matches. At the column level, it uses the (redundant) row ids to add new rows. It will throw an exception if it is overwriting data.
- Made some GATKReport methods public, and added more setters and getters.
- Added method that compares formats of two GATKReports, and added an equals method to verify all data inside.
- The gsalib for R now supports reading GATKReport v1 files in addition to legacy formats (v0.*)
- Added a GATKReportDataType enum to give column a certain data type. This must be specified when making a gatherable report. This enum contains several methods including a reverse lookup map.
- Added a data type field in GATKColumn, when a type is not specified, the unknown type is used. Unknown types should not be gathered.
Test changes:
- Updated Unit Tests for GATK Report v1. Added a test for the gatherer. Left one test disabled while we transition from v0 to v1.
- Updated the MD5 hashes in integration tests throughout the GATK.
Other changes:
- Added the gatherer functions to CoverageByRG
- Also added the scatterCount parameter in the Interval Coverage script
- Dropped support for reading in legacy GATKReport formats ( v0.*)
- Updated VariantEvalWalker to work with GATK Report v1, added a format String to all applicable DataPoints.
- Rewrote the read file method for GATK report files.
- Optimized the equals methods within GATKReport. The protected functions should only be called by the GATKReport methods.
Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
Now looks like:
<GATK-run-report>
<id>D7D31ULwTSxlAwnEOSmW6Z4PawXwMxEz</id>
<start-time>2012/03/10 20.21.19</start-time>
<end-time>2012/03/10 20.21.19</end-time>
<run-time>0</run-time>
<walker-name>CountReads</walker-name>
<svn-version>1.4-483-g63ecdb2</svn-version>
<total-memory>85000192</total-memory>
<max-memory>129957888</max-memory>
<user-name>depristo</user-name>
<host-name>10.0.1.10</host-name>
<java>Apple Inc.-1.6.0_26</java>
<machine>Mac OS X-x86_64</machine>
<iterations>105</iterations>
</GATK-run-report>
No longer capturing command line or directory information, to minimize people's concerns with phone home and privacy
This is a quick-and-dirty patch for the null pointer error Mauricio reported earlier.
Later on we might want to address in a more general way the fact that we validate user intervals
against the reference but not against the merged BAM header produced by the engine at runtime.
This fix is similar, but distinct from the earlier fix to GATKBAMIndex. If we fail to read in
a complete 3-integer bin header from the BAM schedule file that the engine has written, throw a
ReviewedStingException (since this is our problem, not the user's) rather than allowing a
cryptic buffer underflow error to occur.
Note that this change does not fix the underlying problem in the engine, if there is one
(there may be an as-yet-undetected bug in the code that writes the bam schedule). It will
just make it easier for us to identify what's going wrong in the future.
GATKBAMIndex would allow an extremely confusing BufferUnderflowException to be
thrown when a BAM index file was truncated or corrupt. Now, a UserException is
thrown in this situation instructing the user to re-index the BAM.
Added a unit test for this case as well.
-- A cleaner table output (molten). For those interested in seeing how this can be done with GATKReports look here for a nice clean example
-- Integration tests
-- Minor improvements to GATKReportTable with methods to getPrimaryKeys
-- We weren't properly handling the case where a site had both a SNP and indel in both eval and comp. These would naturally pair off as SNP x SNP and INDEL x INDEL in eval, but we'd still invoke update2 with (null, SNP) and (null, INDEL) resulting most conspicously as incorrect false negatives in the validation report.
-- Updating misc. integrationtests, as the counting of comps (in particular for dbSNP) was inflated because of this effect.
-Running the GATK with the -et NO_ET or -et STDOUT options now
requires a key issued by us. Our reasons for doing this, and the
procedure for our users to request keys, are documented here:
http://www.broadinstitute.org/gsa/wiki/index.php/Phone_home
-A GATK user key is an email address plus a cryptographic signature
signed using our private key, all wrapped in a GZIP container.
User keys are validated using the public key we now distribute with
the GATK. Our private key is kept in a secure location.
-Keys are cryptographically secure in that valid keys definitely
came from us and keys cannot be fabricated, however keys are not
"copy-protected" in any way.
-Includes private, standalone utilities to create a new GATK user key
(GenerateGATKUserKey) and to create a new master public/private key
pair (GenerateKeyPair). Usage of these tools will be documented on
the internal wiki shortly.
-Comprehensive unit/integration tests, including tests to ensure the
continued integrity of the GATK master public/private key pair.
-Generation of new user keys and the new unit/integration tests both
require access to the GATK private key, which can only be read by
members of the group "gsagit".
-- Includes paired end status (T/F)
-- Includes count of reads used in calculation
-- Includes simple read type (2x76 for example)
-- Better handling of insert size, read length when there's no data, or the data isn't paired end by emitting NA not 0
-- ReadGroupProperties: Emits a GATKReport containing read group, sample, library, platform, center, median insert size and median read length for each read group in every BAM file.
-- Median tool that collects up to a given maximum number of elements and returns the median of the elements.
-- Unit and integration tests for everything.
-- Making name of TestProvider protected so subclasses and override name more easily
* All contexts with 'N' bases are now collapsed as uninformative
* Context size is now represented internally as a BitSet but output as a dna string
* Temporarily disabled sorted outputs because of null objects
* Turns DNA sequences (for context covariates) into bit sets for maximum compression
* Allows variable context size representation guaranteeing uniqueness.
* Works with long precision, so it is limited to a context size of 31 bases (can be extended with BigNumber precision if necessary).
* Unit Tests added
-- As these represent the bulk of the StingExceptions coming from BAMSchedule and are caused by simple problems like the user providing bad input tmp directories, etc.
-- DoC now by default ignores bases with reference Ns, so these are not included in the coverage calculations at any stage.
-- Added option --includeRefNSites that will include them in the calculation
-- Added integration tests that ensures the per base tables (and so all subsequent calculations) work with and without reference N bases included
-- Reorganized command line options, tagging advanced options with @Advanced
* The tailSet generated every time we flush the reads stash is still being affected by subsequent clears because it is just a pointer to the parent element in the original TreeSet. This is dangerous, and there is a weird condition where the clear will affects it.
* Fix by creating a new set, given the tailSet instead of trying to do magic with just the pointer.
When aggregating raw BAM file spans into shards, the IntervalSharder tries to combine
file spans when it can. Unfortunately, the method that combines two BAM file
spans was seriously flawed, and would produce a truncated union if the file spans
overlapped in certain ways. This could cause entire regions of the BAM file containing
reads within the requested intervals to be dropped.
Modified GATKBAMFileSpan.union() to correct this problem, and added unit tests
to verify that the correct union is produced regardless of how the file spans
happen to overlap.
Thanks to Khalid, who did at least as much work on this bug as I did.
so Ryan can work on the recalibration on the fly without breaking the build. Supposedly all the secret sauce is in the BQSR walker, which sits in private.
* added support to base before deletion in the pileup
* refactored covariates to operate on mismatches, insertions and deletions at the same time
* all code is in private so original BQSR is still working as usual in public
* outputs a molten CSV with mismatches, insertions and deletions, time to play!
* barely tested, passes my very simple tests... haven't tested edge cases.
premature push from my part. Roger is still working on the new format and we need to update the other tools to operate correctly with the new GATKReport.
This reverts commit aea0de314220810c2666055dc75f04f9010436ad.
- Added the GATKReportGatherer
- Added private methods in GATKReport to combine Tables and Reports
- It is very conservative and it will only gather if the table columns, match.
- At the column level it uses the (redundant) row ids to add new rows. It will throw an exception if it is overwriting data.
Added the gatherer functions to CoverageByRG
Also added the scatterCount parameter in the Interval Coverage script
Made some more GATKReport methods public
The UnitTest included shows that the merging methods work
Added a getter for the PrimaryKeyName
Fixed bugs that prevented the gatherer form working
Working GATKReportGatherer
Has only the functional to addLines
The input file parser assumes that the first column is the primary key
Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
* Adding the context covariate standard in both modes (including old CountCovariates) with parameters
* Updating all covariates and modules to use GATKSAMRecord throughout the code.
* BQSR now processes indels in the pileup (but doesn't do anything with them yet)
* calculates and interprets the coverage of a given interval track
* allows to expand intervals by specified number of bases
* classifies targets as CALLABLE, LOW_COVERAGE, EXCESSIVE_COVERAGE and POOR_QUALITY.
* outputs text file for now (testing purposes only), soon to be VCF.
* filters are overly aggressive for now.
-- This is a partial fix for the problem with uploading S3 logs reported by Mauricio. There the problem is that the java.io.tmpdir is not accessible (network just hangs). Because of that the s3 upload fails because the underlying system uses tmpdir for caching, etc. As far as I can tell there's no way around this bug -- you cannot overload the java.io.tmpdir programmatically and even if I could what value would we use? The only solution seems to me is to detect that tmpdir is hanging (how?!) and fail with a meaningful error.
per Mark's recommendation to reuse the Indel Realigner tag that made it to the SAM spec. The Alignment end tag is still "OE" as there is no official tag to reuse.
* Added annotations for reads that had been soft clipped prior to being reduced so that we can later recuperate their original alignments (start and end).
* Tags keep the alignment shifts, not real alignment, for better compression
* Tags are defined in the GATKSAMRecord
* GATKSAMRecord has new functionality to retrieve original alignment start of all reads (trimmed or not) -- getOriginalAlignmentStart() and getOriginalAligmentEnd()
* Updated ReduceReads MD5s accordingly
AllelePair is being mean and not telling me what genotype it sees when it finds a non-diploid genotype, but i suspect it's a no-call (".") rather than a no call ("./.").
Eric reported this bug due to the reduced reads failing with an index out of bounds on what we thought was a deletion, but turned out to be a read starting with insertion.
* Refactored PileupElement to distinguish clearly between deletions and read starting with insertion
* Modified ExtendedEventPileup to correctly distinguish elements with deletion when creating new pileups
* Refactored most of the lazyLoadNextAlignment() function of the LocusIteratorByState for clarity and to create clear separation between what is a pileup with a deletion and what's not one. Got rid of many useless if statements.
* Changed the way LocusIteratorByState creates extended event pileups to differentiate between insertions in the beginning of the read and deletions.
* Every deletion now has an offset (start of the event)
* Fixed bug when LocusITeratorByState found a read starting with insertion that happened to be a reduced read.
* Separated the definitions of deletion/insertion (in the beginning of the read) in all UG annotations (and the annotator engine).
* Pileup depth of coverage for a deleted base will now return the average coverage around the deletion.
* Indel ReadPositionRankSum test now uses the deletion true offset from the read, changed all appropriate md5's
* The extra pileup elements now properly read by the Indel mode of the UG made any subsequent call have a different random number and therefore all RankSum tests have slightly different values (in the 10^-3 range). Updated all appropriate md5s after extremely careful inspection -- Thanks Ryan!
phew!
reporting malformed BAMs was actually misreporting any error that happened in the Picard layer
as a BAM ERROR.
Specifically changing PicardException to report as a ReviewedStingException; we might want to
change it in the future. I'll followup with the Picard team to make sure they really, really
want PicardException to inherit from SAMException.
-- Disturbingly, fixing this bug doesn't actually cause an test failures.
-- Wrote a new QCRefWalker to actually check in detail that the reference bases coming into the RefWalker are all correct when comparing against a clean uncached load of the contig bases directly.
-- However, I cannot run this tool due to some kind of weird BAM error -- sending this on to Matt
-- Missing BAMs were appearing as StingExceptions
-- Missing VCFs were showing up as CommandLineErrors, but it's clearer for them to be CouldNotReadInputFile exceptions
-- Added integration tests to ensure missing BAMs, VCFs, and -L files are properly thrown as CouldNotReadInputFile exceptions
-- Added path to standard b37 BAM to BaseTest
-- Cleaned up code in SAMDataSource, removing my parallel loading code as this just didn't prove to be useful.
-- Instead issue a warning when a large (>1MB) record is encountered
-- Optimized ref.getBytes()[i] => (byte)ref.charAt(i), which avoids an implicit O(n) allocation each iteration through computeReverseClipping()
-- samtools can emit alleles where the ref is 42M Ns and this caused the GATK (via tribble) to hang in several places.
-- Tribble was updated so we actually could read the line properly (rev. to 51 here).
-- Still the parsing algorithms in the GATK aren't happy with such a long allele. Instead of optimizing the code around an improper use case I put in a limit of 2^16 bp for any allele, and throw a meaningful exception when encountered.
CalibrateGenotypeLikelihoods supports using an external VCF as input for genotype likelihoods. Currently can be a per-sample VCF, but has un-implemented methods for allowing a read-group VCF to be used.
Removed the old constrained genotyping code from UGE -- the trellis calculated is exactly the same as that done in the MLE AC estimate; so we should just re-use that one.
* using the filter() instead of map() makes for a cleaner walker.
* renaming the unit tests to make more sense with the other unit and integration tests
* if the adaptor boundary is more than MAXIMUM_ADAPTOR_SIZE bases away from the read, then let's not clip anything and consider the fragment to be undetermined for this read pair.
* updated md5's accordingly
QScript accessor to QSettings to specify a default runName and other default function settings.
Because log files are no longer pseudo-random their presense can be used to tell if a job without other file outputs is "done". For now still using the log's .done file in addition to original outputs.
Gathered log files concatenate all log files together into the stdout.
InProcessFunctions now have PrintStreams for stdout and stderr.
Updated ivy to use commons-io 2.1 for copying logs to the stdout PrintStream. Removed snakeyaml.
During graph tracking of outputs the Index files, and now BAM MD5s, are tracked with the gathering of the original file.
In Queue generated wrappers for the GATK the Index and MD5s used for tracking are switched to private scope.
Added more detailed output when running with -l DEBUG.
Simplified graphviz visualization for additional debugging.
Switched usage of the scala class 'List' to the trait 'Seq' (think java.util.ArrayList vs. using the interface java.util.List)
Minor cleanup to build including sending ant gsalib to R's default libloc.
To support this, refactored code that computes consensus alleles. To ease merging of mulitple alt alleles, we create a single vc for each alt alleles and then use VariantContextUtils.simpleMerge to carry out merging, which takes care of handling all corner conditions already. In order to use this, interface to GenotypeLikelihoodsCalculationModel changed to pass in a GenomeLocParser object (why are these objects to hard to handle??).
More testing is required and feature turned off my default.
-- Call sets with indels > 50 bp in length are tagged as CNVs in the tag (following the 1000 Genomes convention) and were unconditionally checking whether the CNV is already known, by looking at the known cnvs file, which is optional. Fixed. Has the annoying side effect that indels > 50bp in size are not counted as indels, and so are substrated from both the novel and known counts for indels. C'est la vie
-- Added integration test to check for this case, using Mauricio's most recent VCF file for NA12878 which has many large indels. Using this more recent and representative file probably a good idea for more future tests in VE and other tools. File is NA12878.HiSeq.WGS.b37_decoy.indel.recalibrated.vcf in Validation_Data
This error was due to the ReadClipper change of contract. Before the read utils would return null if a read was entirely clipped, now it returns an empty (safe) GATKSAMRecord.
* Downsampling is now a parameter to the walker with default value of 0 (no downsampling)
* Downsampling selects reads at random at the variant region window and strives to achieve uniform coverage if possible around the desired downsampling value.
* Added integration test
* Knuth-shuffle is a simple, yet effective array permutator (hope this is good english).
* added a simple randomSubset that returns a random subset without repeats of any given array with the same probability for every permutation.
* added unit tests to both functions
* Modified cleanCigarShift to allow insertions in the beginning and end of the read
* Allowed cigars starting/ending in insertions in the systematic ReadClipper tests
* Updated all ReadClipper unit tests
* ReduceReads does not hard clip leading insertions by default anymore
* SlidingWindow adjusts start location if read starts with insertion
* SlidingWindow creates an empty element with insertions to the right
* Fixed all potential divide by zero with totalCount() (from BaseCounts)
* Updated all Integration tests
* Added new integration test for multiple interval reducing
-- Automatic detection of most recent version of GATK release (just tell the script now to use 1.2, 1.3, and 1.4)
-- Uses 1.4 now
-- By default we do 9 runs of each non-parallel test
-- In PathUtils added convenience utility to find most recent release GATK jar with a specific release number
Turns out that because the RTC is the first walker to 'correctly' tree reduce according to functional programming
standards, the RTC has revealed a few problems with the tree reducer holding on to too much data. This is the first
and smaller of two commits to reduce memory consumption. The second commit will likely be pushed after GATK1.4 is
released.
* Creates an empty GATKSAMRecord with empty (not null) Cigar, bases and quals. Allows empty reads to be probed without breaking.
* All ReadClipper utilities now emit empty reads for fully clipped reads
* Described the ReadClipper contract in the top of the class
* Added contracts where applicable
* Added descriptive information to all tools in the read clipper
* Organized public members and static methods together with the same javadoc
These functions are methods of the read, and supplement getAlignmentStart() and getUnclippedStart() by calculating the unclipped start counting only soft clips.
* Removed from ReadUtils
* Added to GATKSAMRecord
* Changed name to getSoftStart() and getSoftEnd
* Updated third party code accordingly.
* Removed all clipping functionality from ReadUtils (it should all be done using the ReadClipper now)
* Cleaned up functionality that wasn't being used or had been superseded by other code (in an effort to reduce multiple unsupported implementations)
* Made all meaningful functions public and added better comments/explanation to the headers
For simplified access to the hard clipping utilities. No need to create a ReadClipper object if you are not doing multiple complicated clipping operations, just use the static methods.
examples:
ReadClipper.hardClipLowQualEnds(2);
ReadClipper.hardClipAdaptorSequence();
The algorithm wasn't accounting for the case where the read is the reverse strand and the insert size is negative.
* Fixed and rewrote for more clarity (with Ryan, Mark and Eric).
* Restructured the code to handle GATKSAMRecords only
* Cleaned up the other structures and functions around it to minimize clutter and potential for error.
* Added unit tests for all 4 cases of adaptor boundaries.
-- Support for collecting resources info from DRMAA runners
-- Disabled the non-standard mem_free argument so that we can actually use our own SGE cluster gsa4
-- NCoresRequest is a testing queue script for this.
-- Added two command line arguments:
-- multiCoreJerk: don't request multiple cores for jobs with nt > 1. This was the old behavior but it's really not the best way to run parallel jobs. Now with queue if you run nt = 4 the system requests 4 cores on your host. If this flag is thrown, though, it will only request 1 and you'll just use 4, like a jerk
-- job_parallel_env: parallel environment named used with SGE to request multicore jobs. Equivalent to -pe job_parallel_env NT for NT > 1 jobs
* New ClippingOp REVERT_SOFTCLIPPED_BASES turns soft clipped bases into matches.
* Added functionality to clipping op to revert all soft clip bases in a read into matches
* Added revertSoftClipBases function to the ReadClipper for public use
* Wrote systematic unit tests
Minor changes to ValidationSiteSelector logic (SampleSelectors determine whether a site is valid for output, no actual subset context need be operated on beyond that determination). Implementation of GL-based site selection. Minor changes to EJG.
UGBoundAF has to go in at some point. In the process of rewriting the math for bounding the allele frequency (it was assuming uniform tails, which is silly since i derived the posterior distribution in closed form sometime back, just need to find it)
Creating a single temporary directory per ant test run instead of a putting temp files across all runs in the same directory.
Updated various tests for above items and other small fixes.
fixed issue where a read starting with an insertion followed by a deletion would break, clipper can now safely clip the insertion and the deletion if that's the case.
note: test is turned off until contract changes to allow hanging insertions (left/right).
* fixed edge case when requested to hard clip beginning of a read that had hanging soft clipped bases on the left tail.
* fixed edge case when requested to hard clip end of a read that had hanging soft clipped bases on the right tail.
* fixed AlignmentStart of a clipped read that results in only hard clips and soft clips
note: added tests to all these beautiful cases...
This change and my previous change have dropped runtime when dynamically merging 2k BAM files from 72.6min/1M reads to 46.8sec/1M reads.
Note that many of these changes are stopgaps -- the real problem is the way ReadWalkers interface with Picard, and I'll have to work with
Tim&Co to produce a more maintainable patch.
-- Serial version can be re-enabled with a static boolean, if we decide to return to the serial version
-- Comparison of serial and parallel reader with cached and uncached files:
Initialization time: serial with 500 fully cached BAMs: 8.20 seconds
Initialization time: serial with 500 uncached BAMs : 197.02 seconds
Initialization time: parallel with 500 fully cached BAMs: 30.12 seconds
Initialization time: parallel with 500 uncached BAMs : 75.47 seconds
* expanded the systematic cigar string space test framework Roger wrote to all tests
* moved utility functions into Utils and ReadUtils
* cleaned up unused classes
caught a bug in the hard clipper where it does not account for hard clipping softclipped bases in the resulting cigar string, if there is already a hard clipped base immediately after it.
* updated unit test for hardClipSoftClippedBases with corresponding test-case.
shifted the contract to functions that operate on reference based coordinates. The clipper should do the right thing with unmapped reads, but it needs more testing (Ryan is using it at the moment and says it works). Will write some unit tests.
* added functions to create synthetic reads for unit testing with reasonable default parameters
* added more functions to create synthetic reads based on cigar string + bases and quals.
bug: When performing multiple hard clip operations in a read that has indels, if the N+1 hardclip requests to clip inside an indel that has been removed by one of the (1..N) previous hardclips, the hard clipper would go out of bounds.
fix: dynamically adjust the boundaries according to the new hardclipped read length. (this maintains the current contract that hardclipping will never return a read starting or ending in indels).
-- Uses 8 threads to load BAM files and indices in parallel, decreasing costs to read thousands of BAM files by a significant amount
-- Added logger.info message noting progress and cost of reading low-level BAM data.
-- Now properly handles the case where a sample isn't present (no longer adds a null to the genotypes list)
-- Fix for logic failure where if the number of requested samples equals the number of known genotypes then all of the records were returned, which isn't correct when there are missing samples.
-- Unit tests added to handle these cases
Turns out that someone previously upped the declared size of a ROD shard to 100M bases, making
each ROD shard larger than the size of chr20. Why didn't we see this in Stable? Because the
ShardStrategy/ShardStrategyFactory mechanism was dutifully ignoring the shard size specification.
When I rolled the ShardStrategy/ShardStrategyFactory mechanics back into the DataSources as part
of the async I/O project, I inadvertently reenabled this specifier.
PreQC parses file with spaces in sample names by using tabs only.
PostQC allows passing the file names for the evals so that flanks can be evaled.
BaseTest's network temp dir now adds the user name to the path so files aren't created in the root.
HybridSelectionPipeline:
- Updated to latest versions of reference data.
- Refactored Picard parsing code replacing YAML.
This has implications for both Qscript authors and CommandLineFunction authors.
Qscript authors:
You no longer need to (and in fact must not) manually escape String values to
avoid interpretation by the shell when setting up Walker parameters. Queue will
safely escape all of your Strings for you so that they'll be interpreted literally. Eg.,
Old way:
filterSNPs.filterExpression = List("\"QD<2.0\"", "\"MQ<40.0\"", "\"HaplotypeScore>13.0\"")
New way:
filterSNPs.filterExpression = List("QD<2.0", "MQ<40.0", "HaplotypeScore>13.0")
CommandLineFunction authors:
If you're writing a one-off CommandLineFunction in a Qscript and don't really
care about quoting issues, just keep doing things the direct, simple way:
def commandLine = "cat %s | grep -v \"#\" > %s".format(files, out)
If you're writing a CommandLineFunction that will become part of Queue and
will be used by other QScripts, however, it's advisable to do things the
newer, safer way, ie.:
When you construct your commandLine, you should do so ONLY using the API methods
required(), optional(), conditional(), and repeat(). These will manage quoting
and whitespace separation for you, so you shouldn't insert quotes/extraneous
whitespace in your Strings. By default you get both (quoting and whitespace
separation), but you can disable either of these via parameters. Eg.,
override def commandLine = super.commandLine +
required("eff") +
conditional(verbose, "-v") +
optional("-c", config) +
required("-i", "vcf") +
required("-o", "vcf") +
required(genomeVersion) +
required(inVcf) +
required(">", escape=false) + // This will be shell-interpreted
required(outVcf)
I've ported the Picard/Samtools/SnpEff CommandLineFunction classes to the new
system, so you'll get free shell escaping when you use those in Qscripts just
like with walkers.
-- VariantSummary now includes novelty of CNVs by reciprocal overlap detection using the standard variant eval -knownCNVs argument
-- Genericizes loading for intervals into interval tree by chromosome
-- GenomeLoc methods for reciprocal overlap detection, with unit tests
Also includes convenience parameters for specifying the IO/CPU threading balance outside of a tag. Will be killed when
Queue gets better support for tagged arguments (hopefully soon).
-- Performance optimizations
-- Tables now are cleanly formatted (floats are %.2f printed)
-- VariantSummary is a standard report now
-- Removed CompEvalGenotypes (it didn't do anything)
-- Deleted unused classes in GenotypeConcordance
-- Updates integration tests as appropriate
-- Updating MD5s for UG to reflect that what was previously called ./.:.:10:0,0,0 is now just ./. Eric will fix long-standing bug in QD observed from this change
-- VFW MD5s restored to their old correct values. There was a bug in my implementation to caused the genotypes to not be parsed from the lazy output even through the header was incorrect.