-- Refactored VCF writers into vcf.writers package
-- Moved BCF2Writer to bcf2.writer
-- Updates to all of the walkers using VCFWriter to reflect new packages
-- A large number of files had their headers cleaned up because of this as well
-- Restructured code to separate the MISSING value in java (currently everywhere a null) from the byte representation on disk (an int).
-- Now handles correctly MISSING qual fields
-- Refactored setting of contigs from VCFWriterStub to VCFUtils. Necessary for proper BCF working
-- Added VCFContigHeaderLine that manages the order for sorting, so we now emit contigs in the proper order.
-- Cleaned up VCFHeader operations
-- BCF now uses the right header files correctly when encoding / decoding contigs
-- Clean up unused tools
-- Refactored header parsing routines to make them more accessible
-- More minor header changes from Intellij
This bug will happen in all adapter/wrapper classes that are passed a resource, and then in their close method they ignore requests to close the wrapped resource, causing a leak when the adapter is the only one left with a reference to the resource.
Ex:
public Wrapper getNewWrapper(File path) {
FileStream myStream = new FileStream(path); // This stream must be eventually closed.
return new Wrapper(myStream);
}
public void close(Wrapper wrapper) {
wrapper.close(); // If wrapper.close() does nothing, NO ONE else has a reference to close myStream.
}
* For some reason, the original implementor decided to use Booleans instead of booleans and didn't always check for null so we'd occasionally get a NPE. Switched over to booleans.
* We'd also generate a NPE if SAMRecord writing specific arguments (e.g. --simplifyBAM) were used while writing to sdout.
The practical differences between version 1.0 and this one (v1.1) are:
* the underlying data structure now uses arrays instead of hashes, which should drastically reduce the memory overhead required to create large tables.
* no more primary keys; you can still create arbitrary IDs to index into rows, but there is no special cased primary key column in the table.
* no more dangerous/ugly table operations supported except to increment a cell's value (if an int) or to concatenate 2 tables.
Integration tests change because table headers are different.
Old classes are still lying around. Will clean those up in a subsequent commit.
* writer mostly implemented
* walkers to convert BCF2 <-> VCF
* almost working for sites-only files; genotypes still need work
* initial performance tests this afternoon will be on sites-only files
From tribble logs:
Binary feature support in tribble
-- Massive refactoring and cleanup
-- Many bug fixes throughout
-- FeatureCodec is now general, with decode etc. taking a PositionBufferedStream
as an argument not a String
-- See ExampleBinaryCodec for an example binary codec
-- AbstractAsciiFeatureCodec provides to its subclass the same String decode,
readHeader functionality before. Old ASCII codecs should inherit from this base
class, and will work without additional modifications
-- Split AsciiLineReader into a position tracking stream
(PositionalBufferedStream). The new AsciiLineReader takes as an argument a
PositionalBufferedStream and provides the readLine() functionality of before.
Could potentially use optimizations (its a TODO in the code)
-- The Positional interface includes some more functionality that's now
necessary to support the more general decoding of binary features
-- FeatureReaders now work using the general FeatureCodec interface, so they can
index binary features
-- Bugfixes to LinearIndexCreator off by 1 error in setting the end block
position
-- Deleted VariantType, since this wasn't used anywhere and it's a particularly
clean why of thinking about the problem
-- Moved DiploidGenotype, which is specific to Gelitext, to the gelitext package
-- TabixReader requires an AsciiFeatureCodec as it's currently only implemented
to handle line oriented records
-- Renamed AsciiFeatureReader to TribbleIndexedFeatureReader now that it handles
Ascii and binary features
-- Removed unused functions here and there as encountered
-- Fixed build.xml to be truly headless
-- FeatureCodec readHeader returns a FeatureCodecHeader obtain that contains a
value and the position in the file where the header ends (not inclusive).
TribbleReaders now skip the header if the position is set, so its no longer
necessary, if one implements the general readHeader(PositionalBufferedStream)
version to see header lines in the decode functions. Necessary for binary
codecs but a nice side benefit for ascii codecs as well
-- Cleaned up the IndexFactory interface so there's a truly general createIndex
function that takes the enumerated index type. Added a writeIndex() function
that writes an index to disk.
-- Vastly expanded the index unit tests and reader tests to really test linear,
interval, and tabix indexed files. Updated test.bed, and created a tabix
version of it as well.
-- Significant BinaryFeaturesTest suite.
-- Some test files have indent changes
-- Other tribble contributors did major refactoring / simplification of tribble, which required some changes to GATK code
-- Integrationtests pass without modification, though some very old index files (callable loci beds) were apparently corrupt and no longer tolerated by the newer tribble codebase
Updated ReadFilter abstract class to implement (via UnsupportedOperationException) the new SamRecordFilter.filterOut().
In IndelRealignerIntegrationTest updates for Picard fixes to SAMRecord.getInferredInsertSize() in svn r1115 & r1124.
- Ran FixMates to create new input BAM since running IR with variable maxReadsInMemory means all reads weren't realigned leading to different outputs.
- Updated md5s to match new expectations after looking at TLEN diff engine output.
* Not working yet, still very much a work-in-progress with lots of placeholders
* Needed to check this in to enable possible collaboration, since it's
going slower than anticipated and the conference deadline looms.
a) Add ability for ErrorModel to be specified by external log-probability vector for testing.
b) For a given depth and ploidy(=2*samples/pool), create artificial high quality pileup testing from AC=0 to AC=ploidy, and test that pool GL's have expected content.Misc. refactorings and cleanups
c) Misc. cleanups and beautification.
a) Utility class called Probability Vector that holds a log-probability vector and has the ability to clip ends that deviate largely from max value.
b) Used this class to hold site error model, since likelihoods of error model away from peak are so far down that it's not worth computing with them and just wastes time.
c) Expand unit tests and add an exhaustive test for ErrorModel class.
d) Corrected major math bug in ErrorModel uncovered by exhaustive test: log(e^x) is NOT x if log's base = 10.
e) Refactored utility functions that created artificial pileups for testing into separate class ArtificialPileupTestProvider. Right now functionality is limited (one artificial contig of 10 bp), can only specify pileups in one position with a given number of matches and mismatches to ref) but functionality will be expanded in future to cover more test cases.
f) Use this utility class for IndelGenotypeLikelihoods unit test and for PoolGenotypeLikelihoods unit test (the latter testing functionality still not done).
g) Linearized implementation of biallelic exact model (very simple approach, similar to diploid exact model, just abort if we're past the max value of AC distribution and below a threshold). Still need to add unit tests for this and to expand to multiallelic model.
h) Update integration test md5's due to minor differences stemming from linearized exact model and better error model math
The GATK -L unmapped is for GenomeLocs with SAMRecord.NO_ALIGNMENT_REFERENCE_NAME, not SAMRecord.getReadUnmappedFlag()
Previously unmapped flag reads in the last bin were being printed while also seeking for the reads without a reference contig.
* fixed queue script plot file names
* updated the ReadGroupCovariate to use the platform unit instead of sample + lane.
* fixed plotting of marginalized reported qualities
* updated BQSR queue script for faster turnaround
* implemented plot generation for scatter/gatherered runs
* adjusted output file names to be cooperative with the queue script
* added the recalibration report file to the argument table in the report
* added ReadCovariates unit test -- guarantees that all the covariates are being generated for every base in the read
* added RecalibrationReport unit test -- guarantees the integrity of the delta tables
* fixed context covariate famous "off by one" error
* reduced maximum quality score to Q50 (following Eric/Ryan's suggestion)
* remove context downsampling in BQSR R script
This test brings together the old and the new BQSR, building a recalibration table using the two separate frameworks and performing the recalibration calculation using the two different frameworks for 10,000+ bases and asserting that the calculations match in every case.
* Refactored CycleCovariate to be a fragment covariate instead of a per read covariate
* Refactored the CycleCovariateUnitTest to test the pairing information
* Updated BQSR Integration tests accordingly
* Made quantization levels parameter not hidden anymore
* Added hidden option to keep intermediate plotting files for debug purposes (they're automatically deleted)
* Added hidden option not to generate the plots automatically (important for scatter/gathering)
The most important reason for this change is that we no longer need to read the entire recal file into memory up front in ApplyRecalibration. For 1000G calling this was prohibitive in terms of memory requirements. Now we go through the rod system and pull in just the records we need at a given position.
As an added bonus, once BCF2 is live we can drastically cut down the sizes of these recal files (which can grow large for whole genome calling).
* removed low quality bases from the recalibration report.
* refactored the Datum (Recal and Accuracy) class structure
* created a new plotting csv table for optimized performance with the R script
* added a datum object that carries the accuracy information (AccuracyDatum) for plotting
* added mean reported quality score to all covariates
* added QualityScore as a covariate for plotting purposes
* added unit test to the key manager to operate with one required covariate and multiple optional covariates
* integrated the plotting into BQSR (automatically generates the pdf with the recalibration tearsheet)
- By porting from jython to java now accessible to Queue via automatic extension generation.
- Better handling for problematic sample names by using PicardAggregationUtils.
GATKReportTable looks up keys using arrays instead of dot-separated strings, which is useful when a sample has a period in the name.
CombineVariants has option to suppress the header with the command line, which is now invoked during VCF gathering.
Added SelectHeaders walker for filtering headers for dbGAP submission.
Generated command line for read filters now correctly prefixes the argument name as --read_filter instead of -read_filter.
Latest WholeGenomePipeline.
Other minor cleanup to utility methods.
-- Not hooked up yet, so the output of VariantEval should be the same as before
-- Implemented a VariantEvalUnitTest that tests the low level strat / eval combinatorics and counting routines
-- Better docs throughout
-- Now properly includes both bi and multi-allelic variants. These are actually counted as well, and emitted as counts and % of sites with multiple alleles
-- Bug fix for gold standard rate
-- HMS no longer tries to grab and throw all exceptions. Exceptions are just thrown directly now.
-- Proper error handling is handled by functions in HMS, which are used by ShardTraverser and TreeReducer
-- Better printing of stack traces in WalkerTest
-- Unfortunately the result of the multi-threaded test is non-deterministic so run the test 10x times to see if the right expection is always thrown
-- Now prints the stack trace and exception message of the caught exception of the wrong type, if this occurs
Adding support for active-region-based annotation for most standard annotations. I need to discuss with Ryan what to do about tests that require offsets into the reads (since I don't have access to the offsets) like e.g. the ReadPosRankSumTest.
IMPORTANT NOTE: this is still very much a dev effort and can only be accessed through private walkers (i.e. the HaplotypeCaller). The interface is in flux and so we are making no attempt at all to make it clean or to merge this with the Locus-Traversal-based annotation system. When we are satisfied that it's working properly and have settled on the proper interface, we will clean it up then.
* Fixed output format to get a valid vcf
* Optimzed the per sample pileup routine O(n^2) => O(n) pileup for samples
* Added support to overlapping intervals
* Removed expand target functionality (for now)
* Removed total depth (pointless metric)
-- SamFileReader.java:525
-- BlockCompressedInputStream:376
These were both instances were we weren't catching and rethrowing picard exceptions as UserExceptions.
- refactored the statistics classes
- concurrent callable statuses by sample are now available.
Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
* Added parameter -qq to quantize qualities using a recalibration report
* Added options to quantize using the recalibration report quantization levels, new nLevels and no quantization.
* Updated BQSR scripts to make use of the new parameters
-- Now calculates the number of Indels overlapping gold standard sites, as well as the percent of indels overlapping gold standard sites
-- Removed insertion : deletion ratio for 1 bp event, replaced it with 1 + 2 : 3 bp ratio for insertions and deletions separately. This is based on an old email from Mark Daly:
// - Since 1 & 2 bp insertions and 1 & 2 bp deletions are equally likely to cause a
// downstream frameshift, if we make the simplifying assumptions that 3 bp ins
// and 3bp del (adding/subtracting 1 AA in general) are roughly comparably
// selected against, we should see a consistent 1+2 : 3 bp ratio for insertions
// as for deletions, and certainly would expect consistency between in/dels that
// multiple methods find and in/dels that are unique to one method (since deletions
// are more common and the artifacts differ, it is probably worth looking at the totals,
// overlaps and ratios for insertions and deletions separately in the methods
// comparison and in this case don't even need to make the simplifying in = del functional assumption
-- Added a new VEW argument to bind a gold standard track
-- Added two new stratifications: OneBPIndel and TandemRepeat which do exactly what you imagine they do
-- Deleted random unused functions in IndelUtils
Returns true iff VC is an non-complex indel where every allele represents an expansion or
contraction of a series of identical bases in the reference.
The logic of this function is pretty simple. Take all of the non-null alleles in VC. For
each insertion allele of n bases, check if that allele matches the next n reference bases.
For each deletion allele of n bases, check if this matches the reference bases at n - 2 n,
as it must necessarily match the first n bases. If this test returns true for all
alleles you are a tandem repeat, otherwise you are not. Note that in this context n is the
base differences between the ref and alt alleles
* fixed the loading of the new reduced size reports
* reduced BQSR scala script memory to 2Gb
* removed dcov parameter from BQSR scala script
* fixed estimatedQReported calculation from -log10(pe) to -10*log10(pe).
* updated md5's with the proper PHRED scaled EstimatedQReported
* fixed bug where some keys were using the same recal datum objects
* fixed quantization qual calculations when combining multiple reports
* fixed rounding error with empirical quality reported when combining reports
* fixed combine routine in the gatk reports due to the primary keys being out of order
* added auto-recalibration option to BQSR scala script
* reduced the size of the recalibration report by ~15%
* updated md5's
-- Molten now supports variableName and valueName so you don't have to use variable and value if you don't want to.
-- Cleanup code, reorganize a bit more.
-- Fix for broken integrationtests
*** WAY FASTER ***
-- 3x performance for multiple sample analysis with 1000 samples
-- Analyzing 1MB of the ESP call set (3100 samples) takes 40 secs, compared to several minutes in the previous version
-- According to JProfiler all of the runtime is now spent decoding genotypes, which will only get better when we move to BCF2
-- Remove the TableType system, as this was way too complex. No longer possible to embed what were effectively multiple tables in a single Evaluator. You now have to have 1 table per eval
-- Replaced it with @Molten, which allows an evaluator to provide a single Map from variable -> value for analysis. IndelLengthHistogram is now a @Molten data type. GenotypeConcordance is also.
-- No longer allow Evaluators to use private and protected variables at @DataPoints. You get an error if you do.
-- Simplified entire IO system of VE. Refactored into VariantEvalReportWriter.
-- Commented out GenotypePhasingEvaluator, as it uses the retired TableType
-- Stratifications are all fully typed, so it's easy for GATKReports to format them.
-- Removed old VE work around from GATKReportColumn
-- General code cleanup throughout
-- Updated integration tests
-- Added memory and safety optimizations to StratNode and StratificationManager. Fresh, immutable Hashmaps are allocated for final data structures, so they exactly the correct size and cannot be changed by users.
-- Added ability of a stratification to specify incompatible evaluation. The two strats using this are AC and Sample with VariantSummary, as this computes per-sample averages and so combining these results in an O(n^2) memory requirement. Added integration test to cover incompatible strats and evals
-- Renamed and reorganized infrastructure
-- StratificationManager now a Map from List<Object> -> V. All key functions are implemented. Less commonly used TODO
-- Ready for hookup to VE
-- Created a unit tested tree mapping from a List<String> -> integer (StratificationStates). This class is the key infrastructure necessary to create a complete static mapping from all stratification combinations to an offset in a vector of EvalutionContexts for update in map.
-- Minor code cleanup throughout VE (removing unused headers, for example)