Commit Graph

2859 Commits (5a4e2a5fa4d7ee7c6d7773d261eebc8a3ff349f1)

Author SHA1 Message Date
Mark DePristo 757e6a0160 Making Pileup thread-safe
-- Old version relied on out printstream magically sorting output, new version puts the print in reduce
2012-09-05 15:45:23 -04:00
Mark DePristo d7105223fe More debugging output for NanoScheduler when debugging is enabled 2012-09-05 15:45:23 -04:00
Mark DePristo 9823102c0c TraverseReadsNano supports walker.filter and walker.done
-- Instead of returning directly the result of map(), returns a MapResult object with the value and a reduceMe flag.
-- Reduce function respects the reduceMe flag
-- Code cleanup and more documentation
2012-09-05 15:45:23 -04:00
Mark DePristo 1a8f5fc374 Trivial cleanup of NanoScheduler 2012-09-05 15:45:23 -04:00
Mark DePristo 6a5a70cdf1 Done GSA-539: SimpleTimer should use System.nanoTime for nanoSecond resolution 2012-09-05 15:45:23 -04:00
Mark DePristo 59109d5eeb NanoScheduler tracks time outside of its execute call 2012-09-05 15:45:23 -04:00
Mark DePristo 800a27c3a7 NanoScheduler tracks time within input, map, and reduce
-- Helpful for understanding where the time goes to each bit of the code.
-- Controlled by a local static boolean, to avoid the potential overhead in general
2012-09-05 15:45:23 -04:00
Mark DePristo 7087b22ea3 No debugging output (even conditional) for ReadTransformers in PrintReads 2012-09-05 15:45:23 -04:00
Mark DePristo e01258b261 NanoScheduler now supports printProgress. Bugfixes to printProgress
-- TraverseReadsNano prints progress at the end of each traversal unit
-- Fix bugs in TraversalEngine printProgress
    -- Synchronize the method so we don't get multiple logged outputs when two or more HMSs call printProgress before initialization at the start!
    -- Fix the logic for mustPrint, which actually had the logic of mustNotPrint.  Now we see the done log line that was always supposed to be there
    -- Fix output formatting, as the done() line was incorrectly shifting over the % complete by 1 char as 100.0% didn't fit in %4.1f
-- Add clearer doc on -PF argument so that people know that the performance log can be generated to standard out if one wants
2012-09-05 15:45:23 -04:00
Mark DePristo 6055101df8 NanoScheduler no longer groups inputs, each map() call is interlaced now
-- Maximizes the efficiency of the threads
-- Simplifies interface (yea!)
-- Reduces number of combinatorial tests that need to be performed
2012-09-05 15:45:22 -04:00
Mark DePristo e3b4cc02aa Done GSA-282: Unindexed traversals crash if a read goes off the end of a contig
-- Already fixed in the codebase.  Added unindexed bam and integration tests to ensure this is fine going forward.
2012-09-05 15:45:22 -04:00
Yossi Farjoun ad5fa449e7 fixed a typo in the string comment 2012-09-05 14:46:10 -04:00
Ryan Poplin 84a83fd3f3 fixing typo 2012-09-05 10:41:03 -04:00
Eric Banks fc06f39411 Fixed docs for Pileup walker 2012-09-05 09:55:34 -04:00
Christopher Hartl d795437202 - New UserExceptions added for when ReadFilters or Walkers specified on the command line are not found. When -rf xxxx cannot find the class corresponding to xxxx, all read filters are printed in a better formatted way, with links to their gatk docs.
- VariantAnnotatorEngine changed to call genotype annotations even if pilups and allele -> likelihood mappings are not present. Current genotype annotations altered to check for null pilupes and null mappings.
2012-09-04 16:41:44 -04:00
Ryan Poplin 9cc1a9931b Resolving merge conflicts. 2012-09-04 10:47:38 -04:00
Ryan Poplin c9944d81ef Skip array needs to also be used in the updateDataForRead function of the delocalized BQSR. 2012-09-04 10:33:37 -04:00
Mark DePristo 0892f2b8b2 Closing GSA-287:LocusReferenceView doesn't do very well in the case where contigs land off the end of the reference
-- Confirmed that reads spanning off the end of the chromosome don't cause an exception by adding integration test for a single read that starts 7 bases from the end of chromosome 1 and spans 90 bases or so off.  Added pileup integration test to ensure this behavior continues to work
2012-09-03 20:18:56 -04:00
Mark DePristo c9ea213c9b Make BaseRecalibration thread-safe
-- In the process uncovered two strange things
    1 -- qualityScoreByFullCovariateKey was created but never used.  Seems like a cache?
    2 -- Discovered nasty bug in BaseRecalibrator: https://jira.broadinstitute.org/browse/GSA-534
2012-08-31 13:42:42 -04:00
Mark DePristo 27ddebee53 Protect PrintReads from strange state from TraverseReadsUnitTests 2012-08-31 13:42:41 -04:00
Mark DePristo e028901d54 Fixed bad contract in ReadTransformer 2012-08-31 13:42:41 -04:00
Mark DePristo cf91d894e4 Fix build problems with tests 2012-08-31 13:42:41 -04:00
Mark DePristo 817ece37a2 General infrastructure for ReadTransformers
-- These are like read filters but can be applied either on input, on output, of handled by the walker
-- Previous example of BAQ now uses the general framework
    -- Resulted in massive conceptual cleanup of SAMDataSource and ReadProperties!  Yeah!
-- BQSR now uses this framework.  We can now do BQSR on input, on output, or within a walker
-- PrintReads now handles all read transformers in the walker in map, enabling us to parallelize PrintReads with BAQ and BQSR
-- Currently BQSR is excepting in parallel, which subsequent commit with fix
-- Removed global variable setting in GenomeAnalysisEngine for BAQ, as command line parameters are cleanly handled by ReadTransformer infrastructure
-- In principle ReadFilters are just a special kind of ReadTransformer, but this refactoring is larger than I can do. It's a JIRA entry
-- Many files touched simply due to the refactoring and renaming of classes
2012-08-31 13:42:41 -04:00
Ryan Poplin ff6ebbf3fd Resolving merge conflicts. 2012-08-31 11:25:55 -04:00
Eric Banks ac0c44720b I started to put together a set of unit tests for the PileupElement creation functionality of LocusIteratorByState and found pretty quickly that it's definitely still busted for indels. The data provider is nowhere near comprehensive yet, but I need to sit back and think about how to really test some of the functionality of LIBS. Committing what I have for now because at the very least it'll be helpful going forward (failing tests are commented out with TODO). 2012-08-30 22:49:13 -04:00
Mark DePristo 39400c56a9 Update md5s for VQSR, as VQSLOD is now a double and gets the standard double precision treatment in VCF 2012-08-30 19:41:49 -04:00
Mark DePristo 2f749b5e52 Added ThreadSafeMapReduce interface, super of TreeReducible
-- A higher level interface to declare parallelism capability of a walker.  This interface means that the walker can be multi-threaded, but doesn't necessarily support TreeReducible interface, which forces you to have a combine ReduceType operation that isn't appropriate for parallel read walkers
-- Updated ReadWalkers to implement ThreadSafeMapReduce not TreeReducible
2012-08-30 19:41:49 -04:00
Mark DePristo 544740d45d tasking for n threads should give you n threads in NanoScheduler, not n - 1 2012-08-30 19:41:49 -04:00
Mark DePristo 1212dfd2ef Reduce the number of test combinations in ReadBasedREferenceOrderedView 2012-08-30 19:41:49 -04:00
Mark DePristo 7a462399ce Fix GSA-529: Fix RODs for parallel read walkers
-- TraverseReadsNano modified to read in all input data before invoking maps, so the input to TraverseReadsNano is a MapData object holding the sam record, the ref context, and the refmetadatatracker.
-- Update ValidateRODForReads to be tree reducible, using synchronized map and explicitly sort the output map from locations -> counts in onTraversalDone
-- Expanded integration tests to test nt 1, 2, 4.
2012-08-30 19:41:49 -04:00
Mark DePristo 7d95176539 Bugfix to compareTo and equals in GenomeLoc
-- Yes, GenomeLoc.compareTo was broken.  The compareTo function only considered the contig and start position, but not the stop, when comparing genome locs.
-- Updated GenomeLoc.compareTo function to account for stop.  Updated GATK code where necessary to fix resulting problems that depended on this.
-- Added unit tests to ensure that hashcode, equals, and compareTo are all correct for GenomeLocs
2012-08-30 19:41:49 -04:00
Mark DePristo 5a9610d875 ReadShards now default to 10K (up from 1K) reads per samFile up to 250K
-- This should help make the inputs for parallel read walkers a little meater, and avoid spinning the shard creation infrastructure so often
2012-08-30 19:41:49 -04:00
Christopher Hartl 5a142fe265 After dicussion with Ryan/Eric, the Structural_Indel variant type is now gone, and has been entirely replaced with the access pattern .isStructuralIndel(). This makes it a strict subtype of indel. I agree that this method is a bit more sensible.
In addition, fix for GSA-310. If supplied -rf argument does not match a known read filter, the list of read filters will be printed, and users directed to the documentation for more information.
2012-08-30 17:57:31 -04:00
Mark DePristo 82b2845b9f Fix: GSA-531 ApplyRecalibration writing to BCF: java.lang.String cannot be cast to java.lang.Double
-- LOD must be added a double to attributes, not as string, so that it can be written out as BCF
2012-08-30 16:59:57 -04:00
Ryan Poplin 7b366d4049 misc cleanup in active region traversal. 2012-08-30 11:01:01 -04:00
Mark DePristo 21dd70ed36 Test to ensure that ReadBasedReferenceOrderedView produces stateless objects
-- Stateless objects are required for nano-scheduling.  This means you can take the RefMetaDataTracker provided by ReadBasedReferenceOrderedView, store it way, get another from the same view, and the original one behaves the same.
2012-08-30 10:15:11 -04:00
Mark DePristo ce3d1f89ea ReadShard are no longer allowed to span multiple contigs
-- Previous behavior was unnecessary and causes all sorts of problems with RODs for reads.  The old implementation simply failed in this case.  The new code handles this correctly by forcing shards to have all of their data on a single contig.
-- Added a PrintReads integration test to ensure this behavior is correct
-- Adding test BAMs that have < 200 reads and span across contig boundaries
2012-08-30 10:15:11 -04:00
Mark DePristo 53376b9423 Part III of GSA-462: Consistent RODBinding access across Ref and Read trackers
-- shardSpan is only calculated when there some ROD is live in the GATK.  No sense in paying the cost per read when you don't need it
-- Update contract to allow null span or unmapped span (good catch unittests!)
2012-08-30 10:15:10 -04:00
Mark DePristo 1200848bbf Part II of GSA-462: Consistent RODBinding access across Ref and Read trackers
-- Deleted ReadMetaDataTracker
-- Added function to ReadShard to give us the span from the left most position of the reads in the shard to the right most, which is needed for the new view
2012-08-30 10:15:10 -04:00
Mark DePristo 972be8b4a4 Part I of GSA-462: Consistent RODBinding access across Ref and Read trackers
-- ReadMetaDataTracker is dead!  Long live the RefMetaDataTracker.  Read walkers will soon just take RefMetaDataTracker objects.  In this commit they take a class that trivially extends them
-- Rewrote ReadBasedReferenceOrderedView to produce RefMetaDataTrackers not the old class.
    -- This new implementation produces thread-safe objects (i.e., holds no points to shared state).  Suitable for use (to be tested) with nano scheduling
    -- Simplified interfaces to use the simplest data structures (PeekableIterator) not the LocusAwareSeekableIterator, since I both hate those classes and this is on the long term trajectory to remove those from the GATK entirely.
-- Massively expanded DataProvider unit tests for ReadBasedReferenceOrderedView
-- Note that the old implementation of offset -> ROD in ReadRefMetaDataTracker was broken for any read not completely matching the reference.  Rather than provide broken code the ReadMetaDataTracker only provides a "bag of RODs" interface.  If you want to work with the relationship between the read and the RODs in your tool you need to manage the CIGAR element itself.
    -- This commit breaks the new read walker BQSR, but Ryan knows this is coming
-- Subsequent commit will be retiring / fixing ValidateRODForReads
2012-08-30 10:15:10 -04:00
Mark DePristo 8fc6a0a68b Cleanup RefMetaDataTracker before refactoring ReadMetaDataTracker 2012-08-30 10:13:06 -04:00
Ryan Poplin b85ded8389 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-30 10:11:48 -04:00
Ryan Poplin 57d997f06f Fixing bug from when FragmentUtils merging function moved over to the soft clipped start instead of the unclipped start 2012-08-30 10:10:43 -04:00
Ryan Poplin f9bab37015 Merged bug fix from Stable into Unstable 2012-08-30 09:21:24 -04:00
Ryan Poplin eb63221875 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/stable 2012-08-30 09:19:35 -04:00
Ryan Poplin 81d5eca975 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-30 09:10:56 -04:00
Ryan Poplin 35baf0b155 This along with Mauricio's previous commit (thanks!) fixes GSA-522. There are no longer any modifications to reads in the map calls of ActiveRegion walkers. Added the bam which identified this error as a new integration test. 2012-08-30 09:07:36 -04:00
Eric Banks 1acf0f0b2c Fixing bug in fasta .fai generation: trim the contig names to the first whitespace if one appears. We now generate indexes identical to samtools. 2012-08-29 22:36:27 -04:00
Eric Banks 4d38befe86 Merged bug fix from Stable into Unstable 2012-08-29 15:13:56 -04:00
Eric Banks 150a969279 Be careful with String manipulation when constructing alleles in SomaticIndelDetector 2012-08-29 15:13:28 -04:00
Eric Banks ce55ba98f4 Don't try to left align indels in unmapped reads (which for some reason can still have CIGARs) because the ref context is null. 2012-08-29 15:01:11 -04:00
Ryan Poplin 4ea38bbfe8 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-29 11:39:30 -04:00
Mauricio Carneiro 69b56e11c8 ReadClipper won't modify the original read
Reverting back to the original implementation, but now including write N's and write Q0's due to walkers that look at the same read multiple times in different reference windows
2012-08-29 11:33:19 -04:00
Ryan Poplin e12ae65d33 Changing the commenting style in the BQSR 2012-08-29 11:27:45 -04:00
Ryan Poplin 6d6ca090c6 RecalDatums now hold doubles so the test for equality needs an epsilon. 2012-08-28 16:00:52 -04:00
Ryan Poplin 18eca3544e Initial commit of the delocalized BQSR written as a read walker. 2012-08-28 15:24:20 -04:00
Eric Banks e74c527d47 Register the depricated walkers as depricated starting in v2.2 so that users get a helpful error message 2012-08-28 10:19:18 -04:00
Eric Banks 67d348a31d Retiring the alignment walkers and related integration test since we don't want to support them anymore. 2012-08-28 10:16:49 -04:00
Mark DePristo 0f4acaae1b Update MD5s with new FS score 2012-08-28 08:06:47 -04:00
Mark DePristo 2996693c9f FisherStrand now computed with and without filtering low-qual bases, and least significant pvalue is kept
-- Old way (filtering for Q > 17 bases) resulted in biased FS when the site was good but there was a
systematic shift in the QUAL of REF and ALT between strands of the reads (sometimes happens)
-- New way (taking all bases) was consistent with BaseQualRankSum and other tests, but there can be
a lot of low qual reference bases on one strand in some techs (ION/PROTON/PACBIO) because of the
preference for introducing an indel vs. a mismatch.
-- This implementation allows us to have our cake and eat it to by computing both p-values, and
taking the maximum one (i.e., least significant).
-- No integration tests updated yet -- still exploring the consequences of this change
2012-08-28 08:06:47 -04:00
Eric Banks bedcdbdc5f Fixing merge conflict 2012-08-27 12:16:51 -04:00
Eric Banks 3d476487c6 LIBS is totally busted for deletions. Putting a check in AD for bad pileup event bases so that we don't produce busted alleles. We must fix LIBS ASAP. 2012-08-27 12:13:12 -04:00
Mark DePristo 63a9ae817a Ensure thread-safety of CachingIndexedFastaSequenceFile
-- Cosmetic cleanup of ReadReferenceView
-- TraverseReadsNano provides the reference context, since it's thread-safe
-- Cleanup CachingIndexedFastaSequenceFile.  Add docs, remove unnecessary setters
-- Expand CachingIndexedFastaSequenceFileUnitTest to test explicitly multi-threaded safety.
2012-08-27 12:11:54 -04:00
Mark DePristo e5b1f1c7f4 Add simple main function to unit test so we can run the nano scheduler test from the command line 2012-08-27 12:11:54 -04:00
Khalid Shakir 2d1ea7124b One less Queue command line requirement: -tempDir now defaults to .queue/tmp.
Also moved queueScatterGather to .queue/scatterGather.
2012-08-27 12:04:50 -04:00
Mark DePristo 68c5142d2d numThreads > 1 any time you have -nt > 1 silly 2012-08-26 14:36:13 -04:00
Mark DePristo faacacd6c0 Increase runtime of nano scheduler tests to 1 min 2012-08-26 08:42:58 -04:00
Mark DePristo 846e0c11bc Add TimeOuts to new threading tests, in case there's a underlying deadlock 2012-08-26 08:18:43 -04:00
Mark DePristo fde9824765 Optimizations for parallel read walkers
-- TraversalReadsNano only creates the NanoScheduler once, and shuts it down onTraversalDone
-- Nicer debugging output in NanoScheduler
-- ReadShard has a getBufferSize() method now
2012-08-25 17:21:12 -04:00
Mark DePristo 5066b14335 Parallel FlagStat 2012-08-25 17:21:12 -04:00
Mark DePristo af540888f1 Limited version of parallel read walkers
-- Currently doesn't support accessing reference or ROD data
-- Parallel versions of PrintReads and CountReads
2012-08-25 17:21:12 -04:00
Mark DePristo e060b148e2 Minor cleanup of TraverseReads 2012-08-25 17:21:11 -04:00
Mark DePristo 275a5e5439 More tests for NanoScheduler
-- Add more contracts
-- Test in the UnitTest that the reduce is being called in the correct order
2012-08-25 17:21:11 -04:00
Christopher Hartl 6db0988898 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-25 15:40:32 -04:00
Christopher Hartl db2e88c7cb Fix for badIndelLength() throwing NPE at non-indel sites. Added integration test. 2012-08-25 12:38:23 -07:00
Mark DePristo 59b5913b54 Merged bug fix from Stable into Unstable 2012-08-25 14:53:22 -04:00
Mark DePristo dcc972a557 Usability cleanup for BQSR
-- I'm seeing a lot of people trying to use BinaryTagCovariate in the community.  They really shouldn't do this, so I moved it to private.
-- Throw an exception if its required bintag argument is missing
-- Check explicitly if user is requesting DinucCovariate and tell them that its been retired in favor of ContextCovariate
-- Show the type (Required, Experimental, Standard) of the covariates when running --list
2012-08-25 14:53:00 -04:00
Christopher Hartl b59948709f Code improvements re: JIRA GSA-510. Trio class migrated into the Samples package - because the trio structure is so ubiquitously used, it makes sense, I think, to have a class which imposes the structure on the samples. Existing functions which slightly duplicated the getTrios() method look like they have bugs. These functions are now deprecated.
A number of functions int he sampleDB looked to be assuming that samples could not share IDs (e.g. sample IDs are unique, so a sample present in two families could not be represented by multiple Sample objects). Added an assertion in the SampleDBBuilder to document/test this assumption.

MVLikelihoodRatio now uses the trio methods from SampleDB.
2012-08-25 08:48:27 -07:00
Mark DePristo 0996bbd548 Comments for Chris on cleanup 2012-08-24 16:04:58 -04:00
Mark DePristo 649b82ce85 Merge branch 'nanoScheduler'
Conflicts:
	private/scala/qscript/org/broadinstitute/sting/queue/qscripts/performance/GATKPerformanceOverTime.scala
2012-08-24 15:59:36 -04:00
Mark DePristo 9de8077eeb Working (efficient?) implementation of NanoScheduler
-- Groups inputs for each thread so that we don't have one thread execution per map() call
-- Added shutdown function
-- Documentation everywhere
-- Code cleanup
-- Extensive unittests
-- At this point I'm ready to integrate it into the engine for CPU parallel read walkers
2012-08-24 15:34:23 -04:00
Christopher Hartl 752f44c332 Code cleanup in MVLR and SelectVariants. Should fix JIRA GSA-509 and GSA-510 2012-08-24 12:25:11 -07:00
Mark DePristo d6e6b30caf Initial implementation of GSA-515: Nanoscheduler
– Write general NanoScheduler framework in utils.threading. Test with reading via iterator from list of integers, map is int * 2, reduce is sum. Should be efficiency using resources to do sum of 2 * (sum(1 - X)).

Done!

CPU parallelism is nano threads. Pfor across read / map / reduce. Use work queue to implement.
Create general read map reduce framework in utils. Test parallelism independently before hooking up to Locus iterator
Represent explicitly the dependency graph. Scheduler should choose the work units that are ready for computation, that are marked as "completing a computation", and then finally that maximize the number of sequent available work units. May be worth measuring expected cost for read read / map / reduce unit and use it to balance the compute
As input is single threaded just need one thread to populate inputs, which runs as fast as possible on parallel pushing data to fixed size queue. Each push creates map job and links to upcoming reduce job.
Note that there's at most one thread for IO tasks, and all of the threads can contribute to CPU tasks
2012-08-24 14:07:44 -04:00
Eric Banks 0545664f91 Fix ClassCastException seen in Tableau errors 2012-08-24 13:45:48 -04:00
Eric Banks 740520c23b Fix BQSR docs 2012-08-24 13:20:10 -04:00
Ryan Poplin 5f8574bd15 Fixing typo in error message. 2012-08-24 10:48:41 -04:00
Mark DePristo 1999b95754 Work around for GSA-513: ClassCastException in VariantEval 2012-08-23 18:14:49 -04:00
Christopher Hartl f1166d6d00 Spotted a potential bug where sample IDs passed in from the meta data were only checked against the sample IDs in the VCF header if the input file happened to be a meta data file rather than a fam file. Added a check for fam files as well, and added an integration test to cover each case. 2012-08-23 11:43:19 -07:00
Mark DePristo 857b11b26f Done with GSA-506: Add nt and efficiency information to GATKRunReport
-- GATKRunReports contain itemized information about the numThreads used to execute the GATK, as well as the efficiency of the use of those threads to get real work done, including time spent running, waiting, blocking, and waiting for IO
-- See https://jira.broadinstitute.org/browse/GSA-506 for more details
2012-08-23 09:59:53 -04:00
Mark DePristo 0b735884db Cleanup code in VariantContext 2012-08-23 09:59:53 -04:00
Eric Banks e5df91aa23 Looks like the @WalkerName annotation doesn't work with the GATK docs, so I'm renaming the walkers. 2012-08-22 20:17:39 -04:00
Mark DePristo 63af0cbcba Cleanup GATK efficiency monitor classes
-- Invert logic in GATKArgumentCollection to disable monitoring, not enable.  That means monitoring is on by default
-- Fix testing error in unit tests
-- Rename variables in ThreadAllocation to be clearer
2012-08-22 16:48:02 -04:00
Mark DePristo e1293f0ef2 GSA-507: Thread monitoring refactored so it can work without a thread factory
-- Old version StateMonitoringThreadFactory refactored into base class ThreadEfficiencyMonitor and subclass EfficiencyMonitoringThreadFactory.
-- Base class is used by LinearMicroScheduler to monitor performance of GATK in single threaded mode
-- MicroScheduler now handles management of the efficiency monitor.  Includes master thread in monitor, meaning that reduce is now included for both schedulers
2012-08-22 16:48:01 -04:00
Mark DePristo f876c51277 Separately track time spent doing user and system CPU work
-- Allows us to ID (by proxy) time spent doing IO
-- Refactor StateMonitoryingThreadFactory to use it's own enum, not Thread.State
-- Reliable unit tests across mac and unix
2012-08-22 16:48:01 -04:00
Mark DePristo 18060f237b Add thread efficiency monitoring to GATK HMS
-- See https://jira.broadinstitute.org/browse/GSA-502
-- New command line argument -mt enables thread monitoring
-- If enabled, HMS uses StateMonitoringThreadFactory to create monitored threads, and prints out an efficiency report when HMS exits, telling the user information like:

for BQSR – known to be inefficient locking
INFO 17:10:33,195 StateMonitoringThreadFactory - Number of activeThreads used: 8
INFO 17:10:33,196 StateMonitoringThreadFactory - Total runtime 90.3 m
INFO 17:10:33,196 StateMonitoringThreadFactory - Fraction of time spent blocked is 0.72 ( 64.8 m)
INFO 17:10:33,197 StateMonitoringThreadFactory - Fraction of time spent running is 0.26 ( 23.7 m)
INFO 17:10:33,197 StateMonitoringThreadFactory - Fraction of time spent waiting is 0.02 ( 112.8 s)
INFO 17:10:33,197 StateMonitoringThreadFactory - Efficiency of multi-threading: 26.19% of time spent doing productive work

for CountLoci
INFO 17:06:12,777 StateMonitoringThreadFactory - Number of activeThreads used: 8
INFO 17:06:12,777 StateMonitoringThreadFactory - Total runtime 43.5 m
INFO 17:06:12,778 StateMonitoringThreadFactory - Fraction of time spent blocked is 0.00 ( 4.2 s)
INFO 17:06:12,778 StateMonitoringThreadFactory - Fraction of time spent running is 1.00 ( 43.3 m)
INFO 17:06:12,779 StateMonitoringThreadFactory - Fraction of time spent waiting is 0.00 ( 6.0 s)
INFO 17:06:12,779 StateMonitoringThreadFactory - Efficiency of multi-threading: 99.61% of time spent doing productive work
2012-08-22 16:48:01 -04:00
Guillermo del Angel 1aa856e0e3 Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-22 15:53:47 -04:00
Guillermo del Angel e29469eeeb Forgot to update 2 integration test md5's (in this cases, changes are legit because of the code revamp of AD, it's simpler if AD is not output when a site is not variant, as genotype DP conveys the same information) 2012-08-22 15:53:33 -04:00
Ryan Poplin fe3069b278 Merged bug fix from Stable into Unstable 2012-08-22 14:40:34 -04:00
Ryan Poplin e5cfdb4811 Bug fix for popular _Duplicate allele added to VariantContext_ error reported on the forum. It seems to be due to lower case bases in the reference being treated as reference mismatches. We would try to turn these mismatches into SNP events, for example c/C. We now uppercase the result from IndexedFastaSequenceFile.getSubsequenceAt() 2012-08-22 14:39:35 -04:00
Ryan Poplin 63213e8eb5 Expanding the HaplotypeCaller integration tests to cover a wider range of data 2012-08-22 14:18:44 -04:00
Eric Banks 944e1c299d Docs for --keepOriginalAC were wrong in SelectVariants 2012-08-22 13:07:13 -04:00
Eric Banks 2409aa9bfd Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-22 12:54:43 -04:00
Eric Banks 94540ccc27 Using the simple VCBuilder constructor and then subsequently trying to modify attributes was throwing a NPE. This is easily solved (without a performance hit) by initializing the attributes map to an immutable Collections.emptyMap(). Added unit test to cover this case. 2012-08-22 12:54:29 -04:00
Guillermo del Angel 901f47d8af Final step (for now) in VA refactoring: update MD5's because, a) since it's not guaranteed that we'll iterate through reads/pileups in the same order, the rank sum dithering will change annotations, b) FS uses new generic threshold to distinguish uninformative reads (it used to use ad-hoc thresholds), c) AD definition changed and throws away uninformative reads, d) shortened general ploidy integration tests for quicker debugging. May have missed some MD5's in the update so there may be lingering test failures still 2012-08-22 11:38:51 -04:00
Guillermo del Angel 7df0abf49b Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-22 11:36:41 -04:00
Eric Banks 9e76e8aa0b Just noticed that the efficient conversion to uppercase method is redundant since it's already implemented efficiently in Picard; let's just have a single implementation. 2012-08-22 11:26:08 -04:00
Christopher Hartl 20601f034e Updating the checkType() function to include the new StructuralIndel variant type. Fixes outstanding broken integration test. 2012-08-22 07:33:10 -07:00
Eric Banks c7ce3e1cf5 Merged bug fix from Stable into Unstable 2012-08-22 00:24:40 -04:00
Eric Banks 03017855e4 WTF - why is support for whole-read insertions all messed up in LIBS? I've pushed a temporary patch for now (the right solution should certainly not be implemented in stable; LIBS needs to be better thought out). Added another unit test. 2012-08-22 00:24:01 -04:00
Mark DePristo 6ce8016ae7 GSA-491: Add hidden tag to GATK that propagates to the GATK logs 2012-08-21 14:44:18 -04:00
Guillermo del Angel 6a8cf1c84a Enable and adapt HaplotypeScore and MappingQualityZero as active region annotations now that we have per-read likelihoods passed in to annotations 2012-08-21 14:35:40 -04:00
Guillermo del Angel d0644b3565 Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-21 10:35:23 -04:00
Ryan Poplin 94e7f677ad Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-21 10:21:47 -04:00
Guillermo del Angel 418ace463a More merge conflict resolution 2012-08-21 10:15:52 -04:00
Ryan Poplin 10961db3ce Another round of FindBugs fixes. Object returns its internal reference to an externally mutable array. Very dangerous. 2012-08-21 09:35:55 -04:00
Ryan Poplin 605acaae9c Another round of FindBugs fixes. Object internally stores a reference to an externally mutable array. Very dangerous. 2012-08-21 09:33:58 -04:00
Ryan Poplin 55b7949d68 Another round of FindBugs fixes. Comparator doesn't implement Serializable. 2012-08-21 09:20:55 -04:00
Christopher Hartl ba8622ff0d number of stashed changes are lurking in here. In order of importance:
- Fix for M_Trieb's error report on the forum, and addition of integration tests to cover the walker.
 - Addition of StructuralIndel as a class of variation within the VariantContext. These are for variants with a full alt allele that's >150bp in length.
 - Adaptation of the MVLikelihoodRatio to work for a set of trios (takes the max over the trios of the MVLR)
 - InsertSizeDistribution changed to use the new gatk report output (it was previously broken)
 - RetrogeneDiscovery changed to be compatible with the new gatk report
 - A maxIndelSize argument added to SelectVariants
 - ByTranscriptEvaluator rewritten for cleanliness
 - VariantRecalibrator modified to not exclude structural indels from recalibration if the mode is INDEL
 - Documentation added to DepthOfCoverageIntegrationTest (no, don't yell at chartl ;_; )

Also sorry for the long commit history behind this that is the result of fixing merge conflicts. Because this *also* fixes a conflict (from git stash apply), for some reason I can't rebase all of them away. I'm pretty sure some of the commit notes say "this note isn't important because I'm going to rebase it anyway".
2012-08-21 07:08:58 -04:00
Eric Banks 3dfe8df262 Merged bug fix from Stable into Unstable 2012-08-20 23:12:58 -04:00
Eric Banks 40d5efc804 Fix for Adam K's reported bug: we weren't handling reads that were entirely insertions properly in LIBS. Specifically, the event bases were off-by-one (which was disasterous in Adam's case with a 1bp read). Added a unit test to cover this case. 2012-08-20 23:12:41 -04:00
Eric Banks 286b658fab Re-enabling parallelism in the BaseRecalibrator now that the release is out. 2012-08-20 21:25:14 -04:00
Guillermo del Angel 7bbd2a7a20 Fixing merge conflicts 2012-08-20 20:38:25 -04:00
Guillermo del Angel 2041cb853c New implementation of AD - ignore now non-informative reads based on per-read likelihoods 2012-08-20 20:31:34 -04:00
Ryan Poplin 77fbaec044 Another round of FindBugs fixes. Class implements its own compareTo() but uses base Object.equals() which can lead to unpredictable behavior. 2012-08-20 16:55:00 -04:00
Ryan Poplin 5e28bca630 Another round of FindBugs fixes. Should be static inner class. 2012-08-20 16:15:48 -04:00
Ryan Poplin 5db3bd6fd2 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-20 15:28:57 -04:00
Ryan Poplin 464d49509a Pulling out common caller arguments into its own StandardCallerArgumentCollection base class so that every caller isn't exposed to the unused arguments from every other caller. 2012-08-20 15:28:39 -04:00
Eric Banks 4450d66c64 Fixing the docs for DP and AD 2012-08-20 15:10:24 -04:00
Ryan Poplin c67d708c51 Bug fix in HaplotypeCaller for non-regular bases in the reference or reads. Those events don't get created any more. Bug fix for advanced GenotypeFullActiveRegion mode: custom variant annotations created by the HC don't make sense when in this mode so don't try to calculate them. 2012-08-20 13:41:08 -04:00
Guillermo del Angel 5b5fee56cf Next iteration of new VA interface: extend changes to per-genotype annotations as well. Will allow to have AD correctly implemented at last (that change not done yet) 2012-08-20 12:52:15 -04:00
Eric Banks 154f65e0de Temporarily disabling multi-threaded usage of BaseRecalibrator for performance reasons. 2012-08-20 12:43:17 -04:00
Guillermo del Angel c384677917 Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-20 10:27:25 -04:00
Eric Banks 97b191f578 Thanks to Guillermo I was able to isolate an instance of where the MLEAC > AN. It turns out that this is valid, e.g. when PLs are all 0s for a sample we no-call it but it's allowed to factor into the MLE (since that's the contract with the exact model). Removing the check in UG and instead protecting for it in the AlleleCount stratification. 2012-08-20 01:16:23 -04:00
Guillermo del Angel 963ad03f8b Second step of interface cleanup for variant annotator: several bug fixes, don't hash pileup elements to Maps because the hashCode() for a pileup element is not implemented and strange things can happen. Still several things to do, not done yet 2012-08-19 21:18:18 -04:00
Mark DePristo 7fa76f719b Print "Parsing data stream with BCF version BCFx.y" in BCF2 codec as .debug not .info 2012-08-19 10:32:55 -04:00
Mark DePristo 9121b98167 CombineVariants outputs the first non-MISSING qual, not the maximum
-- When merging multiple VCF records at a site, the combined VCF record has the QUAL of the first VCF record with a non-MISSING QUAL value.  The previous behavior was to take the max QUAL, which resulted in sometime strange downstream confusion.
2012-08-19 10:29:38 -04:00
Guillermo del Angel d9641e3d57 Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-19 09:23:21 -04:00
Mauricio Carneiro d16cb68539 Updated and more thorough version of the BadCigar read filter
* No reads with Hard/Soft clips in the middle of the cigar
   * No reads starting with deletions (with or without preceding clips)
   * No reads ending in deletions (with or without follow-up clips)
   * No reads that are fully hard or soft clipped
   * No reads that have consecutive indels in the cigar (II, DD, ID or DI)

 Also added systematic test for good cigars and iterative test for bad cigars.
2012-08-17 17:05:27 -04:00
Mark DePristo 980685af16 Fix GSA-137: Having both DataSource.REFERENCE and DataSource.REFERENCE_BASES is confusing to end users.
-- Removed REFERENCE_BASES option.  You only have REFERENCE now.  There's no efficiency savings for the REFERENCE_BASES option any longer, since the reference bases are loaded lazy so if you don't use them there's effectively no cost to making the RefContext that could load them.
2012-08-17 14:55:38 -04:00
Eric Banks 2676b7fc2e Put in a sanity check that MLEAC <= AN 2012-08-17 11:49:53 -04:00
Mark DePristo daa26cc64e Print to logger not to System.out in CachingIndexFastaSequenceFile when profiling cache performance 2012-08-17 11:49:02 -04:00
Mark DePristo be0f8beebb Fixed GSA-434: GATK should generate error when gzipped FASTA is passed in.
-- The GATK sort of handles this now, but only if you have the exactly correct sequence dictionary and FAI files associated with the reference.  If you do, the file can be .gz.  If not, the GATK will fail on creating the FAI and DICT files.  Added an error message that handles this case and clearly says what to do.
2012-08-17 11:49:02 -04:00
Mark DePristo a3d2764d11 Fixed: GSA-392 @arguments with just a short name get the wrong argument bindings
-- Now blows up if an argument begins with -.  Implementation isn't pretty, as it actually blows up during Queue extension creation with a somewhat obscure error message but at least its something.
2012-08-17 11:49:01 -04:00
Mark DePristo 4c0f198d48 Potential fix for GSA-484: Incomplete writing of temp BCF when running CombineVariants in parallel
-- Keep reading from BCF2 input stream when read(byte[]) returns < number of needed bytes
-- It's possible (I think) that the failure in GSA-484 is due to multi-threading writing/reading of BCF2 records where the underlying stream is not yet flushed so read(byte[]) returns a partial result.  No loops until we get all of the needed bytes or EOF is encounted
2012-08-17 11:49:01 -04:00
Mark DePristo de3be45806 Proper function call in BCF2Decoder to validateReadBytes 2012-08-17 11:49:01 -04:00
Eric Banks 53383e82ec Hmm, not good. Fixing the math in PBT resulted in changed MD5s for integration tests that look like significant changes. I am reverting and will report this to Laurent. 2012-08-16 21:41:18 -04:00
Eric Banks 65c594afff Better error message for reads that begin/end with a deletion in LIBS 2012-08-16 21:27:07 -04:00
Guillermo del Angel b61ecc7c19 Fix merge conflicts 2012-08-16 20:45:52 -04:00
Guillermo del Angel d26183e0ec First preliminary big refactoring of UG annotation engine. Goals: a) Remove gigantic hack that cached per-read haplotype likelihoods in a static array so that annotations would go back and retrieve them, b) unify interface for annotations between HaplotypeCaller and UnifiedGenotyper, c) as a consequence, removed and cleaned duplicated code. As a bonus, annotations have now more relevant info to help them compute values.
Major idea is that per-read haplotype likelihoods are now stored in a single unified object of class PerReadAlleleLikelihoodMap. Class implementation in theory hides internal storage details from outside work (still may need work cleaning up interface), and this object(or rather, a Map from Sample->perReadAlleleLikelihoodMap) is produced by UGCalcLikelihoods. The genotype calculation is also able to potentially use this info if needed. All InfoFieldAnnotations now get an extra argument with this map. Currently, this map is only produced for indels in UG, or for all variants within HaplotypeCaller. If this map is absent (SNPs in UG), the old Pileup interface is used, but it's avoided whenever possible. FORMAT annotations are not yet changed but will be focus of second step. Major benefit will be that annotations will be able to very easily discard non-informative reads for certain events. HaplotypeCaller also uses this new class, and no longer hard-codes the mapping of allele ->list(reads) but instead uses the same objects and interfaces as the rest of the modules. Code still needs further testing/cleaning/reviewing/debugging
2012-08-16 20:36:53 -04:00
Mark DePristo 6a2862e8bc GSA-483: Bug in GATKdocs for Enums
-- Fixed to no long show constants in enums as constant values in the gatkdocs
2012-08-16 16:24:17 -04:00
Eric Banks 3253fc216b FindBugs 'Maintainability' fixes 2012-08-16 15:53:06 -04:00
Eric Banks 05cbf1c8c0 FindBugs 'Efficiency' fixes 2012-08-16 15:40:52 -04:00
Mark DePristo d8071c66ed Removing SlowGenotype object from GATK 2012-08-16 15:23:06 -04:00
Eric Banks a22e7a5358 Should've run 'ant clean' instead of just 'ant'. In any event, these are 2 cases where we are setting a class's internal static variable directly. Very dangerous. 2012-08-16 15:07:32 -04:00
Eric Banks 47b4f7b7e5 One final FindBugs related fix. I think it's safe to consider these changes 'fixes' that are allowed to go in during a code freeze. 2012-08-16 14:59:05 -04:00
Eric Banks ded0e11b45 Killing off some FindBugs 'Realiability' issues 2012-08-16 14:00:48 -04:00
Eric Banks dac3958461 Killing off some FindBugs 'Usability' issues 2012-08-16 13:32:44 -04:00
Eric Banks 611d9b61e2 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-16 13:05:36 -04:00
Eric Banks 2df04dc48a Fix for performance problem in GGA mode related to previous --regenotype commit. Instead of trying to hack around the determination of the calculation model when it's not needed, just simply overload the calculateGenotypes() method to add one that does simple genotyping. Re-enabling the Pool Caller integration tests. 2012-08-16 13:05:17 -04:00
Mark DePristo 132cdfd9c1 GSA-488: MLEAC > AN error when running variant eval fixed 2012-08-16 13:03:14 -04:00
Mark DePristo 4e42988c66 GSA-485: Remove repairVCFHeader from GATK codebase
-- Removed half-a*ssed attempt to automatically repair VCF files with bad headers, which allowed users to provide a replacement header overwriting the file's actually header on the fly.  Not a good idea, really.  Eric has promised to create a utility that walks through a VCF file and creates a meaningful header field based on the file's contents (if this ever becomes a priority)
2012-08-16 13:03:13 -04:00
Mark DePristo 52bfe8db8a Make sure the storage writer is closed before running mergeInfo in multi-threaded output management
-- It's not clear this is cause of GSA-484 but it will help confirm that it's not the cause
2012-08-16 13:03:13 -04:00
Mark DePristo 7a247df922 Added -bcf argument to VCFWriter output to force BCF regardless of file extension
-- Now possible to do -o /dev/stdout -bcf -l DEBUG > tmp.bcf and create a valid BCF2 file
-- Cleanup code to make sure extensions easier by moving to a setX model in VariantContextWriterStub
2012-08-16 13:03:13 -04:00
Mark DePristo 28c8e3e6d7 Cleanup BCF2Codec
-- Remove FORBID_SYMBOLIC global that is no longer necessary
-- all error handling goes via error() function
2012-08-16 13:03:13 -04:00
Mark DePristo 9dc694b2e9 Meaningful error message and keeping tmp file when mergeInfo fails
-- BCF2 is failing for some reason when merging tmp. files with parallel combine variants.  ThreadLocalOutputTracker no longer sets deleteOnExit on the tmp file, as this prevents debugging.  And it's unnecessary because each mergeInto was deleting files as appropriate
-- MergeInfo in VariantContextWriterStorage only deletes the intermediate output if an error occurs
2012-08-16 13:03:13 -04:00
Mark DePristo a9a1c499fd Update md5 in VariantRecalibrationWalkers test for BCF2 -- only encoding differences 2012-08-16 13:03:13 -04:00
Eric Banks f368e568db Implementing support in BaseRecalibrator for SOLiD no call strategies other than throwing an exception. For some reason we never transfered these capabilities into BQSRv2 earlier. 2012-08-15 22:52:56 -04:00
Eric Banks 9d09230c26 Better docs for verbose output of Pileup 2012-08-15 21:55:08 -04:00
Mark DePristo c0a31b2e5b CombineVariants parallel integration tests
-- All tests but one (using old bad VCF3 input) run unmodified with parallel code.
-- Disabled UNSAFE_VCF_PROCESSING for all but that test, which changes md5s because the output files have fixed headers
-- Minor optimizations to simpleMerge
2012-08-15 21:13:16 -04:00
Mark DePristo 669c43031a BCF2 optimizations; parallel CombineVariants
-- BCF2 now determines whether it can safely write out raw genotype blocks, which is true in the case where the VCF header of the input is a complete, ordered subset of the output header.  Added utilities to determine this and extensive unit tests (headerLinesAreOrderedConsistently)
-- Cleanup collapseStringList and exploreStringList for new unit tests of BCF2Utils.  Fixed bug in edge case that never occurred in practice
-- VCFContigHeaderLine now provides its own key (VCFHeader.CONTIG_KEY) directly instead of requiring the user to provide it (and hoping its right)
-- More ways to access the data in VCFHeader
-- BCF2Writer uses a cache to avoid recomputing unnecessarily whether raw genotype blocks can be emitted directly into the output
-- Optimization of fullyDecodeAttributes -- attributes.size() is expensive and unnecessary.  We just guess that on average we need ~10 elements for the attribute map
-- CombineVariants optimization -- filters are online HashSet but are sorted at the end by creating a TreeSet
-- makeCombinations is now makePermutations, and you can request to create the permutations with or without replacement
2012-08-15 21:13:16 -04:00
Mark DePristo dafa7e3885 Temporarily disable StateMonitoringThreadTests while I get them reliably working across platforms 2012-08-15 21:13:16 -04:00
Mark DePristo d70fd18900 Minor increase in tolerance to sum of states in UnitTest for StateMonitoringThreadFactory 2012-08-15 21:13:15 -04:00
Mark DePristo ae4d4482ac Parallel combine variants!
-- CombineVariants is now TreeReducible!
-- Integration tests running in parallel all pass except one (will fix) due to incorrect use of db=0 flag on input from old VCF format
2012-08-15 21:13:15 -04:00
Mark DePristo bd7ed0d028 Enable efficient parallel output of BCF2
-- Previous IO stub was hardcoded to write VCF.  So when you ran -nt 2 -o my.bcf you actually created intermediate VCF files that were then encoded single threaded as BCF.  Now we emit natively per thread BCF, and use the fast mergeInfo code to read BCF -> write BCF.  Upcoming optimizations to avoid decoding genotype data unnecessarily will enable us to really quickly process BCF2 in parallel
-- VariantContextWriterStub forces BCF output for intermediate files
-- Nicer debug log message in BCF2Codec
-- Turn off debug logging of BCF2LazyGenotypesDecoder
-- BCF2FieldWriterManager now uses .debug not .info, so you won't see all of that field manager debugging info with BCF2 any longer
-- VariantContextWriterFactory.isBCFOutput now has version that accepts just a file path, not path + options
2012-08-15 21:13:15 -04:00
Mark DePristo 9459e6203a Clean, documented implementation of ThreadFactory that monitors running / blocking / waiting time of threads it creates
-- Expanded unit tests
-- Support for clean logging of results to logger
-- Refactored MyTime into AutoFormattingTime in Utils, out of TraversalEngine, for cleanliness and reuse
-- Added docs and contracts to StateMonitoringThreadFactory
2012-08-15 21:13:15 -04:00
Mark DePristo be3230a1fd Initial implementation of ThreadFactory that monitors running / blocking / waiting time of threads it creates
-- Created makeCombinations utility function (very useful!).  Moved template from VariantContextTestProvider
-- UnitTests for basic functionality
2012-08-15 21:13:15 -04:00
Mark DePristo f277d7c09e Removing parallelism bottleneck in the GATK
-- GenomeLocParser cache was a major performance bottleneck in parallel GATK performance.  With 10 thread > 50% of each thread's time was spent blocking on the MasterSequencingDictionary object.  Made this a thread local variable.
-- Now we can run the GATK with 48 threads efficiently on GSA4!
  -- Running -nt 1 => 75 minutes (didn't let is run all of the way through so likely would take longer)
  -- Running -nt 24 => 3.81 minutes
2012-08-15 21:13:15 -04:00
Eric Banks 87e41c83c5 In AlleleCount stratification, check to make sure the AC (or MLEAC) is valid (i.e. not higher than number of chromosomes) and throw a User Error if it isn't. Added a test for bad AC. 2012-08-14 15:02:30 -04:00
Eric Banks 8e3774fb0e Fixing behavior of the --regenotype argument in SelectVariants to properly run in GenotypeGivenAlleles mode. Added integration tests to cover recent SV changes. 2012-08-14 14:21:42 -04:00
Eric Banks 34b62fa092 Two changes to SelectVariants: 1) don't add DP INFO annotation if DP wasn't used in the input VCF (it was adding DP=0 previously). 2) If MLEAC or MLEAF is present in the original VCF and the number of samples decreases, remove those annotations from the VC. 2012-08-14 12:54:31 -04:00
Eric Banks cfb994abd2 Trivial removal of ununsed variable (mentioned in resolved JIRA entry) 2012-08-13 22:55:02 -04:00
Khalid Shakir f809f24afb Removed SelectHeader's --include_reference_name option since the reference is always included.
In SelectHeaders instead of including the path to the file, only include the name of the reference since dbGaP does not like paths in headers.
2012-08-13 16:49:27 -04:00
Mark DePristo 6ad75d2f5c Reverting changes to BCF2 ranges
-- The previously expanded ones are actually the missing values in the range.  The previous ranges were correct.  Removed the TODO to confirm them, as they are now officially confirmed
2012-08-13 15:06:28 -04:00
Mark DePristo 4d3fad38e9 Increase allowable range for BCF2 by -1 on low-end 2012-08-13 14:20:26 -04:00
Mark DePristo aab417c94d Fix missing argument in unittest 2012-08-12 13:58:14 -04:00
Mark DePristo f032e0aba4 A bit better output for ContextCovariate context size logging 2012-08-12 13:45:52 -04:00
Mark DePristo 243af0adb1 Expanded the BQSR reporting script
-- Includes header page
-- Table of arguments (Arguments)
-- Summary of counts (RecalData0)
-- Summary of counts by qual (RecalData1)
-- Fixed bug in output that resulted in covariates list always being null (updated md5s accordingly)
-- BQSR.R loads all relevant libaries now, include gplots, grid, and gsalib to run correctly
2012-08-12 13:45:14 -04:00
Mark DePristo 458bbdee8f Add useful logger.info telling us the mismatch and indel context sizes 2012-08-12 10:27:05 -04:00
Ami Levy Moonshine 6fefdaf428 "update integration tests in CombineVariantsIntegrationTest"
Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-08-10 17:00:35 -04:00
Ami Levy Moonshine 4968daf0a5 update integration tests at CombineVariantsIntegrationTest 2012-08-10 16:58:05 -04:00
Eric Banks 40f0320a1c When adding a unit test to LIBS for X and = CIGAR operators, I uncovered a bug with the implementation of the ReadBackedPileup.depthOfCoverage() method. 2012-08-10 14:58:29 -04:00
Eric Banks eca9613356 Adding support of X and = CIGAR operators to the GATK 2012-08-10 14:54:07 -04:00
Ami Levy Moonshine 68fb04b8f7 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable into testing 2012-08-09 16:48:22 -04:00
Mark DePristo 06258c8a01 BCF2 optimizations
-- Added Write method to BCF2 types that directly converts int value to byte stream.  Deleted writeRawBytes(int)
-- encodeTypeDescriptor semi-inlined into encodeType so that the tests for overflow are done in just one place
-- Faster implementation of determineIntegerType for int[] values
2012-08-09 16:36:18 -04:00
Mark DePristo c6bd9b15ff BCF2 optimizations
-- BCF2Type enum has an overloaded method to read the type as an int from an input stream.  This gets rid of a case statement and replaces it with just minimum tiny methods that should be better optimized.  As side effect of this optimization is an overall cleaner code organization
2012-08-09 16:36:18 -04:00
Mark DePristo 9a0dda71d4 BCF2 optimizations
-- All low-level reads throw IOException instead of catching it directly.  This allows us to not try/catch in readByte, improving performance by 5% or so
-- Optimize encodeTypeDescriptor with final variables.  Avoid using Math.min instead do inline comparison
-- Inlined willOverflow directly in its single use
2012-08-09 16:36:18 -04:00
Ryan Poplin 9887bc4410 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-09 16:31:06 -04:00
Ryan Poplin f4c72a26d5 A few quick, minor findbugs fixes. 2012-08-09 16:30:58 -04:00
Ryan Poplin c7f22e410f A few quick, minor findbugs fixes. 2012-08-09 16:22:08 -04:00
Eric Banks def077c4e5 There's actually a subtle but important difference between foo++ and ++foo 2012-08-09 12:42:50 -04:00