Commit Graph

2974 Commits (7dcafe8b8194ce8a9d0b8825812fd11c8f9a0612)

Author SHA1 Message Date
Eric Banks 0206e09a6a Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-09-12 15:18:27 -04:00
Eric Banks d94d0d15c2 Complete overhaul of previous commits to make it all work with scatter-gather. Now tracks output files correctly and can print to stdout. 2012-09-12 15:15:40 -04:00
Eric Banks 4bb7a99f08 Given that all classes implementing output stubs already have getters for the underlying OutputStream and File, it makes sense to unify that functionality into the Stub interface. Now it is possible to have an Engine utility method that iterates over all registered stubs to find the one representing a given OutputStream and return the File associated with it. 2012-09-12 11:51:44 -04:00
Eric Banks 994a4ff387 Track all outputs from BQSR (.table, .csv., and .pdf) as @Output arguments. Updated integration tests because we no longer have command-line options not to generate plots (now just don't provide a pdf) or to keep the intermediate csv (now, just provide a filename on the command-line). This is currently busted because we can't access the original filenames from the Engine's storage/stub system and therefore cannot call out to the Rscript with the executor (which requires filename strings). 2012-09-12 11:24:53 -04:00
Christopher Hartl 96be1cbea9 My own integration test isn't passing with a clean checkout. This fix to the walker ought to do it. 2012-09-12 10:11:06 -04:00
Christopher Hartl 546586b70e Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-09-12 10:09:42 -04:00
Mark DePristo bfbf1686cd Fixed nasty bug with defaulting to diploid no-call genotypes
-- For the pooled caller we were writing diploid no-calls even when other samples were haploid.  Changed maxPloidy function to return a defaultPloidy, rather than 0, in the case where all samples are missing.
-- VCF/BCF Writers now create missing genotypes with the ploidy of other samples, or 2 if none are available at all.
-- Updating integration tests for general ploidy, as previously we wrote ./. even when other calls were 0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/1/1/1/1/1, but now we write ./././././././././././././././././././././././. (ugly but correct)
2012-09-12 07:08:03 -04:00
Mark DePristo d1ba17df5d Fixed nasty bug in BCF2 writer for case where all genotypes are missing
-- Previous code was looking for a -1 result from maxPloidy() but the result as actually 0, so instead of writing a diploid no call we were actually writing "unavailable" genotypes, and failing the BCF == VCF test in integration tests.  Fixed.
2012-09-12 06:46:27 -04:00
Mark DePristo 91f3204534 VCF/BCF writers once again automatically write out no-call genotypes for samples in the VCFHeader but not in the VC itself
-- Turns out this was consuming 30% of the UG runtime, and causing problems elsewhere.
-- Removed addMissingSamples from VariantcontextUtils, and calls to it
-- Updated VCF / BCF writers to automatically write out a diploid no call for missing samples
-- Added unit tests for this behavior in VariantContextWritersUnitTest
2012-09-12 06:46:26 -04:00
Christopher Hartl 5d19fca649 A couple of bug-fixy changes.
1) SelectVariants could throw a ReviewedStingException (one of the nasty "Bug:") ones if the user requested a sample that wasn't present in the VCF. The walker now
    checks for this in the initialize() phase, and throws a more informative error if the situation is detected. If the user simply wants to subset the VCF to
    all the samples requested that are actually present in the VCF, the --ALLOW_NONOVERLAPPING_COMMAND_LINE_SAMPLES flag changes this UserException to a Warning,
    and does the appropriate subsetting. Added integration tests for this.

 2) GenotypeLikelihoods has an unsafe method getLog10GQ(GenotypeType), which is completely broken for multi-allelic sites. I marked that method
    as deprecated, and added methods that use the context of the allele ordering (either directly specified or as a VC) to retrieve the appropriate GQ, and
    added a unit test to cover this case. VariantsToBinaryPed needs to dynamically calculate the GQ field sometimes (because I have some VCFs with PLs but no GQ).
2012-09-11 23:01:00 -04:00
Mark DePristo e25e617d1a Fixes GSA-515 Nanoscheduler GSA-560 / Fix display of NanoScheduler and MonitoringEfficiency
-- Now prints out a single combined NanoScheduler runtime profile report across all nano schedulers in use.  So now if you run with -nt 4 you'll get one combined NanoScheduler profiler across all 4 instances of the NanoScheduler within TraverseXNano.
2012-09-11 07:38:34 -04:00
Mark DePristo d6e42d839c Fixes GSA-558 GATK ReadShards don't handle unmapped reads correctly. 2012-09-10 20:14:14 -04:00
Mark DePristo 641c6a361e Fix nasty memory leak in new data thread x cpu thread parallelism
-- Basically you cannot safely use instance specific ThreadLocal variables, as these cannot be safely cleaned up.  The old implementation kept pointers to old writers, with huge tribble block indexes, and eventually we crashed out of integration tests
-- See http://weblogs.java.net/blog/jjviana/archive/2010/06/10/threadlocal-thread-pool-bad-idea-or-dealing-apparent-glassfish-memor for more information
-- New implementation uses a borrow/return schedule with a list of N TraversalEngines managed by the MicroScheduler directly.
2012-09-10 20:14:14 -04:00
Mark DePristo 195cf6df7e Attempting to fix out of memory errors with new traversal engine creator 2012-09-10 20:14:14 -04:00
Mark DePristo f713d400e2 Fixed GSA-515 Nanoscheduler GSA-555 / Make NT and NCT work together
-- Can now say -nt 4 and -nct 4 to get 16 threads running for you!
-- TraversalEngines are now ThreadLocal variables in the MicroScheduler.
-- Misc. code cleanup, final variables, some contracts.
2012-09-10 20:14:14 -04:00
Mark DePristo 233f70f8ba Final cleanup of TraversalProgressMeters, moved to utils.progressmeter
-- TraversalProgressMeter now completely generalized, named ProgressMeter in utils.progressmeter.  Now just takes "nRecordsProcessed" as an argument to print reads.  Completely removes dependence on complex data structures from TraversalProgressMeter.  Can be used to measure progress on any task with processing units in genomic locations.
-- a fairly simple, class with no dependency on GATK engine or other features.
-- Currently only used by the TraversalEngine / MicroScheduler but could be used for any purpose now, really.
2012-09-10 20:14:14 -04:00
Mark DePristo 2e94a0a201 Refactor TraversalEngine to extract the progress meter functions
-- Previously these core progress metering functions were all in TraversalEngine, and available to subclasses like TraverseLoci via inheritance.  The problem here is that the upcoming data threads x cpu threads parallelism requires one master copy of the progress metering shared among all traversals, but multiple instantiations of traverse engines themselves.
-- Because the progress metering code has horrible anyway, I've refactored and vastly cleaned up and simplified all of these capabilities into TraversalProgressMeter class.  I've simplified down the classes it uses to work (STILL SOME TODOs in there) so that it doesn't reach into the core GATK engine all the time.  It should be possible to write some nice tests for it now.  By making it its own class, it can protect itself from multi-threaded access with a single synchronized printProgress function instead of carrying around multiple lock objects as before
-- Cleaned up the start up of the progress meter.  It's now handled when the meter is created, so each micro scheduler doesn't have to deal with proper initialization timing any longer
-- Simplified and made clear the interface for shutting down the traversal engines.  There's no a shutdown method in TraversalEngine that's called once by the MicroScheduler when the entire traversing in over.  Nano traversals now properly shut down (was subtle bug I undercovered here).  The printing of on traversal done metering is now handled by MicroScheduler
-- The MicroScheduler holds the single master copy of the progress meter, and doles it out to the TraversalEngines (currently 1 but in future commit there will be N).
-- Added a nice function to GenomeAnalysisEngine that returns the regions we will be processing, either the intervals requested or the whole genome.  Useful for progress meter but also probably for other infrastructure as well
-- Remove a lot of the sh*ting Bean interface getting and setting in MicroScheduler that's no longer useful.  The generic bean is just a shell interface with nothing in it.
-- By removing a lot of these bean accessors and setters many things are now final that used to be dynamic.
2012-09-10 20:14:13 -04:00
David Roazen d2f3d6d22f Revert "Separated out the DoC calculations from the XHMM pipeline, so that CalcDepthOfCoverage can be used for calculating joint coverage on a per-base accounting over multiple samples (e.g., family samples)"
This reverts commit 075c56060e0ffcce39631693ef39cf5f8c3a4d5a.
2012-09-10 15:52:39 -04:00
Menachem Fromer 0b717e2e2e Separated out the DoC calculations from the XHMM pipeline, so that CalcDepthOfCoverage can be used for calculating joint coverage on a per-base accounting over multiple samples (e.g., family samples) 2012-09-10 15:32:41 -04:00
Eric Banks d7499e0642 Updating the rank sum test documentation 2012-09-09 22:17:36 -04:00
Eric Banks 8ca205f1a9 Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-09-07 14:26:06 -04:00
Eric Banks b1677fc719 Fixed JIRA GSA-520 for Guillermo: when intervals with zero coverage were present, DiagnoseTargets was trying to merge them with the next interval (even if non-overlapping) which would cause problems later on when it checked to make sure that intervals were strictly overlapping. 2012-09-07 14:25:57 -04:00
Geraldine Van der Auwera 3f2a4379af Added forum API version stub to base URL for posting GATKDocs
This will prevent bugs from occurring when Vanilla make changes to the API
    as described here: http://vanillaforums.com/blog/api#configuration
    Based on the bug that broke the website Guide section on 9/6/12,
    the GATKDocs posting system will probably break in the next release if
    this is not applied as a bug fix.
2012-09-07 11:49:02 -04:00
Eric Banks ed3d9b050f Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-09-07 11:45:09 -04:00
Eric Banks 3dc248a49d Adding another test 2012-09-07 11:41:38 -04:00
Ryan Poplin 81b27f9db2 auto-merging to latest version 2012-09-07 11:36:47 -04:00
Eric Banks 41a8a304a0 Catch masked OutOfMemory errors as User Errors 2012-09-07 11:27:00 -04:00
Mark DePristo d62eca5d92 Update GATKPerformanceOverTime to measure -nt and -nct 2012-09-07 10:47:29 -04:00
Mark DePristo bf87de8a25 UnitTests for ReducerThread and InputProducer
-- Uncovered bug in ReducerThread in detecting abnormal case where jobs are coming in out of order
2012-09-07 09:51:32 -04:00
Mark DePristo c503884958 GSA-515 Nanoscheduler GSA-551 / Optimize nanoScheduling performance of UnifiedGenotyper
-- I've rewritten the entire NS framework to use a producer / consumer model for input -> map and from map -> reduce. This is allowing us to scale reasonably efficiently up to 4 threads (see figure). Future work on the nano scheduler will be itemized in a separate JIRA entry.
-- Restructured the NS code for clarity.  Docs everywhere.
-- This is considered version 1.0
2012-09-07 09:15:16 -04:00
Mark DePristo 9d12935986 Intermediate commit for new hyper parallel NanoScheduler
-- There's a logic bug now but I'll go to squash it...
2012-09-07 09:15:16 -04:00
Eric Banks 576c7280d9 Extensions to the ErrorThrowing framework for testing purposes 2012-09-06 22:03:18 -04:00
David Roazen cb84a6473f Downsampling: experimental engine integration
-Off by default; engine fork isolates new code paths from old code paths,
so no integration tests change yet

-Experimental implementation is currently BROKEN due to a serious issue
involving file spans. No one can/should use the experimental features
until I've patched this issue.

-There are temporarily two independent versions of LocusIteratorByState.
Anyone changing one version should port the change to the other (if possible),
and anyone adding unit tests for one version should add the same unit tests
for the other (again, if possible). This situation will hopefully be extremely
temporary, and last only until the experimental implementation is proven.
2012-09-06 15:03:27 -04:00
Eric Banks 6df6c1abd5 Fix for PBT to stop NPE when there are no likelihoods present 2012-09-06 13:14:18 -04:00
Mark DePristo 1b064805ed Renaming -cnt to -nct for consistency 2012-09-05 21:13:19 -04:00
Mark DePristo 574a8f710b Add static boolean controlled output of individual map call timing to nanoSecond resolution 2012-09-05 17:40:02 -04:00
Mark DePristo e11915aa0a GSA-515 Nanoscheduler GSA-550 ThreadSafeMapReduce shouldn't be super interface of TreeReducible 2012-09-05 17:37:56 -04:00
Mark DePristo c5f1ceaa95 All read and loci traversals go through NanoScheduler now
-- The NanoScheduler is doing a good job at tracking important information like time spent in map/reduce/input etc.
-- Can be disabled with static boolean in MicroScheduler if we have problems
-- See GSA-515 Nanoscheduler GSA-549 Retire TraverseReads and TraverseLoci after testing confirms nano scheduler version in single threaded version is fine
2012-09-05 16:38:21 -04:00
Mark DePristo dddf148a59 Fixed bug in ThreadAllocation getTotalNumberOfThreads
-- It isnt data + cpu its data * cpu threads.
2012-09-05 16:35:32 -04:00
Mark DePristo 9bf1d138d9 New GATK argument interface for data and cpu threads
-- Closes GSA-515 Nanoscheduler GSA-542 Good interface to nanoScheduler
-- Old -nt means dataThreads
-- New -cnt (--num_cpu_threads_per_data_thread) gives you n cpu threads for each data thread in the system
-- Cleanup logic for handling data and cpu threading in HMS, LMS, and MS
-- GATKRunReport reports the total number of threads in use by the GATK, not just the nt value
-- Removed the io,cpu tags for nt.  Stupid system if you ask me.  Cleaned up the GenomeAnalysisEngine and ThreadAllocation handling to be totally straightforward now
2012-09-05 15:45:24 -04:00
Mark DePristo 1e55475adc NanoScheduler uses ExecutorService to run input reader thread 2012-09-05 15:45:24 -04:00
Mark DePristo 71d9ebcb0d Fix bug (introduced by me) that didn't include contig in progress meter 2012-09-05 15:45:24 -04:00
Mark DePristo c822b7c760 Fix long-standing NPE in LMS due to inappropriate timing of initialization 2012-09-05 15:45:24 -04:00
Mark DePristo a997c99806 Initial NanoScheduler with input producer thread 2012-09-05 15:45:24 -04:00
Mark DePristo 03dd470ec1 Test for progressFunction in NanoScheduler; bugfix for single threaded fast path 2012-09-05 15:45:23 -04:00
Mark DePristo 8cdeb51b78 Cleanup printProgress in TraversalEngine
-- Separate updating cumulative traversal metrics from printing progress.  There's now an updateCumulativeMetrics function and a printProgress() that only takes a current position
-- printProgress now soles relies on the time since the last progress to decide if it will print or not.  No longer uses the number of cycles, since this isn't reliable in the case of nano scheduling
-- GenomeAnalysisEngine now maintains a pointer to the master cumulative metrics.  getCumulativeMetrics never returns null, which was handled in some parts of the code but not others.
-- Update all of the traversals to use the new updateCumulativeMetrics, printProgress model
-- Added progress callback to nano scheduler.  Every bufferSize elements this callback is invoked, allowing us to smoothly update the progress meter in the NanoScheduler
-- Rename MapFunction to NanoSchedulerMap and the same for reduce.
2012-09-05 15:45:23 -04:00
Mark DePristo d503ed97ab Mark I NanoScheduling TraverseLoci
-- Refactored TraverseLoci into old linear version and nano scheduling version
-- Temp. GATK argument to say how many nano threads to use
-- Can efficiently scale to 3 threads before blocking on input
2012-09-05 15:45:23 -04:00
Mark DePristo 757e6a0160 Making Pileup thread-safe
-- Old version relied on out printstream magically sorting output, new version puts the print in reduce
2012-09-05 15:45:23 -04:00
Mark DePristo d7105223fe More debugging output for NanoScheduler when debugging is enabled 2012-09-05 15:45:23 -04:00
Mark DePristo 9823102c0c TraverseReadsNano supports walker.filter and walker.done
-- Instead of returning directly the result of map(), returns a MapResult object with the value and a reduceMe flag.
-- Reduce function respects the reduceMe flag
-- Code cleanup and more documentation
2012-09-05 15:45:23 -04:00
Mark DePristo 1a8f5fc374 Trivial cleanup of NanoScheduler 2012-09-05 15:45:23 -04:00
Mark DePristo 6a5a70cdf1 Done GSA-539: SimpleTimer should use System.nanoTime for nanoSecond resolution 2012-09-05 15:45:23 -04:00
Mark DePristo 59109d5eeb NanoScheduler tracks time outside of its execute call 2012-09-05 15:45:23 -04:00
Mark DePristo 800a27c3a7 NanoScheduler tracks time within input, map, and reduce
-- Helpful for understanding where the time goes to each bit of the code.
-- Controlled by a local static boolean, to avoid the potential overhead in general
2012-09-05 15:45:23 -04:00
Mark DePristo 7087b22ea3 No debugging output (even conditional) for ReadTransformers in PrintReads 2012-09-05 15:45:23 -04:00
Mark DePristo e01258b261 NanoScheduler now supports printProgress. Bugfixes to printProgress
-- TraverseReadsNano prints progress at the end of each traversal unit
-- Fix bugs in TraversalEngine printProgress
    -- Synchronize the method so we don't get multiple logged outputs when two or more HMSs call printProgress before initialization at the start!
    -- Fix the logic for mustPrint, which actually had the logic of mustNotPrint.  Now we see the done log line that was always supposed to be there
    -- Fix output formatting, as the done() line was incorrectly shifting over the % complete by 1 char as 100.0% didn't fit in %4.1f
-- Add clearer doc on -PF argument so that people know that the performance log can be generated to standard out if one wants
2012-09-05 15:45:23 -04:00
Mark DePristo 6055101df8 NanoScheduler no longer groups inputs, each map() call is interlaced now
-- Maximizes the efficiency of the threads
-- Simplifies interface (yea!)
-- Reduces number of combinatorial tests that need to be performed
2012-09-05 15:45:22 -04:00
Yossi Farjoun ad5fa449e7 fixed a typo in the string comment 2012-09-05 14:46:10 -04:00
Ryan Poplin 84a83fd3f3 fixing typo 2012-09-05 10:41:03 -04:00
Eric Banks fc06f39411 Fixed docs for Pileup walker 2012-09-05 09:55:34 -04:00
Christopher Hartl d795437202 - New UserExceptions added for when ReadFilters or Walkers specified on the command line are not found. When -rf xxxx cannot find the class corresponding to xxxx, all read filters are printed in a better formatted way, with links to their gatk docs.
- VariantAnnotatorEngine changed to call genotype annotations even if pilups and allele -> likelihood mappings are not present. Current genotype annotations altered to check for null pilupes and null mappings.
2012-09-04 16:41:44 -04:00
Ryan Poplin 9cc1a9931b Resolving merge conflicts. 2012-09-04 10:47:38 -04:00
Ryan Poplin c9944d81ef Skip array needs to also be used in the updateDataForRead function of the delocalized BQSR. 2012-09-04 10:33:37 -04:00
Mark DePristo c9ea213c9b Make BaseRecalibration thread-safe
-- In the process uncovered two strange things
    1 -- qualityScoreByFullCovariateKey was created but never used.  Seems like a cache?
    2 -- Discovered nasty bug in BaseRecalibrator: https://jira.broadinstitute.org/browse/GSA-534
2012-08-31 13:42:42 -04:00
Mark DePristo 27ddebee53 Protect PrintReads from strange state from TraverseReadsUnitTests 2012-08-31 13:42:41 -04:00
Mark DePristo e028901d54 Fixed bad contract in ReadTransformer 2012-08-31 13:42:41 -04:00
Mark DePristo cf91d894e4 Fix build problems with tests 2012-08-31 13:42:41 -04:00
Mark DePristo 817ece37a2 General infrastructure for ReadTransformers
-- These are like read filters but can be applied either on input, on output, of handled by the walker
-- Previous example of BAQ now uses the general framework
    -- Resulted in massive conceptual cleanup of SAMDataSource and ReadProperties!  Yeah!
-- BQSR now uses this framework.  We can now do BQSR on input, on output, or within a walker
-- PrintReads now handles all read transformers in the walker in map, enabling us to parallelize PrintReads with BAQ and BQSR
-- Currently BQSR is excepting in parallel, which subsequent commit with fix
-- Removed global variable setting in GenomeAnalysisEngine for BAQ, as command line parameters are cleanly handled by ReadTransformer infrastructure
-- In principle ReadFilters are just a special kind of ReadTransformer, but this refactoring is larger than I can do. It's a JIRA entry
-- Many files touched simply due to the refactoring and renaming of classes
2012-08-31 13:42:41 -04:00
Ryan Poplin ff6ebbf3fd Resolving merge conflicts. 2012-08-31 11:25:55 -04:00
Mark DePristo 2f749b5e52 Added ThreadSafeMapReduce interface, super of TreeReducible
-- A higher level interface to declare parallelism capability of a walker.  This interface means that the walker can be multi-threaded, but doesn't necessarily support TreeReducible interface, which forces you to have a combine ReduceType operation that isn't appropriate for parallel read walkers
-- Updated ReadWalkers to implement ThreadSafeMapReduce not TreeReducible
2012-08-30 19:41:49 -04:00
Mark DePristo 544740d45d tasking for n threads should give you n threads in NanoScheduler, not n - 1 2012-08-30 19:41:49 -04:00
Mark DePristo 7a462399ce Fix GSA-529: Fix RODs for parallel read walkers
-- TraverseReadsNano modified to read in all input data before invoking maps, so the input to TraverseReadsNano is a MapData object holding the sam record, the ref context, and the refmetadatatracker.
-- Update ValidateRODForReads to be tree reducible, using synchronized map and explicitly sort the output map from locations -> counts in onTraversalDone
-- Expanded integration tests to test nt 1, 2, 4.
2012-08-30 19:41:49 -04:00
Mark DePristo 7d95176539 Bugfix to compareTo and equals in GenomeLoc
-- Yes, GenomeLoc.compareTo was broken.  The compareTo function only considered the contig and start position, but not the stop, when comparing genome locs.
-- Updated GenomeLoc.compareTo function to account for stop.  Updated GATK code where necessary to fix resulting problems that depended on this.
-- Added unit tests to ensure that hashcode, equals, and compareTo are all correct for GenomeLocs
2012-08-30 19:41:49 -04:00
Mark DePristo 5a9610d875 ReadShards now default to 10K (up from 1K) reads per samFile up to 250K
-- This should help make the inputs for parallel read walkers a little meater, and avoid spinning the shard creation infrastructure so often
2012-08-30 19:41:49 -04:00
Christopher Hartl 5a142fe265 After dicussion with Ryan/Eric, the Structural_Indel variant type is now gone, and has been entirely replaced with the access pattern .isStructuralIndel(). This makes it a strict subtype of indel. I agree that this method is a bit more sensible.
In addition, fix for GSA-310. If supplied -rf argument does not match a known read filter, the list of read filters will be printed, and users directed to the documentation for more information.
2012-08-30 17:57:31 -04:00
Mark DePristo 82b2845b9f Fix: GSA-531 ApplyRecalibration writing to BCF: java.lang.String cannot be cast to java.lang.Double
-- LOD must be added a double to attributes, not as string, so that it can be written out as BCF
2012-08-30 16:59:57 -04:00
Ryan Poplin 7b366d4049 misc cleanup in active region traversal. 2012-08-30 11:01:01 -04:00
Mark DePristo ce3d1f89ea ReadShard are no longer allowed to span multiple contigs
-- Previous behavior was unnecessary and causes all sorts of problems with RODs for reads.  The old implementation simply failed in this case.  The new code handles this correctly by forcing shards to have all of their data on a single contig.
-- Added a PrintReads integration test to ensure this behavior is correct
-- Adding test BAMs that have < 200 reads and span across contig boundaries
2012-08-30 10:15:11 -04:00
Mark DePristo 53376b9423 Part III of GSA-462: Consistent RODBinding access across Ref and Read trackers
-- shardSpan is only calculated when there some ROD is live in the GATK.  No sense in paying the cost per read when you don't need it
-- Update contract to allow null span or unmapped span (good catch unittests!)
2012-08-30 10:15:10 -04:00
Mark DePristo 1200848bbf Part II of GSA-462: Consistent RODBinding access across Ref and Read trackers
-- Deleted ReadMetaDataTracker
-- Added function to ReadShard to give us the span from the left most position of the reads in the shard to the right most, which is needed for the new view
2012-08-30 10:15:10 -04:00
Mark DePristo 972be8b4a4 Part I of GSA-462: Consistent RODBinding access across Ref and Read trackers
-- ReadMetaDataTracker is dead!  Long live the RefMetaDataTracker.  Read walkers will soon just take RefMetaDataTracker objects.  In this commit they take a class that trivially extends them
-- Rewrote ReadBasedReferenceOrderedView to produce RefMetaDataTrackers not the old class.
    -- This new implementation produces thread-safe objects (i.e., holds no points to shared state).  Suitable for use (to be tested) with nano scheduling
    -- Simplified interfaces to use the simplest data structures (PeekableIterator) not the LocusAwareSeekableIterator, since I both hate those classes and this is on the long term trajectory to remove those from the GATK entirely.
-- Massively expanded DataProvider unit tests for ReadBasedReferenceOrderedView
-- Note that the old implementation of offset -> ROD in ReadRefMetaDataTracker was broken for any read not completely matching the reference.  Rather than provide broken code the ReadMetaDataTracker only provides a "bag of RODs" interface.  If you want to work with the relationship between the read and the RODs in your tool you need to manage the CIGAR element itself.
    -- This commit breaks the new read walker BQSR, but Ryan knows this is coming
-- Subsequent commit will be retiring / fixing ValidateRODForReads
2012-08-30 10:15:10 -04:00
Mark DePristo 8fc6a0a68b Cleanup RefMetaDataTracker before refactoring ReadMetaDataTracker 2012-08-30 10:13:06 -04:00
Ryan Poplin b85ded8389 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-30 10:11:48 -04:00
Ryan Poplin 57d997f06f Fixing bug from when FragmentUtils merging function moved over to the soft clipped start instead of the unclipped start 2012-08-30 10:10:43 -04:00
Ryan Poplin f9bab37015 Merged bug fix from Stable into Unstable 2012-08-30 09:21:24 -04:00
Ryan Poplin eb63221875 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/stable 2012-08-30 09:19:35 -04:00
Ryan Poplin 81d5eca975 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-30 09:10:56 -04:00
Ryan Poplin 35baf0b155 This along with Mauricio's previous commit (thanks!) fixes GSA-522. There are no longer any modifications to reads in the map calls of ActiveRegion walkers. Added the bam which identified this error as a new integration test. 2012-08-30 09:07:36 -04:00
Eric Banks 1acf0f0b2c Fixing bug in fasta .fai generation: trim the contig names to the first whitespace if one appears. We now generate indexes identical to samtools. 2012-08-29 22:36:27 -04:00
Eric Banks 4d38befe86 Merged bug fix from Stable into Unstable 2012-08-29 15:13:56 -04:00
Eric Banks 150a969279 Be careful with String manipulation when constructing alleles in SomaticIndelDetector 2012-08-29 15:13:28 -04:00
Eric Banks ce55ba98f4 Don't try to left align indels in unmapped reads (which for some reason can still have CIGARs) because the ref context is null. 2012-08-29 15:01:11 -04:00
Ryan Poplin 4ea38bbfe8 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-29 11:39:30 -04:00
Mauricio Carneiro 69b56e11c8 ReadClipper won't modify the original read
Reverting back to the original implementation, but now including write N's and write Q0's due to walkers that look at the same read multiple times in different reference windows
2012-08-29 11:33:19 -04:00
Ryan Poplin e12ae65d33 Changing the commenting style in the BQSR 2012-08-29 11:27:45 -04:00
Ryan Poplin 18eca3544e Initial commit of the delocalized BQSR written as a read walker. 2012-08-28 15:24:20 -04:00
Eric Banks e74c527d47 Register the depricated walkers as depricated starting in v2.2 so that users get a helpful error message 2012-08-28 10:19:18 -04:00
Eric Banks 67d348a31d Retiring the alignment walkers and related integration test since we don't want to support them anymore. 2012-08-28 10:16:49 -04:00
Mark DePristo 2996693c9f FisherStrand now computed with and without filtering low-qual bases, and least significant pvalue is kept
-- Old way (filtering for Q > 17 bases) resulted in biased FS when the site was good but there was a
systematic shift in the QUAL of REF and ALT between strands of the reads (sometimes happens)
-- New way (taking all bases) was consistent with BaseQualRankSum and other tests, but there can be
a lot of low qual reference bases on one strand in some techs (ION/PROTON/PACBIO) because of the
preference for introducing an indel vs. a mismatch.
-- This implementation allows us to have our cake and eat it to by computing both p-values, and
taking the maximum one (i.e., least significant).
-- No integration tests updated yet -- still exploring the consequences of this change
2012-08-28 08:06:47 -04:00
Eric Banks bedcdbdc5f Fixing merge conflict 2012-08-27 12:16:51 -04:00
Eric Banks 3d476487c6 LIBS is totally busted for deletions. Putting a check in AD for bad pileup event bases so that we don't produce busted alleles. We must fix LIBS ASAP. 2012-08-27 12:13:12 -04:00
Mark DePristo 63a9ae817a Ensure thread-safety of CachingIndexedFastaSequenceFile
-- Cosmetic cleanup of ReadReferenceView
-- TraverseReadsNano provides the reference context, since it's thread-safe
-- Cleanup CachingIndexedFastaSequenceFile.  Add docs, remove unnecessary setters
-- Expand CachingIndexedFastaSequenceFileUnitTest to test explicitly multi-threaded safety.
2012-08-27 12:11:54 -04:00
Khalid Shakir 2d1ea7124b One less Queue command line requirement: -tempDir now defaults to .queue/tmp.
Also moved queueScatterGather to .queue/scatterGather.
2012-08-27 12:04:50 -04:00
Mark DePristo 68c5142d2d numThreads > 1 any time you have -nt > 1 silly 2012-08-26 14:36:13 -04:00
Mark DePristo fde9824765 Optimizations for parallel read walkers
-- TraversalReadsNano only creates the NanoScheduler once, and shuts it down onTraversalDone
-- Nicer debugging output in NanoScheduler
-- ReadShard has a getBufferSize() method now
2012-08-25 17:21:12 -04:00
Mark DePristo 5066b14335 Parallel FlagStat 2012-08-25 17:21:12 -04:00
Mark DePristo af540888f1 Limited version of parallel read walkers
-- Currently doesn't support accessing reference or ROD data
-- Parallel versions of PrintReads and CountReads
2012-08-25 17:21:12 -04:00
Mark DePristo e060b148e2 Minor cleanup of TraverseReads 2012-08-25 17:21:11 -04:00
Mark DePristo 275a5e5439 More tests for NanoScheduler
-- Add more contracts
-- Test in the UnitTest that the reduce is being called in the correct order
2012-08-25 17:21:11 -04:00
Christopher Hartl 6db0988898 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-25 15:40:32 -04:00
Christopher Hartl db2e88c7cb Fix for badIndelLength() throwing NPE at non-indel sites. Added integration test. 2012-08-25 12:38:23 -07:00
Mark DePristo 59b5913b54 Merged bug fix from Stable into Unstable 2012-08-25 14:53:22 -04:00
Mark DePristo dcc972a557 Usability cleanup for BQSR
-- I'm seeing a lot of people trying to use BinaryTagCovariate in the community.  They really shouldn't do this, so I moved it to private.
-- Throw an exception if its required bintag argument is missing
-- Check explicitly if user is requesting DinucCovariate and tell them that its been retired in favor of ContextCovariate
-- Show the type (Required, Experimental, Standard) of the covariates when running --list
2012-08-25 14:53:00 -04:00
Christopher Hartl b59948709f Code improvements re: JIRA GSA-510. Trio class migrated into the Samples package - because the trio structure is so ubiquitously used, it makes sense, I think, to have a class which imposes the structure on the samples. Existing functions which slightly duplicated the getTrios() method look like they have bugs. These functions are now deprecated.
A number of functions int he sampleDB looked to be assuming that samples could not share IDs (e.g. sample IDs are unique, so a sample present in two families could not be represented by multiple Sample objects). Added an assertion in the SampleDBBuilder to document/test this assumption.

MVLikelihoodRatio now uses the trio methods from SampleDB.
2012-08-25 08:48:27 -07:00
Mark DePristo 0996bbd548 Comments for Chris on cleanup 2012-08-24 16:04:58 -04:00
Mark DePristo 649b82ce85 Merge branch 'nanoScheduler'
Conflicts:
	private/scala/qscript/org/broadinstitute/sting/queue/qscripts/performance/GATKPerformanceOverTime.scala
2012-08-24 15:59:36 -04:00
Mark DePristo 9de8077eeb Working (efficient?) implementation of NanoScheduler
-- Groups inputs for each thread so that we don't have one thread execution per map() call
-- Added shutdown function
-- Documentation everywhere
-- Code cleanup
-- Extensive unittests
-- At this point I'm ready to integrate it into the engine for CPU parallel read walkers
2012-08-24 15:34:23 -04:00
Christopher Hartl 752f44c332 Code cleanup in MVLR and SelectVariants. Should fix JIRA GSA-509 and GSA-510 2012-08-24 12:25:11 -07:00
Mark DePristo d6e6b30caf Initial implementation of GSA-515: Nanoscheduler
– Write general NanoScheduler framework in utils.threading. Test with reading via iterator from list of integers, map is int * 2, reduce is sum. Should be efficiency using resources to do sum of 2 * (sum(1 - X)).

Done!

CPU parallelism is nano threads. Pfor across read / map / reduce. Use work queue to implement.
Create general read map reduce framework in utils. Test parallelism independently before hooking up to Locus iterator
Represent explicitly the dependency graph. Scheduler should choose the work units that are ready for computation, that are marked as "completing a computation", and then finally that maximize the number of sequent available work units. May be worth measuring expected cost for read read / map / reduce unit and use it to balance the compute
As input is single threaded just need one thread to populate inputs, which runs as fast as possible on parallel pushing data to fixed size queue. Each push creates map job and links to upcoming reduce job.
Note that there's at most one thread for IO tasks, and all of the threads can contribute to CPU tasks
2012-08-24 14:07:44 -04:00
Eric Banks 0545664f91 Fix ClassCastException seen in Tableau errors 2012-08-24 13:45:48 -04:00
Eric Banks 740520c23b Fix BQSR docs 2012-08-24 13:20:10 -04:00
Ryan Poplin 5f8574bd15 Fixing typo in error message. 2012-08-24 10:48:41 -04:00
Mark DePristo 1999b95754 Work around for GSA-513: ClassCastException in VariantEval 2012-08-23 18:14:49 -04:00
Christopher Hartl f1166d6d00 Spotted a potential bug where sample IDs passed in from the meta data were only checked against the sample IDs in the VCF header if the input file happened to be a meta data file rather than a fam file. Added a check for fam files as well, and added an integration test to cover each case. 2012-08-23 11:43:19 -07:00
Mark DePristo 857b11b26f Done with GSA-506: Add nt and efficiency information to GATKRunReport
-- GATKRunReports contain itemized information about the numThreads used to execute the GATK, as well as the efficiency of the use of those threads to get real work done, including time spent running, waiting, blocking, and waiting for IO
-- See https://jira.broadinstitute.org/browse/GSA-506 for more details
2012-08-23 09:59:53 -04:00
Mark DePristo 0b735884db Cleanup code in VariantContext 2012-08-23 09:59:53 -04:00
Eric Banks e5df91aa23 Looks like the @WalkerName annotation doesn't work with the GATK docs, so I'm renaming the walkers. 2012-08-22 20:17:39 -04:00
Mark DePristo 63af0cbcba Cleanup GATK efficiency monitor classes
-- Invert logic in GATKArgumentCollection to disable monitoring, not enable.  That means monitoring is on by default
-- Fix testing error in unit tests
-- Rename variables in ThreadAllocation to be clearer
2012-08-22 16:48:02 -04:00
Mark DePristo e1293f0ef2 GSA-507: Thread monitoring refactored so it can work without a thread factory
-- Old version StateMonitoringThreadFactory refactored into base class ThreadEfficiencyMonitor and subclass EfficiencyMonitoringThreadFactory.
-- Base class is used by LinearMicroScheduler to monitor performance of GATK in single threaded mode
-- MicroScheduler now handles management of the efficiency monitor.  Includes master thread in monitor, meaning that reduce is now included for both schedulers
2012-08-22 16:48:01 -04:00
Mark DePristo f876c51277 Separately track time spent doing user and system CPU work
-- Allows us to ID (by proxy) time spent doing IO
-- Refactor StateMonitoryingThreadFactory to use it's own enum, not Thread.State
-- Reliable unit tests across mac and unix
2012-08-22 16:48:01 -04:00
Mark DePristo 18060f237b Add thread efficiency monitoring to GATK HMS
-- See https://jira.broadinstitute.org/browse/GSA-502
-- New command line argument -mt enables thread monitoring
-- If enabled, HMS uses StateMonitoringThreadFactory to create monitored threads, and prints out an efficiency report when HMS exits, telling the user information like:

for BQSR – known to be inefficient locking
INFO 17:10:33,195 StateMonitoringThreadFactory - Number of activeThreads used: 8
INFO 17:10:33,196 StateMonitoringThreadFactory - Total runtime 90.3 m
INFO 17:10:33,196 StateMonitoringThreadFactory - Fraction of time spent blocked is 0.72 ( 64.8 m)
INFO 17:10:33,197 StateMonitoringThreadFactory - Fraction of time spent running is 0.26 ( 23.7 m)
INFO 17:10:33,197 StateMonitoringThreadFactory - Fraction of time spent waiting is 0.02 ( 112.8 s)
INFO 17:10:33,197 StateMonitoringThreadFactory - Efficiency of multi-threading: 26.19% of time spent doing productive work

for CountLoci
INFO 17:06:12,777 StateMonitoringThreadFactory - Number of activeThreads used: 8
INFO 17:06:12,777 StateMonitoringThreadFactory - Total runtime 43.5 m
INFO 17:06:12,778 StateMonitoringThreadFactory - Fraction of time spent blocked is 0.00 ( 4.2 s)
INFO 17:06:12,778 StateMonitoringThreadFactory - Fraction of time spent running is 1.00 ( 43.3 m)
INFO 17:06:12,779 StateMonitoringThreadFactory - Fraction of time spent waiting is 0.00 ( 6.0 s)
INFO 17:06:12,779 StateMonitoringThreadFactory - Efficiency of multi-threading: 99.61% of time spent doing productive work
2012-08-22 16:48:01 -04:00
Ryan Poplin fe3069b278 Merged bug fix from Stable into Unstable 2012-08-22 14:40:34 -04:00
Ryan Poplin e5cfdb4811 Bug fix for popular _Duplicate allele added to VariantContext_ error reported on the forum. It seems to be due to lower case bases in the reference being treated as reference mismatches. We would try to turn these mismatches into SNP events, for example c/C. We now uppercase the result from IndexedFastaSequenceFile.getSubsequenceAt() 2012-08-22 14:39:35 -04:00
Ryan Poplin 63213e8eb5 Expanding the HaplotypeCaller integration tests to cover a wider range of data 2012-08-22 14:18:44 -04:00
Eric Banks 944e1c299d Docs for --keepOriginalAC were wrong in SelectVariants 2012-08-22 13:07:13 -04:00
Eric Banks 2409aa9bfd Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-22 12:54:43 -04:00
Eric Banks 94540ccc27 Using the simple VCBuilder constructor and then subsequently trying to modify attributes was throwing a NPE. This is easily solved (without a performance hit) by initializing the attributes map to an immutable Collections.emptyMap(). Added unit test to cover this case. 2012-08-22 12:54:29 -04:00
Guillermo del Angel 901f47d8af Final step (for now) in VA refactoring: update MD5's because, a) since it's not guaranteed that we'll iterate through reads/pileups in the same order, the rank sum dithering will change annotations, b) FS uses new generic threshold to distinguish uninformative reads (it used to use ad-hoc thresholds), c) AD definition changed and throws away uninformative reads, d) shortened general ploidy integration tests for quicker debugging. May have missed some MD5's in the update so there may be lingering test failures still 2012-08-22 11:38:51 -04:00
Guillermo del Angel 7df0abf49b Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-22 11:36:41 -04:00
Eric Banks 9e76e8aa0b Just noticed that the efficient conversion to uppercase method is redundant since it's already implemented efficiently in Picard; let's just have a single implementation. 2012-08-22 11:26:08 -04:00
Christopher Hartl 20601f034e Updating the checkType() function to include the new StructuralIndel variant type. Fixes outstanding broken integration test. 2012-08-22 07:33:10 -07:00
Eric Banks c7ce3e1cf5 Merged bug fix from Stable into Unstable 2012-08-22 00:24:40 -04:00
Eric Banks 03017855e4 WTF - why is support for whole-read insertions all messed up in LIBS? I've pushed a temporary patch for now (the right solution should certainly not be implemented in stable; LIBS needs to be better thought out). Added another unit test. 2012-08-22 00:24:01 -04:00
Mark DePristo 6ce8016ae7 GSA-491: Add hidden tag to GATK that propagates to the GATK logs 2012-08-21 14:44:18 -04:00
Guillermo del Angel 6a8cf1c84a Enable and adapt HaplotypeScore and MappingQualityZero as active region annotations now that we have per-read likelihoods passed in to annotations 2012-08-21 14:35:40 -04:00
Guillermo del Angel d0644b3565 Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-21 10:35:23 -04:00
Ryan Poplin 94e7f677ad Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-21 10:21:47 -04:00
Guillermo del Angel 418ace463a More merge conflict resolution 2012-08-21 10:15:52 -04:00
Ryan Poplin 10961db3ce Another round of FindBugs fixes. Object returns its internal reference to an externally mutable array. Very dangerous. 2012-08-21 09:35:55 -04:00
Ryan Poplin 605acaae9c Another round of FindBugs fixes. Object internally stores a reference to an externally mutable array. Very dangerous. 2012-08-21 09:33:58 -04:00
Ryan Poplin 55b7949d68 Another round of FindBugs fixes. Comparator doesn't implement Serializable. 2012-08-21 09:20:55 -04:00
Christopher Hartl ba8622ff0d number of stashed changes are lurking in here. In order of importance:
- Fix for M_Trieb's error report on the forum, and addition of integration tests to cover the walker.
 - Addition of StructuralIndel as a class of variation within the VariantContext. These are for variants with a full alt allele that's >150bp in length.
 - Adaptation of the MVLikelihoodRatio to work for a set of trios (takes the max over the trios of the MVLR)
 - InsertSizeDistribution changed to use the new gatk report output (it was previously broken)
 - RetrogeneDiscovery changed to be compatible with the new gatk report
 - A maxIndelSize argument added to SelectVariants
 - ByTranscriptEvaluator rewritten for cleanliness
 - VariantRecalibrator modified to not exclude structural indels from recalibration if the mode is INDEL
 - Documentation added to DepthOfCoverageIntegrationTest (no, don't yell at chartl ;_; )

Also sorry for the long commit history behind this that is the result of fixing merge conflicts. Because this *also* fixes a conflict (from git stash apply), for some reason I can't rebase all of them away. I'm pretty sure some of the commit notes say "this note isn't important because I'm going to rebase it anyway".
2012-08-21 07:08:58 -04:00
Eric Banks 3dfe8df262 Merged bug fix from Stable into Unstable 2012-08-20 23:12:58 -04:00
Eric Banks 40d5efc804 Fix for Adam K's reported bug: we weren't handling reads that were entirely insertions properly in LIBS. Specifically, the event bases were off-by-one (which was disasterous in Adam's case with a 1bp read). Added a unit test to cover this case. 2012-08-20 23:12:41 -04:00
Eric Banks 286b658fab Re-enabling parallelism in the BaseRecalibrator now that the release is out. 2012-08-20 21:25:14 -04:00
Guillermo del Angel 7bbd2a7a20 Fixing merge conflicts 2012-08-20 20:38:25 -04:00
Guillermo del Angel 2041cb853c New implementation of AD - ignore now non-informative reads based on per-read likelihoods 2012-08-20 20:31:34 -04:00
Ryan Poplin 77fbaec044 Another round of FindBugs fixes. Class implements its own compareTo() but uses base Object.equals() which can lead to unpredictable behavior. 2012-08-20 16:55:00 -04:00
Ryan Poplin 5e28bca630 Another round of FindBugs fixes. Should be static inner class. 2012-08-20 16:15:48 -04:00
Ryan Poplin 5db3bd6fd2 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-20 15:28:57 -04:00
Ryan Poplin 464d49509a Pulling out common caller arguments into its own StandardCallerArgumentCollection base class so that every caller isn't exposed to the unused arguments from every other caller. 2012-08-20 15:28:39 -04:00
Eric Banks 4450d66c64 Fixing the docs for DP and AD 2012-08-20 15:10:24 -04:00
Ryan Poplin c67d708c51 Bug fix in HaplotypeCaller for non-regular bases in the reference or reads. Those events don't get created any more. Bug fix for advanced GenotypeFullActiveRegion mode: custom variant annotations created by the HC don't make sense when in this mode so don't try to calculate them. 2012-08-20 13:41:08 -04:00
Guillermo del Angel 5b5fee56cf Next iteration of new VA interface: extend changes to per-genotype annotations as well. Will allow to have AD correctly implemented at last (that change not done yet) 2012-08-20 12:52:15 -04:00
Eric Banks 154f65e0de Temporarily disabling multi-threaded usage of BaseRecalibrator for performance reasons. 2012-08-20 12:43:17 -04:00
Guillermo del Angel c384677917 Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-20 10:27:25 -04:00
Eric Banks 97b191f578 Thanks to Guillermo I was able to isolate an instance of where the MLEAC > AN. It turns out that this is valid, e.g. when PLs are all 0s for a sample we no-call it but it's allowed to factor into the MLE (since that's the contract with the exact model). Removing the check in UG and instead protecting for it in the AlleleCount stratification. 2012-08-20 01:16:23 -04:00
Guillermo del Angel 963ad03f8b Second step of interface cleanup for variant annotator: several bug fixes, don't hash pileup elements to Maps because the hashCode() for a pileup element is not implemented and strange things can happen. Still several things to do, not done yet 2012-08-19 21:18:18 -04:00
Mark DePristo 7fa76f719b Print "Parsing data stream with BCF version BCFx.y" in BCF2 codec as .debug not .info 2012-08-19 10:32:55 -04:00
Mark DePristo 9121b98167 CombineVariants outputs the first non-MISSING qual, not the maximum
-- When merging multiple VCF records at a site, the combined VCF record has the QUAL of the first VCF record with a non-MISSING QUAL value.  The previous behavior was to take the max QUAL, which resulted in sometime strange downstream confusion.
2012-08-19 10:29:38 -04:00
Guillermo del Angel d9641e3d57 Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-19 09:23:21 -04:00
Mauricio Carneiro d16cb68539 Updated and more thorough version of the BadCigar read filter
* No reads with Hard/Soft clips in the middle of the cigar
   * No reads starting with deletions (with or without preceding clips)
   * No reads ending in deletions (with or without follow-up clips)
   * No reads that are fully hard or soft clipped
   * No reads that have consecutive indels in the cigar (II, DD, ID or DI)

 Also added systematic test for good cigars and iterative test for bad cigars.
2012-08-17 17:05:27 -04:00
Mark DePristo 980685af16 Fix GSA-137: Having both DataSource.REFERENCE and DataSource.REFERENCE_BASES is confusing to end users.
-- Removed REFERENCE_BASES option.  You only have REFERENCE now.  There's no efficiency savings for the REFERENCE_BASES option any longer, since the reference bases are loaded lazy so if you don't use them there's effectively no cost to making the RefContext that could load them.
2012-08-17 14:55:38 -04:00
Eric Banks 2676b7fc2e Put in a sanity check that MLEAC <= AN 2012-08-17 11:49:53 -04:00
Mark DePristo daa26cc64e Print to logger not to System.out in CachingIndexFastaSequenceFile when profiling cache performance 2012-08-17 11:49:02 -04:00
Mark DePristo be0f8beebb Fixed GSA-434: GATK should generate error when gzipped FASTA is passed in.
-- The GATK sort of handles this now, but only if you have the exactly correct sequence dictionary and FAI files associated with the reference.  If you do, the file can be .gz.  If not, the GATK will fail on creating the FAI and DICT files.  Added an error message that handles this case and clearly says what to do.
2012-08-17 11:49:02 -04:00
Mark DePristo a3d2764d11 Fixed: GSA-392 @arguments with just a short name get the wrong argument bindings
-- Now blows up if an argument begins with -.  Implementation isn't pretty, as it actually blows up during Queue extension creation with a somewhat obscure error message but at least its something.
2012-08-17 11:49:01 -04:00
Mark DePristo 4c0f198d48 Potential fix for GSA-484: Incomplete writing of temp BCF when running CombineVariants in parallel
-- Keep reading from BCF2 input stream when read(byte[]) returns < number of needed bytes
-- It's possible (I think) that the failure in GSA-484 is due to multi-threading writing/reading of BCF2 records where the underlying stream is not yet flushed so read(byte[]) returns a partial result.  No loops until we get all of the needed bytes or EOF is encounted
2012-08-17 11:49:01 -04:00
Mark DePristo de3be45806 Proper function call in BCF2Decoder to validateReadBytes 2012-08-17 11:49:01 -04:00
Eric Banks 53383e82ec Hmm, not good. Fixing the math in PBT resulted in changed MD5s for integration tests that look like significant changes. I am reverting and will report this to Laurent. 2012-08-16 21:41:18 -04:00
Eric Banks 65c594afff Better error message for reads that begin/end with a deletion in LIBS 2012-08-16 21:27:07 -04:00
Guillermo del Angel b61ecc7c19 Fix merge conflicts 2012-08-16 20:45:52 -04:00
Guillermo del Angel d26183e0ec First preliminary big refactoring of UG annotation engine. Goals: a) Remove gigantic hack that cached per-read haplotype likelihoods in a static array so that annotations would go back and retrieve them, b) unify interface for annotations between HaplotypeCaller and UnifiedGenotyper, c) as a consequence, removed and cleaned duplicated code. As a bonus, annotations have now more relevant info to help them compute values.
Major idea is that per-read haplotype likelihoods are now stored in a single unified object of class PerReadAlleleLikelihoodMap. Class implementation in theory hides internal storage details from outside work (still may need work cleaning up interface), and this object(or rather, a Map from Sample->perReadAlleleLikelihoodMap) is produced by UGCalcLikelihoods. The genotype calculation is also able to potentially use this info if needed. All InfoFieldAnnotations now get an extra argument with this map. Currently, this map is only produced for indels in UG, or for all variants within HaplotypeCaller. If this map is absent (SNPs in UG), the old Pileup interface is used, but it's avoided whenever possible. FORMAT annotations are not yet changed but will be focus of second step. Major benefit will be that annotations will be able to very easily discard non-informative reads for certain events. HaplotypeCaller also uses this new class, and no longer hard-codes the mapping of allele ->list(reads) but instead uses the same objects and interfaces as the rest of the modules. Code still needs further testing/cleaning/reviewing/debugging
2012-08-16 20:36:53 -04:00
Mark DePristo 6a2862e8bc GSA-483: Bug in GATKdocs for Enums
-- Fixed to no long show constants in enums as constant values in the gatkdocs
2012-08-16 16:24:17 -04:00
Eric Banks 3253fc216b FindBugs 'Maintainability' fixes 2012-08-16 15:53:06 -04:00
Eric Banks 05cbf1c8c0 FindBugs 'Efficiency' fixes 2012-08-16 15:40:52 -04:00
Mark DePristo d8071c66ed Removing SlowGenotype object from GATK 2012-08-16 15:23:06 -04:00
Eric Banks a22e7a5358 Should've run 'ant clean' instead of just 'ant'. In any event, these are 2 cases where we are setting a class's internal static variable directly. Very dangerous. 2012-08-16 15:07:32 -04:00
Eric Banks 47b4f7b7e5 One final FindBugs related fix. I think it's safe to consider these changes 'fixes' that are allowed to go in during a code freeze. 2012-08-16 14:59:05 -04:00
Eric Banks ded0e11b45 Killing off some FindBugs 'Realiability' issues 2012-08-16 14:00:48 -04:00
Eric Banks dac3958461 Killing off some FindBugs 'Usability' issues 2012-08-16 13:32:44 -04:00
Eric Banks 611d9b61e2 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-16 13:05:36 -04:00
Eric Banks 2df04dc48a Fix for performance problem in GGA mode related to previous --regenotype commit. Instead of trying to hack around the determination of the calculation model when it's not needed, just simply overload the calculateGenotypes() method to add one that does simple genotyping. Re-enabling the Pool Caller integration tests. 2012-08-16 13:05:17 -04:00
Mark DePristo 132cdfd9c1 GSA-488: MLEAC > AN error when running variant eval fixed 2012-08-16 13:03:14 -04:00
Mark DePristo 4e42988c66 GSA-485: Remove repairVCFHeader from GATK codebase
-- Removed half-a*ssed attempt to automatically repair VCF files with bad headers, which allowed users to provide a replacement header overwriting the file's actually header on the fly.  Not a good idea, really.  Eric has promised to create a utility that walks through a VCF file and creates a meaningful header field based on the file's contents (if this ever becomes a priority)
2012-08-16 13:03:13 -04:00
Mark DePristo 52bfe8db8a Make sure the storage writer is closed before running mergeInfo in multi-threaded output management
-- It's not clear this is cause of GSA-484 but it will help confirm that it's not the cause
2012-08-16 13:03:13 -04:00
Mark DePristo 7a247df922 Added -bcf argument to VCFWriter output to force BCF regardless of file extension
-- Now possible to do -o /dev/stdout -bcf -l DEBUG > tmp.bcf and create a valid BCF2 file
-- Cleanup code to make sure extensions easier by moving to a setX model in VariantContextWriterStub
2012-08-16 13:03:13 -04:00
Mark DePristo 28c8e3e6d7 Cleanup BCF2Codec
-- Remove FORBID_SYMBOLIC global that is no longer necessary
-- all error handling goes via error() function
2012-08-16 13:03:13 -04:00
Mark DePristo 9dc694b2e9 Meaningful error message and keeping tmp file when mergeInfo fails
-- BCF2 is failing for some reason when merging tmp. files with parallel combine variants.  ThreadLocalOutputTracker no longer sets deleteOnExit on the tmp file, as this prevents debugging.  And it's unnecessary because each mergeInto was deleting files as appropriate
-- MergeInfo in VariantContextWriterStorage only deletes the intermediate output if an error occurs
2012-08-16 13:03:13 -04:00
Eric Banks f368e568db Implementing support in BaseRecalibrator for SOLiD no call strategies other than throwing an exception. For some reason we never transfered these capabilities into BQSRv2 earlier. 2012-08-15 22:52:56 -04:00
Eric Banks 9d09230c26 Better docs for verbose output of Pileup 2012-08-15 21:55:08 -04:00
Mark DePristo c0a31b2e5b CombineVariants parallel integration tests
-- All tests but one (using old bad VCF3 input) run unmodified with parallel code.
-- Disabled UNSAFE_VCF_PROCESSING for all but that test, which changes md5s because the output files have fixed headers
-- Minor optimizations to simpleMerge
2012-08-15 21:13:16 -04:00
Mark DePristo 669c43031a BCF2 optimizations; parallel CombineVariants
-- BCF2 now determines whether it can safely write out raw genotype blocks, which is true in the case where the VCF header of the input is a complete, ordered subset of the output header.  Added utilities to determine this and extensive unit tests (headerLinesAreOrderedConsistently)
-- Cleanup collapseStringList and exploreStringList for new unit tests of BCF2Utils.  Fixed bug in edge case that never occurred in practice
-- VCFContigHeaderLine now provides its own key (VCFHeader.CONTIG_KEY) directly instead of requiring the user to provide it (and hoping its right)
-- More ways to access the data in VCFHeader
-- BCF2Writer uses a cache to avoid recomputing unnecessarily whether raw genotype blocks can be emitted directly into the output
-- Optimization of fullyDecodeAttributes -- attributes.size() is expensive and unnecessary.  We just guess that on average we need ~10 elements for the attribute map
-- CombineVariants optimization -- filters are online HashSet but are sorted at the end by creating a TreeSet
-- makeCombinations is now makePermutations, and you can request to create the permutations with or without replacement
2012-08-15 21:13:16 -04:00
Mark DePristo ae4d4482ac Parallel combine variants!
-- CombineVariants is now TreeReducible!
-- Integration tests running in parallel all pass except one (will fix) due to incorrect use of db=0 flag on input from old VCF format
2012-08-15 21:13:15 -04:00
Mark DePristo bd7ed0d028 Enable efficient parallel output of BCF2
-- Previous IO stub was hardcoded to write VCF.  So when you ran -nt 2 -o my.bcf you actually created intermediate VCF files that were then encoded single threaded as BCF.  Now we emit natively per thread BCF, and use the fast mergeInfo code to read BCF -> write BCF.  Upcoming optimizations to avoid decoding genotype data unnecessarily will enable us to really quickly process BCF2 in parallel
-- VariantContextWriterStub forces BCF output for intermediate files
-- Nicer debug log message in BCF2Codec
-- Turn off debug logging of BCF2LazyGenotypesDecoder
-- BCF2FieldWriterManager now uses .debug not .info, so you won't see all of that field manager debugging info with BCF2 any longer
-- VariantContextWriterFactory.isBCFOutput now has version that accepts just a file path, not path + options
2012-08-15 21:13:15 -04:00
Mark DePristo 9459e6203a Clean, documented implementation of ThreadFactory that monitors running / blocking / waiting time of threads it creates
-- Expanded unit tests
-- Support for clean logging of results to logger
-- Refactored MyTime into AutoFormattingTime in Utils, out of TraversalEngine, for cleanliness and reuse
-- Added docs and contracts to StateMonitoringThreadFactory
2012-08-15 21:13:15 -04:00
Mark DePristo be3230a1fd Initial implementation of ThreadFactory that monitors running / blocking / waiting time of threads it creates
-- Created makeCombinations utility function (very useful!).  Moved template from VariantContextTestProvider
-- UnitTests for basic functionality
2012-08-15 21:13:15 -04:00
Mark DePristo f277d7c09e Removing parallelism bottleneck in the GATK
-- GenomeLocParser cache was a major performance bottleneck in parallel GATK performance.  With 10 thread > 50% of each thread's time was spent blocking on the MasterSequencingDictionary object.  Made this a thread local variable.
-- Now we can run the GATK with 48 threads efficiently on GSA4!
  -- Running -nt 1 => 75 minutes (didn't let is run all of the way through so likely would take longer)
  -- Running -nt 24 => 3.81 minutes
2012-08-15 21:13:15 -04:00
Eric Banks 87e41c83c5 In AlleleCount stratification, check to make sure the AC (or MLEAC) is valid (i.e. not higher than number of chromosomes) and throw a User Error if it isn't. Added a test for bad AC. 2012-08-14 15:02:30 -04:00
Eric Banks 8e3774fb0e Fixing behavior of the --regenotype argument in SelectVariants to properly run in GenotypeGivenAlleles mode. Added integration tests to cover recent SV changes. 2012-08-14 14:21:42 -04:00
Eric Banks 34b62fa092 Two changes to SelectVariants: 1) don't add DP INFO annotation if DP wasn't used in the input VCF (it was adding DP=0 previously). 2) If MLEAC or MLEAF is present in the original VCF and the number of samples decreases, remove those annotations from the VC. 2012-08-14 12:54:31 -04:00
Eric Banks cfb994abd2 Trivial removal of ununsed variable (mentioned in resolved JIRA entry) 2012-08-13 22:55:02 -04:00
Khalid Shakir f809f24afb Removed SelectHeader's --include_reference_name option since the reference is always included.
In SelectHeaders instead of including the path to the file, only include the name of the reference since dbGaP does not like paths in headers.
2012-08-13 16:49:27 -04:00
Mark DePristo 6ad75d2f5c Reverting changes to BCF2 ranges
-- The previously expanded ones are actually the missing values in the range.  The previous ranges were correct.  Removed the TODO to confirm them, as they are now officially confirmed
2012-08-13 15:06:28 -04:00
Mark DePristo 4d3fad38e9 Increase allowable range for BCF2 by -1 on low-end 2012-08-13 14:20:26 -04:00
Mark DePristo f032e0aba4 A bit better output for ContextCovariate context size logging 2012-08-12 13:45:52 -04:00
Mark DePristo 243af0adb1 Expanded the BQSR reporting script
-- Includes header page
-- Table of arguments (Arguments)
-- Summary of counts (RecalData0)
-- Summary of counts by qual (RecalData1)
-- Fixed bug in output that resulted in covariates list always being null (updated md5s accordingly)
-- BQSR.R loads all relevant libaries now, include gplots, grid, and gsalib to run correctly
2012-08-12 13:45:14 -04:00
Mark DePristo 458bbdee8f Add useful logger.info telling us the mismatch and indel context sizes 2012-08-12 10:27:05 -04:00
Eric Banks 40f0320a1c When adding a unit test to LIBS for X and = CIGAR operators, I uncovered a bug with the implementation of the ReadBackedPileup.depthOfCoverage() method. 2012-08-10 14:58:29 -04:00
Eric Banks eca9613356 Adding support of X and = CIGAR operators to the GATK 2012-08-10 14:54:07 -04:00
Ami Levy Moonshine 68fb04b8f7 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable into testing 2012-08-09 16:48:22 -04:00
Mark DePristo 06258c8a01 BCF2 optimizations
-- Added Write method to BCF2 types that directly converts int value to byte stream.  Deleted writeRawBytes(int)
-- encodeTypeDescriptor semi-inlined into encodeType so that the tests for overflow are done in just one place
-- Faster implementation of determineIntegerType for int[] values
2012-08-09 16:36:18 -04:00
Mark DePristo c6bd9b15ff BCF2 optimizations
-- BCF2Type enum has an overloaded method to read the type as an int from an input stream.  This gets rid of a case statement and replaces it with just minimum tiny methods that should be better optimized.  As side effect of this optimization is an overall cleaner code organization
2012-08-09 16:36:18 -04:00
Mark DePristo 9a0dda71d4 BCF2 optimizations
-- All low-level reads throw IOException instead of catching it directly.  This allows us to not try/catch in readByte, improving performance by 5% or so
-- Optimize encodeTypeDescriptor with final variables.  Avoid using Math.min instead do inline comparison
-- Inlined willOverflow directly in its single use
2012-08-09 16:36:18 -04:00
Ryan Poplin 9887bc4410 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-09 16:31:06 -04:00
Ryan Poplin f4c72a26d5 A few quick, minor findbugs fixes. 2012-08-09 16:30:58 -04:00
Ryan Poplin c7f22e410f A few quick, minor findbugs fixes. 2012-08-09 16:22:08 -04:00
Eric Banks def077c4e5 There's actually a subtle but important difference between foo++ and ++foo 2012-08-09 12:42:50 -04:00
Ryan Poplin e48727dae3 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-09 10:31:10 -04:00
Guillermo del Angel 5be7e0621d Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-09 09:58:34 -04:00
Guillermo del Angel 71ee8d87b3 Rename per-sample ML allelic fractions and counts so that they don't have the same name as the per-site INFO fields, and clarify wording in VCF header 2012-08-09 09:58:20 -04:00
Eric Banks 35cec8530c Make coverage threshold in FindCoveredIntervals a command-line argument 2012-08-08 21:44:24 -04:00
Ryan Poplin 1223d77546 Removing argument from HaplotypeCaller that was made unneccesary by recent improvements to triggering around large events 2012-08-08 15:13:20 -04:00
Eric Banks 0a2a646a52 Other random FindBugs fixes 2012-08-08 14:56:27 -04:00
Eric Banks 4c84cc9486 Quick pass of FindBugs 'should be static inner class' fixes. 2012-08-08 14:42:06 -04:00
Eric Banks a0196c9f5b Quick pass of FindBugs 'method invokes inefficient Number constructor' fixes. 2012-08-08 14:34:16 -04:00
Eric Banks 4b2e3cec0b Quick pass of FindBugs 'inefficient use of keySet iterator instead of entrySet iterator' fixes for core tools. 2012-08-08 14:29:41 -04:00
Guillermo del Angel 3e2752667c Intermediate checkin for ReducedReads with HaplotypeCaller - change min read count over k-mer to average count over k-mer when doing assembly of a reduced read (not optimal, currently trying max and then will decide on best approach), fix merge conflicts 2012-08-08 12:07:33 -04:00
David Roazen a7811d673f Update URL for phone home / GATK key documentation output by the GATK upon error 2012-08-08 09:29:54 -04:00
Mark DePristo cda8d944b7 Bugfixes for BCF with VQSR
-- Old version converted doubles directly from strings.  New version uses VariantContext getAttributeAsDouble() that looks at the values directly to determine how to convert from Object to Double (via Double.valueOf, (Double), or (Double)(Integer)).
-- getAttributeAsDouble() is now smart in converting integers to doubles as needed
-- Removed unnecessary logging info in BCF2Codec
-- Added integration tests to ensure that VQSR works end-to-end with BCF2 using sites version of the file khalid sent to me
-- Added vqsr.bcf_test.snps.unfiltered.bcf file for this integration test
2012-08-07 17:22:39 -04:00
Mark DePristo 80b94a4f9a AdaptiveContexts implement pruning to a given chi2 p value
-- Added bonferroni corrected p-value pruning, so you tell it how significant of a different you are willing to collapse in the tree, and it prunes the tree down to this maximum threshold
-- Penalty is now a phred-scaled p-value not the raw chi2 value
-- Split command line arguments in VisualizeContextTree into separate arguments for each type of pruning
2012-08-07 17:22:39 -04:00
Mark DePristo 982c735c76 VisualizeAdaptiveTree now considers only leaf nodes when computing max/min penalty 2012-08-07 17:22:39 -04:00
Ryan Poplin 15085bf03e The UnifiedGenotyper now makes use of base insertion and base deletion quality scores if they exist in the reads. 2012-08-07 13:58:22 -04:00
Guillermo del Angel 97c5ed4feb Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-06 20:22:31 -04:00
Guillermo del Angel 238d55cb61 Fixes for running HaplotypeCaller with reduced reads: a) minor refactoring, pulled out code to compute mean representative count to ReadUtils, b) Don't use min representative count over kmer when constructing de Bruijn graph - this creates many paths with multiplicity=1 and makes us lose a lot of SNP's at edge of capture targets. Use mean instead 2012-08-06 20:22:12 -04:00
Ryan Poplin f1c30c3a59 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-06 12:02:26 -04:00
Mark DePristo 44f160f29f indelGOP and indelGCP are now advanced, not hidden arguments 2012-08-06 11:42:55 -04:00
Mark DePristo 2f004665fb Fixing public -> private dep 2012-08-06 11:42:55 -04:00
Mark DePristo 7bf5ca51ee Major bugfix for adaptive contexts
-- Basically I was treating the context history in the wrong direction, effectively predicting the further bases in the context based on the closer one.  Totally backward.  Updated the code to build the tree in the right direction.
-- Added a few more useful outputs for analysis (minPenalty and maxPenalty)
-- Misc. cleanup of the code
-- Overall I'm not 100% certain this is even the right way to think about the problem.  Clearly this is producing a reasonable output but the sum of chi2 values over the entire tree is just enormous.  Perhaps a MCMC convergence / sampling criterion would be a better way to think about this problem?
2012-08-06 11:42:55 -04:00
Mark DePristo b4841548f1 Bug fixes and misc. improvements to running the adaptive context tools
-- Better output file name defaults
-- Fixed nasty bug where I included non-existant quals in the contexts to process because they showed up in the Cycle covariate
-- Data is processed in qual order now, so it's easier to see progress
-- Logger messages explaining where we are in the process
-- When in UPDATE mode we still write out the information for an equivalent prune by depth for post analysis
2012-08-06 11:42:55 -04:00
Ryan Poplin b8709d8c67 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-06 11:41:28 -04:00
Eric Banks 210db5ec27 Update -maxAlleles argument to -maxAltAlleles to make it more accurate. The hidden GSA production -capMaxAllelesForIndels argument also gets updated. 2012-08-06 11:31:18 -04:00
Eric Banks 8f95a03bb6 Prevent NumberFormatExceptions when parsing the VCF POS field 2012-08-06 11:19:54 -04:00
Ryan Poplin b7eec2fd0e Bug fixes related to the changes in allele padding. If a haplotype started with an insertion it led to array index out of bounds. Haplotype allele insert function is now very simple because all alleles are treated the same way. HaplotypeUnitTest now uses a variant context instead of creating Allele objects directly. 2012-08-05 12:29:10 -04:00
Mark DePristo e1bba91836 Ready for full-scale evaluation adaptive BQSR contexts
-- VisualizeContextTree now can write out an equivalent BQSR table determined after adaptive context merging of all RG x QUAL x CONTEXT trees
-- Docs, algorithm descriptions, etc so that it makes sense what's going on
-- VisualizeContextTree should really be simplified when into a single tool that just visualize the trees when / if we decide to make adaptive contexts standard part of BQSR
 -- Misc. cleaning, organization of the code (recalibation tests were in private but corresponding actual files were public)
2012-08-03 16:02:53 -04:00
Guillermo del Angel 6f8e7692d4 Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-03 12:24:37 -04:00
Guillermo del Angel 9e25b209e0 First pass of implementation of Reduced Reads with HaplotypeCaller. Main changes: a) Active region: scale PL's by representative count to determine whether region is active. b) Scale per-read, per-haplotype likelihoods by read representative counts. A read representative count is (temporarily) defined as the average representative count over all bases in read, TBD whether this is good enough to avoid biases in GL's. c) DeBruijn assembler inserts kmers N times in graph, where N is min representative count of read over kmer span - TBD again whether this is the best approach. d) Bug fixes in FragmentUtils: logic to merge fragments was wrong in cases where there is discrepancy of overlaps between unclipped/soft clipped bases. Didn't affect things before but RR makes prevalence of hard-clipped bases in CIGARs more prevalent so this was exposed. e) Cache read representative counts along with read likelihoods associated with a Haplotype. Code can/should be cleaned up and unified with PairHMMIndelErrorModelCode, as well as refactored to support arbitrary ploidy in HaplotypeCaller 2012-08-03 12:24:23 -04:00
Ryan Poplin 8817fc70d1 Merged bug fix from Stable into Unstable 2012-08-03 10:45:01 -04:00
Ryan Poplin f40d0a0a28 Updating VQSR to work with the MNP and symbolic variants that are coming out of the HaplotypeCaller. Integration tests change because of the MNPs in dbSNP. 2012-08-03 10:44:36 -04:00
Joel Thibault 51bd03cc36 Add RemoveProgramRecords annotation to ActiveRegionWalker 2012-08-03 09:54:16 -04:00
Joel Thibault addbfd6437 Add a RemoveProgramRecords annotation
* Add the RemoveProgramRecords annotation to LocusWalker
2012-08-03 09:54:16 -04:00
Joel Thibault 524d7ea306 Choose whether to keep program records based on Walker
* Add keepProgramRecords argument
* Make removeProgramRecords / keepProgramRecords override default
2012-08-03 09:54:16 -04:00
Mark DePristo e04989f76d Bugfix for new PASS position in dictionary in BCF2 2012-08-03 09:42:21 -04:00
Mark DePristo fb5dabce18 Update BCF2 to include a minor version number so we can rev (and report errors) with BCF2
-- We are no likely to fail with an error when reading old BCF files, rather than just giving bad results
-- Added new class BCFVersion that consolidates all of the version management of BCF
2012-08-02 17:30:30 -04:00
Eric Banks e3f89fb054 Missing/malformed GATK report files are user errors 2012-08-02 11:33:21 -04:00
Mark DePristo c3c3d18611 Update BCF2 to put PASS as offset 0 not at the end
-- Unfortunately this commit breaks backward compatibility with all existing BCF2 files...
2012-08-01 17:09:22 -04:00
Mark DePristo ccac77d888 Bugfix for incorrect allele counting in IndelSummary
-- Previous version would count all alt alleles as present in a sample, even if only 1 were present, because of the way VariantEval subsetted VCs
-- Updated code for subsetting VCs by sample to be clearer about how it handles rederiving alleles
-- Update a few pieces of code to get previous correct behavior
-- Updated a few MD5s as now ref calls at sites in dbSNP are counted as having a comp sites, and therefore show up in known sites when Novelty strat is on (which I think is correct)
-- Walkers that used old subsetting function with true are now using clearer version that does rederive alleles by default
2012-08-01 15:45:12 -04:00
Joel Thibault 2b25df3d53 Add removeProgramRecords argument
* Add unit test for the removeProgramRecords
2012-08-01 15:33:05 -04:00
Ryan Poplin d53105668b Merged bug fix from Stable into Unstable 2012-08-01 14:53:06 -04:00
Ryan Poplin fabca66d09 Another fix to VQSR docs 2012-08-01 14:52:49 -04:00
Ryan Poplin 2be29ebd22 Merged bug fix from Stable into Unstable 2012-08-01 14:35:30 -04:00
Ryan Poplin 4093909a56 Updating VQSR docs. Removing references to old best practices pages. 2012-08-01 14:30:24 -04:00
Eric Banks 52b93cab62 Merged bug fix from Stable into Unstable 2012-08-01 13:17:36 -04:00
Eric Banks 22bf052828 Fixing BQSR GATK docs 2012-08-01 13:17:16 -04:00
Eric Banks 459832ee16 Fixed bug in FastaAlternateReferenceMaker when input VCF has overlapping deletions as reported a while back on GS 2012-08-01 10:45:04 -04:00
Eric Banks a4a41458ef Update docs of FastaAlternateReferenceMaker as promised in older GS thread 2012-08-01 10:33:41 -04:00
Eric Banks 38e5419b11 Merged bug fix from Stable into Unstable 2012-08-01 09:50:31 -04:00
Eric Banks 56f8afab97 Requested by Geraldine: adding a utility to register deprecated walkers (and the major version of the first release since they were removed) so that the User Error printed out for e.g. CountCovariates now states: Walker CountCovariates is no longer available in the GATK; it has been deprecated since version 2.0. 2012-08-01 09:50:00 -04:00
Guillermo del Angel 0528337467 Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-31 18:17:50 -04:00
Guillermo del Angel 4a23f3cd11 Simple cleanup of pool caller code - since usage is much more general than just calling pools, AF calculation models and GL calculation models are renamed from Pool -> GeneralPloidy. Also, don't have users specify special arguments for -glm and -pnrm. Instead, when running UG with sample ploidy != 2, the correct general ploidy modules are automatically detected and loaded. -glm now reverts to old [SNP|INDEL|BOTH] usage 2012-07-31 16:34:20 -04:00
Eric Banks 6cb10cef96 Fixed older GS reported bug. Actually, the problem really lies in Picard (can't set max records in RAM without it throwing an exception, reported on their JIRA) so I just masked out the problem by removing this never-used argument from this rarely-used tool. 2012-07-31 16:00:36 -04:00
Eric Banks ab53d73459 Quick fix to user error catching 2012-07-31 15:50:32 -04:00
Eric Banks 10111450aa Fixed AlignmentUtils bug for handling Ns in the CIGAR string. Added a UG integration test that calls a BAM with such reads (provided by a user on GetSatisfaction). 2012-07-31 15:37:22 -04:00
Mark DePristo f7133ffc31 Cleanup syntax errors from BQSR reorganization 2012-07-31 08:11:05 -04:00
Mark DePristo dad9bb1192 Changes order of writing BaseRecalibrator results so that if R blows up you still get a meaningful tree 2012-07-31 08:11:04 -04:00
Mark DePristo 0c4e729e13 Working version of adaptive context calculations
-- Uses chi2 test for independences to determine if subcontext is worth representing.   Give excellent visual results
-- Writes out analysis output file producing excellent results in R
-- Trivial reformatting of MathUtils
2012-07-31 08:11:04 -04:00
Mark DePristo 93640b382e Preliminary version of adaptive context covariate algorithm
-- Works according to visual inspection of output tree
2012-07-31 08:11:04 -04:00
Mark DePristo 315d25409f Improvement to RecalDatum and VisualizeContextTree
-- Reorganize functions in RecalDatum so that error rate can be computed indepentently.  Added unit tests.  Removed equals() method, which is a buggy without it's associated implementation for hashcode
-- New class RecalDatumTree based on QualIntervals that inherits from RecalDatum but includes the concept of sub data
-- VisualizeContextTree now uses RecalDatumTree and can trivially compute the penalty function for merging nodes, which it displays in the graph
2012-07-31 08:11:04 -04:00
Mark DePristo 57b45bfb1e Extensive unit tests, contacts, and documentation for RecalDatum 2012-07-31 08:11:03 -04:00
Mark DePristo e00ed8bc5e Cleanup BQSR classes
-- Moved most of BQSR classes (which are used throughout the codebase) to utils.recalibration.  It's better in my opinion to keep commonly used code in utils, and only specialized code in walkers.  As code becomes embedded throughout GATK its should be refactored to live in utils
-- Removed unncessary imports of BQSR in VQSR v3
-- Now ready to refactor QualQuantizer and unit test into a subclass of RecalDatum, refactor unit tests into RecalDatum unit tests, and generalize into hierarchical recal datum that can be used in QualQuantizer and the analysis of adaptive context covariate
-- Update PluginManager to sort the plugins and interfaces.  This allows us to have a deterministic order in which the plugin classes come back, which caused BQSR integration tests to temporarily change because I moved my classes around a bit.
2012-07-31 08:11:03 -04:00
Mark DePristo 191294eedc Initial cleanup of RecalDatum for move and further refactoring
-- Moved Datum, the now unnecessary superclass, into RecalDatum
-- Fixed some obviously dangerous synchronization errors in RecalDatum, though these may not have caused problems because they may not have been called in parallel mode
2012-07-31 08:11:03 -04:00
Mark DePristo 0670316288 Be clearer that dcov 50 is good for 4x, should use 200 for >30x 2012-07-31 08:11:02 -04:00
Mark DePristo 874dbf5b58 Maximum wait for GATK run report upload reduced to 10 seconds 2012-07-31 08:11:02 -04:00
Ryan Poplin 7ed06ee7b9 Updating FindCoveredIntervals to use the changes to the ActiveRegionWalker. 2012-07-30 12:16:27 -04:00
Ryan Poplin 13591b169f Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-30 12:13:24 -04:00
Eric Banks 0b30588d67 Catch yet another class of User Errors 2012-07-30 11:59:56 -04:00
Eric Banks 5743694196 Merged bug fix from Stable into Unstable 2012-07-30 11:35:28 -04:00
Eric Banks 79195b97a3 Adding categories for the remaining uncategorized walkers 2012-07-30 11:35:08 -04:00
Eric Banks 2b1b00ade5 All integration tests and VC/Allele unit tests are passing 2012-07-27 17:03:49 -04:00
Eric Banks beb7610195 Resolving merge conflicts 2012-07-27 15:52:02 -04:00
Eric Banks 27e7e11ec0 Allele refactoring checkpoint #3: all integration tests except for PoolCaller are passing now. Fixed a couple of bugs from old code that popped up during md5 difference review. Added VariantContextUtils.requiresPaddingBase() method for tools that create alleles to use for determining whether or not to add the ref padding base. One of the HaplotypeCaller tests wasn't passing because of RankSumTest differences, so I added a TODO for Ryan to look into this. 2012-07-27 15:48:40 -04:00
Ryan Poplin 22bb4804f0 HaplotypeCaller now use an excessive number of high quality soft clips as a triggering signal in order to capture both end points of a large deletion in a single active region. 2012-07-27 12:44:02 -04:00
Ryan Poplin a0890126a8 ActiveRegionWalker's isActive function returns a results object now instead of just a double. 2012-07-27 11:01:39 -04:00
Eric Banks ef335b6213 Several more walkers have been brought up to use the new Allele representation. 2012-07-27 02:14:25 -04:00
Eric Banks 9e2209694a Re-enable reverse trimming of alleles in UG engine when sub-selecting alleles after genotyping. UG integration tests now pass. 2012-07-27 00:47:15 -04:00
Eric Banks baf3e33730 Allele refactoring checkpoint 2: all code finally compiles, AD and STR annotations are fixed, and most of the UG integration tests pass. 2012-07-26 23:27:11 -04:00
Ryan Poplin 35e803e110 Merged bug fix from Stable into Unstable 2012-07-26 14:00:04 -04:00
Ryan Poplin 4f741b4cd7 Smoothing in the BQSR bins should be one error observation and one non-error observation. 2012-07-26 13:59:02 -04:00
Guillermo del Angel 2ae890155c Improvements to indel calling in pool caller: a) Compute per-read likelihoods in reference sample to determine wheter a read is informative or not. b) Fixed bugs in unit tests. c) Fixed padding-related bugs when computing matches/mismatches in ErrorModel, d) Added a couple of more integration tests to increase test coverage, including testing odd ploidy 2012-07-26 13:43:00 -04:00
Eric Banks a694d1b5de Merge branch 'master' into allelePadding 2012-07-26 01:53:14 -04:00
Eric Banks 32516a2f60 Initial checkpoint commit of VariantContext/Allele refactoring. There were just too many problems associated with the different representation of alleles in VCF (padded) vs. VariantContext (unpadded). We are moving VC to use the VCF representation. No more reference base for indels in VC and no more trimming and padding of alleles. Even reverse trimming has been stopped (the theory being that writers of VCF now know what they are doing and often want the reverse padding if they put it there; this has been requested on GetSatisfaction). Code compiles but presumably pretty much all tests with indels with fail at this point. 2012-07-26 01:50:39 -04:00
Mark DePristo 8c418a15da Sorting out HMS error handling (fingers crossed)
-- Check if a traversal error occurred in the last shard
-- Catch ExecutionException from the TreeReducer and throw as our HMS execption
-- ShardTraverser just throws the exception as formatted by the HMS, rather than wrapping it as a RuntimeException itself
-- EngineFeaturesIntegrationTests now uses public exampleFASTA (faster), and does 1000x iterations (slower)
2012-07-25 23:13:12 -04:00
Mark DePristo 9242f63a4d On the way to really sorting out HMS error handling
-- Better error message when a traveral error occurs (a real bug)
-- EngineFeaturesIntegrationTest runs the multi-threaded error testing routines 50x times
-- A bit of cleanup in WalkerTest
2012-07-25 22:11:10 -04:00
Eric Banks 7eb3f54750 Added category docs for the remaining public walkers (I think I got them all). I removed a couple of totally unnecessary walkers. 2012-07-25 21:40:28 -04:00
Eric Banks 2982b24c4b Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/stable 2012-07-25 20:36:53 -04:00
Eric Banks 0a98a6aa8d Adding extraDocs tag per Mauricio's request 2012-07-25 18:23:18 -04:00
Mauricio Carneiro fce5cb9f35 Few category changes 2012-07-25 17:23:02 -04:00
Eric Banks 05fa377a8e Adding GATK categories to standard walkers. Will add to remaining walkers after the next successful release (so that I can see which walkers are public and still need it). 2012-07-25 16:05:47 -04:00
Mauricio Carneiro d46cf47bd1 Updating Read Filter documentation 2012-07-25 15:05:47 -04:00
Eric Banks 6a3bfa3811 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/stable 2012-07-25 14:11:11 -04:00
Eric Banks 357e0b35af Register GATK-full-only walkers and rethrow the missing walker error as a not supported in GATK lite error 2012-07-25 14:11:03 -04:00
Roger Zurawicki 5b74763096 Removed Categories.
We will use DocumentedGATKFeatures to create categories in our documentation. Eric I guess will be in charge of this. We need to remove walkers and think how to categorize everything.

Tools can be hidden from GATKdocs with the @Hidden annotation

Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-07-25 13:46:24 -04:00
Eric Banks a5721a8846 Context covariate optimizations were not suited for multiple threads, so I removed them (since that ended up being much, much easier than trying to make the covariates thread local). Added -nt 2 layer to BQSR integration tests to confirm that it now works with multiple threads. 2012-07-25 13:38:07 -04:00
Eric Banks e0c07f5567 Reverting old commits that made error handling better because ultimately they made things worse. 2012-07-25 12:37:59 -04:00
Mark DePristo fcefa61bce Remove reference dependence in BCF2Codec
-- Adding BCF2Codec to VCF.jar and associated unit tests

Signed-off-by: Mark DePristo <depristo@broadinstitute.org>
2012-07-25 08:56:38 -04:00
Mark DePristo 19a257a5c1 Multiple bugfixes
-- VariantFiltration now properly sets passFilters in VC
-- BCF2 writer now properly decodes lazy BCF genotype data that it uses.  Improper use generated a horrible subtle bug but the good news is that the extra checks I put in (unnecessarily a few days ago) caught the bug!

Signed-off-by: Mark DePristo <depristo@broadinstitute.org>
2012-07-25 08:56:38 -04:00
Mark DePristo 3066894215 Bugfix for BCF2
-- Always decode genotypes block when writing out a BCF file.  If the header changes (and we currently don't know this easily) then the dictionary keys used in the genotypes block may be invalid.  Temporarily added a private static boolean that turns off writing of the blocks until Eric and his team rewrite the header.

Signed-off-by: Mark DePristo <depristo@broadinstitute.org>
2012-07-25 08:56:38 -04:00
Guillermo del Angel eb55061fd0 a) Document BEAGLE codec, b) Bug fix: inbreeding coefficient shouldn't be computed for non-diploid organisms in current implementaiton 2012-07-24 12:16:15 -04:00
Mauricio Carneiro 348e86159e Moving doclets to public 2012-07-23 23:52:14 -04:00
Mauricio Carneiro 5cd98a36b9 Making ForumAPIUtils public 2012-07-23 17:44:24 -04:00
Mauricio Carneiro 3d92f041f3 forgot to delete the merging line 2012-07-23 17:35:07 -04:00
Roger Zurawicki f3c504769b Added the ability to update the Forum
GATKDocs looks for a key on gsa4, and updates the forum with new walker if it exists.
More changes were made to the GATKDocs. Works nicely with bootstrap on and offline.
Cleaned up the code as well

Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-07-23 17:17:33 -04:00
Khalid Shakir 46ca49b63d Removed 'Walker' suffix from packages/GATKEngine.xml that were breaking the packaged release.
Archived AnalyzeCovariates scripts and removed references in build packages / GATK extensions.
2012-07-23 16:32:31 -04:00
Ryan Poplin 2a14bbe4f0 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-23 11:28:26 -04:00
Ryan Poplin 10d143c35c Adding error model header names in the BQSR recal plot. Making the downsampling of points look a little nicer. 2012-07-23 11:28:17 -04:00
Eric Banks 675ccab2fa Renaming BQSR to BaseRecalibrator 2012-07-23 10:17:17 -04:00
Ryan Poplin 2e486d83e2 Updating HaplotypeCaller docs and expanding integration tests. 2012-07-23 10:05:42 -04:00
Mauricio Carneiro 921eaad33f Generalized the default platform parameter in BQSRv2
Parameter wasn't working outside of the BQSR walker. It now takes the information on the recalibration report in other tools (PrintReads for example) and treats all reads as coming from the defined default platform.
2012-07-20 17:29:13 -04:00
Mauricio Carneiro 5dc2143142 Removed support for walkers ending with "Walker" from the engine.
If your walker has "Walker" in the name, you will have to use "Walker" on the -T to access it.
2012-07-20 17:27:11 -04:00
Mauricio Carneiro d446d34227 GATK Error messages now point to the new website instead of GetSatisfaction. 2012-07-20 17:27:11 -04:00
Mauricio Carneiro 116885a450 Removed the "Walker" suffix from all walkers that had it.
* Did not touch archived walkers... those can be named whatever.
   * Kept abstract classes that end in Walker untouched (e.g. LocusWalker, ReadWalker, ...)
   * Renamed a few inner classes due to conflict when stripping off Walker from their outer classes: ContigStats, FlagStats and FastaStats.
2012-07-20 17:27:11 -04:00
Christopher Hartl 3ee46cced2 Merge branch 'master' of ssh://gsa1.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-19 21:25:40 -04:00
Christopher Hartl af383c30b5 Ensure that the gene summary has a header line 2012-07-19 21:24:04 -04:00
Mark DePristo 2ca5fc62a2 Support for MISSING BCF2 type
-- Heng wants to use 0x0? to represent any missing type value, which in our implementation was invalid.  Updated our codebase to support this construct.  Heng said he'll update the BCF2 quick reference.
-- Enabled integration test reading Heng's ex2.bcf file
-- GATK now only warns in the case where the END info field isn't the same (or +1 due to padding) as the getEnd() function as determined by the GATK.  Turns out there's a single record in the 1000G SV call set that doesn't have the right length
-- VariantContextTestProvider now tests that X = Y where X -> writing -> reading -> writing -> reading = Y for a variety of variant context inputs X
-- Added integration test reading 1000G SV chr1 calls (from Chris)
2012-07-19 16:14:26 -04:00
Guillermo del Angel c16f9f2f15 a) Use new method to check for GATK Like, b) minor improvements to indel pool caller (more to come): brain-dead, quick way to limit number of alt alleles to genotype. We can't process too many alt alleles because of the combinatorial explosion of GL values with high ploidy, and some STR validation targets had up to 12 alt alleles, resulting of GL vectors of > 1e8 elements. Can't use pileup elements since typically not many alleles will be in one pileup, and different alleles will appear in different samples, TBD a nicer solution. c) Commit to posterity scala script for large scale validation calling, still work in progress 2012-07-19 10:24:08 -04:00
Eric Banks e370030e6c As requested by Mark, I've broken out the code to pull out the protected subclass when available (and otherwise use the public version) into the GATKLiteUtils class. People should use this code instead of reimplementing all of the java reflection on their own. 2012-07-18 22:44:37 -04:00
Eric Banks d46ccec04e Adding Unit Tests to cover the exception catching for Picard errors: because we are using String matching, we want to ensure that we know if/when the exception text changes underneath us. 2012-07-18 21:48:58 -04:00
Mark DePristo 74e153ff4a FisherStrand now uses RankSumTest isUsableBase to decide if a read should be included in testing
-- Previously used hardcoded MAPQ > 20 && QUAL > 20 but now uses isUsableBase
-- Updating MD5s as appropriate
2012-07-18 16:07:47 -04:00
Mark DePristo dede3a30e9 Improvements to the validation report of VariantEval
-- If eval has genotypes and comp has genotypes, then subset the genotypes of comp down to the samples being evaluated when considering TP, FP, FN, TN status.  This is important in the case where you want to use this to assess, for example, the quality of calls on NA12878 but you have a CEU trio comp VCF.  The previous version was counting sites polymorphic in mom against the calls in NA12878.
-- Added testdata VCF and integrationtests to ensure this behavior continues in the future
-- TODO: actually run integration tests when I have an internet connection
2012-07-18 16:07:47 -04:00
Mark DePristo 559a4826be Improvements to the validation report of VariantEval
-- If eval has genotypes and comp has genotypes, then subset the genotypes of comp down to the samples being evaluated when considering TP, FP, FN, TN status.  This is important in the case where you want to use this to assess, for example, the quality of calls on NA12878 but you have a CEU trio comp VCF.  The previous version was counting sites polymorphic in mom against the calls in NA12878.
-- Added testdata VCF and integrationtests to ensure this behavior continues in the future
2012-07-18 16:07:46 -04:00
Mark DePristo dc292c0317 FisherStrand now includes all reads and bases, regardless of mapping quality and base quality, just like other annotations
-- This actually proved to be a problem with Ion Torrent data where the base quality can be quite low, and so we need to include Q15 bases for calling effectively.
2012-07-18 16:07:46 -04:00
Eric Banks 2c0f073ab1 Make -qq arg hidden for now since it's still very experimental 2012-07-18 15:43:25 -04:00
Eric Banks b46c85e8b4 More bad BAM file catching 2012-07-18 15:26:31 -04:00
Eric Banks 659eee13a6 Handle NPE generated in UG when non-standard reference bases are present in the fasta 2012-07-18 15:16:27 -04:00
Eric Banks 9af2cfe283 Catch underlying file system problems that get masked as Tribble index errors. There's also a quick patch to the HMS that isn't really the ultimate fix needed; Mark and I will review at a later point. 2012-07-18 15:11:38 -04:00
Eric Banks 4c730542f0 Handle RuntimeExceptions thrown by Picard that are really User Errors. I will add unit tests for these as best I can later. 2012-07-18 13:56:35 -04:00
Eric Banks ae08d35138 Catch 'too many open files' errors that show up when trying to read the bam index. All that needs to be done is to flesh out the original error message (because it will get caught later and rethrown correctly). 2012-07-18 12:57:34 -04:00
Eric Banks f2fe59a9d4 Wow, there are a ton of errors captured having to do with being unable to merge the temp Tribble output. I'm expanding the error message a bit to help see if we can do anything going forward. 2012-07-18 12:31:59 -04:00
Eric Banks e4db8dde91 Enabled a whole other bunch of integration tests for BQSRv2. While I was there I also changed the default context size for indels to 3 (from 8) since that's what works best in the current implementation (as suggested by Ryan). At this point, all of the new core tools (ReduceReads, BQSRv2, HaplotypeCaller, UG extensions) have been moved over to protected and should be stable. Looks like we are pretty much ready for GATK 2.0! 2012-07-17 23:36:43 -04:00
Eric Banks a8d08ea18d As a user pointed out, it is not valid for a GenomeLoc to have a start or stop equal to 0. 2012-07-17 22:18:43 -04:00
Guillermo del Angel 29273abab7 Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-17 16:58:12 -04:00
Guillermo del Angel 731bbba2e6 Bug fixes for integration test, use correct new UG syntax 2012-07-17 16:57:59 -04:00
Eric Banks 33be41ecf5 Cleaning up integration test 2012-07-17 16:06:04 -04:00
Eric Banks 8dbc9cb29c Add the ability to emit the original quals in the OQ tag 2012-07-17 15:52:56 -04:00
Guillermo del Angel 40b8c7172c Pool Caller refactoring in preparation of GATK 2.0: a) PoolCallerUnifiedArgumentCollection disappeared, and arguments moved to UnifiedArgumentCollection. b) PoolCallerWalker is no longer needed and redundant, all functionality subsumed by UG. UG now checks if GATK is lite - if so, don't allow ploidy > 2. c) Moved pool classes from private to protected. d) Changed the way to specify ploidy. Instead of specifying samples per pool and having ploidy = 2*samplesPerPool, have user specify ploidy directly, which is cleaner. Update tests accordingly. We can now call triploid seedless grape genotypes correctly in theory. e) Renamed argument -reference to -reference_sample_calls since the former is ambiguous and it's not clear what it refers to. 2012-07-17 15:27:04 -04:00
Laurent Francioli 68d0e4dd6d - Multi-allelic sites are now correctly ignored - Reporting of mendelian violations enhanced - Corrected TP overflow by caping it to Bye.MAX_VALUE
-Updated integrationtests to reflect changes in MVF file output

Signed-off-by: Eric Banks <ebanks@broadinstitute.org>
2012-07-17 15:21:10 -04:00
Eric Banks b0d99fd10d Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-17 15:12:28 -04:00
Eric Banks 305db8c0d1 Total rewrite of the isGATKLite() functionality with help of Khalid/David. PluginManager was not working for us. 2012-07-17 15:11:03 -04:00
Ryan Poplin 6efbcd99f1 HaplotypeCaller is now an AnnotatorCompatibleWalker with all the rights and privileges pertaining thereto. Enabling the ClippingRankSumTest after showing it was useful for 1000 Genomes calling. 2012-07-17 14:38:36 -04:00
Eric Banks 110886e8b9 Oops, got the logic wrong. 2012-07-17 13:37:11 -04:00
Eric Banks a963b37424 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-17 13:15:37 -04:00
Eric Banks 3a64398d07 Cleaned up the isGATKLite check 2012-07-17 12:46:16 -04:00
Eric Banks 62c5228048 1) Revert previous change - indel recalibration is turned on by default and users of the Lite version will need to turn it off to avoid a User Error. 2) Implemented the engine.isGATKLite() method. 2012-07-17 12:23:40 -04:00
Chris Saunders 1913d1bbd0 Put RunReport S3 upload on timeout thread
Move the RunReport S3 upload process onto a separate thread with a timeout allowing the parent to continue.

Signed-off-by: Khalid Shakir <kshakir@broadinstitute.org>
2012-07-17 12:19:39 -04:00
Eric Banks 40618ac471 A bunch of BQSR changes: 1) by default we do not emit indel quals, but they can be turned on with --enable_indel_quals. 2) We check whether or not we are running in Lite mode (not done yet) and if so and the user is trying to recalibrate indels, we throw a User Error (not supported). 3) Like v1 we now allow the user to set the qual value below which we don't recalibrate (this was the remaining source of differences in the v1 vs. v2 plots). 2012-07-17 10:52:43 -04:00
Eric Banks d5b3a2eabf Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-17 00:32:53 -04:00
Eric Banks f657b8bda8 Complete overhaul of the BQSRv2 integration tests. Much more comprehensive. Still need to deal with a few tests that need some modifications before I'm done, but I'll take care of that sometime tomorrow. 2012-07-17 00:32:34 -04:00
Eric Banks a003148d50 Move AnalyzeCovariates over too. 2012-07-16 16:11:56 -04:00
Eric Banks 0a89adbcdb Add utility decorators so that classes can tell you which package source they come from if they want to (suggested by Khalid). Using those decorators, we can easily pull out the BQSR updateDataForPileupElement() method into a standard RecalibrationEngine and an AdvancedRecalibrationEngine and use the protected one (AdvancedRE) if available (otherwise, the public one). 2012-07-16 15:34:50 -04:00
Eric Banks 52baac1e16 Move BQSRv2 into public and v1 into the archive. 2012-07-16 14:23:38 -04:00
Khalid Shakir 07822d6c0f Fixed input annotations for master/test files on DiffObjectsWalker. 2012-07-16 13:33:11 -04:00
Eric Banks 2a830939df Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-14 23:49:59 -04:00
Eric Banks f29cadd7e2 By default, don't quantize quals in BQSRv2 2012-07-14 23:49:48 -04:00
Eric Banks 75543a3f22 ReadClipper.clipRead's claim that it doesn't modify the original read was false. Ultimately, GATKSAMRecord.clone (as documented) creates a soft copy of the read - so modifying e.g. the bases of the cloned read means that you modify the bases of the original read too. Because of this, when the BQSRv2 Context covariate was writing Ns over the low quality tails of the reads they got propagated out to the output BAM file (very bad). I've updated the ReadClipper docs and cleaned up the code (no reason to use a clone of the read anymore given that we are already modifying the original). For now, the simplest thing is to have the Context covariate store the original bases, overwrite low quality Ns, compute covariates, and rewrite the original bases; we can update later if needed. 2012-07-13 18:50:27 -04:00
Ryan Poplin 443f02ffc2 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-13 16:09:24 -04:00
Khalid Shakir 6dfcc486e8 In ApplyRecalibration marking filter as PASS instead of '.' when the site passes by calling .passFilters(). 2012-07-13 15:40:56 -04:00
Ami Levy Moonshine 5d0a7335ea remove unnecessary use in the PRIORITY list
remove unneeded imports
2012-07-13 15:27:08 -04:00
Ryan Poplin d70bb59182 HaplotypeCaller now calls insertion events that aren't fully assembled as symbolic alleles. 2012-07-10 14:22:23 -06:00
Guillermo del Angel 279dff9f81 Bug fix when specifying a JEXL expression for a field that doesn't exist: we should treat the whole expression as false, but we were rethrowing the JEXL exception in this case. Added integration test to cover this in SelectVariants 2012-07-10 13:59:00 -04:00
Mauricio Carneiro 7eb45b4038 Fixed BQSR IntegrationTests
* BinaryTag covariate is Experimental, not Standard (this was breaking integration tests)
   * New parameter in the Recalibration report requires new MD5 for one of the integration tests.
2012-07-09 13:55:12 -04:00
Eric Banks dd0c47ab7e Don't cast to a specific walker type since any walker can use the VA engine 2012-07-09 10:25:58 -04:00
Mark DePristo 5b0ade67c8 Updates to VCF processing for better BCF processing
-- getMetaData now split into getMetaDataInSortedOrder() [old functionality] and getMetaDataInOriginalOrder() [according to the header order].  Important as BCF uses the order of elements in the header in the offsets to keys, and we were automatically sorting the BCF2 header which is out of order in samtools and the whole system was going crazy
-- Updating GATK code to use the appropriate header function (this is why so many files have changed)
-- BCF2 code was busted in not differentiating PASS from . from FILTER in VC (tests coming that will actually stress this)
-- Bugfix for adding contig lines to BCF2 header dictionary
-- VCFHeader metaData no longer sorted internally.  The system now maintains the data in header order, and only sorts output as requested in API
-- VCFWriter and BCF2Writer now explictly sort their header lines
-- Don't allow filters to be added that are PASS in the contract
2012-07-08 15:44:33 -07:00
Mark DePristo 63f5262e45 mergeInfoWithMaxAC is no longer hidden in CombineVariants 2012-07-08 15:44:32 -07:00
Mark DePristo 66aee613e2 Bugfix for set key in mergeInfoWithMaxAC.
-- Previous version was always setting set=source of info with highest AC.  Should actually have been set to the set annotation value itself.
2012-07-08 15:44:32 -07:00
Mark DePristo 91f0ed8059 Fixed nasty Rscript typo in VariantRecalibrator when compactPDF is available 2012-07-08 15:44:32 -07:00
Mark DePristo 87b090c362 Update VariantRecalibator error message to use -resource not old -B syntax 2012-07-08 15:44:31 -07:00
Mauricio Carneiro 125e6c1a47 added BinaryTagCovariate for ancient dna analysis 2012-07-06 15:03:20 -04:00
Mauricio Carneiro f603d4c48c Fixing PairHMMIndelErrorModel boundary issue
When checking the limits of a read to clip, it wasn't considering reads that may already been clipped before.
2012-07-06 11:48:04 -04:00
Eric Banks dd571d9aa0 Added a --no_indel_quals argument that when used with -BQSR inhibits the writing of base insertion and base deletion quality tags. 2012-07-04 01:22:20 -04:00
Eric Banks 33306d2e20 Changing the logic of the -standard argument; the way it stands currently one can never turn off the cycle or context covariates. Now they are on by default and users must opt out of them to turn them off. 2012-07-04 00:21:21 -04:00
Eric Banks 7d30558e6f Only 'pad' the cycle covariate for indels, not substitutions 2012-07-03 23:47:01 -04:00
Eric Banks 22f1afddaa Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-03 14:55:59 -04:00
Eric Banks 617eebd204 More misc cleanup 2012-07-03 14:55:37 -04:00
Eric Banks 344c3aeb1d Cleanup from previous commit 2012-07-03 14:42:44 -04:00
Ryan Poplin 9e8e78de15 Adding the model name to the VQSR filter lines so that they don't get clobbered with consecutive VQSR runs for SNPs and then indels. 2012-07-03 14:30:37 -04:00
Eric Banks 0b37d44b0d Optimizations for the RecalDatum to make BQSR (Count Covariates) much faster. Needs some cleanup. 2012-07-03 13:05:11 -04:00
Eric Banks 031322ff00 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-03 00:12:59 -04:00
Eric Banks a4670113bd Refactored/renamed the nested integer array; cleaned up code a bit. 2012-07-03 00:12:33 -04:00
Ryan Poplin f92139dd82 Ooops, UG VA path for rank sum tests aren't happy with empty lists. Disabling clipping rank sum test for now. 2012-07-02 21:12:42 -04:00
Ryan Poplin 7e7b4cd1b9 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-02 16:37:54 -04:00
Ryan Poplin b807ff63ef HaplotypeCaller now creates MNP and complex substitutions by using LD information to decide if events segregate together on haplotypes. Added unit test. 2012-07-02 16:37:39 -04:00
Mauricio Carneiro 3cea080aa8 Cache SoftStart() and SoftEnd() in the GATKSAMRecord
these are costly operations when done repeatedly on the same read.
2012-07-02 16:22:00 -04:00
Mauricio Carneiro 88a02fa2cb Fixing but for reads with cigars like 9S54H
When hard-clipping predict when the read is going to be fully hard clipped to the point where only soft/hard-clips are left in the read and preemptively eliminate the read before the SAMRecord mathematics on malformed cigars kills the GATK.
2012-07-02 16:22:00 -04:00
Eric Banks cac72bce91 Initial version of int indexed mapping for BQSR. Will be cleaned up in a bit. 2012-07-02 14:33:33 -04:00
Mark DePristo bcd2e13d8b Adding duplicate header line keys is a logger.debug not logger.warn message now 2012-07-02 11:39:34 -04:00
Mark DePristo 01e04992f8 Fixed compatibilities in AbstractVCFCodec that resulted in key=; being parsed as written as key; in VCF output 2012-07-02 11:38:59 -04:00
Eric Banks c94c8a9c09 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-02 08:53:01 -04:00
Mark DePristo 7aff4446d4 Added unit tests for header repairing capabilities in the GATK engine 2012-07-01 15:38:10 -04:00
Mark DePristo 480b32e759 BCF2 is now officially zero-based open-interval, and that's how the GATK does it now 2012-07-01 14:59:27 -04:00
Ryan Poplin b6093ff02c Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-01 10:32:37 -04:00
Mark DePristo 5ad9a98a15 Minor bugfixes / consistency fixes to filter strings of Genotypes and AC/AF annotations
-- GenotypeBuilder now sorts the list of filter strings so that the output is in a consistent order
-- calculateChromosomeCounts removes the AC/AF fields entirely when there are no alt alleles, to be on VCF spec for A defined info field values
2012-06-30 11:22:49 -04:00
Mark DePristo 385a3c630f Added check in VariantContext.validate to ensure that getEnd() == END value when present
-- Fixed bug in VariantDataManager that this validation mode was intended to detect going forward
-- Still no VariantRecalibrationWalkersIntegrationTest for indels with BCF2 but that's because LowQual is missing from test VCF
2012-06-30 11:22:48 -04:00
Mark DePristo 893630af53 Enabling symbolic alleles in BCF2
-- Bugfix for VCFDiffableReader: don't add null filters to object
-- BCF2Codec uses new VCFAlleleClipper to handle clipping / unclipping of alleles
-- AbstractVCFCodec: decodeLoc uses full decode() [still doesn't decode genotypes] to avoid dangerous code duplication.  Refactored code that clipped alleles and determined end position into updateBuilderAllelesAndStop method that uses new VCFAlleleClipper. Fixed bug by ensuring the VCF codec always uses the END field in the INFO when it's provided, not just in the case where the there's a biallelic symbolic allele
-- Brand new home for allele clipping / padding routines in VCFAlleleClipper.  Actually documented this code, which results in lots of **** negative comments on the code quality.  Eric has promised that he and Ami are going to rethink this code from scratch.  Fixed many nasty bugs in here, cleaning up unnecessary branches, etc.  Added UnitTests in VCFAlleleClipper that actually test the code full.  In the process of testing I discovered lots of edge cases that don't work, and I've commented out failing tests or manually skipped them, noting how this tests need to be fixed.  Even introduced some minor optimizations
-- VariantContext: validateAllele was broken in the case where there were mixed symbolic and concrete alleles, failing validation for no reason.  Fixed.
-- Added computeEndFromAlleles() function to VariantContextUtils and VariantContextBuilder for convenience calculating where the VC really ends given alleles
--
2012-06-30 11:22:48 -04:00
Mark DePristo 16276f81a1 BCF2 with support symbolic alleles
-- refactored allele clipping / padding code into VCFAlleleClipping class, and added much needed docs and TODOs for methods dev guys
-- Added real unit tests for (some) clipping operations in VCFUtilsUnitTest
2012-06-30 11:22:48 -04:00
Mark DePristo 6bea28ae6f Genotype filters are now just Strings, not Set<String> 2012-06-30 11:22:47 -04:00
Guillermo del Angel f631be8d80 UnifiedGenotyperEngine.calculateGenotypes() is not only used in UG but in other walkers - vc attributes shouldn't be inherited by default or it may cause undefined behaviour in those walkers, so only inherit attributes from input vc in case of UG calling this function 2012-06-29 23:51:52 -04:00
Guillermo del Angel 65037b87da Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-29 11:08:44 -04:00
Guillermo del Angel 5a9a37ba01 Pool caller improvements: a) Log ref sample depth at every called site (will add more ref-related annotations later), b) Make -glm POOLBOTH work in case we want to genotype snp's and indels together, c) indel bug fix (pool and non-pool): prevent a bad GenomeLoc to be formed if we're running GGA and incoming alleles are larger than ref window size (typically 400 bb) 2012-06-29 11:08:16 -04:00
Eric Banks 96ea334bf2 Disable caching in BQSR for now since it significantly slows down computation; will look into this in a bit. 2012-06-28 15:27:44 -04:00
Ryan Poplin 05791ebf80 Adding the Clipping rank sum test: If alternate-supporting reads have more hard clipping than reference-supporting reads this is evidence for error. 2012-06-28 13:22:56 -04:00
Ryan Poplin d12ec92a55 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-28 12:57:59 -04:00
Ryan Poplin 5bb0693888 Bug fix for HC GGA mode. Shouldn't try to add an indel into the haplotype if that haplotype already contains the event of interest. Misc minor assembly param changes. Turning off capping of base qualities by base indel qualities until we can evaluate that change. 2012-06-28 12:57:51 -04:00
Khalid Shakir 1ce0b9d519 Throwing UnknownTribbleType exception instead of CommandLineException when an unknown tribble type is specified. 2012-06-28 11:28:04 -04:00
Mark DePristo 734bb5366b Special case the situation where we have ploidy == 0 (no GT values) to implicitly assume we have diploid samples
-- numLikelihoods no longer allows even ploidy == 0 in requires
-- VCFCompoundHeaderLine handles the case where ploidy == 0 => implicit ploidy == 2
2012-06-28 10:06:07 -04:00
Mark DePristo 64d7e93209 Massive bugfixes
-- Previous version was reading the size of the encoded genotypes vector for each genotype.  This only worked because I never wrote out genotype field values with > 15 elements.  Mauricio's killer DiagnoseTargets VCF uncovered the bug.  Unfortunately since symbolic allele clipping is still busted those tests are still diabled
-- GenotypeContext getMaxPloidy was returning -1 in the case where there are no genotypes, but the answer should be 0.
2012-06-28 10:06:06 -04:00
Mark DePristo 7144154f53 VCFWriter and BCFWriter no longer allow missing samples in the VC compared to their header
-- They now throw an error, as its really unsafe to write out ./. as a special case in the VCFWriter as occurred previously.
-- Added convenience method in VariantContextUtils.addMissingSamples(vc, allSamples) that returns a complete VC where samples are given ./. Genotype objects
-- This allows us to properly pass tests of creating / writing / reading VCFs and BCFs, which previously differed because the VC from the VCF would actually be different from its original VC
-- Updated UG, UGEngine, GenotypeAndValidateWalker, CombineVariants, and VariantsToVCF to manage the master list of samples they are writing out and addMissingSamples via the VCU function
2012-06-28 10:06:06 -04:00
Mark DePristo 4811a00891 GENOTYPE_FILTER_KEY is now a VCFStandardHeaderLine 2012-06-28 10:06:05 -04:00
Mark DePristo 93426a44b1 Fixes for DiagnoseTargets to be VCF/BCF2 spec complaint
-- Don't use DP for average interval depth but rather AVG_INTERVAL_DP, which is a float now, not an int
-- Don't add PASS filter value to genotypes, as this is actually considered failing filters in the GATK.  Genotype filters should be empty for PASSing sites
2012-06-28 10:06:05 -04:00
Eric Banks dc7636b923 Refactor the ContextCovariate to significantly reduce runtime 2012-06-28 02:29:35 -04:00
Eric Banks 1fafd9f6c8 NestedHashMap-based implementation of BQSRv2 along with a few minor optimizations. Not a huge runtime upgrade over the long bitset approach, but it allows us to implement further optimizations going forward. Integration test change because the original version had a bug in the quantized qual table creation. 2012-06-27 16:55:49 -04:00
Khalid Shakir 746a5e95f3 Refactored parsing of Rod/IntervalBinding. Queue S/G now uses all interval arguments passed to CommandLineGATK QFunctions including support for BED/tribble types, XL, ISR, and padding.
Updated HSP to use new padding arguments instead of flank intervals file, plus latest QC evals.
IntervalUtils return unmodifiable lists so that utilities don't mutate the collections.
Added a JavaCommandLineFunction.javaGCThreads option to test reducing java's automatic GC thread allocation based on num cpus.
Added comma to list of characters to convert to underscores in GridEngine job names so that GE JSV doesn't choke on the -N values.
JobRunInfo handles the null done times when jobs crash with strange errors.
2012-06-27 01:15:22 -04:00
Mark DePristo 1f45551a15 Bugfixes to G count types in VCF header
-- Previously VCF header lines of count type G assumed that the sample would be diploid.
-- Generalized the code to take a VariantContext and return the right result for G count types by calling into the correct numGenotypes in GenotypeLikelihoods class
-- renamed calcNumGenotypes to numGenotypes, which uses a static cache in the class
-- calcNumGenotypes is private, and is used to build the static cache or to compute on the fly for uncached No. allele / ploidy combinations
-- VariantContext calls into getMaxPloidy in GenotypesContext, which caches the max ploidy among samples
-- Added extensive unit tests that compare A and G type values in genotypes
2012-06-26 15:28:34 -04:00
Mark DePristo 39c849aced Bugfix to ensure the DB=1 old files decode properly 2012-06-26 15:28:33 -04:00
Mark DePristo c1ac0e2760 BCF2 cleanup
-- allowMissingVCFHeaders is now part of -U argument.  If you want specifically unsafe VCF processing you need -U LENIENT_VCF_PROCESSING.  Updated lots of files to use this
-- LENIENT_VCF_PROCESSING disables on the fly VCF header cleanup.  This is now implemented via a member variable, not a class variable, which I believe was changing the GATK behavior during integration tests, causing some files to fail that pass when run as a single test because the header reading behavior was changing depending on previous failures.
2012-06-26 15:28:33 -04:00
Mark DePristo 11dbfc92a7 Horrible bugfix to decodeLoc() in BCF2Codec
-- Just completely wrong.
-- BCF2 shadowBCF now checks that the shadow bcf can be written to avoid /dev/null.bcf problem
-- Added samtools ex2.bcf file for decoding to our integrationtests
2012-06-26 15:28:32 -04:00
Mark DePristo 7dbba465ee Bugfix for shadow BCFs to not attempt to write to /dev/null.bcf 2012-06-26 15:28:32 -04:00
Roger Zurawicki 7eb3e4da41 Added integration Tests for DiagnoseTargets
Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-06-25 17:02:46 -04:00
Joel Thibault f0c54d99ed Account for a null attributes object
* field attributesCanBeModified - a null attributes object can't be modified in its current state
* method makeAttributesModifiable() - initialize a null attributes object to empty
2012-06-25 12:07:36 -04:00
Joel Thibault fd9effbfe2 Fix Exception typo 2012-06-25 12:05:04 -04:00
Ryan Poplin 429ad44421 Bug fix for read pos rank sum test annotation. Shouldn't be using the un-hardclipped start as the alignment start. 2012-06-22 14:53:29 -04:00
Ryan Poplin 735b59d942 Bug fix in MLEAC calculation for when the exact model says the greedy AC of the alternate allele is zero. 2012-06-22 12:38:48 -04:00
Ryan Poplin 0650b349d7 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-22 10:42:49 -04:00
Guillermo del Angel eed32df30d a) Sanity check in PoolCaller: if user didn't specify correct -glm or -pnrm models then error out with useful message, b) Have VariantsToTable deal with case where sample namess have spaces: technically they're allowed (or at least not explicitly forbidden) but they'll produce R-incompatible tables. TBD which other tools have issues, or whether there's a generic fix for this 2012-06-21 21:19:55 -04:00
Mark DePristo 734756d6b2 Final fixes before BCF2 mark III push
-- Added MLEAC and MLEAF format lines to PoolCallerWalker
-- VariantFiltrationWalker now throws an error when JEXL variables cannot be found (XXX < 0.5) but passes through (albeit with a disgusting warning) when a variable is found but its value is a bad type (AF < 0.5) where AF == [0.04,0.00] at multi-allelic variation
-- Allow values to pass assertEquals in VariantContextTestProvider when one file contains X=[null, null] and the other has X missing
2012-06-21 15:17:22 -04:00
Mark DePristo 31ee8aa01a JEXL update
-- Update to 2.1.1 from 2.0
-- VariantFiltrationWalker now allows you to run with type unsafe selects, which all default to false when matching.  So "AF < 0.5" works even in the presence of multi-allelics now.
--
2012-06-21 15:17:21 -04:00
Mark DePristo 549293b6f7 Bugfixes towards final BCF2 implementation
-- MLAC and MLAF in PoolCaller now use standard MLE_AC and MLE_AF
-- VCFDiffableReader disables onTheFly fixing of VCF header fields so comparisons are easier when headers are changing
-- Flag fields with FLAG_KEY=0 are parsed as though FLAG_KEY were entirely absent in AbstractVCFCodec to fix bug where FLAG_KEY=0 was being translated into FLAG_KEY in output VCF, making a false flag value a true one
-- Fix the GT field value in VariantContextTestProviders so it isn't fixed 1000s of times during testing
-- Keys whose value is null are put into the VariantContext info attributes now
2012-06-21 15:17:21 -04:00
Mark DePristo 567dba0f76 Cleanup of VCF header lines and constants, BCF2 bugfixes
-- Created public static UnifiedGenotyper.getHeaderInfo that loads UG standard header lines, and use this in tools like PoolCaller
-- Created VCFStandardHeaderLines class that keeps standard header lines in the GATK in a single place.  Provides convenient methods to add these to a header, as well as functionality to repair standard lines in incoming VCF headers
-- VCF parsers now automatically repair standard VCF header lines when reading the header
-- Updating integration tests to reflect header changes
-- Created private and public testdata directories (public/testdata and private/testdata).  Updated tests to use test
-- SelectHeaders now always updates the header to include the contig lines
-- SelectVariants add UG header lines when in regenotype mode
-- Renamed PHRED_GENOTYPE_LIKELIHOODS_KEY to GENOTYPE_PL_KEY
-- Bugfix in BCF2 to handle lists of null elements (can happen in genotype field values from VCFs)
-- Throw error when VCF has unbounded non-flag values that don't have = value bindings
-- By default we no longer allow writing of BCF2 files without contig lines in the header
2012-06-21 15:16:31 -04:00
Mark DePristo fba7dafa0e Finalizing BCF2 mark III commit
-- Moved GENOTYPE_KEY vcf header line to VCFConstants.  This general migration and cleanup is on Eric's plate now
-- Updated HC to initialize the annotation engine in an order that allows it to write a proper VCF header.  Still doesn't work...
-- Updating integration test files.  Moved many more files into public/testdata.  Updated their headers to all work correctly with new strict VCF header checking.
-- Bugfix for TandemRepeatAnnotation that must be unbounded not A count type as it provides info for the REF as well as each alt
-- No longer add FALSE values to flag values in VCs in VariantAnnotatorEngine.  DB = 0 is never seen in the output VCFs now
-- Fixed bug in VCFDiffableReader that didn't differeniate between "." and "PASS" VC filter status
-- Unconditionally add lowQual Filter to UG output VCF files as this is in some cases (EMIT_ALL_SITES) used when the previous check said it wouldn't be
-- VariantsToVCF now properly writes out the GT FORMAT field
-- BCF2 codec explodes when reading symbolic alleles as I literally cannot figure out how to use the allele clipping code.  Eric said he and Ami will clean up this whole piece of instructure
-- Fixed bug in BCF2Codec that wasn't setting the phase field correctly.  UnitTested now
-- PASS string now added at the end of the BCF2 dictionary after discussion with Heng
-- Fixed bug where I was writing out all field values as BigEndian.  Now everything is LittleEndian.
-- VCFHeader detects the case where a count field has size < 0 (some of our files have count = -1) and throws a UserException
-- Cleaned up unused code
-- Fixed bug in BCF2 string encoder that wasn't handling the case of an empty list of strings for encoding
-- Fixed bug where all samples are no called in a VC, in which case we (like the VCFwriter) write out no called diploid genotypes for all samples
-- We always write the number of genotype samples into the BCF2 nSamples header.  How we can have a variable number of samples per record isn't clear to me, as we don't have a map from missing samples to header names...
-- Removed old filtersWereAppliedToContext code in VCF as properly handle unfiltered, filtered, and PASS records internally
-- Fastpath function getDisplayBases() in allele that just gives you the raw bytes[] you'd see for an Allele
-- Genotype fields no longer differentiate between unfiltered, filtered, and PASS values.  Genotype objects are all PASS implicitly, or explicitly filtered.  We only write out the FT values if at least one sample is filtered.  Removed interface functions and cleaned up code
-- Refactored padAllele code from createVariantContextWithPaddedAlleles into the function padAllele so that it actually works.  In general, **** NEVER COPY CODE **** if you need to share funcitonality make a function, that's why there were invented!
-- Increased the default number of records to read for DiffObjects to 1M
2012-06-21 15:16:27 -04:00
Mark DePristo 9c81f45c9f Phase I commit to get shadowBCFs passing tests
-- The GATK VCFWriter now enforces by default that all INFO, FILTER, and FORMAT fields be properly defined in the header.  This helps avoid some of the low-level errors I saw in SelectVariants.  This behavior can be disable in the engine with the --allowMissingVCFHeaders argument
-- Fixed broken annotations in TandemRepeat, which were overwriting AD instead of defining RPA
-- Optimizations to VariantEval, removing some obvious low-hanging fruit all in the subsetting of variants by sample
-- SelectVariants header fixes -- Was defining DP for the info field as a FORMAT field, as for AC, AF, and AN original
-- Performance optimizations in BCF2 codec and writer
    -- using arrays not lists for intermediate data structures
    -- Create once and reuse an array of GenotypeBuilders for the codec, avoiding reallocating this data structure over and over
-- VCFHeader (which needs a complete rewrite, FYI Eric)
    -- Warn and fix on the way flag values with counts > 0
    -- GenotypeSampleNames are now stored as a List as they are ordered, and the set iteration was slow.  Duplicates are detected once at header creation.
    -- Explicitly track FILTER fields for efficient lookup in their own hashmap
    -- Automatically add PL field when we see a GL field and no PL field
    -- Added get and has methods for INFO, FILTER, and FORMAT fields
-- No longer add AC and AF values to the INFO field when there's no ALT allele
-- Memory efficient comparison of VCF and BCF files for shadow BCF testing.  Now there's no (memory) constraint on the size of the files we can compare
-- Because of VCF's limited floating point resolution we can only use 1 sig digit for comparing doubles between BCF and VCF
2012-06-21 15:16:26 -04:00
Mauricio Carneiro ab53220635 Refactor on how RR treats soft clips
* Sites with more soft clipped bases than regular will force-trigger a variant region
   * No more unclipping/reclipping, RR machinery now handles soft clips natively.
   * implemented support for base insertion and base deletion quality scores in synthetic and regular reads.
   * GATKSAMRecord clone() now creates a fresh object for temporary attributes if one is present.

note: SAMRecords create a shallow copy of the tempAttribute object which was causing multiple reads (that came from the same read) to have their temporary attributes modified by one another inside reduce reads. Beware, if you're not using GATKSAMRecord!
2012-06-21 14:02:03 -04:00
Ryan Poplin 769e190202 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-20 09:59:55 -04:00
Christopher Hartl fe1d6e3953 Merge branch 'master' of ssh://gsa1.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-19 08:02:00 -04:00
Christopher Hartl 79ef3325bd Fix a NullPointerException that could occur in DoC if the user requested an interval summary but never provided a -L argument. This situation is now checked for and a UserError thrown instead. Also (after a great struggle) pushing some old VR3 code into the central repository which had been improperly pushed (e.g. with rsync rather than git push) into my repository on the server, and never migrated to unstable. In addition, minor convenience function added to the GATKReport that allows an entire row to be added, and a walker that parses out annotations from a tool called VariantEffectPredictor and summarizes annotations across transcripts, and consensus annotations. 2012-06-19 07:50:13 -04:00
Eric Banks 62cee2fb5b Feature request from Tim that could be useful to all: there's now an --interval_padding argument that specifies how many basepairs to add to each of the intervals provided with -L (on both ends). This is particularly useful when trying to run over the exome plus flanks and don't want to have to pre-compute the flanks (just use e.g. --interval_padding 50). Added integration test to cover this feature. 2012-06-18 21:36:27 -04:00
Eric Banks 4393adf9e7 If present, VE's AlleleCount stratification uses the MLE AC by default (and otherwise drops down to use the greedy AC). Added integration test to cover it. 2012-06-18 13:36:14 -04:00
Ryan Poplin 707151f0a4 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-18 12:55:58 -04:00
Eric Banks 82a2c40338 Emit the MLE AC and AF in the INFO field of the UG output 2012-06-18 12:19:36 -04:00
Ryan Poplin 5ec737f008 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-18 08:51:48 -04:00
Ryan Poplin e3147969d9 Smith Waterman parameters have somehow gotten too diverged from what it is used in the indel realigner. Results are very dependent on these params. Changes to the assembly to not create long haplotypes out of only small pieces that were properly assembled. 2012-06-18 08:51:41 -04:00
Eric Banks 677babf546 Officially removing all code associated with extended events. Note that I still have a longer term project on my plate to refactor the ReadBackedPileup, but that's a much larger effort. 2012-06-15 15:55:03 -04:00
Eric Banks 783b7f6899 Misc cleanup 2012-06-15 10:39:19 -04:00
Eric Banks 0c218e4822 Refactoring mostly for readability (and small performance improvement) 2012-06-15 10:36:41 -04:00
Eric Banks c54e84e739 Ryan confirmed that we don't need separate arguments to control the context size for insertions and deletions, which allows us to cut down the expensive context calculations. 2012-06-15 09:28:56 -04:00
Eric Banks 61fcbcb190 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-15 02:45:57 -04:00
Eric Banks 4895fe2289 No more extraneous array creation in BQSR covariate classes; now covariates push their data directly to the ReadCovariates class as it's calculated (no more going through CovariateValues.java) 2012-06-15 02:32:00 -04:00
Mark DePristo 0384ce5d34 Simple optimizations for BCF2Encoder
-- Inline encodeString that doesn't go via List<Byte> intermediate
-- Inline encodeString that uses byte[] directly so that we can go from Allele.getBytes() => BCF2
-- Fast paths for Atomic Float and Atomic Integer values avoiding intermediate list creation
-- Final UG integration test update
2012-06-14 16:42:39 -04:00
Mark DePristo 68eed7b313 Optimizations for VCF and BCF2
-- encodeTyped in BCF2Encoder now with specialized versions for int, float, and string, avoiding unnecessary intermediate list creation and dynamic type checking.  encodeTypedMissing also includes inline operations now instead of using Collections.emptyList() version.  Lots of contracts.  User code updated to use specialized versions where possible
-- Misc code refactoring
-- Updated VCF float formating to always include 3 sig digits for values < 1, and 2 for > 1.  Updating MD5s accordingly
-- Expanded testing of BCF2Decoder to really use all of the encodeTyped* operations
2012-06-14 16:42:39 -04:00
Mark DePristo 09df584788 Fixed nasty bug where we weren't closing the underlying PositionalOutputStream in IndexingVariantContextWriter 2012-06-14 16:42:39 -04:00
Mark DePristo fbc45e14d3 Cleanup formatting of VCF floats
-- Final integrationtest update before commit (and fixing new formatting changes)
2012-06-14 16:42:38 -04:00
Mark DePristo 8b01969762 More code cleanup and optimizations to BCF2 writer
-- Cleanup a few contracts
-- BCF2FieldManager uses new VCFHeader accessors for specific info and format fields
-- A few simple optimizations
    -- VCF header samples stored in String[] in the writer for fast access
    -- getCalledChrCount() uses emptySet instead of allocating over and over empty hashset
    -- VariantContextWriterStorage now creates a 1MB buffered output writer, which results in 3x performance boost when writing BCF2 files
-- A few editorial comments in VCFHeader
2012-06-14 16:42:38 -04:00
Mark DePristo e34ca0acb1 Passing all unittests
-- Final merge conflicts resolved
-- BCF2Writer now supports case where a sample is present in the header but the sample isn't in the VC, in which case we create an empty sample and encode that
2012-06-14 16:42:38 -04:00
Mark DePristo 71da76039e Final support for variable length lists of strings in BCF2
-- Updating many MD5s as well.
2012-06-14 16:42:38 -04:00
Mark DePristo bd9d40fb84 Code cleanup and more documentation for BCFFieldWriters
-- Update integration tests where appropriate
2012-06-14 16:42:37 -04:00
Mark DePristo 856905ee5b Cleanup Genotypes
-- Renamed getAttribute to getExtendedAttribute, as this is really what this function does
-- Added a few more genotype tests
2012-06-14 16:42:36 -04:00
Mark DePristo 31997f8092 Bugfixes on the way to passing integration tests
-- Replaced getAttributes with getDP() and not the old style getAttribute, where appropriate
-- Added getAnyAttribute and hasAnyAttribute that actually does the expensive work of seeing if the key is something like GT, AD or another inline datum, and returns it.  Very expensive but convenient.
-- Fixed nasty subsetting bug in SelectVariants with excluding samples
-- Generalized VariantsToTable to work with new inline attributes (using getAnyAttribute) as well as GT
-- Bugfix for dropping old style GL field values
-- Added test to VCFWriter to ensure that we have the sample number of samples in the VC as in the header
-- Bugfix for Allele.getBaseString to properly show NO_CALL alleles
-- getGenotypeString in Genotype returns "NA" instead of null for ploidy == 0 genotypes
2012-06-14 16:42:33 -04:00
Mark DePristo ea1b699778 Cleanup the interface for BCF2FieldEncoder
-- Now uses a much clearer approach.  Update all user classes to new interface
2012-06-14 16:42:33 -04:00
Mark DePristo dd6aee347a Genotype encoding uses the BCF2FieldEncoder system 2012-06-14 16:42:33 -04:00
Mark DePristo 9ac4203254 GenotypeAnnotations now accept a GenotypeBuilder and directly update the builder with their value
-- Cleans up interface and avoids significant amounts of gross typing code
2012-06-14 16:42:32 -04:00
Mark DePristo 7506994d09 Nearing final BCF commit
-- Cleanup some (but not all) VCF3 files.  Turns out there are lots so...
-- Refactored gneotype parser from VCFCodec and VCF3Codec into a single shared version in AbstractVCFCodec.  Now VCF3 properly handles the new GenotypeBuilder interface
-- Misc. bugfixes in GenotypeBuilder
2012-06-14 16:42:32 -04:00
Mark DePristo 6272612808 Testing utility to perform diffs N times 2012-06-14 16:42:32 -04:00
Mark DePristo 8014178f2f Algorithmically faster version of DiffEngine
-- Now only includes leaf nodes in the summary, i.e., summaries of the form "*.*....*.X", which are really the most valuable to see.  This calculation can be accomplished in linear time for N differences, rather than the previous O(n^2) algorithm
-- Now computes the max number of elements to read correctly.  Counts now the size of the entire element tree, not just the count of the roots, which was painful because the trees vary by orders of magnitude in size.
-- Because of this we can enforce a meaningful, useful value for the max elements in MD5 or 100K, and this works well.
-- Added integration test for new leaf and old pairwise calculations
-- Bugfix for Utils.join(sep, int[]) that was eating the first element of the AD, PL fields
2012-06-14 16:42:30 -04:00
Mark DePristo 2a86b81a3f Initial version of clean, fast formatting routines built dynamically from a VCF header
-- BCFFieldEncoder and writers divide up the task of formatting values (atomic or vector, ints, strings, floats, etc) from the task of writing these out at the sites or genotypes level.
-- Allows us to create efficient encoders for specific combinations of header fields, such as int[] encoded values with exactly 3 values
-- Currently only used for INFO fields, but subsequent commit will include optimized genotype field encoder
-- Allowed us to naturally support encoding of lists of strings
-- Bugfixes in VariantContextUtils introduced in genotype -> genotypebuilder conversion
-- Fixes for integration test failures
-- Enabling contig updates
-- WalkerTest now prints out relative paths where possible to make cut/paste/run easier
2012-06-14 16:42:30 -04:00
Mark DePristo 51a3b6e25e No more makePrecisionFormatStringFromDenominatorValue
-- As values in VCs are becoming their native Java types the VCFWriter needs to own proper float formating.
-- Created a smart float formatter in VCFWriter, with unit tests
-- Removed makePrecisionFormatStringFromDenominatorValue and its uses
-- Fix broken contracted
-- Refactored some code from the encoder to utils in BCF2
-- HaplotypeCaller's GenotypingEngine was using old version of subset to context.  Replaced with a faster call that I think is correct. Ryan, please confirm.
2012-06-14 16:42:30 -04:00
Mark DePristo 43ad890fcc Finalizing BCF2 v2
-- FastGenotypes are the default in the engine.  Use --useSlowGenotypes engine argument to return to old representation
-- Cleanup of BCF2Codec.  Good error handling.  Added contracts and docs.
-- Added a few more contacts and docs to BCF2Decoder
-- Optimized encodePrimitive in BCF2Encoder
-- Removed genotype filter field exceptions
-- Docs and cleanup of BCF2GenotypeFieldDecoders
-- Deleted unused BCF2TestWalker
-- Docs and cleanup of BCF2Types
-- Faster version of decodeInts in VCFCodec
-- BCF2Writer
    -- Support for writing a sites only file
    -- Lots of TODOs for future optimizations
    -- Removed lack of filter field support
    -- No longer uses the alleleMap from VCFWriter, which was a Allele -> String, now uses Allele -> Integer which is faster and more natural
    -- Lots of docs and contracts
-- Docs for GenotypeBuilder.  More filter creation routines (unfiltered, for example)
-- More extensive tests in VariantContextTestProfiler, including variable length strings in genotypes and genotype filters.  Better genotype comparisons
2012-06-14 16:42:29 -04:00
Mark DePristo 37e5d32019 Remove logger.info statement 2012-06-14 16:42:29 -04:00
Mark DePristo 01ddf9555a Performance optimizations for Genotype field decoding for GT field
-- Fast path decoder for biallelic diploid GT fields that avoids allocating the same genotypes over and over
-- Contracts
-- final classes
2012-06-14 16:42:28 -04:00
Mark DePristo 7fbca7013e Don't add missing value binding from field to Genotype object in VCF3Codec 2012-06-14 16:42:28 -04:00
Mark DePristo 4a4d3cde3d UnitTests for decodeIntArray method 2012-06-14 16:42:27 -04:00
Mark DePristo 5b8bd81991 An option to not actually write out the results of select variants
-- Useful for performance testing of the SV operations themselves.
2012-06-14 16:42:26 -04:00
Mark DePristo 6f7a01e00d Bugfix for BCF2 reader / writer for > 0x0FFF samples :-)
-- Should be 0x00FFFFFF in the mask
2012-06-14 16:42:26 -04:00
Mark DePristo 1d4eb46606 Efficient reading of genotype fields v1
-- decodeIntArray in BCF2 decoder allows us to more efficiently read ints and int[] from stream directly into Genotype object
-- Code cleanup / contracts added were appropriate
-- V2 will have a yet more optimized path...
2012-06-14 16:42:26 -04:00
Mark DePristo 37b8d70321 Hidden option to SelectVariants to force the genotypes information to be decoded by computing AC 2012-06-14 16:42:25 -04:00
Mark DePristo 17fbd103d0 Smarter infrastructure to decode genotypes in BCF
-- Eliminated the large intermediate map from field name to list of list<Integer> values needed to create genotypes without the GenotypeBuilder.  The new code is cleaner and simply fills in an array of GenotypeBuilders as it moves through the column layout in BCF2
-- Now we create once decoders specialized for each GT field (GT, AD, etc) that can be optimized for putting data into the GenotypeBuilder.  In a subsequent commit these will actually use lower level BCF2 decoders to create the low-level ints and int[], avoiding the intermediate List<Integer> form
-- Reduced the amount of data further to be computed in the DiffEngine.  The DiffEngine algorithm needs to be rethought to be efficient...
2012-06-14 16:42:25 -04:00
Mark DePristo 889e3c4583 Code cleanup before major refactor 2012-06-14 16:42:25 -04:00
Mark DePristo cebd37609c Finalizing new Genotype object and associated routines
-- Builder now provides a depreciated log10pError function to make a new GQ value
-- Genotype is an abstract class, with most of the associated functions implemented here and not in the derived Fast and Slow versions
-- Lots of contracts
-- Bugfixes throughout
2012-06-14 16:42:25 -04:00
Mark DePristo 8b0a629a31 Terrible bugfix
-- The way I was handling the contig offset ordering wasn't correct.  Now the contigs are always indexed in the order in which their corresponding populate() functions are called, so that the order of the contigs is given by the order in which they are in the file, or in our refDict.  It has nothing to do with the contig index itself.
-- SelectVariants no longers prints all samples to the screen if you aren't selecting any explicitly
2012-06-14 16:42:24 -04:00
Mark DePristo d37a8a0bc8 Efficient Genotype object Intermediate commit
-- Created a new Genotype interface with a more limited set of operations
-- Old genotype object is now SlowGenotype.  New genotype object is FastGenotype.  They can be used interchangable
-- There's no way to create Genotypes directly any longer.  You have to use GenotypeBuilder just like VariantContextBuilder
-- Modified lots and lots of code to use GenotypeBuilder
-- Added a temporary hidden argument to engine to use FastGenotype by default.  Current default is SlowGenotype
-- Lots of bug fixes to BCF2 codec and encoder.
-- Feature additions
  -- Now properly handles BCF2 -> BCF2 without decoding or encoding from scratch the BCF2 genotype bytes
  -- Cleaned up semantics of subContextFromSamples.  There's one function that either rederives or not the alleles from the subsetted genotypes

-- MASSIVE BUGFIX in SelectVariants.  The code has been decoding genotypes always, even if you were not subsetting down samples.  Fixed!
2012-06-14 16:42:24 -04:00
Mark DePristo a648b5e65e First step towards an efficient Genotype object
-- Created new clean FastGenotype and GenotypeBuilder classes with contracts to enforce expected behavior and correctness.  Tested utility of this approach by rewritting -- and then commenting out -- a path in BCF2Codec that could use this new code.  Much cleaner interface now, but not yet hooked up to anything
-- Disabled SHADOW_BCF generation and generating contigs in the output VCFs automatically to ensure that the current code bases integration tests, before switching the code to new Genotype class
-- Code cleanup.  Moved "AD" to VCFConstants under GENOTYPE_ALLELIC_DEPTHS.  Uses in code replaced with constant
2012-06-14 16:42:23 -04:00
Mark DePristo ff9ac4b5f8 BCF2 genotype decoding is now lazy
-- Refactored BCF2Codec into a LazyGenotypesDecoder object that provides on-demand genotype decoding of BCF2 data blocks a la VCFCodec.
-- VCFHeader has getters for sampleNamesInOrder and sampleNameToOffset instead of protected variables directly accessed by vcfcodec
2012-06-14 16:42:23 -04:00
Mark DePristo 9eb83a0771 Enable adding contigs to VariantContextWriters on output 2012-06-14 16:42:23 -04:00
Mark DePristo b0ea14ef0f VCFHeader getMetaData returns 4.1 version not 4.0 2012-06-14 16:42:22 -04:00
Mauricio Carneiro 7d12429917 First step towards indel qualities in RR
Let the BI's and BD's pass through the reduce reads machinery
2012-06-14 15:37:39 -04:00
Mauricio Carneiro e68038c5d8 Refactor post-processing downsampling using David's generic downsampler interface 2012-06-14 15:37:32 -04:00
Eric Banks de5508fcea Bug fixes for cycle and context covariates 2012-06-14 13:01:14 -04:00
Eric Banks 5c3c6cbc40 Long -> long conversions in BQSR 2012-06-14 09:07:02 -04:00
Eric Banks 29a74908bb The next round of BQSR optimizations: no more Long[] array creation 2012-06-14 00:05:42 -04:00
Guillermo del Angel cd2074b1dc Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-13 20:59:30 -04:00
Guillermo del Angel 92669a0468 Second intermediate commit for indel pool caller - now works (more or less) in reference sample-free mode. Still needs a lot of cleanups/add more tests and not done w/refactoring quite yet 2012-06-13 20:59:17 -04:00
David Roazen 0550b27799 Make downsampler classes themselves generic (instead of just the Downsampler interface)
This is in response to a request from Mauricio to make it easier
to use the downsamplers with GATKSAMRecords (as opposed to SAMRecords)
without having to do any cumbersome typecasting. Sadly, Java
language limitations make this sort of solution the best choice.

Thanks to Khalid for his feedback on this issue.

Also:

-added a unit test to verify GATKSAMRecord support with no typecasting required

-added some unit tests for the FractionalDownsampler that Mauricio will/might be using

-moved classes from private to public to better sync up with my local development
branch for engine integration
2012-06-13 16:43:39 -04:00
Guillermo del Angel 67c0569f9c Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-13 11:50:00 -04:00
Eric Banks 81993b08e2 Don't put null entries into the key array 2012-06-13 11:43:44 -04:00
Roger Zurawicki bdf5945dcc Fixed bugs in DiagnoseTargets
DT would not report bad mates!
that has been fixed

Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-06-13 11:15:26 -04:00
Roger Zurawicki 538cdf9210 Created the FindCoveredIntervals
Moved some stuff in the DiagnoseTargets walker to the more general ThresHolder class
Minor tweaks
FindCoveredIntervals supports Gathering
FindCoveredIntervals outputs an interval list instead of GATKReport

Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-06-13 11:15:25 -04:00
Guillermo del Angel aee66ab157 Big UG refactoring and intermediate commit to support indels in pool caller (not done yet). Lots of code pulled out of long spaghetti-like functions and modularized to be easily shareable. Add functionality in ErrorModel to count indel matches/mismatches (but left part disabled as not to change integration tests in this commit), add computation of pool genotype likelihoods for indels (not fully working yet in more realistic cases, only working in artificial nice pools). Lot's of TBD's still but existing UG and pool SNP functionality should be intact 2012-06-13 11:14:44 -04:00
Eric Banks 37f56ce8fd A couple of minor updates to BQSR 2012-06-12 16:12:13 -04:00
Eric Banks 277493dd83 Yet more instances of Lists changed over to native arrays 2012-06-12 15:56:09 -04:00
Eric Banks 613badc835 Very minor optimizations for the context covariate 2012-06-12 15:47:32 -04:00
Eric Banks 0f79adb2aa Changing more Java Lists to native arrays in BQSR for performance optimization. 2012-06-12 15:41:01 -04:00
Eric Banks 1da3e43679 Wow, apparently it's way, way less efficient to iterate over Java Lists than native arrays. With this change and the bit fiddling, Ryan's 10-day test case now runs in 1 day. More to come. 2012-06-12 13:32:56 -04:00
Eric Banks fec0bd5e11 Fixing UG argument docs 2012-06-12 09:46:16 -04:00
Eric Banks a4defdfb29 Adding a GT header line to SomaticIndelDetector output 2012-06-12 09:39:17 -04:00
Eric Banks 891ce51908 Refactoring of BQSRv2 to use longs (and standard bit fiddling techniques) instead of Java BitSets for performance improvements. 2012-06-12 09:19:36 -04:00
Eric Banks ff5749599d Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-11 15:46:17 -04:00
Eric Banks fea625632f Don't use asList because it maintains an iterator to the original list and then the result can't be used to create a new one 2012-06-11 15:45:58 -04:00
Ryan Poplin e4d371dc80 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-11 10:38:50 -04:00
Ryan Poplin 683d4b508e Bug fix in fragment utils: the read name wasn't being set in the merged read. Misc minor updates to the HaplotypeCaller. 2012-06-11 10:38:35 -04:00
Mauricio Carneiro 4aad7e23ef New ReduceReads v2 with unclipped variant regions and soft-clipped bases
* Re-wrote the sliding window approach to allow the variant region not to clip the reads that overlap it.
   * Updated consensus to include only reads that were not passed on by the variant region, header counts are updated on the fly to avoid recompute
   * Added soft clipped bases to ReduceReads analysis by unclipping high quality soft-clips then re-clipping after reduce reads
   * Updated all integration tests
2012-06-08 14:58:31 -04:00
Eric Banks afa9b2718a Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-08 13:54:48 -04:00
Eric Banks 92280b4068 BQSR optimization: cache the BitSetUtils.bitSetFrom() calls since they are called over and over again with the same values. Another 10% reduction in runtime. 2012-06-08 13:54:37 -04:00
Eric Banks 898a0e6161 Minor optimizations 2012-06-08 12:07:58 -04:00
Ryan Poplin 0a37e19998 Bug fix in VQSR so that the VCF index will be created for the recalFile. 2012-06-08 11:51:28 -04:00
Eric Banks d463ab2cbf BQSR optimization: String manipulation is extremely expensive in Java (accounts for 8% of BQSR runtime). Instead use byte[] and StringBuilder when possible. 2012-06-08 10:42:42 -04:00
Eric Banks 2bd48a7351 Bad comments made it into the previous commit 2012-06-07 23:12:56 -04:00
Eric Banks 31c3a6be48 BQSR optimization: getRequiredCovariates() and getOptionalCovariates() were creating a new List every time they were being called, and unfortunately getRequiredCovariates().size() is used as the stop condition in for-loops throughout the code. Just maintaining the original list of covariates results in a 15% reduction in runtime for BQSR. 2012-06-07 20:04:10 -04:00
Eric Banks 0fb9179f76 BQSR optimization: don't clone the original quals for each read, we can just overwrite the original array 2012-06-07 19:41:03 -04:00
Ryan Poplin d449f169d3 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-07 10:56:55 -04:00
Ryan Poplin 0b4281fdd0 misc minor update to HC debug output for when there are a lot of samples 2012-06-07 10:56:41 -04:00
Eric Banks bad50a1b05 Fix docs 2012-06-06 22:45:38 -04:00
Eric Banks b093ba9dcc Stabilized NGSPlatform code: don't assume all reads have read groups (e.g. artificial SAM records) 2012-06-06 15:17:30 -04:00
Eric Banks 54f682a99c Unify to NGSPlatform framework. TechnologyComposition annotation now generalizes to Illumina and not just SLX. 2012-06-06 11:44:37 -04:00
Eric Banks dd46d843fb IR should skip Ion reads just like it does with 454 reads; Tim has confirmed that official platform name for Ion. 2012-06-06 11:04:55 -04:00
Guillermo del Angel 2cbd6e5f90 Merged bug fix from Stable into Unstable 2012-06-05 15:58:23 -04:00
Guillermo del Angel ce4dc2128d Adding minor clarification to -mbq argument documentation 2012-06-05 15:17:56 -04:00
Eric Banks e02ec8c8b6 Don't update the record ID unless we are actually going to emit the record 2012-06-04 14:58:50 -04:00
Eric Banks 8405156ae1 Refactored VariantsToTable so that 1) genotype-level fields can be specified (stabilized and supported code) and 2) the --moltenize argument could be supported to produce molten output of the data. Added tests that cover these capabilities. 2012-06-04 14:28:32 -04:00
Ryan Poplin f11e7ebc3a Fixing the previous fix related to clipping. Adding extra reference padding in the HaplotypeCaller to get those larger alleles during GGA. 2012-06-04 12:49:36 -04:00
Ryan Poplin 320956ee4b Bug fix in clipping function in ReadUtils for when the read ends at exactly the clipping boundary. Bug fixes in HaplotypeCaller GGA mode for when Smith-Waterman produces a different allele than what was given in the input alleles VCF. GGA mode now works with multiallelic records. Adding min pruning factor argument which is combined with the pruning factor that is determined dynamically by the coverage. 2012-06-04 10:55:36 -04:00
Guillermo del Angel 7a54baf08c Merged bug fix from Stable into Unstable 2012-06-03 08:42:08 -04:00
Guillermo del Angel 47df7bbc14 Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/stable 2012-06-03 08:38:54 -04:00
Guillermo del Angel 2ddbdee3bc Fixed broken VariantEval stratifications VariantType and IndelSize - integration tests to follow 2012-06-03 08:38:38 -04:00
Mauricio Carneiro 12a8c54f9a Fixing VCF header for filter elements (thanks Eric) 2012-06-01 15:45:15 -04:00
Eric Banks 3a15ba2102 Malformed VCF headers should be User Errors 2012-05-31 16:05:53 -04:00
Khalid Shakir c4f7df4dce When an underlying exception occurs because of the user error, if the exception instance does not include a message instead of telling the user "because null", tell them "because <exception class name>". 2012-05-30 16:39:06 -04:00
Ryan Poplin 421d0d1435 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-05-30 15:21:35 -04:00
Ryan Poplin 5dd811f84a Adding genotype given alleles mode to the HaplotypeCaller. 2012-05-30 15:07:01 -04:00
Eric Banks d09b8d5584 Fixing docs 2012-05-30 13:24:08 -04:00
Mauricio Carneiro d6e1205310 Updating default values for DiagnoseTargets 2012-05-30 12:43:07 -04:00
Khalid Shakir c3c7f17d90 Updated hard limit MathUtils.MAXN number of samples from 11,000 to 50,000.
Instead of creating a supposed network temporary directory locally which then fails when remote nodes try to access the non-existant dir, now checking to see if they network directory is available and throwing a SkipException to bypass the test when it cannot be run.
TODO: Throw similar SkipExceptions when fastas are not available. Right now instead of skipping the test or failing fast the REQUIRE_NETWORK_CONNECTION=false means that the errors popup later when the networked fastas aren't found.
2012-05-29 11:18:22 -04:00
Roger Zurawicki b8b139841d DiagnoseTargets with working Q1,Median,Q3
- Merged Roger's metrics with Mauricio's optimizations
 - Added Stats for DiagnoseTargets
     - now has functions to find the median depth, and upper/lower quartile
     - the REF_N callable status is implemented
 - The walker now runs efficiently
 - Diagnose Targets accepts overlapping intervals
 - Diagnose Targets now checks for bad mates
 - The read mates are checked in a memory efficient manner
 - The statistics thresholds have been consolidated and moved outside of the statistics classes and into the walker.
 - Fixed some bugs
 - Removed rod binding

Added more Unit tests

 - Test callable statuses on the locus level
 - Test bad mates

 - Changed NO_COVERAGE -> COVERAGE_GAPS to avoid confusion

Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-05-29 10:16:45 -04:00
Eric Banks 50031b63c5 Fix possible NPE from NBaseCount annotation module 2012-05-29 09:46:00 -04:00
Mark DePristo 454c8e63e6 Made GQ an int, not a float. Updated VC code and lots of corresponding MD5s
-- VCFWriter / codec now passes the same rigorous UnitTest as the BCF2 writer / codec.  As part of this we now can only test doubles for equivalence in VCFs to 1e-2 (not exactly impressive)
2012-05-28 20:20:05 -04:00
Mark DePristo 7ce24a96f1 PBT now uses getGenotypeLikelihoodString to avoid NPE when there are no PLs present 2012-05-28 20:18:16 -04:00
Mark DePristo 1818c29371 Fixed long-standing bug in beagle codec that was passing on the header record for decoding 2012-05-28 20:17:26 -04:00
Mark DePristo 5894d045cb Bugfixes and code cleanup throughout so BCF2 passes VC -> BCF -> VC tests
-- This version of BCF should actually work properly for most files, assuming headers are properly defined.
-- Lots of bug fixes to BCF2 codec
-- Genotype getPhredScaledQual is now an int, returning -1 if there's no QUAL.  NOTE THIS SEMANTICS change
-- Equals() method for GenotypeLikelihoods, using PLs.
-- VCFCodec now longer adds empty bindings to missing input field values.  NOTE THIS CHANGE
-- VCs can be marked as fully decoded, so that when fullyDecode() is called it returns itself, instead of doing the decoding work.  The BCF2 codec now makes VCs marked as fully decoded
-- stringToBytes returns empty list for null or "" string in BCF2Encoder
-- Proper handling of genotype ordering in BCF2 reader / writer
-- Removed the crazy slow noDups and sameSamples tests that were slowing down unit and integration tests totally unnecessarily
-- Many failing MD5s now due to double -> int change in GQ, will update later
2012-05-27 11:17:17 -04:00
Mark DePristo 86e5a066fc Even more conservative limit on number of differences to summarize at 1000 2012-05-27 11:17:13 -04:00
Mark DePristo 31f4e5b52e Stop unlimited runtimes in DiffEngine when you have lots of differences
-- Added a new parameter to control the maximum number of pairwise differences to generate, which previously could expand to a very large number when there were lots of differences among genotypes, resulting in a n^2 algorithm running with n > 1,000,000
2012-05-27 11:17:13 -04:00
Mauricio Carneiro 4109fcbb08 Merged bug fix from Stable into Unstable 2012-05-25 13:03:05 -04:00
Mauricio Carneiro 2be5704a25 Fixed haplotype boundary bug in PairHMMIndelErrorModel
haplotypes were being clipped to the reference window when their unclipped ends went beyond the reference window. The unclipped ends include the hard clipped bases, therefore, if the reference window ended inside the hard clipped bases of a read, the boundaries would be wrong (and the read clipper was throwing an exception).

   * updated code to use SoftEnd/SoftStart instead of UnclippedEnd/UnclippedStart where appropriate.
   * removed unnecessary code to remove hard clips after processing.
   * reorganized the logic to use the assigned read boundaries throughout the code (allowing it to be final).
2012-05-25 13:00:45 -04:00
Guillermo del Angel 175bb35e70 Made TandemRepeatAnnotator standard annotation. HRun no longer standard (superceded by former) 2012-05-25 12:56:23 -04:00
Mark DePristo 7280cdf937 Bugfixes and testdata cleanup
-- Cut down the size of a few large files in public/testdata that were only used in part
-- Refactor vcf Filename => shadow BCF filename to BCF2Utils.  Fix bug in WalkerTest due to the way this was handled previously
2012-05-24 13:26:05 -04:00
Mark DePristo e9c22b9aad Final updates to integration tests for BCF2
-- Fully working version
-- Use -generateShadowBCF to write out foo.bcf as well as foo.vcf anywhere you use -o foo.vcf
-- Moved MedianUnitTest to its proper home in Utils
-- Added reportng to ivy and testng, so build/report/X/html/ is a nicely formatted output for Unit and Integration tests.  From this website it's easy to see md5 diffs, etc.  This is a vastly better way to manage unit and integration test output
2012-05-24 10:58:59 -04:00
Mark DePristo ade1843818 Bugfix for not setting header in AbstractVCFCodec 2012-05-24 10:58:58 -04:00
Mark DePristo 6ca71fe3b4 GATK tests use public/testdata not /humgen/ as much as possible 2012-05-24 10:58:58 -04:00
Mark DePristo 69ee4d0454 Moved getMetaDataForField to VariantContextUtils 2012-05-24 10:57:09 -04:00
Mark DePristo f77d2e6965 Renamed NO_HEADER to the more accurate no_cmdline_in_header
-- Also no_cmdline_in_header permits us to write contigs into the header, so that the shadow BCF system can work as well
2012-05-24 10:57:08 -04:00
Mark DePristo 4bde24f020 Bugfix for VCFWriter in the case where there are no genotypes in the VC but genotypes in the header 2012-05-24 10:57:08 -04:00
Mark DePristo 4846bf5c8e @Hidden --also_generate_bcf engine argument produces both VCF and BCF files for -o my.vcf
-- Going to be useful going forward for integration tests so they will generate both VCF and BCF files automatically
2012-05-24 10:57:07 -04:00
Mark DePristo bb0d87666a Finally just deleted equals() method in GATKArgumentCollection.
-- We never compare these things in the codebase anyway...
2012-05-24 10:57:07 -04:00
Mark DePristo c8ed0bfc4c Edge case fixes for BCF2
--handle entirely missing GT in a sample in decodeGenotypeAlleles
--Create MAX_ALLELES_IN_GENOTYPES constant in BCF2Utils, and extracted its use inline from the code
-- Generalized genotype writing code to handle ploidy != 2 and variable ploidy among samples
-- Remove special case inline treatment of case where all samples have no GT field values, and moved this into calcVCFGenotypeKeys
-- Removed restriction on getPloidy requiring ploidy > 1.  It's logically find to return 0 for a no called sample
-- getMaxPloidy() in VC that does what it says
-- Support for padding / depadding of generic genotype fields
2012-05-24 10:57:06 -04:00
Mark DePristo 40431890be -- BCF2 is now a reference dependent codec so it can initialize the contigs in the case where the file doesn't have contigs in it
-- BCF2 writer can now work without the contig lines being in the header
-- Made GenomeLocParser a final class
2012-05-24 10:57:06 -04:00
Mark DePristo 6301572009 GenotypeLikelihood PLs are capped at Short.MAX_INT now
-- UserExceptions in BCF2 now where appropriate
-- Asserts for code safety
-- Public -> protected encode(Object v) method is for testing only
2012-05-24 10:57:06 -04:00
Mark DePristo d52bc31a47 Bugfix for doNotWriteGenotypes mode
-- Was outputing GT ./. in sites only mode.  Fixed
2012-05-24 10:57:05 -04:00
Mark DePristo 64d4238e2f 99% working version of BCF2 encoder / decoder
-- fixed final bugs with PL encoding / decoding
-- Ready for testing by other members of the group
-- Current performance numbers aren't so great, but they will improve in the next phase of BCF2 optimizations
-- Fixed a nasty bug in the filter field
-- Not that some (many?) GATK tools won't work with BCF because they internally assume values are Strings not their true types

Read 1500 genotypes file in VCF -> VCF : 11 seconds
Read 1500 genotypes file in VCF -> BCF : 9.5 seconds

VariantEval 1500 genotypes file in VCF : 3 seconds
VariantEval 1500 genotypes file in BCF : 3 seconds
2012-05-24 10:57:05 -04:00
Mark DePristo b5bce8d3f9 AD should be UNBOUNDED, actually
-- Pass in # alt alleles as appropriate for getCount in VCF header line
2012-05-24 10:57:05 -04:00
Mark DePristo aaf11f00e3 Near final BCF2 implementation
-- Trivial import changes in some walkers
-- SelectVariants has a new hidden mode to fully decode a VCF file
-- DepthPerAlleleBySample (AD) changed to have not UNBOUNDED by A type, which is actually the right type
-- GenotypeLikelihoods now implements List<Double> for convenience.  The PL duality here is going to be removed in a subsequent commit
-- BugFixes in BCF2Writer.  Proper handling of padding.  Bugfix for nFields for a field
-- padAllele function in VariantContextUtils
-- Much better tests for VariantContextTestProvider, including loading parts of dbSNP 135 and the Phase II 1000G call set with genotypes to test encoding / decoding of fields.
2012-05-24 10:57:02 -04:00
Mark DePristo dfee17a672 Generalize / unify code for handling strings
-- List<String> is converted inside of the codec to a collapsed string, and exploded in the decoder.
-- Unified the type conversion code in BCFWriter to simply the mapping from VCF type => BCF type and special value recoding
-- Code cleanup and renaming
2012-05-24 10:57:02 -04:00
Mark DePristo b4a5acd6f4 Added some genotype tests for BCF2, which all pass. Of course that's because I commented out the ones that didn't 2012-05-24 10:57:01 -04:00
Mark DePristo 373ae39e86 Testing of BCF codec
-- Rev.d tribble
-- Minor code cleanup
-- BCF2 encoder / decoder use Double not Float internally everywhere
-- Generalized VC testing framework
2012-05-24 10:57:01 -04:00
Mark DePristo fb1911a1b6 -- Convenience constructor for VariantContextBuilder that creates a new one based on an existing builder
-- Convenience routine for creating alleles from strings of bases
-- Convenience constructor for VCFFilterHeader line whose description is the same as name
-- VariantContextTestProvider creates all sorts of types of VariantContexts for testing purposes.  Can be reused throughtout code for BCF, VCF, etc.
-- Created basic BCF2WriterCodec tests that consumes VariantContextTestProvider contexts, writes them to disk with BCF2 writer, and checks that they come back equals to the original VariantContexts. Actually worked for some complex tests in the first go
2012-05-24 10:57:01 -04:00
Mark DePristo 4968dcd36a Throw an error when genotype fields with mixed vector lengths are encountered 2012-05-24 10:57:00 -04:00