Commit Graph

10630 Commits (0187f04a906f1b4b4b93446d2b68ccf4c8befff7)

Author SHA1 Message Date
Eric Banks 0187f04a90 Proper fix for a previous RR bug fix: only remove reads from the header if they were actually used in the creation of the polyploid consensus. 2012-09-23 00:39:19 -04:00
Eric Banks 74bb4e2739 Fixing the VariantContextUtilsUnitTest 2012-09-22 23:24:55 -04:00
Eric Banks 344083051b Reverting the fix to the generalized ploidy exact model since it cannot handle it computationally. Will file this in the JIRA. 2012-09-22 23:07:28 -04:00
Eric Banks 25e3ea879a Oops, missed this test before when updating md5s 2012-09-22 22:16:35 -04:00
Eric Banks ced652b3dd RR bug: we need to call removeFromHeader() for reads that were used in creating a polyploid consensus or else they are reused later in creating synthetic reads. In the worst case, this bug caused the tool to create 2 copies of the reduced read. 2012-09-22 21:50:10 -04:00
Eric Banks 60b93acf7d RR bug: we need to test that the mapping and base quals are >= the MIN values and not just >. This was causing us to drop Q20 bases. 2012-09-22 21:32:29 -04:00
David Roazen f6a22e5f50 ExperimentalReadShardBalancerUnitTest was being skipped; fixed
TestNG skips tests when an exception occurs in a data provider,
which is what was happening here.

This was due to an AWFUL AWFUL use of a non-final static for
ReadShard.MAX_READS. This is fine if you assume only one instance
of SAMDataSource, but with multiple tests creating multiple SAMDataSources,
and each one overwriting ReadShard.MAX_READS, you have a recipe for
problems. As a result of this the test ran fine individually, but not as
part of the unit test suite.

Quick fix for now to get the tests running -- this "mutable static"
interface should really be refactored away though, when I have time.
2012-09-22 01:56:39 -04:00
David Roazen e077347cc2 Re-allow running the GATK with experimental downsampling
It's now possible to run with experimental downsampling enabled
using the --enable_experimental_downsampling engine argument.

This is scheduled to become the GATK-wide default next week after
diff engine output for failing tests has been examined.
2012-09-21 23:20:46 -04:00
David Roazen 34eed20aa6 PerSampleDownsamplingReadsIterator: fix for incorrect use of DOWNSAMPLER_POSITIONAL_UPDATE_INTERVAL
Notify all downsamplers in our pool of the current global genomic position every
DOWNSAMPLER_POSITIONAL_UPDATE_INTERVAL position changes, not every single
positional change after that threshold is first reached.
2012-09-21 22:43:39 -04:00
David Roazen 133085469f Experimental, downsampler-friendly read shard balancer
-Only used when experimental downsampling is enabled

-Persists read iterators across shards, creating a new set only when we've exhausted
the current BAM file region(s). This prevents the engine from revisiting regions discarded
by the downsamplers / filters, as could happen in the old implementation.

-SAMDataSource no longer tracks low-level file positions in experimental mode. Can strip
out all related code when the engine fork is collapsed.

-Defensive implementation that assumes BAM file regions coming out of the BAM Schedule
can overlap; should be able to improve performance if we can prove they cannot possibly
overlap.

-Tests a bit on the extreme side (~8 minute runtime) for now; will scale these back
once confidence in the code is gained
2012-09-21 22:17:58 -04:00
Guillermo del Angel ab8fa8f359 Bug fix: AlleleCount stratification in VariantEval didn't support higher ploidy and was producing bad tables 2012-09-21 20:48:12 -04:00
Eric Banks dcd31e654d Turn off RR tests while I debug 2012-09-21 17:26:00 -04:00
Eric Banks 21251c29c2 Off-by-one error in sliding window manifests itself at end of a coverage region dropping the last covered base. 2012-09-21 17:22:30 -04:00
Mauricio Carneiro 2c3dc291c0 Added positive/negative strand to the synthetic reads 2012-09-21 10:00:48 -04:00
Mauricio Carneiro 51cb5098e4 Fixed the alignment issues with reads that started with empty consensus headers 2012-09-21 10:00:47 -04:00
Mauricio Carneiro aa1d2f3a5b Not every consensus is well aligned. Need to check more, but starting position has been fixed. 2012-09-21 10:00:45 -04:00
Mauricio Carneiro 97874b92d1 Program runs, but the consensus reads are all out of place and need more tags 2012-09-21 10:00:44 -04:00
Mauricio Carneiro 3494a52ddc another intermediate commit to update changes from stable 2012-09-21 10:00:43 -04:00
Mauricio Carneiro a89ff7b5dd Intermediate commit to resolve conflicts coming from stable 2012-09-21 10:00:41 -04:00
Mark DePristo 5d758bf97f Better run a shorter test -- should take 3 minutes total 2012-09-20 18:54:14 -04:00
Mark DePristo d29218825d Fix grouping for display of GATKPerformanceOverTime
-- God I hate R
2012-09-20 18:45:16 -04:00
Mark DePristo b5fa848255 Fix GSA-515 Nanoscheduler GSA-573 -nt and -nct interact badly w.r.t. output
-- See https://jira.broadinstitute.org/browse/GSA-573
-- Uses InheritedThreadLocal storage so that children threads created by the NanoScheduler see the parent stubs in the main thread.
-- Added explicit integration test that checks that -nt 1, 2 and -nct 1, 2 give the same results for GLM BOTH with the UG over 1 MB.
2012-09-20 18:45:16 -04:00
Mark DePristo 90b7df46cf Add invocation count and shorter timeout to NanoSchedulerUnitTest 2012-09-20 18:45:16 -04:00
Mark DePristo ba9e95a8fe Revert "Reorganized NanoScheduler so that main thread does the reduces"
Doesn't actually fix the problem, and adds an unnecessary delay in closing down NanoScheduler, so reverting.

This reverts commit 66b820bf94ae755a8a0c71ea16f4cae56fd3e852.
2012-09-20 18:45:15 -04:00
Mark DePristo 7425ab9637 Reorganized NanoScheduler so that main thread does the reduces
-- Enables us to run -nt 2 -nct 2 and get meaningful output
-- Uses a sleep / poll mechanism.  Not ideal -- will look into wait / notify instead.
2012-09-20 18:45:15 -04:00
Eric Banks 747694f7c2 Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-09-20 14:14:58 -04:00
Eric Banks 1316b579f0 Bad news folks: BQSR scatter-gather was totally busted; you absolutely cannot trust any BQSR table that was a product of SG (for any version of BQSR). I fixed BQSR-gathering, rewrote (and enabled) the unit test, and confirmed that outputs are now identical whether or not SG is used to create the table. 2012-09-20 14:14:34 -04:00
Christopher Hartl c492185be6 Merge branch 'master' of gsa2:/humgen/gsa-scr1/chartl/dev/unstable 2012-09-20 12:56:07 -04:00
Christopher Hartl d25579deeb A couple of minor things.
1) Better documentation on the meta data file for VariantsToBinaryPed with examples of each file type

2) MannWhitneyU can now take an argument on creation to turn off dithering. This pertains to JIRA-GSA-571 but does not fix it,
   as it isn't hooked up to the command line. Next step is to add an argument to the command line where it's accessible to the
   annotation classes (e.g. from either UG or the VariantAnnotator).

3) Added some dumb python scripts to deal with Plink files, and a script to convert plink binaries to VCF to help sanity check. Basically if you want to do an analysis on genotype data stored in plink binary format, your choices are:
  1) Add a new module to Plink [difficulty rating: Impossible -- code obfuscation]
  2) Steal plink parsing code from software (Plink/PlinkSeq/GCTA/Emacks/etc) that readds the files [difficulty rating: Oppressive -- code not modularized at all)
  3) Write your own dumb stuff [difficutly rating: Annoying]
What's been added is the result of 3. It's a library so nobody else has to do this, so long as they're comfortable with python.
2012-09-20 12:48:13 -04:00
Eric Banks 2e6f533996 Adding both unit and integration tests to cover the previous edge case of mismatched PLs 2012-09-20 11:55:28 -04:00
Eric Banks 4b7edc72d1 Fixing edge case bug in the Exact model (both standard and generalized) where we could abort prematurely in the special case of multiple polymorphic alleles and samples with widely different depths of coverage (e.g. exome and low-pass). In these cases it was possible to call the site bi-allelic when in fact it was multi-allelic (but it wouldn't cause it to create a monomorphic call). 2012-09-20 10:59:42 -04:00
Ryan Poplin ccb65a03e8 sorry, non-ASCII characters annoy some computers. 2012-09-20 10:14:48 -04:00
Mauricio Carneiro 1ef6fa7eed QD and FS are doubles and select variants is more picky than variant filtration on that 2012-09-20 08:21:42 -04:00
Mauricio Carneiro 4e160a267d quality control script for ReduceReads
Takes in a full bam and a reduced bam, makes calls over a given interval, selects only the high quality
2012-09-20 00:11:32 -04:00
Mark DePristo 087247f1f0 Allow longs and doubles in recalibration report to allow some backward compatibility 2012-09-19 19:23:44 -04:00
Mark DePristo 2267b722b2 Proper error handling in NanoScheduler
-- Renamed TraversalErrorManager to the more general MultiThreadedErrorTracker
-- ErrorTracker is now used throughout the NanoScheduler.  In order to properly handle errors, the work previously done by main thread (submit jobs, block on reduce) is now handled in a separate thread.  The main thread simply wakes up peroidically and checks whether the reduce result is available or if an error has occurred, and handles each appropriately.
-- EngineFeaturesIntegrationTest checks that -nt and -nct properly throw errors in Walkers
-- Added NanoSchedulerUnitTest for input errors
-- ThreadEfficiencyMonitoring is now disabled by default, and can be enabled with a GATK command line option.  This is because the monitoring doesn't differentiate between threads that are supposed to do work, and those that are supposed to wait, and therefore gives misleading results.
-- Build.xml no longer copies the unittest results verbosely
2012-09-19 17:03:13 -04:00
Mark DePristo 773af05980 Intermediate commit for proper error handling in the NanoScheduler
-- Refactored error handling from HMS into utils.TraversalErrorManager, which is now used by HMS and will be usable by NanoScheduler
-- Generalized EngineFeaturesIntegrationTest to test map / reduce error throwing for nt 1, nt 2 and nct 2 (disabled)
-- Added unit tests for failing input iterator in NanoScheduler (fails)
-- Made ErrorThrowing NanoScheduable
2012-09-19 17:03:13 -04:00
Mark DePristo eb24dc920a GATKPerformanceOverTime now includes ideal scaling line by default 2012-09-19 17:03:13 -04:00
Mark DePristo d2046b67b1 Remove problematic @Ensures from InputProducer.
-- We need to figure out why CoFoJa is broken in the NanoScheduler
2012-09-19 17:03:13 -04:00
Mark DePristo 33fabb8180 Final V3 version of NanoScheduler
-- Fixed basic bugs in tracking of input -> map -> reduce jobs
-- Simplified classes
-- Expanded unit tests
2012-09-19 17:03:12 -04:00
Mark DePristo e18bc4e7b1 Adding PrintReads -baq and -bqsr to standard performance testing 2012-09-19 17:03:12 -04:00
Mark DePristo 5734d756b5 Remove problematic @Invariant from EOFMarkedValue 2012-09-19 17:03:12 -04:00
Mark DePristo aa9a1e8122 Warn GATK user if the number of requested threads > available processors on the machine 2012-09-19 17:03:12 -04:00
Mark DePristo 76027d17e6 Add a few more UnitTests for InputProducer
-- Cleaned up function calls for clarity
2012-09-19 17:03:12 -04:00
Mark DePristo 7605c6bcc4 Done GSA-515 Nanoscheduler / GSA-557 V3 nanoScheduler algorithm
-- V3 + V4 algorithm for NanoScheduler.  The newer version uses 1 dedicated input thread and n - 1 map/reduce threads.  These MapReduceJobs perform map and a greedy reduce.  The main thread's only job is to shuttle inputs from the input producer thread, enqueueing MapReduce jobs for each one.  We manage the number of map jobs now via a Semaphore instead of a BlockingQueue of fixed size.
-- This new algorithm should consume N00% CPU power for -nct N value.
-- Also a cleaner implementation in general
-- Vastly expanded unit tests
-- Deleted FutureValue and ReduceThread
2012-09-19 17:03:12 -04:00
Mark DePristo 69e418c3f5 Intermediate commit for v3 NanoScheduling algorithm
-- This version works but it blocks much more than I'd expect on input.  Merging v2 and v3 to make v4 now
2012-09-19 17:03:12 -04:00
Joel Thibault c72db70416 Update downsample_to_coverage to 60 2012-09-19 16:23:58 -04:00
Mauricio Carneiro ee31a54a03 Merged bug fix from Stable into Unstable 2012-09-19 16:09:45 -04:00
Mauricio Carneiro 7cf9911924 Fixed ReduceReads bug where variant regions were missing.
This affected variant regions with more than 100 reads and less than 250 reads. Only bams reduced with GATK v2 and 2.1 were affected.
2012-09-19 16:09:08 -04:00
Ryan Poplin 26e35e5ee2 updating BQSR integration tests 2012-09-19 14:10:34 -04:00