Commit Graph

2351 Commits (7d95176539546585bbc76cfde2866fba64ee83c2)

Author SHA1 Message Date
Mark DePristo 7d95176539 Bugfix to compareTo and equals in GenomeLoc
-- Yes, GenomeLoc.compareTo was broken.  The compareTo function only considered the contig and start position, but not the stop, when comparing genome locs.
-- Updated GenomeLoc.compareTo function to account for stop.  Updated GATK code where necessary to fix resulting problems that depended on this.
-- Added unit tests to ensure that hashcode, equals, and compareTo are all correct for GenomeLocs
2012-08-30 19:41:49 -04:00
Mark DePristo 5a9610d875 ReadShards now default to 10K (up from 1K) reads per samFile up to 250K
-- This should help make the inputs for parallel read walkers a little meater, and avoid spinning the shard creation infrastructure so often
2012-08-30 19:41:49 -04:00
Christopher Hartl 5a142fe265 After dicussion with Ryan/Eric, the Structural_Indel variant type is now gone, and has been entirely replaced with the access pattern .isStructuralIndel(). This makes it a strict subtype of indel. I agree that this method is a bit more sensible.
In addition, fix for GSA-310. If supplied -rf argument does not match a known read filter, the list of read filters will be printed, and users directed to the documentation for more information.
2012-08-30 17:57:31 -04:00
Mark DePristo 82b2845b9f Fix: GSA-531 ApplyRecalibration writing to BCF: java.lang.String cannot be cast to java.lang.Double
-- LOD must be added a double to attributes, not as string, so that it can be written out as BCF
2012-08-30 16:59:57 -04:00
Mark DePristo ce3d1f89ea ReadShard are no longer allowed to span multiple contigs
-- Previous behavior was unnecessary and causes all sorts of problems with RODs for reads.  The old implementation simply failed in this case.  The new code handles this correctly by forcing shards to have all of their data on a single contig.
-- Added a PrintReads integration test to ensure this behavior is correct
-- Adding test BAMs that have < 200 reads and span across contig boundaries
2012-08-30 10:15:11 -04:00
Mark DePristo 53376b9423 Part III of GSA-462: Consistent RODBinding access across Ref and Read trackers
-- shardSpan is only calculated when there some ROD is live in the GATK.  No sense in paying the cost per read when you don't need it
-- Update contract to allow null span or unmapped span (good catch unittests!)
2012-08-30 10:15:10 -04:00
Mark DePristo 1200848bbf Part II of GSA-462: Consistent RODBinding access across Ref and Read trackers
-- Deleted ReadMetaDataTracker
-- Added function to ReadShard to give us the span from the left most position of the reads in the shard to the right most, which is needed for the new view
2012-08-30 10:15:10 -04:00
Mark DePristo 972be8b4a4 Part I of GSA-462: Consistent RODBinding access across Ref and Read trackers
-- ReadMetaDataTracker is dead!  Long live the RefMetaDataTracker.  Read walkers will soon just take RefMetaDataTracker objects.  In this commit they take a class that trivially extends them
-- Rewrote ReadBasedReferenceOrderedView to produce RefMetaDataTrackers not the old class.
    -- This new implementation produces thread-safe objects (i.e., holds no points to shared state).  Suitable for use (to be tested) with nano scheduling
    -- Simplified interfaces to use the simplest data structures (PeekableIterator) not the LocusAwareSeekableIterator, since I both hate those classes and this is on the long term trajectory to remove those from the GATK entirely.
-- Massively expanded DataProvider unit tests for ReadBasedReferenceOrderedView
-- Note that the old implementation of offset -> ROD in ReadRefMetaDataTracker was broken for any read not completely matching the reference.  Rather than provide broken code the ReadMetaDataTracker only provides a "bag of RODs" interface.  If you want to work with the relationship between the read and the RODs in your tool you need to manage the CIGAR element itself.
    -- This commit breaks the new read walker BQSR, but Ryan knows this is coming
-- Subsequent commit will be retiring / fixing ValidateRODForReads
2012-08-30 10:15:10 -04:00
Mark DePristo 8fc6a0a68b Cleanup RefMetaDataTracker before refactoring ReadMetaDataTracker 2012-08-30 10:13:06 -04:00
Ryan Poplin b85ded8389 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-30 10:11:48 -04:00
Ryan Poplin 57d997f06f Fixing bug from when FragmentUtils merging function moved over to the soft clipped start instead of the unclipped start 2012-08-30 10:10:43 -04:00
Ryan Poplin f9bab37015 Merged bug fix from Stable into Unstable 2012-08-30 09:21:24 -04:00
Ryan Poplin eb63221875 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/stable 2012-08-30 09:19:35 -04:00
Ryan Poplin 81d5eca975 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-30 09:10:56 -04:00
Ryan Poplin 35baf0b155 This along with Mauricio's previous commit (thanks!) fixes GSA-522. There are no longer any modifications to reads in the map calls of ActiveRegion walkers. Added the bam which identified this error as a new integration test. 2012-08-30 09:07:36 -04:00
Eric Banks 1acf0f0b2c Fixing bug in fasta .fai generation: trim the contig names to the first whitespace if one appears. We now generate indexes identical to samtools. 2012-08-29 22:36:27 -04:00
Eric Banks 4d38befe86 Merged bug fix from Stable into Unstable 2012-08-29 15:13:56 -04:00
Eric Banks 150a969279 Be careful with String manipulation when constructing alleles in SomaticIndelDetector 2012-08-29 15:13:28 -04:00
Eric Banks ce55ba98f4 Don't try to left align indels in unmapped reads (which for some reason can still have CIGARs) because the ref context is null. 2012-08-29 15:01:11 -04:00
Ryan Poplin 4ea38bbfe8 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-29 11:39:30 -04:00
Mauricio Carneiro 69b56e11c8 ReadClipper won't modify the original read
Reverting back to the original implementation, but now including write N's and write Q0's due to walkers that look at the same read multiple times in different reference windows
2012-08-29 11:33:19 -04:00
Ryan Poplin e12ae65d33 Changing the commenting style in the BQSR 2012-08-29 11:27:45 -04:00
Ryan Poplin 18eca3544e Initial commit of the delocalized BQSR written as a read walker. 2012-08-28 15:24:20 -04:00
Eric Banks e74c527d47 Register the depricated walkers as depricated starting in v2.2 so that users get a helpful error message 2012-08-28 10:19:18 -04:00
Eric Banks 67d348a31d Retiring the alignment walkers and related integration test since we don't want to support them anymore. 2012-08-28 10:16:49 -04:00
Mark DePristo 2996693c9f FisherStrand now computed with and without filtering low-qual bases, and least significant pvalue is kept
-- Old way (filtering for Q > 17 bases) resulted in biased FS when the site was good but there was a
systematic shift in the QUAL of REF and ALT between strands of the reads (sometimes happens)
-- New way (taking all bases) was consistent with BaseQualRankSum and other tests, but there can be
a lot of low qual reference bases on one strand in some techs (ION/PROTON/PACBIO) because of the
preference for introducing an indel vs. a mismatch.
-- This implementation allows us to have our cake and eat it to by computing both p-values, and
taking the maximum one (i.e., least significant).
-- No integration tests updated yet -- still exploring the consequences of this change
2012-08-28 08:06:47 -04:00
Eric Banks bedcdbdc5f Fixing merge conflict 2012-08-27 12:16:51 -04:00
Eric Banks 3d476487c6 LIBS is totally busted for deletions. Putting a check in AD for bad pileup event bases so that we don't produce busted alleles. We must fix LIBS ASAP. 2012-08-27 12:13:12 -04:00
Mark DePristo 63a9ae817a Ensure thread-safety of CachingIndexedFastaSequenceFile
-- Cosmetic cleanup of ReadReferenceView
-- TraverseReadsNano provides the reference context, since it's thread-safe
-- Cleanup CachingIndexedFastaSequenceFile.  Add docs, remove unnecessary setters
-- Expand CachingIndexedFastaSequenceFileUnitTest to test explicitly multi-threaded safety.
2012-08-27 12:11:54 -04:00
Khalid Shakir 2d1ea7124b One less Queue command line requirement: -tempDir now defaults to .queue/tmp.
Also moved queueScatterGather to .queue/scatterGather.
2012-08-27 12:04:50 -04:00
Mark DePristo 68c5142d2d numThreads > 1 any time you have -nt > 1 silly 2012-08-26 14:36:13 -04:00
Mark DePristo fde9824765 Optimizations for parallel read walkers
-- TraversalReadsNano only creates the NanoScheduler once, and shuts it down onTraversalDone
-- Nicer debugging output in NanoScheduler
-- ReadShard has a getBufferSize() method now
2012-08-25 17:21:12 -04:00
Mark DePristo 5066b14335 Parallel FlagStat 2012-08-25 17:21:12 -04:00
Mark DePristo af540888f1 Limited version of parallel read walkers
-- Currently doesn't support accessing reference or ROD data
-- Parallel versions of PrintReads and CountReads
2012-08-25 17:21:12 -04:00
Mark DePristo e060b148e2 Minor cleanup of TraverseReads 2012-08-25 17:21:11 -04:00
Mark DePristo 275a5e5439 More tests for NanoScheduler
-- Add more contracts
-- Test in the UnitTest that the reduce is being called in the correct order
2012-08-25 17:21:11 -04:00
Christopher Hartl 6db0988898 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-25 15:40:32 -04:00
Christopher Hartl db2e88c7cb Fix for badIndelLength() throwing NPE at non-indel sites. Added integration test. 2012-08-25 12:38:23 -07:00
Mark DePristo 59b5913b54 Merged bug fix from Stable into Unstable 2012-08-25 14:53:22 -04:00
Mark DePristo dcc972a557 Usability cleanup for BQSR
-- I'm seeing a lot of people trying to use BinaryTagCovariate in the community.  They really shouldn't do this, so I moved it to private.
-- Throw an exception if its required bintag argument is missing
-- Check explicitly if user is requesting DinucCovariate and tell them that its been retired in favor of ContextCovariate
-- Show the type (Required, Experimental, Standard) of the covariates when running --list
2012-08-25 14:53:00 -04:00
Christopher Hartl b59948709f Code improvements re: JIRA GSA-510. Trio class migrated into the Samples package - because the trio structure is so ubiquitously used, it makes sense, I think, to have a class which imposes the structure on the samples. Existing functions which slightly duplicated the getTrios() method look like they have bugs. These functions are now deprecated.
A number of functions int he sampleDB looked to be assuming that samples could not share IDs (e.g. sample IDs are unique, so a sample present in two families could not be represented by multiple Sample objects). Added an assertion in the SampleDBBuilder to document/test this assumption.

MVLikelihoodRatio now uses the trio methods from SampleDB.
2012-08-25 08:48:27 -07:00
Mark DePristo 0996bbd548 Comments for Chris on cleanup 2012-08-24 16:04:58 -04:00
Mark DePristo 649b82ce85 Merge branch 'nanoScheduler'
Conflicts:
	private/scala/qscript/org/broadinstitute/sting/queue/qscripts/performance/GATKPerformanceOverTime.scala
2012-08-24 15:59:36 -04:00
Mark DePristo 9de8077eeb Working (efficient?) implementation of NanoScheduler
-- Groups inputs for each thread so that we don't have one thread execution per map() call
-- Added shutdown function
-- Documentation everywhere
-- Code cleanup
-- Extensive unittests
-- At this point I'm ready to integrate it into the engine for CPU parallel read walkers
2012-08-24 15:34:23 -04:00
Christopher Hartl 752f44c332 Code cleanup in MVLR and SelectVariants. Should fix JIRA GSA-509 and GSA-510 2012-08-24 12:25:11 -07:00
Mark DePristo d6e6b30caf Initial implementation of GSA-515: Nanoscheduler
– Write general NanoScheduler framework in utils.threading. Test with reading via iterator from list of integers, map is int * 2, reduce is sum. Should be efficiency using resources to do sum of 2 * (sum(1 - X)).

Done!

CPU parallelism is nano threads. Pfor across read / map / reduce. Use work queue to implement.
Create general read map reduce framework in utils. Test parallelism independently before hooking up to Locus iterator
Represent explicitly the dependency graph. Scheduler should choose the work units that are ready for computation, that are marked as "completing a computation", and then finally that maximize the number of sequent available work units. May be worth measuring expected cost for read read / map / reduce unit and use it to balance the compute
As input is single threaded just need one thread to populate inputs, which runs as fast as possible on parallel pushing data to fixed size queue. Each push creates map job and links to upcoming reduce job.
Note that there's at most one thread for IO tasks, and all of the threads can contribute to CPU tasks
2012-08-24 14:07:44 -04:00
Eric Banks 0545664f91 Fix ClassCastException seen in Tableau errors 2012-08-24 13:45:48 -04:00
Eric Banks 740520c23b Fix BQSR docs 2012-08-24 13:20:10 -04:00
Ryan Poplin 5f8574bd15 Fixing typo in error message. 2012-08-24 10:48:41 -04:00
Mark DePristo 1999b95754 Work around for GSA-513: ClassCastException in VariantEval 2012-08-23 18:14:49 -04:00