Commit Graph

1418 Commits (621ee2b613bc518480599b8e0939ebfd79f7bbe5)

Author SHA1 Message Date
David Roazen 621ee2b613 Merged bug fix from Stable into Unstable 2012-01-03 16:56:49 -05:00
David Roazen ea6e718cb8 SnpEff 2.0.5 support. Re-enabled SnpEff in the HybridSelectionPipeline.
For now, we recommend only running with the GRCh37.64 database.
2012-01-03 15:18:36 -05:00
Christopher Hartl 93e1417b6e Update to the VSS GATK documentation. 2012-01-03 13:39:31 -05:00
David Roazen 4984ca5e31 Merged bug fix from Stable into Unstable 2012-01-03 11:03:30 -05:00
David Roazen f3f01da1af Enforce serial dependencies in RecalibrationWalkersIntegrationTest
Some tests in this class were intermittently not being executed due
to being randomly scheduled before tests whose results they depend on.
Now the serial dependencies are enforced to avoid problematic orderings.
2012-01-03 10:42:41 -05:00
Eric Banks ab8d47d9a5 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-01-03 09:38:49 -05:00
Mauricio Carneiro 3d4bf273de Added getPileupForReadGroups to ReadBackPileup
* returns a pileup for all the read groups provided.
   * saves us from multiple calls to getPileup (which is very inefficient)
2012-01-03 09:35:11 -05:00
Mauricio Carneiro 4a208c7c06 Refactor of the downsampling machinery to accept different strategies
* Implemented Adaptive downsampler
   * Added integration test
   * Added option to RRead scala script to choose downsampling strategy
2012-01-03 09:29:47 -05:00
Mauricio Carneiro 21ae3ef5f9 Added downsampling support to ReduceReads
* Downsampling is now a parameter to the walker with default value of 0 (no downsampling)
    * Downsampling selects reads at random at the variant region window and strives to achieve uniform coverage if possible around the desired downsampling value.
    * Added integration test
2012-01-03 09:29:46 -05:00
Mauricio Carneiro cd68cc239b Added knuth-shuffle (KS) and randomSubset using KS to MathUtils
* Knuth-shuffle is a simple, yet effective array permutator (hope this is good english).
         * added a simple randomSubset that returns a random subset without repeats of any given array with the same probability for every permutation.
         * added unit tests to both functions
2012-01-03 09:29:46 -05:00
Mauricio Carneiro 94791a2a75 Add support for reads starting with insertion
* Modified cleanCigarShift to allow insertions in the beginning and end of the read
      * Allowed cigars starting/ending in insertions in the systematic ReadClipper tests
      * Updated all ReadClipper unit tests
      * ReduceReads does not hard clip leading insertions by default anymore
      * SlidingWindow adjusts start location if read starts with insertion
      * SlidingWindow creates an empty element with insertions to the right
      * Fixed all potential divide by zero with totalCount() (from BaseCounts)
      * Updated all Integration tests
      * Added new integration test for multiple interval reducing
2012-01-03 09:29:45 -05:00
Mark DePristo d05f0c2318 GATKPerformanceOverTime script update
-- Automatic detection of most recent version of GATK release (just tell the script now to use 1.2, 1.3, and 1.4)
-- Uses 1.4 now
-- By default we do 9 runs of each non-parallel test
-- In PathUtils added convenience utility to find most recent release GATK jar with a specific release number
2012-01-02 09:58:46 -05:00
Mauricio Carneiro 1b6d52817e fixing adaptor clipping effect on recalibration integration test 2012-01-01 22:20:06 -05:00
Eric Banks 393993e0c7 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-31 20:42:46 -05:00
Mauricio Carneiro 55cfa76cf3 Updated integration tests for the new adaptor clipping fix. 2011-12-30 18:47:14 -05:00
Mauricio Carneiro c7d0a9ebee Forgot to test for inter-chromosomal mates in the adaptor clipping
* Fixing bug caught by Eric (and Kristian)
2011-12-30 00:19:53 -05:00
Matt Hanna a259bfefd4 First commit addressing problems running RTC in parallel.
Turns out that because the RTC is the first walker to 'correctly' tree reduce according to functional programming
standards, the RTC has revealed a few problems with the tree reducer holding on to too much data.  This is the first
and smaller of two commits to reduce memory consumption.  The second commit will likely be pushed after GATK1.4 is
released.
2011-12-29 16:22:14 -05:00
Eric Banks 1a45ea5a05 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-29 11:37:15 -05:00
Mauricio Carneiro f692911903 GATKSAMRecord emptyRead static constructor
* Creates an empty GATKSAMRecord with empty (not null) Cigar, bases and quals. Allows empty reads to be probed without breaking.
 * All ReadClipper utilities now emit empty reads for fully clipped reads
2011-12-27 17:01:17 -05:00
Mauricio Carneiro 8259c748f2 No more Filtered Reads tag.
All synthetic reads are marked with the reduced read tag.
2011-12-27 17:01:17 -05:00
Eric Banks d20a25d681 A much better way of choosing the alternate allele(s) to genotype in the SNP model of UG: instead of looking at the sum of base qualities (which can and did lead to us over-genotyping esp. when allowing multiple alternate alleles), we look at the likelihoods themselves (free since we are already calculating likelihoods for all 10 genotypes). Now, even if the base quals exceed some arbitrary threshold, we only bother genotyping an alternate allele when there's a sample for which it is more likely than ref/ref (I can generate weird edge cases where this falls apart, but none that model truly variable sites that we actually want to call). This leads to a huge efficiency improvement esp. for exomes (and esp. for many samples) where we almost always were trying to genotype all 3 alternate alleles. Integration tests change only because ref calls have slight QUAL differences (because the best alt allele is still chosen arbitrarily, but differently). 2011-12-27 16:50:38 -05:00
Eric Banks adff40ff58 Minor optimizations to avoid extra processing (esp. for reduced reads) 2011-12-27 13:16:25 -05:00
Mauricio Carneiro 17bfe48d5e Made all class methods private in the ReadClipper
* ReadClipperUnitTest now uses static methods
 * Haplotype caller now uses static methods
 * Exon Junction Genotyper now uses static methods
2011-12-27 02:11:32 -05:00
Eric Banks dd990061f6 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-26 14:45:35 -05:00
Eric Banks 2130b39f33 Found the bug in the engine: RodLocusView was using the wrong seek method so that it would only move to the first locus of a shard (and with multi-locus shards, this meant that we never processed RODs from the other positions). In fact, because the seek(Shard) method is extremely misleading and now no longer used, I think it's safer to delete it and make everyone use the much more transparent seek(GenomeLoc). Note that I have not re-enabled my improvements to the intervals accumulation of ReferenceDataSource because that inefficiency is still present downstream in RodLocusView; need to discuss those changes with Matt. 2011-12-26 14:45:19 -05:00
Mauricio Carneiro 35c41409a1 Better contracts and docs for the ReadClipper
* Described the ReadClipper contract in the top of the class
  * Added contracts where applicable
  * Added descriptive information to all tools in the read clipper
  * Organized public members and static methods together with the same javadoc
2011-12-23 19:36:57 -05:00
David Roazen 506c0e9c97 Disabling SnpEff support in the GATK and SnpEff annotation in the HybridSelectionPipeline
SnpEff support will remain disabled until SnpEff 2.0.4 has been officially released
and we've verified the quality of its annotations.
2011-12-23 19:12:57 -05:00
Eric Banks 24c84da60d 'Fixing' the changes in ReferenceDataSource so that a shard properly contains a list of GenomeLocs instead of a single merged one. However, that uncovered a probable bug in the engine, so instead of letting this code fester unfixed in the build (affecting everyone in the group) I've decided to revert the previous (slow, but working) version and fix the engine in my own branch. 2011-12-23 15:39:12 -05:00
Eric Banks 8762313a0d Better TODO message 2011-12-22 20:54:35 -05:00
Eric Banks a815e875a8 Removing debugging output 2011-12-22 15:49:11 -05:00
Eric Banks deef542a38 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-22 15:44:58 -05:00
Eric Banks 6d260ec6ae Start printing traversal stats after 30 seconds. I can't stand waiting 2 minutes. 2011-12-22 15:40:59 -05:00
David Roazen 510c71158c Merged bug fix from Stable into Unstable 2011-12-22 10:49:52 -05:00
David Roazen 32cdef9682 Rename *PerformanceTest test classes to *LargeScaleTest
This is in preparation for the installation of the new performance test suite in Bamboo.

Note that "ant performancetest" is now "ant largescaletest"
2011-12-22 10:38:49 -05:00
Mauricio Carneiro 731a463415 Updated IntegrationTests with new adaptor clipper
phew!
2011-12-20 17:48:52 -05:00
Mauricio Carneiro cadff40247 getRefCoordSoftUnclippedStart and End refactor
These functions are methods of the read, and supplement getAlignmentStart() and getUnclippedStart() by calculating the unclipped start counting only soft clips.

* Removed from ReadUtils
* Added to GATKSAMRecord
* Changed name to getSoftStart() and getSoftEnd
* Updated third party code accordingly.
2011-12-20 17:48:51 -05:00
Mauricio Carneiro 07128a2ad2 ReadUtils cleanup
* Removed all clipping functionality from ReadUtils (it should all be done using the ReadClipper now)
 * Cleaned up functionality that wasn't being used or had been superseded by other code (in an effort to reduce multiple unsupported implementations)
 * Made all meaningful functions public and added better comments/explanation to the headers
2011-12-20 17:48:40 -05:00
Mauricio Carneiro 1c4774c475 Static versions of the hard clipping utilities
For simplified access to the hard clipping utilities. No need to create a ReadClipper object if you are not doing multiple complicated clipping operations, just use the static methods.

 examples:
   ReadClipper.hardClipLowQualEnds(2);
   ReadClipper.hardClipAdaptorSequence();
2011-12-20 17:48:39 -05:00
Mauricio Carneiro f73ad1c2e2 Bugfix/Rewrite: Algorithm to determine adaptor boundaries
The algorithm wasn't accounting for the case where the read is the reverse strand and the insert size is negative.

    * Fixed and rewrote for more clarity (with Ryan, Mark and Eric).
    * Restructured the code to handle GATKSAMRecords only
    * Cleaned up the other structures and functions around it to minimize clutter and potential for error.
    * Added unit tests for all 4 cases of adaptor boundaries.
2011-12-20 17:48:39 -05:00
Mark DePristo 0cc5c3d799 General improvements to Queue
-- Support for collecting resources info from DRMAA runners
-- Disabled the non-standard mem_free argument so that we can actually use our own SGE cluster gsa4
-- NCoresRequest is a testing queue script for this.
-- Added two command line arguments:
  -- multiCoreJerk: don't request multiple cores for jobs with nt > 1.  This was the old behavior but it's really not the best way to run parallel jobs.  Now with queue if you run nt = 4 the system requests 4 cores on your host.  If this flag is thrown, though, it will only request 1 and you'll just use 4, like a jerk
  -- job_parallel_env: parallel environment named used with SGE to request multicore jobs.  Equivalent to -pe job_parallel_env NT for NT > 1 jobs
2011-12-20 14:05:09 -05:00
Eric Banks 7204fcc2c3 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-20 12:59:11 -05:00
Eric Banks 8ade2d6ac2 max_alternate_alleles also ready to be made public 2011-12-20 12:59:02 -05:00
Eric Banks 6f52bd580b --multiallelic mode is not hidden anymore (but it is annotated as advanced); added docs 2011-12-20 12:47:38 -05:00
Mauricio Carneiro 37e0044c48 Removing unclipSoftClipBases from ReadUtils
* it was buggy and dangerous.
 * Updated Chris' code to use the ReadClipper.
2011-12-20 00:11:26 -05:00
Mauricio Carneiro 78d9bf7196 Added REVERT_SOFTCLIPPED_BASES capability to ReadClipper
* New ClippingOp REVERT_SOFTCLIPPED_BASES turns soft clipped bases into matches.
    * Added functionality to clipping op to revert all soft clip bases in a read into matches
    * Added revertSoftClipBases function to the ReadClipper for public use
    * Wrote systematic unit tests
2011-12-20 00:04:30 -05:00
Christopher Hartl 24585062f8 Merge branch 'incoming' 2011-12-19 23:16:36 -05:00
Christopher Hartl 67298f8a11 AFCR made public (for use in VSS)
Minor changes to ValidationSiteSelector logic (SampleSelectors determine whether a site is valid for output, no actual subset context need be operated on beyond that determination). Implementation of GL-based site selection. Minor changes to EJG.
2011-12-19 23:14:26 -05:00
Eric Banks 06d385e619 Simplifying the interface a bit 2011-12-19 15:29:46 -05:00
Christopher Hartl 339ef92eac Goodbye SW by default. Now aligned reads that overlap intron-exon junctions are scored where they are by default, but warns the user (and flags the record in the VCF) if there's evidence to suggest that there is an indel throwing off the scoring (e.g. if the best score of a realigned unmapped read is >5 log orders better than the best score of a scored mapped read). Unmapped reads are still SW-aligned to the junction-junction sequence. This should result in a rather massive speedup, so far untested.
UGBoundAF has to go in at some point. In the process of rewriting the math for bounding the allele frequency (it was assuming uniform tails, which is silly since i derived the posterior distribution in closed form sometime back, just need to find it)
2011-12-19 12:18:18 -05:00
Christopher Hartl 418d22b67e Merge branch 'master' of ssh://tin.broadinstitute.org/humgen/gsa-scr1/chartl/dev/unstable
Conflicts:
	private/java/src/org/broadinstitute/sting/gatk/walkers/genotyper/IntronLossGenotyperV2.java
2011-12-19 10:59:18 -05:00