Commit Graph

8504 Commits (d5fce22d78db9638b2ebda858f1ca8a67f928e33)

Author SHA1 Message Date
David Roazen d5fce22d78 Disable HaplotypeCaller integration tests in Stable
These tests use out-of-date files that no longer exist, and only
need to be enabled in Unstable for now.
2012-02-13 16:28:19 -05:00
David Roazen 03e5184741 Fix serious engine bug that could cause reads to be dropped under certain circumstances
When aggregating raw BAM file spans into shards, the IntervalSharder tries to combine
file spans when it can. Unfortunately, the method that combines two BAM file
spans was seriously flawed, and would produce a truncated union if the file spans
overlapped in certain ways. This could cause entire regions of the BAM file containing
reads within the requested intervals to be dropped.

Modified GATKBAMFileSpan.union() to correct this problem, and added unit tests
to verify that the correct union is produced regardless of how the file spans
happen to overlap.

Thanks to Khalid, who did at least as much work on this bug as I did.
2012-02-13 16:25:21 -05:00
Khalid Shakir 23e7f1bed9 When an interval list specifies overlapping intervals merge them before scattering. 2012-02-08 02:12:16 -05:00
Guillermo del Angel 6ec686b877 Complement to previous commit: make sure we also don't inherit filter from input VCF when genotyping at an empty site 2012-02-06 13:19:26 -05:00
Guillermo del Angel 827be878b4 Bug fix when running UG in GenotypeGivenAlleles mode: if an input site to genotype had no coverage, the output VCF had AC,AF and AN inherited from input VCF, which could have nothing to do with given BAM so numbers could be non-sensical. Now new vc has clear attributes instead of attributes inherited from input VCF. 2012-02-06 11:58:13 -05:00
Guillermo del Angel 090d87b48b Bug fix in ValidationSiteSelector: when input vcf had genotypes and was multiallelic, the parsing of the AF/AC fields was wrong. Better logic to unify parsing of field 2012-02-06 10:33:12 -05:00
Matt Hanna 30b937d2af Fix bug discovered in FGTP branch in which BlockInputStream returns -1 in cases where some data could be read,
but not all the data requested by the caller.
2012-02-01 16:06:22 -05:00
Menachem Fromer a9671b73ca Fix to permit proper handling of mapping qualities between 128 to 255 (which get converted to byte values of -128 to -1) 2012-01-27 16:01:30 -05:00
David Roazen b07fdb1089 Rename alltests* targets in build.xml
"ant alltests" is now "ant committests"
"ant alltests.public" is now "ant committests.public"
"ant alltests.gatk.packagejar" is now "ant releasetests.gatk.packagejar"
"ant alltests.queue.packagejar" is now "ant releasetests.queue.packagejar"

This is going into both Stable + Unstable so that all Bamboo
plans can be properly updated at the same time.
2012-01-24 14:58:30 -05:00
Mark DePristo 80a4ce0edf Bugfix for incorrect error messages for missing BAMs and VCFs
-- Missing BAMs were appearing as StingExceptions
-- Missing VCFs were showing up as CommandLineErrors, but it's clearer for them to be CouldNotReadInputFile exceptions
-- Added integration tests to ensure missing BAMs, VCFs, and -L files are properly thrown as CouldNotReadInputFile exceptions
-- Added path to standard b37 BAM to BaseTest
-- Cleaned up code in SAMDataSource, removing my parallel loading code as this just didn't prove to be useful.
2012-01-23 09:52:07 -05:00
David Roazen d5199db8ec Be explicit about setting the snpEff -onlyCoding option in the pipeline
When run without an explicit -onlyCoding option, as we've been doing up to
now, snpEff automatically sets -onlyCoding to "true" provided that there is
at least one transcript marked as "protein_coding", which will always be the
case for us in practice (and indeed, all pipeline runs so far with snpEff
2.0.5 have run with -onlyCoding auto-set to "true").

However, given the disastrous effect on annotation quality setting
"-onlyCoding false" has, we wish to be explicit with this option
rather than relying on snpEff's auto-detection logic.
2012-01-17 20:04:27 -05:00
Matt Hanna 3ba918aff1 Error message cleanup in BAM indexing code. 2012-01-17 11:05:42 -05:00
Matt Hanna cd43f016ce Fixed NPE in getNextOverlappingBAMScheduleEntry() when mixed mapped/unmapped interval lists are used. Added integrationtest to verify behavior. 2012-01-12 13:29:11 -05:00
Mark DePristo 2e47336a81 Only print out error report for most recent release in runGATKReport.py 2012-01-11 08:54:46 -05:00
Khalid Shakir ef50e77ee2 When running Queue jobs locally, merge the stderr to the stdout log if the error file is NOT specified.
Updated VE strats in the HSP for plotting Ka/Ks by AC.
2012-01-10 16:10:25 -05:00
Matt Hanna dc60757b68 Eliminate unnecessary strong references (and therefore memory held) by tree reduce entries that have already been processed.
Thanks to Tim Fennell for the bug report.
2012-01-09 23:04:53 -05:00
Mark DePristo 845c0b1c66 Merge branch 'master' of ssh://depristo@gsa1/humgen/gsa-scr1/gsa-engineering/git/stable 2012-01-09 08:40:59 -05:00
Mark DePristo f5add25c72 Improved formatting of queueStatus 2012-01-09 08:40:53 -05:00
Matt Hanna 1f1233b669 Fix for a rare but insidious bug in position tracking during async BAM file reading.
Thanks to Khalid for spotting and reporting the issue.
2012-01-08 22:03:35 -05:00
Mark DePristo 63b7a70c44 Removing very costly analyses of all GATK versions. Will be replaced by Tableau website 2012-01-06 18:13:19 -05:00
Mark DePristo c96fee477c Bug fix for VariantSummary
-- Call sets with indels > 50 bp in length are tagged as CNVs in the tag (following the 1000 Genomes convention) and were unconditionally checking whether the CNV is already known, by looking at the known cnvs file, which is optional.  Fixed.  Has the annoying side effect that indels > 50bp in size are not counted as indels, and so are substrated from both the novel and known counts for indels.  C'est la vie
-- Added integration test to check for this case, using Mauricio's most recent VCF file for NA12878 which has many large indels.  Using this more recent and representative file probably a good idea for more future tests in VE and other tools.  File is NA12878.HiSeq.WGS.b37_decoy.indel.recalibrated.vcf in Validation_Data
2012-01-05 21:51:06 -05:00
Eric Banks 18ed954741 Compute Ti/Tv only if bi-allelic 2012-01-05 15:33:26 -05:00
Khalid Shakir 253a07fdb1 Implicits conversion issue/bug: QScript String<==>File shortcuts at compile time do not make String.equals(File) at runtime. 2012-01-03 18:43:45 -05:00
Mauricio Carneiro 9b55505c03 Fixing PairHMMIndelErrorModel array out of bounds
This error was due to the ReadClipper change of contract. Before the read utils would return null if a read was entirely clipped, now it returns an empty (safe) GATKSAMRecord.
2012-01-03 18:08:46 -05:00
David Roazen ea6e718cb8 SnpEff 2.0.5 support. Re-enabled SnpEff in the HybridSelectionPipeline.
For now, we recommend only running with the GRCh37.64 database.
2012-01-03 15:18:36 -05:00
David Roazen f3f01da1af Enforce serial dependencies in RecalibrationWalkersIntegrationTest
Some tests in this class were intermittently not being executed due
to being randomly scheduled before tests whose results they depend on.
Now the serial dependencies are enforced to avoid problematic orderings.
2012-01-03 10:42:41 -05:00
Mauricio Carneiro 1b6d52817e fixing adaptor clipping effect on recalibration integration test 2012-01-01 22:20:06 -05:00
Eric Banks b0d68eb0e3 Merge remote-tracking branch 'unstable/master' 2011-12-31 20:26:44 -05:00
Mauricio Carneiro 55cfa76cf3 Updated integration tests for the new adaptor clipping fix. 2011-12-30 18:47:14 -05:00
Mauricio Carneiro c7d0a9ebee Forgot to test for inter-chromosomal mates in the adaptor clipping
* Fixing bug caught by Eric (and Kristian)
2011-12-30 00:19:53 -05:00
Matt Hanna a259bfefd4 First commit addressing problems running RTC in parallel.
Turns out that because the RTC is the first walker to 'correctly' tree reduce according to functional programming
standards, the RTC has revealed a few problems with the tree reducer holding on to too much data.  This is the first
and smaller of two commits to reduce memory consumption.  The second commit will likely be pushed after GATK1.4 is
released.
2011-12-29 16:22:14 -05:00
Matt Hanna e6e80e8d3f Update Picard to fix a bug Mauricio found in Picard where Picard unnecessarily depends on Snappy during some usages of SortingCollection. 2011-12-29 14:35:02 -05:00
Roger Zurawicki efe33a0a1b BUG FIX: Output is correct
The output would put zero coverage because the pileup filtered using the wrong method

Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2011-12-28 23:05:43 -05:00
Roger Zurawicki 5672688a73 Optimized CoverageByRG and Added GCContent
- CoverageByRG now uses a hashmap for its value instead of a list. It runs about 4 times faster.
- Cleaned up some of the code
- CoverageByRG now calculates GCContent

Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2011-12-28 15:25:07 -05:00
Roger Zurawicki 0c05998c4c Added CoverageByRG LocusWalker
WIll take any number of input bams and intervals
Returns a ReportTable with Average Coverage of each Read Group per Interval

Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2011-12-28 15:25:07 -05:00
Mauricio Carneiro f692911903 GATKSAMRecord emptyRead static constructor
* Creates an empty GATKSAMRecord with empty (not null) Cigar, bases and quals. Allows empty reads to be probed without breaking.
 * All ReadClipper utilities now emit empty reads for fully clipped reads
2011-12-27 17:01:17 -05:00
Mauricio Carneiro 8259c748f2 No more Filtered Reads tag.
All synthetic reads are marked with the reduced read tag.
2011-12-27 17:01:17 -05:00
Ryan Poplin ef31b2f0a7 fixing merge conflicts. 2011-12-27 14:26:36 -05:00
Ryan Poplin 4f09a95221 Updating HaplotypeCaller for the new contracts in the adapter clipping. 2011-12-27 14:25:03 -05:00
Mauricio Carneiro 17bfe48d5e Made all class methods private in the ReadClipper
* ReadClipperUnitTest now uses static methods
 * Haplotype caller now uses static methods
 * Exon Junction Genotyper now uses static methods
2011-12-27 02:11:32 -05:00
Mauricio Carneiro ce493bf257 Added adaptor clipping to ReduceReads
* made all clipping steps optional with arguments.
2011-12-27 01:19:06 -05:00
Mauricio Carneiro f7a5752025 Let this one slip through my commits. 2011-12-26 21:55:02 -05:00
Mauricio Carneiro c1eaf7cf81 ReduceReads will allows different context sizes for different events
* Rename contextSize to contextSizeMismatches
 * Indel context size is now different from mismatches context size
2011-12-26 21:17:29 -05:00
Mauricio Carneiro 4633637af6 Moved ReduceReads to static ReadClipper
* all clipping done in ReduceReads is done using the static methods of the ReadClipper now.
2011-12-26 21:14:40 -05:00
Mauricio Carneiro 9aa1c0c6e5 Better documentation and contracts for ReduceReads
* added javadoc to all methods
  * added GATKDocs style documentation to the ReduceReadsWalker
  * revised contracts and made explicit in the documentation
2011-12-26 21:12:23 -05:00
Mauricio Carneiro 3051cdf9c5 fixed reduced reads integration tests 2011-12-26 21:12:22 -05:00
Mauricio Carneiro 256a7d8bd2 fixing the arguments for RRead script 2011-12-26 21:12:22 -05:00
Eric Banks dd990061f6 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-26 14:45:35 -05:00
Eric Banks 2130b39f33 Found the bug in the engine: RodLocusView was using the wrong seek method so that it would only move to the first locus of a shard (and with multi-locus shards, this meant that we never processed RODs from the other positions). In fact, because the seek(Shard) method is extremely misleading and now no longer used, I think it's safer to delete it and make everyone use the much more transparent seek(GenomeLoc). Note that I have not re-enabled my improvements to the intervals accumulation of ReferenceDataSource because that inefficiency is still present downstream in RodLocusView; need to discuss those changes with Matt. 2011-12-26 14:45:19 -05:00
Mauricio Carneiro 02495a5fd5 renaming script, once more 2011-12-23 20:01:25 -05:00