Commit Graph

1823 Commits (27651d8dc2f10f34933c92bcfaf888cf7427f14f)

Author SHA1 Message Date
chartl 27651d8dc2 Oops. numReads is now called size
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2175 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-29 06:59:17 +00:00
chartl 21744e024b Quick walker that determines % of bases covered at (user - defined depth)x . I've been maintaining it in my directories alone, but now that i've accidentally deleted it twice, into playground it goes
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2174 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-29 06:51:19 +00:00
hanna 3300ca906a An iterator for Eric to use when injecting his new wrapping reads -- a stopgap solution for getting additional caching
functionality into a SAMRecord.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2173 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-27 22:25:52 +00:00
rpoplin 26db15be5c Added SingleReadGroupFilter to only use reads from a specific read group, filtering out all others.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2172 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-27 20:33:59 +00:00
rpoplin 91f5672a32 misc cleanup
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2171 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-27 19:56:20 +00:00
rpoplin d1298dda13 Encapsulated the sections of code that were shared by the two Recalibration walkers. This includes both the shared command line arguments and the section of code in the map methods which pull out data from the SAMRecord and stuff it into the ReadHashDatum. Command line arguments are now passed to the Covariates using a new initialize method that all Covariates must implement. Updated the dbsnp sanity check warning message to be less cryptic.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2170 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-27 19:54:10 +00:00
depristo 75b61a3663 Updated, optimized REadBackedPileup. Updated test that was breaking the build -- it created a pileup from reads without bases...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2169 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-25 23:30:39 +00:00
alecw ac1b289d55 Add tile to ReadHashDatum, and implement TileCovariate
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2166 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-25 21:41:42 +00:00
depristo db40e28e54 ReadBackedPileup in all its glory. Documented, aligned with the output of LocusIteratorByState, and caching common outputs for performance
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2165 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-25 20:54:44 +00:00
rpoplin b44363d20a Removed silly casts from Integer to int.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2164 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-25 19:59:21 +00:00
ebanks d0f673f0c0 Use Math.abs so we don't get (inconsistent) -0's
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2160 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-25 19:08:34 +00:00
rpoplin 6ff8526592 Added arguments to the recalibration walkers so the user can specify the default read group id and platform to use when a read has no read group. There are also options to force every read group and every platform to be the specified values. Added integration tests that use a bam file with no read groups. Added comments to all the covariates to explain what each of the methods in the Covariate interface are used for.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2157 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-25 15:41:12 +00:00
aaron cfbd9332b0 small cleanups for the GATK paper genotyper; switched to the managed output system.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2156 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-25 08:04:13 +00:00
ebanks e1e5b35b19 Don't have the spanning deletions argument be a hard cutoff, but instead be a percentage of the reads in the pileup. Default is now 5% of reads.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2155 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-25 04:54:44 +00:00
depristo 03342c1fdd Restructuring and interface change to ReadBackedPileup. We now lower support the Pileup interface, the BasicPileup static methods, and the ReadBackedPileup class. Now everything is a ReadBackedPileup and all methods to manipulate pileups are off of it. Also provides the recommended iterable() interface of pileup elements so you can use the syntax for (PileupElement p : pileup) and access directly from p.getBase() and p.getQual() and p.getSecondBase(). Only a few straggler walkers use the old style interface -- but those walkers will be retired soon. Documentation coming in the AM. Please everyone use the new syntax, it's safer, and will be more efficient as soon as the LocusIteratorByState directly emits the ReadBackedPileup for the Alignment context, as opposed to the current interface. In the process of the change over, discovered several bugs in the second-best base code due to things getting out of sync, but these changes were resolved manually. All other integrationtests passed without modification.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2154 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-25 03:51:41 +00:00
ebanks 2cb3e53b0b Verbose mode shouldn't be printing out 'NaN's and 'Infinity's
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2153 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 22:01:00 +00:00
rpoplin c9ff5f209c Added a CountCovariates integration test that uses a vcf file as the list of variant sites to skip over instead of the usual dbSNP rod.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2152 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 21:51:38 +00:00
ebanks 3484f652e7 1. Variation is now passed to VariantAnnotator along with the List of Genotypes so non-genotype calls has access to all relevant info.
2. Killed OnOffGenoype
3. SpanningDeletions is now SpanningDeletionFraction



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2151 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 21:47:20 +00:00
ebanks e05cb346f3 GenotypeLocusData now extends Variation.
Also, Variations should be INSERTIONs or DELETIONs (and not just INDELs).
Technically, VCF records can be indels now.
More changes coming


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2150 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 21:07:55 +00:00
rpoplin 8b30279edc style update
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2149 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 20:56:31 +00:00
rpoplin dffa46b380 BAM files created by TableRecalibration now have the version number and list of covariates used appended to their header with a new 'PG' tag. Eventually the entire list of command line args will be put in there as well. Big thanks to Matt and Aaron. The integration test uses the --no_pg_tag so that the md5 doesn't change every time the version number changes.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2148 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 20:53:57 +00:00
aaron 8fbc0c8473 fix for bug GSA-234: fasta index files couldn't handle anything but letters, numbers, or spaces in the contig name
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2147 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 19:19:47 +00:00
andrewk 3fca23cd16 Added a stub treeReduce function for debugging multi-threaded execution.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2146 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 18:51:19 +00:00
rpoplin 277e6d6b32 Further optimizations of TableRecalibration. This completes my goal of having the only math done in the map function be addition, subtraction and rounding the quality score to an integer. Everything else has been moved to the initialize method and only done once.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2145 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 18:21:57 +00:00
andrewk e4546f802c Accumulates coverage across hybrid selection bait intervals to assess effect of bait adjacency. Requires input bait intervals that have an overhang beyond the actual bait interval to capture coverage data at these points. Outputs R parseable file that has all data in lists and then does some basic plotting.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2144 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 18:12:34 +00:00
andrewk e5106c9924 Hybrid selection performance statistics now include counts of the number of adjacent baits (0,1,2) using OverlapDetector and optionally include assayed bait quantities input via interval lists.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2143 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 18:07:23 +00:00
ebanks 87c1860398 I'm not sure I believe it, but JProfiler claims that calling FourBaseProbs.isVerbose() was taking 5% of my runtime...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2142 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 17:00:32 +00:00
ebanks b3f561710f Optimizations:
1. Only do calculations in UG for alternate allele with highest sum of quality scores (note that this also constitutes a bug fix for a precision problem we were having).
2. Avoid using Strings in DiploidGenotype when we can (it was taking 1.5% of my compute according to JProfiler)

UG now runs in half the time for JOINT_ESTIMATE model.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2141 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 16:27:39 +00:00
rpoplin a59e5b5e1a Added dbSNP sanity check to CountCovariates. If the mismatch rate is too low at dbSNP sites it warns the user that the dbSNP file is suspicious. Added option in CountCovariates and TableRecalibration to ignore read group id's and collapse them together. Also, If the read group is null the walkers no long crash with NullPointerException but instead warn the user the read group and platform are defaulting to some values. Default window size in MinimumNQSCovariate is 5 (two bases in either direction) based on rereading of Chris's analysis.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2140 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 16:16:44 +00:00
alecw e5e6d515c3 Fix misunderstanding of GenomeLoc interval
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2138 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 15:12:49 +00:00
ebanks cb6d6f2686 Very minor performance improvements
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2137 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 05:21:07 +00:00
ebanks c90bea39a1 read.getReadString().charAt(offset) --> read.getReadBases()[offset]
[As a courtesy I fixed all instances once I was updating GenotypeLikelihoods]


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2136 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 04:25:19 +00:00
ebanks ec321abd7b Added ability to filter on the QUAL field
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2135 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 04:08:22 +00:00
ebanks 36d493e645 All standard annotations now inherit from StandardVariantAnnotation. Users can specify whether they want all annotations, just the standard annotations, or specific annotations. When calling in from another walker, the default is just the standard ones.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2134 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 03:55:12 +00:00
ebanks ee5093d2c6 -Added VariantFiltration integration tests
-Added integration test for GLFs



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2133 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 02:36:27 +00:00
ebanks be6a549e7b Added the capability to allow expressions in an integration test command (i.e. -filter 'foo') by escaping them in the command.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2132 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 02:34:48 +00:00
hanna 903342745d Basic integration test for the aligner.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2131 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 23:08:05 +00:00
hanna 4837fe919c Convenience changes. If no -BWT option is specified, pull the BWT location from the reference.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2130 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 22:46:05 +00:00
rpoplin 9e4eadc37c CountCovariates v2.0.2: Added a --process_nth_locus <int> argument to only use every Nth covered locus when creating the recalibration table.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2129 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 22:07:38 +00:00
chartl 6a52ca3db6 Update to the UG integration test. Why I had to rm -rf my entire sting directory to get it to correctly fail we may never know.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2128 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 21:23:00 +00:00
ebanks ed4cf3de57 Check that we're biallelic before calling isSNP()
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2127 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 20:20:48 +00:00
rpoplin 5744a1d968 The covariates don't care about SAMRecord's anymore - Cleaning up the import statements.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2126 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 20:10:12 +00:00
chartl 23983b2fd8 New annotation: ResidualQuality
Computes a metric for how much error is left that isn't explained by ref or snp bases. This is the sum of Q scores, weighted by the proportion of non-ref non-snp bases to non-snp bases. Reported in Log space.

Update to the integration test so bamboo doesn't look as though someone murdered it with a spork




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2124 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 20:04:01 +00:00
ebanks 70059a0fc9 Refactored joint estimation model to allow subclasses to overload PofD calculation over all frequencies. Pooled model now takes only 20% of time that it used to.
Added integration test for pooled model and updated other joint estimation tests to be more comprehensive now that they are faster.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2123 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 20:03:38 +00:00
rpoplin 7f947f6b60 Updated recalibrator integration tests to use all three platforms as well as a bam with multi-platform reads intermingled. CountCovariates v2.0.1: Once again uses a read filter to filter out zero mapping quality reads. Added --sorted_output option to output the table recalibration file in sorted order
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2122 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 19:51:36 +00:00
ebanks c299ca5f49 It would help if I copied the MD5s from the right integration test...
I hate Mondays.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2121 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 17:21:36 +00:00
ebanks ff4797acbb Forgot to check in integration test update
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2120 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 17:13:51 +00:00
ebanks 14bf6ce83c 1. Newest version of the joint estimation model. Faster than previous version and now qscores can get to be > 39.8 for hets.
2. More sanity checks in annotations


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2119 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 17:05:50 +00:00
hanna ee2abd30c4 Count the best alignments and emit them to a file.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2118 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 16:37:59 +00:00
rpoplin 1d46de6d34 The old recalibrator is replaced with the refactored recalibrator. Added a version message to the logger output. These walkers start at version 2.0.0
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2117 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 14:58:33 +00:00