Commit Graph

3383 Commits (30a104228aa1a5d44ad8d6e11c9c66b929beb3a6)

Author SHA1 Message Date
ebanks 30a104228a Don't require entropy reduction when cleaning only at known sites; instead we need to trust the known indels. This will improve consistency between lane-level and aggregated cleaning.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4076 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-22 02:44:38 +00:00
depristo b6989289fc Potential bug fix for bad references where some codons may have Ns
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4075 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-21 12:09:33 +00:00
kiran 121b4f23b6 Simple change to allow a list of samples or regular expressions to be provided in a text file (one line per sample).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4074 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-21 00:01:48 +00:00
ebanks 165dc6d3b0 Ryan, what did you decide about supporting this tool? Is it still useful?
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4073 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-20 19:16:14 +00:00
ebanks 2ef2f1b24a Fix UG's simple indel calculation model so that deletions are created correctly
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4072 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-20 15:35:47 +00:00
aaron fa36731faf fixes for VariantEval integration tests affected by the spaces to underscores change.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4070 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-19 22:43:20 +00:00
fromer 1c4784999a Updated to work exclusively in log10 space
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4069 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-19 21:31:07 +00:00
fromer 3af4e618cc Fixed precision issues with PQ (phasing quality)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4068 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-19 20:34:47 +00:00
kshakir 88ca1fb22c Lazy loading reflections so Queue can hack the classpath before the PluginManager looks for classes.
Removed extra quotes from 'cd' pre-exec command.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4067 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-19 20:29:52 +00:00
aaron 63ada20da5 allow RefSeq files to optionally contain the header line, which is the default output from the UCSC table browser
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4065 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-19 20:25:37 +00:00
fromer effeedf1a3 Updated Bayesian phasing method to output per-site phasing statistics (and to not cap PQ at 40)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4064 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-19 19:55:47 +00:00
aaron 04e5b28f6d updates for VCF; we can no longer cache genotypes or alleles in a static array, this is bad for sharred memory parallel runs. One instance per codec was better for performance than using ThreadLocal code.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4063 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-19 19:34:44 +00:00
corin 8054b6b295 Changing a name of a column for variantevals output for easier reading by R--let me know if this needs to be updated elsewhere; it's just a space to an underscore.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4062 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-19 19:18:16 +00:00
ebanks 4b94f8c21b Silly me, I forgot to check for the contig boundaries. Thank goodness for performance tests!
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4061 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-19 18:40:26 +00:00
aaron f16bb1e830 fix for a bug in package utils.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4060 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-19 15:01:50 +00:00
fromer 15c5aa6e48 Efficient iteration over all possible combinations of variable assignments, for variables of arbitrary cardinalities
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4059 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-19 14:14:37 +00:00
ebanks 1ec305cd15 Fix for running the cleaner at the lane-level for known indels only: instead of relying on the reads to get the reference sequence, we now use an IndexedFastaSequenceFile in all cases and pad the reference with bases on either end. This allows us to deal with cases in which we are trying to clean just a single deletion-containing read with tiny LOD (so the read needs to be pushed off the seen reference; @Reference doesn't yet work for Read Walkers) and has the added benefit of allowing us now to get much larger known indels that aren't completely covered with reads.
Thanks to Matt for the advice.

Also, for Guillermo: while I was at it, I changed the .stats debug output to emit the original interval instead of the cleaned region.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4058 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-19 11:31:13 +00:00
ebanks 98f7679619 Fixed the bug reported on GS regarding a clipped read that got moved several hundred bases away. The code that got triggered here was written back in the original version of the cleaner and it never actually did the right thing.
While I was fixing it, I noticed that we weren't allowing the cleaner to un-clean reads with indels when they're wrong even though we should.  Hypothetically, that should rarely happen: only when we can left-align out an indel or when the original mapper really went haywire.  This situation is rare enough that I'm calling logger.info to let the user know it's happening and suggesting that they double-check that everything looks right with their reads.  Better to be extra-cautious now that the cleaner is moving into the 1kg and Broad production pipelines soon
.
Mark, have no fear: this was truly a rare edge case - one that won't affect the cleaning stats.  There is no need to re-clean the data processing paper bams!



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4057 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-19 01:42:48 +00:00
aaron 3dc4d3c3a9 removing the custom reflections library from the libs, and adding a release version. Hopefully this will fix the problem Menachem has been seeing with random JVM crashes. Also
removed the auto-deletion of the reflections jar, and removed the very old OmniPlan document we had checked-in.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4056 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-19 00:42:37 +00:00
fromer 1336ea17a3 quality-scored-based Bayesian phasing algorithm implemented
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4055 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-18 21:17:46 +00:00
fromer 553bda4e0e PreciseNonNegativeDouble permits precise arithmetic operations on NON-NEGATIVE double values
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4054 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-18 21:10:58 +00:00
rpoplin 8f15b2ba72 Memory optimization for the VariantRecalibrator. Only add variants to the list if they pass the novelty and qual filters.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4051 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-17 21:57:28 +00:00
kshakir b7c60b9729 Queue now uses its own version instead of the gatk version.
Added a Queue release directory.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4050 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-17 19:34:23 +00:00
aaron e632d9b83d remove some dependencies on out of date methods from the tests
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4047 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-17 00:07:26 +00:00
aaron c1df293feb remove testing code from tribble track builder, set the command line program in walker test to null to reclaim memory in integration tests, and removed some orphaned intergration tests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4046 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-16 23:52:01 +00:00
rpoplin 578e7fa36d Don't output -0 as qual value in VariantRecalibrator.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4044 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-16 16:47:58 +00:00
kiran 3d63302b70 Deprecated. Use SelectVariants instead.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4043 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-16 15:07:50 +00:00
depristo 20db00a3e8 Lazy reference loading; the engine doesn't fetch the reference bases until you actually call ref.getBases(). With the new hidden --dontUpdateUG to table recalibrator this is 2-3x faster than before. Enabled for locus, read, and rod walkers.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4042 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-16 13:46:22 +00:00
aaron 9ab647b730 adding checks to the RefSeq rod for line's that contain less than the required number of columns (we expect there to be 16 columns)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4041 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-16 13:34:32 +00:00
aaron cc58a27b00 fix for broken unit test; make sure when we can't get an index off of disk, the internal method returns null
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4040 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-16 13:12:32 +00:00
aaron b23545fafa re-enable the check for up-to-date versions in the Tribble index.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4039 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-16 12:47:58 +00:00
ebanks 37586d3a43 Don't exception out when bad aligners emit wonky alignments; instead, just don't clean
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4038 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-16 02:36:04 +00:00
depristo a36951f11a @output and @input arguments for table recalibration for use with Q
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4037 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-14 18:36:28 +00:00
depristo 61064d7075 GenotypeConcordance log file -- if provided, GC module will write FN/FP information to this file by context
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4036 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-14 18:35:57 +00:00
depristo 0d209d5442 Nicer printing out of clustering
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4035 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-14 16:02:13 +00:00
kshakir 4710015c17 Disabled AlignerIntegrationTest while addressing build machine memory issues.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4033 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-14 01:23:21 +00:00
kshakir 307c8ca027 Created a new playground script for cleaning bams in Firehose.
Some refactoring of Queue extensions for reusability in scripts.
Putting the extensions into the Queue.jar after building them.
More updates to GATK walker arguments specifying @Input and @Output for Queue.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4032 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-13 23:52:24 +00:00
fromer dfe2922b5e First working version of statistical haplotype phaser
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4031 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-13 21:29:45 +00:00
ebanks f36c0ed613 Stop building obsolete VCFTools and CGUtilities
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4030 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-13 19:28:36 +00:00
rpoplin 222f61df87 Bug fix for damoskow in TableRecalibration. Shouldn't try to update the reference mismatch rate tag for an unmapped read.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4028 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-13 18:57:07 +00:00
kshakir 80a70ccf03 Repopulating rodsToSamples. Code reviewed by Eric.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4027 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-13 17:07:18 +00:00
hanna cb144734c0 Getting rid of GenotypeWriter interface. Of note:
- GATKVCFWriter deleted, to be replaced if absolutely necessary when VCF writing goes into Tribble.
- VCFWriter is now an interface, for easier redirection.
- VCFWriterImpl fleshes out the VCFWriter interface.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4026 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-13 16:33:22 +00:00
kshakir 542d394e09 Cleaning up Queue debugging output.
-l DEBUG with local programs now prints out the stdout/stderr of the programs as they are run.
More documentation in the examples with a new even simpler CountReads example.
Took out unused option to build Queue GATK extensions separately.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4025 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-13 15:54:08 +00:00
chartl 49a3db9dfe A brief implementation of a QD calculation that is not quite so bimodal for known variants (multiplicatively penalizes QD by (n variant samples)/(n variant alleles) ). Not sure how helpful this will be (which is why it is in oneoffs). Seems nice on MCKD1, but I'm still playing with the optimization.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4024 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-13 15:42:37 +00:00
chartl c6a8fba922 Occasionally if a JEXL expression results in no variants being captured (like "QD > 20.0" on filtered variants) the per-sample mapping from samples to eval objects can be empty. This semi-hacky fix prevents null pointer exceptions in setting up the resulting empty table (by jumping straight to it in this case)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4023 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-13 15:37:45 +00:00
ebanks f874e548aa Shame on us. FlagStat used ints instead of longs, so we ended up getting negative read counts
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4022 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-13 03:00:57 +00:00
ebanks 71c4d3f33d Moving pointer to b36 reference from /broad/1KG to /humgen/1kg
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4021 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-13 00:54:34 +00:00
kshakir f39dce1082 Exposed CommandLineFunction defaults to the Queue.jar command line (see -help).
Added ability to skip up-to-date jobs where the outputs are older than the inputs.
Changed -T CountDuplicates --quiet to --quietLocus so that Queue GATK extensions can use both short and full argument names.
Short names can be used to set values on Queue GATK extensions, for example: vf.XL :+= myFile
Moved Hidden from the GATK to StingUtils.
Updated ivy from 2.0.0 to 2.2.0-rc1 to fix sha1 issue: http://bit.ly/aX72w7
Added Queue to javadoc and testing build targets.
Added first Queue unit test.
Another pass at avoiding cycles in the DAG thanks to all function I/O being files.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4017 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-11 21:58:26 +00:00
chartl 8c08f47923 1) Make sure that the table size is set correctly in finalize()
2) Make sure variants are biallelic before asking for isTransversion()



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4016 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-11 20:32:22 +00:00
hanna 41d57b7139 Massive cleanup of read filtering.
- Eliminate reduncancy of filter application.
- Track filter metrics per-shard to facitate per merging.
- Flatten counting iterator hierarchy for easier debugging.
- Rename Reads class to ReadProperties and track it outside of the Sting iterators.
Note: because shards are currently tied so closely to reads and not the merged triplet of <reads,ref,RODs>, the metrics
classes are managed by the SAMDataSource when they should be managed by something more general.  For now, we're hacking
the reads data source to manage the metrics; in the future, something more general should manage the metrics classes.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4015 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-11 20:17:11 +00:00