Commit Graph

906 Commits (4c23b1fe9cc2a89ba4872f5122e449af97fd2447)

Author SHA1 Message Date
hanna 4c23b1fe9c Get rid of the static cache of ArgumentTypeDescriptors by making them an integral part of the
parsing engine.  Hugely lowers our memory footprint in integrationtests, but not yet enough to 
run Mark's new parallelized VariantEvalIntegrationTests.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4585 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-27 19:44:55 +00:00
hanna 04e38929f0 Disabling parallelized version of VE integration tests. Still slow, but not
deadlocking any more.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4580 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-27 02:47:03 +00:00
fromer a7af1a164b Updated MNP merging to merge VC records if any sample has a haplotype of ALT-ALT, since this could possibly change annotations. Note that, besides the "interesting" case of an ALT-ALT MNP in a pair of HET sites, this could even occur if two records are hom-var (irrespective of using phasing). Note also that this procedure may generate more than one ALT allele.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4577 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-27 01:50:36 +00:00
depristo b085648141 Parallelized VariantEval. Refactored output to support parallel output style. Minor improvements to testing framework to enable easy executeTestParallel to run -nt 1 and -nt 4 by default.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4574 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-26 20:21:38 +00:00
fromer c357ec775a Trivially phases any hom site (since it is always correct to continue the previous haplotypes by appending the same allele onto both haplotypes)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4568 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-25 16:58:41 +00:00
fromer 9ba7269728 Fixed Integration Tests to output VCF files with -NO_HEADER
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4548 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-21 19:49:44 +00:00
kshakir b88cfd2939 Updated MD5s of VCFs, since the approximate command line arguments injected into the VCF headers now have a little more order to them thanks to changes in the ParsingEngine.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4538 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-21 03:07:40 +00:00
fromer f76865abbc ReadBackedPhasing now uses a SortedVCFWriter to simplify, and has the ability to merge phased SNPs into MNPs on the fly [turned off by default]; MergeSegregatingPolymorphismsWalker can also do this as a post-processing step; Integration tests for MergeSegregatingPolymorphismsWalker were also added
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4534 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-20 20:27:10 +00:00
ebanks 7a291a8ff3 First pass at a VCF validator. Will test more tonight.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4524 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-19 19:55:49 +00:00
chartl 2bc5971ca1 Added - a tool to fix reference bases of a VCF. The OMNI had a couple of sites with incorrect reference bases (look to be legacy from other chips), and a few more that had ref and alt flipped. GAP should probably take care of it, but since I need results by monday, I'm doing it.
Modified - SelectVariants: Hook up to VariantContextUtils to recalculate AC/AF/AN, which uses the accessor in VariantContext to do this. Somehow sites that were selected down to hom-ref genotypes only wound up getting positive AC. 

**IMPORTANT** I kind of need input here. The header of a file used for an integration test specifies AC as being an integer. Recalculating it casts it into an integer list (which it should be, as it allows for alternate alleles). However this appears to clash with what the jexl expression is looking for? For now, the integration test itself needed to be changed -- it's unclear what to do when the header specifies AC of being one class, but recalculating it casts to another class, and I'm not sure what to do.

I'm committing my omni_qc pipeline because I'm almost certain 2 months down the road I'm going to wonder what the heck I did to generate my results.




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4511 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-17 03:18:01 +00:00
ebanks 7aa030a9a4 Hmm. Apparently variants can get lifted over to different chromosomes. Who knew? Reverting changes from a couple of days ago. The only way to do this correctly (without requiring lots of memory) is to turn off on-the-fly indexing for this walker. Integration tests cover this now.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4510 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-17 02:54:12 +00:00
ebanks 954dd84f51 Adding an integration test (against hg18 this time) that requires on-the-fly sorting in order to work properly.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4500 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-15 07:45:21 +00:00
ebanks 9f54170dff Hooking up the liftover tool to the new on-the-fly sorting VCF writer so that records can now get emitted in order.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4499 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-15 07:27:01 +00:00
chartl 7c9ef59d65 This is simultaneously a minor and major change to VariantEval, so take heed:
The core walker has been modified so that when variant contexts (eval and comp) are subset to command-line-specified sample(s), the chromosome count annotations (AC/AN/AF) are altered to reflect the AC/AN/AF of only those samples involved in the comparison. No more getting AC500 when you're comparing a 10-sample overlap. Interestingly enough, this didn't break any integration tests.

GenotypeConcordance now has two additional tables: Allele Count Statistics, and Allele Count Summary Statistics. These work exactly identically to the Sample Statistics and Sample Summary Statistics tables, except that the partition being used is no longer the sample, but instead the allele count of the variant sites. These tables stratify by both eval and comp ACs, e.g.

evalAC0
evalAC1
evalAC2
compAC0
compAC1
compAC2

Differences with previous integration tests were verified to only be in the Allele Count tables (by grepping them out of the diff); a new test has been added for the simple case of an AC=1 site in the eval becoming an AC=2 site in the comp.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4491 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-13 22:26:15 +00:00
hanna 83b8676b69 Hack to fix mysterious disappearing read attributes. Ultimately caused
by the fact that the GATKSAMRecord, by design, needs to both inherit from 
SAMRecord and wrap a 'member' SAMRecord, and method calls that aren't
implemented as explicit passthroughs can compromise the content of the
SAMRecord in subtle ways.

Will be automatically fixed when Picard moves to a lightweight SAMRecord
interface rather than the current heavyweight implementation.  But in 
the short-term, there's no obvious fix.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4489 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-13 19:06:54 +00:00
aaron 272ac2ae4a more fixes for tests broken by indexing-on-the-fly; I think this should do it.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4486 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-13 01:54:32 +00:00
aaron ff0df1a2da A fix for an integration test that was broken by on-the-fly indexing. Also, better reporting of Tribble exceptions in GATK integration tests. Trying to get the tests back up and running...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4483 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-12 18:39:56 +00:00
kiran f348ca2976 Now processes VCF files with repeated loci without crashing.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4481 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-12 04:36:07 +00:00
depristo 116309b3c3 More test cases for UG integration test. We currently fail doing multi-threaded gzip output, FYI
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4472 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-08 20:22:12 +00:00
depristo 38a67fed63 High performance version of standard vcf writer. New general static Tribble class for common constants, including general .idx constant and functions to get standard index name for a given file.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4471 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-08 19:53:21 +00:00
fromer bdd3a9752e Changed min MQ and BQ to 20 (for phasing)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4469 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-08 19:27:45 +00:00
chartl 21ec44339d Somewhat major update. Changes:
- ProduceBeagleInputWalker
 + Now takes a validation ROD and a prior to give it, will use those genotypes in place of the variant genotypes if both are present
 + Takes a bootstrap argument -- can use some given %age of the validation sites
 + Optionally takes a bootstrap output argument -- re-prints the validation VCF, filtering those sites used as part of the bootstrap
-BeagleOutputToVCFWalker
 + Now filters sites where the genotypes have been reverted to hom ref
 + Now calls in to the new VCUtils to calculate AC/AN

-Queue
 + New pipeline libraries for easy qscript creation, still a work in progress, but this is a considerable prototype
 + full calling pipeline v2 uses the above libraries
 + minor changes to some of my own scripts
 + no more need for contig interval lists, these will be parsed out of your normal interval list when it is provided



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4459 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-08 13:30:28 +00:00
depristo 0a2e76e9dc 2nd step towards on the fly indexing. Also fixed parsing bug for headers with < symbols
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4454 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-07 21:38:46 +00:00
rpoplin 7bb9704592 Update the BeagleOutputToVCF integration test because of removing the source header line. Source headers are provided by the engine for all VCF files now.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4453 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-07 19:55:57 +00:00
rpoplin 0de658534d Removed the qScale arguments in VariantRecalibrator. It is smarter about how it tries to find a cut so the arbitrary scale factor hopefully is no longer necessary. Now the recalibrated variant quality score more accurately reflects our believed lod of the call.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4451 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-07 18:04:57 +00:00
fromer ee00dcb79d 1. Phasing now ignores bases without minimum base quality (BQ) and minimum mapping quality (MQ); 2. The probability of a non-called base is now divided by 3, to evenly split up the error probability over the non-called bases
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4450 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-07 17:40:59 +00:00
ebanks 6205910f9f updating integration test for Sarah Calvo
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4449 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-07 04:03:37 +00:00
fromer 652a3e8de5 Added integration tests for ReadBackedPhasing
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4446 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-06 20:50:32 +00:00
kshakir ca5db821ce Added the ability to Queue to run scala functions inside the JVM. NOTE: Extend from InProcessFunction instead of CommandLineFunction to use this functionality.
Queue now submits new LSF jobs only after previous functions have completed successfully.
When the Queue process is shutdown (ex: via Control-C) sends a bkill command for any running jobs.
Ported commands like creating directories and scatter/gather interval list to scala functions.
Updates to LSF status tracking by porting the python to internally generated bash scripts.
Temporarily disabled job name submission to LSF.  Plus side is that the full command is now available in "bjobs -w".  TODO: Put back jobName passing to LSF based on an option?
Changed BaseTest to allow scala to access paths to references.
Changed the extension generator to default the analysis name to the walker "name".

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4442 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-06 18:29:56 +00:00
rpoplin 69485d6a7a Added command line argument for the max value of the allele count prior in VariantRecalibrator (--max_ac_prior). Default value increased to 0.99 from 0.95.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4436 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-06 14:00:53 +00:00
ebanks b5e148140b Officially fixed the UG priors; updated the default min MQ/BQs to pipeline values of q20 and min calling threshold to Q50
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4431 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-05 18:35:36 +00:00
ebanks 6448753cf7 Removed the SequenomValidationConvertor and renamed it VariantValidationAssessor since it no longer handles ped/sequenom files (but instead works on vcfs/variantcontexts). Updated all of the wiki docs, including adding instructions on how to convert ped files to vcf, a la Shaun Purcell. We now officially no longer support ped files everyone. Other misc cleanup in the code.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4419 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-04 02:11:38 +00:00
hanna 4ea73bcfb1 Basic unit tests for WalkerManager.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4394 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-30 19:27:41 +00:00
hanna 78343be52c At some time in the recent past, we lost our ability to process the '-L all'
argument.  Brought it back, and added an integrationtest to make sure it
stays around.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4390 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-30 15:58:43 +00:00
delangel e80742e72f Use -o as argument for output file in ProduceBeagleInputWalker, to be consistent with other walkers (you're welcome, chartl :)).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4386 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-29 22:46:39 +00:00
rpoplin a6c7de95c8 By using the AC info field instead of parsing the genotypes we cut 78% off the runtime of VariantRecalibrator. There is a new argument to force the parsing of genotypes if necessary. Various other optimizations throughout.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4383 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-29 18:56:50 +00:00
chartl 862c94c8ce Small change for Matt -- output partition types in lexicographic order.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4365 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-28 20:08:03 +00:00
bthomas 96cccafb0d Adding a few helper methods for accessing sample metadata, and associated unit tests. These are motivated by discussion with Ryan about how he'll use sample metadata in VariantEvalwalker - hopefully will make it easier for him. Methods are:
-- getToolkit().subContextFromSampleProperty(): filters a VariantContext to genotypes that come from samples that have a given property value
-- getToolkit().getSamplesWithProperty(): gets all samples with a given property
-- getToolkit().getSamplesFromVariantContext(): sample objects that are referenced by name in a VariantContext



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4361 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-28 02:16:25 +00:00
kshakir edaa278edd Removed cases where various toolkit functions were accessing GenomeAnalysisEngine.instance.
This will allow other programs like Queue to reuse the functionality.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4351 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-25 02:49:30 +00:00
hanna 497bcbcbb7 Recent changes to the build system make the build system complain loudly about
pieces of core that depend on playground.  Most of these have been eliminated by
(temporarily) promoting Aaron's report system to core in this checkin.  I'll 
follow up with other changes in separately.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4350 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-24 22:09:12 +00:00
depristo 745b8cc6d3 GATK now detects and UserExceptions when human lexicographically sorted data is provided
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4343 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-24 15:19:48 +00:00
rpoplin 1931b2e1bd Three fixes for VariantFiltrationWalker: Trying to filter an empty VCF file will produce a well-formed VCF file with zero records instead of a blank file, needed for pipelines. The first record's genotype info fields are now in the same order as all the others. The VCF header lines are pulled from just the input variant rod instead of from all rods.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4341 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-24 13:52:56 +00:00
kshakir 4ed9f437e9 Sliced the GAE in half like a gordian knot to avoid the constant merge conflicts.
The GAE half has all the walker specific code.  The new "Abstract" GAE has the rest of the logic.
More refactoring to come, with the end goal of having a tool that other java analysis programs (Queue, etc.) can use to read in genomic data.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4339 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-23 23:28:55 +00:00
hanna 8f75d88519 Fix for GATK run report ids:
mOVsxGfDiiSMxVs2PPTVjzYTVbizlD6e
  f9kUHUADFsZ0LiTGxRL5zPmq9kZcA4cQ
  8eGHWJFAlBVmgxwPi3sMd1RmiN2PwHOf
  iLhvHWveypKb2F8vKS5irHylc3pYvlOb
  HDttXKUMEVoPrvVeWrH7E0htxYyNydMx
plus a bit of cleanup of custom exceptions in the sharding system.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4330 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-22 19:49:25 +00:00
kshakir 20b38b38f3 Updated from SnakeYAML 1.6 to 1.7.
Added a pipeline java bean and YAML utility to serialize java beans.
Added a getFirehosePipelineYaml.sh that can pull firehose data into the pipeline yaml file format.
Updated the fullCallingPipeline.q to begin using the pipeline yaml file format for bams and reference.
More changes to come as this code gets tested out in the fullCallingPipeline.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4329 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-22 19:47:49 +00:00
hanna fb5d595ef0 Disable VCF header output in the Beagle integrationtest.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4327 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-22 16:50:03 +00:00
hanna 0c99c97685 The engine now automatically adds the command-line arguments to the header of every VCF, unless -NO_HEADER is specified.
Changed integration tests, adding the -NO_HEADER argument, for walkers that previously did not include the command-line
arg headers.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4326 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-22 15:27:58 +00:00
aaron 1af9ca6d45 enabling tests that now pass with the conitg length validation.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4325 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-21 22:20:50 +00:00
aaron 3938d53738 one broken build short of the hat trick. Fixing the unix test which expects the sequence dictionary of the Tribble track to equal the reference; we actually return the sequence dictionary of the track iself, with each contig set to the length of the sequence dictionary contig entry.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4322 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-21 18:47:20 +00:00
aaron b968af5db5 The tribble indexes are now updated with correct sequence lengths for each contig they have in their sequence dictionary. Also clean-up in the RMD track builder.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4321 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-21 18:21:22 +00:00