Commit Graph

4016 Commits (ee348ac9d4049a4e142779f9a7ac95f7e4c0b596)

Author SHA1 Message Date
hanna e313eeede8 Push command-line expansions, such as BAM list unpacking and -B tag parsing, out
into the CommandLine* classes.  This makes it easier for external functionality
(such as the VCF streamer) to use GenomeAnalysisEngine directly.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4897 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-22 19:00:17 +00:00
depristo 66cca7de0f renamed genotypesArePhased to isPhased, as the previous name was incorrect for several reasons. Added setPhase() to MutableGenotype. Other classes changed to reflect renaming to isPhased(). CombineVariants now supports an experimental MASTER mode where it consumes -B:master,vcf and -B:xi,vcf for any number i and updates the master with phasing information in xi.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4896 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-22 17:42:05 +00:00
chartl 2235245af0 PrivatePermutations generalized to compute transition counts and average probabilities (and thus was renamed). Changes in some pipelines to reflect the change. Bugfix in the batch merging pipeline (it would halt because the allele VCF for genotyping batches could become off-spec).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4894 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-22 15:16:15 +00:00
delangel a1653f0c83 Another major redo for indel genotyper: this time, add ability to do allele and variant discovery, and don't rely necessarily on external vcf's to provide candidate variants and alleles (e.g. by using IndelGenotyperV2). This has two major advantages: speed, and more fine-grained control of discovery process. Code is still under test and analysis but this version should be hopefully stable.
Ability to genotype candidate variants from input vcf is retained and can be turned on by command line argument but is disabled by default. 
Code, by default, will build a consensus of the most common indel event at a pileup. If that consensus allele has a count bigger than N (=5 by default), we proceed to genotype by computing probabilistic realigmment, AF distribution etc. and possibly emmiting a call.

Needed for this, also added ability to build haplotypes from list of alleles instead of from a variant context.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4893 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-22 02:38:06 +00:00
hanna 09c7ea879d Merging GenomeAnalysisEngine and AbstractGenomeAnalysisEngine back together.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4889 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-21 02:09:46 +00:00
depristo b3ac47812c No longer emits records at filtered sites, in sub-sampling mode
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4883 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-20 16:43:50 +00:00
depristo 60880b925f VC utils prune method now will keep genotype attributes as well as info keys. RBP now emits a far reduce (NO INFO, only GT:GQ:PG) records, further reducing size of phasing output
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4882 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-20 16:33:14 +00:00
depristo 8604335566 Minor improvements to further reduce debugging output. When running in -samplesToPhase mode, now only including the samples to phase in the output VCF, making it very much smaller.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4881 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-20 16:19:47 +00:00
depristo ff90c24f28 RBP now supports operating on a subset of samples, outputting a much reduced VCF file appropriate for merging later. Also, general optimization to avoid printing enormous amounts of data to logger.debug by using a glocal static variable DEBUG that conditionally allows writing to the variable. Passes integration tests
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4880 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-20 16:03:28 +00:00
depristo a3729bd59c Now I call BeforeMethod correctly
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4872 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-17 22:45:45 +00:00
depristo b7e4a015c0 static thread cache reset in UnitTest
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4870 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-17 21:53:10 +00:00
depristo 3bbc6a0540 Slightly more thread safe CachingIndexedFastaSequenceFile.java. Likely passes parallel testing
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4869 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-17 21:05:17 +00:00
depristo 5dd0e8388b Fixed a bug in UnitTest
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4867 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-17 19:44:35 +00:00
depristo 4a54f3f230 ThreadLocal version of CachingIndexedFastaSequenceFile. More efficient support for shared memory BAQ calculations
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4865 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-17 15:44:48 +00:00
depristo 32d5397c01 Experimental support for sided annotations. Currently not more/less valuable than two-tailed testing. Future experiments are needed
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4864 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-17 15:08:31 +00:00
handsake 21dc05138a Bug fixes for the bwa aligner and changes to support compiling against newer releases of the bwa code base.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4863 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-17 14:49:15 +00:00
chartl 2bd2667516 Another privately-owned class to add before re-checking out repository
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4858 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-16 18:14:51 +00:00
chartl e406eb0f95 Adding a useful accessor method to TableFeature
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4856 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-16 18:11:51 +00:00
ebanks 8ab4704b4c Adding a command-line argument to allow missing values to evaluate as false instead of true
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4854 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-16 05:18:12 +00:00
ebanks 9f3e56e487 VariantAnnotator shouldn't die when multiple records occur at the same position
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4853 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-16 04:05:47 +00:00
hanna acfe83920b '-L unmapped': adding integration tests for explicitly including (-L unmapped)
unmapped reads and explicitly excluding (-XL unmapped) unmapped reads, augmenting
the suite of unit tests already put in place.

'-L unmapped' seems safe to use; go for it, but please validate results against
samtools flagstat when the process finishes.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4849 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-15 23:11:46 +00:00
ebanks dabdeb729e Eric broke the build. Eric broke the build.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4847 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-15 17:01:38 +00:00
ebanks 5c0b66cb7c 3 big changes that all kill the integration tests: 1. Don't cap the PLs by 255 anymore. 2. Move over to the 3state model as the only available base model for UG (no more base transition tables). 3. New QD implementation when GLs/PLs are available.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4846 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-15 16:24:28 +00:00
chartl 5a27d231fa Rename it so that nobody else falls into the trap laid out (the test is VariantToTable, the walker is Variant[s]ToTable)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4844 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-15 11:43:00 +00:00
chartl 5e27e9162f Huh? I thought we parsed out comma-separated command line arguments into list automatically...just change the syntax of the integration test, no need to update the md5
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4843 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-15 11:40:27 +00:00
chartl 3e75431bc8 Thanks to mark: VCFInfoToTable removed in favor of a more flexible walker. Slight change to the argument structure of the walker to make it play more nicely with Queue: the field list parsing is pushed into the command line system (e.g. the variable is exposed as a List<String> and not a String, so Queue doesn't have to join a list into a string only to have it broken out again. This also allows the user to specify -F field1 -F field2 -F field3 if he/she so desires.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4842 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-15 03:33:36 +00:00
kshakir 01323447c6 Removed LibBat.SUB2_BSUB_BLOCK since the use of it exits the JVM.
Fixed integration tests to wait on their own for the job to run instead of using SUB2_BSUB_BLOCK.
Updated VariantRecalibrationIntegrationTests MD5s which were knocked out of sync whele SUB2_BSUB_BLOCK was exiting in the middle of integration tests.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4840 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-14 19:57:20 +00:00
hanna 67c07d1a6a Fixed recently introduced multiplexer issue where DoC couldn't be written
directly to command-line.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4839 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-14 19:35:15 +00:00
hanna 526ae92093 Getting back to '-L unmapped':
- basic unit tests for interval sorting and merging with mix of mapped/unmapped.
- validation to ensure that locus walkers (really all non-read walkers) blow up with a user error when -L unmapped is specified.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4837 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-14 18:24:18 +00:00
ebanks afd4655674 Use @Output instead of @Argument. As a side note, Chris I'm ready for this nightmare to go away...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4835 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-14 17:13:15 +00:00
ebanks cf7d932a17 Fix for f***ed up BWA alignments that adhere to SAM specs
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4834 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-14 17:12:25 +00:00
kshakir d550fdfd60 Disabling integration test to see if this restores the full test suite.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4833 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-14 15:27:02 +00:00
delangel a5008faca8 Bug fix: when getting variant contexts at a site, we need to get only variants that start at current location, otherwise we get duplicated records when filtering indels.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4830 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-13 19:23:10 +00:00
delangel 17db2e0e24 (forgot I hadn't committed this) - refactored IndelStatistics module and added a new inner class to compute Indel classification along with other statistics. So, we now get an extra table specifying, per sample, counts of whether indels are:
- Repeat Expansions
- Novel sequence
And for indels of size <=2 we get a per-mononuc. or dinuc. breakdown of novels and expansions.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4828 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-13 17:43:43 +00:00
chartl cf75caf653 java changes:
VariantEvalWalker's logger is made public, so that variant eval modules can access it through the parent object.
 DesignFileGenerator comment lists how best to bind things to it, and the feature accessor is better refined to grab the genome loc. (old change)

scala changes:

convenience addAll( List[CommandLineFunction] ) added to QScript class (and thus removed from the fCPV2)
useful command line functions added to a new library package for command line functions (these are fast simple VCF command lines)
bug fixed in ProjectManagement for the class where there's only one batch to be batch-merged (not really part of the use-case, but an edge-condition that came up during pipeline testing)
first draft of a private mutations pipeline which will be elaborated in future



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4823 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-12 05:10:45 +00:00
depristo abd6ce1c77 A TiTv-free approach for cutting variants! Apparently much better than previous approach, and will work for indels and SV will truly minor modifications to the code. Will discuss with methods group on Monday.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4822 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-11 23:08:13 +00:00
depristo 974aaa134d Trival fix to broken build
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4820 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-11 13:56:03 +00:00
kshakir 895cb39f41 Thanks to Platform Computing tech support, found the magical environment variable BSUB_QUIET.
Minor refactoring to add more of the CLibrary including setenv().

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4819 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-10 21:27:12 +00:00
depristo 5b46a900b3 Final version of BAQ calculation. default gap open is 1e-4, a good sensitive value. Useful timer class SimpleTimer added. BAQ is now live.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4818 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-10 19:35:12 +00:00
ebanks 491a599b59 Minor optimization
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4817 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-10 18:56:35 +00:00
kshakir 56433ebf6b Switched from LSF command line wrappers to JNA wrappers around the C API. Side effects:
- bsub command line is no longer fully printed out.
- extraBsubArgs hack is now a callback function updateJobRun.
Updated FullCallingPipelineTest to reflect latest changes to fullCallingPipeline.q.
Added a pipeline that tests the UGv2 runtimes at different bam counts and memory limits.
Updated VE packages that live in oneoffs to compile to oneoffs.
Added a hack to replace the deprecated symbol environ in Mac OS X 10.5+ which is needed by LSF7 on Mac.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4816 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-10 04:36:06 +00:00
hanna d4d3170436 Support for '-L unmapped' in read walkers. DO NOT USE THIS PATCH YET. It has been
subjected to and passes cursory testing on one dataset (and all integration tests pass).
However, there's a small library of validation checks, and unit and integration tests 
that must be added.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4813 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-09 19:51:48 +00:00
delangel a2d6cef181 Weird corner condition fix in indel genotyper: if there are 2 consecutive locations on candidate sites to genotype, we can get both when calling getVariantContexts and if we are triggering on an extended event - this leads to confusion and we can end up picking the wrong one. So, we require start of the vc to be the same as the start of the ref locus to be sure.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4812 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-09 19:34:23 +00:00
depristo 722819688a Minor utility improvements to ValidateBAQ
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4809 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-09 02:19:32 +00:00
depristo a63bbb2fec Optimized BAQ implementation. No longer does excessive amounts of copying of arrays. At this point I'm not 100% certain where additional performance improvements would come from
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4808 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-08 21:26:30 +00:00
depristo db55b2b0c6 Better testing of BAQ. Now really handles soft clipped reads properly by doing an expensive copy operation :-( will need to be transformed to a ByteBuffer in the near future.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4807 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-08 17:37:00 +00:00
ebanks f1f01610f8 Remove the extra trailing tab at the end of the VCF ## header line. Unfortunately, this meant updating every freaking integration test.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4806 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-08 17:22:29 +00:00
depristo 16e1bbd380 Hidden command line option to control BAQ gap open penalty for testing by me and eric. ValidateBAQWalker has misc. useful improvements. PrintReads now adds BAQ tags on output, if requested.
BAQ has generally useful improvements.  Refactor code to make it easier for BAQUnitTest to run.  minBaseQuality enforced on output, as well as input now.  Added BAQUnitTest that checks that the BAQ calculation is performing as expected.  Still needs to be expanded significantly.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4804 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-08 01:01:39 +00:00
depristo 1b6bec8e6b Trivial changes
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4803 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-07 20:06:54 +00:00
delangel ca7810f11d First major update of indel genotyper:
a) Really fix this time strand bias computation for indels, previous version was a partial fix only.
b) Change way in which we deal with bad bases at the edge of reads. Even if a base is soft clipped in CIGAR string, there may still be dangling bases with Q=2 that may throw off QUAL computation in some sites. So, we're stricter and we also trim off those bases off read edges even if they are not soft-clipped officially.
c) First feeble-minded attempt at runtime optimization - don't compute log and 10^base_qual every time. Rather, cache 10^-k/10 and log(1-10^-k/10) for all k <=60. This speeds up code about 4x.
d) Further optimization: don't compute log(10^x+10^y) but rather use softMax function recently put into ExactAFCalculationModel.
e) Skip bad reads where all Q=2 (sic)
f) Avoid log to lin and back to log conversions of genotype likelihoods - this was legacy code from back when exact model did stuff in linear domain. This improves precision overall.




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4802 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-07 18:35:22 +00:00
ebanks e2d45ec2af Make Indel Realigner exceptions related to not enough space on disk or a too low file-handle limit UserExceptions.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4801 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-07 16:37:31 +00:00
depristo 70980b659a CombineVariants no longer requires rod_priority_string
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4800 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-07 15:39:43 +00:00
depristo bc885b7bd0 Don't print debugging output.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4799 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-06 20:57:11 +00:00
depristo c91712bd59 BAQ calculation refactoring in the GATK. Single -baq argument can be NONE, CALCULATE_AS_NECESSARY, and RECALCULATE. Walkers can control bia the @BAQMode annotation how the BAQ calculation is applied. Can either be as a tag, by overwriting the qualities scores, or by only returning the baq-capped qualities scores. Additionally, walkers can be set up to have the BAQ applied to the incoming reads (ON_INPUT, the default), to output reads (ON_OUTPUT), or HANDLED_BY_WALKER, which means that calling into the BAQ system is the responsibility of the individual walker.
SAMFileWriterStub now supports BAQ writing as an internal feature.  Several walkers have the @BAQMode applied to this, with parameters that I think are reasonable.  Please look if you own these walkers, though

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4798 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-06 20:55:52 +00:00
depristo 5d2c2bd280 Just refactoring into utils/baq directory
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4795 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-06 17:43:43 +00:00
depristo 80f32712dc Tiny bug fix
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4793 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-05 18:48:33 +00:00
depristo 44feb4a362 Improved BAQ implementation. Now supports adding BAQ tags to reads on the fly with ADD_TAG_ONLY option. Caching fasta reader implementation, and changes throughout the system to enable this. Many performance improvements throughout the system due to better reference access patterns.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4792 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-05 18:29:39 +00:00
ebanks 8901e63879 Cheap optimization: don't keep calculating the log of a constant. (How did I not catch this before?)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4791 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-05 04:36:21 +00:00
ebanks bef48e7a42 For Chris, to make his life easier: iterate over all VCF records passed in looking for one with an ALT allele defined instead of assuming all records have one.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4789 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-05 02:23:38 +00:00
depristo 97c94176c0 Immediate, obvious bug fix to avoid blowing up on unmapped reads
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4788 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-04 20:43:39 +00:00
depristo a5b3aac864 Engine-level BAQ calculation now available in the GATK [totally experimental right now]. -baq argument to disable (NONE), to only use the tags in the BAM (USE_TAG_ONLY), use the tag when present but calculate on the fly as necessary (CALCULATE_AS_NECESSARY), and to always recalculate (RECALCULATE_ALWAYS). BAQ.java contains the complete implementation, for those interested. ValidateBAQWalker is a useful QC tool for verifying the BAQ is correct. BAQSamIterator applies BAQ to reads, as needed, in the engine. Let me know if you encounter any problems. Before prime-time, needs a caching implementation of IndexedFastaReader to avoid loading *lots* of reference data all of the time
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4787 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-04 20:23:06 +00:00
fromer b12cec4302 Added emitOnlyMNPs flag
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4785 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-03 20:34:17 +00:00
fromer 6d4ec7f9e7 Remove RefSeq INFO from MNPs since annotating them properly
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4784 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-03 19:03:35 +00:00
fromer 4719bbc772 Changed dontRequireSomeSampleHasDoubleAltAllele parameter to mean that merging should only start at a polymorphic site
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4783 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-03 17:52:56 +00:00
ebanks ec174dc0ba As per Menachem's last commit, there's a minimally more efficient way of doing the MQ cap.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4782 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-03 16:37:08 +00:00
fromer 92cf7744a6 Set minMQ = max(minMQ, minBQ) for phasing since anyway we cap BQ by MQ; also, lowered MIN_BASE_QUALITY_SCORE for phasing to 17 (was previously 20)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4781 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-03 16:31:13 +00:00
ebanks 237ab1d489 1. As discussed in group meeting today, because we cap BQ by MQ, if MQ < minBQ then we filter the read.
2. Update to UGCalcLikelihoods for Chris: require a vcf bound to 'allele' to be provided so that we know exactly which alternate allele we should be calculating GLs for at each site.  The user is warned when the VC is not biallelic or there are multiple records at a site.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4780 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-03 05:57:06 +00:00
delangel da6a07ad3b First round of critical fixes to indel genotyper (more to come tomorrow):
a) Avoid complete crash of caller that broke due to a recent refactoring by someone who must not be named <cough>EB<cough>... an integration test to avoid this in the future coming soon.
b) Fixed up strand bias computation for indels





git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4779 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-03 02:48:09 +00:00
fromer e09d6ee56b write non-MNP VariantContexts records only once (where they start)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4777 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-02 22:14:26 +00:00
fromer 1515bf6de9 Merged common VCF writing logic into phasing/WriteVCF.java
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4776 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-02 22:03:02 +00:00
asivache 4e62de4213 Added method getOriginalReadGroupId(): takes merged (in case of collision) read group id as reported by a read coming from the merged stream and returns this read's read group id as it was listed in the original input bam file.
IndelRealigner now uses this functionality to correctly un-mangle read group id's in --nWayOut mode (i.e. when we need to write reads into separate output bams with headers matching the original inputs).

Some hidden changes to IndelRealigner: purely testing and development, transparent to the users (hidden option added)

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4775 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-02 21:41:52 +00:00
rpoplin e5282742f9 Bug fix in CountCovariates, skip over indel records as well as SNPs in the dbsnp file. CountCovariates is now called CountCovariatesWalker. I've always hated that the name was swapped.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4774 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-02 18:43:24 +00:00
rpoplin 0adf505b53 We no longer look at by-hapmap validation status in the VQSR because using the HapMap VCF file is higher quality. As a side effect we now support the dbsnp 132 vcf file. ApplyVariantCuts now requires that the input VCF rod bindings begin with input, matching the other VQSR walkers. Wiki updated with information about how to obtain the hapmap and 1kg truth sets.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4772 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-02 15:38:45 +00:00
ebanks 99b942b0b4 Removing duplicated header args
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4770 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-01 20:16:53 +00:00
fromer 9ac0f98d0d Fixed bug in retaining proper RefSeq records
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4768 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-01 18:39:02 +00:00
ebanks 7caf666f48 For Sendu: add a hidden option to allow bams to come out unsorted. We've agreed to let him deal with sorting these puppies on his own.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4767 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-01 17:56:13 +00:00
ebanks 3afa841a6a Fixing docs
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4766 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-01 17:36:47 +00:00
ebanks 6a6cdc1925 Adding minor usage docs
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4765 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-01 17:34:33 +00:00
ebanks 0d1c905df3 Adding UGCalcLikelihoods and UGCallVariants so that GSA members can break up the calling process into separate steps (calculate the GLs and then call off of those) - useful for Chris's new batch merger. As the docs say, these are absolutely not supported or recommended for public use.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4764 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-01 17:32:26 +00:00
fromer b4ef716aaf As per Eric and Mark's suggestions, separated the segregating MNP merger (MergeMNPs) from the more general merger employed for annotation purposes (MergeSegregatingAlternateAlleles). Both use the same core MergePhasedSegregatingAlternateAllelesVCFWriter
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4763 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-01 16:42:08 +00:00
ebanks 0892daddb0 Improvement for the TGEN folks: when running in the solid recal mode of SET_Q_ZERO_BASE_N, update the NM tag if one was present in the read to reflect the new N's in the read.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4761 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-01 04:36:44 +00:00
asivache a22b1b04e6 SW-turbo. Kind of. This implementation is presumably equivalent to the old one (mathematically), but runs ~10 times faster: inner loops eliminated completely. The author of the original implementation should be sentenced to the galleys. Oh, that would be me...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4760 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-01 00:08:47 +00:00
delangel 2ac938fe4e 1)
Minor fixes to avoid crashes vs CG indel files:
- Add count for complex events, not just insertions and deletions
- Handle correctly cases of large indels falling out of bounds of histogram array: added a count of indels ouf of bounds and avoid exceptions.

2) Cosmetic fix for R script assessing UG calling performance: draw red y=x line on top of Simulated vs Estimated AC to get a better view of under/over-estimation of AC.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4758 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-30 21:08:25 +00:00
rpoplin af84462f3e The dev team has decided to change the filter that is added to records that are set to monomorphic by Beagle. It no longer lists the reference allele. Added those filters to the header of the output VCF file. Finally, we no longer use R2=NaN values coming from Beagle.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4757 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-30 17:19:54 +00:00
ebanks 21256909bb Not supported. I'm checking this in for Ryan only.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4756 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-30 16:59:18 +00:00
kshakir e21a66d876 Updated the Queue GATK generator and packaging to include more dependencies for fullCallingPipeline.q.
Set the -bigMemQueue in the FullCallingPipelineTest to GSA to avoid waiting for the week queue when it is busy.
Fixed the package definition of PipelineTest so that scalac won't recompile it every time.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4755 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-30 15:29:40 +00:00
aaron 7f2ded0706 belated special case fix for Menachem; if the results of a BTI and BTIMR produce an empty interval list, exception out. This would be solved long term with better handling or empty and / or null interval lists. I'll add a JIRA
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4754 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-30 05:49:20 +00:00
ebanks a181680814 We no longer require dbSNP files to be of the dbsnp rod-type; VCFs will do (provided they are bound to the name 'dbsnp')
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4753 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-30 03:25:18 +00:00
asivache 8ffea42b75 about 10% improvement in SW alignment (and hence IndelRealigner!) speed by using c-style linearized array representation for matrices instead of java 2D arrays...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4751 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-30 00:06:50 +00:00
aaron b03ac61e9d consolidating the checking of the RMD sequence dictionary against the reference into a single function, and adding an integration test to test that empty VCFs pass (both the indexing and the seq dictionary validation).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4750 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-30 00:01:56 +00:00
hanna abc13d0a90 Temporary hack: force abort with an intelligent message suggesting that users
specify -B:dbsnp,vcf <filename> if the filename passed if the --DBSNP argument
value contains 'vcf'.  We'll replace this functionality once dbSNP 132 starts
playing nicely with the tagging system.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4749 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-29 23:37:30 +00:00
ebanks d89e17ec8c Fare thee well, UGv1. Here come the days UGv2.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4747 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-29 21:51:19 +00:00
fromer 727dac7b7a Added MNP annotation of the number of AA changes occuring in the SAME RefSeq entry (numAAchanges), and if this number is > 1 for any of the alt alleles (alleleHasMultAAchanges)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4746 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-29 21:24:30 +00:00
ebanks 222cd42ceb Have the UG engine take care of the GL to PL conversion. Note that we still use GLs for calling (since we are losing precision in high-pass and, even worse, it can affect QD), but we emit PLs in all cases. This means that calculating the GLs, emitting them to VCF, and then calling off of them (a la samtools) is absolutely, positively not ideal.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4745 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-29 20:28:16 +00:00
ebanks 102c8b1f59 Large refactoring of the UGv2 engine so that it is now truly separated into 2 distict phases: GL calculation and AF calculation, where each can be done independently. This is not yet enabled in UGv2 itself though because I need to work out one last issue or two. Tested on 1Mb of 1000G Aug allPops low-pass and results are identical as before. Also, making BQ capping by MQ mandatory.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4744 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-28 21:36:33 +00:00
ebanks ce051e4e9a Write to sdout when no -o is provided
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4743 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-28 06:19:26 +00:00
ebanks e3e6d176df Looking over the daily error log email made me realize that there were 2 implementations of vc.modifyLocation() - the correct one in VC that didn't require lazy loading the genotype data and the bad one in VCUtils that did. Removing the implementation in VCUtils and updating the code accordingly. Also, removing createPotentiallyInvalidGenomeLoc() since no one uses it anymore.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4736 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-26 18:40:34 +00:00
ebanks 35b90d2295 Don't compute SB for ref calls
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4735 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-26 03:54:26 +00:00
ebanks 6934f83cc7 Two changes to CombineVariants.
1. Fix: VCs were padded before the merge, but they were never unpadded afterwards.  This leaves us with a VC that doesn't meet our spec.
2. Update: instead of running the merged VC through every standard annotation (which seems really wrong, since this isn't the annotator tool), just update the chromosome count annotations (AC,AF,AN) through VCUtils.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4734 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-25 04:52:12 +00:00
fromer d775192631 Check if MNP annotation of amino acid is dependent on the MNP, or could it be obtained through some single-base variant?
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4733 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-24 22:38:33 +00:00