fromer
2bf4fc94f0
Try to use more sampling to get a "correct" estimate of multivariate integral
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4815 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-10 02:58:46 +00:00
fromer
c64bf80b57
Added theoretical model of read-backed phasing (RBP)
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4814 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-10 01:46:09 +00:00
hanna
d4d3170436
Support for '-L unmapped' in read walkers. DO NOT USE THIS PATCH YET. It has been
...
subjected to and passes cursory testing on one dataset (and all integration tests pass).
However, there's a small library of validation checks, and unit and integration tests
that must be added.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4813 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-09 19:51:48 +00:00
delangel
a2d6cef181
Weird corner condition fix in indel genotyper: if there are 2 consecutive locations on candidate sites to genotype, we can get both when calling getVariantContexts and if we are triggering on an extended event - this leads to confusion and we can end up picking the wrong one. So, we require start of the vc to be the same as the start of the ref locus to be sure.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4812 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-09 19:34:23 +00:00
corin
27acede64d
Removing old arguments. We'll now be running with the defaults.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4811 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-09 18:58:56 +00:00
depristo
fabb42924c
Minor improvements to my crappy old python job management system. Mauricio's first task is to retire all of this code and move the DPP pipeline over to Queue
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4810 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-09 04:44:16 +00:00
depristo
722819688a
Minor utility improvements to ValidateBAQ
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4809 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-09 02:19:32 +00:00
depristo
a63bbb2fec
Optimized BAQ implementation. No longer does excessive amounts of copying of arrays. At this point I'm not 100% certain where additional performance improvements would come from
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4808 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-08 21:26:30 +00:00
depristo
db55b2b0c6
Better testing of BAQ. Now really handles soft clipped reads properly by doing an expensive copy operation :-( will need to be transformed to a ByteBuffer in the near future.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4807 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-08 17:37:00 +00:00
ebanks
f1f01610f8
Remove the extra trailing tab at the end of the VCF ## header line. Unfortunately, this meant updating every freaking integration test.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4806 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-08 17:22:29 +00:00
chartl
f8dd59c1d1
Tightening of the batch merging pipeline. Optimized to run on hour queue, so please: if you run this, crush 'hour' with it. Testing is forthcoming, but it merged 700 samples overnight.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4805 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-08 14:36:23 +00:00
depristo
16e1bbd380
Hidden command line option to control BAQ gap open penalty for testing by me and eric. ValidateBAQWalker has misc. useful improvements. PrintReads now adds BAQ tags on output, if requested.
...
BAQ has generally useful improvements. Refactor code to make it easier for BAQUnitTest to run. minBaseQuality enforced on output, as well as input now. Added BAQUnitTest that checks that the BAQ calculation is performing as expected. Still needs to be expanded significantly.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4804 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-08 01:01:39 +00:00
depristo
1b6bec8e6b
Trivial changes
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4803 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-07 20:06:54 +00:00
delangel
ca7810f11d
First major update of indel genotyper:
...
a) Really fix this time strand bias computation for indels, previous version was a partial fix only.
b) Change way in which we deal with bad bases at the edge of reads. Even if a base is soft clipped in CIGAR string, there may still be dangling bases with Q=2 that may throw off QUAL computation in some sites. So, we're stricter and we also trim off those bases off read edges even if they are not soft-clipped officially.
c) First feeble-minded attempt at runtime optimization - don't compute log and 10^base_qual every time. Rather, cache 10^-k/10 and log(1-10^-k/10) for all k <=60. This speeds up code about 4x.
d) Further optimization: don't compute log(10^x+10^y) but rather use softMax function recently put into ExactAFCalculationModel.
e) Skip bad reads where all Q=2 (sic)
f) Avoid log to lin and back to log conversions of genotype likelihoods - this was legacy code from back when exact model did stuff in linear domain. This improves precision overall.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4802 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-07 18:35:22 +00:00
ebanks
e2d45ec2af
Make Indel Realigner exceptions related to not enough space on disk or a too low file-handle limit UserExceptions.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4801 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-07 16:37:31 +00:00
depristo
70980b659a
CombineVariants no longer requires rod_priority_string
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4800 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-07 15:39:43 +00:00
depristo
bc885b7bd0
Don't print debugging output.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4799 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-06 20:57:11 +00:00
depristo
c91712bd59
BAQ calculation refactoring in the GATK. Single -baq argument can be NONE, CALCULATE_AS_NECESSARY, and RECALCULATE. Walkers can control bia the @BAQMode annotation how the BAQ calculation is applied. Can either be as a tag, by overwriting the qualities scores, or by only returning the baq-capped qualities scores. Additionally, walkers can be set up to have the BAQ applied to the incoming reads (ON_INPUT, the default), to output reads (ON_OUTPUT), or HANDLED_BY_WALKER, which means that calling into the BAQ system is the responsibility of the individual walker.
...
SAMFileWriterStub now supports BAQ writing as an internal feature. Several walkers have the @BAQMode applied to this, with parameters that I think are reasonable. Please look if you own these walkers, though
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4798 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-06 20:55:52 +00:00
chartl
02de9a9764
With multi-sample genotyping must come scatter+gather. Also Khalid informed me of the .group(size) method, so removing my useless (but pretty) code.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4797 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-06 20:12:23 +00:00
chartl
f4c43f013f
Due to the overhead for reading VCF files (>32g for 700 5MB VCF files), batched merging has to generate likelihoods in batches.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4796 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-06 18:23:54 +00:00
depristo
5d2c2bd280
Just refactoring into utils/baq directory
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4795 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-06 17:43:43 +00:00
depristo
aec6c0a030
BAQ reorg
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4794 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-06 17:23:51 +00:00
depristo
80f32712dc
Tiny bug fix
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4793 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-05 18:48:33 +00:00
depristo
44feb4a362
Improved BAQ implementation. Now supports adding BAQ tags to reads on the fly with ADD_TAG_ONLY option. Caching fasta reader implementation, and changes throughout the system to enable this. Many performance improvements throughout the system due to better reference access patterns.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4792 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-05 18:29:39 +00:00
ebanks
8901e63879
Cheap optimization: don't keep calculating the log of a constant. (How did I not catch this before?)
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4791 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-05 04:36:21 +00:00
chartl
0944184832
Major refactoring of library and full calling pipeline (v2) structure.
...
Arguments to the full calling qscript (and indeed, any qscript that wants them) are now specified via the PipelineArgumentCollection
Libraries require a Pipeline object for instantiation -- eliminating their previous dependence on yaml files
Functions added to PipelineUtils to build out the proper Pipeline object from the PipelineArgumentCollection, which now contains
additional arguments to specify pipeline properties (name, ref, bams, dbsnp, interval list); which are mutually exclusive with
the yaml file.
Pipeline length reduced to a mere 62 lines.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4790 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-05 02:33:54 +00:00
ebanks
bef48e7a42
For Chris, to make his life easier: iterate over all VCF records passed in looking for one with an ALT allele defined instead of assuming all records have one.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4789 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-05 02:23:38 +00:00
depristo
97c94176c0
Immediate, obvious bug fix to avoid blowing up on unmapped reads
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4788 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-04 20:43:39 +00:00
depristo
a5b3aac864
Engine-level BAQ calculation now available in the GATK [totally experimental right now]. -baq argument to disable (NONE), to only use the tags in the BAM (USE_TAG_ONLY), use the tag when present but calculate on the fly as necessary (CALCULATE_AS_NECESSARY), and to always recalculate (RECALCULATE_ALWAYS). BAQ.java contains the complete implementation, for those interested. ValidateBAQWalker is a useful QC tool for verifying the BAQ is correct. BAQSamIterator applies BAQ to reads, as needed, in the engine. Let me know if you encounter any problems. Before prime-time, needs a caching implementation of IndexedFastaReader to avoid loading *lots* of reference data all of the time
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4787 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-04 20:23:06 +00:00
corin
bdc7516168
Taking out recalibrating for now, since having these files is confusing people and we've not gone to dbsnp 132 yet so cluster generation's broken with these command lines.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4786 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-03 22:12:09 +00:00
fromer
b12cec4302
Added emitOnlyMNPs flag
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4785 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-03 20:34:17 +00:00
fromer
6d4ec7f9e7
Remove RefSeq INFO from MNPs since annotating them properly
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4784 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-03 19:03:35 +00:00
fromer
4719bbc772
Changed dontRequireSomeSampleHasDoubleAltAllele parameter to mean that merging should only start at a polymorphic site
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4783 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-03 17:52:56 +00:00
ebanks
ec174dc0ba
As per Menachem's last commit, there's a minimally more efficient way of doing the MQ cap.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4782 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-03 16:37:08 +00:00
fromer
92cf7744a6
Set minMQ = max(minMQ, minBQ) for phasing since anyway we cap BQ by MQ; also, lowered MIN_BASE_QUALITY_SCORE for phasing to 17 (was previously 20)
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4781 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-03 16:31:13 +00:00
ebanks
237ab1d489
1. As discussed in group meeting today, because we cap BQ by MQ, if MQ < minBQ then we filter the read.
...
2. Update to UGCalcLikelihoods for Chris: require a vcf bound to 'allele' to be provided so that we know exactly which alternate allele we should be calculating GLs for at each site. The user is warned when the VC is not biallelic or there are multiple records at a site.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4780 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-03 05:57:06 +00:00
delangel
da6a07ad3b
First round of critical fixes to indel genotyper (more to come tomorrow):
...
a) Avoid complete crash of caller that broke due to a recent refactoring by someone who must not be named <cough>EB<cough>... an integration test to avoid this in the future coming soon.
b) Fixed up strand bias computation for indels
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4779 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-03 02:48:09 +00:00
kshakir
c7dbf66d41
Added a javaMemoryLimit option for cases where the java -Xmx memory should be lower than the bsub memory limit.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4778 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-02 22:38:06 +00:00
fromer
e09d6ee56b
write non-MNP VariantContexts records only once (where they start)
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4777 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-02 22:14:26 +00:00
fromer
1515bf6de9
Merged common VCF writing logic into phasing/WriteVCF.java
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4776 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-02 22:03:02 +00:00
asivache
4e62de4213
Added method getOriginalReadGroupId(): takes merged (in case of collision) read group id as reported by a read coming from the merged stream and returns this read's read group id as it was listed in the original input bam file.
...
IndelRealigner now uses this functionality to correctly un-mangle read group id's in --nWayOut mode (i.e. when we need to write reads into separate output bams with headers matching the original inputs).
Some hidden changes to IndelRealigner: purely testing and development, transparent to the users (hidden option added)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4775 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-02 21:41:52 +00:00
rpoplin
e5282742f9
Bug fix in CountCovariates, skip over indel records as well as SNPs in the dbsnp file. CountCovariates is now called CountCovariatesWalker. I've always hated that the name was swapped.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4774 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-02 18:43:24 +00:00
chartl
670ae814b3
Get rid of files from the grep string
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4773 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-02 18:39:59 +00:00
rpoplin
0adf505b53
We no longer look at by-hapmap validation status in the VQSR because using the HapMap VCF file is higher quality. As a side effect we now support the dbsnp 132 vcf file. ApplyVariantCuts now requires that the input VCF rod bindings begin with input, matching the other VQSR walkers. Wiki updated with information about how to obtain the hapmap and 1kg truth sets.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4772 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-02 15:38:45 +00:00
chartl
220fb0c44a
Added a pipeline for merging batches. For now takes a file containing a list of VCFs, and a file containing a list of bams. Does not do anything smart (e.g. if you leave out some .bams or add some extra ones, you will not be warned). Heavy lifting done in (the beginnings of) a library for managing multi-batch or multi-project tasks.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4771 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-02 07:31:59 +00:00
ebanks
99b942b0b4
Removing duplicated header args
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4770 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-01 20:16:53 +00:00
chartl
9f03f09cc9
Changes to V2 pipeline and libraries. AB dropped. Cleaning enabled. Project name now properly propagated to intermediate files (instead of the string repr of the object). Indel mask is now expanded prior to filtering at indels.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4769 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-01 18:55:48 +00:00
fromer
9ac0f98d0d
Fixed bug in retaining proper RefSeq records
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4768 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-01 18:39:02 +00:00
ebanks
7caf666f48
For Sendu: add a hidden option to allow bams to come out unsorted. We've agreed to let him deal with sorting these puppies on his own.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4767 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-01 17:56:13 +00:00
ebanks
3afa841a6a
Fixing docs
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4766 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-01 17:36:47 +00:00