Commit Graph

3740 Commits (84b6d2926bfea0e95e8de5f94a96267fa1950423)

Author SHA1 Message Date
depristo 84b6d2926b Useful walker that creates a new interval list with only the interval overlapping input sites list. Really a one-off walker
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4559 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-22 19:55:04 +00:00
depristo 78b4a1c240 VariantsToTable now supports the virtual TRANSITION field
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4558 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-22 19:53:46 +00:00
hanna e6d61197e6 Disable OTF indexing when writing indices for temporary VCFs when running
with -nt option.  When last I checked in, Ryan was seeing a ~25% speedup 
per shard by not indexing.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4556 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-22 17:40:37 +00:00
depristo e6b008f87c Fixed >= vs. > test leading to failure to tolerate dynamic indexes that are created at *exactly* the instant the output VCF is closed too
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4555 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-22 16:11:14 +00:00
ebanks 72c5b75460 Tribble exceptions can be generated outside of the normal codec parsing code because we now lazy load the VCF genotype fields. I'm not sure how else to account for this (to make sure they show up as user errors and not GATK system errors) besides catching them here.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4554 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-22 15:22:17 +00:00
delangel e24f7fec47 Fixed indel genotyper which broke yet again because we can't just call context.getBasePileup() without checking again for its existence in the first place.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4553 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-22 15:17:11 +00:00
ebanks c0b4317311 Er, here's the right fix
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4552 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-22 15:08:25 +00:00
ebanks 181f901126 Fix for Ryan: don't pull reference sequence for the portions of reads that extend beyond the contig boundaries
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4551 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-22 14:38:26 +00:00
ebanks 9f76aed515 Fix for IDs 5zP7jJeffK2sdPH1BH4JBVSrQztVEDKP and nX0cuBjoqBW4NQFpM6dE13KpkCuYFpZu
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4550 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-22 14:05:27 +00:00
hanna d4feb99d9a For parallel ROD traversals, simplified reference sharding. Will replace
with a more sensible strategy for sharding w/o BAMs at some point after
ASHG.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4549 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-22 05:08:15 +00:00
fromer 9ba7269728 Fixed Integration Tests to output VCF files with -NO_HEADER
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4548 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-21 19:49:44 +00:00
fromer 60f88866dd Uses VCFConstants instead of hard-coded constants
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4547 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-21 19:49:01 +00:00
fromer 883b8ff80e Removed flush() method from VCFWriter interface; added takeOwnershipOfInner parameter in constructor of wrapper VCFWriters to designate if the Writer should close the inner Writer it receives on construction
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4546 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-21 19:48:00 +00:00
fromer 1ea43be976 Removed flush() method from VCFWriter interface
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4545 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-21 19:46:42 +00:00
chartl 3566ad2146 Wrong if statement.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4544 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-21 17:37:45 +00:00
chartl bf17f92b64 Do not look for samples in dbsnp binding
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4543 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-21 17:36:38 +00:00
ebanks 225cf49128 Implementing reference confidence estimate in UGv2 as per UGv1
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4542 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-21 16:57:59 +00:00
delangel cf9c9ae241 Three important updates for Dindel genotyper:
a) Fix it up because it broke with a recent checkin to annotate vcf with unfiltered depth.
b) Printout of ref/alt alleles in output vcf was incorrect because the start/stop positions of associated GenomeLoc were incorrectly computed in case of a deletion.
c) Redid Beagle input/output walkers as not assume that ref was a single base, not to assume that variant was a vcf and generalized it to be indel-capable, so now the Beagle walkers can be used for indels as well.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4541 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-21 16:00:16 +00:00
kshakir b88cfd2939 Updated MD5s of VCFs, since the approximate command line arguments injected into the VCF headers now have a little more order to them thanks to changes in the ParsingEngine.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4538 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-21 03:07:40 +00:00
ebanks 8f38ebf98e Throw a user exception when using the clustered SNP filter in the presence of ref calls. It's unfortunate, but until we get a windowed ROD context this is just too much of a headache to support.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4537 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-21 02:44:10 +00:00
kshakir 88a0d77433 Changed parsing engine to store the order the argument bindings based on their definition in the class, moving "-T" to the front of Queue command lines.
Queue GATK generated .intervals is now a List(File) again removing special case handling in the generator.
Instead of using @Scatter annotation, using ScatterFunction instance to determine if a job can be scattered.
Implemented special VcfGatherFunction which only uses the header from the first file, even if the other files differ in their headers.
Added a -deleteIntermediates to Queue to delete the outputs from intermediate commands after a successful run.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4536 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-20 21:43:52 +00:00
ebanks 91049269c2 Optimizations across the board, with help from Guillermo, Matt, and JProfiler. Too tired to give details now.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4535 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-20 20:47:41 +00:00
fromer f76865abbc ReadBackedPhasing now uses a SortedVCFWriter to simplify, and has the ability to merge phased SNPs into MNPs on the fly [turned off by default]; MergeSegregatingPolymorphismsWalker can also do this as a post-processing step; Integration tests for MergeSegregatingPolymorphismsWalker were also added
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4534 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-20 20:27:10 +00:00
fromer e8079399ac Added flush() method to VCFWriters
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4533 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-20 20:23:22 +00:00
fromer 00726b6c4b Added mergeIntoMNPs to merge successive VCF records into a single MNP VCF [if possible]
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4532 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-20 19:40:26 +00:00
fromer 55230ce5f3 Added startsBefore, startsAfter, and minDistance [calculates distance between any pair of bases in the two GenomeLocs]
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4531 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-20 19:12:34 +00:00
ebanks 4f77581087 More optimizations for HaplotypeScore: pulling final constants out of loops
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4530 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-20 17:40:57 +00:00
hanna 20fac43521 Add extra logging to the GATK run report at the start of metrics aggregation.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4529 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-20 17:32:51 +00:00
ebanks a205900eff Naughty use of Strings in HaplotypeScore literally double the runtime of Unified Genotyper. Moved over to bytes and no longer allow Strings in the Haplotype util class. New round of profiling on tap for tomorrow.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4528 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-20 03:32:21 +00:00
depristo f9541b78d3 Timing of traversal now starts at the start of the traversal, so the rate is reasonable right off the bat. For example, we now see: INFO 22:45:02,476 TraversalEngine - [TRAVERSAL STARTING]; INFO 22:45:32,484 TraversalEngine - [PROGRESS] Traversed to 2:50850686, processing 18,646 sites in 30.05 secs (1611.50 secs per 1M sites)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4527 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-20 02:47:34 +00:00
depristo f7ce18553e GenotypeConcordance now prints interesting sites more nicely. RMDTrackBuilder is now uses the root class FeatureSource not BasicFeatureSource.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4525 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-20 00:29:02 +00:00
ebanks 7a291a8ff3 First pass at a VCF validator. Will test more tonight.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4524 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-19 19:55:49 +00:00
chartl 341e93ee12 The reference fixer seems to have munged the OMNI rather than making it better. Looks like some sites need to only have the ref and alt bases swapped, and others need to have the genotypes swapped as well? E.g.
some subset need
A  C  1/1   -->  C  A  0/0

while another subset need
A  C  1/1   -->  C  A  1/1

it's unclear how big these subsets are (or even if one is empty). What I do know is, doing the first one totally screws up concordance metrics for the 421-sample chip. So either something else needs to be done, or there's a bug in this walker. Until I know for sure, I've added an initialize exception to disable this thing...



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4523 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-19 12:50:24 +00:00
ebanks 5251f49a90 Including Marian Thieme's BaseCounts class (with some modifications)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4522 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-19 03:07:30 +00:00
hanna c5f105d050 Fix boneheaded mistake in the new interval filtering code I added on Sunday.
Sorry everyone.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4521 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-19 01:20:12 +00:00
ebanks 524cb8257c Renaming for accuracy
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4519 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-18 18:11:07 +00:00
ebanks 0fe504b748 Use filtered depth for Exact model (just like grid search)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4518 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-18 18:08:31 +00:00
ebanks d54d9880d7 Now that G's new genotyping algorithm is live, I've cleaned up the code to completely separate the grid search from the exact model. AlleleFrequencyCalculationModel is now completely abstract.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4517 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-18 18:04:06 +00:00
ebanks 80e5ac65b4 CAP_BASE_QUALITY needs to be included in the clone() method for it to be usable in UG
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4516 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-18 03:11:03 +00:00
hanna 6af9532090 Fix for GATK slowdowns at the ends of intervals.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4514 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-17 23:21:23 +00:00
chartl 5889138f4a *facepalm*
forgot to add the samples to the header. How could the VCFWriter let me get away with something so boneheaded?!



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4513 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-17 05:36:29 +00:00
chartl 2bc5971ca1 Added - a tool to fix reference bases of a VCF. The OMNI had a couple of sites with incorrect reference bases (look to be legacy from other chips), and a few more that had ref and alt flipped. GAP should probably take care of it, but since I need results by monday, I'm doing it.
Modified - SelectVariants: Hook up to VariantContextUtils to recalculate AC/AF/AN, which uses the accessor in VariantContext to do this. Somehow sites that were selected down to hom-ref genotypes only wound up getting positive AC. 

**IMPORTANT** I kind of need input here. The header of a file used for an integration test specifies AC as being an integer. Recalculating it casts it into an integer list (which it should be, as it allows for alternate alleles). However this appears to clash with what the jexl expression is looking for? For now, the integration test itself needed to be changed -- it's unclear what to do when the header specifies AC of being one class, but recalculating it casts to another class, and I'm not sure what to do.

I'm committing my omni_qc pipeline because I'm almost certain 2 months down the road I'm going to wonder what the heck I did to generate my results.




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4511 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-17 03:18:01 +00:00
ebanks 7aa030a9a4 Hmm. Apparently variants can get lifted over to different chromosomes. Who knew? Reverting changes from a couple of days ago. The only way to do this correctly (without requiring lots of memory) is to turn off on-the-fly indexing for this walker. Integration tests cover this now.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4510 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-17 02:54:12 +00:00
chartl 8b2d387643 Added in an eval module that calculates the dispersion histograms between eval and comp (e.g. M_{i,j} = # of times eval observed to have AC i, comp AC j -- for af it's i/100 vs j/100 )
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4507 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-15 19:07:43 +00:00
ebanks f78ff08e2b This is less correct than my previous change but it's what UGv1 does and now is not the right time to start mucking with things.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4506 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-15 18:56:45 +00:00
ebanks 471c18054f Fix for SB calculation: the best overall AF might not have any mass when just looking at reads from a single strand. We need to compute the best AF for each stratification.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4505 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-15 17:51:18 +00:00
asivache 42c3d74432 bug fix
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4503 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-15 16:27:40 +00:00
chartl c9d473edee More changes to Variant Eval and Genotype Concordance (passes all integration tests):
1: -sample can now include a file, which will be parsed for sample-name entries
2: If you request a sample to run analysis on, but it is not present in any of your RODs, VEW will exception out
3: Change added to parse Integer, String, and List<Integer> type Allele Count annotations (error otherwise)
4 [slightly problematic]: The count objects now maintain row-keys in order, as the keys were taking an inordinate amount of time in onTraversalDone (multiple calls to getRowKeys(), so many multiple sorts of the same underlying unsorted object, very bad)

There is a legacy comparison object which is unused which I will strip out soon.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4502 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-15 12:40:36 +00:00
ebanks 954dd84f51 Adding an integration test (against hg18 this time) that requires on-the-fly sorting in order to work properly.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4500 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-15 07:45:21 +00:00
ebanks 9f54170dff Hooking up the liftover tool to the new on-the-fly sorting VCF writer so that records can now get emitted in order.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4499 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-15 07:27:01 +00:00