Commit Graph

751 Commits (d7f3102c3f31a2550b5a74a18a902af158c7fec2)

Author SHA1 Message Date
ebanks 7a91dbd490 Renamed some of the column names in Ti/Tv and Concordance modules so that they are clearer. Removed ValidationRate module (it was busted).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3564 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-16 15:53:06 +00:00
asivache 42b8a8f295 slight change in output format
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3559 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-15 14:52:04 +00:00
asivache 9666d47d17 ooops, debug print now removed
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3550 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-14 18:07:12 +00:00
asivache 4ab1f440c3 A new argument: --targetIntervalsSorted (boolean flag). If specified, the interval file is assumed to be sorted (duh!) and it is NOT slurped into the memory but instead traversed directly on disk as needed. If the file turns out to be unsorted, an exception will be thrown at the point where inconsistency occurs (can be late into the processing!).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3547 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-14 16:00:22 +00:00
hanna c3b68cc58d Rethinking DownsamplingLocusIteratorByState with a flattened read structure. Samples are kept
independent while processing, and only merged back in a priority queue if necessary in a special
variant of the ReadBackedPileup.  This code is not live yet except in the case of naive deduping.
Downsampling by sample temporarily disabled, and the ReadBackedPileup variant is sketchy and
not well integrated with StratifiedAlignmentContext or the walkers.  Cleanup to follow.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3540 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-13 01:47:02 +00:00
ebanks 8c28be5933 Fixing a VCF bug for Sendu: we weren't emitting flags (booleans) correctly in VCF3.3 (rev'ed tribble for this).
Updated dbsnp/hapmap membership info fields to be flags now instead of ints.
While I was there, I added the change in the Annotator for Jan to force reads to be from a specific sample.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3536 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-11 16:42:06 +00:00
ebanks 22620ba95c Adding "abi_solid" to the list of known platforms.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3534 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-11 13:37:19 +00:00
ebanks ca4eab1d23 Now annotations that require reads return null if there's no alignment context, so that running without reads adds annotations only for the appropriate fields.
Added an integration test for the read-less case.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3525 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-09 20:36:46 +00:00
aaron 4f00e265a8 quick update for a change I implemented for Ryan
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3519 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-09 17:23:31 +00:00
aaron ad98512f6c adding changes so that we look at the headers already loaded by the engine for samples and other VCF utils, and not create readers for each file to get them (this caused Tribble to regerenate indices if the index file can't be written to disk).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3518 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-09 17:21:12 +00:00
ebanks 9b2fcc4711 Refactoring of the annotation system:
1. VA is now a ROD walker so it no longer requires reads (needs a little more testing)
2. Annotations can now represent multiple INFO fields (i.e. sets of key/value pairs)
3. The chromosome count annotations have been pulled out of UG and the VCF writer code and into VA where they belong.  Fixed the headers too.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3513 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-09 17:05:51 +00:00
chartl 5ed2818ffb Forgot to commit code i relied upon
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3503 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-08 21:01:35 +00:00
hanna c1ecf75dd5 Update to the latest rev of the picard sharding patch. Includes updates reflecting
the imminent move of IlluminaUtil into picard public.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3493 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-06 20:33:21 +00:00
depristo 3ea506fe52 No more new Allele() -- must use create. Allelel simple alleles are now cached for efficiency reasons. VCF4 codec optimizations -- 4x performance in general. Now working in general but hooked up to the ROD system now as VCF4. WARNING -- does not actually work with indels, genotype filters, etc.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3489 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-04 23:03:55 +00:00
depristo e2b41082af GATK now does automatic adaptor filtering in locus iterators (but not expt. downsampling iterator). General support for LocusIteratorFilters just like read filters but only applying at particular bases. Updated tools with new MD5 sums due to adaptor bases in their integrationtest data. Not that as a side effect here reads close to each other with odd orientations are also filtered out. Updated minor argument to VariantRecalibrator to change the qStep value on the command line
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3481 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-02 22:26:32 +00:00
asivache f0c379dde8 Unconsequential changes in report formatting
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3479 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-02 17:43:25 +00:00
weisburd 09c3b15af3 Implemented joins
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3476 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-02 16:28:06 +00:00
rpoplin 290771a8c2 Automatic cutting of recalibrated variant calls using ApplyVariantCuts. VariantRecalibrator produces the tranches plot alongside the optimization curve. Specify the levels using -tranche 1.0 -tranche 5.0 etc
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3472 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-02 15:03:00 +00:00
ebanks 4a555827aa Removing more toUpperCase sanity checks
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3471 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-02 14:38:39 +00:00
ebanks 56e504789a trivial change: toUpperCase no longer necessary
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3470 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-02 14:00:47 +00:00
rpoplin 87fe60fe4f Fix for Sendu. new Process and p.waitFor() don't seem to work on his farm. Throws an IOException. This was a problem way back with AnalyzeCovariates too.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3469 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-02 11:37:10 +00:00
ebanks 7f0c638653 Fix for the indel cleaner: I forgot to "unclip" the cigar string (even though the clipped bases were removed) before using it as an alternate consensus in a particular instance.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3468 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-01 02:07:20 +00:00
depristo 2b02324587 Support for detecting and automatically excluding reads reading into the adaptor sequence and, if desired, also only showing the first pair when two reads overlap in the fragment. Not enabled, an intermediate check in before updating and verifying the impact on locus walkers everywhere.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3465 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-30 18:00:12 +00:00
ebanks eb25e41111 minor update to new tribble name
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3462 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-28 20:23:25 +00:00
ebanks ffeb3fd80d Thanks to Guillermo, I found a bug in the Unified Genotyper output: GL was posteriors instead of likelihoods. Not a huge deal because the
priors were flat, but fixed nonetheless.
Also, needed to update Tribble.
Minor updates to the Beagle input maker.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3461 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-28 19:28:26 +00:00
rpoplin 522dd7a5b2 Adding the variantrecalibration classes.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3459 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-28 18:21:27 +00:00
aaron 871cf0f4f6 Call out ROD types by there record type, instead of the codec type (which was clumsy). So instead of:
@Requires(value={},referenceMetaData=@RMD(name="eval",type= VCFCodec.class))

you'd say:

@Requires(value={},referenceMetaData=@RMD(name="eval",type= VCFRecord.class))

Which is more in-line with what was done before.  All instances in the existing codebase should be switched over.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3457 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-28 14:52:44 +00:00
depristo cc2bf549c8 Removing my unnecessary optimization. 10 lines later in the code the same optimization was applied. A monumental waste of time.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3455 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-28 14:10:48 +00:00
depristo 6485e8383d Trivial change to retrigger broken build that really isn't broken
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3453 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-27 23:33:46 +00:00
depristo f2e7582cfc Reorganization of SW code for clarity. Totally failure at raw optimization. Discovered that ~50% of reads being cleaned were perfect reference matches. New code comes with flag to look at NM field and not clean perfect matches. Can we turned off with command line option (needed for 1KG bams with bad NM fields). Going to rerun cleaning jobs due to accidentally rebuilding of stable codebase and loss of 2 days of runtime.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3452 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-27 23:16:00 +00:00
ebanks e2674671e7 The liftover code needs to *hard filter* records whose reference changes (since they no longer adhere to the VCF spec as they don't match the new reference - and can't be converted to VariantContext).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3448 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-27 19:22:47 +00:00
depristo dfc36c1e95 Restructuring of the mandatory read filters for traversals. Now everything uses ReadFilters, even for the required filters like being mapped for LocusWalkers. Statistics now tracked for each read filter used during the traversal and info emitted in INFO at the end.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3445 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-26 22:12:25 +00:00
chartl f9efc1248c VariantEvalWalker now takes indels if you throw the -dels flag. IndelLengthHistogram appears to be working properly, it is turned off by default (as it is experimental) but you can turn it on in your own repository.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3443 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-26 20:03:14 +00:00
chartl 0265199ce4 First pass at an IndelLengthHistogram module for variant annotator. Off by default. Will be tested shortly (have to commit, so I can check out in another directory, so that compiling won't kill all my jobs running on LSF)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3440 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-26 15:04:39 +00:00
depristo 5928047d8b Optimization of reference window calculation to us bytes not char and no uppercasing since reference and read bases are always uppercase now. Should remove some ~5% of runtime of UG.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3438 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-26 14:10:26 +00:00
chartl 88a06ad81f Changes to Depth of Coverage:
- For speedup in large number of samples, base counts are done on a per read group level, then
   merged into counts on larger partitions (samples, libraries, etc)
   + passed all integration tests before next item
- Added additional summary item, a coverage threshold. Set by (possibly multiple) -ct flags,
   the summary outputs will have columns for "%_bases_covered_to_X"; both per sample, and
   per sample per interval summary files are effected (thus md5s changed for these)

NOTE:

This is the last revision that will include the per-gene summary files. Once DesignFileGenerator is sufficiently general, and has integration tests, it will be moved to core and the per-gene summary from Depth of Coverage will be retired.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3437 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-26 03:39:22 +00:00
ebanks 772f558ae0 Massive change to the indel realigner code. We now properly deal with soft-clipped reads. Also, improved left-alignment code.
Small change for Ryan to get hard-clipped reads working for the recalibrator.

PLEASE DO NOT RELEASE THIS WEEK.  I still have some more testing to do and need Mark to run WG jobs.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3430 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-25 20:04:33 +00:00
delangel a280a0ff0d a) Made HaplotypeScore default annotation. This changed several integration tests, whose MD5 is now updated.
b) Disabled BaseQualRankSumTest, the returned p-values differ wildly from Matlab/R-provided ones, cause TBD.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3419 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-21 22:25:17 +00:00
chartl 7fb3f2d3eb Annotator now buffers indel calls (prevents double-output from double-calls to map)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3413 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-21 16:34:34 +00:00
chartl 4e834b5e35 VFW now uses a ref window and thus is compatible with indels.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3412 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-21 15:59:42 +00:00
chartl 88cb93cc3c Changes to Depth of Coverage (added maximum base and mapping quality flags; with new integration tests -- because they use b36, and the other test uses hg18, it's in a different class (integration test system can't change refs on the fly). Initial change to VariantAnnotator to allow it to see extended event pilups; you currently have to throw the -dels flag; and it's specified as "very experimental". Yet,all the integration tests pass.
Homopolymer Run now does the "right" thing (e.g. single bases are represented as HRun = 0 rather than HRun = 1) for indels. AlleleBalance now does something close enough to correct.

Added a convenience method to VariantContext that will return the indel length (or lengths if a site is not biallelic).



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3409 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-21 13:02:01 +00:00
depristo 6faf101c6c Minor improvements to Callable Loci for public consumption
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3408 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-21 12:50:11 +00:00
depristo a10fca0d5c Genotyper now is using bytes not chars. Passes all tests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3406 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-20 21:02:44 +00:00
depristo 727822adb4 BaseUtils has more clear distinction between byte and char routines. All char routines are @Depreciated now. Please use bytes. Better organization of reverse(), now in Utils not BaseUtils.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3400 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-20 14:05:13 +00:00
depristo 6ce3835622 Removing unused methods in QualityUtils; ReferenceContext now converting all bases to upper case, but can be disabled with static boolean
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3399 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-20 12:38:06 +00:00
depristo 5abac5c057 A few more char -> byte cleanups
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3398 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-20 00:02:06 +00:00
depristo 8a725b6c93 Restructuring of ReferenceContext and ReadWalkers to accept a ReferenceContext. Now ReferenceContext is byte[] backed not char[]. Please no more chars for the reference. All of the tests pass now. Coming check-ins are going to clean up the char / byte problems in the GATK
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3397 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-19 23:27:55 +00:00
hanna 017ab6b690 Experimental versions of downsampler and Ryan's deduper are now available either
as walker attributes or from the command-line.  Not ready yet!  Downsampling/deduping 
works in a general sense, but this approach has not been completely optimized or validated.
Use with caution.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3392 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-19 05:40:05 +00:00
chartl 635f61c22d Clone the other guy too
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3381 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-18 18:56:01 +00:00
chartl eb200e4cce Hrumph. Don't just add pointers to the same objects, actually clone the underlying arrays.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3379 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-18 17:13:44 +00:00