Commit Graph

1155 Commits (0e9a6826b0d3e3724d3ce095a5205f4efbc3bbd0)

Author SHA1 Message Date
ebanks c6f6948f9d Haiku:
Eric is a fool.
Matt found his really dumb bug.
Eric is humbled.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2830 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-12 04:51:56 +00:00
ebanks 96fee7cf7a Disabling input of known indels for use as alternate consenses. When we get rods in a read traversal, it will be trivial to hook it into the cleaner (the code is already there).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2825 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-11 15:52:21 +00:00
ebanks a4a2c9b172 Deal with bad input; also N-way out isn't default.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2823 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-11 03:44:56 +00:00
hanna dc885ba386 Fix for some correctness bugs found during early performance testing, phase 1.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2822 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-10 22:32:25 +00:00
depristo c66861746a improvements to ve2, including more meaningful mendelian violation counting. Support for VCF emitted interesting sites, annotated according to the evaluations themselves. Basic intergration test for VE2 started
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2819 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-10 16:12:29 +00:00
rpoplin 0b1e243a7b CountCovariates now sorts the list of standard covariate classes coming from PackageUtils.getClassesImplementingInterface(). As a result some of the integration tests now make use of -standard
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2817 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-10 15:52:20 +00:00
ebanks 6652b992f7 The new cleaner can now use known indels to create alternate consenses for cleaning.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2816 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-10 04:39:15 +00:00
hanna 0250338ce7 Basic use cases for merging BAM files with the new sharding system work.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2815 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-09 22:14:37 +00:00
depristo 934d4b93a2 VariantContext to VCF converter. BeagleROD, and phasing of VCF calls. Integration tests galore :-)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2814 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-09 19:02:25 +00:00
depristo 94f892ad42 VCF->beagle and VCF phasing using beagle input. Appears to work fairly well. VariantContexts now support phased genotypes.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2812 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-09 01:22:05 +00:00
depristo 457568485a simple Beagle input ROD
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2811 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-09 01:21:04 +00:00
chartl 935e76daa1 Minor changes to oneoff walkers. PlinkRod altered but still commented.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2808 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-08 18:49:56 +00:00
ebanks 4fe851a83d Optimization: don't keep scoring an alternate consensus if it's already worse than the best alt seen so far.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2806 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-07 05:06:32 +00:00
ebanks ca1917507f Various improvements and fixes:
In indel cleaner:

1. allow the user to specify that he wants to use Picard’s SAMFileWriter sorting on disk instead of having us sort in memory; this is useful if the input consists of long reads.

2. for N-way-out mode: output bams now use the original headers from the corresponding input bams - as opposed to the merged header.  This entailed some reworking of the datasources code.

3. intermediate check-in of code that allows user to input known indels to be used as alternate consenses.  Not done yet.

In UG: fix bug in beagle output for Jared.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2805 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-07 04:21:04 +00:00
depristo 3b1ab86d11 Added generic interfaces to RefMetaDataTracker to obtain VariantContext objects. More docs. Integration tests for VariantContexts using dbSNP and VCF. At this stage if you use dbSNP or VCF files only in your walkers, please move them over to the VariantContext, it's just nicer. If you've got RODs that implemented the old variation/genotype interfaces, and you want them to work in new walkers, please add an adaptor to VariantContextAdaptors in refdata package. It should be easy and will reduce burden in the long term when those interfaces are retired.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2803 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-06 16:26:06 +00:00
depristo 33760834d6 commented out inactive (due to string ==) but actually incorrect code. Sometimes two wrongs do make a right
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2801 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-06 16:22:26 +00:00
hanna c7e006a996 Bug fixes for interval batching in sharding system. Sharding system now batches intervals and passes
basic tests for small and large intervals and intervals that cross bin boundaries.  Currently works
only with a single BAM file.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2800 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-05 21:47:54 +00:00
asivache a1d5a384f4 Reverting the last reversal. bestConsensus points to something also kept in a set, so just reassigning it will NOT automatically destroy the underlying data; explicit clearing of unneeded data reinstated. STUPIDO!!!
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2796 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-05 18:08:53 +00:00
asivache cf7e6d0c0b Memory-saving change, same as in old IntervalCleaner (if alt consensus does not beat the best one, destroy its data immediately)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2795 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-05 18:05:04 +00:00
asivache df0be25afb ooops, no need to destroy old best's data explicitly, it will be done automatically of course
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2794 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-05 18:03:16 +00:00
asivache 9f44018b7d Reducing memory footprint: if alt consensus does not beat the best alt observed so far, destroy its data immediately, instead of keeping them around. If new alt is better than the old best, then destroy the old best right away instead.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2793 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-05 17:58:54 +00:00
rpoplin be33d1852c Reverting
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2792 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-05 15:57:09 +00:00
depristo af8c47fc2f Fixing up testVariantContext for integration tests for variant context. Printing of VCs and genotypes now stable using sorting. Cleaned up comments in quality score by strand. RefMetaDataTracker now directly allows walkers to obtain VariantContexts using the simple Collection<VariantContext> getAllVariantContexts(GenomeLoc curLocation, EnumSet<VariantContext.Type> allowedTypes, boolean requireStartHere, boolean takeFirstOnly) function. VCF and dbSNP VariantContexts now officially supported. Other importan types can be added to the adapator system in refdata package. Integration tests later today
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2791 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-05 15:42:54 +00:00
rpoplin 0d8d6e0a14 Ti/Tv module in VariantEval shows known and novel ratios if possible
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2790 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-05 15:37:40 +00:00
depristo c6d86da4b8 almost managed to move things around perfectly in move go
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2788 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-05 14:18:26 +00:00
hanna e53432d54d Checkpoint for combining adjacent intervals into the same shard.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2782 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-05 02:48:02 +00:00
asivache 0d347d662a More plumbing: if after the shift window contains indel(s) at the first position, do not throw an exception, just print the warning (we can not deal with this situation!!) and discard those indels without trying to call them. This situation will most probably arise after forced shift over a messy region anyway.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2781 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-04 21:06:28 +00:00
asivache e7b710791f OK, we finally ran into a messy dataset where we can not find a place to shift the window to: there's an indel at every position. Don't panick, don't throw an exception, just ignore the whole window completely, we do not want to call there.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2779 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-04 19:49:56 +00:00
ebanks 83b9d63d59 1. Added functionality to the data sources to allow engine to get mapping from input files to (merged) read group ids from those files.
2. Used said mapping to implement N-way-in,N-way-out functionality in the new indel cleaner.  Still needs more testing (to be done after vacation but preliminary tests look good).
3. Fixes to VCF validator: ignore case when testing VCF reference base against true reference base and allow quals of -1 (as per spec).



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2773 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-04 04:12:49 +00:00
hanna 3f35e181d5 Add an alternate implementation of the BAM file reader that keeps the entire index in memory. Initial revision of BAMFileStat, a tool to inspect BAM file BGZF blocks and index entries.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2769 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-03 19:48:15 +00:00
hanna 9dbdfff786 Moved VariantEval to core. Updated integration test md5s to reflect new Analysis class names.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2762 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-02 00:22:15 +00:00
ebanks 506d39f751 The UG calculations are now driven by an independent engine.
This completely separates the genotyper walker from other walkers.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2758 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-01 20:57:31 +00:00
hanna d8e75cf631 Fix for Kiran's memory issue running UG...turned out to be a particularly bad interaction between @By(Reference) traversals and TreeReduce.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2757 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-01 20:27:06 +00:00
asivache 990af3f76e Will now work with simplest tabular format - genotype string ("+ACTT") does not have to be followed by ':'
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2755 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-01 15:40:01 +00:00
ebanks e0808e6c37 Moved old EM model to archive
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2754 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-01 02:55:32 +00:00
ebanks f6da57dc79 1. For Matt: JIRA GSA-270. Other walkers needing to call into the Unified Genotyper now use static methods (e.g. runGenotyper()) instead of calling initialize and map.
2. Set the default confidence cutoff to 50 (instead of 0).



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2752 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-31 21:14:57 +00:00
ebanks ce9d3dcefb Removing deprecated version of indel genotyper (putting it in archive in case we need to reproduce original 1KG indel calls for some reason).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2749 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-31 14:05:36 +00:00
depristo 3d45457595 VariantEval2 test framework implemented; Kiran is experimenting with the system. Not for use by anyone else. VariantContext appears to work well; I'll release it next week for general use following docs of the functions. Removing newvarianteval and other classes to avoid any future confusion. Update to TraverseLoci and RodLocusView to simplify a few functions and to correct some minor errors. All tests pass without modification.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2748 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-30 20:51:24 +00:00
jmaguire ea7e737441 Two new annotations:
1. LowMQ: fraction of reads at MQ=0 or MQ<=10.
	2. Alignability: annotate SNPs with Heng's (or anyone else's) alignability mask.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2746 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-29 23:23:00 +00:00
chartl 97f60dbc4b Moving stuff around. ( core;playground ) ----> ( oneoffs ). I've been a bad boy, sullying the core codebase.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2745 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-29 22:50:03 +00:00
rpoplin c6cc844e55 Added -name argument to AnalyzeAnnotations that allows one to specify the name of the annotation to be used on the plots. Instead of seeing AB and DP, one can add -name AB,AlleleBalance -name DP,Depth
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2742 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-29 20:48:53 +00:00
depristo 62a80f2b6f fixed out of date tests. Also, tests uncovered a subtle bug in new implementation that was also fixed
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2741 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-29 20:03:48 +00:00
rpoplin 4f29a1d4f6 AnalyzeAnnotations now plots true positive rate instead of percentage of variants found in the truth set. Committing GCContentCovariate to help people experiment with correcting the pilot3/Kristian base calling error mode in slx.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2740 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-29 20:01:56 +00:00
hanna e7f5c93fe5 Cleaning up the inheritance hierarchy from the previous commit.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2738 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-29 19:13:36 +00:00
depristo 9decd20f46 Fix to priors to allow lower het values for mouse guys; no intergration test changes
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2734 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-29 15:36:12 +00:00
chartl d57a86ad41 Not nearly as badass as it looks. The problem I mentioned yesterday with "bleeding in" of samples comes from VCFUtils and SampleUtils looking for all VCF-class RODs in the tracker, and stealing the name from them. I have introduced a new HapmapVCF - type rod for use
when you want to protect your VCF header from being infected by the samples in a bound hapmap VCF. Changes are as follows:

VCFRecord - minor change to adapt isNovel() to the case where the dbsnp ID field is empty, but the info field has DB=1

HapmapVCFRod - introduced for the reason at the top

RODRecordIterator - was: catch ( Exception e ) { throw new StingException("long ass message") }
                 is now: catch ( Exception e ) { throw new StingException("long ass message",e) }
                    to permit full stack ejaculation.

RodVCF - Now with more brackets!

ReferenceOrderedData - registering HapmapVCF as a bindable string

VariantAnnotator - There's an extra space on a line. And some new brackets.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2733 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-29 15:19:50 +00:00
depristo 5aaf4e6434 VariantFiltration now accepts any number of --name --filter expressions, and annotates the VCF file with each name that matches. Very useful
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2732 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-29 12:13:08 +00:00
ebanks 01e73fc39e Yuck - Picard's SAMRecord Comparator only deals with mapped reads. Adding an extended version that works for all reads.
After adding some more minor changes to the new realigner it now gets the same exact results as the original version - except that sometimes it doesn't clean when it shouldn't!
More testing coming.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2731 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-29 07:49:47 +00:00
hanna 3d922a019f Basic support for very simple index-driven locus traversals. Interface has been changed to
support batched intervals in a single shard, but intervals are not yet compressed into a single
shard.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2730 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-29 03:14:26 +00:00
asivache 4810e9c9cd And now the DOCS!
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2729 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-28 23:21:33 +00:00
asivache 40262e2070 Now calls single-sample indels too, with all the V2 level stats and bells. This officialy obsoletes IndelGenotyperWalker (V1). In addition, the alignments spanning beyond the contig end are now completely ignored (with a user warning), this applies to both single-sample and paired (somatic) calls. You just wait, Eric, I'll get you the docs with the next commit!
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2728 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-28 22:28:02 +00:00
rpoplin 79c4cc1db7 AnalyzeAnnotations now breaks out titv by calls in hapmap and also plots true positive rates. Any RODs passed in whose name starts with 'truth' is considered to be the truth set.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2726 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-28 21:41:23 +00:00
chartl 8de6a8d246 Lots of changes; all to do something relatively minor.
1) Changed VCF/RodVCF to allow for inquiries to whether or not the site is novel; isNovel() looks at the ID field, and those members of the info field that indicate membership in dbsnp, hapmap2, or hapmap3; and if none can be found, returns true.

2) Changed VariantAnnotator to annotate hapmap2 and hapmap3, if you bind rods to it with those names. Works in the same way as DBSNP does -- if you give it a rod named "hapmap2" it'll annotate membership in it. -- Passes integration tests

3) Changed UnifiedGenotyper to do the same thing (since it uses Annotations as a subroutine) -- Passes integration tests

4) Changed MultiSampleConcordanceWalker to take a flag --ignoreKnownSites (or -novels) to examine concordance only on sites that are not marked as in dbSNP or in Hapmap in the variant VCF

5) Changed VCFConcordanceCalculator (the object MultiSampleConcordanceWalker runs on) to output Concordant_Het_Calls and Concordant_Hom_Calls separately, rather than combined as Concordant_Calls

6) AlleleBalanceHistogramWalker -- I don't know what i did to this thing. I've been jerry rigging System.outs to do stuff it was never really intended to do; so there's probably some dumb System.out.print("HI I AM AT LOCUS:"+loc) stuck somewhere. It compiles at any rate.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2724 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-28 21:06:56 +00:00
ebanks 6f11fe442a Sync with Andrey's changes
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2723 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-28 20:49:38 +00:00
asivache db429e1096 Some alt consenses may have cigar string starting with an insertion. Not a bug, strictly speaking, since the cleaner had been detecting this and crashing deliberately. Now it knows how to deal with this special case though. Also, uppercase the ref before using it in SW aligner!
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2722 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-28 18:53:02 +00:00
depristo 9876645a5d Now drives the walker by reference, not by reads, so we see even loci with no reads. This allows us to accurately calculate the true total callable area
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2720 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-28 11:12:46 +00:00
ebanks 1dd9996f3a New realigner now completely uses bytes, plus misc fixes. Still not ready for use.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2719 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-28 04:17:20 +00:00
asivache bd11060e72 Ups, I did it again. Fixing the bug introduced in a previous commit: use correct length of the indel event.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2713 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-27 21:51:54 +00:00
ebanks fddca032bb Initial commit of v2.0 of the cleaner. DO NOT USE. (this means you, Chris)
Cleaned up SW code and started moving over everything to use byte[] instead of String or char[].

Added a wrapper class for SAMFileWriter that allows for adding reads out of order.

Not even close to done, but I need to commit now to sync up with Andrey.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2712 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-27 21:36:42 +00:00
rpoplin fc4285f9fd AnalyzeAnnotations seems to be popular so I've rewritten the guts to be easier to extend and maintain.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2707 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-27 19:30:31 +00:00
hanna fa3589e5c5 Update our error messages to point to getsatisfaction.com/gsa.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2706 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-27 19:16:28 +00:00
asivache 3edcefb7fb add _gI and _gD to the indel probe names according to the spec (in the hope that wiki is not obsolete); added optional cmd line param -project_id to prefix all probe names with.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2704 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-27 17:06:49 +00:00
depristo 40c242d2b8 Fix for overflow issues
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2702 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-27 13:37:16 +00:00
aaron 8453676b71 added a method to AlignmentContext called hasExceededMaxPileup, which you can use to determine if the current site exceeded the maximum pileup size (reads were dropped). Added this as a check to unified genotyper according to Eric's instructions, and added the plumbing to the engine.
Also deleted the FixBamSortOrder package that isn't used anymore.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2701 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-27 05:17:01 +00:00
ebanks 476d6f3076 RealignerTargetCreator is officially live
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2697 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-27 03:41:52 +00:00
asivache 1f64c5d41a Do not slurp the whole set of snp mask sites into memory (gets pretty heavy on full dbSNP!); instantiate a privare ROD iterator instead and drag it across the sites we are designing probes for.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2694 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-26 22:39:46 +00:00
ebanks 47440bc029 - Removed max_coverage argument from UG; Aaron will set it up so that we don't call when the GATK had to drop reads.
- Reimplemented optimization in UG to not call when there are no non-ref bases.
- Compute reference confidence accurately in UG for ref calls.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2693 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-26 21:56:33 +00:00
rpoplin a1054efe8a Default platform and default read group are no longer set to values by default. The recalibrator throws an exception if needed values are empty in the bam file and the args weren't set by the user. This is done to make it more obvious to the user when the bam file is malformed. Similarly, the recalibrator now refuses to recalibrate any solid reads in which it can't find the color space information with an exception message explaining this. The recalibrator no longer maintains its own version number and instead uses the new global GATK version number.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2690 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-26 18:47:40 +00:00
rpoplin 0345d9f6a5 Updating the recalibrator to use non-depricated getPileup() method. Adding documentation to AnalyzeAnnotations so that the walker isn't marked as unclean at compile time.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2688 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-26 14:15:09 +00:00
depristo c231547204 Refactoring and migration of new allele/variantcontext/genotype code into oneoffprojects. NOT FOR USE. PlinkRod commented out due to dependence on this new, rapidly changing interface.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2687 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-26 13:53:29 +00:00
aaron 2e57bc7879 added a better message for the SO flag error in MergingSAMIterator2
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2685 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-25 22:57:18 +00:00
rpoplin 894a2b511b Fixing no platform warning message.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2682 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-25 19:46:50 +00:00
rpoplin 2b51cf18f0 AnalyzeAnnotations now outputs plots with log x-axis in addition to standard x-axis so things like DP and MQ0 are easier to see. AnalyzeAnnotations now skips over all annotations that aren't floating point values. Recalibrator now warns users if PL tags are missing and so therefore it is reverting to illumina.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2681 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-25 19:39:18 +00:00
asivache 6cf413e630 Bug: ExpandedSAMRecord did not treat hard-clipped bases ('H') correctly. Fixed.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2680 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-25 19:23:44 +00:00
ebanks dc170caafc Now, if a dbsnp rod is passed to either the UnifiedGenotyper or VariantAnnotator, a DB=0/1 annotation is added (in addition to filling in the ID field); this is in line with 1KG project calls. If no dbsnp rod is used, the annotation is not added (as opposed to setting every entry to DB=0).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2678 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-25 17:27:12 +00:00
rpoplin 5d2f8aaa54 Updating recalibrator version number after the several emergency changes last week.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2677 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-25 14:35:47 +00:00
ebanks 78890c0bee First version of walker that combines the functionality of IndelIntervalWalker, MismatchIntervalWalker, SNPClusterWalker, and IntervalMergerWalker - plus it allows the user to input rods containing known indels (e.g. dbSNP or 1KG calls) for automatic cleaning. Basically, all pre-processing steps for cleaning are now done in a single pass.
More testing needed.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2672 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-24 05:32:38 +00:00
chartl d6b9b788a8 Renamed -- PlinkRodWithGenomeLoc --> PlinkRod
Since binary files do not need encoded locus information in the SNP names there's no need to suggest that it is so in the name of the rod



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2671 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-23 18:19:28 +00:00
chartl ae22d35212 PlinkRod now correctly parses binary files without indels; unit test added for this behavior.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2669 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-23 17:34:06 +00:00
chartl 94dc09c865 PlinkRod now successfully instantiates on the binary ped file trio (.bim, .bam, .fam) for non-indel files.
Upcoming: Test that the instantiation is correct, do it for indel-containing files.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2668 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-23 16:13:24 +00:00
chartl 01db93299c PlinkRodWithGenomeLoc now properly handels indels.
There is now a DELETION_REFERENCE allele type to allow for the storage of multi-base references rather than point-mutation references.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2667 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-23 07:34:52 +00:00
chartl 42fb85e7f3 PlinkRodWithGenomeLoc now properly parses text plink files. Unit test added to test this functionality. Indels and binary files to come.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2666 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-23 06:19:26 +00:00
depristo c871a0f221 UG map() now returns a VariantCallContext object. Also has a field for confidentlyCalledBases. UG reduce() emits statistics on the confident called % of bases
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2664 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-22 23:06:43 +00:00
chartl fbf82526cb Minor renamign changes.
PlinkRodWithGenomeLoc now supports .bed file parsing (and doesn't require |c#_p# conventions for SNPs -- still requires _g[I/D] for indels)



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2663 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-22 23:06:32 +00:00
rpoplin fd223e955c Reverting the previous solid change. We now refuse to recalibrate if the solid read doesn't contain proper color space information. The exception message has been updated to say this. Also, Tile has been downgraded to an ExperimentalCovariate due to performance issues.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2662 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-22 20:55:28 +00:00
rpoplin 7732f98e56 Fix for Solid reads that have '.' in their color space field. The recalibrator will just set them to be illumina reads and won't apply color space correction.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2661 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-22 20:09:16 +00:00
aaron 2ea768d902 ant clean is your friend....fixed test code dependent on an interface change.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2660 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-22 20:07:46 +00:00
aaron cc3b818268 cleanup of the pile-up limit exceeded warning, and a little code cleanup
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2657 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-21 22:17:24 +00:00
ebanks c1e09efb23 - Fixed output for beagle header
- Better description for QualByDepth annotation



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2655 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-21 21:25:56 +00:00
hanna d25a2fe120 Better handling of enums by the command-line argument system.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2647 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-20 21:36:46 +00:00
ebanks 9c7b281b4f Set default value for max_coverage to be 100K (since 10K is too small).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2646 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-20 20:15:25 +00:00
hanna 908d399670 Bug fix for help text / version number - help text retriever was crashing in the debugger if help text hadn't been built.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2643 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-20 19:18:19 +00:00
chartl ab289872e4 Changes:
- Annotations return null when given pileups with no second-base information

- SequenomRodWithGenomeLoc -- beter handling of indels

Eric; I made two small changes to the new Genotype interface that we should talk about (they basically have to do with allele/genotype representation):

Allele - added a new UNKNOWN_POINT_MUTATION to AlleleType. If I see a sequenom genotype AG; one's got to be ref, one's got to be SNP, but until I have
         an actual reference base in hand, I don't know which is which. That's what this entry is for.

Genotype - added an enum class StandardAttributes for dealing with things like deletion/inversion length. This is probably not the way we want to
         represent indels, so we should talk about this. Plus now that there's a direct link between my ROD and the genotype; when we do decide
         how to deal with indels, we'll be forced to alter the SequenomRodWithGenomeLoc accordingly.




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2642 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-20 16:45:17 +00:00
aaron a1b4cc4baf changes to intelligently log overflowing locus pile-ups.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2640 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-20 08:09:48 +00:00
ebanks 4ac9eb7cb2 - Smarter strand bias calculation
- Better debug/verbose printing



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2639 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-20 03:01:26 +00:00
asivache 4625261d79 Bug fix: alignments ending with 'I' were not counted into the overall coverage which resulted in inaccurate stats, and in rare occasions outright messed up ones.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2635 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-19 22:12:16 +00:00
hanna 8dafd26100 Print out the current version number in the application header.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2633 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-19 21:58:36 +00:00
depristo 9e0ae993c7 -B 1kg_ceu,VFC,CEU.vcf -B 1kg_yri,VCF,YRI.vcf system supported to allow 1KG % (like dbSNP%)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2632 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-19 21:33:13 +00:00
rpoplin c98df0a862 Updated solid_recal_modes to work with bfast aligned data. Added an integration test that uses the BFAST file provided by TGen.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2630 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-19 21:18:02 +00:00
chartl 53352e1bb4 First pass at a sequenom ROD. Nothing uses it; currently undergoing testing.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2629 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-19 17:09:36 +00:00