Commit Graph

2293 Commits (668c7da33d00bb0eecb0444b85aeccd9bb03e607)

Author SHA1 Message Date
hanna 668c7da33d Bug fix in custom override of queryOverlapping.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2743 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-29 21:35:59 +00:00
rpoplin c6cc844e55 Added -name argument to AnalyzeAnnotations that allows one to specify the name of the annotation to be used on the plots. Instead of seeing AB and DP, one can add -name AB,AlleleBalance -name DP,Depth
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2742 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-29 20:48:53 +00:00
depristo 62a80f2b6f fixed out of date tests. Also, tests uncovered a subtle bug in new implementation that was also fixed
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2741 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-29 20:03:48 +00:00
rpoplin 4f29a1d4f6 AnalyzeAnnotations now plots true positive rate instead of percentage of variants found in the truth set. Committing GCContentCovariate to help people experiment with correcting the pilot3/Kristian base calling error mode in slx.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2740 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-29 20:01:56 +00:00
aaron ac2a207b0b added a wrapper exception for anything that goes wrong in VCF parsing; this way the problematic file line is emitted, no matter what happens. Makes debugging a lot easier, especially in large files.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2739 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-29 19:58:51 +00:00
hanna e7f5c93fe5 Cleaning up the inheritance hierarchy from the previous commit.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2738 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-29 19:13:36 +00:00
depristo 88495a39d4 better formating
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2737 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-29 15:38:21 +00:00
depristo 1993472b38 Just like VariantFiltration but lets you match info fields out of the VCF instead of annotating them.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2736 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-29 15:38:03 +00:00
depristo 0a7426c29c Computes SNP density over the genome. Doesn't work with intervals
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2735 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-29 15:36:49 +00:00
depristo 9decd20f46 Fix to priors to allow lower het values for mouse guys; no intergration test changes
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2734 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-29 15:36:12 +00:00
chartl d57a86ad41 Not nearly as badass as it looks. The problem I mentioned yesterday with "bleeding in" of samples comes from VCFUtils and SampleUtils looking for all VCF-class RODs in the tracker, and stealing the name from them. I have introduced a new HapmapVCF - type rod for use
when you want to protect your VCF header from being infected by the samples in a bound hapmap VCF. Changes are as follows:

VCFRecord - minor change to adapt isNovel() to the case where the dbsnp ID field is empty, but the info field has DB=1

HapmapVCFRod - introduced for the reason at the top

RODRecordIterator - was: catch ( Exception e ) { throw new StingException("long ass message") }
                 is now: catch ( Exception e ) { throw new StingException("long ass message",e) }
                    to permit full stack ejaculation.

RodVCF - Now with more brackets!

ReferenceOrderedData - registering HapmapVCF as a bindable string

VariantAnnotator - There's an extra space on a line. And some new brackets.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2733 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-29 15:19:50 +00:00
depristo 5aaf4e6434 VariantFiltration now accepts any number of --name --filter expressions, and annotates the VCF file with each name that matches. Very useful
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2732 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-29 12:13:08 +00:00
ebanks 01e73fc39e Yuck - Picard's SAMRecord Comparator only deals with mapped reads. Adding an extended version that works for all reads.
After adding some more minor changes to the new realigner it now gets the same exact results as the original version - except that sometimes it doesn't clean when it shouldn't!
More testing coming.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2731 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-29 07:49:47 +00:00
hanna 3d922a019f Basic support for very simple index-driven locus traversals. Interface has been changed to
support batched intervals in a single shard, but intervals are not yet compressed into a single
shard.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2730 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-29 03:14:26 +00:00
asivache 4810e9c9cd And now the DOCS!
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2729 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-28 23:21:33 +00:00
asivache 40262e2070 Now calls single-sample indels too, with all the V2 level stats and bells. This officialy obsoletes IndelGenotyperWalker (V1). In addition, the alignments spanning beyond the contig end are now completely ignored (with a user warning), this applies to both single-sample and paired (somatic) calls. You just wait, Eric, I'll get you the docs with the next commit!
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2728 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-28 22:28:02 +00:00
rpoplin 79c4cc1db7 AnalyzeAnnotations now breaks out titv by calls in hapmap and also plots true positive rates. Any RODs passed in whose name starts with 'truth' is considered to be the truth set.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2726 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-28 21:41:23 +00:00
chartl 7a10c40fb3 Much clearer (and, like, not totally incorrect) implementation of isNovel
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2725 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-28 21:16:21 +00:00
chartl 8de6a8d246 Lots of changes; all to do something relatively minor.
1) Changed VCF/RodVCF to allow for inquiries to whether or not the site is novel; isNovel() looks at the ID field, and those members of the info field that indicate membership in dbsnp, hapmap2, or hapmap3; and if none can be found, returns true.

2) Changed VariantAnnotator to annotate hapmap2 and hapmap3, if you bind rods to it with those names. Works in the same way as DBSNP does -- if you give it a rod named "hapmap2" it'll annotate membership in it. -- Passes integration tests

3) Changed UnifiedGenotyper to do the same thing (since it uses Annotations as a subroutine) -- Passes integration tests

4) Changed MultiSampleConcordanceWalker to take a flag --ignoreKnownSites (or -novels) to examine concordance only on sites that are not marked as in dbSNP or in Hapmap in the variant VCF

5) Changed VCFConcordanceCalculator (the object MultiSampleConcordanceWalker runs on) to output Concordant_Het_Calls and Concordant_Hom_Calls separately, rather than combined as Concordant_Calls

6) AlleleBalanceHistogramWalker -- I don't know what i did to this thing. I've been jerry rigging System.outs to do stuff it was never really intended to do; so there's probably some dumb System.out.print("HI I AM AT LOCUS:"+loc) stuck somewhere. It compiles at any rate.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2724 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-28 21:06:56 +00:00
ebanks 6f11fe442a Sync with Andrey's changes
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2723 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-28 20:49:38 +00:00
asivache db429e1096 Some alt consenses may have cigar string starting with an insertion. Not a bug, strictly speaking, since the cleaner had been detecting this and crashing deliberately. Now it knows how to deal with this special case though. Also, uppercase the ref before using it in SW aligner!
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2722 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-28 18:53:02 +00:00
depristo 956b570c8e V5 improvements to VariantContext. Now fully supports genotypes. Filtering enabled. Significant tests throughout system. Support for rebuilding variant contexts from subsets of genotypes. Some code cleanup around repository
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2721 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-28 18:37:17 +00:00
depristo 9876645a5d Now drives the walker by reference, not by reads, so we see even loci with no reads. This allows us to accurately calculate the true total callable area
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2720 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-28 11:12:46 +00:00
ebanks 1dd9996f3a New realigner now completely uses bytes, plus misc fixes. Still not ready for use.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2719 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-28 04:17:20 +00:00
depristo f6bca7873c V3 of VariantContext. Support for Genotypes and NO_CALL alleles. QUAL fields fully implemented. Can parse VCF records and dbSNP. More complete validation. Detailed testing routines for VariantContext and Allele.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2718 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-28 04:10:16 +00:00
chartl 23fc9737b4 Added the ability to filter out variant (not truth) calls based on read depth. Using -NLD 5 will not update concordant counts for calls with 0, 1, 2, 3, or 4 reads supporting them. Not to be used with VCF files that do not have DP in the format field.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2716 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-27 23:28:04 +00:00
chartl 1b9184a1c7 Added a multisample concordance walker which takes the place of the VCF python library I've been using. Takes a truth VCF and a variant VCF and outputs A TSV that looks like this:
Sample_ID       Concordant_Refs Concordant_Vars Homs_called_het Het_called_homs False_Positives False_Negatives_Due_To_Ref_Call False_Negatives_Due_To_No_Call
NA19381 491     294     2       0       0       0       1
NA19451 489     298     1       0       0       0       0
NA19463 486     289     2       3       1       4       3
NA19376 488     296     1       0       2       0       1
NA19317 489     284     5       3       3       3       1


This walker will be merged with GenotypeConcordance once it's clear how to do so. 



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2715 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-27 22:59:17 +00:00
asivache bd11060e72 Ups, I did it again. Fixing the bug introduced in a previous commit: use correct length of the indel event.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2713 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-27 21:51:54 +00:00
ebanks fddca032bb Initial commit of v2.0 of the cleaner. DO NOT USE. (this means you, Chris)
Cleaned up SW code and started moving over everything to use byte[] instead of String or char[].

Added a wrapper class for SAMFileWriter that allows for adding reads out of order.

Not even close to done, but I need to commit now to sync up with Andrey.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2712 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-27 21:36:42 +00:00
rpoplin b8ae083d1b AnalyzeAnnotations creates a plot of dbsnp rate as a function of the annotations.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2711 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-27 21:08:33 +00:00
rpoplin 3999a8d2c8 IntelliJ no longer complains that my methods are too complex to analyze.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2708 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-27 20:12:13 +00:00
rpoplin fc4285f9fd AnalyzeAnnotations seems to be popular so I've rewritten the guts to be easier to extend and maintain.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2707 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-27 19:30:31 +00:00
hanna fa3589e5c5 Update our error messages to point to getsatisfaction.com/gsa.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2706 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-27 19:16:28 +00:00
depristo 3399ad9691 Incremental update 2 -- refined allele and VariantContext classes; support for AttributedObject class; extensive testing for Allele class, and partial for VariantContext. Now possible to easily convert dbSNP to VariantContext.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2705 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-27 17:19:37 +00:00
asivache 3edcefb7fb add _gI and _gD to the indel probe names according to the spec (in the hope that wiki is not obsolete); added optional cmd line param -project_id to prefix all probe names with.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2704 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-27 17:06:49 +00:00
chartl ed9b7edee3 Changed " to ' to stop the
[javadoc] /humgen/gsa-scr1/chartl/sting/java/src/org/broadinstitute/sting/oneoffprojects/variantcontext/VariantContext.java:99: warning: unmappable character for encoding ASCII
  [javadoc]      *   if one of the alleles is deleted (?-?).

warnings on compile.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2703 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-27 15:23:55 +00:00
depristo 40c242d2b8 Fix for overflow issues
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2702 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-27 13:37:16 +00:00
aaron 8453676b71 added a method to AlignmentContext called hasExceededMaxPileup, which you can use to determine if the current site exceeded the maximum pileup size (reads were dropped). Added this as a check to unified genotyper according to Eric's instructions, and added the plumbing to the engine.
Also deleted the FixBamSortOrder package that isn't used anymore.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2701 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-27 05:17:01 +00:00
rpoplin 4bcdab580c --output_dir has been changed to --output_prefix to give the user more control over the names of the resulting mass of files in AnalyzeAnnotations. The fontsize of the axes is increased. Cumulative filtering plots are removed since the binned filtering plots are much more useful.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2700 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-27 04:50:54 +00:00
chartl df112e64b8 Minor tweaks
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2699 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-27 04:17:47 +00:00
ebanks 476d6f3076 RealignerTargetCreator is officially live
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2697 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-27 03:41:52 +00:00
asivache 1f64c5d41a Do not slurp the whole set of snp mask sites into memory (gets pretty heavy on full dbSNP!); instantiate a privare ROD iterator instead and drag it across the sites we are designing probes for.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2694 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-26 22:39:46 +00:00
ebanks 47440bc029 - Removed max_coverage argument from UG; Aaron will set it up so that we don't call when the GATK had to drop reads.
- Reimplemented optimization in UG to not call when there are no non-ref bases.
- Compute reference confidence accurately in UG for ref calls.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2693 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-26 21:56:33 +00:00
chartl 2c8d7b0c44 Forgot the onTraversalDone. That was dumb.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2692 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-26 21:02:46 +00:00
chartl 04e1832968 Added - AlleleBalanceHistogramWalker -- hopefully this'll be able to tell us very clearly whether bad genotype concordance is a result of systematic contamination (consistent wonky allele balances)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2691 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-26 20:57:12 +00:00
rpoplin a1054efe8a Default platform and default read group are no longer set to values by default. The recalibrator throws an exception if needed values are empty in the bam file and the args weren't set by the user. This is done to make it more obvious to the user when the bam file is malformed. Similarly, the recalibrator now refuses to recalibrate any solid reads in which it can't find the color space information with an exception message explaining this. The recalibrator no longer maintains its own version number and instead uses the new global GATK version number.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2690 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-26 18:47:40 +00:00
rpoplin 0345d9f6a5 Updating the recalibrator to use non-depricated getPileup() method. Adding documentation to AnalyzeAnnotations so that the walker isn't marked as unclean at compile time.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2688 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-26 14:15:09 +00:00
depristo c231547204 Refactoring and migration of new allele/variantcontext/genotype code into oneoffprojects. NOT FOR USE. PlinkRod commented out due to dependence on this new, rapidly changing interface.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2687 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-26 13:53:29 +00:00
aaron 2e57bc7879 added a better message for the SO flag error in MergingSAMIterator2
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2685 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-25 22:57:18 +00:00
rpoplin 24d4082925 AnalyzeAnnotations can now process only variants that are found in samples that match the -sampleName argument. X-axis of plots no longer use annoying scientific notation.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2684 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-25 20:52:11 +00:00