Commit Graph

767 Commits (c66e93d86e410b0feb414dfd27e6c72f5f83ff72)

Author SHA1 Message Date
hanna c08936d6f4 Added a reservoir downsampler which can sample elements in an iterator uniformly
from a stream (see Vitter 1985).  Thanks to Eric and Andrey for the pointer.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3197 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-19 20:48:14 +00:00
ebanks c44f63c846 Fixing the performance tests: we need to catch the RuntimeException (not samtools' RuntimeIOExcpetion). Also, CountCovariates doesn't need the catch.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3196 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-19 14:28:12 +00:00
ebanks abf48cee05 Moving over to VariantContext from Variation
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3195 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-19 06:56:29 +00:00
ebanks d73c63a99a Redoing the conversion to VariantContext: instead of walkers passing in a ref allele, they pass in the ref context and the adaptors create the allele. This is the right way of doing it.
Also, adding some more useful integration tests.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3194 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-19 05:47:17 +00:00
aaron be7cbf948b adding a catch for the exception thrown by samtools when it attempts to close /dev/null in the performance tests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3186 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-16 17:41:48 +00:00
ebanks 7adff5b81a Renaming for consistency
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3180 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-15 20:36:19 +00:00
ebanks e702bea99f Moving VE2 to core; calling it "VariantEval" (one more checkin coming)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3179 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-15 20:25:47 +00:00
chartl ac6f6363ce Execs() temporarily disabled after removal of bam file. New tests forthcoming.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3178 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-15 20:11:56 +00:00
ebanks ac9dc0b4b4 Removing VariantEval (v1); everyone should be using VE2 now. Docs coming ASAP.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3177 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-15 19:53:02 +00:00
ebanks 5f7564bf0a Better naming of output columns
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3175 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-15 18:08:07 +00:00
aaron e682460c1f add a fix so that XL arguments won't cancel out -BTI arguments, fixed a bug for Ben where the ROD -> interval list conversion was throwing an exception, and some old code removal.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3174 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-15 16:31:43 +00:00
ebanks 04909fa6ad Removing arbitrary selects
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3169 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-14 17:46:39 +00:00
weisburd b930dc52a5 Integration test for GenomicAnnotator
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3167 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-14 14:43:25 +00:00
ebanks dde092fb61 Added the ability in VE2 to select which eval modules to run, so that you aren't forced to use all of them. You can use --list to list all of the possible modules to run.
Heads up everyone: by default, *no* modules are run.  Please add "-all" to your scripts to maintain the previous behavior.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3161 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-13 22:15:58 +00:00
hanna 8573b0bc6f Refactoring intervals, separating the process of parsing interval lists,
sorting and merging interval lists, and creating RODs from intervals.  This
gives Doug the ability to keep using our interval list parsing code when
sorting intervals on our behalf.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3159 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-13 15:50:38 +00:00
ebanks e413882302 Generalizing the SequenomValidationConverter to be able to take in any arbitrary rod type (provided it can be converted to VariantContext).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3155 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-12 20:42:18 +00:00
ebanks d06c7835d8 Adding performance tests for the indel realigner; should take ~3 hours.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3151 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-11 04:45:22 +00:00
ebanks 961ca05abc Removed outdated Sequenom rod and renamed HapMapGenotypeROD to HapMapROD.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3149 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-11 01:43:07 +00:00
ebanks fa01876255 UnifiedGenotyper performance tests (WG, WEx); currently takes just over an hour.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3148 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-09 19:42:29 +00:00
rpoplin c2a37e4b5c Variant Quality Score modules in VariantEval2 no longer create huge lists which hold all of the quality scores encountered and instead cast the quality score to an integer and use hash tables. Bug fix for files in which all the quality scores are set to -1.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3146 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-09 18:36:06 +00:00
ebanks 71f38a9199 Adding performance tests for the recalibrator (Whole Genome and Whole Exome tests).
Should take ~3 hours to run.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3145 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-09 18:30:59 +00:00
ebanks fba48b515a Heads up everyone:
For consistency, these tools should be writing to the walker's output stream and no longer use the -vcf argument.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3140 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-09 05:37:25 +00:00
chartl 7025f5b51d Added an auxiliary table to DepthOfCoverage, which is the cumulative equivalent of the locus table (got tired of doing the calculation by hand). Also took care of a trailing tab in the per-locus output table.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3138 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-08 19:37:17 +00:00
aaron 9f6377f7fb added a performance test build option (for the upcoming performance test suite), and added a sample performance test for VariantEval.
IMPORTANT: it was really redundant that we had -Dsingle and -Dsingleintegration to run single unit tests and integration tests, now you can just use -Dsingle to run a single test for performance, unit, and integration tests.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3136 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-08 15:37:15 +00:00
aaron 4014a8a674 A long overdue correction; all unit tests now end in 'UnitTest'. This was something we wanted to do for a while, and now with the performance tests coming, it was a good time to clean-up. Please label any new test appropriately: *UnitTest and *IntegrationTest are the two valid file name patterns for tests.
Thanks!



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3135 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-08 06:14:15 +00:00
aaron 8fd59c8823 Modified the report system based on Ryan's feedback: tables are now created independently to avoid the permutation problem when they were all compressed in rows, and removed our dependency on FreeMarker. The Grep format stays the same.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3130 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-07 20:39:55 +00:00
depristo 918b746798 More detailed validation output. Fixes for genotyping overflow -- these are temporary and need to be properly resolved
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3129 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-07 16:38:28 +00:00
rpoplin 60c227d67f Added new VE2 module to create a plot of titv ratio by variant quality score
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3125 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-06 15:19:27 +00:00
chartl d7880ef7ad Forgot to uncomment the AlignerIntegrationTest before committing. And yes, matt, commenting it out is, in fact, easier than just setting my classpath.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3110 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-01 17:17:16 +00:00
chartl f7d1b8f5de CoverageStatistics has now replaced DepthOfCoverage -- old DoC is in the archive.
Also, I can't be bothered to fix the spelling of "oldepthofcoverage" to contain the necessary number of D's. Be content that it does, however, contain the requisite number of O's.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3109 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-01 16:27:23 +00:00
aaron 585cc880a2 changed jexl expressions to jexl names in the VariantEval2 output, fixed integration test, and fixed a problem where a line was getting dropped in CSV output
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3108 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-01 16:23:14 +00:00
bthomas b4f6f54502 Reorganizing the way interval arguments are processed
Most of the changes occur in GenomeAnalysisEngine.java and GenomeLocParser.java: 
-- parseIntervalRegion and parseGenomeLocs combined into parseIntervalArguments
-- initializeIntervals modified
-- some helper functions deprecated for cleanliness
Includes new set of unit tests, GenomeAnalysisEngineTest.java

New restrictions: 
-- all interval arguments are now checked to be on the reference contig
-- all interval files must have one of the following extensions: .picard, .bed, .list, .intervals, .interval_list



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3106 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-01 12:47:48 +00:00
aaron 3d3d19a6a7 the last-mile commit for Tribble integration. The system is now ready for Tribble to be turned on, as soon as we've removed any dependencies in the ROD code on interfaces that aren't in the Tribble library (i.e. the Variation or Genotype interface on RODs). All of the walkers should be up to date.
a caveat: for anyone asking for all of the ROD's back from the RefMetaDataTracker (if your not using the facilities to get the track by name), you'll now be getting back a collection of GATKFeature objects.  This object will contain the track name, and a method for getting the underlying object (getUnderlyingObject()), which will be the traditional RodVCF, rodDbSNP, etc.  This layer is needed so we can integrate Tribble tracks (which don't natively have names).  Calls that ask for RODs by name will still get back the traditional reference ordered data objects (RodVCF, rodDbSNP, etc).

Sorry for the inconvenience!  More changes to come, but this is by far the largest (as has the greatest effect on end users).


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3104 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-31 22:39:56 +00:00
chartl dc802aa26f Moved CoverageStatistics to core. This will be (soon) renamed DepthOfCoverage; so please use CoverageStatistics
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3090 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-29 13:32:00 +00:00
depristo 8ea98faf47 Deleting the pooled calcluation model -- no longer supported.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3088 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-29 11:44:27 +00:00
aaron 074ec77dcc First go of the new output system for VE2. There are three different report types supported right now (Table, Grep, CSV), which can be
specified with the reportType command line option in VE2.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3083 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-27 03:59:32 +00:00
kshakir 20e3ba15ca Added an optional argument -rgbl --read_group_black_list to filter read groups.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3079 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-26 19:38:57 +00:00
ebanks 73a14a985b Moving VariantsToVCF to core.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3078 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-26 18:55:12 +00:00
ebanks 14bf6923a8 HapMap-to-VCF now works fine within Variants-to-VCF. Added integration test for it and removed old code.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3077 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-26 18:34:59 +00:00
ebanks 4398a8b370 Updated. Now uses VariantContext and is truly "variants" to vcf (i.e. not just GELI to vcf).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3074 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-25 04:53:31 +00:00
aaron 5079f35e40 better method names for read based reference ordered data access.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3069 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-24 16:13:31 +00:00
aaron 7462a0b2d1 cleaned-up of VariantContextAdapter tests, fixed the double comparisons in equals() in RodGeliText (nice MathUtils.compareDoubles Kiran)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3064 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-23 15:18:30 +00:00
aaron a69b8555dd Geli to variant context.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3063 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-23 06:45:29 +00:00
aaron eafdd047f7 GLF to variant context. Added some methods in GLF to aid testing; and added a test that reads GLF, converts to VC, writes GLF and reads back to compare.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3062 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-23 03:43:25 +00:00
asivache ee1dc6092f Test updated. Now we do not throw an exception when locus interval is out of bounds, we just return silently a reference context trimmed to the current shard boundaries. New test checks for trimming.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3058 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-22 17:37:52 +00:00
aaron 439c34ed38 clean-up before annotating VariantEval2 for output.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3055 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-22 07:39:20 +00:00
ebanks 4c4d048f14 Moving VariantFiltration over to use VariantContext.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3048 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-19 18:35:23 +00:00
ebanks c88a2a3027 Fixing/cleaning up the vcf merge util
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3047 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-19 15:13:32 +00:00
ebanks 03480c955c And now the UnifiedGenotyper can officially annotate genotype (FORMAT) fields too.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3039 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-19 04:58:37 +00:00
ebanks 0311980668 The VariantAnnotator can now officially annotate genotype (FORMAT) fields.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3037 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-19 03:30:14 +00:00
aaron 8a5f0b746e some cleanup for the output system.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3032 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-18 12:54:39 +00:00
ebanks 0247548400 Fixed one test and (temporarily) punted on another
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3030 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-18 06:22:48 +00:00
ebanks ee0e833616 Some significant changes to the annotator:
1. Annotations can now be "decorated" with any arbitrary interface description - not just standard or experimental.
2. Users can now not only specify specific annotations to use, but also the interface names from #1.  Any number of them can be specified, e.g. -G Standard -G Experimental -A RankSumTest.
3. These same arguments can be used with the Unified Genotyper for when it calls into the Annotator.
4. There are now two types of annotations: those that are applied to the INFO field and those that are applied to specific genotypes (the FORMAT field) in the VCF (however, I haven't implemented any of these latter annotations just yet; coming soon).



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3029 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-18 05:38:32 +00:00
ebanks 4340601c26 -Pushed base quals back down into SAMRecord; if -OQ is used, the SAMRecord quals get updated automatically
-Better integration test


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3020 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-17 16:00:10 +00:00
hanna 2525ecaa43 Oops. Commented out some tests to improve performance and then checked in the commented out tests. Reverted.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3012 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-16 16:34:50 +00:00
hanna 6dd5f192e7 Performance improvements for RODs in conjunction with new sharding system.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3010 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-16 14:54:12 +00:00
aaron 10e76abbbc adding some VE2 report infrastructure; work-in-progress.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3008 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-16 03:57:42 +00:00
ebanks 202231141c -Push the --use_original_qualities argument into the engine.
-Check that base and qual strings are the same lengths
-Fix one more bug in the clipper.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3006 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-16 02:06:11 +00:00
ebanks 411d25c8d1 -Integration tests for walkers that use original quals.
-framework for pushing -OQ into GATK (not done)


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3004 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-15 18:46:31 +00:00
aaron e365d308d4 add a new JEXLContext that lazy-evaluates JEXL expressions given the VariantContext.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3003 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-15 16:00:55 +00:00
ebanks 73d6167bd6 Fixing broken integration tests
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2998 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-14 23:18:49 +00:00
depristo 4dd7c5972c Unit tests for -XL arguments; expt. annotation calculating the GC content within 100 bp of the current SNP
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2997 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-14 21:08:14 +00:00
aaron ecb59f5d0d removed old tests and old code
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2995 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-12 22:57:01 +00:00
aaron 88a48821ea removed the dependence on removeRegion() in GenomeLocSortedSet
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2993 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-12 22:35:49 +00:00
depristo b39b5edca8 Bug fix in variant eval 2. Preliminary (slow and buggy) support for -XL exclude lists.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2991 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-12 19:23:12 +00:00
aaron 1eb5f97255 fixed dropping single base intervals from deleteRegion, moving onto performance fixes.
(stop - start is length-1 on closed intervals, so we need to check greater than OR equals to zero)

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2990 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-12 19:14:21 +00:00
aaron 661a043cef adding methods to get RODs by name or type in read traversals, performance improvements to RODs for Reads in general, and some more Tribble infrastructure.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2984 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-11 21:13:39 +00:00
hanna a7ba88e649 Rework the way the MicroScheduler handles locus shards to handle intervals that span shards
with less memory consumption.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2981 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-11 18:40:31 +00:00
aaron dde9fd8a15 some rods-for-reads cleaning and performance improvements.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2979 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-10 22:54:58 +00:00
depristo 4f4555c80f PPV and Sensitivity added to validation tool output; support for arbitrary -sample arguments to subset variant contexts by sample
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2978 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-10 22:28:31 +00:00
ebanks 40d305bc7e Added test of Nway cleaning for Matt; thanks to Aaron for the help.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2977 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-10 21:00:41 +00:00
depristo 486bef9318 Support for validationRate calculation in variant eval 2; better error messages for failed genome loc parsing; tolerance to odd whitespace in plinkrod, and fix for monomorphic sites in vcf2variantcontext.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2976 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-10 16:25:16 +00:00
ebanks 7ddd45d059 Hmm. I thought I removed this already.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2973 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-10 03:09:13 +00:00
ebanks 1a576525e9 misc improvements
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2972 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-10 03:00:28 +00:00
ebanks 6e855809e1 Renaming and moving relevant tools into a sequenom directory
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2971 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-10 02:31:10 +00:00
chartl 0a49dffa8f Row/Column names are now R-friendly
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2966 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-09 19:01:03 +00:00
ebanks e5475a7ba9 re-enabling PlinkToVCF integration tests
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2964 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-09 17:35:49 +00:00
ebanks 5a20bf0e64 3 changes to UG which break integration tests:
1. emit AA,AB,BB likelihoods in the FORMAT field for Mark
2. remove constraint that genotype alleles (in the GT field) need to be lexigraphically sorted.
3. Add bam file(s) used by genotyper to header for Kiran


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2963 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-09 17:16:47 +00:00
ebanks 9f3b99c11b Moving UnifiedGenotyper and VariantAnnotator over to VariantContext system.
Removing obsolete genotyping classes.
First stage of removing dependence on old Genotype class.
More changes to come.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2960 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-09 03:41:07 +00:00
chartl bca9bdcc68 Add integration test for quartiles overflowing on interval reduce
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2957 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-08 16:18:45 +00:00
hanna a7fe07c404 A few stopgap fixes to get the GATK to the point where the old sharding
infrastructure can be torn down:
1) New sharding system emulates old MonolithicSharding mechanism.
2) Better awareness of differences between fasta and BAM files when creating
   shards.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2948 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-07 21:01:25 +00:00
hanna dd6122f682 Fixed another bug in the original sharding system. Updated integration tests
as appropriate.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2947 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-07 15:32:18 +00:00
hanna ee2ec7ced9 Fix off-by-one error in original implementation of read sharding. Tested by
awking output of BamToFastq vs. samtools until the outputs matched exactly.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2945 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-06 18:52:53 +00:00
depristo ee913eca07 Forgot to check in fix this morning
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2943 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-05 21:07:19 +00:00
chartl 8738c544f1 Minor refactoring of CoverageStatistics to allow simultaneous output of per-sample and per-read group statistics.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2940 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-05 17:06:52 +00:00
hanna 7104a3a96c Fix for accumulator exception when running reduce by interval walkers without
intervals.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2935 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-05 01:04:08 +00:00
aaron 366771d5a6 another test-with-multiple outputs fix
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2934 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-04 22:46:15 +00:00
chartl 706d49d84c Commit for Aaron
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2932 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-04 21:29:07 +00:00
aaron 54f04dc541 forgot to uncomment the auto-deletion of temp files...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2930 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-04 20:29:42 +00:00
aaron 80cc6bbeb4 add a way to test files generated by a walker that aren't command-line arguments; added some example code in CoverageStatisticsIntegrationTest for Chris.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2929 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-04 20:20:58 +00:00
ebanks 0dd65461a1 Various improvements to plink, variant context, and VCF code.
We almost completely support indels. Not yet done with plink stuff.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2926 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-04 17:58:01 +00:00
aaron c8077b7a22 Waypoint check-in: a couple of changes to for Tribble, and adding some options to the integration test for passing in auxillary files that aren’t “%s” command line options.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2925 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-04 16:02:21 +00:00
aaron ca2cd9d4f5 a little clean-up: move setting the bases of generated reads into Artificial SAM Utils now that the clean read injector test is gone.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2919 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-03 16:31:45 +00:00
aaron 790d2a7776 adding the initial ROD for Reads support; more convenience methods in ReadMetaDataTracker to come.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2918 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-03 15:56:44 +00:00
ebanks 0e9a6826b0 Update to VCF code to get it up to spec.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2917 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-03 06:12:42 +00:00
ebanks 74a5223b11 oops - didn't mean to check this in
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2914 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-02 20:28:22 +00:00
ebanks 5f3c80d9aa 1. To make indel calls, we need to get rid of the SNP-centricity of our code. First step is to have the reference be a String, not a char in the Genotype. Note that this is just a temporary patch until the genotype code is ported over to use VariantContext.
2. Significant refactoring of Plink code to work in the rods and use VariantContext.  More coming.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2913 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-02 20:26:40 +00:00
aaron d8fedd59be docs, cleanup, and some improvements to the iterators.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2901 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-26 22:36:04 +00:00
chartl 87f8fb7282 Quick commit in advance of Aaron's. Just a bunch of refactoring (private classes separated out, put in proper package). Also support added for coverage by read group rather than sample.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2897 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-26 16:39:47 +00:00
depristo 9a6b384adb Support for no qual fields in VCF; better support for Mendelian violation calculations
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2893 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-26 00:29:17 +00:00
aaron 246fa28386 RODs for reads phase 2: modified RODRecordList to implement List<ReferenceOrderedDatum> so I could stub it out for testing, added a FlashBackIterator which is needed to prevent the ResourcePool from opening infinity+1 iterators, and some other interfaces to make unit testing much smoother.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2892 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-25 22:48:55 +00:00
chartl 3d92e5a737 Initial commit of integration test(s) for CoverageStatistics, currently in progress [midway commit is for Matt]
Modifications to CoverageStatistics - now includes and extends much of the behavior of DepthOfCoverage (per-base output, per-target output).

Additional functionality (coverage without deletions, base counts, by read group instead of by sample) is upcoming.




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2888 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-25 20:25:07 +00:00
hanna 553d39bb00 Clean up the code a bit following the introduction of reduceByInterval.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2887 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-25 01:20:22 +00:00
hanna 199b43fcf2 Reduce by interval alterations to interface with new sharding system. This checkin with be followed by a
simplification of some of the locus traversal code.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2886 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-25 00:16:50 +00:00
aaron fef1154fc8 starting on RODs for Reads: made RODRecordList implement list<RODatum> (so we can sub in fake lists during testing), and removed unnecessary generic-ness. Removed BrokenRODSimulator, which isn't being used.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2884 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-24 22:11:53 +00:00
aaron 5546aa4416 adding code to deal with the off-spec situation where our minimum likelihood is above the GLF max of 255.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2871 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-22 22:27:39 +00:00
ebanks 8b555ff17c Killed the old cleaner code. Bye bye.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2868 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-22 20:49:58 +00:00
rpoplin 3e0e7aad2d Removing debug statement. oops.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2858 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-19 15:26:22 +00:00
rpoplin 7f19ff1fa1 Added a new option in the recalibrator to be used by people who have SOLiD data in which only a few of the reads have no-calls in the color space. These reads will be skipped over and left in the bam file untouched.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2857 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-19 15:25:23 +00:00
aaron b1a4e6d840 removing non-ascii characters from my Copyright and from VariantEval2Walker
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2856 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-18 18:54:36 +00:00
aaron 33ae256186 a start to some of the infrastructure for Tribble, including dynamic detection of new RMD; not nearly wired in or complete yet.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2855 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-18 18:43:52 +00:00
ebanks 79ab7affda - Change sortOnDisk option to sortInMemory
- Fix horrible cleaner bug
- Trivial optimizations to cleaner code - more significant ones coming soon.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2850 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-17 20:52:57 +00:00
aaron 653f70efa2 added methods to validate an interval before you try to make a GenomeLoc: boolean validGenomeLoc().
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2846 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-16 20:35:35 +00:00
depristo 8072e9aed5 should never commit without running intergration tests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2838 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-12 23:42:37 +00:00
depristo 5f74fffa02 Massive improvements to VE2 infrastructure. Now supports VCF writing of interesting sites; multiple comp and eval tracks. Eric will be taking it over and expanding functionality over the next few weeks until it's ready to replace VE1
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2832 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-12 15:26:52 +00:00
depristo 197dd540b5 added root GATKData variable
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2831 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-12 15:25:34 +00:00
depristo c66861746a improvements to ve2, including more meaningful mendelian violation counting. Support for VCF emitted interesting sites, annotated according to the evaluations themselves. Basic intergration test for VE2 started
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2819 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-10 16:12:29 +00:00
rpoplin 0b1e243a7b CountCovariates now sorts the list of standard covariate classes coming from PackageUtils.getClassesImplementingInterface(). As a result some of the integration tests now make use of -standard
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2817 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-10 15:52:20 +00:00
depristo 934d4b93a2 VariantContext to VCF converter. BeagleROD, and phasing of VCF calls. Integration tests galore :-)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2814 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-09 19:02:25 +00:00
depristo 94f892ad42 VCF->beagle and VCF phasing using beagle input. Appears to work fairly well. VariantContexts now support phased genotypes.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2812 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-09 01:22:05 +00:00
kshakir fc810a1800 Updated VCF Reader to parse VCFs according to the VCFv3.3 spec. Column headers are tab separated since sample names might have spaces.
Updated test files in /humgen/gsa-scr1/GATK_Data/Validation_Data/*.vcf to remove spaces except for when they are supposed to be in the sample name.
Added @Test before VCFReaderTest.testHeaderNoRecords()

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2809 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-08 22:55:59 +00:00
chartl 935e76daa1 Minor changes to oneoff walkers. PlinkRod altered but still commented.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2808 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-08 18:49:56 +00:00
ebanks ca1917507f Various improvements and fixes:
In indel cleaner:

1. allow the user to specify that he wants to use Picard’s SAMFileWriter sorting on disk instead of having us sort in memory; this is useful if the input consists of long reads.

2. for N-way-out mode: output bams now use the original headers from the corresponding input bams - as opposed to the merged header.  This entailed some reworking of the datasources code.

3. intermediate check-in of code that allows user to input known indels to be used as alternate consenses.  Not done yet.

In UG: fix bug in beagle output for Jared.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2805 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-07 04:21:04 +00:00
depristo 3b1ab86d11 Added generic interfaces to RefMetaDataTracker to obtain VariantContext objects. More docs. Integration tests for VariantContexts using dbSNP and VCF. At this stage if you use dbSNP or VCF files only in your walkers, please move them over to the VariantContext, it's just nicer. If you've got RODs that implemented the old variation/genotype interfaces, and you want them to work in new walkers, please add an adaptor to VariantContextAdaptors in refdata package. It should be easy and will reduce burden in the long term when those interfaces are retired.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2803 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-06 16:26:06 +00:00
hanna c7e006a996 Bug fixes for interval batching in sharding system. Sharding system now batches intervals and passes
basic tests for small and large intervals and intervals that cross bin boundaries.  Currently works
only with a single BAM file.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2800 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-05 21:47:54 +00:00
rpoplin be33d1852c Reverting
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2792 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-05 15:57:09 +00:00
rpoplin 0d8d6e0a14 Ti/Tv module in VariantEval shows known and novel ratios if possible
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2790 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-05 15:37:40 +00:00
depristo 1494dc875f fixing up tests. Moves are complete
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2789 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-05 14:24:00 +00:00
depristo c6d86da4b8 almost managed to move things around perfectly in move go
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2788 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-05 14:18:26 +00:00
depristo 1d86dd7fd1 Interface changes following Matt's advice. VariantContexts are now immutable, and there are special mutable versions, in case you need to change things. AttributedObject now a InferredGeneticContext and package protected. VariantContexts are now named, which makes them easier to use with the rod system
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2780 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-04 20:55:49 +00:00
aaron af7cd9cf58 some very old tests relied on cancer data that got moved. Reset one to use data in the validation directory, the other to the artificial sam utils (the best approach).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2767 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-02 23:13:10 +00:00
hanna 9dbdfff786 Moved VariantEval to core. Updated integration test md5s to reflect new Analysis class names.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2762 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-02 00:22:15 +00:00
depristo d9671dffba Documentation for VariantContext. Please read it and start using it.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2756 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-01 17:49:51 +00:00
ebanks e0808e6c37 Moved old EM model to archive
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2754 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-01 02:55:32 +00:00
ebanks f6da57dc79 1. For Matt: JIRA GSA-270. Other walkers needing to call into the Unified Genotyper now use static methods (e.g. runGenotyper()) instead of calling initialize and map.
2. Set the default confidence cutoff to 50 (instead of 0).



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2752 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-31 21:14:57 +00:00
depristo 62a80f2b6f fixed out of date tests. Also, tests uncovered a subtle bug in new implementation that was also fixed
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2741 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-29 20:03:48 +00:00
hanna e7f5c93fe5 Cleaning up the inheritance hierarchy from the previous commit.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2738 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-29 19:13:36 +00:00
depristo 1993472b38 Just like VariantFiltration but lets you match info fields out of the VCF instead of annotating them.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2736 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-29 15:38:03 +00:00
depristo 5aaf4e6434 VariantFiltration now accepts any number of --name --filter expressions, and annotates the VCF file with each name that matches. Very useful
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2732 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-29 12:13:08 +00:00
hanna 3d922a019f Basic support for very simple index-driven locus traversals. Interface has been changed to
support batched intervals in a single shard, but intervals are not yet compressed into a single
shard.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2730 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-29 03:14:26 +00:00
depristo 956b570c8e V5 improvements to VariantContext. Now fully supports genotypes. Filtering enabled. Significant tests throughout system. Support for rebuilding variant contexts from subsets of genotypes. Some code cleanup around repository
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2721 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-28 18:37:17 +00:00
depristo f6bca7873c V3 of VariantContext. Support for Genotypes and NO_CALL alleles. QUAL fields fully implemented. Can parse VCF records and dbSNP. More complete validation. Detailed testing routines for VariantContext and Allele.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2718 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-28 04:10:16 +00:00
depristo 3399ad9691 Incremental update 2 -- refined allele and VariantContext classes; support for AttributedObject class; extensive testing for Allele class, and partial for VariantContext. Now possible to easily convert dbSNP to VariantContext.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2705 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-27 17:19:37 +00:00
ebanks 476d6f3076 RealignerTargetCreator is officially live
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2697 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-27 03:41:52 +00:00
ebanks 47440bc029 - Removed max_coverage argument from UG; Aaron will set it up so that we don't call when the GATK had to drop reads.
- Reimplemented optimization in UG to not call when there are no non-ref bases.
- Compute reference confidence accurately in UG for ref calls.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2693 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-26 21:56:33 +00:00
rpoplin a1054efe8a Default platform and default read group are no longer set to values by default. The recalibrator throws an exception if needed values are empty in the bam file and the args weren't set by the user. This is done to make it more obvious to the user when the bam file is malformed. Similarly, the recalibrator now refuses to recalibrate any solid reads in which it can't find the color space information with an exception message explaining this. The recalibrator no longer maintains its own version number and instead uses the new global GATK version number.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2690 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-26 18:47:40 +00:00
depristo c231547204 Refactoring and migration of new allele/variantcontext/genotype code into oneoffprojects. NOT FOR USE. PlinkRod commented out due to dependence on this new, rapidly changing interface.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2687 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-26 13:53:29 +00:00
chartl d6b9b788a8 Renamed -- PlinkRodWithGenomeLoc --> PlinkRod
Since binary files do not need encoded locus information in the SNP names there's no need to suggest that it is so in the name of the rod



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2671 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-23 18:19:28 +00:00
chartl ac983e7a0b Ran the rod on a binary plink file with indels and it just worked. Love it when that happens! Unit test to ensure this behaviour is maintained.
****** PLINK ROD IS NOW READY TO GO ********




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2670 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-23 18:13:05 +00:00
chartl ae22d35212 PlinkRod now correctly parses binary files without indels; unit test added for this behavior.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2669 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-23 17:34:06 +00:00
chartl 94dc09c865 PlinkRod now successfully instantiates on the binary ped file trio (.bim, .bam, .fam) for non-indel files.
Upcoming: Test that the instantiation is correct, do it for indel-containing files.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2668 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-23 16:13:24 +00:00
chartl 01db93299c PlinkRodWithGenomeLoc now properly handels indels.
There is now a DELETION_REFERENCE allele type to allow for the storage of multi-base references rather than point-mutation references.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2667 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-23 07:34:52 +00:00
chartl 42fb85e7f3 PlinkRodWithGenomeLoc now properly parses text plink files. Unit test added to test this functionality. Indels and binary files to come.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2666 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-23 06:19:26 +00:00
aaron 2ea768d902 ant clean is your friend....fixed test code dependent on an interface change.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2660 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-22 20:07:46 +00:00
ebanks c1e09efb23 - Fixed output for beagle header
- Better description for QualByDepth annotation



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2655 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-21 21:25:56 +00:00
chartl 5b2a1e483e Renamed SequenomToVCF as PlinkToVCF. Wiki will be changed accordingly.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2649 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-21 17:35:20 +00:00
ebanks 9c7b281b4f Set default value for max_coverage to be 100K (since 10K is too small).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2646 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-20 20:15:25 +00:00
aaron 8d1d37302c a quick change to GLF to keep as much precision in our likelihoods as long as possible, before we put it into byte space. Sanger was doing a diff at low coverage and noticed our calls didn't contain as much precision as theirs. Updated the MD5 for unified genotyper output.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2644 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-20 19:36:49 +00:00
aaron a1b4cc4baf changes to intelligently log overflowing locus pile-ups.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2640 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-20 08:09:48 +00:00
ebanks 4ac9eb7cb2 - Smarter strand bias calculation
- Better debug/verbose printing



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2639 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-20 03:01:26 +00:00
depristo 9e0ae993c7 -B 1kg_ceu,VFC,CEU.vcf -B 1kg_yri,VCF,YRI.vcf system supported to allow 1KG % (like dbSNP%)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2632 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-19 21:33:13 +00:00
rpoplin c98df0a862 Updated solid_recal_modes to work with bfast aligned data. Added an integration test that uses the BFAST file provided by TGen.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2630 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-19 21:18:02 +00:00
ebanks 12453fa163 Misc cleanup of UG args
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2620 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-17 04:38:52 +00:00
depristo d8e74c5795 Update to MD5s for old tests and added extensive VCF testing
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2615 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-16 20:22:58 +00:00
ebanks b911b7df82 Fixing the AC annotation to be in line with the VCF spec
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2593 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-15 18:28:52 +00:00
rpoplin f2e539c52f As per discussions with Tim we are reverting the previous change regarding PairedReadOrderCovariate. The CycleCovariate now differentiates between first and second of pair by multiplying the cycle by -1. PairedReadOrderCovariate has been removed completely.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2592 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-15 18:18:59 +00:00
rpoplin df998041a8 Minor change to solid warning message. Added note for a future solid recalibration integration test when we get the required data file.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2590 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-15 16:31:25 +00:00
hanna b19bb19f3d First successful test of new sharding system prototype. Can traverse over reads from a single
BAM file.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2587 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-15 03:35:55 +00:00
aaron db9570ae29 Looks bigger than it is:
* Moved GATKArgumentCollection into gatk.arguments folder to clean up the main folder, also added some associated argument classes (most of the changes).
* Added code the argument parsing system for default enums, we needed this so we could preserve the current unsafe flag, and at the same time allow finer grained control of unsafe operations.  You can now specify:

"-U" (for all unsafe operations), "-U ALLOW_UNINDEXED_BAM" (only allow unindexed BAMs), "-U NO_READ_ORDER_VERIFICATION", etc.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2586 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-15 00:14:35 +00:00
asivache cff8b705c0 Oh, and the test would not work anymore...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2585 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-14 17:47:09 +00:00
rpoplin 9bf0d7250a Fixing the testOtherOutput UG integration test so it will run.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2580 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-14 13:40:14 +00:00
chartl 424d1b57f7 Sequenom to VCF now allows user to specify filters for QC, and they will appear in the filter field of the output VCF
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2577 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-13 23:22:37 +00:00
rpoplin f96b2b211e My last checkin updating R code broke an unrelated UnifiedGenotyper integration test. Eric says that I should take out the verbose test.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2576 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-13 22:28:10 +00:00
rpoplin 49c44e7b36 PairedReadOrderCovariate is now a standard covariate and because of this CycleCovariate no longer multiplies by negative one for second of pair reads. Added PairedReadOrderCovariate to some of the integration tests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2574 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-13 20:09:10 +00:00
ebanks 8ca5bba738 We emit genotype data in the VCF record if the format string instructs us to (regardless of whether or not genotypes are provided - this was the wrong test).
SequenomToVCF now correctly has no-calls when probes fail.
Re-enabled SequenomToVCF integration test.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2572 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-13 15:40:27 +00:00
chartl 6d1107a4ed Update to SequenomToVCF
Output changing slightly so integration test disabled temporarily



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2571 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-13 15:32:05 +00:00
ebanks f99586f91b Added integration test for beagle and verbose output in UG.
Minor cleanup of VCFRecord code.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2570 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-13 03:55:24 +00:00
rpoplin 189829841b The recalibrator now uses all input RODs when looking for known polymorphic sites not just the one named dbsnp. Added an integration test which uses both dbsnp and an input vcf file and skips over the union of the two.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2564 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-12 18:50:39 +00:00
aaron 16777e3875 more fixes for the empty interval list problem; you can now run LocusWindow traversals with an empty interval list, but the GATK will give you a warning (unless you're running in unsafe mode).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2563 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-12 18:47:43 +00:00
ebanks 03b7d5f5c7 1. Fixed small but embarrassing bug in weighted Allele Balance annotation calculation.
2. Made RankSumTest abstract; added 2 subclasses: BaseQualityRST and MappingQualityRST (the latter based on a suggestion from Mark Daly).  Untested so they're still experimental.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2561 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-12 18:33:53 +00:00
ebanks 040fdfee61 Cleaned up the interface to VCFRecord. It's now possible (and easy) to create records and then write them with a VCFWriter.
I've updated HapMap2VCF to use the new interface; Chris agreed to take care of Sequenom2VCF.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2558 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-11 21:42:12 +00:00
ebanks 42aff1d2c3 Annotator in general should be able to annotate monomorphic or tri-allelic sites.
It's up to the individual annotations to decide whether they want to annotate or not.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2556 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-11 19:52:18 +00:00
chartl dfa3c3b875 Added:
SequenomToVCF - Takes a sequenom ped file and converts it to a VCF file with the proper metrics for QC. It's currently a rough draft,
but is working as expected on a test ped file, which is included as an integration test.

Modified:

VCFGenotypeCall -- added a cloneCall() method that returns a clone of the call

Hapmap2VCF -- removed a VCFGenotypeCall object that gets instantiated and modified but never used
(caused me all kinds of confusion when I was basing SequenomToVCF off of it)



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2554 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-11 17:17:21 +00:00
rpoplin 62dd2fa5be Fixing another bug in solid recal regarding negative strand reads. The isInconsistentColorSpace method incorrectly used the inconsistent tag added by parseColorSpace, the inconsistent tag is in the direction of the read like the color space tag, and not in the direction of the reference like everything else. This affects the recalibrated quality scores but the improvment in SNP calling performance is minor when using the default UG settings (min base quality 10).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2553 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-11 14:28:52 +00:00
rpoplin 9cbae53ee1 Bug fixes for both SET_Q_ZERO and REMOVE_REF_BIAS solid recal modes regarding proper handling of negative strand reads. These changes yield a minor improvment in HapMap sensitivity.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2548 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-08 15:19:22 +00:00
ebanks dfcd5ce25b Fixed broken test
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2547 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-08 06:13:01 +00:00
ebanks d5ab002449 Curiously, it seems I never set the default base quality used by the Genotyper to 10. It's done now.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2546 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-08 06:02:01 +00:00
ebanks b468369dfa -UG's call into VariantAnnotator now uses the full alignment context (as opposed to the filtered one)
-MQ0 annotation is now standard again
-Added AC and AN annotations to VCF output



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2545 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-08 05:40:42 +00:00
rpoplin f587ff46af Tile is now a standard covariate. By default the TileCovariate returns -1 if tile can't be derived from the read's name. Added a new command line option -throwTileException which will force TileCovariate to throw an exception if tile can't be derived for a read. Singleton covariates, such as any read group without tile info, must be skipped over in TableRecalibration so that the sequential formulation doesn't apply the same correction more than once. TileCovariate class has been added to the Early Access package.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2544 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-07 22:51:41 +00:00
rpoplin 5f58492401 A rogue QualityUtils.MAX_REASONABLE_Q_SCORE managed to get through my previous bug fix. It should instead check the command line -maxQ argument.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2540 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-07 21:17:39 +00:00
ebanks c7a8dffa89 Check for division by 0 in annotations
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2539 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-07 19:27:15 +00:00
ebanks b643a513bb Minor interface change for VCFGenotypeRecord.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2537 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-07 16:48:09 +00:00
depristo 076481f786 Fixes to mergeVCF -- now correctly supports merging of filter fields. Also removed incorrect hasFilteringCodes() function. Updated intergration tests
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2535 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-07 14:50:13 +00:00
rpoplin cea544871d Fixed an issue with recalibrating original quality scores above Q40. There is a new option -maxQ which sets the maximum quality score possible for when a RecalDatum tries to compute its quality score from the mismatch rate. The same option was added to AnalyzeCovariates to help with plotting q scores above Q40. Added an integration test which makes use of this new -maxQ option.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2534 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-07 13:50:30 +00:00
ebanks 6c739e30e0 1. Removing an old version of the Genotype interface which is no longer being used. Needed to do this now so that the naming conflicts would cease.
2. Adding a preliminary version of the new Genotype/Allele interface (putting it into refdata/ as the VariantContext really only applies to rods) with updates to VariantContext.  This is by no means complete - further updates coming tomorrow.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2533 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-07 05:51:10 +00:00
depristo 7215526810 Fix to isReference() in VCFRecord. Change to VariantCounter to correctly counter only non-genotype variants, as well as update to VariantEvalWalker
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2531 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-07 00:03:29 +00:00
andrewk 6c4ac9e663 Updated HapMap2VCF to use the VCFGenotypeWriterAdapter interface; fixed bug in VCFParameters that affects VariantsToVCF and HapMap2VCF when reference is lower-cased; added integration test for HapMap2VCF that checks for the lower-case issue by testing against Hg18 region that has lower-cased bases
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2530 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-06 21:27:11 +00:00
depristo 8d13597a27 Temporary command-line support to enable rod walkers, if you know what you are doing this is safe.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2505 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-06 12:15:36 +00:00
rpoplin 0a6bd5a270 CycleCovariate is now one-based so that 0 and -0 don't collide with each other. Solid recal modes now only change the inconsistent base and the previous base (along the direction of the read) instead of both the bases before and after. Removed estimatedNumberOfBins from the Covariate interface because it wasn't being used.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2498 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-04 20:52:15 +00:00
ebanks ed2fff13aa -Misc improvements to VCF code
-Small fix to callset concordance


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2497 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-04 02:28:47 +00:00
ebanks b668d32cf1 Updated the min mapping quality and min base quality defaults to be 10 in both cases (and updated all integration tests) as suggested by Mark.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2494 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-03 21:31:04 +00:00
hanna b6ecc9e151 Support for ad-hoc reference sequences. Also reenabled BWA/Java integration test, which was commented out
and the data backing it up deleted without my knowledge.  Unfortunately, since the data was deleted, I had
to regenerate the data and a new md5.  Hopefully the aligner output is still correct.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2493 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-02 20:19:14 +00:00
asivache ad549eacfd Now that we changed how deletions are represented, got to update MD5...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2491 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-30 22:00:58 +00:00
asivache 9c41ac252f Disable testSingleBPFailure - getReferenceContext() now whould agree to accept length > 1 genome locs as its argument, so there's nothing to test...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2486 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-30 21:12:00 +00:00
asivache 4aeb50c87d Added: integration test for extended pileup (with indels included)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2481 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-29 23:02:23 +00:00
rpoplin 96c4929b3c Recalibrator now uses NestedHashMap instead of NHashMap. The keys are now nested hash maps instead of Lists of Comparables. These results in a big speed up (thanks Tim!). There is still a little bit of clean up to do, but everything works now.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2474 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-29 21:01:32 +00:00
depristo 7826e144a1 forgot to update md5s
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2473 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-29 20:31:29 +00:00
depristo 87e863b48d Removed used routines in duputils; duplicatequals to archive; docs for new duplicate traversal code; general code cleanup; bug fixes for combineduplicates; integration tests for combine duplicates walker
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2468 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-29 19:46:29 +00:00
ebanks 5fdf17fccb Removed the VCF "NS" annotation (which wasn't working for pooled calls anyways) since it's ambiguous and not useful.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2465 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-29 17:30:47 +00:00
aaron a34c2442c0 moved hard-coded file paths to the oneKGLocation, validationDataLocation, and seqLocation variables setup in the BaseTest.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2460 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-29 07:40:48 +00:00
depristo 9d263b2565 Integration tests for count duplicates walker validated on a TCGA hybrid capture lane.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2459 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-28 23:57:25 +00:00
depristo fcc80e8632 Completely rewritten duplicate traversal, more free of bugs, with integration tests for count duplicates walker validated on a TCGA hybrid capture lane.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2458 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-28 23:56:49 +00:00
hanna 4617052b3c For Alec, and others at the Broad who want to run our unit/integration tests off of gsa1/gsa2: put a ceiling on the amount of memory that integration tests can use. Reduce the memory footprint of the fasta reader test.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2457 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-28 23:42:46 +00:00
alecw b5e5e27225 New versions of picard-private, sam and picard jars for TileCovariate and regeneration of NM tag
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2456 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-28 22:18:55 +00:00
rpoplin 562db45fa5 Sites that were marked NO_DINUC no longer get dinuc-corrected but are still recalibrated using the other available covariates. Solid cycle is now the same as Illumina cycle pending an analysis that looks at the effect of PrimerRoundCovariate. Solid color space methods cleaned up to reduce number of calls to read.getAttribute(). Polished NHashMap sort method in preparation for move to core/utils. Added additional plots in AnalyzeCovariates to look at reported quality as a function of the covariate.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2451 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-28 20:19:37 +00:00
ebanks 12990c5e7a Added qual-by-depth annotation
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2445 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-25 02:30:30 +00:00
ebanks 438d21842a The new recalibrator had been mimicking the behavior of the old one in that if there was no dinuc available (following a no-call base or at either end of a read), it didn't try to recalibrate. Now that Ryan has modularized the system, we no longer need to skip the base completely (we just need to skip the dinuc value)... which is good because the Picard people complained after realizing that cycle #1 never got recalibrated.
The major effects of this commit are as follows:
1. We no longer skip any good bases (of course, this change alone breaks every single integration test).
2. The dinuc covariate returns a "no dinuc" value for the first base of a read (but not for the last base anymore, since there is a valid dinuc) or if the previous base is a bad base (e.g. 'N').

I've done a bunch of testing on real data and everything looks right; however, let's wait until the recalibrator guru gets back from vacation next week and can double-check everything before shipping this out in another early access release.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2443 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-24 20:41:29 +00:00
ebanks 6df40876a3 Un-reverted Matt's previous changes and fixed integration tests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2441 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-24 02:47:00 +00:00
hanna 2bd0b1bbf7 After further review, it's unclear that my patch in RecalDataManager was the right choice. Reverting.
Also updating other IntervalCleanerIntegrationTest failures that were masked by my first patch.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2440 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-24 00:32:33 +00:00
hanna 98c268483e Fixed issues with the integration tests:
1) sam-jdk apparently no longer supports custom tags with type int[] values.
2) BAM output for indel cleaner integration test changed in a way that's so subtle it can't be seen after converting the output to .sam.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2439 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-23 23:12:22 +00:00
aaron b134e0052f added changes to the code to allow different types of interval merging,
1: all overlapping and abutting intervals merged (ALL), 
2: just overlapping, not abutting intervals (OVERLAPPING_ONLY), 
3: no merging (NONE).  This option is not currently allowed, it will throw an exception.  Once we're more certain that unmerged lists are going to work in all cases in the GATK, we'll enable that.  

The command line option is --interval_merging or -im


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2437 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-23 21:59:14 +00:00
ebanks 770093a40e Oops - forgot to check this one in.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2433 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-23 19:53:28 +00:00
ebanks dc96879861 2 separate changes which both affect lots of UG integration md5s, so I'm committing them together:
1. allele balance annotation is now weighted by genotype quality (so we don't get misled by borderline het calls)

2. Updates to the Unified Genotyper for parallelization:
   a. verbose writing now works again; arg was moved from UAC to UG
   b. UG checks for command that don't work with parallelization
   c. some cleanup



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2432 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-23 19:03:56 +00:00
ebanks 872a9d1c7b I'm making this change now (as opposed to waiting until Monday) to honor Tim's request.
The cycle covariate is now first/second of pair aware.  I'm taking it on faith from both Chris Hartl (waiting on slides from him) and Tim that this is the right thing to do.  We'll have Ryan confirm it all next week.
The only change is that if a read is the second of a pair, we multiple the cycle by -1 (a simple way of separating its index from that of its mate).
Of course, this broke all integration tests.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2431 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-23 16:26:43 +00:00
ebanks cf303810d3 VCF reader now creates the correct type of header line for each header type
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2423 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-21 20:39:06 +00:00
ebanks 87e5a41964 Fixed a bug that accounted for a bunch of my remaining mis-cleaned indels.
Also, slightly optimized the cleaner by using readBases (instead of readString) and caching cigar element lengths.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2419 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-21 05:46:16 +00:00
hanna 9e53c06328 First revision of command-line argument support for GenotypeWriter. Also, fixed the damn build.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2416 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-20 19:19:23 +00:00
aaron 7e0f69dab5 Changed the GLF record to store it's contig name and position in each record instead of in the Reader. Integration tests all stay the same.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2410 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-18 22:54:56 +00:00
hanna 80b3eb85fa Fixed curiously epic failure in read-backed pileup: size() mismatched the numReads-numDeletions at that locus in the case where includeReadsWithDeletionsAtLoci == false, causing failures including bad output from pileup walker. Also fixed up ValidatingPileup to run with the new ReadBackedPileup instead of just compiling successfully.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2409 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-18 22:52:44 +00:00
rpoplin fdf542c214 The CycleCovariate for 454 data is now the TACG flow cycle. That is, each flow grabs all the T's, A's, C's, and G's in order in a single cycle. This is changed from incrementing the cycle whenever there is a discontinuous nucleotide along the direction of the read.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2408 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-18 22:39:51 +00:00
ebanks 4ea31fd949 Pushed header initialization out of the GenotypeWriter constructors and into a writeHeader method, in preparation for parallelization.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2406 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-18 19:16:41 +00:00
ebanks 1cde4161b7 Fixed another test
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2399 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-18 05:05:03 +00:00
ebanks 94f5edb68a 1. Fixed VCFGenotypeRecord bug (it needs to emit fields in the order specified by the GenotypeFormatString)
2. isNoCall() added to Genotype interface so that we can distinguish between ref and no calls (all we had before was isVariant())
3. Added Hardy-Weinberg annotation; still experimental - not working yet so don't use it.
4. Move 'output type' argument out of the UnifiedArgumentCollection and into the UnifiedGenotyper, in preparation for parallelization.
5. Improved some of the UG integration tests.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2398 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-18 04:14:14 +00:00
rpoplin 6fbf77be95 Updating the two solid_recal_mode options to also change the previous base since solid aligner prefers single color mismatch alignments over true SNP alignments. COUNT_AS_MISMATCH mode has been removed completely. The default mode is now SET_Q_ZERO.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2394 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-17 20:07:26 +00:00
hanna 07f1859290 Added integration test for running the recalibrator with no index.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2393 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-17 19:10:53 +00:00
ebanks c75ec67f84 When called as a standalone, VariantAnnotator now emits samples in sorted (as opposed to random) order in VCFs.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2392 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-17 19:01:08 +00:00
hanna b863fffdf6 Fix
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2390 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-17 17:55:00 +00:00
asivache e6cc7dab26 fixing md5 sum; new version of IndelIntervalWalker does the right thing...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2388 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-17 01:04:13 +00:00
ebanks b626fc0684 Joint Estimate is now the default calculation model.
Reworked all of the integration tests so that they're now more comprehensive, cover more of what we wan to test, and don't take forever to run.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2376 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-16 19:41:02 +00:00
ebanks bb312814a2 UG is now officially in the business of making good SNP calls (as opposed to being hyper-aggressive in its calls and expecting the end-user to filter).
Bad/suspicious bases/reads (high mismatch rate, low MQ, low BQ, bad mates) are now filtered out by default (and not used for the annotations either), although this can all be turned off.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2373 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-16 17:28:09 +00:00
ebanks 874552ff75 Pull the genotype (and genotype quality) calculation out of the VCF code and into the Genotyper.
[Also, enable Mark's new UG arguments]



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2355 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-15 04:29:28 +00:00
chartl 1389ac6bdf Hurrr -- this uses power as part of its output. Changes to the power calculation broke the md5s RIGHT AFTER I HAD FIXED THEM arghflrg.
Will fix again.




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2351 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-14 22:42:50 +00:00
chartl b42fc905e8 Added - new tests (Hapmap was re-added)
Modified - Hapmap now takes a -q command to filter out variants by quality
Modified - MathUtils - cumBinomialProbLog now uses BigDecimal to handle some numerical imprecisions
Modified - PowerBelowFrequency - returns 0.0 if called with a negative number (can't be done from inside the walker itself, but since it's called elsewhere one can't be too careful)



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2350 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-14 21:57:20 +00:00
rpoplin 8e44bfd2ef CycleCovariate and PrimerRoundCovariate now correctly handle negative strand 454 and SOLID reads.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2349 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-14 21:52:30 +00:00
ebanks 97618663ef Refactored and generalized the VCF header info code.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2346 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-13 21:02:45 +00:00
ebanks bd2a46ab4c I want to move over to hpprojects tonight, so I'm checking in various changes all in one go:
1. Initial code for annotating calls with the base mismatch rate within a reference window (still needs analysis).
2. Move error checking code from rodVCF to VCFRecord.
3. More improvements to SNP Genotype callset concordance.
4. Fixed some comments in Variation/Genotype



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2341 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-13 02:52:18 +00:00
aaron 09811b9f34 Now that we always output the VCF header, make sure that we correctly handle the situation where there are no records in the file. Added unit tests as well.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2333 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-11 19:51:05 +00:00
ebanks 2ea7632b76 The SNP genotype concordance module is now more comprehensive.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2330 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-11 18:34:33 +00:00
ebanks 2de7e1a178 Move VariantAnnotator over to use a StratifiedAlignmentContext split by sample.
The only major difference is that we are now able to get accurate allele balance ratios.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2321 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-11 05:28:28 +00:00
ebanks e6f541fdca Forgot to update integration test last night
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2308 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-10 12:57:10 +00:00