Commit Graph

3443 Commits (7f1e44b764ccaeaf5eeaab5e885c9aa0a07b3ad9)

Author SHA1 Message Date
rpoplin 1931b2e1bd Three fixes for VariantFiltrationWalker: Trying to filter an empty VCF file will produce a well-formed VCF file with zero records instead of a blank file, needed for pipelines. The first record's genotype info fields are now in the same order as all the others. The VCF header lines are pulled from just the input variant rod instead of from all rods.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4341 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-24 13:52:56 +00:00
kshakir 4ed9f437e9 Sliced the GAE in half like a gordian knot to avoid the constant merge conflicts.
The GAE half has all the walker specific code.  The new "Abstract" GAE has the rest of the logic.
More refactoring to come, with the end goal of having a tool that other java analysis programs (Queue, etc.) can use to read in genomic data.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4339 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-23 23:28:55 +00:00
rpoplin 0c9fabb06f Fix in AnalyzeAnnotations, somebody changed it look for ID in the vc's info field. This dinosaur desperately needs integration tests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4338 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-23 19:48:44 +00:00
hanna 0c781968fb Tried to do a bit of pre-commit refactoring and screwed it up. Fixed.
Thanks to Ryan for identifying the problem.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4336 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-23 18:17:29 +00:00
depristo d081b9b352 Improvements to error messages about @Requires and @Allows
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4334 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-23 12:08:27 +00:00
hanna 7841b301c4 Added more diagnostics so that I have some idea of what a 'general' exception
is.  Required to fix bug ZjhCJAdwhtFq1x54ZlmlN8pFNcbrRpdJ and similar.  We
might want to change this particular case to a ReviewedStingException after
we gain a bit more experience with it.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4333 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-22 21:32:01 +00:00
fromer 44ccfc3531 Updated Phasing algorithm + evaluation module to properly implement haplotypes [including homozygous genotypes]; Implemented dynamic window phasing model for LARGE increase in efficiency
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4332 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-22 21:29:58 +00:00
hanna 8f75d88519 Fix for GATK run report ids:
mOVsxGfDiiSMxVs2PPTVjzYTVbizlD6e
  f9kUHUADFsZ0LiTGxRL5zPmq9kZcA4cQ
  8eGHWJFAlBVmgxwPi3sMd1RmiN2PwHOf
  iLhvHWveypKb2F8vKS5irHylc3pYvlOb
  HDttXKUMEVoPrvVeWrH7E0htxYyNydMx
plus a bit of cleanup of custom exceptions in the sharding system.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4330 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-22 19:49:25 +00:00
kshakir 20b38b38f3 Updated from SnakeYAML 1.6 to 1.7.
Added a pipeline java bean and YAML utility to serialize java beans.
Added a getFirehosePipelineYaml.sh that can pull firehose data into the pipeline yaml file format.
Updated the fullCallingPipeline.q to begin using the pipeline yaml file format for bams and reference.
More changes to come as this code gets tested out in the fullCallingPipeline.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4329 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-22 19:47:49 +00:00
hanna 0c99c97685 The engine now automatically adds the command-line arguments to the header of every VCF, unless -NO_HEADER is specified.
Changed integration tests, adding the -NO_HEADER argument, for walkers that previously did not include the command-line
arg headers.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4326 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-22 15:27:58 +00:00
depristo 522830fb01 Support for --assume-single-sample in UG, better malformated bam exceptions, and ignoring out of order contigs in seqdictutils. All for the CG bam file
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4323 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-21 20:33:34 +00:00
aaron b968af5db5 The tribble indexes are now updated with correct sequence lengths for each contig they have in their sequence dictionary. Also clean-up in the RMD track builder.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4321 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-21 18:21:22 +00:00
rpoplin 547763b230 Better error message for Petr's null pointer exception. Also added an exception integration test because I'm certain this used to work.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4319 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-21 13:44:40 +00:00
depristo 8719dde59d Now prints out PASS when a variant is unfiltered
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4318 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-21 13:16:41 +00:00
delangel 205fc0b636 Cleanup: Use Tribble's version of createVariantContextWithPaddedAlleles (no real functional difference) to avoid duplicated code.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4315 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-20 19:53:30 +00:00
delangel a10cfe213b Small bug fix in simple indel genotyper: Likelihood of case where best haplotype pair was (REF,REF) was not computed correctly.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4314 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-20 17:04:39 +00:00
ebanks f5a30d0248 I just spoke to Andrey & Kiran (the original authors of these tools), and they voted to kill these in favor of Picard
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4313 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-20 13:27:35 +00:00
delangel f64b6fddc1 Major changes/improvements to indel genotyper:
a) Redid way to compute path metrics in indel error model. Paper formulation where we have an anchor point in the alignemt between read and haplotype won't work in practice except in nice data sets that are perfectly indel-realigned and that are well mapped by aligner. New formulation doesn't assume this, and it's actually simpler and uses less code. It now resembles more a classic SW dynamic programming formulation but it still preserves the HMM probabilistic formulation. 
b) Added a programmable call threshold, set by command line.
c) Use now sample name from BAM file, remove -sampleName argument.
d) Simplify loop to compute read-haplotype likelihoods.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4311 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-19 23:47:31 +00:00
rpoplin c6351a11d6 Clearer logger output when not using by-hapmap
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4308 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-18 16:10:42 +00:00
rpoplin 7e58d8ed61 CombineVariants now outputs the command line in the VCF header. Added a new hidden argument to VR walkers called --NoByHapMapValidationStatus to turn off the by-hapmap dbsnp rod behavior. Very useful for experimenting with which sets to use as training data.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4307 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-18 16:06:50 +00:00
kshakir a3f31e5df0 When QScript writers use the RodBind, then the File version of the same argument should be optional, i.e. should not always try to output the file, which when unpopulated will be null.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4305 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-17 18:22:07 +00:00
bthomas c6c6d32b46 Quickly adding a new convenience method for retreiving a group of samples. The method is getSamples(Collection<String>) and returns a set of sample objects. There's also a test there.
Ryan is using this to modify VCF code today...



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4303 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-17 15:55:17 +00:00
kshakir a898908918 The output BAM file optional arguments of compression and whether to write an index are not outputs themselves.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4302 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-17 15:35:54 +00:00
bthomas bc12055fcf Quick patch to fix the sample code. It wasn't actually initializing the sample data source, so I added a call to initializeSampleDataSource() in GenomeAnalysisEngine. I think there was just an error resolving the versions of GenomeAnalysisEngine
Also added a new error message that I thought would be helpful...



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4301 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-17 14:05:26 +00:00
ebanks a10b2a00a5 Moving the util VariantContext 'modifying' routines into VC itself (as opposed to VCUtils) so that we can pass the genotype data directly into it and are no longer forced to decode the genotypes for no reason. This means that any walker that takes in a VCF and modifies the records without touching the genotypes never have to decode them. I've hooked this into the other two Variant Recalibrator walkers for Ryan. One side effect, though, is that we no longer can sort the sample names in the VCF (i.e. if the input VCF doesn't have samples in alphabetical order, then we used to sort them when writing a new VCF but no longer do that), because if we don't decode then we can't re-order the genotypes. I don't think this is a big concern given that the Unified Genotyper does emit sorted samples and that's the main source for most of the VCFs we use.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4300 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-17 07:09:58 +00:00
bthomas f66ef4626e Fixing two minor issues: 1) adding a new error message if the user adds a fasta file in a directory that doesn't exist; 2) renaming my sample unit tests so they actually run.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4299 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-16 20:45:51 +00:00
rpoplin 2eb5d9b2d2 CountCovariates makes sure that it sees a rod type that it expects for use as a variant mask (accepted types are dbsnp, vcf, and bed)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4296 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-16 18:53:42 +00:00
aaron 782e0018e4 removal of most of the old GATK ROD system; also a fix for -Dsingle so we can again run just a single unit or integration test (single tests in tribble can be run with the -DsingleTest option now). More to come.
*** Three integration tests had to change: ***

RecalibarationWalkersIntegrationTest:
One of the tests was using the interval as the snp track, and wasn't supplying a DbSNP track (for CountCovariates)

SequenomValidationConverterIntegrationTest:
relies on Plink ROD which we've removed.  

PileupWalkerIntegrationTest: 
we no longer have implicit interval tracks, so there isn't a rod name over the specified region.  Otherwise the same result.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4292 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-15 22:54:49 +00:00
delangel c604ed9440 Several improvements to new indel genotyper (more to come soon):
a) Turns out previous change of centering haplotype around indel was a bad idea. Context to the left of indel is important but not as important as right one, because by definition all alleles start at the same location, so haplotype is the same to the left of indel regardless of allele. So, go back to having a constant size window to the left of event.
b) Expand reference context so we can test larger haplotypes.
c) Optimize computation of read likelihoods by doing them in linear array instead of in a matrix - no difference in biallelic sites but could be significantly faster in multiallelic sites.
d) Bug fix: read alignment wasn't being computed correctly if, a) we were at an insertion, b) read started right at the insertion, c) read CIGAR didn't include insertion - more of these corner conditions are lurking, so a revamped computation of how reads align to candidate haplotypes is in the works.
e) Add debug option not to use prior haplotype likelihoods.
f) Don't hard-code NA12878 for genotyping, now sample name is a required input argument.
g) Bug fix: if there are no reads covering a candidate indel event, just output NO_CALL (didn't notice this in HiSeq, but in P1 data it happens all the time). I need to add a confidence threshold for calling later on.






git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4291 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-15 21:53:08 +00:00
depristo fb6d7d19f9 Better window size error message
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4290 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-15 20:40:56 +00:00
rpoplin b5d2e299d2 Make it more clear what is going on with the by-hapmap validation status in the dbSNP rod
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4289 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-15 17:29:31 +00:00
rpoplin 0a06fbdb94 Adding header lines to output of VR walkers to settle validator warnings. Command lines are added to the VCF header. GATK version numbers will be added to the header lines by Matt.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4288 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-15 16:45:03 +00:00
asivache d7b5baf8e5 Now uses tagging of -I arguments. Multiple -I options (merging) is now allowed. In somatic mode 'tumor' and 'normal' tags are required for each input bam, the order does not matter anymore (since we use tags!)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4286 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-15 13:58:51 +00:00
bthomas e5f81d25d4 Adding the --sample-metadata (-SM) command line argument and associated functionality. This is something Matt and I have been working on for a while. Basically, it allows you to integrate sample metadata into an analysis, by including a sample file. More detailed documentation is on the wiki: http://www.broadinstitute.org/gsa/wiki/index.php/Adding_Sample_data_to_an_analysis
This commit adds two important classes: Sample, which contains data about one sample; and SampleDataSource, which manages sample data a la ReferenceDataSource and ReadsDataSource. 

This code should be stable, but it has not been integrated with existing walkers yet. That's the next commit. 

In the meantime, feel free to experiment with the code - there are two basic example walkers in the playground.sample package. And PLEASE let me know if you see any errors/inconsistencies.

Note that this also adds a new dependency on SnakeYaml, a YAML parser.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4285 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-15 11:50:22 +00:00
ebanks dd23f204ab Making the UG args that allow users to proceed with insufficient bam headers (no SM or PL tags) @Hidden; removed them from wiki.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4283 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-15 01:54:50 +00:00
ebanks 514b28210e Have VF write to sdout when no -o is supplied
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4282 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-15 01:48:33 +00:00
ebanks 4e83ba411f We now do lazy loading for the genotype data in VCF. Practically, almost all walkers end of loading the genotype data because we need to be smarter about transfering the unparsed genotype string when modifying VariantContexts; however, this does solve the problem for VR's piece to generate clusters (shaved off 75% of runtime for Ryan's large case). That further optimization will happen later.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4279 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-15 00:18:17 +00:00
depristo 74d4f124b1 Bug fixes to allow us to generate GATKRunReports for very early errors that leave the engine in a corrupt state. Vastly better error handling of common command line problems. Analysis output now notes whether an exception is a a UserException or a StingException
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4278 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-14 22:45:15 +00:00
delangel 6d07181dc9 When processing Beagle output and creating new vcf, output the filtered records in the original input vcf as is, so that we don't lose the information on them when we run Beagle.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4276 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-14 19:18:45 +00:00
hanna 7fa6b2135b Added a back door so that integration tests can reset the sequence dictionary
in the reference.  Reset routine is not accessible to any class outside
GenomeLocParser's package.

We'll have to do something more intelligent with this when the GATK goes
distributed.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4275 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-14 18:58:08 +00:00
depristo dbb641280e CycleCovariate now tolerates SOLEXA as machine type. Also, exception handling is now written to stderr.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4274 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-14 12:35:57 +00:00
ebanks 71d2d69b41 Better error message
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4273 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-14 05:04:26 +00:00
fromer 248cc308b2 ReadBackedPhasing silently ignores sites with ploidy != 2
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4272 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-13 21:14:17 +00:00
fromer 528f6344af Moved ReadBackedPhasingWalker to phasing sub-directory
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4271 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-13 19:36:41 +00:00
depristo fa3be2209f Improvements to the error display code to print out the SVN number in all messages. Fixes to CallableLoci and tests to check for that case
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4270 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-13 18:36:45 +00:00
depristo 7880863eb7 Final step in error refactoring. GATK exception is now ReviewedStingException, indicating that this exception is really what one wants. Only use this exception when you have thought about StingException vs. UserException and made a real decision.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4267 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-12 15:07:38 +00:00
depristo 7ad8fbdd5a Moved GATKException to exceptions
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4266 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-12 14:47:19 +00:00
depristo bccebf8899 Newly placed StingException
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4264 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-12 14:38:46 +00:00
depristo 3964e02fb6 Newly placed StingException
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4263 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-12 14:38:32 +00:00
depristo 595907e98e Moving StingException
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4262 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-12 14:34:15 +00:00
depristo 40e6179911 Penultimate step in exception system overhaul. UserError is now UserException. This class should be used for all communication with the USER for problems with their inputs. Engine now validates sequence dictionaries for compatibility, detecting not only lack of overlap but now inconsistent headers (b36 ref with v37 BAM, for example) as well as ref / bam order inconsistency. New -U option to allow users to tolerate dangerous seq dict issues. WalkerTest system now supports testing for exceptions (see email and wiki for docs). Tests for vcf and bam vs. ref incompatibility. Waiting on Tribble seq dict improvements to detect b36 VCF with b37 ref (currently cannot tell this is wrong.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4258 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-12 14:02:43 +00:00
delangel da2e879bbc Miscellaneous improvements to indel genotyper:
- Add a simple calculation model for Pr(R|H) that doesn't rely on Dindel's HMM model. MUCH faster, at a cost of slightly worse performance since we're more sensitive to bad reads coming from sequencing artifacts (add -simple to command line to activate).
- Add debug option to calculation model so that we can optionally output useful info on current read being evaluated. (add -debugout to commandline).
- Small performance improvement: instead of evaluating haplotype to the right of indel (just with a 5 base addition to the left), it seems better to center the indel and to add context evenly around event.




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4257 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-12 13:50:28 +00:00
ebanks 61d511f601 Small memory performance improvement: remove the mapping from the hash instead of setting the value to null (i.e. remove the key too)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4256 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-12 05:19:09 +00:00
ebanks a0231f073f Damnit. Enabling the Picard code to recalculate all of the relevant SAMRecord attribute tags means that I need to have reference bases over all read bases even after realignment (and there are some big indels in dbsnp). Fortunately, I have my trusty IndexedFastaSequenceFile reader handy! Re-enabling the previously broken performance test.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4255 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-12 05:06:37 +00:00
hanna 87aca64716 Jumped the gun a bit on bam on-the-fly indexing -- Tim says it's not ready yet.
Turned it off by default and added a property to turn it back on.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4254 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-10 21:16:03 +00:00
rpoplin 7b113a4886 Truncate the floating point numbers coming out of the variant recalibration walkers. Integration tests now work with both 1.6.0_16-b01 and 1.6.0_21-b06
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4253 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-10 18:37:49 +00:00
depristo 8f1a32acae All exceptions thrown by the GATK have been reviewed and UserErrors replaced where appropriate. Shazam. Another check-in will remove the GATKException and restore the StingException.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4252 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-10 15:25:30 +00:00
rpoplin 61e848c4f0 It's clear from Sendu's calling and my own calling that -qScale 100.0 is a much better default value for low pass data.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4248 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-10 01:47:21 +00:00
depristo 1de713f354 Massive review of maybe 50% of the exceptions in the GATK. GATKException is a tmp. tracker so that I can tell which StingExceptions I've reviewed. Please don't use it. If you are working on new code and are considering throwing exceptions, it's either UserError or StingException, please
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4246 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-09 23:21:17 +00:00
aaron f5c295b6b2 add a little bit of documentation to the RMD track builder and wrap any exceptions thrown in tribble with the file source and line that caused the error.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4243 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-09 17:56:36 +00:00
rpoplin aeb897db7f VR walkers look at by-hapmap validation status by default. Eric will be updating the syntax to allow for more flexibility here.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4242 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-09 15:40:56 +00:00
depristo 6a30617a60 Initial implementation of UserError exceptions and error message overhaul. UserErrors and their subclasses UserError.MalFormedBam for example should be used when the GATK detects errors on part of the user. The output for errors is now much clearer and hopefully will reduce GS posts. Please start using UserError and its subclasses in your code. I've replace some, but not all, of the StingExceptions in the GATK with UserError where appropriate.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4239 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-09 11:32:20 +00:00
depristo ca9c7389ee Not useful
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4238 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-09 02:33:03 +00:00
depristo 8708753a6a checkin for removal
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4237 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-09 02:32:46 +00:00
hanna 5119bdb55e - Update DoC to support output to /dev/null.
- Add a release sanity check for DoC.
- Update release sanity checks with new command-line argument system.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4236 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-08 23:43:18 +00:00
fromer 1b1ec7e52d Changed default phasing window size to 10
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4235 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-08 21:28:36 +00:00
fromer ce031b2f05 PhasingEvaluator prints out interesting sites (only 1 phased, or phases disagree)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4233 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-08 18:21:21 +00:00
ebanks 40283f6456 Success! TranscriptToGenomicInfo now works without the delicate hacks that Ben had put in.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4232 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-08 18:06:00 +00:00
ebanks cd091d7309 This walker can NOT be tree-reducible (in its current state). Given that it's meant to be run just once for any given transcript set, this is not at all a problem.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4231 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-08 16:47:51 +00:00
ebanks ae9cba1c73 After an epic battle with this code until 3am last night, I have discovered that it is tragically and fatally busted. Ben clearly didn't understand how the ROD system works when writing it and so it is unusable in its current state. I've ripped out all code and it now gracefully exits telling the user that we are actively working on a replacement for this tool. Sigh.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4230 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-08 16:39:41 +00:00
ebanks 29f7b1e6d6 Trivial update
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4229 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-08 14:02:38 +00:00
ebanks cd2bfb09ef Change for Tim: invalidate the MD tag (temporarily) if it exists in a read that gets realigned
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4228 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-08 13:59:09 +00:00
ebanks 65edbced36 Addition for Tim: recalculate the NM and UQ tags after realignment. Also, don't fix the insert size calculation, since that's done by fix mate information.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4227 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-08 04:02:14 +00:00
chartl 71046e650e Added a more robust check for Jishu -- am pretty sure the .bam header is busticated
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4223 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-08 01:11:22 +00:00
fromer ae3f7026a4 Corrected phasing quality evaluation to correctly account for hom sites that break phase
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4222 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-07 22:43:54 +00:00
hanna 501f6a0e14 Temporary hack to disable index creation when target BAM is /dev/null. Tim
promises me that Picard will put in a real solution next week.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4220 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-07 16:57:51 +00:00
fromer 754c2c761e Added minimum phasing quality for phasing evaluation
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4219 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-07 14:29:11 +00:00
ebanks 5d0d9c7dce My parallel version of TranscriptToInfo now emits 'chr start end' instead of 'chr:start-end' for records so that 1) they can be easily sorted in coordinate order (allowing me to emit records out of order if I choose) and 2) the file can be tabix indexed (when we stop finding 'critical' bugs in that code).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4218 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-07 05:20:40 +00:00
ebanks 4d4ef5b42c In the end, it's not worth rewriting TranscriptToInfo from scratch. I'm keeping the old one around for a bit so I can play with this new version which 1. doesn't store the records in memory so can be run in under 1Gb of memory, 2. actually emits all of the records (the original fails in some cases), and 3. is refactored to cut out ~20% of the code.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4215 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-06 02:37:34 +00:00
kiran 0dd5a0990d Now annotates sites marked as filtered out (this is important if sites are in a lower-quality tranche).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4214 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-04 00:36:55 +00:00
delangel ef7454a241 Minor improvements to indel genotyper:
a) Ability to specify haplotype size from command line
b) Expand reference context  window so we can form haplotypes for longer indel events.
c) small bug fix in temp output writer (to be removed once I can emit vcfs)



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4212 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-03 22:52:08 +00:00
depristo 7eeabe534a QSample walker for 1KG -- measures aggregate quality of sequencing. Includes misc. improvements throughtout the code, including using the new Tribble GenotypeLikelihoods class for working with VCF GLs from the UG
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4211 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-03 18:21:43 +00:00
rpoplin e3962c0d13 VR integration tests are longer but much more useful.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4210 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-03 15:50:19 +00:00
hanna da11efa1a2 Automatically write BAM file indices for coordinate-sorted BAMs.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4209 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-03 14:10:44 +00:00
fromer 529eecd4dc Added phasing sub-directory to keep walkers directory clean
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4208 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-03 13:38:46 +00:00
fromer c0ce9ca8cc Added phasing sub-directory to keep walkers directory clean
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4207 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-03 13:32:30 +00:00
rpoplin 60003aeaca Bug fix in VariantRecalibrator. Only add sample names from the input rod bindings, not from all rod bindings.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4206 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-03 13:31:49 +00:00
fromer c119f64514 Added phasing sub-directory to keep walkers directory clean
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4205 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-03 13:24:18 +00:00
depristo 3c9597d45a OnTraversalDone writes output to out now
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4203 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-03 12:55:03 +00:00
depristo 73d41bfa24 CountLoci nows writes out to a file for Queue status tracking. VariantAnnotatorEngine has a special group None that doesn't add any annotations; useful for those who are testing UG performance
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4202 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-03 12:52:33 +00:00
hanna 70bb480939 The battle is over. Picard is revved.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4200 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-03 05:28:01 +00:00
ebanks fdaac4aa78 As the VCF guru, I'll take this one for Andrey. Someone has actually found a deletion at the beginning of the chromosome. Instead of failing with an ArrayIndexOutOfBoundsException, just don't try to print out the record. Our VCF writer doesn't really support this case (yet).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4199 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-03 03:27:43 +00:00
ebanks c45ffcdaed Changing documentation (temporarily) to warn people that -U is not supported.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4198 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-03 03:18:07 +00:00
delangel 8a7f5aba4b First more or less sort of functional framework for statistical Indel error caller. Current implementation computes Pr(read|haplotype) based on Dindel's error model. A simple walker that takes an existing vcf, generates haplotypes around calls and computes genotype likelihoods is used to test this as first example. No attempt yet to use prior information on indel AF, nor to use multi-sample caller abilities.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4197 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-03 00:25:34 +00:00
fromer a1cf3398a5 Added basic version of phasing evaluation: GenotypePhasingEvaluator
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4196 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-02 22:09:50 +00:00
kshakir fd5970fdd4 At chartl's superb suggestion, command line files are now all Files instead of old method of sometimes "has a File". Should be easier when reassigning them.
No longer generating deprecated GATK arguments on the Queue extensions.
Emitting deprecation warnings to Queue compile to help debugging issues.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4195 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-02 21:30:48 +00:00
rpoplin 0bb05fb472 Bug fix in VariantRecalibrator. Only add sample names from the input rod bindings, not from all rod bindings.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4194 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-02 21:12:09 +00:00
chartl 3a4844ebde Additional partition types into DepthOfCoverage:
- Sequencing Center
- Platform
- Sample by Center
- Sample by Platform
- Sample by Platform by Center <---- needed for analysis I'm doing

The fact that the latter three needed their own partition types, rather than being dictatable from the command line, combined with the new hierarchical traversal types, and new output formatting engine, suggest that DepthOfCoverageV3 is about ready to be retired in favor of a newer, sleeker version.

For now, this will do.
 


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4193 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-02 19:30:03 +00:00
chartl 590bb50d16 Test for missing read group
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4192 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-02 14:22:13 +00:00
kiran acd6bd2430 Experimental tool to annotates indels that are provided in a VCF file based on RefGene. Specifies gene, transcript, strand, type (Non-frameshift, frameshift, 5'-UTR, 3'-UTR, SpliceSiteDisruption, Intron, or Unknown).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4191 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-01 23:30:28 +00:00