Commit Graph

833 Commits (9f76aed5157a3921180ed72381fa53bf07ad201c)

Author SHA1 Message Date
fromer 55230ce5f3 Added startsBefore, startsAfter, and minDistance [calculates distance between any pair of bases in the two GenomeLocs]
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4531 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-20 19:12:34 +00:00
ebanks a205900eff Naughty use of Strings in HaplotypeScore literally double the runtime of Unified Genotyper. Moved over to bytes and no longer allow Strings in the Haplotype util class. New round of profiling on tap for tomorrow.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4528 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-20 03:32:21 +00:00
depristo f7ce18553e GenotypeConcordance now prints interesting sites more nicely. RMDTrackBuilder is now uses the root class FeatureSource not BasicFeatureSource.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4525 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-20 00:29:02 +00:00
asivache 42c3d74432 bug fix
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4503 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-15 16:27:40 +00:00
chartl c9d473edee More changes to Variant Eval and Genotype Concordance (passes all integration tests):
1: -sample can now include a file, which will be parsed for sample-name entries
2: If you request a sample to run analysis on, but it is not present in any of your RODs, VEW will exception out
3: Change added to parse Integer, String, and List<Integer> type Allele Count annotations (error otherwise)
4 [slightly problematic]: The count objects now maintain row-keys in order, as the keys were taking an inordinate amount of time in onTraversalDone (multiple calls to getRowKeys(), so many multiple sorts of the same underlying unsorted object, very bad)

There is a legacy comparison object which is unused which I will strip out soon.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4502 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-15 12:40:36 +00:00
ebanks 2606e67cf1 Reverting Matt's change from yesterday which I accidentally blew away when trying to cope with the stupid svn update issues we've been plagued with recently.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4495 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-14 14:40:42 +00:00
ebanks cfb33d8e12 Filtering optimizations are now live for UGv2. Instead of re-computing filtered bases at every locus, they are computed just once per read and stored in the read itself. Eyeballing the results on the ~600 sample set from 1kg, we cut out ~40% of the runtime! QUALs are now sometimes different from UGv1 because I noticed a bug in v1 where samples with spanning deletions only were assigned ref calls instead of no-calls which ever so slightly affects the QUAL. Not a big deal though.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4494 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-14 05:04:28 +00:00
hanna 83b8676b69 Hack to fix mysterious disappearing read attributes. Ultimately caused
by the fact that the GATKSAMRecord, by design, needs to both inherit from 
SAMRecord and wrap a 'member' SAMRecord, and method calls that aren't
implemented as explicit passthroughs can compromise the content of the
SAMRecord in subtle ways.

Will be automatically fixed when Picard moves to a lightweight SAMRecord
interface rather than the current heavyweight implementation.  But in 
the short-term, there's no obvious fix.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4489 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-13 19:06:54 +00:00
ebanks 530875817f Experimental code for better filtering of bases in sam records. Not hooked up yet.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4475 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-11 02:19:51 +00:00
ebanks a0de269c4b Better message
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4474 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-10 20:11:51 +00:00
asivache 05500d1a8d An iterator wrapper/adapter: takes GenomeLoc iterators 1 and 2 and traverses intersections of intervals from 1 with intervals from 2. Both 1 and 2 must be SORTED and NON_OVERLAPPING, but this iterator does NOT perfrom any checks, so if these conditions are not met, the behavior is unspecified
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4468 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-08 16:34:00 +00:00
asivache 253d528e49 not ready for commit yet
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4467 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-08 15:30:55 +00:00
asivache 4f2f33b42a fix method invocation to conform to new API; this version of the code will compile but new functionality is still not fully in
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4466 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-08 15:30:26 +00:00
asivache cece19d4d2 not ready for commit yet
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4465 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-08 15:14:54 +00:00
asivache 39e373af6e deleting accidentally committed junk
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4464 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-08 15:13:01 +00:00
asivache 77dddd0afa renaming MergingIterator to RODMergingIterator as it is more appropriate for this specialized implementation
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4461 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-08 14:08:28 +00:00
hanna 8d25a5f9f2 A mechanism for supplying attribution text -- mainly useful for external
walkers.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4402 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-01 18:31:19 +00:00
hanna bf7fd08810 Fix newly-introduced bug in the PluginManager/DynamicClassResolutionException
where, when the system can't find a plugin of the correct name, the system
prefers to crap all over itself and throw an unintelligible NullPointerException
rather than displaying an intelligent error.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4393 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-30 19:07:05 +00:00
fromer 7c909bef82 Moved phasing classes out of playground! The code is still under production, though...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4369 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-28 23:21:28 +00:00
chartl 5a5c72c80d Accidentally commited some debug output to PackageUtils, reverting change.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4367 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-28 21:58:42 +00:00
chartl 862c94c8ce Small change for Matt -- output partition types in lexicographic order.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4365 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-28 20:08:03 +00:00
bthomas 96cccafb0d Adding a few helper methods for accessing sample metadata, and associated unit tests. These are motivated by discussion with Ryan about how he'll use sample metadata in VariantEvalwalker - hopefully will make it easier for him. Methods are:
-- getToolkit().subContextFromSampleProperty(): filters a VariantContext to genotypes that come from samples that have a given property value
-- getToolkit().getSamplesWithProperty(): gets all samples with a given property
-- getToolkit().getSamplesFromVariantContext(): sample objects that are referenced by name in a VariantContext



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4361 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-28 02:16:25 +00:00
kshakir edaa278edd Removed cases where various toolkit functions were accessing GenomeAnalysisEngine.instance.
This will allow other programs like Queue to reuse the functionality.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4351 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-25 02:49:30 +00:00
hanna 497bcbcbb7 Recent changes to the build system make the build system complain loudly about
pieces of core that depend on playground.  Most of these have been eliminated by
(temporarily) promoting Aaron's report system to core in this checkin.  I'll 
follow up with other changes in separately.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4350 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-24 22:09:12 +00:00
depristo 745b8cc6d3 GATK now detects and UserExceptions when human lexicographically sorted data is provided
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4343 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-24 15:19:48 +00:00
hanna 7841b301c4 Added more diagnostics so that I have some idea of what a 'general' exception
is.  Required to fix bug ZjhCJAdwhtFq1x54ZlmlN8pFNcbrRpdJ and similar.  We
might want to change this particular case to a ReviewedStingException after
we gain a bit more experience with it.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4333 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-22 21:32:01 +00:00
kshakir 20b38b38f3 Updated from SnakeYAML 1.6 to 1.7.
Added a pipeline java bean and YAML utility to serialize java beans.
Added a getFirehosePipelineYaml.sh that can pull firehose data into the pipeline yaml file format.
Updated the fullCallingPipeline.q to begin using the pipeline yaml file format for bams and reference.
More changes to come as this code gets tested out in the fullCallingPipeline.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4329 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-22 19:47:49 +00:00
hanna 0c99c97685 The engine now automatically adds the command-line arguments to the header of every VCF, unless -NO_HEADER is specified.
Changed integration tests, adding the -NO_HEADER argument, for walkers that previously did not include the command-line
arg headers.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4326 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-22 15:27:58 +00:00
depristo 522830fb01 Support for --assume-single-sample in UG, better malformated bam exceptions, and ignoring out of order contigs in seqdictutils. All for the CG bam file
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4323 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-21 20:33:34 +00:00
delangel f64b6fddc1 Major changes/improvements to indel genotyper:
a) Redid way to compute path metrics in indel error model. Paper formulation where we have an anchor point in the alignemt between read and haplotype won't work in practice except in nice data sets that are perfectly indel-realigned and that are well mapped by aligner. New formulation doesn't assume this, and it's actually simpler and uses less code. It now resembles more a classic SW dynamic programming formulation but it still preserves the HMM probabilistic formulation. 
b) Added a programmable call threshold, set by command line.
c) Use now sample name from BAM file, remove -sampleName argument.
d) Simplify loop to compute read-haplotype likelihoods.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4311 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-19 23:47:31 +00:00
ebanks a10b2a00a5 Moving the util VariantContext 'modifying' routines into VC itself (as opposed to VCUtils) so that we can pass the genotype data directly into it and are no longer forced to decode the genotypes for no reason. This means that any walker that takes in a VCF and modifies the records without touching the genotypes never have to decode them. I've hooked this into the other two Variant Recalibrator walkers for Ryan. One side effect, though, is that we no longer can sort the sample names in the VCF (i.e. if the input VCF doesn't have samples in alphabetical order, then we used to sort them when writing a new VCF but no longer do that), because if we don't decode then we can't re-order the genotypes. I don't think this is a big concern given that the Unified Genotyper does emit sorted samples and that's the main source for most of the VCFs we use.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4300 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-17 07:09:58 +00:00
delangel c604ed9440 Several improvements to new indel genotyper (more to come soon):
a) Turns out previous change of centering haplotype around indel was a bad idea. Context to the left of indel is important but not as important as right one, because by definition all alleles start at the same location, so haplotype is the same to the left of indel regardless of allele. So, go back to having a constant size window to the left of event.
b) Expand reference context so we can test larger haplotypes.
c) Optimize computation of read likelihoods by doing them in linear array instead of in a matrix - no difference in biallelic sites but could be significantly faster in multiallelic sites.
d) Bug fix: read alignment wasn't being computed correctly if, a) we were at an insertion, b) read started right at the insertion, c) read CIGAR didn't include insertion - more of these corner conditions are lurking, so a revamped computation of how reads align to candidate haplotypes is in the works.
e) Add debug option not to use prior haplotype likelihoods.
f) Don't hard-code NA12878 for genotyping, now sample name is a required input argument.
g) Bug fix: if there are no reads covering a candidate indel event, just output NO_CALL (didn't notice this in HiSeq, but in P1 data it happens all the time). I need to add a confidence threshold for calling later on.






git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4291 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-15 21:53:08 +00:00
hanna 7fa6b2135b Added a back door so that integration tests can reset the sequence dictionary
in the reference.  Reset routine is not accessible to any class outside
GenomeLocParser's package.

We'll have to do something more intelligent with this when the GATK goes
distributed.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4275 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-14 18:58:08 +00:00
depristo 7880863eb7 Final step in error refactoring. GATK exception is now ReviewedStingException, indicating that this exception is really what one wants. Only use this exception when you have thought about StingException vs. UserException and made a real decision.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4267 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-12 15:07:38 +00:00
depristo 7ad8fbdd5a Moved GATKException to exceptions
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4266 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-12 14:47:19 +00:00
depristo bccebf8899 Newly placed StingException
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4264 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-12 14:38:46 +00:00
depristo 3964e02fb6 Newly placed StingException
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4263 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-12 14:38:32 +00:00
depristo 595907e98e Moving StingException
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4262 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-12 14:34:15 +00:00
depristo 40e6179911 Penultimate step in exception system overhaul. UserError is now UserException. This class should be used for all communication with the USER for problems with their inputs. Engine now validates sequence dictionaries for compatibility, detecting not only lack of overlap but now inconsistent headers (b36 ref with v37 BAM, for example) as well as ref / bam order inconsistency. New -U option to allow users to tolerate dangerous seq dict issues. WalkerTest system now supports testing for exceptions (see email and wiki for docs). Tests for vcf and bam vs. ref incompatibility. Waiting on Tribble seq dict improvements to detect b36 VCF with b37 ref (currently cannot tell this is wrong.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4258 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-12 14:02:43 +00:00
delangel da2e879bbc Miscellaneous improvements to indel genotyper:
- Add a simple calculation model for Pr(R|H) that doesn't rely on Dindel's HMM model. MUCH faster, at a cost of slightly worse performance since we're more sensitive to bad reads coming from sequencing artifacts (add -simple to command line to activate).
- Add debug option to calculation model so that we can optionally output useful info on current read being evaluated. (add -debugout to commandline).
- Small performance improvement: instead of evaluating haplotype to the right of indel (just with a 5 base addition to the left), it seems better to center the indel and to add context evenly around event.




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4257 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-12 13:50:28 +00:00
depristo 8f1a32acae All exceptions thrown by the GATK have been reviewed and UserErrors replaced where appropriate. Shazam. Another check-in will remove the GATKException and restore the StingException.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4252 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-10 15:25:30 +00:00
depristo 1de713f354 Massive review of maybe 50% of the exceptions in the GATK. GATKException is a tmp. tracker so that I can tell which StingExceptions I've reviewed. Please don't use it. If you are working on new code and are considering throwing exceptions, it's either UserError or StingException, please
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4246 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-09 23:21:17 +00:00
depristo 6a30617a60 Initial implementation of UserError exceptions and error message overhaul. UserErrors and their subclasses UserError.MalFormedBam for example should be used when the GATK detects errors on part of the user. The output for errors is now much clearer and hopefully will reduce GS posts. Please start using UserError and its subclasses in your code. I've replace some, but not all, of the StingExceptions in the GATK with UserError where appropriate.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4239 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-09 11:32:20 +00:00
depristo 7eeabe534a QSample walker for 1KG -- measures aggregate quality of sequencing. Includes misc. improvements throughtout the code, including using the new Tribble GenotypeLikelihoods class for working with VCF GLs from the UG
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4211 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-03 18:21:43 +00:00
delangel 8a7f5aba4b First more or less sort of functional framework for statistical Indel error caller. Current implementation computes Pr(read|haplotype) based on Dindel's error model. A simple walker that takes an existing vcf, generates haplotypes around calls and computes genotype likelihoods is used to test this as first example. No attempt yet to use prior information on indel AF, nor to use multi-sample caller abilities.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4197 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-03 00:25:34 +00:00
hanna dc5f858d29 Replaced placeholder support for splitting by read group with read support (sorry everyone), and added relatively comprehensive unit tests to ensure that splitting by read group works.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4190 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-01 22:24:50 +00:00
hanna de5ccfb0b1 Moved hasPileupBeenDownsampled() based on Eric's request. Also eliminated
@Deprecated constructors from AlignmentContext.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4142 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-27 16:12:05 +00:00
delangel f2b138d975 Small refactoring: make Haplotype a public class since it will be soon extended and shared with other callers.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4100 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-24 17:52:36 +00:00
aaron 35b9883dd6 vcfwriter is in tribble now
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4083 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-23 17:01:04 +00:00
hanna b80cf7d1d9 Modifications to the output system for better interaction with @Output. Multiplexed arguments. More details in the Monday meeting.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4077 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-22 14:27:05 +00:00