Commit Graph

1304 Commits (e45b699ac059045f4e762c42b872c42eaf250fb3)

Author SHA1 Message Date
asivache 7a11b4f35d Another change in variant classification values
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5237 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-14 17:47:58 +00:00
asivache 7f7d7eb2d1 Inconsequential changes, more 'variant classification' values are recognized
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5236 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-14 17:36:39 +00:00
ebanks 4fe0fcd707 Updates to handle CG data, headers, etc.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5215 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-08 03:16:05 +00:00
fromer bceb2a9460 Now that Mauricio has updated the PacBio BAM to properly have RG, can use sample name in the walker
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5212 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-07 20:26:57 +00:00
asivache 2a04e0d378 Explicitly set logger's level to info - otherwise samtools is too chatty
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5209 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-07 17:08:50 +00:00
depristo fe4aa58d35 Removing unused class
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5197 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-04 22:22:28 +00:00
hanna 5c3198520c A few minor modifications masquerading as significant changes according to
svn's logs:
- Copied BAM indexing engine from Picard back into the GATK anticipating
  shard merging algorithm.  Tried to leave most of the building blocks in
  Picard.  If this turns into a logistical nightmare, I'll merge the building
  blocks into the GATK as well.
- Reorganized the org.broadinstitute.sting.gatk.datasources package, giving
  better separation of query and management functionality for reads, ref, rmd,
  and samples.  
- Merged Shard building blocks into org.broadinstitute.sting.gatk.datasources.
  reads package, indicating it's current strong relationship with the reads,
  rather than the general unifying element I wish this would be.
- Collapsed BAMFormatAwareShard into Shard.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5184 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-03 17:59:19 +00:00
kiran cab426f86f VariantEval 3.0 is now in core.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5139 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-31 17:42:08 +00:00
kiran e26da9b047 Changed column-key names to not have spaces, as GATKReport gets very upset about this.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5131 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-31 03:31:54 +00:00
asivache 7af0532292 An attempt to have more intelligent sorting of RODs. Tested with maf only so far. Should be able to reference-sort dbsnp, bed and vcf as well, bugs nonwithstanding. Very simple, brute-force implementation using SortingCollection. Should I have used tribble indexing machinery instead?
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5118 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 22:10:07 +00:00
asivache fa8963522b Ignore header line if it happens to be passed to the codec again, instead of crashing on it
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5116 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 21:44:33 +00:00
fromer f2de39d661 Calculates phase concordance rates between trio and RBP-phasing tracks, stratified by trio status (Het3, non-Het3)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5114 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 20:50:01 +00:00
kiran 9cb1ae384c Constant precision for floating point numbers. Added integration test - carries over tests from VariantEval with the necessary modifications to command-line arguments and md5s. Disabled use of 'synchronized' keyword because I clearly don't get how that keyword is supposed to work yet...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5107 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 05:19:18 +00:00
asivache f036a178f1 Added support for MAF features. So far works for MAF Lite only, annotated MAF is NOT TESTED yet AT ALL.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5105 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 03:20:46 +00:00
kiran 3e9f185dad Fixed issue with GenotypeConcordance being initialized incorrectly when the first seen comptrack had no samples.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5102 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 01:12:27 +00:00
kiran 58f0ecff89 Fixes to support evaluations with TableType elements - each such object now gets a separate entry in the output table. Added codon degeneracy stratification. Handle null elements in reports (useful for debugging).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5101 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-27 22:09:59 +00:00
kiran 2901299ff6 Sets the number of samples to all of the samples in the file when it's not specifed on the command-line explicitly. GenotypeConcordance no longer a standard evaluation.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5094 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-27 01:38:26 +00:00
asivache 43812a28fc If among all the multiple alignments for the given read we have 'unmapped' ones (can happen with bwa 0.5.7 and maybe later versions), then discard the latters and keep only the mapped ones. Keep 'unmapped' only if its the only alignment available.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5090 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-26 20:07:08 +00:00
asivache 63b709d992 When remapping the read, set MAPQ, CIGAR etc to 0/null for unmapped reads. This is not required according to spec but current samtools jdk otherwise dies in STRICT validation mode.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5089 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-26 19:49:07 +00:00
kiran a97184fddf Frick! Changed to refer to the *playground* version of VariantEvaluator.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5087 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-26 19:33:03 +00:00
kiran a9d0772516 When evaluating JEXL expressions, on't blow up if the eval VC is null
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5085 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-26 18:25:03 +00:00
kiran 22e599ec76 Fixed output report to properly handle evaluation modules with TableType objects. Promoted CpG to a standard stratification. Demoted Filter to a non-standard stratification. Now, if the filter stratification is not specified, VariantEval only evaluates PASSing sites.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5084 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-26 17:38:21 +00:00
ebanks 2bbcc9275a Committing the fragment-based calling code. Results look great in all datasets (will show this at 1000G this week with Ryan). Note that this is an intermediate commit. The code needs to be cleaned up and the fragmentation code needs to be moved up into LocusIteratorByState. This should all happen later this week, but I don't want Ryan to have to keep running from my own personal Sting directory. The current crappy implementation adds ~10% to the runtime, but that should all go away in the next iteration.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5058 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-23 05:04:17 +00:00
fromer 4bec93e3e4 Permit retrieval of read names for debugging purposes
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5011 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-18 16:09:34 +00:00
kiran 2f4a436719 Throw an exception if no eval rods are specified. If one or more samples are specified, subset the 'all' VariantContext to just the specified samples. This is useful when you want to see what effect dropping certain samples will have on the metrics and you don't want to go through SelectVariants first.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5009 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-17 06:46:10 +00:00
kiran 73acfa654a Fixed double-counting bug. Fixed issue where evaluation module with an update2() method wasn't getting called if the comp track was null. Added a column to the output report indicating the table name for easy greppability. Fixed an issue where, if sample-level stratification was not required, the sample-level VCs would be generated anyway.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5000 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-14 14:06:43 +00:00
depristo e3956148ac removing unused fastqtobam
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4985 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-13 14:29:32 +00:00
fromer c2dd956888 Moved PrintReferenceVariantsWalker to playground
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4971 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-10 22:07:41 +00:00
kiran fdc514ded3 Intermediate commit for VariantEval 3.0. Among the changes:
* Stratifications (by comp rod, by eval rod, novelty, filter status, etc.) have been generalized.  They are very symmetric with evaluators now.  Each stratification can have multiple states (e.g. known, novel, all).  New stratifications can be added and optionally applied.  Some new stratifications include:
  - by sample
  - by functional class
  - by CpG status

* Output is to a single file in GATKReport format, rather than having the options of CSV, R, table, etc.

* Rather than needing to state up front that the allowable variant type is a SNP or an indel, each eval record is inspected and the appropriate record type is fetched from the comp track.  (This will require a bit more testing...)

* Evaluation context (basically a single row in a VariantEval report) generation and retrieval has been overhauled.  Now, every possible configuration of stratification state is generated recursively and stored in a HashMap.  The key of the HashMap is a key that represents that exact state configuration.  When examining a comp track and eval track, this key is computed based on the data, providing easy lookup for the appropriate evaluation context.  When there are only a handful of stratification configurations, this isn't a big deal.  But when operating on a file with hundreds of samples, multipled by 3 states for novelty, 3 states for filtration, 3 states for CpG status, etc., it becomes a very big deal.

There are still some known issues:
* When the per-sample stratification is turned off, things are getting overcounted (too many variants are showing up when compared to the VariantEval 2.0 code).  It's probably because I break out the VariantContext by sample even when not necessary, and those irrelevant contexts are still being counted.  Or my recursion is overaggressively creating evaluation contexts, and they all get added up in a weird way.  But that's why I'm committing now - so I can track down this issue without losing my work so far.

* The Jexl expressions are sometimes throwing an exception that I don't yet understand (they complain of an incorrect specification on the command-line... *after* the program has made it through a few thousand records.

* The request to have evaluations be smart enough to reject certain stratification states is not implemented yet.

There's still some work to do before I can replace VariantEval 2.0 with VariantEval 3.0, but feel free to take a look.  I'd love comments on the new code.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4946 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-06 15:20:24 +00:00
fromer 4b37710bcd Added validator for phasing using read information, e.g., PacBio: ReadBasedPhasingValidationWalker
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4940 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-05 20:05:56 +00:00
hanna 8d2c14b29c Update Picard / sam-jdk at Tim's request.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4925 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-03 02:17:25 +00:00
hanna cba18116e4 A significant refactoring of the ROD system, done largely to simplify the process of
streaming/piping VCFs into the GATK.  Notable changes:
- Public interface to RMDTrackBuilder is greatly simplified; users can use it only to build 
  RMDTracks and lookup codecs.
- RODDataSource and RMDTrack are no longer functionally at the same level; RODDataSources now
  manage RMDTracks on behalf of the GATK, and the only direct consumers of the RMDTrack class
  are the walkers that feel the need to access the ROD system directly.  (We need to stamp out
  this access pattern.
A few minor warts were introduced as part of this process, labeled with TODOs.  These'll be
fixed as part of the VCF streaming project.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4915 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-31 04:52:22 +00:00
ebanks dabdeb729e Eric broke the build. Eric broke the build.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4847 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-15 17:01:38 +00:00
ebanks 5c0b66cb7c 3 big changes that all kill the integration tests: 1. Don't cap the PLs by 255 anymore. 2. Move over to the 3state model as the only available base model for UG (no more base transition tables). 3. New QD implementation when GLs/PLs are available.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4846 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-15 16:24:28 +00:00
ebanks d89e17ec8c Fare thee well, UGv1. Here come the days UGv2.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4747 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-29 21:51:19 +00:00
ebanks 222cd42ceb Have the UG engine take care of the GL to PL conversion. Note that we still use GLs for calling (since we are losing precision in high-pass and, even worse, it can affect QD), but we emit PLs in all cases. This means that calculating the GLs, emitting them to VCF, and then calling off of them (a la samtools) is absolutely, positively not ideal.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4745 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-29 20:28:16 +00:00
ebanks 102c8b1f59 Large refactoring of the UGv2 engine so that it is now truly separated into 2 distict phases: GL calculation and AF calculation, where each can be done independently. This is not yet enabled in UGv2 itself though because I need to work out one last issue or two. Tested on 1Mb of 1000G Aug allPops low-pass and results are identical as before. Also, making BQ capping by MQ mandatory.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4744 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-28 21:36:33 +00:00
ebanks 35b90d2295 Don't compute SB for ref calls
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4735 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-26 03:54:26 +00:00
ebanks ea6e2218c1 1. dbsnp has some massive indels which my left-aligner was barfing on because there isn't enough reference context; fixed. 2. Lower default calling threshold to Q30 for UGv2.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4722 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-23 19:28:33 +00:00
bthomas 374c0deba2 Updating the core LocusWalker tools to include the Sample infrastructure that I added last month. This commit touches a lot of files, but only significantly changes a few: LocusIteratorByState and ReadBackedPileup and associated classes.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4711 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-19 19:59:05 +00:00
kshakir c723db1f4b Added a -summary jexl argument to VariantEval similar to -validate.
Updated the package of ValidationGenotyper to match the file location.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4710 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-19 04:42:46 +00:00
rpoplin b677080858 Initial checkin of the ValidationGenotyper. Not intended to be used by anybody yet. Only here for archival purposes at this point.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4685 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-15 22:33:49 +00:00
depristo ef2f6d90d2 VQSR now operates on LOD scores in the INFO field directly, and doesn't adjust the QUAL field. New format for tranches file uses LOD score. Old file format no longer supported. log10sumlog10() function, a very useful utility in MathUtils. No more ExtendedPileupElement! Robust math calculations in GMM so that no infinities are generated! HaplotypeScore refactored to enable use of filtered context. Not yet enabled... InferredContext getDouble and getInteger arguments now parse values from Strings if necessary
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4684 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-15 22:19:22 +00:00
ebanks 28142408ff Refactoring so that all counting in UGv2 is done on the filtered context. In particular, tests for empty pileups and too many spanning deletions now use the correct counts. Also, -all_bases mode now trumps all; this one is for you, chartl.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4671 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-15 05:01:12 +00:00
delangel cb1e8ad43a Temp bug fix for indel genotyper: if there are two or more variant contexts at a site, just choose the first one containing an indel and genotype that. There might be cases where IGv2 emits 2 indel variant contexts in at the same ref location which made us fail there. A better solution will be to form underlying haplotypes supported by reads and compute likelihoods of that.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4667 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-14 00:21:54 +00:00
ebanks 69de3e51bf Better precision for the calculated AF value. Now looks at the total number of samples to determine how much precision is necessary. Also, changing default min BQ used for calling in UGv2 to Q17.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4655 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-12 08:31:40 +00:00
ebanks 2f6666a988 Correcting traversal statistics
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4652 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-11 22:46:58 +00:00
delangel 2f3be24a00 Improvement in exact allele frequency calculation model (still under test, but this is definitely better than what I had before). Instead of approximating log(10^x+10^y) as max(x,y), approximate full Jacobian formula max(x,y)+log(1+10^-abs(x-y)) with static lookup table for the second term.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4647 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-11 01:22:35 +00:00
hanna 8e36a07bea Convert GenomeLocParser into an instance variable. This change is required
for anything that needs to be simultaneously aware of multiple references, eg
Queue's interval sharding code, liftover support, distributed GATK etc.  

GenomeLocParser instances must now be used to create/parse GenomeLocs.
GenomeLocParser instances are available in walkers by calling either

-getToolkit().getGenomeLocParser()
or
-refContext.getGenomeLocParser()

This is an intermediate change; GenomeLocParser will eventually be merged
with the reference, but we're not clear exactly how to do that yet.  This
will become clearer when contig aliasing is implemented.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4642 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-10 17:59:50 +00:00
ebanks e05af54f3e Found the cause of 80% of our non-called FNs: an excess of filtered bases were causing us to choose the wrong alternate allele. More details to dev team.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4634 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-07 03:39:57 +00:00