Commit Graph

4377 Commits (e47513f04376d044bbd3f270189e1ad7a423e1bb)

Author SHA1 Message Date
kshakir e47513f043 Minor updates to match the wiki documentation.
Upper cased the PartitionType enum values.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5506 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-24 20:22:23 +00:00
kshakir f3e94ef2be Walkers can now specify a class extending from Gatherer to merge custom output formats. Add @Gather(MyGatherer.class) to the walker @Output.
JavaCommandLineFunctions can now specify the classpath+mainclass as an alternative to specifying a path to an executable jar.
JCLF by default pass on the current classpath and only require the mainclass be specified by the developer extending the JCLF, relieving the QScript author from having to explicitly specify the jar.
Like the Picard MergeSamFiles, GATK engine by default is now run from the current classpath. The GATK can still be overridden via .jarFile or .javaClasspath.
Walkers from the GATK package are now also embedded into the Queue package.
Updated AnalyzeCovariates to make it easier to guess the main class, AnalyzeCovariates instead of AnalyzeCovariatesCLP.
Removed the GATK jar argument from the example QScripts.
Removed one of the most FAQ when getting started with Scala/Queue, the use of Option[_] in QScripts:
1) Fixed mistaken assumption with java enums. In java enums can be null so they don't need nullable wrappers.
2) Added syntactic sugar for Nullable primitives to the QScript trait. Any variable defined as Option[Int] can just be assigned an Int value or None, ex: myFunc.memoryLimit = 3
Removed other unused code.
Re-fixed dry run function ordering.
Re-ordered the QCommandline companion object so that IntelliJ doesn't complain about missing main methods.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5504 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-24 14:03:51 +00:00
ebanks 18271aa1f4 It never fails to amaze me that aligners can find so many different ways to place indels off the ends of contigs
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5503 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-24 04:17:23 +00:00
ebanks 48b15d42e0 More fixes and improvements. We no longer use any bases under Q20 because random ~Q5s were cluttering the graphs; instead we grab any contiguous segments of size at least MIN_SEQUENCE_LENGTH where all bases are above Q20. Also, I implemented a quick algorithm to traverse the graph (using DFS) to choose the two best scoring paths (haplotypes). Used it successfully at NA12878 HM3 SNP sites to determine whether they are homozygous (no distiction yet between ref and alt) or heterozygous! Indels are the next target. Still have some issues to work out.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5502 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-24 03:51:19 +00:00
hanna 26e3bea76e Fix for == used to test object equality.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5499 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-23 18:15:19 +00:00
ebanks 401d1cb97f Bug fixes plus some debugging code added. Broke out DeBruijnVertex into its own class so that the interface is now cleaner. Still very much a work in progress.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5498 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-23 17:35:34 +00:00
hanna 37fbf17da8 Finally restored code after accidentally removing three days worth of work:
schedule file infrastructure has been restored, and is now a single file.
Only the exact bins required for the traversal are stored in the schedule.
Very close to being able to merge schedule entries.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5497 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-23 05:52:40 +00:00
ebanks 69646ff840 ... and the corresponding integration test update
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5496 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-23 01:58:07 +00:00
ebanks ded80e0c57 Trivial change to remove space at the end of the description
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5495 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-23 01:47:46 +00:00
carneiro 3414bccb46 documentation changes to agree with the wiki
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5494 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-22 21:48:49 +00:00
carneiro 28149e5c5e GenotypeAndValidate version 2, ready to be used.
- now it differentiates between confident REF calls and not confident calls.
- you can now use a BAM file as the truth set. 
- output is much clearer now

dataProcessingPipeline version 2, ready to be used.
- All the processing is now done at the sample level
- Reads the input bam file headers to combine all lanes of the same sample.
- Cleaning is now scattered/gathered. Inteligently breaks down in as many intervals as possible, given the dataset.
- Outputs one processed bam file per sample (and a .list file with all processed files listed)
- Much faster, low pass (read Papuans) can run in the hour queue.




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5493 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-22 20:18:02 +00:00
chartl 687b2e51b4 Switch from togglable wiggle output to togglable bedgraph format. Can be pulled directly into IGV to show the statistics values. I'll need to bug jim to allow value-toggling in a bedgraph, currently 2nd and 3rd columns are just ignored.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5492 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-22 17:58:53 +00:00
chartl 5a79f16ea4 Fixed an edge case where an exception was thrown if either of the sets was empty for the MWU test. Also altered the output format so U itself is not printed (which though interesting, isn't so useful for recalibration), but rather a value I call V (really the deviation of U from its expectation).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5490 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-22 16:28:44 +00:00
ebanks af7f78e8ba Minor debugging output change.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5488 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-22 12:59:26 +00:00
ebanks b463faad92 Fixing typo
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5487 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-22 03:57:11 +00:00
ebanks 1a9e65bcd4 Updating other walkers now that VCC extends from VC
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5486 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-22 03:10:40 +00:00
ebanks 0ee687e49d For Mauricio: now, even in GENOTYPE_GIVEN_ALLELES mode, the VariantCallContext (which now inherits directly from VC) will report reference calls as confidently called if they pass the threshold even if the QUAL of the record itself is low because we were forced to have an ALT allele.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5485 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-22 02:42:28 +00:00
ebanks ab6a815184 As per the comments in the commit itself: when reads get mapped to the junction of two chromosomes (e.g. MT since it is actually circular DNA), their unmapped bit is set, but they are given legitimate coordinates. The Picard code will come in and move the read all the way back to its mate - which can be arbitrarily far away and cause records to be written out of order. Very evil.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5484 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-21 20:30:24 +00:00
ebanks d9202f2764 Don't try to create a GenomeLoc from an unmapped read
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5480 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-21 13:46:55 +00:00
ebanks 1c95208e26 Finally found the bug that everyone is reporting on GS. Iterators on PriorityQueues aren't guaranteed to return elements in sorted order (a pretty stupid contract) - so we were passing items to the constrained writer out of order. Just do a Collections.sort instead (1 line of code). Happy father's day!
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5476 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-18 21:28:19 +00:00
ebanks 9568c84af9 Don't output these messages in INFO mode because they are scaring people unnecessarily
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5475 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-18 19:55:22 +00:00
depristo 22ff2573d5 Removed MAG entirely
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5474 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-18 19:43:23 +00:00
kiran 55897631ad Initial attempt at identifying potentially interesting variants in a Mendelian disease context when the called genotypes are uncertain.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5473 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-18 19:41:35 +00:00
kshakir b2b8a4f19f Re-un-final'ed BAQ.MAG as it was pre r5469.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5472 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-18 19:40:31 +00:00
asivache 1d5326ff0c Minor fixes to the cmd-line help messages
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5470 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-18 18:18:04 +00:00
depristo 7857cb5a22 Waiting to go to the hospital -- fixed a bug in the BAQ calculation where the BAQ would NPE if a read had no usable bases (all clipped, for example) but didn't fail the PF filter
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5469 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-18 17:45:21 +00:00
fromer e84a27ceea OverlapWithBedInIntervalWalker calculates the average per-input-interval coverage by the BED intervals track
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5468 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-18 17:44:46 +00:00
depristo abc7d1aef9 BeagleOutputToVCF now accepts an option to keep monomorphic sites. This is useful to genotype a single sample, where having AC=0 just means that the sample is hom-ref at the site.
ProduceBeagleInputWalker can optionally emit a beagle markers file, necessary to use the beagled reference panel for imputation.  Also supports the VQSR calibration curve idea that a site can be flagged as a certain FP, based on the VQSLOD field.  This allows us to have both continuous quality in the refinement of sites as well as hard filtering at some threshold so we don't end up with lots of sites with all 1/3 1/3 1/3 likelihoods for all samples (i.e., a definite FP site where we don't know anything about the samples). 

Added a new VariantsToBeagleUnphased walker that writes out a marker drive hard-call unphased genotypes file suitable for imputating missing genotypes with a reference panel with beagle.  Can optionally keep back a fraction of sites, marked as missing in the genotypes file, for assessment of imputation accuracy and power.  The bootstrap sites can be written to a separate VCF for assessment as well.

Finally, my general Queue script for creating and evaluating reference panels from VCF files.  Supports explicitly genotyping a BAM file at each panel SNP site, for assessment of imputation accuracy of a reference panel.  Lots of options for exploring the impact of the VQS likelihooods, multiple VCFs for constructing the reference panel, as well as fraction of sites left out in assessing the panel's power.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5467 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-18 03:08:38 +00:00
depristo 9b8d41160b GENOTYPE_GIVEN_ALLELES now respects the filter status of the incoming alleles file.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5466 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-18 02:59:28 +00:00
depristo 6281c1db6f A nicer error (UserException now) for malformed genome locs
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5465 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-18 02:58:29 +00:00
delangel b45afe5ba8 Several major fixes and changes to new indel likelihood model:
a) Scrapped the way in which we constructed candidate haplotypes because it wasnt fully correct and yielded corner conditions with incorrect genotyping and likelihood computation. Ideally, a haplotype should "cover" the read and the most likely alignments should be such that the ends of the read are inside the ends of the haplotype. This wasn't happening, and if you have a "dangling read off a haplotype" the probabilistic alignment model may prefer to shift a read instead of scoring it correctly - this is especially bad with tandem repeat insertions. 
So now, we build haplotypes based on the reference context and adaptively change them based on read alignment positions, plus some padding and uncertainty in the alignment.

b) Changed the way soft clipped based are dealt with. Instead of either ignoring them or using them, we only use them if the read start or end position (after soft clipping) are within eventDistance of the current location. This is done because it's very common that BWA's strictly local SW implementation will soft clip every single read at an insertion position because it couldn't place that end of the read without too many mismatches, but the read is legit and the bases are good quality. If we don't take these bases into consideration, reads which are informative of an insertion event are essentially discarded because the informative part is clipped away. 

c) Several cleanups and fixes to the context-dependent gap penalty model based on length of HRun.





git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5464 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-17 18:39:31 +00:00
depristo cd38dfb4ef Now with a clearer, grammatically correct message
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5462 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-17 18:06:05 +00:00
depristo 10466dc7d1 I finally broke down and added a default documentation string to @Input for use in Queue scripts. It's not ideal, but I couldn't take any more queue scripts with doc="x" all over the place.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5461 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-17 18:05:25 +00:00
depristo c1798a7dbc Whitespace cleanup
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5460 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-17 18:04:08 +00:00
corin 30237e6824 Updated the walker to specify the build based on the user's input file name if the user does not specify the build.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5459 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-17 17:49:17 +00:00
carneiro 3de300e504 A walker that moves annotations from the filter field to the info field of truth annotated vcfs.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5458 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-17 17:11:28 +00:00
ebanks 481750cbf9 Probable patch to Jerry Glenn's GetSatisfaction report. I'm having him test it out.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5456 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-17 16:00:50 +00:00
ebanks 3eea6e92b7 An extremely basic implementation of a deBruijn-based local assembler, using the jgrapht graph library. This is not at all optimized and has only been tested on my very simple 3-read test bams. I'm sure there are bugs in there - more testing coming soon. Insertions and deletions confirmed to generate identical graphs (except for the multiplicity of edges of course). Not worth using yet.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5455 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-17 14:03:07 +00:00
hanna 28a5a177ce Very crude implementation of writing BAM 'schedules' to disk rather that 'meta-
indexes'.  Not yet elegant, but proves that it circumvents the performance
issues associated with the meta-index.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5454 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-16 21:48:47 +00:00
rpoplin 8d0880d33e Misc cleanup
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5453 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-16 17:33:19 +00:00
rpoplin c6ef6ee8b7 Recal file is in input to ApplyRecalibration not an output.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5452 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-16 12:08:58 +00:00
rpoplin 8e89ff170e Can't check substitution type of tri-allelic SNPs.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5451 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-16 03:06:03 +00:00
carneiro e2e435d52c GenotypeAndValidate: now looks at annotations in the INFO field instead of filter field. Better output and filters repetitive calls to indel extended events.
IndelUtils: added a isInsideExtendedIndel() method to filter the above mentioned.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5449 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-15 21:54:40 +00:00
rpoplin d98503ca50 Removing some debug code from VQSRv2. VariantEval can now be stratified by contig with -ST Contig. New hidden option in CombineVariants for overlapping records to take the info fields from the record with the highest AC (while still updating AC/AN/AF correctly) instead of dropping info fields which aren't exactly the same.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5448 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-15 21:28:10 +00:00
carneiro 4b9b767eb1 SelectVariants: now keeps the YAML stuff internal... it's there if you wanna use it, but won't be published anymore. Official parameter is the string for now.
VariantEval: now sports the new MendelianViolation utility class.
MendelianViolationClassifier: I noticed I had broken chartl's walker by changing VariantEval, so I took the liberty to modify it to use the new library too, though I kept modifications to a minimum, could have gone into full integration if this is a useful tool, but since it's in oneoffs, I decided not to go all out.
MendelianViolation: Some getter methods were added for chartl and VariantEval.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5447 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-15 18:36:55 +00:00
delangel 653fb09bb7 a) Next iteration of context-dependent gap penalty model for new probabilistic alignment indel model. Actual model is now implemented, computes homopolymer run profile for candidate haplotypes and looks up in table gap penalties based on hrun length at each position. Initial penalty model is a very naive affine penalty model with each extra hrun increment decreasing Q2 the gap open penalty, until a minimum is reached. Still needs to be tuned and ideally get data from recalibration.
b) small bug fix when setting debug arguments



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5446 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-15 16:46:28 +00:00
rpoplin bbcc4ed700 The second pass of the contrastive VQSRv2.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5444 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-14 21:05:02 +00:00
rpoplin 2a2538136d A version of VQSRv2 that does contrastive clustering in two passes. The walkers will be renamed when they are moved to core.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5443 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-14 21:03:56 +00:00
carneiro fcc347bb05 making sure the output is as pretty as I said it would be on the wiki.
wikipage for this walker is up, at : http://www.broadinstitute.org/gsa/wiki/index.php/Genotype_and_Validate#Examples

use it ;)



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5442 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-14 20:32:09 +00:00
ebanks 239dae0985 Absolutely nothing to get excited about. This is just the skeleton for the local assembler. It doesn't do anything at all now except for collect reads over each -L interval and pass them to an assembly engine (which isn't implemented yet). The interface for the AssemblyEngine will change later, but for now this one is the most conducive to debugging.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5441 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-14 20:31:54 +00:00