Commit Graph

1363 Commits (d534241f35505d757f39d482d45d3316d6dd944e)

Author SHA1 Message Date
depristo cd293f145b More stable reduced reads representation. Bug fixes throughout. No diffs by <1% of sites in an exome, and the majority of these differences are filtered out, or are obvious artifacts. UnitTests for BaseCounts. BaseCounts extended to handle indels, but not yet enabled in the consensus reads.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5939 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-03 20:11:31 +00:00
depristo 86df10ec09 UnitTests for ConsensusSpan infrastructure
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5929 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 22:44:52 +00:00
depristo ad9dca9137 Package updated. Copyrights added
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5926 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 21:29:27 +00:00
depristo 3d628f06f0 moved to playground
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5925 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 21:25:26 +00:00
hanna 6cc84c3ce2 Make the set of VariantContextAdaptors dynamic so that Andrey's MafFeature can
continue to exist and live in playground (and thus outside of the normal release
 / git release branch).


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5909 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-01 02:54:55 +00:00
carneiro 5974675b43 Two intermediate commits, to work over the weekend.
ReplicationValidationWalker: Just the skeleton of what will be the implementation of the replication/validation model.
dataProcessingV2: Committing an UNTESTED implementation of BWA alignment. I am running tests on it over the weekend.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5900 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-27 22:03:08 +00:00
carneiro b66c6dced1 - No longer prints out non confident calls (they were leading to tables that don't add up and confusing some Pacbio folk).
- Added sensitivity and Specificity to the report.
- With the changes in genotype likelihoods, the indel analysis only happens if the BAM file also has an extended event. Not great, but at least it's not broken.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5759 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-04 19:26:55 +00:00
rpoplin 23cd3a7a5d Moving VQSR v2 to core.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5740 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 20:20:06 +00:00
rpoplin 5bade81c6d Adding tranche plot generation back to VQSR
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5736 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 19:26:26 +00:00
rpoplin e73720c2db Updating VQSLOD annotation description
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5735 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 19:01:08 +00:00
rpoplin 11052918d9 Better exception text for common error in VQSR.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5734 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 18:37:25 +00:00
rpoplin 4bbce42861 Renaming ContrastiveRecalibrator --> VariantRecalibrator in preparation for move to core
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5733 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 18:12:47 +00:00
rpoplin 6323fb8673 misc cleanup in VQSR
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5732 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 18:00:22 +00:00
rpoplin 3224bbe750 New visualization output for VQSR. It creates the R script file on the fly and then runs Rscript on it. Adding 1000G Project consensus code. First pass of having VQSR work with missing data by marginalizing over the missing dimension for that data point (thanks Chris and Bob for ideas). Updated math functions to use apache math commons instead of approximations from wikipedia. New parameters available for the priors based on further reading in Bishop and looking at the new visualizations. Updated integration test to use more modern files. Updated MDCP to use new best practices w.r.t. annotations.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5723 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-02 19:14:42 +00:00
ebanks cbcdfc584d Moving out of core and into playground
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5671 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-21 02:30:22 +00:00
corin 2cf6a06503 Throwing an error if INFO fields arguments contain whitespace.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5651 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-15 20:52:55 +00:00
corin fce6d25075 Moved the reference ID to a meta data field for validity declaration.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5650 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-15 20:28:56 +00:00
corin 59215dab48 Now writes results to a minimal vcf with annotations included in the INFO field. Must be run with -NO_HEADER to totally remove header for the most bare bones vcf; otherwise also includes command line meta data.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5649 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-15 20:14:02 +00:00
fromer 8e0f5bc5a5 Prevent NullPointerException in cases where SNP is filtered
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5641 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-14 19:59:59 +00:00
ebanks af09170167 As I threatened yesterday, I've moved the various and disparate randomization code out of the walkers. Now they all (except VQSRv1, whose days are numbered anyways) use a static generator available in the engine itself. Please use this from now on. The seed is reset before every individual integration test is run. I think there may still be an issue with the IndelRealigner but I need to confirm with the commit to see what testNG does. Integration tests are already broken anyways, so no big deal.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5589 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-07 17:03:48 +00:00
carneiro 89bb21d024 typo in the argument description
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5587 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-06 19:45:32 +00:00
rpoplin 3f3f35dea0 UnifiedGenotyper now BAQs via ADD_TAG to facilitate using BAQed quals for GL calculations but unBAQed quals for annotation calculations. UnifiedGenotyper now produces SNP and indel calls simultaneously. 40 base mismatch intrinsic filter removed from UG to greatly simplify the code. RankSumTests are now standard annotations but the integration tests are commented out pending changes that will allow random annotations to work.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5585 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-06 19:06:24 +00:00
rpoplin b2a0331e2d Pushing hard coded arguments into VariantRecalibratorArgumentCollection
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5566 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-03 19:55:09 +00:00
depristo 11822da578 Stand alone, GATK dependent tool that Reads a list of BAM files and slices all of them into a single merged BAM file containing reads in overlapping chr:start-stop interval. Highly efficient when working with thousands of BAM files. Can merge 1MB of sequence of 1600 4x BAMs in 4g in only 2 hours.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5558 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-02 13:41:29 +00:00
rpoplin 5ddc0e464a Under guidance from Matt added ability to use key-value tags with ROD binding command line arguments, so now one can say -B:hapmap,VCF,known=false,training=true,truth=true,prior=12.0 hapmap.vcf and get the tags in a walker. Look at ContrastiveRecalibrator for an example of how to use the new ReferenceOrderedDataSource.getTags(). Removed references to FDR in tranches since we are only using truth sensitivity. Finally fixed long standing bug where tranche filters weren't set appropriately.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5536 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-29 21:04:09 +00:00
corin f2d84bf746 Changes the validity declaration from a true to false to a five point scale
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5527 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-28 18:31:53 +00:00
depristo 6a1d12cf7b Intermediate commit refactoring FragmentPileup to (1) make it more accessible (now in utils.pileup) as well as (2) improve performance. Passes all integration tests now. Upcoming refactoring will change further how the system can be accessed, and further improve performance.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5522 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-27 12:42:22 +00:00
kshakir f3e94ef2be Walkers can now specify a class extending from Gatherer to merge custom output formats. Add @Gather(MyGatherer.class) to the walker @Output.
JavaCommandLineFunctions can now specify the classpath+mainclass as an alternative to specifying a path to an executable jar.
JCLF by default pass on the current classpath and only require the mainclass be specified by the developer extending the JCLF, relieving the QScript author from having to explicitly specify the jar.
Like the Picard MergeSamFiles, GATK engine by default is now run from the current classpath. The GATK can still be overridden via .jarFile or .javaClasspath.
Walkers from the GATK package are now also embedded into the Queue package.
Updated AnalyzeCovariates to make it easier to guess the main class, AnalyzeCovariates instead of AnalyzeCovariatesCLP.
Removed the GATK jar argument from the example QScripts.
Removed one of the most FAQ when getting started with Scala/Queue, the use of Option[_] in QScripts:
1) Fixed mistaken assumption with java enums. In java enums can be null so they don't need nullable wrappers.
2) Added syntactic sugar for Nullable primitives to the QScript trait. Any variable defined as Option[Int] can just be assigned an Int value or None, ex: myFunc.memoryLimit = 3
Removed other unused code.
Re-fixed dry run function ordering.
Re-ordered the QCommandline companion object so that IntelliJ doesn't complain about missing main methods.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5504 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-24 14:03:51 +00:00
ebanks 48b15d42e0 More fixes and improvements. We no longer use any bases under Q20 because random ~Q5s were cluttering the graphs; instead we grab any contiguous segments of size at least MIN_SEQUENCE_LENGTH where all bases are above Q20. Also, I implemented a quick algorithm to traverse the graph (using DFS) to choose the two best scoring paths (haplotypes). Used it successfully at NA12878 HM3 SNP sites to determine whether they are homozygous (no distiction yet between ref and alt) or heterozygous! Indels are the next target. Still have some issues to work out.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5502 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-24 03:51:19 +00:00
ebanks 401d1cb97f Bug fixes plus some debugging code added. Broke out DeBruijnVertex into its own class so that the interface is now cleaner. Still very much a work in progress.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5498 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-23 17:35:34 +00:00
carneiro 3414bccb46 documentation changes to agree with the wiki
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5494 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-22 21:48:49 +00:00
carneiro 28149e5c5e GenotypeAndValidate version 2, ready to be used.
- now it differentiates between confident REF calls and not confident calls.
- you can now use a BAM file as the truth set. 
- output is much clearer now

dataProcessingPipeline version 2, ready to be used.
- All the processing is now done at the sample level
- Reads the input bam file headers to combine all lanes of the same sample.
- Cleaning is now scattered/gathered. Inteligently breaks down in as many intervals as possible, given the dataset.
- Outputs one processed bam file per sample (and a .list file with all processed files listed)
- Much faster, low pass (read Papuans) can run in the hour queue.




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5493 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-22 20:18:02 +00:00
ebanks 1a9e65bcd4 Updating other walkers now that VCC extends from VC
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5486 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-22 03:10:40 +00:00
asivache 1d5326ff0c Minor fixes to the cmd-line help messages
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5470 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-18 18:18:04 +00:00
depristo abc7d1aef9 BeagleOutputToVCF now accepts an option to keep monomorphic sites. This is useful to genotype a single sample, where having AC=0 just means that the sample is hom-ref at the site.
ProduceBeagleInputWalker can optionally emit a beagle markers file, necessary to use the beagled reference panel for imputation.  Also supports the VQSR calibration curve idea that a site can be flagged as a certain FP, based on the VQSLOD field.  This allows us to have both continuous quality in the refinement of sites as well as hard filtering at some threshold so we don't end up with lots of sites with all 1/3 1/3 1/3 likelihoods for all samples (i.e., a definite FP site where we don't know anything about the samples). 

Added a new VariantsToBeagleUnphased walker that writes out a marker drive hard-call unphased genotypes file suitable for imputating missing genotypes with a reference panel with beagle.  Can optionally keep back a fraction of sites, marked as missing in the genotypes file, for assessment of imputation accuracy and power.  The bootstrap sites can be written to a separate VCF for assessment as well.

Finally, my general Queue script for creating and evaluating reference panels from VCF files.  Supports explicitly genotyping a BAM file at each panel SNP site, for assessment of imputation accuracy of a reference panel.  Lots of options for exploring the impact of the VQS likelihooods, multiple VCFs for constructing the reference panel, as well as fraction of sites left out in assessing the panel's power.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5467 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-18 03:08:38 +00:00
corin 30237e6824 Updated the walker to specify the build based on the user's input file name if the user does not specify the build.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5459 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-17 17:49:17 +00:00
ebanks 3eea6e92b7 An extremely basic implementation of a deBruijn-based local assembler, using the jgrapht graph library. This is not at all optimized and has only been tested on my very simple 3-read test bams. I'm sure there are bugs in there - more testing coming soon. Insertions and deletions confirmed to generate identical graphs (except for the multiplicity of edges of course). Not worth using yet.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5455 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-17 14:03:07 +00:00
rpoplin 8d0880d33e Misc cleanup
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5453 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-16 17:33:19 +00:00
rpoplin c6ef6ee8b7 Recal file is in input to ApplyRecalibration not an output.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5452 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-16 12:08:58 +00:00
rpoplin 8e89ff170e Can't check substitution type of tri-allelic SNPs.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5451 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-16 03:06:03 +00:00
carneiro e2e435d52c GenotypeAndValidate: now looks at annotations in the INFO field instead of filter field. Better output and filters repetitive calls to indel extended events.
IndelUtils: added a isInsideExtendedIndel() method to filter the above mentioned.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5449 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-15 21:54:40 +00:00
rpoplin d98503ca50 Removing some debug code from VQSRv2. VariantEval can now be stratified by contig with -ST Contig. New hidden option in CombineVariants for overlapping records to take the info fields from the record with the highest AC (while still updating AC/AN/AF correctly) instead of dropping info fields which aren't exactly the same.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5448 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-15 21:28:10 +00:00
rpoplin bbcc4ed700 The second pass of the contrastive VQSRv2.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5444 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-14 21:05:02 +00:00
rpoplin 2a2538136d A version of VQSRv2 that does contrastive clustering in two passes. The walkers will be renamed when they are moved to core.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5443 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-14 21:03:56 +00:00
carneiro fcc347bb05 making sure the output is as pretty as I said it would be on the wiki.
wikipage for this walker is up, at : http://www.broadinstitute.org/gsa/wiki/index.php/Genotype_and_Validate#Examples

use it ;)



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5442 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-14 20:32:09 +00:00
ebanks 239dae0985 Absolutely nothing to get excited about. This is just the skeleton for the local assembler. It doesn't do anything at all now except for collect reads over each -L interval and pass them to an assembly engine (which isn't implemented yet). The interface for the AssemblyEngine will change later, but for now this one is the most conducive to debugging.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5441 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-14 20:31:54 +00:00
corin 6d09cdd4bc This is a walker that lets the user generate the bed file for declaring variants true positives or false positives. For use with the IGV crowd sourcing project.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5440 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-14 19:56:16 +00:00
depristo f75ad0dee3 Now in Picard, and released to the public
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5439 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-14 19:36:56 +00:00
carneiro 9dfe4c9cb7 moving GenotypeAndValidate to the playground. It's ready to be used.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5438 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-14 19:19:18 +00:00
rpoplin ceb08f9ee6 Moving some math around in VQSRv2.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5431 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-12 15:15:05 +00:00