depristo
cd293f145b
More stable reduced reads representation. Bug fixes throughout. No diffs by <1% of sites in an exome, and the majority of these differences are filtered out, or are obvious artifacts. UnitTests for BaseCounts. BaseCounts extended to handle indels, but not yet enabled in the consensus reads.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5939 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-03 20:11:31 +00:00
depristo
86df10ec09
UnitTests for ConsensusSpan infrastructure
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5929 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 22:44:52 +00:00
depristo
ad9dca9137
Package updated. Copyrights added
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5926 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 21:29:27 +00:00
depristo
3d628f06f0
moved to playground
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5925 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 21:25:26 +00:00
hanna
6cc84c3ce2
Make the set of VariantContextAdaptors dynamic so that Andrey's MafFeature can
...
continue to exist and live in playground (and thus outside of the normal release
/ git release branch).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5909 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-01 02:54:55 +00:00
carneiro
5974675b43
Two intermediate commits, to work over the weekend.
...
ReplicationValidationWalker: Just the skeleton of what will be the implementation of the replication/validation model.
dataProcessingV2: Committing an UNTESTED implementation of BWA alignment. I am running tests on it over the weekend.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5900 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-27 22:03:08 +00:00
carneiro
b66c6dced1
- No longer prints out non confident calls (they were leading to tables that don't add up and confusing some Pacbio folk).
...
- Added sensitivity and Specificity to the report.
- With the changes in genotype likelihoods, the indel analysis only happens if the BAM file also has an extended event. Not great, but at least it's not broken.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5759 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-04 19:26:55 +00:00
rpoplin
23cd3a7a5d
Moving VQSR v2 to core.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5740 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 20:20:06 +00:00
rpoplin
5bade81c6d
Adding tranche plot generation back to VQSR
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5736 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 19:26:26 +00:00
rpoplin
e73720c2db
Updating VQSLOD annotation description
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5735 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 19:01:08 +00:00
rpoplin
11052918d9
Better exception text for common error in VQSR.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5734 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 18:37:25 +00:00
rpoplin
4bbce42861
Renaming ContrastiveRecalibrator --> VariantRecalibrator in preparation for move to core
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5733 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 18:12:47 +00:00
rpoplin
6323fb8673
misc cleanup in VQSR
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5732 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 18:00:22 +00:00
rpoplin
3224bbe750
New visualization output for VQSR. It creates the R script file on the fly and then runs Rscript on it. Adding 1000G Project consensus code. First pass of having VQSR work with missing data by marginalizing over the missing dimension for that data point (thanks Chris and Bob for ideas). Updated math functions to use apache math commons instead of approximations from wikipedia. New parameters available for the priors based on further reading in Bishop and looking at the new visualizations. Updated integration test to use more modern files. Updated MDCP to use new best practices w.r.t. annotations.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5723 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-02 19:14:42 +00:00
ebanks
cbcdfc584d
Moving out of core and into playground
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5671 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-21 02:30:22 +00:00
corin
2cf6a06503
Throwing an error if INFO fields arguments contain whitespace.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5651 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-15 20:52:55 +00:00
corin
fce6d25075
Moved the reference ID to a meta data field for validity declaration.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5650 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-15 20:28:56 +00:00
corin
59215dab48
Now writes results to a minimal vcf with annotations included in the INFO field. Must be run with -NO_HEADER to totally remove header for the most bare bones vcf; otherwise also includes command line meta data.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5649 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-15 20:14:02 +00:00
fromer
8e0f5bc5a5
Prevent NullPointerException in cases where SNP is filtered
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5641 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-14 19:59:59 +00:00
ebanks
af09170167
As I threatened yesterday, I've moved the various and disparate randomization code out of the walkers. Now they all (except VQSRv1, whose days are numbered anyways) use a static generator available in the engine itself. Please use this from now on. The seed is reset before every individual integration test is run. I think there may still be an issue with the IndelRealigner but I need to confirm with the commit to see what testNG does. Integration tests are already broken anyways, so no big deal.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5589 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-07 17:03:48 +00:00
carneiro
89bb21d024
typo in the argument description
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5587 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-06 19:45:32 +00:00
rpoplin
3f3f35dea0
UnifiedGenotyper now BAQs via ADD_TAG to facilitate using BAQed quals for GL calculations but unBAQed quals for annotation calculations. UnifiedGenotyper now produces SNP and indel calls simultaneously. 40 base mismatch intrinsic filter removed from UG to greatly simplify the code. RankSumTests are now standard annotations but the integration tests are commented out pending changes that will allow random annotations to work.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5585 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-06 19:06:24 +00:00
rpoplin
b2a0331e2d
Pushing hard coded arguments into VariantRecalibratorArgumentCollection
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5566 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-03 19:55:09 +00:00
depristo
11822da578
Stand alone, GATK dependent tool that Reads a list of BAM files and slices all of them into a single merged BAM file containing reads in overlapping chr:start-stop interval. Highly efficient when working with thousands of BAM files. Can merge 1MB of sequence of 1600 4x BAMs in 4g in only 2 hours.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5558 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-02 13:41:29 +00:00
rpoplin
5ddc0e464a
Under guidance from Matt added ability to use key-value tags with ROD binding command line arguments, so now one can say -B:hapmap,VCF,known=false,training=true,truth=true,prior=12.0 hapmap.vcf and get the tags in a walker. Look at ContrastiveRecalibrator for an example of how to use the new ReferenceOrderedDataSource.getTags(). Removed references to FDR in tranches since we are only using truth sensitivity. Finally fixed long standing bug where tranche filters weren't set appropriately.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5536 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-29 21:04:09 +00:00
corin
f2d84bf746
Changes the validity declaration from a true to false to a five point scale
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5527 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-28 18:31:53 +00:00
depristo
6a1d12cf7b
Intermediate commit refactoring FragmentPileup to (1) make it more accessible (now in utils.pileup) as well as (2) improve performance. Passes all integration tests now. Upcoming refactoring will change further how the system can be accessed, and further improve performance.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5522 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-27 12:42:22 +00:00
kshakir
f3e94ef2be
Walkers can now specify a class extending from Gatherer to merge custom output formats. Add @Gather(MyGatherer.class) to the walker @Output.
...
JavaCommandLineFunctions can now specify the classpath+mainclass as an alternative to specifying a path to an executable jar.
JCLF by default pass on the current classpath and only require the mainclass be specified by the developer extending the JCLF, relieving the QScript author from having to explicitly specify the jar.
Like the Picard MergeSamFiles, GATK engine by default is now run from the current classpath. The GATK can still be overridden via .jarFile or .javaClasspath.
Walkers from the GATK package are now also embedded into the Queue package.
Updated AnalyzeCovariates to make it easier to guess the main class, AnalyzeCovariates instead of AnalyzeCovariatesCLP.
Removed the GATK jar argument from the example QScripts.
Removed one of the most FAQ when getting started with Scala/Queue, the use of Option[_] in QScripts:
1) Fixed mistaken assumption with java enums. In java enums can be null so they don't need nullable wrappers.
2) Added syntactic sugar for Nullable primitives to the QScript trait. Any variable defined as Option[Int] can just be assigned an Int value or None, ex: myFunc.memoryLimit = 3
Removed other unused code.
Re-fixed dry run function ordering.
Re-ordered the QCommandline companion object so that IntelliJ doesn't complain about missing main methods.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5504 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-24 14:03:51 +00:00
ebanks
48b15d42e0
More fixes and improvements. We no longer use any bases under Q20 because random ~Q5s were cluttering the graphs; instead we grab any contiguous segments of size at least MIN_SEQUENCE_LENGTH where all bases are above Q20. Also, I implemented a quick algorithm to traverse the graph (using DFS) to choose the two best scoring paths (haplotypes). Used it successfully at NA12878 HM3 SNP sites to determine whether they are homozygous (no distiction yet between ref and alt) or heterozygous! Indels are the next target. Still have some issues to work out.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5502 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-24 03:51:19 +00:00
ebanks
401d1cb97f
Bug fixes plus some debugging code added. Broke out DeBruijnVertex into its own class so that the interface is now cleaner. Still very much a work in progress.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5498 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-23 17:35:34 +00:00
carneiro
3414bccb46
documentation changes to agree with the wiki
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5494 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-22 21:48:49 +00:00
carneiro
28149e5c5e
GenotypeAndValidate version 2, ready to be used.
...
- now it differentiates between confident REF calls and not confident calls.
- you can now use a BAM file as the truth set.
- output is much clearer now
dataProcessingPipeline version 2, ready to be used.
- All the processing is now done at the sample level
- Reads the input bam file headers to combine all lanes of the same sample.
- Cleaning is now scattered/gathered. Inteligently breaks down in as many intervals as possible, given the dataset.
- Outputs one processed bam file per sample (and a .list file with all processed files listed)
- Much faster, low pass (read Papuans) can run in the hour queue.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5493 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-22 20:18:02 +00:00
ebanks
1a9e65bcd4
Updating other walkers now that VCC extends from VC
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5486 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-22 03:10:40 +00:00
asivache
1d5326ff0c
Minor fixes to the cmd-line help messages
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5470 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-18 18:18:04 +00:00
depristo
abc7d1aef9
BeagleOutputToVCF now accepts an option to keep monomorphic sites. This is useful to genotype a single sample, where having AC=0 just means that the sample is hom-ref at the site.
...
ProduceBeagleInputWalker can optionally emit a beagle markers file, necessary to use the beagled reference panel for imputation. Also supports the VQSR calibration curve idea that a site can be flagged as a certain FP, based on the VQSLOD field. This allows us to have both continuous quality in the refinement of sites as well as hard filtering at some threshold so we don't end up with lots of sites with all 1/3 1/3 1/3 likelihoods for all samples (i.e., a definite FP site where we don't know anything about the samples).
Added a new VariantsToBeagleUnphased walker that writes out a marker drive hard-call unphased genotypes file suitable for imputating missing genotypes with a reference panel with beagle. Can optionally keep back a fraction of sites, marked as missing in the genotypes file, for assessment of imputation accuracy and power. The bootstrap sites can be written to a separate VCF for assessment as well.
Finally, my general Queue script for creating and evaluating reference panels from VCF files. Supports explicitly genotyping a BAM file at each panel SNP site, for assessment of imputation accuracy of a reference panel. Lots of options for exploring the impact of the VQS likelihooods, multiple VCFs for constructing the reference panel, as well as fraction of sites left out in assessing the panel's power.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5467 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-18 03:08:38 +00:00
corin
30237e6824
Updated the walker to specify the build based on the user's input file name if the user does not specify the build.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5459 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-17 17:49:17 +00:00
ebanks
3eea6e92b7
An extremely basic implementation of a deBruijn-based local assembler, using the jgrapht graph library. This is not at all optimized and has only been tested on my very simple 3-read test bams. I'm sure there are bugs in there - more testing coming soon. Insertions and deletions confirmed to generate identical graphs (except for the multiplicity of edges of course). Not worth using yet.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5455 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-17 14:03:07 +00:00
rpoplin
8d0880d33e
Misc cleanup
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5453 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-16 17:33:19 +00:00
rpoplin
c6ef6ee8b7
Recal file is in input to ApplyRecalibration not an output.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5452 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-16 12:08:58 +00:00
rpoplin
8e89ff170e
Can't check substitution type of tri-allelic SNPs.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5451 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-16 03:06:03 +00:00
carneiro
e2e435d52c
GenotypeAndValidate: now looks at annotations in the INFO field instead of filter field. Better output and filters repetitive calls to indel extended events.
...
IndelUtils: added a isInsideExtendedIndel() method to filter the above mentioned.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5449 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-15 21:54:40 +00:00
rpoplin
d98503ca50
Removing some debug code from VQSRv2. VariantEval can now be stratified by contig with -ST Contig. New hidden option in CombineVariants for overlapping records to take the info fields from the record with the highest AC (while still updating AC/AN/AF correctly) instead of dropping info fields which aren't exactly the same.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5448 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-15 21:28:10 +00:00
rpoplin
bbcc4ed700
The second pass of the contrastive VQSRv2.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5444 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-14 21:05:02 +00:00
rpoplin
2a2538136d
A version of VQSRv2 that does contrastive clustering in two passes. The walkers will be renamed when they are moved to core.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5443 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-14 21:03:56 +00:00
carneiro
fcc347bb05
making sure the output is as pretty as I said it would be on the wiki.
...
wikipage for this walker is up, at : http://www.broadinstitute.org/gsa/wiki/index.php/Genotype_and_Validate#Examples
use it ;)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5442 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-14 20:32:09 +00:00
ebanks
239dae0985
Absolutely nothing to get excited about. This is just the skeleton for the local assembler. It doesn't do anything at all now except for collect reads over each -L interval and pass them to an assembly engine (which isn't implemented yet). The interface for the AssemblyEngine will change later, but for now this one is the most conducive to debugging.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5441 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-14 20:31:54 +00:00
corin
6d09cdd4bc
This is a walker that lets the user generate the bed file for declaring variants true positives or false positives. For use with the IGV crowd sourcing project.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5440 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-14 19:56:16 +00:00
depristo
f75ad0dee3
Now in Picard, and released to the public
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5439 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-14 19:36:56 +00:00
carneiro
9dfe4c9cb7
moving GenotypeAndValidate to the playground. It's ready to be used.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5438 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-14 19:19:18 +00:00
rpoplin
ceb08f9ee6
Moving some math around in VQSRv2.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5431 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-12 15:15:05 +00:00