a) Fixed bugs in new dynamic programming-based genotyper
b) Fixed up temp hack that handles extended pileups for now.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4398 348d0f76-0448-11de-a6fe-93d51630548a
- Brought over exact AF estimation from branch (which is now dead). Exact model is default in UnifiedGenotyperV2.
- Implemented completely new genotyping algorithm given best AF estimate using dynamic programming, which in theory should be better than both greedy search and any HWE-based genotyper.
- Integrated and added new Dindel likelihood estimation model.
- Corrected annotators that would call readBasePileup: since we can be annotating extended events, best way is to interrogate context for kind of pileup and either readBasePileup or readExtendedEventPileup.
All changes above except last one are still in playground since they require more testing.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4396 348d0f76-0448-11de-a6fe-93d51630548a
This will allow other programs like Queue to reuse the functionality.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4351 348d0f76-0448-11de-a6fe-93d51630548a
pieces of core that depend on playground. Most of these have been eliminated by
(temporarily) promoting Aaron's report system to core in this checkin. I'll
follow up with other changes in separately.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4350 348d0f76-0448-11de-a6fe-93d51630548a
Also added a new error message that I thought would be helpful...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4301 348d0f76-0448-11de-a6fe-93d51630548a
*** Three integration tests had to change: ***
RecalibarationWalkersIntegrationTest:
One of the tests was using the interval as the snp track, and wasn't supplying a DbSNP track (for CountCovariates)
SequenomValidationConverterIntegrationTest:
relies on Plink ROD which we've removed.
PileupWalkerIntegrationTest:
we no longer have implicit interval tracks, so there isn't a rod name over the specified region. Otherwise the same result.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4292 348d0f76-0448-11de-a6fe-93d51630548a
This commit adds two important classes: Sample, which contains data about one sample; and SampleDataSource, which manages sample data a la ReferenceDataSource and ReadsDataSource.
This code should be stable, but it has not been integrated with existing walkers yet. That's the next commit.
In the meantime, feel free to experiment with the code - there are two basic example walkers in the playground.sample package. And PLEASE let me know if you see any errors/inconsistencies.
Note that this also adds a new dependency on SnakeYaml, a YAML parser.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4285 348d0f76-0448-11de-a6fe-93d51630548a
This object designed to be both the structure that holds data during the execution of the walker, as well as the object that properly formats and emits the data so that it can be easily loaded into R. In the end, you get a table that looks like this:
##:GATKReport.v0.1 ErrorRatePerCycle : The error rate per sequenced position in the reads
cycle errorrate.61PA8.7 qualavg.61PA8.7
0 0.007451835696110506 25.474613284804366
1 0.002362777171937477 29.844949954504095
2 9.087604507451836E-4 32.87590975254731
3 5.452562704471102E-4 34.498999090081895
4 9.087604507451836E-4 35.14831665150137
5 5.452562704471102E-4 36.07223435225619
6 5.452562704471102E-4 36.1217248908297
7 5.452562704471102E-4 36.1910480349345
8 5.452562704471102E-4 36.00345705967977
...
A GATKReport object can hold multiple tables, and the write() method will emit all tables in succession. Each table starts with its own ##:GATKReport.v0.1 table header, so each table can stand alone. This allows for tables to be mixed and matched in a single file, or for the output from different walkers to be combined into a single file with no ill effect.
The display property of individual columns can be turned off. This is useful when a column is used to store intermediate results, necesary for the computation of some later value, but the contents of the intermediate column itself are not required in the final output file.
Finally, the GATKReportTable allows for some simple, mathematical, element-wise and column-wise operations. For instance, two whole columns can be divided, the results of the operation being stored in a third column. This mimics the most basic of R operations, where whole vectors can be added, subtracted, multiplied or divided without requiring the developer to explicitly write a loop.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4159 348d0f76-0448-11de-a6fe-93d51630548a
1. Rip out all of Ben's code intended to circumvent the stable VCF Writer output system in multi-threaded mode (I threw up a little when
I saw this code). This will improve memory consumption when running with -nt.
2. Don't annotate indels or > bi-allelic sites.
3. Fix bug where not all records were making it into the output VCF.
4. General code clean up.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4118 348d0f76-0448-11de-a6fe-93d51630548a
- GATKVCFWriter deleted, to be replaced if absolutely necessary when VCF writing goes into Tribble.
- VCFWriter is now an interface, for easier redirection.
- VCFWriterImpl fleshes out the VCFWriter interface.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4026 348d0f76-0448-11de-a6fe-93d51630548a
Added ability to skip up-to-date jobs where the outputs are older than the inputs.
Changed -T CountDuplicates --quiet to --quietLocus so that Queue GATK extensions can use both short and full argument names.
Short names can be used to set values on Queue GATK extensions, for example: vf.XL :+= myFile
Moved Hidden from the GATK to StingUtils.
Updated ivy from 2.0.0 to 2.2.0-rc1 to fix sha1 issue: http://bit.ly/aX72w7
Added Queue to javadoc and testing build targets.
Added first Queue unit test.
Another pass at avoiding cycles in the DAG thanks to all function I/O being files.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4017 348d0f76-0448-11de-a6fe-93d51630548a
- Tribble is included directly in the GATK repo; those who have access to commit to Tribble can now directly commit from the GATK directory from Intellij; command line users can commit from
inside the tribble directory.
- Hapmap ROD now in Tribble; all mentions have been switched over.
- VariantContext does not know about GenomeLoc; use VariantContextUtils.getLocation(VariantContext vc) to get a genome loc.
- VariantContext.getSNPSubstitutionType is now in VariantContextUtils.
- This does not include the checked-in project files for Intellij; still running into issues with changes to the iml files being marked as changes by SVN
I'll send out an email to GSAMembers with some more details.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3954 348d0f76-0448-11de-a6fe-93d51630548a
b) Cosmetic change to Beagle annotation description.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3861 348d0f76-0448-11de-a6fe-93d51630548a
2) Keep track of whether vcf records are unfiltered vs. pass filters in the variant context so we can regenerate the records on output.
3) No more "ID" hard-coded all over the code to set the VariantContext ID. Use a static variable instead.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3840 348d0f76-0448-11de-a6fe-93d51630548a
2. Moved Jared's VCFTool code into archive so that everything would compile.
3. Added the vcf reference base (needed for indels) as an attribute to the VariantContext from the reader.
4. TribbleRMDTrackBuilderUnitTest was complaining that a validation file didn'r exist, so I commented it out.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3835 348d0f76-0448-11de-a6fe-93d51630548a
- changed to a better method for getting headers from Codecs
- some removal of old commented out code in the GATKAgrumentCollection
- changes for the rename of FeatureReader to FeatureSource
- removed the old Beagle ROD
- cleaned up some of the code in SampleUtils
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3826 348d0f76-0448-11de-a6fe-93d51630548a
a) VCF track name can work again with 3.3 or 4.0 VCF's when specifying -B name,VCF,file. Code will read header and parse automatically the version.
b) Old VCF codec is deprecated. Reader goes now direct from parsing VCF lines into producing VariantContext objects, with no intermediate VCF records. If anyone can't resist the urge to still input files using the old method, a new VCF3Codec is in place with the old code, but it will be eventually deleted.
c) VCF headers and VCF info fields no longer keep track of the version. They are parsed into an internal representation and will be output only in VCF4.0 format.
d) As a consequence, the existing GATK bug where files are produced with VCF4 body but VCF3.3 headers is solved.
e) Several VCF 4.0 writer bugs are now solved.
f) Integration test MD5's are changed, mostly because of corrected VCF4.0 headers and because validation data mostly uses now VCF4.0.
g) Several VCF files in the ValidationData/ directory have been converted to VCF 4.0 format. I kept the old versions, and the new versions have a .vcf4 extension.
Pending issues:
a) We are still not dealing with indels consistently or correctly when representing them. This will be a second part of the changes.
b) The VCF writer doesn't use VCFRecord but it does still use a lot of leftovers like VCFGenotypeEncoding, VCFGenotypeRecord, etc. This needs to be simplified and cleaned.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3813 348d0f76-0448-11de-a6fe-93d51630548a
- Remove some old debugging cruft regarding handling of threaded engine exceptions.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3796 348d0f76-0448-11de-a6fe-93d51630548a
b) Several improvements to BeagleOutputToVCFWalker:
1. If a Hapmap input track is provided (e.g. -B comp,VCF,file), Hapmap sites will be annotated with Hapmap Allele count and allele frequency (key ACH, AFH).
2. If probability of correct genotype is lower than ncthr (optional argument provided by user, default = 0.0), walker will keep original calls instead of using Beagle calls.
3. Instead of annotating just whether Beagle had modified a site, annotate instead HOW MANY genotypes in a site were actually changed by Beagle.
All three improvements are mostly for debugging and analysis only.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3769 348d0f76-0448-11de-a6fe-93d51630548a
2. Updated liftover code (and scripts) to emit vcf 4.0 and no longer depend on VCFRecord.
3. Beagle walker now also emits vcf 4.0.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3767 348d0f76-0448-11de-a6fe-93d51630548a
b) Add "OG" annotation to genotypes. If Beagle changes genotypes, this annotation gets the original genotype call, to ease performance comparisons. If not, this annotation gets an empty value.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3723 348d0f76-0448-11de-a6fe-93d51630548a