aaron
32f324a009
incremental changes to the VCF4 codec, including allele clipping down to the minimum reference allele; adding unit testing for certain aspects of the parsing. Not ready for prime-time yet.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3604 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-22 06:31:05 +00:00
bthomas
300a18b85f
Updating the way reference data is processed, so GATK creates the .fasta.fai and .dict files automatically. If either (or both) don't exist, GATK will create them in the same folder as the fasta file. If it can't write the file, GATK will fail with a message to create them manually.
...
Note that this functionality will only work if the directory with the fasta is writeable. GATK will fail if directory is read only and and either the .fasta.fai or .dict files don't exist. In the future, we could have these references be created in memory, but we decided against it this time.
Locking was also added to ReferenceDataSource so no issues come up while running multiple GATKs on the same reference: we don't want one process to be half-finished and another try to read it. So, you could see error messages related to locking. See ReferenceDataSource.java for explanation of the locking strategy.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3601 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-21 21:42:42 +00:00
hanna
c806ffba5f
Switching over DownsamplingLocusIteratorByState -> LocusIteratorByState. Some operations
...
will not be as fast as they could be because the workflow is currently merge sam records (sharding)
-> split sam records (LocusIteratorByState) -> merge records (LocusIteraotorByState) -> split
records (StratifiedAlignmentContext), but this will be fixed when StratifiedAlignmentContext
is updated to take advantage of the new functionality in ReadBackedPileup.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3599 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-21 02:11:42 +00:00
depristo
57a13805da
GATK now uses a optimized indexing scheme in Tribble. 5x or more performance gain on files with many genotypes. Updated integrationtest that was failing and was clearly wrong. DB=; isn't a valid annotation.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3596 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-19 21:36:41 +00:00
kiran
8ff93f77e6
Added evaluation module to count functional classes (missense, nonsense, etc.). At the moment, it only understands Cancer's MAF annotations. Added integration test for the functional class counting. Added better description for VariantEval.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3595 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-18 21:51:40 +00:00
ebanks
1e06d2bf68
Initial HLA Caller integration tests. Kind of painful, but will improve with code refactoring.
...
This baby is now officially ours.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3593 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-18 20:35:27 +00:00
rpoplin
724affc3cc
Major bug fixes for the Variant Recalibrator. Covariance matrix values are now allowed to be negative. When probabilities are multiplied together the calculation is done in log space, normalized, then converted back to real valued probabilities. Clustering weights have been changed to only use HapMap and by-1000genomes sites. The -nI argument was removed and now clustering simply runs until convergence. Test cases seem to work best when using just two annotations (QD and SB). More changes are in the works and are being evaluated. Misc fixes to walkers that use RScript due to CentOS changes.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3590 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-18 17:37:11 +00:00
aaron
c3434493b0
fixed integration test for VCF Header changes.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3589 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-18 16:31:48 +00:00
aaron
42e7ff4f28
forgot to update a test, the md5sum of the underlying file changed (which is recorded in the ROD tests).
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3586 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-18 13:27:56 +00:00
aaron
b978d5946b
adding changes for VCF 4, mostly in the way we handle VCF headers. The header fields are now aware of the differences between different VCF formats. There was also a bunch of clean-up of out-of-spec VCF used in the tests (mismatched VCF file format fields, etc), and updates to the associated integration tests. Also some logging statements for BTI.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3584 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-18 08:23:23 +00:00
weisburd
e26a273ef5
Turned the test back on
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3582 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-17 22:57:42 +00:00
hanna
48cbc5ce37
Merging the sharding-specific inherited classes down into the base.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3581 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-17 22:36:13 +00:00
hanna
612c3fdd9d
First pass at eliminating the old sharding system. Classes required for the original sharding system
...
are gone where I could identify them, but hierarchies that split to support two sharding systems have
not yet been taken apart.
@Eric: ~4k lines.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3580 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-17 20:17:31 +00:00
aaron
3d049204ed
some refactoring for the variant eval output system
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3576 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-17 05:34:31 +00:00
hanna
db1383d0b2
Rev the latest version of Picard.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3575 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-16 23:55:07 +00:00
weisburd
5b370ffc62
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3574 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-16 20:42:58 +00:00
ebanks
01ffa307c2
When going NWay out in the cleaner, use the new *merged* header (instead of the original one) for each bam file so that it matches the new uniquified read group ids in the reads.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3569 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-16 19:36:36 +00:00
ebanks
7a91dbd490
Renamed some of the column names in Ti/Tv and Concordance modules so that they are clearer. Removed ValidationRate module (it was busted).
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3564 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-16 15:53:06 +00:00
asivache
671ac00748
A simple utility class that implements a merging Iterator<GenomeLoc> built over an interval or bed file (this is NOT a rod, but rather a direct line-by-line file reader that converts strings to genome locs on the fly and merges overlapping intervals)
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3546 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-14 15:54:37 +00:00
ebanks
8c28be5933
Fixing a VCF bug for Sendu: we weren't emitting flags (booleans) correctly in VCF3.3 (rev'ed tribble for this).
...
Updated dbsnp/hapmap membership info fields to be flags now instead of ints.
While I was there, I added the change in the Annotator for Jan to force reads to be from a specific sample.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3536 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-11 16:42:06 +00:00
bthomas
99b684ea89
Adding new support for reference data. ReferenceDataSource is a new class that manages reference data, and allows IndexedFastaSequenceFile to be a simple reader. This checkin also includes FastaSequenceIndexBuilder, which reads a fasta file and creates an index, like samtools faidx. Right now this is not enabled, because we are still working out thread safety. So the only new UI change is that GATK can be run without a fai file. Soon, we will enable 1) GATK to be run without a dict file too, and 2) both dict and fai files will be saved on disk for future program executions. For more info, see ReferenceDataSource.java
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3527 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-10 20:10:23 +00:00
ebanks
ca4eab1d23
Now annotations that require reads return null if there's no alignment context, so that running without reads adds annotations only for the appropriate fields.
...
Added an integration test for the read-less case.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3525 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-09 20:36:46 +00:00
ebanks
9b2fcc4711
Refactoring of the annotation system:
...
1. VA is now a ROD walker so it no longer requires reads (needs a little more testing)
2. Annotations can now represent multiple INFO fields (i.e. sets of key/value pairs)
3. The chromosome count annotations have been pulled out of UG and the VCF writer code and into VA where they belong. Fixed the headers too.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3513 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-09 17:05:51 +00:00
aaron
6d5556939d
updating Tribble with a couple of important Tabix fixes, and updating the variant eval integration tests to run each test with both plain vcf and gzipped tabix (added the tabix version
...
to the vlidation directory), using the same md5sum.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3509 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-09 01:47:04 +00:00
depristo
6eeb1693ca
JEXL2 upgrade. Improvements to JEXL processing including dynamically resolving variable -> value bindings instead of up front adding them to a map. Performance improvements and code cleanup throughout.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3494 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-07 00:33:02 +00:00
depristo
3ea506fe52
No more new Allele() -- must use create. Allelel simple alleles are now cached for efficiency reasons. VCF4 codec optimizations -- 4x performance in general. Now working in general but hooked up to the ROD system now as VCF4. WARNING -- does not actually work with indels, genotype filters, etc.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3489 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-04 23:03:55 +00:00
aaron
0b03e28b60
updating the tribble library to include the reference dictionary reading / writing. We now check the dictionaries of any tracks that have them against the reference (all new tribble tracks and out-of-date tracks will have this). Also renamed some classes to be more reflective of their function.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3485 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-04 06:34:26 +00:00
depristo
e2b41082af
GATK now does automatic adaptor filtering in locus iterators (but not expt. downsampling iterator). General support for LocusIteratorFilters just like read filters but only applying at particular bases. Updated tools with new MD5 sums due to adaptor bases in their integrationtest data. Not that as a side effect here reads close to each other with odd orientations are also filtered out. Updated minor argument to VariantRecalibrator to change the qStep value on the command line
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3481 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-02 22:26:32 +00:00
aaron
8ec091d6d2
re-enabling regeneration of the tribble index if it's out of date. Also moved the class that can detect text in the log4j stream (useful in testing to make sure appropriate messages are generated).
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3480 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-02 17:45:51 +00:00
depristo
21427211c0
Personal MD5 database system now live. WalkerTest now maintains a database of result files associated with MD5 results in integrationtest/, and provides command lines for diff-ing expected to current md5 results when encountering failed intergration tests. The suite currently takes 200Mb to store. Update and run intergrationtest to build your very own expectation database for future development work.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3466 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-31 16:06:16 +00:00
depristo
2b02324587
Support for detecting and automatically excluding reads reading into the adaptor sequence and, if desired, also only showing the first pair when two reads overlap in the fragment. Not enabled, an intermediate check in before updating and verifying the impact on locus walkers everywhere.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3465 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-30 18:00:12 +00:00
ebanks
ffeb3fd80d
Thanks to Guillermo, I found a bug in the Unified Genotyper output: GL was posteriors instead of likelihoods. Not a huge deal because the
...
priors were flat, but fixed nonetheless.
Also, needed to update Tribble.
Minor updates to the Beagle input maker.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3461 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-28 19:28:26 +00:00
rpoplin
4e268ef6ac
Removing the Variant Recalibration Performance test because it isn't ready yet.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3460 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-28 18:27:25 +00:00
rpoplin
522dd7a5b2
Adding the variantrecalibration classes.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3459 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-28 18:21:27 +00:00
rpoplin
2014837f8a
VariantOptimizer package is moved to core, renamed as VariantRecalibration, and added to the binary release package. VariantOptimizer walker is renamed to GenerateVariantClustersWalker and ApplyVariantClustersWalker renamed to VariantRecalibrator. Integration tests added, performance tests still to be done.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3458 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-28 18:20:18 +00:00
aaron
871cf0f4f6
Call out ROD types by there record type, instead of the codec type (which was clumsy). So instead of:
...
@Requires(value={},referenceMetaData=@RMD(name="eval",type= VCFCodec.class))
you'd say:
@Requires(value={},referenceMetaData=@RMD(name="eval",type= VCFRecord.class))
Which is more in-line with what was done before. All instances in the existing codebase should be switched over.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3457 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-28 14:52:44 +00:00
depristo
cc2bf549c8
Removing my unnecessary optimization. 10 lines later in the code the same optimization was applied. A monumental waste of time.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3455 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-28 14:10:48 +00:00
aaron
a4d834cc01
fixing the test I broke
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3454 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-28 02:06:20 +00:00
depristo
f2e7582cfc
Reorganization of SW code for clarity. Totally failure at raw optimization. Discovered that ~50% of reads being cleaned were perfect reference matches. New code comes with flag to look at NM field and not clean perfect matches. Can we turned off with command line option (needed for 1KG bams with bad NM fields). Going to rerun cleaning jobs due to accidentally rebuilding of stable codebase and loss of 2 days of runtime.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3452 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-27 23:16:00 +00:00
ebanks
058441fa39
Trivial renaming of test
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3441 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-26 16:56:42 +00:00
aaron
a2fab07258
fixed the build problem: there were two copies of the AnnotatorInputTable Codec and Feature in two different spots.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3439 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-26 14:47:15 +00:00
chartl
88a06ad81f
Changes to Depth of Coverage:
...
- For speedup in large number of samples, base counts are done on a per read group level, then
merged into counts on larger partitions (samples, libraries, etc)
+ passed all integration tests before next item
- Added additional summary item, a coverage threshold. Set by (possibly multiple) -ct flags,
the summary outputs will have columns for "%_bases_covered_to_X"; both per sample, and
per sample per interval summary files are effected (thus md5s changed for these)
NOTE:
This is the last revision that will include the per-gene summary files. Once DesignFileGenerator is sufficiently general, and has integration tests, it will be moved to core and the per-gene summary from Depth of Coverage will be retired.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3437 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-26 03:39:22 +00:00
ebanks
0607f76a15
commenting out this test until I can figure out what the hell is going on with the codecs.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3436 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-26 01:12:10 +00:00
ebanks
ae6c014884
Fixed UG parallelization bug. Better integration test to catch this in the future.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3432 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-25 21:03:45 +00:00
ebanks
434e920da9
Oops, forgot to update integration tests
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3431 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-25 20:37:45 +00:00
delangel
a280a0ff0d
a) Made HaplotypeScore default annotation. This changed several integration tests, whose MD5 is now updated.
...
b) Disabled BaseQualRankSumTest, the returned p-values differ wildly from Matlab/R-provided ones, cause TBD.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3419 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-21 22:25:17 +00:00
chartl
745d7c582f
added integration test for intervals with no coverage due to filtering
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3414 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-21 16:52:42 +00:00
chartl
88cb93cc3c
Changes to Depth of Coverage (added maximum base and mapping quality flags; with new integration tests -- because they use b36, and the other test uses hg18, it's in a different class (integration test system can't change refs on the fly). Initial change to VariantAnnotator to allow it to see extended event pilups; you currently have to throw the -dels flag; and it's specified as "very experimental". Yet,all the integration tests pass.
...
Homopolymer Run now does the "right" thing (e.g. single bases are represented as HRun = 0 rather than HRun = 1) for indels. AlleleBalance now does something close enough to correct.
Added a convenience method to VariantContext that will return the indel length (or lengths if a site is not biallelic).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3409 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-21 13:02:01 +00:00
depristo
6faf101c6c
Minor improvements to Callable Loci for public consumption
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3408 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-21 12:50:11 +00:00
depristo
a10fca0d5c
Genotyper now is using bytes not chars. Passes all tests.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3406 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-20 21:02:44 +00:00