Commit Graph

3509 Commits (06fc5eecf80df9360894e971c8513faefdc8086e)

Author SHA1 Message Date
weisburd 06fc5eecf8 Implemented TreeReducible - if num threads > 1, the output will be accumulated in memory and written to a vcf file at the end - in onTraveralDone(..). If num threads == 1, things will work as before - where vcf records are written to disk as soon as they are computed with map(..).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3530 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-10 20:57:23 +00:00
weisburd 3b375cb237 Sped up parseGenomeLoc(..) by replacing regexp with String.indexOf(..) - attempt 2
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3529 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-10 20:54:36 +00:00
aaron e27951ab39 re-updating the VCF code to handle spaces in sample names
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3528 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-10 20:18:34 +00:00
bthomas 99b684ea89 Adding new support for reference data. ReferenceDataSource is a new class that manages reference data, and allows IndexedFastaSequenceFile to be a simple reader. This checkin also includes FastaSequenceIndexBuilder, which reads a fasta file and creates an index, like samtools faidx. Right now this is not enabled, because we are still working out thread safety. So the only new UI change is that GATK can be run without a fai file. Soon, we will enable 1) GATK to be run without a dict file too, and 2) both dict and fai files will be saved on disk for future program executions. For more info, see ReferenceDataSource.java
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3527 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-10 20:10:23 +00:00
hanna f55f32d4ee Bug fix.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3526 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-10 01:53:26 +00:00
ebanks ca4eab1d23 Now annotations that require reads return null if there's no alignment context, so that running without reads adds annotations only for the appropriate fields.
Added an integration test for the read-less case.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3525 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-09 20:36:46 +00:00
aaron 6941c81bfa reverting revision 3522 to the old code until we fix the tests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3524 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-09 19:25:02 +00:00
hanna dbee21a50f Bugfixes for the case when no read groups / no samples are available.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3523 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-09 18:47:05 +00:00
weisburd adc4c4e577 Sped up parseGenomeLoc(..) by replacing regexp with String.indexOf(..)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3522 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-09 18:11:43 +00:00
chartl 20167fd411 Final changes to MVC -- associates variants with regions of homozygosity in child and parents, corrects for genotype errors, and prints out a separate file with informationf or each region of homozygosity.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3521 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-09 18:05:37 +00:00
weisburd fdded73861 Improved error reporting
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3520 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-09 17:52:48 +00:00
aaron 4f00e265a8 quick update for a change I implemented for Ryan
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3519 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-09 17:23:31 +00:00
aaron ad98512f6c adding changes so that we look at the headers already loaded by the engine for samples and other VCF utils, and not create readers for each file to get them (this caused Tribble to regerenate indices if the index file can't be written to disk).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3518 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-09 17:21:12 +00:00
weisburd c1b7bcc786 Fixed handling of mitochondrial genes - added special cases such as ATT being a start codon in mitochondria. Added warning if a gene doesn't start with Met or end in a stop codon
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3517 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-09 17:15:47 +00:00
weisburd 4f1181974b Added toString() method
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3516 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-09 17:12:57 +00:00
weisburd 6fd2d39a7d Modified run_locally mode to use os.system(..) instead of popen
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3515 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-09 17:10:03 +00:00
weisburd a3ccf49f5b Write error to stderr
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3514 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-09 17:09:10 +00:00
ebanks 9b2fcc4711 Refactoring of the annotation system:
1. VA is now a ROD walker so it no longer requires reads (needs a little more testing)
2. Annotations can now represent multiple INFO fields (i.e. sets of key/value pairs)
3. The chromosome count annotations have been pulled out of UG and the VCF writer code and into VA where they belong.  Fixed the headers too.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3513 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-09 17:05:51 +00:00
hanna 84563b37e5 Partial flattening of the hanger data structure. Hanger data structure is
not currently as flat as it could / should be, but it's already comparable
to the speed of the reference implementation.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3512 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-09 16:28:49 +00:00
chartl 8f9e3e8ad7 Commit for Kiran; but this is now working, barring little exceptions that I've yet to run across...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3511 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-09 14:21:19 +00:00
aaron 6febd0291d rev tribble to include some dbsnp clean-up and fixes
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3510 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-09 03:08:31 +00:00
aaron 6d5556939d updating Tribble with a couple of important Tabix fixes, and updating the variant eval integration tests to run each test with both plain vcf and gzipped tabix (added the tabix version
to the vlidation directory), using the same md5sum.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3509 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-09 01:47:04 +00:00
weisburd 2b31975cb4 Added more options for coordinate systems - now you can add 1 to either the start coordinates, the end coordinates, or both
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3508 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-08 22:49:19 +00:00
weisburd 410afcdf2c Added parallelization options - when running locally, multiple processes can be spawned, or a -nt arg can be specified to run each TranscriptToInfo instance multi-threaded
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3507 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-08 22:48:07 +00:00
weisburd 92c72d3361 Added back lines that update the *big-table-header.txt file before using it
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3506 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-08 22:45:41 +00:00
weisburd 3c24223d02 Script for concatenating 2 AnnotatorInputTables, and writing the result to standard out. Merge-sorts the 2 tables while concatenating them
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3505 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-08 22:44:16 +00:00
hanna c2858c8988 Minor performance enhancement. Checkpoint commit before major performance
overhaul.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3504 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-08 21:39:39 +00:00
chartl 5ed2818ffb Forgot to commit code i relied upon
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3503 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-08 21:01:35 +00:00
chartl 736098b58d A quick commit before running home. This is a re-factored version of the OppositeHomozygoteClassifier which will work with deNovo violations as well. Some code still needs to be migrated from OHC which is wy that walker isn't yet deleted. This'll be up and running tonight.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3502 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-08 20:47:01 +00:00
delangel de134c226d Removed ability of users to specify annotations to recompute, cleanups.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3501 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-08 19:17:59 +00:00
ebanks 4d1a6b3d99 quick changes for G
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3500 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-08 16:33:27 +00:00
delangel 907931c902 a) Update annotations when creating new vcf with Beagle's imputed data. Since genotypes may (will) change based on imputation, several annotations need to be updated. By default, AC, AF, AN and AB will be updated. User can force extra annotaqtions to be updated with -A <annotation> argument.
b) Several cleanups and beautifications.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3499 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-08 15:12:04 +00:00
chartl 933133ee28 Initial commit of the opposite homozygote classifier. Currently does the following, given a trio vcf:
+ Identifies opposite homozygote sites
 + Identifies the parent from whom it is expected that a null allele was inherited (or whether it was a putative genotype error; e.g. mom=homref, dad=homref, child=homvar)
 + Labels each opposite homozygote with its homozygous region in the child (e.g. region 1, region 2)
 + Labels each opposite homozygote with the size of the homozygous region in which it was found, the number of child homozygotes in the region, and the number of opposite homozygote violations within that region

To come:
 + Classification of sites as likely tri-allelic


Note that this is very experimental



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3498 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-08 03:56:07 +00:00
hanna 199e4208cd Bug fixes.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3497 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-08 00:30:33 +00:00
hanna 52ab9f2417 Feature parity between LocusIteratorByState, DownsamplingLocusIteratorByState, including pushing mrl /
the LocusOverflowTracker into LocusIteratorByState.  Note that the 'Matt Hanna exception', is still enabled
because I haven't yet validated the performance of the DownsamplingLocusIteratorByState when running
without downsampling.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3496 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-07 22:58:21 +00:00
hanna 5c4d070566 Push Mark's changes in LocusIteratorByState into DownsamplingLocusIteratorByState
in preparation for merging the two into one.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3495 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-07 17:29:30 +00:00
depristo 6eeb1693ca JEXL2 upgrade. Improvements to JEXL processing including dynamically resolving variable -> value bindings instead of up front adding them to a map. Performance improvements and code cleanup throughout.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3494 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-07 00:33:02 +00:00
hanna c1ecf75dd5 Update to the latest rev of the picard sharding patch. Includes updates reflecting
the imminent move of IlluminaUtil into picard public.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3493 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-06 20:33:21 +00:00
delangel c503f01dcf More cleanup
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3492 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-06 17:41:38 +00:00
delangel d4c66d6191 a) Small cleanup
b) Fix major issue with Beagle likelihood converter: if likelihood triplets from UG end up being too low, then Beagle input file will be produced with 0.00,0.00,0.00 triplet. If all samples at a marker have this issue, Beagle will effectively produce junk. To fix, likelihoods are renormalized before converting to linear space.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3491 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-06 17:31:59 +00:00
depristo cfa18f6743 Fixing missed update with new Allele in it
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3490 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-04 23:56:34 +00:00
depristo 3ea506fe52 No more new Allele() -- must use create. Allelel simple alleles are now cached for efficiency reasons. VCF4 codec optimizations -- 4x performance in general. Now working in general but hooked up to the ROD system now as VCF4. WARNING -- does not actually work with indels, genotype filters, etc.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3489 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-04 23:03:55 +00:00
delangel ef47a69c50 a) First fully functional (sort of) version of walker that parses Beagle imputation output files and produce a vcf with imputed genotypes.
More doc/info to follow shortly. Issues still to be solved:
a) Walker changes all genotypes based on Beagle data, but annotations on the original VCF are unchanged. They should in theory be recomputed based on new genotypes.
b) Current implementation is ugly, dirty unwieldy and will necessitate a refactoring soon so I can keep my pride. Most aesthetically affronting issue right now is that we read the full Beagle files at initialization and keep them in memory, but a more delicate implementation would just read from files on a marker by marker basis. Issue that currently prevents this is that BufferedReader() instances don't seem to play nice when called from the map() function.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3488 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-04 20:37:25 +00:00
depristo b811e61ae1 Optimized, nearly complete VCF4 reader 2-4x faster than the previous implementation, along with a VCF4 reader performance testing walker that can read 3/4 files, useful for benchmarking
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3487 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-04 18:11:38 +00:00
aaron 6482b87741 adding the super experimental, half-broken, generally crippled, awkwardly commented, header ignoring vcf4 code. Don't use this, unless you're a developer for VCF4. If so, remove the exception from the constructor so that it won't always exception out.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3486 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-04 07:38:46 +00:00
aaron 0b03e28b60 updating the tribble library to include the reference dictionary reading / writing. We now check the dictionaries of any tracks that have them against the reference (all new tribble tracks and out-of-date tracks will have this). Also renamed some classes to be more reflective of their function.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3485 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-04 06:34:26 +00:00
hanna 3d055e3d16 Fail fast if users try to parallelize a read walker.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3484 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-03 18:14:33 +00:00
hanna 7d79848f40 Better error message when bam file / list file with wrong extension is
supplied.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3483 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-03 17:52:48 +00:00
ebanks 597b3744ab Always use phasing info when converting genotypes to strings
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3482 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-03 17:50:50 +00:00
depristo e2b41082af GATK now does automatic adaptor filtering in locus iterators (but not expt. downsampling iterator). General support for LocusIteratorFilters just like read filters but only applying at particular bases. Updated tools with new MD5 sums due to adaptor bases in their integrationtest data. Not that as a side effect here reads close to each other with odd orientations are also filtered out. Updated minor argument to VariantRecalibrator to change the qStep value on the command line
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3481 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-02 22:26:32 +00:00