Commit Graph

52 Commits (cbce3e3c83c72a8c7dff7b8fec00f00a2b419e83)

Author SHA1 Message Date
hanna 83b8676b69 Hack to fix mysterious disappearing read attributes. Ultimately caused
by the fact that the GATKSAMRecord, by design, needs to both inherit from 
SAMRecord and wrap a 'member' SAMRecord, and method calls that aren't
implemented as explicit passthroughs can compromise the content of the
SAMRecord in subtle ways.

Will be automatically fixed when Picard moves to a lightweight SAMRecord
interface rather than the current heavyweight implementation.  But in 
the short-term, there's no obvious fix.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4489 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-13 19:06:54 +00:00
ebanks a0231f073f Damnit. Enabling the Picard code to recalculate all of the relevant SAMRecord attribute tags means that I need to have reference bases over all read bases even after realignment (and there are some big indels in dbsnp). Fortunately, I have my trusty IndexedFastaSequenceFile reader handy! Re-enabling the previously broken performance test.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4255 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-12 05:06:37 +00:00
aaron cf33614ddc remove the test that's failing the performance tests, please don't release until this is figured out
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4251 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-10 06:30:40 +00:00
ebanks 65edbced36 Addition for Tim: recalculate the NM and UQ tags after realignment. Also, don't fix the insert size calculation, since that's done by fix mate information.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4227 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-08 04:02:14 +00:00
hanna 70bb480939 The battle is over. Picard is revved.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4200 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-03 05:28:01 +00:00
hanna bf0b6bd486 Update integration tests to use the new ROD syntax.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4112 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-25 18:13:30 +00:00
ebanks c9c6ff49c2 Deprecated 'O' in favor of 'o' in the cleaner
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4085 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-23 18:09:24 +00:00
ebanks 1ec305cd15 Fix for running the cleaner at the lane-level for known indels only: instead of relying on the reads to get the reference sequence, we now use an IndexedFastaSequenceFile in all cases and pad the reference with bases on either end. This allows us to deal with cases in which we are trying to clean just a single deletion-containing read with tiny LOD (so the read needs to be pushed off the seen reference; @Reference doesn't yet work for Read Walkers) and has the added benefit of allowing us now to get much larger known indels that aren't completely covered with reads.
Thanks to Matt for the advice.

Also, for Guillermo: while I was at it, I changed the .stats debug output to emit the original interval instead of the cleaned region.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4058 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-19 11:31:13 +00:00
ebanks ac4699a650 Re-enabling this test
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3962 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-06 20:20:37 +00:00
ebanks 340bd0e2c1 Removed hard-coded pointers to references
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3934 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-04 17:59:37 +00:00
delangel 473ec91633 a) Bug fix in VCFHeader parsing - Info fields were not being parsed properly, with the result that the Count field was not being properly displayed in records (e.g. if Count=0 for a particular field, the INFO tag was still being displayed as ...;Field=x;... instead of ...;Field;...
b) Bug fixes and update to how we represent indels and other complex events in a VariantContext object. Convention is now that all events are left aligned, with the first variant context location marking the common base before an event occurs. However, alleles in a VC don't have the common base in all VC's. Two new functions are now part of VariantContextUtils: CreateVariantContextWithPaddedAlleles and CreateVariantContextWithTrimmedAlleles. Both take a VC as an input and create a VC as an output.
Main flow is that a VCF reader would create a VC with trimmed alleles, all walkers would ideally work with these trimmed alleles, and then the VCF writer would pad back the alleles before writing. However, there are special cases where we need to pad alleles like for example when merging/combining VC's.

Pending issues:
- PED and DBSNP RODs have to be updated to create VC's for indels following the convention above. Changes will go in after Tribble location is moved and things are tested.
- Need to verify Indel genotyper and other modules that create VC's with indels.- Wiki page describing convention above and how walkers should interpret indel VC's still needs updating/detailing.
 


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3850 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-22 02:36:45 +00:00
ebanks 379584f1bf Re-enable (most of) these tests. Guillermo will re-enable the other one when the VCF->VC conversion is done for indels
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3821 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-18 03:24:28 +00:00
delangel 55b756f1cc First step in major cleanup/redo of VCF functionality. Specifically, now:
a) VCF track name can work again with 3.3 or 4.0 VCF's when specifying -B name,VCF,file. Code will read header and parse automatically the version. 
b) Old VCF codec is deprecated. Reader goes now direct from parsing VCF lines into producing VariantContext objects, with no intermediate VCF records. If anyone can't resist the urge to still input files using the old method, a new VCF3Codec is in place with the old code, but it will be eventually deleted.
c) VCF headers and VCF info fields no longer keep track of the version. They are parsed into an internal representation and will be output only in VCF4.0 format.
d) As a consequence, the existing GATK bug where files are produced with VCF4 body but VCF3.3 headers is solved.
e) Several VCF 4.0 writer bugs are now solved.
f) Integration test MD5's are changed, mostly because of corrected VCF4.0 headers and because validation data mostly uses now VCF4.0.
g) Several VCF files in the ValidationData/ directory have been converted to VCF 4.0 format. I kept the old versions, and the new versions have a .vcf4 extension.

Pending issues:
a) We are still not dealing with indels consistently or correctly when representing them. This will be a second part of the changes.
b) The VCF writer doesn't use VCFRecord but it does still use a lot of leftovers like VCFGenotypeEncoding, VCFGenotypeRecord, etc. This needs to be simplified and cleaned.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3813 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-16 22:49:16 +00:00
aaron 36ac73cf9a comment out broken test until it can be fixed.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3810 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-16 20:04:40 +00:00
ebanks e7e58d7129 The SAM spec has now officially reserved my new tags for original cigar and original alignment start... except that OS has been named OP ('original POS')
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3800 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-16 00:09:36 +00:00
hanna 9fc05ac2ae eagerDecode is now false.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3733 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-07 22:51:48 +00:00
ebanks 30714ec8d9 As per quick chat with Richard Durban, don't increase the mapping quality of realigned reads too much; for now, arbitrarily increase the MQ by 10. We need to figure out a better solution.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3731 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-07 20:12:59 +00:00
ebanks 12c0de6170 Added ability to clean using only known indels. Added integration test for it. Fixed vcf->vc conversion for indels which was busted.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3678 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-30 01:20:56 +00:00
ebanks 4a451949ba add parallel option to target creator for masking out reads with bad mates
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3663 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-28 22:13:25 +00:00
ebanks 6a23edd911 Fix performance tests
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3662 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-28 21:51:48 +00:00
ebanks 1292c96e29 The cleaner now adds the OC (original cigar) and OS (original alignment start) tags as appropriate to reads that get realigned; this feature can be turned off. Also, improved integration tests (sorry, Kiran!).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3657 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-28 16:46:47 +00:00
ebanks bf5cbad04c Make the target creator a rod walker (that allows reads) so that we can easily trigger the cleaner on only known indel sites. Adding an integration test to cover this case.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3651 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-28 13:28:37 +00:00
ebanks b6bceb39b0 Fixing up output for performance tests
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3619 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-23 17:00:17 +00:00
ebanks b5df2705c9 -Remove Nway output option
-Remove in-memory sorting
-Default to name-sorting (although we allow coordinate sorting with the --sortInCoordinateOrderEvenThoughItIsHighlyUnsafe flag).

Cleaner, faster code.  Wiki has been updated (including how to use FixMateInformation.jar from Picard).  More changes coming soon.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3612 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-22 20:31:55 +00:00
ebanks 01ffa307c2 When going NWay out in the cleaner, use the new *merged* header (instead of the original one) for each bam file so that it matches the new uniquified read group ids in the reads.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3569 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-16 19:36:36 +00:00
depristo cc2bf549c8 Removing my unnecessary optimization. 10 lines later in the code the same optimization was applied. A monumental waste of time.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3455 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-28 14:10:48 +00:00
depristo f2e7582cfc Reorganization of SW code for clarity. Totally failure at raw optimization. Discovered that ~50% of reads being cleaned were perfect reference matches. New code comes with flag to look at NM field and not clean perfect matches. Can we turned off with command line option (needed for 1KG bams with bad NM fields). Going to rerun cleaning jobs due to accidentally rebuilding of stable codebase and loss of 2 days of runtime.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3452 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-27 23:16:00 +00:00
ebanks 058441fa39 Trivial renaming of test
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3441 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-26 16:56:42 +00:00
ebanks 434e920da9 Oops, forgot to update integration tests
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3431 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-25 20:37:45 +00:00
ebanks 9dff578706 Added PG tag to bam header to let people know it's been cleaned.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3284 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-30 17:30:30 +00:00
ebanks c44f63c846 Fixing the performance tests: we need to catch the RuntimeException (not samtools' RuntimeIOExcpetion). Also, CountCovariates doesn't need the catch.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3196 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-19 14:28:12 +00:00
aaron be7cbf948b adding a catch for the exception thrown by samtools when it attempts to close /dev/null in the performance tests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3186 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-16 17:41:48 +00:00
ebanks d06c7835d8 Adding performance tests for the indel realigner; should take ~3 hours.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3151 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-11 04:45:22 +00:00
ebanks 40d305bc7e Added test of Nway cleaning for Matt; thanks to Aaron for the help.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2977 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-10 21:00:41 +00:00
ebanks 0dd65461a1 Various improvements to plink, variant context, and VCF code.
We almost completely support indels. Not yet done with plink stuff.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2926 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-04 17:58:01 +00:00
ebanks 8b555ff17c Killed the old cleaner code. Bye bye.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2868 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-22 20:49:58 +00:00
ebanks 79ab7affda - Change sortOnDisk option to sortInMemory
- Fix horrible cleaner bug
- Trivial optimizations to cleaner code - more significant ones coming soon.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2850 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-17 20:52:57 +00:00
ebanks 476d6f3076 RealignerTargetCreator is officially live
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2697 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-27 03:41:52 +00:00
aaron a34c2442c0 moved hard-coded file paths to the oneKGLocation, validationDataLocation, and seqLocation variables setup in the BaseTest.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2460 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-29 07:40:48 +00:00
hanna 2bd0b1bbf7 After further review, it's unclear that my patch in RecalDataManager was the right choice. Reverting.
Also updating other IntervalCleanerIntegrationTest failures that were masked by my first patch.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2440 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-24 00:32:33 +00:00
hanna 98c268483e Fixed issues with the integration tests:
1) sam-jdk apparently no longer supports custom tags with type int[] values.
2) BAM output for indel cleaner integration test changed in a way that's so subtle it can't be seen after converting the output to .sam.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2439 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-23 23:12:22 +00:00
ebanks 87e5a41964 Fixed a bug that accounted for a bunch of my remaining mis-cleaned indels.
Also, slightly optimized the cleaner by using readBases (instead of readString) and caching cigar element lengths.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2419 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-21 05:46:16 +00:00
asivache e6cc7dab26 fixing md5 sum; new version of IndelIntervalWalker does the right thing...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2388 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-17 01:04:13 +00:00
hanna 7c386fa428 Another case of reordering of read groups blowing up checksums.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2030 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-13 00:07:35 +00:00
aaron 2e4949c4d6 Rev'ing Picard, which includes the update to get all the reads in the query region (GSA-173). With it come a bunch of fixes, including retiring the FourBaseRecaller code, and updated md5 for some walker tests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1751 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-30 20:37:59 +00:00
ebanks 1362a56227 Added fasta tests and small fix to cleaner test
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1575 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-10 03:13:11 +00:00
ebanks bed646e4f6 Adding cleaner test
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1561 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-09 16:05:56 +00:00
aaron 0df6a9da5c -Seperating out normal (unit) tests and integration tests. From now on if your test are more of an integration test (i.e. you're testing a walker and all the subunits it relies on) please name the test "______IntegrationTest.java" instead of "______Test.java".
-Bamboo will now run the integration tests once a day, and the normal units tests on each check-in.

-Also added a bunch of unit tests for VariantEval walker

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1555 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-09 15:01:40 +00:00
ebanks 0cc219c0df -Added unit test for walkers dealing with intervals for cleaning
-I also uncovered a corner case in the cleaner that for some reason was commented out but shouldn't have been.  Hooray for unit tests!



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1553 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-09 02:35:17 +00:00
hanna ccdb4a0313 General-purpose management of output streams.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1454 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-23 00:56:02 +00:00