Commit Graph

1823 Commits (fa2d564f2c8b970c97818e28691ec7d8a0d9be58)

Author SHA1 Message Date
asivache e3b4d4cbed Genotyper reimplemented. Does the same thing, at least for now, but internal data structures redesign enables collecting various statistics for indel-containing/reference-matching reads. The statistics are not yet used by the caller itself to make a better judgement w.r.t. the validity of the calls it makes, but they are now printed into the output stream (--verbose). The statistics (for both normal and tumor) include: indel observation count/total coverage, av. number of mismatches per indel-containing and per ref-matching read, av. mapping quality, av. mismatch rate and av. base quality within an NQS windoew around the indel, numbers of indel and ref observations per strand.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1936 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-29 19:09:16 +00:00
hanna f04b80d7db Fixed epic memory leak.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1934 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-29 16:32:43 +00:00
ebanks 2b96b2e4e7 better multi-sample integration test
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1933 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-29 13:51:51 +00:00
ebanks 1c4ca9d383 -Mark just reminded me: actually force the ref/loc to be immutable
-VCF writer should be blind to the score/confidence/lod value - just print the thing out as is


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1932 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-29 13:41:53 +00:00
ebanks 5cdbdd9e5b now that the design is stable, pull the setReference and setLocation methods back out of Genotype and stick them into constructors of implementing classes
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1931 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-29 13:27:37 +00:00
ebanks 3091443dc7 Sweeping changes to the genotype output system, as per several discussions with Matt & Aaron.
Some things still need to be changed, but it will entail some more design decisions first (which means I get to bug M&A again tomorrow!).


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1930 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-29 03:46:41 +00:00
depristo 86573177d1 Reverting rod walkers to use underlying refwalker implementation while we work on ROD2 and reenable the system. Added some serious sparse file parsing to variant eval tests
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1929 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-29 01:04:37 +00:00
hanna c9a3707cfd Initial version of BWA/C bindings. Still lots of squirrels roaming the code.
- Some cigar strings aren't right.
- Memory leaks.
- BWA codebase changes aren't committed to BWA tree.
- Aligner interface butchered to support BWA/C-style alignments.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1928 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-28 21:37:49 +00:00
chartl c4359bc340 Whoops. Forgot the implements.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1927 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-28 19:59:57 +00:00
aaron 5a3bd50537 adding error log reporting to the GATK, and a stream based output method for the argument collection
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1926 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-28 19:56:05 +00:00
chartl 863d3023d5 IndelCounterWalker -- a new little walker that counts indels over a region (want to see what kind of havoc BWA may be resulting in). Don't know when BasicPileup.indelPileup() was written, but kudos to whoever wrote it.
BTTJ - remove 'N's from previous base analysis -- even if both read and ref are 'N' (which does happen, occasionally)




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1925 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-28 19:50:50 +00:00
aaron 04e9a494e9 removed the GenotypesBacked interface, which is currently unused. Also cleaned up some documentation lines
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1924 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-28 18:08:14 +00:00
rpoplin 06ff81efe5 Added NeighborhoodQualityWalker.java and ReadQualityScoreWalker.java which are used to calculate a read quality score based on attributes of the read and the reads in the neighborhood.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1922 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-28 13:24:11 +00:00
depristo 68fa6da788 Initial graph-based reference implementation and alignment assessor. Not suitable for public use
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1921 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-27 21:54:47 +00:00
depristo 31d143a841 now only needs READS
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1920 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-27 21:54:14 +00:00
depristo ef2ea79994 code cleanup and containsStartPosition function
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1919 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-27 21:53:40 +00:00
depristo 186a8dd698 Trivial protection for null value
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1918 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-27 21:52:52 +00:00
depristo be333da9c0 charSeq2byteSeq -- convert a char[] to a byte[] for convenience
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1917 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-27 21:52:23 +00:00
chartl 4192b093b8 More robust error handling with parallelization + usePreviousBase. Added forceReadBasesToMatchRef to use in conjunction with nPreviousReadBases as a less stringent approximation of usePreviousBases (requiring previous pileups only had mismatches, and that read mapping quality be high was throwing everything away)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1916 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-27 17:20:44 +00:00
chartl 31d5df2859 Previous base now checks that the read matches the reference in the previous base window.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1915 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-27 15:58:20 +00:00
depristo 726378be8b Almost ready to stop doing eagar decoding; waiting on Eric
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1914 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-27 15:28:05 +00:00
ebanks e96b1791ab Need to check for biallelic snp or exception gets thrown.
Also, update to new tracker calls.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1913 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-27 02:43:43 +00:00
aaron 3fb3773098 a fix for traverse dupplicates bug: GSA-202. Also removed some debugging output from FastaAltRef walker
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1912 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-26 20:18:55 +00:00
hanna a1e8a532ad Support for initialize() and onTraversalDone() output from parallelized walkers.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1911 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-26 20:18:31 +00:00
chartl 62c1001790 BTTJ is now correct. What a terrible waste of time, turns out I'd just reversed the header. Because of this the MD5 had to be updated in the tests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1910 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-26 19:24:18 +00:00
sjia 24c7f694e6 Handles allele frequencies for any specified population, changed user input for mismatch filter options
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1909 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-25 22:51:56 +00:00
chartl db9419df49 @ Hack to allow output from onTraversalDone()
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1908 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-25 15:19:04 +00:00
ebanks 75ad6bbef7 Check that map isn't being called passing in null arguments.
(This seems wrong; see JIRA entry GSA-211)


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1907 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-25 02:30:36 +00:00
depristo b4f55df600 Bugfix for Jason F
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1906 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-24 22:09:27 +00:00
hanna 65b98470f3 Temporary fix: have RodLocusView manage and close its RODs. Really the
relationship between these two classes needs to be rethought; see JIRA
GSA-207.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1904 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-23 16:00:12 +00:00
aaron ad1fc511b1 intermediate commit for some changes in the Variation system, so Eric can go ahead with his changes. Everything is pretty set, but the Variation interface could use a convenience method that joins all the alternate alleles.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1903 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-23 06:31:15 +00:00
ebanks 6c338eccb8 Joint Estimation model now emits calls in all formats.
The whole GenotypeCall framework needs to be changed, but this will work for the time being.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1902 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-23 03:07:28 +00:00
chartl a6dc8cd44e BTTC is now Tree Reducible allowing for parallelization.
Integration test comment changed to reflect actual date of last md5 update.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1901 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-22 23:19:29 +00:00
hanna 2e552eb5a1 Validates intervals against sequence dictionary header bounds.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1900 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-22 19:31:15 +00:00
ebanks 54c61c663c -Cleanup of the Joint Estimation code
-Don't print verbose/debugging output to logger, but instead specify a file in the argument collection (and then we only need to print conditionally)


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1899 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-22 15:25:29 +00:00
asivache 2cab4c68d4 Added method: isCodingExon(). Returns true if position is simultaneously within an exon AND within coding interval of any single transcript from the list. The old method of detecting coding positions as isExon() && isCoding() is buggy, as the position could be in the UTR part of one transcript (isExon() is true), and within coding region bounds (but not in the exon) of another transcript (isCoding() is true). As a result UTR positions would be erroneously annotated as coding.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1898 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-22 14:55:07 +00:00
chartl af761fb9bd Base transition table now forces epsilon/3 (three-state) model for the unified genotyper. Verified to be identical with changing the default model to being epsilon/3. This of course changes the observed counts, so the integration test has been updated.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1897 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-21 21:18:26 +00:00
ebanks 55fa1cfa06 -Renamed new calculation model and worked out some significant xhanges with Mark
-Allow walkers calling the UG to pass in their own argument collections


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1896 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-21 20:49:36 +00:00
chartl 8e3f72ced9 BTTJ - Code refactoring (major) - passes integration test
VariantEvalWalker - whoops, wrote PooledGenotypeAnalysis rather than PooledAnalysis, now passes tests again

- PooledFrequencyAnalysis - don't bother initializing matrices if this isn't a pool




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1895 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-21 19:04:51 +00:00
depristo 15a1849758 notes for chartl
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1894 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-21 18:31:31 +00:00
chartl 77863d4940 @PowerBelowFrequency
+ Changes to doc

@ BasicPoolVariantAnalysis
    + use char rather than ReferenceContext
    + calculate # alleles

@ PooledFrequencyAnalysis
    + breakdown of call metrics by estimated number of alleles in pool

@ VariantEvalWalker
    + add PooledFrequencyAnalysis to analysis set

@ PooledGenotypeConcordance
    + correctly calculate maximal allele frequency for output




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1893 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-21 15:17:11 +00:00
chartl 967128035e Make command like args default to false.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1892 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-21 13:59:35 +00:00
ebanks 9b9744109c Mark's new unified calculation model is now officially implemented.
Because it doesn't actually use EM, it's no longer a subclass of the EM model.

Note that you can't use it just yet because it doesn't actually emit calls (just prints to logger).  I need to deal with general UG output tomorrow.  Hold off until then, Mark, and then you can go wild.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1891 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-21 02:39:23 +00:00
depristo caa3187af8 Enabling correct high-performance ROD walker and moved VariantEval over to it. Performance improvements in variantEval in general. See wiki for full description
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1890 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-20 23:31:13 +00:00
chartl 4a8a6468be Use read group as a condition for confusion tables. With an integration test.
Changed BaseTransitionTable to comparable objects for consistent ordering of output
( e.g. so the integration test doesn't yell so much )




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1889 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-20 19:39:32 +00:00
chartl b83df5616a Change for lower-case references (always compare upper case bases)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1888 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-20 17:36:31 +00:00
chartl 3b1fabeff0 Major code refactoring:
@ Pooled utils & power
   - Removed two of the power walkers leaving only PowerBelowFrequency, added some additional
     flags on PowerBelowFrequency to give it some of the behavior that PowerAndCoverage had
   - Removed a number of PoolUtils variables and methods that were used in those walkers or simply
     not used
   - Removed AnalyzePowerWalker (un-necessary)
   - Changed the location of Quad/Squad/ReadOffsetQuad into poolseq

@NQS
   - Deleted all walkers but the minimum NQS walker, refactored not to use LocalMapType

@ BaseTransitionTable
   - Added a slew of new integration tests for different flaggable and integral parameters
   - (Scala) just a System.out that was added and commented out (no actual code change)
   - (Java) changed a < to <= and a boolean formula


Chris



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1887 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-20 14:58:04 +00:00
aaron 4be6bb8e92 added a check to ensure the eval track variation is bi-allelic. Also changed some string constants over to enums. For some reason my check-ins from home wouldn't work last night, so this is the actual changes for 1884.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1886 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-20 14:15:33 +00:00
depristo 449a6ba75a Deleting lots of code as part of my cleanup. More classes tagged for removal. Many more walkers have their days numbered.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1885 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-20 12:23:36 +00:00
aaron d749a5eb5f added a check to ensure the eval track variation is bi-allelic. Also changed some string constants over to enums
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1884 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-20 04:56:51 +00:00
ebanks b8ab77c91c Don't filter out reads without proper read groups. Instead, allow the user (or another walker calling UG) to specify an assumed sample to use (but then we assume single-sample mode).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1883 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-20 01:30:53 +00:00
depristo a8a2c1a2a1 Replaced SSG with UG in packaging utils. Minor performance and formatting improvements for ClipReads
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1882 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-20 01:19:58 +00:00
ebanks c29924e7cf Reverting previous change.
Aaron, it's all yours...


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1881 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-20 00:55:24 +00:00
aaron d21b582b18 memory leak, where the Resource Pool was releasing based on the value and not the key, resulting in the resourceAssignments map growing with each additional shard
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1880 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-20 00:39:42 +00:00
ebanks 761a730758 assertBiAllelic -> assertMultiAllelic.
Chris, if this breaks an integration test, you get it.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1879 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-20 00:09:46 +00:00
depristo 2a26bb42dd Softclipping support in clip reads walker. Minor improvement to WalkerTest -- now can specify file extensions for tmp files. Matt -- I couldn't easily create non-presorted SAM file. The softclipper has an impact on this.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1878 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-19 21:54:53 +00:00
chartl 055a99fb05 Change in ordering for a disjunctions. Walker will no longer try to calculate number of simple mismatches in the pileup if the pileup includes 'N's.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1877 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-19 18:24:14 +00:00
chartl 10bde9e77b Integration test for BTT calculator
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1876 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-19 18:21:55 +00:00
aaron cfa86d52c2 ensure that in the indel case we don't allow identification as both an insertion and deletion at the same location in the VCF ROD
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1875 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-19 18:21:00 +00:00
chartl 3d50c72d74 Forgot a dumb little System.out.println. You will be flooded with "This read will not be used." statements until, overwhelmed, you give in to my demands.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1874 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-19 16:13:48 +00:00
chartl 225ef52973 Now produces same output as the Scala walker for unconditioned tables (no 2bb, no previous base, etc.)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1873 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-19 16:10:44 +00:00
ebanks bb180a23ef Updated MD5
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1871 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-18 05:30:38 +00:00
ebanks 51f9ec0a5c subtract largest posterior value from all values; this hopefully solves any precision issues
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1870 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-18 05:20:15 +00:00
ebanks b9e8867287 -push allele frequency and genotype likelihood variable definitions down into the subclasses so that they can use different data structures
-use slightly more stringent stability metric
-better integration test



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1869 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-18 04:22:17 +00:00
depristo d6385e0d88 simpleComplement function() in BaseUtils. Generic framework for clipping reads along with tests. Support for Q score based clipping, sequence-specific clipping (not1), and clipping of ranges of bases (cycles 1-5, 10-15 for example). Can write out clipped bases as Ns, quality scores as 0s, or in the future will support softclipping the bases themselves.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1868 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-16 22:29:35 +00:00
chartl ad777a9c14 @BasicPileup - made the counts public so they can be used
@PoolUtils - split reads by indel/simple base

@BaseTransitionTable - complete refactoring, nicer now

@UnifiedArgumentCollection - added PoolSize as an argument

@UnifiedGenotyper - checks to ensure pooled sequencing uses the appropriate model

@GenotypeCalculationModel - instantiates with the new PoolSize argument




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1867 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-16 21:56:56 +00:00
andrewk bdb34fcf38 Updated integration tests for VariantEval. Hooray for IT!
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1866 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-16 20:00:29 +00:00
aaron 41a95cb3f0 fixing unified genotyper test for change: VCF output now emits no calls as ./.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1865 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-16 19:38:58 +00:00
hanna 85a4fbc256 Bumping version of Picard for firehose compatibility.
Integration tests were validated against svn rev 1861, before the wonder
twins committed their changes.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1864 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-16 19:38:56 +00:00
aaron 8aacc43203 VCF output now emits no calls as ./.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1863 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-16 18:51:31 +00:00
andrewk d1a4cd2f73 Added ValidationData analysis type to VariantEvalWalker; this eval takes a GFF file with validated truth data positions (bound to "validation")and calculates the accuracy of the genotype calls bound to "eval".
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1862 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-16 15:39:08 +00:00
ebanks 07b134a124 Added some integration tests for multiple samples
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1861 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-16 15:22:10 +00:00
ebanks 418e007ca6 A cleaner interface: now everyone can use UG's initialize method
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1860 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-16 14:09:16 +00:00
aaron 96972c3a5c a fix for a bug Eric found: if your first call contains fewer samples than calls at other loci, your VCFHeader got setup incorrectly.
Also moved a buch of Lists over to Sets for consistancy.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1859 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-16 04:57:50 +00:00
aaron a69ea9b57c Cleaning up the VCF code, adding lots of tests for a variety of edge cases. Two issues are still outstanding: updating the no call string with the standard 1000g decided on today, and fixing Eric's issue where not all the VCF sample names are present initially.
also: their, I hope your happy Eric, from now on I'll try not to flout my awesomest grammer in the future accept when I need to illicit a strong response :-)

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1858 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-16 04:11:34 +00:00
ebanks b82c3b6040 Better error output (and fixed spelling mistakes)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1857 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-16 01:01:45 +00:00
ebanks 993c567bd8 I had to remove some of my more agressive optimizations, as they were causing us to get slightly different results as MSG. Results in only small cost to running time.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1856 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-16 00:59:32 +00:00
asivache 7d7ff09f54 throw an exception if read has no associated read group
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1855 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-15 18:11:32 +00:00
chartl b9544d3f89 Output formatting change (very slight)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1854 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-15 16:47:29 +00:00
hanna 839c5d66bc Read uints directly into longs.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1853 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-15 16:15:11 +00:00
hanna ce38fa7c81 Breaking the signed int glass ceiling; stage 1: convert critical ints to longs. Code cleanup and documentation.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1852 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-15 15:28:56 +00:00
kcibul 79993be46c changed blank gene name to UNKNOWN
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1851 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-15 13:47:00 +00:00
depristo 0c2016c19a Improved error messages -- now easier to read, points to the GATK Error Messages wiki, and avoids double printing of stack traces
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1850 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-15 12:07:44 +00:00
aaron a9094c835c clean-up and fixes to the VCF input
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1849 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-15 04:53:59 +00:00
ebanks a32470cea1 Deal with the fact that walkers can call UG's init/map functions directly.
We need to filter contexts in that case since the calling walkers don't get UG's traversal-level filters.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1848 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-15 02:31:45 +00:00
hanna 8dca236958 Base-packed reader cleanup.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1847 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-15 01:26:23 +00:00
hanna 316b30ee56 On the road to human: make sure the suffix array will fit in a Java array.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1846 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-14 21:45:35 +00:00
ebanks e740e7a7ce Because walkers call UG's map function, we need to move the actual writing out
to UG's reduce function.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1845 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-14 20:49:26 +00:00
kcibul 825e6c7a4d added calculation for bases over 2x,10x,20x,30x plus gene name
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1844 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-14 20:32:26 +00:00
aaron 727b69fce0 catch null output destinations earlier
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1843 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-14 20:07:15 +00:00
chartl 1f66738c8e Fix a hashing function bug. Ignore reads with non-reference bases in the pileup.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1842 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-14 19:41:26 +00:00
hanna 72c34f11dd Bug fixing for BWA output formats.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1841 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-14 19:32:22 +00:00
aaron 60183229ab the oldest java mistake in the book...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1840 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-14 19:32:13 +00:00
ebanks 52d2e0ca07 All walkers now use read.getReadGroup()
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1839 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-14 19:27:40 +00:00
chartl 0a09fa4d5c Rename to distinguish this transition table calculator from the scala version.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1838 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-14 18:52:21 +00:00
chartl 1d055011bd Getting rid of this so I can rename it without the world blowing up.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1837 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-14 18:45:11 +00:00
aaron eb90e5c4d7 changes to VCF output, and updated MD5's in the integration tests
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1836 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-14 18:42:48 +00:00
ebanks 89771fef05 -Use read.getReadGroup()
-Add another filter for read groups for Chris


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1835 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-14 18:08:32 +00:00
ebanks 311ab8da5a A helper class to create the masks for the sequenom design maker.
This project is now officially done.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1834 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-14 17:28:51 +00:00
hanna 3553fc9ec0 Preparing for human -- support bwa output files directly rather than relying on a custom fixed sa interval.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1833 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-14 17:17:46 +00:00
ebanks d89bc2c796 This class no longer outputs in sequenom format
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1832 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-14 17:16:36 +00:00
ebanks 0c95d6906f Merge both versions of the Sequenom assay design maker: use Jared's base code and add in indels. [Jared, this still emits the same output for SNPs as your original version)
Remove all sequenom stuff from the FastaAlternateReferenceMaker so it can just concentrate on making alternate references...


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1831 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-14 17:11:45 +00:00
ebanks 49af5269e5 Jared: feel free to change or revert, but until we move over to UG version...
Only print out positions with at least one non-ref call


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1830 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-14 17:08:57 +00:00
chartl f5a2e6dd50 Fix!
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1829 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-14 16:15:20 +00:00
ebanks f2886d88e0 We now emit genotype calls
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1828 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-14 02:49:56 +00:00
ebanks 1b214c0de5 Fixed logic: throw exception if contigs are NOT equal
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1827 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-14 02:48:44 +00:00
ebanks aeca14d052 On our side of 5CC, we spell multi M-U-L-T-I.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1826 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-14 01:41:25 +00:00
ebanks c9c8fd1fef Added the discovery LOD score to the meta data
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1825 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-14 01:24:06 +00:00
ebanks 0c06bf9dbc Explicitly set output to GELI now that default is VCF
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1824 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-13 22:12:03 +00:00
hanna a76fac4687 Cleanup existing speedups. Minor performance improvements.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1823 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-13 21:51:18 +00:00
hanna 837ae1d33a Optimization: from 22k reads/min - 30k reads/min.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1822 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-13 20:59:29 +00:00
ebanks 96b8499a31 Remodeled version of the UnifiedGenotyper.
We currently get identical lods and slods as MultiSampleCaller (except slods for ref calls, as I discussed with Jared) and are a bit faster in my few test cases.  Single-sample mode still emulates SSG.
The remaining to do items:
1. more testing still needed
2. we currently only output lods/slods, but I need to emit actual calls
3. stubs are in place for Mark's proposed version of the EM calculation and now I need to add the actual code.
More check-ins coming soon...


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1821 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-13 20:27:01 +00:00
ebanks b28446acac Multi-sample calls now have associated meta-data (SLOD, allele freq), which wil
l soon actually be used...


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1820 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-13 20:08:43 +00:00
hanna db642fd08b Optimization: from 10k reads/sec - 22k reads/sec..
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1819 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-13 18:07:15 +00:00
aaron 77499e35ac fixes for GSA-199: Need easier way to write binary outputs to standard output. GLF and VCF now have stream constructors, and can get dumped to standard out.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1818 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-13 15:50:20 +00:00
hanna f37564e63a Our BWA is now looking at roughly the same number of candidate alignments as BWA/C. Performance is now at 11k reads / min, still a long way from BWA/C.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1817 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-13 15:50:04 +00:00
chartl 8d0e057d83 I got bored today and decided to write the confusion matrix calculator. At present it is untested. I'm submitting it to subversion to make sure
I have  previous revision to revert back to.


This is a calculator that will calculate:

P[ True base is X | read base mismatches, secondary base is Y, previous K bases are Z1,Z2,...ZK ]

where the number of pervious reference bases to take into account is user-defined. The secondary base is optional as well.

--usePreviousBases k

tells the walker to use the k previous reference bases in the transition table

--useSecondaryBase

tells the walker to use the secondary base at a locus in the transition table

these can be used together.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1816 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-13 02:55:29 +00:00
ebanks be92a1e603 Don't try to close if the lazy initialize hasn't triggered
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1815 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-13 01:20:25 +00:00
chartl ec83bc6ec5 This somehow didn't make it into subversion the last time.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1814 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-12 21:11:13 +00:00
chartl ecbb11e017 Modified PowerBelowFrequency to ignore reads below a user-defined mapping quality. Request from Jason Flannick.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1813 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-12 20:59:24 +00:00
chartl ec68ae3bc5 Added a filter that will split the read set by a threshold of mapping quality (Request from Jason Flannick)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1812 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-12 20:58:37 +00:00
chartl 0d73fe69e7 Recalibrator by NQS. Had this puppy running all afternoon. Thing had got through 100,000,000 reads before I decided to delete my sting tree. *sigh*, a little more delay.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1811 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-12 20:55:02 +00:00
chartl ee0afba0af Recalibration stuff...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1810 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-12 20:51:39 +00:00
ebanks caf689821f added method to get normalized posteriors
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1809 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-12 02:33:22 +00:00
ebanks cf7a26759d -use the getReadGroup() function that was added to picard for us
-clean up some include lines


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1808 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-12 01:39:32 +00:00
hanna d844d1c496 SAMFileWriters specified as command-line arguments were sometimes incorrectly altering the default short name. Make sure short name is not specified if shortName is not specified but fullName is.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1807 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-09 19:16:46 +00:00
hanna da084357db Fixed minor typo in output message.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1806 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-09 18:56:54 +00:00
aaron 62c484b57a Fixes for GSA-201, where enumerated types in command line arguments had to be defined as all uppercase for the system to work.
Also a little playground walker that changes the sort order flag of a BAM file.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1805 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-09 18:11:32 +00:00
hanna 32d55eb2ff Fix issue Eric was seeing with java.lang.Error in unmap0.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1804 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-09 17:46:56 +00:00
ebanks 9f3482ef11 VCF is both a multi- and single- sample format, so we shouldn't be throwing an exception when used for SS
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1803 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-09 17:43:26 +00:00
jmaguire d9f5a314ac avoid an out of memory error by no putting more than 5000 reads in the cache. on pilot1 at least those are crazy loci anyway.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1802 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-09 14:56:55 +00:00
hanna f4b6afb42c JVM issue id 5092131 (http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=5092131)
was causing OOM issues with the new mmapping fasta file reader during large jobs.
Temporarily reverting the reader until a workaround can be found.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1801 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-09 04:45:46 +00:00
chartl 6d7f4481e4 Changed traversal type slightly
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1800 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-09 04:11:48 +00:00
ebanks a9f3d46fa8 Your time has come, SSG.
Fare thee well.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1799 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-08 20:27:56 +00:00
jmaguire 8fdb8922b8 now output in the exact format that works with sequenom software.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1798 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-08 20:06:27 +00:00
aaron 98e3a0bf1a VCF can now be emitted from SSG. The basic's are there (the genotype, read depth, our error estimate), but more fields need to be added for each record as nessasary.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1797 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-08 19:50:04 +00:00
hanna 95f24d671d Fixed 'visualization' of reads that didn't match bwa's alignments exactly.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1796 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-08 19:45:30 +00:00
kiran 29ad6cd876 Made redundant by BCMMarkDupes
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1795 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-08 18:47:20 +00:00
kiran 94d82d1915 Matthew Bainbridge's duplicate removal utility for 454 data. This code should eventually be moved into a read walker. For now, it's being introduced into the repository as-is (well, with one minor change to make the handling of command-line arguments a little more straightforward).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1794 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-08 18:32:37 +00:00
ebanks 15bf014e0b logger.info -> logger.debug (don't want to risk filling up my log on genome-wide calls)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1792 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-08 17:53:11 +00:00
ebanks df8ea8f437 UG integration test. This was the old SSG test with MD5s updated.
I'll need to add some multi-sample tests in a bit...


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1791 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-08 17:43:58 +00:00
ebanks 008455915a One way of making the integration test stop failing is to remove it...
[waiting for Matt to cringe...]


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1789 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-08 17:08:41 +00:00
chartl f89a89ffe3 Use of AlleleFrequency as an input to PowerAndCoverage is deprecated by the new walker. Reverting to the standard "power at 1 allele" calculation.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1788 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-08 16:07:45 +00:00
chartl ae05f5c7ad Fixin the header.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1787 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-08 15:49:28 +00:00
chartl 11ff1e09b8 A new power walker for the user to feed in a number of alleles. Call that number k. Output is:
Locus Power_for_k_alleles  Power_for_k-2_alleles  Power_for_k-2_alleles ... Power_for_1_allele

This was a request from Jason Flannick & the T2DB group.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1786 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-08 15:35:35 +00:00
ebanks 04fe50cadd *** We no longer have a separate model for the single-sample case. ***
For now, a single sample input will be special-cased in the EM model - but that will change when the EM model degenerates to the single sample output with a single sample as input.  For now, the EM code for multi-samples isn't finished; I'm planning on checking that in soon.

The SingleSampleIntegrationTest now uses the UnifiedCaller instead of SSG, and so should all of you.  More on that in a separate email.
Other minor cleanups added too.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1785 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-08 14:08:57 +00:00
jmaguire 32128e093a misc. changes to get the numbers back to the baseline while keeping the speedup.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1784 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-08 12:27:07 +00:00
jmaguire d38a0d04b9 fix a snp mask offset error.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1783 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-08 12:25:40 +00:00
kiran 829e99413b Rescores a variant after removing duplicates (defined very strictly as reads with the same start points).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1782 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-08 03:07:36 +00:00
hanna fcb6a992c8 Switched IndexedFastaSequenceFile over to use memory mapping to load data rather than
the loop-with-small block size.  Performance improvements in loading refs are extreme;
segments can be loaded in <1ms.  chr1 in its entirety can be loaded in 1.5sec (down
from 30sec).


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1781 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-08 00:07:15 +00:00
jmaguire 02d2492d68 Simple tool for picking sequenom probes for SNPs. Can be extended to indels if necessary.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1780 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-07 23:46:41 +00:00
ebanks 1905b5defa Hash by chromosome for now to reduce memory. This is a temporary solution until we decide how to reture the Injector for good.
Also, with Picard's latest changes, we need to make sure we don't double-close the sam writer.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1779 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-07 20:06:25 +00:00
ebanks f9a1598d75 Reformatting
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1778 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-07 20:03:34 +00:00
ebanks 203c626fc2 A wrapper around the GenotypeLikelihoods class for the UnifiedGenotyper. This wrapper incorporates both strand-based likelihoods and a combined likelihoods over both strands.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1777 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-07 19:57:37 +00:00
sjia 5bdcc2b4dc Included HLA class 2 genes in CreatePedFileWalker
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1776 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-07 18:46:51 +00:00
sjia 8f896b734f Included HLA class 2 genes in CreatePedFileWalker
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1775 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-07 18:28:01 +00:00
aaron f9a0eefe4b GELI_BINARY is now functional, and can be used as a variant type in SSG (-vf=GELI_BINARY). Also fixed the max mapping quality column in both GELI output formats, we haven't been correctly outputing up until now.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1774 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-07 18:20:34 +00:00
chartl 225b9bccc1 Modifications to NQSClusteredZScoreWalker to output empirical mismatch rates on bins by both Z-score and reported Q-score, rather than averaging over all Q-score bins for each Z-score.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1773 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-07 13:45:12 +00:00
depristo 8dd0924b37 Minor performance improvements to VariantEval -- now all of the CPU time is spent dealing with the ROD system...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1772 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-06 23:40:30 +00:00
aaron 4554ca1b28 more cleanup, depecaited the old genotype, corrected SNPCallsFromGenotypes' imports and two other classes that depend on it.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1771 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-06 19:09:27 +00:00
aaron 3aec76136f Removing the AllelicVariant interface, which is replaced by the Variation interface.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1770 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-06 17:44:24 +00:00
depristo 1bd0c3c145 variant eval allows non Variation rod objects
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1768 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-06 13:04:26 +00:00
aaron 66fc8ea444 GSA-182: Adding support for BED interval files.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1767 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-06 02:45:31 +00:00
hanna aec83b401d SSG multithreading doesn't play well with some I/O changes made since I last svn up'd. Reverting until I can find the reason.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1766 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-05 19:48:57 +00:00
hanna 8a503c86b6 Code supporting SSG proof-of-concept shared memory parallelism.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1765 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-05 18:56:16 +00:00
ebanks fb619bd593 -Refactoring: make GenotypeCalculationModel constructors empty so that they don't have to be updated every time we add a new parameter; instead put that logic in the super class's initialize method (making everything protected so that only the factory can access them)
-Adding initial version of Multi-sample calculation model.  This still needs much work: it needs to be cleaned up and finished.  Right now, it (purposely) throws a RuntimeException after completing the EM loop.

Also:
-Fix logic in GenotypeLikelihoods.setPriors
-Add logger to the models for output






git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1764 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-05 18:10:36 +00:00
sjia 98076db6b4 Modified CreatePedFileWalker to output PED file given HLA allele names
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1763 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-05 03:06:42 +00:00
hanna 56bc4fa21a Fixed bug where not all alignments were returned if read aligned to multiple locations. Enhanced test suite to validate all alignments.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1762 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-04 18:20:20 +00:00
hanna 05aa928e3e Fix off-by-number-of-deletions issue with negative strand reads. Improved performance by factor of 2.5x.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1761 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-03 21:55:18 +00:00
chartl 7605ee500c Idiocy! All tests were being disabled because I forgot the instanceof
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1760 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-02 20:04:56 +00:00
chartl 88d0890cc3 Made PooledGenotypeConcordance a standard test in VariantEval
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1759 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-02 20:03:31 +00:00
aaron 7fc4472e6d A big fix for MergingSamRecordIterator, where we weren't correctly handling the comparisons of SAMRecords correctly (we weren't applying the new reference index first, so sometimes the MT contig would be ID 23, sometimes 24 in different records).
Also a fix to the GLF tests, and a correction to PrintReadsWalker to remove the close() on the output source, the source handles that itself (and you get a double close).

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1758 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-02 19:35:35 +00:00
chartl 68cb2ee54b Tweaks to parameters for NQS analysis walkers; change to PowerAndCoverage for Jason Flannick (can input the number of alleles to compute power for - i.e. doubletons, tripletons; rather than statically checking singletons.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1757 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-02 19:11:27 +00:00
ebanks 7249fade05 updated
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1756 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-02 18:10:34 +00:00
ebanks 53a4bd7f51 A better understanding of what's going on means no need for clearing the cache
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1755 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-02 18:07:46 +00:00
aaron e885cc4b21 changes for corrected GLF likelihood output, along with better tests
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1754 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-01 20:45:05 +00:00
hanna 2309d19f6f Bug fix from Michael Ross: mark second read in sequence as second of pair.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1753 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-01 14:34:36 +00:00
aaron 2e4949c4d6 Rev'ing Picard, which includes the update to get all the reads in the query region (GSA-173). With it come a bunch of fixes, including retiring the FourBaseRecaller code, and updated md5 for some walker tests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1751 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-30 20:37:59 +00:00
ebanks 303972aa4b Yup, I broke the build...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1750 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-30 20:20:43 +00:00
ebanks 841d25cc44 Added ability to set the priors after construction (and requiring a flushing of the likelihoods cache)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1749 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-30 19:55:49 +00:00
hanna 665951f9f0 Support negative strand alignments.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1748 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-30 18:10:26 +00:00
hanna d3b1732cca Start of refactoring effort. Make construction of alignment object simpler.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1747 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-30 15:19:31 +00:00
hanna 70e1aef550 Better integrate the @ArgumentCollection into the command-line argument parser. Walkers can now specify their own @ArgumentCollections. Also cleaned up a bit of the CommandLineProgram template method pattern to minimize duplicate code.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1746 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-29 22:23:19 +00:00
aaron b1c321f161 Adjusted Genotype concordance to more accurately use the new Genotyping code, fixed the VCF rod, and temp. fix the build by reintroducing Shermans ReadCigarFormatter
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1745 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-29 21:28:21 +00:00
sjia 9b78a789e2 HLA Caller 2.0 Walkers:
CalculateBaseLikelihoodsWalker.java walks through reads calculates likelihoods using SSG at each base position
CalculateAlleleLikelihoodsWalker.java walks through HLA dictionary and calculates likelihoods for allele pairs given output of CalculateBaseLikelihoodsWalker.java
CalculatePhaseLikelihoodsWalker.java walks through reads and calculates likelihoods score for allele pairs given phase information

File Readers:
BaseLikelihoodsFileReader.java reads text file of likelihoods outputted by SSG
FrequencyFileReader.java reads text file of HLA allele frequencies
PolymorphicSitesFileReader.java reads text file of polymorphic sites in the HLA dictionary
SAMFileReader.java reads a sam file (used to read HLA dictionary when in another walker)
SimilarityFileReader.java reads a text file of how similar each read is to the closest HLA allele (used to filter misaligned reads)

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1744 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-29 20:45:55 +00:00
chartl 281a77c981 Bugfix. isMismatch() was actually computing isMatch().
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1743 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-29 20:04:59 +00:00
chartl e28b45688c More NQS Related Walkers to play with
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1742 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-29 20:01:04 +00:00
ebanks 9ef80e3c3c One minor addition: to incorporate Pooled calling (and to be as general as possible), we allow the genotype calculation model to use rods if it wants.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1741 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-29 17:05:59 +00:00
ebanks 19bfe43173 First pass at a unified caller, being checked in now so Mark can give feedback if he chooses and so Matt can debug issues with the ArgumentCollection class.
Some notes:
1. This design should be flexible enough to include pooled calling (for now) after discussions with Chris.
2. Using the unified caller with the SingleSampleCalculationModel emits the exact same output as SSG over all of chr20 for NA12878.  Additionally, when we include the "max deletions allowed at a locus" argument (so we don't try to call SNPs at deletion sites), it removed 233 SNP calls in chr20 that were clearly indel artficts.
3. The MultiSampleEMCalculationModel is still a work in progress and will be checked in later this week.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1740 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-29 16:48:15 +00:00
ebanks 8bd345ba00 Generalized deletions in pileup
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1739 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-29 15:58:43 +00:00
andrewk 6134f49e3c Convert de novo SNP caller to run using parent1 and parent2 BAM files (by splitting contexts by reader using getMergedReadGroupsByReaders) instead of geli files providing a large speed-up and obviating the need for large whole-genome geli files.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1738 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-29 06:42:21 +00:00
andrewk 5dab95aa5a Fix getMergedReadGroupsByReaders so that it provides read groups in the same way Picard does so that it works correctly when input read files have no clashes in their read groups and retain their original read group names.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1737 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-29 06:35:50 +00:00
andrewk 5662a88ee1 Cosmetic change to list sampling functions: the typical usage of n and k were reversed. No change in functionality of the classes has been made and unit tests still pass.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1736 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-28 18:12:32 +00:00
aaron 39598f1f0a switching the concordance walker over to the new Variation system
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1735 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-28 15:46:36 +00:00
asivache bce2f0d7cf Now instantiates the list of alternative consenses to evaluate as LinkedHashSet to guarantee iterator traversal order. Old implementation used HashSet and exhibited unstable behavior when two alt consenses turned out to be equally good: depending on the run conditions (including size of the interval set being cleaned??), either one could be seen first as selected as the 'best' one
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1734 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-28 06:15:46 +00:00
asivache 663175e868 Bug fix: when jumping onto next contig (chromosome), the walker was erasing last mismatch interval from the previous chr it was still holding without printing it; now it gets printed.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1733 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-25 22:24:34 +00:00
asivache 92c6efabb7 moving IndelGenotyper out of playground
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1732 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-25 19:44:49 +00:00
asivache aec61c558b moving IndelGenotyper out from playground
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1731 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-25 19:43:53 +00:00
chartl fe6d810515 Some basic commits that I've been sitting on for a while now:
@ PooledGenotypeConcordance - changes to output, now also reports false-negatives and false-positives as interesting sites. It's been like this in my directory for ages, just never committed.

@NQSExtendedGroupsCovariantWalker - change for formatting.

@NQSTabularDistributionWalker - breaks out the full (window_size)-dimensional empirical error rate distribution by the window. So if you've got a window of size 3; the quality score sequences 22 25 23 and 22 25 24 have their own bins (each of the 40^3 sequences get one) for match and mismatch counts.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1730 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-25 19:35:50 +00:00
sjia f7684d9e1b ImputeAllelesWalker fills missing portions of HLA dictionary based on best allele matches
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1729 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-25 18:51:46 +00:00
sjia 235de38c2e Updates to FindClosestAlleleWalker and CreateHaplotypesWalker
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1728 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-25 16:41:58 +00:00
aaron 130a01a40a delete the integration test temp files when the test is over
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1727 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-25 16:34:08 +00:00
aaron 2b7d39035a switched over the FastaAlternateReferenceWalker to the Variation system
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1726 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-25 16:09:43 +00:00
aaron 7ffc1d97ef Cut DeNovoSNPWalker over to the new Variation system, some renaming of methods on the Variation interface, and some corrections on the interface.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1724 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-25 04:35:52 +00:00
depristo 392152f149 1000x performance improvements to MSG for crisis control
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1723 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-24 23:44:33 +00:00
hanna 44879c81b0 Add in weights. Massive performance improvements.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1722 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-24 23:19:15 +00:00
hanna 3b79f9eddc Support 'N's and other mismatch characters in the reference.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1721 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-24 21:41:30 +00:00
hanna 08e8d2183a Indels supported. Variable gap penalties are not yet taken into account.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1720 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-24 21:03:02 +00:00
aaron d2af26e81f Pooled EM SNP Rod converted over to the Variation interface
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1719 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-24 16:33:11 +00:00
ebanks 97105ac001 We need to return a null RODRecordList when the default value is null (as opposed to a list with a single null value), because that's what everyone is expecting.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1718 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-24 16:23:12 +00:00
ebanks d4b40bc06f Filter for reads with missing read groups so we can safely assume all reads have valid read groups
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1717 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-24 16:10:26 +00:00
ebanks 90de2e0cde Added ability to specify whether you want to use a point estimate or fair coin test calculation; for now you can use either but fair coin test is still experimental as it needs to be parametrized correctly. This job will hopefully be done by the future Bioinformatic Analyst...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1716 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-24 15:29:50 +00:00
aaron d262cbd41c changes to add VCF to the rod system, fix VCF output in VariantsToVCF, and some other minor changes
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1715 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-24 15:16:11 +00:00
sjia 1ee8ba590c Reads cigar files
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1713 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-24 03:14:10 +00:00
sjia 9422156e09 Finds closest allele for each read in bam file
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1712 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-24 03:12:20 +00:00
sjia 5c5151c4e7 Creates ped file from reads
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1711 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-24 02:48:29 +00:00
hanna b0ec7fc144 More comprehensive testing of BWT (mismatches only) module, and lots of bug fixes.
Limitations:
1) Can't handle RC alignments.
2) Can't handle indels.
3) Can't handle N's in reference bases.
4) Stops at first hit.

Ran BWT over a test suite of 800k Ecoli reads.  After removing alignments with indels / reads with Ns, the remaining reads were aligned with quality 'equal to' that of the alignment stored in the BAM file.  In this case 'equal' quality is <= mismatches to the reference as the existing alignment stored in the BAM file.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1710 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-23 23:44:59 +00:00
sjia b446b3f1b6 CreateHaplotypeWalker now gives correct output
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1709 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-23 21:13:52 +00:00
aaron eeb14ec717 a couple of light changes to GenomeLocSortedSet.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1708 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-23 20:38:53 +00:00
sjia 3916e165fb New walker to output haplotypes for each read (for SNP analysis or imputation, etc)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1707 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-23 20:26:43 +00:00
ebanks 423a3ee894 Added a sequenom rod to empower Carrie to convert 1KG validation SNPs to sequenom format
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1706 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-23 20:22:09 +00:00
chartl 63f3d45ca4 fixing the build
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1705 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-23 20:04:09 +00:00
chartl 540e1b971f And we fix one boneheaded mistake, which was actually causing the problem; though the last change was still correct.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1704 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-23 19:26:45 +00:00
chartl 124ca68fa8 And an IMMEDIATE minor fix (want neighborhood quality > base quality to be represented correctly)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1703 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-23 19:21:09 +00:00
chartl 8cdb78ebee More sophisticated version of the NQSCovariantWalker - modified to be more explicit about how much higher the
quality score of a particular base is than the quality score of its neighbors. The granularity of the binning
jumps from 32 groups to 860 groups.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1702 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-23 19:18:24 +00:00
hanna 856bbd0320 Let Picard specify the default compression level.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1701 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-23 19:01:48 +00:00
aaron f783cb30e0 adding an interface so that the current @Requires with ROD annotations work in walkers like VariantEval
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1700 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-23 18:24:05 +00:00
hanna ebfbe56b43 Make sure compression level always gets pushed into SAMFileWriterFactory.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1699 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-23 18:20:26 +00:00
asivache fa87dd386d Now uses rodRefSeq in its new reincarnation
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1698 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-23 18:19:36 +00:00
asivache bf7cd66d53 New, simpler rodRefSeq. Fully relies on the ROD system standard mechanisms. Multiple transcripts over a given location will be now returned by the ROD system itself as RodRecordList<rodRefSeq>; and yes, rodRefSeq does represent a single transcript record now and implements Transcript interface
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1697 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-23 18:18:25 +00:00
asivache 8fa4c93f5a Transcript is now simply an interface
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1696 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-23 18:13:31 +00:00
asivache fe36289e44 Noone needs this, probably... Old experimental code.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1695 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-23 18:11:50 +00:00
asivache 1bd4c0077c Now that ROD system supports overlapping RODs, we do not need rodRefSeq to be too smart and read in all the overlapping records (transcripts) on its own; leave it to the generic ROD mechanism.
PARTIAL commit; new, simpler rodRefSeq will reappear in a seq.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1694 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-23 18:11:16 +00:00
sjia aa66074a0e Compares each read to the HLA dictionary and outputs closest allele, as well as other stats
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1693 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-23 16:17:23 +00:00
aaron 11c32b588f fixing VariantEvalWalkerIntegrationTest md5 sums, a couple comment changes, and a little bit of cleanup
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1690 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-22 20:54:47 +00:00
ebanks b0fa19a0b2 Fixed recal integration test
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1689 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-22 20:22:32 +00:00
ebanks 0748d80baa Added a convenience method in rodDbSNP to deal with Andrey's changes to the rod. Now you can just ask for the first real SNP rod from the list and not have to think about how it works.
CountCovariates uses it.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1688 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-22 20:15:40 +00:00
ebanks 6780476fb5 updated to deal with new dbSNP rod
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1687 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-22 19:46:32 +00:00
hanna 14477bb48e Unidirectional alignments with mismatches now working. Significant refactoring will be required.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1686 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-22 19:05:10 +00:00
sjia 22932042ea Combined Scores, bug fixed for printing HLA-C
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1685 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-22 18:28:16 +00:00
ebanks 682b765536 bug: need to upper case chars so that == works throughout
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1684 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-22 18:20:43 +00:00
asivache d7d0b270d1 now supports blacklisting lanes (with -BL option will ignore reads from any of the specified lanes)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1682 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-22 16:46:57 +00:00
asivache 57d31b8e9b Filter that discards reads from specific lanes; and also its friend that helps blacklisting a set of lanes from GATK command line a one-liner.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1681 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-22 16:46:06 +00:00
aaron 83a9eebcc4 fixed a bug I checked in that Eric found, for intervals with no start or stop coordinate. Now I owe Eric a cookie, and Milk Street is so far away. Damn.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1679 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-22 04:34:18 +00:00
ebanks 5ce42cbab3 After thinking about this a bit more, it makes sense to pull this functionality out of my walker and into the GenomeLocParser where everyone else can benefit from it...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1677 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-22 01:32:35 +00:00
aaron 7bfb5fad27 fixing the dbSNP test. Also removing unnessasary comments from the GenomeLocParser, added some tests, and commented out the performance test
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1676 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-21 23:32:24 +00:00
aaron 39a47491a9 changes to make GenomeLoc string parsing 25% faster
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1675 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-21 22:37:47 +00:00
ebanks b1dc6d65e4 interval merging is now blazingly fast
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1674 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-21 21:15:04 +00:00
asivache 15135788ca OK, let's bite the bullet. Now rodDbSNP objects are 'isSNP()' only when they are annotated as 'exact', not a 'range'.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1673 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-21 19:25:16 +00:00
asivache 8ad181f46f Note to myself: do 'ant clean' now and then or old versions of the code that suddenly became invalid will stick around. The world is not perfect, and neither is automatic dependency resolution.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1672 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-21 17:40:52 +00:00