Commit Graph

9520 Commits (7b81559a9b2acd902ffde05db8aed31df4b2a557)

Author SHA1 Message Date
Guillermo del Angel 7b81559a9b One more pool caller bug fix: don't create output file for noisy simulation in unit test, or else previous results will be deleted 2012-05-14 16:27:38 -04:00
Guillermo del Angel 617ac0b88f More pool caller bug fixes 2012-05-14 16:15:54 -04:00
Guillermo del Angel 578092b120 Pool caller bug fixes: avoid NPE in null tracker positions, fix so that we can (in theory) use Pool AF with non-pool GL model for testing 2012-05-14 15:03:53 -04:00
Guillermo del Angel 5fc3adbb04 One more VariantsToTable bug fix 2012-05-14 14:10:07 -04:00
Guillermo del Angel 04d691f04a Forgot to update MD5's due to new Exact AF model in pool caller (all changes legit, minor QUAL/QD/SB differences). Fixed bug in VariantsToTable from previous commit 2012-05-14 14:01:29 -04:00
Guillermo del Angel ae26f0fe14 a) Fully functional and working multiallelic exact model for pools. Needs cleanup/more testing. b) Better unit test for pool genotype likelihoods - it now optionally generates actual noisy pileups that can be used for assessing GL accuracy, c) Totally experimental, hidden option in VariantsToTable to output genotype fields. Specifying -GF will output columns of form Sample.FieldName - needs also more testing 2012-05-14 10:55:35 -04:00
Guillermo del Angel 67e5c3ff9f Solved major scalability problem in pool caller - exact model may have been linear but computing pool GL's was O(n^p) where p was max # of alleles (4 in SNP discovery mode). Linearized approach follows exact AF model with queue of AC conformations to add - may refactor code to eliminate duplication later, as linear multiallelic pool AF model will use same approach. TBD: how to print PL's with -Infinity value, right now since we never cap PL printing we end up with big nonsense numbers in those positions and vcf's look ugly. Calling MT in CEU trio with pool size = 100 goes from 2 days to 55 minutes (sic) 2012-05-11 10:05:09 -04:00
Guillermo del Angel 9acef4b206 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-05-10 16:00:58 -04:00
Guillermo del Angel da6f16986e Preparatory refactorings for pool indel calling and for optimizations: restructure code in PoolSNPGenotypeLikelihoods that will be shared with indels, and make it easier to rewrite when optimized version that's linear in pool size is ready (current version is linear in #of pools but not yet on pool size). 2012-05-10 16:00:37 -04:00
Ryan Poplin c9dd0f3173 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-05-10 13:09:10 -04:00
Ryan Poplin 0cdadffe14 Committing the best of the frantic pre-CSHL experiments: Better algorithm for partioning reads amongst the alleles they support. Require the read's original alignment to actually overlap the variant. QD uses the non-informative reads when calculating D. More HC-specific annotations for potential use in a statistical filtering strategy. Increasing the minimum kmer length in the assembly graphs. Misc minor bug fixes. 2012-05-10 13:09:03 -04:00
Guillermo del Angel 89f8a6b2e6 Revert bad part of last commit that shouldn't have been pushed 2012-05-10 10:41:08 -04:00
Guillermo del Angel 27b1aa5dd3 Don't allow N's in insertions when discovering indels. Maybe better solution will be to use them as wildcards and merge them with compatible regular insertion alleles but for now it's easier to ignore them. Minor refactoring of Allele.accepableAlleleBases to support this. Added unit test to test consensus allele counter in presence of N's 2012-05-10 10:29:19 -04:00
Eric Banks 4f37d6d399 Fixing docs 2012-05-10 00:56:00 -04:00
Joel Thibault 51936dcef3 Update indices to better match queries
Only query for the requested samples
2012-05-09 17:14:50 -04:00
Joel Thibault f4ae4a0a70 Initial versions of Queue scripts for Mongo testing 2012-05-09 14:22:34 -04:00
Joel Thibault 5427f8dffa Mongo Long/Integer confusion 2012-05-09 14:22:34 -04:00
David Roazen c56370a503 Update GATKPerformanceOverTime script for GATK 1.6 2012-05-09 13:47:41 -04:00
Mark DePristo 398dceec56 Basic test script to cut up a BAM and run it through cramtools 2012-05-08 19:46:51 -04:00
Mark DePristo c81acfc15d Working implementation of BCF2
-- Nearly complete on spec implementation.  Slow but clean
-- Some refactoring of VariantContext to support common functions for BCF and VCF
2012-05-08 19:46:51 -04:00
Mark DePristo a5193c2399 Mostly complete reference implementation of BCF2
-- Can run VariantEval on 3000 sample exome VCF and get the same output as the original VCF
2012-05-08 19:46:51 -04:00
Mark DePristo 237a41d3d3 Phase II of BCF2 reader / writer
-- New encoder decoder implementation with cleaner interface that supports newer spec versions
-- Checkpoint to read / write sites files
2012-05-08 19:46:50 -04:00
Mark DePristo eb6721bd44 Initial test simple BCF2 encoder / decoder 2012-05-08 19:46:50 -04:00
Eric Banks 473d07b0c5 fixing up docs from previous Pool Caller commit 2012-05-08 11:02:55 -04:00
Eric Banks b4999d14c1 updating docs 2012-05-08 10:58:46 -04:00
Guillermo del Angel 33a1dd2048 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-05-08 10:42:12 -04:00
Guillermo del Angel 7584b1ea17 Back off optimization of pool vcf that didn't print genotypes. Many site attributes require GT in genotypes to be computed correctly. Better to change string representation of polyploid genotypes, TBD better solution 2012-05-08 10:41:39 -04:00
Eric Banks 5cf4fd63c2 Catch malformed base qualities and throw as a User Error 2012-05-08 09:34:57 -04:00
Guillermo del Angel a4f4b5007b Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-05-08 09:34:33 -04:00
Guillermo del Angel 605984353f Pool Caller improvements: a) New non-standard private annotation Heteroplasmy which measures mean heteroplasmy (pool AF) across called samples, meant for easier mtDNA calling. Pure homoplasmic variants (pool AF = 1 or 0) would have heteroplasmy=1. b) Don't output pool genotypes by default for large pool sizes because it makes file sizes explode and they're unreadable. c) Refactored classes ExactACCounts and ExactACSet and moved to superclass AlleleFrequencyCalculationModel because both Pool and Exact AF calculation models will use it. d) Initial refactorings and skeleton for linearized multi-allelic exact model (not done yet). e) Unit test for Pool AF calculation model. 2012-05-08 09:33:38 -04:00
Eric Banks c40cda7e3c Nope, loads of integration tests had to be changed. 2012-05-07 14:30:42 -04:00
Eric Banks 66838a073e Very annoying: we have been emitting an extra TAB in the header of the VCF (which breaks some parsers) for sites-only file. Hopefully not too many integration tests will need to be fixed... 2012-05-07 12:20:11 -04:00
Mark DePristo a90482c772 Rev. tribble to v101 with another putative open file leak fix
Scalability bugfixes; can issues tens of thousands of queries to an reader
without opening too many files

-- Fixed missing close() statement in TribbleIndexedFeatureReader
-- Fixed NPE in TabixIteratorLineReader
-- Added scalability test that confirms .query() failure and subsequent fix

Note this actually fixes a tested and reproducible scability issue.  Might not be the only one but I believe it should do the trick.  Sorry everyone for the inconvenience.  Note that we now have a test in Tribble to ensure this doesn't happen again.
2012-05-04 15:40:41 -04:00
David Roazen 9424acb3c8 BCF2: Fix issue with parsing of filters 2012-05-04 15:08:53 -04:00
David Roazen e506de47b3 BCF2: Use the reference's sequence dictionary in BCF2Writer, don't require the VCF header to have contig declarations 2012-05-04 14:54:50 -04:00
David Roazen b28de6674d BCF2: set VC stop position to allow BCF2ToVCF walker to work correctly
Stop position is not yet correct for multi-nucleotide events, but that can
be fixed later
2012-05-04 13:24:49 -04:00
David Roazen 6b769e91d8 BCF2: third checkpoint
* writer mostly implemented
* walkers to convert BCF2 <-> VCF
* almost working for sites-only files; genotypes still need work
* initial performance tests this afternoon will be on sites-only files
2012-05-04 13:00:15 -04:00
Mark DePristo fa84d50a2b Rev. tribble for putative bugfixes for not closing streams 2012-05-04 10:20:46 -04:00
Khalid Shakir 23e3668e2c Added JUST_BCF2 to PRS walker based on GVCF tests.
Example: -T ProfileRodSystem -mode JUST_BCF2 -R <fasta> -vcf <input> -o out.txt [-performanceTest]
2012-05-03 22:08:18 -04:00
Khalid Shakir a9da9598f5 Implemented getSamplesFromVCF. 2012-05-03 21:57:57 -04:00
Khalid Shakir 7c11dde328 Updated DPP test MD5's due to template length (TLEN) changes when Picard was revved. 2012-05-03 14:47:58 -04:00
David Roazen fbb40c3c42 BCF2: checkpoint for Mark 2012-05-03 14:31:25 -04:00
Eric Banks c9829374d3 Oops, was using the wrong variables to print in the HaplotypeResolver. Fixing for Ryan. 2012-05-03 13:39:49 -04:00
Eric Banks f3433201b1 Merged bug fix from Stable into Unstable 2012-05-03 11:11:00 -04:00
Eric Banks 557da77a1a Don't compute QD if there is no QUAL; added integration test for this 2012-05-03 11:02:37 -04:00
Eric Banks 1fc7b5d58b Merged bug fix from Stable into Unstable 2012-05-03 10:37:58 -04:00
Laurent Francioli 567d01cee8 - Added option to output the father's allele first in phased child haplotypes - BUG corrected causing wrong phasing of child/father pairs
Signed-off-by: Eric Banks <ebanks@broadinstitute.org>
2012-05-03 10:36:49 -04:00
Laurent Francioli 96e5a26223 PED support for Inbreeding Coefficient annotation
Signed-off-by: Eric Banks <ebanks@broadinstitute.org>
2012-05-03 10:36:20 -04:00
Mark DePristo 0f4cc1884d Rev to tribble 99, optimized AsciiFeatureCodec
-- Removed tmp. GeneralizedFeatureCodec
-- BCF2 Reader update to use new style, but this entire class can be deleted now
-- Rev. tribble to r99
2012-05-03 07:31:48 -04:00
Mark DePristo 43d97c2e00 Rev Tribble to r97, adding binary feature support
From tribble logs:

Binary feature support in tribble

-- Massive refactoring and cleanup
-- Many bug fixes throughout
-- FeatureCodec is now general, with decode etc. taking a PositionBufferedStream
as an argument not a String
-- See ExampleBinaryCodec for an example binary codec
-- AbstractAsciiFeatureCodec provides to its subclass the same String decode,
readHeader functionality before.  Old ASCII codecs should inherit from this base
class, and will work without additional modifications
-- Split AsciiLineReader into a position tracking stream
(PositionalBufferedStream).  The new AsciiLineReader takes as an argument a
PositionalBufferedStream and provides the readLine() functionality of before.
Could potentially use optimizations (its a TODO in the code)
-- The Positional interface includes some more functionality that's now
necessary to support the more general decoding of binary features
-- FeatureReaders now work using the general FeatureCodec interface, so they can
index binary features
-- Bugfixes to LinearIndexCreator off by 1 error in setting the end block
position
-- Deleted VariantType, since this wasn't used anywhere and it's a particularly
clean why of thinking about the problem
-- Moved DiploidGenotype, which is specific to Gelitext, to the gelitext package
-- TabixReader requires an AsciiFeatureCodec as it's currently only implemented
to handle line oriented records
-- Renamed AsciiFeatureReader to TribbleIndexedFeatureReader now that it handles
Ascii and binary features
-- Removed unused functions here and there as encountered
-- Fixed build.xml to be truly headless
-- FeatureCodec readHeader returns a FeatureCodecHeader obtain that contains a
value and the position in the file where the header ends (not inclusive).
TribbleReaders now skip the header if the position is set, so its no longer
necessary, if one implements the general readHeader(PositionalBufferedStream)
version to see header lines in the decode functions.  Necessary for binary
codecs but a nice side benefit for ascii codecs as well
-- Cleaned up the IndexFactory interface so there's a truly general createIndex
function that takes the enumerated index type.  Added a writeIndex() function
that writes an index to disk.
-- Vastly expanded the index unit tests and reader tests to really test linear,
interval, and tabix indexed files.  Updated test.bed, and created a tabix
version of it as well.
-- Significant BinaryFeaturesTest suite.
-- Some test files have indent changes
2012-05-03 07:31:48 -04:00