Commit Graph

2049 Commits (ae26f0fe14df0b555d8b45fd4ce3c08eb12ee8e4)

Author SHA1 Message Date
Guillermo del Angel ae26f0fe14 a) Fully functional and working multiallelic exact model for pools. Needs cleanup/more testing. b) Better unit test for pool genotype likelihoods - it now optionally generates actual noisy pileups that can be used for assessing GL accuracy, c) Totally experimental, hidden option in VariantsToTable to output genotype fields. Specifying -GF will output columns of form Sample.FieldName - needs also more testing 2012-05-14 10:55:35 -04:00
Ryan Poplin c9dd0f3173 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-05-10 13:09:10 -04:00
Ryan Poplin 0cdadffe14 Committing the best of the frantic pre-CSHL experiments: Better algorithm for partioning reads amongst the alleles they support. Require the read's original alignment to actually overlap the variant. QD uses the non-informative reads when calculating D. More HC-specific annotations for potential use in a statistical filtering strategy. Increasing the minimum kmer length in the assembly graphs. Misc minor bug fixes. 2012-05-10 13:09:03 -04:00
Guillermo del Angel 89f8a6b2e6 Revert bad part of last commit that shouldn't have been pushed 2012-05-10 10:41:08 -04:00
Guillermo del Angel 27b1aa5dd3 Don't allow N's in insertions when discovering indels. Maybe better solution will be to use them as wildcards and merge them with compatible regular insertion alleles but for now it's easier to ignore them. Minor refactoring of Allele.accepableAlleleBases to support this. Added unit test to test consensus allele counter in presence of N's 2012-05-10 10:29:19 -04:00
Eric Banks 4f37d6d399 Fixing docs 2012-05-10 00:56:00 -04:00
Mark DePristo c81acfc15d Working implementation of BCF2
-- Nearly complete on spec implementation.  Slow but clean
-- Some refactoring of VariantContext to support common functions for BCF and VCF
2012-05-08 19:46:51 -04:00
Mark DePristo a5193c2399 Mostly complete reference implementation of BCF2
-- Can run VariantEval on 3000 sample exome VCF and get the same output as the original VCF
2012-05-08 19:46:51 -04:00
Eric Banks 473d07b0c5 fixing up docs from previous Pool Caller commit 2012-05-08 11:02:55 -04:00
Eric Banks b4999d14c1 updating docs 2012-05-08 10:58:46 -04:00
Guillermo del Angel 33a1dd2048 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-05-08 10:42:12 -04:00
Eric Banks 5cf4fd63c2 Catch malformed base qualities and throw as a User Error 2012-05-08 09:34:57 -04:00
Guillermo del Angel a4f4b5007b Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-05-08 09:34:33 -04:00
Guillermo del Angel 605984353f Pool Caller improvements: a) New non-standard private annotation Heteroplasmy which measures mean heteroplasmy (pool AF) across called samples, meant for easier mtDNA calling. Pure homoplasmic variants (pool AF = 1 or 0) would have heteroplasmy=1. b) Don't output pool genotypes by default for large pool sizes because it makes file sizes explode and they're unreadable. c) Refactored classes ExactACCounts and ExactACSet and moved to superclass AlleleFrequencyCalculationModel because both Pool and Exact AF calculation models will use it. d) Initial refactorings and skeleton for linearized multi-allelic exact model (not done yet). e) Unit test for Pool AF calculation model. 2012-05-08 09:33:38 -04:00
Eric Banks c40cda7e3c Nope, loads of integration tests had to be changed. 2012-05-07 14:30:42 -04:00
Eric Banks 66838a073e Very annoying: we have been emitting an extra TAB in the header of the VCF (which breaks some parsers) for sites-only file. Hopefully not too many integration tests will need to be fixed... 2012-05-07 12:20:11 -04:00
David Roazen 6b769e91d8 BCF2: third checkpoint
* writer mostly implemented
* walkers to convert BCF2 <-> VCF
* almost working for sites-only files; genotypes still need work
* initial performance tests this afternoon will be on sites-only files
2012-05-04 13:00:15 -04:00
Eric Banks f3433201b1 Merged bug fix from Stable into Unstable 2012-05-03 11:11:00 -04:00
Eric Banks 557da77a1a Don't compute QD if there is no QUAL; added integration test for this 2012-05-03 11:02:37 -04:00
Eric Banks 1fc7b5d58b Merged bug fix from Stable into Unstable 2012-05-03 10:37:58 -04:00
Laurent Francioli 567d01cee8 - Added option to output the father's allele first in phased child haplotypes - BUG corrected causing wrong phasing of child/father pairs
Signed-off-by: Eric Banks <ebanks@broadinstitute.org>
2012-05-03 10:36:49 -04:00
Laurent Francioli 96e5a26223 PED support for Inbreeding Coefficient annotation
Signed-off-by: Eric Banks <ebanks@broadinstitute.org>
2012-05-03 10:36:20 -04:00
Mark DePristo 43d97c2e00 Rev Tribble to r97, adding binary feature support
From tribble logs:

Binary feature support in tribble

-- Massive refactoring and cleanup
-- Many bug fixes throughout
-- FeatureCodec is now general, with decode etc. taking a PositionBufferedStream
as an argument not a String
-- See ExampleBinaryCodec for an example binary codec
-- AbstractAsciiFeatureCodec provides to its subclass the same String decode,
readHeader functionality before.  Old ASCII codecs should inherit from this base
class, and will work without additional modifications
-- Split AsciiLineReader into a position tracking stream
(PositionalBufferedStream).  The new AsciiLineReader takes as an argument a
PositionalBufferedStream and provides the readLine() functionality of before.
Could potentially use optimizations (its a TODO in the code)
-- The Positional interface includes some more functionality that's now
necessary to support the more general decoding of binary features
-- FeatureReaders now work using the general FeatureCodec interface, so they can
index binary features
-- Bugfixes to LinearIndexCreator off by 1 error in setting the end block
position
-- Deleted VariantType, since this wasn't used anywhere and it's a particularly
clean why of thinking about the problem
-- Moved DiploidGenotype, which is specific to Gelitext, to the gelitext package
-- TabixReader requires an AsciiFeatureCodec as it's currently only implemented
to handle line oriented records
-- Renamed AsciiFeatureReader to TribbleIndexedFeatureReader now that it handles
Ascii and binary features
-- Removed unused functions here and there as encountered
-- Fixed build.xml to be truly headless
-- FeatureCodec readHeader returns a FeatureCodecHeader obtain that contains a
value and the position in the file where the header ends (not inclusive).
TribbleReaders now skip the header if the position is set, so its no longer
necessary, if one implements the general readHeader(PositionalBufferedStream)
version to see header lines in the decode functions.  Necessary for binary
codecs but a nice side benefit for ascii codecs as well
-- Cleaned up the IndexFactory interface so there's a truly general createIndex
function that takes the enumerated index type.  Added a writeIndex() function
that writes an index to disk.
-- Vastly expanded the index unit tests and reader tests to really test linear,
interval, and tabix indexed files.  Updated test.bed, and created a tabix
version of it as well.
-- Significant BinaryFeaturesTest suite.
-- Some test files have indent changes
2012-05-03 07:31:48 -04:00
Mark DePristo 58c470a6c5 Rev'ing Tribble from 53 to 94
-- Other tribble contributors did major refactoring / simplification of tribble, which required some changes to GATK code
-- Integrationtests pass without modification, though some very old index files (callable loci beds) were apparently corrupt and no longer tolerated by the newer tribble codebase
2012-05-03 07:31:47 -04:00
Eric Banks e448cfcc59 Forgot to update these md5s 2012-05-02 21:09:50 -04:00
Khalid Shakir b8b7f28aa9 Revving Picard to pick up new SamFileHeaderMerger.
Updated ReadFilter abstract class to implement (via UnsupportedOperationException) the new SamRecordFilter.filterOut().
In IndelRealignerIntegrationTest updates for Picard fixes to SAMRecord.getInferredInsertSize() in svn r1115 & r1124.
- Ran FixMates to create new input BAM since running IR with variable maxReadsInMemory means all reads weren't realigned leading to different outputs.
- Updated md5s to match new expectations after looking at TLEN diff engine output.
2012-05-02 16:47:28 -04:00
Mauricio Carneiro f51a1d0d61 Better error message to the BAMScheduler
In the case where the BAM file was aligned using a reference but analysis is being attempted with a different reference.
2012-05-02 16:10:00 -04:00
Mauricio Carneiro 940029fa5d Fixing on-the-fly recalibration (caught by Ryan)
low quality bases in the tails were being turned to N's in the final read.
2012-05-02 16:06:04 -04:00
Eric Banks 623b36fbc4 Add header lines for AC,AF, and AN tags 2012-05-02 15:33:34 -04:00
Guillermo del Angel 429800a192 Fix corner case rounding issue in MathUtils unit test: 10^logFactorial(4)) was 23.999999... which if cast directly yielded 23 - so, do pre-rounding to ensure correct integer result if caller will cast value. 2012-05-02 09:57:06 -04:00
Guillermo del Angel 76a95fdedf Full implementation of multiallelic exact model for pools. Still super-linear so not useable at scale but it should be a gold standard to compare to. Unit tests are not exhaustive yet, will be expanded to provide better test coverage. Small inconsequential optimization in MathUtils: we're already caching log10(factorial(n)) for large n, so might as well use the cached values to compute binomial and multinomial coefficients instead of the log-gamma approximation which is more expensive (doesn't seem to save much time either in PoolCaller nor in UG though). 2012-05-02 09:24:28 -04:00
Joel Thibault 4d732fa586 Move all MongoDB files into private/java/src/org/broadinstitute/sting/mongodb 2012-05-01 18:23:51 -04:00
Eric Banks 619a69a5f1 As promised in the release notes for 1.6, I am removing the old deprecated genotyping framework revolving around the misordering of alleles and have moved the fixed version in its place in preparation for release 1.7 (or 2.0?). 2012-05-01 16:18:24 -04:00
Joel Thibault c255dd5917 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-05-01 16:10:38 -04:00
Ryan Poplin 51af61b5d7 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-05-01 16:07:23 -04:00
Ryan Poplin fc55dcec3c Unfortunately the reverse trimming of alleles still doesn't work with mixed records in some corner cases. Turning it off for now. 2012-05-01 16:02:36 -04:00
Ryan Poplin 20a0078f23 Merging active regions across shard boundries if they are contiguous, have the same active status and don't grow too big. 2012-05-01 15:51:36 -04:00
Eric Banks 0f3af9555b Adding an option to SelectVariants which allows the user to re-genotype through the exact model (if PLs are present) the samples in order to recalculate the QUAL and genotypes. This is really the correct way to select a subset of samples, especially when originally called from low coverage data. Also added integration test to cover this case. 2012-05-01 14:58:06 -04:00
Joel Thibault aa4d41cce0 Minor cleanup before push 2012-05-01 14:16:44 -04:00
Joel Thibault b101b9c30b Add Mongo switch 2012-05-01 14:00:48 -04:00
Joel Thibault 1b609e9075 Move Mongo to server couchdb 2012-05-01 13:59:47 -04:00
Joel Thibault fd57d27f45 Move MongoDB connection handling to a separate class 2012-05-01 13:59:37 -04:00
Joel Thibault db3cd1abd5 Use 2 MongoDB collections (tables): one for INFO/attributes, one for samples/genotypes. 2012-05-01 13:57:23 -04:00
Joel Thibault 04e1be9106 Better handling of Mongo errors + exceptions 2012-05-01 13:57:23 -04:00
Joel Thibault ca737479cf Query for stop locations because we don't have that information in the reference 2012-05-01 13:57:23 -04:00
Joel Thibault 1cda87a4ad Set ROD priority list to input 2012-05-01 13:57:23 -04:00
Joel Thibault a7fe847faf Set the priority list and don't bother combining if not needed 2012-05-01 13:57:23 -04:00
Joel Thibault f739305f43 Combine the variants found at a location 2012-05-01 13:57:23 -04:00
Joel Thibault 020f884d5a Use new key of source ROD plus alleles 2012-05-01 13:57:23 -04:00
Joel Thibault 221ce9c3d6 Add alleles to the primary key 2012-05-01 13:57:23 -04:00