Guillermo del Angel
429800a192
Fix corner case rounding issue in MathUtils unit test: 10^logFactorial(4)) was 23.999999... which if cast directly yielded 23 - so, do pre-rounding to ensure correct integer result if caller will cast value.
2012-05-02 09:57:06 -04:00
Guillermo del Angel
76a95fdedf
Full implementation of multiallelic exact model for pools. Still super-linear so not useable at scale but it should be a gold standard to compare to. Unit tests are not exhaustive yet, will be expanded to provide better test coverage. Small inconsequential optimization in MathUtils: we're already caching log10(factorial(n)) for large n, so might as well use the cached values to compute binomial and multinomial coefficients instead of the log-gamma approximation which is more expensive (doesn't seem to save much time either in PoolCaller nor in UG though).
2012-05-02 09:24:28 -04:00
Joel Thibault
4d732fa586
Move all MongoDB files into private/java/src/org/broadinstitute/sting/mongodb
2012-05-01 18:23:51 -04:00
Eric Banks
619a69a5f1
As promised in the release notes for 1.6, I am removing the old deprecated genotyping framework revolving around the misordering of alleles and have moved the fixed version in its place in preparation for release 1.7 (or 2.0?).
2012-05-01 16:18:24 -04:00
Joel Thibault
c255dd5917
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-05-01 16:10:38 -04:00
Ryan Poplin
51af61b5d7
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-05-01 16:07:23 -04:00
Ryan Poplin
fc55dcec3c
Unfortunately the reverse trimming of alleles still doesn't work with mixed records in some corner cases. Turning it off for now.
2012-05-01 16:02:36 -04:00
Ryan Poplin
20a0078f23
Merging active regions across shard boundries if they are contiguous, have the same active status and don't grow too big.
2012-05-01 15:51:36 -04:00
Eric Banks
0f3af9555b
Adding an option to SelectVariants which allows the user to re-genotype through the exact model (if PLs are present) the samples in order to recalculate the QUAL and genotypes. This is really the correct way to select a subset of samples, especially when originally called from low coverage data. Also added integration test to cover this case.
2012-05-01 14:58:06 -04:00
Joel Thibault
aa4d41cce0
Minor cleanup before push
2012-05-01 14:16:44 -04:00
Joel Thibault
b101b9c30b
Add Mongo switch
2012-05-01 14:00:48 -04:00
Joel Thibault
1b609e9075
Move Mongo to server couchdb
2012-05-01 13:59:47 -04:00
Joel Thibault
fd57d27f45
Move MongoDB connection handling to a separate class
2012-05-01 13:59:37 -04:00
Joel Thibault
db3cd1abd5
Use 2 MongoDB collections (tables): one for INFO/attributes, one for samples/genotypes.
2012-05-01 13:57:23 -04:00
Joel Thibault
04e1be9106
Better handling of Mongo errors + exceptions
2012-05-01 13:57:23 -04:00
Joel Thibault
ca737479cf
Query for stop locations because we don't have that information in the reference
2012-05-01 13:57:23 -04:00
Joel Thibault
1cda87a4ad
Set ROD priority list to input
2012-05-01 13:57:23 -04:00
Joel Thibault
a7fe847faf
Set the priority list and don't bother combining if not needed
2012-05-01 13:57:23 -04:00
Joel Thibault
f739305f43
Combine the variants found at a location
2012-05-01 13:57:23 -04:00
Joel Thibault
020f884d5a
Use new key of source ROD plus alleles
2012-05-01 13:57:23 -04:00
Joel Thibault
221ce9c3d6
Add alleles to the primary key
2012-05-01 13:57:23 -04:00
Joel Thibault
3198ce5471
Can have multiple variants at a location
2012-05-01 13:57:22 -04:00
Joel Thibault
11ed8e61c9
Add referenceBaseForIndel to the Mongo VariantContext objects
2012-05-01 13:53:44 -04:00
Joel Thibault
7ed0ee7ed0
Skip locations with no genotypes instead of throwing a NPE
2012-05-01 13:53:44 -04:00
Joel Thibault
4bdfeacdaa
Handle multiple samples/genotypes per location
...
TODO: sample selection
2012-05-01 13:53:43 -04:00
Joel Thibault
1f7c628796
Insert the ROD filename into MongoDB as part of the primary key
2012-05-01 13:53:43 -04:00
Joel Thibault
bb8a6e9b0a
Initial test of write and read from MongoDB
2012-05-01 13:53:43 -04:00
David Roazen
c0084c741b
Pilot BCF2 Implementation: Checkpointing the code
...
* Not working yet, still very much a work-in-progress with lots of placeholders
* Needed to check this in to enable possible collaboration, since it's
going slower than anticipated and the conference deadline looms.
2012-05-01 12:23:10 -04:00
Eric Banks
0c8e801021
Removing public to private dependency
2012-05-01 11:04:11 -04:00
Eric Banks
e964d17518
Removing public to private dependency
2012-05-01 11:02:28 -04:00
Mauricio Carneiro
462450c3e3
disabling all BQSR unit tests
...
with the changes to the cycle covariate, some tests need updates, others need to be completely re-written.
2012-04-30 14:39:55 -04:00
Guillermo del Angel
e185632013
Exhaustive unit tests for Pool SNP genotype likelihoods:
...
a) Add ability for ErrorModel to be specified by external log-probability vector for testing.
b) For a given depth and ploidy(=2*samples/pool), create artificial high quality pileup testing from AC=0 to AC=ploidy, and test that pool GL's have expected content.Misc. refactorings and cleanups
c) Misc. cleanups and beautification.
2012-04-30 14:29:46 -04:00
Christopher Hartl
7d029b9a28
Merge branch 'master' of ssh://ni.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-04-30 12:16:30 -04:00
Christopher Hartl
944a7d815e
Bringing VQSRV3 up to date. Lots of new features (un-classifying the worst-performing training sites, treating the x% best/worst sites as postive/negative points, ability to pass in a monomorphic track to see ROC curves output). Minor changes to AlleleBalance: weighted average was incorrectly specified (using logscale actually biased the average towards the AB of low-quality genotypes), and breaking out AB by het, hom, and diploid to bring it in line with some (private) changes to the indel likelihood model that (correctly) computes these values for indels.
2012-04-28 11:31:03 -04:00
Ryan Poplin
54a9bc2da2
Bug fix in reverse trim alleles for the case of mixed records that become non-mixed after subsetting the alleles.
2012-04-28 09:12:26 -04:00
Ryan Poplin
e332aeaf70
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-04-27 16:21:21 -04:00
Ryan Poplin
2b5dd28550
Bug fix in reverse trim alleles for the case of mixed records.
2012-04-27 16:21:02 -04:00
Mauricio Carneiro
1db2d1ba82
Do not add the first and last 4 cycles to the recalibration tables.
2012-04-27 15:18:07 -04:00
Mauricio Carneiro
08dbd756f3
Quick QC walkers to look at the error profile of indels in the read
2012-04-27 15:18:07 -04:00
Guillermo del Angel
730208133b
Several fixes and improvements to Pool caller with ancillary test functions (not done yet):
...
a) Utility class called Probability Vector that holds a log-probability vector and has the ability to clip ends that deviate largely from max value.
b) Used this class to hold site error model, since likelihoods of error model away from peak are so far down that it's not worth computing with them and just wastes time.
c) Expand unit tests and add an exhaustive test for ErrorModel class.
d) Corrected major math bug in ErrorModel uncovered by exhaustive test: log(e^x) is NOT x if log's base = 10.
e) Refactored utility functions that created artificial pileups for testing into separate class ArtificialPileupTestProvider. Right now functionality is limited (one artificial contig of 10 bp), can only specify pileups in one position with a given number of matches and mismatches to ref) but functionality will be expanded in future to cover more test cases.
f) Use this utility class for IndelGenotypeLikelihoods unit test and for PoolGenotypeLikelihoods unit test (the latter testing functionality still not done).
g) Linearized implementation of biallelic exact model (very simple approach, similar to diploid exact model, just abort if we're past the max value of AC distribution and below a threshold). Still need to add unit tests for this and to expand to multiallelic model.
h) Update integration test md5's due to minor differences stemming from linearized exact model and better error model math
2012-04-27 14:41:17 -04:00
Eric Banks
0439047269
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-04-27 10:49:45 -04:00
Eric Banks
05b44dd017
The genotypeCounts array wasn't always being initialized before it was accessed, leading to a NPE (which got caught and thrown as a JEXL expression when used in selection). Added unit test to cover all genotype count methods.
2012-04-27 10:49:36 -04:00
Khalid Shakir
9801dd114f
Bug fix for: https://getsatisfaction.com/gsa/topics/problem_with_indelrealigner_and_l_unmapped
...
The GATK -L unmapped is for GenomeLocs with SAMRecord.NO_ALIGNMENT_REFERENCE_NAME, not SAMRecord.getReadUnmappedFlag()
Previously unmapped flag reads in the last bin were being printed while also seeking for the reads without a reference contig.
2012-04-27 09:58:38 -04:00
Guillermo del Angel
2f86ccb086
Correct md5's for previous code change
2012-04-26 16:20:41 -04:00
Guillermo del Angel
972d6531b6
Corner case fix for indel GL computation: sometimes (depending on surrounding context) reads which are not informative of two candidate haplotypes end up having marginally higher likelihoods with one haplotype as opposed to another, depending on uncertainty on alignments in surrounding regions. So, a sample whose GL is -0.0001,-0.0005,-0.001 may have its genotype set to 1/1 due to this statistical noise. We already have a tolerance comparing max(gl)-min(gl) to avoid genotyping, so this tolerance is now increased from 0.001 to 0.1 (equivalent to 1 PL unit) to avoid genotyping a sample if all PLs are within this threshold. Changed 2 integration test md5s that hit this case.
2012-04-26 10:15:26 -04:00
Laurent Francioli
ab2a952ad1
PED support for Inbreeding Coefficient annotation
...
Signed-off-by: Eric Banks <ebanks@broadinstitute.org>
2012-04-25 12:56:47 -04:00
Laurent Francioli
219b0a128b
PED support for ChromosomeCounts annotation
...
Signed-off-by: Eric Banks <ebanks@broadinstitute.org>
2012-04-25 12:50:04 -04:00
Laurent Francioli
19d5213d5a
Added function to get founders IDs in SampleDB
...
Signed-off-by: Eric Banks <ebanks@broadinstitute.org>
2012-04-25 12:49:36 -04:00
Mauricio Carneiro
902277856e
fix for RBP getPileupsForSamples()
...
do not differentiate per sample pileups from generic pileups. Do the same for both -- it's O(n) either way.
2012-04-24 17:20:30 -04:00
Mauricio Carneiro
82b4798913
CountBasesWalker -- a quick QC walker.
2012-04-24 17:20:30 -04:00
Mauricio Carneiro
e440d0ce69
BQSR triage #4
...
* fixed queue script plot file names
* updated the ReadGroupCovariate to use the platform unit instead of sample + lane.
* fixed plotting of marginalized reported qualities
2012-04-24 17:19:54 -04:00
Eric Banks
d6277b70d8
Forgot to consider the optimized case in hasAllele
2012-04-24 11:32:28 -04:00
Eric Banks
91bad244d5
Using a VCF whose ALT is the reference in GGA mode is a User Error
2012-04-24 11:08:37 -04:00
Eric Banks
74ad008163
Adding VariantContext.hasAlternateAllele functionality
2012-04-24 11:07:46 -04:00
Eric Banks
66f3315548
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-04-24 09:39:55 -04:00
Eric Banks
bcb93dda5f
Fixing docs (rank sum test values are not phred-scaled)
2012-04-24 09:39:42 -04:00
Mauricio Carneiro
e39a59594a
BQSR triage and test routines
...
* updated BQSR queue script for faster turnaround
* implemented plot generation for scatter/gatherered runs
* adjusted output file names to be cooperative with the queue script
* added the recalibration report file to the argument table in the report
* added ReadCovariates unit test -- guarantees that all the covariates are being generated for every base in the read
* added RecalibrationReport unit test -- guarantees the integrity of the delta tables
2012-04-23 11:23:00 -04:00
Eric Banks
a733723439
Merged bug fix from Stable into Unstable
2012-04-23 10:30:30 -04:00
Eric Banks
2761da975e
Handle null VCs (which can arise when indels are present in the file)
2012-04-23 10:30:00 -04:00
Eric Banks
cd63bcb1b8
Fixing unit tests to register the user exception being thrown (instead of the NumberFormatException)
2012-04-23 10:06:51 -04:00
Eric Banks
63aa79df82
Slightly better error message
2012-04-23 09:37:28 -04:00
Eric Banks
7b5fbf9567
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-04-23 09:34:08 -04:00
Eric Banks
4edb005411
Catch poorly formatted PL/GL fields
2012-04-23 09:33:50 -04:00
Ryan Poplin
35bb55f562
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-04-22 13:23:36 -04:00
Ryan Poplin
18e4532d10
Turning down the amount of assembly graph pruning slightly in the case of low coverage.
2012-04-22 13:23:24 -04:00
Eric Banks
1f23d99dfa
If we are subsetting alleles in the UG (either because there were too many or because some were not polymorphic), then we may need to trim the alleles (because the original VariantContext may have had to pad at the end). Thanks to Ryan for reporting this. Only one of the integration tests had even partially covered this case, so I added one that did.
2012-04-20 17:00:05 -04:00
Eric Banks
4b81c75642
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-04-20 14:30:19 -04:00
Eric Banks
f1c5510ec0
When running SelectVariants with the excludeNonVariants option, remove alleles from the ALT field that are no longer polymorphic.
2012-04-20 14:30:04 -04:00
Ryan Poplin
a1596791af
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-04-20 14:03:04 -04:00
Ryan Poplin
a57295eb75
Fixing a bug when breaking up active regions where the resulting regions would overlap by one base. Adding quality score manipulation from the UG into the haplotype caller (qual capped by mapping quality, min qual threshold).
2012-04-20 14:02:55 -04:00
Guillermo del Angel
de68363c23
Removed experimental feature (aka hack) that was meant for 1000G consensus but remained in VQSR data manager - QD was being scaled by indel length. There's no evidence any more that QD is length-dependent, neither in CEU trio data nor in latest 1000G P2 calls
2012-04-20 10:58:34 -04:00
Guillermo del Angel
d2488dfb81
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-04-19 19:40:03 -04:00
Guillermo del Angel
c44c7b9a97
Restored optimization in Pair HMM only to compute HMM matrices starting in index where haplotypes start to diverge - saves about 15-20% of runtime which is what we lost by disabling banding in latest version, so runtime should be now about the same as what it was before refactoring. Output is bit-true to previous commit
2012-04-19 19:39:43 -04:00
Mauricio Carneiro
0f8c77391d
BQSR bug triage #3
...
* fixed context covariate famous "off by one" error
* reduced maximum quality score to Q50 (following Eric/Ryan's suggestion)
* remove context downsampling in BQSR R script
2012-04-19 17:31:04 -04:00
Khalid Shakir
df5dd841af
AC strat now checks if evals will be merged before throwing an error on multiple eval files.
...
Minor tweaks to WGP script based on new recal VCF format.
2012-04-19 16:08:55 -04:00
Guillermo del Angel
1ae2ab5b63
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-04-19 12:50:29 -04:00
Guillermo del Angel
0e6e0cb907
Merging bug fixes
2012-04-19 12:49:30 -04:00
Eric Banks
79272c5e15
Thanks to Menachem for pointing out that the docs for genotyping_mode and output_mode were the same (and unclear). Fixed.
2012-04-19 12:48:09 -04:00
Guillermo del Angel
02ff930f6a
My changes
2012-04-19 12:45:18 -04:00
Eric Banks
2485cef5b8
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-04-19 11:46:06 -04:00
Eric Banks
76a6e37f4f
Don't output callability metrics by default anymore; one can still have them output to the 'metrics' file (which is now @Hidden because they are really for GSA use). Added a TODO to move UG from @By reference to reads and rods once LIBS is cleaned up.
2012-04-19 11:45:56 -04:00
Ryan Poplin
1ea4e48a27
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-04-19 11:32:32 -04:00
Ryan Poplin
11001ab9a2
Adding option to HaplotypeCaller to genotype the events on the chosen haplotypes as independent events. The filtered reads are now kept around so they can be passed to the variant annotations. Unfortunately the filtered reads aren't assigned a likelihood yet so they are all thrown in the Allele.NO_CALL bin.
2012-04-19 11:32:10 -04:00
Mauricio Carneiro
eb22cd7222
Unit test to guarantee BQSR sequential calculation accuracy
...
This test brings together the old and the new BQSR, building a recalibration table using the two separate frameworks and performing the recalibration calculation using the two different frameworks for 10,000+ bases and asserting that the calculations match in every case.
2012-04-19 09:33:40 -04:00
Mauricio Carneiro
68d0211fa1
Improved BQSR plotting and some new parameters
...
* Refactored CycleCovariate to be a fragment covariate instead of a per read covariate
* Refactored the CycleCovariateUnitTest to test the pairing information
* Updated BQSR Integration tests accordingly
* Made quantization levels parameter not hidden anymore
* Added hidden option to keep intermediate plotting files for debug purposes (they're automatically deleted)
* Added hidden option not to generate the plots automatically (important for scatter/gathering)
2012-04-19 09:31:41 -04:00
Guillermo del Angel
143e92b797
Rebasing
2012-04-18 20:05:43 -04:00
Guillermo del Angel
960e7e6aaf
Changes to integration tests
2012-04-18 19:53:42 -04:00
Guillermo del Angel
82efd4457e
Revert some bad merge changes
2012-04-18 16:35:09 -04:00
Guillermo del Angel
31c394d588
Resolve merge conflicts
2012-04-18 16:25:03 -04:00
Ryan Poplin
4999ae87ad
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-04-18 15:02:42 -04:00
Ryan Poplin
dcc4871468
minor misc optimizations to PairHMM
2012-04-18 15:02:26 -04:00
Eric Banks
d3c84e7b1f
This should be a User Error since it's provided from the DoC command-line arguments
2012-04-18 13:09:23 -04:00
Eric Banks
392f1903f7
Handling some of the NumberFormatExceptions seen via Tableau that are really user errors.
2012-04-18 12:57:37 -04:00
Ryan Poplin
8a84456626
Following Eric's awesome update to change the VQSR recal file into a VCF file, the ApplyRecalibration step is now scatter/gather-able and tree reducible.
2012-04-18 11:24:04 -04:00
Eric Banks
4448a3ea76
Final tweaks. Added an integration test to cover the case of SNPs and indels that start at the same position.
2012-04-17 23:54:10 -04:00
Eric Banks
c1f52b773a
Minor tweaks and updated integration tests MD5s
2012-04-17 23:17:28 -04:00
Eric Banks
6d03bce0d3
Important refactoring of the VQSR recal file format: we now use a VCF instead of a CSV file.
...
The most important reason for this change is that we no longer need to read the entire recal file into memory up front in ApplyRecalibration. For 1000G calling this was prohibitive in terms of memory requirements. Now we go through the rod system and pull in just the records we need at a given position.
As an added bonus, once BCF2 is live we can drastically cut down the sizes of these recal files (which can grow large for whole genome calling).
2012-04-17 22:38:18 -04:00
Eric Banks
ea793d8e27
Khalid pressured me into adding an integration test that makes sure we don't fail on reads with adjacent I and D events.
2012-04-17 21:21:29 -04:00
Mauricio Carneiro
46a212d8e9
Added "simplify reads" option to PrintReads.
2012-04-17 19:32:34 -04:00
Mauricio Carneiro
f0c81b59b0
Implementation of the new BQSR plotting infrastructure
...
* removed low quality bases from the recalibration report.
* refactored the Datum (Recal and Accuracy) class structure
* created a new plotting csv table for optimized performance with the R script
* added a datum object that carries the accuracy information (AccuracyDatum) for plotting
* added mean reported quality score to all covariates
* added QualityScore as a covariate for plotting purposes
* added unit test to the key manager to operate with one required covariate and multiple optional covariates
* integrated the plotting into BQSR (automatically generates the pdf with the recalibration tearsheet)
2012-04-17 19:23:55 -04:00
Ryan Poplin
952280bef1
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-04-17 17:00:14 -04:00
Ryan Poplin
cf705f6c62
Adding read position rank sum test to the list of annotations that get produced with the HaplotypeCaller
2012-04-17 17:00:00 -04:00
Eric Banks
13c800417e
Handle NPE in UG indel code: deletions immediately preceding insertions were not handled well in the code.
2012-04-17 15:51:23 -04:00
Guillermo del Angel
c78b0eee3a
Refactoring/fixing up UG HMM code: a) Make code use PairHMM class instead of having duplicated code. That way UG and HaplotypeCaller now use same core code. Changes to be able to do this: 1. Compute context-dependent GOP as a function of read, not of haplotype, b) Extracted code to initialize HMM arrays into separate method, c) Move PairHMM class and unit test to public, d) Reenable banded code in PairHMM, inverted sense of flag (true=enable feature) but leave off in HaplotypeCaller.
2012-04-17 14:22:48 -04:00
Khalid Shakir
91cb654791
AggregateMetrics:
...
- By porting from jython to java now accessible to Queue via automatic extension generation.
- Better handling for problematic sample names by using PicardAggregationUtils.
GATKReportTable looks up keys using arrays instead of dot-separated strings, which is useful when a sample has a period in the name.
CombineVariants has option to suppress the header with the command line, which is now invoked during VCF gathering.
Added SelectHeaders walker for filtering headers for dbGAP submission.
Generated command line for read filters now correctly prefixes the argument name as --read_filter instead of -read_filter.
Latest WholeGenomePipeline.
Other minor cleanup to utility methods.
2012-04-17 11:45:32 -04:00
Ryan Poplin
1a2e92f8db
Merged bug fix from Stable into Unstable
2012-04-17 10:23:05 -04:00
Ryan Poplin
adad76b36f
Fixing NPE in VQSR for the case of very small callsets.
2012-04-17 10:20:43 -04:00
Mark DePristo
3f6b2423d8
Update VE IT to reflect new fields and bugfixes
2012-04-13 17:00:37 -04:00
Mark DePristo
f9190b6fcd
VariantEvalUnitTest is better named VariantEvalWalkerUnitTest
2012-04-13 17:00:37 -04:00
Mark DePristo
23ccf772d4
IndelSummary now emits all of the underlying counts for ratios, percentages, etc it computes
2012-04-13 17:00:36 -04:00
Mark DePristo
84d1e8713a
Infrastructure for combining VariantEvaluations
...
-- Not hooked up yet, so the output of VariantEval should be the same as before
-- Implemented a VariantEvalUnitTest that tests the low level strat / eval combinatorics and counting routines
-- Better docs throughout
2012-04-13 17:00:36 -04:00
Mark DePristo
38986e4240
Documentation for StratificationManager
2012-04-13 17:00:36 -04:00
Mark DePristo
ab06d53867
Useful test constructor or Unit tests in RefMetaDataTracker
2012-04-13 17:00:36 -04:00
Mark DePristo
285e61a227
Bugfix for IndelSummary
...
-- multi allelic count should be % not ratio
2012-04-13 17:00:35 -04:00
Mark DePristo
e6d5cb46d2
Improvements and bugfixes to IndelSummary
...
-- Now properly includes both bi and multi-allelic variants. These are actually counted as well, and emitted as counts and % of sites with multiple alleles
-- Bug fix for gold standard rate
2012-04-13 17:00:35 -04:00
Mark DePristo
bfa966a4e9
Bugfix for OneBPIndel
...
-- Previously was only including 1 bp insertions in stratification
2012-04-13 17:00:35 -04:00
Mark DePristo
2aa2d9aec0
Merged bug fix from Stable into Unstable
2012-04-13 09:25:43 -04:00
Mark DePristo
27e7e17dc7
New way to handle exceptions in multi-threaded GATK
...
-- HMS no longer tries to grab and throw all exceptions. Exceptions are just thrown directly now.
-- Proper error handling is handled by functions in HMS, which are used by ShardTraverser and TreeReducer
-- Better printing of stack traces in WalkerTest
2012-04-13 09:23:33 -04:00
Mark DePristo
e85e9a8cf5
More extensive testing of type of error thrown in multi-threaded walker test
...
-- Unfortunately the result of the multi-threaded test is non-deterministic so run the test 10x times to see if the right expection is always thrown
-- Now prints the stack trace and exception message of the caught exception of the wrong type, if this occurs
2012-04-13 09:23:33 -04:00
Eric Banks
297afc7911
Added unit test to ensure that we genotype correctly cases with really large GLs
2012-04-12 15:43:14 -04:00
Eric Banks
818e8c2fb9
Resolving merge conflicts
2012-04-12 15:19:44 -04:00
Eric Banks
0dd571928d
Let's not have the indel model emit more than the max possible number of genotypable alt alleles (since we may not be able to subset down to the best ones).
2012-04-12 15:16:29 -04:00
Eric Banks
f77a6d18b8
Bad conflict merge before
2012-04-12 09:56:49 -04:00
Eric Banks
33a8bdd75f
Resolving merge conflicts
2012-04-12 09:51:55 -04:00
Eric Banks
b659b16b31
Generate User Error for bad POS value
2012-04-12 09:49:35 -04:00
Eric Banks
cc71baf691
Don't allow users to try to genotype more than the max possible value (catch and throw a User Error at startup). Better docs explaining that users shouldn't play with this value unless they know what they are doing.
2012-04-12 09:18:44 -04:00
Eric Banks
5bf9dd2def
A framework to get annotations working in the HaplotypeCaller (and ART walkers in general).
...
Adding support for active-region-based annotation for most standard annotations. I need to discuss with Ryan what to do about tests that require offsets into the reads (since I don't have access to the offsets) like e.g. the ReadPosRankSumTest.
IMPORTANT NOTE: this is still very much a dev effort and can only be accessed through private walkers (i.e. the HaplotypeCaller). The interface is in flux and so we are making no attempt at all to make it clean or to merge this with the Locus-Traversal-based annotation system. When we are satisfied that it's working properly and have settled on the proper interface, we will clean it up then.
2012-04-11 16:22:12 -04:00
Guillermo del Angel
f9f8589692
Refactoring/fixing up UG HMM code: a) Make code use PairHMM class instead of having duplicated code. That way UG and HaplotypeCaller now use same core code. Changes to be able to do this: 1. Compute context-dependent GOP as a function of read, not of haplotype, b) Extracted code to initialize HMM arrays into separate method, c) Move PairHMM class and unit test to public, d) Reenable banded code in PairHMM, inverted sense of flag (true=enable feature) but leave off in HaplotypeCaller.
2012-04-11 13:56:51 -04:00
Eric Banks
5b7da3831f
Not sure why this didn't make it into the last push, but here's a working MD5 for the NDA annotation in UG
2012-04-11 13:49:50 -04:00
Eric Banks
7aa654d13f
New interface for some dev work that Ryan and I are doing; only accessible from private walkers right now
2012-04-11 13:49:09 -04:00
Eric Banks
dc90508104
Adding a new annotation to UG calls: NDA = number of discovered (but not necessarily genotyped) alleles for the site. This could help downstream analysis esp. of indels for wonky sites (since we only use the top 2-3 alleles). Not enabled by default but we can change that if this turns out to be useful.
2012-04-11 13:47:10 -04:00
Eric Banks
d2142c3aa7
Adding integration test for Flag Stat
2012-04-10 22:40:38 -04:00
Eric Banks
f560611fe8
Merged bug fix from Stable into Unstable
2012-04-10 22:26:53 -04:00
Eric Banks
f46f7d0590
Fix the stats coming out of FlagStat. I will add an integration test in unstable
2012-04-10 22:26:10 -04:00
Mauricio Carneiro
cd842b650e
Optimizing DiagnoseTargets
...
* Fixed output format to get a valid vcf
* Optimzed the per sample pileup routine O(n^2) => O(n) pileup for samples
* Added support to overlapping intervals
* Removed expand target functionality (for now)
* Removed total depth (pointless metric)
2012-04-10 17:43:59 -04:00
Ryan Poplin
1df0adf862
Fixing ActivityProfile unit test.
2012-04-10 15:28:27 -04:00
Ryan Poplin
e3cc7cc59c
Resolving merge conflict.
2012-04-10 14:50:27 -04:00
Ryan Poplin
a4634624b7
There are now three triggering options in the HaplotypeCaller. The default (mismatches, insertions, deletions, high quality soft clips), an external alleles file (from the UG for example), or extended triggers which include low quality soft clips, bad mates and unmapped mates. Added better algorithm for band pass filtering an ActivityProfile and breaking them apart when they get too big. Greatly increased the specificity of the caller by battening down the hatches on things like base quality and mapping quality thresholds for both the assembler and the likelihood function.
2012-04-10 14:48:23 -04:00
Eric Banks
10e74a71eb
We now allow arbitrary annotations other than dbSNP (e.g. HM3) to come out of the Unified Genotyper. This was already set up in the Variant Annotator Engine and was just a matter of hooking UG up to it. Added integration test to ensure correct behavior.
2012-04-10 12:30:35 -04:00
Mark DePristo
b43d21056b
Merged bug fix from Stable into Unstable
2012-04-10 09:42:09 -04:00
Mark DePristo
6885e2d065
UserException fixes for GATK_logs recent errors
...
-- SamFileReader.java:525
-- BlockCompressedInputStream:376
These were both instances were we weren't catching and rethrowing picard exceptions as UserExceptions.
2012-04-10 07:37:42 -04:00
Mark DePristo
8507cd7440
Throw UserException for bad dict / chain files
2012-04-10 07:22:43 -04:00
Ryan Poplin
cd9bf1bfc3
Changing IndelSummary eval module so that PostCallingQC.scala can run with MIXED-record VCFs.
2012-04-10 00:22:40 -04:00
Roger Zurawicki
9ece93ae9c
DiagnoseTargets now outputs a VCF file
...
- refactored the statistics classes
- concurrent callable statuses by sample are now available.
Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-04-09 16:40:20 -04:00
Guillermo del Angel
719ec9144a
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-04-09 14:53:19 -04:00
Guillermo del Angel
550179a1f7
Major refactorings/optimizations of pool caller, output still bit-true to older version: a) Move DEFAULT_PLOIDY from UnifiedGenotyperEngine to VariantContextUtils. b) Optimize iteration through all possible allele combinations. c) Don't store log PL's in hashmap from allele conformations to double, it was too slow. Things can still be optimized much more down the line if needed. d) Remove remaining traces of genotype priors.
2012-04-09 14:53:05 -04:00
Eric Banks
f82986ee62
Adding unit tests for the very important log10sumLog10 util method.
2012-04-09 14:28:25 -04:00
Eric Banks
ea4300d583
Refactoring so that Unified Argument Collection doesn't use deprecated classes.
2012-04-09 13:45:17 -04:00
Eric Banks
6ddf2170b6
More efficient implementation of the sum of the allele frequency posteriors matrix using a pre-allocated cache as discussed in group meeting last week. Now, when the cache is filled, we safely collapse down to a single value in real space and put the un-re-centered log10 value back into the front of the cache. Thanks to all for the help and advice.
2012-04-09 11:46:16 -04:00
Mauricio Carneiro
87e6bea6c1
Adding engine capability to quantize qualities.
...
* Added parameter -qq to quantize qualities using a recalibration report
* Added options to quantize using the recalibration report quantization levels, new nLevels and no quantization.
* Updated BQSR scripts to make use of the new parameters
2012-04-08 21:07:51 -04:00