Eric Banks
0920a1921e
Minor fixes to splitting multi-allelic records (as regards printing indel alleles correctly); minor code refactoring; adding integration tests to cover +/- splitting multi-allelics.
2012-02-13 15:09:53 -05:00
Eric Banks
14981bed10
Cleaning up VariantsToTable: added docs for supported fields; removed one-off hidden arguments for multi-allelics; default behavior is now to include multi-allelics in one record; added option to split multi-allelics into separate records.
2012-02-13 14:32:03 -05:00
Eric Banks
f52f1f659f
Multiallelic implementation of the TDT should be a pairwise list of values as per Mark Daly. Integration tests change because the count in the header is now A instead of 1.
2012-02-10 14:15:59 -05:00
Eric Banks
f53cd3de1b
Based on Ryan's suggestion, there's a new contract for genotyping multiple alleles. Now the requester submits alleles in any arbitrary order - rankings aren't needed. If the Exact model decides that it needs to subset the alleles because too many were requested, it does so based on PL mass (in other words, I moved this code from the SNPGenotypeLikelihoodsCalculationModel to the Exact model). Now subsetting alleles is consistent.
2012-02-10 11:07:32 -05:00
Eric Banks
7a937dd1eb
Several bug fixes to new genotyping strategy. Update integration tests for multi-allelic indels accordingly.
2012-02-09 16:14:22 -05:00
Eric Banks
2f800b078c
Changes to default behavior of UG: multi-allelic mode is always on; max number of alternate alleles to genotype is 3; alleles in the SNP model are ranked by their likelihood sum (Guillermo will do this for indels); SB is computed again.
2012-02-08 15:27:16 -05:00
Mauricio Carneiro
d5d4fa8a88
Fixed discordance bug reported by Brad Chapman
...
discordance now reports discordance between genotypes as well (just like concordance)
2012-01-30 09:50:45 -05:00
Eric Banks
ddaf51a50f
Updated one integration test for indels
2012-01-25 19:18:51 -05:00
Eric Banks
e349b4b14b
Allow appending with the dbSNP ID even if a (different) ID is already present for the variant rod.
2012-01-25 11:35:54 -05:00
Mauricio Carneiro
ffd61f4c1c
Refactor the Pileup Element with regards to indels
...
Eric reported this bug due to the reduced reads failing with an index out of bounds on what we thought was a deletion, but turned out to be a read starting with insertion.
* Refactored PileupElement to distinguish clearly between deletions and read starting with insertion
* Modified ExtendedEventPileup to correctly distinguish elements with deletion when creating new pileups
* Refactored most of the lazyLoadNextAlignment() function of the LocusIteratorByState for clarity and to create clear separation between what is a pileup with a deletion and what's not one. Got rid of many useless if statements.
* Changed the way LocusIteratorByState creates extended event pileups to differentiate between insertions in the beginning of the read and deletions.
* Every deletion now has an offset (start of the event)
* Fixed bug when LocusITeratorByState found a read starting with insertion that happened to be a reduced read.
* Separated the definitions of deletion/insertion (in the beginning of the read) in all UG annotations (and the annotator engine).
* Pileup depth of coverage for a deleted base will now return the average coverage around the deletion.
* Indel ReadPositionRankSum test now uses the deletion true offset from the read, changed all appropriate md5's
* The extra pileup elements now properly read by the Indel mode of the UG made any subsequent call have a different random number and therefore all RankSum tests have slightly different values (in the 10^-3 range). Updated all appropriate md5s after extremely careful inspection -- Thanks Ryan!
phew!
2012-01-24 16:07:21 -05:00
Christopher Hartl
4a08e8ca6e
Minor tweaks to T2D-related qscripts. Replacing old md5s from the BeagleIntegrationTest. All differences boiled down either to the accounting of genotypes changed (./. --> 0/0 is no longer a "changed" genotype, and original genotypes that were ./. are represented as OG=. rather than OG=./. .)
...
This is somewhat of an arbitrary decision, and is negotiable. I could see treating
GT:PL ./.:.
differently from
GT:PL .:0,3,6
but am not sure the worth of doing so.
2012-01-23 08:25:34 -05:00
Eric Banks
ab8f499bc3
Annotate with FS even for filtered sites
2012-01-18 22:04:51 -05:00
Ryan Poplin
60024e0d7b
updating TDT integration test
2012-01-18 09:52:50 -05:00
Mauricio Carneiro
cec7107762
Better location for the downsampling of reads in PrintReads
...
* using the filter() instead of map() makes for a cleaner walker.
* renaming the unit tests to make more sense with the other unit and integration tests
2012-01-14 14:06:09 -05:00
Mauricio Carneiro
28aa353501
Added "unbiased" downsampling parameter to PrintReads
...
* also cleaned up and updated part of the unit tests for print reads. Needs a more thorough cleaning.
2012-01-12 16:33:55 -05:00
Mauricio Carneiro
77a03c9709
Patching special case in the adaptor clipping
...
* if the adaptor boundary is more than MAXIMUM_ADAPTOR_SIZE bases away from the read, then let's not clip anything and consider the fragment to be undetermined for this read pair.
* updated md5's accordingly
2012-01-11 17:47:44 -05:00
Eric Banks
c5320ef1af
Resolving changes in integration test during merge
2012-01-10 12:14:16 -05:00
Eric Banks
0f36f6947e
Resolving merge conflicts
2012-01-10 11:44:16 -05:00
Eric Banks
f2cecce10f
Much better implementation of the approximate summing of an array of log10 values (including more efficient rounding). Now effectively takes 0% of UG runtime on T2D GENES (as opposed to 11% previously).
2012-01-10 11:34:23 -05:00
Mark DePristo
dd80ffbbbe
Merged bug fix from Stable into Unstable
2012-01-05 21:51:48 -05:00
Mark DePristo
c96fee477c
Bug fix for VariantSummary
...
-- Call sets with indels > 50 bp in length are tagged as CNVs in the tag (following the 1000 Genomes convention) and were unconditionally checking whether the CNV is already known, by looking at the known cnvs file, which is optional. Fixed. Has the annoying side effect that indels > 50bp in size are not counted as indels, and so are substrated from both the novel and known counts for indels. C'est la vie
-- Added integration test to check for this case, using Mauricio's most recent VCF file for NA12878 which has many large indels. Using this more recent and representative file probably a good idea for more future tests in VE and other tools. File is NA12878.HiSeq.WGS.b37_decoy.indel.recalibrated.vcf in Validation_Data
2012-01-05 21:51:06 -05:00
Guillermo del Angel
58d4539304
Enabled banded indel computation by default. Reversed logic in input UG argument so that we can still disable it if required. Minor changes to integration tests due to minor differences in GL's and in annotations
2012-01-04 15:28:26 -05:00
David Roazen
621ee2b613
Merged bug fix from Stable into Unstable
2012-01-03 16:56:49 -05:00
David Roazen
ea6e718cb8
SnpEff 2.0.5 support. Re-enabled SnpEff in the HybridSelectionPipeline.
...
For now, we recommend only running with the GRCh37.64 database.
2012-01-03 15:18:36 -05:00
David Roazen
4984ca5e31
Merged bug fix from Stable into Unstable
2012-01-03 11:03:30 -05:00
David Roazen
f3f01da1af
Enforce serial dependencies in RecalibrationWalkersIntegrationTest
...
Some tests in this class were intermittently not being executed due
to being randomly scheduled before tests whose results they depend on.
Now the serial dependencies are enforced to avoid problematic orderings.
2012-01-03 10:42:41 -05:00
Eric Banks
ab8d47d9a5
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-01-03 09:38:49 -05:00
Mauricio Carneiro
1b6d52817e
fixing adaptor clipping effect on recalibration integration test
2012-01-01 22:20:06 -05:00
Eric Banks
393993e0c7
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-31 20:42:46 -05:00
Mauricio Carneiro
55cfa76cf3
Updated integration tests for the new adaptor clipping fix.
2011-12-30 18:47:14 -05:00
Eric Banks
d20a25d681
A much better way of choosing the alternate allele(s) to genotype in the SNP model of UG: instead of looking at the sum of base qualities (which can and did lead to us over-genotyping esp. when allowing multiple alternate alleles), we look at the likelihoods themselves (free since we are already calculating likelihoods for all 10 genotypes). Now, even if the base quals exceed some arbitrary threshold, we only bother genotyping an alternate allele when there's a sample for which it is more likely than ref/ref (I can generate weird edge cases where this falls apart, but none that model truly variable sites that we actually want to call). This leads to a huge efficiency improvement esp. for exomes (and esp. for many samples) where we almost always were trying to genotype all 3 alternate alleles. Integration tests change only because ref calls have slight QUAL differences (because the best alt allele is still chosen arbitrarily, but differently).
2011-12-27 16:50:38 -05:00
David Roazen
506c0e9c97
Disabling SnpEff support in the GATK and SnpEff annotation in the HybridSelectionPipeline
...
SnpEff support will remain disabled until SnpEff 2.0.4 has been officially released
and we've verified the quality of its annotations.
2011-12-23 19:12:57 -05:00
David Roazen
510c71158c
Merged bug fix from Stable into Unstable
2011-12-22 10:49:52 -05:00
David Roazen
32cdef9682
Rename *PerformanceTest test classes to *LargeScaleTest
...
This is in preparation for the installation of the new performance test suite in Bamboo.
Note that "ant performancetest" is now "ant largescaletest"
2011-12-22 10:38:49 -05:00
Mauricio Carneiro
731a463415
Updated IntegrationTests with new adaptor clipper
...
phew!
2011-12-20 17:48:52 -05:00
Laurent Francioli
16cc2b864e
- Corrected bug causing cases where both parents are HET to be accounted twice in the TDT calculation - Adapted TDT Integration test to corrected version of TDT
...
Signed-off-by: Ryan Poplin <rpoplin@broadinstitute.org>
2011-12-19 10:30:59 -05:00
Eric Banks
3069a689fe
Bug fix: if there are multiple records at a given position, it turns out that SelectVariants would drop all variants that follow after one that fails filters (instead of dropping just the failing one). Added an integration test to cover this case.
2011-12-19 10:04:33 -05:00
Eric Banks
76bd13a1ed
Forgot to update the unit test
2011-12-18 01:13:49 -05:00
Eric Banks
c5ffe0ab04
No reason to sum the normalized posteriors array to get Pr(AF>0) given that we can just compute 1.0 - array[0]. Integration tests change only because of trivial precision artifacts for reference calls using EMIT_ALL_SITES.
2011-12-18 00:31:47 -05:00
Eric Banks
6dc52d42bf
Implemented the proper QUAL calculation for multi-allelic calls. Integration tests pass except for the ones making multi-allelic calls (duh) and one of the SLOD tests (which used to print 0 when one of the LODs was NaN but now we just don't print the SB annotation for that record).
2011-12-18 00:01:42 -05:00
Eric Banks
4fddac9f22
Updating busted integration tests
2011-12-14 16:24:43 -05:00
Eric Banks
1e90d602a4
Optimization: cache up front the PL index to the pair of alleles it represents for all possible numbers of alternate alleles.
2011-12-14 13:38:20 -05:00
Mauricio Carneiro
5cc1e72fdb
Parallelized SelectVariants
...
* can now use -nt with SelectVariants for significant speedup in large files
* added parallelization integration tests for SelectVariants
2011-12-12 18:41:14 -05:00
Laurent Francioli
7cf27bb66e
Updated md5sum for MendelianViolationEvaluator test to reflect the change in column alignment in VariantEval.
2011-12-12 12:22:43 +01:00
Laurent Francioli
025bdfe2cc
Merge branch 'master' of ssh://copper.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-12 12:19:44 +01:00
Eric Banks
7b6338c742
Merge branch 'master' into trialleles
2011-12-11 00:28:46 -05:00
Eric Banks
7c4b9338ad
The old bi-allelic implementation of the Exact model has been completely deprecated - you can only use the multi-allelic implementation now.
2011-12-11 00:23:33 -05:00
Eric Banks
044f211a30
Don't collapse likelihoods over all alt alleles - that's just not right. For now, the QUAL is calculated for just the most likely of the alt alleles; I need to think about the right way to handle this properly.
2011-12-10 23:57:14 -05:00
Eric Banks
442ceb6ad9
The Exact model now computes both the likelihoods and posteriors (in separate arrays); likelihoods are used for assigning genotypes, not the posteriors.
2011-12-09 10:16:44 -05:00
Laurent Francioli
a79144f7db
Merge branch 'master' of ssh://copper.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-09 15:57:24 +01:00