droazen
a50c40ed05
Temporary commit to aid in investigation of recent intermittent
...
IndelRealignerIntegrationTest failures -- yes, it's the classic printf()
debugging technique. Will revert in a day or two once I get the data I need :)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5896 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-27 20:01:57 +00:00
rpoplin
2227f49220
misc cleanup
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5893 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-27 16:49:20 +00:00
rpoplin
9e834391fe
We now skip over all covering RODs in the BQSR as intended instead of just those which can be converted into a VariantContext. All the integration tests change because of subtleties in how certain dbsnp rod records are being converted into VCs. Added integration test which uses a bed file as the list of known polymorphic sites.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5892 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-27 16:32:17 +00:00
depristo
8ed82e5a08
The previous version of the UG was always creating BAQ'd pileups for the underlying site QUAL calculation. This resulted in some slowdown in the code. But as far as I can tell, the code actually didn't apply the BAQ'd base quality anywhere when the BAQ field wasn't in the read, so this just saves us 20% of the runtime when BAQ isn't enabled from heading into the BAQ subsystem when we don't actually want to get the BAQ'd base qualities.
...
Fixed minor problem with WalkerTest for "" (for parameterization) md5s.
Added an explicit integrationtest for BAQ NONE
Now only creates the BAQ'd pileup, if the useBAQPileup parameter is provide in initializeAlternateAllele.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5891 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-27 14:00:52 +00:00
depristo
136c8c7900
ClipReads now supports HARDCLIP_BASES, though in fact this turned out to be not necessary for my desired tests. In the process of developing the HARDCLIP mode, I added some proper ReadUtils unit tests, which would ideally be expanded to include other ReadUtil functions, as added
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5890 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-27 11:42:22 +00:00
hanna
a77ca2d36a
Incorporating Guillermo's patch to eliminate compile-time dependency of (core) UG indel model
...
on oneoffs. Thanks Guillermo! We'll polish the patch when you free up a bit.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5888 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-27 02:22:19 +00:00
delangel
6ecbfa9013
OK, this time REALLY fix cut and paste error
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5880 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-26 19:47:12 +00:00
delangel
efe6602827
Fix copy-paste error from previous commit
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5878 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-26 16:02:08 +00:00
delangel
7a43673599
Bug fix: also enclose fetching FS or HRun in a try/catch block or else code will blow up if an annotation is absent (e.g. when there no evidence for a variant in a vc)
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5877 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-26 15:00:36 +00:00
delangel
f7298f4a7f
First of many baby steps to redo way in which we trigger events for indel calling and to eliminate extended events: get rid of SpanningDeletions annotation for indels. It's completely useless, and even more so once we no longer trigger at extended events (because we'll trigger by definition a base before a deletion starts, so deletions present in the current pileup are not informative).
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5876 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-26 00:49:23 +00:00
ebanks
bafdd4f8f7
Ask for existance of extended pileup before grabbing it
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5874 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-25 17:39:03 +00:00
ebanks
6ed71cf683
Annotation that adds a list of samples who are polymorphic at a site based on the GTs. Very useful if you are looking at rare variants among many samples, esp. in Evoker
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5868 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-24 20:12:27 +00:00
depristo
1bd1404aa9
Sometimes md5s can be null
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5867 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-24 19:17:18 +00:00
depristo
e582a92af6
WalkerTest now checks for valid md5s in the integrationtests themselves, so no more stray whitespace errors. Added a WalkerTestTest to ensure tha t bad MD5s are detected and an error thrown
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5865 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-24 14:34:55 +00:00
hanna
06486c134a
Kill extra space in the md5.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5863 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-24 12:00:31 +00:00
depristo
57e4693e4c
Slightly better error message when failing to create the index on the fly
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5861 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-24 11:04:08 +00:00
depristo
cf3dbfee97
Renamed variantMergeOptions to filteredRecordsMergeType, as this is really what it does. Cleaned up the wiki so that it's clear what this does, as well as included an example of how to create an intersection with CombineVariants and SelectVariants. Added integrationtests of CombineVariants with OMNI and HapMap that deal with the two ways to merge fitlered/unfiltered records at the same site.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5860 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-24 01:54:29 +00:00
kiran
653475ce12
Now finds the most likely configuration of genotypes given the genotype likelihoods and inheritance constraints. The parental genotypes are now phased as well (the alleles are ordered as A_transmitted|A_untransmitted). Rewrote the way the transmission probability is calculated. This will probably move into core soon.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5859 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-24 01:35:40 +00:00
hanna
4bfec4c55b
Reenabling E.coli ValidatingPileup with MV1994 realigned using the BWA/C bindings.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5856 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-23 21:32:53 +00:00
chartl
c7f4674fe2
Great! Contracts is working. Fixing some misspecified ones.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5854 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-23 21:00:52 +00:00
hanna
5dca1e4d2e
Make IntervalIntegrationTest aware of the new alignments in the MV1994.bam
...
testset.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5852 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-23 19:59:47 +00:00
chartl
7ff5375493
Removing build-killing dependency on a private package.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5851 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-23 18:13:15 +00:00
chartl
0b07373909
Incorporating old feedback from eric: @deprecated methods should not be @deprecated, but rather protected, and the test's package moved to where it can access those test methods.
...
Also allows for the slightly more awesome name "MWUnitTest"
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5850 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-23 18:06:05 +00:00
kiran
f8f37a786d
Now emits much more informative filter names and includes all of other the proper VCF header details (filter description line, tag definitions, etc.). Currently rewriting the way the transmission probability is calculated. This is shaping up to be a lovely little piece of code...
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5849 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-23 17:50:59 +00:00
chartl
15dc632570
The U-value can be zero (edge case)
...
z-value can not be NaN (and can't possibly be null)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5847 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-23 14:15:36 +00:00
chartl
3c31007da4
Stupid brackets. How did this even compile?
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5846 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-23 14:00:53 +00:00
chartl
480859db50
Contractified version of MannWhitneyU. Some behavior has been changed:
...
- Running a test when there are no observations of at least one of the sets now breaks the MWU contract
+ MWU returns Pair(Double.NaN,Double.NaN) in these instances to maintain the contract of never returning null
+ No more Double.Infinity values will appear
- RankSumTests now probe the return values for NaNs, and don't annotate if they appear
- For small sets where the probability is calculated recursively, the z-value is now the inversion of the error function
and not the approximate z-value
- UG and Annotator integration tests updated to reflect changes
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5845 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-23 13:57:15 +00:00
depristo
b814f4bbd6
Contracts for HasGenomeLocation. BAQ iterator variables are all final. Contracts added
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5844 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-23 02:21:59 +00:00
depristo
43057bd15c
Remove Param annotation and associated broken processing code, as this was never used in the codebase
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5843 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-23 02:21:15 +00:00
depristo
d005c4bf09
GenomeLocProcessingTracker was using SimpleTimer in a non-thread safe way. No longer providing an interface to time parallel operations. Now issues warning if someone enables distributed GATK, as this is considered an unstable, experimental engine feature.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5842 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-23 02:10:27 +00:00
depristo
a18b0152df
Contracts for SimpleTimer, as well as UnitTests
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5841 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-22 19:45:31 +00:00
depristo
0dc0d586f1
Phasing-specific utilies are now in the Phasing walker
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5839 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-22 18:51:35 +00:00
depristo
f608ed6d5a
Removed old (and unused) reporting system, now that Kiran's VE reporting system is working. Refactors dictionary creation error messages into UserExceptions
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5836 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-22 18:42:52 +00:00
rpoplin
4e7ecbdcb2
FS values need to be jittered just like HRun
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5835 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-22 16:44:12 +00:00
depristo
9cc049f80f
Contracted ReferenceContext. Removed depreciated accessors that aren't used in the GATK at all
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5834 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-22 02:41:15 +00:00
depristo
d77f4ebe31
CalibrateGenotypeLikelihoods now emits a molten data set with REF and ALT alleles, so that GL calibration can be evaluated as a function of the REF/ALT bases. DigestTable is a stand-alone Rscript that digests the multi-GB molten data table into a tiny table that shows reported vs. empirical GLs, as a function of a variety of features of the data, like REF/ALT, comp GT, eval GT, and GL itself.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5833 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-21 14:02:30 +00:00
depristo
6a49e8df34
Significant change to the way subsetting by sample works with monomorphic sites. Now keeps the alt allele, even if a record is AC=0 after the subset. Previously, the system dropped the alt allele, which I don't think is the right behavior. If you really want a VCF without monomorphic sites, use the option to drop monomorphic sites after subsetting. See detailed information below.
...
Right now, if you select a multi-sample VCF file down (or one with filters I see) down to a smaller set of samples, and the site isn't polymorphic in that subgroup, then the alt allele is lost. For example, when selecting down NA12878 from the OMNI, I previously received the following VCF:
1 82154 rs4477212 A . . PASS AC=0;AF=0.00;AN=2;CR=100.0;DP=0;GentrainScore=0.7826;HW=1.0 GT:GC 0/0:0.7205
1 534247 SNP1-524110 C . . PASS AC=0;AF=0.00;AN=2;CR=99.93414;DP=0;GentrainScore=0.7423;HW=1.0 GT:GC 0/0:0.6491
1 565286 SNP1-555149 C T . PASS AC=2;AF=1.00;AN=2;CR=98.8266;DP=0;GentrainScore=0.7029;HW=1.0 GT:GC 1/1:0.3471
1 569624 SNP1-559487 T C . PASS AC=2;AF=1.00;AN=2;CR=97.8022;DP=0;GentrainScore=0.8070;HW=1.0 GT:GC 1/1:0.3942
Where the first two records lost the ALT allele, because NA12878 is hom-ref at this site. My change results in a VCF that looks like:
1 82154 rs4477212 A G . PASS AC=0;AF=0.00;AN=2;CR=100.0;DP=0;GentrainScore=0.7826;HW=1.0 GT:GC 0/0:0.7205
1 534247 SNP1-524110 C T . PASS AC=0;AF=0.00;AN=2;CR=99.93414;DP=0;GentrainScore=0.7423;HW=1.0 GT:GC 0/0:0.6491
1 565286 SNP1-555149 C T . PASS AC=2;AF=1.00;AN=2;CR=98.8266;DP=0;GentrainScore=0.7029;HW=1.0 GT:GC 1/1:0.3471
1 569624 SNP1-559487 T C . PASS AC=2;AF=1.00;AN=2;CR=97.8022;DP=0;GentrainScore=0.8070;HW=1.0 GT:GC 1/1:0.3942
The genotype remains unchanged, but the ALT allele is now preserved. I think this is the correct behavior, as reducing samples down shouldn't change the character of the site, only the AC in the subpopulation. This is related to the tricky issue of isPolymorphic() vs. isVariant().
isVariant => is there an ALT allele?
isPolymorphic => is some sample non-ref in the samples?
In part this is complicated as the semantics of sites-only VCFs, where ALT = . is used to mean not-polymorphic. Unfortunately, I just don't think there's a consistent convention right now, but it might be worth at some point to adopt a single approach to handling this. Wiki docs updated.
Does anyone have critical infrastructure that depends on the previous convention? Let me know so we can coordinate the change.
There's a new function subContextFromGenotypes() that also takes a Set<Allele> to handle this type of behavior.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5832 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-21 13:59:16 +00:00
depristo
8377424089
Basic error checking to ensure incoming arguments are provided correctly.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5831 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-21 13:43:48 +00:00
depristo
e234589240
Contracts for GenomeLocParser and GenomeLoc are now fully implemented.
...
GenomeLocs can officially have any start/stop values from -Inf - +Inf. Bounds w.r.t. the reference are enforced, optionally, by GenomeLocParser. General code cleanup throughout the subsystem.
All validation code for GLs is now centralized, and all I/O systems now validate their inputs. Because of this, the Picard interval processing code has been changed to examine whether an interval is valid, and only keep the valid intervals. Note that the scatter/gather test was changed, because the original hg18 chr20 interval files as actually malformed (all records for some reason where on chr20).
Many interval processing routines were moved to IntervalUtils, as this is their natural home.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5830 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-21 02:01:59 +00:00
kiran
3aa56037af
If asked, filters out triple-het situations too (which cannot be simply phased by transmission).
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5829 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-20 18:48:19 +00:00
depristo
e16bc2cbd9
Contracts for Java now write for GenomeLoc and GenomeLocParser. The semantics of GenomeLoc are now much clearer. It is no longer allowed to create invalid GenomeLocs -- you can only create them with well formed start, end, and contigs, with respect to the mater dictionary. Where one previously created an invalid GenomeLoc, and asked is this valid, you must now provide the raw arguments to helper functions to assess this. Providing bad arguments to GenomeLoc generates UserExceptions now. Added utilty functions contigIsInDictionary and indexIsInDictionary to help with this.
...
Refactored several Interval utilties from GenomeLocParser to IntervalUtils, as one might expect they go
Removed GenomeLoc.clone() method, as this was not correctly implemented, and actually unnecessary, as GenomeLocs are immutable. Several iterator classes have changed to remove their use of clone()
Removed misc. unnecessary imports
Disabled, temporarily, the validating pileup integration test, as it uses reads mapped to an different reference sequence for ecoli, and this now does not satisfy the contracts for GenomeLoc
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5827 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-20 15:43:27 +00:00
kiran
d896a4a9d3
Given genotypes for a trio, phases child by transmission. Computes probability that the determined phase is correct given that the genotypes for mom and dad are correct (useful if you want to use this to compare phasing accuracy, but want to break that comparison down by phasing confidence in the truth set). Optionally filters out sites where the phasing is indeterminate.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5824 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-19 21:27:37 +00:00
rpoplin
fe4b40ac2c
Adding new InbreedingCoeff and PercentNBases annotations for Guillermo to use.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5823 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-19 19:50:39 +00:00
ebanks
bc98ac1e74
Adding a TODO for future consideration
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5821 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-19 15:02:23 +00:00
hanna
0bb6b9a91a
Locus iterators were implemented in a peekable style, which meant that a locus
...
and its three or four nearest neighbors could be in memory at once. Tweaking
the iterators to ensure that previous AlignmentContexts don't have strong
references which means that the garbage collector can work effectively to
help us trundle through these regions.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5820 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-18 21:40:40 +00:00
hanna
a38b2be329
Fix for old, broken invariant where unmapped reads are represented by null rather than an empty BAMFileSpan.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5819 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-18 20:57:38 +00:00
carneiro
ebcd333ed8
Quick small updates:
...
SelectVariants: typo
MethodsDevelopmentPipeline: Added CEU Trio WGS dataset
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5818 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-18 20:08:39 +00:00
rpoplin
4b00fd2688
Adding User Exception to VQSR for the case of trying to cluster with an annotation that doesn't exist in the input VCF
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5816 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-18 19:47:51 +00:00
rpoplin
d698c87bbf
More UserExceptions and warnings in VQSR.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5813 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-16 19:03:21 +00:00
kshakir
541b5f7a80
Somehow checked in a version that was building extensions for everything ("") instead of selected packages. Fixed.
...
Also added more logging when extension generation fails.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5812 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-16 16:58:37 +00:00
delangel
a27e8b1dc6
Bug fix - use correct variable to retrieve from map.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5811 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-16 15:32:58 +00:00
rpoplin
d925f76edc
Cutting down on the number of info lines in VQSR so that I can read the warning messages
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5810 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-16 13:35:51 +00:00
delangel
5a7444e186
First step in refactoring UG way of storing indel likelihoods - main motive is that rank sum annotations require per-read quality or likelihood information, and even the question "what allele of a variant is present in a read" which is trivial for SNPs may not be that straightforward for indels.
...
This step just changes storage of likelihoods so now we have, instead of an internal matrix, a class member which stores, as a hash table, a mapping from pileup element to an (allele, likelihood) pair. There's no functional change aside from internal data storage.
As a bonus, we get for free a 2-3x improvement in speed in calling because redundant likelihood computations are removed.
Next step will hook this up to, and redefine annotation engine interaction with UG for indel case.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5809 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-15 23:04:11 +00:00
depristo
3ccc08ace4
Now emits siteType = {SNP,INDEL}. Doesn't work (and may never actually work) for indels under current extended event system.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5808 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-15 19:16:09 +00:00
depristo
75db4705ab
Added splitContextByReadGroup() and fixed bug in getPileupForReadGroup() that resulted in a NPE when no reads where present for a read group.
...
Added doc string for getNBoundRodTracks()
Intermediate commit for CalibrateGenotypeLikelihoods and GenotypeConcordanceTable, so I have a record of my work. Not ready for public consumption. Really looking forward to making local commits so I can track my progress without needing to push incomplete functionality up to the server.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5807 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-15 17:36:07 +00:00
delangel
fa75efb6ac
Backing off - need to change pileup interface for rank sum tests before indels can be annotated with them
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5804 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-13 21:54:54 +00:00
asivache
befbcd274b
Computes additional stats we want to use later for filtering: median and mad for indel position with respect to starts and ends of all the reads that support it
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5803 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-13 21:19:58 +00:00
asivache
5c889580c4
Change of logic: if "read" (sequence 2) sticks out beyond the boundary of the ref (sequence 1) it is aligned to, the extra bases on the left or on the right will be softclipped in the cigar generated for such an alignment, rather than added to the firts/last M block. This also affects alignment offset: if read starts before the ref (used to be represented by a negative offset), the cigar now will start with S, and the returned offset (alignment start) will be 0.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5802 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-13 21:12:54 +00:00
delangel
d4ca8d94fa
Trivial change to allow indels to be annotated by rank rum tests
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5801 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-13 20:24:08 +00:00
hanna
03452c15c0
Cleanup GATKBAMIndex unit test to allow a more efficient access pattern for
...
FindLargeShards. Runtime of FindLargeShards on papuan dataset is now 75min.
GATK proper should benefit as well, although the benefits might be so small
as to not be measurable.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5798 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-12 21:50:33 +00:00
depristo
db1f9af679
Now supports multiple records in allele at sites that genotype as reference
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5796 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-12 17:36:27 +00:00
rpoplin
a22e98a2c4
Yikes. Fixing the build
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5794 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-12 01:52:35 +00:00
rpoplin
40797f9d45
Ensuring a minimum number of variants when clustering with bad variants. Better error message when Matrix library fails to calculate inverse.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5793 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-12 01:48:37 +00:00
kshakir
a20d257773
Generating extensions for org.broadinstitute.sting.gatk.datasources.reads.utilities, including FindLargeShards.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5792 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-12 00:49:31 +00:00
carneiro
fb1be2653c
A succint walker that reports GC content by interval. Taking down two old implementations of the same thing from oneoffs. Documentation added to the wiki.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5790 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-11 18:53:11 +00:00
depristo
9a1d0d7076
Simple bug fix to allow multiple records at same site when genotyping given alleles. Takes only the first record (respecting filters, SNP type, etc), and issues a warning if there is more than one valid record at a site
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5789 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-11 14:17:14 +00:00
ebanks
dfdef2d29b
PLEASE READ ME! In order to prepare for the upcoming changes to VCF4, we felt it was best to split up the vcf3 and vcf4 codecs (vcf4 is not backwards compatible to vcf3 and certain changes are too complex to handle in both codecs). Using the 'VCF' rod type in the GATK will now throw a UserException for vcf3.2 or vcf3.3 files telling you to use the 'VCF3' type instead (and vice versa). Integration/unit tests have been updated. For programmers: note that there is currently a lot of code duplication in the two codecs (although I pulled out the easy stuff to a VCFCodecUtils class); however WE ARE FREEZING THE VCF3 CODEC AND WILL NO LONGER MAKE CHANGES TO IT. All updates/improvements will be targetted to the vcf4 codec only as vcf3 is there only to be able to read legacy files. People should really be using vcf4 files only.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5787 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-11 12:07:44 +00:00
delangel
852e555c00
Fix broken functionality from previous commit.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5786 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-10 18:38:25 +00:00
ebanks
8d47d2e813
Fix for Tim. It was possible for the constrained mate fixer to dump its cache in them middle of a given realignment (so the IndelRealigner was playing by the rules). No longer possible.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5785 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-10 16:48:24 +00:00
delangel
3c364279f4
Add simple ability to create "X out of N" combined files: if a site is present in at least X input rods, it gets output, otherwise it's skipped, controlled with argument -minN.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5783 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-10 15:27:18 +00:00
hanna
f275be6968
A 'fat shard' finder. Cranks through the indices of a BAM file or list of
...
BAM files looking for outliers (outliers right now are defined naively as
shards whose sizes are more than 5 stddevs away from the mean). Runs in
13 minutes per chromosome on 707 low pass whole genome BAMs -- not great, but
much faster than running UG on the same region to discover anomalies.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5782 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-10 12:56:47 +00:00
kshakir
7d21350a17
Fixed import.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5780 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-09 18:07:40 +00:00
asivache
0861451726
Print on multiple rows in standalone command line mode when the sequences are too long
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5779 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-09 13:51:00 +00:00
ebanks
bf40351094
Minor update
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5778 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-08 03:48:37 +00:00
ebanks
15c7bd82a5
Fix for IndelRealigner memory problem. Now the Constrained mate fixing writer is told whether a read has been modified and, if it wasn't, can dump it when the cache needs to get flushed at places with tons of coverage.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5777 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-06 19:34:41 +00:00
rpoplin
d8a761bbbd
Warn the user if trying to train with too few variants
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5776 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-06 17:47:58 +00:00
hanna
c2e8c460cb
Factor out all testing dependencies into a separate test configuration and
...
only download that test configuration when running unit/integration tests.
This means that the build will (hopefully) never break because it can't
fetch a file that isn't required for the GATK to run.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5775 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-05 22:42:11 +00:00
rpoplin
b94d8dae17
Removing requirement of providing known track in VQSR for the non-humans. Updating placement of legend on tranche plot.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5773 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-05 20:24:06 +00:00
delangel
7d7ce6cf00
Two embarassing bug fixes:
...
a) Forgot to convert from phred to log-prob when computing gap penalties from recal table.
b) Forgot to uncomment code to correctly deal with hard-clipped bases in a read. But because of this, had to do a short term workaround to at least temporarily return class from hardClipAdaptorSequence to GATKSAMRecord. Otherwise, I get exceptions when casting because somehow some reads in HiSeq get to be SAMRecord (which GATKSAMRecord inherits from) but some reads get to be BAMRecords (which can't be cast into GATKSAMRecord), not sure why.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5771 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-05 17:08:34 +00:00
kshakir
28b897d5de
Fixed O(N^2) operation when scattering interval files.
...
Cleaned up intervals contig count function.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5768 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-05 03:32:35 +00:00
carneiro
3882d1b9c0
fixing the build \o/
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5767 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-05 00:57:49 +00:00
kshakir
8ad547e6c2
Fixed another interval bug where dividing up N intervals into N parts wasn't working.
...
Minor updates to the FCPTest to match the changes due to using the old indel caller.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5766 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-04 20:49:35 +00:00
hanna
5c6965575e
Some refactoring that Mauricio and I worked through together. Changed filters
...
to extend from org.broadinstitute.sting.gatk.filters.ReadFilter rather than
directly from net.sf.picard.filter.SamRecordFilter, which allows us to add
an initialize(GATKEngine) method so that filters can do any initialization
they'd like based on CL arguments, SAM headers, etc.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5760 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-04 19:29:08 +00:00
carneiro
b66c6dced1
- No longer prints out non confident calls (they were leading to tables that don't add up and confusing some Pacbio folk).
...
- Added sensitivity and Specificity to the report.
- With the changes in genotype likelihoods, the indel analysis only happens if the BAM file also has an extended event. Not great, but at least it's not broken.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5759 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-04 19:26:55 +00:00
carneiro
7ed8b4ddb0
Making sure CalculateLikelihoodsAndGenotypes returns an empty variant context when 'EMIT_ALL_SITES' and 'GENOTYPE_GIVEN_ALLELES' are being used, now for indels too!
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5756 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-04 18:04:56 +00:00
rpoplin
6c7a0adc76
Updating VariantGaussianMixtureModelUnitTest to use truth sensitivity cutting
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5750 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-04 13:56:01 +00:00
delangel
a19389528d
Bring back from the dead the old likelihoods model for indels, which has worse performance but is about 4x faster. Enabled with argument -GSA_PRODUCTION_ONLY in UG
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5748 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 22:38:33 +00:00
carneiro
e5cc0f4eec
Added 'specificity' to variant eval's Validation Report evaluator.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5742 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 20:48:30 +00:00
rpoplin
b88dec387c
clean up from VQSR movement
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5741 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 20:35:30 +00:00
rpoplin
23cd3a7a5d
Moving VQSR v2 to core.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5740 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 20:20:06 +00:00
rpoplin
44a717f63a
Good bye VQSR v1. This commit will break the build.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5739 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 20:09:52 +00:00
hanna
2dacf1b2b2
Better header support when running R's read.table(...,header=T).
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5738 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 19:56:20 +00:00
hanna
ad8c786b2d
Now more easily R-parseable.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5737 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 19:30:50 +00:00
rpoplin
5bade81c6d
Adding tranche plot generation back to VQSR
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5736 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 19:26:26 +00:00
rpoplin
e73720c2db
Updating VQSLOD annotation description
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5735 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 19:01:08 +00:00
rpoplin
11052918d9
Better exception text for common error in VQSR.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5734 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 18:37:25 +00:00
rpoplin
4bbce42861
Renaming ContrastiveRecalibrator --> VariantRecalibrator in preparation for move to core
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5733 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 18:12:47 +00:00
rpoplin
6323fb8673
misc cleanup in VQSR
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5732 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 18:00:22 +00:00
hanna
f3bd11a02e
Dress up some formatting issues.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5731 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 17:35:18 +00:00
hanna
9c809ed68e
A walker to analyze the memory consumption of reference, reads, and RODs at
...
each base both in bytes and as a percentage of the used heap size.
May be a bit buggy at this point; there are a lot of metrics around the Java
heap and I'm not completely sure that the metrics I'm outputting are exactly
the ones that I'm looking for.
Also fixed a documentation bug in my Sizeof class.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5730 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 17:08:15 +00:00
ebanks
d4cbd8691c
Make the default that we only output SNPs (so that when I make another release we don't get flooded with questions about why the UG is all of a sudden so slow)
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5729 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 16:38:55 +00:00
rpoplin
70f8ab6f89
Adding AF bin stratification for VariantEval.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5728 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 15:22:50 +00:00
hanna
870e65a685
Fixing a build failure because I want to be completely sure that the code I
...
checked in immediately following the build breaking code passes integration
tests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5727 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 02:09:53 +00:00
hanna
411980a50a
Performance enhancements in GATKBAMIndex. Not sure these will assist in a
...
normal use case, but they cut startup times and memory allocation noise in
the profiler, making my profiling time more productive.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5726 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-02 20:48:16 +00:00
delangel
422d4ceeea
removed useless file - no need for tableRecalibration, right now everything is done in PairHMMIndelErrorModel
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5725 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-02 20:35:44 +00:00
delangel
2a80ffa2ee
Totally experimental, barely useable not to be used yet implementation of an "Indel Quality Recalibrator" Idea is that any indel that's not in input dbsnp is treated as an artifact, and then a csv is built with # of indels and # of observations as a function of each input covariate (initially, only cycle, read group and homopolymer run are useful). Then, when computing likelihoods of indels based on input haplotypes we compute gap penalties based on value of covariates at read. Feature is disabled by default with hidden arguments. TBD if usefulness of feature is worth the extra time and pain.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5724 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-02 20:31:43 +00:00
rpoplin
3224bbe750
New visualization output for VQSR. It creates the R script file on the fly and then runs Rscript on it. Adding 1000G Project consensus code. First pass of having VQSR work with missing data by marginalizing over the missing dimension for that data point (thanks Chris and Bob for ideas). Updated math functions to use apache math commons instead of approximations from wikipedia. New parameters available for the priors based on further reading in Bishop and looking at the new visualizations. Updated integration test to use more modern files. Updated MDCP to use new best practices w.r.t. annotations.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5723 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-02 19:14:42 +00:00
ebanks
fcf8cff64a
We didn't actually support all of these extensions. Updated to be accurate.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5722 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-02 19:03:46 +00:00
carneiro
34092fd32f
minor update...
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5716 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-29 21:29:01 +00:00
carneiro
36ac8beee1
Making the GATK unpredictably random...
...
through an option!
set -ndrs if you want the GATK to be really random (non-deterministic). Engine option, available to every walker.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5715 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-29 19:29:08 +00:00
carneiro
f97e7d2fb4
Walker that calculates the percentage of bases that are covered to at least 20x. Very useful! In oneoffs until someone else thinks it's as useful as I think it is ;)
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5714 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-29 19:19:39 +00:00
ebanks
deed7c47a1
Continuing the epic fail, some of our existing integration tests were wrong because of the lazy loading failure.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5712 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-29 17:54:41 +00:00
ebanks
ab9ffb1a74
Epic failure on the lazy loading of genotypes: if the input VCF had its samples unsorted and we used a walker that didn't require genotypes, then we would sort the samples but not load genotypes (and therefore the genotypes wouldn't match the samples anymore). Added simple integration test to cover this case.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5711 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-29 16:03:45 +00:00
hanna
96571b55be
Disable caching of ReadShards by the GenomeLocProcessingTracker (at least
...
temporarily). Unfortunately this does not completely fix the IndelRealigner
exception that Ryan is seeing, but it helps things quite a bit.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5710 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-29 13:59:34 +00:00
carneiro
a5b96e0e04
I have to remember that this is Java, not C.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5709 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-28 17:40:14 +00:00
rpoplin
b7334dcc1e
Rank sum test annotations are the Z-scores from the test instead of the p-value.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5707 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-28 14:35:00 +00:00
ebanks
45081c32d7
continuing from last night, the integration tests weren't covering the right behavior either
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5706 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-28 13:30:57 +00:00
ebanks
f34e6d5b8c
Somewhere along the way someone broke this tool and failed to update the documentation to boot. Fixing.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5705 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-28 03:16:20 +00:00
ebanks
ae8f3f2cde
Check for bad reference bases before creating simple/'empty' VCs. Updated the code in the indel GL model to be consistent and to use the existing utility in the Allele class.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5704 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-27 23:55:20 +00:00
depristo
6cce3e00f3
A test walker that does consensus compression of deep read data sets.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5702 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-27 22:00:48 +00:00
rpoplin
3907377f37
When genotyping given alleles, for multiallelic sites we go back to the reads and use the alternate base with the highest sum of quality scores instead of taking the first alternate allele from the vcf file
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5701 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-27 21:31:09 +00:00
droazen
6e9e766a71
The tighter interval validation wasn't interacting well with unmapped
...
intervals -- altered the validation methods to not throw an error for
unmapped intervals.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5700 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-27 20:56:46 +00:00
hanna
6d5e45b5c6
Revbump Picard dependencies at Tim/Kathleen's request. Exclude anonymous
...
classes from PluginManager.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5699 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-27 20:38:05 +00:00
droazen
d650efd40a
Fix for bug GSA-449: Intervals that are not in GATK format are not validated
...
to the same standard as GATK format intervals. Full validation against contig
bounds is now performed for all intervals, regardless of their source. Also
fixed a few tests for validation exclusions that were backwards.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5698 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-27 18:12:10 +00:00
kshakir
df35a143b2
Removed -debug/--debug_mode.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5697 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-27 10:56:39 +00:00
hanna
27495a0c64
Killed quiet mode. Should probably kill debugMode as well, but Queue's using
...
it. Will check with Khalid tomorrow.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5695 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-27 04:17:36 +00:00
hanna
f3dacd3c40
Use ByteBuffer.allocateDirect() instead of ByteBuffer.allocate().
...
ByteBuffer.allocateDirect() behaves like Java NIO MappedByteBuffers in that
it consumes address space, which counts against our virtual memory allocation;
but cannot be destroyed or otherwise freed. This was definitely contributing
to the LSF failures that I was seeing, but I'm not yet convinced that it's the
sole source of these virtual memory 'leaks'. More tomorrow as the results of
my whole exome tests start to roll in.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5693 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-27 02:01:11 +00:00
chartl
7afeb1ab17
Removing broken imports (boo)
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5692 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-26 18:55:25 +00:00
rpoplin
379f837e82
RankSum z-scores are looking quite good, so RIP Wilcoxon.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5691 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-26 18:34:39 +00:00
chartl
bc3fd70b0a
Removing the old association walker, switching test to just validate that MannWhitneyU is doing the right thing. Unit tests still pass.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5690 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-26 18:05:19 +00:00
kshakir
f619dd3ca7
Refactored IntervalUtils used to parse and scatter intervals for Queue.
...
Scattering non-contig interval lists by number of loci in the intervals instead of just number of intervals.
Queue caches the list of locs and how to split them up instead of reloading them from disk repeatedly.
TODO: general purpose function to divide data evenly.
Skip over comments when parsing picard analysis files.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5687 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-26 00:06:00 +00:00
hanna
57a4700299
Ported small BAM performance test suite to the Google Caliper microbenchmarking suite. Looks promising,
...
but I'm still not sure that GC is a good long-term solution.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5683 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-22 22:09:17 +00:00
chartl
a56a2dfdb7
Nothing to see here. Move along.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5681 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-22 15:01:02 +00:00
delangel
600617a63c
Enabled code to deal with hard-clipping adaptor sequence when processing reads in pileup in indel caller. Proven now that changes are minimal (4 less calls in NA12878 chr20, quals slightly different), minor changes in vcf fields in integration tests.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5679 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-22 14:10:33 +00:00
chartl
88735a8c9b
Adding in a delta to try and better measure effect size -- equivalent to looking at the lower end of the N^th percentile confidence interval. Kind of a hacky way to add it in, the infrastructure is about due for a streamlining rewrite.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5676 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-22 03:53:33 +00:00
hanna
7428ae338a
A fix for Marian Thieme's NPE in the new sharding system.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5675 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-21 19:47:14 +00:00
chartl
5b9a8555cd
Queue graph time is currently of O(n^m) where n = num jobs, m = num unique base files. This script therefore was running in order 1200^16, which I don't think would finish before the heat death of the universe. For now, push down the number of files to 1 and gather them outside of Queue, once I've fixed up scatter-gather in core, outputs can be uncommented.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5674 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-21 12:56:25 +00:00
ebanks
cbcdfc584d
Moving out of core and into playground
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5671 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-21 02:30:22 +00:00
depristo
cc78027bd3
Two optimizations. Even more aggressive printProgress meter optimization to only even consider doing work once every 1000 cycles through the engine. Second, GenomeLocParser now uses a single indirection around the contigInfo variable. This class uses a last used cache to retrieve efficiently contig information instead of always returning to the underlying SAMSequenceDictionary hashmap to make genome locs.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5670 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-21 01:31:26 +00:00
depristo
29857f5ba6
Fix for instability in output of fasta alternative reference maker when snpmask and snp files are provided and have overlapping records. The order of the records changed due to optimization of the refmetadatatracker, and uncovered this non-determinanism. Now preferrentially masks out includes sites from snps before considering masking out sites in snpmask
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5669 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-20 21:54:09 +00:00
kshakir
8619f49d20
Added a utility method to retrieve the contig lengths for WG chunking.
...
Added a rudimentary GATKReportParser for parsing VE3 results.
Re-enabled the FCPTest using VE3, the GATKRP, and the PicardAggregationUtils.
The tag type for .rod files is DBSNP, not ROD.
More explicit return types on implicit methods.
Added null checks for implicit string to/from file conversions.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5668 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-20 19:22:21 +00:00
delangel
59dd79faab
One more optimization: don't use Math.round(), but do my own rouding/casting. UG now about 40% faster calling indels, 30-35% faster calling snp's+indels simultaneously.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5667 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-20 19:15:58 +00:00
delangel
246d8190b5
Round one of "easy" zero-effort optimizations to UG's indel caller. Mostly inline functions, avoid repeated computation and try to optimize SoftMaxPair() which is by far the bigest runtime hog. More to come...
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5666 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-20 18:57:34 +00:00
depristo
a8f8077d7a
Simple optimizations for cases where there is no data or RODs at sites, such as with the FastaStats walker. private static immutable Lists and Maps in underlying data structures that have no associated data. Also, avoiding a double map.get() in the low-level genome loc parser. RefMetaDataTracker is now
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5664 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-20 10:52:16 +00:00
hanna
54660a8c25
Fix requested by Lee Lichtenstein: first check to see whether it's time for
...
a progress message, then aggregate metrics. Makes the overhead of
printProgress in RealignerTargetCreator go from >20% to ~3%.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5663 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-20 03:22:48 +00:00
hanna
49550e257f
Fix for JamesP's issue. This issue appeared because of a design flaw in the
...
interface between SAMDataSource and IntervalSharder that needs to stay around
until the original BAM sharder is retired. Will add a JIRA to fix design
flaw.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5661 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-19 00:52:13 +00:00
depristo
541c9109b3
V1 of GATK Resource Bundling system
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5659 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-18 19:23:45 +00:00
ebanks
673772a522
Catch samtools exceptions and make them 'BAM Exceptions' asking the user to run Picard's validator and re-index the file before posting anything to the forum. Let's see whether this helps or not.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5658 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-18 03:52:43 +00:00
ebanks
e97a5ca161
Rename 'verbose' argument to 'debug_file'.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5657 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-18 03:17:13 +00:00
chartl
e28fc21642
Spurious associations can develop from including ambiguous reads in these tests. Perhaps MQ0 reads shouldn't be used for anything except MQ0, but the best way to do that is to restructure the code, so for now I'll put it off.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5656 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-17 23:17:03 +00:00