Commit Graph

5330 Commits (8ae42b70ac9100ef0d02cc75a0e231768fc76e02)

Author SHA1 Message Date
carneiro 8ae42b70ac give it an annotated VCF and a BAM and it creates a truth table on the validation of the VCF calls. This is just the first version, not ready for primetime.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5371 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-03 21:45:11 +00:00
carneiro 0daa65b9ef quick and dirty 'close your eyes' solution to run the papuans over the weekend. Will be properly fixed soon.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5370 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-03 21:42:22 +00:00
rpoplin f7ef35b8f5 Removing untrue comments in the GaussianMixtureModel
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5369 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-03 18:18:47 +00:00
chartl 9e12cd1312 Gotta include the changes i made to get an init function into the contexts
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5368 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-03 15:45:51 +00:00
chartl 835a26d145 A pass at a sample-normalized test. I think maybe all of them will simply do their own normalizing.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5367 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-03 15:42:25 +00:00
chartl ef38fd1e0e Major refactoring of association testing framework. New modules are now beyond trivial to implement. One hurdle remains which is how to deal with statistics that ought to be sample-normalized (e.g. depth, insert-size [when multiple libraries are used], and possibly others).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5366 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-03 14:27:16 +00:00
hanna 5d4bbf41fb Behave intelligently in the deepest levels of GATK record filtration when
we find a read flagged as 'mapped' in the unmapped region at the end of the 
file.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5365 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-03 04:52:55 +00:00
hanna 7a22f19366 More descriptive error when VerifyingSamIterator hits an inconsistent alignment. Also updated
case UserException.MalformedBAM to match case of UserExceptio.MissortedBAM for consistency and
ease-of-use.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5364 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-03 03:55:24 +00:00
kshakir a0309e7fb0 Want to get this into Ryan's hands asap: First working version of a distributed scatter function.
More refactoring to do so that other new scatter functions can be implemented very easily and annotated on walkers.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5363 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-03 01:40:08 +00:00
depristo 0181d95fe4 Intermediate optimization checkin. LinearExact model now about 10-20% faster than previous commit, by reorganizing and optimizing the if statements and genotype likelihood calculations. Next commit will include a banded implementation.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5362 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-02 22:01:35 +00:00
ebanks f0f4bc3363 This was busted because it assumed 1 (and only 1) record at each position. However it's possible to have 0 (which generated a NullPointer) or 2+ records (which dropped records).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5361 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-02 21:35:50 +00:00
kshakir 1e68259f5c Stage one of refactoring GATK scatter functions. This intermediate stage should only be used by "rhymes with Shmoplin" until the next refactoring.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5360 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-02 18:10:19 +00:00
depristo c152ef4339 Better error message for unknown reference file extension.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5359 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-02 17:52:16 +00:00
hanna bef83b8b09 Bug fix: was tracking state across BAMs that should've been tracked per-BAM.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5358 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-02 17:32:06 +00:00
depristo bafa61c1fe LINEAR_EXACT now the default model. Passes all integration tests. 2-3x faster in low-pass data. Tests on exome data ongoing, but potentially vastly faster there.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5357 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-02 17:14:36 +00:00
rpoplin 8e1aa6059a New mode for CombineVariants to assume the incoming VCFs have the same samples and disjoint calls. Drastically reduces the runtime for routine combining operations. Very useful with Queue.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5356 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-02 15:52:17 +00:00
hanna 5e4b321f86 Add hidden command-line argument for low-memory sharding.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5355 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-02 15:13:16 +00:00
ebanks ae42c0c7da Bug fix based on GATK run report
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5354 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-02 14:18:12 +00:00
ebanks 660998065b 'Okay, now I'm absolutely certain that there are no more bugs in the constrained writer.'
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5353 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-02 03:48:40 +00:00
hanna 880c607d79 Disable validation of linear index against original linear index process.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5352 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-02 01:51:26 +00:00
hanna dc62685a2f For Ryan: force creation of BAM index when no reads are present in the BAM
file.  Temporary fix until Picard changes the behavior of indexing.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5351 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-02 01:50:42 +00:00
asivache 570186fa42 Added (deep) clone() and merge() to the RunningAverage utility class
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5350 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-02 00:35:23 +00:00
hanna 43567b7fe3 Load the linear index without forcing the index for the entire contig to be
loaded into memory.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5349 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-02 00:08:39 +00:00
ebanks a20ce1436d A temporary @hidden hack to get indel calling done for Phase I: don't try to call if there's too much coverage. Do not use this unless your last name rhymes with Shmoplin.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5348 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-01 19:22:27 +00:00
carneiro 8ab6eee1cf gold standard creates its own tranches and vcf files now.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5347 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-01 17:48:40 +00:00
hanna 3c7ae0d1a6 Special case handling of unmapped region in low memory sharder.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5346 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-01 17:38:30 +00:00
hanna dd30ad751a Fix bug in low memory sharder's interval accumulator.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5345 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-01 17:11:22 +00:00
hanna d6145de970 More comprehensive tracking of position when bin trees are sparse.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5344 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-01 15:53:43 +00:00
ebanks bb969cd3a2 EMIT_ALL_SITES now does exactly that - even when there's no coverage or too many deletions
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5343 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-01 05:05:00 +00:00
kshakir 4b08baabe7 Let Queue resubmit the job instead of LSF.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5342 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-01 03:19:50 +00:00
chartl 0723b0f44c Generalized association is now working. Output is in a horrific format. Implementation of T-testing. Improvements are to look for classes dynamically (a la VariantEval/VariantAnnotator), beautify output, and do optimizations where they exist.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5341 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-01 01:23:37 +00:00
rpoplin ce34a8a918 New hidden option in VQSR to not parse the genotypes of the incoming training data. Updated VQSR training in methods development pipeline to be more in line with best practices.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5340 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-28 23:19:51 +00:00
hanna e7089f9870 Fix for particularly small, isolated intervals: make sure the bounds of the
bin tree are dictated by the lowest bin level, whether it exists or not.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5339 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-28 22:35:53 +00:00
hanna c869d1c9cf Fix misc issues in new protosharder regarding proper iterator termination when
an unexpectedly small amount of data is present.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5338 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-28 21:14:18 +00:00
chartl abab23350f Fix for expanded intervals in the case where the input file has overlapping intervals in it (seriously?)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5337 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-28 19:56:33 +00:00
ebanks 5ac9af472c Adding performance test for case with very high coverage (> 600,000x) over an interval
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5336 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-28 19:48:56 +00:00
carneiro c7a51f0de7 fixed 1kg pilot dindel calls vcf file and combined all vcfs into one master dindel file.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5335 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-28 19:04:58 +00:00
hanna e75366f738 Fixed performance issue in protosharding code -- turns out that the index
optimizer was mutating the data stored in the indices.  Protosharding still
disabled by default.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5334 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-28 17:32:12 +00:00
ebanks 8de83725f9 Simple walker to randomly break VCF files into (potentially unequal) subsets. Useful for e.g. cutting hapmap into training and evaluation sets.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5333 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-28 16:51:46 +00:00
delangel d059d89a9d Fixes and cleanups for indel eval module. Also outputs AT/CG ratio in dedicated column in IndelStatistics.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5332 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-28 12:07:50 +00:00
ebanks 05fac8583d Following up Mark's recent commit: hooking up the --maxPositionalMoveAllowed argument into the indel realigner and through to the SAM writer. We now ensure that no read is realigned more than N bases (200 by default, which is nowhere close to realistically possible). If anyone ever sees a warning message about this with the default value then please let me know because I need to see it for myself.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5331 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-28 04:40:54 +00:00
depristo 874406352c Accidentally commited the N2 comparing test as well...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5330 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-28 04:15:30 +00:00
depristo 1dedfdb11b Fixes for constrained movement Indel Realigner. Now sorts all of the reads in the interval before handing them to ConstrainedMateFixingSAMFileWriter to maintain correct contract between the two pieces of software
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5329 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-28 03:52:18 +00:00
depristo d216830b92 Experimental linear version of the exact model. In testing, but gives identical results to N2 gold standard version, and passes integration tests. Performance optimizations still ongoing.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5328 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-28 03:48:11 +00:00
ebanks 54facb2c51 Small change for Mauricio so that the correct metrics get output when running in GENOTYPE_GIVEN_ALLELES mode.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5327 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-27 06:08:32 +00:00
depristo 146756de79 Class name to reflect actual file name. manySampleUGPerformance now operates on 1000 samples!
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5326 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-26 23:36:04 +00:00
depristo b1b9c14c98 Minor utility improvements
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5325 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-26 15:36:26 +00:00
depristo aaecc271e8 Misc changes
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5324 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-26 15:35:49 +00:00
chartl b089d35b21 Fix expand intervals to do the right thing:
- No more duplicate intervals
 - Truncation at intervals that already exist, e.g.

exists:      |--------|           |-------|
new:               |---------|
fixed:                 |-----|

note that weird instances like:

exists:           |-|        |-|                  |-|
new:           |---------------------|
fixed:                          |----|

e.g. you're truncated to the nearest interval on whatever side. In general many behaviors could happen in this instance, this is the one currently implemented.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5323 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-26 04:19:01 +00:00
carneiro fd5d1f9cfc minor cosmetic changes.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5322 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-25 21:56:35 +00:00