Eric Banks
64dad13e2d
Don't carry around an extra copy of the code for the Haplotype Caller
2011-12-09 11:09:40 -05:00
Eric Banks
442ceb6ad9
The Exact model now computes both the likelihoods and posteriors (in separate arrays); likelihoods are used for assigning genotypes, not the posteriors.
2011-12-09 10:16:44 -05:00
Laurent Francioli
a79144f7db
Merge branch 'master' of ssh://copper.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-09 15:57:24 +01:00
Laurent Francioli
72fbfba97d
Added UnitTests for getFamilies() and getChildrenWithParents()
2011-12-09 15:57:07 +01:00
Laurent Francioli
5a06170804
Corrected bug causing getChildrenWithParents() to not take the last family member into consideration.
2011-12-09 14:51:34 +01:00
Eric Banks
aa4a8c5303
No dynamic programming solution for assignning genotypes; just done greedily now. Fixed QualByDepth to skip no-call genotypes. No-calls are no longer given annotations (attributes).
2011-12-09 02:25:06 -05:00
Eric Banks
2fe50c64da
Updating md5s
2011-12-09 00:47:01 -05:00
Eric Banks
8777288a9f
Don't throw a UserException if too many alt alleles are trying to be genotyped. Instead, I've added an argument that allows the user to set the max number of alt alleles to genotype and the UG warns and skips any sites with more than that number.
2011-12-09 00:00:20 -05:00
Eric Banks
3e7714629f
Scrapped the whole idea of an int/long as an index into the ACset: with lots of alternate alleles we run into overflow issues. Instead, simply use the ACcounts array as the hash key since it is unique for each AC conformation. To do this, it needed to be wrapped inside an object so hashcode() would work.
2011-12-08 23:50:54 -05:00
Eric Banks
4aebe99445
Need to use longs for the set index (because we can run out of ints when there are too many alternate alleles). Integration tests now use the multiallelic implementation.
2011-12-08 15:31:02 -05:00
Eric Banks
7750bafb12
Fixed bug where last dependent set index wasn't properly being transferred for sites with many alleles. Adding debugging output.
2011-12-08 13:50:50 -05:00
Guillermo del Angel
252e0f3d0a
Merged bug fix from Stable into Unstable
2011-12-08 13:11:39 -05:00
Guillermo del Angel
1bfe28067f
Don't try to genotype an indel even bigger than the reference window size, or else we'll be out of bounds. Necessary to handle Phase 1 integrated callset with large deletions. Better error indication when validating a GenomeLoc.
2011-12-08 12:54:08 -05:00
Mark DePristo
9def841275
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-07 13:36:16 -05:00
Mark DePristo
4055877708
Prints 0.0 TiTv not NaN when there are no variants
...
-- Updated md5
2011-12-07 12:07:54 -05:00
Matt Hanna
15533e08df
Fixed issue with RODWalker parallelization.
...
Turns out that someone previously upped the declared size of a ROD shard to 100M bases, making
each ROD shard larger than the size of chr20. Why didn't we see this in Stable? Because the
ShardStrategy/ShardStrategyFactory mechanism was dutifully ignoring the shard size specification.
When I rolled the ShardStrategy/ShardStrategyFactory mechanics back into the DataSources as part
of the async I/O project, I inadvertently reenabled this specifier.
2011-12-07 11:55:42 -05:00
Mark DePristo
5d2212bc8e
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-07 09:03:17 -05:00
Mark DePristo
6bf18899df
Fix for variant summary -- now treats all 50 bp deletions or insertions as CNVs
2011-12-07 09:02:49 -05:00
Matt Hanna
c9b2cd8ba5
Fix for chartl's stale null representation issue.
2011-12-06 18:05:17 -05:00
Eric Banks
79d18dc078
Fixing indexing bug on the ACsets. Added unit tests for the Exact model code.
2011-12-06 16:17:18 -05:00
Matt Hanna
f5b977fc88
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-06 10:11:35 -05:00
Matt Hanna
4001c22a11
Better file count / buffering variation in test suite. Parameterized read shard buffering. Misc cleanup.
2011-12-06 10:10:38 -05:00
Khalid Shakir
677bea0abd
Right aligning GATKReport numeric columns and updated MD5s in tests.
...
PreQC parses file with spaces in sample names by using tabs only.
PostQC allows passing the file names for the evals so that flanks can be evaled.
BaseTest's network temp dir now adds the user name to the path so files aren't created in the root.
HybridSelectionPipeline:
- Updated to latest versions of reference data.
- Refactored Picard parsing code replacing YAML.
2011-12-05 23:22:15 -05:00
Eric Banks
7a0f6feda4
Make sure that too many alternate alleles aren't being passed to the genotyper (10 for now) and exit with a UserError if there are.
2011-12-05 16:18:52 -05:00
Eric Banks
7fac4afab3
Fixed priors (now initialized upon engine startup in a multi-dimensional array) and cell coefficients (properly handles the generalized closed form representation for multiple alleles).
2011-12-05 15:57:25 -05:00
Eric Banks
a7cb941417
The posteriors vector is now 2 dimensional so that it supports multiple alleles (although the UG is still hard-coded to use only array[0] for now); the exact model now collapses probabilities for all conformations over a given AC into the posteriors array (in the appropriate dimension). Fixed a bug where the priors and posteriors were being passed in swapped.
2011-12-04 13:02:53 -05:00
Eric Banks
eab2b76c9b
Added loads of comments for future reference
2011-12-03 23:54:42 -05:00
Eric Banks
29662be3d7
Fixed bug where k=2N case wasn't properly being computed. Added optimization for BB genotype case not in old model. At this point, integration tests pass except for 1 case where QUALs differ by 0.01 (this is okay because I occasionally need to compute extra cells in the matrix which affects the approximations) and 2 cases where multi-allelic indels are being genotyped (some work still needs to be done to support them).
2011-12-03 23:12:04 -05:00
Eric Banks
71f793b71b
First partially working version of the multi-allelic version of the Exact AF calculation
2011-12-02 14:13:14 -05:00
David Roazen
d014c7faf9
Queue now properly escapes all shell arguments in generated shell scripts
...
This has implications for both Qscript authors and CommandLineFunction authors.
Qscript authors:
You no longer need to (and in fact must not) manually escape String values to
avoid interpretation by the shell when setting up Walker parameters. Queue will
safely escape all of your Strings for you so that they'll be interpreted literally. Eg.,
Old way:
filterSNPs.filterExpression = List("\"QD<2.0\"", "\"MQ<40.0\"", "\"HaplotypeScore>13.0\"")
New way:
filterSNPs.filterExpression = List("QD<2.0", "MQ<40.0", "HaplotypeScore>13.0")
CommandLineFunction authors:
If you're writing a one-off CommandLineFunction in a Qscript and don't really
care about quoting issues, just keep doing things the direct, simple way:
def commandLine = "cat %s | grep -v \"#\" > %s".format(files, out)
If you're writing a CommandLineFunction that will become part of Queue and
will be used by other QScripts, however, it's advisable to do things the
newer, safer way, ie.:
When you construct your commandLine, you should do so ONLY using the API methods
required(), optional(), conditional(), and repeat(). These will manage quoting
and whitespace separation for you, so you shouldn't insert quotes/extraneous
whitespace in your Strings. By default you get both (quoting and whitespace
separation), but you can disable either of these via parameters. Eg.,
override def commandLine = super.commandLine +
required("eff") +
conditional(verbose, "-v") +
optional("-c", config) +
required("-i", "vcf") +
required("-o", "vcf") +
required(genomeVersion) +
required(inVcf) +
required(">", escape=false) + // This will be shell-interpreted
required(outVcf)
I've ported the Picard/Samtools/SnpEff CommandLineFunction classes to the new
system, so you'll get free shell escaping when you use those in Qscripts just
like with walkers.
2011-12-01 18:13:44 -05:00
Mark DePristo
3060a4a15e
Support for list of known CNVs in VariantEval
...
-- VariantSummary now includes novelty of CNVs by reciprocal overlap detection using the standard variant eval -knownCNVs argument
-- Genericizes loading for intervals into interval tree by chromosome
-- GenomeLoc methods for reciprocal overlap detection, with unit tests
2011-11-30 17:05:16 -05:00
Matt Hanna
b65db6a854
First draft of a test script for I/O performance with the new asynchronous I/O processing.
...
Also includes convenience parameters for specifying the IO/CPU threading balance outside of a tag. Will be killed when
Queue gets better support for tagged arguments (hopefully soon).
2011-11-30 13:13:16 -05:00
Laurent Francioli
1d5d200790
Cleaned up unused import statements
2011-11-30 15:30:30 +01:00
Mark DePristo
28b286ad39
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-11-30 09:11:53 -05:00
Laurent Francioli
20bffe0430
Adapted for the new version of MendelianViolation
2011-11-30 14:46:38 +01:00
Laurent Francioli
1cb5e9e149
Removed outdated (and unused) -familyStr commandline argument
2011-11-30 14:45:04 +01:00
Laurent Francioli
9574be0394
Updated MendelianViolationEvaluator integration test
2011-11-30 14:44:15 +01:00
Laurent Francioli
f49dc5c067
Added functionality to get all children that have both parents (useful when trios are needed)
2011-11-30 14:43:37 +01:00
Laurent Francioli
a4606f9cfe
Merge branch 'MendelianViolation'
...
Conflicts:
public/java/src/org/broadinstitute/sting/utils/MendelianViolation.java
2011-11-30 11:13:15 +01:00
Laurent Francioli
b279ae4ead
Merge branch 'master' of ssh://copper.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-11-30 10:10:21 +01:00
Laurent Francioli
7d58db626e
Added MendelianViolationEvaluator integration test
2011-11-30 10:09:20 +01:00
Ryan Poplin
91413cf0d9
Merged bug fix from Stable into Unstable
2011-11-29 14:01:23 -05:00
Ryan Poplin
cb284eebde
Further updating VQSR tutorial wiki docs to reflect the bundle
2011-11-29 14:00:57 -05:00
Ryan Poplin
dcb889665d
Merged bug fix from Stable into Unstable
2011-11-29 09:58:49 -05:00
Ryan Poplin
447e9bff9e
Updating VQSR tutorial wiki docs to reflect the bundle
2011-11-29 09:57:45 -05:00
Ryan Poplin
110298322c
Adding Transmission Disequilibrium Test annotation to VariantAnnotator and integration test to test it.
2011-11-29 09:29:18 -05:00
Laurent Francioli
ab67011791
Corrected bug introduced in the last update and causing no families to be returned by getFamilies in case the samples were not specified
2011-11-29 11:18:15 +01:00
Eric Banks
d7d8b8e380
Tribble v42 changes the Codec.canDecode method to take in a String instead of a File; this is something that Jim was adamant about (because Tribble can handle streams other than files). I didn't want the next person who needed to rev Tribble to deal with this change additionally, so I took care of updating the GATK now.
2011-11-28 14:18:28 -05:00
Laurent Francioli
a09c01fcec
Removed walker argument FamilyStructure as this is now supported by the engine (ped file)
2011-11-28 17:18:11 +01:00
Laurent Francioli
795c99d693
Adapted MendelianViolation to the new ped family representation. Adapted all classes using MendelianViolation too.
...
MendelianViolationEvaluator was added a number of useful metrics on allele transmission and MVs
2011-11-28 17:13:14 +01:00
Laurent Francioli
e877db8f42
Changed visibility of getSampleDB from protected to public as the sampleDB needs to be accessible from Annotators and Evaluators too.
2011-11-28 17:11:30 +01:00
Laurent Francioli
5c2595701c
Added a function to get families only for a given list of samples.
2011-11-28 17:10:33 +01:00
Mark DePristo
3c36428a20
Bug fix for TiTv calculation -- shouldn't be rounding
2011-11-28 10:20:34 -05:00
Eric Banks
436b4dc855
Updated docs
2011-11-28 08:59:48 -05:00
Laurent Francioli
b1dd632d5d
Merge branch 'master' of ssh://copper.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
...
Conflicts:
public/java/src/org/broadinstitute/sting/gatk/walkers/phasing/PhaseByTransmission.java
2011-11-25 16:16:44 +01:00
Mark DePristo
e60272975a
Fix for changed MD5 in streaming VCF test
2011-11-23 19:01:33 -05:00
Mark DePristo
12f09d88f9
Removing references to SimpleMetricsByAC
2011-11-23 16:08:18 -05:00
Mark DePristo
e319079c32
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-11-23 13:02:11 -05:00
Mark DePristo
4107636144
VariantEval updates
...
-- Performance optimizations
-- Tables now are cleanly formatted (floats are %.2f printed)
-- VariantSummary is a standard report now
-- Removed CompEvalGenotypes (it didn't do anything)
-- Deleted unused classes in GenotypeConcordance
-- Updates integration tests as appropriate
2011-11-23 13:02:07 -05:00
David Roazen
e5b85f0a78
A toString() method for IntervalBindings
...
Necessary since we're currently writing things like this to our VCF headers:
intervals=[org.broadinstitute.sting.commandline.IntervalBinding@4ce66f56]
2011-11-23 11:56:12 -05:00
Mark DePristo
5a4856b82e
GATKReports now support a format field per column
...
-- You can tell the table to format your object with "%.2f" for example.
2011-11-23 11:31:04 -05:00
Mark DePristo
c8bf7d2099
Check for null comment
2011-11-23 10:47:21 -05:00
Mark DePristo
6c2555885c
Caching getSimpleName() in VariantEval is a big performance improvement
...
-- Removed the SimpleMetricsByAC table, as one should just use the AlleleCount Stratefication and the upcoming VariantSummary table
2011-11-23 08:34:05 -05:00
Guillermo del Angel
32adbd614f
Solve merge conflict
2011-11-22 22:48:46 -05:00
Guillermo del Angel
941f3784dc
Solve merge conflict
2011-11-22 22:48:03 -05:00
Guillermo del Angel
75d93e6335
Another corner condition fix: skip likelihood computation in case we cut so many bases there's no haplotype or read left
2011-11-22 22:46:12 -05:00
Mark DePristo
a3aef8fa53
Final performance optimization for GenotypesContext
2011-11-22 17:19:30 -05:00
Mark DePristo
990c02e4de
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-11-22 17:19:11 -05:00
Guillermo del Angel
38a90da92c
Fixed merge conflict to Unstable
2011-11-22 14:39:45 -05:00
Guillermo del Angel
32a77a8a56
Prevent out of bound error in case read span > reference context + indel length. Can happen in RNAseq reads with long N CIGAR operators in the middle.
2011-11-22 13:57:24 -05:00
Eric Banks
5821c11fad
For BAM and Reviewed errors we now check the error message to see if it's actually a 'too many open files' problem and, if so, we generate a User Error instead.
2011-11-22 10:50:22 -05:00
Mark DePristo
7087310373
Embarassing bug fixed
2011-11-22 10:16:36 -05:00
Mark DePristo
e484625594
GenotypesContext now updates cached data for add, set, replace operations when possible
...
-- Involved separately managing the sample -> offset and sample sorted list operations. This should improve performance throughout the system
2011-11-22 08:40:48 -05:00
Mark DePristo
29ca24694a
UG now encoding NO_CALLs as ./. not ./.:.:4:0,0,0
...
A few updated UGs integration tests
2011-11-22 08:22:32 -05:00
Mark DePristo
2b51c01df4
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-11-21 19:16:06 -05:00
Mark DePristo
5443d3634a
Again, fixing the add call when we really mean replace
...
-- Updating MD5s for UG to reflect that what was previously called ./.:.:10:0,0,0 is now just ./. Eric will fix long-standing bug in QD observed from this change
-- VFW MD5s restored to their old correct values. There was a bug in my implementation to caused the genotypes to not be parsed from the lazy output even through the header was incorrect.
2011-11-21 19:15:56 -05:00
Mauricio Carneiro
5ad3dfcd62
BugFix: byte overflow in SyntheticRead compressed base counts
...
* fixed and added unit test
2011-11-21 17:11:50 -05:00
Mark DePristo
9ea7b70a02
Added decode method to LazyGenotypesContext
...
-- AbstractVCFCodec calls this if the samples are not sorted. Previously called getGenotypes() which didn't actually trigger the decode
2011-11-21 16:21:23 -05:00
Mark DePristo
ab2efe3bd3
Reverting bad exact model changes
2011-11-21 16:14:40 -05:00
Eric Banks
44554b2bfd
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-11-21 15:01:45 -05:00
Eric Banks
022832bd74
Very bad use of the == operator with Strings was ensuring that validating GenomeLocs was very inefficient. This fix resulted in a significant speedup for a simple RodWalker.
2011-11-21 14:49:47 -05:00
Mark DePristo
1561af22af
Exact model code cleanup
...
-- Fixed up code when fixing a bug detected by aggressive contracts in GenotypesContext.
2011-11-21 14:35:15 -05:00
Mark DePristo
2c501364b8
GenotypesContext no longer have immutability in constructor
...
-- additional bug fixes throughout VariantContext and GenotypesContext objects
2011-11-21 14:34:31 -05:00
David Roazen
1296dd41be
Removing the legacy -L "interval1;interval2" syntax
...
This syntax predates the ability to have multiple -L arguments, is
inconsistent with the syntax of all other GATK arguments, requires
quoting to avoid interpretation by the shell, and was causing
problems in Queue.
A UserException is now thrown if someone tries to use this syntax.
2011-11-21 13:18:53 -05:00
Mark DePristo
e467b8e1ae
More contracts on LazyGenotypesContext
2011-11-21 09:34:57 -05:00
Mark DePristo
2e9ecf639e
Generalized interface to LazyGenotypesContext
...
-- Now you provide a LazyParsing object
-- LazyGenotypesContext now knows nothing about the VCF parser itself. The parser holds all of the necessary data to parse the VCF genotypes when necessarily, and the LGC only has a pointer to this object
-- Using new interface added LazyGenotypesContext to unit tests with a simple lazy version
-- Deleted VCFParser interface, as it was no longer necessary
2011-11-21 09:30:40 -05:00
Mark DePristo
f0ac588d32
Extensive unit test for GenotypeContextUnitTest
...
-- Currently only tests base class. Adding subclass testing in a bit
2011-11-20 18:28:01 -05:00
Mark DePristo
bc44f6fd9e
Utility function Collection<Genotype> -> Collection<String>
2011-11-20 18:26:56 -05:00
Mark DePristo
9445326c6c
Genotype is Comparable via sampleName
2011-11-20 18:26:27 -05:00
Mark DePristo
f9e25081ab
Completed documented LazyGenotypesContext
2011-11-20 08:35:52 -05:00
Mark DePristo
9cb3fe3a59
Vastly better way of doing on-demand genotyping loading
...
-- With our GenotypesContext class we can naturally create a LazyGenotypesContext subclass that does the on-demand loading.
-- This new class was replaced all of the old, complex functionality
-- Better still, there were many cases were the genotypes were being loaded unnecessarily, resulting in efficiency. This was detected because some of the integration tests changed as the genotypes were no longer being parsing unnecessarily
-- Misc. bug fixes throughout the system
-- Bug fixes for PhaseByTransmission with new GenotypesContext
2011-11-20 08:23:09 -05:00
Mark DePristo
f392d330c3
Proper use of builder. Previous conversion attempt was flawed
2011-11-19 22:09:56 -05:00
Mark DePristo
7d09c0064b
Bug fixes and code cleanup throughout
...
-- chromosomeCounts now takes builder as well, cleaning up a lot of code throughout the codebase.
2011-11-19 18:40:15 -05:00
Mark DePristo
707bd30b3f
Should have been @BeforeMethod
2011-11-19 16:10:09 -05:00
Mark DePristo
8f7eebbaaf
Bugfix for pError not being checked correctly in CommonInfo
...
-- UnitTests to ensure correct behavior
-- UnitTests to ensure correct behavior for pass filters vs. failed filters vs. unfiltered
2011-11-19 15:58:59 -05:00
Mark DePristo
b7b57ef39a
Updating MD5 to reflect canonical ordering of calculation
...
-- We should no longer have md5s changing because of hashmaps changing their sort order on us
-- Added GenotypeLikelihoodsUnitTests
-- Refactored ExactAFCaclculation to put the PL -> QUAL calculation in the GenotypeLikelihoods class to avoid the code copy.
2011-11-19 15:57:33 -05:00
Mark DePristo
73119c8e3c
Merge with master
...
-- A few bug fixes
2011-11-19 09:56:06 -05:00
Mark DePristo
f685fff79b
Killing the final versions of old new VariantContext interface
2011-11-18 21:32:43 -05:00
Mark DePristo
6cf315e17b
Change interface to getNegLog10PError to getLog10PError
2011-11-18 21:07:30 -05:00
Mark DePristo
c7f2d5c7c7
Final minor fix to contract
2011-11-18 19:40:05 -05:00
Mauricio Carneiro
b5de182014
isEmpty now checks if mReadBases is null
...
Since newly created reads have mReadBases == null. This is an effort to centralize the place to check for empty GATKSAMRecords.
2011-11-18 18:34:05 -05:00
Mauricio Carneiro
8ab3ee9c65
Merge remote-tracking branch 'unstable/master' into rr
2011-11-18 16:50:25 -05:00
Mauricio Carneiro
333e5de812
returning read instead of GATKSAMRecord
...
Do not create new GATKSAMRecord when read has been fully clipped, because it is essentially the same as returning the currently fully clipped read.
2011-11-18 16:49:59 -05:00
Matt Hanna
8bb4d4dca3
First pass of the asynchronous block loader.
...
Block loads are only triggered on queue empty at this point. Disabled by
default (enable with nt:io=?).
2011-11-18 15:02:59 -05:00
Mark DePristo
a2e79fbe8a
Fixes to contracts
2011-11-18 14:18:53 -05:00
Mark DePristo
660d6009a2
Documentation and contracts for GenotypesContext and VariantContextBuilder
2011-11-18 13:59:30 -05:00
Mark DePristo
f54afc19b4
VariantContextBuilder
...
-- New approach to making VariantContexts modeled on StringBuilder
-- No more modify routines -- use VariantContextBuilder
-- Renamed isPolymorphic to isPolymorphicInSamples. Same for mono
-- getChromosomeCount -> getCalledChrCount
-- Walkers changed to use new VariantContext. Some deprecated new VariantContext calls remain
-- VCFCodec now uses optimized cached information to create GenotypesContext.
2011-11-18 12:39:10 -05:00
Eric Banks
6459784351
Merged bug fix from Stable into Unstable
2011-11-18 12:34:57 -05:00
Eric Banks
c62082ba1b
Making this class public again as per request from Cancer folks
2011-11-18 12:34:27 -05:00
Eric Banks
8710673a97
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-11-18 12:29:33 -05:00
Eric Banks
768b27322b
I figured out why we were getting tons of hom var genotype calls with Mauricio's low quality (synthetic) reduced reads: the RR implementation in the UG was not capping the base quality by the mapping quality, so all the low quality reads were used to generate GLs. Fixed.
2011-11-18 12:29:15 -05:00
Mark DePristo
7490dbb6eb
First version of VariantContextBuilder
2011-11-18 11:06:15 -05:00
Roger Zurawicki
f48d4cfa79
Bug fix: fully clipping GATKSAMRecords and flushing ops
...
Reads that are emptied after clipping become new GATKSAMRecords.
When applying ClippingOps, the ops are cleared after the clipping
2011-11-18 00:24:39 -05:00
Mark DePristo
fa454c88bb
UnitTests for VariantContext for chrCount, getSampleNames, Order function
...
-- Major change to how chromosomeCounts is computed. Now NO_CALL alleles are always excluded. So ChromosomeCounts(A/.) is 1, the previous result would have been 2.
-- Naming changes for getSamplesNameInOrder()
2011-11-17 20:37:22 -05:00
Mark DePristo
02f22cc9f8
No more VC integration tests. All tests are now unit tests
2011-11-17 15:33:09 -05:00
Mark DePristo
23359d1c6c
Bugfix for pruneVariantContext, which was dropping the ref base for padding
2011-11-17 15:32:52 -05:00
Mark DePristo
473b860312
Major determinism fix for UG and RankSumTest
...
-- Now these routines all iterate in sample name order (genotypes.iterateInSampleNameOrder) so that the results of UG and the annotator do not depend on the particular order of samples we see for the exact model and the RankSumTest
2011-11-17 15:31:45 -05:00
Khalid Shakir
c50274e02e
During flanking interval creation merging overlapping flanks so that on scatter the list doesn't accidentally genotype the same site twice.
...
Moved flanking interval utilies to IntervalUtils with UnitTests.
2011-11-17 13:56:42 -05:00
Eric Banks
bad19779b9
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-11-17 13:29:43 -05:00
Eric Banks
16a021992b
Updated header description for the INFO and FORMAT DP fields to be more accurate.
2011-11-17 13:17:53 -05:00
Eric Banks
e7d41d8d33
Minor cleanup
2011-11-17 12:00:28 -05:00
Mark DePristo
7e66677769
Expanded UnitTests for VariantContext
...
Tests for
-- getGenotype and getGenotypes
-- subContextBySample
-- modify routines
2011-11-16 20:45:15 -05:00
Mauricio Carneiro
72f00e2883
Merging Roger's Unit tests for Reduce Reads from RR repository
2011-11-16 17:26:49 -05:00
Mark DePristo
aa0610ea92
GenotypeCollection renamed to GenotypesContext
2011-11-16 16:24:05 -05:00
Mark DePristo
974daaca4d
V13 version in archive. Can you pulled out wholesale for performance testing
2011-11-16 16:08:46 -05:00
Mark DePristo
caf6080402
Better algorithm for merging genotypes in CombineVariants
2011-11-16 15:17:33 -05:00
Mark DePristo
101ffc4dfd
Expanded, contrastive VariantContextBenchmark
...
-- Compares performance across a bunch of common operations with GATK 1.3 version of VariantContext and GATK 1.4
-- 1.3 VC and associated utilities copied wholesale into test directory under v13
2011-11-16 13:35:16 -05:00
Mark DePristo
e56d52006a
Continuing bugfixes to get new VC working
2011-11-16 10:39:17 -05:00
Matt Hanna
eb8e031f75
Merged bug fix from Stable into Unstable
2011-11-16 09:57:37 -05:00
Matt Hanna
6a5d5e7ac9
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/stable
2011-11-16 09:57:13 -05:00
Matt Hanna
7ac5cf8430
Getting rid of unsupported CountReadPairs walker in stable. Removal of
...
remainder of pairs processing framework to follow in unstable.
2011-11-16 09:53:59 -05:00
Eric Banks
c2ebe58712
Merge remote-tracking branch 'Laurent/master'
2011-11-16 09:34:47 -05:00
Laurent Francioli
0dc3d20d58
Corrected bug causing PhaseByTransmission to crash in case of new Genotype.Type
2011-11-16 09:33:13 +01:00
Laurent Francioli
7d77fc51f5
Corrected bug causing PhaseByTransmission to crash in case of new Genotype.Type
2011-11-16 03:32:43 -05:00
David Roazen
0d163e3f52
SnpEff 2.0.4 support
...
-Modified the SnpEff parser to work with the SnpEff 2.0.4 VCF output format
-Assigning functional classes and effect impacts now handled directly
by SnpEff rather than the GATK
-Removed support for SnpEff 2.0.2, as we no longer trust the output of that
version since it doesn't exclude effects associated with certain nonsensical
transcripts. These effects are excluded as of 2.0.4.
-Updated unit and integration tests
This support is based on a *release-candidate* of SnpEff 2.0.4, and so is subject
to change between now and the next GATK release.
2011-11-15 18:36:22 -05:00
Mark DePristo
df415da4ab
More bug fixes on the way to passing all tests
2011-11-15 17:38:12 -05:00
Mark DePristo
0be23aae4e
Bugfixes on way to a working refactored VariantContext
2011-11-15 17:20:14 -05:00
Mark DePristo
231c47c039
Bugfixes on way to a working refactored VariantContext
2011-11-15 16:42:50 -05:00
Laurent Francioli
fb685f88ec
Merge branch 'master' of ssh://copper.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-11-15 16:23:53 -05:00
Mark DePristo
2b2514dad2
Moved many unused phasing walkers and utilities to archive
2011-11-15 16:14:50 -05:00
Mark DePristo
460a51f473
ID field now stored in the VariantContext itself, not the attributes
2011-11-15 14:56:33 -05:00
Eric Banks
7fada320a9
The right fix for this test is just to delete it.
2011-11-15 14:53:27 -05:00
Eric Banks
b45d10e6f1
The DP in the FORMAT field (per sample) must also use the representative count or else it's always 1 for reduced reads.
2011-11-15 10:23:59 -05:00
Mark DePristo
233e581828
Merging in Master
2011-11-15 09:28:24 -05:00
Eric Banks
b66556f4a0
Update error message so that it's clear ReadPair Walkers are exceptions
2011-11-15 09:22:57 -05:00
Mark DePristo
6e1a86bc3e
Bug fixes to VariantContext and GenotypeCollection
2011-11-15 09:21:30 -05:00
Roger Zurawicki
284430d61d
Added more basic UnitTests for ReadClipper
...
hardClipByReadCoordinatesWorks
hardClipLowQualTailsWorks
2011-11-15 00:13:52 -05:00
Roger Zurawicki
8e91e19229
Merge branch 'master' of ssh://nickel/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-11-15 00:13:37 -05:00
Mauricio Carneiro
cde829899d
compress Reduce Read counts bytes by offset
...
compressed the representation of the reduce reads counts by offset results in 17% average compression in final BAM file size.
Example compression -->
from : 10, 10, 11, 11, 12, 12, 12, 11, 10
to: 10, 0, 1, 1,2, 2, 2, 1, 0
2011-11-14 18:30:24 -05:00
Mark DePristo
4ff8225d78
GenotypeMap -> GenotypeCollection part 3
...
-- Test code actually builds
2011-11-14 17:51:41 -05:00
Mark DePristo
f0234ab67f
GenotypeMap -> GenotypeCollection part 2
...
-- Code actually builds
2011-11-14 17:42:55 -05:00
David Roazen
ab0ee9b847
Perform only necessary validation in VariantContext modify methods
2011-11-14 16:49:59 -05:00
Mark DePristo
2e9d5363e7
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-11-14 15:32:06 -05:00
Mark DePristo
1fbdcb4f43
GenotypeMap -> GenotypeCollection
2011-11-14 15:32:03 -05:00
Eric Banks
4dc9dbe890
One quick fix to previous commit
2011-11-14 14:42:12 -05:00
Eric Banks
7b2a7cfbe7
Transfer headers from the resource VCF when possible when using expressions. While there, VA was modified so that it didn't assume that the ID field was present in the VC's info map in preparation for Mark's upcoming changes.
2011-11-14 14:31:27 -05:00
Mark DePristo
9b5c79b49d
Renamed InferredGeneticContext to CommonInfo
...
-- I have no idea why I named this InferredGeneticContext, a totally meaningless term
-- Renamed to CommonInfo.
-- Made package protected, as no one should use this outside of VariantContext and Genotype
-- UGEngine was using IGC constant, but it's now using the public one in VariantContext.
2011-11-14 14:28:52 -05:00
Mark DePristo
077397cb4b
Deleted MutableVariantContext
...
-- All methods that used this capable now use VariantContext directly instead
2011-11-14 14:19:06 -05:00
Mark DePristo
b11c535527
Deleted MutableGenotype
...
-- This class wasn't really used anywhere, and so removed to control code bloat.
2011-11-14 13:16:36 -05:00
Mark DePristo
79987d685c
GenotypeMap contains a Map, not extends it
...
-- On path to replacing it with GenotypeCollection
2011-11-14 12:55:03 -05:00
Eric Banks
7aee80cd3b
Fix to deal with reduced reads containing a deletion
2011-11-14 12:23:46 -05:00
Eric Banks
3d2970453b
Misc minor cleanup
2011-11-14 09:41:54 -05:00
Laurent Francioli
1347beef40
Merge branch 'PhaseByTransmission'
2011-11-14 11:31:28 +01:00
Laurent Francioli
6881d4800c
Added Integration tests for Phasing by Transmission
2011-11-14 10:47:51 +01:00
Laurent Francioli
34acf8b978
Added Unit tests for new methods in GenotypeLikelihoods
2011-11-14 10:47:02 +01:00
Roger Zurawicki
1202a809cb
Added Basic Unit Tests for ReadClipper
...
Tests some but not all functions
Some tests have been disabled because they are not working
2011-11-13 22:27:49 -05:00
Eric Banks
b7c33116af
Minor docs update
2011-11-12 23:21:07 -05:00
Eric Banks
76d357be40
Updating docs example to use -L since that's best practice
2011-11-12 23:20:05 -05:00
Mark DePristo
fee9b367e4
VariantContext genotypes are now stored as GenotypeMap objects
...
-- Enables further sophisticated optimizations, as this class can be smarter about storing the data and will directly support operations like subset to samples
-- All instances in the gatk that used Map<String, Genotype> now use GenotypeMap type.
-- Amazingly, there were many places where HashMap<String, Genotype> is used, so that the order of the genotypes is technically undefined and could be dangerous. Now everything uses GenotypeMap with a specific ordering of samples (by name)
-- Integrationtests updated and all pass
2011-11-11 15:00:35 -05:00
Guillermo del Angel
cd3146f4cf
Add hidden option to ValidationAmplicons to output slightly modified format to make file work with downstream SQNM tools more seamlessly at request of GAP: one line per record, keep probe identifier to 20 characters, no * in ref allele.
2011-11-11 14:07:07 -05:00
Ryan Poplin
40fbeafa37
VQSR will now detect if the negative model failed to converge properly because of having too few data points and automatically retry with more appropriate clustering parameters.
2011-11-11 11:52:30 -05:00
Mark DePristo
4938569b3a
More general handling of parameters for VariantContextBenchmark
2011-11-11 10:22:19 -05:00
Mark DePristo
ef9f8b5d46
Added subContextOfSamples to VariantContext
...
-- This is a more convenient accesssor than subContextOfGenotypes, represents nearly all of the use cases of the former function, and potentially can be implemented more efficiently.
2011-11-11 10:07:11 -05:00
Mark DePristo
e216e85465
First working version of VariantContextBenchmark
2011-11-11 09:56:00 -05:00
Mark DePristo
ee40791776
Attributes are now Map<String,Object> not Map<String,?>
...
-- Allows us to avoid an unnecessary copy when creating InferredGeneticContext (whose name really needs to change).
2011-11-11 09:55:42 -05:00
Mark DePristo
dc9b351b5e
Meaningful error message when an IntervalArg file fails to parse correctly
2011-11-10 17:10:26 -05:00
Mark DePristo
bb7bf74aa8
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-11-10 16:05:43 -05:00
Mark DePristo
153e52ffed
VariantEvalIntegrationTest for IntervalStratification
2011-11-10 14:10:39 -05:00
Mauricio Carneiro
060c7ce8ae
It wouldn't harm integrationtests if we had our logic right... :-)
2011-11-10 14:03:22 -05:00
Eric Banks
39678b6a20
Check for reads with missing read groups and throw a UserException when encountered. Mauricio said this wouldn't break integration tests.
2011-11-10 13:34:45 -05:00
Mark DePristo
dd1810140f
-stratIntervals is optional
2011-11-10 13:27:32 -05:00
Mark DePristo
67b022c34b
Cleanup for new SampleUtils function
...
-- getVCFHeadersFromRods(rods) is now available so that you don't have getVCFHeadersFromRods(rods, null) throughout the codebase
2011-11-10 13:27:13 -05:00
Mark DePristo
35fe9c8a06
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-11-10 11:11:33 -05:00
Mark DePristo
dc4932f93d
VariantEval module to stratify the variants by whether they overlap an interval set
...
The primary use of this stratification is to provide a mechanism to divide asssessment of a call set up by whether a variant overlaps an interval or not. I use this to differentiate between variants occurring in CCDS exons vs. those in non-coding regions, in the 1000G call set, using a command line that looks like:
-T VariantEval -R human_g1k_v37.fasta -eval 1000G.vcf -stratIntervals:BED ccds.bed -ST IntervalStratification
Note that the overlap algorithm properly handles symbolic alleles with an INFO field END value. In order to safely use this module you should provide entire contigs worth of variants, and let the interval strat decide overlap, as opposed to using -L which will not properly work with symbolic variants.
Minor improvements to create() interval in GenomeLocParser.
2011-11-10 10:58:40 -05:00
Mauricio Carneiro
0d8983feee
outputting the RG information
...
setReadGroup now sets the read group attribute for the GATKSAMRecord
2011-11-09 23:35:00 -05:00
Eric Banks
315ac68b0b
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-11-09 22:37:36 -05:00
Eric Banks
6313aae2c4
Adding checks for hasBasePileup() before calling getBasePileup() as per GS thread
2011-11-09 22:37:26 -05:00
Ryan Poplin
74a18d3de8
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-11-09 22:29:40 -05:00
Ryan Poplin
24712c0221
Merged bug fix from Stable into Unstable
2011-11-09 22:28:27 -05:00
Ryan Poplin
8942406aa2
Use MathUtils to compare doubles instead of testing for equality
2011-11-09 22:05:21 -05:00
Ryan Poplin
348f2db7fd
Fix for HMM optimization. If the two penalty arrays match exactly the function should return the end of the array instead of 0.
2011-11-09 22:00:52 -05:00
Eric Banks
82bf09edf3
Mark Standard Annotations with an asterisk
2011-11-09 20:42:31 -05:00
Eric Banks
04b122be29
Fix for bug reported on GetSatisfaction
2011-11-09 20:33:36 -05:00
Mauricio Carneiro
d00b2c6599
Adding a synthetic read for filtered data
...
* Generalized the concept of a synthetic read to cread both running consensus and a synthetic reads of filtered data.
* Synthetic reads can now have deletions (but not insertions)
* New reduced read tag for filtered data synthetic reads *(RF)*
* Sliding window header now keeps information of consensus and filtered data
* Synthetic reads are created simultaneously, new functionality is controlled internally by addToSyntheticReads
2011-11-09 20:16:22 -05:00
Eric Banks
21bf43f3bb
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-11-09 15:34:40 -05:00
Eric Banks
02d5e3025e
Added integration test for intervals from bed file
2011-11-09 15:34:19 -05:00
Christopher Hartl
85bffe1dca
Merged bug fix from Stable into Unstable
2011-11-09 15:29:14 -05:00
Christopher Hartl
d828eba7f4
Allow comments in a table-formatted file to precede the header line.
2011-11-09 15:27:38 -05:00
Eric Banks
8205efbb29
Merge branch 'master' into intervals
2011-11-09 15:27:15 -05:00
Eric Banks
d64f8a89a9
Instead of the SelfScopingFeatureCodec interface, pushed this functionality into Tribble itself. Now we can e.g. determine that a file can be parsed by the BedCodec on the fly.
2011-11-09 15:24:29 -05:00
Mauricio Carneiro
f080f64f99
Preserve RG information on new GATKSAMRecord from SAMRecord
2011-11-09 14:39:20 -05:00
Mauricio Carneiro
f9530e0768
Clean unnecessary attributes from the read
...
this gives on average 40% file size reduction.
2011-11-09 14:39:20 -05:00
Mauricio Carneiro
9427ada498
Fixing no cigar bug
...
empty GATKSAMRecords will have a null cigar. Treat them accordingly.
2011-11-09 14:39:20 -05:00
Mark DePristo
e639f0798e
mergeEvals allows you to treat -eval 1.vcf -eval 2.vcf as a single call set
...
-- A bit of code cleanup in VCFUtils
-- VariantEval table to create 1000G Phase I variant summary table
-- First version of 1000G Phase I summary table Qscript
2011-11-09 14:35:50 -05:00
Christopher Hartl
149b79eaad
Merged bug fix from Stable into Unstable
2011-11-09 11:26:30 -05:00
Christopher Hartl
11abb4f9d1
Better error message.
2011-11-09 11:25:28 -05:00
Christopher Hartl
d3a533b82e
Revert "a"
...
This reverts commit 1175f50ddbf389f5da74d27dc725596582ae15af.
2011-11-09 11:22:26 -05:00
Christopher Hartl
5eaf800281
a
2011-11-09 11:22:20 -05:00
Christopher Hartl
5451fbc2b2
Merged bug fix from Stable into Unstable
2011-11-09 11:06:15 -05:00
Christopher Hartl
091229e4db
MVLikelihoodRatio now checks if the family string is provided before attempting to instantiate. Also check that variant contexts have both genotypes and genotype likelihoods.
...
Table codec now yells at users for not providing a HEADER with the table - parsing tables without a header line was causing the first line of the file to be eaten.
Table feature now has a toString method.
These are minor bug fixes.
2011-11-09 11:03:29 -05:00
Mauricio Carneiro
e1b4c3968f
Fixing GATKSAMRecord bug
...
when constructing a GATKSAMRecord from scratch, we should set "mRestOfBinaryData" to null so the BAMRecord doesn't try to retrieve missing information from the non-existent bam file.
2011-11-08 16:50:36 -05:00
Ryan Poplin
e973ca2010
fixing merge conflict.
2011-11-08 14:55:05 -05:00
Ryan Poplin
b0e6afec48
Bug fix for HMM optimization. Need to also check the gap continuation penalty array for the index with the first discrepancy.
2011-11-08 14:51:25 -05:00
Laurent Francioli
571c724cfd
Added reporting of the number of genotypes updated.
2011-11-08 15:15:51 +01:00
Ryan Poplin
94dc447a70
Merged bug fix from Stable into Unstable
2011-11-07 15:26:35 -05:00
Ryan Poplin
0b181be61f
Bug fix in SelectVariants when using a discordance track but no sample specifications. Added integration test to test this.
2011-11-07 15:25:16 -05:00
Ryan Poplin
0534149708
Merged bug fix from Stable into Unstable
2011-11-07 14:07:08 -05:00
Ryan Poplin
2d1e385ca4
Adding note to VQSR docs about Rscript being needed in the environment PATH.
2011-11-07 14:04:13 -05:00
Eric Banks
759f4fe6b8
Moving unclaimed walker with bad integration test to archive
2011-11-07 13:16:38 -05:00
Eric Banks
c1986b6335
Add notes to the GATKdocs as to when a particular annotation can/cannot be calculated.
2011-11-07 11:06:19 -05:00
Eric Banks
724e3f3b0d
Merged bug fix from Stable into Unstable
2011-11-06 22:23:22 -05:00
Eric Banks
cdd40d1222
Removing contracts for the SimpleTimer
2011-11-06 22:22:49 -05:00
Ryan Poplin
5c565d28b9
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-11-06 10:26:19 -05:00
Eric Banks
3517489a22
Better --sample selection integration test for VE. The previous one would return true even if --sample was not working at all.
2011-11-06 01:07:49 -04:00
Eric Banks
1c4e429a1c
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-11-06 00:05:56 -04:00
Eric Banks
a12bc63e5c
Get rid of support for bams without sample information in the read groups. This hidden option wasn't being used anyways because it wasn't hooked up properly in the AlignmentContext.
2011-11-05 23:54:28 -04:00
Eric Banks
ad57bcd693
Adding integration test to cover using expressions with IDs (-E foo.ID)
2011-11-05 23:53:15 -04:00
Eric Banks
90a053ea93
Don't change the mapping quality of MQ=255 reads in IR
2011-11-05 22:40:45 -04:00
Ryan Poplin
611a395783
Now properly extending candidate haplotypes with bases from the reference context instead of filling with padding bases. Functionality in the private Haplotype class is no longer necessary so removing it. No need to have four different Haplotype classes in the GATK.
2011-11-05 12:18:56 -04:00
Mark DePristo
e99871f587
Bug fix for decode loc
...
-- decodeLoc() wasn't skipping input header lines, so the system blew up when there was an = line being split.
2011-11-04 13:20:54 -04:00
Mark DePristo
a340a1aeac
Bug fix. decodeLoc() should update lineNo so you get meaningful line no when indexing
...
due to malformed VCF files.
2011-11-04 11:44:24 -04:00
Mark DePristo
9f260c0dc1
Zero byte index bug fix for RandomlySplitVariants + cleanup
...
-- vcfWriter2 was never being closed in onTraversalDone(), so the on the fly index file was being created but never actually properly written to the file.
-- This bug is ultimately due to the inability of the GATK to allow multiple VCF output writers as @Output arguments, though
-- Removed the unnecessary local variable iFraction, = 1000 * the input fraction argument. Now the system just uses a double random number and compares to the input fraction at all. Is there some subtle reason I don't appreciate for this programming construct?
2011-11-04 09:45:20 -04:00
Mauricio Carneiro
e89ff063fc
GATKSAMRecord refactor
...
The GATK engine will now provide a GATKSAMRecord to all tools which incorporates the functionality used by the GATK to the bam file (ReadGroups, Reduced Reads, ...).
* No tools should create SAMRecord anymore, use GATKSAMRecord instead *
2011-11-03 15:43:26 -04:00
Laurent Francioli
385a6abec1
Fixed a bug that wrongly swapped the mother and father genotypes in case the child genotype missing.
2011-11-03 13:04:53 +01:00
Laurent Francioli
893787de53
Functions getAsMap and getNegLog10GQ now handle missing genotype case.
2011-11-03 13:04:11 +01:00
Eric Banks
e8bceb1eaa
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-11-02 21:13:54 -04:00
Eric Banks
78a00d2ddc
Updating UG integration tests (needed updating only because the -mbq default is different from the old -mmq one).
2011-11-02 21:13:44 -04:00
Eric Banks
52b16bf739
Must check whether there's a normal vs. extended pileup before asking for it.
2011-11-02 20:45:24 -04:00
Eric Banks
e1edd6bd12
Removing the min mapping quality argument since it wasn't being used in the normal processing of the pileups in UG - only for indel pileups. Instead, we apply the min base quality to the reads in the pileup for indels and define it to be the min 'confidence' of the base. Docs are updated but I didn't rename the argument as I don't want people to complain.
2011-11-02 20:32:58 -04:00
Ryan Poplin
e94fcf537b
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-11-02 16:29:19 -04:00
Ryan Poplin
4d35272916
Bug fixes with Mauricio to functions in ReadUtils used by reduced reads and the haplotype caller.
2011-11-02 16:29:10 -04:00
Mark DePristo
8a2929c1dd
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-11-02 16:21:00 -04:00
Laurent Francioli
19ad5b635a
- Calculation of parent/child pairs corrected
...
- Separated the reporting of single and double mendelian violations in trios
2011-11-02 18:35:31 +01:00
Eric Banks
967ff647b8
Reduced reads shouldn't contribute to Fisher Strand calculations
2011-11-02 13:07:20 -04:00
Eric Banks
cf0e699226
QualByDepth was inefficiently iterating over the pileup 2 times for some reason. Removed non-useful annotation classes.
2011-11-02 12:58:38 -04:00
Eric Banks
4501dce58d
Fixing merge conflict
2011-11-02 12:50:32 -04:00
Eric Banks
54331b44e9
New way of looking at the size of a pileup: there's a physical number of elements in the data structure and there's a representative depth of coverage (since a reduced read represents depth >= 1). The size() method has been removed because its meaning is ambiguous. Updated several annotations and the UG engine to make use of the representative depths.
2011-11-02 12:47:30 -04:00
Mark DePristo
392e0aeace
Moved unit tests into master IntervalUtilsUnitTest
2011-11-02 10:52:00 -04:00
Mark DePristo
c2b97030a4
IntervalUtils for completely balanced locus-based scatter/gather
...
-- scatterLocusIntervals master utility
-- Moved around some general functionality from GenomeLocSortedSet to GenomeLoc
-- Util function for reversing a list (List<T> -> List<T>, unlike Collections version)
-- DoC is PartitionType.INTERVAL
-- Significant unit tests on new functionality (all passing)
-- Ready for real-world testing, as soon as I can get LocusScatterFunction.scala to actually work
2011-11-02 10:49:40 -04:00
Laurent Francioli
119ca7d742
Fixed a bug in parent/child pairs reporting causing a crash in case the -mvf option was used and mother was not provided
2011-11-02 08:22:33 +01:00
Laurent Francioli
b91a9c4711
- Fixed parent/child pairs handling (was crashing before)
...
- Added parent/child pair reporting
2011-11-02 08:04:01 +01:00
Mark DePristo
5fc613f972
Better default partition types for walkers
...
-- Added PartitionType.READ, and associated ReadScatterFunction. ReadScatterFunction is literally just ContigScatterFunction until someone wants to implement something better
-- LocusWalkers (and subclasses RodWalkers and RefWalkers) are by default PartitionType.LOCUS.
2011-11-01 19:47:10 -04:00
Mauricio Carneiro
36600fd8e9
added MQ of low MQ/BQ to consensus RMS
...
Bases that were excluded for MQ and BQ filters are now contributing to the MQ RMS (but not to consensus base counts and variant/not variant region triggers).
2011-11-01 17:46:12 -04:00
Mauricio Carneiro
b004489c6d
Moving ReduceRead TAG to GATKSAMRecord
...
ReduceReads are now a feature of a GATKSAMRecord, so the tag and the special methods needed to use it will now be housed by the GATKSAMRecord.
2011-11-01 17:12:09 -04:00
Mauricio Carneiro
17cc484dbd
Revert "ReduceReads ref bases are now output as '='
...
Reducing the reference bases to '=' results in an extra compression of 13% on average. The GATK is not ready to handle files with '=' bases, and the decision was to implement this a an engine support, not a part of ReduceReads.
2011-11-01 16:35:07 -04:00
Eric Banks
0839c75c8d
More minor fixes to docs
2011-10-31 21:49:27 -04:00
Eric Banks
74b018a1f3
Minor fixes to docs
2011-10-31 21:41:43 -04:00
Eric Banks
31ee5432c5
Merged bug fix from Stable into Unstable
2011-10-31 14:56:59 -04:00
David Roazen
cdde32acbd
Merged bug fix from Stable into Unstable
2011-10-31 14:21:15 -04:00
Eric Banks
f62af0291b
Check for invalid VCF records (not enough tokens) instead of assuming they are there.
2011-10-31 14:09:51 -04:00
Andrey Sivachenko
bed0acaed4
nWayOut now adds PG tag to the header as it should. Also, additional hidden option added: keepPGTags. If invoked, IndelRealigner PG tags from previous runs (if any) are kept in the header and the new PG tag is simply added, instead of overriding them
2011-10-31 12:28:28 -04:00
Mauricio Carneiro
389380a590
ReduceReads ref bases are now output as '=' to save space
...
Restructured the sliding window framework to manipulate a wrapped version of the SAMRecord that contains information about the reference.
2011-10-30 12:04:39 -04:00
Eric Banks
0ca7428e76
Allow processing of empty intervals, but warn user when this case is encountered.
2011-10-28 12:12:14 -04:00
Eric Banks
649dfe98f0
Add VCF header for any expressions that are requested
2011-10-28 10:22:19 -04:00
Eric Banks
8b1a62da27
Adding unit test to cover overlapping intervals from the same source with the intersection rule.
2011-10-28 09:59:43 -04:00
Eric Banks
057a79f598
This argument should be annotated as @Input
2011-10-28 09:44:49 -04:00
Eric Banks
4ba7c0cecd
Moving to private
2011-10-28 09:29:28 -04:00
Eric Banks
1bdd76c2f2
These tools now use the IntervalBinding system to handle intervals instead of doing it all manually
2011-10-28 09:28:12 -04:00
Eric Banks
6ba08a103d
Empty ROD files should generate an exception when used for creating intervals. Moved some now obsolete files to the archive as the realigner will now read all target intervals into memory.
2011-10-28 09:23:25 -04:00
Eric Banks
3d04bb5608
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-10-27 23:55:18 -04:00
Eric Banks
19e27d4568
Removing all instances of -BTI (in tests and in GATKdocs) and replacing them with the appropriate alternative.
2011-10-27 23:55:11 -04:00
Eric Banks
cafc245a43
For some reason, a class of Codecs (including TableCodec) require that a GenomeLocParser be passed in to do the position processing. Why can't they just return a Feature with chr, start, stop? Isn't that the right thing?
2011-10-27 23:54:28 -04:00
Guillermo del Angel
cbc43683ee
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-10-27 20:54:18 -04:00
Guillermo del Angel
8907e42007
First fully functional implementation of ValidationSiteSelectorWalker. User gives a) a set of input variants, b) a desired number of output variants, b) Optionally, a set of samples which will restrict sites to be polymorphic in those samples, c) a frequency selection mode: either uniform (no AF matching), or matching AF so that output sites mirror the input AF spectrum as closely as possible.
...
More testing is needed and docs need improving but so far all functionality seems up and running
2011-10-27 20:53:48 -04:00
Eric Banks
ccfd853b34
Added further integration tests for rod-based intervals that deal with more complex cases. Good call by Mark to test the empty VCF example because we were failing on it; fixed.
2011-10-27 20:43:50 -04:00
Eric Banks
c2f343773e
Oops, working too quickly last time. This is the proper fix for the potential NPE in the equals() test.
2011-10-27 15:32:08 -04:00
Khalid Shakir
b80d407dc7
No more hunting down R "resources". As a tradeoff Rscript cannot be specified on the commandline and will be found in the environment path.
...
Other minor cleanup.
2011-10-27 14:17:07 -04:00
Eric Banks
8c4dbce6d8
Don't serialize the GATKArgumentCollection for the GATKRunReports (which would have meant dealing with the new IntervalBindings). Also, forgot to remove a test that's no longer relevant to BED parsing.
2011-10-27 13:58:19 -04:00
Eric Banks
4a7e6fee3f
Remove support for BED file interval parsing in the GATK; it should all go through Tribble now. IndelRealigner no longer supports unordered interval input (which shouldn't have been used anyways). Temporarily commenting out serialization of arguments so that tests pass; this whole piece will be deleted soon anyways.
2011-10-27 13:38:08 -04:00
Matt Hanna
f7df8bdecc
Merged bug fix from Stable into Unstable
2011-10-27 11:31:17 -04:00
Matt Hanna
41ddc7bce7
Make sure we output a full stack trace when we encounter Tribble error messages on VCF header merge.
2011-10-27 11:30:04 -04:00
Eric Banks
44f905b5e5
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-10-26 23:31:11 -04:00
Eric Banks
68283b1651
Fixing docs and adding GATKdocs for the new interval functionality
2011-10-26 22:14:43 -04:00
Mark DePristo
c9978316a3
Merge branch 'FragmentUtils'
2011-10-26 19:51:49 -04:00
Mauricio Carneiro
add9ad97ec
No scatter gather for VQSR or ApplyVQSR.
...
These walkers should not be scatter gatherable. Annotating them accordingly so that Queue doesn't allow a less than knowledgeable user to try and scatter/gather VQSR.
2011-10-26 16:35:44 -04:00
Ryan Poplin
74aeb22eeb
Merged bug fix from Stable into Unstable
2011-10-26 15:57:30 -04:00
Ryan Poplin
86871bd1e3
Throw a UserException in the BQSR when there is no data instead of creating an empty csv file
2011-10-26 15:56:41 -04:00
Mark DePristo
034a997d07
Generalized Reads -> Fragment calculation
...
-- Supports ReadBackedPileup -> FragmentCollection as before
-- Added support for List<SAMRecord> -> FragmentCollection for Ryan's haplotype caller
-- General cleanup, renaming, move to separate package, more extensive unit tests, etc.
-- Added toFragment() function to ReadBackedPileup interface
2011-10-26 15:54:38 -04:00
Eric Banks
2f21b6ecfb
Removed debugging output
2011-10-26 15:50:20 -04:00
Eric Banks
b39fcb1bea
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-10-26 15:44:25 -04:00
Eric Banks
b6ce6ed3f8
Go around the ROD system for now so that we can just call decodeLoc() for efficiency. Noted that we should go through the ROD system once it gets cleaned up. This means that currently gzipped files are not supported with -L.
2011-10-26 15:42:53 -04:00
Eric Banks
3273c20c98
Added integration tests for Tribble-based intervals and fixed up some of the other tests based on some method changes.
2011-10-26 15:29:18 -04:00
Eric Banks
9424e8b2ca
Initial working version of new interval system in which the argument for -L (and -XL) is allowed to be a rod file (e.g. VCF). Old samtools-style intervals still behave as before. BTI is no longer supported. The merging (union or intersection) of intervals is now consistently applied to all -L (or -XL) intervals, which is nice. More testing needed.
2011-10-26 14:11:49 -04:00
Mark DePristo
7fa943aef1
Renamed FragmentPileup to FragmentUtils
2011-10-26 14:01:45 -04:00
Laurent Francioli
1f044faedd
- Genotype assignment in case of equally likeli combination is now random
...
- Genotype combinations with 0 confidence are now left unphased
2011-10-26 19:57:09 +02:00
Laurent Francioli
81b163ff4d
Indentation
2011-10-26 14:49:12 +02:00
Laurent Francioli
62cff266d4
GQ calculation corrected for most likely genotype
2011-10-26 14:40:04 +02:00
Mark DePristo
af3613cc5f
GATKSAMRecord commit branch summary
...
First, I'm sure there's a better way to do this, but I wanted to create a single commit summarizing the changes from my branch SamRecordFactory. What's the best way to do this? Rebase?
Now, on to the changes here:
-- Picard added a SamRecordFactory that is used to create instances the subclass SamRecord or BAMRecord. This factory allows us to have low-level picard readers (SamFileReader) create objects of type GATKSamRecord. The abomination of the extends and contains GATKSamRecord is now gone. GATKSamRecords are now produced by this factory, the GATK provides this factory to our SamFileReaders, and everything works with GATKSamRecord just extending BAMRecord. This results in up to a 2x performance improvement in writing BAMs and a ~10% improvement when reading BAMs files.
-- As a consequence of this, we no longer officially support SAM records. Attempting to create SAMRecord objects with the factory will throw a user exception.
-- Created a standard NGSPlatform enum, and GATKSamRecords support efficiently obtaining this value. The real BQSR (not the copy indel version) got the efficient code to use this. Please add all future platforms to this enum.
-- GATKSamRecord no longer supports using the OQ or defaultBaseQuality. This is performed in a wrapper iterator that's only added when these command line options are used.
-- ReducedRead code has been moved from ReadUtils until efficiency caching assessors in GATKSamRecord.
-- ArtificialSamUtils creates GATKSamRecords now, just SAMRecords. Added code here to create artifical pairs and using that code to create artificial ReadBackedPileups with specific properties
-- New smarter algorithm for FragmentPileup. This new code is up to 3x faster than the previous version, and is lazy so is more efficient when no overlapping pairs are actually in the pileup. Created extensive DataProvider driven UnitTest. Added Caliper-based benchmarking system to characterize the performance differences between the old and new algorithms. TODO still remains to make a efficient version that works for non-pileups for the HaplotypeCaller
2011-10-25 20:52:56 -04:00
Mark DePristo
2822f0dc27
Merge branch 'SamRecordFactory'
2011-10-25 20:34:47 -04:00
Mark DePristo
1b722c21cf
merge master
2011-10-25 16:08:39 -04:00