Eric Banks
b4749757f8
Fixes for SLOD: 1) didn't work properly for multi-allelics (randomly chose an allele, possibly one that wasn't genotyped in the full context); 2) in cases when there were more alt alleles than the max allowed and the user is calculating SB, we would recompute the best alt alleles(s); 3) for some reason, we were recomputing the LOD for the full context when we'd already done that. Given that this passes integration tests on my end, this should be the last commit before the release.
2012-03-12 01:07:07 -04:00
Ryan Poplin
2836c161ee
Moving trimToVariableRegion out of reduced reads and into a public static ReadClipper function. HaplotypeCaller clips reads to the active region boundries before passing to the HMM. The philosophy of the HC is moving towards genotyping the entire haplotype sequence contained within the active region as a single allele.
2012-03-11 14:45:59 -04:00
Ryan Poplin
8db11eb781
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-03-10 21:00:55 -05:00
Mark DePristo
1ee46e5c06
Collect only the bare essentials in the GATKRunReport
...
Now looks like:
<GATK-run-report>
<id>D7D31ULwTSxlAwnEOSmW6Z4PawXwMxEz</id>
<start-time>2012/03/10 20.21.19</start-time>
<end-time>2012/03/10 20.21.19</end-time>
<run-time>0</run-time>
<walker-name>CountReads</walker-name>
<svn-version>1.4-483-g63ecdb2</svn-version>
<total-memory>85000192</total-memory>
<max-memory>129957888</max-memory>
<user-name>depristo</user-name>
<host-name>10.0.1.10</host-name>
<java>Apple Inc.-1.6.0_26</java>
<machine>Mac OS X-x86_64</machine>
<iterations>105</iterations>
</GATK-run-report>
No longer capturing command line or directory information, to minimize people's concerns with phone home and privacy
2012-03-10 20:27:14 -05:00
Ryan Poplin
92bbb9bbdd
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-03-10 10:09:57 -05:00
Mark DePristo
3ba2e5667c
CalibrateGenotypesLikelihoods include pOfDGivenD now
2012-03-09 16:00:07 -05:00
Mark DePristo
1011f3862b
CalibrateGenotypeLikelihoods now emits the position of the variant for debugging
...
-- Refactored some duplicated code (FYI, code duplication = root of all evil) into shared functions
-- Added long-missing integrationtests
-- CHRIS/RYAN -- it would be very good to add an integration test covering external VCF files as I believe we rely on this functionality and it's not tested at all
2012-03-09 16:00:07 -05:00
David Roazen
91d10431d3
BAMScheduler: detect contigs from the interval list that are not in the merged BAM header's sequence dictionary
...
This is a quick-and-dirty patch for the null pointer error Mauricio reported earlier.
Later on we might want to address in a more general way the fact that we validate user intervals
against the reference but not against the merged BAM header produced by the engine at runtime.
2012-03-09 15:20:16 -05:00
David Roazen
bc65f6326f
Detect incomplete reads from BAM schedule file in BAMSchedule before they become buffer underflows
...
This fix is similar, but distinct from the earlier fix to GATKBAMIndex. If we fail to read in
a complete 3-integer bin header from the BAM schedule file that the engine has written, throw a
ReviewedStingException (since this is our problem, not the user's) rather than allowing a
cryptic buffer underflow error to occur.
Note that this change does not fix the underlying problem in the engine, if there is one
(there may be an as-yet-undetected bug in the code that writes the bam schedule). It will
just make it easier for us to identify what's going wrong in the future.
2012-03-09 12:33:48 -05:00
David Roazen
32dee7ed9b
Avoid buffer underflow in GATKBAMIndex by detecting premature EOF in BAM indices
...
GATKBAMIndex would allow an extremely confusing BufferUnderflowException to be
thrown when a BAM index file was truncated or corrupt. Now, a UserException is
thrown in this situation instructing the user to re-index the BAM.
Added a unit test for this case as well.
2012-03-08 15:30:44 -05:00
Guillermo del Angel
c04853eae6
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-03-08 12:30:04 -05:00
Guillermo del Angel
858acf8616
Hidden mode in ValidationAmplicons to support ILMN output format (same as Sequenom, with just shuffled columns)
2012-03-08 12:29:44 -05:00
Andrey Sivachenko
56f074b520
docs updated
2012-03-07 18:47:15 -05:00
Andrey Sivachenko
117ea605ac
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-03-07 18:35:07 -05:00
Andrey Sivachenko
497a1b059e
transition to JEXL completed, old parameters setting individual cutoffs now deprecated
2012-03-07 18:34:11 -05:00
Andrey Sivachenko
fbd2f04a04
JEXL support added; intermediate commit, not yet functional
2012-03-07 17:29:42 -05:00
Mark DePristo
0376d73ece
Improved, public version of ErrorRateByCycle
...
-- A cleaner table output (molten). For those interested in seeing how this can be done with GATKReports look here for a nice clean example
-- Integration tests
-- Minor improvements to GATKReportTable with methods to getPrimaryKeys
2012-03-07 13:10:08 -05:00
Christopher Hartl
a6a8fc0521
Merge branch 'master' of ssh://ni.broadinstitute.org/humgen/gsa-scr1/chartl/dev/unstable
2012-03-07 10:05:43 -05:00
Mark DePristo
569be953b9
Bugfix for VariantEval
...
-- We weren't properly handling the case where a site had both a SNP and indel in both eval and comp. These would naturally pair off as SNP x SNP and INDEL x INDEL in eval, but we'd still invoke update2 with (null, SNP) and (null, INDEL) resulting most conspicously as incorrect false negatives in the validation report.
-- Updating misc. integrationtests, as the counting of comps (in particular for dbSNP) was inflated because of this effect.
2012-03-06 16:56:59 -05:00
David Roazen
811f871f78
Do not fail tests that require the GATK private key if the user does not have permission to read it
...
Several of the unit tests for the new key authorization feature require
read access to the GATK master private key file. Since this file is only
readable by members of the group gsagit, this makes it hard for people
outside the group to run the test suite.
Now, we skip tests that require the master private key if the private
key exists (since not existing would be a true error) but is not readable
by the user running the test suite
Bamboo, of course, will always be able to run these tests.
2012-03-06 15:57:02 -05:00
Christopher Hartl
67def6acc8
Merge branch 'master' of ssh://ni.broadinstitute.org/humgen/gsa-scr1/chartl/dev/unstable
2012-03-06 14:23:14 -05:00
Christopher Hartl
20c1fbaf0f
Fixing a merge (turning off downsampling on DoC)
2012-03-06 14:22:45 -05:00
Ryan Poplin
46b470cc69
Minor misc updates
2012-03-06 10:14:45 -05:00
David Roazen
0702ee1587
Public-key authorization scheme to restrict use of NO_ET
...
-Running the GATK with the -et NO_ET or -et STDOUT options now
requires a key issued by us. Our reasons for doing this, and the
procedure for our users to request keys, are documented here:
http://www.broadinstitute.org/gsa/wiki/index.php/Phone_home
-A GATK user key is an email address plus a cryptographic signature
signed using our private key, all wrapped in a GZIP container.
User keys are validated using the public key we now distribute with
the GATK. Our private key is kept in a secure location.
-Keys are cryptographically secure in that valid keys definitely
came from us and keys cannot be fabricated, however keys are not
"copy-protected" in any way.
-Includes private, standalone utilities to create a new GATK user key
(GenerateGATKUserKey) and to create a new master public/private key
pair (GenerateKeyPair). Usage of these tools will be documented on
the internal wiki shortly.
-Comprehensive unit/integration tests, including tests to ensure the
continued integrity of the GATK master public/private key pair.
-Generation of new user keys and the new unit/integration tests both
require access to the GATK private key, which can only be read by
members of the group "gsagit".
2012-03-06 00:09:43 -05:00
Lechu
027843d791
I've simply added a "library(grid)" call at the beginning of the R script generation since R 2.14.2 doesn't seem to load the "grid" package as default. I haven't tested it on previous R versions (you may edit the R version comment to be more precise if desired), but I'm almost certain that this library call shouldn't do any harm on them.
...
Signed-off-by: Ryan Poplin <rpoplin@broadinstitute.org>
2012-03-05 21:27:03 -05:00
Ryan Poplin
f6905630bb
Adding Unit test for Haplotype class. Used in HC's genotype given alleles mode.
2012-03-05 21:08:07 -05:00
Ryan Poplin
9b53250bef
Adding Unit test for Haplotype class. Used in HC's genotype given alleles mode.
2012-03-05 21:07:36 -05:00
Ryan Poplin
b37461587d
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-03-05 17:54:59 -05:00
Ryan Poplin
c6ded4d23c
Bug fix for hard clipping reads when base insertion and base deletion qualities are present in the read. Updating HaplotypeCaller integration tests to reflect all the recent changes.
2012-03-05 17:54:42 -05:00
Ryan Poplin
14a77b1e71
Getting rid of redundant methods in MathUtils. Adding unit tests for approximateLog10SumLog10 and normalizeFromLog10. Increasing the precision of the Jacobian approximation used by approximateLog10SumLog which changes the UG+HC integration tests ever so slightly.
2012-03-05 12:28:32 -05:00
Mauricio Carneiro
e9ad382e74
unifying the BQSR argument collection
2012-03-05 10:48:26 -05:00
Ryan Poplin
f879daa7d0
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-03-05 08:29:08 -05:00
Ryan Poplin
d6871967ae
Adding more unit tests and contracts to PairHMM util class. Updating HaplotypeCaller to use the new PairHMM util class. Now that the HMM result isn't dependent on the length of the haplotype there is no reason to ensure all haplotypes have the save length which simplifies the code considerably.
2012-03-05 08:28:42 -05:00
Guillermo del Angel
3b5a7c34d7
Added argument to ValidationAmplicons to only output valid sequences - useful for not having to post-filter or grep resulting files before delivering downstream
2012-03-04 10:24:29 -05:00
Mark DePristo
69611af7d3
Workaround for bug in Picard in ReadGroupProperties
...
-- NPE caused when you call getRunDate on a read group without a date.
2012-03-02 18:53:45 -05:00
Mark DePristo
ba71b0aee4
ReadGroupProperties mk3
...
-- Includes sequencing date
2012-03-02 16:12:42 -05:00
Eric Banks
1e07e97b58
Optimization: create allele list just once, not for each genotype
2012-03-02 13:30:17 -05:00
Ryan Poplin
0ad7d5fbc1
Standalone common Pair HMM utility class with associated unit tests.
2012-03-01 22:41:13 -05:00
Mark DePristo
2f334a57c2
ReadGroupProperties mk2
...
-- Includes paired end status (T/F)
-- Includes count of reads used in calculation
-- Includes simple read type (2x76 for example)
-- Better handling of insert size, read length when there's no data, or the data isn't paired end by emitting NA not 0
2012-03-01 18:43:53 -05:00
Mauricio Carneiro
486712bfc2
ugly RG encoding
2012-03-01 17:56:45 -05:00
Mauricio Carneiro
29f74b658b
Unit tests for the context covariate
...
this is simple, but it's the infra-structure to start messing around with the context.
2012-03-01 17:56:45 -05:00
Mark DePristo
aff508e091
ReadGroupProperties walker and associated infrastructure
...
-- ReadGroupProperties: Emits a GATKReport containing read group, sample, library, platform, center, median insert size and median read length for each read group in every BAM file.
-- Median tool that collects up to a given maximum number of elements and returns the median of the elements.
-- Unit and integration tests for everything.
-- Making name of TestProvider protected so subclasses and override name more easily
2012-03-01 15:01:11 -05:00
Mauricio Carneiro
9e95b10789
Context covariate now operates as a highly compressed bitset
...
* All contexts with 'N' bases are now collapsed as uninformative
* Context size is now represented internally as a BitSet but output as a dna string
* Temporarily disabled sorted outputs because of null objects
2012-02-29 19:25:21 -05:00
Mauricio Carneiro
d379c3763a
DNA Sequence to BitSet and vice-versa conversion tools
...
* Turns DNA sequences (for context covariates) into bit sets for maximum compression
* Allows variable context size representation guaranteeing uniqueness.
* Works with long precision, so it is limited to a context size of 31 bases (can be extended with BigNumber precision if necessary).
* Unit Tests added
2012-02-29 19:25:20 -05:00
Eric Banks
129b5e7f6b
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-02-28 10:09:34 -05:00
Eric Banks
a4a279ce80
Damn you, Mark
2012-02-28 10:09:09 -05:00
Khalid Shakir
0681bea5a5
Changed DoC from PartitionType.INTERVAL to PartitionType.NONE since it doesn't have a way to gather scattered outputs.
...
Added MultiallelicSummary to HSP eval.
2012-02-28 09:27:27 -05:00
Eric Banks
bd398e30fd
Another quick optimization
2012-02-28 09:25:35 -05:00
Eric Banks
40bdadbda5
Minor optimization as per Mark
2012-02-28 09:24:07 -05:00
Eric Banks
d7928ad669
Drat, missed one: handle null alleles being passed in.
2012-02-27 21:31:54 -05:00
Mark DePristo
24356f11b7
Merged bug fix from Stable into Unstable
...
-- Resolved conflict
Conflicts:
public/java/src/org/broadinstitute/sting/gatk/datasources/reads/SAMDataSource.java
2012-02-27 17:13:17 -05:00
Mark DePristo
0b29d54937
Changed most BAMSchedule ReviewedStingExceptions to UserExceptions
...
-- As these represent the bulk of the StingExceptions coming from BAMSchedule and are caused by simple problems like the user providing bad input tmp directories, etc.
2012-02-27 17:08:41 -05:00
Mark DePristo
f9e8e82e33
Removed unused class variable from VCFHeaderLineTranslator
2012-02-27 17:07:19 -05:00
Mark DePristo
100ddef930
Fix typo in VariantContextBuilder
2012-02-27 17:06:45 -05:00
Mark DePristo
ca0931c01f
Adding test for reading samtools VCF file
2012-02-27 17:05:50 -05:00
Eric Banks
bd944ab04f
Another test where we no longer print out 'NaN' for the AF.
2012-02-27 15:19:08 -05:00
Mark DePristo
5f7ccdcc01
Avoid calling getBasePileup when there's no pileup in NBaseCount annotation
2012-02-27 15:12:25 -05:00
Eric Banks
52871187d7
Adding integration test for file with no GTs. Also updated md5 for one other test (since we no longer print out 'NaN' for the AF).
2012-02-27 15:09:56 -05:00
Mark DePristo
729bb954e2
Throws ReviewedStingException for a bug when parent VariantContext argument is null
2012-02-27 15:09:00 -05:00
Eric Banks
998ed8fff3
Bug fix to deal with VCF records that don't have GTs. While in there, optimized a bunch of related functions (including removing a copy of the method calculateChromosomeCounts(); why did we have 2 copies? very dangerous).
2012-02-27 14:56:10 -05:00
Mark DePristo
4d9582de77
More general catching of Exceptions in interval reading to throw MalformedFile exception in all cases
...
-- Now throws UserException no matter what happens during the reading of the intervals file.
2012-02-27 14:02:26 -05:00
Mark DePristo
9712fed7a5
Trap SAMFormatException and rethrow as MalformatedBAM exception
...
-- Trap errors in header and rethrow
-- Wrap underlying iterator in MalformatedBAMErrorReformattingIterator
2012-02-27 13:52:50 -05:00
Eric Banks
1ea34058c2
Updating integration tests now that standard annotations support multiple alleles
2012-02-27 11:32:26 -05:00
Eric Banks
64754e7870
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-02-27 11:31:41 -05:00
Eric Banks
850c5d0db2
Enabling Rank Sum Tests for multi-allelics: use ref vs any alt allele.
2012-02-27 09:59:36 -05:00
Eric Banks
dfdf4f989b
Enabling Fisher Strand for multi-allelics: use the alt allele with max AC. Added minor optimization to the method in the VC.
2012-02-27 09:50:09 -05:00
Guillermo del Angel
16122bea8d
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-02-25 13:57:54 -05:00
Guillermo del Angel
dea35943d1
a) Bug fix in calling new functions that give indel bases and length from regular pileup in LocusIteratorByState, b) Added unit test to cover these.
2012-02-25 13:57:28 -05:00
Mark DePristo
c8a06e53c1
DoC now properly handles reference N bases + misc. additional cleanups
...
-- DoC now by default ignores bases with reference Ns, so these are not included in the coverage calculations at any stage.
-- Added option --includeRefNSites that will include them in the calculation
-- Added integration tests that ensures the per base tables (and so all subsequent calculations) work with and without reference N bases included
-- Reorganized command line options, tagging advanced options with @Advanced
2012-02-25 11:32:50 -05:00
Mark DePristo
50de1a3eab
Fixing bad VCFIntegration tests
...
-- Left disabled a test that should have been enabled
-- Didn't add the md5 to the test I actually added
-- Now VCFIntegrationTests should be working!
2012-02-25 11:26:36 -05:00
Guillermo del Angel
c9a4c74f7a
a) Bug fixes for last commit related to PileupElements (unit tests are forthcoming). b) Changes needed to make pool caller work in GENOTYPE_GIVEN_ALLELES mode c) Bug fix (yet again) for UG when GENOTYPE_GIVEN_ALLELES and EMIT_ALL_SITES are on, when there's no coverage at site and when input vcf has genotypes: output vcf would still inherit genotypes from input vcf. Now, we just build vc from scratch instead of initializing from input vc. We just take location and alleles from vc
2012-02-24 10:27:59 -05:00
Mauricio Carneiro
ee9a56ad27
Fix subtle bug in the ReduceReads stash reported by Adam
...
* The tailSet generated every time we flush the reads stash is still being affected by subsequent clears because it is just a pointer to the parent element in the original TreeSet. This is dangerous, and there is a weird condition where the clear will affects it.
* Fix by creating a new set, given the tailSet instead of trying to do magic with just the pointer.
2012-02-23 18:35:25 -05:00
Mark DePristo
e0c189909f
Added support for breakpoint alleles
...
-- See https://getsatisfaction.com/gsa/topics/support_vcf_4_1_structural_variation_breakend_alleles?utm_content=topic_link&utm_medium=email&utm_source=new_topic
-- Added integrationtest to ensure that we can parse and write out breakpoint example
2012-02-23 12:14:48 -05:00
Guillermo del Angel
6866a41914
Added functionality in pileups to not only determine whether there's an insertion or deletion following the current position, but to also get the indel length and involved bases - definitely needed for extended event removal, and needed for pool caller indel functionality.
2012-02-23 09:45:47 -05:00
Eric Banks
d34f07dba0
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-02-22 20:41:03 -05:00
Ryan Poplin
2b6c0939ab
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-02-22 19:00:38 -05:00
Ryan Poplin
8695738400
Bug fix in HaplotypeCaller's GENOTYPE_GIVEN_ALLELES mode for insertions greater than length 1. The allele being genotyped was off by one base pair.
2012-02-22 19:00:04 -05:00
Christopher Hartl
2c1b14d35e
Mostly small changes to my own scala scripts: .vcf.gz compatibility for output files, smarter beagle generation, simple script to scatter-gather combine variants. Whole genome indel calling now uses the gold standard indel set.
2012-02-22 17:20:04 -05:00
Mauricio Carneiro
75783af6fc
int <-> BitSet conversion utils for MathUtils
...
* added unit tests.
2012-02-21 14:10:36 -05:00
Guillermo del Angel
0f5674b95e
Redid fix for corner case when forming consensus with reads that start/end with insertions and that don't agree with each other in inserted bases: since I can't iterate over the elements of a HashMap because keys might change during iteration, and since I can't use ConcurrentHashMaps, the code now copies structure of (bases, number of times seen) into ArrayList, which can be addressed by element index in order to iterate on it.
2012-02-20 09:12:51 -05:00
Ryan Poplin
3d9eee4942
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-02-18 10:55:29 -05:00
Ryan Poplin
a8be96f63d
This caching in the BQSR seems to be too slow now that there are so many keys
2012-02-18 10:54:39 -05:00
Ryan Poplin
78718b8d6a
Adding Genotype Given Alleles mode to the HaplotypeCaller. It constructs the possible haplotypes via assembly and then injects the desired allele to be genotyped.
2012-02-18 10:31:26 -05:00
Guillermo del Angel
e724c63f2b
Reverting last commit until I learn how to effectively replicate and debug pipeline test failures, and until I also learn how to effectively remove a kep from a HashMap that's being iterated on
2012-02-17 17:18:43 -05:00
Guillermo del Angel
f2ef8d1d23
Reverting last commit until I learn how to effectively replicate and debug pipeline test failures, and until I also learn how to effectively remove a kep from a HashMap that's being iterated on
2012-02-17 17:15:53 -05:00
Guillermo del Angel
3e031a540f
Solve merge conflict
2012-02-17 10:56:03 -05:00
Guillermo del Angel
cd352f502d
Corner case bug fix: if a read starts with an insertion, when computing the consensus allele for calling the insertion was only added to the last element in the consensus key hash map. Now, an insertion that partially overlaps with several candidate alleles will have their respective count increased for all of them
2012-02-17 10:21:37 -05:00
Eric Banks
2f33c57060
No reason to restrict HaplotypeScore to bi-allelic SNPs when the plumbing for multi-allelic events is already present.
2012-02-16 13:58:00 -05:00
Guillermo del Angel
2f08846d82
Merged bug fix from Stable into Unstable
2012-02-14 21:26:25 -05:00
Guillermo del Angel
7dc6f73399
Bug fix for validation site selector: records with AC=0 in them were always being thrown out if input vcf was sites-only, even when -ignorePolymorphicStatus flag was set
2012-02-14 21:11:24 -05:00
Ryan Poplin
30085781cf
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-02-14 14:01:20 -05:00
Ryan Poplin
ae5b42c884
Put base insertion and base deletions in the SAMRecord as a string of quality scores instead of an array of bytes. Start of a proper genotype given alleles mode in HaplotypeCaller
2012-02-14 14:01:04 -05:00
David Roazen
85d31f80a2
Merged bug fix from Stable into Unstable
2012-02-13 16:37:11 -05:00
David Roazen
03e5184741
Fix serious engine bug that could cause reads to be dropped under certain circumstances
...
When aggregating raw BAM file spans into shards, the IntervalSharder tries to combine
file spans when it can. Unfortunately, the method that combines two BAM file
spans was seriously flawed, and would produce a truncated union if the file spans
overlapped in certain ways. This could cause entire regions of the BAM file containing
reads within the requested intervals to be dropped.
Modified GATKBAMFileSpan.union() to correct this problem, and added unit tests
to verify that the correct union is produced regardless of how the file spans
happen to overlap.
Thanks to Khalid, who did at least as much work on this bug as I did.
2012-02-13 16:25:21 -05:00
Eric Banks
ad90af94ed
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-02-13 15:10:10 -05:00
Eric Banks
0920a1921e
Minor fixes to splitting multi-allelic records (as regards printing indel alleles correctly); minor code refactoring; adding integration tests to cover +/- splitting multi-allelics.
2012-02-13 15:09:53 -05:00
Eric Banks
14981bed10
Cleaning up VariantsToTable: added docs for supported fields; removed one-off hidden arguments for multi-allelics; default behavior is now to include multi-allelics in one record; added option to split multi-allelics into separate records.
2012-02-13 14:32:03 -05:00
Ryan Poplin
e9338e2c20
Context covariate needs to look in the reverse direction for negative stranded reads.
2012-02-13 13:40:41 -05:00
Ryan Poplin
41ffd08d53
On the fly base quality score recalibration now happens up front in a SAMIterator on input instead of in a lazy-loading fashion if the BQSR table is provided as an engine argument. On the fly recalibration is now completely hooked up and live.
2012-02-13 12:35:09 -05:00
Ryan Poplin
3caa1b83bb
Updating HC integration tests
2012-02-11 11:48:32 -05:00
Ryan Poplin
9b8fd4c2ff
Updating the half of the code that makes use of the recalibration information to work with the new refactoring of the bqsr. Reverting the covariate interface change in the original bqsr because the error model enum was moved to a different class and didn't make sense any more.
2012-02-11 10:57:20 -05:00
Eric Banks
f52f1f659f
Multiallelic implementation of the TDT should be a pairwise list of values as per Mark Daly. Integration tests change because the count in the header is now A instead of 1.
2012-02-10 14:15:59 -05:00
Mauricio Carneiro
1fb19a0f98
Moving the covariates and shared functionality to public
...
so Ryan can work on the recalibration on the fly without breaking the build. Supposedly all the secret sauce is in the BQSR walker, which sits in private.
2012-02-10 11:44:01 -05:00
Eric Banks
5e18020a5f
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-02-10 11:08:33 -05:00
Eric Banks
f53cd3de1b
Based on Ryan's suggestion, there's a new contract for genotyping multiple alleles. Now the requester submits alleles in any arbitrary order - rankings aren't needed. If the Exact model decides that it needs to subset the alleles because too many were requested, it does so based on PL mass (in other words, I moved this code from the SNPGenotypeLikelihoodsCalculationModel to the Exact model). Now subsetting alleles is consistent.
2012-02-10 11:07:32 -05:00
Mauricio Carneiro
5af373a3a1
BQSR with indels integrated!
...
* added support to base before deletion in the pileup
* refactored covariates to operate on mismatches, insertions and deletions at the same time
* all code is in private so original BQSR is still working as usual in public
* outputs a molten CSV with mismatches, insertions and deletions, time to play!
* barely tested, passes my very simple tests... haven't tested edge cases.
2012-02-09 18:46:45 -05:00
Eric Banks
7a937dd1eb
Several bug fixes to new genotyping strategy. Update integration tests for multi-allelic indels accordingly.
2012-02-09 16:14:22 -05:00
Eric Banks
0f728a0604
The Exact model now subsets the VC to the first N alleles when the VC contains more than the maximum number of alleles (instead of throwing it out completely as it did previously). [Perhaps the culling should be done by the UG engine? But theoretically the Exact model can be called outside of the UG and we'd still want the context subsetted.]
2012-02-09 14:02:34 -05:00
Matt Hanna
aa097a83d5
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-02-09 11:26:48 -05:00
Matt Hanna
b57d4250bf
Documentation request by Eric. At each stage of the GATK where filtering occurs, added documentation suggesting the goal of the filtering along with examples of suggested inputs and outputs.
2012-02-09 11:24:52 -05:00
Mauricio Carneiro
d561914d4f
Revert "First implementation of GATKReportGatherer"
...
premature push from my part. Roger is still working on the new format and we need to update the other tools to operate correctly with the new GATKReport.
This reverts commit aea0de314220810c2666055dc75f04f9010436ad.
2012-02-08 23:28:55 -05:00
Eric Banks
2f800b078c
Changes to default behavior of UG: multi-allelic mode is always on; max number of alternate alleles to genotype is 3; alleles in the SNP model are ranked by their likelihood sum (Guillermo will do this for indels); SB is computed again.
2012-02-08 15:27:16 -05:00
Matt Hanna
51ac87b28c
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-02-08 08:43:55 -05:00
Matt Hanna
5b58fe741a
Retiring Picard customizations for async I/O and cleaning up parts of the code to use common Picard utilities I recently discovered.
...
Also embedded bug fix for issues reading sparse shards and did some cleanup based on comments during BAM reading code transition meetings.
2012-02-08 08:34:37 -05:00
Mauricio Carneiro
337819e791
disabling the test while we fix it
2012-02-07 19:22:32 -05:00
Roger Zurawicki
c0c676590b
First implementation of GATKReportGatherer
...
- Added the GATKReportGatherer
- Added private methods in GATKReport to combine Tables and Reports
- It is very conservative and it will only gather if the table columns, match.
- At the column level it uses the (redundant) row ids to add new rows. It will throw an exception if it is overwriting data.
Added the gatherer functions to CoverageByRG
Also added the scatterCount parameter in the Interval Coverage script
Made some more GATKReport methods public
The UnitTest included shows that the merging methods work
Added a getter for the PrimaryKeyName
Fixed bugs that prevented the gatherer form working
Working GATKReportGatherer
Has only the functional to addLines
The input file parser assumes that the first column is the primary key
Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-02-07 18:14:47 -05:00
Mauricio Carneiro
e89887cd8e
laying groundwork to have insertions and deletions going through the system.
2012-02-07 18:11:53 -05:00
Mauricio Carneiro
0d3ea0401c
BQSR Parameter cleanup
...
* get rid of 320C argument that nobody uses.
* get rid of DEFAULT_READ_GROUP parameter and functionality (later to become an engine argument).
2012-02-07 14:42:11 -05:00
Eric Banks
717cd4b912
Document -L unmapped
2012-02-07 13:30:54 -05:00
Eric Banks
718da7757e
Fixes to ValidateVariants as per GS post: ref base of mixed alleles were sometimes wrong, error print out of bad ACs was throwing a RuntimeException, don't validate ACs if there are no genotypes.
2012-02-07 13:15:58 -05:00
Eric Banks
9d1a19bbaa
Multi-allelic indels were not being printed out correctly in VariantsToTable; fixed.
2012-02-06 22:49:29 -05:00
Mauricio Carneiro
5961868a7f
fixup for BQSR (HC integration tests)
...
In the new BQSR implementation, covariates do depend on the RecalibrationArgumentCollection.
2012-02-06 22:47:27 -05:00
Mauricio Carneiro
6e6f0f10e1
BaseQualityScoreRecalibration walker (bqsr v2) first commit includes
...
* Adding the context covariate standard in both modes (including old CountCovariates) with parameters
* Updating all covariates and modules to use GATKSAMRecord throughout the code.
* BQSR now processes indels in the pileup (but doesn't do anything with them yet)
2012-02-06 17:38:29 -05:00
Eric Banks
0717c79901
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-02-06 16:23:36 -05:00
Eric Banks
91897f5fe7
Transpose rows/cols in AF table to make it molten (so I can plot easily in R)
2012-02-06 16:23:32 -05:00
Guillermo del Angel
fb5786385c
Merged bug fix from Stable into Unstable
2012-02-06 13:22:56 -05:00
Guillermo del Angel
6ec686b877
Complement to previous commit: make sure we also don't inherit filter from input VCF when genotyping at an empty site
2012-02-06 13:19:26 -05:00
Guillermo del Angel
93ffca1e3a
Merged bug fix from Stable into Unstable
2012-02-06 11:58:58 -05:00
Guillermo del Angel
827be878b4
Bug fix when running UG in GenotypeGivenAlleles mode: if an input site to genotype had no coverage, the output VCF had AC,AF and AN inherited from input VCF, which could have nothing to do with given BAM so numbers could be non-sensical. Now new vc has clear attributes instead of attributes inherited from input VCF.
2012-02-06 11:58:13 -05:00
Eric Banks
fbbd04621d
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-02-06 11:53:31 -05:00
Eric Banks
edb4edc08f
Commented out unused metrics for now
2012-02-06 11:53:15 -05:00
Ryan Poplin
096c23a473
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-02-06 11:10:38 -05:00
Ryan Poplin
dc05b71e39
Updating Covariate interface with Mauricio to include an errorModel parameter. On the fly recalibration of base insertion and base deletion quals is live for the HaplotypeCaller
2012-02-06 11:10:24 -05:00
Guillermo del Angel
1e11408f8b
Merged bug fix from Stable into Unstable
2012-02-06 10:34:26 -05:00
Guillermo del Angel
090d87b48b
Bug fix in ValidationSiteSelector: when input vcf had genotypes and was multiallelic, the parsing of the AF/AC fields was wrong. Better logic to unify parsing of field
2012-02-06 10:33:12 -05:00
Eric Banks
9d94f310f1
Break AF histogram into max and min AFs
2012-02-06 09:01:19 -05:00
Ryan Poplin
b7ffd144e8
Cleaning up the covariate classes and removing unused code from the bqsr optimizations in 2009.
2012-02-06 08:54:42 -05:00
Eric Banks
cef550903e
Minor optimization
2012-02-06 00:48:00 -05:00
Ryan Poplin
5343f8ba67
Initial version of on-the-fly, lazy loading base quality score recalibration. It isn't completely hooked up yet but I'm committing so Mauricio and Mark can see how I envision it will fit together. Look it over and give any feedback. With the exception of the Solid specific code we are very very close to being able to remove TableRecalibrationWalker from the code base and just replace it with PrintReads -BQSR recal.csv
2012-02-05 13:09:03 -05:00
Ryan Poplin
f94d547e97
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-02-03 17:14:20 -05:00
Ryan Poplin
894d3340be
Active Region Traversal should use GATKSAMRecords everywhere instead of SAMRecords. misc cleanup.
2012-02-03 17:13:52 -05:00
Mauricio Carneiro
4a57add6d0
First implementation of DiagnoseTargets
...
* calculates and interprets the coverage of a given interval track
* allows to expand intervals by specified number of bases
* classifies targets as CALLABLE, LOW_COVERAGE, EXCESSIVE_COVERAGE and POOR_QUALITY.
* outputs text file for now (testing purposes only), soon to be VCF.
* filters are overly aggressive for now.
2012-02-03 17:12:43 -05:00
Mauricio Carneiro
3dd6a1f962
Adding some generic sum and average functions to MathUtils
2012-02-03 17:12:43 -05:00
Mauricio Carneiro
e1d69e4060
make the size of a GenomeLoc int instead of long
...
it will never be bigger than an int and it's actually useful to be an int so we can use it as parameters to array/list/hash size creation.
2012-02-03 17:12:42 -05:00
Ryan Poplin
0e44430e47
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-02-03 13:45:11 -05:00
Christopher Hartl
aa3638ecb3
Merge branch 'master' of ssh://chartl@ni.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-02-03 13:42:09 -05:00
Eric Banks
3abfbcbcf2
Generalized the TDT for multi-allelic events
2012-02-03 12:23:21 -05:00
Ryan Poplin
601e53d633
Fix when specifying preset active regions with -AR argument
2012-02-02 16:34:26 -05:00
Christopher Hartl
0111505ea9
Terrible. Swapping the paternal and sample ids.
2012-02-02 11:41:16 -05:00
Ryan Poplin
1f50f6970b
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-02-02 10:17:13 -05:00
Ryan Poplin
4ed06801a7
Updating HaplotypeCaller's HMM calc to use GOP as a function of the read instead of a function of the haplotype in preparation for IQSR
2012-02-02 10:17:04 -05:00
Matt Hanna
8adfc79123
Merged bug fix from Stable into Unstable
2012-02-01 16:07:41 -05:00
Matt Hanna
30b937d2af
Fix bug discovered in FGTP branch in which BlockInputStream returns -1 in cases where some data could be read,
...
but not all the data requested by the caller.
2012-02-01 16:06:22 -05:00
Mauricio Carneiro
45da892ecc
Better exceptions to catch malformed reads
...
* throw exceptions in LocusIteratorByState when hitting reads starting or ending with deletions
2012-02-01 11:56:19 -05:00
Christopher Hartl
810996cfca
Introducing: VariantsToPed, the world's most annoying walker! And also a busted QScript to run it that I need Khalid's help debugging ( frownie face ). Note that VariantsToPed and PlinkSeq generate the same binary file (up to strand flips...thanks PlinkSeq), so I know it's working properly. Hooray!
2012-02-01 10:39:03 -05:00
Christopher Hartl
25d943f706
Merge branch 'master' of ssh://chartl@ni.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-02-01 10:32:11 -05:00
Ryan Poplin
056b24ccd6
Resolving merge conflicts with LocusIteratorByState
2012-01-31 16:13:32 -05:00
Ryan Poplin
febc634557
Changing PileupElement's isSoftClipped to isNextToSoftClip since soft clipped bases aren't actually added to pileups, oops. Removing the intrinsic clustered variants filter from the HaplotypeCaller
2012-01-31 16:06:14 -05:00
Matt Hanna
7f70612beb
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-01-31 11:59:25 -05:00
Matt Hanna
a630db1703
Oops...HierarchicalMicroScheduler was transforming any exception from the walker level into a ReviewedStingException.
...
Thanks to Ryan for pointing this out.
2012-01-31 11:58:21 -05:00
Christopher Hartl
faba3dd530
Merge branch 'master' of ssh://chartl@ni.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-01-31 10:25:29 -05:00
Mauricio Carneiro
17dbe9a95d
A few cleanups in the LocusIteratorByState
...
* No more N's in the extended event pileups
* Only add to the pileup MQ0 counter if the read actually goes into the pileup
2012-01-31 09:40:51 -05:00
Ryan Poplin
f9162ea705
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-01-30 19:45:19 -05:00
Ryan Poplin
abb91cf26b
Increasing the size of the active regions that are produced by the active probability integrator, more context is needed to call more complex events
2012-01-30 15:36:12 -05:00
Mauricio Carneiro
d5d4fa8a88
Fixed discordance bug reported by Brad Chapman
...
discordance now reports discordance between genotypes as well (just like concordance)
2012-01-30 09:50:45 -05:00
Mark DePristo
3164c8dee5
S3 upload now directly creates the XML report in memory and puts that in S3
...
-- This is a partial fix for the problem with uploading S3 logs reported by Mauricio. There the problem is that the java.io.tmpdir is not accessible (network just hangs). Because of that the s3 upload fails because the underlying system uses tmpdir for caching, etc. As far as I can tell there's no way around this bug -- you cannot overload the java.io.tmpdir programmatically and even if I could what value would we use? The only solution seems to me is to detect that tmpdir is hanging (how?!) and fail with a meaningful error.
2012-01-29 15:14:58 -05:00
Menachem Fromer
0e17cbbce9
Merged bug fix from Stable into Unstable
2012-01-27 16:03:16 -05:00
Menachem Fromer
a9671b73ca
Fix to permit proper handling of mapping qualities between 128 to 255 (which get converted to byte values of -128 to -1)
2012-01-27 16:01:30 -05:00
Ryan Poplin
f7ac1f4a69
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-01-27 15:12:55 -05:00
Ryan Poplin
fc08235ff3
Bug fix in active region traversal, locusView.getNext() skips over pileups with zero coverage but still need to count them in the active probability integrator
2012-01-27 15:12:37 -05:00
Mark DePristo
0f2e8400b5
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-01-27 10:12:50 -05:00
Mauricio Carneiro
ec9920b04f
Updating the SAM TAG for Original Alignment Start to "OP"
...
per Mark's recommendation to reuse the Indel Realigner tag that made it to the SAM spec. The Alignment end tag is still "OE" as there is no official tag to reuse.
2012-01-27 08:51:39 -05:00
Mark DePristo
13d1626f51
Minor improvements in ref QC walker. Unfortunately this doesn't actually catch Chris's error
2012-01-27 08:24:22 -05:00
Mauricio Carneiro
2a565ebf90
embarrassing fix-up, thanks Khalid.
2012-01-26 19:58:42 -05:00
Mauricio Carneiro
246e085ec9
Unit tests for GATKSAMRecord class
...
* new unit tests for the alignment shift properties of reduce reads
* moved unit tests from ReadUtils that were actually testing GATKSAMRecord, not any of the ReadUtils to it.
* cleaned up ReadUtilsUnitTest
2012-01-26 17:06:36 -05:00
Mauricio Carneiro
0d4027104f
Reduced reads are now aware of their original alignments
...
* Added annotations for reads that had been soft clipped prior to being reduced so that we can later recuperate their original alignments (start and end).
* Tags keep the alignment shifts, not real alignment, for better compression
* Tags are defined in the GATKSAMRecord
* GATKSAMRecord has new functionality to retrieve original alignment start of all reads (trimmed or not) -- getOriginalAlignmentStart() and getOriginalAligmentEnd()
* Updated ReduceReads MD5s accordingly
2012-01-26 17:06:36 -05:00
Eric Banks
07f72516ae
Unsupported platform should be a user error
2012-01-26 16:14:25 -05:00
Ryan Poplin
cdff23269d
HaplotypeCaller now uses insertions and softclipped bases as possible triggers. LocusIteratorByState tags pileup elements with the required info to make this calculation efficient. The days of the extended event pileup are coming to a close.
2012-01-26 15:56:33 -05:00
Christopher Hartl
673ceadd11
While this fix worked for the evaluator module, it could potentially have bad effects in the phasing walkers. Special-case nocalls in the PhasingEvaluator and return AllelePair to previous state.
2012-01-26 13:06:36 -05:00
Christopher Hartl
9c6fda7e15
Yup. I was right.
2012-01-26 12:54:11 -05:00
Christopher Hartl
7d059540a4
Allow segments of genome to be excluded in generating a reference panel. Occasionally targets would contain no variation (typically, in the middle of the centromere), which beagle doesn't particularly like, and errors out rather than producing empty output files. The best way to deal with these is to just exclude the regions on a second-pass, and the remaining bits will be gathered with no additional work.
...
AllelePair is being mean and not telling me what genotype it sees when it finds a non-diploid genotype, but i suspect it's a no-call (".") rather than a no call ("./.").
2012-01-26 12:43:52 -05:00
Ryan Poplin
25532bdc37
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-01-26 11:43:32 -05:00
Ryan Poplin
390d493049
Updating ActiveRegionWalker interface to output a probability of active status instead of a boolean. Integrator runs a band-pass filter over this probability to produce actual active regions. First version of HaplotypeCaller which decides for itself where to trigger and assembles those regions.
2012-01-26 11:37:08 -05:00
Eric Banks
859dd882c9
Don't make it standard for now
2012-01-26 00:38:16 -05:00
Eric Banks
c5e81be978
Adding pairwise AF table. Not polished at all, but usable none-the-less.
2012-01-26 00:37:06 -05:00
Eric Banks
702a2d768f
Initial version of multi-allelic summary module in VariantEval
2012-01-25 19:42:55 -05:00
Eric Banks
9a60887567
Lost an import in the merge
2012-01-25 19:41:41 -05:00
Eric Banks
cba5f1a8b1
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-01-25 19:19:03 -05:00
Eric Banks
ddaf51a50f
Updated one integration test for indels
2012-01-25 19:18:51 -05:00
Eric Banks
add6918f32
Cleaner, more efficient way of determining the last dependent set in the queue.
2012-01-25 16:21:10 -05:00
Menachem Fromer
db645a94ca
Added options to make the batch-merger more all-inclusive: keep all indels, SNPs (even filtered ones) but maintain their annotations. Also, VariantContextUtils.simpleMerge can now merge variants of all types using the Hidden non-default enum MultipleAllelesMergeType=MIX_TYPES
2012-01-25 16:10:59 -05:00
Eric Banks
ef335a5812
Better implementation of the fix; PL index is now traversed in order.
2012-01-25 15:15:42 -05:00
Eric Banks
8e2d372ab0
Use remove instead of setting the value to null
2012-01-25 14:41:34 -05:00
Eric Banks
05816955aa
It was possible that we'd clean up a matrix column too early when a dependent column aborted early (with not enough probability mass) because we weren't being smart about the order in which we created dependencies. Fixed.
2012-01-25 14:28:21 -05:00
Eric Banks
2799a1b686
Catch exception for bad type and throw as a TribbleException
2012-01-25 12:15:51 -05:00
Eric Banks
96b62daff3
Minor tweak to the warning message.
2012-01-25 11:55:33 -05:00
Eric Banks
fb863dc6a7
Warn user when trying to run with EMIT_ALL_SITES with indels; better docs for that option.
2012-01-25 11:50:12 -05:00
Eric Banks
e349b4b14b
Allow appending with the dbSNP ID even if a (different) ID is already present for the variant rod.
2012-01-25 11:35:54 -05:00
Eric Banks
ea3d4d60f2
This annotation requires rods and should be annotated as such
2012-01-25 11:35:13 -05:00
Ryan Poplin
bbefe4a272
Added option to be able to write out the active regions to an interval list file
2012-01-25 09:47:06 -05:00