Mauricio Carneiro
e9ad382e74
unifying the BQSR argument collection
2012-03-05 10:48:26 -05:00
Ryan Poplin
f879daa7d0
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-03-05 08:29:08 -05:00
Ryan Poplin
d6871967ae
Adding more unit tests and contracts to PairHMM util class. Updating HaplotypeCaller to use the new PairHMM util class. Now that the HMM result isn't dependent on the length of the haplotype there is no reason to ensure all haplotypes have the save length which simplifies the code considerably.
2012-03-05 08:28:42 -05:00
Guillermo del Angel
3b5a7c34d7
Added argument to ValidationAmplicons to only output valid sequences - useful for not having to post-filter or grep resulting files before delivering downstream
2012-03-04 10:24:29 -05:00
Mark DePristo
69611af7d3
Workaround for bug in Picard in ReadGroupProperties
...
-- NPE caused when you call getRunDate on a read group without a date.
2012-03-02 18:53:45 -05:00
Mark DePristo
ba71b0aee4
ReadGroupProperties mk3
...
-- Includes sequencing date
2012-03-02 16:12:42 -05:00
Eric Banks
1e07e97b58
Optimization: create allele list just once, not for each genotype
2012-03-02 13:30:17 -05:00
Ryan Poplin
0ad7d5fbc1
Standalone common Pair HMM utility class with associated unit tests.
2012-03-01 22:41:13 -05:00
Mark DePristo
2f334a57c2
ReadGroupProperties mk2
...
-- Includes paired end status (T/F)
-- Includes count of reads used in calculation
-- Includes simple read type (2x76 for example)
-- Better handling of insert size, read length when there's no data, or the data isn't paired end by emitting NA not 0
2012-03-01 18:43:53 -05:00
Mauricio Carneiro
486712bfc2
ugly RG encoding
2012-03-01 17:56:45 -05:00
Mauricio Carneiro
29f74b658b
Unit tests for the context covariate
...
this is simple, but it's the infra-structure to start messing around with the context.
2012-03-01 17:56:45 -05:00
Mark DePristo
aff508e091
ReadGroupProperties walker and associated infrastructure
...
-- ReadGroupProperties: Emits a GATKReport containing read group, sample, library, platform, center, median insert size and median read length for each read group in every BAM file.
-- Median tool that collects up to a given maximum number of elements and returns the median of the elements.
-- Unit and integration tests for everything.
-- Making name of TestProvider protected so subclasses and override name more easily
2012-03-01 15:01:11 -05:00
Mauricio Carneiro
9e95b10789
Context covariate now operates as a highly compressed bitset
...
* All contexts with 'N' bases are now collapsed as uninformative
* Context size is now represented internally as a BitSet but output as a dna string
* Temporarily disabled sorted outputs because of null objects
2012-02-29 19:25:21 -05:00
Mauricio Carneiro
d379c3763a
DNA Sequence to BitSet and vice-versa conversion tools
...
* Turns DNA sequences (for context covariates) into bit sets for maximum compression
* Allows variable context size representation guaranteeing uniqueness.
* Works with long precision, so it is limited to a context size of 31 bases (can be extended with BigNumber precision if necessary).
* Unit Tests added
2012-02-29 19:25:20 -05:00
Eric Banks
129b5e7f6b
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-02-28 10:09:34 -05:00
Eric Banks
a4a279ce80
Damn you, Mark
2012-02-28 10:09:09 -05:00
Khalid Shakir
0681bea5a5
Changed DoC from PartitionType.INTERVAL to PartitionType.NONE since it doesn't have a way to gather scattered outputs.
...
Added MultiallelicSummary to HSP eval.
2012-02-28 09:27:27 -05:00
Eric Banks
bd398e30fd
Another quick optimization
2012-02-28 09:25:35 -05:00
Eric Banks
40bdadbda5
Minor optimization as per Mark
2012-02-28 09:24:07 -05:00
Eric Banks
d7928ad669
Drat, missed one: handle null alleles being passed in.
2012-02-27 21:31:54 -05:00
Mark DePristo
24356f11b7
Merged bug fix from Stable into Unstable
...
-- Resolved conflict
Conflicts:
public/java/src/org/broadinstitute/sting/gatk/datasources/reads/SAMDataSource.java
2012-02-27 17:13:17 -05:00
Mark DePristo
0b29d54937
Changed most BAMSchedule ReviewedStingExceptions to UserExceptions
...
-- As these represent the bulk of the StingExceptions coming from BAMSchedule and are caused by simple problems like the user providing bad input tmp directories, etc.
2012-02-27 17:08:41 -05:00
Mark DePristo
f9e8e82e33
Removed unused class variable from VCFHeaderLineTranslator
2012-02-27 17:07:19 -05:00
Mark DePristo
100ddef930
Fix typo in VariantContextBuilder
2012-02-27 17:06:45 -05:00
Mark DePristo
ca0931c01f
Adding test for reading samtools VCF file
2012-02-27 17:05:50 -05:00
Eric Banks
bd944ab04f
Another test where we no longer print out 'NaN' for the AF.
2012-02-27 15:19:08 -05:00
Mark DePristo
5f7ccdcc01
Avoid calling getBasePileup when there's no pileup in NBaseCount annotation
2012-02-27 15:12:25 -05:00
Eric Banks
52871187d7
Adding integration test for file with no GTs. Also updated md5 for one other test (since we no longer print out 'NaN' for the AF).
2012-02-27 15:09:56 -05:00
Mark DePristo
729bb954e2
Throws ReviewedStingException for a bug when parent VariantContext argument is null
2012-02-27 15:09:00 -05:00
Eric Banks
998ed8fff3
Bug fix to deal with VCF records that don't have GTs. While in there, optimized a bunch of related functions (including removing a copy of the method calculateChromosomeCounts(); why did we have 2 copies? very dangerous).
2012-02-27 14:56:10 -05:00
Mark DePristo
4d9582de77
More general catching of Exceptions in interval reading to throw MalformedFile exception in all cases
...
-- Now throws UserException no matter what happens during the reading of the intervals file.
2012-02-27 14:02:26 -05:00
Mark DePristo
9712fed7a5
Trap SAMFormatException and rethrow as MalformatedBAM exception
...
-- Trap errors in header and rethrow
-- Wrap underlying iterator in MalformatedBAMErrorReformattingIterator
2012-02-27 13:52:50 -05:00
Eric Banks
1ea34058c2
Updating integration tests now that standard annotations support multiple alleles
2012-02-27 11:32:26 -05:00
Eric Banks
64754e7870
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-02-27 11:31:41 -05:00
Eric Banks
850c5d0db2
Enabling Rank Sum Tests for multi-allelics: use ref vs any alt allele.
2012-02-27 09:59:36 -05:00
Eric Banks
dfdf4f989b
Enabling Fisher Strand for multi-allelics: use the alt allele with max AC. Added minor optimization to the method in the VC.
2012-02-27 09:50:09 -05:00
Guillermo del Angel
16122bea8d
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-02-25 13:57:54 -05:00
Guillermo del Angel
dea35943d1
a) Bug fix in calling new functions that give indel bases and length from regular pileup in LocusIteratorByState, b) Added unit test to cover these.
2012-02-25 13:57:28 -05:00
Mark DePristo
c8a06e53c1
DoC now properly handles reference N bases + misc. additional cleanups
...
-- DoC now by default ignores bases with reference Ns, so these are not included in the coverage calculations at any stage.
-- Added option --includeRefNSites that will include them in the calculation
-- Added integration tests that ensures the per base tables (and so all subsequent calculations) work with and without reference N bases included
-- Reorganized command line options, tagging advanced options with @Advanced
2012-02-25 11:32:50 -05:00
Mark DePristo
50de1a3eab
Fixing bad VCFIntegration tests
...
-- Left disabled a test that should have been enabled
-- Didn't add the md5 to the test I actually added
-- Now VCFIntegrationTests should be working!
2012-02-25 11:26:36 -05:00
Guillermo del Angel
c9a4c74f7a
a) Bug fixes for last commit related to PileupElements (unit tests are forthcoming). b) Changes needed to make pool caller work in GENOTYPE_GIVEN_ALLELES mode c) Bug fix (yet again) for UG when GENOTYPE_GIVEN_ALLELES and EMIT_ALL_SITES are on, when there's no coverage at site and when input vcf has genotypes: output vcf would still inherit genotypes from input vcf. Now, we just build vc from scratch instead of initializing from input vc. We just take location and alleles from vc
2012-02-24 10:27:59 -05:00
Mauricio Carneiro
ee9a56ad27
Fix subtle bug in the ReduceReads stash reported by Adam
...
* The tailSet generated every time we flush the reads stash is still being affected by subsequent clears because it is just a pointer to the parent element in the original TreeSet. This is dangerous, and there is a weird condition where the clear will affects it.
* Fix by creating a new set, given the tailSet instead of trying to do magic with just the pointer.
2012-02-23 18:35:25 -05:00
Mark DePristo
e0c189909f
Added support for breakpoint alleles
...
-- See https://getsatisfaction.com/gsa/topics/support_vcf_4_1_structural_variation_breakend_alleles?utm_content=topic_link&utm_medium=email&utm_source=new_topic
-- Added integrationtest to ensure that we can parse and write out breakpoint example
2012-02-23 12:14:48 -05:00
Guillermo del Angel
6866a41914
Added functionality in pileups to not only determine whether there's an insertion or deletion following the current position, but to also get the indel length and involved bases - definitely needed for extended event removal, and needed for pool caller indel functionality.
2012-02-23 09:45:47 -05:00
Eric Banks
d34f07dba0
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-02-22 20:41:03 -05:00
Ryan Poplin
2b6c0939ab
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-02-22 19:00:38 -05:00
Ryan Poplin
8695738400
Bug fix in HaplotypeCaller's GENOTYPE_GIVEN_ALLELES mode for insertions greater than length 1. The allele being genotyped was off by one base pair.
2012-02-22 19:00:04 -05:00
Christopher Hartl
2c1b14d35e
Mostly small changes to my own scala scripts: .vcf.gz compatibility for output files, smarter beagle generation, simple script to scatter-gather combine variants. Whole genome indel calling now uses the gold standard indel set.
2012-02-22 17:20:04 -05:00
Mauricio Carneiro
75783af6fc
int <-> BitSet conversion utils for MathUtils
...
* added unit tests.
2012-02-21 14:10:36 -05:00
Guillermo del Angel
0f5674b95e
Redid fix for corner case when forming consensus with reads that start/end with insertions and that don't agree with each other in inserted bases: since I can't iterate over the elements of a HashMap because keys might change during iteration, and since I can't use ConcurrentHashMaps, the code now copies structure of (bases, number of times seen) into ArrayList, which can be addressed by element index in order to iterate on it.
2012-02-20 09:12:51 -05:00
Ryan Poplin
3d9eee4942
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-02-18 10:55:29 -05:00
Ryan Poplin
a8be96f63d
This caching in the BQSR seems to be too slow now that there are so many keys
2012-02-18 10:54:39 -05:00
Ryan Poplin
78718b8d6a
Adding Genotype Given Alleles mode to the HaplotypeCaller. It constructs the possible haplotypes via assembly and then injects the desired allele to be genotyped.
2012-02-18 10:31:26 -05:00
Guillermo del Angel
e724c63f2b
Reverting last commit until I learn how to effectively replicate and debug pipeline test failures, and until I also learn how to effectively remove a kep from a HashMap that's being iterated on
2012-02-17 17:18:43 -05:00
Guillermo del Angel
f2ef8d1d23
Reverting last commit until I learn how to effectively replicate and debug pipeline test failures, and until I also learn how to effectively remove a kep from a HashMap that's being iterated on
2012-02-17 17:15:53 -05:00
Guillermo del Angel
3e031a540f
Solve merge conflict
2012-02-17 10:56:03 -05:00
Guillermo del Angel
cd352f502d
Corner case bug fix: if a read starts with an insertion, when computing the consensus allele for calling the insertion was only added to the last element in the consensus key hash map. Now, an insertion that partially overlaps with several candidate alleles will have their respective count increased for all of them
2012-02-17 10:21:37 -05:00
Eric Banks
2f33c57060
No reason to restrict HaplotypeScore to bi-allelic SNPs when the plumbing for multi-allelic events is already present.
2012-02-16 13:58:00 -05:00
Guillermo del Angel
2f08846d82
Merged bug fix from Stable into Unstable
2012-02-14 21:26:25 -05:00
Guillermo del Angel
7dc6f73399
Bug fix for validation site selector: records with AC=0 in them were always being thrown out if input vcf was sites-only, even when -ignorePolymorphicStatus flag was set
2012-02-14 21:11:24 -05:00
Ryan Poplin
30085781cf
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-02-14 14:01:20 -05:00
Ryan Poplin
ae5b42c884
Put base insertion and base deletions in the SAMRecord as a string of quality scores instead of an array of bytes. Start of a proper genotype given alleles mode in HaplotypeCaller
2012-02-14 14:01:04 -05:00
David Roazen
85d31f80a2
Merged bug fix from Stable into Unstable
2012-02-13 16:37:11 -05:00
David Roazen
03e5184741
Fix serious engine bug that could cause reads to be dropped under certain circumstances
...
When aggregating raw BAM file spans into shards, the IntervalSharder tries to combine
file spans when it can. Unfortunately, the method that combines two BAM file
spans was seriously flawed, and would produce a truncated union if the file spans
overlapped in certain ways. This could cause entire regions of the BAM file containing
reads within the requested intervals to be dropped.
Modified GATKBAMFileSpan.union() to correct this problem, and added unit tests
to verify that the correct union is produced regardless of how the file spans
happen to overlap.
Thanks to Khalid, who did at least as much work on this bug as I did.
2012-02-13 16:25:21 -05:00
Eric Banks
ad90af94ed
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-02-13 15:10:10 -05:00
Eric Banks
0920a1921e
Minor fixes to splitting multi-allelic records (as regards printing indel alleles correctly); minor code refactoring; adding integration tests to cover +/- splitting multi-allelics.
2012-02-13 15:09:53 -05:00
Eric Banks
14981bed10
Cleaning up VariantsToTable: added docs for supported fields; removed one-off hidden arguments for multi-allelics; default behavior is now to include multi-allelics in one record; added option to split multi-allelics into separate records.
2012-02-13 14:32:03 -05:00
Ryan Poplin
e9338e2c20
Context covariate needs to look in the reverse direction for negative stranded reads.
2012-02-13 13:40:41 -05:00
Ryan Poplin
41ffd08d53
On the fly base quality score recalibration now happens up front in a SAMIterator on input instead of in a lazy-loading fashion if the BQSR table is provided as an engine argument. On the fly recalibration is now completely hooked up and live.
2012-02-13 12:35:09 -05:00
Ryan Poplin
3caa1b83bb
Updating HC integration tests
2012-02-11 11:48:32 -05:00
Ryan Poplin
9b8fd4c2ff
Updating the half of the code that makes use of the recalibration information to work with the new refactoring of the bqsr. Reverting the covariate interface change in the original bqsr because the error model enum was moved to a different class and didn't make sense any more.
2012-02-11 10:57:20 -05:00
Eric Banks
f52f1f659f
Multiallelic implementation of the TDT should be a pairwise list of values as per Mark Daly. Integration tests change because the count in the header is now A instead of 1.
2012-02-10 14:15:59 -05:00
Mauricio Carneiro
1fb19a0f98
Moving the covariates and shared functionality to public
...
so Ryan can work on the recalibration on the fly without breaking the build. Supposedly all the secret sauce is in the BQSR walker, which sits in private.
2012-02-10 11:44:01 -05:00
Eric Banks
5e18020a5f
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-02-10 11:08:33 -05:00
Eric Banks
f53cd3de1b
Based on Ryan's suggestion, there's a new contract for genotyping multiple alleles. Now the requester submits alleles in any arbitrary order - rankings aren't needed. If the Exact model decides that it needs to subset the alleles because too many were requested, it does so based on PL mass (in other words, I moved this code from the SNPGenotypeLikelihoodsCalculationModel to the Exact model). Now subsetting alleles is consistent.
2012-02-10 11:07:32 -05:00
Mauricio Carneiro
5af373a3a1
BQSR with indels integrated!
...
* added support to base before deletion in the pileup
* refactored covariates to operate on mismatches, insertions and deletions at the same time
* all code is in private so original BQSR is still working as usual in public
* outputs a molten CSV with mismatches, insertions and deletions, time to play!
* barely tested, passes my very simple tests... haven't tested edge cases.
2012-02-09 18:46:45 -05:00
Eric Banks
7a937dd1eb
Several bug fixes to new genotyping strategy. Update integration tests for multi-allelic indels accordingly.
2012-02-09 16:14:22 -05:00
Eric Banks
0f728a0604
The Exact model now subsets the VC to the first N alleles when the VC contains more than the maximum number of alleles (instead of throwing it out completely as it did previously). [Perhaps the culling should be done by the UG engine? But theoretically the Exact model can be called outside of the UG and we'd still want the context subsetted.]
2012-02-09 14:02:34 -05:00
Matt Hanna
aa097a83d5
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-02-09 11:26:48 -05:00
Matt Hanna
b57d4250bf
Documentation request by Eric. At each stage of the GATK where filtering occurs, added documentation suggesting the goal of the filtering along with examples of suggested inputs and outputs.
2012-02-09 11:24:52 -05:00
Mauricio Carneiro
d561914d4f
Revert "First implementation of GATKReportGatherer"
...
premature push from my part. Roger is still working on the new format and we need to update the other tools to operate correctly with the new GATKReport.
This reverts commit aea0de314220810c2666055dc75f04f9010436ad.
2012-02-08 23:28:55 -05:00
Eric Banks
2f800b078c
Changes to default behavior of UG: multi-allelic mode is always on; max number of alternate alleles to genotype is 3; alleles in the SNP model are ranked by their likelihood sum (Guillermo will do this for indels); SB is computed again.
2012-02-08 15:27:16 -05:00
Matt Hanna
51ac87b28c
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-02-08 08:43:55 -05:00
Matt Hanna
5b58fe741a
Retiring Picard customizations for async I/O and cleaning up parts of the code to use common Picard utilities I recently discovered.
...
Also embedded bug fix for issues reading sparse shards and did some cleanup based on comments during BAM reading code transition meetings.
2012-02-08 08:34:37 -05:00
Mauricio Carneiro
337819e791
disabling the test while we fix it
2012-02-07 19:22:32 -05:00
Roger Zurawicki
c0c676590b
First implementation of GATKReportGatherer
...
- Added the GATKReportGatherer
- Added private methods in GATKReport to combine Tables and Reports
- It is very conservative and it will only gather if the table columns, match.
- At the column level it uses the (redundant) row ids to add new rows. It will throw an exception if it is overwriting data.
Added the gatherer functions to CoverageByRG
Also added the scatterCount parameter in the Interval Coverage script
Made some more GATKReport methods public
The UnitTest included shows that the merging methods work
Added a getter for the PrimaryKeyName
Fixed bugs that prevented the gatherer form working
Working GATKReportGatherer
Has only the functional to addLines
The input file parser assumes that the first column is the primary key
Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-02-07 18:14:47 -05:00
Mauricio Carneiro
e89887cd8e
laying groundwork to have insertions and deletions going through the system.
2012-02-07 18:11:53 -05:00
Mauricio Carneiro
0d3ea0401c
BQSR Parameter cleanup
...
* get rid of 320C argument that nobody uses.
* get rid of DEFAULT_READ_GROUP parameter and functionality (later to become an engine argument).
2012-02-07 14:42:11 -05:00
Eric Banks
717cd4b912
Document -L unmapped
2012-02-07 13:30:54 -05:00
Eric Banks
718da7757e
Fixes to ValidateVariants as per GS post: ref base of mixed alleles were sometimes wrong, error print out of bad ACs was throwing a RuntimeException, don't validate ACs if there are no genotypes.
2012-02-07 13:15:58 -05:00
Eric Banks
9d1a19bbaa
Multi-allelic indels were not being printed out correctly in VariantsToTable; fixed.
2012-02-06 22:49:29 -05:00
Mauricio Carneiro
5961868a7f
fixup for BQSR (HC integration tests)
...
In the new BQSR implementation, covariates do depend on the RecalibrationArgumentCollection.
2012-02-06 22:47:27 -05:00
Mauricio Carneiro
6e6f0f10e1
BaseQualityScoreRecalibration walker (bqsr v2) first commit includes
...
* Adding the context covariate standard in both modes (including old CountCovariates) with parameters
* Updating all covariates and modules to use GATKSAMRecord throughout the code.
* BQSR now processes indels in the pileup (but doesn't do anything with them yet)
2012-02-06 17:38:29 -05:00
Eric Banks
0717c79901
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-02-06 16:23:36 -05:00
Eric Banks
91897f5fe7
Transpose rows/cols in AF table to make it molten (so I can plot easily in R)
2012-02-06 16:23:32 -05:00
Guillermo del Angel
fb5786385c
Merged bug fix from Stable into Unstable
2012-02-06 13:22:56 -05:00
Guillermo del Angel
6ec686b877
Complement to previous commit: make sure we also don't inherit filter from input VCF when genotyping at an empty site
2012-02-06 13:19:26 -05:00
Guillermo del Angel
93ffca1e3a
Merged bug fix from Stable into Unstable
2012-02-06 11:58:58 -05:00
Guillermo del Angel
827be878b4
Bug fix when running UG in GenotypeGivenAlleles mode: if an input site to genotype had no coverage, the output VCF had AC,AF and AN inherited from input VCF, which could have nothing to do with given BAM so numbers could be non-sensical. Now new vc has clear attributes instead of attributes inherited from input VCF.
2012-02-06 11:58:13 -05:00
Eric Banks
fbbd04621d
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-02-06 11:53:31 -05:00
Eric Banks
edb4edc08f
Commented out unused metrics for now
2012-02-06 11:53:15 -05:00
Ryan Poplin
096c23a473
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-02-06 11:10:38 -05:00
Ryan Poplin
dc05b71e39
Updating Covariate interface with Mauricio to include an errorModel parameter. On the fly recalibration of base insertion and base deletion quals is live for the HaplotypeCaller
2012-02-06 11:10:24 -05:00
Guillermo del Angel
1e11408f8b
Merged bug fix from Stable into Unstable
2012-02-06 10:34:26 -05:00
Guillermo del Angel
090d87b48b
Bug fix in ValidationSiteSelector: when input vcf had genotypes and was multiallelic, the parsing of the AF/AC fields was wrong. Better logic to unify parsing of field
2012-02-06 10:33:12 -05:00
Eric Banks
9d94f310f1
Break AF histogram into max and min AFs
2012-02-06 09:01:19 -05:00
Ryan Poplin
b7ffd144e8
Cleaning up the covariate classes and removing unused code from the bqsr optimizations in 2009.
2012-02-06 08:54:42 -05:00
Eric Banks
cef550903e
Minor optimization
2012-02-06 00:48:00 -05:00
Ryan Poplin
5343f8ba67
Initial version of on-the-fly, lazy loading base quality score recalibration. It isn't completely hooked up yet but I'm committing so Mauricio and Mark can see how I envision it will fit together. Look it over and give any feedback. With the exception of the Solid specific code we are very very close to being able to remove TableRecalibrationWalker from the code base and just replace it with PrintReads -BQSR recal.csv
2012-02-05 13:09:03 -05:00
Ryan Poplin
f94d547e97
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-02-03 17:14:20 -05:00
Ryan Poplin
894d3340be
Active Region Traversal should use GATKSAMRecords everywhere instead of SAMRecords. misc cleanup.
2012-02-03 17:13:52 -05:00
Mauricio Carneiro
4a57add6d0
First implementation of DiagnoseTargets
...
* calculates and interprets the coverage of a given interval track
* allows to expand intervals by specified number of bases
* classifies targets as CALLABLE, LOW_COVERAGE, EXCESSIVE_COVERAGE and POOR_QUALITY.
* outputs text file for now (testing purposes only), soon to be VCF.
* filters are overly aggressive for now.
2012-02-03 17:12:43 -05:00
Mauricio Carneiro
3dd6a1f962
Adding some generic sum and average functions to MathUtils
2012-02-03 17:12:43 -05:00
Mauricio Carneiro
e1d69e4060
make the size of a GenomeLoc int instead of long
...
it will never be bigger than an int and it's actually useful to be an int so we can use it as parameters to array/list/hash size creation.
2012-02-03 17:12:42 -05:00
Ryan Poplin
0e44430e47
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-02-03 13:45:11 -05:00
Christopher Hartl
aa3638ecb3
Merge branch 'master' of ssh://chartl@ni.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-02-03 13:42:09 -05:00
Eric Banks
3abfbcbcf2
Generalized the TDT for multi-allelic events
2012-02-03 12:23:21 -05:00
Ryan Poplin
601e53d633
Fix when specifying preset active regions with -AR argument
2012-02-02 16:34:26 -05:00
Christopher Hartl
0111505ea9
Terrible. Swapping the paternal and sample ids.
2012-02-02 11:41:16 -05:00
Ryan Poplin
1f50f6970b
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-02-02 10:17:13 -05:00
Ryan Poplin
4ed06801a7
Updating HaplotypeCaller's HMM calc to use GOP as a function of the read instead of a function of the haplotype in preparation for IQSR
2012-02-02 10:17:04 -05:00
Matt Hanna
8adfc79123
Merged bug fix from Stable into Unstable
2012-02-01 16:07:41 -05:00
Matt Hanna
30b937d2af
Fix bug discovered in FGTP branch in which BlockInputStream returns -1 in cases where some data could be read,
...
but not all the data requested by the caller.
2012-02-01 16:06:22 -05:00
Mauricio Carneiro
45da892ecc
Better exceptions to catch malformed reads
...
* throw exceptions in LocusIteratorByState when hitting reads starting or ending with deletions
2012-02-01 11:56:19 -05:00
Christopher Hartl
810996cfca
Introducing: VariantsToPed, the world's most annoying walker! And also a busted QScript to run it that I need Khalid's help debugging ( frownie face ). Note that VariantsToPed and PlinkSeq generate the same binary file (up to strand flips...thanks PlinkSeq), so I know it's working properly. Hooray!
2012-02-01 10:39:03 -05:00
Christopher Hartl
25d943f706
Merge branch 'master' of ssh://chartl@ni.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-02-01 10:32:11 -05:00
Ryan Poplin
056b24ccd6
Resolving merge conflicts with LocusIteratorByState
2012-01-31 16:13:32 -05:00
Ryan Poplin
febc634557
Changing PileupElement's isSoftClipped to isNextToSoftClip since soft clipped bases aren't actually added to pileups, oops. Removing the intrinsic clustered variants filter from the HaplotypeCaller
2012-01-31 16:06:14 -05:00
Matt Hanna
7f70612beb
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-01-31 11:59:25 -05:00
Matt Hanna
a630db1703
Oops...HierarchicalMicroScheduler was transforming any exception from the walker level into a ReviewedStingException.
...
Thanks to Ryan for pointing this out.
2012-01-31 11:58:21 -05:00
Christopher Hartl
faba3dd530
Merge branch 'master' of ssh://chartl@ni.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-01-31 10:25:29 -05:00
Mauricio Carneiro
17dbe9a95d
A few cleanups in the LocusIteratorByState
...
* No more N's in the extended event pileups
* Only add to the pileup MQ0 counter if the read actually goes into the pileup
2012-01-31 09:40:51 -05:00
Ryan Poplin
f9162ea705
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-01-30 19:45:19 -05:00
Ryan Poplin
abb91cf26b
Increasing the size of the active regions that are produced by the active probability integrator, more context is needed to call more complex events
2012-01-30 15:36:12 -05:00
Mauricio Carneiro
d5d4fa8a88
Fixed discordance bug reported by Brad Chapman
...
discordance now reports discordance between genotypes as well (just like concordance)
2012-01-30 09:50:45 -05:00
Mark DePristo
3164c8dee5
S3 upload now directly creates the XML report in memory and puts that in S3
...
-- This is a partial fix for the problem with uploading S3 logs reported by Mauricio. There the problem is that the java.io.tmpdir is not accessible (network just hangs). Because of that the s3 upload fails because the underlying system uses tmpdir for caching, etc. As far as I can tell there's no way around this bug -- you cannot overload the java.io.tmpdir programmatically and even if I could what value would we use? The only solution seems to me is to detect that tmpdir is hanging (how?!) and fail with a meaningful error.
2012-01-29 15:14:58 -05:00
Menachem Fromer
0e17cbbce9
Merged bug fix from Stable into Unstable
2012-01-27 16:03:16 -05:00
Menachem Fromer
a9671b73ca
Fix to permit proper handling of mapping qualities between 128 to 255 (which get converted to byte values of -128 to -1)
2012-01-27 16:01:30 -05:00
Ryan Poplin
f7ac1f4a69
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-01-27 15:12:55 -05:00
Ryan Poplin
fc08235ff3
Bug fix in active region traversal, locusView.getNext() skips over pileups with zero coverage but still need to count them in the active probability integrator
2012-01-27 15:12:37 -05:00
Mark DePristo
0f2e8400b5
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-01-27 10:12:50 -05:00
Mauricio Carneiro
ec9920b04f
Updating the SAM TAG for Original Alignment Start to "OP"
...
per Mark's recommendation to reuse the Indel Realigner tag that made it to the SAM spec. The Alignment end tag is still "OE" as there is no official tag to reuse.
2012-01-27 08:51:39 -05:00
Mark DePristo
13d1626f51
Minor improvements in ref QC walker. Unfortunately this doesn't actually catch Chris's error
2012-01-27 08:24:22 -05:00
Mauricio Carneiro
2a565ebf90
embarrassing fix-up, thanks Khalid.
2012-01-26 19:58:42 -05:00
Mauricio Carneiro
246e085ec9
Unit tests for GATKSAMRecord class
...
* new unit tests for the alignment shift properties of reduce reads
* moved unit tests from ReadUtils that were actually testing GATKSAMRecord, not any of the ReadUtils to it.
* cleaned up ReadUtilsUnitTest
2012-01-26 17:06:36 -05:00
Mauricio Carneiro
0d4027104f
Reduced reads are now aware of their original alignments
...
* Added annotations for reads that had been soft clipped prior to being reduced so that we can later recuperate their original alignments (start and end).
* Tags keep the alignment shifts, not real alignment, for better compression
* Tags are defined in the GATKSAMRecord
* GATKSAMRecord has new functionality to retrieve original alignment start of all reads (trimmed or not) -- getOriginalAlignmentStart() and getOriginalAligmentEnd()
* Updated ReduceReads MD5s accordingly
2012-01-26 17:06:36 -05:00
Eric Banks
07f72516ae
Unsupported platform should be a user error
2012-01-26 16:14:25 -05:00
Ryan Poplin
cdff23269d
HaplotypeCaller now uses insertions and softclipped bases as possible triggers. LocusIteratorByState tags pileup elements with the required info to make this calculation efficient. The days of the extended event pileup are coming to a close.
2012-01-26 15:56:33 -05:00
Christopher Hartl
673ceadd11
While this fix worked for the evaluator module, it could potentially have bad effects in the phasing walkers. Special-case nocalls in the PhasingEvaluator and return AllelePair to previous state.
2012-01-26 13:06:36 -05:00
Christopher Hartl
9c6fda7e15
Yup. I was right.
2012-01-26 12:54:11 -05:00
Christopher Hartl
7d059540a4
Allow segments of genome to be excluded in generating a reference panel. Occasionally targets would contain no variation (typically, in the middle of the centromere), which beagle doesn't particularly like, and errors out rather than producing empty output files. The best way to deal with these is to just exclude the regions on a second-pass, and the remaining bits will be gathered with no additional work.
...
AllelePair is being mean and not telling me what genotype it sees when it finds a non-diploid genotype, but i suspect it's a no-call (".") rather than a no call ("./.").
2012-01-26 12:43:52 -05:00
Ryan Poplin
25532bdc37
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-01-26 11:43:32 -05:00
Ryan Poplin
390d493049
Updating ActiveRegionWalker interface to output a probability of active status instead of a boolean. Integrator runs a band-pass filter over this probability to produce actual active regions. First version of HaplotypeCaller which decides for itself where to trigger and assembles those regions.
2012-01-26 11:37:08 -05:00
Eric Banks
859dd882c9
Don't make it standard for now
2012-01-26 00:38:16 -05:00
Eric Banks
c5e81be978
Adding pairwise AF table. Not polished at all, but usable none-the-less.
2012-01-26 00:37:06 -05:00
Eric Banks
702a2d768f
Initial version of multi-allelic summary module in VariantEval
2012-01-25 19:42:55 -05:00
Eric Banks
9a60887567
Lost an import in the merge
2012-01-25 19:41:41 -05:00
Eric Banks
cba5f1a8b1
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-01-25 19:19:03 -05:00
Eric Banks
ddaf51a50f
Updated one integration test for indels
2012-01-25 19:18:51 -05:00
Eric Banks
add6918f32
Cleaner, more efficient way of determining the last dependent set in the queue.
2012-01-25 16:21:10 -05:00
Menachem Fromer
db645a94ca
Added options to make the batch-merger more all-inclusive: keep all indels, SNPs (even filtered ones) but maintain their annotations. Also, VariantContextUtils.simpleMerge can now merge variants of all types using the Hidden non-default enum MultipleAllelesMergeType=MIX_TYPES
2012-01-25 16:10:59 -05:00
Eric Banks
ef335a5812
Better implementation of the fix; PL index is now traversed in order.
2012-01-25 15:15:42 -05:00
Eric Banks
8e2d372ab0
Use remove instead of setting the value to null
2012-01-25 14:41:34 -05:00
Eric Banks
05816955aa
It was possible that we'd clean up a matrix column too early when a dependent column aborted early (with not enough probability mass) because we weren't being smart about the order in which we created dependencies. Fixed.
2012-01-25 14:28:21 -05:00
Eric Banks
2799a1b686
Catch exception for bad type and throw as a TribbleException
2012-01-25 12:15:51 -05:00
Eric Banks
96b62daff3
Minor tweak to the warning message.
2012-01-25 11:55:33 -05:00
Eric Banks
fb863dc6a7
Warn user when trying to run with EMIT_ALL_SITES with indels; better docs for that option.
2012-01-25 11:50:12 -05:00
Eric Banks
e349b4b14b
Allow appending with the dbSNP ID even if a (different) ID is already present for the variant rod.
2012-01-25 11:35:54 -05:00
Eric Banks
ea3d4d60f2
This annotation requires rods and should be annotated as such
2012-01-25 11:35:13 -05:00
Ryan Poplin
bbefe4a272
Added option to be able to write out the active regions to an interval list file
2012-01-25 09:47:06 -05:00
Ryan Poplin
9818c69df6
Can now specify active regions to process at the command line, mainly for debugging purposes
2012-01-25 09:32:52 -05:00
Mauricio Carneiro
ffd61f4c1c
Refactor the Pileup Element with regards to indels
...
Eric reported this bug due to the reduced reads failing with an index out of bounds on what we thought was a deletion, but turned out to be a read starting with insertion.
* Refactored PileupElement to distinguish clearly between deletions and read starting with insertion
* Modified ExtendedEventPileup to correctly distinguish elements with deletion when creating new pileups
* Refactored most of the lazyLoadNextAlignment() function of the LocusIteratorByState for clarity and to create clear separation between what is a pileup with a deletion and what's not one. Got rid of many useless if statements.
* Changed the way LocusIteratorByState creates extended event pileups to differentiate between insertions in the beginning of the read and deletions.
* Every deletion now has an offset (start of the event)
* Fixed bug when LocusITeratorByState found a read starting with insertion that happened to be a reduced read.
* Separated the definitions of deletion/insertion (in the beginning of the read) in all UG annotations (and the annotator engine).
* Pileup depth of coverage for a deleted base will now return the average coverage around the deletion.
* Indel ReadPositionRankSum test now uses the deletion true offset from the read, changed all appropriate md5's
* The extra pileup elements now properly read by the Indel mode of the UG made any subsequent call have a different random number and therefore all RankSum tests have slightly different values (in the 10^-3 range). Updated all appropriate md5s after extremely careful inspection -- Thanks Ryan!
phew!
2012-01-24 16:07:21 -05:00
Matt Hanna
c312bd5960
Weirdly, PicardException inherits from SAMException, which means that our specialty code for
...
reporting malformed BAMs was actually misreporting any error that happened in the Picard layer
as a BAM ERROR.
Specifically changing PicardException to report as a ReviewedStingException; we might want to
change it in the future. I'll followup with the Picard team to make sure they really, really
want PicardException to inherit from SAMException.
2012-01-24 15:30:04 -05:00
Mark DePristo
0a3172a9f1
Fix for ref 0 bases for Chris
...
-- Disturbingly, fixing this bug doesn't actually cause an test failures.
-- Wrote a new QCRefWalker to actually check in detail that the reference bases coming into the RefWalker are all correct when comparing against a clean uncached load of the contig bases directly.
-- However, I cannot run this tool due to some kind of weird BAM error -- sending this on to Matt
2012-01-24 10:55:09 -05:00
Khalid Shakir
c18beadbdb
Device files like /dev/null are now tracked as special by Queue and are not used to generate .out file paths, scattered into a temporary directory, gathered, deleted, etc.
...
Attempted workaround for xdr_resourceInfoReq unsatisfied link during loading of libbat.so.
2012-01-23 16:17:04 -05:00
Mark DePristo
02450e4b12
Merged bug fix from Stable into Unstable
2012-01-23 12:08:39 -05:00
Christopher Hartl
798596257b
Enable the Genotype Phasing Evaluator. Because it didn't have the same argument structure as the base class, update2 of VariantEvaluator was being called, rather than update2 of the actual module.
2012-01-23 10:50:16 -05:00
Mark DePristo
80a4ce0edf
Bugfix for incorrect error messages for missing BAMs and VCFs
...
-- Missing BAMs were appearing as StingExceptions
-- Missing VCFs were showing up as CommandLineErrors, but it's clearer for them to be CouldNotReadInputFile exceptions
-- Added integration tests to ensure missing BAMs, VCFs, and -L files are properly thrown as CouldNotReadInputFile exceptions
-- Added path to standard b37 BAM to BaseTest
-- Cleaned up code in SAMDataSource, removing my parallel loading code as this just didn't prove to be useful.
2012-01-23 09:52:07 -05:00
Guillermo del Angel
31d2f04368
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-01-23 09:23:03 -05:00
Guillermo del Angel
966387ca0b
Next intermediate commit in the pool caller. Lots of bug fixes and now we can emit true vcf's with calls in discovery mode (still of unknown quality) - old validation mode is temporarily broken,will be fixed in next refactoring.
2012-01-23 09:22:31 -05:00
Christopher Hartl
4a08e8ca6e
Minor tweaks to T2D-related qscripts. Replacing old md5s from the BeagleIntegrationTest. All differences boiled down either to the accounting of genotypes changed (./. --> 0/0 is no longer a "changed" genotype, and original genotypes that were ./. are represented as OG=. rather than OG=./. .)
...
This is somewhat of an arbitrary decision, and is negotiable. I could see treating
GT:PL ./.:.
differently from
GT:PL .:0,3,6
but am not sure the worth of doing so.
2012-01-23 08:25:34 -05:00
Ryan Poplin
4d6312d4ea
HaplotypeCaller is now an ActiveRegionWalker.
2012-01-22 14:31:01 -05:00
Christopher Hartl
3b1aad4f17
After a minor and abject freakout, alter the T2D script to seek out truth sensitivities between 80 and 100, rather than between 0.8 and 1. Also, don't consider a genotype "changed by beagle" if the initial genotype is a no-call.
2012-01-20 23:43:51 -05:00
Christopher Hartl
9b4f6afa21
Alterations to scripts for better performance. Grid search now expands the sens/spec tradeoff (90 was far too aggressive against hapmap chr20), and 20 max gaussians was too many, and caused errors. For consensus genotypes: remember to gunzip the beagle outputs before converting to VCF. Also, beagle can in fact create 'null' alleles in certain circumstances. I'm not sure what exactly those circumstances are, but those sites should be ignored. When it does, all alleles apear to be set to null, so this should not affect the actual phasing in the output VCF.
2012-01-20 23:07:59 -05:00
Ryan Poplin
4b18786b5d
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-01-19 22:05:20 -05:00
Ryan Poplin
ace9333068
Active region walkers can now see the reads in a buffer around thier active reigons. This buffer size is specified as a walker annotation. Intervals are internally extended by this buffer size so that the extra reads make their way through the traversal engine but the walker author only needs to see the original interval. Also, several corner case bug fixes in active region traversal.
2012-01-19 22:05:08 -05:00
Menachem Fromer
066da80a3d
Added KEEP_UNCONDTIONAL option which permits even sites with only filtered records to be included as unfiltered sites in the output
2012-01-19 18:19:58 -05:00
Christopher Hartl
7f3ad25b01
Adding a mode to VariantFiltration to invalidate previously-applied filters to allow complete re-filtering of a VCF.
...
T2D VQSR: re-calling now done with appropriate quality settings and using BAQ.
2012-01-19 10:54:48 -05:00
Ryan Poplin
7e082c7750
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-01-19 09:11:23 -05:00
Eric Banks
ab8f499bc3
Annotate with FS even for filtered sites
2012-01-18 22:04:51 -05:00
Guillermo del Angel
b123416c4c
Resolve stale merge changes
2012-01-18 20:56:36 -05:00
Guillermo del Angel
2eb45340e1
Initial, raw, mostly untested version of new pool caller that also does allele discovery. Still needs debugging/refining. Main modification is that there is a new operation mode, set by argument -ALLELE_DISCOVERY_MODE, which if true will determine optimal alt allele at each computable site and will compute AC distribution on it. Current implementation is not working yet if there's more than one pool and it will only output biallelic sites, no functionality for true multi-allelics yet
2012-01-18 20:54:10 -05:00
Ryan Poplin
0268da7560
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-01-18 09:53:00 -05:00
Ryan Poplin
60024e0d7b
updating TDT integration test
2012-01-18 09:52:50 -05:00
Ryan Poplin
11982b5a34
We no longer calculate the population-level TDT statistic if there are fewer than 5 trios with full genotype likelihood information. When there is a high degree of missingness the results are skewed or in the worst case come out as NaN.
2012-01-18 09:42:41 -05:00
Mark DePristo
763c81d520
No longer enforce MAX_ALLELE_SIZE in VCF codec
...
-- Instead issue a warning when a large (>1MB) record is encountered
-- Optimized ref.getBytes()[i] => (byte)ref.charAt(i), which avoids an implicit O(n) allocation each iteration through computeReverseClipping()
2012-01-18 07:35:11 -05:00
Mark DePristo
0c7865fdb5
UnitTest for reverseAlleleClipping
...
-- No code modified yet, just implementing a unit test to ensure correctness of the existing code
2012-01-18 07:35:11 -05:00
Mark DePristo
62801e430a
Bugfix for unnecessary optimization
...
-- don't cache the ref bytes
2012-01-17 16:40:26 -05:00
Mark DePristo
f2b0575dee
Detect unreasonably large allele strings (>2^16) and throw an error
...
-- samtools can emit alleles where the ref is 42M Ns and this caused the GATK (via tribble) to hang in several places.
-- Tribble was updated so we actually could read the line properly (rev. to 51 here).
-- Still the parsing algorithms in the GATK aren't happy with such a long allele. Instead of optimizing the code around an improper use case I put in a limit of 2^16 bp for any allele, and throw a meaningful exception when encountered.
2012-01-17 16:40:26 -05:00
Ryan Poplin
8b0ddf0aaf
Adding notes to CountCovariates docs about using interval lists as database of known variation
2012-01-17 16:13:13 -05:00
Matt Hanna
40ebc17437
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-01-17 14:49:17 -05:00
Matt Hanna
41d70abe4e
At chartl's request, add the bwa aln -N and bwa aln -m parameters to the bindings.
2012-01-17 14:47:53 -05:00
Ryan Poplin
ae259f81cc
Bug fixing for merging of read fragments when one fragment contained an indel
2012-01-17 14:39:27 -05:00
Christopher Hartl
cde224746f
Bait Redesign supports baits that overlap, by picking only the start of intervals.
...
CalibrateGenotypeLikelihoods supports using an external VCF as input for genotype likelihoods. Currently can be a per-sample VCF, but has un-implemented methods for allowing a read-group VCF to be used.
Removed the old constrained genotyping code from UGE -- the trellis calculated is exactly the same as that done in the MLE AC estimate; so we should just re-use that one.
2012-01-17 13:51:05 -05:00
Ryan Poplin
8e23c98dd9
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-01-17 13:46:28 -05:00
Matt Hanna
32ccde374b
Merged bug fix from Stable into Unstable
2012-01-17 11:08:35 -05:00
Matt Hanna
3ba918aff1
Error message cleanup in BAM indexing code.
2012-01-17 11:05:42 -05:00
Mauricio Carneiro
cec7107762
Better location for the downsampling of reads in PrintReads
...
* using the filter() instead of map() makes for a cleaner walker.
* renaming the unit tests to make more sense with the other unit and integration tests
2012-01-14 14:06:09 -05:00
Mark DePristo
b06074d6e7
Updated SortingVCFWriterBase to use PriorityBlockingQueue so that the class is thread-safe
...
-- Uses PriorityBlockingQueue instead of PriorityQueue
-- synchronized keywords added to all key functions that modify internal state
Note that this hasn't been tested extensivesly. Based on report:
http://getsatisfaction.com/gsa/topics/missing_loci_output_in_multi_thread_mode_when_implement_sortingvcfwriterbase?utm_content=topic_link&utm_medium=email&utm_source=new_topic
2012-01-13 09:33:16 -05:00
Mauricio Carneiro
28aa353501
Added "unbiased" downsampling parameter to PrintReads
...
* also cleaned up and updated part of the unit tests for print reads. Needs a more thorough cleaning.
2012-01-12 16:33:55 -05:00
Matt Hanna
2c3176eb80
Merged bug fix from Stable into Unstable
2012-01-12 13:31:10 -05:00
Matt Hanna
cd43f016ce
Fixed NPE in getNextOverlappingBAMScheduleEntry() when mixed mapped/unmapped interval lists are used. Added integrationtest to verify behavior.
2012-01-12 13:29:11 -05:00
Eric Banks
ed34b4f088
Merge branch 'master' of ssh://gsa1.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-01-12 10:27:26 -05:00
Eric Banks
e7fe9910f7
Create the temp storage for calculating cell values just once as per Mark's TODO
2012-01-12 10:27:10 -05:00
Eric Banks
f5f5ed5dcd
Don't initialize the cell conformation values (use an else in the loop instead) as per Mark's TODO
2012-01-12 08:50:03 -05:00
Eric Banks
410a340ef5
Swapping the iteration order to run over AF conformations and then samples instead of the reverse minimizes calls to HashMap.get; instead of it being O(n) since we called it for each sample it's now O(1). Runtime on T2D GENES test set is reduced by 5-10%. More optimizations to follow.
2012-01-12 02:04:03 -05:00
Mauricio Carneiro
77a03c9709
Patching special case in the adaptor clipping
...
* if the adaptor boundary is more than MAXIMUM_ADAPTOR_SIZE bases away from the read, then let's not clip anything and consider the fragment to be undetermined for this read pair.
* updated md5's accordingly
2012-01-11 17:47:44 -05:00
Eric Banks
25d0d53d88
Moving the approximate summing of log10 vals to MathUtils; keeping the more efficient implementation of fast rounding.
2012-01-10 12:38:47 -05:00
Eric Banks
589397d611
Merge branch 'master' of ssh://gsa1.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-01-10 12:36:48 -05:00
Eric Banks
c5320ef1af
Resolving changes in integration test during merge
2012-01-10 12:14:16 -05:00
Matt Hanna
e923a2e512
Revving Picard to incorporate final version of ReadWalker performance improvements.
2012-01-10 12:12:33 -05:00
Eric Banks
0f36f6947e
Resolving merge conflicts
2012-01-10 11:44:16 -05:00
Eric Banks
f2cecce10f
Much better implementation of the approximate summing of an array of log10 values (including more efficient rounding). Now effectively takes 0% of UG runtime on T2D GENES (as opposed to 11% previously).
2012-01-10 11:34:23 -05:00
Matt Hanna
509c3d87b0
Merged bug fix from Stable into Unstable
2012-01-09 23:08:46 -05:00
Matt Hanna
dc60757b68
Eliminate unnecessary strong references (and therefore memory held) by tree reduce entries that have already been processed.
...
Thanks to Tim Fennell for the bug report.
2012-01-09 23:04:53 -05:00
Matt Hanna
fda1795791
Merged bug fix from Stable into Unstable
2012-01-08 22:04:44 -05:00
Matt Hanna
1f1233b669
Fix for a rare but insidious bug in position tracking during async BAM file reading.
...
Thanks to Khalid for spotting and reporting the issue.
2012-01-08 22:03:35 -05:00
Khalid Shakir
5793625592
No more "Q-<pid>@<host>". Generated log file names now use the first output + ".out" (ex. my.vcf.out) or the name of the first QScript plus the order the function was added (ex. MyScript-1.out). The same function added twice with the same outputs will now have the same default logs, meaning the 2nd instance of the function won't be added to the graph twice.
...
QScript accessor to QSettings to specify a default runName and other default function settings.
Because log files are no longer pseudo-random their presense can be used to tell if a job without other file outputs is "done". For now still using the log's .done file in addition to original outputs.
Gathered log files concatenate all log files together into the stdout.
InProcessFunctions now have PrintStreams for stdout and stderr.
Updated ivy to use commons-io 2.1 for copying logs to the stdout PrintStream. Removed snakeyaml.
During graph tracking of outputs the Index files, and now BAM MD5s, are tracked with the gathering of the original file.
In Queue generated wrappers for the GATK the Index and MD5s used for tracking are switched to private scope.
Added more detailed output when running with -l DEBUG.
Simplified graphviz visualization for additional debugging.
Switched usage of the scala class 'List' to the trait 'Seq' (think java.util.ArrayList vs. using the interface java.util.List)
Minor cleanup to build including sending ant gsalib to R's default libloc.
2012-01-08 12:11:55 -05:00
Guillermo del Angel
d4e7655d14
Added ability to call multiallelic indels, if -multiallelic is included in UG arguments. Simple idea: we genotype all alleles with count >= minIndelCnt.
...
To support this, refactored code that computes consensus alleles. To ease merging of mulitple alt alleles, we create a single vc for each alt alleles and then use VariantContextUtils.simpleMerge to carry out merging, which takes care of handling all corner conditions already. In order to use this, interface to GenotypeLikelihoodsCalculationModel changed to pass in a GenomeLocParser object (why are these objects to hard to handle??).
More testing is required and feature turned off my default.
2012-01-06 11:24:38 -05:00
Ryan Poplin
616ff8ea01
fixed typo in help text
2012-01-06 10:36:11 -05:00
Mark DePristo
dd80ffbbbe
Merged bug fix from Stable into Unstable
2012-01-05 21:51:48 -05:00
Mark DePristo
c96fee477c
Bug fix for VariantSummary
...
-- Call sets with indels > 50 bp in length are tagged as CNVs in the tag (following the 1000 Genomes convention) and were unconditionally checking whether the CNV is already known, by looking at the known cnvs file, which is optional. Fixed. Has the annoying side effect that indels > 50bp in size are not counted as indels, and so are substrated from both the novel and known counts for indels. C'est la vie
-- Added integration test to check for this case, using Mauricio's most recent VCF file for NA12878 which has many large indels. Using this more recent and representative file probably a good idea for more future tests in VE and other tools. File is NA12878.HiSeq.WGS.b37_decoy.indel.recalibrated.vcf in Validation_Data
2012-01-05 21:51:06 -05:00
Eric Banks
f5e10e9879
Merged bug fix from Stable into Unstable
2012-01-05 15:35:09 -05:00
Eric Banks
18ed954741
Compute Ti/Tv only if bi-allelic
2012-01-05 15:33:26 -05:00
Ryan Poplin
a6886a4cc0
Initial commit of the Active Region Traversal. Not ready to be used by anyone yet.
2012-01-04 17:03:21 -05:00
Guillermo del Angel
58d4539304
Enabled banded indel computation by default. Reversed logic in input UG argument so that we can still disable it if required. Minor changes to integration tests due to minor differences in GL's and in annotations
2012-01-04 15:28:26 -05:00
Mauricio Carneiro
9ff8a01da2
Merged bug fix from Stable into Unstable
2012-01-03 18:10:39 -05:00
Mauricio Carneiro
9b55505c03
Fixing PairHMMIndelErrorModel array out of bounds
...
This error was due to the ReadClipper change of contract. Before the read utils would return null if a read was entirely clipped, now it returns an empty (safe) GATKSAMRecord.
2012-01-03 18:08:46 -05:00
Christopher Hartl
2c3a9ce02f
Merge branch 'master' of ssh://tin.broadinstitute.org/humgen/gsa-scr1/chartl/dev/unstable
2012-01-03 17:25:56 -05:00
David Roazen
621ee2b613
Merged bug fix from Stable into Unstable
2012-01-03 16:56:49 -05:00
Christopher Hartl
9093de1132
Cleanup: remove code to calculate the MLE AC in the UGE.
2012-01-03 15:58:51 -05:00
Christopher Hartl
2d093828a4
Final changes to Junky (been frozen for a while, but uncommitted) and the qscript for it. A first cursory implementation of the trellis-based Exact AC-constrained genotyping algorithm in UGE. Nothing calls into it, so this should be entirely safe (and, no surprise, it passes UG integration tests).
2012-01-03 15:33:04 -05:00
David Roazen
ea6e718cb8
SnpEff 2.0.5 support. Re-enabled SnpEff in the HybridSelectionPipeline.
...
For now, we recommend only running with the GRCh37.64 database.
2012-01-03 15:18:36 -05:00
Christopher Hartl
93e1417b6e
Update to the VSS GATK documentation.
2012-01-03 13:39:31 -05:00
David Roazen
4984ca5e31
Merged bug fix from Stable into Unstable
2012-01-03 11:03:30 -05:00
David Roazen
f3f01da1af
Enforce serial dependencies in RecalibrationWalkersIntegrationTest
...
Some tests in this class were intermittently not being executed due
to being randomly scheduled before tests whose results they depend on.
Now the serial dependencies are enforced to avoid problematic orderings.
2012-01-03 10:42:41 -05:00
Eric Banks
ab8d47d9a5
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-01-03 09:38:49 -05:00
Mauricio Carneiro
3d4bf273de
Added getPileupForReadGroups to ReadBackPileup
...
* returns a pileup for all the read groups provided.
* saves us from multiple calls to getPileup (which is very inefficient)
2012-01-03 09:35:11 -05:00
Mauricio Carneiro
4a208c7c06
Refactor of the downsampling machinery to accept different strategies
...
* Implemented Adaptive downsampler
* Added integration test
* Added option to RRead scala script to choose downsampling strategy
2012-01-03 09:29:47 -05:00
Mauricio Carneiro
21ae3ef5f9
Added downsampling support to ReduceReads
...
* Downsampling is now a parameter to the walker with default value of 0 (no downsampling)
* Downsampling selects reads at random at the variant region window and strives to achieve uniform coverage if possible around the desired downsampling value.
* Added integration test
2012-01-03 09:29:46 -05:00
Mauricio Carneiro
cd68cc239b
Added knuth-shuffle (KS) and randomSubset using KS to MathUtils
...
* Knuth-shuffle is a simple, yet effective array permutator (hope this is good english).
* added a simple randomSubset that returns a random subset without repeats of any given array with the same probability for every permutation.
* added unit tests to both functions
2012-01-03 09:29:46 -05:00
Mauricio Carneiro
94791a2a75
Add support for reads starting with insertion
...
* Modified cleanCigarShift to allow insertions in the beginning and end of the read
* Allowed cigars starting/ending in insertions in the systematic ReadClipper tests
* Updated all ReadClipper unit tests
* ReduceReads does not hard clip leading insertions by default anymore
* SlidingWindow adjusts start location if read starts with insertion
* SlidingWindow creates an empty element with insertions to the right
* Fixed all potential divide by zero with totalCount() (from BaseCounts)
* Updated all Integration tests
* Added new integration test for multiple interval reducing
2012-01-03 09:29:45 -05:00
Mark DePristo
d05f0c2318
GATKPerformanceOverTime script update
...
-- Automatic detection of most recent version of GATK release (just tell the script now to use 1.2, 1.3, and 1.4)
-- Uses 1.4 now
-- By default we do 9 runs of each non-parallel test
-- In PathUtils added convenience utility to find most recent release GATK jar with a specific release number
2012-01-02 09:58:46 -05:00
Mauricio Carneiro
1b6d52817e
fixing adaptor clipping effect on recalibration integration test
2012-01-01 22:20:06 -05:00
Eric Banks
393993e0c7
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-31 20:42:46 -05:00
Mauricio Carneiro
55cfa76cf3
Updated integration tests for the new adaptor clipping fix.
2011-12-30 18:47:14 -05:00
Mauricio Carneiro
c7d0a9ebee
Forgot to test for inter-chromosomal mates in the adaptor clipping
...
* Fixing bug caught by Eric (and Kristian)
2011-12-30 00:19:53 -05:00
Matt Hanna
a259bfefd4
First commit addressing problems running RTC in parallel.
...
Turns out that because the RTC is the first walker to 'correctly' tree reduce according to functional programming
standards, the RTC has revealed a few problems with the tree reducer holding on to too much data. This is the first
and smaller of two commits to reduce memory consumption. The second commit will likely be pushed after GATK1.4 is
released.
2011-12-29 16:22:14 -05:00
Eric Banks
1a45ea5a05
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-29 11:37:15 -05:00
Mauricio Carneiro
f692911903
GATKSAMRecord emptyRead static constructor
...
* Creates an empty GATKSAMRecord with empty (not null) Cigar, bases and quals. Allows empty reads to be probed without breaking.
* All ReadClipper utilities now emit empty reads for fully clipped reads
2011-12-27 17:01:17 -05:00
Mauricio Carneiro
8259c748f2
No more Filtered Reads tag.
...
All synthetic reads are marked with the reduced read tag.
2011-12-27 17:01:17 -05:00
Eric Banks
d20a25d681
A much better way of choosing the alternate allele(s) to genotype in the SNP model of UG: instead of looking at the sum of base qualities (which can and did lead to us over-genotyping esp. when allowing multiple alternate alleles), we look at the likelihoods themselves (free since we are already calculating likelihoods for all 10 genotypes). Now, even if the base quals exceed some arbitrary threshold, we only bother genotyping an alternate allele when there's a sample for which it is more likely than ref/ref (I can generate weird edge cases where this falls apart, but none that model truly variable sites that we actually want to call). This leads to a huge efficiency improvement esp. for exomes (and esp. for many samples) where we almost always were trying to genotype all 3 alternate alleles. Integration tests change only because ref calls have slight QUAL differences (because the best alt allele is still chosen arbitrarily, but differently).
2011-12-27 16:50:38 -05:00
Eric Banks
adff40ff58
Minor optimizations to avoid extra processing (esp. for reduced reads)
2011-12-27 13:16:25 -05:00
Mauricio Carneiro
17bfe48d5e
Made all class methods private in the ReadClipper
...
* ReadClipperUnitTest now uses static methods
* Haplotype caller now uses static methods
* Exon Junction Genotyper now uses static methods
2011-12-27 02:11:32 -05:00
Eric Banks
dd990061f6
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-26 14:45:35 -05:00
Eric Banks
2130b39f33
Found the bug in the engine: RodLocusView was using the wrong seek method so that it would only move to the first locus of a shard (and with multi-locus shards, this meant that we never processed RODs from the other positions). In fact, because the seek(Shard) method is extremely misleading and now no longer used, I think it's safer to delete it and make everyone use the much more transparent seek(GenomeLoc). Note that I have not re-enabled my improvements to the intervals accumulation of ReferenceDataSource because that inefficiency is still present downstream in RodLocusView; need to discuss those changes with Matt.
2011-12-26 14:45:19 -05:00
Mauricio Carneiro
35c41409a1
Better contracts and docs for the ReadClipper
...
* Described the ReadClipper contract in the top of the class
* Added contracts where applicable
* Added descriptive information to all tools in the read clipper
* Organized public members and static methods together with the same javadoc
2011-12-23 19:36:57 -05:00
David Roazen
506c0e9c97
Disabling SnpEff support in the GATK and SnpEff annotation in the HybridSelectionPipeline
...
SnpEff support will remain disabled until SnpEff 2.0.4 has been officially released
and we've verified the quality of its annotations.
2011-12-23 19:12:57 -05:00
Eric Banks
24c84da60d
'Fixing' the changes in ReferenceDataSource so that a shard properly contains a list of GenomeLocs instead of a single merged one. However, that uncovered a probable bug in the engine, so instead of letting this code fester unfixed in the build (affecting everyone in the group) I've decided to revert the previous (slow, but working) version and fix the engine in my own branch.
2011-12-23 15:39:12 -05:00
Eric Banks
8762313a0d
Better TODO message
2011-12-22 20:54:35 -05:00
Eric Banks
a815e875a8
Removing debugging output
2011-12-22 15:49:11 -05:00
Eric Banks
deef542a38
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-22 15:44:58 -05:00
Eric Banks
6d260ec6ae
Start printing traversal stats after 30 seconds. I can't stand waiting 2 minutes.
2011-12-22 15:40:59 -05:00
David Roazen
510c71158c
Merged bug fix from Stable into Unstable
2011-12-22 10:49:52 -05:00
David Roazen
32cdef9682
Rename *PerformanceTest test classes to *LargeScaleTest
...
This is in preparation for the installation of the new performance test suite in Bamboo.
Note that "ant performancetest" is now "ant largescaletest"
2011-12-22 10:38:49 -05:00
Mauricio Carneiro
731a463415
Updated IntegrationTests with new adaptor clipper
...
phew!
2011-12-20 17:48:52 -05:00
Mauricio Carneiro
cadff40247
getRefCoordSoftUnclippedStart and End refactor
...
These functions are methods of the read, and supplement getAlignmentStart() and getUnclippedStart() by calculating the unclipped start counting only soft clips.
* Removed from ReadUtils
* Added to GATKSAMRecord
* Changed name to getSoftStart() and getSoftEnd
* Updated third party code accordingly.
2011-12-20 17:48:51 -05:00
Mauricio Carneiro
07128a2ad2
ReadUtils cleanup
...
* Removed all clipping functionality from ReadUtils (it should all be done using the ReadClipper now)
* Cleaned up functionality that wasn't being used or had been superseded by other code (in an effort to reduce multiple unsupported implementations)
* Made all meaningful functions public and added better comments/explanation to the headers
2011-12-20 17:48:40 -05:00
Mauricio Carneiro
1c4774c475
Static versions of the hard clipping utilities
...
For simplified access to the hard clipping utilities. No need to create a ReadClipper object if you are not doing multiple complicated clipping operations, just use the static methods.
examples:
ReadClipper.hardClipLowQualEnds(2);
ReadClipper.hardClipAdaptorSequence();
2011-12-20 17:48:39 -05:00
Mauricio Carneiro
f73ad1c2e2
Bugfix/Rewrite: Algorithm to determine adaptor boundaries
...
The algorithm wasn't accounting for the case where the read is the reverse strand and the insert size is negative.
* Fixed and rewrote for more clarity (with Ryan, Mark and Eric).
* Restructured the code to handle GATKSAMRecords only
* Cleaned up the other structures and functions around it to minimize clutter and potential for error.
* Added unit tests for all 4 cases of adaptor boundaries.
2011-12-20 17:48:39 -05:00
Mark DePristo
0cc5c3d799
General improvements to Queue
...
-- Support for collecting resources info from DRMAA runners
-- Disabled the non-standard mem_free argument so that we can actually use our own SGE cluster gsa4
-- NCoresRequest is a testing queue script for this.
-- Added two command line arguments:
-- multiCoreJerk: don't request multiple cores for jobs with nt > 1. This was the old behavior but it's really not the best way to run parallel jobs. Now with queue if you run nt = 4 the system requests 4 cores on your host. If this flag is thrown, though, it will only request 1 and you'll just use 4, like a jerk
-- job_parallel_env: parallel environment named used with SGE to request multicore jobs. Equivalent to -pe job_parallel_env NT for NT > 1 jobs
2011-12-20 14:05:09 -05:00
Eric Banks
7204fcc2c3
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-20 12:59:11 -05:00
Eric Banks
8ade2d6ac2
max_alternate_alleles also ready to be made public
2011-12-20 12:59:02 -05:00
Eric Banks
6f52bd580b
--multiallelic mode is not hidden anymore (but it is annotated as advanced); added docs
2011-12-20 12:47:38 -05:00
Mauricio Carneiro
37e0044c48
Removing unclipSoftClipBases from ReadUtils
...
* it was buggy and dangerous.
* Updated Chris' code to use the ReadClipper.
2011-12-20 00:11:26 -05:00
Mauricio Carneiro
78d9bf7196
Added REVERT_SOFTCLIPPED_BASES capability to ReadClipper
...
* New ClippingOp REVERT_SOFTCLIPPED_BASES turns soft clipped bases into matches.
* Added functionality to clipping op to revert all soft clip bases in a read into matches
* Added revertSoftClipBases function to the ReadClipper for public use
* Wrote systematic unit tests
2011-12-20 00:04:30 -05:00
Christopher Hartl
24585062f8
Merge branch 'incoming'
2011-12-19 23:16:36 -05:00
Christopher Hartl
67298f8a11
AFCR made public (for use in VSS)
...
Minor changes to ValidationSiteSelector logic (SampleSelectors determine whether a site is valid for output, no actual subset context need be operated on beyond that determination). Implementation of GL-based site selection. Minor changes to EJG.
2011-12-19 23:14:26 -05:00
Eric Banks
06d385e619
Simplifying the interface a bit
2011-12-19 15:29:46 -05:00
Christopher Hartl
339ef92eac
Goodbye SW by default. Now aligned reads that overlap intron-exon junctions are scored where they are by default, but warns the user (and flags the record in the VCF) if there's evidence to suggest that there is an indel throwing off the scoring (e.g. if the best score of a realigned unmapped read is >5 log orders better than the best score of a scored mapped read). Unmapped reads are still SW-aligned to the junction-junction sequence. This should result in a rather massive speedup, so far untested.
...
UGBoundAF has to go in at some point. In the process of rewriting the math for bounding the allele frequency (it was assuming uniform tails, which is silly since i derived the posterior distribution in closed form sometime back, just need to find it)
2011-12-19 12:18:18 -05:00
Christopher Hartl
418d22b67e
Merge branch 'master' of ssh://tin.broadinstitute.org/humgen/gsa-scr1/chartl/dev/unstable
...
Conflicts:
private/java/src/org/broadinstitute/sting/gatk/walkers/genotyper/IntronLossGenotyperV2.java
2011-12-19 10:59:18 -05:00
Christopher Hartl
69661da37d
Moving ValidationSiteSelector to validation package in public under my ownership. JunctionGenotyper added and modified several times, this commit is due to merging conflix fixes.
2011-12-19 10:57:28 -05:00
Laurent Francioli
16cc2b864e
- Corrected bug causing cases where both parents are HET to be accounted twice in the TDT calculation - Adapted TDT Integration test to corrected version of TDT
...
Signed-off-by: Ryan Poplin <rpoplin@broadinstitute.org>
2011-12-19 10:30:59 -05:00
Eric Banks
5fd19ae734
Commented exactly how the results are represented from the exact model so developers can know how to use them.
2011-12-19 10:19:00 -05:00
Eric Banks
3069a689fe
Bug fix: if there are multiple records at a given position, it turns out that SelectVariants would drop all variants that follow after one that fails filters (instead of dropping just the failing one). Added an integration test to cover this case.
2011-12-19 10:04:33 -05:00
Mauricio Carneiro
5b678e3b94
Remove ClippingOp UnitTests
...
* all testing functionality is in the ReadClipperUnitTest, no need to double test.
* class and package naming cleanup
2011-12-19 07:49:26 -05:00
Matt Hanna
1ead00cac5
New fork of SamFileHeaderMerger should be cached at the thread level to enable fast (and valid) thread lookups.
2011-12-18 19:04:26 -05:00
Ryan Poplin
bc842ab3a5
Adding option to VariantAnnotator to do strict allele matching when annotating with comp track concordance.
2011-12-18 15:27:23 -05:00
Ryan Poplin
953998dcd0
Now that getSampleDB is public in the walker base class this override in VariantAnnotator isn't necessary.
2011-12-18 14:38:59 -05:00
Eric Banks
76bd13a1ed
Forgot to update the unit test
2011-12-18 01:13:49 -05:00
Eric Banks
07f9d14d9f
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-18 00:43:15 -05:00
Eric Banks
c5ffe0ab04
No reason to sum the normalized posteriors array to get Pr(AF>0) given that we can just compute 1.0 - array[0]. Integration tests change only because of trivial precision artifacts for reference calls using EMIT_ALL_SITES.
2011-12-18 00:31:47 -05:00
Eric Banks
6dc52d42bf
Implemented the proper QUAL calculation for multi-allelic calls. Integration tests pass except for the ones making multi-allelic calls (duh) and one of the SLOD tests (which used to print 0 when one of the LODs was NaN but now we just don't print the SB annotation for that record).
2011-12-18 00:01:42 -05:00
Khalid Shakir
6059ca76e8
Removing cruft that snuck in last commit.
2011-12-16 23:00:16 -05:00
Khalid Shakir
7486696c07
When using bam list mode in HSP deriving VCF name from bam list instead of requiring an additional parameter.
...
Creating a single temporary directory per ant test run instead of a putting temp files across all runs in the same directory.
Updated various tests for above items and other small fixes.
2011-12-16 18:09:25 -05:00
Mauricio Carneiro
e5df9e0684
cleaner test output
...
cleaned up the debug "pass" messages in the unit tests
2011-12-16 18:04:00 -05:00
Mauricio Carneiro
fcc21180e8
Added hardClipLeadingInsertions UnitTest for the ReadClipper
...
fixed issue where a read starting with an insertion followed by a deletion would break, clipper can now safely clip the insertion and the deletion if that's the case.
note: test is turned off until contract changes to allow hanging insertions (left/right).
2011-12-16 18:02:47 -05:00
Mauricio Carneiro
075be52adc
Added hardClipByReferenceCoordinates (left and right tails) UnitTest for the ReadClipper
2011-12-16 18:01:33 -05:00
Mauricio Carneiro
5bba44d693
Added hardClipByReferenceCoordinates UnitTest for the ReadClipper
...
* fixed edge case when requested to hard clip beginning of a read that had hanging soft clipped bases on the left tail.
* fixed edge case when requested to hard clip end of a read that had hanging soft clipped bases on the right tail.
* fixed AlignmentStart of a clipped read that results in only hard clips and soft clips
note: added tests to all these beautiful cases...
2011-12-16 18:01:33 -05:00
Mauricio Carneiro
5838ba529d
Added hardClipByReadCoordinates UnitTest for the ReadClipper
2011-12-16 18:01:33 -05:00
Mauricio Carneiro
c26295919e
Added hardClipBothEndsByReferenceCoordinates UnitTest for the ReadClipper
2011-12-16 18:01:33 -05:00
Mark DePristo
1994c3e3bc
Only print warning about allele incompatibility when running there are genotypes in the file in CombineVariants
2011-12-16 16:50:51 -05:00
Mark DePristo
b6067be952
Support for selecting only variants with specific IDs from a file in SelectVariants
...
-- Cleaned up unused variables as well
2011-12-16 16:50:39 -05:00
Mark DePristo
d6d2f49c88
Don't print log if there are no BAMs
2011-12-16 16:50:36 -05:00
Mark DePristo
78e0950a77
Minor bug fix for printing in SAMDataSource
2011-12-16 11:45:40 -05:00
Mark DePristo
7bc0d18418
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-16 11:42:42 -05:00
Ryan Poplin
5aa79dacfc
Changing hidden optimization argument to advanced.
2011-12-16 10:29:20 -05:00
Matt Hanna
3642a73c07
Performance improvements for dynamically merging BAMs in read walkers.
...
This change and my previous change have dropped runtime when dynamically merging 2k BAM files from 72.6min/1M reads to 46.8sec/1M reads.
Note that many of these changes are stopgaps -- the real problem is the way ReadWalkers interface with Picard, and I'll have to work with
Tim&Co to produce a more maintainable patch.
2011-12-16 09:37:44 -05:00
Mark DePristo
3414ecfe2e
Restored serial version of reader initialization. Serial mode is default, as the performance gains aren't so huge.
...
-- Serial version can be re-enabled with a static boolean, if we decide to return to the serial version
-- Comparison of serial and parallel reader with cached and uncached files:
Initialization time: serial with 500 fully cached BAMs: 8.20 seconds
Initialization time: serial with 500 uncached BAMs : 197.02 seconds
Initialization time: parallel with 500 fully cached BAMs: 30.12 seconds
Initialization time: parallel with 500 uncached BAMs : 75.47 seconds
2011-12-16 09:22:10 -05:00
Mark DePristo
fb1c9d2abc
Restored serial version of reader initialization. Parallel mode is default.
...
-- Serial version can be re-enabled with a static boolean, if we decide to return to the serial version
2011-12-16 09:05:28 -05:00
Mauricio Carneiro
e61e5c7589
Refactor of ReadClipper unit tests
...
* expanded the systematic cigar string space test framework Roger wrote to all tests
* moved utility functions into Utils and ReadUtils
* cleaned up unused classes
2011-12-15 19:05:43 -05:00
Mauricio Carneiro
4748ae0a14
Bugfix: Softclips before Hardclips weren't being accounted for
...
caught a bug in the hard clipper where it does not account for hard clipping softclipped bases in the resulting cigar string, if there is already a hard clipped base immediately after it.
* updated unit test for hardClipSoftClippedBases with corresponding test-case.
2011-12-15 12:17:25 -05:00
Mauricio Carneiro
62a2e335bc
Changing HardClipper contract to allow UNMAPPED reads
...
shifted the contract to functions that operate on reference based coordinates. The clipper should do the right thing with unmapped reads, but it needs more testing (Ryan is using it at the moment and says it works). Will write some unit tests.
2011-12-15 11:08:19 -05:00
Matt Hanna
9333b678b5
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-14 18:05:44 -05:00
Matt Hanna
6fb4be1a09
Cache header merger.
2011-12-14 18:05:31 -05:00
Mauricio Carneiro
50dee86d7f
Added unit test to catch Ryan's exception
...
Unit test to catch the special case that broke the clipping op, fixed in the previous commit.
2011-12-14 16:58:14 -05:00
Mauricio Carneiro
128bdf9c09
Create artificial reads with "default" parameters
...
* added functions to create synthetic reads for unit testing with reasonable default parameters
* added more functions to create synthetic reads based on cigar string + bases and quals.
2011-12-14 16:58:14 -05:00
Mauricio Carneiro
c85100ce9c
Fix ClippingOp bug when performing multiple hardclip ops
...
bug: When performing multiple hard clip operations in a read that has indels, if the N+1 hardclip requests to clip inside an indel that has been removed by one of the (1..N) previous hardclips, the hard clipper would go out of bounds.
fix: dynamically adjust the boundaries according to the new hardclipped read length. (this maintains the current contract that hardclipping will never return a read starting or ending in indels).
2011-12-14 16:57:47 -05:00
Eric Banks
de5928ac5a
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-14 16:24:56 -05:00
Eric Banks
4fddac9f22
Updating busted integration tests
2011-12-14 16:24:43 -05:00
Mark DePristo
01e547eed3
Parallel SAMDataSource initialization
...
-- Uses 8 threads to load BAM files and indices in parallel, decreasing costs to read thousands of BAM files by a significant amount
-- Added logger.info message noting progress and cost of reading low-level BAM data.
2011-12-14 16:14:26 -05:00
Mark DePristo
71b4bb12b7
Bug fix for incorrect logic in subsetSamples
...
-- Now properly handles the case where a sample isn't present (no longer adds a null to the genotypes list)
-- Fix for logic failure where if the number of requested samples equals the number of known genotypes then all of the records were returned, which isn't correct when there are missing samples.
-- Unit tests added to handle these cases
2011-12-14 16:14:26 -05:00
Eric Banks
35fc2e13c3
Using the new PL cache, fix a bug: when only a subset of the genotyped alleles are used for assigning genotypes (because the exact model determined that they weren't all real) the PLs need to be adjusted to reflect this. While fixing this I discovered that the integration tests are busted because ref calls (ALT=.) were getting annotated with PLs, which makes no sense at all.
2011-12-14 15:31:09 -05:00
Eric Banks
1e90d602a4
Optimization: cache up front the PL index to the pair of alleles it represents for all possible numbers of alternate alleles.
2011-12-14 13:38:20 -05:00
Eric Banks
988d60091f
Forgot to add in the new result class
2011-12-14 13:37:15 -05:00
Eric Banks
106bf13056
Use a thread local result object to collect the results of the exact calculation instead of passing in multiple pre-allocated arrays.
2011-12-14 12:05:50 -05:00
Eric Banks
7648521718
Add check for mixed genotype so that we don't exception out for a valid record
2011-12-14 11:26:43 -05:00
Eric Banks
9497e9492c
Bug fix for complex records: do not ever reverse clip out a complete allele.
2011-12-14 11:21:28 -05:00
Eric Banks
09a5a9eac0
Don't update lineNo for decodeLoc - only for decode (otherwise they get double-counted). Even still, because of the way the GATK currently utilizes Tribble we can parse the same line multiple times, which knocks the line counter out of sync. For now, I've added a TODO in the code to remind us and the error messages note that it's an approximate line number.
2011-12-14 10:43:52 -05:00
Eric Banks
d3f4a5a901
Fail gracefully when encountering malformed VCFs without enough data columns
2011-12-14 10:37:38 -05:00
Eric Banks
079932ba2a
The log10cache needs to be larger if we want to handle 10K samples in the UG.
2011-12-13 23:36:10 -05:00
Ryan Poplin
7fa1ab1bae
Fix to allow haplotype caller to call indels after UG engine entry points were unified. Adding Haplotype Caller integration test
2011-12-13 17:19:40 -05:00
Eric Banks
e47a113c9f
Enabled multi-allelic SNP discovery in the UG. Needs loads of testing so do not use yet. While working in the UG engine, I removed the extraneous and unnecessary MultiallelicGenotypeLikelihoods class: now a VariantContext with PL-annotated Genotypes is passed around instead. Integration tests pass so it must all work, right?
2011-12-12 23:02:45 -05:00
Mauricio Carneiro
5cc1e72fdb
Parallelized SelectVariants
...
* can now use -nt with SelectVariants for significant speedup in large files
* added parallelization integration tests for SelectVariants
2011-12-12 18:41:14 -05:00
Mauricio Carneiro
a70a0f25fb
Better debug output for SAMDataSource
...
output the name and number of the files being loaded by the GATK instead of "coordinate sorted".
2011-12-12 17:57:29 -05:00
Mark DePristo
d03425df2f
TODO optimization targets
2011-12-12 17:39:51 -05:00
Laurent Francioli
7cf27bb66e
Updated md5sum for MendelianViolationEvaluator test to reflect the change in column alignment in VariantEval.
2011-12-12 12:22:43 +01:00
Laurent Francioli
025bdfe2cc
Merge branch 'master' of ssh://copper.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-12 12:19:44 +01:00
Eric Banks
7b6338c742
Merge branch 'master' into trialleles
2011-12-11 00:28:46 -05:00
Eric Banks
7c4b9338ad
The old bi-allelic implementation of the Exact model has been completely deprecated - you can only use the multi-allelic implementation now.
2011-12-11 00:23:33 -05:00
Eric Banks
044f211a30
Don't collapse likelihoods over all alt alleles - that's just not right. For now, the QUAL is calculated for just the most likely of the alt alleles; I need to think about the right way to handle this properly.
2011-12-10 23:57:14 -05:00
Eric Banks
364f1a030b
Plumbing added so that the UG engine can handle multiple alleles and they can successfully be genotyped. Alleles that aren't likely are not allowed to be used when assigning genotypes, but otherwise the greedy PL-based approach is what is used. Moved assign genotypes code to UG engine since it has nothing to do with the Exact model. Still have some TODOs in here before I can push this out to everyone.
2011-12-09 14:25:28 -05:00
Mauricio Carneiro
8475328b2c
Turning off test that breaks read clipper
...
until we define what is the desired behavior for clipping this particular case.
2011-12-09 11:53:12 -05:00
Roger Zurawicki
4cbd1f0dec
Reorganized the testing code and created ClipReadsTestUtils
...
Tests are more rigorous and includes many more test cases.
We can tests custom cigars and the generated cigars.
*Still needs debugging because code is not working.
Created test classes to be used across several tests.
Some cases are still commented out.
Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2011-12-09 11:52:34 -05:00
Roger Zurawicki
0e9c2cefa2
testHardClipSoftClippedBases works with Matches and Deletions
...
Insertions are a problem so cigar cases with "I" are commented out.
The test works with multiple deletions and matches.
This is still not a complete test. A lot of cigar test cases are commented out.
Added insertions to ReadClipperUnitTest
ReadClipper now tests for all indels.
Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2011-12-09 11:43:37 -05:00
Eric Banks
64dad13e2d
Don't carry around an extra copy of the code for the Haplotype Caller
2011-12-09 11:09:40 -05:00
Eric Banks
442ceb6ad9
The Exact model now computes both the likelihoods and posteriors (in separate arrays); likelihoods are used for assigning genotypes, not the posteriors.
2011-12-09 10:16:44 -05:00
Laurent Francioli
a79144f7db
Merge branch 'master' of ssh://copper.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-09 15:57:24 +01:00
Laurent Francioli
72fbfba97d
Added UnitTests for getFamilies() and getChildrenWithParents()
2011-12-09 15:57:07 +01:00
Laurent Francioli
5a06170804
Corrected bug causing getChildrenWithParents() to not take the last family member into consideration.
2011-12-09 14:51:34 +01:00
Eric Banks
aa4a8c5303
No dynamic programming solution for assignning genotypes; just done greedily now. Fixed QualByDepth to skip no-call genotypes. No-calls are no longer given annotations (attributes).
2011-12-09 02:25:06 -05:00
Eric Banks
2fe50c64da
Updating md5s
2011-12-09 00:47:01 -05:00
Eric Banks
8777288a9f
Don't throw a UserException if too many alt alleles are trying to be genotyped. Instead, I've added an argument that allows the user to set the max number of alt alleles to genotype and the UG warns and skips any sites with more than that number.
2011-12-09 00:00:20 -05:00
Eric Banks
3e7714629f
Scrapped the whole idea of an int/long as an index into the ACset: with lots of alternate alleles we run into overflow issues. Instead, simply use the ACcounts array as the hash key since it is unique for each AC conformation. To do this, it needed to be wrapped inside an object so hashcode() would work.
2011-12-08 23:50:54 -05:00
Eric Banks
4aebe99445
Need to use longs for the set index (because we can run out of ints when there are too many alternate alleles). Integration tests now use the multiallelic implementation.
2011-12-08 15:31:02 -05:00
Eric Banks
7750bafb12
Fixed bug where last dependent set index wasn't properly being transferred for sites with many alleles. Adding debugging output.
2011-12-08 13:50:50 -05:00
Guillermo del Angel
252e0f3d0a
Merged bug fix from Stable into Unstable
2011-12-08 13:11:39 -05:00
Guillermo del Angel
1bfe28067f
Don't try to genotype an indel even bigger than the reference window size, or else we'll be out of bounds. Necessary to handle Phase 1 integrated callset with large deletions. Better error indication when validating a GenomeLoc.
2011-12-08 12:54:08 -05:00
Mark DePristo
9def841275
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-07 13:36:16 -05:00
Mark DePristo
4055877708
Prints 0.0 TiTv not NaN when there are no variants
...
-- Updated md5
2011-12-07 12:07:54 -05:00
Matt Hanna
15533e08df
Fixed issue with RODWalker parallelization.
...
Turns out that someone previously upped the declared size of a ROD shard to 100M bases, making
each ROD shard larger than the size of chr20. Why didn't we see this in Stable? Because the
ShardStrategy/ShardStrategyFactory mechanism was dutifully ignoring the shard size specification.
When I rolled the ShardStrategy/ShardStrategyFactory mechanics back into the DataSources as part
of the async I/O project, I inadvertently reenabled this specifier.
2011-12-07 11:55:42 -05:00
Mark DePristo
5d2212bc8e
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-07 09:03:17 -05:00
Mark DePristo
6bf18899df
Fix for variant summary -- now treats all 50 bp deletions or insertions as CNVs
2011-12-07 09:02:49 -05:00
Matt Hanna
c9b2cd8ba5
Fix for chartl's stale null representation issue.
2011-12-06 18:05:17 -05:00
Eric Banks
79d18dc078
Fixing indexing bug on the ACsets. Added unit tests for the Exact model code.
2011-12-06 16:17:18 -05:00
Matt Hanna
f5b977fc88
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-06 10:11:35 -05:00
Matt Hanna
4001c22a11
Better file count / buffering variation in test suite. Parameterized read shard buffering. Misc cleanup.
2011-12-06 10:10:38 -05:00
Khalid Shakir
677bea0abd
Right aligning GATKReport numeric columns and updated MD5s in tests.
...
PreQC parses file with spaces in sample names by using tabs only.
PostQC allows passing the file names for the evals so that flanks can be evaled.
BaseTest's network temp dir now adds the user name to the path so files aren't created in the root.
HybridSelectionPipeline:
- Updated to latest versions of reference data.
- Refactored Picard parsing code replacing YAML.
2011-12-05 23:22:15 -05:00
Eric Banks
7a0f6feda4
Make sure that too many alternate alleles aren't being passed to the genotyper (10 for now) and exit with a UserError if there are.
2011-12-05 16:18:52 -05:00
Eric Banks
7fac4afab3
Fixed priors (now initialized upon engine startup in a multi-dimensional array) and cell coefficients (properly handles the generalized closed form representation for multiple alleles).
2011-12-05 15:57:25 -05:00
Eric Banks
a7cb941417
The posteriors vector is now 2 dimensional so that it supports multiple alleles (although the UG is still hard-coded to use only array[0] for now); the exact model now collapses probabilities for all conformations over a given AC into the posteriors array (in the appropriate dimension). Fixed a bug where the priors and posteriors were being passed in swapped.
2011-12-04 13:02:53 -05:00
Eric Banks
eab2b76c9b
Added loads of comments for future reference
2011-12-03 23:54:42 -05:00
Eric Banks
29662be3d7
Fixed bug where k=2N case wasn't properly being computed. Added optimization for BB genotype case not in old model. At this point, integration tests pass except for 1 case where QUALs differ by 0.01 (this is okay because I occasionally need to compute extra cells in the matrix which affects the approximations) and 2 cases where multi-allelic indels are being genotyped (some work still needs to be done to support them).
2011-12-03 23:12:04 -05:00
Eric Banks
71f793b71b
First partially working version of the multi-allelic version of the Exact AF calculation
2011-12-02 14:13:14 -05:00
David Roazen
d014c7faf9
Queue now properly escapes all shell arguments in generated shell scripts
...
This has implications for both Qscript authors and CommandLineFunction authors.
Qscript authors:
You no longer need to (and in fact must not) manually escape String values to
avoid interpretation by the shell when setting up Walker parameters. Queue will
safely escape all of your Strings for you so that they'll be interpreted literally. Eg.,
Old way:
filterSNPs.filterExpression = List("\"QD<2.0\"", "\"MQ<40.0\"", "\"HaplotypeScore>13.0\"")
New way:
filterSNPs.filterExpression = List("QD<2.0", "MQ<40.0", "HaplotypeScore>13.0")
CommandLineFunction authors:
If you're writing a one-off CommandLineFunction in a Qscript and don't really
care about quoting issues, just keep doing things the direct, simple way:
def commandLine = "cat %s | grep -v \"#\" > %s".format(files, out)
If you're writing a CommandLineFunction that will become part of Queue and
will be used by other QScripts, however, it's advisable to do things the
newer, safer way, ie.:
When you construct your commandLine, you should do so ONLY using the API methods
required(), optional(), conditional(), and repeat(). These will manage quoting
and whitespace separation for you, so you shouldn't insert quotes/extraneous
whitespace in your Strings. By default you get both (quoting and whitespace
separation), but you can disable either of these via parameters. Eg.,
override def commandLine = super.commandLine +
required("eff") +
conditional(verbose, "-v") +
optional("-c", config) +
required("-i", "vcf") +
required("-o", "vcf") +
required(genomeVersion) +
required(inVcf) +
required(">", escape=false) + // This will be shell-interpreted
required(outVcf)
I've ported the Picard/Samtools/SnpEff CommandLineFunction classes to the new
system, so you'll get free shell escaping when you use those in Qscripts just
like with walkers.
2011-12-01 18:13:44 -05:00
Mark DePristo
3060a4a15e
Support for list of known CNVs in VariantEval
...
-- VariantSummary now includes novelty of CNVs by reciprocal overlap detection using the standard variant eval -knownCNVs argument
-- Genericizes loading for intervals into interval tree by chromosome
-- GenomeLoc methods for reciprocal overlap detection, with unit tests
2011-11-30 17:05:16 -05:00
Matt Hanna
b65db6a854
First draft of a test script for I/O performance with the new asynchronous I/O processing.
...
Also includes convenience parameters for specifying the IO/CPU threading balance outside of a tag. Will be killed when
Queue gets better support for tagged arguments (hopefully soon).
2011-11-30 13:13:16 -05:00
Laurent Francioli
1d5d200790
Cleaned up unused import statements
2011-11-30 15:30:30 +01:00
Mark DePristo
28b286ad39
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-11-30 09:11:53 -05:00
Laurent Francioli
20bffe0430
Adapted for the new version of MendelianViolation
2011-11-30 14:46:38 +01:00
Laurent Francioli
1cb5e9e149
Removed outdated (and unused) -familyStr commandline argument
2011-11-30 14:45:04 +01:00
Laurent Francioli
9574be0394
Updated MendelianViolationEvaluator integration test
2011-11-30 14:44:15 +01:00
Laurent Francioli
f49dc5c067
Added functionality to get all children that have both parents (useful when trios are needed)
2011-11-30 14:43:37 +01:00
Laurent Francioli
a4606f9cfe
Merge branch 'MendelianViolation'
...
Conflicts:
public/java/src/org/broadinstitute/sting/utils/MendelianViolation.java
2011-11-30 11:13:15 +01:00
Laurent Francioli
b279ae4ead
Merge branch 'master' of ssh://copper.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-11-30 10:10:21 +01:00
Laurent Francioli
7d58db626e
Added MendelianViolationEvaluator integration test
2011-11-30 10:09:20 +01:00
Ryan Poplin
91413cf0d9
Merged bug fix from Stable into Unstable
2011-11-29 14:01:23 -05:00
Ryan Poplin
cb284eebde
Further updating VQSR tutorial wiki docs to reflect the bundle
2011-11-29 14:00:57 -05:00
Ryan Poplin
dcb889665d
Merged bug fix from Stable into Unstable
2011-11-29 09:58:49 -05:00
Ryan Poplin
447e9bff9e
Updating VQSR tutorial wiki docs to reflect the bundle
2011-11-29 09:57:45 -05:00
Ryan Poplin
110298322c
Adding Transmission Disequilibrium Test annotation to VariantAnnotator and integration test to test it.
2011-11-29 09:29:18 -05:00
Laurent Francioli
ab67011791
Corrected bug introduced in the last update and causing no families to be returned by getFamilies in case the samples were not specified
2011-11-29 11:18:15 +01:00
Eric Banks
d7d8b8e380
Tribble v42 changes the Codec.canDecode method to take in a String instead of a File; this is something that Jim was adamant about (because Tribble can handle streams other than files). I didn't want the next person who needed to rev Tribble to deal with this change additionally, so I took care of updating the GATK now.
2011-11-28 14:18:28 -05:00
Laurent Francioli
a09c01fcec
Removed walker argument FamilyStructure as this is now supported by the engine (ped file)
2011-11-28 17:18:11 +01:00
Laurent Francioli
795c99d693
Adapted MendelianViolation to the new ped family representation. Adapted all classes using MendelianViolation too.
...
MendelianViolationEvaluator was added a number of useful metrics on allele transmission and MVs
2011-11-28 17:13:14 +01:00
Laurent Francioli
e877db8f42
Changed visibility of getSampleDB from protected to public as the sampleDB needs to be accessible from Annotators and Evaluators too.
2011-11-28 17:11:30 +01:00
Laurent Francioli
5c2595701c
Added a function to get families only for a given list of samples.
2011-11-28 17:10:33 +01:00
Mark DePristo
3c36428a20
Bug fix for TiTv calculation -- shouldn't be rounding
2011-11-28 10:20:34 -05:00
Eric Banks
436b4dc855
Updated docs
2011-11-28 08:59:48 -05:00
Laurent Francioli
b1dd632d5d
Merge branch 'master' of ssh://copper.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
...
Conflicts:
public/java/src/org/broadinstitute/sting/gatk/walkers/phasing/PhaseByTransmission.java
2011-11-25 16:16:44 +01:00
Mark DePristo
e60272975a
Fix for changed MD5 in streaming VCF test
2011-11-23 19:01:33 -05:00
Mark DePristo
12f09d88f9
Removing references to SimpleMetricsByAC
2011-11-23 16:08:18 -05:00
Mark DePristo
e319079c32
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-11-23 13:02:11 -05:00
Mark DePristo
4107636144
VariantEval updates
...
-- Performance optimizations
-- Tables now are cleanly formatted (floats are %.2f printed)
-- VariantSummary is a standard report now
-- Removed CompEvalGenotypes (it didn't do anything)
-- Deleted unused classes in GenotypeConcordance
-- Updates integration tests as appropriate
2011-11-23 13:02:07 -05:00
David Roazen
e5b85f0a78
A toString() method for IntervalBindings
...
Necessary since we're currently writing things like this to our VCF headers:
intervals=[org.broadinstitute.sting.commandline.IntervalBinding@4ce66f56]
2011-11-23 11:56:12 -05:00
Mark DePristo
5a4856b82e
GATKReports now support a format field per column
...
-- You can tell the table to format your object with "%.2f" for example.
2011-11-23 11:31:04 -05:00
Mark DePristo
c8bf7d2099
Check for null comment
2011-11-23 10:47:21 -05:00
Mark DePristo
6c2555885c
Caching getSimpleName() in VariantEval is a big performance improvement
...
-- Removed the SimpleMetricsByAC table, as one should just use the AlleleCount Stratefication and the upcoming VariantSummary table
2011-11-23 08:34:05 -05:00
Guillermo del Angel
32adbd614f
Solve merge conflict
2011-11-22 22:48:46 -05:00
Guillermo del Angel
941f3784dc
Solve merge conflict
2011-11-22 22:48:03 -05:00
Guillermo del Angel
75d93e6335
Another corner condition fix: skip likelihood computation in case we cut so many bases there's no haplotype or read left
2011-11-22 22:46:12 -05:00
Mark DePristo
a3aef8fa53
Final performance optimization for GenotypesContext
2011-11-22 17:19:30 -05:00
Mark DePristo
990c02e4de
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-11-22 17:19:11 -05:00
Guillermo del Angel
38a90da92c
Fixed merge conflict to Unstable
2011-11-22 14:39:45 -05:00
Guillermo del Angel
32a77a8a56
Prevent out of bound error in case read span > reference context + indel length. Can happen in RNAseq reads with long N CIGAR operators in the middle.
2011-11-22 13:57:24 -05:00
Eric Banks
5821c11fad
For BAM and Reviewed errors we now check the error message to see if it's actually a 'too many open files' problem and, if so, we generate a User Error instead.
2011-11-22 10:50:22 -05:00
Mark DePristo
7087310373
Embarassing bug fixed
2011-11-22 10:16:36 -05:00
Mark DePristo
e484625594
GenotypesContext now updates cached data for add, set, replace operations when possible
...
-- Involved separately managing the sample -> offset and sample sorted list operations. This should improve performance throughout the system
2011-11-22 08:40:48 -05:00
Mark DePristo
29ca24694a
UG now encoding NO_CALLs as ./. not ./.:.:4:0,0,0
...
A few updated UGs integration tests
2011-11-22 08:22:32 -05:00
Mark DePristo
2b51c01df4
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-11-21 19:16:06 -05:00
Mark DePristo
5443d3634a
Again, fixing the add call when we really mean replace
...
-- Updating MD5s for UG to reflect that what was previously called ./.:.:10:0,0,0 is now just ./. Eric will fix long-standing bug in QD observed from this change
-- VFW MD5s restored to their old correct values. There was a bug in my implementation to caused the genotypes to not be parsed from the lazy output even through the header was incorrect.
2011-11-21 19:15:56 -05:00
Mauricio Carneiro
5ad3dfcd62
BugFix: byte overflow in SyntheticRead compressed base counts
...
* fixed and added unit test
2011-11-21 17:11:50 -05:00
Mark DePristo
9ea7b70a02
Added decode method to LazyGenotypesContext
...
-- AbstractVCFCodec calls this if the samples are not sorted. Previously called getGenotypes() which didn't actually trigger the decode
2011-11-21 16:21:23 -05:00
Mark DePristo
ab2efe3bd3
Reverting bad exact model changes
2011-11-21 16:14:40 -05:00
Eric Banks
44554b2bfd
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-11-21 15:01:45 -05:00
Eric Banks
022832bd74
Very bad use of the == operator with Strings was ensuring that validating GenomeLocs was very inefficient. This fix resulted in a significant speedup for a simple RodWalker.
2011-11-21 14:49:47 -05:00
Mark DePristo
1561af22af
Exact model code cleanup
...
-- Fixed up code when fixing a bug detected by aggressive contracts in GenotypesContext.
2011-11-21 14:35:15 -05:00
Mark DePristo
2c501364b8
GenotypesContext no longer have immutability in constructor
...
-- additional bug fixes throughout VariantContext and GenotypesContext objects
2011-11-21 14:34:31 -05:00
David Roazen
1296dd41be
Removing the legacy -L "interval1;interval2" syntax
...
This syntax predates the ability to have multiple -L arguments, is
inconsistent with the syntax of all other GATK arguments, requires
quoting to avoid interpretation by the shell, and was causing
problems in Queue.
A UserException is now thrown if someone tries to use this syntax.
2011-11-21 13:18:53 -05:00
Mark DePristo
e467b8e1ae
More contracts on LazyGenotypesContext
2011-11-21 09:34:57 -05:00
Mark DePristo
2e9ecf639e
Generalized interface to LazyGenotypesContext
...
-- Now you provide a LazyParsing object
-- LazyGenotypesContext now knows nothing about the VCF parser itself. The parser holds all of the necessary data to parse the VCF genotypes when necessarily, and the LGC only has a pointer to this object
-- Using new interface added LazyGenotypesContext to unit tests with a simple lazy version
-- Deleted VCFParser interface, as it was no longer necessary
2011-11-21 09:30:40 -05:00
Mark DePristo
f0ac588d32
Extensive unit test for GenotypeContextUnitTest
...
-- Currently only tests base class. Adding subclass testing in a bit
2011-11-20 18:28:01 -05:00
Mark DePristo
bc44f6fd9e
Utility function Collection<Genotype> -> Collection<String>
2011-11-20 18:26:56 -05:00
Mark DePristo
9445326c6c
Genotype is Comparable via sampleName
2011-11-20 18:26:27 -05:00
Mark DePristo
f9e25081ab
Completed documented LazyGenotypesContext
2011-11-20 08:35:52 -05:00
Mark DePristo
9cb3fe3a59
Vastly better way of doing on-demand genotyping loading
...
-- With our GenotypesContext class we can naturally create a LazyGenotypesContext subclass that does the on-demand loading.
-- This new class was replaced all of the old, complex functionality
-- Better still, there were many cases were the genotypes were being loaded unnecessarily, resulting in efficiency. This was detected because some of the integration tests changed as the genotypes were no longer being parsing unnecessarily
-- Misc. bug fixes throughout the system
-- Bug fixes for PhaseByTransmission with new GenotypesContext
2011-11-20 08:23:09 -05:00
Mark DePristo
f392d330c3
Proper use of builder. Previous conversion attempt was flawed
2011-11-19 22:09:56 -05:00
Mark DePristo
7d09c0064b
Bug fixes and code cleanup throughout
...
-- chromosomeCounts now takes builder as well, cleaning up a lot of code throughout the codebase.
2011-11-19 18:40:15 -05:00
Mark DePristo
707bd30b3f
Should have been @BeforeMethod
2011-11-19 16:10:09 -05:00
Mark DePristo
8f7eebbaaf
Bugfix for pError not being checked correctly in CommonInfo
...
-- UnitTests to ensure correct behavior
-- UnitTests to ensure correct behavior for pass filters vs. failed filters vs. unfiltered
2011-11-19 15:58:59 -05:00
Mark DePristo
b7b57ef39a
Updating MD5 to reflect canonical ordering of calculation
...
-- We should no longer have md5s changing because of hashmaps changing their sort order on us
-- Added GenotypeLikelihoodsUnitTests
-- Refactored ExactAFCaclculation to put the PL -> QUAL calculation in the GenotypeLikelihoods class to avoid the code copy.
2011-11-19 15:57:33 -05:00
Mark DePristo
73119c8e3c
Merge with master
...
-- A few bug fixes
2011-11-19 09:56:06 -05:00
Mark DePristo
f685fff79b
Killing the final versions of old new VariantContext interface
2011-11-18 21:32:43 -05:00
Mark DePristo
6cf315e17b
Change interface to getNegLog10PError to getLog10PError
2011-11-18 21:07:30 -05:00
Mark DePristo
c7f2d5c7c7
Final minor fix to contract
2011-11-18 19:40:05 -05:00
Mauricio Carneiro
b5de182014
isEmpty now checks if mReadBases is null
...
Since newly created reads have mReadBases == null. This is an effort to centralize the place to check for empty GATKSAMRecords.
2011-11-18 18:34:05 -05:00
Mauricio Carneiro
8ab3ee9c65
Merge remote-tracking branch 'unstable/master' into rr
2011-11-18 16:50:25 -05:00
Mauricio Carneiro
333e5de812
returning read instead of GATKSAMRecord
...
Do not create new GATKSAMRecord when read has been fully clipped, because it is essentially the same as returning the currently fully clipped read.
2011-11-18 16:49:59 -05:00
Matt Hanna
8bb4d4dca3
First pass of the asynchronous block loader.
...
Block loads are only triggered on queue empty at this point. Disabled by
default (enable with nt:io=?).
2011-11-18 15:02:59 -05:00
Mark DePristo
a2e79fbe8a
Fixes to contracts
2011-11-18 14:18:53 -05:00
Mark DePristo
660d6009a2
Documentation and contracts for GenotypesContext and VariantContextBuilder
2011-11-18 13:59:30 -05:00
Mark DePristo
f54afc19b4
VariantContextBuilder
...
-- New approach to making VariantContexts modeled on StringBuilder
-- No more modify routines -- use VariantContextBuilder
-- Renamed isPolymorphic to isPolymorphicInSamples. Same for mono
-- getChromosomeCount -> getCalledChrCount
-- Walkers changed to use new VariantContext. Some deprecated new VariantContext calls remain
-- VCFCodec now uses optimized cached information to create GenotypesContext.
2011-11-18 12:39:10 -05:00
Eric Banks
6459784351
Merged bug fix from Stable into Unstable
2011-11-18 12:34:57 -05:00
Eric Banks
c62082ba1b
Making this class public again as per request from Cancer folks
2011-11-18 12:34:27 -05:00
Eric Banks
8710673a97
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-11-18 12:29:33 -05:00
Eric Banks
768b27322b
I figured out why we were getting tons of hom var genotype calls with Mauricio's low quality (synthetic) reduced reads: the RR implementation in the UG was not capping the base quality by the mapping quality, so all the low quality reads were used to generate GLs. Fixed.
2011-11-18 12:29:15 -05:00
Mark DePristo
7490dbb6eb
First version of VariantContextBuilder
2011-11-18 11:06:15 -05:00
Roger Zurawicki
f48d4cfa79
Bug fix: fully clipping GATKSAMRecords and flushing ops
...
Reads that are emptied after clipping become new GATKSAMRecords.
When applying ClippingOps, the ops are cleared after the clipping
2011-11-18 00:24:39 -05:00
Mark DePristo
fa454c88bb
UnitTests for VariantContext for chrCount, getSampleNames, Order function
...
-- Major change to how chromosomeCounts is computed. Now NO_CALL alleles are always excluded. So ChromosomeCounts(A/.) is 1, the previous result would have been 2.
-- Naming changes for getSamplesNameInOrder()
2011-11-17 20:37:22 -05:00
Mark DePristo
02f22cc9f8
No more VC integration tests. All tests are now unit tests
2011-11-17 15:33:09 -05:00
Mark DePristo
23359d1c6c
Bugfix for pruneVariantContext, which was dropping the ref base for padding
2011-11-17 15:32:52 -05:00
Mark DePristo
473b860312
Major determinism fix for UG and RankSumTest
...
-- Now these routines all iterate in sample name order (genotypes.iterateInSampleNameOrder) so that the results of UG and the annotator do not depend on the particular order of samples we see for the exact model and the RankSumTest
2011-11-17 15:31:45 -05:00
Khalid Shakir
c50274e02e
During flanking interval creation merging overlapping flanks so that on scatter the list doesn't accidentally genotype the same site twice.
...
Moved flanking interval utilies to IntervalUtils with UnitTests.
2011-11-17 13:56:42 -05:00
Eric Banks
bad19779b9
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-11-17 13:29:43 -05:00
Eric Banks
16a021992b
Updated header description for the INFO and FORMAT DP fields to be more accurate.
2011-11-17 13:17:53 -05:00
Eric Banks
e7d41d8d33
Minor cleanup
2011-11-17 12:00:28 -05:00
Mark DePristo
7e66677769
Expanded UnitTests for VariantContext
...
Tests for
-- getGenotype and getGenotypes
-- subContextBySample
-- modify routines
2011-11-16 20:45:15 -05:00
Mauricio Carneiro
72f00e2883
Merging Roger's Unit tests for Reduce Reads from RR repository
2011-11-16 17:26:49 -05:00
Mark DePristo
aa0610ea92
GenotypeCollection renamed to GenotypesContext
2011-11-16 16:24:05 -05:00
Mark DePristo
974daaca4d
V13 version in archive. Can you pulled out wholesale for performance testing
2011-11-16 16:08:46 -05:00
Mark DePristo
caf6080402
Better algorithm for merging genotypes in CombineVariants
2011-11-16 15:17:33 -05:00
Mark DePristo
101ffc4dfd
Expanded, contrastive VariantContextBenchmark
...
-- Compares performance across a bunch of common operations with GATK 1.3 version of VariantContext and GATK 1.4
-- 1.3 VC and associated utilities copied wholesale into test directory under v13
2011-11-16 13:35:16 -05:00
Mark DePristo
e56d52006a
Continuing bugfixes to get new VC working
2011-11-16 10:39:17 -05:00
Matt Hanna
eb8e031f75
Merged bug fix from Stable into Unstable
2011-11-16 09:57:37 -05:00
Matt Hanna
6a5d5e7ac9
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/stable
2011-11-16 09:57:13 -05:00
Matt Hanna
7ac5cf8430
Getting rid of unsupported CountReadPairs walker in stable. Removal of
...
remainder of pairs processing framework to follow in unstable.
2011-11-16 09:53:59 -05:00
Eric Banks
c2ebe58712
Merge remote-tracking branch 'Laurent/master'
2011-11-16 09:34:47 -05:00
Laurent Francioli
0dc3d20d58
Corrected bug causing PhaseByTransmission to crash in case of new Genotype.Type
2011-11-16 09:33:13 +01:00
Laurent Francioli
7d77fc51f5
Corrected bug causing PhaseByTransmission to crash in case of new Genotype.Type
2011-11-16 03:32:43 -05:00
David Roazen
0d163e3f52
SnpEff 2.0.4 support
...
-Modified the SnpEff parser to work with the SnpEff 2.0.4 VCF output format
-Assigning functional classes and effect impacts now handled directly
by SnpEff rather than the GATK
-Removed support for SnpEff 2.0.2, as we no longer trust the output of that
version since it doesn't exclude effects associated with certain nonsensical
transcripts. These effects are excluded as of 2.0.4.
-Updated unit and integration tests
This support is based on a *release-candidate* of SnpEff 2.0.4, and so is subject
to change between now and the next GATK release.
2011-11-15 18:36:22 -05:00
Mark DePristo
df415da4ab
More bug fixes on the way to passing all tests
2011-11-15 17:38:12 -05:00
Mark DePristo
0be23aae4e
Bugfixes on way to a working refactored VariantContext
2011-11-15 17:20:14 -05:00
Mark DePristo
231c47c039
Bugfixes on way to a working refactored VariantContext
2011-11-15 16:42:50 -05:00
Laurent Francioli
fb685f88ec
Merge branch 'master' of ssh://copper.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-11-15 16:23:53 -05:00
Mark DePristo
2b2514dad2
Moved many unused phasing walkers and utilities to archive
2011-11-15 16:14:50 -05:00
Mark DePristo
460a51f473
ID field now stored in the VariantContext itself, not the attributes
2011-11-15 14:56:33 -05:00
Eric Banks
7fada320a9
The right fix for this test is just to delete it.
2011-11-15 14:53:27 -05:00
Eric Banks
b45d10e6f1
The DP in the FORMAT field (per sample) must also use the representative count or else it's always 1 for reduced reads.
2011-11-15 10:23:59 -05:00
Mark DePristo
233e581828
Merging in Master
2011-11-15 09:28:24 -05:00
Eric Banks
b66556f4a0
Update error message so that it's clear ReadPair Walkers are exceptions
2011-11-15 09:22:57 -05:00
Mark DePristo
6e1a86bc3e
Bug fixes to VariantContext and GenotypeCollection
2011-11-15 09:21:30 -05:00
Roger Zurawicki
284430d61d
Added more basic UnitTests for ReadClipper
...
hardClipByReadCoordinatesWorks
hardClipLowQualTailsWorks
2011-11-15 00:13:52 -05:00
Roger Zurawicki
8e91e19229
Merge branch 'master' of ssh://nickel/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-11-15 00:13:37 -05:00
Mauricio Carneiro
cde829899d
compress Reduce Read counts bytes by offset
...
compressed the representation of the reduce reads counts by offset results in 17% average compression in final BAM file size.
Example compression -->
from : 10, 10, 11, 11, 12, 12, 12, 11, 10
to: 10, 0, 1, 1,2, 2, 2, 1, 0
2011-11-14 18:30:24 -05:00
Mark DePristo
4ff8225d78
GenotypeMap -> GenotypeCollection part 3
...
-- Test code actually builds
2011-11-14 17:51:41 -05:00
Mark DePristo
f0234ab67f
GenotypeMap -> GenotypeCollection part 2
...
-- Code actually builds
2011-11-14 17:42:55 -05:00
David Roazen
ab0ee9b847
Perform only necessary validation in VariantContext modify methods
2011-11-14 16:49:59 -05:00
Mark DePristo
2e9d5363e7
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-11-14 15:32:06 -05:00
Mark DePristo
1fbdcb4f43
GenotypeMap -> GenotypeCollection
2011-11-14 15:32:03 -05:00
Eric Banks
4dc9dbe890
One quick fix to previous commit
2011-11-14 14:42:12 -05:00
Eric Banks
7b2a7cfbe7
Transfer headers from the resource VCF when possible when using expressions. While there, VA was modified so that it didn't assume that the ID field was present in the VC's info map in preparation for Mark's upcoming changes.
2011-11-14 14:31:27 -05:00
Mark DePristo
9b5c79b49d
Renamed InferredGeneticContext to CommonInfo
...
-- I have no idea why I named this InferredGeneticContext, a totally meaningless term
-- Renamed to CommonInfo.
-- Made package protected, as no one should use this outside of VariantContext and Genotype
-- UGEngine was using IGC constant, but it's now using the public one in VariantContext.
2011-11-14 14:28:52 -05:00
Mark DePristo
077397cb4b
Deleted MutableVariantContext
...
-- All methods that used this capable now use VariantContext directly instead
2011-11-14 14:19:06 -05:00
Mark DePristo
b11c535527
Deleted MutableGenotype
...
-- This class wasn't really used anywhere, and so removed to control code bloat.
2011-11-14 13:16:36 -05:00
Mark DePristo
79987d685c
GenotypeMap contains a Map, not extends it
...
-- On path to replacing it with GenotypeCollection
2011-11-14 12:55:03 -05:00
Eric Banks
7aee80cd3b
Fix to deal with reduced reads containing a deletion
2011-11-14 12:23:46 -05:00
Eric Banks
3d2970453b
Misc minor cleanup
2011-11-14 09:41:54 -05:00
Laurent Francioli
1347beef40
Merge branch 'PhaseByTransmission'
2011-11-14 11:31:28 +01:00
Laurent Francioli
6881d4800c
Added Integration tests for Phasing by Transmission
2011-11-14 10:47:51 +01:00
Laurent Francioli
34acf8b978
Added Unit tests for new methods in GenotypeLikelihoods
2011-11-14 10:47:02 +01:00
Roger Zurawicki
1202a809cb
Added Basic Unit Tests for ReadClipper
...
Tests some but not all functions
Some tests have been disabled because they are not working
2011-11-13 22:27:49 -05:00
Eric Banks
b7c33116af
Minor docs update
2011-11-12 23:21:07 -05:00
Eric Banks
76d357be40
Updating docs example to use -L since that's best practice
2011-11-12 23:20:05 -05:00
Mark DePristo
fee9b367e4
VariantContext genotypes are now stored as GenotypeMap objects
...
-- Enables further sophisticated optimizations, as this class can be smarter about storing the data and will directly support operations like subset to samples
-- All instances in the gatk that used Map<String, Genotype> now use GenotypeMap type.
-- Amazingly, there were many places where HashMap<String, Genotype> is used, so that the order of the genotypes is technically undefined and could be dangerous. Now everything uses GenotypeMap with a specific ordering of samples (by name)
-- Integrationtests updated and all pass
2011-11-11 15:00:35 -05:00
Guillermo del Angel
cd3146f4cf
Add hidden option to ValidationAmplicons to output slightly modified format to make file work with downstream SQNM tools more seamlessly at request of GAP: one line per record, keep probe identifier to 20 characters, no * in ref allele.
2011-11-11 14:07:07 -05:00
Ryan Poplin
40fbeafa37
VQSR will now detect if the negative model failed to converge properly because of having too few data points and automatically retry with more appropriate clustering parameters.
2011-11-11 11:52:30 -05:00
Mark DePristo
4938569b3a
More general handling of parameters for VariantContextBenchmark
2011-11-11 10:22:19 -05:00
Mark DePristo
ef9f8b5d46
Added subContextOfSamples to VariantContext
...
-- This is a more convenient accesssor than subContextOfGenotypes, represents nearly all of the use cases of the former function, and potentially can be implemented more efficiently.
2011-11-11 10:07:11 -05:00
Mark DePristo
e216e85465
First working version of VariantContextBenchmark
2011-11-11 09:56:00 -05:00
Mark DePristo
ee40791776
Attributes are now Map<String,Object> not Map<String,?>
...
-- Allows us to avoid an unnecessary copy when creating InferredGeneticContext (whose name really needs to change).
2011-11-11 09:55:42 -05:00
Mark DePristo
dc9b351b5e
Meaningful error message when an IntervalArg file fails to parse correctly
2011-11-10 17:10:26 -05:00
Mark DePristo
bb7bf74aa8
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-11-10 16:05:43 -05:00
Mark DePristo
153e52ffed
VariantEvalIntegrationTest for IntervalStratification
2011-11-10 14:10:39 -05:00
Mauricio Carneiro
060c7ce8ae
It wouldn't harm integrationtests if we had our logic right... :-)
2011-11-10 14:03:22 -05:00
Eric Banks
39678b6a20
Check for reads with missing read groups and throw a UserException when encountered. Mauricio said this wouldn't break integration tests.
2011-11-10 13:34:45 -05:00
Mark DePristo
dd1810140f
-stratIntervals is optional
2011-11-10 13:27:32 -05:00
Mark DePristo
67b022c34b
Cleanup for new SampleUtils function
...
-- getVCFHeadersFromRods(rods) is now available so that you don't have getVCFHeadersFromRods(rods, null) throughout the codebase
2011-11-10 13:27:13 -05:00
Mark DePristo
35fe9c8a06
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-11-10 11:11:33 -05:00
Mark DePristo
dc4932f93d
VariantEval module to stratify the variants by whether they overlap an interval set
...
The primary use of this stratification is to provide a mechanism to divide asssessment of a call set up by whether a variant overlaps an interval or not. I use this to differentiate between variants occurring in CCDS exons vs. those in non-coding regions, in the 1000G call set, using a command line that looks like:
-T VariantEval -R human_g1k_v37.fasta -eval 1000G.vcf -stratIntervals:BED ccds.bed -ST IntervalStratification
Note that the overlap algorithm properly handles symbolic alleles with an INFO field END value. In order to safely use this module you should provide entire contigs worth of variants, and let the interval strat decide overlap, as opposed to using -L which will not properly work with symbolic variants.
Minor improvements to create() interval in GenomeLocParser.
2011-11-10 10:58:40 -05:00
Mauricio Carneiro
0d8983feee
outputting the RG information
...
setReadGroup now sets the read group attribute for the GATKSAMRecord
2011-11-09 23:35:00 -05:00
Eric Banks
315ac68b0b
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-11-09 22:37:36 -05:00
Eric Banks
6313aae2c4
Adding checks for hasBasePileup() before calling getBasePileup() as per GS thread
2011-11-09 22:37:26 -05:00
Ryan Poplin
74a18d3de8
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-11-09 22:29:40 -05:00
Ryan Poplin
24712c0221
Merged bug fix from Stable into Unstable
2011-11-09 22:28:27 -05:00
Ryan Poplin
8942406aa2
Use MathUtils to compare doubles instead of testing for equality
2011-11-09 22:05:21 -05:00
Ryan Poplin
348f2db7fd
Fix for HMM optimization. If the two penalty arrays match exactly the function should return the end of the array instead of 0.
2011-11-09 22:00:52 -05:00
Eric Banks
82bf09edf3
Mark Standard Annotations with an asterisk
2011-11-09 20:42:31 -05:00
Eric Banks
04b122be29
Fix for bug reported on GetSatisfaction
2011-11-09 20:33:36 -05:00
Mauricio Carneiro
d00b2c6599
Adding a synthetic read for filtered data
...
* Generalized the concept of a synthetic read to cread both running consensus and a synthetic reads of filtered data.
* Synthetic reads can now have deletions (but not insertions)
* New reduced read tag for filtered data synthetic reads *(RF)*
* Sliding window header now keeps information of consensus and filtered data
* Synthetic reads are created simultaneously, new functionality is controlled internally by addToSyntheticReads
2011-11-09 20:16:22 -05:00
Eric Banks
21bf43f3bb
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-11-09 15:34:40 -05:00
Eric Banks
02d5e3025e
Added integration test for intervals from bed file
2011-11-09 15:34:19 -05:00
Christopher Hartl
85bffe1dca
Merged bug fix from Stable into Unstable
2011-11-09 15:29:14 -05:00
Christopher Hartl
d828eba7f4
Allow comments in a table-formatted file to precede the header line.
2011-11-09 15:27:38 -05:00
Eric Banks
8205efbb29
Merge branch 'master' into intervals
2011-11-09 15:27:15 -05:00
Eric Banks
d64f8a89a9
Instead of the SelfScopingFeatureCodec interface, pushed this functionality into Tribble itself. Now we can e.g. determine that a file can be parsed by the BedCodec on the fly.
2011-11-09 15:24:29 -05:00
Mauricio Carneiro
f080f64f99
Preserve RG information on new GATKSAMRecord from SAMRecord
2011-11-09 14:39:20 -05:00
Mauricio Carneiro
f9530e0768
Clean unnecessary attributes from the read
...
this gives on average 40% file size reduction.
2011-11-09 14:39:20 -05:00
Mauricio Carneiro
9427ada498
Fixing no cigar bug
...
empty GATKSAMRecords will have a null cigar. Treat them accordingly.
2011-11-09 14:39:20 -05:00
Mark DePristo
e639f0798e
mergeEvals allows you to treat -eval 1.vcf -eval 2.vcf as a single call set
...
-- A bit of code cleanup in VCFUtils
-- VariantEval table to create 1000G Phase I variant summary table
-- First version of 1000G Phase I summary table Qscript
2011-11-09 14:35:50 -05:00
Christopher Hartl
149b79eaad
Merged bug fix from Stable into Unstable
2011-11-09 11:26:30 -05:00
Christopher Hartl
11abb4f9d1
Better error message.
2011-11-09 11:25:28 -05:00
Christopher Hartl
d3a533b82e
Revert "a"
...
This reverts commit 1175f50ddbf389f5da74d27dc725596582ae15af.
2011-11-09 11:22:26 -05:00
Christopher Hartl
5eaf800281
a
2011-11-09 11:22:20 -05:00
Christopher Hartl
5451fbc2b2
Merged bug fix from Stable into Unstable
2011-11-09 11:06:15 -05:00
Christopher Hartl
091229e4db
MVLikelihoodRatio now checks if the family string is provided before attempting to instantiate. Also check that variant contexts have both genotypes and genotype likelihoods.
...
Table codec now yells at users for not providing a HEADER with the table - parsing tables without a header line was causing the first line of the file to be eaten.
Table feature now has a toString method.
These are minor bug fixes.
2011-11-09 11:03:29 -05:00
Mauricio Carneiro
e1b4c3968f
Fixing GATKSAMRecord bug
...
when constructing a GATKSAMRecord from scratch, we should set "mRestOfBinaryData" to null so the BAMRecord doesn't try to retrieve missing information from the non-existent bam file.
2011-11-08 16:50:36 -05:00
Ryan Poplin
e973ca2010
fixing merge conflict.
2011-11-08 14:55:05 -05:00
Ryan Poplin
b0e6afec48
Bug fix for HMM optimization. Need to also check the gap continuation penalty array for the index with the first discrepancy.
2011-11-08 14:51:25 -05:00
Laurent Francioli
571c724cfd
Added reporting of the number of genotypes updated.
2011-11-08 15:15:51 +01:00
Ryan Poplin
94dc447a70
Merged bug fix from Stable into Unstable
2011-11-07 15:26:35 -05:00
Ryan Poplin
0b181be61f
Bug fix in SelectVariants when using a discordance track but no sample specifications. Added integration test to test this.
2011-11-07 15:25:16 -05:00
Ryan Poplin
0534149708
Merged bug fix from Stable into Unstable
2011-11-07 14:07:08 -05:00
Ryan Poplin
2d1e385ca4
Adding note to VQSR docs about Rscript being needed in the environment PATH.
2011-11-07 14:04:13 -05:00
Eric Banks
759f4fe6b8
Moving unclaimed walker with bad integration test to archive
2011-11-07 13:16:38 -05:00
Eric Banks
c1986b6335
Add notes to the GATKdocs as to when a particular annotation can/cannot be calculated.
2011-11-07 11:06:19 -05:00
Eric Banks
724e3f3b0d
Merged bug fix from Stable into Unstable
2011-11-06 22:23:22 -05:00
Eric Banks
cdd40d1222
Removing contracts for the SimpleTimer
2011-11-06 22:22:49 -05:00
Ryan Poplin
5c565d28b9
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-11-06 10:26:19 -05:00
Eric Banks
3517489a22
Better --sample selection integration test for VE. The previous one would return true even if --sample was not working at all.
2011-11-06 01:07:49 -04:00
Eric Banks
1c4e429a1c
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-11-06 00:05:56 -04:00
Eric Banks
a12bc63e5c
Get rid of support for bams without sample information in the read groups. This hidden option wasn't being used anyways because it wasn't hooked up properly in the AlignmentContext.
2011-11-05 23:54:28 -04:00
Eric Banks
ad57bcd693
Adding integration test to cover using expressions with IDs (-E foo.ID)
2011-11-05 23:53:15 -04:00
Eric Banks
90a053ea93
Don't change the mapping quality of MQ=255 reads in IR
2011-11-05 22:40:45 -04:00
Ryan Poplin
611a395783
Now properly extending candidate haplotypes with bases from the reference context instead of filling with padding bases. Functionality in the private Haplotype class is no longer necessary so removing it. No need to have four different Haplotype classes in the GATK.
2011-11-05 12:18:56 -04:00
Mark DePristo
e99871f587
Bug fix for decode loc
...
-- decodeLoc() wasn't skipping input header lines, so the system blew up when there was an = line being split.
2011-11-04 13:20:54 -04:00
Mark DePristo
a340a1aeac
Bug fix. decodeLoc() should update lineNo so you get meaningful line no when indexing
...
due to malformed VCF files.
2011-11-04 11:44:24 -04:00
Mark DePristo
9f260c0dc1
Zero byte index bug fix for RandomlySplitVariants + cleanup
...
-- vcfWriter2 was never being closed in onTraversalDone(), so the on the fly index file was being created but never actually properly written to the file.
-- This bug is ultimately due to the inability of the GATK to allow multiple VCF output writers as @Output arguments, though
-- Removed the unnecessary local variable iFraction, = 1000 * the input fraction argument. Now the system just uses a double random number and compares to the input fraction at all. Is there some subtle reason I don't appreciate for this programming construct?
2011-11-04 09:45:20 -04:00
Mauricio Carneiro
e89ff063fc
GATKSAMRecord refactor
...
The GATK engine will now provide a GATKSAMRecord to all tools which incorporates the functionality used by the GATK to the bam file (ReadGroups, Reduced Reads, ...).
* No tools should create SAMRecord anymore, use GATKSAMRecord instead *
2011-11-03 15:43:26 -04:00
Laurent Francioli
385a6abec1
Fixed a bug that wrongly swapped the mother and father genotypes in case the child genotype missing.
2011-11-03 13:04:53 +01:00
Laurent Francioli
893787de53
Functions getAsMap and getNegLog10GQ now handle missing genotype case.
2011-11-03 13:04:11 +01:00
Eric Banks
e8bceb1eaa
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-11-02 21:13:54 -04:00
Eric Banks
78a00d2ddc
Updating UG integration tests (needed updating only because the -mbq default is different from the old -mmq one).
2011-11-02 21:13:44 -04:00
Eric Banks
52b16bf739
Must check whether there's a normal vs. extended pileup before asking for it.
2011-11-02 20:45:24 -04:00
Eric Banks
e1edd6bd12
Removing the min mapping quality argument since it wasn't being used in the normal processing of the pileups in UG - only for indel pileups. Instead, we apply the min base quality to the reads in the pileup for indels and define it to be the min 'confidence' of the base. Docs are updated but I didn't rename the argument as I don't want people to complain.
2011-11-02 20:32:58 -04:00
Ryan Poplin
e94fcf537b
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-11-02 16:29:19 -04:00
Ryan Poplin
4d35272916
Bug fixes with Mauricio to functions in ReadUtils used by reduced reads and the haplotype caller.
2011-11-02 16:29:10 -04:00
Mark DePristo
8a2929c1dd
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-11-02 16:21:00 -04:00
Laurent Francioli
19ad5b635a
- Calculation of parent/child pairs corrected
...
- Separated the reporting of single and double mendelian violations in trios
2011-11-02 18:35:31 +01:00
Eric Banks
967ff647b8
Reduced reads shouldn't contribute to Fisher Strand calculations
2011-11-02 13:07:20 -04:00
Eric Banks
cf0e699226
QualByDepth was inefficiently iterating over the pileup 2 times for some reason. Removed non-useful annotation classes.
2011-11-02 12:58:38 -04:00
Eric Banks
4501dce58d
Fixing merge conflict
2011-11-02 12:50:32 -04:00
Eric Banks
54331b44e9
New way of looking at the size of a pileup: there's a physical number of elements in the data structure and there's a representative depth of coverage (since a reduced read represents depth >= 1). The size() method has been removed because its meaning is ambiguous. Updated several annotations and the UG engine to make use of the representative depths.
2011-11-02 12:47:30 -04:00
Mark DePristo
392e0aeace
Moved unit tests into master IntervalUtilsUnitTest
2011-11-02 10:52:00 -04:00
Mark DePristo
c2b97030a4
IntervalUtils for completely balanced locus-based scatter/gather
...
-- scatterLocusIntervals master utility
-- Moved around some general functionality from GenomeLocSortedSet to GenomeLoc
-- Util function for reversing a list (List<T> -> List<T>, unlike Collections version)
-- DoC is PartitionType.INTERVAL
-- Significant unit tests on new functionality (all passing)
-- Ready for real-world testing, as soon as I can get LocusScatterFunction.scala to actually work
2011-11-02 10:49:40 -04:00
Laurent Francioli
119ca7d742
Fixed a bug in parent/child pairs reporting causing a crash in case the -mvf option was used and mother was not provided
2011-11-02 08:22:33 +01:00
Laurent Francioli
b91a9c4711
- Fixed parent/child pairs handling (was crashing before)
...
- Added parent/child pair reporting
2011-11-02 08:04:01 +01:00
Mark DePristo
5fc613f972
Better default partition types for walkers
...
-- Added PartitionType.READ, and associated ReadScatterFunction. ReadScatterFunction is literally just ContigScatterFunction until someone wants to implement something better
-- LocusWalkers (and subclasses RodWalkers and RefWalkers) are by default PartitionType.LOCUS.
2011-11-01 19:47:10 -04:00
Mauricio Carneiro
36600fd8e9
added MQ of low MQ/BQ to consensus RMS
...
Bases that were excluded for MQ and BQ filters are now contributing to the MQ RMS (but not to consensus base counts and variant/not variant region triggers).
2011-11-01 17:46:12 -04:00
Mauricio Carneiro
b004489c6d
Moving ReduceRead TAG to GATKSAMRecord
...
ReduceReads are now a feature of a GATKSAMRecord, so the tag and the special methods needed to use it will now be housed by the GATKSAMRecord.
2011-11-01 17:12:09 -04:00
Mauricio Carneiro
17cc484dbd
Revert "ReduceReads ref bases are now output as '='
...
Reducing the reference bases to '=' results in an extra compression of 13% on average. The GATK is not ready to handle files with '=' bases, and the decision was to implement this a an engine support, not a part of ReduceReads.
2011-11-01 16:35:07 -04:00
Eric Banks
0839c75c8d
More minor fixes to docs
2011-10-31 21:49:27 -04:00
Eric Banks
74b018a1f3
Minor fixes to docs
2011-10-31 21:41:43 -04:00
Eric Banks
31ee5432c5
Merged bug fix from Stable into Unstable
2011-10-31 14:56:59 -04:00
David Roazen
cdde32acbd
Merged bug fix from Stable into Unstable
2011-10-31 14:21:15 -04:00
Eric Banks
f62af0291b
Check for invalid VCF records (not enough tokens) instead of assuming they are there.
2011-10-31 14:09:51 -04:00
Andrey Sivachenko
bed0acaed4
nWayOut now adds PG tag to the header as it should. Also, additional hidden option added: keepPGTags. If invoked, IndelRealigner PG tags from previous runs (if any) are kept in the header and the new PG tag is simply added, instead of overriding them
2011-10-31 12:28:28 -04:00
Mauricio Carneiro
389380a590
ReduceReads ref bases are now output as '=' to save space
...
Restructured the sliding window framework to manipulate a wrapped version of the SAMRecord that contains information about the reference.
2011-10-30 12:04:39 -04:00
Eric Banks
0ca7428e76
Allow processing of empty intervals, but warn user when this case is encountered.
2011-10-28 12:12:14 -04:00
Eric Banks
649dfe98f0
Add VCF header for any expressions that are requested
2011-10-28 10:22:19 -04:00
Eric Banks
8b1a62da27
Adding unit test to cover overlapping intervals from the same source with the intersection rule.
2011-10-28 09:59:43 -04:00
Eric Banks
057a79f598
This argument should be annotated as @Input
2011-10-28 09:44:49 -04:00
Eric Banks
4ba7c0cecd
Moving to private
2011-10-28 09:29:28 -04:00
Eric Banks
1bdd76c2f2
These tools now use the IntervalBinding system to handle intervals instead of doing it all manually
2011-10-28 09:28:12 -04:00
Eric Banks
6ba08a103d
Empty ROD files should generate an exception when used for creating intervals. Moved some now obsolete files to the archive as the realigner will now read all target intervals into memory.
2011-10-28 09:23:25 -04:00
Eric Banks
3d04bb5608
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-10-27 23:55:18 -04:00
Eric Banks
19e27d4568
Removing all instances of -BTI (in tests and in GATKdocs) and replacing them with the appropriate alternative.
2011-10-27 23:55:11 -04:00
Eric Banks
cafc245a43
For some reason, a class of Codecs (including TableCodec) require that a GenomeLocParser be passed in to do the position processing. Why can't they just return a Feature with chr, start, stop? Isn't that the right thing?
2011-10-27 23:54:28 -04:00
Guillermo del Angel
cbc43683ee
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-10-27 20:54:18 -04:00
Guillermo del Angel
8907e42007
First fully functional implementation of ValidationSiteSelectorWalker. User gives a) a set of input variants, b) a desired number of output variants, b) Optionally, a set of samples which will restrict sites to be polymorphic in those samples, c) a frequency selection mode: either uniform (no AF matching), or matching AF so that output sites mirror the input AF spectrum as closely as possible.
...
More testing is needed and docs need improving but so far all functionality seems up and running
2011-10-27 20:53:48 -04:00
Eric Banks
ccfd853b34
Added further integration tests for rod-based intervals that deal with more complex cases. Good call by Mark to test the empty VCF example because we were failing on it; fixed.
2011-10-27 20:43:50 -04:00
Eric Banks
c2f343773e
Oops, working too quickly last time. This is the proper fix for the potential NPE in the equals() test.
2011-10-27 15:32:08 -04:00
Khalid Shakir
b80d407dc7
No more hunting down R "resources". As a tradeoff Rscript cannot be specified on the commandline and will be found in the environment path.
...
Other minor cleanup.
2011-10-27 14:17:07 -04:00
Eric Banks
8c4dbce6d8
Don't serialize the GATKArgumentCollection for the GATKRunReports (which would have meant dealing with the new IntervalBindings). Also, forgot to remove a test that's no longer relevant to BED parsing.
2011-10-27 13:58:19 -04:00
Eric Banks
4a7e6fee3f
Remove support for BED file interval parsing in the GATK; it should all go through Tribble now. IndelRealigner no longer supports unordered interval input (which shouldn't have been used anyways). Temporarily commenting out serialization of arguments so that tests pass; this whole piece will be deleted soon anyways.
2011-10-27 13:38:08 -04:00
Matt Hanna
f7df8bdecc
Merged bug fix from Stable into Unstable
2011-10-27 11:31:17 -04:00
Matt Hanna
41ddc7bce7
Make sure we output a full stack trace when we encounter Tribble error messages on VCF header merge.
2011-10-27 11:30:04 -04:00
Eric Banks
44f905b5e5
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-10-26 23:31:11 -04:00
Eric Banks
68283b1651
Fixing docs and adding GATKdocs for the new interval functionality
2011-10-26 22:14:43 -04:00
Mark DePristo
c9978316a3
Merge branch 'FragmentUtils'
2011-10-26 19:51:49 -04:00
Mauricio Carneiro
add9ad97ec
No scatter gather for VQSR or ApplyVQSR.
...
These walkers should not be scatter gatherable. Annotating them accordingly so that Queue doesn't allow a less than knowledgeable user to try and scatter/gather VQSR.
2011-10-26 16:35:44 -04:00
Ryan Poplin
74aeb22eeb
Merged bug fix from Stable into Unstable
2011-10-26 15:57:30 -04:00
Ryan Poplin
86871bd1e3
Throw a UserException in the BQSR when there is no data instead of creating an empty csv file
2011-10-26 15:56:41 -04:00
Mark DePristo
034a997d07
Generalized Reads -> Fragment calculation
...
-- Supports ReadBackedPileup -> FragmentCollection as before
-- Added support for List<SAMRecord> -> FragmentCollection for Ryan's haplotype caller
-- General cleanup, renaming, move to separate package, more extensive unit tests, etc.
-- Added toFragment() function to ReadBackedPileup interface
2011-10-26 15:54:38 -04:00
Eric Banks
2f21b6ecfb
Removed debugging output
2011-10-26 15:50:20 -04:00
Eric Banks
b39fcb1bea
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-10-26 15:44:25 -04:00
Eric Banks
b6ce6ed3f8
Go around the ROD system for now so that we can just call decodeLoc() for efficiency. Noted that we should go through the ROD system once it gets cleaned up. This means that currently gzipped files are not supported with -L.
2011-10-26 15:42:53 -04:00
Eric Banks
3273c20c98
Added integration tests for Tribble-based intervals and fixed up some of the other tests based on some method changes.
2011-10-26 15:29:18 -04:00
Eric Banks
9424e8b2ca
Initial working version of new interval system in which the argument for -L (and -XL) is allowed to be a rod file (e.g. VCF). Old samtools-style intervals still behave as before. BTI is no longer supported. The merging (union or intersection) of intervals is now consistently applied to all -L (or -XL) intervals, which is nice. More testing needed.
2011-10-26 14:11:49 -04:00
Mark DePristo
7fa943aef1
Renamed FragmentPileup to FragmentUtils
2011-10-26 14:01:45 -04:00
Laurent Francioli
1f044faedd
- Genotype assignment in case of equally likeli combination is now random
...
- Genotype combinations with 0 confidence are now left unphased
2011-10-26 19:57:09 +02:00