Commit Graph

1355 Commits (7486696c076f73c02a61baaf2262f6f1cdc91de7)

Author SHA1 Message Date
Khalid Shakir 7486696c07 When using bam list mode in HSP deriving VCF name from bam list instead of requiring an additional parameter.
Creating a single temporary directory per ant test run instead of a putting temp files across all runs in the same directory.
Updated various tests for above items and other small fixes.
2011-12-16 18:09:25 -05:00
Mauricio Carneiro e5df9e0684 cleaner test output
cleaned up the debug "pass" messages in the unit tests
2011-12-16 18:04:00 -05:00
Mauricio Carneiro fcc21180e8 Added hardClipLeadingInsertions UnitTest for the ReadClipper
fixed issue where a read starting with an insertion followed by a deletion would break, clipper can now safely clip the insertion and the deletion if that's the case.

note: test is turned off until contract changes to allow hanging insertions (left/right).
2011-12-16 18:02:47 -05:00
Mauricio Carneiro 075be52adc Added hardClipByReferenceCoordinates (left and right tails) UnitTest for the ReadClipper 2011-12-16 18:01:33 -05:00
Mauricio Carneiro 5bba44d693 Added hardClipByReferenceCoordinates UnitTest for the ReadClipper
* fixed edge case when requested to hard clip beginning of a read that had hanging soft clipped bases on the left tail.
* fixed edge case when requested to hard clip end of a read that had hanging soft clipped bases on the right tail.
* fixed AlignmentStart of a clipped read that results in only hard clips and soft clips

note: added tests to all these beautiful cases...
2011-12-16 18:01:33 -05:00
Mauricio Carneiro 5838ba529d Added hardClipByReadCoordinates UnitTest for the ReadClipper 2011-12-16 18:01:33 -05:00
Mauricio Carneiro c26295919e Added hardClipBothEndsByReferenceCoordinates UnitTest for the ReadClipper 2011-12-16 18:01:33 -05:00
Mark DePristo 1994c3e3bc Only print warning about allele incompatibility when running there are genotypes in the file in CombineVariants 2011-12-16 16:50:51 -05:00
Mark DePristo b6067be952 Support for selecting only variants with specific IDs from a file in SelectVariants
-- Cleaned up unused variables as well
2011-12-16 16:50:39 -05:00
Mark DePristo d6d2f49c88 Don't print log if there are no BAMs 2011-12-16 16:50:36 -05:00
Mark DePristo 78e0950a77 Minor bug fix for printing in SAMDataSource 2011-12-16 11:45:40 -05:00
Mark DePristo 7bc0d18418 Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-16 11:42:42 -05:00
Ryan Poplin 5aa79dacfc Changing hidden optimization argument to advanced. 2011-12-16 10:29:20 -05:00
Matt Hanna 3642a73c07 Performance improvements for dynamically merging BAMs in read walkers.
This change and my previous change have dropped runtime when dynamically merging 2k BAM files from 72.6min/1M reads to 46.8sec/1M reads.
Note that many of these changes are stopgaps -- the real problem is the way ReadWalkers interface with Picard, and I'll have to work with
Tim&Co to produce a more maintainable patch.
2011-12-16 09:37:44 -05:00
Mark DePristo 3414ecfe2e Restored serial version of reader initialization. Serial mode is default, as the performance gains aren't so huge.
-- Serial version can be re-enabled with a static boolean, if we decide to return to the serial version

-- Comparison of serial and parallel reader with cached and uncached files:

Initialization time: serial   with 500 fully cached BAMs: 8.20 seconds
Initialization time: serial   with 500 uncached BAMs    : 197.02 seconds
Initialization time: parallel with 500 fully cached BAMs: 30.12 seconds
Initialization time: parallel with 500 uncached BAMs    : 75.47 seconds
2011-12-16 09:22:10 -05:00
Mark DePristo fb1c9d2abc Restored serial version of reader initialization. Parallel mode is default.
-- Serial version can be re-enabled with a static boolean, if we decide to return to the serial version
2011-12-16 09:05:28 -05:00
Mauricio Carneiro e61e5c7589 Refactor of ReadClipper unit tests
* expanded the systematic cigar string space test framework Roger wrote to all tests
* moved utility functions into Utils and ReadUtils
* cleaned up unused classes
2011-12-15 19:05:43 -05:00
Mauricio Carneiro 4748ae0a14 Bugfix: Softclips before Hardclips weren't being accounted for
caught a bug in the hard clipper where it does not account for hard clipping softclipped bases in the resulting cigar string, if there is already a hard clipped base immediately after it.
* updated unit test for hardClipSoftClippedBases with corresponding test-case.
2011-12-15 12:17:25 -05:00
Mauricio Carneiro 62a2e335bc Changing HardClipper contract to allow UNMAPPED reads
shifted the contract to functions that operate on reference based coordinates. The clipper should do the right thing with unmapped reads, but it needs more testing (Ryan is using it at the moment and says it works). Will write some unit tests.
2011-12-15 11:08:19 -05:00
Matt Hanna 9333b678b5 Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-14 18:05:44 -05:00
Matt Hanna 6fb4be1a09 Cache header merger. 2011-12-14 18:05:31 -05:00
Mauricio Carneiro 50dee86d7f Added unit test to catch Ryan's exception
Unit test to catch the special case that broke the clipping op, fixed in the previous commit.
2011-12-14 16:58:14 -05:00
Mauricio Carneiro 128bdf9c09 Create artificial reads with "default" parameters
* added functions to create synthetic reads for unit testing with reasonable default parameters
* added more functions to create synthetic reads based on cigar string + bases and quals.
2011-12-14 16:58:14 -05:00
Mauricio Carneiro c85100ce9c Fix ClippingOp bug when performing multiple hardclip ops
bug: When performing multiple hard clip operations in a read that has indels, if the N+1 hardclip requests to clip inside an indel that has been removed by one of the (1..N) previous hardclips, the hard clipper would go out of bounds.

fix: dynamically adjust the boundaries according to the new hardclipped read length. (this maintains the current contract that hardclipping will never return a read starting or ending in indels).
2011-12-14 16:57:47 -05:00
Eric Banks de5928ac5a Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-14 16:24:56 -05:00
Eric Banks 4fddac9f22 Updating busted integration tests 2011-12-14 16:24:43 -05:00
Mark DePristo 01e547eed3 Parallel SAMDataSource initialization
-- Uses 8 threads to load BAM files and indices in parallel, decreasing costs to read thousands of BAM files by a significant amount
-- Added logger.info message noting progress and cost of reading low-level BAM data.
2011-12-14 16:14:26 -05:00
Mark DePristo 71b4bb12b7 Bug fix for incorrect logic in subsetSamples
-- Now properly handles the case where a sample isn't present (no longer adds a null to the genotypes list)
-- Fix for logic failure where if the number of requested samples equals the number of known genotypes then all of the records were returned, which isn't correct when there are missing samples.
-- Unit tests added to handle these cases
2011-12-14 16:14:26 -05:00
Eric Banks 35fc2e13c3 Using the new PL cache, fix a bug: when only a subset of the genotyped alleles are used for assigning genotypes (because the exact model determined that they weren't all real) the PLs need to be adjusted to reflect this. While fixing this I discovered that the integration tests are busted because ref calls (ALT=.) were getting annotated with PLs, which makes no sense at all. 2011-12-14 15:31:09 -05:00
Eric Banks 1e90d602a4 Optimization: cache up front the PL index to the pair of alleles it represents for all possible numbers of alternate alleles. 2011-12-14 13:38:20 -05:00
Eric Banks 988d60091f Forgot to add in the new result class 2011-12-14 13:37:15 -05:00
Eric Banks 106bf13056 Use a thread local result object to collect the results of the exact calculation instead of passing in multiple pre-allocated arrays. 2011-12-14 12:05:50 -05:00
Eric Banks 7648521718 Add check for mixed genotype so that we don't exception out for a valid record 2011-12-14 11:26:43 -05:00
Eric Banks 9497e9492c Bug fix for complex records: do not ever reverse clip out a complete allele. 2011-12-14 11:21:28 -05:00
Eric Banks 09a5a9eac0 Don't update lineNo for decodeLoc - only for decode (otherwise they get double-counted). Even still, because of the way the GATK currently utilizes Tribble we can parse the same line multiple times, which knocks the line counter out of sync. For now, I've added a TODO in the code to remind us and the error messages note that it's an approximate line number. 2011-12-14 10:43:52 -05:00
Eric Banks d3f4a5a901 Fail gracefully when encountering malformed VCFs without enough data columns 2011-12-14 10:37:38 -05:00
Eric Banks 079932ba2a The log10cache needs to be larger if we want to handle 10K samples in the UG. 2011-12-13 23:36:10 -05:00
Ryan Poplin 7fa1ab1bae Fix to allow haplotype caller to call indels after UG engine entry points were unified. Adding Haplotype Caller integration test 2011-12-13 17:19:40 -05:00
Eric Banks e47a113c9f Enabled multi-allelic SNP discovery in the UG. Needs loads of testing so do not use yet. While working in the UG engine, I removed the extraneous and unnecessary MultiallelicGenotypeLikelihoods class: now a VariantContext with PL-annotated Genotypes is passed around instead. Integration tests pass so it must all work, right? 2011-12-12 23:02:45 -05:00
Mauricio Carneiro 5cc1e72fdb Parallelized SelectVariants
* can now use -nt with SelectVariants for significant speedup in large files
* added parallelization integration tests for SelectVariants
2011-12-12 18:41:14 -05:00
Mauricio Carneiro a70a0f25fb Better debug output for SAMDataSource
output the name and number of the files being loaded by the GATK instead of "coordinate sorted".
2011-12-12 17:57:29 -05:00
Mark DePristo d03425df2f TODO optimization targets 2011-12-12 17:39:51 -05:00
Laurent Francioli 7cf27bb66e Updated md5sum for MendelianViolationEvaluator test to reflect the change in column alignment in VariantEval. 2011-12-12 12:22:43 +01:00
Laurent Francioli 025bdfe2cc Merge branch 'master' of ssh://copper.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-12 12:19:44 +01:00
Eric Banks 7b6338c742 Merge branch 'master' into trialleles 2011-12-11 00:28:46 -05:00
Eric Banks 7c4b9338ad The old bi-allelic implementation of the Exact model has been completely deprecated - you can only use the multi-allelic implementation now. 2011-12-11 00:23:33 -05:00
Eric Banks 044f211a30 Don't collapse likelihoods over all alt alleles - that's just not right. For now, the QUAL is calculated for just the most likely of the alt alleles; I need to think about the right way to handle this properly. 2011-12-10 23:57:14 -05:00
Eric Banks 364f1a030b Plumbing added so that the UG engine can handle multiple alleles and they can successfully be genotyped. Alleles that aren't likely are not allowed to be used when assigning genotypes, but otherwise the greedy PL-based approach is what is used. Moved assign genotypes code to UG engine since it has nothing to do with the Exact model. Still have some TODOs in here before I can push this out to everyone. 2011-12-09 14:25:28 -05:00
Mauricio Carneiro 8475328b2c Turning off test that breaks read clipper
until we define what is the desired behavior for clipping this particular case.
2011-12-09 11:53:12 -05:00
Roger Zurawicki 4cbd1f0dec Reorganized the testing code and created ClipReadsTestUtils
Tests are more rigorous and includes many more test cases.
We can tests custom cigars and the generated cigars.
     *Still needs debugging because code is not working.
Created test classes to be used across several tests.

Some cases are still commented out.

Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2011-12-09 11:52:34 -05:00