Commit Graph

658 Commits (f9f8589692fece0185a7e8e059b75ee4672d1c8d)

Author SHA1 Message Date
Mark DePristo c8a06e53c1 DoC now properly handles reference N bases + misc. additional cleanups
-- DoC now by default ignores bases with reference Ns, so these are not included in the coverage calculations at any stage.
-- Added option --includeRefNSites that will include them in the calculation
-- Added integration tests that ensures the per base tables (and so all subsequent calculations) work with and without reference N bases included
-- Reorganized command line options, tagging advanced options with @Advanced
2012-02-25 11:32:50 -05:00
Mark DePristo 50de1a3eab Fixing bad VCFIntegration tests
-- Left disabled a test that should have been enabled
-- Didn't add the md5 to the test I actually added
-- Now VCFIntegrationTests should be working!
2012-02-25 11:26:36 -05:00
Mark DePristo e0c189909f Added support for breakpoint alleles
-- See https://getsatisfaction.com/gsa/topics/support_vcf_4_1_structural_variation_breakend_alleles?utm_content=topic_link&utm_medium=email&utm_source=new_topic
-- Added integrationtest to ensure that we can parse and write out breakpoint example
2012-02-23 12:14:48 -05:00
Mauricio Carneiro 75783af6fc int <-> BitSet conversion utils for MathUtils
* added unit tests.
2012-02-21 14:10:36 -05:00
Eric Banks ad90af94ed Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-02-13 15:10:10 -05:00
Eric Banks 0920a1921e Minor fixes to splitting multi-allelic records (as regards printing indel alleles correctly); minor code refactoring; adding integration tests to cover +/- splitting multi-allelics. 2012-02-13 15:09:53 -05:00
Eric Banks 14981bed10 Cleaning up VariantsToTable: added docs for supported fields; removed one-off hidden arguments for multi-allelics; default behavior is now to include multi-allelics in one record; added option to split multi-allelics into separate records. 2012-02-13 14:32:03 -05:00
Ryan Poplin 41ffd08d53 On the fly base quality score recalibration now happens up front in a SAMIterator on input instead of in a lazy-loading fashion if the BQSR table is provided as an engine argument. On the fly recalibration is now completely hooked up and live. 2012-02-13 12:35:09 -05:00
Eric Banks f52f1f659f Multiallelic implementation of the TDT should be a pairwise list of values as per Mark Daly. Integration tests change because the count in the header is now A instead of 1. 2012-02-10 14:15:59 -05:00
Eric Banks 5e18020a5f Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-02-10 11:08:33 -05:00
Eric Banks f53cd3de1b Based on Ryan's suggestion, there's a new contract for genotyping multiple alleles. Now the requester submits alleles in any arbitrary order - rankings aren't needed. If the Exact model decides that it needs to subset the alleles because too many were requested, it does so based on PL mass (in other words, I moved this code from the SNPGenotypeLikelihoodsCalculationModel to the Exact model). Now subsetting alleles is consistent. 2012-02-10 11:07:32 -05:00
Mauricio Carneiro 5af373a3a1 BQSR with indels integrated!
* added support to base before deletion in the pileup
   * refactored covariates to operate on mismatches, insertions and deletions at the same time
   * all code is in private so original BQSR is still working as usual in public
   * outputs a molten CSV with mismatches, insertions and deletions, time to play!
   * barely tested, passes my very simple tests... haven't tested edge cases.
2012-02-09 18:46:45 -05:00
Eric Banks 7a937dd1eb Several bug fixes to new genotyping strategy. Update integration tests for multi-allelic indels accordingly. 2012-02-09 16:14:22 -05:00
Mauricio Carneiro d561914d4f Revert "First implementation of GATKReportGatherer"
premature push from my part. Roger is still working on the new format and we need to update the other tools to operate correctly with the new GATKReport.

This reverts commit aea0de314220810c2666055dc75f04f9010436ad.
2012-02-08 23:28:55 -05:00
Eric Banks 2f800b078c Changes to default behavior of UG: multi-allelic mode is always on; max number of alternate alleles to genotype is 3; alleles in the SNP model are ranked by their likelihood sum (Guillermo will do this for indels); SB is computed again. 2012-02-08 15:27:16 -05:00
Mauricio Carneiro 337819e791 disabling the test while we fix it 2012-02-07 19:22:32 -05:00
Roger Zurawicki c0c676590b First implementation of GATKReportGatherer
- Added the GATKReportGatherer
- Added private methods in GATKReport to combine Tables and Reports
- It is very conservative and it will only gather if the table columns, match.
- At the column level it uses the (redundant) row ids to add new rows. It will throw an exception if it is overwriting data.
Added the gatherer functions to CoverageByRG

Also added the scatterCount parameter in the Interval Coverage script
Made some more GATKReport methods public

The UnitTest included shows that the merging methods work
Added a getter for the PrimaryKeyName
Fixed bugs that prevented the gatherer form working

Working GATKReportGatherer
Has only the functional to addLines
The input file parser assumes that the first column is the primary key

Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-02-07 18:14:47 -05:00
Mauricio Carneiro e1d69e4060 make the size of a GenomeLoc int instead of long
it will never be bigger than an int and it's actually useful to be an int so we can use it as parameters to array/list/hash size creation.
2012-02-03 17:12:42 -05:00
Mauricio Carneiro d5d4fa8a88 Fixed discordance bug reported by Brad Chapman
discordance now reports discordance between genotypes as well (just like concordance)
2012-01-30 09:50:45 -05:00
Mauricio Carneiro 2a565ebf90 embarrassing fix-up, thanks Khalid. 2012-01-26 19:58:42 -05:00
Mauricio Carneiro 246e085ec9 Unit tests for GATKSAMRecord class
* new unit tests for the alignment shift properties of reduce reads
   * moved unit tests from ReadUtils that were actually testing GATKSAMRecord, not any of the ReadUtils to it.
   * cleaned up ReadUtilsUnitTest
2012-01-26 17:06:36 -05:00
Ryan Poplin cdff23269d HaplotypeCaller now uses insertions and softclipped bases as possible triggers. LocusIteratorByState tags pileup elements with the required info to make this calculation efficient. The days of the extended event pileup are coming to a close. 2012-01-26 15:56:33 -05:00
Eric Banks ddaf51a50f Updated one integration test for indels 2012-01-25 19:18:51 -05:00
Eric Banks e349b4b14b Allow appending with the dbSNP ID even if a (different) ID is already present for the variant rod. 2012-01-25 11:35:54 -05:00
Mauricio Carneiro ffd61f4c1c Refactor the Pileup Element with regards to indels
Eric reported this bug due to the reduced reads failing with an index out of bounds on what we thought was a deletion, but turned out to be a read starting with insertion.

   * Refactored PileupElement to distinguish clearly between deletions and read starting with insertion
   * Modified ExtendedEventPileup to correctly distinguish elements with deletion when creating new pileups
   * Refactored most of the lazyLoadNextAlignment() function of the LocusIteratorByState for clarity and to create clear separation between what is a pileup with a deletion and what's not one. Got rid of many useless if statements.
   * Changed the way LocusIteratorByState creates extended event pileups to differentiate between insertions in the beginning of the read and deletions.
   * Every deletion now has an offset (start of the event)
   * Fixed bug when LocusITeratorByState found a read starting with insertion that happened to be a reduced read.
   * Separated the definitions of deletion/insertion (in the beginning of the read) in all UG annotations (and the annotator engine).
   * Pileup depth of coverage for a deleted base will now return the average coverage around the deletion.
   * Indel ReadPositionRankSum test now uses the deletion true offset from the read, changed all appropriate md5's
   * The extra pileup elements now properly read by the Indel mode of the UG made any subsequent call have a different random number and therefore all RankSum tests have slightly different values (in the 10^-3 range). Updated all appropriate md5s after extremely careful inspection -- Thanks Ryan!

 phew!
2012-01-24 16:07:21 -05:00
Khalid Shakir c18beadbdb Device files like /dev/null are now tracked as special by Queue and are not used to generate .out file paths, scattered into a temporary directory, gathered, deleted, etc.
Attempted workaround for xdr_resourceInfoReq unsatisfied link during loading of libbat.so.
2012-01-23 16:17:04 -05:00
Mark DePristo 02450e4b12 Merged bug fix from Stable into Unstable 2012-01-23 12:08:39 -05:00
Mark DePristo 80a4ce0edf Bugfix for incorrect error messages for missing BAMs and VCFs
-- Missing BAMs were appearing as StingExceptions
-- Missing VCFs were showing up as CommandLineErrors, but it's clearer for them to be CouldNotReadInputFile exceptions
-- Added integration tests to ensure missing BAMs, VCFs, and -L files are properly thrown as CouldNotReadInputFile exceptions
-- Added path to standard b37 BAM to BaseTest
-- Cleaned up code in SAMDataSource, removing my parallel loading code as this just didn't prove to be useful.
2012-01-23 09:52:07 -05:00
Christopher Hartl 4a08e8ca6e Minor tweaks to T2D-related qscripts. Replacing old md5s from the BeagleIntegrationTest. All differences boiled down either to the accounting of genotypes changed (./. --> 0/0 is no longer a "changed" genotype, and original genotypes that were ./. are represented as OG=. rather than OG=./. .)
This is somewhat of an arbitrary decision, and is negotiable. I could see treating

GT:PL   ./.:.

differently from

GT:PL   .:0,3,6

but am not sure the worth of doing so.
2012-01-23 08:25:34 -05:00
Eric Banks ab8f499bc3 Annotate with FS even for filtered sites 2012-01-18 22:04:51 -05:00
Ryan Poplin 0268da7560 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-01-18 09:53:00 -05:00
Ryan Poplin 60024e0d7b updating TDT integration test 2012-01-18 09:52:50 -05:00
Mark DePristo 0c7865fdb5 UnitTest for reverseAlleleClipping
-- No code modified yet, just implementing a unit test to ensure correctness of the existing code
2012-01-18 07:35:11 -05:00
Mauricio Carneiro cec7107762 Better location for the downsampling of reads in PrintReads
* using the filter() instead of map() makes for a cleaner walker.
   * renaming the unit tests to make more sense with the other unit and integration tests
2012-01-14 14:06:09 -05:00
Mauricio Carneiro 28aa353501 Added "unbiased" downsampling parameter to PrintReads
* also cleaned up and updated part of the unit tests for print reads. Needs a more thorough cleaning.
2012-01-12 16:33:55 -05:00
Matt Hanna 2c3176eb80 Merged bug fix from Stable into Unstable 2012-01-12 13:31:10 -05:00
Matt Hanna cd43f016ce Fixed NPE in getNextOverlappingBAMScheduleEntry() when mixed mapped/unmapped interval lists are used. Added integrationtest to verify behavior. 2012-01-12 13:29:11 -05:00
Mauricio Carneiro 77a03c9709 Patching special case in the adaptor clipping
* if the adaptor boundary is more than MAXIMUM_ADAPTOR_SIZE bases away from the read, then let's not clip anything and consider the fragment to be undetermined for this read pair.
   * updated md5's accordingly
2012-01-11 17:47:44 -05:00
Eric Banks c5320ef1af Resolving changes in integration test during merge 2012-01-10 12:14:16 -05:00
Eric Banks 0f36f6947e Resolving merge conflicts 2012-01-10 11:44:16 -05:00
Eric Banks f2cecce10f Much better implementation of the approximate summing of an array of log10 values (including more efficient rounding). Now effectively takes 0% of UG runtime on T2D GENES (as opposed to 11% previously). 2012-01-10 11:34:23 -05:00
Mark DePristo dd80ffbbbe Merged bug fix from Stable into Unstable 2012-01-05 21:51:48 -05:00
Mark DePristo c96fee477c Bug fix for VariantSummary
-- Call sets with indels > 50 bp in length are tagged as CNVs in the tag (following the 1000 Genomes convention) and were unconditionally checking whether the CNV is already known, by looking at the known cnvs file, which is optional.  Fixed.  Has the annoying side effect that indels > 50bp in size are not counted as indels, and so are substrated from both the novel and known counts for indels.  C'est la vie
-- Added integration test to check for this case, using Mauricio's most recent VCF file for NA12878 which has many large indels.  Using this more recent and representative file probably a good idea for more future tests in VE and other tools.  File is NA12878.HiSeq.WGS.b37_decoy.indel.recalibrated.vcf in Validation_Data
2012-01-05 21:51:06 -05:00
Guillermo del Angel 58d4539304 Enabled banded indel computation by default. Reversed logic in input UG argument so that we can still disable it if required. Minor changes to integration tests due to minor differences in GL's and in annotations 2012-01-04 15:28:26 -05:00
David Roazen 621ee2b613 Merged bug fix from Stable into Unstable 2012-01-03 16:56:49 -05:00
David Roazen ea6e718cb8 SnpEff 2.0.5 support. Re-enabled SnpEff in the HybridSelectionPipeline.
For now, we recommend only running with the GRCh37.64 database.
2012-01-03 15:18:36 -05:00
David Roazen 4984ca5e31 Merged bug fix from Stable into Unstable 2012-01-03 11:03:30 -05:00
David Roazen f3f01da1af Enforce serial dependencies in RecalibrationWalkersIntegrationTest
Some tests in this class were intermittently not being executed due
to being randomly scheduled before tests whose results they depend on.
Now the serial dependencies are enforced to avoid problematic orderings.
2012-01-03 10:42:41 -05:00
Eric Banks ab8d47d9a5 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-01-03 09:38:49 -05:00
Mauricio Carneiro cd68cc239b Added knuth-shuffle (KS) and randomSubset using KS to MathUtils
* Knuth-shuffle is a simple, yet effective array permutator (hope this is good english).
         * added a simple randomSubset that returns a random subset without repeats of any given array with the same probability for every permutation.
         * added unit tests to both functions
2012-01-03 09:29:46 -05:00
Mauricio Carneiro 94791a2a75 Add support for reads starting with insertion
* Modified cleanCigarShift to allow insertions in the beginning and end of the read
      * Allowed cigars starting/ending in insertions in the systematic ReadClipper tests
      * Updated all ReadClipper unit tests
      * ReduceReads does not hard clip leading insertions by default anymore
      * SlidingWindow adjusts start location if read starts with insertion
      * SlidingWindow creates an empty element with insertions to the right
      * Fixed all potential divide by zero with totalCount() (from BaseCounts)
      * Updated all Integration tests
      * Added new integration test for multiple interval reducing
2012-01-03 09:29:45 -05:00
Mauricio Carneiro 1b6d52817e fixing adaptor clipping effect on recalibration integration test 2012-01-01 22:20:06 -05:00
Eric Banks 393993e0c7 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-31 20:42:46 -05:00
Mauricio Carneiro 55cfa76cf3 Updated integration tests for the new adaptor clipping fix. 2011-12-30 18:47:14 -05:00
Mauricio Carneiro c7d0a9ebee Forgot to test for inter-chromosomal mates in the adaptor clipping
* Fixing bug caught by Eric (and Kristian)
2011-12-30 00:19:53 -05:00
Eric Banks 1a45ea5a05 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-29 11:37:15 -05:00
Eric Banks d20a25d681 A much better way of choosing the alternate allele(s) to genotype in the SNP model of UG: instead of looking at the sum of base qualities (which can and did lead to us over-genotyping esp. when allowing multiple alternate alleles), we look at the likelihoods themselves (free since we are already calculating likelihoods for all 10 genotypes). Now, even if the base quals exceed some arbitrary threshold, we only bother genotyping an alternate allele when there's a sample for which it is more likely than ref/ref (I can generate weird edge cases where this falls apart, but none that model truly variable sites that we actually want to call). This leads to a huge efficiency improvement esp. for exomes (and esp. for many samples) where we almost always were trying to genotype all 3 alternate alleles. Integration tests change only because ref calls have slight QUAL differences (because the best alt allele is still chosen arbitrarily, but differently). 2011-12-27 16:50:38 -05:00
Mauricio Carneiro 17bfe48d5e Made all class methods private in the ReadClipper
* ReadClipperUnitTest now uses static methods
 * Haplotype caller now uses static methods
 * Exon Junction Genotyper now uses static methods
2011-12-27 02:11:32 -05:00
David Roazen 506c0e9c97 Disabling SnpEff support in the GATK and SnpEff annotation in the HybridSelectionPipeline
SnpEff support will remain disabled until SnpEff 2.0.4 has been officially released
and we've verified the quality of its annotations.
2011-12-23 19:12:57 -05:00
David Roazen 510c71158c Merged bug fix from Stable into Unstable 2011-12-22 10:49:52 -05:00
David Roazen 32cdef9682 Rename *PerformanceTest test classes to *LargeScaleTest
This is in preparation for the installation of the new performance test suite in Bamboo.

Note that "ant performancetest" is now "ant largescaletest"
2011-12-22 10:38:49 -05:00
Mauricio Carneiro 731a463415 Updated IntegrationTests with new adaptor clipper
phew!
2011-12-20 17:48:52 -05:00
Mauricio Carneiro cadff40247 getRefCoordSoftUnclippedStart and End refactor
These functions are methods of the read, and supplement getAlignmentStart() and getUnclippedStart() by calculating the unclipped start counting only soft clips.

* Removed from ReadUtils
* Added to GATKSAMRecord
* Changed name to getSoftStart() and getSoftEnd
* Updated third party code accordingly.
2011-12-20 17:48:51 -05:00
Mauricio Carneiro f73ad1c2e2 Bugfix/Rewrite: Algorithm to determine adaptor boundaries
The algorithm wasn't accounting for the case where the read is the reverse strand and the insert size is negative.

    * Fixed and rewrote for more clarity (with Ryan, Mark and Eric).
    * Restructured the code to handle GATKSAMRecords only
    * Cleaned up the other structures and functions around it to minimize clutter and potential for error.
    * Added unit tests for all 4 cases of adaptor boundaries.
2011-12-20 17:48:39 -05:00
Mauricio Carneiro 78d9bf7196 Added REVERT_SOFTCLIPPED_BASES capability to ReadClipper
* New ClippingOp REVERT_SOFTCLIPPED_BASES turns soft clipped bases into matches.
    * Added functionality to clipping op to revert all soft clip bases in a read into matches
    * Added revertSoftClipBases function to the ReadClipper for public use
    * Wrote systematic unit tests
2011-12-20 00:04:30 -05:00
Laurent Francioli 16cc2b864e - Corrected bug causing cases where both parents are HET to be accounted twice in the TDT calculation - Adapted TDT Integration test to corrected version of TDT
Signed-off-by: Ryan Poplin <rpoplin@broadinstitute.org>
2011-12-19 10:30:59 -05:00
Eric Banks 3069a689fe Bug fix: if there are multiple records at a given position, it turns out that SelectVariants would drop all variants that follow after one that fails filters (instead of dropping just the failing one). Added an integration test to cover this case. 2011-12-19 10:04:33 -05:00
Mauricio Carneiro 5b678e3b94 Remove ClippingOp UnitTests
* all testing functionality is in the ReadClipperUnitTest, no need to double test.
* class and package naming cleanup
2011-12-19 07:49:26 -05:00
Eric Banks 76bd13a1ed Forgot to update the unit test 2011-12-18 01:13:49 -05:00
Eric Banks 07f9d14d9f Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-18 00:43:15 -05:00
Eric Banks c5ffe0ab04 No reason to sum the normalized posteriors array to get Pr(AF>0) given that we can just compute 1.0 - array[0]. Integration tests change only because of trivial precision artifacts for reference calls using EMIT_ALL_SITES. 2011-12-18 00:31:47 -05:00
Eric Banks 6dc52d42bf Implemented the proper QUAL calculation for multi-allelic calls. Integration tests pass except for the ones making multi-allelic calls (duh) and one of the SLOD tests (which used to print 0 when one of the LODs was NaN but now we just don't print the SB annotation for that record). 2011-12-18 00:01:42 -05:00
Khalid Shakir 6059ca76e8 Removing cruft that snuck in last commit. 2011-12-16 23:00:16 -05:00
Khalid Shakir 7486696c07 When using bam list mode in HSP deriving VCF name from bam list instead of requiring an additional parameter.
Creating a single temporary directory per ant test run instead of a putting temp files across all runs in the same directory.
Updated various tests for above items and other small fixes.
2011-12-16 18:09:25 -05:00
Mauricio Carneiro e5df9e0684 cleaner test output
cleaned up the debug "pass" messages in the unit tests
2011-12-16 18:04:00 -05:00
Mauricio Carneiro fcc21180e8 Added hardClipLeadingInsertions UnitTest for the ReadClipper
fixed issue where a read starting with an insertion followed by a deletion would break, clipper can now safely clip the insertion and the deletion if that's the case.

note: test is turned off until contract changes to allow hanging insertions (left/right).
2011-12-16 18:02:47 -05:00
Mauricio Carneiro 075be52adc Added hardClipByReferenceCoordinates (left and right tails) UnitTest for the ReadClipper 2011-12-16 18:01:33 -05:00
Mauricio Carneiro 5bba44d693 Added hardClipByReferenceCoordinates UnitTest for the ReadClipper
* fixed edge case when requested to hard clip beginning of a read that had hanging soft clipped bases on the left tail.
* fixed edge case when requested to hard clip end of a read that had hanging soft clipped bases on the right tail.
* fixed AlignmentStart of a clipped read that results in only hard clips and soft clips

note: added tests to all these beautiful cases...
2011-12-16 18:01:33 -05:00
Mauricio Carneiro 5838ba529d Added hardClipByReadCoordinates UnitTest for the ReadClipper 2011-12-16 18:01:33 -05:00
Mauricio Carneiro c26295919e Added hardClipBothEndsByReferenceCoordinates UnitTest for the ReadClipper 2011-12-16 18:01:33 -05:00
Mauricio Carneiro e61e5c7589 Refactor of ReadClipper unit tests
* expanded the systematic cigar string space test framework Roger wrote to all tests
* moved utility functions into Utils and ReadUtils
* cleaned up unused classes
2011-12-15 19:05:43 -05:00
Mauricio Carneiro 4748ae0a14 Bugfix: Softclips before Hardclips weren't being accounted for
caught a bug in the hard clipper where it does not account for hard clipping softclipped bases in the resulting cigar string, if there is already a hard clipped base immediately after it.
* updated unit test for hardClipSoftClippedBases with corresponding test-case.
2011-12-15 12:17:25 -05:00
Mauricio Carneiro 50dee86d7f Added unit test to catch Ryan's exception
Unit test to catch the special case that broke the clipping op, fixed in the previous commit.
2011-12-14 16:58:14 -05:00
Mauricio Carneiro c85100ce9c Fix ClippingOp bug when performing multiple hardclip ops
bug: When performing multiple hard clip operations in a read that has indels, if the N+1 hardclip requests to clip inside an indel that has been removed by one of the (1..N) previous hardclips, the hard clipper would go out of bounds.

fix: dynamically adjust the boundaries according to the new hardclipped read length. (this maintains the current contract that hardclipping will never return a read starting or ending in indels).
2011-12-14 16:57:47 -05:00
Eric Banks de5928ac5a Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-14 16:24:56 -05:00
Eric Banks 4fddac9f22 Updating busted integration tests 2011-12-14 16:24:43 -05:00
Mark DePristo 71b4bb12b7 Bug fix for incorrect logic in subsetSamples
-- Now properly handles the case where a sample isn't present (no longer adds a null to the genotypes list)
-- Fix for logic failure where if the number of requested samples equals the number of known genotypes then all of the records were returned, which isn't correct when there are missing samples.
-- Unit tests added to handle these cases
2011-12-14 16:14:26 -05:00
Eric Banks 1e90d602a4 Optimization: cache up front the PL index to the pair of alleles it represents for all possible numbers of alternate alleles. 2011-12-14 13:38:20 -05:00
Mauricio Carneiro 5cc1e72fdb Parallelized SelectVariants
* can now use -nt with SelectVariants for significant speedup in large files
* added parallelization integration tests for SelectVariants
2011-12-12 18:41:14 -05:00
Laurent Francioli 7cf27bb66e Updated md5sum for MendelianViolationEvaluator test to reflect the change in column alignment in VariantEval. 2011-12-12 12:22:43 +01:00
Laurent Francioli 025bdfe2cc Merge branch 'master' of ssh://copper.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-12 12:19:44 +01:00
Eric Banks 7b6338c742 Merge branch 'master' into trialleles 2011-12-11 00:28:46 -05:00
Eric Banks 7c4b9338ad The old bi-allelic implementation of the Exact model has been completely deprecated - you can only use the multi-allelic implementation now. 2011-12-11 00:23:33 -05:00
Eric Banks 044f211a30 Don't collapse likelihoods over all alt alleles - that's just not right. For now, the QUAL is calculated for just the most likely of the alt alleles; I need to think about the right way to handle this properly. 2011-12-10 23:57:14 -05:00
Mauricio Carneiro 8475328b2c Turning off test that breaks read clipper
until we define what is the desired behavior for clipping this particular case.
2011-12-09 11:53:12 -05:00
Roger Zurawicki 4cbd1f0dec Reorganized the testing code and created ClipReadsTestUtils
Tests are more rigorous and includes many more test cases.
We can tests custom cigars and the generated cigars.
     *Still needs debugging because code is not working.
Created test classes to be used across several tests.

Some cases are still commented out.

Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2011-12-09 11:52:34 -05:00
Roger Zurawicki 0e9c2cefa2 testHardClipSoftClippedBases works with Matches and Deletions
Insertions are a problem so cigar cases with "I" are commented out.
The test works with multiple deletions and matches.

This is still not a complete test. A lot of cigar test cases are commented out.

Added insertions to ReadClipperUnitTest

ReadClipper now tests for all indels.

Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2011-12-09 11:43:37 -05:00
Eric Banks 442ceb6ad9 The Exact model now computes both the likelihoods and posteriors (in separate arrays); likelihoods are used for assigning genotypes, not the posteriors. 2011-12-09 10:16:44 -05:00
Laurent Francioli a79144f7db Merge branch 'master' of ssh://copper.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-09 15:57:24 +01:00
Laurent Francioli 72fbfba97d Added UnitTests for getFamilies() and getChildrenWithParents() 2011-12-09 15:57:07 +01:00
Eric Banks aa4a8c5303 No dynamic programming solution for assignning genotypes; just done greedily now. Fixed QualByDepth to skip no-call genotypes. No-calls are no longer given annotations (attributes). 2011-12-09 02:25:06 -05:00
Eric Banks 2fe50c64da Updating md5s 2011-12-09 00:47:01 -05:00
Eric Banks 4aebe99445 Need to use longs for the set index (because we can run out of ints when there are too many alternate alleles). Integration tests now use the multiallelic implementation. 2011-12-08 15:31:02 -05:00
Mark DePristo 4055877708 Prints 0.0 TiTv not NaN when there are no variants
-- Updated md5
2011-12-07 12:07:54 -05:00
Mark DePristo 5d2212bc8e Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-07 09:03:17 -05:00
Eric Banks 79d18dc078 Fixing indexing bug on the ACsets. Added unit tests for the Exact model code. 2011-12-06 16:17:18 -05:00
Matt Hanna f5b977fc88 Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-06 10:11:35 -05:00
Matt Hanna 4001c22a11 Better file count / buffering variation in test suite. Parameterized read shard buffering. Misc cleanup. 2011-12-06 10:10:38 -05:00
Khalid Shakir 677bea0abd Right aligning GATKReport numeric columns and updated MD5s in tests.
PreQC parses file with spaces in sample names by using tabs only.
PostQC allows passing the file names for the evals so that flanks can be evaled.
BaseTest's network temp dir now adds the user name to the path so files aren't created in the root.
HybridSelectionPipeline:
- Updated to latest versions of reference data.
- Refactored Picard parsing code replacing YAML.
2011-12-05 23:22:15 -05:00
Eric Banks 29662be3d7 Fixed bug where k=2N case wasn't properly being computed. Added optimization for BB genotype case not in old model. At this point, integration tests pass except for 1 case where QUALs differ by 0.01 (this is okay because I occasionally need to compute extra cells in the matrix which affects the approximations) and 2 cases where multi-allelic indels are being genotyped (some work still needs to be done to support them). 2011-12-03 23:12:04 -05:00
Mark DePristo 3060a4a15e Support for list of known CNVs in VariantEval
-- VariantSummary now includes novelty of CNVs by reciprocal overlap detection using the standard variant eval -knownCNVs argument
-- Genericizes loading for intervals into interval tree by chromosome
-- GenomeLoc methods for reciprocal overlap detection, with unit tests
2011-11-30 17:05:16 -05:00
Laurent Francioli 9574be0394 Updated MendelianViolationEvaluator integration test 2011-11-30 14:44:15 +01:00
Laurent Francioli a4606f9cfe Merge branch 'MendelianViolation'
Conflicts:
	public/java/src/org/broadinstitute/sting/utils/MendelianViolation.java
2011-11-30 11:13:15 +01:00
Laurent Francioli 7d58db626e Added MendelianViolationEvaluator integration test 2011-11-30 10:09:20 +01:00
Ryan Poplin 110298322c Adding Transmission Disequilibrium Test annotation to VariantAnnotator and integration test to test it. 2011-11-29 09:29:18 -05:00
Mark DePristo e60272975a Fix for changed MD5 in streaming VCF test 2011-11-23 19:01:33 -05:00
Mark DePristo 12f09d88f9 Removing references to SimpleMetricsByAC 2011-11-23 16:08:18 -05:00
Mark DePristo 4107636144 VariantEval updates
-- Performance optimizations
-- Tables now are cleanly formatted (floats are %.2f printed)
-- VariantSummary is a standard report now
-- Removed CompEvalGenotypes (it didn't do anything)
-- Deleted unused classes in GenotypeConcordance
-- Updates integration tests as appropriate
2011-11-23 13:02:07 -05:00
Mark DePristo e484625594 GenotypesContext now updates cached data for add, set, replace operations when possible
-- Involved separately managing the sample -> offset and sample sorted list operations.  This should improve performance throughout the system
2011-11-22 08:40:48 -05:00
Mark DePristo 29ca24694a UG now encoding NO_CALLs as ./. not ./.:.:4:0,0,0
A few updated UGs integration tests
2011-11-22 08:22:32 -05:00
Mark DePristo 2b51c01df4 Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-11-21 19:16:06 -05:00
Mark DePristo 5443d3634a Again, fixing the add call when we really mean replace
-- Updating MD5s for UG to reflect that what was previously called ./.:.:10:0,0,0 is now just ./.  Eric will fix long-standing bug in QD observed from this change
-- VFW MD5s restored to their old correct values.  There was a bug in my implementation to caused the genotypes to not be parsed from the lazy output even through the header was incorrect.
2011-11-21 19:15:56 -05:00
Mauricio Carneiro 5ad3dfcd62 BugFix: byte overflow in SyntheticRead compressed base counts
* fixed and added unit test
2011-11-21 17:11:50 -05:00
Mark DePristo 2c501364b8 GenotypesContext no longer have immutability in constructor
-- additional bug fixes throughout VariantContext and GenotypesContext objects
2011-11-21 14:34:31 -05:00
David Roazen 1296dd41be Removing the legacy -L "interval1;interval2" syntax
This syntax predates the ability to have multiple -L arguments, is
inconsistent with the syntax of all other GATK arguments, requires
quoting to avoid interpretation by the shell, and was causing
problems in Queue.

A UserException is now thrown if someone tries to use this syntax.
2011-11-21 13:18:53 -05:00
Mark DePristo 2e9ecf639e Generalized interface to LazyGenotypesContext
-- Now you provide a LazyParsing object
-- LazyGenotypesContext now knows nothing about the VCF parser itself.  The parser holds all of the necessary data to parse the VCF genotypes when necessarily, and the LGC only has a pointer to this object
-- Using new interface added LazyGenotypesContext to unit tests with a simple lazy version
-- Deleted VCFParser interface, as it was no longer necessary
2011-11-21 09:30:40 -05:00
Mark DePristo f0ac588d32 Extensive unit test for GenotypeContextUnitTest
-- Currently only tests base class.  Adding subclass testing in a bit
2011-11-20 18:28:01 -05:00
Mark DePristo 9cb3fe3a59 Vastly better way of doing on-demand genotyping loading
-- With our GenotypesContext class we can naturally create a LazyGenotypesContext subclass that does the on-demand loading.
-- This new class was replaced all of the old, complex functionality
-- Better still, there were many cases were the genotypes were being loaded unnecessarily, resulting in efficiency.  This was detected because some of the integration tests changed as the genotypes were no longer being parsing unnecessarily
-- Misc. bug fixes throughout the system
-- Bug fixes for PhaseByTransmission with new GenotypesContext
2011-11-20 08:23:09 -05:00
Mark DePristo 7d09c0064b Bug fixes and code cleanup throughout
-- chromosomeCounts now takes builder as well, cleaning up a lot of code throughout the codebase.
2011-11-19 18:40:15 -05:00
Mark DePristo 707bd30b3f Should have been @BeforeMethod 2011-11-19 16:10:09 -05:00
Mark DePristo 8f7eebbaaf Bugfix for pError not being checked correctly in CommonInfo
-- UnitTests to ensure correct behavior
-- UnitTests to ensure correct behavior for pass filters vs. failed filters vs. unfiltered
2011-11-19 15:58:59 -05:00
Mark DePristo b7b57ef39a Updating MD5 to reflect canonical ordering of calculation
-- We should no longer have md5s changing because of hashmaps changing their sort order on us
-- Added GenotypeLikelihoodsUnitTests
-- Refactored ExactAFCaclculation to put the PL -> QUAL calculation in the GenotypeLikelihoods class to avoid the code copy.
2011-11-19 15:57:33 -05:00
Mark DePristo 73119c8e3c Merge with master
-- A few bug fixes
2011-11-19 09:56:06 -05:00
Mark DePristo f685fff79b Killing the final versions of old new VariantContext interface 2011-11-18 21:32:43 -05:00
Mark DePristo 6cf315e17b Change interface to getNegLog10PError to getLog10PError 2011-11-18 21:07:30 -05:00
Matt Hanna 8bb4d4dca3 First pass of the asynchronous block loader.
Block loads are only triggered on queue empty at this point.  Disabled by
default (enable with nt:io=?).
2011-11-18 15:02:59 -05:00
Mark DePristo f54afc19b4 VariantContextBuilder
-- New approach to making VariantContexts modeled on StringBuilder
-- No more modify routines -- use VariantContextBuilder
-- Renamed isPolymorphic to isPolymorphicInSamples.   Same for mono
-- getChromosomeCount -> getCalledChrCount
-- Walkers changed to use new VariantContext.  Some deprecated new VariantContext calls remain
-- VCFCodec now uses optimized cached information to create GenotypesContext.
2011-11-18 12:39:10 -05:00
Mark DePristo 7490dbb6eb First version of VariantContextBuilder 2011-11-18 11:06:15 -05:00
Mark DePristo fa454c88bb UnitTests for VariantContext for chrCount, getSampleNames, Order function
-- Major change to how chromosomeCounts is computed.  Now NO_CALL alleles are always excluded.  So ChromosomeCounts(A/.) is 1, the previous result would have been 2.
-- Naming changes for getSamplesNameInOrder()
2011-11-17 20:37:22 -05:00
Mark DePristo 02f22cc9f8 No more VC integration tests. All tests are now unit tests 2011-11-17 15:33:09 -05:00
Khalid Shakir c50274e02e During flanking interval creation merging overlapping flanks so that on scatter the list doesn't accidentally genotype the same site twice.
Moved flanking interval utilies to IntervalUtils with UnitTests.
2011-11-17 13:56:42 -05:00
Eric Banks bad19779b9 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-11-17 13:29:43 -05:00
Eric Banks 16a021992b Updated header description for the INFO and FORMAT DP fields to be more accurate. 2011-11-17 13:17:53 -05:00
Mark DePristo 7e66677769 Expanded UnitTests for VariantContext
Tests for
-- getGenotype and getGenotypes
-- subContextBySample
-- modify routines
2011-11-16 20:45:15 -05:00
Mauricio Carneiro 72f00e2883 Merging Roger's Unit tests for Reduce Reads from RR repository 2011-11-16 17:26:49 -05:00
Mark DePristo aa0610ea92 GenotypeCollection renamed to GenotypesContext 2011-11-16 16:24:05 -05:00
Mark DePristo 974daaca4d V13 version in archive. Can you pulled out wholesale for performance testing 2011-11-16 16:08:46 -05:00
Mark DePristo 101ffc4dfd Expanded, contrastive VariantContextBenchmark
-- Compares performance across a bunch of common operations with GATK 1.3 version of VariantContext and GATK 1.4
-- 1.3 VC and associated utilities copied wholesale into test directory under v13
2011-11-16 13:35:16 -05:00
Mark DePristo e56d52006a Continuing bugfixes to get new VC working 2011-11-16 10:39:17 -05:00
Eric Banks c2ebe58712 Merge remote-tracking branch 'Laurent/master' 2011-11-16 09:34:47 -05:00