Commit Graph

1321 Commits (8a442d3c9f33dcda2e5ffececd06f0c5359b0df2)

Author SHA1 Message Date
David Roazen cdb1fa1105 Fix more tests that fail when run in parallel on the farm
-Allow the default S3 put timeout of 30 seconds for GATKRunReports
 to be overridden via a constructor argument, and use a timeout
 of 300 seconds for tests. The timeout remains 30 seconds in all
 other cases.

-Change integration tests that themselves dispatch farm jobs
 into pipeline tests. Necessary because some farm nodes are
 not set up as submit hosts. Pipeline tests are still run
 directly on gsa4.

-Bump up the timeout for the MaxRuntimeIntegrationTest even more
 (was still occasionally failing on the farm!)
2013-03-12 16:53:30 -04:00
Yossi Farjoun baad965a57 - Changed loadContaminationFile file parser to delimit by tab only. This allows spaces in sampleIDs, which apparently are allowed.
- This was needed since samples with spaces in their names are regularly found in the picard pipeline.
- Modified the tests to account for this (removed spaces from the good tests, and changed the failing tests accordingly)
- Cleaned up the unit tests using a @DataProvider (I'm in love...).
- Moved AlleleBiasedDownsamplingUtilsUnitTest to public to match location of class it is testing (due to the way bamboo operates)
2013-03-07 13:04:24 -05:00
David Roazen 3ab78543a7 Fix tests that were consistently or intermittently failing when run in parallel on the farm
-Make MaxRuntimeIntegrationTest more lenient by assuming that startup overhead
 might be as long as 120 seconds on a very slow node, rather than the original
 assumption of 20 seconds

-In TraverseActiveRegionsUnitTest, write temp bam file to the temp directory, not
 to the current working directory

-SimpleTimerUnitTest: This test was internally inconsistent. It asserted that
 a particular operation should take no more than 10 milliseconds, and then asserted
 again that this same operation should take no more than 100 microseconds (= 0.1 millisecond).
 On a slow node it could take slightly longer than 100 microseconds, however.
 Changed the test to assert that the operation should require no more than 10000 microseconds
 (= 10 milliseconds)

-change global default test timeout from 20 to 40 minutes (things just take longer
 on the farm!)

-build.xml: allow runtestonly target to work with scala test classes
2013-03-06 13:56:54 -05:00
Eric Banks 3759d9dd67 Added the functionality to impose a relative ordering on ReadTransformers in the GATK engine.
* ReadTransformers can say they must be first, must be last, or don't care.
  * By default, none of the existing ones care about ordering except BQSR (must be first).
    * This addresses a bug reported on the forum where BAQ is incorrectly applied before BQSR.
  * The engine now orders the read transformers up front before applying iterators.
  * The engine checks for enabled RTs that are not compatible (e.g. both must be first) and blows up (gracefully).
  * Added unit tests.
2013-03-06 12:38:59 -05:00
Eric Banks 78721ee09b Added new walker to split MNPs into their allelic primitives (SNPs).
* Can be extended to complex alleles at some point.
  * Currently only works for bi-allelics (documented).
  * Added unit and integration tests.
2013-03-05 23:16:42 -05:00
Mark DePristo 42d3919ca4 Expanded functionality for writing BAMs from HaplotypeCaller
-- The new code includes a new mode to write out a BAM containing reads realigned to the called haplotypes from the HC, which can be easily visualized in IGV.
-- Previous functionality maintained, with bug fixes
-- Haplotype BAM writing code now lives in utils
-- Created a base class that includes most of the functionality of writing reads realigned to haplotypes onto haplotypes.
-- Created two subclasses, one that writes all haplotypes (previous functionality) and a CalledHaplotypeBAMWriter that will only write reads aligned to the actually called haplotypes
-- Extended PerReadAlleleLikelihoodMap.getMostLikelyAllele to optionally restrict set of alleles to consider best
-- Massive increase in unit tests in AlignmentUtils, along with several new powerful functions for manipulating cigars
-- Fix bug in SWPairwiseAlignment that produces cigar elements with 0 size, and are now fixed with consolidateCigar in AlignmentUtils
-- HaplotypeCaller now tracks the called haplotypes in the GenotypingEngine, and returns this information to the HC for use in visualization.
-- Added extensive docs to HaplotypeCaller on how to use this capability
-- BUGFIX -- don't modify the read bases in GATKSAMRecord in LikelihoodCalculationEngine in the HC
-- Cleaned up SWPairwiseAlignment.  Refactored out the big main and supplementary static methods.  Added a unit test with a bug TODO to fix what seems to be an edge case bug in SW
-- Integration test to make sure we can actually write a BAM for each mode.  This test only ensures that the code runs and doesn't exception out.  It doesn't actually enforce any MD5s
-- HaplotypeBAMWriter also left aligns indels in the reads, as SW can return a random placement of a read against the haplotype.  Calls leftAlign to make the alignments more clear, with unit test of real read to cover this case
-- Writes out haplotypes for both all haplotype and called haplotype mode
-- Haplotype writers now get the active region call, regardless of whether an actual call was made.  Only emitting called haplotypes is moved down to CalledHaplotypeBAMWriter
2013-03-03 12:07:29 -05:00
Eric Banks ebd5404124 Fixed the add functionality of GenomeLocSortedSet.
* Fixed GenomeLocSortedSet.add() to ensure that overlapping intervals are detected and an exception is thrown.
 * Fixed GenomeLocSortedSet.addRegion() by merging it with the add() method; it now produces sorted inputs in all cases.
 * Cleaned up duplicated code throughout the engine to create a list of intervals over all contigs.
 * Added more unit tests for add functionality of GLSS.
 * Resolves GSA-775.
2013-02-28 23:31:00 -05:00
Eric Banks 12fc198b80 Added better error message for BAMs with bad read groups.
* Split the cases into reads that don't have a RG at all vs. those with a RG that's not defined in the header.
  * Added integration tests to make sure that the correct error is thrown.
  * Resolved GSA-407.
2013-02-27 16:02:56 -05:00
David Roazen 2a7af43164 Fix improper dependencies in QScripts used by pipeline tests, and attempt to fix the flawed MisencodedBaseQualityUnitTest
-Some QScripts used by public pipeline tests unnecessarily used the (now protected) UnifiedGenotyper.
 Changed them to use PrintReads instead.

-Moved ExampleUnifiedGenotyperPipelineTest to protected

-Attempt to fix the flawed and sporadically failing MisencodedBaseQualityUnitTest:

   After looking at this class a bit, I think the problem was the use of global arrays for the quals
   shared across all reads in all tests (BAMRecord class definitely does not make a separate copy for
   each read!). One test (testFixBadQuals) modifies the bad quals array, and if this happens to run
   before the testBadQualsThrowsError test the bad quals array will have been "fixed" and no exception
   will be thrown.
2013-02-27 04:45:53 -05:00
David Roazen 65d31ba4ad Fix runtime public -> protected dependencies in the test suite
-replace unnecessary uses of the UnifiedGenotyper by public integration tests
 with PrintReads

-move NanoSchedulerIntegrationTest to protected, since it's completely dependent
 on the UnifiedGenotyper
2013-02-26 21:19:12 -05:00
depristo 51d618de97 Merge pull request #62 from broadinstitute/rp_increase_max_kmer_in_assembly
The maximum kmer length is derived from the reads.
2013-02-26 05:37:02 -08:00
Ryan Poplin 89e2943dd1 The maximum kmer length is derived from the reads.
-- This is done to take advantage of longer reads which can produce less ambiguous haplotypes
-- Integration tests change for HC and BiasedDownsampling
2013-02-25 14:40:25 -05:00
David Roazen 3645ea9bb6 Sequence dictionary validation: detect problematic contig indexing differences
The GATK engine does not behave correctly when contigs are indexed
differently in the reads sequence dictionaries vs. the reference
sequence dictionary, and the inconsistently-indexed contigs are included
in the user's intervals. For example, given the dictionaries:

Reference dictionary = { chrM, chr1, chr2, ... }
BAM dictionary       = { chr1, chr2, ... }

and the interval "-L chr1", the engine would fail to correctly retrieve
the reads from chr1, since chr1 has a different index in the two dictionaries.

With this patch, we throw an exception if there are contig index differences
between the dictionaries for reads and reference, AND the user's intervals
include at least one of the mismatching contigs.

The user can disable this exception via -U ALLOW_SEQ_DICT_INCOMPATIBILITY

In all other cases, dictionary validation behaves as before.

I also added comprehensive unit tests for the (previously-untested)
SequenceDictionaryUtils class.

GSA-768 #resolve
2013-02-25 11:14:22 -05:00
Ryan Poplin 6a639c8ffc Replace Smith-Waterman alignment with the bubble traversal.
-- Instead of doing a full SW alignment against the reference we read off bubbles from the assembly graph.
-- Smith-Waterman is run only on the base composition of the bubbles which drastically reduces runtime.
-- Refactoring graph functions into a new DeBruijnAssemblyGraph class.
-- Bug fix in path.getBases().
-- Adding validation code to the assembly engine.
-- Renaming SimpleDeBruijnAssembler to match the naming of the new Assembly graph class.
-- Adding bug fixes, docs and unit tests for DeBruijnAssemblyGraph and KBestPaths classes.
-- Added ability to ignore bubbles that are too divergent from the reference
-- Max kmer can't be bigger than the extension size.
-- Reverse the order that we create the assembly graphs so that the bigger kmers are used first.
-- New algorithm for determining unassembled insertions based on the bubble traversal instead of the full SW alignment.
-- Don't need the full read span reference loc for anything any more now that we clip down to the extended loc for both assembly and likelihood evaluation.
-- Updating HaplotypeCaller and BiasedDownsampling integration tests.
-- Rebased everything into one commit as requested by Eric
-- improvements to the bubble traversal are coming as a separate push
2013-02-22 15:42:16 -05:00
Mauricio Carneiro 4ac50c89ad Updating TestNG to the latest version
-- changed SkipException constructors that are now private in TestNG
-- Updated build.xml to use the latest testng
-- Added guice dependency to ivy
-- Fixed broken SampleDBUnitTest

The SampleDBUnitTest was only passing before because the map comparison in the old TestNG was broken. It was comparing two DIFFERENT samples and testing for "equals"

GSA-695 #resolve
2013-02-22 09:40:23 -05:00
Mark DePristo 8ac6d3521f Vast improvements to AssessNA12878 code and functionality
-- AssessNA12878 now breaks out multi-allelics into bi-allelic components.  This means that we can properly assess multi-allelic calls against the bi-allelic KB
-- Refactor AssessNA12878, moving into assess package in KB.  Split out previously private classes in the walker itself into separate classes.  Added real docs for all of the classes.
-- Vastly expand (from 0) unit tests for NA12878 assessments
-- Allow sites only VCs to be evaluated by Assessor
-- Move utility for creating simple VCs from a list of string alleles from GATKVariantContextUtilsUnitTest to GATKVariantContextUtils
-- Assessor bugfix for discordant records at a site.  Previous version didn't handle properly the case where one had a non-matching call in the callset w.r.t. the KB, so that the KB element was eaten during the analysis.  Fixed.  UnitTested
-- See GSA-781 -- Handle multi-allelic variants in KB for more information
-- Bugfix for missing site counting in AssessNA12878.  Previous version would count N misses for every missed value at a site.  Not that this has much impact but it's worth fixing
-- UnitTests for BadSitesWriter
-- UnitTests for filtered and filtering sites in the Assessor
-- Cleanup end report generation code (simply the code).  Note that instead of "indel" the new code will print out "INDELS"
-- Assessor DoC calculations now us LIBS and RBPs for the depth calculation.  The previous version was broken for reduced reads.  Added unit test that reads a complex reduced read example and matches the DoC of this BAM with the output of the GATK DoC tool here.
-- Added convenience constructor for LIBS using just SAMFileReader and an iterator.  It's now easy to create a LIBS from a BAM at a locus.  Added advanceToLocus function that moves the LIBS to a specific position.  UnitTested via the assessor (which isn't ideal, but is a proper test)
2013-02-21 20:43:12 -05:00
Mark DePristo 29319bf222 Improved allele trimming code in GATKVariantContextUtils
-- Now supports trimming the alleles from both the reverse and forward direction.
-- Added lots of unit tests for forwrad allele trimming, as well as creating VC from forward and reverse trimming.
-- Added docs and tests for the code, to bring it up to GATK spec
2013-02-21 12:01:43 -05:00
Mark DePristo 910d966428 Extend timeout of NanoScheduler deadlock tests
-- The previous timeout of 1 second was just dangerously short.  Increase the timeout to 10 seconds
2013-02-19 20:25:25 -05:00
Mauricio Carneiro 371ea2f24c Fixed IndelRealigner reference length bug (GSA-774)
-- modified ReadBin GenomeLoc to keep track of softStart() and softEnd() of the reads coming in, to make sure the reference will always be sufficient even if we want to use the soft-clipped bases
-- changed the verification from readLength to aligned bases to allow reads with soft-clipped bases
-- switched TreeSet -> PriorityQueue in the ConstrainedMateFixer as some different reads can be considered equal by picard's SAMRecordCoordinateComparator (the Set was replacing them)
-- pulled out ReadBin class so it can be testable
-- added unit tests for ReadBin with soft-clips
-- added tests for getMismatchCount (AlignmentUtils) to make sure it works with soft-clipped reads

GSA-774 #resolve
2013-02-19 16:00:36 -05:00
Mark DePristo be45edeff2 ActivityProfile and ActiveRegions respects engine interval boundaries
-- Active regions are created as normal, but they are split and trimmed to the engine intervals when added to the traversal, if there are intervals present.
-- UnitTests for ActiveRegion.splitAndTrimToIntervals
-- GenomeLocSortedSet.getOverlapping uses binary search to efficiently in ~ log N time find overlapping intervals
-- UnitTesting overlap function in GenomeLocSortedSet
-- Discovered fundamental implementation bug in that adding genome locs out of order (elements on 20 then on 19) produces an invalid GenomeLocSortedSet.  Created a JIRA to address this: https://jira.broadinstitute.org/browse/GSA-775
-- Constructor that takes a collection of genome locs now sorts its input and merges overlapping intervals
-- Added docs for the constructors in GLSS
-- Update HaplotypeCaller MD5s, which change because ActiveRegions are now restricted to the engine intervals, which changes slightly the regions in the tests and so the reads in the regions, and thus the md5s
-- GenomeAnalysisEngineUnitTest needs to provide non-null genome loc parser
2013-02-18 10:40:25 -05:00
Mark DePristo 3b67aa8aee Final edge case bug fixes to QualityUtil routines
-- log10 functions in QualityUtils allow -Infinity to allow log10(0.0) values
-- Fix edge condition of log10OneMinusX failing with Double.MIN_VALUE
-- Fix another edge condition of log10OneMinusX failing with a small but not min_value double
2013-02-16 07:31:38 -08:00
Mark DePristo 9e28d1e347 Cleanup and unit tests for QualityUtils
-- Fixed a few conversion bugs with edge case quals (ones that were very high)
-- Fixed a critical bug in the conversion of quals that was causing near capped quals to fall below their actual value.  Will undoubtedly need to fix md5s
-- More precise prob -> qual calculations for very high confidence events in phredScaleCorrectRate, trueProbToQual, and errorProbToQual.  Very likely to improve accuracy of many calculations in the GATK
-- Added errorProbToQual and trueProbToQual calculations that accept an integer cap, and perform the (tricky) conversion from int to byte correctly.
-- Full docs and unit tests for phredScaleCorrectRate and phredScaleErrorRate.
-- Renamed probToQual to trueProbToQual
-- Added goodProbability and log10OneMinusX to MathUtils
-- Went through the GATK and cleaned up many uses of QualityUtils
-- Cleanup constants in QualityUtils
-- Added full docs for all of the constants
-- Rename MAX_QUAL_SCORE to MAX_SAM_QUAL_SCORE for clarity
-- Moved MAX_GATK_USABLE_Q_SCORE to RecalDatum, as it's s BQSR specific feature
-- Convert uses of QualityUtils.errorProbToQual(1-x) to QualityUtils.trueProbToQual(x)
-- Cleanup duplicate quality score routines in MathUtils.  Moved and renamed MathUtils.log10ProbabilityToPhredScale => QualityUtils.phredScaleLog10ErrorRate. Removed 3 routines from MathUtils, and remapped their usages into the better routines in QualityUtils
2013-02-16 07:31:37 -08:00
droazen 664960373d Merge pull request #31 from broadinstitute/yf_fast_BAM_index_traversal
-re-enables fast BAM indexing
2013-02-15 09:12:32 -08:00
MauricioCarneiro 1dd284a5bb Merge pull request #39 from broadinstitute/tj_printreads_tag_for_bqsr_GSA-720
PrintReads writes a header when used with -BQSR
2013-02-15 07:18:28 -08:00
MauricioCarneiro b58a0eca6b Merge pull request #33 from broadinstitute/gg_more_gatkdocs_tweaks_GSATDG-62
Refactored GATKDocs categories some more ( GSATDG-62 )
2013-02-14 22:35:07 -08:00
Tad Jordan 6cb80591e3 PrintReads writes a header when used with -BQSR 2013-02-14 22:19:14 -05:00
Yossi Farjoun 3a7c8c13e2 Re-enabled fastBAMindexing by replacing the FileChannel with a SeekableBufferedStream
This helps a lot since FileChannel is very low-level and traversing the BAMIndex involves lots of short reads.

- Fixed a deterioration in BAMIndex due to rev'ed picard (see below)
- Added unit tests for SeekableBufferedStream
- Added integrationTests for GATKBAMIndex (in PileupWalkerIntegrationTest)
- Added a runtime-test to verify that the amount read equals the amount requested.
- Added failing tests with expectedExceptions
- Used a DataProvider to make code nicer
2013-02-14 17:51:15 -05:00
Mark DePristo f92328a1a1 Extend default timeout to 20 minutes
-- The default of 10 minutes is right on the edge for some tests, and we really want a default not to enforce a max time (test should be short) but to stop testng from failing to terminate ever in the case where some test is truly hung
2013-02-13 17:43:40 -08:00
Geraldine Van der Auwera 6208742f7c Refactored GATKDocs categories some more ( GSATDG-62 )
-- Renamed ValidatePileup to CheckPileup since validation is reserved word
-- Renamed AlignmentValidation to CheckAlignment (same as above)
-- Refactored category definitions to use constants defined in HelpConstants
-- Fixed a couple of minor typos and an example error
-- Reorganized the GATKDocs index template to use supercategories
-- Refactored integration tests for renamed walkers (my earlier refactoring had screwed them up or not carried over)
2013-02-13 16:49:18 -05:00
Mark DePristo ca76de0619 Move ProcessUtilsUnitTest to private 2013-02-09 12:34:45 -05:00
MauricioCarneiro f5e52b72ea Merge pull request #23 from broadinstitute/md_process_utils_unit_tests
UnitTests for ProcessUtils
2013-02-09 09:27:31 -08:00
Mark DePristo b127fc6a1a Expand NGSPlatform to meet SAM 1.4 spec, with full unit tests
-- Added CAPILLARY and HELICOS platforms as required by spec 1.4
-- Added extensive unit tests to ensure NGSPlatform functions work as expected.
-- Fixed some NPE bugs for reads that don't have RGs or PLs in their RG fields
2013-02-09 11:16:21 -05:00
Mark DePristo fc3307a97f UnitTests for ProcessUtils 2013-02-09 10:13:01 -05:00
Mauricio Carneiro 5f49c95cc1 Added distance across contigs calculation to GenomeLocs
-- distance across contigs is calculated given a sequence dictionary (from SAMFileHeader)
-- unit test added
GSATDG-45
2013-02-07 16:31:41 -05:00
Eric Banks 9826192854 Added contracts, docs, and tests for several methods in AlignmentUtils. There are over 74K tests being run now for this class!
* AlignmentUtils.getMismatchCount()
* AlignmentUtils.calcAlignmentByteArrayOffset()
* AlignmentUtils.readToAlignmentByteArray().
* AlignmentUtils.leftAlignIndel()
2013-02-07 13:04:24 -05:00
David Roazen e7e76ed76e Replace org.broadinstitute.variant with jar built from the Picard repo
The migration of org.broadinstitute.variant into the Picard repo is
complete. This commit deletes the org.broadinstitute.variant sources
from our repo and replaces it with a jar built from a checkout of the
latest Picard-public svn revision.
2013-02-05 17:24:25 -05:00
Mauricio Carneiro f6bc5be6b4 Fixing license on Yossi's file
Somebody needs to set up the license hook ;-)
2013-02-05 11:14:43 -05:00
MauricioCarneiro 050c4794a5 Merge pull request #11 from yfarjoun/per_sample2
-Added Per-Sample Contamination Removal to UnifiedGenotyper: Added an @A...
2013-02-05 08:04:29 -08:00
Eric Banks 00c98ff0cf Need to reset the static counter before tests are run or else we won't be deterministic.
Also need to give credit where credit is due: David was right that this was not a non-deterministic Bamboo failure...
2013-02-05 10:41:46 -05:00
Yossi Farjoun de03f17be4 -Added Per-Sample Contamination Removal to UnifiedGenotyper: Added an @Advanced option to the StandardCallerArgumentCollection, a file which should
contain two columns, Sample (String) and Fraction (Double) that form the Sample-Fraction map for the per-sample AlleleBiasedDownsampling.
-Integration tests to UnifiedGenotyper (Using artificially contaminated BAMs created from a mixure of two broadly concented samples) were added
-includes throwing an exception in HC if called using per-sample contamination file (not implemented); tested in a new integration test.
-(Note: HaplotypeCaller already has "Flat" contamination--using the same fraction for all samples--what it doesn't have is
   _per-sample_ AlleleBiasedDownsampling, which is what has been added here to the UnifiedGenotyper.
-New class: DefaultHashMap (a Defaulting HashMap...) and new function: loadContaminationFile (which reads a Sample-Fraction file and returns a map).
-Unit tests to the new class and function are provided.
-Added tests to see that malformed contamination files are found and that spaces and tabs are now read properly.
-Merged the integration tests that pertain to biased downsampling, whether HaplotypeCaller or unifiedGenotyper, into a new IntegrationTest class.
2013-02-04 18:24:36 -05:00
Mark DePristo a281fa6548 Resolves Genome Sequence Analysis GSA-750 Don't print an endless series of starting messages from the ProgressMeter
-- The progress meter isn't started until the GATK actually calls execute on the microscheduler.  Now we get a message saying "Creating shard strategy" while this (expensive) operation runs
2013-02-04 15:47:30 -05:00
Mark DePristo 6382d5bdc9 Final cleanup and unit testing for GATKRunReport
-- Bringing code up to document, style, and code coverage specs
-- Move GATKRunReportUnitTest to private
-- Fully expand GATKRunReportUnitTests to coverage writing and reading GATKRunReport to local disk, to standard out, to AWS.
-- Move documentation URL from GATKRunReport to UserException
-- Delete a few unused files from s3GATKReport
-- Added capabilities to GATKRunReport to make testing easier
-- Added capabilities to deserialize GATKRunReports from an InputStream
2013-02-02 15:06:56 -05:00
David Roazen c6581e4953 Update MD5s to reflect version number change in the BAM header
I've confirmed via a script that all of these differences only
involve the version number bump in the BAM headers and nothing
else:

< @HD   VN:1.0  GO:none SO:coordinate
---
> @HD   VN:1.4  GO:none SO:coordinate
2013-02-01 13:51:31 -05:00
David Roazen c4b0ba4d45 Temporarily back out the Picard team's patches to GATKBAMIndex from December
These patches to GATKBAMIndex are causing massive BAM index reading errors in
combination with the latest version of Picard. The bug is either in the patches
themselves or in the underlying SeekableBufferedStream class they rely on. Until
the cause can be identified, we are temporarily backing out these changes so that
we can continue to run with the latest Picard/Tribble.

This reverts commits:
81483ec21e528790dfa719d18cdee27d577ca98e
68cf0309db490b79eecdabb4034987ff825ffea8
54bb68f28ad5fe1b3df01702e9c5e108106a0176
2013-02-01 13:51:31 -05:00
Mark DePristo 6d9816f1a5 Cleanup unused utils functions, and add unit test for one (append) 2013-02-01 13:51:31 -05:00
Mark DePristo 206eab80e3 Expanded unit tests for AlignmentUtils
-- Added JIRA entries for the remaining capabilities to be fixed up and unit tested
2013-02-01 13:51:31 -05:00
Mark DePristo c3c4e2785b UnitTest for calcNumHighQualityBases in AlignmentUtils 2013-01-31 13:57:23 -05:00
Ryan Poplin 496727ac5e Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-31 11:51:08 -05:00
Ryan Poplin ac033ce41a Intermediate commit of new bubble assembly graph traversal algorithm for the HaplotypeCaller. Adding functionality for a path from an assembly graph to calculate its own cigar string from each of the bubbles instead of doing a massive Smith-Waterman alignment between the path's full base composition and the reference. 2013-01-31 11:32:19 -05:00
Eric Banks 9c0207f8ef Fixing BQSR/BAQ bug:
If a read had an existing BAQ tag, was clipped by our engine, and couldn't have the BAQ recalculated (for whatever reason), then we would
fail in the BQSR because we would default to using the old tag (which no longer matched the length of the read bases).
The right thing to do here is to remove the old BAQ tag when RECALCULATE and ADD_TAG are the BAQ modes used but BAQ cannot be recalculated.
Added a unit test to ensure that the tags are removed in such a case.
2013-01-31 11:03:17 -05:00
Mark DePristo 404ee9a6e4 More aggressive checking of AWS key quality upon startup in the GATK 2013-01-31 09:08:38 -05:00
Mark DePristo b707331332 Encrypt GATK AWS keys using the GATK private key, and decrypt as needed as a resource when uploading to AWS logs
-- Has the overall effect that the GATK user AWS keys are no longer visible in the gatk source as plain text.  This will stop AWS from emailing me (they crawl the web looking for keys)
-- Added utility EncryptAWSKeys that takes as command line arguments the GATK user AWS access and secret keys, encrypts them with the GATK private key, and writes out the resulting file to resources in phonehome.
-- GATKRunReport now decrypts as needed these keys using the GATK public key as resources in the GATK bundle
-- Refactored the essential function of Resource (reading the resource) from IOUtils into the class itself.  Now how to get the data in the resouce is straightforward
-- Refactored md5 calculation code from a byte[] into Utils.  Added unit tests
-- Committing the encrypted AWS keys
-- #resolves https://jira.broadinstitute.org/browse/GSA-730
2013-01-30 16:42:23 -05:00
David Roazen 9985f82a7a Move BaseUtils back to the GATK by request, along with associated utility methods 2013-01-30 13:09:44 -05:00
Mark DePristo 1ff78679ca UnitTesting example for copying
-- Example combinatorial unit tests, plus unit tests that create reads and bam files, pileups, variant context (from scratch and from a file), and genome locs
2013-01-30 11:19:08 -05:00
Eric Banks 9025567cb8 Refactoring the SimpleGenomeLoc into the now public utility UnvalidatingGenomeLoc and the RR-specific FinishedGenomeLoc.
Moved the merging utility methods into GenomeLoc and moved the unit tests around accordingly.
2013-01-30 10:45:29 -05:00
Mark DePristo 45603f58cd Refactoring and unit testing GenomeLocParser
-- Moved previously inner class to MRUCachingSAMSequenceDictionary, and unit test to 100% coverage
-- Fully document all functions in GenomeLocParser
-- Unit tests for things like parsePosition (shocking it wasn't tested!)
-- Removed function to specifically create GenomeLocs for VariantContexts.  The fact that you must incorporate END attributes in the context means that createGenomeLoc(Feature) works correctly
-- Depreciated (and moved functionality) of setStart, setStop, and incPos to GenomeLoc
-- Unit test coverage at like 80%, moving to 100% with next commit
2013-01-30 09:47:47 -05:00
Mark DePristo 8562bfaae1 Optimize GenomeLocParser.createGenomeLoc
-- The new version is roughly 2x faster than the previous version.  The key here was to cleanup the workflow for validateGenomeLoc and remove the now unnecessary synchronization blocks from the CachingSequencingDictionary, since these are now thread local variables
-- #resolves https://jira.broadinstitute.org/browse/GSA-724
2013-01-30 09:47:47 -05:00
Mark DePristo 69dd5cc902 AutoFormattingTimeUnitTest should be in utils 2013-01-30 09:47:47 -05:00
Mark DePristo 92c5635e19 Cleanup, document, and unit test ActiveRegion
-- All functions tested.  In the testing / review I discovered several bugs in the ActiveRegion routines that manipulate reads.  New version should be correct
-- Enforce correct ordering of supporting states in constructor
-- Enforce read ordering when adding reads to an active region in add
-- Fix bug in HaplotypeCaller map with new updating read spans.  Now get the full span before clipping down reads in map, so that variants are correctly placed w.r.t. the full reference sequence
-- Encapsulate isActive field with an accessor function
-- Make sure that all state lists are unmodifiable, and that the docs are clear about this
-- ActiveRegion equalsExceptReads is for testing only, so make it package protected
-- ActiveRegion.hardClipToRegion must resort reads as they can become out of order
-- Previous version of HC clipped reads but, due to clipping, these reads could no longer overlap the active region.  The old version of HC kept these reads, while the enforced contracts on the ActiveRegion detected this was a problem and those reads are removed.  Has a minor impact on PLs and RankSumTest values
-- Updating HaplotypeCaller MD5s to reflect changes to ActiveRegions read inclusion policy
2013-01-30 09:47:12 -05:00
David Roazen 6449c320b4 Fix the CachingIndexedFastaSequenceFileUnitTest
BaseUtils.convertIUPACtoN() no longer throws a UserException,
since it's in org.broadinstitute.variant
2013-01-29 21:07:16 -05:00
Mauricio Carneiro 29fd536c28 Updating licenses manually
Please check that your commit hook is properly pointing at ../../private/shell/pre-commit

Conflicts:
	public/java/test/org/broadinstitute/variant/VariantBaseTest.java
2013-01-29 17:27:53 -05:00
David Roazen a536e1da84 Move some VCF/VariantContext methods back to the GATK based on feedback
-Moved some of the more specialized / complex VariantContext and VCF utility
 methods back to the GATK.

-Due to this re-shuffling, was able to return things like the Pair class back
 to the GATK as well.
2013-01-29 16:56:55 -05:00
Guillermo del Angel 5995f01a01 Big intermediate commit (mostly so that I don't have to go again through merge/rebase hell) in expanding BQSR capabilities. Far from done yet:
a) Add option to stratify CalibrateGenotypeLikelihoods by repeat - will add integration test in next push.
b) Simulator to produce BAM files with given error profile - for now only given SNP/indel error rate can be given. A bad context can be specified and if such context is present then error rate is increased to given value.
c) Rewrote RepeatLength covariate to do the right thing - not fully working yet, work in progress.
d) Additional experimental covariates to log repeat unit and combined repeat unit+length. Needs code refactoring/testing
2013-01-28 19:55:46 -05:00
David Roazen f63f27aa13 org.broadinstitute.variant refactor, part 2
-removed sting dependencies from test classes
-removed org.apache.log4j dependency
-misc cleanup
2013-01-28 09:03:46 -05:00
David Roazen 3744d1a596 Collapse the downsampling fork in the GATK engine
With LegacyLocusIteratorByState deleted, the legacy downsampling implementation
was already non-functional. This commit removes all remaining code in the
engine belonging to the legacy implementation.
2013-01-28 01:50:30 -05:00
Mark DePristo 63913d516f Add join call to Progress meter unit test so we actually know the daemon thread has finished 2013-01-27 16:52:45 -05:00
Mark DePristo c97a361b5d Added realistic BandPassFilterUnitTest that ensures quality results for 1000G phase I VCF and NA12878 VCF
-- Helped ID more bugs in the ActivityProfile, necessitating a new algorithm for popping off active regions.  This new algorithm requires that at least maxRegionSize + prob. propagation distance states have been examined.  This ensures that the incremental results are the same as you get reading in an entire profile and running getRegions on the full profile
-- TODO is to remove incremental search start algorithm, as this is no longer necessary, and nicely eliminates a state variable I was always uncomfortable with
2013-01-27 14:10:08 -05:00
Mauricio Carneiro 705cccaf63 Making SplitReads output FastQ's instead of BAM
- eliminates one step in my pipeline
   - BAM is too finicky and maintaining parameters that wouldn't be useful was becoming a headache, better avoided.
2013-01-27 02:36:31 -05:00
Mauricio Carneiro 6ea7133d95 Updating licenses of latest moved files 2013-01-26 13:46:52 -05:00
Ami Levy-Moonshine 99cb8d68e9 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-25 16:07:38 -05:00
Mark DePristo b8c0b05785 Add contract to ensure that getAdapterBoundary returns the right result
-- Also renamed the function to getAdaptorBoundary for consistency across the codebase
2013-01-25 16:05:17 -05:00
Mark DePristo e445c71161 LIBS optimization for adapter clipping
-- GATKSAMRecords now cache the result of the getAdapterBoundary, allowing us to avoid repeating a lot of work in LIBS
-- Added unittests to cover adapter clipping
2013-01-25 16:05:17 -05:00
Ami Levy-Moonshine b4447cdca2 In cases where one uses VariantContextUtils.GenotypeMergeType.REQUIRE_UNIQUE we used to verify that the samples names are unique in VariantContextUtils.simpleMerge for each VCs. It couse to a bug that was reported on the forum (when a VCs had 2 VC from the same sample).
Now we will check it only in CombineVariants.init using the headers. A new function was added to SamplesUtils with unitTests in CVunitTest.java.
2013-01-25 15:49:51 -05:00
Mark DePristo 592f90aaef ActivityProfile now cuts intelligently at the best local minimum when in a larger than max size active region
-- This new algorithm is essential to properly handle activity profiles that have many large active regions generated from lots of dense variant events.  The new algorithm passes unit tests and passes visualize visual inspection of both running on 1000G and NA12878
-- Misc. commenting of the code
-- Updated ActiveRegionExtension to include a min active region size
-- Renamed ActiveRegionExtension to ActiveRegionTraversalParameters, as it carries more than just the traversal extension now
2013-01-24 13:48:00 -05:00
Mark DePristo 0c94e3d96e Adaptively compute the band pass filter from the sigma, up to a maximum size of 50 bp
-- Previously we allowed band pass filter size to be specified along with the sigma.  But now that sigma is controllable from walkers and from the command line, we instead compute the filter size given the kernel from the sigma, including all kernel points with p > 1e-5 in the kernel.  This means that if you use a smaller kernel you get a small band size and therefore faster ART
-- Update, as discussed with Ryan, the sigma and band size to 17 bp for HC (default ART wide) and max band size of 50 bp
2013-01-24 13:47:59 -05:00
Mark DePristo 9e43a2028d Making band pass filter size, sigma, active region max size and extension all accessible from the command line 2013-01-24 13:47:59 -05:00
Mark DePristo ee8039bf25 Fix trivial call in unit test 2013-01-23 13:51:58 -05:00
Mark DePristo 8e8126506b Renaming IncrementalActivityProfile to ActivityProfile
-- Also adding a work in progress functionality to make it easy to visualize activity profiles and active regions in IGV
2013-01-23 13:46:01 -05:00
Mark DePristo e917f56df8 Remove old ActivityProfile and old BandPassActivityProfile 2013-01-23 13:46:01 -05:00
Mark DePristo 7fd27a5167 Add band pass filtering activity profile
-- Based on the new incremental activity profile
-- Unit Tested!  Fixed a few bugs with the old band pass filter
-- Expand IncrementalActivityProfileUnitTest to test the band pass filter as well for basic properties
-- Add new UnitTest for BandPassIncrementalActivityProfile
-- Added normalizeFromRealSpace to MathUtils
-- Cleanup unused code in new activity profiles
2013-01-23 13:46:01 -05:00
Mark DePristo eb60235dcd Working version of incremental active region traversals
-- The incremental version now processes active regions as soon as they are ready to be processed, instead of waiting until the end of the shard as in the previous version.  This means that ART walkers will now take much less memory than previously.  On chr20 of NA12878 the majority of regions are processed with as few as 500 reads in memory.  Over the whole chr20 only 5K reads were ever held in ART at one time.
-- Fixed bug in the way active regions worked with shard boundaries.  The new implementation no longer see shard boundaries in any meaningful way, and that uncovered a problem that active regions were always being closed across shard boundaries.  This behavior was actually encoded in the unit tests, so those needed to be updated as well.
-- Changed the way that preset regions work in ART.  The new contract ensures that you get exactly the regions you requested.  the isActive function is still called, but its result has no impact on the regions.  With this functionality is should be possible to use the HC as a generic assembly by forcing it to operate over very large regions
-- Added a few misc. useful functions to IncrementalActivityProfile
2013-01-23 13:46:00 -05:00
Mark DePristo e050f649fd IncrementalActivityProfile, complete with extensive unit tests
-- This is an activity profile compatible with fetching its implied active regions incrementally, as activity profile states are added
2013-01-23 13:45:21 -05:00
Mark DePristo 8d9b0f1bd5 Restructure ActivityProfiler into root class ActivityProfile and derived class BandPassActivityProfile
-- Required before I jump in an redo the entire activity profile so it's can be run imcrementally
-- This restructuring makes the differences between the two functionalities clearer, as almost all of the functionality is in the base class. The only functionality provided by the BandPassActivityProfile is isolated to a finalizeProfile function overloaded from the base class.
-- Renamed ActivityProfileResult to ActivityProfileState, as this is a clearer indication of its actual functionality.  Almost all of the misc. walker changes are due to this name update
-- Code cleanup and docs for TraverseActiveRegions
-- Expanded unit tests for ActivityProfile and ActivityProfileState
2013-01-23 13:45:21 -05:00
Mark DePristo 42b807a5fe Unit tests for ActivityProfileResult 2013-01-23 13:45:20 -05:00
Mauricio Carneiro 7b8b064165 Last manual license update (hopefully)
if everyone updates their git hook accordingly, this will be the last time I have to manually run the script.

GSATDG-5
2013-01-18 16:13:07 -05:00
Mark DePristo 738c24a3b1 Add tests to ensure that all insertion reads appear in the active region traversal 2013-01-16 16:25:36 -05:00
Mark DePristo 2a42b47e4a Massive expansion of ActiveRegionTraversal unit tests, resulting in several bugfixes to ART
-- UnitTests now include combinational tiling of reads within and spanning shard boundaries
-- ART now properly handles shard transitions, and does so efficiently without requiring hash sets or other collections of reads
-- Updating HC and CountReadsInActiveRegions integration tests
2013-01-16 15:30:00 -05:00
Mark DePristo 4d0e7b50ec ArtificialBAMBuilder utility class for creating streams of GATKSAMRecords with a variety of properties
--  Allows us to make a stream of reads or an index BAM file with read having the following properties (coming from n samples, of fixed read length and aligned to the genome with M operator, having N reads per alignment start, skipping N bases between each alignment start, starting at a given alignment start)
-- This stream can be handed back to the caller immediately, or written to an indexed BAM file
-- Update LocusIteratorByStateUnitTest to use this functionality (which was refactored from LIBS unit tests and ArtificialSAMUtils)
2013-01-16 15:29:59 -05:00
Eric Banks ec1cfe6732 Oops, forgot to add 1 of my files 2013-01-16 15:05:49 -05:00
Eric Banks e47a389b26 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-16 14:59:11 -05:00
Eric Banks d18dbcbac1 Added tests for changing IUPAC bases to Ns, for failing on bad ref bases, and for the HaplotypeCaller not failing when running over a region with an IUPAC base.
Out of curiosity, why does Picard's IndexedFastaSequenceFile allow one to query for start position 0?  When doing so, that base is a line feed (-1 offset to the first base in the contig) which is an illegal base (and which caused me no end of trouble)...
2013-01-16 14:55:33 -05:00
Khalid Shakir 4ffb43079f Re-committing the following changes from Dec 18:
Refactored interval specific arguments out of GATKArgumentCollection into InvtervalArgumentCollection such that it can be used in other CommandLinePrograms.
Updated SelectHeaders to print out full interval arguments.
Added RemoteFile.createUrl(Date expiration) to enable creation of presigned URLs for download over http: or file:.
2013-01-16 12:43:15 -05:00
Eric Banks 392b5cbcdf The CachingIndexedFastaSequenceFile now automatically converts IUPAC bases to Ns and errors out on other non-standard bases.
This way walkers won't see anything except the standard bases plus Ns in the reference.
Added option to turn off this feature (to maintain backwards compatibility).

As part of this commit I cleaned up the BaseUtils code by adding a Base enum and removing all of the static indexes for
each of the bases.  This uncovered a bug in the way the DepthOfCoverage walker counts deletions (it was counting Ns instead!) that isn't covered by tests.  Fortunately that walker is being deprecated soon...
2013-01-16 10:22:43 -05:00
Mark DePristo 3c37ea014b Retire original TraverseActiveRegion, leaving only the new optimized version
-- Required some updates to MD5s, which was unexpected, and will be sorted out later with more detailed unit tests
2013-01-15 10:24:45 -05:00
Mark DePristo 39bc9e999d Add a test to LocusIteratorByState to ensure that we aren't holding reads anywhere
-- Run an iterator with 100Ks of reads, each carrying MBs of byte[] data, through LIBS, all starting at the same position.  Will crash with an out-of-memory error if we're holding reads anywhere in the system.
-- Is there a better way to test this behavior?
2013-01-14 16:30:16 -05:00
Mark DePristo 5a5422e4f8 Refactor PerSampleReadStates into a separate class
-- No longer update the total counts in each per-sample state manager, but instead return delta counts that are updated by the overall ReadStateManager
-- One step on the way to improving the underlying representation of the data in PerSampleReadStateManager
-- Make LocusIteratorByState final
2013-01-14 16:30:16 -05:00
Mark DePristo 19288b007d LIBS bugfix: kept reads now only (correctly) includes reads that at least passed the reservoir
-- Added unit tests to ensure this behavior is correct
2013-01-14 16:30:16 -05:00
Mark DePristo 83fcc06e28 LIBS optimizations and performance tools
-- Made LIBSPerformance a full featured CommandLineProgram, and it can be used to assess the LIBS performance by reading a provided BAM
-- ReadStateManager now provides a clean interface to iterate in sample order the per-sample read states, allowing us to avoid many map.get calls
-- Moved updateReadStates to ReadStateManager
-- Removed the unnecessary wrapping of an iterator in ReadStateManager
-- readStatesBySample is now a LinkedHashMap so that iteration occurs in LIBS sample order, allowing us to avoid many unnecessary calls to map.get iterating over samples.  Now those are just map native iterations
-- Restructured collectPendingReads for simplicity, removing redundant and consolidating common range checks.  The new piece is code is much clearer and avoids several unnecessary function calls
2013-01-14 16:30:15 -05:00
Mark DePristo ec05ecef60 getAdaptorBoundary returns an int, not an Integer, as this was taking 30% of the allocation effort for LIBS 2013-01-14 16:30:15 -05:00
Mark DePristo e88dae2758 LocusIteratorByState operates natively on GATKSAMRecords now
-- Updated code to reflect this new typing
2013-01-11 15:17:18 -05:00