Commit Graph

2215 Commits (7a8e2a8adabbd4e65f48aa85b0214da6f53f3fc4)

Author SHA1 Message Date
Eric Banks 37f56ce8fd A couple of minor updates to BQSR 2012-06-12 16:12:13 -04:00
Eric Banks 277493dd83 Yet more instances of Lists changed over to native arrays 2012-06-12 15:56:09 -04:00
Eric Banks 613badc835 Very minor optimizations for the context covariate 2012-06-12 15:47:32 -04:00
Eric Banks 0f79adb2aa Changing more Java Lists to native arrays in BQSR for performance optimization. 2012-06-12 15:41:01 -04:00
Eric Banks 1da3e43679 Wow, apparently it's way, way less efficient to iterate over Java Lists than native arrays. With this change and the bit fiddling, Ryan's 10-day test case now runs in 1 day. More to come. 2012-06-12 13:32:56 -04:00
Eric Banks a96c5da884 Oops, forgot to push the unit tests 2012-06-12 11:38:30 -04:00
Eric Banks fec0bd5e11 Fixing UG argument docs 2012-06-12 09:46:16 -04:00
Eric Banks a4defdfb29 Adding a GT header line to SomaticIndelDetector output 2012-06-12 09:39:17 -04:00
Eric Banks 891ce51908 Refactoring of BQSRv2 to use longs (and standard bit fiddling techniques) instead of Java BitSets for performance improvements. 2012-06-12 09:19:36 -04:00
Eric Banks ff5749599d Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-11 15:46:17 -04:00
Eric Banks fea625632f Don't use asList because it maintains an iterator to the original list and then the result can't be used to create a new one 2012-06-11 15:45:58 -04:00
Ryan Poplin e4d371dc80 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-11 10:38:50 -04:00
Ryan Poplin 683d4b508e Bug fix in fragment utils: the read name wasn't being set in the merged read. Misc minor updates to the HaplotypeCaller. 2012-06-11 10:38:35 -04:00
Mauricio Carneiro 4aad7e23ef New ReduceReads v2 with unclipped variant regions and soft-clipped bases
* Re-wrote the sliding window approach to allow the variant region not to clip the reads that overlap it.
   * Updated consensus to include only reads that were not passed on by the variant region, header counts are updated on the fly to avoid recompute
   * Added soft clipped bases to ReduceReads analysis by unclipping high quality soft-clips then re-clipping after reduce reads
   * Updated all integration tests
2012-06-08 14:58:31 -04:00
Eric Banks afa9b2718a Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-08 13:54:48 -04:00
Eric Banks 92280b4068 BQSR optimization: cache the BitSetUtils.bitSetFrom() calls since they are called over and over again with the same values. Another 10% reduction in runtime. 2012-06-08 13:54:37 -04:00
Eric Banks 898a0e6161 Minor optimizations 2012-06-08 12:07:58 -04:00
Ryan Poplin 0a37e19998 Bug fix in VQSR so that the VCF index will be created for the recalFile. 2012-06-08 11:51:28 -04:00
Eric Banks d463ab2cbf BQSR optimization: String manipulation is extremely expensive in Java (accounts for 8% of BQSR runtime). Instead use byte[] and StringBuilder when possible. 2012-06-08 10:42:42 -04:00
Eric Banks 2bd48a7351 Bad comments made it into the previous commit 2012-06-07 23:12:56 -04:00
Eric Banks 31c3a6be48 BQSR optimization: getRequiredCovariates() and getOptionalCovariates() were creating a new List every time they were being called, and unfortunately getRequiredCovariates().size() is used as the stop condition in for-loops throughout the code. Just maintaining the original list of covariates results in a 15% reduction in runtime for BQSR. 2012-06-07 20:04:10 -04:00
Eric Banks 0fb9179f76 BQSR optimization: don't clone the original quals for each read, we can just overwrite the original array 2012-06-07 19:41:03 -04:00
Ryan Poplin d449f169d3 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-07 10:56:55 -04:00
Ryan Poplin 0b4281fdd0 misc minor update to HC debug output for when there are a lot of samples 2012-06-07 10:56:41 -04:00
Eric Banks bad50a1b05 Fix docs 2012-06-06 22:45:38 -04:00
Eric Banks b093ba9dcc Stabilized NGSPlatform code: don't assume all reads have read groups (e.g. artificial SAM records) 2012-06-06 15:17:30 -04:00
Eric Banks 54f682a99c Unify to NGSPlatform framework. TechnologyComposition annotation now generalizes to Illumina and not just SLX. 2012-06-06 11:44:37 -04:00
Eric Banks dd46d843fb IR should skip Ion reads just like it does with 454 reads; Tim has confirmed that official platform name for Ion. 2012-06-06 11:04:55 -04:00
Guillermo del Angel 2cbd6e5f90 Merged bug fix from Stable into Unstable 2012-06-05 15:58:23 -04:00
Guillermo del Angel ce4dc2128d Adding minor clarification to -mbq argument documentation 2012-06-05 15:17:56 -04:00
Eric Banks e02ec8c8b6 Don't update the record ID unless we are actually going to emit the record 2012-06-04 14:58:50 -04:00
Eric Banks 8405156ae1 Refactored VariantsToTable so that 1) genotype-level fields can be specified (stabilized and supported code) and 2) the --moltenize argument could be supported to produce molten output of the data. Added tests that cover these capabilities. 2012-06-04 14:28:32 -04:00
Ryan Poplin f11e7ebc3a Fixing the previous fix related to clipping. Adding extra reference padding in the HaplotypeCaller to get those larger alleles during GGA. 2012-06-04 12:49:36 -04:00
Ryan Poplin 320956ee4b Bug fix in clipping function in ReadUtils for when the read ends at exactly the clipping boundary. Bug fixes in HaplotypeCaller GGA mode for when Smith-Waterman produces a different allele than what was given in the input alleles VCF. GGA mode now works with multiallelic records. Adding min pruning factor argument which is combined with the pruning factor that is determined dynamically by the coverage. 2012-06-04 10:55:36 -04:00
Guillermo del Angel 7a54baf08c Merged bug fix from Stable into Unstable 2012-06-03 08:42:08 -04:00
Guillermo del Angel 47df7bbc14 Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/stable 2012-06-03 08:38:54 -04:00
Guillermo del Angel 2ddbdee3bc Fixed broken VariantEval stratifications VariantType and IndelSize - integration tests to follow 2012-06-03 08:38:38 -04:00
Mauricio Carneiro 12a8c54f9a Fixing VCF header for filter elements (thanks Eric) 2012-06-01 15:45:15 -04:00
Eric Banks 3a15ba2102 Malformed VCF headers should be User Errors 2012-05-31 16:05:53 -04:00
Khalid Shakir c4f7df4dce When an underlying exception occurs because of the user error, if the exception instance does not include a message instead of telling the user "because null", tell them "because <exception class name>". 2012-05-30 16:39:06 -04:00
Ryan Poplin 421d0d1435 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-05-30 15:21:35 -04:00
Ryan Poplin 5dd811f84a Adding genotype given alleles mode to the HaplotypeCaller. 2012-05-30 15:07:01 -04:00
Eric Banks d09b8d5584 Fixing docs 2012-05-30 13:24:08 -04:00
Mauricio Carneiro d6e1205310 Updating default values for DiagnoseTargets 2012-05-30 12:43:07 -04:00
Khalid Shakir c3c7f17d90 Updated hard limit MathUtils.MAXN number of samples from 11,000 to 50,000.
Instead of creating a supposed network temporary directory locally which then fails when remote nodes try to access the non-existant dir, now checking to see if they network directory is available and throwing a SkipException to bypass the test when it cannot be run.
TODO: Throw similar SkipExceptions when fastas are not available. Right now instead of skipping the test or failing fast the REQUIRE_NETWORK_CONNECTION=false means that the errors popup later when the networked fastas aren't found.
2012-05-29 11:18:22 -04:00
Roger Zurawicki b8b139841d DiagnoseTargets with working Q1,Median,Q3
- Merged Roger's metrics with Mauricio's optimizations
 - Added Stats for DiagnoseTargets
     - now has functions to find the median depth, and upper/lower quartile
     - the REF_N callable status is implemented
 - The walker now runs efficiently
 - Diagnose Targets accepts overlapping intervals
 - Diagnose Targets now checks for bad mates
 - The read mates are checked in a memory efficient manner
 - The statistics thresholds have been consolidated and moved outside of the statistics classes and into the walker.
 - Fixed some bugs
 - Removed rod binding

Added more Unit tests

 - Test callable statuses on the locus level
 - Test bad mates

 - Changed NO_COVERAGE -> COVERAGE_GAPS to avoid confusion

Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-05-29 10:16:45 -04:00
Eric Banks 50031b63c5 Fix possible NPE from NBaseCount annotation module 2012-05-29 09:46:00 -04:00
Mark DePristo 08de4dfd96 Missed one integration test 2012-05-29 07:23:24 -04:00
Mark DePristo 454c8e63e6 Made GQ an int, not a float. Updated VC code and lots of corresponding MD5s
-- VCFWriter / codec now passes the same rigorous UnitTest as the BCF2 writer / codec.  As part of this we now can only test doubles for equivalence in VCFs to 1e-2 (not exactly impressive)
2012-05-28 20:20:05 -04:00
Mark DePristo 7ce24a96f1 PBT now uses getGenotypeLikelihoodString to avoid NPE when there are no PLs present 2012-05-28 20:18:16 -04:00
Mark DePristo 1818c29371 Fixed long-standing bug in beagle codec that was passing on the header record for decoding 2012-05-28 20:17:26 -04:00
Mark DePristo 06b02e1b9b Update MD5s to reflect new limited output of DiffObjectsWalkers
-- Also updated GQ change in VCFIntegrationTest
2012-05-27 11:20:47 -04:00
Mark DePristo 5894d045cb Bugfixes and code cleanup throughout so BCF2 passes VC -> BCF -> VC tests
-- This version of BCF should actually work properly for most files, assuming headers are properly defined.
-- Lots of bug fixes to BCF2 codec
-- Genotype getPhredScaledQual is now an int, returning -1 if there's no QUAL.  NOTE THIS SEMANTICS change
-- Equals() method for GenotypeLikelihoods, using PLs.
-- VCFCodec now longer adds empty bindings to missing input field values.  NOTE THIS CHANGE
-- VCs can be marked as fully decoded, so that when fullyDecode() is called it returns itself, instead of doing the decoding work.  The BCF2 codec now makes VCs marked as fully decoded
-- stringToBytes returns empty list for null or "" string in BCF2Encoder
-- Proper handling of genotype ordering in BCF2 reader / writer
-- Removed the crazy slow noDups and sameSamples tests that were slowing down unit and integration tests totally unnecessarily
-- Many failing MD5s now due to double -> int change in GQ, will update later
2012-05-27 11:17:17 -04:00
Mark DePristo 86e5a066fc Even more conservative limit on number of differences to summarize at 1000 2012-05-27 11:17:13 -04:00
Mark DePristo 31f4e5b52e Stop unlimited runtimes in DiffEngine when you have lots of differences
-- Added a new parameter to control the maximum number of pairwise differences to generate, which previously could expand to a very large number when there were lots of differences among genotypes, resulting in a n^2 algorithm running with n > 1,000,000
2012-05-27 11:17:13 -04:00
Guillermo del Angel a6ee4f98b5 Yet More missing md5's 2012-05-25 17:21:47 -04:00
Mauricio Carneiro 4109fcbb08 Merged bug fix from Stable into Unstable 2012-05-25 13:03:05 -04:00
Mauricio Carneiro 2be5704a25 Fixed haplotype boundary bug in PairHMMIndelErrorModel
haplotypes were being clipped to the reference window when their unclipped ends went beyond the reference window. The unclipped ends include the hard clipped bases, therefore, if the reference window ended inside the hard clipped bases of a read, the boundaries would be wrong (and the read clipper was throwing an exception).

   * updated code to use SoftEnd/SoftStart instead of UnclippedEnd/UnclippedStart where appropriate.
   * removed unnecessary code to remove hard clips after processing.
   * reorganized the logic to use the assigned read boundaries throughout the code (allowing it to be final).
2012-05-25 13:00:45 -04:00
Guillermo del Angel 175bb35e70 Made TandemRepeatAnnotator standard annotation. HRun no longer standard (superceded by former) 2012-05-25 12:56:23 -04:00
Mark DePristo d6df817174 Oops, don't enable shadow BCF tests 2012-05-24 13:31:13 -04:00
Mark DePristo 0a86564669 Updated test files didn't make it into last push 2012-05-24 13:29:44 -04:00
Mark DePristo 7280cdf937 Bugfixes and testdata cleanup
-- Cut down the size of a few large files in public/testdata that were only used in part
-- Refactor vcf Filename => shadow BCF filename to BCF2Utils.  Fix bug in WalkerTest due to the way this was handled previously
2012-05-24 13:26:05 -04:00
Mark DePristo e9c22b9aad Final updates to integration tests for BCF2
-- Fully working version
-- Use -generateShadowBCF to write out foo.bcf as well as foo.vcf anywhere you use -o foo.vcf
-- Moved MedianUnitTest to its proper home in Utils
-- Added reportng to ivy and testng, so build/report/X/html/ is a nicely formatted output for Unit and Integration tests.  From this website it's easy to see md5 diffs, etc.  This is a vastly better way to manage unit and integration test output
2012-05-24 10:58:59 -04:00
Mark DePristo ade1843818 Bugfix for not setting header in AbstractVCFCodec 2012-05-24 10:58:58 -04:00
Mark DePristo 6ca71fe3b4 GATK tests use public/testdata not /humgen/ as much as possible 2012-05-24 10:58:58 -04:00
Mark DePristo 69ee4d0454 Moved getMetaDataForField to VariantContextUtils 2012-05-24 10:57:09 -04:00
Mark DePristo cb13f16e90 WalkerTest infrastructure to generate and test shadowBCF file for every generated VCF file
-- Currently disabled
2012-05-24 10:57:09 -04:00
Mark DePristo f77d2e6965 Renamed NO_HEADER to the more accurate no_cmdline_in_header
-- Also no_cmdline_in_header permits us to write contigs into the header, so that the shadow BCF system can work as well
2012-05-24 10:57:08 -04:00
Mark DePristo 4bde24f020 Bugfix for VCFWriter in the case where there are no genotypes in the VC but genotypes in the header 2012-05-24 10:57:08 -04:00
Mark DePristo 4846bf5c8e @Hidden --also_generate_bcf engine argument produces both VCF and BCF files for -o my.vcf
-- Going to be useful going forward for integration tests so they will generate both VCF and BCF files automatically
2012-05-24 10:57:07 -04:00
Mark DePristo bb0d87666a Finally just deleted equals() method in GATKArgumentCollection.
-- We never compare these things in the codebase anyway...
2012-05-24 10:57:07 -04:00
Mark DePristo 6f469305ab Don't try to share BCF2 yet 2012-05-24 10:57:06 -04:00
Mark DePristo c8ed0bfc4c Edge case fixes for BCF2
--handle entirely missing GT in a sample in decodeGenotypeAlleles
--Create MAX_ALLELES_IN_GENOTYPES constant in BCF2Utils, and extracted its use inline from the code
-- Generalized genotype writing code to handle ploidy != 2 and variable ploidy among samples
-- Remove special case inline treatment of case where all samples have no GT field values, and moved this into calcVCFGenotypeKeys
-- Removed restriction on getPloidy requiring ploidy > 1.  It's logically find to return 0 for a no called sample
-- getMaxPloidy() in VC that does what it says
-- Support for padding / depadding of generic genotype fields
2012-05-24 10:57:06 -04:00
Mark DePristo 40431890be -- BCF2 is now a reference dependent codec so it can initialize the contigs in the case where the file doesn't have contigs in it
-- BCF2 writer can now work without the contig lines being in the header
-- Made GenomeLocParser a final class
2012-05-24 10:57:06 -04:00
Mark DePristo 6301572009 GenotypeLikelihood PLs are capped at Short.MAX_INT now
-- UserExceptions in BCF2 now where appropriate
-- Asserts for code safety
-- Public -> protected encode(Object v) method is for testing only
2012-05-24 10:57:06 -04:00
Mark DePristo d52bc31a47 Bugfix for doNotWriteGenotypes mode
-- Was outputing GT ./. in sites only mode.  Fixed
2012-05-24 10:57:05 -04:00
Mark DePristo 64d4238e2f 99% working version of BCF2 encoder / decoder
-- fixed final bugs with PL encoding / decoding
-- Ready for testing by other members of the group
-- Current performance numbers aren't so great, but they will improve in the next phase of BCF2 optimizations
-- Fixed a nasty bug in the filter field
-- Not that some (many?) GATK tools won't work with BCF because they internally assume values are Strings not their true types

Read 1500 genotypes file in VCF -> VCF : 11 seconds
Read 1500 genotypes file in VCF -> BCF : 9.5 seconds

VariantEval 1500 genotypes file in VCF : 3 seconds
VariantEval 1500 genotypes file in BCF : 3 seconds
2012-05-24 10:57:05 -04:00
Mark DePristo b5bce8d3f9 AD should be UNBOUNDED, actually
-- Pass in # alt alleles as appropriate for getCount in VCF header line
2012-05-24 10:57:05 -04:00
Mark DePristo aaf11f00e3 Near final BCF2 implementation
-- Trivial import changes in some walkers
-- SelectVariants has a new hidden mode to fully decode a VCF file
-- DepthPerAlleleBySample (AD) changed to have not UNBOUNDED by A type, which is actually the right type
-- GenotypeLikelihoods now implements List<Double> for convenience.  The PL duality here is going to be removed in a subsequent commit
-- BugFixes in BCF2Writer.  Proper handling of padding.  Bugfix for nFields for a field
-- padAllele function in VariantContextUtils
-- Much better tests for VariantContextTestProvider, including loading parts of dbSNP 135 and the Phase II 1000G call set with genotypes to test encoding / decoding of fields.
2012-05-24 10:57:02 -04:00
Mark DePristo dfee17a672 Generalize / unify code for handling strings
-- List<String> is converted inside of the codec to a collapsed string, and exploded in the decoder.
-- Unified the type conversion code in BCFWriter to simply the mapping from VCF type => BCF type and special value recoding
-- Code cleanup and renaming
2012-05-24 10:57:02 -04:00
Mark DePristo b4a5acd6f4 Added some genotype tests for BCF2, which all pass. Of course that's because I commented out the ones that didn't 2012-05-24 10:57:01 -04:00
Mark DePristo 373ae39e86 Testing of BCF codec
-- Rev.d tribble
-- Minor code cleanup
-- BCF2 encoder / decoder use Double not Float internally everywhere
-- Generalized VC testing framework
2012-05-24 10:57:01 -04:00
Mark DePristo fb1911a1b6 -- Convenience constructor for VariantContextBuilder that creates a new one based on an existing builder
-- Convenience routine for creating alleles from strings of bases
-- Convenience constructor for VCFFilterHeader line whose description is the same as name
-- VariantContextTestProvider creates all sorts of types of VariantContexts for testing purposes.  Can be reused throughtout code for BCF, VCF, etc.
-- Created basic BCF2WriterCodec tests that consumes VariantContextTestProvider contexts, writes them to disk with BCF2 writer, and checks that they come back equals to the original VariantContexts. Actually worked for some complex tests in the first go
2012-05-24 10:57:01 -04:00
Mark DePristo 4968dcd36a Throw an error when genotype fields with mixed vector lengths are encountered 2012-05-24 10:57:00 -04:00
Mark DePristo afd2f1a3f9 Individual VariantContextWriters are now package protected
-- Added VCFHeader() constructor that makes an empty header, and updated VariantRecalibrator to use it
-- Update build.xml to build vcf.jar with updated paths and bcf2 support.
2012-05-24 10:57:00 -04:00
Mark DePristo 24864fd5b0 GATK now writes BCF output to any file with .bcf extension
-- Moved VCF and BCF writers to variantcontext.writers
-- Updated vcf.jar build path
-- Refactored VCFWriter and other code.  Now the best (and soon to be only) way to create these files is through a factory method called VariantContextWriterFactory.  Renamed the general VCFWriter interface to VariantContextWriter which is implemented by VCFWriter and BCF2Writer.
2012-05-24 10:57:00 -04:00
Mark DePristo e2311294c0 Removed unused ManualSortingVCFWriter 2012-05-24 10:56:59 -04:00
Mark DePristo 93cef82637 BCF2 header encoding decoding at final spec 2012-05-24 10:56:58 -04:00
Mark DePristo ce9e9eebb1 No dictionary in header. Now built dynamically from the header in the writer and codec
-- Created BCF2Utils and moved BCF2Constants and TypeDescriptor methods there
2012-05-24 10:56:58 -04:00
Mark DePristo f0b081a85f Update VCF.jar loading test
-- to reflect new path to VCFWriter
2012-05-24 10:56:58 -04:00
Mark DePristo c3b8048e2e Moving around classes in VCF and BCF2
-- Refactored VCF writers into vcf.writers package
-- Moved BCF2Writer to bcf2.writer
-- Updates to all of the walkers using VCFWriter to reflect new packages
-- A large number of files had their headers cleaned up because of this as well
2012-05-24 10:56:58 -04:00
Mark DePristo 679ffdd333 Move BCF2 from private utils to public codecs 2012-05-24 10:56:56 -04:00
Mark DePristo 450f098a61 BCF2 encoder / decoder implement new site / genotype block organization
-- Supports final organization of data blocks into sites data and genotypes data
2012-05-24 10:56:55 -04:00
Mark DePristo 27b51d4dea Enable on the fly indexing of BCF2 2012-05-24 10:56:54 -04:00
Mark DePristo 81bd7646d6 Fix for MISSING floats
-- Restructured code to separate the MISSING value in java (currently everywhere a null) from the byte representation on disk (an int).
-- Now handles correctly MISSING qual fields
2012-05-24 10:56:53 -04:00
Mark DePristo 3afbc50511 More BCF2 improvements
-- Refactored setting of contigs from VCFWriterStub to VCFUtils.  Necessary for proper BCF working
-- Added VCFContigHeaderLine that manages the order for sorting, so we now emit contigs in the proper order.
-- Cleaned up VCFHeader operations
-- BCF now uses the right header files correctly when encoding / decoding contigs
-- Clean up unused tools
-- Refactored header parsing routines to make them more accessible
-- More minor header changes from Intellij
2012-05-24 10:56:52 -04:00
Mark DePristo 0799855479 Archiving GCF
-- Rider update to CramByPiece.scala
2012-05-24 10:56:51 -04:00
Guillermo del Angel 43919078cd Merged bug fix from Stable into Unstable 2012-05-23 21:21:01 -04:00
Guillermo del Angel 4bc04e2a9e Correct way in which start/stop positions in a VC are computed when creating an indel VC. Old way was incorrect in case GENOTYPE_GIVEN_ALLELES was specified with a complex record. New way should work in general for all cases and is simpler. 2012-05-23 21:19:30 -04:00
Ryan Poplin 08dfd6cab6 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-05-21 16:47:07 -04:00