gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Joel Thibault	319d651e4a	Initial updates for ActiveRegionShard	2013-01-03 17:00:13 -05:00
Joel Thibault	e7553545ef	Initial updates for ReadShard	2013-01-03 17:00:13 -05:00
Joel Thibault	14a3ac0e3c	Enable the use of alternate shards	2013-01-03 17:00:13 -05:00
Joel Thibault	4cc372f53b	LocusShardDataProvider doesn't need its own GenomeLocParser	2013-01-03 17:00:13 -05:00
Joel Thibault	ffbd4d85f2	No need to pass fields as parameters	2013-01-03 17:00:12 -05:00
Joel Thibault	47e620dfbc	Create BAM index to test shard boundaries	2013-01-03 17:00:12 -05:00
Tad Jordan	c1ba12d71a	Added unit test for outputting sorted GATKReport Tables - Made few small modifications to code - Replaced the two arguments in GATKReportTable constructor with an enum used to specify way of sorting the table	2013-01-03 16:53:59 -05:00
Ami Levy-Moonshine	10a705b27f	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-01-03 13:42:31 -05:00
Ami Levy-Moonshine	2018285a39	better error message	2013-01-03 13:41:03 -05:00
Eric Banks	c7039a9b71	Pushing in implementation of the Bayesian estimate of Qemp for the BQSR. This isn't hooked up yet with BQSR; it's just a static method used in my testing walker. I'll hook this into BQSR after more testing and the addition of unit tests. Most of the changes in this commit are actually documentation-related.	2013-01-02 15:21:44 -05:00
Joel Thibault	c515175313	Ensure that active region extensions stay on contig	2013-01-02 14:46:24 -05:00
Joel Thibault	dcb7735d3c	Active Region extensions must stay on contig	2013-01-02 14:46:24 -05:00
Chris Hartl	09199366b7	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-01-02 14:44:49 -05:00
Chris Hartl	e1d09ab0db	QD is now divided by the average length of the alternate allele (weighted by the allele count). The average length is stored in a related annotation, "AAL", which can be used to re-compute the "old" QD by simple multiplication. Integration tests should all pass.	2013-01-02 14:41:29 -05:00
Joel Thibault	a15f368bdc	Re-enable testIsActiveRangeLow/High	2013-01-02 11:57:50 -05:00
Mark DePristo	12f4c6307e	AutoFormattingTime cleanup and complete unittests -- Underlying system now uses long nano times to be more consistent with standard java practice -- Updated a few places in the code that were converting from nanoseconds to double seconds to use the new nanoseconds interface directly -- Bringing us to 100% test coverage with clover with AutoFormattingTimeUnitTest	2013-01-02 11:29:25 -05:00
Joel Thibault	429567cd3f	Rename to TraverseActiveRegionsUnitTest	2013-01-01 19:20:30 -05:00
Joel Thibault	57d38aac8a	Temporarily disable due to unknown contracts problem	2013-01-01 19:20:04 -05:00
Joel Thibault	7748b3816f	Delete the test BAI file as well as the BAM	2013-01-01 19:20:02 -05:00
Joel Thibault	5afeb465aa	TODOs	2013-01-01 19:19:17 -05:00
Mark DePristo	5558a6b8f7	Deleting / archiving no longer classes -- AminoAcidTable and AminoAcid goes to the archive -- Removing two unused SAMRecord classes	2012-12-29 14:34:17 -05:00
Mark DePristo	38cc496de8	Move SomaticIndelDetector and associated tools and libraries into private/andrey package -- Intermediate commit on the way to archiving SomaticIndelDetector and other tools. -- SomaticIndelDetector, PairMaker and RemapAlignments tools have been refactored into the private andrey package. All utility classes refactored into here as well. At this point, the SomaticIndelDetector builds in this version of the GATK. -- Subsequent commit will put this code into the archive so it no longer builds in the GATK	2012-12-29 14:34:08 -05:00
Ami Levy-Moonshine	f450cbc1a3	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-12-27 21:23:59 -05:00
Eric Banks	75d5b88a3d	Enabling the Recal Report unit test (which looks like it was never ever enabled)	2012-12-26 15:35:50 -05:00
Eric Banks	efceb0d48c	Check for well-encoded reads while fixing mis-encoded ones	2012-12-26 14:30:51 -05:00
Mark DePristo	af9746af52	Fix merge failure	2012-12-24 13:43:04 -05:00
Mark DePristo	04cc75aaec	Minor cleanup and expansion of the RecalDatum unit tests	2012-12-24 13:35:58 -05:00
Mark DePristo	7bf1f67273	BQSR optimization: read group x quality score calibration table is thread-local -- AdvancedRecalibrationEngine now uses a thread-local table for the quality score table, and in finalizeData merges these thread-local tables into the final table. Radically reduces the contention for RecalDatum in this very highly used table -- Refactored the utility function to combine two tables into RecalUtils, and created UnitTests for this function, as well as all of RecalibrationTables. Updated combine in RecalibrationReport to use this table combiner function -- Made several core functions in RecalDatum into final methods for performance -- Added RecalibrationTestUtils, a home for recalibration testing utilities	2012-12-24 13:35:58 -05:00
Mark DePristo	7d250a789a	ArtificialReadPileupTestProvider now creates GATKSamRecords with good header values	2012-12-24 13:35:57 -05:00
Mark DePristo	295455eee2	NanoScheduler optimizations and simplification -- The previous model was to enqueue individual map jobs (with a resolution of 1 map job per map call), to track the number of map calls submitted via a counter and a semaphore, and to use this information in each map job and reduce to control the number of map jobs, when reduce was complete, etc. All hideously complex. -- This new model is vastly simply. The reducer basically knows nothing about the control mechanisms in the NanoScheduler. It just supports multi-threaded reduce. The NanoScheduler enqueues exactly nThread jobs to be run, which continually loop reading, mapping, and reducing until they run out of material to read, when they shut down. The master thread of the NS just holds a CountDownLatch, initialized to nThreads, and when each thread exits it reduces the latch by 1. The master thread gets the final reduce result when its free by the latch reaching 0. It's all super super simple. -- Because this model uses vastly fewer synchronization primitives within the NS itself, it's naturally much faster at getting things done, without any of the overhead obvious in profiles of BQSR -nct 2.	2012-12-24 13:35:57 -05:00
Mark DePristo	aa3ee29929	Handle case where the ReadGroup is null in GATKSAMRecord	2012-12-24 13:35:57 -05:00
Mark DePristo	bf81db40f7	NanoScheduler reducer optimizations -- reduceAsMuchAsPossible no longer blocks threads via synchronization, but instead uses an explicit lock to manage access. If the lock is already held (because some thread is doing reduce) then the thread attempting to reduce immediately exits the call and continues doing productive work. They removes one major source of blocking contention in the NanoScheduler	2012-12-24 13:35:57 -05:00
Mark DePristo	161487b4a4	MapResult compareTo() is now unit tested -- Thanks clover!	2012-12-24 13:35:57 -05:00
Mark DePristo	940816f16a	GATKSamRecord now checks that the read group is a GATKReadGroupRecord, and if not makes one	2012-12-24 13:35:57 -05:00
Mark DePristo	14944b5d73	Incorporating clover into build.xml -- See http://gatkforums.broadinstitute.org/discussion/2002/clover-coverage-analysis-with-ant for use docs -- Fix for artificial reads not having proper read groups, causing NPE in some tests -- Added clover itself to private/resources	2012-12-24 13:35:57 -05:00
Mark DePristo	7796ba7601	Minor optimizations for NanoScheduler -- Reducer.maybeReleaseLatch is no longer synchronized -- NanoScheduler only prints progress every 100 or so map calls	2012-12-24 13:35:56 -05:00
Mark DePristo	0f04485c24	NanoScheduler optimization: don't use a PriorityBlockingQueue for the MapResultsQueue -- Created a separate, limited interface MapResultsQueue object that previously was set to the PriorityBlockingQueue. -- The MapResultsQueue is now backed by a synchronized ExpandingArrayList, since job ids are integers incrementing from 0 to N. This means we avoid the n log n sort in the priority queue which was generating a lot of cost in the reduce step -- Had to update ReducerUnitTest because the test itself was brittle, and broken when I changed the underlying code. -- A few bits of minor code cleanup through the system (removing unused constructors, local variables, etc) -- ExpandingArrayList called ensureCapacity so that we increase the size of the arraylist once to accommodate the upcoming size needs	2012-12-24 13:35:56 -05:00
Mark DePristo	b92f563d06	NanoScheduler optimization for TraverseReadsNano -- Pre-read MapData into a list, which is actually faster than dealing with future lock contention issues with lots of map threads -- Increase the ReadShard default size to 100K reads by default	2012-12-24 13:35:56 -05:00
Mark DePristo	f849910c4e	BQSR optimization: only compute BAQ when there's at least one error to delocalize -- Saves something like 2/3 of the compute cost of BQSR	2012-12-24 13:35:56 -05:00
Mark DePristo	0f0188ddb1	Optimization of BQSR -- Created a ReadRecalibrationInfo class that holds all of the information (read, base quality vectors, error vectors) for a read for the call to updateDataForRead in RecalibrationEngine. This object has a restrictive interface to just get information about specific qual and error values at offset and for event type. This restrict allows us to avoid creating an vector of byte 45 for each read to represent BI and BD values not in the reads. Shaves 5% of the runtime off the entire code. -- Cleaned up code and added lots more docs -- With this commit we no longer have much in the way of low-hanging fruit left in the optimization of BQSR. 95% of the runtime is spent in BAQing the read, and updating the RecalData in the NestedIntegerArrays.	2012-12-24 13:35:09 -05:00
Mark DePristo	f6d5499582	The GATK engine now ensures that incoming GATKSAMRecords have GATKSAMReadGroupRecord objects in their header -- Update SAMDataSource so that the merged header contains GATKSAMReadGroupRecord -- Now getting the NGSPlatform for a GATKSAMRecord is actually efficient, instead of computing the NGS platform over and over from the PL string -- Updated a few places in the code where the input argument is actually a GATKSAMRecord, not a SAMRecord for type safety	2012-12-24 13:35:09 -05:00
Ami Levy-Moonshine	8be01af145	add the new gather tool to GATKExtensionsGenerator	2012-12-21 15:09:00 -05:00
Ami Levy-Moonshine	3ca3fd4b3e	keep working on loglessHMM in UG	2012-12-21 11:06:12 -05:00
Ami Levy-Moonshine	6590039bc3	add fast gather to UG; change UG to work with log-lessHMM (work in prograss)	2012-12-20 14:58:57 -05:00
Tad Jordan	b491c177ff	Added functionality of outputting sorted GATKReport Tables - Added an optional argument to BaseRecalibrator to produce sorted GATKReport Tables - Modified BSQR Integration Tests to include the optional argument. Tests now produce sorted tables	2012-12-20 14:02:21 -05:00
Eric Banks	6c3f5eefe9	Merged bug fix from Stable into Unstable	2012-12-19 22:29:21 -05:00
xingwei2012	22d13ccdab	Bug fix for Queue LSF v8.3 the function ls_getLicenseUsage() is not supported by LSF v8.x, comment the line: public static native lsfLicUsage.ByReference ls_getLicenseUsage() Signed-off-by: Eric Banks <ebanks@broadinstitute.org>	2012-12-19 22:28:53 -05:00
Ryan Poplin	54e5c84018	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-12-19 11:31:40 -05:00
David Roazen	07b369ca7e	Move VCF/BCF2/VariantContext to new standalone org.broadinstitute.variant package This is an intermediate commit so that there is a record of these changes in our commit history. Next step is to isolate the test classes as well, and then move the entire package to the Picard repository and replace it with a jar in our repo. -Removed all dependencies on org.broadinstitute.sting (still need to do the test classes, though) -Had to split some of the utility classes into "GATK-specific" vs generic methods (eg., GATKVCFUtils vs. VCFUtils) -Placement of some methods and choice of exception classes to replace the StingExceptions and UserExceptions may need to be tweaked until everyone is happy, but this can be done after the move.	2012-12-19 10:25:22 -05:00
Ryan Poplin	cda0c48570	auto-merge	2012-12-19 10:12:49 -05:00
Mark DePristo	1ca13f9581	Fundamentally better model for the NanoScheduler -- Now each map job reads a value, performs map, and does as much reducing as possible. This ensures that we scale performance with the nct value, so -nct 2 should result in 2x performance, -nct 3 3x, etc. All of this is accomplished using exactly NCT% of the CPU of the machine. -- Has the additional value of actually simplifying the code -- Resolves a long-standing annoyance with the nano scheduler.	2012-12-19 09:31:31 -05:00
David Roazen	d0cd29cb36	Merged bug fix from Stable into Unstable	2012-12-19 02:20:28 -05:00
David Roazen	0d93330ab9	Fix bug in the PerSampleDownsamplingReadsIterator that could lead to excessive memory usage at traversal startup This is a MUST-HAVE update for GATK 2.3 users who want to try out the new ability to use -dcov with ReadWalkers.	2012-12-19 02:05:36 -05:00
Joel Thibault	a29df3e094	oops	2012-12-18 19:03:12 -05:00
Joel Thibault	ee22c1bf44	More TODOs	2012-12-18 18:47:43 -05:00
Joel Thibault	2b1db519d7	Add reads which overstep a boundary by a single base	2012-12-18 18:47:43 -05:00
Joel Thibault	9828b2990f	Reads off the end of a contig fail SAM validation when using actual BAMs	2012-12-18 18:47:43 -05:00
Joel Thibault	72e2394b26	Create actual BAM	2012-12-18 18:47:43 -05:00
Joel Thibault	d69d1f8988	Fun with varargs	2012-12-18 18:47:42 -05:00
Joel Thibault	1158c1529f	Refactor region/read comparisons	2012-12-18 18:47:42 -05:00
Yossi Farjoun	6ed9eb3da9	GATKBAMIndex now passes unit test! Problem was that SeekableBufferedStream seems to have a bug: it will read beyond the end of a file if asked to.	2012-12-18 17:32:26 -05:00
Ryan Poplin	902ca7ea70	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-12-18 15:45:33 -05:00
Ryan Poplin	3950f7b3e3	Increasing the INFORMATIVE_LIKELIHOOD_THRESHOLD value to 0.2	2012-12-18 15:45:12 -05:00
Ryan Poplin	b5d590ba92	Based on NA12878 knowledge base experiments updating HC to allow for a much smaller minimum kmer length in the assembly graph.	2012-12-18 15:43:56 -05:00
eitanbanks	002ce9c1d5	Merge pull request #8 from yfarjoun/master Huge speedup in initial traversal of BAM index files (x20 speed!)	2012-12-18 10:16:53 -08:00
Mark DePristo	16eb1c5436	Optimization to TraverseReadsNano -- Don't just read all inputs into a list, and then provide an iterator to that list, actually make a real iterator so NanoScheduler input thread can contribute meaningfully to the work load -- Use NanoScheduler progress function, instead of home-grown updater	2012-12-18 10:14:47 -05:00
Mark DePristo	b33f804cdc	Inline increment function in RecalDatum to avoid minor duplication of work and multiple synchronized method calls	2012-12-17 16:47:27 -05:00
Mark DePristo	66d32f646b	Minor cleanup of BAQ calculation (final variables, etc)	2012-12-17 16:47:27 -05:00
Mark DePristo	67fe81391c	ProgressMeter optimization: don't do genome loc formatting, but instead create an object that only formats when printing is actually needed	2012-12-17 16:47:27 -05:00
Mark DePristo	1de2f527b9	Optimization of recalibrateRead -- Refactor calculation so that upfront constant values are pre-computed, and cached, and their values just looked up during application -- Trivial comment on how we might use BAQ better in BaseRecalibrator	2012-12-17 16:47:27 -05:00
Mark DePristo	bd6cda7542	Trivial optimization of TraverseReadsNano -- don't format the shard toString if logger isn't debug enabled	2012-12-17 16:47:27 -05:00
Mark DePristo	a481d006f0	Optimizations for applying BQSR table with PrintReads -- Cleaned up code in updateDataForRead so that constant values where not computed in inner loops -- BaseRecalibrator doesn't create it's own fasta index reader, it just piggy backs on the GATK one -- ReadCovariates <init> now uses a thread local cache for it's int[][][] keys member variable. This stops us from recreating an expensive array over and over. In order to make this really work had to update recordValues in ContextCovariate so it writes 0s over base values its skipping because of low quality base clipping. Previously the values in the ReadCovariates keys were 0 because they were never modified by ContextCovariates. Now these values are actually zero'd out explicitly by the covariates.	2012-12-17 16:47:27 -05:00
Mark DePristo	5ec25797b3	Optimizations for BaseRecalibrator -- No longer computes at each update the overall read group table. Now computes this derived table only at the end of the computation, using the ByQual table as input. Reduces BQSR runtime by 1/3 in my test	2012-12-17 16:47:27 -05:00
Eric Banks	e6f468b647	Refactored the quasi-useful IndelType annotation into the more useful VariantType. The indels are still annotated as before, but now all other variant types are annotated too. I'm doing this because of requests on the forum but am not making it standard. If we find it to be useful we can turn it on by default later.	2012-12-17 11:54:47 -05:00
Eric Banks	762f184262	Bug fix for strict validation: rsID checking wasn't working if there were multiple IDs	2012-12-17 10:32:41 -05:00
Yossi Farjoun	ea704d688f	chose smaller buffer size for the bufferedStream	2012-12-15 13:01:38 -05:00
Yossi Farjoun	6da2338ea7	removed comments and uneeded imports	2012-12-15 12:31:37 -05:00
Yossi Farjoun	19dd2d628a	some changes. some changes.	2012-12-14 17:21:32 -05:00
Mauricio Carneiro	74344a3871	Bringing in the changes from the CMI repo	2012-12-13 21:59:37 -05:00
Eric Banks	696bf95fba	Fix for PBT bug reported on the forum: the AD is actually output correctly now (rather than with 'null' or some gibberish memory pointer).	2012-12-13 23:28:30 +00:00
Mark DePristo	aeab932c63	Actual working version of unflushing VCFWriter -- Uses high-performance local writer backed by byte array that writes the entire VCF line in some write operation to the underlying output stream. -- Fixes problems with indexing of unflushed writes while still allowing efficient block zipping -- Same (or better) IO performance as previous implementation -- IndexingVariantContextWriter now properly closes the underlying output stream when it's closed -- Updated compressed VCF output file	2012-12-13 16:15:08 -05:00
Yossi Farjoun	5e66109268	Replaced a useless getInt with a skipInt to remove 1/4 of the initial seek time in the BAM Index.	2012-12-12 17:08:11 -05:00
Eric Banks	62eaffdf0a	Fix docs for ReadBackedPhasing	2012-12-12 20:28:04 +00:00
Eric Banks	bba63a3b0e	Fix for GSA-615: UnifiedGenotyperEngine.getGLModelsToUse takes 5% of the runtime of UG, should be optimized away.	2012-12-12 20:25:45 +00:00
Mauricio Carneiro	a52e3c7e15	Revert "Bug fix for RR: don't let the softclip start position be less than 1" this introduced a bug in reduce reads by de-activating it's hard clipping of the out of bounds soft-clips (specially in the MT). DEV-322 #resolve #time 4m This reverts commit 42acfd9d0bccfc0411944c342a5b889f5feae736.	2012-12-12 13:09:39 -05:00
Mark DePristo	5632c13bf2	Resolves GSA-681 / Compressed VCF.gz output is too big because of unnecessary call to flush(). -- Now compressed output VCFs are properly blocked compressed (i.e., they are actually smaller than the uncompressed VCF)	2012-12-12 10:27:07 -05:00
Mark DePristo	dd52a70d45	Fix AFCalcResult unit test -- I was simply passing in the wrong values into the function. Fixed the calls, and expanded the docs on what needs to be passed in.	2012-12-11 10:40:12 -05:00
Ami Levy-Moonshine	6bf31065e3	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-12-11 10:34:50 -05:00
Ami Levy-Moonshine	2f99569dda	change the md5 in one of the CV intergration tests, since it wasn't use the priority list when printing the origin of the annotation (the setValue field)	2012-12-10 22:48:15 -05:00
Ami Levy-Moonshine	2e3284f306	Continue to fix the case where PRIORITIZE is used but no priority list is given. While fixing that case I also removed unnecessary sorting, when the prioeity list is not provied. When the priority list is not provided, it will continue to be null. Thus, the number of original Variant Contexts should be given as a new parameter to simpleMerge (since priority might be null). This new parameter is used for checking if there are filtered VC, when annotationOrigin is true.	2012-12-10 22:23:58 -05:00
Mauricio Carneiro	8a115edbaf	ReduceReads is now scattered by contig It's no longer safe to scatter/gather by interval because now we don't hard-clip to the intervals anymore.	2012-12-10 15:25:27 -05:00
Ami Levy-Moonshine	573ace4403	restore the right version of VariantContextUtils.java in my unstable dir	2012-12-10 10:28:56 -05:00
David Roazen	46edab6d6a	Use the new downsampling implementation by default -Switch back to the old implementation, if needed, with --use_legacy_downsampler -LocusIteratorByStateExperimental becomes the new LocusIteratorByState, and the original LocusIteratorByState becomes LegacyLocusIteratorByState -Similarly, the ExperimentalReadShardBalancer becomes the new ReadShardBalancer, with the old one renamed to LegacyReadShardBalancer -Performance improvements: locus traversals used to be 20% slower in the new downsampling implementation, now they are roughly the same speed. -Tests show a very high level of concordance with UG calls from the previous implementation, with some new calls and edge cases that still require more examination. -With the new implementation, can now use -dcov with ReadWalkers to set a limit on the max # of reads per alignment start position per sample. Appropriate value for ReadWalker dcov may be in the single digits for some tools, but this too requires more investigation.	2012-12-10 09:44:50 -05:00
Ami Levy-Moonshine	5460c96137	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-12-09 23:43:57 -05:00
Ami Levy-Moonshine	3a420d163e	(1) changes in catVariants (work still under development) (2) changes to CV to throw an error when GenotypeMergeType is PRIORITIZE but no priority (rod_priority_list) is not given. Reported by TechnicalVault on the forum on Nov 14 2012	2012-12-09 23:40:03 -05:00
Eric Banks	574d5b467f	Bug fix for indel HMM: protect against situation where long reads (e.g. Sanger) in a pileup can lead to a read starting after the haplotype end for a given haplotype.	2012-12-09 02:09:34 -05:00
Mark DePristo	dbf721968d	PrintReads large-scale test to protect against another major low-level performance issue	2012-12-05 21:36:27 -05:00
Ami Levy-Moonshine	5d78a61f7a	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-12-05 15:07:12 -05:00
Mark DePristo	465694078e	Major performance improvement to the GATK engine -- The NanoSchedule timing code (in NSRuntimeProfile) was crazy expensive, but never showed up in the profilers. Removed all of the timing code from the NanoScheduler, the NSRuntimeProfile itself, and updated the unit tests. -- For tools that largely pass through data quickly, this change reduces runtimes by as much as 10x. For the RealignerTargetCreator example, the runtime before this commit was 3 hours, and after is 30 minutes (6x improvement). -- Took this opportunity to improve the GATK ProgressMeter. NotifyOfProgress now just keeps track of the maximum position seen, and a separate daemon thread ProgressMeterDaemon periodically wakes up and prints the current progress. This removes all inner loop calls to the GATK timers. -- The history of the bug started here: http://gatkforums.broadinstitute.org/discussion/comment/2402#Comment_2402	2012-12-05 14:49:22 -05:00
Mark DePristo	2b601571e7	Better error handling in NanoScheduler -- The previous nanoscheduler would deadlock in the case where an Error, not an Exception, was thrown. Errors, like out of memory, would cause the whole system to die. This bugfix resolves that issue	2012-12-05 14:49:22 -05:00
Eric Banks	0c925856cb	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-12-05 02:00:39 -05:00
Eric Banks	ef87b18e09	In retrospect, it wasn't a good idea to have FisherStrand handle reduced reads since they are always on the forward strand. For now, FS ignores reduced reads but I've added a note (and JIRA) to make this work once the RR het compression is enabled (since we will have directionality in reads then).	2012-12-05 02:00:35 -05:00
Mauricio Carneiro	30f013aeb0	Added a copy() method for ReadBackedPileups necessary to create new alignment contexts with hard-copies of the pileup.	2012-12-05 01:32:18 -05:00
Mauricio Carneiro	6feda540a4	Better error message for SimpleGATKReports	2012-12-05 01:32:18 -05:00
Randal Moore	8d2d0253a2	introduce a level of indirection for the forum URLs - this new function will allow me a place to morph the URL into something that is supported by Confluence Signed-off-by: Eric Banks <ebanks@broadinstitute.org>	2012-12-03 22:33:02 -05:00
Eric Banks	67932b357d	Bug fix for RR: don't let the softclip start position be less than 1	2012-12-03 15:59:14 -05:00
Ryan Poplin	a47da9bb2f	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-12-03 14:30:14 -05:00
Eric Banks	5fed9df295	Quick fix: base qual array in the GATKSAMRecord stores the actual phred values (-33) and not the original bytes (duh).	2012-12-03 12:18:20 -05:00
Eric Banks	b6839b3049	Added checking in the GATK for mis-encoded quality scores. The check is performed by a Read Transformer that samples (currently set to once every 1000 reads so that we don't hurt overall GATK performance) from the input reads and checks to make sure that none of the base quals is too high (> Q60). If we encounter such a base then we fail with a User Error. * Can be over-ridden with --allow_potentially_misencoded_quality_scores. * Also, the user can choose to fix his quals on the fly (presumably using PrintReads to write out a fixed bam) with the --fix_misencoded_quality_scores argument. Added unit tests.	2012-12-03 11:18:41 -05:00
Ryan Poplin	18b002c99c	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-12-03 10:08:56 -05:00
Ryan Poplin	1bdf17ef53	Reworking of how the likelihood calculation is organized in the HaplotypeCaller to facilitate the inclusion of per allele downsampling. We now use the downsampling for both the GL calculations and the annotation calculations.	2012-12-02 11:58:32 -05:00
Ami Levy-Moonshine	d0b8cc7773	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-12-01 00:08:25 -05:00
Ami Levy-Moonshine	969c995298	work under development - catVariants. Changes to AssessRRQuals based on Eric todo comments. bug fix in CombineVariants	2012-12-01 00:08:19 -05:00
Mark DePristo	8020ba14db	Minor cleanup of SAMDataSource as part of my system review -- Changed a few function from public to protected, as they are only used by the package contents, to simplify the SAMDataSource interface	2012-11-30 15:04:41 -05:00
Mauricio Carneiro	fc7fab5f3b	Fixed ReadBackedPileup downsampling Downsampling in the PerSampleReadBackedPileup was broken, it didn't downsample anything, always returning a copy the original pileup.	2012-11-30 00:42:05 -05:00
Joel Thibault	97d29f203e	Add walltime changes to LSF - Check whether the specified attribute is available - Add pipeline test (disabled due to missing attribute)	2012-11-29 15:23:37 -05:00
Joel Thibault	c76c808268	Reads are required to be sorted - Remove the extended_only case because it's outside intervals	2012-11-28 13:59:58 -05:00
Joel Thibault	198923b597	Add ActiveRegionReadState handling	2012-11-28 13:59:57 -05:00
Ryan Poplin	f0395b457a	Adding the work-in-progress, experimental RepeatLengthCovariate to the BQSR so Chris can continue the development.	2012-11-28 13:56:32 -05:00
Eric Banks	3463774f2a	Merged bug fix from Stable into Unstable	2012-11-28 13:26:52 -05:00
Eric Banks	6030605242	Added quick check for creation of bad BAQ values associated with badly encoded base qualities; hopefully this can help us debug the non-reproducible issue seen by many users.	2012-11-28 13:26:31 -05:00
Mark DePristo	c676853731	Merged bug fix from Stable into Unstable. Updating md5s Conflicts: protected/java/test/org/broadinstitute/sting/gatk/walkers/genotyper/UnifiedGenotyperIntegrationTest.java	2012-11-28 12:54:36 -05:00
Mark DePristo	a1d6461121	Critical bugfix to AFCalcResult affecting UG/HC quality score emission thresholds As reported by Menachem Fromer: a critical bug in AFCalcResult: Specifically, the implementation: public boolean isPolymorphic(final Allele allele, final double log10minPNonRef) { return getLog10PosteriorOfAFGt0ForAllele(allele) >= log10minPNonRef; } seems incorrect and should probably be: getLog10PosteriorOfAFEq0ForAllele(allele) <= log10minPNonRef The issue here is that the 30 represents a Phred-scaled probability of error and it's currently being compared to a log probability of non-error. Instead, we need to require that our probability of error be less than the error threshold. This bug has only a minor impact on the calls -- hardly any sites change -- which is good. But the inverted logic effects multi-allelic sites significantly. Basically you only hit this logic with multiple alleles, and in that case it'\s including extra alt alleles incorrectly, and throwing out good ones. Change was to create a new function that properly handles thresholds that are PhredScaled quality scores: /** * Same as #isPolymorphic but takes a phred-scaled quality score as input */ public boolean isPolymorphicPhredScaledQual(final Allele allele, final double minPNonRefPhredScaledQual) { if ( minPNonRefPhredScaledQual < 0 ) throw new IllegalArgumentException("phredScaledQual " + minPNonRefPhredScaledQual + " < 0 "); final double log10Threshold = Math.log10(QualityUtils.qualToProb(minPNonRefPhredScaledQual)); return isPolymorphic(allele, log10Threshold); }	2012-11-28 12:08:02 -05:00
Menachem Fromer	79bc878e6a	Allow debugging to be set from the command line	2012-11-27 22:37:41 -05:00
Eric Banks	b40d3eb8aa	Merged bug fix from Stable into Unstable	2012-11-27 14:41:07 -05:00
Eric Banks	01abcc3e0f	Tests didn't like my note to Geraldine in the output logs; apparently it's tested in integration tests	2012-11-27 14:40:49 -05:00
Mark DePristo	7e4b9c9e6e	Fix failing unit tests for VariantContextUtilsUnitTest -- Previous version was adding multiple samples with the same name to the variant context	2012-11-27 14:26:23 -05:00
Joel Thibault	9bfe39411e	Equal overlap should match right/later region	2012-11-27 13:03:13 -05:00
Joel Thibault	d83ad906ef	Add profile range contract	2012-11-27 13:03:13 -05:00
Joel Thibault	cc550b4145	Add a read and interval on a different contig	2012-11-27 13:03:13 -05:00
Eric Banks	9531e58445	Merged bug fix from Stable into Unstable	2012-11-27 11:00:50 -05:00
Eric Banks	4543ece088	Fixing parsing of genomelocs that contain colons in the contig names (which is allowed by the spec) as reported on the forum. Added unit test for this case.	2012-11-27 11:00:33 -05:00
Eric Banks	a82ec7ad80	Merged bug fix from Stable into Unstable	2012-11-27 10:27:08 -05:00
Eric Banks	e199562c25	I have pulled out all of the documentation URLs and put them into the HelpUtils class as static variables; this way, Appistry can change links as needed to point commercial users to their own internal forum without having to muck things up all over our source. Added some TODOs for Geraldine to update links in the GATK docs that still point to the old wiki. Sorry that I am pushing into stable, but that's what Appistry is pulling from for their release next week (and unstable has been failing forever).	2012-11-27 10:26:17 -05:00
Mauricio Carneiro	97fd5de260	Merging latest CMI updates with UNSTABLE	2012-11-27 09:08:00 -05:00
Eric Banks	b1969a66bd	Update docs	2012-11-27 08:24:41 -05:00
Eric Banks	cc72aaefeb	Minor efficiency: use >= instead of > in test	2012-11-27 01:11:23 -05:00
Eric Banks	405f3c675d	Fix for GSA-649: GenomeLocSortedSet.overlaps is crazy slow. Also improved GenomeLocSortedSet.sizeBeforeLoc.	2012-11-27 01:07:00 -05:00
Ryan Poplin	e27d677c13	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-11-26 12:20:32 -05:00
Ryan Poplin	c3b7dd1374	Misc cleanup in the HaplotypeCaller. Cleaning up unused arguments after recent changes to HC-GenotypingEngine	2012-11-26 12:19:11 -05:00
Eric Banks	4f7fa3009a	I forget why I thought that the VariantAnnotator couldn't run multi-threaded because it works just fine. Now you can specify -nt with VA.	2012-11-26 11:34:59 -05:00
Mauricio Carneiro	a3f5932501	Fixed null pointer exception in Integration Tests When running Utils.setupWriter with NO_PG_TAG set, the writer was attempting to create a program record with the null pointer. Fixed.	2012-11-26 11:12:27 -05:00
Ryan Poplin	fedc4fde6c	Merged bug fix from Stable into Unstable	2012-11-25 21:55:55 -05:00
Ryan Poplin	d978cfe835	Soft clipped bases shouldn't be counted in the delocalized BQSR.	2012-11-25 21:55:29 -05:00
Eric Banks	9719ba7adc	Remove -number example from the docs since it's no longer supported.	2012-11-22 21:53:42 -05:00
Menachem Fromer	2306518ab6	Fix to deal with 'proper' options of casting	2012-11-22 01:45:18 -05:00
Menachem Fromer	d33a412b5f	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-11-22 01:42:29 -05:00
Mark DePristo	48f271c5bd	Adding 80% support for multi-allelic variants -- Multi-allelic variants are split into their bi-allelic version, trimmed, and we attempt to provide a meaningful genotype for NA12878 here. It's not perfect and needs some discussion on how to handle het/alt variants -- Adding splitInBiallelic funtion to VariantContextUtils as well as extensive unit tests that also indirectly test reverseTrimAlleles (which worked perfectly FYI)	2012-11-21 17:24:59 -05:00
Joel Thibault	c68bc95db6	Initial read mapping tests - Failing tests are commented out	2012-11-21 17:16:46 -05:00
Joel Thibault	3ad9128800	Add some reads - Move intervals and reads to init - Update intervals and reads	2012-11-21 17:16:46 -05:00
Joel Thibault	3fa3b00f4a	Add ActiveRegion tests and refactor	2012-11-21 17:16:45 -05:00
Joel Thibault	e8defcb20d	Test multiple bases and intervals	2012-11-21 17:16:45 -05:00
Joel Thibault	c08b782743	Count isActive calls directly	2012-11-21 17:16:45 -05:00
Eric Banks	4f2229d399	As per the TODO message, I removed a check that was no longer necessary. Now ID is an allowable INFO field key.	2012-11-21 16:01:26 -05:00
Menachem Fromer	06261b58c2	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-11-21 15:57:08 -05:00
Eric Banks	ed50814ccb	Finally found a case where user errors were being masked behind other errors and could debug. It turns out that the checkForMaskedUserErrors() method needs to run recursively over all levels (calling exception.getCause()) to check for the original cause.	2012-11-21 15:57:05 -05:00
Menachem Fromer	c8be7c3102	Keep SNPs and indels separately for batch merging; Add options to DepthOfCoverage to count fragments (to not double-count overlapping reads of same fragment); DepthOfCoverage should now support ReducedReads; Replace recusrion with loop in DoC/package.scala (for lists longer than 5000 elements)	2012-11-21 15:56:53 -05:00
Ami Levy-Moonshine	4714ccc284	change the way CombineVariants check the priority arguments in order to throw error when the genotypeMergeOption argument is set to PRIORITIZE but PRIORITY_STRING is not provided	2012-11-21 10:47:35 -05:00
Eric Banks	2e1a055aca	Merged bug fix from Stable into Unstable	2012-11-20 23:20:33 -05:00
Eric Banks	c54fc94505	Protect against features that start off the end of the read (otherwise, Arrays.fill fails)	2012-11-20 23:19:59 -05:00
Eric Banks	c2efb04657	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-11-20 22:43:15 -05:00
Eric Banks	72e2d569c5	The user can now set the maximum allowable cycle on the command-line with --maximum_cycle_value. This value is (now) enforced in the Cycle covariate and a User Error is thrown if the maximum value is passed (with a helpful error message). Added unit tests to cover this new functionality.	2012-11-20 22:41:57 -05:00
Eric Banks	ff87642a91	Enable cycle covariate unit tests	2012-11-20 22:29:56 -05:00
Mark DePristo	cc7680e601	NA12878 knowledge base backed by MongoDB -- Idea is simply to create a persistent database of all TP/FP sites on chr20 in NA12878. Individual callsets can be imported, and a consensus algorithm is run over all callsets in the database to create a consensus collection, which can be used to assess NA12878 callsets for GATK and methods development -- Framework for representing simple VariantContexts and Genotypes in MongoDB, querying for records, and iterating over them in the GATK -- Not hooked up to Tribble, but could be done reasonably easily now (future TODO) -- Tools to import callsets, create consensus callsets, import and export reviews -- Scripts to reset the knowledge base and repopulate it with the standard data files (Eric will expand) -- Actually scales to all of chr20, includes AssessNA12878 that reads a VCF and itemizes it against the truth data set -- ImportCallset can load OMNI, HM3, CEU best practices, mills/devine sites and genotypes, properly marking sites as poly/mono/unk as well as TP/FP/UNK based on command line parameters -- Added shell scripts that start up a local mongo db, that connect to a local or BI hosted mongo for NA12878.db for debugging, and a setupNA12878db script that can load OMNI, HM3, CEU best practices, Mills/Devine into the db and then update the consensus. -- Reviewed sites can be exported to a VCF, and imported again, as a mechanism to safely store the only non-recoverable data from the Mongo DB. -- Created a NA12878DBWalker that manages the outer DB interaction, and that all MongoDB interacting walkers inherit from. Added a NA12878DBArgumentCollection.java consolating all of the common command line arguments (though strictly not necessary as all of this occurs in the root walker) UnitTests -- Can connect to a test knowledge base for development and unit testing -- PolymorphicStatus, TruthStatus, SiteIterator -- NA12878KBUnitTestBase provides simple utilities for connecting to the test mongo db, getting calls, etc -- MongoVariantContext tests creation, matching, and encoding -> writing -> read -> decoding from the mongodb AssessNA12878 -- Generic tool for comparing a NA12878 callset against the knowledge base. See http://gatkforums.broadinstitute.org/discussion/1848/using-the-na12878-knowledge-base for detailed documentation -- Performs trivial filtering on FS, MQ, QD for SNPs and non-SNPs to separate out variants likely to be filtered from those that are honest-to-goodness FPs Misc -- Ability to provide Description for Simplified GATK report	2012-11-20 18:50:52 -05:00
Eric Banks	937ac7290f	Lots more GGA fixes for the HC now that I understand what's going on internally. Integration tests pass except for the GGA test which I believe now produces better results.	2012-11-20 16:13:29 -05:00
Eric Banks	4f243acaa6	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-11-19 10:34:44 -05:00
Eric Banks	f0b8a0228f	Quick fix for HC refactoring: when copying over Haplotype objects, make sure to copy over the artificial allele used to create it too.	2012-11-19 09:57:55 -05:00
Eric Banks	ff180a8e02	Significant refactoring of the Haplotype Caller to handle problems with GGA. The main fix is that we now maintain a mapping from 'original' allele to 'Smith-Waterman-based' allele so that we no longer need to do a (buggy) matching throughout the calling process.	2012-11-19 09:09:57 -05:00
Eric Banks	78ce822b6f	Protect against NPE when using non-GATK reports for inputs expecting valid GATK reports	2012-11-19 09:07:04 -05:00
Joel Thibault	b70fd4a242	Initial testing of the Active Region Traversal contract - TODO: many more tests and test cases	2012-11-15 10:08:00 -05:00
Guillermo del Angel	a68e6810c9	Back off experimental code that escaped last commit, not for general use yet	2012-11-14 14:45:15 -05:00
Guillermo del Angel	89bbe73a43	Commenting out CMI pipeline test that wasn't meant to be in GATK repository (why was this merged??)	2012-11-14 14:39:04 -05:00
Guillermo del Angel	3771d074dc	Merge branch 'master' of ssh://gsa3/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-11-14 14:37:43 -05:00
Mauricio Carneiro	e35fd1c717	Merging CMI-0.5.0 and GATK-2.2 together.	2012-11-14 10:42:03 -05:00
Mauricio Carneiro	a079d8d0d1	Breaking the utility to write @PG tags for SAMFileWriters and StingSAMFileWriters	2012-11-14 10:33:22 -05:00
Mauricio Carneiro	dba31018f4	Implementation of BySampleSAMFileWriter ReduceReads now works with the n-way-out capability, splitting by sample. DEV-27 #resolve #time 3m	2012-11-14 10:33:22 -05:00
Mauricio Carneiro	a17cd54b68	Co-Reduction implementation in ReduceReads ReduceReads now co-reduces bams if they're passed in toghether with multiple -I. Co-reduction forces every variant region in one sample to be a variant region in all samples. Also: * Added integrationtest for co-reduction * Fixed bug with new no-recalculation implementation of the marksites object where the last object wasn't being removed after finalizing a variant region (updated MD5's accordingly) DEV-200 #resolve #time 8m	2012-11-14 10:33:21 -05:00
kshakir	6d59dd3455	Scala classes were only returning direct subclasses (confirmed when inspected in debugger) so changed PluginManager to allow specifying the explicit subclass. Removed some generics from PluginManager for now until able to figure out syntax for requesting explicit subclass. QStatusMessenger uses a slightly more primitive Map[String, Seq[RemoteFile]] instead of Map[ArgumentSource, Seq[RemoteFile]]. Added an QCommandPlugin.initScript utility method for handling specialized script types.	2012-11-14 10:33:20 -05:00
Eric Banks	42ddf51156	Merged bug fix from Stable into Unstable	2012-11-14 10:29:09 -05:00
Eric Banks	ba41f65759	Protect against NPEs in SelectVariants by checking for missing Genotypes	2012-11-13 11:53:39 -05:00
Eric Banks	c7335c9902	Having a malformed GATK report is a User Error	2012-11-13 11:53:12 -05:00
Eric Banks	525cf331f4	Don't catch a User Error and re-throw as a Reviewed Exception. That makes Eric unhappy.	2012-11-13 11:52:47 -05:00
Eric Banks	ee776e996a	Merged bug fix from Stable into Unstable	2012-11-09 08:35:51 -05:00
Eric Banks	66cbaaee31	Fixed nasty bug in BQSR csv file creation: numbers larger than 999 in the Errors column were printed out with commas (which looks like a separate column). This wasn't caught earlier because there are no integration tests covering the csv. I'll add one into unstable in a sec.	2012-11-09 08:33:55 -05:00
Eric Banks	e9183d9fe0	Fix bugs as reported on the forum: BED needs to be explicitly set as the default output format and the output didn't actually adhere to the BED spec.	2012-11-08 15:07:47 -05:00
Eric Banks	17ab3a39d5	Make the --intermediate_csv_file argument un-hidden.	2012-11-08 14:35:23 -05:00
Eric Banks	f4d4846435	Merged bug fix from Stable into Unstable	2012-11-06 20:53:54 -08:00
Eric Banks	15b8c08132	Apparently CIGAR elements can have 0 length according to the spec, but 0Ms were causing left alignment of indels to fail. Fixed.	2012-11-06 20:53:33 -08:00
Mark DePristo	f8a0a947e3	Critical bugfix for GSA-652 / Multi-threaded VCF -> BCF writing produces invalid intermediate file that fails on merging -- New tribble library now uses 64 bit sizes. The 26K VCF has so much data that low-level tribble block indices where overflowing their int size values. This includes a to-be-committed tribble jar that fixes this problem -- See https://jira.broadinstitute.org/browse/GSA-652 -- Minor cleanup of error messages that were useful on the way to solving this monster problem	2012-11-02 09:09:59 -04:00
David Roazen	6185e8c432	Allow large-scale tests 5 hours each to run	2012-11-01 17:48:58 -04:00
Ryan Poplin	386b45e94d	This VE eval module isn't useful anymore.	2012-11-01 15:44:41 -04:00
Mark DePristo	872abddfce	Add custom TestNGTestTransformer that adds a maximum test runtime of 10 minutes to all testng tests -- Closes GSA-494 / Add maximum runtime for integration tests, running them in timeout thread -- Needed to debug locking issues -- Needed to debug excessively long running integrationtests -- Added build.xml maximum runtime for all testng tests of 10 hours. We will ultimately fail the build if it goes on for more than 10 hours	2012-11-01 15:34:12 -04:00
Mark DePristo	1444cd753b	Bugfix for GSA-647 HaplotypeCaller misses good variant because the active region doesn't trigger for an exome -- The logic for determining active regions was a bit broken in the HC when intervals were used in the system -- TraverseActiveRegions now uses the AllLocus view, since we always want to see all reference sites, not just those covered. Simplifies logic of TAR -- Non-overlapping intervals are always treated as separate objects for determing active / inactive state. This means that each exon will stand on its own when deciding if it should be active or inactive -- Misc. cleanup, docs of some TAR infrastructure to make it safer and easier to debug in the future. -- Committing the SingleExomeCalling script that I used to find this problem, and will continue to use in evaluating calling of a single exome with the HC -- Make sure to get all of the reads into the set of potentially active reads, even for genomic locations that themselves don't overlap the engine intervals but may have reads that overlap the regions -- Remove excessively expensive calls to check bases are upper cased in ReferenceContext -- Update md5s after a lot of manual review and discussion with Ryan	2012-11-01 15:34:04 -04:00
Mark DePristo	9cd04c335c	Work on GSA-508 / CachingIndexedFastaReader should internally upper case bases loading data -- As one might expect, CachingIndexedFastaSequenceFile now internally upper cases the FASTA reference bases. This is now done by default, unless requested explicitly to preserve the original bases. -- This is really the correct place to do this for a variety of reasons. First, you don't need to work about upper casing bases throughout the code. Second, the cache is only upper cased once, no matter how often the bases are accessed, which walkers cannot optimize themselves. Finally, this uses the fastest function for this -- Picard's toUpperCase(byte[]) which is way better than String.toUpperCase() -- Added unit tests to ensure this functionality works correct. -- Removing unnecessary upper casing of bases in some core GATK tools, now that RefContext guarentees that the reference bases are all upper case. -- Added contracts to ensure this is the case. -- Remove a ton of sh*t from BaseUtils that was so old I had no idea what it was doing any longer, and didn't have any unit tests to ensure it was correct, and wasn't used anywhere in our code	2012-11-01 15:34:03 -04:00
Eric Banks	94a13c05ed	Merged bug fix from Stable into Unstable	2012-10-31 22:57:26 -04:00
Eric Banks	47a0f5859e	Don't run these tests if not GAKT lite	2012-10-31 22:56:38 -04:00
Eric Banks	881c843307	Merged bug fix from Stable into Unstable	2012-10-31 21:28:27 -04:00
Eric Banks	f8af8a2355	Moving UG integration tests to protected since they use protected-only contamination filtering. Adding a new UGLite integration test to confirm that contamination filtering is ignored in lite.	2012-10-31 21:28:07 -04:00
Guillermo del Angel	24e6da25cc	Merge branch 'master' of ssh://gsa3/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-10-31 14:17:41 -04:00
Eric Banks	96344c6b62	Add note to realigner docs	2012-10-31 12:35:45 -04:00
Guillermo del Angel	4580e99c0c	Merge branch 'master' of ssh://gsa3/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-10-31 10:50:54 -04:00
Guillermo del Angel	02b790c8db	Merge fix	2012-10-31 10:50:36 -04:00
Guillermo del Angel	51a9ce28e1	Merge remote-tracking branch 'unstable/master' into develop	2012-10-31 10:29:48 -04:00
Eric Banks	e1e480a0b9	Bug fix: don't add no-call alleles to the list of ALT alleles being validated.	2012-10-30 14:54:29 -04:00
Eric Banks	2aa28abe0a	Fixing md5s to reflect the new HapMap file	2012-10-30 14:27:10 -04:00
Guillermo del Angel	c8e17a7adf	totally experimental UG feature, to be removed	2012-10-30 13:57:54 -04:00
Eric Banks	c95e893920	Better error message for unused ALT alleles	2012-10-29 21:51:35 -04:00
Eric Banks	b6a1967f12	Better documentation for ValidateVariants so that people realize it's used for strict validation of the VCF file. Added an option to turn off strict validation and an integration test to cover it.	2012-10-29 21:47:09 -04:00
Eric Banks	be902375ac	'Bug' fix: fix the error message from the vcf validator so people realize that the file fails strict validation but still adheres to the spec.	2012-10-29 16:29:27 -04:00
Ryan Poplin	4e661847b2	DelocalizedBaseRecalibrator becomes the BaseRecalibrator.	2012-10-29 12:53:39 -04:00
Eric Banks	ac99437eec	Bug fixes to hapmap conversion in VariantsToVCF	2012-10-29 01:45:33 -04:00
Eric Banks	43625f652e	Shoot, mixed up the md5s last time.	2012-10-27 19:43:46 -04:00
Andrey Sivachenko	f3ac5d404d	updating vcf header attribute descriptions in order to reflect correctly what's actually being written...	2012-10-26 23:52:21 -04:00
Andrey Sivachenko	b4fbf6280a	fixing missing sample genotype bug, missing AD/DP bug, and putting annotations in more natural order (Ref/Alt)	2012-10-26 23:48:40 -04:00
Mark DePristo	ac5e58a265	Bugfix for GSA-540 / Update metadata maps when adding lines to VCFHeader -- https://jira.broadinstitute.org/browse/GSA-540 -- http://gatkforums.broadinstitute.org/discussion/1433/possible-bug-and-fix-in-java-code-of-vcfheader-org-broadinstitute-sting-utils-codecs-vcf-vcfheader	2012-10-26 16:34:16 -04:00
Mark DePristo	fa9b2a91d0	Bugfix for GSA-552 -- https://jira.broadinstitute.org/browse/GSA-552 -- User reports a null exception while using VariantsToVCF: http://gatkforums.broadinstitute.org/discussion/1461/nullpointerexception-converting-vcf3-to-vcf-using-variantstovcf The problem is that he left out an input VCF file for the --variant argument and the command-line argument parsing code didn't catch this, so we NPE out later on.	2012-10-26 16:34:16 -04:00
Eric Banks	682a72faf7	Hmm, thought I got all the md5s last time. Apparently not.	2012-10-26 16:10:12 -04:00
Eric Banks	f66d812778	Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-10-26 13:20:41 -04:00
Eric Banks	a8704ca73f	Adding TODO notes for Ami	2012-10-26 13:20:27 -04:00
Mark DePristo	251983b8fb	Add GATK-wide command line argument to control the maximum runtime allowed for the GATK -- Providing this optional argument -maxRuntime (in -maxRuntimeUnits units) causes the GATK to exit gracefully when the max. runtime has been exceeded. By cleanly I mean that the engine simply stops at the next available cycle in the walker as through the end of processing had been reached. This means that all output files are closed properly, etc. -- Emits an info message that looks like "INFO 10:36:52,723 MicroScheduler - Aborting execution (cleanly) because the runtime has exceeded the requested maximum 10.0000 s". Otherwise there's currently no way to differentiate a truly completed run from a timelimit exceeded run, which may be a useful thing for a future update -- Resolves GSA-630 / GATK max runtime to deal with bad LSA calling? -- Added new JIRA entry for Ami to restart chr1 macarthur with this argument set to -maxRuntime 1 -maxRuntimeUnits DAYS to see if we can do all of chr1 in one weekend.	2012-10-26 13:18:34 -04:00
Eric Banks	ed11b7dab2	Fix UG parallelization test	2012-10-26 12:10:44 -04:00
Eric Banks	7a706ed345	Fix some of the broken integration tests	2012-10-26 11:23:44 -04:00
Eric Banks	ebebec7fdb	Accidentally left one test disabled	2012-10-26 02:15:32 -04:00
Eric Banks	b06f689d4b	Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-10-26 02:13:26 -04:00
Eric Banks	a53e03d525	Do not let reduced reads get removed in the contamination down-sampling	2012-10-26 02:13:04 -04:00
Eric Banks	bf3d61ce82	The default value for --contamination_fraction_to_filter is now 0.05 (5%) in both UG and HC. Users of GATK-lite get pushed down to 0% by default (since it's not enabled) or get a user error if they try to set it.	2012-10-26 01:04:51 -04:00
Eric Banks	91f2c847a3	Fixing problem reported on forum for VF: DP couldn't be filtered from the FORMAT field, only from the INFO field. Fixed and added integration test.	2012-10-26 00:57:40 -04:00
Mark DePristo	6b8b7df651	Queue now understands -nct and requests the appropriate number of cores from LSF, SGE, etc -- NCT wasn't previously recognized by Queue as needing more processors per machine. This commit fixes this. Also a potential cause of poor GATKPerformanceOverTime, in that runs with -nct could flood a node and cause it to have hundreds of cores in contention.	2012-10-25 17:26:58 -04:00
David Roazen	422e16c62e	BaseRecalibration: don't cache instances of ReadCovariates across reads Caching and reusing ReadCovariates instances across reads sounds good in theory, but: -it doesn't work unless you zero out the internal arrays before each read -the internal arrays must be sized proportionally to the maximum POSSIBLE recalibrated read length (5000!!!), instead of the ACTUAL read lengths By contrast, creating a new instance per read is basically equivalent to doing an efficient low-level memset-style clear on a much smaller array (since we use the actual rather than the maximum read length to create it). So this should be faster than caching instances and calling clear() but slower than caching instances and not calling clear(). Credit to Ryan to proposing this approach.	2012-10-25 17:02:55 -04:00
Guillermo del Angel	92fa7e953a	Merge branch 'master' of ssh://gsa3/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-10-25 16:33:14 -04:00
Ami Levy Moonshine	dde3060bb8	add the CEUtrio best practices results (UG + PBT) to the bundle	2012-10-25 15:36:17 -04:00
Ami Levy Moonshine	90b9971033	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-10-25 15:32:29 -04:00
David Roazen	884d031e72	NestedIntegerArray: Pre-allocate only the first two dimensions It turns out that pre-allocating the entire tree was too expensive in terms of memory when using large values for the -mcs and -ics parameters. Pre-allocating the first two dimensions prevents us from ever locking the root node during a put(). Contention between threads over lower levels of the tree should be minimal given that puts are rare compared to gets. Also output dimensions and pre-allocation info at startup. If pre-allocation takes longer than usual this gives the user a sense of what is causing the delay.	2012-10-25 15:17:42 -04:00
Mark DePristo	cc8c12b954	Committing a broken version of BaseRecalibration -- I'm committing because there's some kind of fundamental problem with the ReadCovariates cache, in that historical data isn't being cleared / computed properly, and I'd rather it fail for a while than leave it in JIRA. -- The integration tests test the -nct with PrintReads to get 1, 2, 4 and the 4 fails. But that's because of this incorrect calculation -- Updating GATKPerformanceOverTime with the new @ClassType annotation	2012-10-25 14:46:35 -04:00
Eric Banks	e93ff3ea6e	Let's go back to having the SB/SLOD NOT computed by default. If you recall, it was only enabled by default because we thought we were going to use it when we made VQSR use random forests. But since we decided not to change VQSR, there's no reason to triple the computation for every variant site anymore.	2012-10-25 12:45:23 -04:00
Guillermo del Angel	a838653822	Merge branch 'master' of ssh://gsa3/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-10-25 10:35:58 -04:00
Guillermo del Angel	596c1723ae	Hidden, unsupported ability of VariantEval to run AlleleCount stratification on sites-only VCFs. I'll expose it/add tests on it if people think this is generaly useful. User needs to specify total # of samples as command line argument since genotypes are not available. Also, fixes to large-scale validation script: lower -minIndelFrac threshold or else we'll kill most indels since default 0.25 is too high for pools, fix also VE stratifications and add one VE run where eval=1KG, comp=pool data and AC stratification based on 1KG annotation	2012-10-25 10:35:43 -04:00
Eric Banks	6dc7d872ec	Fix GenotypeAndValidate to handle SNPs and indels as reported on the forum. Recent changes to the UnifiedArgumentCollection made this stop working. Adding in JIRA to create integration tests for this tool.	2012-10-25 10:06:13 -04:00
Eric Banks	c53c55da12	Re-enable tests	2012-10-25 09:37:08 -04:00
Eric Banks	e6652f7777	Added integration test for contamination down-sampling	2012-10-25 09:36:05 -04:00
Eric Banks	df9e0b7045	Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-10-25 02:49:54 -04:00
Eric Banks	72714ee43e	Minor patches to get the contamination down-sampling working for indels. Adding @Hidden logging output for easy debugging.	2012-10-25 02:47:42 -04:00
Eric Banks	c6b57fffda	Added allele biased down-sampling capabilities to the PerReadAlleleLikelihoodMap object, which means that both the UG and HC can use this functionality. Note that it's only available in protected, so GATK-lite users won't be allowed to enable it. Needs more testing.	2012-10-24 22:52:25 -04:00
Ami Levy Moonshine	bcf3582095	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-10-24 21:50:41 -04:00
Eric Banks	9da7bbf689	Refactoring the PerReadAlleleLikelihoodMap in preparation for adding contntamination downsampling into protected only.	2012-10-24 15:49:07 -04:00
David Roazen	d9aa9855f8	Better comments in NestedIntegerArray	2012-10-24 15:29:13 -04:00
David Roazen	02018ca764	Legacy BaseRecalibrator walker is neither TreeReducible nor NanoSchedulable The old BaseRecalibrator walker is and never will be thread-safe, since it's a LocusWalker that uses read attributes to track state. ONLY the newer DelocalizedBaseRecalibrator is believed likely to be thread-safe at this point. It is safe to run the DelocalizedBaseRecalibrator with -nct > 1 for testing purposes, but wait for further testing to be done before using it for production purposes in multithreaded mode.	2012-10-24 15:22:50 -04:00
David Roazen	32a6d7000a	Thread-safe ReadGroupCovariate The ReadGroupCovariate class was not thread-safe. This led to horrible race conditions in multithreaded runs of the BQSR where (for example) the same read group could get inserted into the reverse lookup table twice with different IDs. Should fix the intermittent crash reported in GSA-492.	2012-10-24 15:22:50 -04:00
David Roazen	991658acf4	BQSR: use more granular locking for concurrency control -With this change, BQSR performance scales properly by thread rather than gaining nothing from additional threads. -Benefits are seen when using either -nt (HierarchicalMicroScheduler) or -nct (NanoScheduler) -Removes high-level locks in the recalibration engines and NestedIntegerArray in favor of maximally-granular locks on and around manipulation of the leaf nodes of the NestedIntegerArray. -NestedIntegerArray now creates all interior nodes upfront rather than on the fly to avoid the need for locking during tree traversals. This uses more memory in the initial part of BQSR runs, but the BQSR would eventually converge to use this memory anyway over the course of a typical run. IMPORTANT NOTE: This does not mean it's safe to run the old BaseRecalibrator walker with multiple threads. The BaseRecalibrator walker is and will never be thread-safe, as it's a LocusWalker that uses read attributes to track state information. ONLY the newer DelocalizedBaseRecalibrator can be made thread-safe (and will hopefully be made so in my subsequent commits). This commit addresses performance, not correctness.	2012-10-24 15:22:50 -04:00
Eric Banks	5b7b42356b	Fix bug in GenotypeAndValidate where it doesn't check vc.hasAttribute() before using vc.getAttribute().	2012-10-24 14:02:50 -04:00

... 3 4 5 6 7 ...

3380 Commits (5f49c95cc1f167ea9f4b5f4909848cffa54173b5)