gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Joel Thibault	ffbd4d85f2	No need to pass fields as parameters	2013-01-03 17:00:12 -05:00
Tad Jordan	c1ba12d71a	Added unit test for outputting sorted GATKReport Tables - Made few small modifications to code - Replaced the two arguments in GATKReportTable constructor with an enum used to specify way of sorting the table	2013-01-03 16:53:59 -05:00
Ami Levy-Moonshine	10a705b27f	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-01-03 13:42:31 -05:00
Ami Levy-Moonshine	2018285a39	better error message	2013-01-03 13:41:03 -05:00
Eric Banks	c7039a9b71	Pushing in implementation of the Bayesian estimate of Qemp for the BQSR. This isn't hooked up yet with BQSR; it's just a static method used in my testing walker. I'll hook this into BQSR after more testing and the addition of unit tests. Most of the changes in this commit are actually documentation-related.	2013-01-02 15:21:44 -05:00
Joel Thibault	c515175313	Ensure that active region extensions stay on contig	2013-01-02 14:46:24 -05:00
Chris Hartl	e1d09ab0db	QD is now divided by the average length of the alternate allele (weighted by the allele count). The average length is stored in a related annotation, "AAL", which can be used to re-compute the "old" QD by simple multiplication. Integration tests should all pass.	2013-01-02 14:41:29 -05:00
Mark DePristo	12f4c6307e	AutoFormattingTime cleanup and complete unittests -- Underlying system now uses long nano times to be more consistent with standard java practice -- Updated a few places in the code that were converting from nanoseconds to double seconds to use the new nanoseconds interface directly -- Bringing us to 100% test coverage with clover with AutoFormattingTimeUnitTest	2013-01-02 11:29:25 -05:00
Mark DePristo	5558a6b8f7	Deleting / archiving no longer classes -- AminoAcidTable and AminoAcid goes to the archive -- Removing two unused SAMRecord classes	2012-12-29 14:34:17 -05:00
Mark DePristo	38cc496de8	Move SomaticIndelDetector and associated tools and libraries into private/andrey package -- Intermediate commit on the way to archiving SomaticIndelDetector and other tools. -- SomaticIndelDetector, PairMaker and RemapAlignments tools have been refactored into the private andrey package. All utility classes refactored into here as well. At this point, the SomaticIndelDetector builds in this version of the GATK. -- Subsequent commit will put this code into the archive so it no longer builds in the GATK	2012-12-29 14:34:08 -05:00
Ami Levy-Moonshine	f450cbc1a3	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-12-27 21:23:59 -05:00
Eric Banks	75d5b88a3d	Enabling the Recal Report unit test (which looks like it was never ever enabled)	2012-12-26 15:35:50 -05:00
Eric Banks	efceb0d48c	Check for well-encoded reads while fixing mis-encoded ones	2012-12-26 14:30:51 -05:00
Mark DePristo	af9746af52	Fix merge failure	2012-12-24 13:43:04 -05:00
Mark DePristo	04cc75aaec	Minor cleanup and expansion of the RecalDatum unit tests	2012-12-24 13:35:58 -05:00
Mark DePristo	7bf1f67273	BQSR optimization: read group x quality score calibration table is thread-local -- AdvancedRecalibrationEngine now uses a thread-local table for the quality score table, and in finalizeData merges these thread-local tables into the final table. Radically reduces the contention for RecalDatum in this very highly used table -- Refactored the utility function to combine two tables into RecalUtils, and created UnitTests for this function, as well as all of RecalibrationTables. Updated combine in RecalibrationReport to use this table combiner function -- Made several core functions in RecalDatum into final methods for performance -- Added RecalibrationTestUtils, a home for recalibration testing utilities	2012-12-24 13:35:58 -05:00
Mark DePristo	295455eee2	NanoScheduler optimizations and simplification -- The previous model was to enqueue individual map jobs (with a resolution of 1 map job per map call), to track the number of map calls submitted via a counter and a semaphore, and to use this information in each map job and reduce to control the number of map jobs, when reduce was complete, etc. All hideously complex. -- This new model is vastly simply. The reducer basically knows nothing about the control mechanisms in the NanoScheduler. It just supports multi-threaded reduce. The NanoScheduler enqueues exactly nThread jobs to be run, which continually loop reading, mapping, and reducing until they run out of material to read, when they shut down. The master thread of the NS just holds a CountDownLatch, initialized to nThreads, and when each thread exits it reduces the latch by 1. The master thread gets the final reduce result when its free by the latch reaching 0. It's all super super simple. -- Because this model uses vastly fewer synchronization primitives within the NS itself, it's naturally much faster at getting things done, without any of the overhead obvious in profiles of BQSR -nct 2.	2012-12-24 13:35:57 -05:00
Mark DePristo	aa3ee29929	Handle case where the ReadGroup is null in GATKSAMRecord	2012-12-24 13:35:57 -05:00
Mark DePristo	bf81db40f7	NanoScheduler reducer optimizations -- reduceAsMuchAsPossible no longer blocks threads via synchronization, but instead uses an explicit lock to manage access. If the lock is already held (because some thread is doing reduce) then the thread attempting to reduce immediately exits the call and continues doing productive work. They removes one major source of blocking contention in the NanoScheduler	2012-12-24 13:35:57 -05:00
Mark DePristo	940816f16a	GATKSamRecord now checks that the read group is a GATKReadGroupRecord, and if not makes one	2012-12-24 13:35:57 -05:00
Mark DePristo	14944b5d73	Incorporating clover into build.xml -- See http://gatkforums.broadinstitute.org/discussion/2002/clover-coverage-analysis-with-ant for use docs -- Fix for artificial reads not having proper read groups, causing NPE in some tests -- Added clover itself to private/resources	2012-12-24 13:35:57 -05:00
Mark DePristo	7796ba7601	Minor optimizations for NanoScheduler -- Reducer.maybeReleaseLatch is no longer synchronized -- NanoScheduler only prints progress every 100 or so map calls	2012-12-24 13:35:56 -05:00
Mark DePristo	0f04485c24	NanoScheduler optimization: don't use a PriorityBlockingQueue for the MapResultsQueue -- Created a separate, limited interface MapResultsQueue object that previously was set to the PriorityBlockingQueue. -- The MapResultsQueue is now backed by a synchronized ExpandingArrayList, since job ids are integers incrementing from 0 to N. This means we avoid the n log n sort in the priority queue which was generating a lot of cost in the reduce step -- Had to update ReducerUnitTest because the test itself was brittle, and broken when I changed the underlying code. -- A few bits of minor code cleanup through the system (removing unused constructors, local variables, etc) -- ExpandingArrayList called ensureCapacity so that we increase the size of the arraylist once to accommodate the upcoming size needs	2012-12-24 13:35:56 -05:00
Mark DePristo	b92f563d06	NanoScheduler optimization for TraverseReadsNano -- Pre-read MapData into a list, which is actually faster than dealing with future lock contention issues with lots of map threads -- Increase the ReadShard default size to 100K reads by default	2012-12-24 13:35:56 -05:00
Mark DePristo	f849910c4e	BQSR optimization: only compute BAQ when there's at least one error to delocalize -- Saves something like 2/3 of the compute cost of BQSR	2012-12-24 13:35:56 -05:00
Mark DePristo	0f0188ddb1	Optimization of BQSR -- Created a ReadRecalibrationInfo class that holds all of the information (read, base quality vectors, error vectors) for a read for the call to updateDataForRead in RecalibrationEngine. This object has a restrictive interface to just get information about specific qual and error values at offset and for event type. This restrict allows us to avoid creating an vector of byte 45 for each read to represent BI and BD values not in the reads. Shaves 5% of the runtime off the entire code. -- Cleaned up code and added lots more docs -- With this commit we no longer have much in the way of low-hanging fruit left in the optimization of BQSR. 95% of the runtime is spent in BAQing the read, and updating the RecalData in the NestedIntegerArrays.	2012-12-24 13:35:09 -05:00
Mark DePristo	f6d5499582	The GATK engine now ensures that incoming GATKSAMRecords have GATKSAMReadGroupRecord objects in their header -- Update SAMDataSource so that the merged header contains GATKSAMReadGroupRecord -- Now getting the NGSPlatform for a GATKSAMRecord is actually efficient, instead of computing the NGS platform over and over from the PL string -- Updated a few places in the code where the input argument is actually a GATKSAMRecord, not a SAMRecord for type safety	2012-12-24 13:35:09 -05:00
Ami Levy-Moonshine	8be01af145	add the new gather tool to GATKExtensionsGenerator	2012-12-21 15:09:00 -05:00
Ami Levy-Moonshine	3ca3fd4b3e	keep working on loglessHMM in UG	2012-12-21 11:06:12 -05:00
Ami Levy-Moonshine	6590039bc3	add fast gather to UG; change UG to work with log-lessHMM (work in prograss)	2012-12-20 14:58:57 -05:00
Tad Jordan	b491c177ff	Added functionality of outputting sorted GATKReport Tables - Added an optional argument to BaseRecalibrator to produce sorted GATKReport Tables - Modified BSQR Integration Tests to include the optional argument. Tests now produce sorted tables	2012-12-20 14:02:21 -05:00
Eric Banks	6c3f5eefe9	Merged bug fix from Stable into Unstable	2012-12-19 22:29:21 -05:00
xingwei2012	22d13ccdab	Bug fix for Queue LSF v8.3 the function ls_getLicenseUsage() is not supported by LSF v8.x, comment the line: public static native lsfLicUsage.ByReference ls_getLicenseUsage() Signed-off-by: Eric Banks <ebanks@broadinstitute.org>	2012-12-19 22:28:53 -05:00
Ryan Poplin	54e5c84018	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-12-19 11:31:40 -05:00
David Roazen	07b369ca7e	Move VCF/BCF2/VariantContext to new standalone org.broadinstitute.variant package This is an intermediate commit so that there is a record of these changes in our commit history. Next step is to isolate the test classes as well, and then move the entire package to the Picard repository and replace it with a jar in our repo. -Removed all dependencies on org.broadinstitute.sting (still need to do the test classes, though) -Had to split some of the utility classes into "GATK-specific" vs generic methods (eg., GATKVCFUtils vs. VCFUtils) -Placement of some methods and choice of exception classes to replace the StingExceptions and UserExceptions may need to be tweaked until everyone is happy, but this can be done after the move.	2012-12-19 10:25:22 -05:00
Ryan Poplin	cda0c48570	auto-merge	2012-12-19 10:12:49 -05:00
Mark DePristo	1ca13f9581	Fundamentally better model for the NanoScheduler -- Now each map job reads a value, performs map, and does as much reducing as possible. This ensures that we scale performance with the nct value, so -nct 2 should result in 2x performance, -nct 3 3x, etc. All of this is accomplished using exactly NCT% of the CPU of the machine. -- Has the additional value of actually simplifying the code -- Resolves a long-standing annoyance with the nano scheduler.	2012-12-19 09:31:31 -05:00
David Roazen	d0cd29cb36	Merged bug fix from Stable into Unstable	2012-12-19 02:20:28 -05:00
David Roazen	0d93330ab9	Fix bug in the PerSampleDownsamplingReadsIterator that could lead to excessive memory usage at traversal startup This is a MUST-HAVE update for GATK 2.3 users who want to try out the new ability to use -dcov with ReadWalkers.	2012-12-19 02:05:36 -05:00
Yossi Farjoun	6ed9eb3da9	GATKBAMIndex now passes unit test! Problem was that SeekableBufferedStream seems to have a bug: it will read beyond the end of a file if asked to.	2012-12-18 17:32:26 -05:00
Ryan Poplin	902ca7ea70	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-12-18 15:45:33 -05:00
Ryan Poplin	3950f7b3e3	Increasing the INFORMATIVE_LIKELIHOOD_THRESHOLD value to 0.2	2012-12-18 15:45:12 -05:00
Ryan Poplin	b5d590ba92	Based on NA12878 knowledge base experiments updating HC to allow for a much smaller minimum kmer length in the assembly graph.	2012-12-18 15:43:56 -05:00
eitanbanks	002ce9c1d5	Merge pull request #8 from yfarjoun/master Huge speedup in initial traversal of BAM index files (x20 speed!)	2012-12-18 10:16:53 -08:00
Mark DePristo	16eb1c5436	Optimization to TraverseReadsNano -- Don't just read all inputs into a list, and then provide an iterator to that list, actually make a real iterator so NanoScheduler input thread can contribute meaningfully to the work load -- Use NanoScheduler progress function, instead of home-grown updater	2012-12-18 10:14:47 -05:00
Mark DePristo	b33f804cdc	Inline increment function in RecalDatum to avoid minor duplication of work and multiple synchronized method calls	2012-12-17 16:47:27 -05:00
Mark DePristo	66d32f646b	Minor cleanup of BAQ calculation (final variables, etc)	2012-12-17 16:47:27 -05:00
Mark DePristo	67fe81391c	ProgressMeter optimization: don't do genome loc formatting, but instead create an object that only formats when printing is actually needed	2012-12-17 16:47:27 -05:00
Mark DePristo	1de2f527b9	Optimization of recalibrateRead -- Refactor calculation so that upfront constant values are pre-computed, and cached, and their values just looked up during application -- Trivial comment on how we might use BAQ better in BaseRecalibrator	2012-12-17 16:47:27 -05:00
Mark DePristo	bd6cda7542	Trivial optimization of TraverseReadsNano -- don't format the shard toString if logger isn't debug enabled	2012-12-17 16:47:27 -05:00
Mark DePristo	a481d006f0	Optimizations for applying BQSR table with PrintReads -- Cleaned up code in updateDataForRead so that constant values where not computed in inner loops -- BaseRecalibrator doesn't create it's own fasta index reader, it just piggy backs on the GATK one -- ReadCovariates <init> now uses a thread local cache for it's int[][][] keys member variable. This stops us from recreating an expensive array over and over. In order to make this really work had to update recordValues in ContextCovariate so it writes 0s over base values its skipping because of low quality base clipping. Previously the values in the ReadCovariates keys were 0 because they were never modified by ContextCovariates. Now these values are actually zero'd out explicitly by the covariates.	2012-12-17 16:47:27 -05:00
Mark DePristo	5ec25797b3	Optimizations for BaseRecalibrator -- No longer computes at each update the overall read group table. Now computes this derived table only at the end of the computation, using the ByQual table as input. Reduces BQSR runtime by 1/3 in my test	2012-12-17 16:47:27 -05:00
Eric Banks	e6f468b647	Refactored the quasi-useful IndelType annotation into the more useful VariantType. The indels are still annotated as before, but now all other variant types are annotated too. I'm doing this because of requests on the forum but am not making it standard. If we find it to be useful we can turn it on by default later.	2012-12-17 11:54:47 -05:00
Eric Banks	762f184262	Bug fix for strict validation: rsID checking wasn't working if there were multiple IDs	2012-12-17 10:32:41 -05:00
Yossi Farjoun	ea704d688f	chose smaller buffer size for the bufferedStream	2012-12-15 13:01:38 -05:00
Yossi Farjoun	6da2338ea7	removed comments and uneeded imports	2012-12-15 12:31:37 -05:00
Yossi Farjoun	19dd2d628a	some changes. some changes.	2012-12-14 17:21:32 -05:00
Mauricio Carneiro	74344a3871	Bringing in the changes from the CMI repo	2012-12-13 21:59:37 -05:00
Eric Banks	696bf95fba	Fix for PBT bug reported on the forum: the AD is actually output correctly now (rather than with 'null' or some gibberish memory pointer).	2012-12-13 23:28:30 +00:00
Mark DePristo	aeab932c63	Actual working version of unflushing VCFWriter -- Uses high-performance local writer backed by byte array that writes the entire VCF line in some write operation to the underlying output stream. -- Fixes problems with indexing of unflushed writes while still allowing efficient block zipping -- Same (or better) IO performance as previous implementation -- IndexingVariantContextWriter now properly closes the underlying output stream when it's closed -- Updated compressed VCF output file	2012-12-13 16:15:08 -05:00
Yossi Farjoun	5e66109268	Replaced a useless getInt with a skipInt to remove 1/4 of the initial seek time in the BAM Index.	2012-12-12 17:08:11 -05:00
Eric Banks	62eaffdf0a	Fix docs for ReadBackedPhasing	2012-12-12 20:28:04 +00:00
Eric Banks	bba63a3b0e	Fix for GSA-615: UnifiedGenotyperEngine.getGLModelsToUse takes 5% of the runtime of UG, should be optimized away.	2012-12-12 20:25:45 +00:00
Mauricio Carneiro	a52e3c7e15	Revert "Bug fix for RR: don't let the softclip start position be less than 1" this introduced a bug in reduce reads by de-activating it's hard clipping of the out of bounds soft-clips (specially in the MT). DEV-322 #resolve #time 4m This reverts commit 42acfd9d0bccfc0411944c342a5b889f5feae736.	2012-12-12 13:09:39 -05:00
Mark DePristo	5632c13bf2	Resolves GSA-681 / Compressed VCF.gz output is too big because of unnecessary call to flush(). -- Now compressed output VCFs are properly blocked compressed (i.e., they are actually smaller than the uncompressed VCF)	2012-12-12 10:27:07 -05:00
Mark DePristo	dd52a70d45	Fix AFCalcResult unit test -- I was simply passing in the wrong values into the function. Fixed the calls, and expanded the docs on what needs to be passed in.	2012-12-11 10:40:12 -05:00
Ami Levy-Moonshine	6bf31065e3	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-12-11 10:34:50 -05:00
Ami Levy-Moonshine	2e3284f306	Continue to fix the case where PRIORITIZE is used but no priority list is given. While fixing that case I also removed unnecessary sorting, when the prioeity list is not provied. When the priority list is not provided, it will continue to be null. Thus, the number of original Variant Contexts should be given as a new parameter to simpleMerge (since priority might be null). This new parameter is used for checking if there are filtered VC, when annotationOrigin is true.	2012-12-10 22:23:58 -05:00
Mauricio Carneiro	8a115edbaf	ReduceReads is now scattered by contig It's no longer safe to scatter/gather by interval because now we don't hard-clip to the intervals anymore.	2012-12-10 15:25:27 -05:00
Ami Levy-Moonshine	573ace4403	restore the right version of VariantContextUtils.java in my unstable dir	2012-12-10 10:28:56 -05:00
David Roazen	46edab6d6a	Use the new downsampling implementation by default -Switch back to the old implementation, if needed, with --use_legacy_downsampler -LocusIteratorByStateExperimental becomes the new LocusIteratorByState, and the original LocusIteratorByState becomes LegacyLocusIteratorByState -Similarly, the ExperimentalReadShardBalancer becomes the new ReadShardBalancer, with the old one renamed to LegacyReadShardBalancer -Performance improvements: locus traversals used to be 20% slower in the new downsampling implementation, now they are roughly the same speed. -Tests show a very high level of concordance with UG calls from the previous implementation, with some new calls and edge cases that still require more examination. -With the new implementation, can now use -dcov with ReadWalkers to set a limit on the max # of reads per alignment start position per sample. Appropriate value for ReadWalker dcov may be in the single digits for some tools, but this too requires more investigation.	2012-12-10 09:44:50 -05:00
Ami Levy-Moonshine	5460c96137	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-12-09 23:43:57 -05:00
Ami Levy-Moonshine	3a420d163e	(1) changes in catVariants (work still under development) (2) changes to CV to throw an error when GenotypeMergeType is PRIORITIZE but no priority (rod_priority_list) is not given. Reported by TechnicalVault on the forum on Nov 14 2012	2012-12-09 23:40:03 -05:00
Eric Banks	574d5b467f	Bug fix for indel HMM: protect against situation where long reads (e.g. Sanger) in a pileup can lead to a read starting after the haplotype end for a given haplotype.	2012-12-09 02:09:34 -05:00
Ami Levy-Moonshine	5d78a61f7a	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-12-05 15:07:12 -05:00
Mark DePristo	465694078e	Major performance improvement to the GATK engine -- The NanoSchedule timing code (in NSRuntimeProfile) was crazy expensive, but never showed up in the profilers. Removed all of the timing code from the NanoScheduler, the NSRuntimeProfile itself, and updated the unit tests. -- For tools that largely pass through data quickly, this change reduces runtimes by as much as 10x. For the RealignerTargetCreator example, the runtime before this commit was 3 hours, and after is 30 minutes (6x improvement). -- Took this opportunity to improve the GATK ProgressMeter. NotifyOfProgress now just keeps track of the maximum position seen, and a separate daemon thread ProgressMeterDaemon periodically wakes up and prints the current progress. This removes all inner loop calls to the GATK timers. -- The history of the bug started here: http://gatkforums.broadinstitute.org/discussion/comment/2402#Comment_2402	2012-12-05 14:49:22 -05:00
Mark DePristo	2b601571e7	Better error handling in NanoScheduler -- The previous nanoscheduler would deadlock in the case where an Error, not an Exception, was thrown. Errors, like out of memory, would cause the whole system to die. This bugfix resolves that issue	2012-12-05 14:49:22 -05:00
Eric Banks	0c925856cb	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-12-05 02:00:39 -05:00
Eric Banks	ef87b18e09	In retrospect, it wasn't a good idea to have FisherStrand handle reduced reads since they are always on the forward strand. For now, FS ignores reduced reads but I've added a note (and JIRA) to make this work once the RR het compression is enabled (since we will have directionality in reads then).	2012-12-05 02:00:35 -05:00
Mauricio Carneiro	30f013aeb0	Added a copy() method for ReadBackedPileups necessary to create new alignment contexts with hard-copies of the pileup.	2012-12-05 01:32:18 -05:00
Mauricio Carneiro	6feda540a4	Better error message for SimpleGATKReports	2012-12-05 01:32:18 -05:00
Randal Moore	8d2d0253a2	introduce a level of indirection for the forum URLs - this new function will allow me a place to morph the URL into something that is supported by Confluence Signed-off-by: Eric Banks <ebanks@broadinstitute.org>	2012-12-03 22:33:02 -05:00
Eric Banks	67932b357d	Bug fix for RR: don't let the softclip start position be less than 1	2012-12-03 15:59:14 -05:00
Ryan Poplin	a47da9bb2f	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-12-03 14:30:14 -05:00
Eric Banks	5fed9df295	Quick fix: base qual array in the GATKSAMRecord stores the actual phred values (-33) and not the original bytes (duh).	2012-12-03 12:18:20 -05:00
Eric Banks	b6839b3049	Added checking in the GATK for mis-encoded quality scores. The check is performed by a Read Transformer that samples (currently set to once every 1000 reads so that we don't hurt overall GATK performance) from the input reads and checks to make sure that none of the base quals is too high (> Q60). If we encounter such a base then we fail with a User Error. * Can be over-ridden with --allow_potentially_misencoded_quality_scores. * Also, the user can choose to fix his quals on the fly (presumably using PrintReads to write out a fixed bam) with the --fix_misencoded_quality_scores argument. Added unit tests.	2012-12-03 11:18:41 -05:00
Ryan Poplin	18b002c99c	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-12-03 10:08:56 -05:00
Ryan Poplin	1bdf17ef53	Reworking of how the likelihood calculation is organized in the HaplotypeCaller to facilitate the inclusion of per allele downsampling. We now use the downsampling for both the GL calculations and the annotation calculations.	2012-12-02 11:58:32 -05:00
Ami Levy-Moonshine	d0b8cc7773	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-12-01 00:08:25 -05:00
Ami Levy-Moonshine	969c995298	work under development - catVariants. Changes to AssessRRQuals based on Eric todo comments. bug fix in CombineVariants	2012-12-01 00:08:19 -05:00
Mark DePristo	8020ba14db	Minor cleanup of SAMDataSource as part of my system review -- Changed a few function from public to protected, as they are only used by the package contents, to simplify the SAMDataSource interface	2012-11-30 15:04:41 -05:00
Mauricio Carneiro	fc7fab5f3b	Fixed ReadBackedPileup downsampling Downsampling in the PerSampleReadBackedPileup was broken, it didn't downsample anything, always returning a copy the original pileup.	2012-11-30 00:42:05 -05:00
Joel Thibault	97d29f203e	Add walltime changes to LSF - Check whether the specified attribute is available - Add pipeline test (disabled due to missing attribute)	2012-11-29 15:23:37 -05:00
Joel Thibault	198923b597	Add ActiveRegionReadState handling	2012-11-28 13:59:57 -05:00
Ryan Poplin	f0395b457a	Adding the work-in-progress, experimental RepeatLengthCovariate to the BQSR so Chris can continue the development.	2012-11-28 13:56:32 -05:00
Eric Banks	3463774f2a	Merged bug fix from Stable into Unstable	2012-11-28 13:26:52 -05:00
Eric Banks	6030605242	Added quick check for creation of bad BAQ values associated with badly encoded base qualities; hopefully this can help us debug the non-reproducible issue seen by many users.	2012-11-28 13:26:31 -05:00
Mark DePristo	c676853731	Merged bug fix from Stable into Unstable. Updating md5s Conflicts: protected/java/test/org/broadinstitute/sting/gatk/walkers/genotyper/UnifiedGenotyperIntegrationTest.java	2012-11-28 12:54:36 -05:00
Mark DePristo	a1d6461121	Critical bugfix to AFCalcResult affecting UG/HC quality score emission thresholds As reported by Menachem Fromer: a critical bug in AFCalcResult: Specifically, the implementation: public boolean isPolymorphic(final Allele allele, final double log10minPNonRef) { return getLog10PosteriorOfAFGt0ForAllele(allele) >= log10minPNonRef; } seems incorrect and should probably be: getLog10PosteriorOfAFEq0ForAllele(allele) <= log10minPNonRef The issue here is that the 30 represents a Phred-scaled probability of error and it's currently being compared to a log probability of non-error. Instead, we need to require that our probability of error be less than the error threshold. This bug has only a minor impact on the calls -- hardly any sites change -- which is good. But the inverted logic effects multi-allelic sites significantly. Basically you only hit this logic with multiple alleles, and in that case it'\s including extra alt alleles incorrectly, and throwing out good ones. Change was to create a new function that properly handles thresholds that are PhredScaled quality scores: /** * Same as #isPolymorphic but takes a phred-scaled quality score as input */ public boolean isPolymorphicPhredScaledQual(final Allele allele, final double minPNonRefPhredScaledQual) { if ( minPNonRefPhredScaledQual < 0 ) throw new IllegalArgumentException("phredScaledQual " + minPNonRefPhredScaledQual + " < 0 "); final double log10Threshold = Math.log10(QualityUtils.qualToProb(minPNonRefPhredScaledQual)); return isPolymorphic(allele, log10Threshold); }	2012-11-28 12:08:02 -05:00
Menachem Fromer	79bc878e6a	Allow debugging to be set from the command line	2012-11-27 22:37:41 -05:00
Eric Banks	b40d3eb8aa	Merged bug fix from Stable into Unstable	2012-11-27 14:41:07 -05:00
Eric Banks	01abcc3e0f	Tests didn't like my note to Geraldine in the output logs; apparently it's tested in integration tests	2012-11-27 14:40:49 -05:00
Joel Thibault	d83ad906ef	Add profile range contract	2012-11-27 13:03:13 -05:00
Eric Banks	9531e58445	Merged bug fix from Stable into Unstable	2012-11-27 11:00:50 -05:00
Eric Banks	4543ece088	Fixing parsing of genomelocs that contain colons in the contig names (which is allowed by the spec) as reported on the forum. Added unit test for this case.	2012-11-27 11:00:33 -05:00
Eric Banks	a82ec7ad80	Merged bug fix from Stable into Unstable	2012-11-27 10:27:08 -05:00
Eric Banks	e199562c25	I have pulled out all of the documentation URLs and put them into the HelpUtils class as static variables; this way, Appistry can change links as needed to point commercial users to their own internal forum without having to muck things up all over our source. Added some TODOs for Geraldine to update links in the GATK docs that still point to the old wiki. Sorry that I am pushing into stable, but that's what Appistry is pulling from for their release next week (and unstable has been failing forever).	2012-11-27 10:26:17 -05:00
Mauricio Carneiro	97fd5de260	Merging latest CMI updates with UNSTABLE	2012-11-27 09:08:00 -05:00
Eric Banks	b1969a66bd	Update docs	2012-11-27 08:24:41 -05:00
Eric Banks	cc72aaefeb	Minor efficiency: use >= instead of > in test	2012-11-27 01:11:23 -05:00
Eric Banks	405f3c675d	Fix for GSA-649: GenomeLocSortedSet.overlaps is crazy slow. Also improved GenomeLocSortedSet.sizeBeforeLoc.	2012-11-27 01:07:00 -05:00
Ryan Poplin	e27d677c13	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-11-26 12:20:32 -05:00
Ryan Poplin	c3b7dd1374	Misc cleanup in the HaplotypeCaller. Cleaning up unused arguments after recent changes to HC-GenotypingEngine	2012-11-26 12:19:11 -05:00
Eric Banks	4f7fa3009a	I forget why I thought that the VariantAnnotator couldn't run multi-threaded because it works just fine. Now you can specify -nt with VA.	2012-11-26 11:34:59 -05:00
Mauricio Carneiro	a3f5932501	Fixed null pointer exception in Integration Tests When running Utils.setupWriter with NO_PG_TAG set, the writer was attempting to create a program record with the null pointer. Fixed.	2012-11-26 11:12:27 -05:00
Ryan Poplin	fedc4fde6c	Merged bug fix from Stable into Unstable	2012-11-25 21:55:55 -05:00
Ryan Poplin	d978cfe835	Soft clipped bases shouldn't be counted in the delocalized BQSR.	2012-11-25 21:55:29 -05:00
Eric Banks	9719ba7adc	Remove -number example from the docs since it's no longer supported.	2012-11-22 21:53:42 -05:00
Menachem Fromer	2306518ab6	Fix to deal with 'proper' options of casting	2012-11-22 01:45:18 -05:00
Menachem Fromer	d33a412b5f	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-11-22 01:42:29 -05:00
Mark DePristo	48f271c5bd	Adding 80% support for multi-allelic variants -- Multi-allelic variants are split into their bi-allelic version, trimmed, and we attempt to provide a meaningful genotype for NA12878 here. It's not perfect and needs some discussion on how to handle het/alt variants -- Adding splitInBiallelic funtion to VariantContextUtils as well as extensive unit tests that also indirectly test reverseTrimAlleles (which worked perfectly FYI)	2012-11-21 17:24:59 -05:00
Joel Thibault	c08b782743	Count isActive calls directly	2012-11-21 17:16:45 -05:00
Eric Banks	4f2229d399	As per the TODO message, I removed a check that was no longer necessary. Now ID is an allowable INFO field key.	2012-11-21 16:01:26 -05:00
Menachem Fromer	06261b58c2	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-11-21 15:57:08 -05:00
Eric Banks	ed50814ccb	Finally found a case where user errors were being masked behind other errors and could debug. It turns out that the checkForMaskedUserErrors() method needs to run recursively over all levels (calling exception.getCause()) to check for the original cause.	2012-11-21 15:57:05 -05:00
Menachem Fromer	c8be7c3102	Keep SNPs and indels separately for batch merging; Add options to DepthOfCoverage to count fragments (to not double-count overlapping reads of same fragment); DepthOfCoverage should now support ReducedReads; Replace recusrion with loop in DoC/package.scala (for lists longer than 5000 elements)	2012-11-21 15:56:53 -05:00
Ami Levy-Moonshine	4714ccc284	change the way CombineVariants check the priority arguments in order to throw error when the genotypeMergeOption argument is set to PRIORITIZE but PRIORITY_STRING is not provided	2012-11-21 10:47:35 -05:00
Eric Banks	2e1a055aca	Merged bug fix from Stable into Unstable	2012-11-20 23:20:33 -05:00
Eric Banks	c54fc94505	Protect against features that start off the end of the read (otherwise, Arrays.fill fails)	2012-11-20 23:19:59 -05:00
Eric Banks	c2efb04657	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-11-20 22:43:15 -05:00
Eric Banks	72e2d569c5	The user can now set the maximum allowable cycle on the command-line with --maximum_cycle_value. This value is (now) enforced in the Cycle covariate and a User Error is thrown if the maximum value is passed (with a helpful error message). Added unit tests to cover this new functionality.	2012-11-20 22:41:57 -05:00
Mark DePristo	cc7680e601	NA12878 knowledge base backed by MongoDB -- Idea is simply to create a persistent database of all TP/FP sites on chr20 in NA12878. Individual callsets can be imported, and a consensus algorithm is run over all callsets in the database to create a consensus collection, which can be used to assess NA12878 callsets for GATK and methods development -- Framework for representing simple VariantContexts and Genotypes in MongoDB, querying for records, and iterating over them in the GATK -- Not hooked up to Tribble, but could be done reasonably easily now (future TODO) -- Tools to import callsets, create consensus callsets, import and export reviews -- Scripts to reset the knowledge base and repopulate it with the standard data files (Eric will expand) -- Actually scales to all of chr20, includes AssessNA12878 that reads a VCF and itemizes it against the truth data set -- ImportCallset can load OMNI, HM3, CEU best practices, mills/devine sites and genotypes, properly marking sites as poly/mono/unk as well as TP/FP/UNK based on command line parameters -- Added shell scripts that start up a local mongo db, that connect to a local or BI hosted mongo for NA12878.db for debugging, and a setupNA12878db script that can load OMNI, HM3, CEU best practices, Mills/Devine into the db and then update the consensus. -- Reviewed sites can be exported to a VCF, and imported again, as a mechanism to safely store the only non-recoverable data from the Mongo DB. -- Created a NA12878DBWalker that manages the outer DB interaction, and that all MongoDB interacting walkers inherit from. Added a NA12878DBArgumentCollection.java consolating all of the common command line arguments (though strictly not necessary as all of this occurs in the root walker) UnitTests -- Can connect to a test knowledge base for development and unit testing -- PolymorphicStatus, TruthStatus, SiteIterator -- NA12878KBUnitTestBase provides simple utilities for connecting to the test mongo db, getting calls, etc -- MongoVariantContext tests creation, matching, and encoding -> writing -> read -> decoding from the mongodb AssessNA12878 -- Generic tool for comparing a NA12878 callset against the knowledge base. See http://gatkforums.broadinstitute.org/discussion/1848/using-the-na12878-knowledge-base for detailed documentation -- Performs trivial filtering on FS, MQ, QD for SNPs and non-SNPs to separate out variants likely to be filtered from those that are honest-to-goodness FPs Misc -- Ability to provide Description for Simplified GATK report	2012-11-20 18:50:52 -05:00
Eric Banks	937ac7290f	Lots more GGA fixes for the HC now that I understand what's going on internally. Integration tests pass except for the GGA test which I believe now produces better results.	2012-11-20 16:13:29 -05:00
Eric Banks	4f243acaa6	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2012-11-19 10:34:44 -05:00
Eric Banks	f0b8a0228f	Quick fix for HC refactoring: when copying over Haplotype objects, make sure to copy over the artificial allele used to create it too.	2012-11-19 09:57:55 -05:00
Eric Banks	ff180a8e02	Significant refactoring of the Haplotype Caller to handle problems with GGA. The main fix is that we now maintain a mapping from 'original' allele to 'Smith-Waterman-based' allele so that we no longer need to do a (buggy) matching throughout the calling process.	2012-11-19 09:09:57 -05:00
Eric Banks	78ce822b6f	Protect against NPE when using non-GATK reports for inputs expecting valid GATK reports	2012-11-19 09:07:04 -05:00
Joel Thibault	b70fd4a242	Initial testing of the Active Region Traversal contract - TODO: many more tests and test cases	2012-11-15 10:08:00 -05:00
Guillermo del Angel	a68e6810c9	Back off experimental code that escaped last commit, not for general use yet	2012-11-14 14:45:15 -05:00
Guillermo del Angel	89bbe73a43	Commenting out CMI pipeline test that wasn't meant to be in GATK repository (why was this merged??)	2012-11-14 14:39:04 -05:00
Guillermo del Angel	3771d074dc	Merge branch 'master' of ssh://gsa3/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-11-14 14:37:43 -05:00
Mauricio Carneiro	e35fd1c717	Merging CMI-0.5.0 and GATK-2.2 together.	2012-11-14 10:42:03 -05:00
Mauricio Carneiro	a079d8d0d1	Breaking the utility to write @PG tags for SAMFileWriters and StingSAMFileWriters	2012-11-14 10:33:22 -05:00
Mauricio Carneiro	dba31018f4	Implementation of BySampleSAMFileWriter ReduceReads now works with the n-way-out capability, splitting by sample. DEV-27 #resolve #time 3m	2012-11-14 10:33:22 -05:00
Mauricio Carneiro	a17cd54b68	Co-Reduction implementation in ReduceReads ReduceReads now co-reduces bams if they're passed in toghether with multiple -I. Co-reduction forces every variant region in one sample to be a variant region in all samples. Also: * Added integrationtest for co-reduction * Fixed bug with new no-recalculation implementation of the marksites object where the last object wasn't being removed after finalizing a variant region (updated MD5's accordingly) DEV-200 #resolve #time 8m	2012-11-14 10:33:21 -05:00
kshakir	6d59dd3455	Scala classes were only returning direct subclasses (confirmed when inspected in debugger) so changed PluginManager to allow specifying the explicit subclass. Removed some generics from PluginManager for now until able to figure out syntax for requesting explicit subclass. QStatusMessenger uses a slightly more primitive Map[String, Seq[RemoteFile]] instead of Map[ArgumentSource, Seq[RemoteFile]]. Added an QCommandPlugin.initScript utility method for handling specialized script types.	2012-11-14 10:33:20 -05:00
Eric Banks	42ddf51156	Merged bug fix from Stable into Unstable	2012-11-14 10:29:09 -05:00
Eric Banks	ba41f65759	Protect against NPEs in SelectVariants by checking for missing Genotypes	2012-11-13 11:53:39 -05:00
Eric Banks	c7335c9902	Having a malformed GATK report is a User Error	2012-11-13 11:53:12 -05:00
Eric Banks	525cf331f4	Don't catch a User Error and re-throw as a Reviewed Exception. That makes Eric unhappy.	2012-11-13 11:52:47 -05:00
Eric Banks	ee776e996a	Merged bug fix from Stable into Unstable	2012-11-09 08:35:51 -05:00
Eric Banks	66cbaaee31	Fixed nasty bug in BQSR csv file creation: numbers larger than 999 in the Errors column were printed out with commas (which looks like a separate column). This wasn't caught earlier because there are no integration tests covering the csv. I'll add one into unstable in a sec.	2012-11-09 08:33:55 -05:00
Eric Banks	e9183d9fe0	Fix bugs as reported on the forum: BED needs to be explicitly set as the default output format and the output didn't actually adhere to the BED spec.	2012-11-08 15:07:47 -05:00
Eric Banks	17ab3a39d5	Make the --intermediate_csv_file argument un-hidden.	2012-11-08 14:35:23 -05:00
Eric Banks	f4d4846435	Merged bug fix from Stable into Unstable	2012-11-06 20:53:54 -08:00
Eric Banks	15b8c08132	Apparently CIGAR elements can have 0 length according to the spec, but 0Ms were causing left alignment of indels to fail. Fixed.	2012-11-06 20:53:33 -08:00
Mark DePristo	f8a0a947e3	Critical bugfix for GSA-652 / Multi-threaded VCF -> BCF writing produces invalid intermediate file that fails on merging -- New tribble library now uses 64 bit sizes. The 26K VCF has so much data that low-level tribble block indices where overflowing their int size values. This includes a to-be-committed tribble jar that fixes this problem -- See https://jira.broadinstitute.org/browse/GSA-652 -- Minor cleanup of error messages that were useful on the way to solving this monster problem	2012-11-02 09:09:59 -04:00
Ryan Poplin	386b45e94d	This VE eval module isn't useful anymore.	2012-11-01 15:44:41 -04:00
Mark DePristo	1444cd753b	Bugfix for GSA-647 HaplotypeCaller misses good variant because the active region doesn't trigger for an exome -- The logic for determining active regions was a bit broken in the HC when intervals were used in the system -- TraverseActiveRegions now uses the AllLocus view, since we always want to see all reference sites, not just those covered. Simplifies logic of TAR -- Non-overlapping intervals are always treated as separate objects for determing active / inactive state. This means that each exon will stand on its own when deciding if it should be active or inactive -- Misc. cleanup, docs of some TAR infrastructure to make it safer and easier to debug in the future. -- Committing the SingleExomeCalling script that I used to find this problem, and will continue to use in evaluating calling of a single exome with the HC -- Make sure to get all of the reads into the set of potentially active reads, even for genomic locations that themselves don't overlap the engine intervals but may have reads that overlap the regions -- Remove excessively expensive calls to check bases are upper cased in ReferenceContext -- Update md5s after a lot of manual review and discussion with Ryan	2012-11-01 15:34:04 -04:00
Mark DePristo	9cd04c335c	Work on GSA-508 / CachingIndexedFastaReader should internally upper case bases loading data -- As one might expect, CachingIndexedFastaSequenceFile now internally upper cases the FASTA reference bases. This is now done by default, unless requested explicitly to preserve the original bases. -- This is really the correct place to do this for a variety of reasons. First, you don't need to work about upper casing bases throughout the code. Second, the cache is only upper cased once, no matter how often the bases are accessed, which walkers cannot optimize themselves. Finally, this uses the fastest function for this -- Picard's toUpperCase(byte[]) which is way better than String.toUpperCase() -- Added unit tests to ensure this functionality works correct. -- Removing unnecessary upper casing of bases in some core GATK tools, now that RefContext guarentees that the reference bases are all upper case. -- Added contracts to ensure this is the case. -- Remove a ton of sh*t from BaseUtils that was so old I had no idea what it was doing any longer, and didn't have any unit tests to ensure it was correct, and wasn't used anywhere in our code	2012-11-01 15:34:03 -04:00
Guillermo del Angel	24e6da25cc	Merge branch 'master' of ssh://gsa3/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-10-31 14:17:41 -04:00
Eric Banks	96344c6b62	Add note to realigner docs	2012-10-31 12:35:45 -04:00
Guillermo del Angel	4580e99c0c	Merge branch 'master' of ssh://gsa3/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-10-31 10:50:54 -04:00
Guillermo del Angel	02b790c8db	Merge fix	2012-10-31 10:50:36 -04:00
Guillermo del Angel	51a9ce28e1	Merge remote-tracking branch 'unstable/master' into develop	2012-10-31 10:29:48 -04:00
Eric Banks	e1e480a0b9	Bug fix: don't add no-call alleles to the list of ALT alleles being validated.	2012-10-30 14:54:29 -04:00
Guillermo del Angel	c8e17a7adf	totally experimental UG feature, to be removed	2012-10-30 13:57:54 -04:00
Eric Banks	c95e893920	Better error message for unused ALT alleles	2012-10-29 21:51:35 -04:00
Eric Banks	b6a1967f12	Better documentation for ValidateVariants so that people realize it's used for strict validation of the VCF file. Added an option to turn off strict validation and an integration test to cover it.	2012-10-29 21:47:09 -04:00
Eric Banks	be902375ac	'Bug' fix: fix the error message from the vcf validator so people realize that the file fails strict validation but still adheres to the spec.	2012-10-29 16:29:27 -04:00
Ryan Poplin	4e661847b2	DelocalizedBaseRecalibrator becomes the BaseRecalibrator.	2012-10-29 12:53:39 -04:00
Eric Banks	ac99437eec	Bug fixes to hapmap conversion in VariantsToVCF	2012-10-29 01:45:33 -04:00
Andrey Sivachenko	f3ac5d404d	updating vcf header attribute descriptions in order to reflect correctly what's actually being written...	2012-10-26 23:52:21 -04:00
Andrey Sivachenko	b4fbf6280a	fixing missing sample genotype bug, missing AD/DP bug, and putting annotations in more natural order (Ref/Alt)	2012-10-26 23:48:40 -04:00
Mark DePristo	ac5e58a265	Bugfix for GSA-540 / Update metadata maps when adding lines to VCFHeader -- https://jira.broadinstitute.org/browse/GSA-540 -- http://gatkforums.broadinstitute.org/discussion/1433/possible-bug-and-fix-in-java-code-of-vcfheader-org-broadinstitute-sting-utils-codecs-vcf-vcfheader	2012-10-26 16:34:16 -04:00
Mark DePristo	fa9b2a91d0	Bugfix for GSA-552 -- https://jira.broadinstitute.org/browse/GSA-552 -- User reports a null exception while using VariantsToVCF: http://gatkforums.broadinstitute.org/discussion/1461/nullpointerexception-converting-vcf3-to-vcf-using-variantstovcf The problem is that he left out an input VCF file for the --variant argument and the command-line argument parsing code didn't catch this, so we NPE out later on.	2012-10-26 16:34:16 -04:00
Eric Banks	f66d812778	Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-10-26 13:20:41 -04:00
Eric Banks	a8704ca73f	Adding TODO notes for Ami	2012-10-26 13:20:27 -04:00
Mark DePristo	251983b8fb	Add GATK-wide command line argument to control the maximum runtime allowed for the GATK -- Providing this optional argument -maxRuntime (in -maxRuntimeUnits units) causes the GATK to exit gracefully when the max. runtime has been exceeded. By cleanly I mean that the engine simply stops at the next available cycle in the walker as through the end of processing had been reached. This means that all output files are closed properly, etc. -- Emits an info message that looks like "INFO 10:36:52,723 MicroScheduler - Aborting execution (cleanly) because the runtime has exceeded the requested maximum 10.0000 s". Otherwise there's currently no way to differentiate a truly completed run from a timelimit exceeded run, which may be a useful thing for a future update -- Resolves GSA-630 / GATK max runtime to deal with bad LSA calling? -- Added new JIRA entry for Ami to restart chr1 macarthur with this argument set to -maxRuntime 1 -maxRuntimeUnits DAYS to see if we can do all of chr1 in one weekend.	2012-10-26 13:18:34 -04:00
Eric Banks	b06f689d4b	Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-10-26 02:13:26 -04:00
Eric Banks	bf3d61ce82	The default value for --contamination_fraction_to_filter is now 0.05 (5%) in both UG and HC. Users of GATK-lite get pushed down to 0% by default (since it's not enabled) or get a user error if they try to set it.	2012-10-26 01:04:51 -04:00
Eric Banks	91f2c847a3	Fixing problem reported on forum for VF: DP couldn't be filtered from the FORMAT field, only from the INFO field. Fixed and added integration test.	2012-10-26 00:57:40 -04:00
Mark DePristo	6b8b7df651	Queue now understands -nct and requests the appropriate number of cores from LSF, SGE, etc -- NCT wasn't previously recognized by Queue as needing more processors per machine. This commit fixes this. Also a potential cause of poor GATKPerformanceOverTime, in that runs with -nct could flood a node and cause it to have hundreds of cores in contention.	2012-10-25 17:26:58 -04:00
David Roazen	422e16c62e	BaseRecalibration: don't cache instances of ReadCovariates across reads Caching and reusing ReadCovariates instances across reads sounds good in theory, but: -it doesn't work unless you zero out the internal arrays before each read -the internal arrays must be sized proportionally to the maximum POSSIBLE recalibrated read length (5000!!!), instead of the ACTUAL read lengths By contrast, creating a new instance per read is basically equivalent to doing an efficient low-level memset-style clear on a much smaller array (since we use the actual rather than the maximum read length to create it). So this should be faster than caching instances and calling clear() but slower than caching instances and not calling clear(). Credit to Ryan to proposing this approach.	2012-10-25 17:02:55 -04:00
Guillermo del Angel	92fa7e953a	Merge branch 'master' of ssh://gsa3/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-10-25 16:33:14 -04:00
Ami Levy Moonshine	dde3060bb8	add the CEUtrio best practices results (UG + PBT) to the bundle	2012-10-25 15:36:17 -04:00
Ami Levy Moonshine	90b9971033	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-10-25 15:32:29 -04:00
David Roazen	884d031e72	NestedIntegerArray: Pre-allocate only the first two dimensions It turns out that pre-allocating the entire tree was too expensive in terms of memory when using large values for the -mcs and -ics parameters. Pre-allocating the first two dimensions prevents us from ever locking the root node during a put(). Contention between threads over lower levels of the tree should be minimal given that puts are rare compared to gets. Also output dimensions and pre-allocation info at startup. If pre-allocation takes longer than usual this gives the user a sense of what is causing the delay.	2012-10-25 15:17:42 -04:00
Mark DePristo	cc8c12b954	Committing a broken version of BaseRecalibration -- I'm committing because there's some kind of fundamental problem with the ReadCovariates cache, in that historical data isn't being cleared / computed properly, and I'd rather it fail for a while than leave it in JIRA. -- The integration tests test the -nct with PrintReads to get 1, 2, 4 and the 4 fails. But that's because of this incorrect calculation -- Updating GATKPerformanceOverTime with the new @ClassType annotation	2012-10-25 14:46:35 -04:00
Eric Banks	e93ff3ea6e	Let's go back to having the SB/SLOD NOT computed by default. If you recall, it was only enabled by default because we thought we were going to use it when we made VQSR use random forests. But since we decided not to change VQSR, there's no reason to triple the computation for every variant site anymore.	2012-10-25 12:45:23 -04:00
Guillermo del Angel	a838653822	Merge branch 'master' of ssh://gsa3/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-10-25 10:35:58 -04:00
Guillermo del Angel	596c1723ae	Hidden, unsupported ability of VariantEval to run AlleleCount stratification on sites-only VCFs. I'll expose it/add tests on it if people think this is generaly useful. User needs to specify total # of samples as command line argument since genotypes are not available. Also, fixes to large-scale validation script: lower -minIndelFrac threshold or else we'll kill most indels since default 0.25 is too high for pools, fix also VE stratifications and add one VE run where eval=1KG, comp=pool data and AC stratification based on 1KG annotation	2012-10-25 10:35:43 -04:00
Eric Banks	6dc7d872ec	Fix GenotypeAndValidate to handle SNPs and indels as reported on the forum. Recent changes to the UnifiedArgumentCollection made this stop working. Adding in JIRA to create integration tests for this tool.	2012-10-25 10:06:13 -04:00
Eric Banks	df9e0b7045	Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-10-25 02:49:54 -04:00
Eric Banks	72714ee43e	Minor patches to get the contamination down-sampling working for indels. Adding @Hidden logging output for easy debugging.	2012-10-25 02:47:42 -04:00
Eric Banks	c6b57fffda	Added allele biased down-sampling capabilities to the PerReadAlleleLikelihoodMap object, which means that both the UG and HC can use this functionality. Note that it's only available in protected, so GATK-lite users won't be allowed to enable it. Needs more testing.	2012-10-24 22:52:25 -04:00
Ami Levy Moonshine	bcf3582095	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-10-24 21:50:41 -04:00
Eric Banks	9da7bbf689	Refactoring the PerReadAlleleLikelihoodMap in preparation for adding contntamination downsampling into protected only.	2012-10-24 15:49:07 -04:00
David Roazen	d9aa9855f8	Better comments in NestedIntegerArray	2012-10-24 15:29:13 -04:00
David Roazen	02018ca764	Legacy BaseRecalibrator walker is neither TreeReducible nor NanoSchedulable The old BaseRecalibrator walker is and never will be thread-safe, since it's a LocusWalker that uses read attributes to track state. ONLY the newer DelocalizedBaseRecalibrator is believed likely to be thread-safe at this point. It is safe to run the DelocalizedBaseRecalibrator with -nct > 1 for testing purposes, but wait for further testing to be done before using it for production purposes in multithreaded mode.	2012-10-24 15:22:50 -04:00
David Roazen	32a6d7000a	Thread-safe ReadGroupCovariate The ReadGroupCovariate class was not thread-safe. This led to horrible race conditions in multithreaded runs of the BQSR where (for example) the same read group could get inserted into the reverse lookup table twice with different IDs. Should fix the intermittent crash reported in GSA-492.	2012-10-24 15:22:50 -04:00
David Roazen	991658acf4	BQSR: use more granular locking for concurrency control -With this change, BQSR performance scales properly by thread rather than gaining nothing from additional threads. -Benefits are seen when using either -nt (HierarchicalMicroScheduler) or -nct (NanoScheduler) -Removes high-level locks in the recalibration engines and NestedIntegerArray in favor of maximally-granular locks on and around manipulation of the leaf nodes of the NestedIntegerArray. -NestedIntegerArray now creates all interior nodes upfront rather than on the fly to avoid the need for locking during tree traversals. This uses more memory in the initial part of BQSR runs, but the BQSR would eventually converge to use this memory anyway over the course of a typical run. IMPORTANT NOTE: This does not mean it's safe to run the old BaseRecalibrator walker with multiple threads. The BaseRecalibrator walker is and will never be thread-safe, as it's a LocusWalker that uses read attributes to track state information. ONLY the newer DelocalizedBaseRecalibrator can be made thread-safe (and will hopefully be made so in my subsequent commits). This commit addresses performance, not correctness.	2012-10-24 15:22:50 -04:00
Eric Banks	5b7b42356b	Fix bug in GenotypeAndValidate where it doesn't check vc.hasAttribute() before using vc.getAttribute().	2012-10-24 14:02:50 -04:00
Mauricio Carneiro	c210b7cde4	Merge GATK repo into CMI-GATK Bringing in the following relevant changes: * Fixes the indel realigner N-Way out null pointer exception DEV-10 * Optimizations to ReduceReads that bring the run time to 1/3rd. Conflicts: protected/java/src/org/broadinstitute/sting/gatk/walkers/compression/reducereads/SlidingWindow.java DEV-10 #resolve #time 2m	2012-10-23 10:59:11 -04:00
Mark DePristo	f838815343	Updating MD5s for confidence ref site estimation in IndependentAllelesDiploidExactAFCalc -- Included logic to only add priors for alleles with sufficient evidence to be called polymorphic. If no alleles are poly make sure to add priors of first allele	2012-10-23 06:47:53 -04:00
Mark DePristo	15b28e61cd	Retiring TraverseReads and TraverseLoci after testing confirms nano scheduler version in single threaded version is fine -- There's been no report of problems with the nano scheduled version of TraverseLoci and TraverseReads, so I'm removing the old versions since they are no longer needed -- Removing unnecessary intermediate base classes -- GSA-515 / Nanoscheduler GSA-549 / https://jira.broadinstitute.org/browse/GSA-549	2012-10-22 16:55:06 -04:00
Khalid Shakir	fd59e7d5f6	Better error message when generic types are erased from scala collections.	2012-10-22 16:27:31 -04:00
Ryan Poplin	008df54575	Bug fix in GATKSAMRecord.getSoftEnd() for reads that are entirely clipped.	2012-10-22 14:21:52 -04:00
Mark DePristo	90f59803fd	MaxAltAlleles now defaults to 6, no more MaxAltAllelesForIndels -- Updated StandardCallerArgumentCollection to remove MaxAltAllelesForIndels. Previous argument is deprecated with meaningful doc message for people to use maxAltAlleles -- All constructores, factory methods, and test builders and their users updated to provide just a single argument -- Updating MD5s for integration tests that change due to genotyping more alleles -- Adding more alleles to genotyping results in slight changes in the QUAL value for multi-allelic loci where one or more alleles aren't polymorphic. That's simply due to the way that alternative hypotheses contribute as reference evidence against each true allele. The effect can be large (new qual = old qual / 2 in one case here). -- If we want more precision in our estimates we could decide (Eric, should we discuss?) to actually separately do a discovery phase in the genotyping, eliminate all variants not considered polymorphic, and then do a final round of calling to get the exact QUAL value for only those that are segregating. This would have the value of having the QUAL stay constant as more alleles are genotyped, at the cost of some code complexity increase and runtime. Might be worth it through	2012-10-22 13:47:56 -04:00
Khalid Shakir	97dc3664c9	Fixed yet another NPE related to the ArgumentTypeDescriptor vs. ArgumentMatchValue. Added integration test based on GSA-621.	2012-10-22 12:05:32 -04:00
Ami Levy Moonshine	1da4ad9607	new class to test the quals of reduced bam vs non-reduced bam	2012-10-22 10:45:53 -04:00
Mark DePristo	5296de8251	Fix UnifiedArgumentCollection constructor logic error -- The old way of overloading constructors and calling super didn't work (might have been a consequence of merge). This is the right way to do the copy constructor with the call to super()	2012-10-21 12:43:46 -04:00
Mark DePristo	6b6caf8e3a	Bugfix for indel DP calculations using reduced reads -- Adding tests for SNP and indel calling on reduced BAM	2012-10-21 12:42:32 -04:00
Mark DePristo	0fcd358ace	Original EXACT model implementation lives, providing another reference (bi-allelic only) EXACT model -- Potentially a very fast implementation (it's very clean) but restricted to the biallelic case -- A starting point for future bi-allelic only optimized (logless) or generalized (bi-allelic general ploidy) implementations -- Added systematic unit tests covering this implementation, and comparing it to others -- Uncovered a nasty normalization bug in StateTracker that was capping our likelihoods at 0, even after summing up multiple likelihoods, which is just not safe to do and was causing us to lose likelihood in some cases -- Removed the restriction that a likelihood be <= 0 in StateTracker, and the protection for these cases in GeneralPloidyExactAFCalc which just wasn't right	2012-10-21 12:42:31 -04:00
Mark DePristo	0fb8274507	Fix contact on sorting of AFCalcResults -- The thing that must be sorted is the pre-theta^N list, which is not checked in the routine that applies the theta^N prior.	2012-10-21 12:42:31 -04:00
Mark DePristo	eaffb814d3	IndependentExactAFCalc is now the default EXACT model implementation -- Changed UG / HC to use this one via the StandardCallerArgumentCollection -- Update the AFCalcFactory.Calculation to have a getDefault() value instead of having a duplicate entry in the enums	2012-10-21 12:42:31 -04:00
Mark DePristo	326f429270	Bugfixes to make new AFCalc system pass integrationtests -- GeneralPloidyExactAFCalc turns -Infinity values into -Double.MAX_VALUE, so our calculations pass unit tests -- Bugfix for GeneralPloidyGenotypeLikelihoodsCalculationModel, return a null VC when the only allele we get from our final alleles to use method is the reference base -- Fix calculation of reference posteriors when P(AF == 0) = 0.0 and P(AF == 0) = X for some meaningful value of X. Added unit test to ensure this behavior is correct -- Fix horrible sorting bug in IndependentAllelesDiploidExactAFCalc that applied the theta^N priors in the wrong order. Add contract to ensure this doesn't ever happen again -- Bugfix in GLBasedSampleSelector, where VCs without any polymorphic alleles were being sent to the exact model --	2012-10-21 12:42:31 -04:00
Mark DePristo	695cf83675	More docs and contracts for classes in genotyper.afcalc -- Future protection of the output of GeneralPloidyExactAFCalc, which produces in some cases bad likelihoods (positive values)	2012-10-21 12:42:31 -04:00
Mark DePristo	99c9031cb4	Merge AFCalcResultTracker into StateTracker, cleanup -- These two classes were really the same, and now they are actually the same! -- Cleanuped the interfaces, removed duplicate data -- Added lots of contracts, some of which found numerical issues with GeneralPloidyExactAFCalc (which have been patched over but not fixed) -- Moved goodProbability and goodProbabilityVector utilities to MathUtils. Very useful for contracts!	2012-10-21 12:42:31 -04:00
Mark DePristo	9c63cee9fc	Moving pnrm to UnifiedArgumentCollection so it's available with the HaplotypeCaller	2012-10-21 12:42:31 -04:00
Eric Banks	d44d5b8275	Fix RawHapMapCodec so that it can build indexes. Minor fixes to VCF codec.	2012-10-21 01:29:59 -04:00
Eric Banks	841a906f21	Adding a hidden (for now) argument to UG (and HC) that tells the caller that the incoming samples are contaminated by N% and to fix it by aggressively down-sampling all alleles. This actually works. Yes, you read that right: given that we know what N is, we can make good calls on bams that have N% contamination. Only hooked up for SNPS right now. No tests added yet.	2012-10-20 23:31:56 -04:00
Eric Banks	2c624f76c8	Refactoring the Unified (and Standard) Argument Collections because it was really ugly that the subclass had to do all the cloning for the super class. The clone() method is really not recommended best practice in Java anyways, so I changed it so that we use standard overloaded constructors. Confirmed that the Haplotype Caller --help docs do not include UG-specific arguments.	2012-10-20 20:35:54 -04:00
Ryan Poplin	a647f1e076	Refactoring the PairHMM util class to allow for multiple implementations which can be specified by the callers via an enum argument. Adding an optimized PairHMM implementation which caches per-read calculations as well as a logless implementation which drastically reduces the runtime of the HMM while also increasing the precision of the result. In the HaplotypeCaller we now lexicographically sort the haplotypes to take maximal benefit of the haplotype offset optimization which only recalculates the HMM matrices after the first differing base in the haplotype. Many thanks to Mauricio for all the initial groundwork for these optimizations. The change to the one HC integration test is in the fourth decimal of HaplotypeScore.	2012-10-20 16:38:18 -04:00
Joel Thibault	45f64425a3	Update read metrics per shard rather than locus	2012-10-19 15:29:01 -04:00
Joel Thibault	637e0cf151	CountReads does not permit the use of output files	2012-10-19 15:29:01 -04:00
Khalid Shakir	2ef456d51a	Added explicit @ClassType annotations to @Argument for Option[Int] or Option[Double] since scala seems to change the reflected type to Option[Object] on some systems. Changed ReflectionUtils.getGenericTypes' order of looking for @ClassType since the primitive generic wasn't completely erased, only changed to Object which is incorrect. More fixes to @Arguments labeled as java.io.File via incorrect @Input annotation. Put in a default undocumented implementation of @Argument doc() to match the one added to @Input.	2012-10-19 13:20:29 -04:00
Eric Banks	9c088fe3fe	Actually a better implementation of GATKSAMRecord.getSoftStart(). Last commit was all wrong. Oops.	2012-10-19 12:41:24 -04:00
Eric Banks	f08e5a44da	Better implementation of GATKSAMRecord.getSoftStart()	2012-10-19 12:11:18 -04:00
Eric Banks	deca564aef	Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-10-19 12:01:49 -04:00
Khalid Shakir	403654d40a	Fixed null checkes in ArgumentTypeDescriptor due to ArgumentMatchValue updates. Fixed @Arguments such as scatter count that were labeled as java.io.File via incorrect @Input annotation.	2012-10-18 16:57:15 -04:00
Guillermo del Angel	a4184716d8	Merge branch 'master' of ssh://gsa3/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-10-18 15:42:31 -04:00
Guillermo del Angel	3db38c5a93	Bug fix: inbreeding coeff shouldn't be computed in ref-only sites	2012-10-18 15:42:14 -04:00
Ami Levy Moonshine	a0381f15af	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-10-18 14:49:34 -04:00
Ryan Poplin	b4e69239dd	In order to be considered an informative read in the PerReadAlleleLikelihoodMap it has to be informative compared to all other alleles not just the worst allele. Also, fixing a bug when there is only one allele in the map.	2012-10-18 14:31:15 -04:00
Mauricio Carneiro	3504f71b6b	Fixing a null pointer exception bug for DEV-10	2012-10-18 13:58:38 -04:00
Mark DePristo	d3fc797cfe	SelectVariants is actually NOT NanoSchedulable	2012-10-18 10:42:20 -04:00
Mark DePristo	f20fa9d082	SelectVariants is actually NanoSchedulable	2012-10-18 10:27:05 -04:00
Mark DePristo	97abb98c0b	Bugfix for bad nt / nct argument detection in MicroScheduler	2012-10-18 10:27:05 -04:00
Eric Banks	54f698422c	Better implementation for getSoftEnd() in GATKSAMRecord	2012-10-18 09:01:51 -04:00
Ami Levy Moonshine	acc0fb2f7a	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-10-17 22:16:02 -04:00
Mauricio Carneiro	32ee2c7dff	Refactored the compression interface per sample in ReduceReadsa The CompressionStash is now responsible for keeping track of all intervals that must be kept uncompressed by all samples. In general this is a list generated by a tumor sample that will enforce all normal samples to abide. - Updated ReduceReads integration tests - Sliding Window is now using the CompressionStash (single sample). DEV-104 #resolve #time 3m	2012-10-17 16:40:40 -04:00
Mauricio Carneiro	b57df6cac8	Bringing CMI changes into the main GATK repo. Merge remote-tracking branch 'cmi/master'	2012-10-17 15:23:19 -04:00
Mark DePristo	8288c30e36	Use buffered output for ExactCallLogger	2012-10-17 14:15:11 -04:00
Mark DePristo	c9e7a947c2	Improve interface of ExactCallLogger, use it to have a more informative AFCalcPerformanceTest	2012-10-17 14:15:11 -04:00
David Roazen	b30e2a5b7d	BQSR: tool to profile the effects of more-granular locking on scalability by # of threads	2012-10-16 14:43:16 -04:00
Mark DePristo	9bcefadd4e	Refactor ExactCallLogger into a separate class -- Update minor integration tests with NanoSchedule due to qual accuracy update	2012-10-16 13:30:09 -04:00
Mark DePristo	c74d7061fe	Added AFCalcResultUnitTest -- Ensures that the posteriors remain within reasonable ranges. Fixed bug where normalization of posteriors = {-1e30, 0.0} => {-100000, 0.0} which isn't good. Now tests ensure that the normalization process preserves log10 precision where possible -- Updated MathUtils to make this possible	2012-10-16 08:11:06 -04:00
Mark DePristo	9b0ab4e941	Cleanup IndependentAllelesDiploidExactAFCalc -- Remove capability to truncate genotype likelihoods -- this wasn't used and isn't really useful after all -- Added lots of contracts and docs, still more to come. -- Created a default makeMaxLikelihoods function in ReferenceDiploidExactAFCalc and DiploidExactAFCalc so that multiple subclasses don't just do the default thing -- Generalized reference bi-allelic model in IndependentAllelesDiploidExactAFCalc so that in principle any bi-allelic reference model can be used.	2012-10-16 08:11:06 -04:00
Mark DePristo	6bd0ec8de4	Proper likelihoods and posterior probability of the joint allele frequency in IndependentAllelesDiploidExactAFCalc -- Fixed minor numerical stability issue in AFCalcResult -- posterior of joint A/B/C is 1 - (1 - P(D \| AF_b == 0)) x (1 - P(D \| AF_c == 0)), for any number of alleles, obviously. Now computes the joint posterior like this, and then back-calculates likelihoods that generate these posteriors given the priors. It's not pretty but it's the best thing to do	2012-10-16 08:11:06 -04:00
Mark DePristo	d1511e38ad	Removing ConstrainedAFCalculationModel; AFCalcPerformanceTest -- Superceded by IndependentAFCalc -- Added support to read in an ExactModelLog in AFCalcPerformanceTest and run the independent alleles model on it. -- A few misc. bug fixes discovered during running the performance test	2012-10-16 08:11:06 -04:00
kshakir	9fcf71c031	Updated google reflections due to stale slf4j version conflicting with other projects also trying to use Queue as a component. Added targets to build.xml to effectively 'mvn install' packaged GATK/Queue from ant. TODO: Versions during 'mvn install' are hardcoded at 0.0.1 until a better versioning scheme that works with maven dependencies has been identified.	2012-10-16 02:22:30 -04:00
kshakir	213cc00abe	Refactored argument matching to support other plugins in addition to file lists. Added plugin support for sending Queue status messages. Argument parsing can store subclasses of java.io.File, for example RemoteFile.	2012-10-15 15:10:45 -04:00
Mauricio Carneiro	80d92e0c63	Allowing the GATK to have non-required outputs Modified the SAMFileWriterArgumentTypeDescriptor to accept output bam files that are null if they're not required (in the @Output annotation). This change enables the nWayOut parameter for the IndeRealigner and ReduceReads to operate optionally while maintaining the original single way out. [#DEV-10 transition:31 resolution:1]	2012-10-15 13:49:08 -04:00
Ryan Poplin	25be94fbb8	Increasing the precision of MathUtils.approximateLog10SumLog10 from 1E-3 to 1E-4. Genotyper integration tests change as a result. Expanding the unit tests of MathUtils.log10sumLog10.	2012-10-15 13:24:32 -04:00
Ami Levy Moonshine	0d93effa4d	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-10-15 11:19:12 -04:00
Mark DePristo	57e231610b	New framework for EXACT calculations, with new 3 new implementations -- Before this branch, the EXACT calculation implementation was largely based on historical choices in the UnifiedGenotyper. The code was badly organized, there were no unit tests, and the Diploid EXACT calculation was super slow O(n.samples ^ n.alt.alleles) -- Reorganized code into a single class AFCalc superclass that carries out the calculation and an AFCalcResult object that contains only the information we should expose to code users, and is well-validated. -- Implement a new model for the multi-allelic exact calculation that sweeps for each alt allele B all likelihoods into a bi-allelic model XB where X is all alleles != B, and calls these all separately using the reference bi-allelic model. It produces identical quals for the bi-allelic case but slightly different results for multi-allelics due to a genuine model difference in that this Independent model doesn't penalize fully all genotype configurations as occurs in the Reference multi-allelic implementation. However, it seems after much debate that the reference model is doing the wrong thing, so in fact the Independent model seems correct. This code isn't the default implementation yet, simply because I want to do some cleanup and discuss with the methods group before enabling. -- Constrained search model implemented, but will be deleted in a subsequent code cleanup -- Massive (40K) suite of unit tests the exact models, which are passing for the reference and the independent alleles exact model. -- Restored -- but isn't 100% hooked up -- the original clean bi-allelic model for Ryan to pass his optimized logless version on. -- The only way to create these AFCalc objects is through an AFCalcFactory, which again validates its arguments. The AFCalcFactory.Calculation enum exposes calculations to the UG / HC as the AFModel. -- Separated AFCalc from UG, into its own package that could in principle be pushed into utils now -- Created a simple main[] function to run performance tests of the EXACT model.	2012-10-15 08:32:32 -04:00
Mark DePristo	dcf8af42a8	Finalizing IndependentAllelesDiploidExactAFCalc -- Updating integration tests, confirming that results for the original EXACT model are as expected given our new more rigorous application of likelihoods, priors, and posteriors -- Fix basic logic bug in AFCalcResult.isPolymorphic and UnifiedGenotypeEngine, where isNonRef really meant isRef. Not ideal. Finally caught by some tests, but good god it almost made it into the code -- Now takes the Math.abs of the phred-scaled confidence so that we don't see -0.0 -- Massive new suite of unit tests to ensure that bi-allelic and tri-allele events are called properly with all models, and that the IndependentAllelesDiploidExactAFCalc calls events with up to 4 alt alleles correctly. ID'd some of the bugs below -- Fix sort order bug in IndependentAllelesDiploidExactAFCalc caught by new unit tests -- Fix bug in GeneralPloidyExactAFCalc where the AFCalcResult has meaningless values in the likelihoods when no there we no informative GLs.	2012-10-15 08:21:03 -04:00
Mark DePristo	1ac09ca81e	More bugfixes on the way to a final push with new Exact model framework -- UnifiedGenotyperEngine uses only the alleles used in genotyping, not the original alleles, when considering which alleles to include in output -- AFCalcFactory has a more informative info message when looking for and selecting an exact model to use in genotyping	2012-10-15 07:53:57 -04:00
Mark DePristo	6b639f51f0	Finalizing new exact model and tests -- New capabilities in IndependentAllelesDiploidExactAFCalc to actually apply correct theta^n.alt.allele prior. -- Tests that theta^n.alt.alleles is being applied correctly -- Bugfix: keep in logspace when computing posterior probability in toAFCalcResult in AFCalcResultTracker.java -- Bugfix: use only the alleles used in genotyping when assessing if an allele is polymorphic in a sample in UnifiedGenotyperEngine	2012-10-15 07:53:57 -04:00
Mark DePristo	cb857d1640	AFCalcs must be made by factory method now -- AFCalcFactory is the only way to make AFCalcs now. There's a nice ordered enum there describing the models and their ploidy and max alt allele restrictions. The factory makes it easy to create them, and to find models that work for you given your ploidy and max alt alleles. -- AFCalc no longer has UAC constructor -- only AFCalcFactory does. Code cleanup throughout -- Enabling more unit tests, all of which almost pass now (except for IndependentAllelesDiploidExactAFCalc which will be fixed next) -- It's now possible to run the UG / HC with any of the exact models currently in the system. -- Code cleanup throughout the system, reorganizing the unit tests in particular	2012-10-15 07:53:56 -04:00
Mark DePristo	6bbe750e03	Continuing work on IndependentAllelesDiploidExactAFCalc -- Continuing to get IndependentAllelesDiploidExactAFCalc working correctly. A long way towards the right answer now, but still not there -- Restored (but not tested) OriginalDiploidExactAFCalc, the clean diploid O(N) version for Ryan -- MathUtils.normalizeFromLog10 no longer returns -Infinity when kept in log space, enforces the min log10 value there -- New convenience method in VariantContext that looks up the allele index in the alleles	2012-10-15 07:53:56 -04:00
Mark DePristo	176b74095d	Intermediate commit on the path to getting a working IndependentAllelesDiploidExact calculation -- Still not work, but I know what's wrong -- Many tests disabled, that need to be reanabled	2012-10-15 07:53:56 -04:00
Mark DePristo	91aeddeb5a	Steps on the way to a fully described and semantically meaningful AFCalcResult -- AFCalcResult now sports a isPolymorphic and getLog10PosteriorAFGt0ForAllele functions that allow you to ask individually whether specific alleles we've tried to genotype are polymorphic given some confidence threshold -- Lots of contracts for AFCalcResult -- Slowly killing off AFCalcResultsTracker -- Fix for the way UG checks for alt alleles being polymorphic, which is now properly conditioned on the alt allele -- Change in behavior for normalizeFromLog10 in MathUtils: now sets the log10 for 0 values to -10000, instead of -Infinity, since this is really better to ensure that we don't have -Infinity values traveling around the system -- ExactAFCalculationModelUnitTest now checks for meaningful pNonRef values for each allele, uncovering a bug in the GeneralPloidy (not fixed, related to Eric's summation issue from long ago that was reverted) in that we get different results for diploid and general-ploidy == 2 models for multi-allelics.	2012-10-15 07:53:56 -04:00
Mark DePristo	4f1b1c4228	Intermediate commit II on simplifying AFCalcResult -- All of the code now uses the AFCalc object, not the not package protected AFCalcResultTracker. Nearly all unit tests pass (expect for a contract failing one that will be dealt with in subsequent commit), due to -Infinity values from normalizeLog10. -- Changed the way that UnifiedGenotyper decides if the best model is non-ref. Previously looked at the MAP AC, but the MAP AC values are no longer provided by AFCalcResult. This is on purpose, because the MAP isn't a meaningful quantity for the exact model (i.e., everything is going to go to MLE AC in some upcoming commit). If you want to understand why come talk to me. Now uses the isPolymorphic function and the EMIT confidence, so that if pNonRef > EMIT then the site is poly, otherwise it's mono.	2012-10-15 07:53:56 -04:00
Mark DePristo	06687bfaf6	Intermediate commit on simplifying AFCalcResult -- Renamed old class AFCalcResultTracker. This object is now allocated by the AFCalc itself, since it is heavy-weight and was badly optimized in the UG with a thread-local variable. Now, since there's already a AFCalc thread-local there, we get that optimization for free. -- Removed the interface to provide the AFCalcResultTracker to getlog10PNonRef. -- Wrote new, clean but unused AFCalcResult object that will soon replace the tracker as the external interface to the AFCalc model results, leaving the tracker as an internal tracker structure. This will allow me to (1) finally test things exhaustively, as the contracts on this class are clear (2) finalize the IndependentAllelesDiploidExactAFCalc class as it can work with a meaningfully defined result across each object	2012-10-15 07:53:56 -04:00
Mark DePristo	c82aa01e0e	Generalize testing infrastructure to allow us to run specific n.samples calculation	2012-10-15 07:53:55 -04:00
Mark DePristo	ec935f76f6	Initial implementation and tests for IndependentAllelesDiploidExactAFCalc -- This model separates each of N alt alleles, combines the genotype likelihoods into the X/X, X/N_i, and N_i/N_i biallelic case, and runs the exact model on each independently to handle the multi-allelic case. This is very fast, scaling at O(n.alt.alleles x n.samples) -- Many outstanding TODOs in order to truly pass unit tests -- Added proper unit tests for the pNonRef calculation, which all of the models pass	2012-10-15 07:53:55 -04:00
Mark DePristo	ee2f12e2ac	Simpler naming convention for AlleleFrequencyCalculation => AFCalc	2012-10-15 07:53:55 -04:00
Mark DePristo	cf3f9d6ee8	Reorganize and cleanup AFCalculations -- Now contained in a package called afcalc -- Extracted standard alone classes from private static classes in ExactAF -- Most fields are now private, with accessors -- Overall cleaner organization now	2012-10-15 07:53:55 -04:00
Mark DePristo	13211231c7	Restructure and cleanup ExactAFCalculations -- Now there's no duplication between exact old and constrained models. The behavior is controlled by an overloaded abstract function -- No more static function to access the linear exact model -- you have to create the surrounding class. Updated code in the system -- Everything passes unit tests	2012-10-15 07:53:54 -04:00
Mark DePristo	f800f3fb88	Optimized diploid exact AF calculation uses maxACs to stop the calculation by maxAC by allele -- Added unit tests to ensure the approximation isn't so far from our reference implementation (DiploidExactAFCalculation)	2012-10-15 07:53:54 -04:00
Mark DePristo	efad215edb	Greedy version of function to compute the max achievable AC for each alt allele -- walks over the genotypes in VC, and computes for each alt allele the maximum AC we need to consider in that alt allele dimension. Does the calculation based on the PLs in each genotype g, choosing to update the max AC for the alt alleles corresponding to that PL. Only takes the first lowest PL, if there are multiple genotype configurations with the same PL value. It takes values in the order of the alt alleles.	2012-10-15 07:53:54 -04:00
Mark DePristo	7666a58773	Function to compute the max achievable AC for each alt allele -- Additional minor cleanup of ExactAFCalculation	2012-10-15 07:53:53 -04:00
Eric Banks	a8efa5451a	Protect against bad bases users have screwy data (or try to use zipped references)	2012-10-12 15:05:03 -04:00
Eric Banks	81532a0529	Missing file are user errors.	2012-10-12 09:48:12 -04:00
Eric Banks	fa77a83783	Update the out of space error to include another permutation	2012-10-12 09:38:12 -04:00
Eric Banks	85525d9e6e	Make Geraldine's life easier: from now on we treat problems where a temp file cannot be found when running the GATK with multiple threads as User Errors (since they are 99.9% of the time). This is an extremely large class of errors in Tableau and on the forums. Helpful error message tells users exactly what we tell them on the forums anyways (Geraldine: feel free to edit).	2012-10-12 09:19:50 -04:00
Eric Banks	ad60300bee	Catch malformed BAM files at the source since this is the largest class of errors in Tableau.	2012-10-12 09:07:57 -04:00
Eric Banks	593c8065d9	Fix docs for BadMateFilter	2012-10-12 08:35:45 -04:00
Christopher Hartl	6b9987cf1b	Merge branch 'master' of gsa2:/humgen/gsa-scr1/chartl/dev/unstable	2012-10-12 00:48:42 -04:00
David Roazen	3861212dab	Fix inefficiency in FilePointer GenomeLoc validation Validation of GenomeLocs in the FilePointer class was extremely inefficient when the GenomeLocs were added one at a time rather than all at once. Appears to mostly fix GSA-604	2012-10-11 19:55:14 -04:00
Ami Levy Moonshine	ef3882f439	PhaseByTransmission: small typo /n. variantCallQC_summaryTablesOnly.R: small changes (more to come) /n GeneralCallingPipeline.scala: the new pipeline script. It is not as clean as I want it to be, but it works. I still going to work on it a little bit more. Also, it does not include yet: (1) the RR step (2) need better eval step (3) need to include other targets (currently it eork on the CEU Trio)	2012-10-11 14:51:41 -04:00
Ryan Poplin	08b8ce6903	Fixing merge conflicts related to the comment formatting in the BQSR.	2012-10-10 16:03:58 -04:00
Ryan Poplin	45717349dc	Fixing BQSR bug reported on the forum for reads that begin with insertions.	2012-10-10 16:01:37 -04:00
Ryan Poplin	2a9ee89c19	Turning on allele trimming for the haplotype caller.	2012-10-10 10:47:26 -04:00
Ryan Poplin	b543bddbb7	Fixing merge conflicts related to the comment formatting in the BQSR.	2012-10-08 10:23:08 -04:00
Ryan Poplin	b3cc04976f	Fixing BQSR bug reported on the forum for reads that being with insertions.	2012-10-08 10:18:29 -04:00
Eric Banks	a5aaa14aaa	Fix for GSA-601: Indels dropped during liftover. This was a true bug that was an effect of the switch over to the non-null representation of alleles in the VariantContext. Unfortunately, this tool didn't have integration tests - but it does now.	2012-10-07 01:19:52 -04:00
Eric Banks	82e40340c0	Use StringBuilder over StringBuffer	2012-10-07 00:02:15 -04:00
Eric Banks	5d6aad67e2	Fix for bug reported on forums: VariantsToTable does not handle lists and nested arrays correctly. Added an integration test to cover printing of PLs.	2012-10-07 00:01:27 -04:00
Eric Banks	e7798ddd2a	Fix for JIRA GSA-598: AD field not handled properly by CombineVariants. It was also not handled by SelectVariants either. We now strip the AD field out whenever combining/selecting makes it invalid due to a changing of the number of ALT alleles.	2012-10-06 23:02:36 -04:00
Eric Banks	bfc551f612	Fix for GSA-589: SelectVariants with -number gives biased results. The implementation was not good and it's not worth keeping this busted code around given that we have a working implementation of a fractional random sampling already in place, so I removed it.	2012-10-06 22:39:49 -04:00
Eric Banks	e8a6460a33	After merging with Yossi's fix I can confirm that the AD is fixed when going through the HC too. Added similar fixes to DP and FS annotations too.	2012-10-05 16:37:42 -04:00
Eric Banks	52326942cf	Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-10-05 16:15:07 -04:00
Eric Banks	04853252a0	Possible fix for reduced reads coming from the HaplotypeCaller in the AD	2012-10-05 16:15:04 -04:00
Yossi Farjoun	d419a33ed1	* Added an integration test for AD annotation in the Haplotype caller. * Corrected FS Anotation for UG as for AD. * HC still does not annotate ReducedReads correctly (for FS nor AD)	2012-10-05 15:23:59 -04:00
Yossi Farjoun	dc4dcb4140	fixed AD annotation for a ReducedReads BAM file. Added an integration test for this case with a new reduced BAM in private/testdata	2012-10-05 14:20:07 -04:00
Eric Banks	c66ef17cd0	Add a separate max alt alleles argument for indels that defaults to 2 instead of 3. PLEASE TAKE NOTE.	2012-10-04 13:52:14 -04:00
Mark DePristo	b6e20e083a	Copied DiploidExactAFCalc to placeholder OptimizedDiploidExact -- Will be removed. Only commiting now to fix public -> private dependency	2012-10-03 20:16:38 -07:00
Mark DePristo	51cafa73e6	Removing public -> private dependency	2012-10-03 20:05:03 -07:00
Mark DePristo	f6a2ca6e7f	Fixes / TODOs for meaningful results with AFCalculationResult -- Right now the state of the AFCaclulationResult can be corrupt (ie, log10 likelihoods can be -Infinity). Forced me to disable reasonable contracts. Needs to be thought through -- exactCallsLog should be optional -- Update UG integration tests as the calculation of the normalized posteriors is done in a marginally different way so the output is rounded slightly differently.	2012-10-03 19:55:12 -07:00
Mark DePristo	50e4a832ea	Generalize framework for evaluating the performance and scaling of the ExactAF models to tri-allelic variants -- Wow, big performance problems with multi-allelic exact model!	2012-10-03 19:55:11 -07:00
Mark DePristo	3663fe1555	Framework for evaluating the performance and scaling of the ExactAF models	2012-10-03 19:55:11 -07:00
Mark DePristo	17ca543937	More ExactModel cleanup -- UnifiedGenotyperEngine no longer keeps a thread local double[2] array for the normalized posteriors array. This is way heavy-weight compared to just making the array each time. -- Added getNormalizedPosteriorOfAFGTZero and getNormalizedPosteriorOfAFzero to AFResult object. That's the place it should really live -- Add tests for priors, uncovering bugs in the contracts of the tri-allelic priors w.r.t. the AC of the MAP. Added TODOs	2012-10-03 19:55:11 -07:00
Mark DePristo	f8ef4332de	Count the number of evaluations in AFResult; expand unit tests -- AFResult now tracks the number of evaluations (turns through the model calculation) so we can now compute the scaling of exact model itself as a function of n samples -- Added unittests for priors (flat and human) -- Discovered nasty general ploidy bug (enabled with Guillermo_FIXME)	2012-10-03 19:55:11 -07:00
Mark DePristo	de941ddbbe	Cleanup Exact model, better unit tests -- Added combinatorial unit tests for both Diploid and General (in diploid-case) for 2 and 3 alleles in all combinations of sample types (i.e., AA, AB, BB and equiv. for tri-allelic). More assert statements to ensure quality of the result. -- Added docs (DOCUMENT YOUR CODE!) to AlleleFrequencyCalculationResult, with proper input error handling and contracts. Made mutation functions all protected -- No longer need to call reset on your AlleleFrequencyCalculationResult -- it'd done for you in the calculation function. reset is a protected method now, so it's all cleaner and nicer this way -- TODO still -- need to add edge-case tests for non-informative samples (0,0,0), for the impact of priors, and I need to add some way to test the result of the pNonRef	2012-10-03 19:55:11 -07:00
Mark DePristo	3e01a76590	Clean up AlleleFrequencyCalculation classes -- Added a true base class that only does truly common tasks (like manage call logging) -- This base class provides the only public method (getLog10PNonRef) and calls into a protected compute function that's abstract -- Split ExactAF into superclass ExactAF with common data structures and two subclasses: DiploidExact and GeneralPloidyExact -- Added an abstract reduceScope function that manages the simplification of the input VariantContext in the case where there are too many alleles or other constraints require us to only attempt a smaller computation -- All unit tests pass	2012-10-03 19:55:11 -07:00
Mark DePristo	1c52db4cdd	Add exactCallsLog output file to ExactModel and StandardCallerArgumentCollection -- This allows us to log all of the information about the exact model call (alleles, priors, PLs, result, and runtime) to a file for later debugging / optimization	2012-10-03 19:55:11 -07:00
Christopher Hartl	ca31ddf2a5	Allow VCFs without PLs to be converted to a bed file with genotypes other than no-call (by setting the minimum GQ to <=0). Performance enhancements to GRM suite.	2012-10-03 21:36:35 -04:00
Christopher Hartl	1be8a88909	Changes: 1) GATKArgumentCollection has a command to turn off randomization if setting the seed isn't enough. Right now it's only hooked into RankSumTest. 2) RankSumTest now can be passed a boolean telling it whether to use a dithering or non-randomizing comparator. Unit tested. 3) VariantsToBinaryPed can now output in both individual-major and SNP-major mode. Integration test. 4) Updates to PlinkBed-handling python scripts and utilities. 5) Tool for calculating (LD-corrected) GRMs put under version control. This is analysis for T2D, but I don't want to lose it should something happen to my computer.	2012-10-03 16:02:42 -04:00
David Roazen	118e974731	GATK Engine: special-case "monolithic" FilePointers, and allow them to represent multiple contigs Sometimes the GATK engine creates a single monolithic FilePointer representing all regions in all BAM files. In such cases, the monolithic FilePointer is the only FilePointer emitted by the BAMScheduler, and it's safe to allow it to contain regions and intervals from multiple contigs. This fixes support for reading unindexed BAM files (since an unindexed BAM is one case in which the engine creates a monolithic FilePointer).	2012-10-02 15:30:03 -04:00
David Roazen	a96ed385df	ReadShard.getReadsSpan(): handle case where shard contains only unmapped mates Nasty, nasty bug -- if we were extremely unlucky with shard boundaries, we might end up with a shard containing only unmapped mates of mapped reads. In this case, ReadShard.getReadsSpan() would not behave correctly, since the shard as a whole would be marked "mapped" (since it refers to mapped intervals) yet consist only of unmapped mates of mapped reads located within those intervals.	2012-10-02 13:50:00 -04:00
David Roazen	ac87ed47bb	BQSR: allow logging recal table updates to a file For testing/debugging purposes only	2012-10-01 14:18:34 -04:00
Christopher Hartl	2508b0f5a7	Merged bug fix from Stable into Unstable	2012-09-29 00:57:43 -04:00
Christopher Hartl	365f1d2429	hmk123's error on the forum came from the reference context occasionally lacking bases needed for validating the reference bases in the variant context. (no @Window for VariantsToBinaryPed). This bugfix adresses this and other minor items: 1) ValidateVariants removed in favor of direct validation VariantContexts. Integration test added to test broken contexts. 2) Enabling indel and SV output. Still bi-allelic sites only. Integration tests added for these cases. 3) Found a bug where GQ recalculation (if a genotype has PLs but no GQ) would only happen for flipped encoding. Fixed. Integration test added.	2012-09-29 00:55:31 -04:00
David Roazen	e740977994	GATK Engine: do not merge FilePointers that span multiple contigs This affects both the non-experimental and experimental engine paths, and so may break tests, but this is a necessary change.	2012-09-27 18:02:25 -04:00
David Roazen	e82946e5c9	ExperimentalReadShardBalancer: create one monolithic FilePointer per contig Merge all FilePointers for each contig into a single, merged, optimized FilePointer representing all regions to visit in all BAM files for a given contig. This helps us in several ways: -It allows us to create a single, persistent set of iterators for each contig, finally and definitively eliminating all Shard/FilePointer boundary issues for the new experimental ReadWalker downsampling -We no longer need to track low-level file positions in the sharding system (which was no longer possible anyway given the new experimental downsampling system) -We no longer revisit BAM file chunks that we've visited in the past -- all BAM file access is purely sequential -We no longer need to constantly recreate our full chain of read iterators There are also potential dangers: -We hold more BAM index data in memory at once. Given that we merge and optimize the index data during the merge, and only hold one contig's worth of data at a time, this does not appear to be a major issue. TODO: confirm this! -With a huge number of samples and intervals, the FilePointer merge operation might become expensive. With the latest implementation, this does not appear to be an issue even with a huge number of intervals (for one sample, at least), but if it turns out to be a problem for > 1 sample there are things we can do. Still TODO: unit tests for the new FilePointer.union() method	2012-09-27 14:47:54 -04:00
Christopher Hartl	55cdf4f9b7	Commit changes in Variants To Binary Ped to the stable repository to be available prior to next release.	2012-09-27 00:13:32 -04:00
Eric Banks	caa431c367	Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-09-24 21:46:36 -04:00
David Roazen	0b488cce66	ExperimentalReadShardBalancer: close() exhausted iterators Fixes a truly awful SAMReaders resource leak reported by Eric -- thanks Eric!	2012-09-24 14:52:59 -04:00
Mark DePristo	9fd30d6f1c	When writing the initial commit for nt + nct I realized this class was really just a ThreadGroupOutputTracker -- The code is cleaner and the logical more obvious now.	2012-09-24 14:15:36 -04:00
Mark DePristo	3e8d992828	Remove bad error test from MicroScheduler, as it's no longer applicable.	2012-09-24 14:15:36 -04:00
Mark DePristo	a6b3497eac	Fixes GSA-515 Nanoscheduler GSA-577 -nt and -nct together appear to not close resources properly -- Fixes monster bug in the way that traversal engines interacted with the NanoScheduler via the output tracker. -- ThreadLocalOutputTracker is now a ThreadBasedOutputTracker that associates via a map from a master thread -> the storage map. Lookups occur by walking through threads in the same thread group, not just the thread itself (TBD -- should have a map from ThreadGroup instead) -- Removed unnecessary debug statement in GenomeLocParser -- nt and nct officially work together now	2012-09-24 14:15:35 -04:00
Mark DePristo	4749fc114f	Temp. disable -nt > 1 and -nct > 1 while bugs are worked out	2012-09-24 14:15:35 -04:00
Mark DePristo	09bbd2c4c3	Include exception in VCFWriter when one is found when rethrowing as ReviewedStingException	2012-09-24 14:15:35 -04:00
Mark DePristo	10a6b57be6	Fix thread name: should be master executor not input	2012-09-24 14:15:35 -04:00
Eric Banks	9464dfdbf2	Don't penalize the reduced reads for spanning deletions (when surrounding base quals are Q2s)	2012-09-24 14:06:07 -04:00
Eric Banks	1509153b4b	Adding my little walker to assess reduced bam coverage against the original bam because it's turning out to be very useful.	2012-09-23 00:47:40 -04:00
David Roazen	f6a22e5f50	ExperimentalReadShardBalancerUnitTest was being skipped; fixed TestNG skips tests when an exception occurs in a data provider, which is what was happening here. This was due to an AWFUL AWFUL use of a non-final static for ReadShard.MAX_READS. This is fine if you assume only one instance of SAMDataSource, but with multiple tests creating multiple SAMDataSources, and each one overwriting ReadShard.MAX_READS, you have a recipe for problems. As a result of this the test ran fine individually, but not as part of the unit test suite. Quick fix for now to get the tests running -- this "mutable static" interface should really be refactored away though, when I have time.	2012-09-22 01:56:39 -04:00
David Roazen	e077347cc2	Re-allow running the GATK with experimental downsampling It's now possible to run with experimental downsampling enabled using the --enable_experimental_downsampling engine argument. This is scheduled to become the GATK-wide default next week after diff engine output for failing tests has been examined.	2012-09-21 23:20:46 -04:00
David Roazen	34eed20aa6	PerSampleDownsamplingReadsIterator: fix for incorrect use of DOWNSAMPLER_POSITIONAL_UPDATE_INTERVAL Notify all downsamplers in our pool of the current global genomic position every DOWNSAMPLER_POSITIONAL_UPDATE_INTERVAL position changes, not every single positional change after that threshold is first reached.	2012-09-21 22:43:39 -04:00
David Roazen	133085469f	Experimental, downsampler-friendly read shard balancer -Only used when experimental downsampling is enabled -Persists read iterators across shards, creating a new set only when we've exhausted the current BAM file region(s). This prevents the engine from revisiting regions discarded by the downsamplers / filters, as could happen in the old implementation. -SAMDataSource no longer tracks low-level file positions in experimental mode. Can strip out all related code when the engine fork is collapsed. -Defensive implementation that assumes BAM file regions coming out of the BAM Schedule can overlap; should be able to improve performance if we can prove they cannot possibly overlap. -Tests a bit on the extreme side (~8 minute runtime) for now; will scale these back once confidence in the code is gained	2012-09-21 22:17:58 -04:00
Guillermo del Angel	ab8fa8f359	Bug fix: AlleleCount stratification in VariantEval didn't support higher ploidy and was producing bad tables	2012-09-21 20:48:12 -04:00
Mark DePristo	b5fa848255	Fix GSA-515 Nanoscheduler GSA-573 -nt and -nct interact badly w.r.t. output -- See https://jira.broadinstitute.org/browse/GSA-573 -- Uses InheritedThreadLocal storage so that children threads created by the NanoScheduler see the parent stubs in the main thread. -- Added explicit integration test that checks that -nt 1, 2 and -nct 1, 2 give the same results for GLM BOTH with the UG over 1 MB.	2012-09-20 18:45:16 -04:00
Mark DePristo	ba9e95a8fe	Revert "Reorganized NanoScheduler so that main thread does the reduces" Doesn't actually fix the problem, and adds an unnecessary delay in closing down NanoScheduler, so reverting. This reverts commit 66b820bf94ae755a8a0c71ea16f4cae56fd3e852.	2012-09-20 18:45:15 -04:00
Mark DePristo	7425ab9637	Reorganized NanoScheduler so that main thread does the reduces -- Enables us to run -nt 2 -nct 2 and get meaningful output -- Uses a sleep / poll mechanism. Not ideal -- will look into wait / notify instead.	2012-09-20 18:45:15 -04:00
Eric Banks	747694f7c2	Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-09-20 14:14:58 -04:00
Eric Banks	1316b579f0	Bad news folks: BQSR scatter-gather was totally busted; you absolutely cannot trust any BQSR table that was a product of SG (for any version of BQSR). I fixed BQSR-gathering, rewrote (and enabled) the unit test, and confirmed that outputs are now identical whether or not SG is used to create the table.	2012-09-20 14:14:34 -04:00
Christopher Hartl	c492185be6	Merge branch 'master' of gsa2:/humgen/gsa-scr1/chartl/dev/unstable	2012-09-20 12:56:07 -04:00
Christopher Hartl	d25579deeb	A couple of minor things. 1) Better documentation on the meta data file for VariantsToBinaryPed with examples of each file type 2) MannWhitneyU can now take an argument on creation to turn off dithering. This pertains to JIRA-GSA-571 but does not fix it, as it isn't hooked up to the command line. Next step is to add an argument to the command line where it's accessible to the annotation classes (e.g. from either UG or the VariantAnnotator). 3) Added some dumb python scripts to deal with Plink files, and a script to convert plink binaries to VCF to help sanity check. Basically if you want to do an analysis on genotype data stored in plink binary format, your choices are: 1) Add a new module to Plink [difficulty rating: Impossible -- code obfuscation] 2) Steal plink parsing code from software (Plink/PlinkSeq/GCTA/Emacks/etc) that readds the files [difficulty rating: Oppressive -- code not modularized at all) 3) Write your own dumb stuff [difficutly rating: Annoying] What's been added is the result of 3. It's a library so nobody else has to do this, so long as they're comfortable with python.	2012-09-20 12:48:13 -04:00
Eric Banks	4b7edc72d1	Fixing edge case bug in the Exact model (both standard and generalized) where we could abort prematurely in the special case of multiple polymorphic alleles and samples with widely different depths of coverage (e.g. exome and low-pass). In these cases it was possible to call the site bi-allelic when in fact it was multi-allelic (but it wouldn't cause it to create a monomorphic call).	2012-09-20 10:59:42 -04:00
Ryan Poplin	ccb65a03e8	sorry, non-ASCII characters annoy some computers.	2012-09-20 10:14:48 -04:00
Mark DePristo	087247f1f0	Allow longs and doubles in recalibration report to allow some backward compatibility	2012-09-19 19:23:44 -04:00
Mark DePristo	2267b722b2	Proper error handling in NanoScheduler -- Renamed TraversalErrorManager to the more general MultiThreadedErrorTracker -- ErrorTracker is now used throughout the NanoScheduler. In order to properly handle errors, the work previously done by main thread (submit jobs, block on reduce) is now handled in a separate thread. The main thread simply wakes up peroidically and checks whether the reduce result is available or if an error has occurred, and handles each appropriately. -- EngineFeaturesIntegrationTest checks that -nt and -nct properly throw errors in Walkers -- Added NanoSchedulerUnitTest for input errors -- ThreadEfficiencyMonitoring is now disabled by default, and can be enabled with a GATK command line option. This is because the monitoring doesn't differentiate between threads that are supposed to do work, and those that are supposed to wait, and therefore gives misleading results. -- Build.xml no longer copies the unittest results verbosely	2012-09-19 17:03:13 -04:00
Mark DePristo	773af05980	Intermediate commit for proper error handling in the NanoScheduler -- Refactored error handling from HMS into utils.TraversalErrorManager, which is now used by HMS and will be usable by NanoScheduler -- Generalized EngineFeaturesIntegrationTest to test map / reduce error throwing for nt 1, nt 2 and nct 2 (disabled) -- Added unit tests for failing input iterator in NanoScheduler (fails) -- Made ErrorThrowing NanoScheduable	2012-09-19 17:03:13 -04:00
Mark DePristo	d2046b67b1	Remove problematic @Ensures from InputProducer. -- We need to figure out why CoFoJa is broken in the NanoScheduler	2012-09-19 17:03:13 -04:00
Mark DePristo	33fabb8180	Final V3 version of NanoScheduler -- Fixed basic bugs in tracking of input -> map -> reduce jobs -- Simplified classes -- Expanded unit tests	2012-09-19 17:03:12 -04:00
Mark DePristo	5734d756b5	Remove problematic @Invariant from EOFMarkedValue	2012-09-19 17:03:12 -04:00
Mark DePristo	aa9a1e8122	Warn GATK user if the number of requested threads > available processors on the machine	2012-09-19 17:03:12 -04:00
Mark DePristo	76027d17e6	Add a few more UnitTests for InputProducer -- Cleaned up function calls for clarity	2012-09-19 17:03:12 -04:00
Mark DePristo	7605c6bcc4	Done GSA-515 Nanoscheduler / GSA-557 V3 nanoScheduler algorithm -- V3 + V4 algorithm for NanoScheduler. The newer version uses 1 dedicated input thread and n - 1 map/reduce threads. These MapReduceJobs perform map and a greedy reduce. The main thread's only job is to shuttle inputs from the input producer thread, enqueueing MapReduce jobs for each one. We manage the number of map jobs now via a Semaphore instead of a BlockingQueue of fixed size. -- This new algorithm should consume N00% CPU power for -nct N value. -- Also a cleaner implementation in general -- Vastly expanded unit tests -- Deleted FutureValue and ReduceThread	2012-09-19 17:03:12 -04:00
Mark DePristo	69e418c3f5	Intermediate commit for v3 NanoScheduling algorithm -- This version works but it blocks much more than I'd expect on input. Merging v2 and v3 to make v4 now	2012-09-19 17:03:12 -04:00
Ryan Poplin	7a7103a757	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-09-19 10:39:18 -04:00
Ryan Poplin	0ea543e1fd	Removing testing scaffolding from delocalized BQSR. The output recal table reports the data as doubles instead of integers. This changes the mapping-based BQSR integration tests. Final intermediate push before delocalized BQSR replaces previous BQSR.	2012-09-19 10:39:06 -04:00
Ami Levy Moonshine	ccc3f4ff8d	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-09-17 09:58:27 -04:00
Ami Levy Moonshine	ee0b17d98f	typo in VE	2012-09-17 09:51:51 -04:00
Eric Banks	86be50f18d	Add note to docs that the --list argument requires full command-line	2012-09-14 10:58:44 -04:00
Eric Banks	0206e09a6a	Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-09-12 15:18:27 -04:00
Eric Banks	d94d0d15c2	Complete overhaul of previous commits to make it all work with scatter-gather. Now tracks output files correctly and can print to stdout.	2012-09-12 15:15:40 -04:00
Eric Banks	4bb7a99f08	Given that all classes implementing output stubs already have getters for the underlying OutputStream and File, it makes sense to unify that functionality into the Stub interface. Now it is possible to have an Engine utility method that iterates over all registered stubs to find the one representing a given OutputStream and return the File associated with it.	2012-09-12 11:51:44 -04:00
Eric Banks	994a4ff387	Track all outputs from BQSR (.table, .csv., and .pdf) as @Output arguments. Updated integration tests because we no longer have command-line options not to generate plots (now just don't provide a pdf) or to keep the intermediate csv (now, just provide a filename on the command-line). This is currently busted because we can't access the original filenames from the Engine's storage/stub system and therefore cannot call out to the Rscript with the executor (which requires filename strings).	2012-09-12 11:24:53 -04:00
Christopher Hartl	96be1cbea9	My own integration test isn't passing with a clean checkout. This fix to the walker ought to do it.	2012-09-12 10:11:06 -04:00
Christopher Hartl	546586b70e	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-09-12 10:09:42 -04:00
Mark DePristo	bfbf1686cd	Fixed nasty bug with defaulting to diploid no-call genotypes -- For the pooled caller we were writing diploid no-calls even when other samples were haploid. Changed maxPloidy function to return a defaultPloidy, rather than 0, in the case where all samples are missing. -- VCF/BCF Writers now create missing genotypes with the ploidy of other samples, or 2 if none are available at all. -- Updating integration tests for general ploidy, as previously we wrote ./. even when other calls were 0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/1/1/1/1/1, but now we write ./././././././././././././././././././././././. (ugly but correct)	2012-09-12 07:08:03 -04:00
Mark DePristo	d1ba17df5d	Fixed nasty bug in BCF2 writer for case where all genotypes are missing -- Previous code was looking for a -1 result from maxPloidy() but the result as actually 0, so instead of writing a diploid no call we were actually writing "unavailable" genotypes, and failing the BCF == VCF test in integration tests. Fixed.	2012-09-12 06:46:27 -04:00
Mark DePristo	91f3204534	VCF/BCF writers once again automatically write out no-call genotypes for samples in the VCFHeader but not in the VC itself -- Turns out this was consuming 30% of the UG runtime, and causing problems elsewhere. -- Removed addMissingSamples from VariantcontextUtils, and calls to it -- Updated VCF / BCF writers to automatically write out a diploid no call for missing samples -- Added unit tests for this behavior in VariantContextWritersUnitTest	2012-09-12 06:46:26 -04:00
Christopher Hartl	5d19fca649	A couple of bug-fixy changes. 1) SelectVariants could throw a ReviewedStingException (one of the nasty "Bug:") ones if the user requested a sample that wasn't present in the VCF. The walker now checks for this in the initialize() phase, and throws a more informative error if the situation is detected. If the user simply wants to subset the VCF to all the samples requested that are actually present in the VCF, the --ALLOW_NONOVERLAPPING_COMMAND_LINE_SAMPLES flag changes this UserException to a Warning, and does the appropriate subsetting. Added integration tests for this. 2) GenotypeLikelihoods has an unsafe method getLog10GQ(GenotypeType), which is completely broken for multi-allelic sites. I marked that method as deprecated, and added methods that use the context of the allele ordering (either directly specified or as a VC) to retrieve the appropriate GQ, and added a unit test to cover this case. VariantsToBinaryPed needs to dynamically calculate the GQ field sometimes (because I have some VCFs with PLs but no GQ).	2012-09-11 23:01:00 -04:00
Mark DePristo	e25e617d1a	Fixes GSA-515 Nanoscheduler GSA-560 / Fix display of NanoScheduler and MonitoringEfficiency -- Now prints out a single combined NanoScheduler runtime profile report across all nano schedulers in use. So now if you run with -nt 4 you'll get one combined NanoScheduler profiler across all 4 instances of the NanoScheduler within TraverseXNano.	2012-09-11 07:38:34 -04:00
Mark DePristo	d6e42d839c	Fixes GSA-558 GATK ReadShards don't handle unmapped reads correctly.	2012-09-10 20:14:14 -04:00
Mark DePristo	641c6a361e	Fix nasty memory leak in new data thread x cpu thread parallelism -- Basically you cannot safely use instance specific ThreadLocal variables, as these cannot be safely cleaned up. The old implementation kept pointers to old writers, with huge tribble block indexes, and eventually we crashed out of integration tests -- See http://weblogs.java.net/blog/jjviana/archive/2010/06/10/threadlocal-thread-pool-bad-idea-or-dealing-apparent-glassfish-memor for more information -- New implementation uses a borrow/return schedule with a list of N TraversalEngines managed by the MicroScheduler directly.	2012-09-10 20:14:14 -04:00
Mark DePristo	195cf6df7e	Attempting to fix out of memory errors with new traversal engine creator	2012-09-10 20:14:14 -04:00
Mark DePristo	f713d400e2	Fixed GSA-515 Nanoscheduler GSA-555 / Make NT and NCT work together -- Can now say -nt 4 and -nct 4 to get 16 threads running for you! -- TraversalEngines are now ThreadLocal variables in the MicroScheduler. -- Misc. code cleanup, final variables, some contracts.	2012-09-10 20:14:14 -04:00
Mark DePristo	233f70f8ba	Final cleanup of TraversalProgressMeters, moved to utils.progressmeter -- TraversalProgressMeter now completely generalized, named ProgressMeter in utils.progressmeter. Now just takes "nRecordsProcessed" as an argument to print reads. Completely removes dependence on complex data structures from TraversalProgressMeter. Can be used to measure progress on any task with processing units in genomic locations. -- a fairly simple, class with no dependency on GATK engine or other features. -- Currently only used by the TraversalEngine / MicroScheduler but could be used for any purpose now, really.	2012-09-10 20:14:14 -04:00
Mark DePristo	2e94a0a201	Refactor TraversalEngine to extract the progress meter functions -- Previously these core progress metering functions were all in TraversalEngine, and available to subclasses like TraverseLoci via inheritance. The problem here is that the upcoming data threads x cpu threads parallelism requires one master copy of the progress metering shared among all traversals, but multiple instantiations of traverse engines themselves. -- Because the progress metering code has horrible anyway, I've refactored and vastly cleaned up and simplified all of these capabilities into TraversalProgressMeter class. I've simplified down the classes it uses to work (STILL SOME TODOs in there) so that it doesn't reach into the core GATK engine all the time. It should be possible to write some nice tests for it now. By making it its own class, it can protect itself from multi-threaded access with a single synchronized printProgress function instead of carrying around multiple lock objects as before -- Cleaned up the start up of the progress meter. It's now handled when the meter is created, so each micro scheduler doesn't have to deal with proper initialization timing any longer -- Simplified and made clear the interface for shutting down the traversal engines. There's no a shutdown method in TraversalEngine that's called once by the MicroScheduler when the entire traversing in over. Nano traversals now properly shut down (was subtle bug I undercovered here). The printing of on traversal done metering is now handled by MicroScheduler -- The MicroScheduler holds the single master copy of the progress meter, and doles it out to the TraversalEngines (currently 1 but in future commit there will be N). -- Added a nice function to GenomeAnalysisEngine that returns the regions we will be processing, either the intervals requested or the whole genome. Useful for progress meter but also probably for other infrastructure as well -- Remove a lot of the sh*ting Bean interface getting and setting in MicroScheduler that's no longer useful. The generic bean is just a shell interface with nothing in it. -- By removing a lot of these bean accessors and setters many things are now final that used to be dynamic.	2012-09-10 20:14:13 -04:00
David Roazen	d2f3d6d22f	Revert "Separated out the DoC calculations from the XHMM pipeline, so that CalcDepthOfCoverage can be used for calculating joint coverage on a per-base accounting over multiple samples (e.g., family samples)" This reverts commit 075c56060e0ffcce39631693ef39cf5f8c3a4d5a.	2012-09-10 15:52:39 -04:00
Menachem Fromer	0b717e2e2e	Separated out the DoC calculations from the XHMM pipeline, so that CalcDepthOfCoverage can be used for calculating joint coverage on a per-base accounting over multiple samples (e.g., family samples)	2012-09-10 15:32:41 -04:00
Eric Banks	d7499e0642	Updating the rank sum test documentation	2012-09-09 22:17:36 -04:00
Eric Banks	8ca205f1a9	Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-09-07 14:26:06 -04:00
Eric Banks	b1677fc719	Fixed JIRA GSA-520 for Guillermo: when intervals with zero coverage were present, DiagnoseTargets was trying to merge them with the next interval (even if non-overlapping) which would cause problems later on when it checked to make sure that intervals were strictly overlapping.	2012-09-07 14:25:57 -04:00
Geraldine Van der Auwera	3f2a4379af	Added forum API version stub to base URL for posting GATKDocs This will prevent bugs from occurring when Vanilla make changes to the API as described here: http://vanillaforums.com/blog/api#configuration Based on the bug that broke the website Guide section on 9/6/12, the GATKDocs posting system will probably break in the next release if this is not applied as a bug fix.	2012-09-07 11:49:02 -04:00
Eric Banks	ed3d9b050f	Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-09-07 11:45:09 -04:00
Eric Banks	3dc248a49d	Adding another test	2012-09-07 11:41:38 -04:00
Ryan Poplin	81b27f9db2	auto-merging to latest version	2012-09-07 11:36:47 -04:00
Eric Banks	41a8a304a0	Catch masked OutOfMemory errors as User Errors	2012-09-07 11:27:00 -04:00
Mark DePristo	d62eca5d92	Update GATKPerformanceOverTime to measure -nt and -nct	2012-09-07 10:47:29 -04:00
Mark DePristo	bf87de8a25	UnitTests for ReducerThread and InputProducer -- Uncovered bug in ReducerThread in detecting abnormal case where jobs are coming in out of order	2012-09-07 09:51:32 -04:00
Mark DePristo	c503884958	GSA-515 Nanoscheduler GSA-551 / Optimize nanoScheduling performance of UnifiedGenotyper -- I've rewritten the entire NS framework to use a producer / consumer model for input -> map and from map -> reduce. This is allowing us to scale reasonably efficiently up to 4 threads (see figure). Future work on the nano scheduler will be itemized in a separate JIRA entry. -- Restructured the NS code for clarity. Docs everywhere. -- This is considered version 1.0	2012-09-07 09:15:16 -04:00
Mark DePristo	9d12935986	Intermediate commit for new hyper parallel NanoScheduler -- There's a logic bug now but I'll go to squash it...	2012-09-07 09:15:16 -04:00
Eric Banks	576c7280d9	Extensions to the ErrorThrowing framework for testing purposes	2012-09-06 22:03:18 -04:00
David Roazen	cb84a6473f	Downsampling: experimental engine integration -Off by default; engine fork isolates new code paths from old code paths, so no integration tests change yet -Experimental implementation is currently BROKEN due to a serious issue involving file spans. No one can/should use the experimental features until I've patched this issue. -There are temporarily two independent versions of LocusIteratorByState. Anyone changing one version should port the change to the other (if possible), and anyone adding unit tests for one version should add the same unit tests for the other (again, if possible). This situation will hopefully be extremely temporary, and last only until the experimental implementation is proven.	2012-09-06 15:03:27 -04:00
Eric Banks	6df6c1abd5	Fix for PBT to stop NPE when there are no likelihoods present	2012-09-06 13:14:18 -04:00
Mark DePristo	1b064805ed	Renaming -cnt to -nct for consistency	2012-09-05 21:13:19 -04:00
Mark DePristo	574a8f710b	Add static boolean controlled output of individual map call timing to nanoSecond resolution	2012-09-05 17:40:02 -04:00
Mark DePristo	e11915aa0a	GSA-515 Nanoscheduler GSA-550 ThreadSafeMapReduce shouldn't be super interface of TreeReducible	2012-09-05 17:37:56 -04:00
Mark DePristo	c5f1ceaa95	All read and loci traversals go through NanoScheduler now -- The NanoScheduler is doing a good job at tracking important information like time spent in map/reduce/input etc. -- Can be disabled with static boolean in MicroScheduler if we have problems -- See GSA-515 Nanoscheduler GSA-549 Retire TraverseReads and TraverseLoci after testing confirms nano scheduler version in single threaded version is fine	2012-09-05 16:38:21 -04:00
Mark DePristo	dddf148a59	Fixed bug in ThreadAllocation getTotalNumberOfThreads -- It isnt data + cpu its data * cpu threads.	2012-09-05 16:35:32 -04:00
Mark DePristo	9bf1d138d9	New GATK argument interface for data and cpu threads -- Closes GSA-515 Nanoscheduler GSA-542 Good interface to nanoScheduler -- Old -nt means dataThreads -- New -cnt (--num_cpu_threads_per_data_thread) gives you n cpu threads for each data thread in the system -- Cleanup logic for handling data and cpu threading in HMS, LMS, and MS -- GATKRunReport reports the total number of threads in use by the GATK, not just the nt value -- Removed the io,cpu tags for nt. Stupid system if you ask me. Cleaned up the GenomeAnalysisEngine and ThreadAllocation handling to be totally straightforward now	2012-09-05 15:45:24 -04:00
Mark DePristo	1e55475adc	NanoScheduler uses ExecutorService to run input reader thread	2012-09-05 15:45:24 -04:00
Mark DePristo	71d9ebcb0d	Fix bug (introduced by me) that didn't include contig in progress meter	2012-09-05 15:45:24 -04:00
Mark DePristo	c822b7c760	Fix long-standing NPE in LMS due to inappropriate timing of initialization	2012-09-05 15:45:24 -04:00
Mark DePristo	a997c99806	Initial NanoScheduler with input producer thread	2012-09-05 15:45:24 -04:00
Mark DePristo	03dd470ec1	Test for progressFunction in NanoScheduler; bugfix for single threaded fast path	2012-09-05 15:45:23 -04:00
Mark DePristo	8cdeb51b78	Cleanup printProgress in TraversalEngine -- Separate updating cumulative traversal metrics from printing progress. There's now an updateCumulativeMetrics function and a printProgress() that only takes a current position -- printProgress now soles relies on the time since the last progress to decide if it will print or not. No longer uses the number of cycles, since this isn't reliable in the case of nano scheduling -- GenomeAnalysisEngine now maintains a pointer to the master cumulative metrics. getCumulativeMetrics never returns null, which was handled in some parts of the code but not others. -- Update all of the traversals to use the new updateCumulativeMetrics, printProgress model -- Added progress callback to nano scheduler. Every bufferSize elements this callback is invoked, allowing us to smoothly update the progress meter in the NanoScheduler -- Rename MapFunction to NanoSchedulerMap and the same for reduce.	2012-09-05 15:45:23 -04:00
Mark DePristo	d503ed97ab	Mark I NanoScheduling TraverseLoci -- Refactored TraverseLoci into old linear version and nano scheduling version -- Temp. GATK argument to say how many nano threads to use -- Can efficiently scale to 3 threads before blocking on input	2012-09-05 15:45:23 -04:00
Mark DePristo	757e6a0160	Making Pileup thread-safe -- Old version relied on out printstream magically sorting output, new version puts the print in reduce	2012-09-05 15:45:23 -04:00
Mark DePristo	d7105223fe	More debugging output for NanoScheduler when debugging is enabled	2012-09-05 15:45:23 -04:00
Mark DePristo	9823102c0c	TraverseReadsNano supports walker.filter and walker.done -- Instead of returning directly the result of map(), returns a MapResult object with the value and a reduceMe flag. -- Reduce function respects the reduceMe flag -- Code cleanup and more documentation	2012-09-05 15:45:23 -04:00
Mark DePristo	1a8f5fc374	Trivial cleanup of NanoScheduler	2012-09-05 15:45:23 -04:00
Mark DePristo	6a5a70cdf1	Done GSA-539: SimpleTimer should use System.nanoTime for nanoSecond resolution	2012-09-05 15:45:23 -04:00
Mark DePristo	59109d5eeb	NanoScheduler tracks time outside of its execute call	2012-09-05 15:45:23 -04:00
Mark DePristo	800a27c3a7	NanoScheduler tracks time within input, map, and reduce -- Helpful for understanding where the time goes to each bit of the code. -- Controlled by a local static boolean, to avoid the potential overhead in general	2012-09-05 15:45:23 -04:00
Mark DePristo	7087b22ea3	No debugging output (even conditional) for ReadTransformers in PrintReads	2012-09-05 15:45:23 -04:00
Mark DePristo	e01258b261	NanoScheduler now supports printProgress. Bugfixes to printProgress -- TraverseReadsNano prints progress at the end of each traversal unit -- Fix bugs in TraversalEngine printProgress -- Synchronize the method so we don't get multiple logged outputs when two or more HMSs call printProgress before initialization at the start! -- Fix the logic for mustPrint, which actually had the logic of mustNotPrint. Now we see the done log line that was always supposed to be there -- Fix output formatting, as the done() line was incorrectly shifting over the % complete by 1 char as 100.0% didn't fit in %4.1f -- Add clearer doc on -PF argument so that people know that the performance log can be generated to standard out if one wants	2012-09-05 15:45:23 -04:00
Mark DePristo	6055101df8	NanoScheduler no longer groups inputs, each map() call is interlaced now -- Maximizes the efficiency of the threads -- Simplifies interface (yea!) -- Reduces number of combinatorial tests that need to be performed	2012-09-05 15:45:22 -04:00
Yossi Farjoun	ad5fa449e7	fixed a typo in the string comment	2012-09-05 14:46:10 -04:00
Ryan Poplin	84a83fd3f3	fixing typo	2012-09-05 10:41:03 -04:00
Eric Banks	fc06f39411	Fixed docs for Pileup walker	2012-09-05 09:55:34 -04:00
Christopher Hartl	d795437202	- New UserExceptions added for when ReadFilters or Walkers specified on the command line are not found. When -rf xxxx cannot find the class corresponding to xxxx, all read filters are printed in a better formatted way, with links to their gatk docs. - VariantAnnotatorEngine changed to call genotype annotations even if pilups and allele -> likelihood mappings are not present. Current genotype annotations altered to check for null pilupes and null mappings.	2012-09-04 16:41:44 -04:00
Ryan Poplin	9cc1a9931b	Resolving merge conflicts.	2012-09-04 10:47:38 -04:00
Ryan Poplin	c9944d81ef	Skip array needs to also be used in the updateDataForRead function of the delocalized BQSR.	2012-09-04 10:33:37 -04:00
Mark DePristo	c9ea213c9b	Make BaseRecalibration thread-safe -- In the process uncovered two strange things 1 -- qualityScoreByFullCovariateKey was created but never used. Seems like a cache? 2 -- Discovered nasty bug in BaseRecalibrator: https://jira.broadinstitute.org/browse/GSA-534	2012-08-31 13:42:42 -04:00
Mark DePristo	27ddebee53	Protect PrintReads from strange state from TraverseReadsUnitTests	2012-08-31 13:42:41 -04:00
Mark DePristo	e028901d54	Fixed bad contract in ReadTransformer	2012-08-31 13:42:41 -04:00
Mark DePristo	cf91d894e4	Fix build problems with tests	2012-08-31 13:42:41 -04:00
Mark DePristo	817ece37a2	General infrastructure for ReadTransformers -- These are like read filters but can be applied either on input, on output, of handled by the walker -- Previous example of BAQ now uses the general framework -- Resulted in massive conceptual cleanup of SAMDataSource and ReadProperties! Yeah! -- BQSR now uses this framework. We can now do BQSR on input, on output, or within a walker -- PrintReads now handles all read transformers in the walker in map, enabling us to parallelize PrintReads with BAQ and BQSR -- Currently BQSR is excepting in parallel, which subsequent commit with fix -- Removed global variable setting in GenomeAnalysisEngine for BAQ, as command line parameters are cleanly handled by ReadTransformer infrastructure -- In principle ReadFilters are just a special kind of ReadTransformer, but this refactoring is larger than I can do. It's a JIRA entry -- Many files touched simply due to the refactoring and renaming of classes	2012-08-31 13:42:41 -04:00
Ryan Poplin	ff6ebbf3fd	Resolving merge conflicts.	2012-08-31 11:25:55 -04:00
Mark DePristo	2f749b5e52	Added ThreadSafeMapReduce interface, super of TreeReducible -- A higher level interface to declare parallelism capability of a walker. This interface means that the walker can be multi-threaded, but doesn't necessarily support TreeReducible interface, which forces you to have a combine ReduceType operation that isn't appropriate for parallel read walkers -- Updated ReadWalkers to implement ThreadSafeMapReduce not TreeReducible	2012-08-30 19:41:49 -04:00
Mark DePristo	544740d45d	tasking for n threads should give you n threads in NanoScheduler, not n - 1	2012-08-30 19:41:49 -04:00
Mark DePristo	7a462399ce	Fix GSA-529: Fix RODs for parallel read walkers -- TraverseReadsNano modified to read in all input data before invoking maps, so the input to TraverseReadsNano is a MapData object holding the sam record, the ref context, and the refmetadatatracker. -- Update ValidateRODForReads to be tree reducible, using synchronized map and explicitly sort the output map from locations -> counts in onTraversalDone -- Expanded integration tests to test nt 1, 2, 4.	2012-08-30 19:41:49 -04:00
Mark DePristo	7d95176539	Bugfix to compareTo and equals in GenomeLoc -- Yes, GenomeLoc.compareTo was broken. The compareTo function only considered the contig and start position, but not the stop, when comparing genome locs. -- Updated GenomeLoc.compareTo function to account for stop. Updated GATK code where necessary to fix resulting problems that depended on this. -- Added unit tests to ensure that hashcode, equals, and compareTo are all correct for GenomeLocs	2012-08-30 19:41:49 -04:00
Mark DePristo	5a9610d875	ReadShards now default to 10K (up from 1K) reads per samFile up to 250K -- This should help make the inputs for parallel read walkers a little meater, and avoid spinning the shard creation infrastructure so often	2012-08-30 19:41:49 -04:00
Christopher Hartl	5a142fe265	After dicussion with Ryan/Eric, the Structural_Indel variant type is now gone, and has been entirely replaced with the access pattern .isStructuralIndel(). This makes it a strict subtype of indel. I agree that this method is a bit more sensible. In addition, fix for GSA-310. If supplied -rf argument does not match a known read filter, the list of read filters will be printed, and users directed to the documentation for more information.	2012-08-30 17:57:31 -04:00
Mark DePristo	82b2845b9f	Fix: GSA-531 ApplyRecalibration writing to BCF: java.lang.String cannot be cast to java.lang.Double -- LOD must be added a double to attributes, not as string, so that it can be written out as BCF	2012-08-30 16:59:57 -04:00
Ryan Poplin	7b366d4049	misc cleanup in active region traversal.	2012-08-30 11:01:01 -04:00
Mark DePristo	ce3d1f89ea	ReadShard are no longer allowed to span multiple contigs -- Previous behavior was unnecessary and causes all sorts of problems with RODs for reads. The old implementation simply failed in this case. The new code handles this correctly by forcing shards to have all of their data on a single contig. -- Added a PrintReads integration test to ensure this behavior is correct -- Adding test BAMs that have < 200 reads and span across contig boundaries	2012-08-30 10:15:11 -04:00
Mark DePristo	53376b9423	Part III of GSA-462: Consistent RODBinding access across Ref and Read trackers -- shardSpan is only calculated when there some ROD is live in the GATK. No sense in paying the cost per read when you don't need it -- Update contract to allow null span or unmapped span (good catch unittests!)	2012-08-30 10:15:10 -04:00
Mark DePristo	1200848bbf	Part II of GSA-462: Consistent RODBinding access across Ref and Read trackers -- Deleted ReadMetaDataTracker -- Added function to ReadShard to give us the span from the left most position of the reads in the shard to the right most, which is needed for the new view	2012-08-30 10:15:10 -04:00
Mark DePristo	972be8b4a4	Part I of GSA-462: Consistent RODBinding access across Ref and Read trackers -- ReadMetaDataTracker is dead! Long live the RefMetaDataTracker. Read walkers will soon just take RefMetaDataTracker objects. In this commit they take a class that trivially extends them -- Rewrote ReadBasedReferenceOrderedView to produce RefMetaDataTrackers not the old class. -- This new implementation produces thread-safe objects (i.e., holds no points to shared state). Suitable for use (to be tested) with nano scheduling -- Simplified interfaces to use the simplest data structures (PeekableIterator) not the LocusAwareSeekableIterator, since I both hate those classes and this is on the long term trajectory to remove those from the GATK entirely. -- Massively expanded DataProvider unit tests for ReadBasedReferenceOrderedView -- Note that the old implementation of offset -> ROD in ReadRefMetaDataTracker was broken for any read not completely matching the reference. Rather than provide broken code the ReadMetaDataTracker only provides a "bag of RODs" interface. If you want to work with the relationship between the read and the RODs in your tool you need to manage the CIGAR element itself. -- This commit breaks the new read walker BQSR, but Ryan knows this is coming -- Subsequent commit will be retiring / fixing ValidateRODForReads	2012-08-30 10:15:10 -04:00
Mark DePristo	8fc6a0a68b	Cleanup RefMetaDataTracker before refactoring ReadMetaDataTracker	2012-08-30 10:13:06 -04:00
Ryan Poplin	b85ded8389	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-08-30 10:11:48 -04:00
Ryan Poplin	57d997f06f	Fixing bug from when FragmentUtils merging function moved over to the soft clipped start instead of the unclipped start	2012-08-30 10:10:43 -04:00
Ryan Poplin	f9bab37015	Merged bug fix from Stable into Unstable	2012-08-30 09:21:24 -04:00
Ryan Poplin	eb63221875	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/stable	2012-08-30 09:19:35 -04:00
Ryan Poplin	81d5eca975	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-08-30 09:10:56 -04:00
Ryan Poplin	35baf0b155	This along with Mauricio's previous commit (thanks!) fixes GSA-522. There are no longer any modifications to reads in the map calls of ActiveRegion walkers. Added the bam which identified this error as a new integration test.	2012-08-30 09:07:36 -04:00
Eric Banks	1acf0f0b2c	Fixing bug in fasta .fai generation: trim the contig names to the first whitespace if one appears. We now generate indexes identical to samtools.	2012-08-29 22:36:27 -04:00
Eric Banks	4d38befe86	Merged bug fix from Stable into Unstable	2012-08-29 15:13:56 -04:00
Eric Banks	150a969279	Be careful with String manipulation when constructing alleles in SomaticIndelDetector	2012-08-29 15:13:28 -04:00
Eric Banks	ce55ba98f4	Don't try to left align indels in unmapped reads (which for some reason can still have CIGARs) because the ref context is null.	2012-08-29 15:01:11 -04:00
Ryan Poplin	4ea38bbfe8	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-08-29 11:39:30 -04:00
Mauricio Carneiro	69b56e11c8	ReadClipper won't modify the original read Reverting back to the original implementation, but now including write N's and write Q0's due to walkers that look at the same read multiple times in different reference windows	2012-08-29 11:33:19 -04:00
Ryan Poplin	e12ae65d33	Changing the commenting style in the BQSR	2012-08-29 11:27:45 -04:00
Ryan Poplin	18eca3544e	Initial commit of the delocalized BQSR written as a read walker.	2012-08-28 15:24:20 -04:00
Eric Banks	e74c527d47	Register the depricated walkers as depricated starting in v2.2 so that users get a helpful error message	2012-08-28 10:19:18 -04:00
Eric Banks	67d348a31d	Retiring the alignment walkers and related integration test since we don't want to support them anymore.	2012-08-28 10:16:49 -04:00
Mark DePristo	2996693c9f	FisherStrand now computed with and without filtering low-qual bases, and least significant pvalue is kept -- Old way (filtering for Q > 17 bases) resulted in biased FS when the site was good but there was a systematic shift in the QUAL of REF and ALT between strands of the reads (sometimes happens) -- New way (taking all bases) was consistent with BaseQualRankSum and other tests, but there can be a lot of low qual reference bases on one strand in some techs (ION/PROTON/PACBIO) because of the preference for introducing an indel vs. a mismatch. -- This implementation allows us to have our cake and eat it to by computing both p-values, and taking the maximum one (i.e., least significant). -- No integration tests updated yet -- still exploring the consequences of this change	2012-08-28 08:06:47 -04:00
Eric Banks	bedcdbdc5f	Fixing merge conflict	2012-08-27 12:16:51 -04:00
Eric Banks	3d476487c6	LIBS is totally busted for deletions. Putting a check in AD for bad pileup event bases so that we don't produce busted alleles. We must fix LIBS ASAP.	2012-08-27 12:13:12 -04:00
Mark DePristo	63a9ae817a	Ensure thread-safety of CachingIndexedFastaSequenceFile -- Cosmetic cleanup of ReadReferenceView -- TraverseReadsNano provides the reference context, since it's thread-safe -- Cleanup CachingIndexedFastaSequenceFile. Add docs, remove unnecessary setters -- Expand CachingIndexedFastaSequenceFileUnitTest to test explicitly multi-threaded safety.	2012-08-27 12:11:54 -04:00
Khalid Shakir	2d1ea7124b	One less Queue command line requirement: -tempDir now defaults to .queue/tmp. Also moved queueScatterGather to .queue/scatterGather.	2012-08-27 12:04:50 -04:00
Mark DePristo	68c5142d2d	numThreads > 1 any time you have -nt > 1 silly	2012-08-26 14:36:13 -04:00
Mark DePristo	fde9824765	Optimizations for parallel read walkers -- TraversalReadsNano only creates the NanoScheduler once, and shuts it down onTraversalDone -- Nicer debugging output in NanoScheduler -- ReadShard has a getBufferSize() method now	2012-08-25 17:21:12 -04:00
Mark DePristo	5066b14335	Parallel FlagStat	2012-08-25 17:21:12 -04:00
Mark DePristo	af540888f1	Limited version of parallel read walkers -- Currently doesn't support accessing reference or ROD data -- Parallel versions of PrintReads and CountReads	2012-08-25 17:21:12 -04:00
Mark DePristo	e060b148e2	Minor cleanup of TraverseReads	2012-08-25 17:21:11 -04:00
Mark DePristo	275a5e5439	More tests for NanoScheduler -- Add more contracts -- Test in the UnitTest that the reduce is being called in the correct order	2012-08-25 17:21:11 -04:00
Christopher Hartl	6db0988898	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-08-25 15:40:32 -04:00
Christopher Hartl	db2e88c7cb	Fix for badIndelLength() throwing NPE at non-indel sites. Added integration test.	2012-08-25 12:38:23 -07:00
Mark DePristo	59b5913b54	Merged bug fix from Stable into Unstable	2012-08-25 14:53:22 -04:00
Mark DePristo	dcc972a557	Usability cleanup for BQSR -- I'm seeing a lot of people trying to use BinaryTagCovariate in the community. They really shouldn't do this, so I moved it to private. -- Throw an exception if its required bintag argument is missing -- Check explicitly if user is requesting DinucCovariate and tell them that its been retired in favor of ContextCovariate -- Show the type (Required, Experimental, Standard) of the covariates when running --list	2012-08-25 14:53:00 -04:00
Christopher Hartl	b59948709f	Code improvements re: JIRA GSA-510. Trio class migrated into the Samples package - because the trio structure is so ubiquitously used, it makes sense, I think, to have a class which imposes the structure on the samples. Existing functions which slightly duplicated the getTrios() method look like they have bugs. These functions are now deprecated. A number of functions int he sampleDB looked to be assuming that samples could not share IDs (e.g. sample IDs are unique, so a sample present in two families could not be represented by multiple Sample objects). Added an assertion in the SampleDBBuilder to document/test this assumption. MVLikelihoodRatio now uses the trio methods from SampleDB.	2012-08-25 08:48:27 -07:00
Mark DePristo	0996bbd548	Comments for Chris on cleanup	2012-08-24 16:04:58 -04:00
Mark DePristo	649b82ce85	Merge branch 'nanoScheduler' Conflicts: private/scala/qscript/org/broadinstitute/sting/queue/qscripts/performance/GATKPerformanceOverTime.scala	2012-08-24 15:59:36 -04:00
Mark DePristo	9de8077eeb	Working (efficient?) implementation of NanoScheduler -- Groups inputs for each thread so that we don't have one thread execution per map() call -- Added shutdown function -- Documentation everywhere -- Code cleanup -- Extensive unittests -- At this point I'm ready to integrate it into the engine for CPU parallel read walkers	2012-08-24 15:34:23 -04:00
Christopher Hartl	752f44c332	Code cleanup in MVLR and SelectVariants. Should fix JIRA GSA-509 and GSA-510	2012-08-24 12:25:11 -07:00
Mark DePristo	d6e6b30caf	Initial implementation of GSA-515: Nanoscheduler – Write general NanoScheduler framework in utils.threading. Test with reading via iterator from list of integers, map is int * 2, reduce is sum. Should be efficiency using resources to do sum of 2 * (sum(1 - X)). Done! CPU parallelism is nano threads. Pfor across read / map / reduce. Use work queue to implement. Create general read map reduce framework in utils. Test parallelism independently before hooking up to Locus iterator Represent explicitly the dependency graph. Scheduler should choose the work units that are ready for computation, that are marked as "completing a computation", and then finally that maximize the number of sequent available work units. May be worth measuring expected cost for read read / map / reduce unit and use it to balance the compute As input is single threaded just need one thread to populate inputs, which runs as fast as possible on parallel pushing data to fixed size queue. Each push creates map job and links to upcoming reduce job. Note that there's at most one thread for IO tasks, and all of the threads can contribute to CPU tasks	2012-08-24 14:07:44 -04:00
Eric Banks	0545664f91	Fix ClassCastException seen in Tableau errors	2012-08-24 13:45:48 -04:00
Eric Banks	740520c23b	Fix BQSR docs	2012-08-24 13:20:10 -04:00
Ryan Poplin	5f8574bd15	Fixing typo in error message.	2012-08-24 10:48:41 -04:00
Mark DePristo	1999b95754	Work around for GSA-513: ClassCastException in VariantEval	2012-08-23 18:14:49 -04:00
Christopher Hartl	f1166d6d00	Spotted a potential bug where sample IDs passed in from the meta data were only checked against the sample IDs in the VCF header if the input file happened to be a meta data file rather than a fam file. Added a check for fam files as well, and added an integration test to cover each case.	2012-08-23 11:43:19 -07:00
Mark DePristo	857b11b26f	Done with GSA-506: Add nt and efficiency information to GATKRunReport -- GATKRunReports contain itemized information about the numThreads used to execute the GATK, as well as the efficiency of the use of those threads to get real work done, including time spent running, waiting, blocking, and waiting for IO -- See https://jira.broadinstitute.org/browse/GSA-506 for more details	2012-08-23 09:59:53 -04:00
Mark DePristo	0b735884db	Cleanup code in VariantContext	2012-08-23 09:59:53 -04:00
Eric Banks	e5df91aa23	Looks like the @WalkerName annotation doesn't work with the GATK docs, so I'm renaming the walkers.	2012-08-22 20:17:39 -04:00
Mark DePristo	63af0cbcba	Cleanup GATK efficiency monitor classes -- Invert logic in GATKArgumentCollection to disable monitoring, not enable. That means monitoring is on by default -- Fix testing error in unit tests -- Rename variables in ThreadAllocation to be clearer	2012-08-22 16:48:02 -04:00
Mark DePristo	e1293f0ef2	GSA-507: Thread monitoring refactored so it can work without a thread factory -- Old version StateMonitoringThreadFactory refactored into base class ThreadEfficiencyMonitor and subclass EfficiencyMonitoringThreadFactory. -- Base class is used by LinearMicroScheduler to monitor performance of GATK in single threaded mode -- MicroScheduler now handles management of the efficiency monitor. Includes master thread in monitor, meaning that reduce is now included for both schedulers	2012-08-22 16:48:01 -04:00
Mark DePristo	f876c51277	Separately track time spent doing user and system CPU work -- Allows us to ID (by proxy) time spent doing IO -- Refactor StateMonitoryingThreadFactory to use it's own enum, not Thread.State -- Reliable unit tests across mac and unix	2012-08-22 16:48:01 -04:00
Mark DePristo	18060f237b	Add thread efficiency monitoring to GATK HMS -- See https://jira.broadinstitute.org/browse/GSA-502 -- New command line argument -mt enables thread monitoring -- If enabled, HMS uses StateMonitoringThreadFactory to create monitored threads, and prints out an efficiency report when HMS exits, telling the user information like: for BQSR – known to be inefficient locking INFO 17:10:33,195 StateMonitoringThreadFactory - Number of activeThreads used: 8 INFO 17:10:33,196 StateMonitoringThreadFactory - Total runtime 90.3 m INFO 17:10:33,196 StateMonitoringThreadFactory - Fraction of time spent blocked is 0.72 ( 64.8 m) INFO 17:10:33,197 StateMonitoringThreadFactory - Fraction of time spent running is 0.26 ( 23.7 m) INFO 17:10:33,197 StateMonitoringThreadFactory - Fraction of time spent waiting is 0.02 ( 112.8 s) INFO 17:10:33,197 StateMonitoringThreadFactory - Efficiency of multi-threading: 26.19% of time spent doing productive work for CountLoci INFO 17:06:12,777 StateMonitoringThreadFactory - Number of activeThreads used: 8 INFO 17:06:12,777 StateMonitoringThreadFactory - Total runtime 43.5 m INFO 17:06:12,778 StateMonitoringThreadFactory - Fraction of time spent blocked is 0.00 ( 4.2 s) INFO 17:06:12,778 StateMonitoringThreadFactory - Fraction of time spent running is 1.00 ( 43.3 m) INFO 17:06:12,779 StateMonitoringThreadFactory - Fraction of time spent waiting is 0.00 ( 6.0 s) INFO 17:06:12,779 StateMonitoringThreadFactory - Efficiency of multi-threading: 99.61% of time spent doing productive work	2012-08-22 16:48:01 -04:00
Ryan Poplin	fe3069b278	Merged bug fix from Stable into Unstable	2012-08-22 14:40:34 -04:00
Ryan Poplin	e5cfdb4811	Bug fix for popular _Duplicate allele added to VariantContext_ error reported on the forum. It seems to be due to lower case bases in the reference being treated as reference mismatches. We would try to turn these mismatches into SNP events, for example c/C. We now uppercase the result from IndexedFastaSequenceFile.getSubsequenceAt()	2012-08-22 14:39:35 -04:00
Ryan Poplin	63213e8eb5	Expanding the HaplotypeCaller integration tests to cover a wider range of data	2012-08-22 14:18:44 -04:00
Eric Banks	944e1c299d	Docs for --keepOriginalAC were wrong in SelectVariants	2012-08-22 13:07:13 -04:00
Eric Banks	2409aa9bfd	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-08-22 12:54:43 -04:00
Eric Banks	94540ccc27	Using the simple VCBuilder constructor and then subsequently trying to modify attributes was throwing a NPE. This is easily solved (without a performance hit) by initializing the attributes map to an immutable Collections.emptyMap(). Added unit test to cover this case.	2012-08-22 12:54:29 -04:00
Guillermo del Angel	901f47d8af	Final step (for now) in VA refactoring: update MD5's because, a) since it's not guaranteed that we'll iterate through reads/pileups in the same order, the rank sum dithering will change annotations, b) FS uses new generic threshold to distinguish uninformative reads (it used to use ad-hoc thresholds), c) AD definition changed and throws away uninformative reads, d) shortened general ploidy integration tests for quicker debugging. May have missed some MD5's in the update so there may be lingering test failures still	2012-08-22 11:38:51 -04:00
Guillermo del Angel	7df0abf49b	Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-08-22 11:36:41 -04:00
Eric Banks	9e76e8aa0b	Just noticed that the efficient conversion to uppercase method is redundant since it's already implemented efficiently in Picard; let's just have a single implementation.	2012-08-22 11:26:08 -04:00
Christopher Hartl	20601f034e	Updating the checkType() function to include the new StructuralIndel variant type. Fixes outstanding broken integration test.	2012-08-22 07:33:10 -07:00
Eric Banks	c7ce3e1cf5	Merged bug fix from Stable into Unstable	2012-08-22 00:24:40 -04:00

... 8 9 10 11 12 ...

3232 Commits (0f5bb706ffd3e02a2fec45f32317f29e4e45cb6d)