gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Ryan Poplin	febc634557	Changing PileupElement's isSoftClipped to isNextToSoftClip since soft clipped bases aren't actually added to pileups, oops. Removing the intrinsic clustered variants filter from the HaplotypeCaller	2012-01-31 16:06:14 -05:00
Matt Hanna	7f70612beb	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-01-31 11:59:25 -05:00
Matt Hanna	a630db1703	Oops...HierarchicalMicroScheduler was transforming any exception from the walker level into a ReviewedStingException. Thanks to Ryan for pointing this out.	2012-01-31 11:58:21 -05:00
Christopher Hartl	faba3dd530	Merge branch 'master' of ssh://chartl@ni.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-01-31 10:25:29 -05:00
Mauricio Carneiro	17dbe9a95d	A few cleanups in the LocusIteratorByState * No more N's in the extended event pileups * Only add to the pileup MQ0 counter if the read actually goes into the pileup	2012-01-31 09:40:51 -05:00
Ryan Poplin	f9162ea705	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-01-30 19:45:19 -05:00
Ryan Poplin	abb91cf26b	Increasing the size of the active regions that are produced by the active probability integrator, more context is needed to call more complex events	2012-01-30 15:36:12 -05:00
Mauricio Carneiro	d5d4fa8a88	Fixed discordance bug reported by Brad Chapman discordance now reports discordance between genotypes as well (just like concordance)	2012-01-30 09:50:45 -05:00
Mark DePristo	3164c8dee5	S3 upload now directly creates the XML report in memory and puts that in S3 -- This is a partial fix for the problem with uploading S3 logs reported by Mauricio. There the problem is that the java.io.tmpdir is not accessible (network just hangs). Because of that the s3 upload fails because the underlying system uses tmpdir for caching, etc. As far as I can tell there's no way around this bug -- you cannot overload the java.io.tmpdir programmatically and even if I could what value would we use? The only solution seems to me is to detect that tmpdir is hanging (how?!) and fail with a meaningful error.	2012-01-29 15:14:58 -05:00
Menachem Fromer	0e17cbbce9	Merged bug fix from Stable into Unstable	2012-01-27 16:03:16 -05:00
Menachem Fromer	a9671b73ca	Fix to permit proper handling of mapping qualities between 128 to 255 (which get converted to byte values of -128 to -1)	2012-01-27 16:01:30 -05:00
Ryan Poplin	f7ac1f4a69	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-01-27 15:12:55 -05:00
Ryan Poplin	fc08235ff3	Bug fix in active region traversal, locusView.getNext() skips over pileups with zero coverage but still need to count them in the active probability integrator	2012-01-27 15:12:37 -05:00
Mark DePristo	0f2e8400b5	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-01-27 10:12:50 -05:00
Mauricio Carneiro	ec9920b04f	Updating the SAM TAG for Original Alignment Start to "OP" per Mark's recommendation to reuse the Indel Realigner tag that made it to the SAM spec. The Alignment end tag is still "OE" as there is no official tag to reuse.	2012-01-27 08:51:39 -05:00
Mark DePristo	13d1626f51	Minor improvements in ref QC walker. Unfortunately this doesn't actually catch Chris's error	2012-01-27 08:24:22 -05:00
Mauricio Carneiro	2a565ebf90	embarrassing fix-up, thanks Khalid.	2012-01-26 19:58:42 -05:00
Mauricio Carneiro	246e085ec9	Unit tests for GATKSAMRecord class * new unit tests for the alignment shift properties of reduce reads * moved unit tests from ReadUtils that were actually testing GATKSAMRecord, not any of the ReadUtils to it. * cleaned up ReadUtilsUnitTest	2012-01-26 17:06:36 -05:00
Mauricio Carneiro	0d4027104f	Reduced reads are now aware of their original alignments * Added annotations for reads that had been soft clipped prior to being reduced so that we can later recuperate their original alignments (start and end). * Tags keep the alignment shifts, not real alignment, for better compression * Tags are defined in the GATKSAMRecord * GATKSAMRecord has new functionality to retrieve original alignment start of all reads (trimmed or not) -- getOriginalAlignmentStart() and getOriginalAligmentEnd() * Updated ReduceReads MD5s accordingly	2012-01-26 17:06:36 -05:00
Eric Banks	07f72516ae	Unsupported platform should be a user error	2012-01-26 16:14:25 -05:00
Ryan Poplin	cdff23269d	HaplotypeCaller now uses insertions and softclipped bases as possible triggers. LocusIteratorByState tags pileup elements with the required info to make this calculation efficient. The days of the extended event pileup are coming to a close.	2012-01-26 15:56:33 -05:00
Christopher Hartl	673ceadd11	While this fix worked for the evaluator module, it could potentially have bad effects in the phasing walkers. Special-case nocalls in the PhasingEvaluator and return AllelePair to previous state.	2012-01-26 13:06:36 -05:00
Christopher Hartl	9c6fda7e15	Yup. I was right.	2012-01-26 12:54:11 -05:00
Christopher Hartl	7d059540a4	Allow segments of genome to be excluded in generating a reference panel. Occasionally targets would contain no variation (typically, in the middle of the centromere), which beagle doesn't particularly like, and errors out rather than producing empty output files. The best way to deal with these is to just exclude the regions on a second-pass, and the remaining bits will be gathered with no additional work. AllelePair is being mean and not telling me what genotype it sees when it finds a non-diploid genotype, but i suspect it's a no-call (".") rather than a no call ("./.").	2012-01-26 12:43:52 -05:00
Ryan Poplin	25532bdc37	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-01-26 11:43:32 -05:00
Ryan Poplin	390d493049	Updating ActiveRegionWalker interface to output a probability of active status instead of a boolean. Integrator runs a band-pass filter over this probability to produce actual active regions. First version of HaplotypeCaller which decides for itself where to trigger and assembles those regions.	2012-01-26 11:37:08 -05:00
Eric Banks	859dd882c9	Don't make it standard for now	2012-01-26 00:38:16 -05:00
Eric Banks	c5e81be978	Adding pairwise AF table. Not polished at all, but usable none-the-less.	2012-01-26 00:37:06 -05:00
Eric Banks	702a2d768f	Initial version of multi-allelic summary module in VariantEval	2012-01-25 19:42:55 -05:00
Eric Banks	9a60887567	Lost an import in the merge	2012-01-25 19:41:41 -05:00
Eric Banks	cba5f1a8b1	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-01-25 19:19:03 -05:00
Eric Banks	ddaf51a50f	Updated one integration test for indels	2012-01-25 19:18:51 -05:00
Eric Banks	add6918f32	Cleaner, more efficient way of determining the last dependent set in the queue.	2012-01-25 16:21:10 -05:00
Menachem Fromer	db645a94ca	Added options to make the batch-merger more all-inclusive: keep all indels, SNPs (even filtered ones) but maintain their annotations. Also, VariantContextUtils.simpleMerge can now merge variants of all types using the Hidden non-default enum MultipleAllelesMergeType=MIX_TYPES	2012-01-25 16:10:59 -05:00
Eric Banks	ef335a5812	Better implementation of the fix; PL index is now traversed in order.	2012-01-25 15:15:42 -05:00
Eric Banks	8e2d372ab0	Use remove instead of setting the value to null	2012-01-25 14:41:34 -05:00
Eric Banks	05816955aa	It was possible that we'd clean up a matrix column too early when a dependent column aborted early (with not enough probability mass) because we weren't being smart about the order in which we created dependencies. Fixed.	2012-01-25 14:28:21 -05:00
Eric Banks	2799a1b686	Catch exception for bad type and throw as a TribbleException	2012-01-25 12:15:51 -05:00
Eric Banks	96b62daff3	Minor tweak to the warning message.	2012-01-25 11:55:33 -05:00
Eric Banks	fb863dc6a7	Warn user when trying to run with EMIT_ALL_SITES with indels; better docs for that option.	2012-01-25 11:50:12 -05:00
Eric Banks	e349b4b14b	Allow appending with the dbSNP ID even if a (different) ID is already present for the variant rod.	2012-01-25 11:35:54 -05:00
Eric Banks	ea3d4d60f2	This annotation requires rods and should be annotated as such	2012-01-25 11:35:13 -05:00
Ryan Poplin	bbefe4a272	Added option to be able to write out the active regions to an interval list file	2012-01-25 09:47:06 -05:00
Ryan Poplin	9818c69df6	Can now specify active regions to process at the command line, mainly for debugging purposes	2012-01-25 09:32:52 -05:00
Mauricio Carneiro	ffd61f4c1c	Refactor the Pileup Element with regards to indels Eric reported this bug due to the reduced reads failing with an index out of bounds on what we thought was a deletion, but turned out to be a read starting with insertion. * Refactored PileupElement to distinguish clearly between deletions and read starting with insertion * Modified ExtendedEventPileup to correctly distinguish elements with deletion when creating new pileups * Refactored most of the lazyLoadNextAlignment() function of the LocusIteratorByState for clarity and to create clear separation between what is a pileup with a deletion and what's not one. Got rid of many useless if statements. * Changed the way LocusIteratorByState creates extended event pileups to differentiate between insertions in the beginning of the read and deletions. * Every deletion now has an offset (start of the event) * Fixed bug when LocusITeratorByState found a read starting with insertion that happened to be a reduced read. * Separated the definitions of deletion/insertion (in the beginning of the read) in all UG annotations (and the annotator engine). * Pileup depth of coverage for a deleted base will now return the average coverage around the deletion. * Indel ReadPositionRankSum test now uses the deletion true offset from the read, changed all appropriate md5's * The extra pileup elements now properly read by the Indel mode of the UG made any subsequent call have a different random number and therefore all RankSum tests have slightly different values (in the 10^-3 range). Updated all appropriate md5s after extremely careful inspection -- Thanks Ryan! phew!	2012-01-24 16:07:21 -05:00
Matt Hanna	c312bd5960	Weirdly, PicardException inherits from SAMException, which means that our specialty code for reporting malformed BAMs was actually misreporting any error that happened in the Picard layer as a BAM ERROR. Specifically changing PicardException to report as a ReviewedStingException; we might want to change it in the future. I'll followup with the Picard team to make sure they really, really want PicardException to inherit from SAMException.	2012-01-24 15:30:04 -05:00
Mark DePristo	0a3172a9f1	Fix for ref 0 bases for Chris -- Disturbingly, fixing this bug doesn't actually cause an test failures. -- Wrote a new QCRefWalker to actually check in detail that the reference bases coming into the RefWalker are all correct when comparing against a clean uncached load of the contig bases directly. -- However, I cannot run this tool due to some kind of weird BAM error -- sending this on to Matt	2012-01-24 10:55:09 -05:00
Khalid Shakir	c18beadbdb	Device files like /dev/null are now tracked as special by Queue and are not used to generate .out file paths, scattered into a temporary directory, gathered, deleted, etc. Attempted workaround for xdr_resourceInfoReq unsatisfied link during loading of libbat.so.	2012-01-23 16:17:04 -05:00
Mark DePristo	02450e4b12	Merged bug fix from Stable into Unstable	2012-01-23 12:08:39 -05:00
Christopher Hartl	798596257b	Enable the Genotype Phasing Evaluator. Because it didn't have the same argument structure as the base class, update2 of VariantEvaluator was being called, rather than update2 of the actual module.	2012-01-23 10:50:16 -05:00
Mark DePristo	80a4ce0edf	Bugfix for incorrect error messages for missing BAMs and VCFs -- Missing BAMs were appearing as StingExceptions -- Missing VCFs were showing up as CommandLineErrors, but it's clearer for them to be CouldNotReadInputFile exceptions -- Added integration tests to ensure missing BAMs, VCFs, and -L files are properly thrown as CouldNotReadInputFile exceptions -- Added path to standard b37 BAM to BaseTest -- Cleaned up code in SAMDataSource, removing my parallel loading code as this just didn't prove to be useful.	2012-01-23 09:52:07 -05:00
Guillermo del Angel	31d2f04368	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-01-23 09:23:03 -05:00
Guillermo del Angel	966387ca0b	Next intermediate commit in the pool caller. Lots of bug fixes and now we can emit true vcf's with calls in discovery mode (still of unknown quality) - old validation mode is temporarily broken,will be fixed in next refactoring.	2012-01-23 09:22:31 -05:00
Christopher Hartl	4a08e8ca6e	Minor tweaks to T2D-related qscripts. Replacing old md5s from the BeagleIntegrationTest. All differences boiled down either to the accounting of genotypes changed (./. --> 0/0 is no longer a "changed" genotype, and original genotypes that were ./. are represented as OG=. rather than OG=./. .) This is somewhat of an arbitrary decision, and is negotiable. I could see treating GT:PL ./.:. differently from GT:PL .:0,3,6 but am not sure the worth of doing so.	2012-01-23 08:25:34 -05:00
Ryan Poplin	4d6312d4ea	HaplotypeCaller is now an ActiveRegionWalker.	2012-01-22 14:31:01 -05:00
Christopher Hartl	3b1aad4f17	After a minor and abject freakout, alter the T2D script to seek out truth sensitivities between 80 and 100, rather than between 0.8 and 1. Also, don't consider a genotype "changed by beagle" if the initial genotype is a no-call.	2012-01-20 23:43:51 -05:00
Christopher Hartl	9b4f6afa21	Alterations to scripts for better performance. Grid search now expands the sens/spec tradeoff (90 was far too aggressive against hapmap chr20), and 20 max gaussians was too many, and caused errors. For consensus genotypes: remember to gunzip the beagle outputs before converting to VCF. Also, beagle can in fact create 'null' alleles in certain circumstances. I'm not sure what exactly those circumstances are, but those sites should be ignored. When it does, all alleles apear to be set to null, so this should not affect the actual phasing in the output VCF.	2012-01-20 23:07:59 -05:00
Ryan Poplin	4b18786b5d	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-01-19 22:05:20 -05:00
Ryan Poplin	ace9333068	Active region walkers can now see the reads in a buffer around thier active reigons. This buffer size is specified as a walker annotation. Intervals are internally extended by this buffer size so that the extra reads make their way through the traversal engine but the walker author only needs to see the original interval. Also, several corner case bug fixes in active region traversal.	2012-01-19 22:05:08 -05:00
Menachem Fromer	066da80a3d	Added KEEP_UNCONDTIONAL option which permits even sites with only filtered records to be included as unfiltered sites in the output	2012-01-19 18:19:58 -05:00
Christopher Hartl	7f3ad25b01	Adding a mode to VariantFiltration to invalidate previously-applied filters to allow complete re-filtering of a VCF. T2D VQSR: re-calling now done with appropriate quality settings and using BAQ.	2012-01-19 10:54:48 -05:00
Ryan Poplin	7e082c7750	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-01-19 09:11:23 -05:00
Eric Banks	ab8f499bc3	Annotate with FS even for filtered sites	2012-01-18 22:04:51 -05:00
Guillermo del Angel	b123416c4c	Resolve stale merge changes	2012-01-18 20:56:36 -05:00
Guillermo del Angel	2eb45340e1	Initial, raw, mostly untested version of new pool caller that also does allele discovery. Still needs debugging/refining. Main modification is that there is a new operation mode, set by argument -ALLELE_DISCOVERY_MODE, which if true will determine optimal alt allele at each computable site and will compute AC distribution on it. Current implementation is not working yet if there's more than one pool and it will only output biallelic sites, no functionality for true multi-allelics yet	2012-01-18 20:54:10 -05:00
Ryan Poplin	0268da7560	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-01-18 09:53:00 -05:00
Ryan Poplin	60024e0d7b	updating TDT integration test	2012-01-18 09:52:50 -05:00
Ryan Poplin	11982b5a34	We no longer calculate the population-level TDT statistic if there are fewer than 5 trios with full genotype likelihood information. When there is a high degree of missingness the results are skewed or in the worst case come out as NaN.	2012-01-18 09:42:41 -05:00
Mark DePristo	763c81d520	No longer enforce MAX_ALLELE_SIZE in VCF codec -- Instead issue a warning when a large (>1MB) record is encountered -- Optimized ref.getBytes()[i] => (byte)ref.charAt(i), which avoids an implicit O(n) allocation each iteration through computeReverseClipping()	2012-01-18 07:35:11 -05:00
Mark DePristo	0c7865fdb5	UnitTest for reverseAlleleClipping -- No code modified yet, just implementing a unit test to ensure correctness of the existing code	2012-01-18 07:35:11 -05:00
Mark DePristo	62801e430a	Bugfix for unnecessary optimization -- don't cache the ref bytes	2012-01-17 16:40:26 -05:00
Mark DePristo	f2b0575dee	Detect unreasonably large allele strings (>2^16) and throw an error -- samtools can emit alleles where the ref is 42M Ns and this caused the GATK (via tribble) to hang in several places. -- Tribble was updated so we actually could read the line properly (rev. to 51 here). -- Still the parsing algorithms in the GATK aren't happy with such a long allele. Instead of optimizing the code around an improper use case I put in a limit of 2^16 bp for any allele, and throw a meaningful exception when encountered.	2012-01-17 16:40:26 -05:00
Ryan Poplin	8b0ddf0aaf	Adding notes to CountCovariates docs about using interval lists as database of known variation	2012-01-17 16:13:13 -05:00
Matt Hanna	40ebc17437	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-01-17 14:49:17 -05:00
Matt Hanna	41d70abe4e	At chartl's request, add the bwa aln -N and bwa aln -m parameters to the bindings.	2012-01-17 14:47:53 -05:00
Ryan Poplin	ae259f81cc	Bug fixing for merging of read fragments when one fragment contained an indel	2012-01-17 14:39:27 -05:00
Christopher Hartl	cde224746f	Bait Redesign supports baits that overlap, by picking only the start of intervals. CalibrateGenotypeLikelihoods supports using an external VCF as input for genotype likelihoods. Currently can be a per-sample VCF, but has un-implemented methods for allowing a read-group VCF to be used. Removed the old constrained genotyping code from UGE -- the trellis calculated is exactly the same as that done in the MLE AC estimate; so we should just re-use that one.	2012-01-17 13:51:05 -05:00
Ryan Poplin	8e23c98dd9	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-01-17 13:46:28 -05:00
Matt Hanna	32ccde374b	Merged bug fix from Stable into Unstable	2012-01-17 11:08:35 -05:00
Matt Hanna	3ba918aff1	Error message cleanup in BAM indexing code.	2012-01-17 11:05:42 -05:00
Mauricio Carneiro	cec7107762	Better location for the downsampling of reads in PrintReads * using the filter() instead of map() makes for a cleaner walker. * renaming the unit tests to make more sense with the other unit and integration tests	2012-01-14 14:06:09 -05:00
Mark DePristo	b06074d6e7	Updated SortingVCFWriterBase to use PriorityBlockingQueue so that the class is thread-safe -- Uses PriorityBlockingQueue instead of PriorityQueue -- synchronized keywords added to all key functions that modify internal state Note that this hasn't been tested extensivesly. Based on report: http://getsatisfaction.com/gsa/topics/missing_loci_output_in_multi_thread_mode_when_implement_sortingvcfwriterbase?utm_content=topic_link&utm_medium=email&utm_source=new_topic	2012-01-13 09:33:16 -05:00
Mauricio Carneiro	28aa353501	Added "unbiased" downsampling parameter to PrintReads * also cleaned up and updated part of the unit tests for print reads. Needs a more thorough cleaning.	2012-01-12 16:33:55 -05:00
Matt Hanna	2c3176eb80	Merged bug fix from Stable into Unstable	2012-01-12 13:31:10 -05:00
Matt Hanna	cd43f016ce	Fixed NPE in getNextOverlappingBAMScheduleEntry() when mixed mapped/unmapped interval lists are used. Added integrationtest to verify behavior.	2012-01-12 13:29:11 -05:00
Eric Banks	ed34b4f088	Merge branch 'master' of ssh://gsa1.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-01-12 10:27:26 -05:00
Eric Banks	e7fe9910f7	Create the temp storage for calculating cell values just once as per Mark's TODO	2012-01-12 10:27:10 -05:00
Eric Banks	f5f5ed5dcd	Don't initialize the cell conformation values (use an else in the loop instead) as per Mark's TODO	2012-01-12 08:50:03 -05:00
Eric Banks	410a340ef5	Swapping the iteration order to run over AF conformations and then samples instead of the reverse minimizes calls to HashMap.get; instead of it being O(n) since we called it for each sample it's now O(1). Runtime on T2D GENES test set is reduced by 5-10%. More optimizations to follow.	2012-01-12 02:04:03 -05:00
Mauricio Carneiro	77a03c9709	Patching special case in the adaptor clipping * if the adaptor boundary is more than MAXIMUM_ADAPTOR_SIZE bases away from the read, then let's not clip anything and consider the fragment to be undetermined for this read pair. * updated md5's accordingly	2012-01-11 17:47:44 -05:00
Eric Banks	25d0d53d88	Moving the approximate summing of log10 vals to MathUtils; keeping the more efficient implementation of fast rounding.	2012-01-10 12:38:47 -05:00
Eric Banks	589397d611	Merge branch 'master' of ssh://gsa1.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-01-10 12:36:48 -05:00
Eric Banks	c5320ef1af	Resolving changes in integration test during merge	2012-01-10 12:14:16 -05:00
Matt Hanna	e923a2e512	Revving Picard to incorporate final version of ReadWalker performance improvements.	2012-01-10 12:12:33 -05:00
Eric Banks	0f36f6947e	Resolving merge conflicts	2012-01-10 11:44:16 -05:00
Eric Banks	f2cecce10f	Much better implementation of the approximate summing of an array of log10 values (including more efficient rounding). Now effectively takes 0% of UG runtime on T2D GENES (as opposed to 11% previously).	2012-01-10 11:34:23 -05:00
Matt Hanna	509c3d87b0	Merged bug fix from Stable into Unstable	2012-01-09 23:08:46 -05:00
Matt Hanna	dc60757b68	Eliminate unnecessary strong references (and therefore memory held) by tree reduce entries that have already been processed. Thanks to Tim Fennell for the bug report.	2012-01-09 23:04:53 -05:00
Matt Hanna	fda1795791	Merged bug fix from Stable into Unstable	2012-01-08 22:04:44 -05:00
Matt Hanna	1f1233b669	Fix for a rare but insidious bug in position tracking during async BAM file reading. Thanks to Khalid for spotting and reporting the issue.	2012-01-08 22:03:35 -05:00
Khalid Shakir	5793625592	No more "Q-<pid>@<host>". Generated log file names now use the first output + ".out" (ex. my.vcf.out) or the name of the first QScript plus the order the function was added (ex. MyScript-1.out). The same function added twice with the same outputs will now have the same default logs, meaning the 2nd instance of the function won't be added to the graph twice. QScript accessor to QSettings to specify a default runName and other default function settings. Because log files are no longer pseudo-random their presense can be used to tell if a job without other file outputs is "done". For now still using the log's .done file in addition to original outputs. Gathered log files concatenate all log files together into the stdout. InProcessFunctions now have PrintStreams for stdout and stderr. Updated ivy to use commons-io 2.1 for copying logs to the stdout PrintStream. Removed snakeyaml. During graph tracking of outputs the Index files, and now BAM MD5s, are tracked with the gathering of the original file. In Queue generated wrappers for the GATK the Index and MD5s used for tracking are switched to private scope. Added more detailed output when running with -l DEBUG. Simplified graphviz visualization for additional debugging. Switched usage of the scala class 'List' to the trait 'Seq' (think java.util.ArrayList vs. using the interface java.util.List) Minor cleanup to build including sending ant gsalib to R's default libloc.	2012-01-08 12:11:55 -05:00
Guillermo del Angel	d4e7655d14	Added ability to call multiallelic indels, if -multiallelic is included in UG arguments. Simple idea: we genotype all alleles with count >= minIndelCnt. To support this, refactored code that computes consensus alleles. To ease merging of mulitple alt alleles, we create a single vc for each alt alleles and then use VariantContextUtils.simpleMerge to carry out merging, which takes care of handling all corner conditions already. In order to use this, interface to GenotypeLikelihoodsCalculationModel changed to pass in a GenomeLocParser object (why are these objects to hard to handle??). More testing is required and feature turned off my default.	2012-01-06 11:24:38 -05:00
Ryan Poplin	616ff8ea01	fixed typo in help text	2012-01-06 10:36:11 -05:00
Mark DePristo	dd80ffbbbe	Merged bug fix from Stable into Unstable	2012-01-05 21:51:48 -05:00
Mark DePristo	c96fee477c	Bug fix for VariantSummary -- Call sets with indels > 50 bp in length are tagged as CNVs in the tag (following the 1000 Genomes convention) and were unconditionally checking whether the CNV is already known, by looking at the known cnvs file, which is optional. Fixed. Has the annoying side effect that indels > 50bp in size are not counted as indels, and so are substrated from both the novel and known counts for indels. C'est la vie -- Added integration test to check for this case, using Mauricio's most recent VCF file for NA12878 which has many large indels. Using this more recent and representative file probably a good idea for more future tests in VE and other tools. File is NA12878.HiSeq.WGS.b37_decoy.indel.recalibrated.vcf in Validation_Data	2012-01-05 21:51:06 -05:00
Eric Banks	f5e10e9879	Merged bug fix from Stable into Unstable	2012-01-05 15:35:09 -05:00
Eric Banks	18ed954741	Compute Ti/Tv only if bi-allelic	2012-01-05 15:33:26 -05:00
Ryan Poplin	a6886a4cc0	Initial commit of the Active Region Traversal. Not ready to be used by anyone yet.	2012-01-04 17:03:21 -05:00
Guillermo del Angel	58d4539304	Enabled banded indel computation by default. Reversed logic in input UG argument so that we can still disable it if required. Minor changes to integration tests due to minor differences in GL's and in annotations	2012-01-04 15:28:26 -05:00
Mauricio Carneiro	9ff8a01da2	Merged bug fix from Stable into Unstable	2012-01-03 18:10:39 -05:00
Mauricio Carneiro	9b55505c03	Fixing PairHMMIndelErrorModel array out of bounds This error was due to the ReadClipper change of contract. Before the read utils would return null if a read was entirely clipped, now it returns an empty (safe) GATKSAMRecord.	2012-01-03 18:08:46 -05:00
Christopher Hartl	2c3a9ce02f	Merge branch 'master' of ssh://tin.broadinstitute.org/humgen/gsa-scr1/chartl/dev/unstable	2012-01-03 17:25:56 -05:00
David Roazen	621ee2b613	Merged bug fix from Stable into Unstable	2012-01-03 16:56:49 -05:00
Christopher Hartl	9093de1132	Cleanup: remove code to calculate the MLE AC in the UGE.	2012-01-03 15:58:51 -05:00
Christopher Hartl	2d093828a4	Final changes to Junky (been frozen for a while, but uncommitted) and the qscript for it. A first cursory implementation of the trellis-based Exact AC-constrained genotyping algorithm in UGE. Nothing calls into it, so this should be entirely safe (and, no surprise, it passes UG integration tests).	2012-01-03 15:33:04 -05:00
David Roazen	ea6e718cb8	SnpEff 2.0.5 support. Re-enabled SnpEff in the HybridSelectionPipeline. For now, we recommend only running with the GRCh37.64 database.	2012-01-03 15:18:36 -05:00
Christopher Hartl	93e1417b6e	Update to the VSS GATK documentation.	2012-01-03 13:39:31 -05:00
David Roazen	4984ca5e31	Merged bug fix from Stable into Unstable	2012-01-03 11:03:30 -05:00
David Roazen	f3f01da1af	Enforce serial dependencies in RecalibrationWalkersIntegrationTest Some tests in this class were intermittently not being executed due to being randomly scheduled before tests whose results they depend on. Now the serial dependencies are enforced to avoid problematic orderings.	2012-01-03 10:42:41 -05:00
Eric Banks	ab8d47d9a5	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-01-03 09:38:49 -05:00
Mauricio Carneiro	3d4bf273de	Added getPileupForReadGroups to ReadBackPileup * returns a pileup for all the read groups provided. * saves us from multiple calls to getPileup (which is very inefficient)	2012-01-03 09:35:11 -05:00
Mauricio Carneiro	4a208c7c06	Refactor of the downsampling machinery to accept different strategies * Implemented Adaptive downsampler * Added integration test * Added option to RRead scala script to choose downsampling strategy	2012-01-03 09:29:47 -05:00
Mauricio Carneiro	21ae3ef5f9	Added downsampling support to ReduceReads * Downsampling is now a parameter to the walker with default value of 0 (no downsampling) * Downsampling selects reads at random at the variant region window and strives to achieve uniform coverage if possible around the desired downsampling value. * Added integration test	2012-01-03 09:29:46 -05:00
Mauricio Carneiro	cd68cc239b	Added knuth-shuffle (KS) and randomSubset using KS to MathUtils * Knuth-shuffle is a simple, yet effective array permutator (hope this is good english). * added a simple randomSubset that returns a random subset without repeats of any given array with the same probability for every permutation. * added unit tests to both functions	2012-01-03 09:29:46 -05:00
Mauricio Carneiro	94791a2a75	Add support for reads starting with insertion * Modified cleanCigarShift to allow insertions in the beginning and end of the read * Allowed cigars starting/ending in insertions in the systematic ReadClipper tests * Updated all ReadClipper unit tests * ReduceReads does not hard clip leading insertions by default anymore * SlidingWindow adjusts start location if read starts with insertion * SlidingWindow creates an empty element with insertions to the right * Fixed all potential divide by zero with totalCount() (from BaseCounts) * Updated all Integration tests * Added new integration test for multiple interval reducing	2012-01-03 09:29:45 -05:00
Mark DePristo	d05f0c2318	GATKPerformanceOverTime script update -- Automatic detection of most recent version of GATK release (just tell the script now to use 1.2, 1.3, and 1.4) -- Uses 1.4 now -- By default we do 9 runs of each non-parallel test -- In PathUtils added convenience utility to find most recent release GATK jar with a specific release number	2012-01-02 09:58:46 -05:00
Mauricio Carneiro	1b6d52817e	fixing adaptor clipping effect on recalibration integration test	2012-01-01 22:20:06 -05:00
Eric Banks	393993e0c7	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-12-31 20:42:46 -05:00
Mauricio Carneiro	55cfa76cf3	Updated integration tests for the new adaptor clipping fix.	2011-12-30 18:47:14 -05:00
Mauricio Carneiro	c7d0a9ebee	Forgot to test for inter-chromosomal mates in the adaptor clipping * Fixing bug caught by Eric (and Kristian)	2011-12-30 00:19:53 -05:00
Matt Hanna	a259bfefd4	First commit addressing problems running RTC in parallel. Turns out that because the RTC is the first walker to 'correctly' tree reduce according to functional programming standards, the RTC has revealed a few problems with the tree reducer holding on to too much data. This is the first and smaller of two commits to reduce memory consumption. The second commit will likely be pushed after GATK1.4 is released.	2011-12-29 16:22:14 -05:00
Eric Banks	1a45ea5a05	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-12-29 11:37:15 -05:00
Mauricio Carneiro	f692911903	GATKSAMRecord emptyRead static constructor * Creates an empty GATKSAMRecord with empty (not null) Cigar, bases and quals. Allows empty reads to be probed without breaking. * All ReadClipper utilities now emit empty reads for fully clipped reads	2011-12-27 17:01:17 -05:00
Mauricio Carneiro	8259c748f2	No more Filtered Reads tag. All synthetic reads are marked with the reduced read tag.	2011-12-27 17:01:17 -05:00
Eric Banks	d20a25d681	A much better way of choosing the alternate allele(s) to genotype in the SNP model of UG: instead of looking at the sum of base qualities (which can and did lead to us over-genotyping esp. when allowing multiple alternate alleles), we look at the likelihoods themselves (free since we are already calculating likelihoods for all 10 genotypes). Now, even if the base quals exceed some arbitrary threshold, we only bother genotyping an alternate allele when there's a sample for which it is more likely than ref/ref (I can generate weird edge cases where this falls apart, but none that model truly variable sites that we actually want to call). This leads to a huge efficiency improvement esp. for exomes (and esp. for many samples) where we almost always were trying to genotype all 3 alternate alleles. Integration tests change only because ref calls have slight QUAL differences (because the best alt allele is still chosen arbitrarily, but differently).	2011-12-27 16:50:38 -05:00
Eric Banks	adff40ff58	Minor optimizations to avoid extra processing (esp. for reduced reads)	2011-12-27 13:16:25 -05:00
Mauricio Carneiro	17bfe48d5e	Made all class methods private in the ReadClipper * ReadClipperUnitTest now uses static methods * Haplotype caller now uses static methods * Exon Junction Genotyper now uses static methods	2011-12-27 02:11:32 -05:00
Eric Banks	dd990061f6	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-12-26 14:45:35 -05:00
Eric Banks	2130b39f33	Found the bug in the engine: RodLocusView was using the wrong seek method so that it would only move to the first locus of a shard (and with multi-locus shards, this meant that we never processed RODs from the other positions). In fact, because the seek(Shard) method is extremely misleading and now no longer used, I think it's safer to delete it and make everyone use the much more transparent seek(GenomeLoc). Note that I have not re-enabled my improvements to the intervals accumulation of ReferenceDataSource because that inefficiency is still present downstream in RodLocusView; need to discuss those changes with Matt.	2011-12-26 14:45:19 -05:00
Mauricio Carneiro	35c41409a1	Better contracts and docs for the ReadClipper * Described the ReadClipper contract in the top of the class * Added contracts where applicable * Added descriptive information to all tools in the read clipper * Organized public members and static methods together with the same javadoc	2011-12-23 19:36:57 -05:00
David Roazen	506c0e9c97	Disabling SnpEff support in the GATK and SnpEff annotation in the HybridSelectionPipeline SnpEff support will remain disabled until SnpEff 2.0.4 has been officially released and we've verified the quality of its annotations.	2011-12-23 19:12:57 -05:00
Eric Banks	24c84da60d	'Fixing' the changes in ReferenceDataSource so that a shard properly contains a list of GenomeLocs instead of a single merged one. However, that uncovered a probable bug in the engine, so instead of letting this code fester unfixed in the build (affecting everyone in the group) I've decided to revert the previous (slow, but working) version and fix the engine in my own branch.	2011-12-23 15:39:12 -05:00
Eric Banks	8762313a0d	Better TODO message	2011-12-22 20:54:35 -05:00
Eric Banks	a815e875a8	Removing debugging output	2011-12-22 15:49:11 -05:00
Eric Banks	deef542a38	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-12-22 15:44:58 -05:00
Eric Banks	6d260ec6ae	Start printing traversal stats after 30 seconds. I can't stand waiting 2 minutes.	2011-12-22 15:40:59 -05:00
David Roazen	510c71158c	Merged bug fix from Stable into Unstable	2011-12-22 10:49:52 -05:00
David Roazen	32cdef9682	Rename PerformanceTest test classes to LargeScaleTest This is in preparation for the installation of the new performance test suite in Bamboo. Note that "ant performancetest" is now "ant largescaletest"	2011-12-22 10:38:49 -05:00
Mauricio Carneiro	731a463415	Updated IntegrationTests with new adaptor clipper phew!	2011-12-20 17:48:52 -05:00
Mauricio Carneiro	cadff40247	getRefCoordSoftUnclippedStart and End refactor These functions are methods of the read, and supplement getAlignmentStart() and getUnclippedStart() by calculating the unclipped start counting only soft clips. * Removed from ReadUtils * Added to GATKSAMRecord * Changed name to getSoftStart() and getSoftEnd * Updated third party code accordingly.	2011-12-20 17:48:51 -05:00
Mauricio Carneiro	07128a2ad2	ReadUtils cleanup * Removed all clipping functionality from ReadUtils (it should all be done using the ReadClipper now) * Cleaned up functionality that wasn't being used or had been superseded by other code (in an effort to reduce multiple unsupported implementations) * Made all meaningful functions public and added better comments/explanation to the headers	2011-12-20 17:48:40 -05:00
Mauricio Carneiro	1c4774c475	Static versions of the hard clipping utilities For simplified access to the hard clipping utilities. No need to create a ReadClipper object if you are not doing multiple complicated clipping operations, just use the static methods. examples: ReadClipper.hardClipLowQualEnds(2); ReadClipper.hardClipAdaptorSequence();	2011-12-20 17:48:39 -05:00
Mauricio Carneiro	f73ad1c2e2	Bugfix/Rewrite: Algorithm to determine adaptor boundaries The algorithm wasn't accounting for the case where the read is the reverse strand and the insert size is negative. * Fixed and rewrote for more clarity (with Ryan, Mark and Eric). * Restructured the code to handle GATKSAMRecords only * Cleaned up the other structures and functions around it to minimize clutter and potential for error. * Added unit tests for all 4 cases of adaptor boundaries.	2011-12-20 17:48:39 -05:00
Mark DePristo	0cc5c3d799	General improvements to Queue -- Support for collecting resources info from DRMAA runners -- Disabled the non-standard mem_free argument so that we can actually use our own SGE cluster gsa4 -- NCoresRequest is a testing queue script for this. -- Added two command line arguments: -- multiCoreJerk: don't request multiple cores for jobs with nt > 1. This was the old behavior but it's really not the best way to run parallel jobs. Now with queue if you run nt = 4 the system requests 4 cores on your host. If this flag is thrown, though, it will only request 1 and you'll just use 4, like a jerk -- job_parallel_env: parallel environment named used with SGE to request multicore jobs. Equivalent to -pe job_parallel_env NT for NT > 1 jobs	2011-12-20 14:05:09 -05:00
Eric Banks	7204fcc2c3	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-12-20 12:59:11 -05:00
Eric Banks	8ade2d6ac2	max_alternate_alleles also ready to be made public	2011-12-20 12:59:02 -05:00
Eric Banks	6f52bd580b	--multiallelic mode is not hidden anymore (but it is annotated as advanced); added docs	2011-12-20 12:47:38 -05:00
Mauricio Carneiro	37e0044c48	Removing unclipSoftClipBases from ReadUtils * it was buggy and dangerous. * Updated Chris' code to use the ReadClipper.	2011-12-20 00:11:26 -05:00
Mauricio Carneiro	78d9bf7196	Added REVERT_SOFTCLIPPED_BASES capability to ReadClipper * New ClippingOp REVERT_SOFTCLIPPED_BASES turns soft clipped bases into matches. * Added functionality to clipping op to revert all soft clip bases in a read into matches * Added revertSoftClipBases function to the ReadClipper for public use * Wrote systematic unit tests	2011-12-20 00:04:30 -05:00
Christopher Hartl	24585062f8	Merge branch 'incoming'	2011-12-19 23:16:36 -05:00
Christopher Hartl	67298f8a11	AFCR made public (for use in VSS) Minor changes to ValidationSiteSelector logic (SampleSelectors determine whether a site is valid for output, no actual subset context need be operated on beyond that determination). Implementation of GL-based site selection. Minor changes to EJG.	2011-12-19 23:14:26 -05:00
Eric Banks	06d385e619	Simplifying the interface a bit	2011-12-19 15:29:46 -05:00
Christopher Hartl	339ef92eac	Goodbye SW by default. Now aligned reads that overlap intron-exon junctions are scored where they are by default, but warns the user (and flags the record in the VCF) if there's evidence to suggest that there is an indel throwing off the scoring (e.g. if the best score of a realigned unmapped read is >5 log orders better than the best score of a scored mapped read). Unmapped reads are still SW-aligned to the junction-junction sequence. This should result in a rather massive speedup, so far untested. UGBoundAF has to go in at some point. In the process of rewriting the math for bounding the allele frequency (it was assuming uniform tails, which is silly since i derived the posterior distribution in closed form sometime back, just need to find it)	2011-12-19 12:18:18 -05:00
Christopher Hartl	418d22b67e	Merge branch 'master' of ssh://tin.broadinstitute.org/humgen/gsa-scr1/chartl/dev/unstable Conflicts: private/java/src/org/broadinstitute/sting/gatk/walkers/genotyper/IntronLossGenotyperV2.java	2011-12-19 10:59:18 -05:00
Christopher Hartl	69661da37d	Moving ValidationSiteSelector to validation package in public under my ownership. JunctionGenotyper added and modified several times, this commit is due to merging conflix fixes.	2011-12-19 10:57:28 -05:00
Laurent Francioli	16cc2b864e	- Corrected bug causing cases where both parents are HET to be accounted twice in the TDT calculation - Adapted TDT Integration test to corrected version of TDT Signed-off-by: Ryan Poplin <rpoplin@broadinstitute.org>	2011-12-19 10:30:59 -05:00
Eric Banks	5fd19ae734	Commented exactly how the results are represented from the exact model so developers can know how to use them.	2011-12-19 10:19:00 -05:00
Eric Banks	3069a689fe	Bug fix: if there are multiple records at a given position, it turns out that SelectVariants would drop all variants that follow after one that fails filters (instead of dropping just the failing one). Added an integration test to cover this case.	2011-12-19 10:04:33 -05:00
Mauricio Carneiro	5b678e3b94	Remove ClippingOp UnitTests * all testing functionality is in the ReadClipperUnitTest, no need to double test. * class and package naming cleanup	2011-12-19 07:49:26 -05:00
Matt Hanna	1ead00cac5	New fork of SamFileHeaderMerger should be cached at the thread level to enable fast (and valid) thread lookups.	2011-12-18 19:04:26 -05:00
Ryan Poplin	bc842ab3a5	Adding option to VariantAnnotator to do strict allele matching when annotating with comp track concordance.	2011-12-18 15:27:23 -05:00
Ryan Poplin	953998dcd0	Now that getSampleDB is public in the walker base class this override in VariantAnnotator isn't necessary.	2011-12-18 14:38:59 -05:00
Eric Banks	76bd13a1ed	Forgot to update the unit test	2011-12-18 01:13:49 -05:00
Eric Banks	07f9d14d9f	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-12-18 00:43:15 -05:00
Eric Banks	c5ffe0ab04	No reason to sum the normalized posteriors array to get Pr(AF>0) given that we can just compute 1.0 - array[0]. Integration tests change only because of trivial precision artifacts for reference calls using EMIT_ALL_SITES.	2011-12-18 00:31:47 -05:00
Eric Banks	6dc52d42bf	Implemented the proper QUAL calculation for multi-allelic calls. Integration tests pass except for the ones making multi-allelic calls (duh) and one of the SLOD tests (which used to print 0 when one of the LODs was NaN but now we just don't print the SB annotation for that record).	2011-12-18 00:01:42 -05:00
Khalid Shakir	6059ca76e8	Removing cruft that snuck in last commit.	2011-12-16 23:00:16 -05:00
Khalid Shakir	7486696c07	When using bam list mode in HSP deriving VCF name from bam list instead of requiring an additional parameter. Creating a single temporary directory per ant test run instead of a putting temp files across all runs in the same directory. Updated various tests for above items and other small fixes.	2011-12-16 18:09:25 -05:00
Mauricio Carneiro	e5df9e0684	cleaner test output cleaned up the debug "pass" messages in the unit tests	2011-12-16 18:04:00 -05:00
Mauricio Carneiro	fcc21180e8	Added hardClipLeadingInsertions UnitTest for the ReadClipper fixed issue where a read starting with an insertion followed by a deletion would break, clipper can now safely clip the insertion and the deletion if that's the case. note: test is turned off until contract changes to allow hanging insertions (left/right).	2011-12-16 18:02:47 -05:00
Mauricio Carneiro	075be52adc	Added hardClipByReferenceCoordinates (left and right tails) UnitTest for the ReadClipper	2011-12-16 18:01:33 -05:00
Mauricio Carneiro	5bba44d693	Added hardClipByReferenceCoordinates UnitTest for the ReadClipper * fixed edge case when requested to hard clip beginning of a read that had hanging soft clipped bases on the left tail. * fixed edge case when requested to hard clip end of a read that had hanging soft clipped bases on the right tail. * fixed AlignmentStart of a clipped read that results in only hard clips and soft clips note: added tests to all these beautiful cases...	2011-12-16 18:01:33 -05:00
Mauricio Carneiro	5838ba529d	Added hardClipByReadCoordinates UnitTest for the ReadClipper	2011-12-16 18:01:33 -05:00
Mauricio Carneiro	c26295919e	Added hardClipBothEndsByReferenceCoordinates UnitTest for the ReadClipper	2011-12-16 18:01:33 -05:00
Mark DePristo	1994c3e3bc	Only print warning about allele incompatibility when running there are genotypes in the file in CombineVariants	2011-12-16 16:50:51 -05:00
Mark DePristo	b6067be952	Support for selecting only variants with specific IDs from a file in SelectVariants -- Cleaned up unused variables as well	2011-12-16 16:50:39 -05:00
Mark DePristo	d6d2f49c88	Don't print log if there are no BAMs	2011-12-16 16:50:36 -05:00
Mark DePristo	78e0950a77	Minor bug fix for printing in SAMDataSource	2011-12-16 11:45:40 -05:00
Mark DePristo	7bc0d18418	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-12-16 11:42:42 -05:00
Ryan Poplin	5aa79dacfc	Changing hidden optimization argument to advanced.	2011-12-16 10:29:20 -05:00
Matt Hanna	3642a73c07	Performance improvements for dynamically merging BAMs in read walkers. This change and my previous change have dropped runtime when dynamically merging 2k BAM files from 72.6min/1M reads to 46.8sec/1M reads. Note that many of these changes are stopgaps -- the real problem is the way ReadWalkers interface with Picard, and I'll have to work with Tim&Co to produce a more maintainable patch.	2011-12-16 09:37:44 -05:00
Mark DePristo	3414ecfe2e	Restored serial version of reader initialization. Serial mode is default, as the performance gains aren't so huge. -- Serial version can be re-enabled with a static boolean, if we decide to return to the serial version -- Comparison of serial and parallel reader with cached and uncached files: Initialization time: serial with 500 fully cached BAMs: 8.20 seconds Initialization time: serial with 500 uncached BAMs : 197.02 seconds Initialization time: parallel with 500 fully cached BAMs: 30.12 seconds Initialization time: parallel with 500 uncached BAMs : 75.47 seconds	2011-12-16 09:22:10 -05:00
Mark DePristo	fb1c9d2abc	Restored serial version of reader initialization. Parallel mode is default. -- Serial version can be re-enabled with a static boolean, if we decide to return to the serial version	2011-12-16 09:05:28 -05:00
Mauricio Carneiro	e61e5c7589	Refactor of ReadClipper unit tests * expanded the systematic cigar string space test framework Roger wrote to all tests * moved utility functions into Utils and ReadUtils * cleaned up unused classes	2011-12-15 19:05:43 -05:00
Mauricio Carneiro	4748ae0a14	Bugfix: Softclips before Hardclips weren't being accounted for caught a bug in the hard clipper where it does not account for hard clipping softclipped bases in the resulting cigar string, if there is already a hard clipped base immediately after it. * updated unit test for hardClipSoftClippedBases with corresponding test-case.	2011-12-15 12:17:25 -05:00
Mauricio Carneiro	62a2e335bc	Changing HardClipper contract to allow UNMAPPED reads shifted the contract to functions that operate on reference based coordinates. The clipper should do the right thing with unmapped reads, but it needs more testing (Ryan is using it at the moment and says it works). Will write some unit tests.	2011-12-15 11:08:19 -05:00
Matt Hanna	9333b678b5	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-12-14 18:05:44 -05:00
Matt Hanna	6fb4be1a09	Cache header merger.	2011-12-14 18:05:31 -05:00
Mauricio Carneiro	50dee86d7f	Added unit test to catch Ryan's exception Unit test to catch the special case that broke the clipping op, fixed in the previous commit.	2011-12-14 16:58:14 -05:00
Mauricio Carneiro	128bdf9c09	Create artificial reads with "default" parameters * added functions to create synthetic reads for unit testing with reasonable default parameters * added more functions to create synthetic reads based on cigar string + bases and quals.	2011-12-14 16:58:14 -05:00
Mauricio Carneiro	c85100ce9c	Fix ClippingOp bug when performing multiple hardclip ops bug: When performing multiple hard clip operations in a read that has indels, if the N+1 hardclip requests to clip inside an indel that has been removed by one of the (1..N) previous hardclips, the hard clipper would go out of bounds. fix: dynamically adjust the boundaries according to the new hardclipped read length. (this maintains the current contract that hardclipping will never return a read starting or ending in indels).	2011-12-14 16:57:47 -05:00
Eric Banks	de5928ac5a	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-12-14 16:24:56 -05:00
Eric Banks	4fddac9f22	Updating busted integration tests	2011-12-14 16:24:43 -05:00
Mark DePristo	01e547eed3	Parallel SAMDataSource initialization -- Uses 8 threads to load BAM files and indices in parallel, decreasing costs to read thousands of BAM files by a significant amount -- Added logger.info message noting progress and cost of reading low-level BAM data.	2011-12-14 16:14:26 -05:00
Mark DePristo	71b4bb12b7	Bug fix for incorrect logic in subsetSamples -- Now properly handles the case where a sample isn't present (no longer adds a null to the genotypes list) -- Fix for logic failure where if the number of requested samples equals the number of known genotypes then all of the records were returned, which isn't correct when there are missing samples. -- Unit tests added to handle these cases	2011-12-14 16:14:26 -05:00
Eric Banks	35fc2e13c3	Using the new PL cache, fix a bug: when only a subset of the genotyped alleles are used for assigning genotypes (because the exact model determined that they weren't all real) the PLs need to be adjusted to reflect this. While fixing this I discovered that the integration tests are busted because ref calls (ALT=.) were getting annotated with PLs, which makes no sense at all.	2011-12-14 15:31:09 -05:00
Eric Banks	1e90d602a4	Optimization: cache up front the PL index to the pair of alleles it represents for all possible numbers of alternate alleles.	2011-12-14 13:38:20 -05:00
Eric Banks	988d60091f	Forgot to add in the new result class	2011-12-14 13:37:15 -05:00
Eric Banks	106bf13056	Use a thread local result object to collect the results of the exact calculation instead of passing in multiple pre-allocated arrays.	2011-12-14 12:05:50 -05:00
Eric Banks	7648521718	Add check for mixed genotype so that we don't exception out for a valid record	2011-12-14 11:26:43 -05:00
Eric Banks	9497e9492c	Bug fix for complex records: do not ever reverse clip out a complete allele.	2011-12-14 11:21:28 -05:00
Eric Banks	09a5a9eac0	Don't update lineNo for decodeLoc - only for decode (otherwise they get double-counted). Even still, because of the way the GATK currently utilizes Tribble we can parse the same line multiple times, which knocks the line counter out of sync. For now, I've added a TODO in the code to remind us and the error messages note that it's an approximate line number.	2011-12-14 10:43:52 -05:00
Eric Banks	d3f4a5a901	Fail gracefully when encountering malformed VCFs without enough data columns	2011-12-14 10:37:38 -05:00
Eric Banks	079932ba2a	The log10cache needs to be larger if we want to handle 10K samples in the UG.	2011-12-13 23:36:10 -05:00
Ryan Poplin	7fa1ab1bae	Fix to allow haplotype caller to call indels after UG engine entry points were unified. Adding Haplotype Caller integration test	2011-12-13 17:19:40 -05:00
Eric Banks	e47a113c9f	Enabled multi-allelic SNP discovery in the UG. Needs loads of testing so do not use yet. While working in the UG engine, I removed the extraneous and unnecessary MultiallelicGenotypeLikelihoods class: now a VariantContext with PL-annotated Genotypes is passed around instead. Integration tests pass so it must all work, right?	2011-12-12 23:02:45 -05:00
Mauricio Carneiro	5cc1e72fdb	Parallelized SelectVariants * can now use -nt with SelectVariants for significant speedup in large files * added parallelization integration tests for SelectVariants	2011-12-12 18:41:14 -05:00
Mauricio Carneiro	a70a0f25fb	Better debug output for SAMDataSource output the name and number of the files being loaded by the GATK instead of "coordinate sorted".	2011-12-12 17:57:29 -05:00
Mark DePristo	d03425df2f	TODO optimization targets	2011-12-12 17:39:51 -05:00
Laurent Francioli	7cf27bb66e	Updated md5sum for MendelianViolationEvaluator test to reflect the change in column alignment in VariantEval.	2011-12-12 12:22:43 +01:00
Laurent Francioli	025bdfe2cc	Merge branch 'master' of ssh://copper.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-12-12 12:19:44 +01:00
Eric Banks	7b6338c742	Merge branch 'master' into trialleles	2011-12-11 00:28:46 -05:00
Eric Banks	7c4b9338ad	The old bi-allelic implementation of the Exact model has been completely deprecated - you can only use the multi-allelic implementation now.	2011-12-11 00:23:33 -05:00
Eric Banks	044f211a30	Don't collapse likelihoods over all alt alleles - that's just not right. For now, the QUAL is calculated for just the most likely of the alt alleles; I need to think about the right way to handle this properly.	2011-12-10 23:57:14 -05:00
Eric Banks	364f1a030b	Plumbing added so that the UG engine can handle multiple alleles and they can successfully be genotyped. Alleles that aren't likely are not allowed to be used when assigning genotypes, but otherwise the greedy PL-based approach is what is used. Moved assign genotypes code to UG engine since it has nothing to do with the Exact model. Still have some TODOs in here before I can push this out to everyone.	2011-12-09 14:25:28 -05:00
Mauricio Carneiro	8475328b2c	Turning off test that breaks read clipper until we define what is the desired behavior for clipping this particular case.	2011-12-09 11:53:12 -05:00
Roger Zurawicki	4cbd1f0dec	Reorganized the testing code and created ClipReadsTestUtils Tests are more rigorous and includes many more test cases. We can tests custom cigars and the generated cigars. *Still needs debugging because code is not working. Created test classes to be used across several tests. Some cases are still commented out. Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>	2011-12-09 11:52:34 -05:00
Roger Zurawicki	0e9c2cefa2	testHardClipSoftClippedBases works with Matches and Deletions Insertions are a problem so cigar cases with "I" are commented out. The test works with multiple deletions and matches. This is still not a complete test. A lot of cigar test cases are commented out. Added insertions to ReadClipperUnitTest ReadClipper now tests for all indels. Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>	2011-12-09 11:43:37 -05:00
Eric Banks	64dad13e2d	Don't carry around an extra copy of the code for the Haplotype Caller	2011-12-09 11:09:40 -05:00
Eric Banks	442ceb6ad9	The Exact model now computes both the likelihoods and posteriors (in separate arrays); likelihoods are used for assigning genotypes, not the posteriors.	2011-12-09 10:16:44 -05:00
Laurent Francioli	a79144f7db	Merge branch 'master' of ssh://copper.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-12-09 15:57:24 +01:00
Laurent Francioli	72fbfba97d	Added UnitTests for getFamilies() and getChildrenWithParents()	2011-12-09 15:57:07 +01:00
Laurent Francioli	5a06170804	Corrected bug causing getChildrenWithParents() to not take the last family member into consideration.	2011-12-09 14:51:34 +01:00
Eric Banks	aa4a8c5303	No dynamic programming solution for assignning genotypes; just done greedily now. Fixed QualByDepth to skip no-call genotypes. No-calls are no longer given annotations (attributes).	2011-12-09 02:25:06 -05:00
Eric Banks	2fe50c64da	Updating md5s	2011-12-09 00:47:01 -05:00
Eric Banks	8777288a9f	Don't throw a UserException if too many alt alleles are trying to be genotyped. Instead, I've added an argument that allows the user to set the max number of alt alleles to genotype and the UG warns and skips any sites with more than that number.	2011-12-09 00:00:20 -05:00
Eric Banks	3e7714629f	Scrapped the whole idea of an int/long as an index into the ACset: with lots of alternate alleles we run into overflow issues. Instead, simply use the ACcounts array as the hash key since it is unique for each AC conformation. To do this, it needed to be wrapped inside an object so hashcode() would work.	2011-12-08 23:50:54 -05:00
Eric Banks	4aebe99445	Need to use longs for the set index (because we can run out of ints when there are too many alternate alleles). Integration tests now use the multiallelic implementation.	2011-12-08 15:31:02 -05:00
Eric Banks	7750bafb12	Fixed bug where last dependent set index wasn't properly being transferred for sites with many alleles. Adding debugging output.	2011-12-08 13:50:50 -05:00
Guillermo del Angel	252e0f3d0a	Merged bug fix from Stable into Unstable	2011-12-08 13:11:39 -05:00
Guillermo del Angel	1bfe28067f	Don't try to genotype an indel even bigger than the reference window size, or else we'll be out of bounds. Necessary to handle Phase 1 integrated callset with large deletions. Better error indication when validating a GenomeLoc.	2011-12-08 12:54:08 -05:00
Mark DePristo	9def841275	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-12-07 13:36:16 -05:00
Mark DePristo	4055877708	Prints 0.0 TiTv not NaN when there are no variants -- Updated md5	2011-12-07 12:07:54 -05:00
Matt Hanna	15533e08df	Fixed issue with RODWalker parallelization. Turns out that someone previously upped the declared size of a ROD shard to 100M bases, making each ROD shard larger than the size of chr20. Why didn't we see this in Stable? Because the ShardStrategy/ShardStrategyFactory mechanism was dutifully ignoring the shard size specification. When I rolled the ShardStrategy/ShardStrategyFactory mechanics back into the DataSources as part of the async I/O project, I inadvertently reenabled this specifier.	2011-12-07 11:55:42 -05:00
Mark DePristo	5d2212bc8e	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-12-07 09:03:17 -05:00
Mark DePristo	6bf18899df	Fix for variant summary -- now treats all 50 bp deletions or insertions as CNVs	2011-12-07 09:02:49 -05:00
Matt Hanna	c9b2cd8ba5	Fix for chartl's stale null representation issue.	2011-12-06 18:05:17 -05:00
Eric Banks	79d18dc078	Fixing indexing bug on the ACsets. Added unit tests for the Exact model code.	2011-12-06 16:17:18 -05:00
Matt Hanna	f5b977fc88	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-12-06 10:11:35 -05:00
Matt Hanna	4001c22a11	Better file count / buffering variation in test suite. Parameterized read shard buffering. Misc cleanup.	2011-12-06 10:10:38 -05:00
Khalid Shakir	677bea0abd	Right aligning GATKReport numeric columns and updated MD5s in tests. PreQC parses file with spaces in sample names by using tabs only. PostQC allows passing the file names for the evals so that flanks can be evaled. BaseTest's network temp dir now adds the user name to the path so files aren't created in the root. HybridSelectionPipeline: - Updated to latest versions of reference data. - Refactored Picard parsing code replacing YAML.	2011-12-05 23:22:15 -05:00
Eric Banks	7a0f6feda4	Make sure that too many alternate alleles aren't being passed to the genotyper (10 for now) and exit with a UserError if there are.	2011-12-05 16:18:52 -05:00
Eric Banks	7fac4afab3	Fixed priors (now initialized upon engine startup in a multi-dimensional array) and cell coefficients (properly handles the generalized closed form representation for multiple alleles).	2011-12-05 15:57:25 -05:00
Eric Banks	a7cb941417	The posteriors vector is now 2 dimensional so that it supports multiple alleles (although the UG is still hard-coded to use only array[0] for now); the exact model now collapses probabilities for all conformations over a given AC into the posteriors array (in the appropriate dimension). Fixed a bug where the priors and posteriors were being passed in swapped.	2011-12-04 13:02:53 -05:00
Eric Banks	eab2b76c9b	Added loads of comments for future reference	2011-12-03 23:54:42 -05:00
Eric Banks	29662be3d7	Fixed bug where k=2N case wasn't properly being computed. Added optimization for BB genotype case not in old model. At this point, integration tests pass except for 1 case where QUALs differ by 0.01 (this is okay because I occasionally need to compute extra cells in the matrix which affects the approximations) and 2 cases where multi-allelic indels are being genotyped (some work still needs to be done to support them).	2011-12-03 23:12:04 -05:00
Eric Banks	71f793b71b	First partially working version of the multi-allelic version of the Exact AF calculation	2011-12-02 14:13:14 -05:00
David Roazen	d014c7faf9	Queue now properly escapes all shell arguments in generated shell scripts This has implications for both Qscript authors and CommandLineFunction authors. Qscript authors: You no longer need to (and in fact must not) manually escape String values to avoid interpretation by the shell when setting up Walker parameters. Queue will safely escape all of your Strings for you so that they'll be interpreted literally. Eg., Old way: filterSNPs.filterExpression = List("\"QD<2.0\"", "\"MQ<40.0\"", "\"HaplotypeScore>13.0\"") New way: filterSNPs.filterExpression = List("QD<2.0", "MQ<40.0", "HaplotypeScore>13.0") CommandLineFunction authors: If you're writing a one-off CommandLineFunction in a Qscript and don't really care about quoting issues, just keep doing things the direct, simple way: def commandLine = "cat %s \| grep -v \"#\" > %s".format(files, out) If you're writing a CommandLineFunction that will become part of Queue and will be used by other QScripts, however, it's advisable to do things the newer, safer way, ie.: When you construct your commandLine, you should do so ONLY using the API methods required(), optional(), conditional(), and repeat(). These will manage quoting and whitespace separation for you, so you shouldn't insert quotes/extraneous whitespace in your Strings. By default you get both (quoting and whitespace separation), but you can disable either of these via parameters. Eg., override def commandLine = super.commandLine + required("eff") + conditional(verbose, "-v") + optional("-c", config) + required("-i", "vcf") + required("-o", "vcf") + required(genomeVersion) + required(inVcf) + required(">", escape=false) + // This will be shell-interpreted required(outVcf) I've ported the Picard/Samtools/SnpEff CommandLineFunction classes to the new system, so you'll get free shell escaping when you use those in Qscripts just like with walkers.	2011-12-01 18:13:44 -05:00
Mark DePristo	3060a4a15e	Support for list of known CNVs in VariantEval -- VariantSummary now includes novelty of CNVs by reciprocal overlap detection using the standard variant eval -knownCNVs argument -- Genericizes loading for intervals into interval tree by chromosome -- GenomeLoc methods for reciprocal overlap detection, with unit tests	2011-11-30 17:05:16 -05:00
Matt Hanna	b65db6a854	First draft of a test script for I/O performance with the new asynchronous I/O processing. Also includes convenience parameters for specifying the IO/CPU threading balance outside of a tag. Will be killed when Queue gets better support for tagged arguments (hopefully soon).	2011-11-30 13:13:16 -05:00
Laurent Francioli	1d5d200790	Cleaned up unused import statements	2011-11-30 15:30:30 +01:00
Mark DePristo	28b286ad39	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-11-30 09:11:53 -05:00
Laurent Francioli	20bffe0430	Adapted for the new version of MendelianViolation	2011-11-30 14:46:38 +01:00
Laurent Francioli	1cb5e9e149	Removed outdated (and unused) -familyStr commandline argument	2011-11-30 14:45:04 +01:00
Laurent Francioli	9574be0394	Updated MendelianViolationEvaluator integration test	2011-11-30 14:44:15 +01:00
Laurent Francioli	f49dc5c067	Added functionality to get all children that have both parents (useful when trios are needed)	2011-11-30 14:43:37 +01:00
Laurent Francioli	a4606f9cfe	Merge branch 'MendelianViolation' Conflicts: public/java/src/org/broadinstitute/sting/utils/MendelianViolation.java	2011-11-30 11:13:15 +01:00
Laurent Francioli	b279ae4ead	Merge branch 'master' of ssh://copper.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-11-30 10:10:21 +01:00
Laurent Francioli	7d58db626e	Added MendelianViolationEvaluator integration test	2011-11-30 10:09:20 +01:00
Ryan Poplin	91413cf0d9	Merged bug fix from Stable into Unstable	2011-11-29 14:01:23 -05:00
Ryan Poplin	cb284eebde	Further updating VQSR tutorial wiki docs to reflect the bundle	2011-11-29 14:00:57 -05:00
Ryan Poplin	dcb889665d	Merged bug fix from Stable into Unstable	2011-11-29 09:58:49 -05:00
Ryan Poplin	447e9bff9e	Updating VQSR tutorial wiki docs to reflect the bundle	2011-11-29 09:57:45 -05:00
Ryan Poplin	110298322c	Adding Transmission Disequilibrium Test annotation to VariantAnnotator and integration test to test it.	2011-11-29 09:29:18 -05:00
Laurent Francioli	ab67011791	Corrected bug introduced in the last update and causing no families to be returned by getFamilies in case the samples were not specified	2011-11-29 11:18:15 +01:00
Eric Banks	d7d8b8e380	Tribble v42 changes the Codec.canDecode method to take in a String instead of a File; this is something that Jim was adamant about (because Tribble can handle streams other than files). I didn't want the next person who needed to rev Tribble to deal with this change additionally, so I took care of updating the GATK now.	2011-11-28 14:18:28 -05:00
Laurent Francioli	a09c01fcec	Removed walker argument FamilyStructure as this is now supported by the engine (ped file)	2011-11-28 17:18:11 +01:00
Laurent Francioli	795c99d693	Adapted MendelianViolation to the new ped family representation. Adapted all classes using MendelianViolation too. MendelianViolationEvaluator was added a number of useful metrics on allele transmission and MVs	2011-11-28 17:13:14 +01:00
Laurent Francioli	e877db8f42	Changed visibility of getSampleDB from protected to public as the sampleDB needs to be accessible from Annotators and Evaluators too.	2011-11-28 17:11:30 +01:00
Laurent Francioli	5c2595701c	Added a function to get families only for a given list of samples.	2011-11-28 17:10:33 +01:00
Mark DePristo	3c36428a20	Bug fix for TiTv calculation -- shouldn't be rounding	2011-11-28 10:20:34 -05:00
Eric Banks	436b4dc855	Updated docs	2011-11-28 08:59:48 -05:00
Laurent Francioli	b1dd632d5d	Merge branch 'master' of ssh://copper.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable Conflicts: public/java/src/org/broadinstitute/sting/gatk/walkers/phasing/PhaseByTransmission.java	2011-11-25 16:16:44 +01:00
Mark DePristo	e60272975a	Fix for changed MD5 in streaming VCF test	2011-11-23 19:01:33 -05:00
Mark DePristo	12f09d88f9	Removing references to SimpleMetricsByAC	2011-11-23 16:08:18 -05:00
Mark DePristo	e319079c32	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-11-23 13:02:11 -05:00
Mark DePristo	4107636144	VariantEval updates -- Performance optimizations -- Tables now are cleanly formatted (floats are %.2f printed) -- VariantSummary is a standard report now -- Removed CompEvalGenotypes (it didn't do anything) -- Deleted unused classes in GenotypeConcordance -- Updates integration tests as appropriate	2011-11-23 13:02:07 -05:00
David Roazen	e5b85f0a78	A toString() method for IntervalBindings Necessary since we're currently writing things like this to our VCF headers: intervals=[org.broadinstitute.sting.commandline.IntervalBinding@4ce66f56]	2011-11-23 11:56:12 -05:00
Mark DePristo	5a4856b82e	GATKReports now support a format field per column -- You can tell the table to format your object with "%.2f" for example.	2011-11-23 11:31:04 -05:00
Mark DePristo	c8bf7d2099	Check for null comment	2011-11-23 10:47:21 -05:00
Mark DePristo	6c2555885c	Caching getSimpleName() in VariantEval is a big performance improvement -- Removed the SimpleMetricsByAC table, as one should just use the AlleleCount Stratefication and the upcoming VariantSummary table	2011-11-23 08:34:05 -05:00
Guillermo del Angel	32adbd614f	Solve merge conflict	2011-11-22 22:48:46 -05:00
Guillermo del Angel	941f3784dc	Solve merge conflict	2011-11-22 22:48:03 -05:00
Guillermo del Angel	75d93e6335	Another corner condition fix: skip likelihood computation in case we cut so many bases there's no haplotype or read left	2011-11-22 22:46:12 -05:00
Mark DePristo	a3aef8fa53	Final performance optimization for GenotypesContext	2011-11-22 17:19:30 -05:00
Mark DePristo	990c02e4de	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-11-22 17:19:11 -05:00
Guillermo del Angel	38a90da92c	Fixed merge conflict to Unstable	2011-11-22 14:39:45 -05:00
Guillermo del Angel	32a77a8a56	Prevent out of bound error in case read span > reference context + indel length. Can happen in RNAseq reads with long N CIGAR operators in the middle.	2011-11-22 13:57:24 -05:00
Eric Banks	5821c11fad	For BAM and Reviewed errors we now check the error message to see if it's actually a 'too many open files' problem and, if so, we generate a User Error instead.	2011-11-22 10:50:22 -05:00
Mark DePristo	7087310373	Embarassing bug fixed	2011-11-22 10:16:36 -05:00
Mark DePristo	e484625594	GenotypesContext now updates cached data for add, set, replace operations when possible -- Involved separately managing the sample -> offset and sample sorted list operations. This should improve performance throughout the system	2011-11-22 08:40:48 -05:00
Mark DePristo	29ca24694a	UG now encoding NO_CALLs as ./. not ./.:.:4:0,0,0 A few updated UGs integration tests	2011-11-22 08:22:32 -05:00
Mark DePristo	2b51c01df4	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-11-21 19:16:06 -05:00
Mark DePristo	5443d3634a	Again, fixing the add call when we really mean replace -- Updating MD5s for UG to reflect that what was previously called ./.:.:10:0,0,0 is now just ./. Eric will fix long-standing bug in QD observed from this change -- VFW MD5s restored to their old correct values. There was a bug in my implementation to caused the genotypes to not be parsed from the lazy output even through the header was incorrect.	2011-11-21 19:15:56 -05:00
Mauricio Carneiro	5ad3dfcd62	BugFix: byte overflow in SyntheticRead compressed base counts * fixed and added unit test	2011-11-21 17:11:50 -05:00
Mark DePristo	9ea7b70a02	Added decode method to LazyGenotypesContext -- AbstractVCFCodec calls this if the samples are not sorted. Previously called getGenotypes() which didn't actually trigger the decode	2011-11-21 16:21:23 -05:00
Mark DePristo	ab2efe3bd3	Reverting bad exact model changes	2011-11-21 16:14:40 -05:00
Eric Banks	44554b2bfd	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-11-21 15:01:45 -05:00
Eric Banks	022832bd74	Very bad use of the == operator with Strings was ensuring that validating GenomeLocs was very inefficient. This fix resulted in a significant speedup for a simple RodWalker.	2011-11-21 14:49:47 -05:00
Mark DePristo	1561af22af	Exact model code cleanup -- Fixed up code when fixing a bug detected by aggressive contracts in GenotypesContext.	2011-11-21 14:35:15 -05:00
Mark DePristo	2c501364b8	GenotypesContext no longer have immutability in constructor -- additional bug fixes throughout VariantContext and GenotypesContext objects	2011-11-21 14:34:31 -05:00
David Roazen	1296dd41be	Removing the legacy -L "interval1;interval2" syntax This syntax predates the ability to have multiple -L arguments, is inconsistent with the syntax of all other GATK arguments, requires quoting to avoid interpretation by the shell, and was causing problems in Queue. A UserException is now thrown if someone tries to use this syntax.	2011-11-21 13:18:53 -05:00
Mark DePristo	e467b8e1ae	More contracts on LazyGenotypesContext	2011-11-21 09:34:57 -05:00
Mark DePristo	2e9ecf639e	Generalized interface to LazyGenotypesContext -- Now you provide a LazyParsing object -- LazyGenotypesContext now knows nothing about the VCF parser itself. The parser holds all of the necessary data to parse the VCF genotypes when necessarily, and the LGC only has a pointer to this object -- Using new interface added LazyGenotypesContext to unit tests with a simple lazy version -- Deleted VCFParser interface, as it was no longer necessary	2011-11-21 09:30:40 -05:00
Mark DePristo	f0ac588d32	Extensive unit test for GenotypeContextUnitTest -- Currently only tests base class. Adding subclass testing in a bit	2011-11-20 18:28:01 -05:00
Mark DePristo	bc44f6fd9e	Utility function Collection<Genotype> -> Collection<String>	2011-11-20 18:26:56 -05:00
Mark DePristo	9445326c6c	Genotype is Comparable via sampleName	2011-11-20 18:26:27 -05:00
Mark DePristo	f9e25081ab	Completed documented LazyGenotypesContext	2011-11-20 08:35:52 -05:00
Mark DePristo	9cb3fe3a59	Vastly better way of doing on-demand genotyping loading -- With our GenotypesContext class we can naturally create a LazyGenotypesContext subclass that does the on-demand loading. -- This new class was replaced all of the old, complex functionality -- Better still, there were many cases were the genotypes were being loaded unnecessarily, resulting in efficiency. This was detected because some of the integration tests changed as the genotypes were no longer being parsing unnecessarily -- Misc. bug fixes throughout the system -- Bug fixes for PhaseByTransmission with new GenotypesContext	2011-11-20 08:23:09 -05:00
Mark DePristo	f392d330c3	Proper use of builder. Previous conversion attempt was flawed	2011-11-19 22:09:56 -05:00
Mark DePristo	7d09c0064b	Bug fixes and code cleanup throughout -- chromosomeCounts now takes builder as well, cleaning up a lot of code throughout the codebase.	2011-11-19 18:40:15 -05:00
Mark DePristo	707bd30b3f	Should have been @BeforeMethod	2011-11-19 16:10:09 -05:00
Mark DePristo	8f7eebbaaf	Bugfix for pError not being checked correctly in CommonInfo -- UnitTests to ensure correct behavior -- UnitTests to ensure correct behavior for pass filters vs. failed filters vs. unfiltered	2011-11-19 15:58:59 -05:00
Mark DePristo	b7b57ef39a	Updating MD5 to reflect canonical ordering of calculation -- We should no longer have md5s changing because of hashmaps changing their sort order on us -- Added GenotypeLikelihoodsUnitTests -- Refactored ExactAFCaclculation to put the PL -> QUAL calculation in the GenotypeLikelihoods class to avoid the code copy.	2011-11-19 15:57:33 -05:00
Mark DePristo	73119c8e3c	Merge with master -- A few bug fixes	2011-11-19 09:56:06 -05:00
Mark DePristo	f685fff79b	Killing the final versions of old new VariantContext interface	2011-11-18 21:32:43 -05:00
Mark DePristo	6cf315e17b	Change interface to getNegLog10PError to getLog10PError	2011-11-18 21:07:30 -05:00
Mark DePristo	c7f2d5c7c7	Final minor fix to contract	2011-11-18 19:40:05 -05:00
Mauricio Carneiro	b5de182014	isEmpty now checks if mReadBases is null Since newly created reads have mReadBases == null. This is an effort to centralize the place to check for empty GATKSAMRecords.	2011-11-18 18:34:05 -05:00
Mauricio Carneiro	8ab3ee9c65	Merge remote-tracking branch 'unstable/master' into rr	2011-11-18 16:50:25 -05:00
Mauricio Carneiro	333e5de812	returning read instead of GATKSAMRecord Do not create new GATKSAMRecord when read has been fully clipped, because it is essentially the same as returning the currently fully clipped read.	2011-11-18 16:49:59 -05:00
Matt Hanna	8bb4d4dca3	First pass of the asynchronous block loader. Block loads are only triggered on queue empty at this point. Disabled by default (enable with nt:io=?).	2011-11-18 15:02:59 -05:00
Mark DePristo	a2e79fbe8a	Fixes to contracts	2011-11-18 14:18:53 -05:00
Mark DePristo	660d6009a2	Documentation and contracts for GenotypesContext and VariantContextBuilder	2011-11-18 13:59:30 -05:00
Mark DePristo	f54afc19b4	VariantContextBuilder -- New approach to making VariantContexts modeled on StringBuilder -- No more modify routines -- use VariantContextBuilder -- Renamed isPolymorphic to isPolymorphicInSamples. Same for mono -- getChromosomeCount -> getCalledChrCount -- Walkers changed to use new VariantContext. Some deprecated new VariantContext calls remain -- VCFCodec now uses optimized cached information to create GenotypesContext.	2011-11-18 12:39:10 -05:00
Eric Banks	6459784351	Merged bug fix from Stable into Unstable	2011-11-18 12:34:57 -05:00
Eric Banks	c62082ba1b	Making this class public again as per request from Cancer folks	2011-11-18 12:34:27 -05:00
Eric Banks	8710673a97	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-11-18 12:29:33 -05:00
Eric Banks	768b27322b	I figured out why we were getting tons of hom var genotype calls with Mauricio's low quality (synthetic) reduced reads: the RR implementation in the UG was not capping the base quality by the mapping quality, so all the low quality reads were used to generate GLs. Fixed.	2011-11-18 12:29:15 -05:00
Mark DePristo	7490dbb6eb	First version of VariantContextBuilder	2011-11-18 11:06:15 -05:00
Roger Zurawicki	f48d4cfa79	Bug fix: fully clipping GATKSAMRecords and flushing ops Reads that are emptied after clipping become new GATKSAMRecords. When applying ClippingOps, the ops are cleared after the clipping	2011-11-18 00:24:39 -05:00
Mark DePristo	fa454c88bb	UnitTests for VariantContext for chrCount, getSampleNames, Order function -- Major change to how chromosomeCounts is computed. Now NO_CALL alleles are always excluded. So ChromosomeCounts(A/.) is 1, the previous result would have been 2. -- Naming changes for getSamplesNameInOrder()	2011-11-17 20:37:22 -05:00
Mark DePristo	02f22cc9f8	No more VC integration tests. All tests are now unit tests	2011-11-17 15:33:09 -05:00
Mark DePristo	23359d1c6c	Bugfix for pruneVariantContext, which was dropping the ref base for padding	2011-11-17 15:32:52 -05:00
Mark DePristo	473b860312	Major determinism fix for UG and RankSumTest -- Now these routines all iterate in sample name order (genotypes.iterateInSampleNameOrder) so that the results of UG and the annotator do not depend on the particular order of samples we see for the exact model and the RankSumTest	2011-11-17 15:31:45 -05:00
Khalid Shakir	c50274e02e	During flanking interval creation merging overlapping flanks so that on scatter the list doesn't accidentally genotype the same site twice. Moved flanking interval utilies to IntervalUtils with UnitTests.	2011-11-17 13:56:42 -05:00
Eric Banks	bad19779b9	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-11-17 13:29:43 -05:00
Eric Banks	16a021992b	Updated header description for the INFO and FORMAT DP fields to be more accurate.	2011-11-17 13:17:53 -05:00
Eric Banks	e7d41d8d33	Minor cleanup	2011-11-17 12:00:28 -05:00
Mark DePristo	7e66677769	Expanded UnitTests for VariantContext Tests for -- getGenotype and getGenotypes -- subContextBySample -- modify routines	2011-11-16 20:45:15 -05:00
Mauricio Carneiro	72f00e2883	Merging Roger's Unit tests for Reduce Reads from RR repository	2011-11-16 17:26:49 -05:00
Mark DePristo	aa0610ea92	GenotypeCollection renamed to GenotypesContext	2011-11-16 16:24:05 -05:00
Mark DePristo	974daaca4d	V13 version in archive. Can you pulled out wholesale for performance testing	2011-11-16 16:08:46 -05:00
Mark DePristo	caf6080402	Better algorithm for merging genotypes in CombineVariants	2011-11-16 15:17:33 -05:00
Mark DePristo	101ffc4dfd	Expanded, contrastive VariantContextBenchmark -- Compares performance across a bunch of common operations with GATK 1.3 version of VariantContext and GATK 1.4 -- 1.3 VC and associated utilities copied wholesale into test directory under v13	2011-11-16 13:35:16 -05:00
Mark DePristo	e56d52006a	Continuing bugfixes to get new VC working	2011-11-16 10:39:17 -05:00
Matt Hanna	eb8e031f75	Merged bug fix from Stable into Unstable	2011-11-16 09:57:37 -05:00
Matt Hanna	6a5d5e7ac9	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/stable	2011-11-16 09:57:13 -05:00
Matt Hanna	7ac5cf8430	Getting rid of unsupported CountReadPairs walker in stable. Removal of remainder of pairs processing framework to follow in unstable.	2011-11-16 09:53:59 -05:00
Eric Banks	c2ebe58712	Merge remote-tracking branch 'Laurent/master'	2011-11-16 09:34:47 -05:00
Laurent Francioli	0dc3d20d58	Corrected bug causing PhaseByTransmission to crash in case of new Genotype.Type	2011-11-16 09:33:13 +01:00
Laurent Francioli	7d77fc51f5	Corrected bug causing PhaseByTransmission to crash in case of new Genotype.Type	2011-11-16 03:32:43 -05:00
David Roazen	0d163e3f52	SnpEff 2.0.4 support -Modified the SnpEff parser to work with the SnpEff 2.0.4 VCF output format -Assigning functional classes and effect impacts now handled directly by SnpEff rather than the GATK -Removed support for SnpEff 2.0.2, as we no longer trust the output of that version since it doesn't exclude effects associated with certain nonsensical transcripts. These effects are excluded as of 2.0.4. -Updated unit and integration tests This support is based on a release-candidate of SnpEff 2.0.4, and so is subject to change between now and the next GATK release.	2011-11-15 18:36:22 -05:00
Mark DePristo	df415da4ab	More bug fixes on the way to passing all tests	2011-11-15 17:38:12 -05:00
Mark DePristo	0be23aae4e	Bugfixes on way to a working refactored VariantContext	2011-11-15 17:20:14 -05:00
Mark DePristo	231c47c039	Bugfixes on way to a working refactored VariantContext	2011-11-15 16:42:50 -05:00
Laurent Francioli	fb685f88ec	Merge branch 'master' of ssh://copper.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-11-15 16:23:53 -05:00
Mark DePristo	2b2514dad2	Moved many unused phasing walkers and utilities to archive	2011-11-15 16:14:50 -05:00
Mark DePristo	460a51f473	ID field now stored in the VariantContext itself, not the attributes	2011-11-15 14:56:33 -05:00
Eric Banks	7fada320a9	The right fix for this test is just to delete it.	2011-11-15 14:53:27 -05:00
Eric Banks	b45d10e6f1	The DP in the FORMAT field (per sample) must also use the representative count or else it's always 1 for reduced reads.	2011-11-15 10:23:59 -05:00
Mark DePristo	233e581828	Merging in Master	2011-11-15 09:28:24 -05:00
Eric Banks	b66556f4a0	Update error message so that it's clear ReadPair Walkers are exceptions	2011-11-15 09:22:57 -05:00
Mark DePristo	6e1a86bc3e	Bug fixes to VariantContext and GenotypeCollection	2011-11-15 09:21:30 -05:00
Roger Zurawicki	284430d61d	Added more basic UnitTests for ReadClipper hardClipByReadCoordinatesWorks hardClipLowQualTailsWorks	2011-11-15 00:13:52 -05:00
Roger Zurawicki	8e91e19229	Merge branch 'master' of ssh://nickel/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-11-15 00:13:37 -05:00
Mauricio Carneiro	cde829899d	compress Reduce Read counts bytes by offset compressed the representation of the reduce reads counts by offset results in 17% average compression in final BAM file size. Example compression --> from : 10, 10, 11, 11, 12, 12, 12, 11, 10 to: 10, 0, 1, 1,2, 2, 2, 1, 0	2011-11-14 18:30:24 -05:00
Mark DePristo	4ff8225d78	GenotypeMap -> GenotypeCollection part 3 -- Test code actually builds	2011-11-14 17:51:41 -05:00
Mark DePristo	f0234ab67f	GenotypeMap -> GenotypeCollection part 2 -- Code actually builds	2011-11-14 17:42:55 -05:00
David Roazen	ab0ee9b847	Perform only necessary validation in VariantContext modify methods	2011-11-14 16:49:59 -05:00
Mark DePristo	2e9d5363e7	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-11-14 15:32:06 -05:00
Mark DePristo	1fbdcb4f43	GenotypeMap -> GenotypeCollection	2011-11-14 15:32:03 -05:00
Eric Banks	4dc9dbe890	One quick fix to previous commit	2011-11-14 14:42:12 -05:00
Eric Banks	7b2a7cfbe7	Transfer headers from the resource VCF when possible when using expressions. While there, VA was modified so that it didn't assume that the ID field was present in the VC's info map in preparation for Mark's upcoming changes.	2011-11-14 14:31:27 -05:00
Mark DePristo	9b5c79b49d	Renamed InferredGeneticContext to CommonInfo -- I have no idea why I named this InferredGeneticContext, a totally meaningless term -- Renamed to CommonInfo. -- Made package protected, as no one should use this outside of VariantContext and Genotype -- UGEngine was using IGC constant, but it's now using the public one in VariantContext.	2011-11-14 14:28:52 -05:00
Mark DePristo	077397cb4b	Deleted MutableVariantContext -- All methods that used this capable now use VariantContext directly instead	2011-11-14 14:19:06 -05:00
Mark DePristo	b11c535527	Deleted MutableGenotype -- This class wasn't really used anywhere, and so removed to control code bloat.	2011-11-14 13:16:36 -05:00
Mark DePristo	79987d685c	GenotypeMap contains a Map, not extends it -- On path to replacing it with GenotypeCollection	2011-11-14 12:55:03 -05:00
Eric Banks	7aee80cd3b	Fix to deal with reduced reads containing a deletion	2011-11-14 12:23:46 -05:00
Eric Banks	3d2970453b	Misc minor cleanup	2011-11-14 09:41:54 -05:00
Laurent Francioli	1347beef40	Merge branch 'PhaseByTransmission'	2011-11-14 11:31:28 +01:00
Laurent Francioli	6881d4800c	Added Integration tests for Phasing by Transmission	2011-11-14 10:47:51 +01:00
Laurent Francioli	34acf8b978	Added Unit tests for new methods in GenotypeLikelihoods	2011-11-14 10:47:02 +01:00
Roger Zurawicki	1202a809cb	Added Basic Unit Tests for ReadClipper Tests some but not all functions Some tests have been disabled because they are not working	2011-11-13 22:27:49 -05:00
Eric Banks	b7c33116af	Minor docs update	2011-11-12 23:21:07 -05:00
Eric Banks	76d357be40	Updating docs example to use -L since that's best practice	2011-11-12 23:20:05 -05:00
Mark DePristo	fee9b367e4	VariantContext genotypes are now stored as GenotypeMap objects -- Enables further sophisticated optimizations, as this class can be smarter about storing the data and will directly support operations like subset to samples -- All instances in the gatk that used Map<String, Genotype> now use GenotypeMap type. -- Amazingly, there were many places where HashMap<String, Genotype> is used, so that the order of the genotypes is technically undefined and could be dangerous. Now everything uses GenotypeMap with a specific ordering of samples (by name) -- Integrationtests updated and all pass	2011-11-11 15:00:35 -05:00
Guillermo del Angel	cd3146f4cf	Add hidden option to ValidationAmplicons to output slightly modified format to make file work with downstream SQNM tools more seamlessly at request of GAP: one line per record, keep probe identifier to 20 characters, no * in ref allele.	2011-11-11 14:07:07 -05:00
Ryan Poplin	40fbeafa37	VQSR will now detect if the negative model failed to converge properly because of having too few data points and automatically retry with more appropriate clustering parameters.	2011-11-11 11:52:30 -05:00
Mark DePristo	4938569b3a	More general handling of parameters for VariantContextBenchmark	2011-11-11 10:22:19 -05:00

... 6 7 8 9 10 ...

1882 Commits (f9f8589692fece0185a7e8e059b75ee4672d1c8d)