gatk-3.8

Commit Graph

Author	SHA1	Message	Date
aaron	7916ab0ed5	remove the index each run git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4976 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-12 17:38:22 +00:00
carneiro	5e9a8f9cb3	Implemented a new argument (-DQS --defaultQualityScore) that allows GATK to deal with BAM files missing quality scores. If a value is specified, all reads are filled with the default quality score. Appropriate exception is thrown if -DQS is not provided and BAM file doesn't have quality scores for every base. Adding the first version of the techdev pipeline (tdPipeline) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4943 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-05 22:25:08 +00:00
aaron	cba436fa2f	small fix for the table codec; if you see a header line, you know you've finished parsing the header. Also also some changes to return the ref ordered data pool test to using MappedStreamSegment instead of EntireStream git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4942 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-05 21:20:26 +00:00
hanna	0982d35f5b	Bug fixes in streaming in Tribble data via /dev/stdin. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4935 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-05 02:43:04 +00:00
rpoplin	23dbc5ccf3	HaplotypeScore is revamped. It now uses reads' Cigar strings when building the haplotype blocks to skip over soft-clipped bases and factor in insertions and deletions. The statistic now uses only the reads from the filtered context to build the haplotypes but it scores all reads against the two best haplotypes. The score is now computed individually for each sample's reads and then averaged together. Bug fixes throughout. The math for the base quality and mapping quality rank sum tests is fixed. The annotations remain as ExperimentalAnnotations pending more investigation. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4934 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-05 00:28:05 +00:00
hanna	8d2c14b29c	Update Picard / sam-jdk at Tim's request. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4925 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-03 02:17:25 +00:00
hanna	3fc9862964	Unit test fixed - Tribble codecs aren't designed to be stateless, but I was using one as though it was. Fixed, and debug code reverted. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4917 348d0f76-0448-11de-a6fe-93d51630548a	2010-12-31 17:47:52 +00:00
hanna	b9cb57f4b9	A unit test is failing on bamboo in a way I can't reproduce (or even explain). Checking in some debugging info. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4916 348d0f76-0448-11de-a6fe-93d51630548a	2010-12-31 16:35:04 +00:00
hanna	cba18116e4	A significant refactoring of the ROD system, done largely to simplify the process of streaming/piping VCFs into the GATK. Notable changes: - Public interface to RMDTrackBuilder is greatly simplified; users can use it only to build RMDTracks and lookup codecs. - RODDataSource and RMDTrack are no longer functionally at the same level; RODDataSources now manage RMDTracks on behalf of the GATK, and the only direct consumers of the RMDTrack class are the walkers that feel the need to access the ROD system directly. (We need to stamp out this access pattern. A few minor warts were introduced as part of this process, labeled with TODOs. These'll be fixed as part of the VCF streaming project. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4915 348d0f76-0448-11de-a6fe-93d51630548a	2010-12-31 04:52:22 +00:00
ebanks	848977678d	No reason to convert the GLs to a String for formatting when they're just going to be converted to PLs later. That was 5% of the UG runtime... git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4913 348d0f76-0448-11de-a6fe-93d51630548a	2010-12-29 22:06:19 +00:00
ebanks	8a0c07b865	Support for indels in hapmap. This was non-trivial because not only does hapmap not tell you whether the allele is an insertion or deletion, but it also has a completely different positioning strategy (rightmost base). I'll send out an email tomorrow when the new HapMap3.3 VCF is ready. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4908 348d0f76-0448-11de-a6fe-93d51630548a	2010-12-27 07:37:46 +00:00
hanna	e313eeede8	Push command-line expansions, such as BAM list unpacking and -B tag parsing, out into the CommandLine* classes. This makes it easier for external functionality (such as the VCF streamer) to use GenomeAnalysisEngine directly. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4897 348d0f76-0448-11de-a6fe-93d51630548a	2010-12-22 19:00:17 +00:00
depristo	5dd0e8388b	Fixed a bug in UnitTest git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4867 348d0f76-0448-11de-a6fe-93d51630548a	2010-12-17 19:44:35 +00:00
ebanks	5c0b66cb7c	3 big changes that all kill the integration tests: 1. Don't cap the PLs by 255 anymore. 2. Move over to the 3state model as the only available base model for UG (no more base transition tables). 3. New QD implementation when GLs/PLs are available. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4846 348d0f76-0448-11de-a6fe-93d51630548a	2010-12-15 16:24:28 +00:00
chartl	5a27d231fa	Rename it so that nobody else falls into the trap laid out (the test is VariantToTable, the walker is Variant[s]ToTable) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4844 348d0f76-0448-11de-a6fe-93d51630548a	2010-12-15 11:43:00 +00:00
chartl	5e27e9162f	Huh? I thought we parsed out comma-separated command line arguments into list automatically...just change the syntax of the integration test, no need to update the md5 git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4843 348d0f76-0448-11de-a6fe-93d51630548a	2010-12-15 11:40:27 +00:00
kshakir	01323447c6	Removed LibBat.SUB2_BSUB_BLOCK since the use of it exits the JVM. Fixed integration tests to wait on their own for the job to run instead of using SUB2_BSUB_BLOCK. Updated VariantRecalibrationIntegrationTests MD5s which were knocked out of sync whele SUB2_BSUB_BLOCK was exiting in the middle of integration tests. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4840 348d0f76-0448-11de-a6fe-93d51630548a	2010-12-14 19:57:20 +00:00
depristo	abd6ce1c77	A TiTv-free approach for cutting variants! Apparently much better than previous approach, and will work for indels and SV will truly minor modifications to the code. Will discuss with methods group on Monday. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4822 348d0f76-0448-11de-a6fe-93d51630548a	2010-12-11 23:08:13 +00:00
depristo	5b46a900b3	Final version of BAQ calculation. default gap open is 1e-4, a good sensitive value. Useful timer class SimpleTimer added. BAQ is now live. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4818 348d0f76-0448-11de-a6fe-93d51630548a	2010-12-10 19:35:12 +00:00
ebanks	f1f01610f8	Remove the extra trailing tab at the end of the VCF ## header line. Unfortunately, this meant updating every freaking integration test. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4806 348d0f76-0448-11de-a6fe-93d51630548a	2010-12-08 17:22:29 +00:00
depristo	c91712bd59	BAQ calculation refactoring in the GATK. Single -baq argument can be NONE, CALCULATE_AS_NECESSARY, and RECALCULATE. Walkers can control bia the @BAQMode annotation how the BAQ calculation is applied. Can either be as a tag, by overwriting the qualities scores, or by only returning the baq-capped qualities scores. Additionally, walkers can be set up to have the BAQ applied to the incoming reads (ON_INPUT, the default), to output reads (ON_OUTPUT), or HANDLED_BY_WALKER, which means that calling into the BAQ system is the responsibility of the individual walker. SAMFileWriterStub now supports BAQ writing as an internal feature. Several walkers have the @BAQMode applied to this, with parameters that I think are reasonable. Please look if you own these walkers, though git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4798 348d0f76-0448-11de-a6fe-93d51630548a	2010-12-06 20:55:52 +00:00
depristo	5d2c2bd280	Just refactoring into utils/baq directory git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4795 348d0f76-0448-11de-a6fe-93d51630548a	2010-12-06 17:43:43 +00:00
depristo	44feb4a362	Improved BAQ implementation. Now supports adding BAQ tags to reads on the fly with ADD_TAG_ONLY option. Caching fasta reader implementation, and changes throughout the system to enable this. Many performance improvements throughout the system due to better reference access patterns. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4792 348d0f76-0448-11de-a6fe-93d51630548a	2010-12-05 18:29:39 +00:00
depristo	a5b3aac864	Engine-level BAQ calculation now available in the GATK [totally experimental right now]. -baq argument to disable (NONE), to only use the tags in the BAM (USE_TAG_ONLY), use the tag when present but calculate on the fly as necessary (CALCULATE_AS_NECESSARY), and to always recalculate (RECALCULATE_ALWAYS). BAQ.java contains the complete implementation, for those interested. ValidateBAQWalker is a useful QC tool for verifying the BAQ is correct. BAQSamIterator applies BAQ to reads, as needed, in the engine. Let me know if you encounter any problems. Before prime-time, needs a caching implementation of IndexedFastaReader to avoid loading lots of reference data all of the time git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4787 348d0f76-0448-11de-a6fe-93d51630548a	2010-12-04 20:23:06 +00:00
fromer	92cf7744a6	Set minMQ = max(minMQ, minBQ) for phasing since anyway we cap BQ by MQ; also, lowered MIN_BASE_QUALITY_SCORE for phasing to 17 (was previously 20) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4781 348d0f76-0448-11de-a6fe-93d51630548a	2010-12-03 16:31:13 +00:00
rpoplin	e5282742f9	Bug fix in CountCovariates, skip over indel records as well as SNPs in the dbsnp file. CountCovariates is now called CountCovariatesWalker. I've always hated that the name was swapped. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4774 348d0f76-0448-11de-a6fe-93d51630548a	2010-12-02 18:43:24 +00:00
rpoplin	0adf505b53	We no longer look at by-hapmap validation status in the VQSR because using the HapMap VCF file is higher quality. As a side effect we now support the dbsnp 132 vcf file. ApplyVariantCuts now requires that the input VCF rod bindings begin with input, matching the other VQSR walkers. Wiki updated with information about how to obtain the hapmap and 1kg truth sets. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4772 348d0f76-0448-11de-a6fe-93d51630548a	2010-12-02 15:38:45 +00:00
fromer	b4ef716aaf	As per Eric and Mark's suggestions, separated the segregating MNP merger (MergeMNPs) from the more general merger employed for annotation purposes (MergeSegregatingAlternateAlleles). Both use the same core MergePhasedSegregatingAlternateAllelesVCFWriter git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4763 348d0f76-0448-11de-a6fe-93d51630548a	2010-12-01 16:42:08 +00:00
rpoplin	af84462f3e	The dev team has decided to change the filter that is added to records that are set to monomorphic by Beagle. It no longer lists the reference allele. Added those filters to the header of the output VCF file. Finally, we no longer use R2=NaN values coming from Beagle. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4757 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-30 17:19:54 +00:00
ebanks	d89e17ec8c	Fare thee well, UGv1. Here come the days UGv2. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4747 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-29 21:51:19 +00:00
ebanks	e3e6d176df	Looking over the daily error log email made me realize that there were 2 implementations of vc.modifyLocation() - the correct one in VC that didn't require lazy loading the genotype data and the bad one in VCUtils that did. Removing the implementation in VCUtils and updating the code accordingly. Also, removing createPotentiallyInvalidGenomeLoc() since no one uses it anymore. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4736 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-26 18:40:34 +00:00
ebanks	6934f83cc7	Two changes to CombineVariants. 1. Fix: VCs were padded before the merge, but they were never unpadded afterwards. This leaves us with a VC that doesn't meet our spec. 2. Update: instead of running the merged VC through every standard annotation (which seems really wrong, since this isn't the annotator tool), just update the chromosome count annotations (AC,AF,AN) through VCUtils. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4734 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-25 04:52:12 +00:00
rpoplin	ed08899abc	Overwhelming evidence that maxQ = 50 is now a better default than maxQ = 40 in the base quality score recalibrator, especially when combined with dbsnp build 132. Also, added option in ProduceBeagleInputWalker for Beagle-ing chromosome X calls with male samples which sets the genotype likelihood for the AB allele to zero for those samples. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4731 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-24 21:32:26 +00:00
bthomas	374c0deba2	Updating the core LocusWalker tools to include the Sample infrastructure that I added last month. This commit touches a lot of files, but only significantly changes a few: LocusIteratorByState and ReadBackedPileup and associated classes. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4711 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-19 19:59:05 +00:00
depristo	721e8cb679	VariantsToTable now supports wildcard captures. -F PREFIX* now captures all fields that begin with PREFIX, output as a comma-separated list of unique values. Added integration test for VariantsToTable since I find it so useful. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4706 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-18 18:54:59 +00:00
hanna	90711d445c	Change the interface for RMDTrackBuilder, therefore always mandating the specification of a sequence dictionary and related info. This will hopefully eliminate the cases in which the refseq track depends a sequence dictionary / contig parser that hasn't been specified. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4700 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-17 19:00:17 +00:00
depristo	d86ab2becb	JEXL expressions now generate exceptions, not warnings. Tools should catch the runtime exception to handle correctly. Removed unncessary complexity from the JEXL contexts git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4695 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-17 16:08:16 +00:00
kshakir	01b721ab61	Passing ReviewedStingExceptions through the HMS. Added a @Hidden experimental argument -validate to VariantEval that allows external JEXL assertions that must evaluate to true will throw an exception. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4692 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-16 21:50:42 +00:00
hanna	24ec35deaf	- Reintroduce test dependency so that the tests passing / failing is not dependent on the contents of the integrationtest directory. Will figure out how to better manage the integrationtest directory at some point in the future. - Up the max heap size for tests. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4691 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-16 19:55:20 +00:00
depristo	ef2f6d90d2	VQSR now operates on LOD scores in the INFO field directly, and doesn't adjust the QUAL field. New format for tranches file uses LOD score. Old file format no longer supported. log10sumlog10() function, a very useful utility in MathUtils. No more ExtendedPileupElement! Robust math calculations in GMM so that no infinities are generated! HaplotypeScore refactored to enable use of filtered context. Not yet enabled... InferredContext getDouble and getInteger arguments now parse values from Strings if necessary git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4684 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-15 22:19:22 +00:00
hanna	5b83942cee	- Fix DepthOfCoverage so that, when it abuses the ROD system by instantiating a track in onTraversalDone, it also supplies the correct sequence dictionary and parser. - Changed RMDTrackBuilder to use SequenceDictionaryUtils.validateDictionaries for ref <-> ROD sequence dictionary validation. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4683 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-15 20:34:04 +00:00
kshakir	2fd816ac5f	Updated ordering of integration tests. GVC > VR > AVC git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4669 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-14 06:33:28 +00:00
depristo	44d0cb6cde	New version of cutting routines for VQSR. Old code removed. Working unit tests. Best practice with testng integration test (everyone look at it). Walker test now allows you to not specify no. input files, if it can infer input counts from MD5s git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4664 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-13 16:19:56 +00:00
kshakir	62a106ca5a	Disabled VariantGaussianMixtureModelUnitTest git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4663 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-13 03:53:33 +00:00
depristo	42acc968b1	Unit tests git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4660 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-12 20:09:39 +00:00
ebanks	b51762c279	When you commit code late at night you tend to make careless mistakes... like forgetting to update integration tests. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4658 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-12 14:41:10 +00:00
depristo	988da428ae	Bug fix for old style tranches file. ApplyVariantCuts moved over, and passes integration tests git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4657 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-12 14:38:26 +00:00
depristo	c5f8c4dd0d	VariantEval test for tranches file, plus cutting over VE to use the generic Tranches framework git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4656 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-12 13:52:40 +00:00
ebanks	69de3e51bf	Better precision for the calculated AF value. Now looks at the total number of samples to determine how much precision is necessary. Also, changing default min BQ used for calling in UGv2 to Q17. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4655 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-12 08:31:40 +00:00
hanna	8e36a07bea	Convert GenomeLocParser into an instance variable. This change is required for anything that needs to be simultaneously aware of multiple references, eg Queue's interval sharding code, liftover support, distributed GATK etc. GenomeLocParser instances must now be used to create/parse GenomeLocs. GenomeLocParser instances are available in walkers by calling either -getToolkit().getGenomeLocParser() or -refContext.getGenomeLocParser() This is an intermediate change; GenomeLocParser will eventually be merged with the reference, but we're not clear exactly how to do that yet. This will become clearer when contig aliasing is implemented. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4642 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-10 17:59:50 +00:00
depristo	5ef4b234d8	Updates for broken integration tests. Counting annotations (AC, AF) now work correctly for AC = 0 sites git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4640 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-09 19:43:43 +00:00
chartl	42e9987e69	Bug fix to GenotypeConcordance. AC metrics get instantiated based on number of eval samples; if Comp has more samples, we can see AC indeces outside the bounds of the array. Bug fix to LiftoverVariants - no barfing at reference sites. AlleleFrequencyComparison - local changes added to make sure parsing works properly Added HammingDistance annotation. Mostly useless. But only mostly. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4622 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-03 19:23:03 +00:00
hanna	861ee3e37a	Changing testing framework from junit -> testng, for its enhanced configurability. Initial test to see how Bamboo will respond. More detailed email to follow. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4609 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-01 21:31:44 +00:00
ebanks	1c056ea791	Users can now use VariantAnnotator to add annotations from one VCF to another. For example, if you want to annotate your target VCF with the AC field value from the rod bound to CEU1kg, you can specify -E CEU1kg.AC and records will be annotated with CEU1kg.AC=N when a record exists in that rod at the given position. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4598 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-29 16:38:31 +00:00
hanna	2f8057bf24	Cleanup for multithreading memory leak during integration tests...unregister MXBean at end of traversal to avoid holding a reference to the microscheduler, which holds a reference to the engine, which in turn holds a reference to the walker, which itself holds a reference to all the data aggregated during the course of the traversal. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4594 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-28 18:37:42 +00:00
hanna	4c23b1fe9c	Get rid of the static cache of ArgumentTypeDescriptors by making them an integral part of the parsing engine. Hugely lowers our memory footprint in integrationtests, but not yet enough to run Mark's new parallelized VariantEvalIntegrationTests. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4585 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-27 19:44:55 +00:00
hanna	04e38929f0	Disabling parallelized version of VE integration tests. Still slow, but not deadlocking any more. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4580 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-27 02:47:03 +00:00
fromer	a7af1a164b	Updated MNP merging to merge VC records if any sample has a haplotype of ALT-ALT, since this could possibly change annotations. Note that, besides the "interesting" case of an ALT-ALT MNP in a pair of HET sites, this could even occur if two records are hom-var (irrespective of using phasing). Note also that this procedure may generate more than one ALT allele. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4577 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-27 01:50:36 +00:00
depristo	b085648141	Parallelized VariantEval. Refactored output to support parallel output style. Minor improvements to testing framework to enable easy executeTestParallel to run -nt 1 and -nt 4 by default. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4574 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-26 20:21:38 +00:00
fromer	c357ec775a	Trivially phases any hom site (since it is always correct to continue the previous haplotypes by appending the same allele onto both haplotypes) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4568 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-25 16:58:41 +00:00
fromer	9ba7269728	Fixed Integration Tests to output VCF files with -NO_HEADER git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4548 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-21 19:49:44 +00:00
kshakir	b88cfd2939	Updated MD5s of VCFs, since the approximate command line arguments injected into the VCF headers now have a little more order to them thanks to changes in the ParsingEngine. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4538 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-21 03:07:40 +00:00
fromer	f76865abbc	ReadBackedPhasing now uses a SortedVCFWriter to simplify, and has the ability to merge phased SNPs into MNPs on the fly [turned off by default]; MergeSegregatingPolymorphismsWalker can also do this as a post-processing step; Integration tests for MergeSegregatingPolymorphismsWalker were also added git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4534 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-20 20:27:10 +00:00
ebanks	7a291a8ff3	First pass at a VCF validator. Will test more tonight. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4524 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-19 19:55:49 +00:00
chartl	2bc5971ca1	Added - a tool to fix reference bases of a VCF. The OMNI had a couple of sites with incorrect reference bases (look to be legacy from other chips), and a few more that had ref and alt flipped. GAP should probably take care of it, but since I need results by monday, I'm doing it. Modified - SelectVariants: Hook up to VariantContextUtils to recalculate AC/AF/AN, which uses the accessor in VariantContext to do this. Somehow sites that were selected down to hom-ref genotypes only wound up getting positive AC. IMPORTANT I kind of need input here. The header of a file used for an integration test specifies AC as being an integer. Recalculating it casts it into an integer list (which it should be, as it allows for alternate alleles). However this appears to clash with what the jexl expression is looking for? For now, the integration test itself needed to be changed -- it's unclear what to do when the header specifies AC of being one class, but recalculating it casts to another class, and I'm not sure what to do. I'm committing my omni_qc pipeline because I'm almost certain 2 months down the road I'm going to wonder what the heck I did to generate my results. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4511 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-17 03:18:01 +00:00
ebanks	7aa030a9a4	Hmm. Apparently variants can get lifted over to different chromosomes. Who knew? Reverting changes from a couple of days ago. The only way to do this correctly (without requiring lots of memory) is to turn off on-the-fly indexing for this walker. Integration tests cover this now. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4510 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-17 02:54:12 +00:00
ebanks	954dd84f51	Adding an integration test (against hg18 this time) that requires on-the-fly sorting in order to work properly. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4500 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-15 07:45:21 +00:00
ebanks	9f54170dff	Hooking up the liftover tool to the new on-the-fly sorting VCF writer so that records can now get emitted in order. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4499 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-15 07:27:01 +00:00
chartl	7c9ef59d65	This is simultaneously a minor and major change to VariantEval, so take heed: The core walker has been modified so that when variant contexts (eval and comp) are subset to command-line-specified sample(s), the chromosome count annotations (AC/AN/AF) are altered to reflect the AC/AN/AF of only those samples involved in the comparison. No more getting AC500 when you're comparing a 10-sample overlap. Interestingly enough, this didn't break any integration tests. GenotypeConcordance now has two additional tables: Allele Count Statistics, and Allele Count Summary Statistics. These work exactly identically to the Sample Statistics and Sample Summary Statistics tables, except that the partition being used is no longer the sample, but instead the allele count of the variant sites. These tables stratify by both eval and comp ACs, e.g. evalAC0 evalAC1 evalAC2 compAC0 compAC1 compAC2 Differences with previous integration tests were verified to only be in the Allele Count tables (by grepping them out of the diff); a new test has been added for the simple case of an AC=1 site in the eval becoming an AC=2 site in the comp. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4491 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-13 22:26:15 +00:00
hanna	83b8676b69	Hack to fix mysterious disappearing read attributes. Ultimately caused by the fact that the GATKSAMRecord, by design, needs to both inherit from SAMRecord and wrap a 'member' SAMRecord, and method calls that aren't implemented as explicit passthroughs can compromise the content of the SAMRecord in subtle ways. Will be automatically fixed when Picard moves to a lightweight SAMRecord interface rather than the current heavyweight implementation. But in the short-term, there's no obvious fix. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4489 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-13 19:06:54 +00:00
aaron	272ac2ae4a	more fixes for tests broken by indexing-on-the-fly; I think this should do it. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4486 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-13 01:54:32 +00:00
aaron	ff0df1a2da	A fix for an integration test that was broken by on-the-fly indexing. Also, better reporting of Tribble exceptions in GATK integration tests. Trying to get the tests back up and running... git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4483 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-12 18:39:56 +00:00
kiran	f348ca2976	Now processes VCF files with repeated loci without crashing. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4481 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-12 04:36:07 +00:00
depristo	116309b3c3	More test cases for UG integration test. We currently fail doing multi-threaded gzip output, FYI git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4472 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-08 20:22:12 +00:00
depristo	38a67fed63	High performance version of standard vcf writer. New general static Tribble class for common constants, including general .idx constant and functions to get standard index name for a given file. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4471 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-08 19:53:21 +00:00
fromer	bdd3a9752e	Changed min MQ and BQ to 20 (for phasing) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4469 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-08 19:27:45 +00:00
chartl	21ec44339d	Somewhat major update. Changes: - ProduceBeagleInputWalker + Now takes a validation ROD and a prior to give it, will use those genotypes in place of the variant genotypes if both are present + Takes a bootstrap argument -- can use some given %age of the validation sites + Optionally takes a bootstrap output argument -- re-prints the validation VCF, filtering those sites used as part of the bootstrap -BeagleOutputToVCFWalker + Now filters sites where the genotypes have been reverted to hom ref + Now calls in to the new VCUtils to calculate AC/AN -Queue + New pipeline libraries for easy qscript creation, still a work in progress, but this is a considerable prototype + full calling pipeline v2 uses the above libraries + minor changes to some of my own scripts + no more need for contig interval lists, these will be parsed out of your normal interval list when it is provided git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4459 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-08 13:30:28 +00:00
depristo	0a2e76e9dc	2nd step towards on the fly indexing. Also fixed parsing bug for headers with < symbols git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4454 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-07 21:38:46 +00:00
rpoplin	7bb9704592	Update the BeagleOutputToVCF integration test because of removing the source header line. Source headers are provided by the engine for all VCF files now. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4453 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-07 19:55:57 +00:00
rpoplin	0de658534d	Removed the qScale arguments in VariantRecalibrator. It is smarter about how it tries to find a cut so the arbitrary scale factor hopefully is no longer necessary. Now the recalibrated variant quality score more accurately reflects our believed lod of the call. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4451 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-07 18:04:57 +00:00
fromer	ee00dcb79d	1. Phasing now ignores bases without minimum base quality (BQ) and minimum mapping quality (MQ); 2. The probability of a non-called base is now divided by 3, to evenly split up the error probability over the non-called bases git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4450 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-07 17:40:59 +00:00
ebanks	6205910f9f	updating integration test for Sarah Calvo git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4449 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-07 04:03:37 +00:00
fromer	652a3e8de5	Added integration tests for ReadBackedPhasing git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4446 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-06 20:50:32 +00:00
rpoplin	69485d6a7a	Added command line argument for the max value of the allele count prior in VariantRecalibrator (--max_ac_prior). Default value increased to 0.99 from 0.95. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4436 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-06 14:00:53 +00:00
ebanks	b5e148140b	Officially fixed the UG priors; updated the default min MQ/BQs to pipeline values of q20 and min calling threshold to Q50 git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4431 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-05 18:35:36 +00:00
ebanks	6448753cf7	Removed the SequenomValidationConvertor and renamed it VariantValidationAssessor since it no longer handles ped/sequenom files (but instead works on vcfs/variantcontexts). Updated all of the wiki docs, including adding instructions on how to convert ped files to vcf, a la Shaun Purcell. We now officially no longer support ped files everyone. Other misc cleanup in the code. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4419 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-04 02:11:38 +00:00
hanna	4ea73bcfb1	Basic unit tests for WalkerManager. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4394 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-30 19:27:41 +00:00
delangel	e80742e72f	Use -o as argument for output file in ProduceBeagleInputWalker, to be consistent with other walkers (you're welcome, chartl :)). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4386 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-29 22:46:39 +00:00
rpoplin	a6c7de95c8	By using the AC info field instead of parsing the genotypes we cut 78% off the runtime of VariantRecalibrator. There is a new argument to force the parsing of genotypes if necessary. Various other optimizations throughout. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4383 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-29 18:56:50 +00:00
chartl	862c94c8ce	Small change for Matt -- output partition types in lexicographic order. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4365 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-28 20:08:03 +00:00
bthomas	96cccafb0d	Adding a few helper methods for accessing sample metadata, and associated unit tests. These are motivated by discussion with Ryan about how he'll use sample metadata in VariantEvalwalker - hopefully will make it easier for him. Methods are: -- getToolkit().subContextFromSampleProperty(): filters a VariantContext to genotypes that come from samples that have a given property value -- getToolkit().getSamplesWithProperty(): gets all samples with a given property -- getToolkit().getSamplesFromVariantContext(): sample objects that are referenced by name in a VariantContext git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4361 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-28 02:16:25 +00:00
kshakir	edaa278edd	Removed cases where various toolkit functions were accessing GenomeAnalysisEngine.instance. This will allow other programs like Queue to reuse the functionality. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4351 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-25 02:49:30 +00:00
depristo	745b8cc6d3	GATK now detects and UserExceptions when human lexicographically sorted data is provided git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4343 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-24 15:19:48 +00:00
rpoplin	1931b2e1bd	Three fixes for VariantFiltrationWalker: Trying to filter an empty VCF file will produce a well-formed VCF file with zero records instead of a blank file, needed for pipelines. The first record's genotype info fields are now in the same order as all the others. The VCF header lines are pulled from just the input variant rod instead of from all rods. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4341 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-24 13:52:56 +00:00
kshakir	4ed9f437e9	Sliced the GAE in half like a gordian knot to avoid the constant merge conflicts. The GAE half has all the walker specific code. The new "Abstract" GAE has the rest of the logic. More refactoring to come, with the end goal of having a tool that other java analysis programs (Queue, etc.) can use to read in genomic data. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4339 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-23 23:28:55 +00:00
hanna	8f75d88519	Fix for GATK run report ids: mOVsxGfDiiSMxVs2PPTVjzYTVbizlD6e f9kUHUADFsZ0LiTGxRL5zPmq9kZcA4cQ 8eGHWJFAlBVmgxwPi3sMd1RmiN2PwHOf iLhvHWveypKb2F8vKS5irHylc3pYvlOb HDttXKUMEVoPrvVeWrH7E0htxYyNydMx plus a bit of cleanup of custom exceptions in the sharding system. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4330 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-22 19:49:25 +00:00
hanna	fb5d595ef0	Disable VCF header output in the Beagle integrationtest. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4327 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-22 16:50:03 +00:00
hanna	0c99c97685	The engine now automatically adds the command-line arguments to the header of every VCF, unless -NO_HEADER is specified. Changed integration tests, adding the -NO_HEADER argument, for walkers that previously did not include the command-line arg headers. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4326 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-22 15:27:58 +00:00
aaron	1af9ca6d45	enabling tests that now pass with the conitg length validation. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4325 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-21 22:20:50 +00:00
aaron	3938d53738	one broken build short of the hat trick. Fixing the unix test which expects the sequence dictionary of the Tribble track to equal the reference; we actually return the sequence dictionary of the track iself, with each contig set to the length of the sequence dictionary contig entry. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4322 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-21 18:47:20 +00:00

1 2 3 4 5 ...

747 Commits (dc62685a2f3aa5e40f42e4f8a9a67a6007c811f8)