gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Christopher Hartl	673ceadd11	While this fix worked for the evaluator module, it could potentially have bad effects in the phasing walkers. Special-case nocalls in the PhasingEvaluator and return AllelePair to previous state.	2012-01-26 13:06:36 -05:00
Christopher Hartl	9c6fda7e15	Yup. I was right.	2012-01-26 12:54:11 -05:00
Christopher Hartl	7d059540a4	Allow segments of genome to be excluded in generating a reference panel. Occasionally targets would contain no variation (typically, in the middle of the centromere), which beagle doesn't particularly like, and errors out rather than producing empty output files. The best way to deal with these is to just exclude the regions on a second-pass, and the remaining bits will be gathered with no additional work. AllelePair is being mean and not telling me what genotype it sees when it finds a non-diploid genotype, but i suspect it's a no-call (".") rather than a no call ("./.").	2012-01-26 12:43:52 -05:00
Ryan Poplin	25532bdc37	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-01-26 11:43:32 -05:00
Ryan Poplin	390d493049	Updating ActiveRegionWalker interface to output a probability of active status instead of a boolean. Integrator runs a band-pass filter over this probability to produce actual active regions. First version of HaplotypeCaller which decides for itself where to trigger and assembles those regions.	2012-01-26 11:37:08 -05:00
Eric Banks	859dd882c9	Don't make it standard for now	2012-01-26 00:38:16 -05:00
Eric Banks	c5e81be978	Adding pairwise AF table. Not polished at all, but usable none-the-less.	2012-01-26 00:37:06 -05:00
Eric Banks	702a2d768f	Initial version of multi-allelic summary module in VariantEval	2012-01-25 19:42:55 -05:00
Eric Banks	9a60887567	Lost an import in the merge	2012-01-25 19:41:41 -05:00
Eric Banks	cba5f1a8b1	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-01-25 19:19:03 -05:00
Eric Banks	ddaf51a50f	Updated one integration test for indels	2012-01-25 19:18:51 -05:00
Eric Banks	add6918f32	Cleaner, more efficient way of determining the last dependent set in the queue.	2012-01-25 16:21:10 -05:00
Menachem Fromer	db645a94ca	Added options to make the batch-merger more all-inclusive: keep all indels, SNPs (even filtered ones) but maintain their annotations. Also, VariantContextUtils.simpleMerge can now merge variants of all types using the Hidden non-default enum MultipleAllelesMergeType=MIX_TYPES	2012-01-25 16:10:59 -05:00
Eric Banks	ef335a5812	Better implementation of the fix; PL index is now traversed in order.	2012-01-25 15:15:42 -05:00
Eric Banks	8e2d372ab0	Use remove instead of setting the value to null	2012-01-25 14:41:34 -05:00
Eric Banks	05816955aa	It was possible that we'd clean up a matrix column too early when a dependent column aborted early (with not enough probability mass) because we weren't being smart about the order in which we created dependencies. Fixed.	2012-01-25 14:28:21 -05:00
Eric Banks	2799a1b686	Catch exception for bad type and throw as a TribbleException	2012-01-25 12:15:51 -05:00
Eric Banks	96b62daff3	Minor tweak to the warning message.	2012-01-25 11:55:33 -05:00
Eric Banks	fb863dc6a7	Warn user when trying to run with EMIT_ALL_SITES with indels; better docs for that option.	2012-01-25 11:50:12 -05:00
Eric Banks	e349b4b14b	Allow appending with the dbSNP ID even if a (different) ID is already present for the variant rod.	2012-01-25 11:35:54 -05:00
Eric Banks	ea3d4d60f2	This annotation requires rods and should be annotated as such	2012-01-25 11:35:13 -05:00
Ryan Poplin	bbefe4a272	Added option to be able to write out the active regions to an interval list file	2012-01-25 09:47:06 -05:00
Ryan Poplin	9818c69df6	Can now specify active regions to process at the command line, mainly for debugging purposes	2012-01-25 09:32:52 -05:00
Mauricio Carneiro	97499529c7	another small bug with the file extension.	2012-01-24 16:14:35 -05:00
Mauricio Carneiro	ffd61f4c1c	Refactor the Pileup Element with regards to indels Eric reported this bug due to the reduced reads failing with an index out of bounds on what we thought was a deletion, but turned out to be a read starting with insertion. * Refactored PileupElement to distinguish clearly between deletions and read starting with insertion * Modified ExtendedEventPileup to correctly distinguish elements with deletion when creating new pileups * Refactored most of the lazyLoadNextAlignment() function of the LocusIteratorByState for clarity and to create clear separation between what is a pileup with a deletion and what's not one. Got rid of many useless if statements. * Changed the way LocusIteratorByState creates extended event pileups to differentiate between insertions in the beginning of the read and deletions. * Every deletion now has an offset (start of the event) * Fixed bug when LocusITeratorByState found a read starting with insertion that happened to be a reduced read. * Separated the definitions of deletion/insertion (in the beginning of the read) in all UG annotations (and the annotator engine). * Pileup depth of coverage for a deleted base will now return the average coverage around the deletion. * Indel ReadPositionRankSum test now uses the deletion true offset from the read, changed all appropriate md5's * The extra pileup elements now properly read by the Indel mode of the UG made any subsequent call have a different random number and therefore all RankSum tests have slightly different values (in the 10^-3 range). Updated all appropriate md5s after extremely careful inspection -- Thanks Ryan! phew!	2012-01-24 16:07:21 -05:00
Matt Hanna	c312bd5960	Weirdly, PicardException inherits from SAMException, which means that our specialty code for reporting malformed BAMs was actually misreporting any error that happened in the Picard layer as a BAM ERROR. Specifically changing PicardException to report as a ReviewedStingException; we might want to change it in the future. I'll followup with the Picard team to make sure they really, really want PicardException to inherit from SAMException.	2012-01-24 15:30:04 -05:00
Mauricio Carneiro	7c7ca0d799	fixing bug with fastq extension * PPP only recognized .fasta and .fq, failing when the user provided a .fastq file. Fixed.	2012-01-24 11:02:15 -05:00
Mark DePristo	0a3172a9f1	Fix for ref 0 bases for Chris -- Disturbingly, fixing this bug doesn't actually cause an test failures. -- Wrote a new QCRefWalker to actually check in detail that the reference bases coming into the RefWalker are all correct when comparing against a clean uncached load of the contig bases directly. -- However, I cannot run this tool due to some kind of weird BAM error -- sending this on to Matt	2012-01-24 10:55:09 -05:00
Mauricio Carneiro	945cf03889	IntelliJ ate my import!	2012-01-23 21:46:45 -05:00
Mauricio Carneiro	2bb9525e7f	Don't set base qualities if fastQ is provided * Pacbio Processing pipeline now works with the new fastQ files outputted by the Pacbio instrument	2012-01-23 17:57:29 -05:00
Khalid Shakir	c18beadbdb	Device files like /dev/null are now tracked as special by Queue and are not used to generate .out file paths, scattered into a temporary directory, gathered, deleted, etc. Attempted workaround for xdr_resourceInfoReq unsatisfied link during loading of libbat.so.	2012-01-23 16:17:04 -05:00
Mark DePristo	02450e4b12	Merged bug fix from Stable into Unstable	2012-01-23 12:08:39 -05:00
Christopher Hartl	798596257b	Enable the Genotype Phasing Evaluator. Because it didn't have the same argument structure as the base class, update2 of VariantEvaluator was being called, rather than update2 of the actual module.	2012-01-23 10:50:16 -05:00
Mark DePristo	80a4ce0edf	Bugfix for incorrect error messages for missing BAMs and VCFs -- Missing BAMs were appearing as StingExceptions -- Missing VCFs were showing up as CommandLineErrors, but it's clearer for them to be CouldNotReadInputFile exceptions -- Added integration tests to ensure missing BAMs, VCFs, and -L files are properly thrown as CouldNotReadInputFile exceptions -- Added path to standard b37 BAM to BaseTest -- Cleaned up code in SAMDataSource, removing my parallel loading code as this just didn't prove to be useful.	2012-01-23 09:52:07 -05:00
Guillermo del Angel	31d2f04368	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-01-23 09:23:03 -05:00
Guillermo del Angel	966387ca0b	Next intermediate commit in the pool caller. Lots of bug fixes and now we can emit true vcf's with calls in discovery mode (still of unknown quality) - old validation mode is temporarily broken,will be fixed in next refactoring.	2012-01-23 09:22:31 -05:00
Christopher Hartl	4a08e8ca6e	Minor tweaks to T2D-related qscripts. Replacing old md5s from the BeagleIntegrationTest. All differences boiled down either to the accounting of genotypes changed (./. --> 0/0 is no longer a "changed" genotype, and original genotypes that were ./. are represented as OG=. rather than OG=./. .) This is somewhat of an arbitrary decision, and is negotiable. I could see treating GT:PL ./.:. differently from GT:PL .:0,3,6 but am not sure the worth of doing so.	2012-01-23 08:25:34 -05:00
Ryan Poplin	4d6312d4ea	HaplotypeCaller is now an ActiveRegionWalker.	2012-01-22 14:31:01 -05:00
Christopher Hartl	3b1aad4f17	After a minor and abject freakout, alter the T2D script to seek out truth sensitivities between 80 and 100, rather than between 0.8 and 1. Also, don't consider a genotype "changed by beagle" if the initial genotype is a no-call.	2012-01-20 23:43:51 -05:00
Christopher Hartl	9b4f6afa21	Alterations to scripts for better performance. Grid search now expands the sens/spec tradeoff (90 was far too aggressive against hapmap chr20), and 20 max gaussians was too many, and caused errors. For consensus genotypes: remember to gunzip the beagle outputs before converting to VCF. Also, beagle can in fact create 'null' alleles in certain circumstances. I'm not sure what exactly those circumstances are, but those sites should be ignored. When it does, all alleles apear to be set to null, so this should not affect the actual phasing in the output VCF.	2012-01-20 23:07:59 -05:00
Ryan Poplin	4b18786b5d	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-01-19 22:05:20 -05:00
Ryan Poplin	ace9333068	Active region walkers can now see the reads in a buffer around thier active reigons. This buffer size is specified as a walker annotation. Intervals are internally extended by this buffer size so that the extra reads make their way through the traversal engine but the walker author only needs to see the original interval. Also, several corner case bug fixes in active region traversal.	2012-01-19 22:05:08 -05:00
Menachem Fromer	066da80a3d	Added KEEP_UNCONDTIONAL option which permits even sites with only filtered records to be included as unfiltered sites in the output	2012-01-19 18:19:58 -05:00
Christopher Hartl	7f3ad25b01	Adding a mode to VariantFiltration to invalidate previously-applied filters to allow complete re-filtering of a VCF. T2D VQSR: re-calling now done with appropriate quality settings and using BAQ.	2012-01-19 10:54:48 -05:00
Ryan Poplin	7e082c7750	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-01-19 09:11:23 -05:00
Christopher Hartl	39e6df5aa9	Fix edge case for very small VCFs	2012-01-19 00:51:28 -05:00
Christopher Hartl	1e037a0ecf	Ensure second-to-last line printed	2012-01-19 00:33:08 -05:00
Christopher Hartl	9946853039	Remove duplicated line	2012-01-19 00:25:22 -05:00
Christopher Hartl	cf9b1d350a	Some minor changes to in-process functions that nobody else uses. CGL now properly ignores no-calls for external VCFs.	2012-01-19 00:20:49 -05:00
Eric Banks	ab8f499bc3	Annotate with FS even for filtered sites	2012-01-18 22:04:51 -05:00
Guillermo del Angel	b123416c4c	Resolve stale merge changes	2012-01-18 20:56:36 -05:00
Guillermo del Angel	2eb45340e1	Initial, raw, mostly untested version of new pool caller that also does allele discovery. Still needs debugging/refining. Main modification is that there is a new operation mode, set by argument -ALLELE_DISCOVERY_MODE, which if true will determine optimal alt allele at each computable site and will compute AC distribution on it. Current implementation is not working yet if there's more than one pool and it will only output biallelic sites, no functionality for true multi-allelics yet	2012-01-18 20:54:10 -05:00
Ryan Poplin	0133d1a901	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-01-18 09:53:42 -05:00
Ryan Poplin	0268da7560	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-01-18 09:53:00 -05:00
Ryan Poplin	60024e0d7b	updating TDT integration test	2012-01-18 09:52:50 -05:00
David Roazen	b7c65cb089	Merged bug fix from Stable into Unstable	2012-01-18 09:52:47 -05:00
Ryan Poplin	11982b5a34	We no longer calculate the population-level TDT statistic if there are fewer than 5 trios with full genotype likelihood information. When there is a high degree of missingness the results are skewed or in the worst case come out as NaN.	2012-01-18 09:42:41 -05:00
Mark DePristo	763c81d520	No longer enforce MAX_ALLELE_SIZE in VCF codec -- Instead issue a warning when a large (>1MB) record is encountered -- Optimized ref.getBytes()[i] => (byte)ref.charAt(i), which avoids an implicit O(n) allocation each iteration through computeReverseClipping()	2012-01-18 07:35:11 -05:00
Mark DePristo	0c7865fdb5	UnitTest for reverseAlleleClipping -- No code modified yet, just implementing a unit test to ensure correctness of the existing code	2012-01-18 07:35:11 -05:00
David Roazen	d5199db8ec	Be explicit about setting the snpEff -onlyCoding option in the pipeline When run without an explicit -onlyCoding option, as we've been doing up to now, snpEff automatically sets -onlyCoding to "true" provided that there is at least one transcript marked as "protein_coding", which will always be the case for us in practice (and indeed, all pipeline runs so far with snpEff 2.0.5 have run with -onlyCoding auto-set to "true"). However, given the disastrous effect on annotation quality setting "-onlyCoding false" has, we wish to be explicit with this option rather than relying on snpEff's auto-detection logic.	2012-01-17 20:04:27 -05:00
Mark DePristo	62801e430a	Bugfix for unnecessary optimization -- don't cache the ref bytes	2012-01-17 16:40:26 -05:00
Mark DePristo	f2b0575dee	Detect unreasonably large allele strings (>2^16) and throw an error -- samtools can emit alleles where the ref is 42M Ns and this caused the GATK (via tribble) to hang in several places. -- Tribble was updated so we actually could read the line properly (rev. to 51 here). -- Still the parsing algorithms in the GATK aren't happy with such a long allele. Instead of optimizing the code around an improper use case I put in a limit of 2^16 bp for any allele, and throw a meaningful exception when encountered.	2012-01-17 16:40:26 -05:00
Ryan Poplin	8b0ddf0aaf	Adding notes to CountCovariates docs about using interval lists as database of known variation	2012-01-17 16:13:13 -05:00
Ryan Poplin	56761297dd	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-01-17 15:03:32 -05:00
Ryan Poplin	75f87db468	Replacing Mills file with new gold standard indel set in the resource bundle for release with v1.5	2012-01-17 15:02:45 -05:00
Matt Hanna	40ebc17437	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-01-17 14:49:17 -05:00
Matt Hanna	41d70abe4e	At chartl's request, add the bwa aln -N and bwa aln -m parameters to the bindings.	2012-01-17 14:47:53 -05:00
Ryan Poplin	ae259f81cc	Bug fixing for merging of read fragments when one fragment contained an indel	2012-01-17 14:39:27 -05:00
Christopher Hartl	cde224746f	Bait Redesign supports baits that overlap, by picking only the start of intervals. CalibrateGenotypeLikelihoods supports using an external VCF as input for genotype likelihoods. Currently can be a per-sample VCF, but has un-implemented methods for allowing a read-group VCF to be used. Removed the old constrained genotyping code from UGE -- the trellis calculated is exactly the same as that done in the MLE AC estimate; so we should just re-use that one.	2012-01-17 13:51:05 -05:00
Ryan Poplin	8e23c98dd9	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-01-17 13:46:28 -05:00
Matt Hanna	32ccde374b	Merged bug fix from Stable into Unstable	2012-01-17 11:08:35 -05:00
Matt Hanna	3ba918aff1	Error message cleanup in BAM indexing code.	2012-01-17 11:05:42 -05:00
Mauricio Carneiro	cec7107762	Better location for the downsampling of reads in PrintReads * using the filter() instead of map() makes for a cleaner walker. * renaming the unit tests to make more sense with the other unit and integration tests	2012-01-14 14:06:09 -05:00
Mark DePristo	b06074d6e7	Updated SortingVCFWriterBase to use PriorityBlockingQueue so that the class is thread-safe -- Uses PriorityBlockingQueue instead of PriorityQueue -- synchronized keywords added to all key functions that modify internal state Note that this hasn't been tested extensivesly. Based on report: http://getsatisfaction.com/gsa/topics/missing_loci_output_in_multi_thread_mode_when_implement_sortingvcfwriterbase?utm_content=topic_link&utm_medium=email&utm_source=new_topic	2012-01-13 09:33:16 -05:00
Mauricio Carneiro	28aa353501	Added "unbiased" downsampling parameter to PrintReads * also cleaned up and updated part of the unit tests for print reads. Needs a more thorough cleaning.	2012-01-12 16:33:55 -05:00
Matt Hanna	2c3176eb80	Merged bug fix from Stable into Unstable	2012-01-12 13:31:10 -05:00
Matt Hanna	cd43f016ce	Fixed NPE in getNextOverlappingBAMScheduleEntry() when mixed mapped/unmapped interval lists are used. Added integrationtest to verify behavior.	2012-01-12 13:29:11 -05:00
Eric Banks	ed34b4f088	Merge branch 'master' of ssh://gsa1.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-01-12 10:27:26 -05:00
Eric Banks	e7fe9910f7	Create the temp storage for calculating cell values just once as per Mark's TODO	2012-01-12 10:27:10 -05:00
Eric Banks	f5f5ed5dcd	Don't initialize the cell conformation values (use an else in the loop instead) as per Mark's TODO	2012-01-12 08:50:03 -05:00
Eric Banks	410a340ef5	Swapping the iteration order to run over AF conformations and then samples instead of the reverse minimizes calls to HashMap.get; instead of it being O(n) since we called it for each sample it's now O(1). Runtime on T2D GENES test set is reduced by 5-10%. More optimizations to follow.	2012-01-12 02:04:03 -05:00
Mauricio Carneiro	77a03c9709	Patching special case in the adaptor clipping * if the adaptor boundary is more than MAXIMUM_ADAPTOR_SIZE bases away from the read, then let's not clip anything and consider the fragment to be undetermined for this read pair. * updated md5's accordingly	2012-01-11 17:47:44 -05:00
Khalid Shakir	aae61767c6	queueJobReport now compresses PDF when running R 2.13+. Updated PostCallingQC.scala's VE and R to include missense to silent ratio and plot.	2012-01-10 17:32:30 -05:00
Khalid Shakir	a9a6516527	Merged bug fix from Stable into Unstable	2012-01-10 16:16:10 -05:00
Khalid Shakir	ef50e77ee2	When running Queue jobs locally, merge the stderr to the stdout log if the error file is NOT specified. Updated VE strats in the HSP for plotting Ka/Ks by AC.	2012-01-10 16:10:25 -05:00
Eric Banks	3475bfafd3	Merge branch 'master' of ssh://gsa1.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-01-10 12:39:15 -05:00
Mauricio Carneiro	5bf960deb8	adding dbsnp to indel VQSR	2012-01-10 12:38:49 -05:00
Eric Banks	25d0d53d88	Moving the approximate summing of log10 vals to MathUtils; keeping the more efficient implementation of fast rounding.	2012-01-10 12:38:47 -05:00
Eric Banks	589397d611	Merge branch 'master' of ssh://gsa1.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-01-10 12:36:48 -05:00
Eric Banks	c5320ef1af	Resolving changes in integration test during merge	2012-01-10 12:14:16 -05:00
Matt Hanna	e923a2e512	Revving Picard to incorporate final version of ReadWalker performance improvements.	2012-01-10 12:12:33 -05:00
Eric Banks	0f36f6947e	Resolving merge conflicts	2012-01-10 11:44:16 -05:00
Eric Banks	f2cecce10f	Much better implementation of the approximate summing of an array of log10 values (including more efficient rounding). Now effectively takes 0% of UG runtime on T2D GENES (as opposed to 11% previously).	2012-01-10 11:34:23 -05:00
Matt Hanna	509c3d87b0	Merged bug fix from Stable into Unstable	2012-01-09 23:08:46 -05:00
Matt Hanna	dc60757b68	Eliminate unnecessary strong references (and therefore memory held) by tree reduce entries that have already been processed. Thanks to Tim Fennell for the bug report.	2012-01-09 23:04:53 -05:00
Mauricio Carneiro	6f2abd76df	Updating the MDCP with the new indel gold standard from Ryan.	2012-01-09 15:31:18 -05:00
Matt Hanna	fda1795791	Merged bug fix from Stable into Unstable	2012-01-08 22:04:44 -05:00
Matt Hanna	1f1233b669	Fix for a rare but insidious bug in position tracking during async BAM file reading. Thanks to Khalid for spotting and reporting the issue.	2012-01-08 22:03:35 -05:00
Khalid Shakir	5793625592	No more "Q-<pid>@<host>". Generated log file names now use the first output + ".out" (ex. my.vcf.out) or the name of the first QScript plus the order the function was added (ex. MyScript-1.out). The same function added twice with the same outputs will now have the same default logs, meaning the 2nd instance of the function won't be added to the graph twice. QScript accessor to QSettings to specify a default runName and other default function settings. Because log files are no longer pseudo-random their presense can be used to tell if a job without other file outputs is "done". For now still using the log's .done file in addition to original outputs. Gathered log files concatenate all log files together into the stdout. InProcessFunctions now have PrintStreams for stdout and stderr. Updated ivy to use commons-io 2.1 for copying logs to the stdout PrintStream. Removed snakeyaml. During graph tracking of outputs the Index files, and now BAM MD5s, are tracked with the gathering of the original file. In Queue generated wrappers for the GATK the Index and MD5s used for tracking are switched to private scope. Added more detailed output when running with -l DEBUG. Simplified graphviz visualization for additional debugging. Switched usage of the scala class 'List' to the trait 'Seq' (think java.util.ArrayList vs. using the interface java.util.List) Minor cleanup to build including sending ant gsalib to R's default libloc.	2012-01-08 12:11:55 -05:00
Mauricio Carneiro	f6a18aea63	Updated MDCP with INDEL best practices * chose 90.0 indel cut target for most datasets (this is arbitrary).	2012-01-06 17:21:59 -05:00

1 2 3 4 5 ...

1785 Commits (edb4edc08fb0dea2aeea61afcfe4fd39faa7ada1)