Commit Graph

1486 Commits (7c7ca0d799256c484fa65ded9e1f2c04585c761f)

Author SHA1 Message Date
Mark DePristo 0a3172a9f1 Fix for ref 0 bases for Chris
-- Disturbingly, fixing this bug doesn't actually cause an test failures.
-- Wrote a new QCRefWalker to actually check in detail that the reference bases coming into the RefWalker are all correct when comparing against a clean uncached load of the contig bases directly.
-- However, I cannot run this tool due to some kind of weird BAM error -- sending this on to Matt
2012-01-24 10:55:09 -05:00
Khalid Shakir c18beadbdb Device files like /dev/null are now tracked as special by Queue and are not used to generate .out file paths, scattered into a temporary directory, gathered, deleted, etc.
Attempted workaround for xdr_resourceInfoReq unsatisfied link during loading of libbat.so.
2012-01-23 16:17:04 -05:00
Mark DePristo 02450e4b12 Merged bug fix from Stable into Unstable 2012-01-23 12:08:39 -05:00
Christopher Hartl 798596257b Enable the Genotype Phasing Evaluator. Because it didn't have the same argument structure as the base class, update2 of VariantEvaluator was being called, rather than update2 of the actual module. 2012-01-23 10:50:16 -05:00
Mark DePristo 80a4ce0edf Bugfix for incorrect error messages for missing BAMs and VCFs
-- Missing BAMs were appearing as StingExceptions
-- Missing VCFs were showing up as CommandLineErrors, but it's clearer for them to be CouldNotReadInputFile exceptions
-- Added integration tests to ensure missing BAMs, VCFs, and -L files are properly thrown as CouldNotReadInputFile exceptions
-- Added path to standard b37 BAM to BaseTest
-- Cleaned up code in SAMDataSource, removing my parallel loading code as this just didn't prove to be useful.
2012-01-23 09:52:07 -05:00
Guillermo del Angel 31d2f04368 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-01-23 09:23:03 -05:00
Guillermo del Angel 966387ca0b Next intermediate commit in the pool caller. Lots of bug fixes and now we can emit true vcf's with calls in discovery mode (still of unknown quality) - old validation mode is temporarily broken,will be fixed in next refactoring. 2012-01-23 09:22:31 -05:00
Christopher Hartl 4a08e8ca6e Minor tweaks to T2D-related qscripts. Replacing old md5s from the BeagleIntegrationTest. All differences boiled down either to the accounting of genotypes changed (./. --> 0/0 is no longer a "changed" genotype, and original genotypes that were ./. are represented as OG=. rather than OG=./. .)
This is somewhat of an arbitrary decision, and is negotiable. I could see treating

GT:PL   ./.:.

differently from

GT:PL   .:0,3,6

but am not sure the worth of doing so.
2012-01-23 08:25:34 -05:00
Ryan Poplin 4d6312d4ea HaplotypeCaller is now an ActiveRegionWalker. 2012-01-22 14:31:01 -05:00
Christopher Hartl 3b1aad4f17 After a minor and abject freakout, alter the T2D script to seek out truth sensitivities between 80 and 100, rather than between 0.8 and 1. Also, don't consider a genotype "changed by beagle" if the initial genotype is a no-call. 2012-01-20 23:43:51 -05:00
Christopher Hartl 9b4f6afa21 Alterations to scripts for better performance. Grid search now expands the sens/spec tradeoff (90 was far too aggressive against hapmap chr20), and 20 max gaussians was too many, and caused errors. For consensus genotypes: remember to gunzip the beagle outputs before converting to VCF. Also, beagle can in fact create 'null' alleles in certain circumstances. I'm not sure what exactly those circumstances are, but those sites should be ignored. When it does, all alleles apear to be set to null, so this should not affect the actual phasing in the output VCF. 2012-01-20 23:07:59 -05:00
Ryan Poplin 4b18786b5d Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-01-19 22:05:20 -05:00
Ryan Poplin ace9333068 Active region walkers can now see the reads in a buffer around thier active reigons. This buffer size is specified as a walker annotation. Intervals are internally extended by this buffer size so that the extra reads make their way through the traversal engine but the walker author only needs to see the original interval. Also, several corner case bug fixes in active region traversal. 2012-01-19 22:05:08 -05:00
Menachem Fromer 066da80a3d Added KEEP_UNCONDTIONAL option which permits even sites with only filtered records to be included as unfiltered sites in the output 2012-01-19 18:19:58 -05:00
Christopher Hartl 7f3ad25b01 Adding a mode to VariantFiltration to invalidate previously-applied filters to allow complete re-filtering of a VCF.
T2D VQSR: re-calling now done with appropriate quality settings and using BAQ.
2012-01-19 10:54:48 -05:00
Ryan Poplin 7e082c7750 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-01-19 09:11:23 -05:00
Eric Banks ab8f499bc3 Annotate with FS even for filtered sites 2012-01-18 22:04:51 -05:00
Guillermo del Angel b123416c4c Resolve stale merge changes 2012-01-18 20:56:36 -05:00
Guillermo del Angel 2eb45340e1 Initial, raw, mostly untested version of new pool caller that also does allele discovery. Still needs debugging/refining. Main modification is that there is a new operation mode, set by argument -ALLELE_DISCOVERY_MODE, which if true will determine optimal alt allele at each computable site and will compute AC distribution on it. Current implementation is not working yet if there's more than one pool and it will only output biallelic sites, no functionality for true multi-allelics yet 2012-01-18 20:54:10 -05:00
Ryan Poplin 0268da7560 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-01-18 09:53:00 -05:00
Ryan Poplin 60024e0d7b updating TDT integration test 2012-01-18 09:52:50 -05:00
Ryan Poplin 11982b5a34 We no longer calculate the population-level TDT statistic if there are fewer than 5 trios with full genotype likelihood information. When there is a high degree of missingness the results are skewed or in the worst case come out as NaN. 2012-01-18 09:42:41 -05:00
Mark DePristo 763c81d520 No longer enforce MAX_ALLELE_SIZE in VCF codec
-- Instead issue a warning when a large (>1MB) record is encountered
-- Optimized ref.getBytes()[i] => (byte)ref.charAt(i), which avoids an implicit O(n) allocation each iteration through computeReverseClipping()
2012-01-18 07:35:11 -05:00
Mark DePristo 0c7865fdb5 UnitTest for reverseAlleleClipping
-- No code modified yet, just implementing a unit test to ensure correctness of the existing code
2012-01-18 07:35:11 -05:00
Mark DePristo 62801e430a Bugfix for unnecessary optimization
-- don't cache the ref bytes
2012-01-17 16:40:26 -05:00
Mark DePristo f2b0575dee Detect unreasonably large allele strings (>2^16) and throw an error
-- samtools can emit alleles where the ref is 42M Ns and this caused the GATK (via tribble) to hang in several places.
-- Tribble was updated so we actually could read the line properly (rev. to 51 here).
-- Still the parsing algorithms in the GATK aren't happy with such a long allele.  Instead of optimizing the code around an improper use case I put in a limit of 2^16 bp for any allele, and throw a meaningful exception when encountered.
2012-01-17 16:40:26 -05:00
Ryan Poplin 8b0ddf0aaf Adding notes to CountCovariates docs about using interval lists as database of known variation 2012-01-17 16:13:13 -05:00
Matt Hanna 40ebc17437 Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-01-17 14:49:17 -05:00
Matt Hanna 41d70abe4e At chartl's request, add the bwa aln -N and bwa aln -m parameters to the bindings. 2012-01-17 14:47:53 -05:00
Ryan Poplin ae259f81cc Bug fixing for merging of read fragments when one fragment contained an indel 2012-01-17 14:39:27 -05:00
Christopher Hartl cde224746f Bait Redesign supports baits that overlap, by picking only the start of intervals.
CalibrateGenotypeLikelihoods supports using an external VCF as input for genotype likelihoods. Currently can be a per-sample VCF, but has un-implemented methods for allowing a read-group VCF to be used.

Removed the old constrained genotyping code from UGE -- the trellis calculated is exactly the same as that done in the MLE AC estimate; so we should just re-use that one.
2012-01-17 13:51:05 -05:00
Ryan Poplin 8e23c98dd9 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-01-17 13:46:28 -05:00
Matt Hanna 32ccde374b Merged bug fix from Stable into Unstable 2012-01-17 11:08:35 -05:00
Matt Hanna 3ba918aff1 Error message cleanup in BAM indexing code. 2012-01-17 11:05:42 -05:00
Mauricio Carneiro cec7107762 Better location for the downsampling of reads in PrintReads
* using the filter() instead of map() makes for a cleaner walker.
   * renaming the unit tests to make more sense with the other unit and integration tests
2012-01-14 14:06:09 -05:00
Mark DePristo b06074d6e7 Updated SortingVCFWriterBase to use PriorityBlockingQueue so that the class is thread-safe
-- Uses PriorityBlockingQueue instead of PriorityQueue
-- synchronized keywords added to all key functions that modify internal state

Note that this hasn't been tested extensivesly.  Based on report:

http://getsatisfaction.com/gsa/topics/missing_loci_output_in_multi_thread_mode_when_implement_sortingvcfwriterbase?utm_content=topic_link&utm_medium=email&utm_source=new_topic
2012-01-13 09:33:16 -05:00
Mauricio Carneiro 28aa353501 Added "unbiased" downsampling parameter to PrintReads
* also cleaned up and updated part of the unit tests for print reads. Needs a more thorough cleaning.
2012-01-12 16:33:55 -05:00
Matt Hanna 2c3176eb80 Merged bug fix from Stable into Unstable 2012-01-12 13:31:10 -05:00
Matt Hanna cd43f016ce Fixed NPE in getNextOverlappingBAMScheduleEntry() when mixed mapped/unmapped interval lists are used. Added integrationtest to verify behavior. 2012-01-12 13:29:11 -05:00
Eric Banks ed34b4f088 Merge branch 'master' of ssh://gsa1.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-01-12 10:27:26 -05:00
Eric Banks e7fe9910f7 Create the temp storage for calculating cell values just once as per Mark's TODO 2012-01-12 10:27:10 -05:00
Eric Banks f5f5ed5dcd Don't initialize the cell conformation values (use an else in the loop instead) as per Mark's TODO 2012-01-12 08:50:03 -05:00
Eric Banks 410a340ef5 Swapping the iteration order to run over AF conformations and then samples instead of the reverse minimizes calls to HashMap.get; instead of it being O(n) since we called it for each sample it's now O(1). Runtime on T2D GENES test set is reduced by 5-10%. More optimizations to follow. 2012-01-12 02:04:03 -05:00
Mauricio Carneiro 77a03c9709 Patching special case in the adaptor clipping
* if the adaptor boundary is more than MAXIMUM_ADAPTOR_SIZE bases away from the read, then let's not clip anything and consider the fragment to be undetermined for this read pair.
   * updated md5's accordingly
2012-01-11 17:47:44 -05:00
Eric Banks 25d0d53d88 Moving the approximate summing of log10 vals to MathUtils; keeping the more efficient implementation of fast rounding. 2012-01-10 12:38:47 -05:00
Eric Banks 589397d611 Merge branch 'master' of ssh://gsa1.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-01-10 12:36:48 -05:00
Eric Banks c5320ef1af Resolving changes in integration test during merge 2012-01-10 12:14:16 -05:00
Matt Hanna e923a2e512 Revving Picard to incorporate final version of ReadWalker performance improvements. 2012-01-10 12:12:33 -05:00
Eric Banks 0f36f6947e Resolving merge conflicts 2012-01-10 11:44:16 -05:00
Eric Banks f2cecce10f Much better implementation of the approximate summing of an array of log10 values (including more efficient rounding). Now effectively takes 0% of UG runtime on T2D GENES (as opposed to 11% previously). 2012-01-10 11:34:23 -05:00