Mark DePristo
0a3172a9f1
Fix for ref 0 bases for Chris
...
-- Disturbingly, fixing this bug doesn't actually cause an test failures.
-- Wrote a new QCRefWalker to actually check in detail that the reference bases coming into the RefWalker are all correct when comparing against a clean uncached load of the contig bases directly.
-- However, I cannot run this tool due to some kind of weird BAM error -- sending this on to Matt
2012-01-24 10:55:09 -05:00
Khalid Shakir
c18beadbdb
Device files like /dev/null are now tracked as special by Queue and are not used to generate .out file paths, scattered into a temporary directory, gathered, deleted, etc.
...
Attempted workaround for xdr_resourceInfoReq unsatisfied link during loading of libbat.so.
2012-01-23 16:17:04 -05:00
Mark DePristo
02450e4b12
Merged bug fix from Stable into Unstable
2012-01-23 12:08:39 -05:00
Christopher Hartl
798596257b
Enable the Genotype Phasing Evaluator. Because it didn't have the same argument structure as the base class, update2 of VariantEvaluator was being called, rather than update2 of the actual module.
2012-01-23 10:50:16 -05:00
Mark DePristo
80a4ce0edf
Bugfix for incorrect error messages for missing BAMs and VCFs
...
-- Missing BAMs were appearing as StingExceptions
-- Missing VCFs were showing up as CommandLineErrors, but it's clearer for them to be CouldNotReadInputFile exceptions
-- Added integration tests to ensure missing BAMs, VCFs, and -L files are properly thrown as CouldNotReadInputFile exceptions
-- Added path to standard b37 BAM to BaseTest
-- Cleaned up code in SAMDataSource, removing my parallel loading code as this just didn't prove to be useful.
2012-01-23 09:52:07 -05:00
Guillermo del Angel
31d2f04368
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-01-23 09:23:03 -05:00
Guillermo del Angel
966387ca0b
Next intermediate commit in the pool caller. Lots of bug fixes and now we can emit true vcf's with calls in discovery mode (still of unknown quality) - old validation mode is temporarily broken,will be fixed in next refactoring.
2012-01-23 09:22:31 -05:00
Christopher Hartl
4a08e8ca6e
Minor tweaks to T2D-related qscripts. Replacing old md5s from the BeagleIntegrationTest. All differences boiled down either to the accounting of genotypes changed (./. --> 0/0 is no longer a "changed" genotype, and original genotypes that were ./. are represented as OG=. rather than OG=./. .)
...
This is somewhat of an arbitrary decision, and is negotiable. I could see treating
GT:PL ./.:.
differently from
GT:PL .:0,3,6
but am not sure the worth of doing so.
2012-01-23 08:25:34 -05:00
Ryan Poplin
4d6312d4ea
HaplotypeCaller is now an ActiveRegionWalker.
2012-01-22 14:31:01 -05:00
Christopher Hartl
3b1aad4f17
After a minor and abject freakout, alter the T2D script to seek out truth sensitivities between 80 and 100, rather than between 0.8 and 1. Also, don't consider a genotype "changed by beagle" if the initial genotype is a no-call.
2012-01-20 23:43:51 -05:00
Christopher Hartl
9b4f6afa21
Alterations to scripts for better performance. Grid search now expands the sens/spec tradeoff (90 was far too aggressive against hapmap chr20), and 20 max gaussians was too many, and caused errors. For consensus genotypes: remember to gunzip the beagle outputs before converting to VCF. Also, beagle can in fact create 'null' alleles in certain circumstances. I'm not sure what exactly those circumstances are, but those sites should be ignored. When it does, all alleles apear to be set to null, so this should not affect the actual phasing in the output VCF.
2012-01-20 23:07:59 -05:00
Ryan Poplin
4b18786b5d
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-01-19 22:05:20 -05:00
Ryan Poplin
ace9333068
Active region walkers can now see the reads in a buffer around thier active reigons. This buffer size is specified as a walker annotation. Intervals are internally extended by this buffer size so that the extra reads make their way through the traversal engine but the walker author only needs to see the original interval. Also, several corner case bug fixes in active region traversal.
2012-01-19 22:05:08 -05:00
Menachem Fromer
066da80a3d
Added KEEP_UNCONDTIONAL option which permits even sites with only filtered records to be included as unfiltered sites in the output
2012-01-19 18:19:58 -05:00
Christopher Hartl
7f3ad25b01
Adding a mode to VariantFiltration to invalidate previously-applied filters to allow complete re-filtering of a VCF.
...
T2D VQSR: re-calling now done with appropriate quality settings and using BAQ.
2012-01-19 10:54:48 -05:00
Ryan Poplin
7e082c7750
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-01-19 09:11:23 -05:00
Eric Banks
ab8f499bc3
Annotate with FS even for filtered sites
2012-01-18 22:04:51 -05:00
Guillermo del Angel
b123416c4c
Resolve stale merge changes
2012-01-18 20:56:36 -05:00
Guillermo del Angel
2eb45340e1
Initial, raw, mostly untested version of new pool caller that also does allele discovery. Still needs debugging/refining. Main modification is that there is a new operation mode, set by argument -ALLELE_DISCOVERY_MODE, which if true will determine optimal alt allele at each computable site and will compute AC distribution on it. Current implementation is not working yet if there's more than one pool and it will only output biallelic sites, no functionality for true multi-allelics yet
2012-01-18 20:54:10 -05:00
Ryan Poplin
0268da7560
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-01-18 09:53:00 -05:00
Ryan Poplin
60024e0d7b
updating TDT integration test
2012-01-18 09:52:50 -05:00
Ryan Poplin
11982b5a34
We no longer calculate the population-level TDT statistic if there are fewer than 5 trios with full genotype likelihood information. When there is a high degree of missingness the results are skewed or in the worst case come out as NaN.
2012-01-18 09:42:41 -05:00
Mark DePristo
763c81d520
No longer enforce MAX_ALLELE_SIZE in VCF codec
...
-- Instead issue a warning when a large (>1MB) record is encountered
-- Optimized ref.getBytes()[i] => (byte)ref.charAt(i), which avoids an implicit O(n) allocation each iteration through computeReverseClipping()
2012-01-18 07:35:11 -05:00
Mark DePristo
0c7865fdb5
UnitTest for reverseAlleleClipping
...
-- No code modified yet, just implementing a unit test to ensure correctness of the existing code
2012-01-18 07:35:11 -05:00
Mark DePristo
62801e430a
Bugfix for unnecessary optimization
...
-- don't cache the ref bytes
2012-01-17 16:40:26 -05:00
Mark DePristo
f2b0575dee
Detect unreasonably large allele strings (>2^16) and throw an error
...
-- samtools can emit alleles where the ref is 42M Ns and this caused the GATK (via tribble) to hang in several places.
-- Tribble was updated so we actually could read the line properly (rev. to 51 here).
-- Still the parsing algorithms in the GATK aren't happy with such a long allele. Instead of optimizing the code around an improper use case I put in a limit of 2^16 bp for any allele, and throw a meaningful exception when encountered.
2012-01-17 16:40:26 -05:00
Ryan Poplin
8b0ddf0aaf
Adding notes to CountCovariates docs about using interval lists as database of known variation
2012-01-17 16:13:13 -05:00
Matt Hanna
40ebc17437
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-01-17 14:49:17 -05:00
Matt Hanna
41d70abe4e
At chartl's request, add the bwa aln -N and bwa aln -m parameters to the bindings.
2012-01-17 14:47:53 -05:00
Ryan Poplin
ae259f81cc
Bug fixing for merging of read fragments when one fragment contained an indel
2012-01-17 14:39:27 -05:00
Christopher Hartl
cde224746f
Bait Redesign supports baits that overlap, by picking only the start of intervals.
...
CalibrateGenotypeLikelihoods supports using an external VCF as input for genotype likelihoods. Currently can be a per-sample VCF, but has un-implemented methods for allowing a read-group VCF to be used.
Removed the old constrained genotyping code from UGE -- the trellis calculated is exactly the same as that done in the MLE AC estimate; so we should just re-use that one.
2012-01-17 13:51:05 -05:00
Ryan Poplin
8e23c98dd9
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-01-17 13:46:28 -05:00
Matt Hanna
32ccde374b
Merged bug fix from Stable into Unstable
2012-01-17 11:08:35 -05:00
Matt Hanna
3ba918aff1
Error message cleanup in BAM indexing code.
2012-01-17 11:05:42 -05:00
Mauricio Carneiro
cec7107762
Better location for the downsampling of reads in PrintReads
...
* using the filter() instead of map() makes for a cleaner walker.
* renaming the unit tests to make more sense with the other unit and integration tests
2012-01-14 14:06:09 -05:00
Mark DePristo
b06074d6e7
Updated SortingVCFWriterBase to use PriorityBlockingQueue so that the class is thread-safe
...
-- Uses PriorityBlockingQueue instead of PriorityQueue
-- synchronized keywords added to all key functions that modify internal state
Note that this hasn't been tested extensivesly. Based on report:
http://getsatisfaction.com/gsa/topics/missing_loci_output_in_multi_thread_mode_when_implement_sortingvcfwriterbase?utm_content=topic_link&utm_medium=email&utm_source=new_topic
2012-01-13 09:33:16 -05:00
Mauricio Carneiro
28aa353501
Added "unbiased" downsampling parameter to PrintReads
...
* also cleaned up and updated part of the unit tests for print reads. Needs a more thorough cleaning.
2012-01-12 16:33:55 -05:00
Matt Hanna
2c3176eb80
Merged bug fix from Stable into Unstable
2012-01-12 13:31:10 -05:00
Matt Hanna
cd43f016ce
Fixed NPE in getNextOverlappingBAMScheduleEntry() when mixed mapped/unmapped interval lists are used. Added integrationtest to verify behavior.
2012-01-12 13:29:11 -05:00
Eric Banks
ed34b4f088
Merge branch 'master' of ssh://gsa1.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-01-12 10:27:26 -05:00
Eric Banks
e7fe9910f7
Create the temp storage for calculating cell values just once as per Mark's TODO
2012-01-12 10:27:10 -05:00
Eric Banks
f5f5ed5dcd
Don't initialize the cell conformation values (use an else in the loop instead) as per Mark's TODO
2012-01-12 08:50:03 -05:00
Eric Banks
410a340ef5
Swapping the iteration order to run over AF conformations and then samples instead of the reverse minimizes calls to HashMap.get; instead of it being O(n) since we called it for each sample it's now O(1). Runtime on T2D GENES test set is reduced by 5-10%. More optimizations to follow.
2012-01-12 02:04:03 -05:00
Mauricio Carneiro
77a03c9709
Patching special case in the adaptor clipping
...
* if the adaptor boundary is more than MAXIMUM_ADAPTOR_SIZE bases away from the read, then let's not clip anything and consider the fragment to be undetermined for this read pair.
* updated md5's accordingly
2012-01-11 17:47:44 -05:00
Eric Banks
25d0d53d88
Moving the approximate summing of log10 vals to MathUtils; keeping the more efficient implementation of fast rounding.
2012-01-10 12:38:47 -05:00
Eric Banks
589397d611
Merge branch 'master' of ssh://gsa1.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-01-10 12:36:48 -05:00
Eric Banks
c5320ef1af
Resolving changes in integration test during merge
2012-01-10 12:14:16 -05:00
Matt Hanna
e923a2e512
Revving Picard to incorporate final version of ReadWalker performance improvements.
2012-01-10 12:12:33 -05:00
Eric Banks
0f36f6947e
Resolving merge conflicts
2012-01-10 11:44:16 -05:00
Eric Banks
f2cecce10f
Much better implementation of the approximate summing of an array of log10 values (including more efficient rounding). Now effectively takes 0% of UG runtime on T2D GENES (as opposed to 11% previously).
2012-01-10 11:34:23 -05:00