Commit Graph

1315 Commits (2a565ebf90ea53e0a252fe097c778365aff3714a)

Author SHA1 Message Date
Mauricio Carneiro 0d4027104f Reduced reads are now aware of their original alignments
* Added annotations for reads that had been soft clipped prior to being reduced so that we can later recuperate their original alignments (start and end).
   * Tags keep the alignment shifts, not real alignment, for better compression
   * Tags are defined in the GATKSAMRecord
   * GATKSAMRecord has new functionality to retrieve original alignment start of all reads (trimmed or not) -- getOriginalAlignmentStart() and getOriginalAligmentEnd()
   * Updated ReduceReads MD5s accordingly
2012-01-26 17:06:36 -05:00
Eric Banks 07f72516ae Unsupported platform should be a user error 2012-01-26 16:14:25 -05:00
Ryan Poplin cdff23269d HaplotypeCaller now uses insertions and softclipped bases as possible triggers. LocusIteratorByState tags pileup elements with the required info to make this calculation efficient. The days of the extended event pileup are coming to a close. 2012-01-26 15:56:33 -05:00
Ryan Poplin 25532bdc37 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-01-26 11:43:32 -05:00
Ryan Poplin 390d493049 Updating ActiveRegionWalker interface to output a probability of active status instead of a boolean. Integrator runs a band-pass filter over this probability to produce actual active regions. First version of HaplotypeCaller which decides for itself where to trigger and assembles those regions. 2012-01-26 11:37:08 -05:00
Eric Banks 859dd882c9 Don't make it standard for now 2012-01-26 00:38:16 -05:00
Eric Banks c5e81be978 Adding pairwise AF table. Not polished at all, but usable none-the-less. 2012-01-26 00:37:06 -05:00
Eric Banks 702a2d768f Initial version of multi-allelic summary module in VariantEval 2012-01-25 19:42:55 -05:00
Eric Banks 9a60887567 Lost an import in the merge 2012-01-25 19:41:41 -05:00
Eric Banks cba5f1a8b1 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-01-25 19:19:03 -05:00
Eric Banks add6918f32 Cleaner, more efficient way of determining the last dependent set in the queue. 2012-01-25 16:21:10 -05:00
Menachem Fromer db645a94ca Added options to make the batch-merger more all-inclusive: keep all indels, SNPs (even filtered ones) but maintain their annotations. Also, VariantContextUtils.simpleMerge can now merge variants of all types using the Hidden non-default enum MultipleAllelesMergeType=MIX_TYPES 2012-01-25 16:10:59 -05:00
Eric Banks ef335a5812 Better implementation of the fix; PL index is now traversed in order. 2012-01-25 15:15:42 -05:00
Eric Banks 8e2d372ab0 Use remove instead of setting the value to null 2012-01-25 14:41:34 -05:00
Eric Banks 05816955aa It was possible that we'd clean up a matrix column too early when a dependent column aborted early (with not enough probability mass) because we weren't being smart about the order in which we created dependencies. Fixed. 2012-01-25 14:28:21 -05:00
Eric Banks 2799a1b686 Catch exception for bad type and throw as a TribbleException 2012-01-25 12:15:51 -05:00
Eric Banks 96b62daff3 Minor tweak to the warning message. 2012-01-25 11:55:33 -05:00
Eric Banks fb863dc6a7 Warn user when trying to run with EMIT_ALL_SITES with indels; better docs for that option. 2012-01-25 11:50:12 -05:00
Eric Banks e349b4b14b Allow appending with the dbSNP ID even if a (different) ID is already present for the variant rod. 2012-01-25 11:35:54 -05:00
Eric Banks ea3d4d60f2 This annotation requires rods and should be annotated as such 2012-01-25 11:35:13 -05:00
Ryan Poplin bbefe4a272 Added option to be able to write out the active regions to an interval list file 2012-01-25 09:47:06 -05:00
Ryan Poplin 9818c69df6 Can now specify active regions to process at the command line, mainly for debugging purposes 2012-01-25 09:32:52 -05:00
Mauricio Carneiro ffd61f4c1c Refactor the Pileup Element with regards to indels
Eric reported this bug due to the reduced reads failing with an index out of bounds on what we thought was a deletion, but turned out to be a read starting with insertion.

   * Refactored PileupElement to distinguish clearly between deletions and read starting with insertion
   * Modified ExtendedEventPileup to correctly distinguish elements with deletion when creating new pileups
   * Refactored most of the lazyLoadNextAlignment() function of the LocusIteratorByState for clarity and to create clear separation between what is a pileup with a deletion and what's not one. Got rid of many useless if statements.
   * Changed the way LocusIteratorByState creates extended event pileups to differentiate between insertions in the beginning of the read and deletions.
   * Every deletion now has an offset (start of the event)
   * Fixed bug when LocusITeratorByState found a read starting with insertion that happened to be a reduced read.
   * Separated the definitions of deletion/insertion (in the beginning of the read) in all UG annotations (and the annotator engine).
   * Pileup depth of coverage for a deleted base will now return the average coverage around the deletion.
   * Indel ReadPositionRankSum test now uses the deletion true offset from the read, changed all appropriate md5's
   * The extra pileup elements now properly read by the Indel mode of the UG made any subsequent call have a different random number and therefore all RankSum tests have slightly different values (in the 10^-3 range). Updated all appropriate md5s after extremely careful inspection -- Thanks Ryan!

 phew!
2012-01-24 16:07:21 -05:00
Matt Hanna c312bd5960 Weirdly, PicardException inherits from SAMException, which means that our specialty code for
reporting malformed BAMs was actually misreporting any error that happened in the Picard layer
as a BAM ERROR.

Specifically changing PicardException to report as a ReviewedStingException; we might want to
change it in the future.  I'll followup with the Picard team to make sure they really, really
want PicardException to inherit from SAMException.
2012-01-24 15:30:04 -05:00
Mark DePristo 0a3172a9f1 Fix for ref 0 bases for Chris
-- Disturbingly, fixing this bug doesn't actually cause an test failures.
-- Wrote a new QCRefWalker to actually check in detail that the reference bases coming into the RefWalker are all correct when comparing against a clean uncached load of the contig bases directly.
-- However, I cannot run this tool due to some kind of weird BAM error -- sending this on to Matt
2012-01-24 10:55:09 -05:00
Khalid Shakir c18beadbdb Device files like /dev/null are now tracked as special by Queue and are not used to generate .out file paths, scattered into a temporary directory, gathered, deleted, etc.
Attempted workaround for xdr_resourceInfoReq unsatisfied link during loading of libbat.so.
2012-01-23 16:17:04 -05:00
Mark DePristo 02450e4b12 Merged bug fix from Stable into Unstable 2012-01-23 12:08:39 -05:00
Christopher Hartl 798596257b Enable the Genotype Phasing Evaluator. Because it didn't have the same argument structure as the base class, update2 of VariantEvaluator was being called, rather than update2 of the actual module. 2012-01-23 10:50:16 -05:00
Mark DePristo 80a4ce0edf Bugfix for incorrect error messages for missing BAMs and VCFs
-- Missing BAMs were appearing as StingExceptions
-- Missing VCFs were showing up as CommandLineErrors, but it's clearer for them to be CouldNotReadInputFile exceptions
-- Added integration tests to ensure missing BAMs, VCFs, and -L files are properly thrown as CouldNotReadInputFile exceptions
-- Added path to standard b37 BAM to BaseTest
-- Cleaned up code in SAMDataSource, removing my parallel loading code as this just didn't prove to be useful.
2012-01-23 09:52:07 -05:00
Guillermo del Angel 31d2f04368 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-01-23 09:23:03 -05:00
Guillermo del Angel 966387ca0b Next intermediate commit in the pool caller. Lots of bug fixes and now we can emit true vcf's with calls in discovery mode (still of unknown quality) - old validation mode is temporarily broken,will be fixed in next refactoring. 2012-01-23 09:22:31 -05:00
Ryan Poplin 4d6312d4ea HaplotypeCaller is now an ActiveRegionWalker. 2012-01-22 14:31:01 -05:00
Christopher Hartl 3b1aad4f17 After a minor and abject freakout, alter the T2D script to seek out truth sensitivities between 80 and 100, rather than between 0.8 and 1. Also, don't consider a genotype "changed by beagle" if the initial genotype is a no-call. 2012-01-20 23:43:51 -05:00
Christopher Hartl 9b4f6afa21 Alterations to scripts for better performance. Grid search now expands the sens/spec tradeoff (90 was far too aggressive against hapmap chr20), and 20 max gaussians was too many, and caused errors. For consensus genotypes: remember to gunzip the beagle outputs before converting to VCF. Also, beagle can in fact create 'null' alleles in certain circumstances. I'm not sure what exactly those circumstances are, but those sites should be ignored. When it does, all alleles apear to be set to null, so this should not affect the actual phasing in the output VCF. 2012-01-20 23:07:59 -05:00
Ryan Poplin 4b18786b5d Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-01-19 22:05:20 -05:00
Ryan Poplin ace9333068 Active region walkers can now see the reads in a buffer around thier active reigons. This buffer size is specified as a walker annotation. Intervals are internally extended by this buffer size so that the extra reads make their way through the traversal engine but the walker author only needs to see the original interval. Also, several corner case bug fixes in active region traversal. 2012-01-19 22:05:08 -05:00
Menachem Fromer 066da80a3d Added KEEP_UNCONDTIONAL option which permits even sites with only filtered records to be included as unfiltered sites in the output 2012-01-19 18:19:58 -05:00
Christopher Hartl 7f3ad25b01 Adding a mode to VariantFiltration to invalidate previously-applied filters to allow complete re-filtering of a VCF.
T2D VQSR: re-calling now done with appropriate quality settings and using BAQ.
2012-01-19 10:54:48 -05:00
Ryan Poplin 7e082c7750 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-01-19 09:11:23 -05:00
Eric Banks ab8f499bc3 Annotate with FS even for filtered sites 2012-01-18 22:04:51 -05:00
Guillermo del Angel b123416c4c Resolve stale merge changes 2012-01-18 20:56:36 -05:00
Guillermo del Angel 2eb45340e1 Initial, raw, mostly untested version of new pool caller that also does allele discovery. Still needs debugging/refining. Main modification is that there is a new operation mode, set by argument -ALLELE_DISCOVERY_MODE, which if true will determine optimal alt allele at each computable site and will compute AC distribution on it. Current implementation is not working yet if there's more than one pool and it will only output biallelic sites, no functionality for true multi-allelics yet 2012-01-18 20:54:10 -05:00
Ryan Poplin 0268da7560 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-01-18 09:53:00 -05:00
Ryan Poplin 11982b5a34 We no longer calculate the population-level TDT statistic if there are fewer than 5 trios with full genotype likelihood information. When there is a high degree of missingness the results are skewed or in the worst case come out as NaN. 2012-01-18 09:42:41 -05:00
Mark DePristo 763c81d520 No longer enforce MAX_ALLELE_SIZE in VCF codec
-- Instead issue a warning when a large (>1MB) record is encountered
-- Optimized ref.getBytes()[i] => (byte)ref.charAt(i), which avoids an implicit O(n) allocation each iteration through computeReverseClipping()
2012-01-18 07:35:11 -05:00
Mark DePristo 0c7865fdb5 UnitTest for reverseAlleleClipping
-- No code modified yet, just implementing a unit test to ensure correctness of the existing code
2012-01-18 07:35:11 -05:00
Mark DePristo 62801e430a Bugfix for unnecessary optimization
-- don't cache the ref bytes
2012-01-17 16:40:26 -05:00
Mark DePristo f2b0575dee Detect unreasonably large allele strings (>2^16) and throw an error
-- samtools can emit alleles where the ref is 42M Ns and this caused the GATK (via tribble) to hang in several places.
-- Tribble was updated so we actually could read the line properly (rev. to 51 here).
-- Still the parsing algorithms in the GATK aren't happy with such a long allele.  Instead of optimizing the code around an improper use case I put in a limit of 2^16 bp for any allele, and throw a meaningful exception when encountered.
2012-01-17 16:40:26 -05:00
Ryan Poplin 8b0ddf0aaf Adding notes to CountCovariates docs about using interval lists as database of known variation 2012-01-17 16:13:13 -05:00
Matt Hanna 40ebc17437 Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-01-17 14:49:17 -05:00