Commit Graph

3659 Commits (e7152e10f7252bac06c700b0d83377e67585d4fb)

Author SHA1 Message Date
Geraldine Van der Auwera 19a4bf9ff0 made AR an Advanced argument to discourage basic users from fiddling with it 2013-08-14 14:46:56 -04:00
Geraldine Van der Auwera a09831489b Disabled emission of doc URLs for external codecs to avoid broken links 2013-08-10 10:04:04 -07:00
Geraldine Van der Auwera 4d20c71e09 Improvements to various gatkdocs
- Make -rod required
    - Document that contaminationFile is currently not functional with HC
    - Document liftover process more clearly
    - Document VariantEval combinations of ST and VE that are incompatible
    - Added a caveat about using MVLR from HC and UG.
    - Added caveat about not using -mte with -nt
    - Clarified masking options
    - Fixed docs based on Erics comments
2013-08-10 10:01:31 -07:00
Mark DePristo 7aba5a2f9f Several improvements to AssessNA12878 and KB
-- Bugfix for BAMs containing reads without real (M,I,D,N) operators.  Simply needed to set validation stringency to SILENT in the read. Added a BadCigar filter to the SAMRecord stream anyway
-- Add capture all sites mode to AssessNA12878: will write all sites to the badSites VCF, regardless of whether they are bad.  It's useful if you essentially want to annotate a VCF with KB information for later analysis, such as computing ROC curves
-- Add ignore filters mode to AssessNA12878: will as expected treat all sites in the input VCF calls as PASS, even if the site has a FILTER field setting
-- Add minPNonRef argument to AssessNA12878: this will consider a site not called even if the NA12878 genotype is not 0/0 if the PLs are present and the PL for 0/0 isn't greater than this value.  It allows us to easily differentiate low confidence non-ref sites obtained via multi-sample calling from highly confident non-ref calls that might be real TP or FPs
2013-08-07 08:08:37 -04:00
Mauricio Carneiro 285ab2ac62 Better caching for the HaplotypeCaller
Problem
-------
Caching strategy is incompatible with the current sorting of the haplotypes, and is rendering the cache nearly useless.

Before the PairHMM updates, we realized that a lexicographically sorted list of haplotypes would optimize the use of the cache. This was only true until we've added the initial condition to the first row of the deletion matrix, which depends on the length of the haplotype. Because of that, every time the haplotypes differ in length, the cache has to be wiped. A lexicographic sorting of the haplotypes will put different lengths haplotypes clustered together therefore wasting *tons* of re-compute.

Solution
-------
Very simple. Sort the haplotypes by LENGTH and then in lexicographic order.
2013-08-02 01:27:29 -04:00
Yossi Farjoun 284176cd7b moved SnpEffUtilUnitTest to public tree 2013-07-30 17:51:40 -04:00
droazen b8709b1942 Merge pull request #332 from broadinstitute/st_fpga_hmm
FPGA support for PairHMM
2013-07-30 14:21:21 -07:00
Joseph Rose d2860a5486 Adding a representation of the hierarchy of flags output by snpEff (Yossi) and a stratifier whose output states are coding regions, genes, stop_gain, stop_lost and splice sites, all determined by the snpEff hierarchy (J. Rose) 2013-07-30 15:38:32 -04:00
Chris Hartl 464a5b229d Add <pre> tags to the Genotype Concordance docs. Tables were not being displayed properly. 2013-07-29 15:48:17 -07:00
Geraldine Van der Auwera 3063d82797 Fixed example in CallableLoci gatkdoc 2013-07-26 15:51:31 -04:00
Geraldine Van der Auwera fc4a8b1dd0 Fixed example in DoC gatkdoc 2013-07-26 15:51:30 -04:00
Geraldine Van der Auwera 660b075900 Added deprecation notice for SomaticIndelDetector 2013-07-26 15:51:30 -04:00
Geraldine Van der Auwera 5ad99c362d Added caveat to gatkdocs for MAPQ read transformers & cleaned up AB annotation gatkdocs 2013-07-26 15:51:30 -04:00
Geraldine Van der Auwera 0ea3f8ca58 Added function to gatkdocs to specify what VCF field an annotation goes in (INFO or FORMAT) 2013-07-26 15:51:30 -04:00
Ryan Poplin 8c205dda1b Automatically order the annotation dimensions in the VQSR by their standard deviation instead of the order they were specified on the command line. 2013-07-26 10:22:43 -04:00
Louis Bergelson 7c43b5f26a Adding LibraryReadFilter.
--Moving LibraryReadFilter which has been part of Mutect into gatk public.
--Added an additional check for null values.
2013-07-26 09:32:14 -04:00
Mauricio Carneiro 31ab0824b1 quick indentation fixes to FPGA code 2013-07-24 14:09:49 -04:00
Eric Banks 6df43f730a Fixing ReadBackedPileup to represent mapping qualities as ints, not (signed) bytes.
Having them as bytes caused problems for downstream programmers who had data with high MQs.
2013-07-23 23:47:15 -04:00
David Roazen 605a5ac2e3 GATK engine: add ability to do on-the-fly BAM file sample renaming at runtime
-User must provide a mapping file via new --sample_rename_mapping_file argument.
 Mapping file must contain a mapping from absolute bam file path to new sample name
 (format is described in the docs for the argument).

-Requires that each bam file listed in the mapping file contain only one sample
 in their headers (they may contain multiple read groups for that sample, however).
 The engine enforces this, and throws a UserException if on-the-fly renaming is
 requested for a multi-sample bam.

-Not all bam files for a traversal need to be listed in the mapping file.

-On-the-fly renaming is done as the VERY first step after creating the SAMFileReaders
 in SAMDataSource (before the headers are even merged), to prevent possible consistency
 issues.

-Renaming is done ONCE at traversal start for each SAMReaders resource creation in the
 SAMResourcePool; this effectively means once per -nt thread

-Comprehensive unit/integration tests

Known issues: -if you specify the absolute path to a bam in the mapping file, and then
               provide a path to that same bam to -I using SYMLINKS, the renaming won't
               work. The absolute paths will look different to the engine due to the
               symlink being present in one path and not in the other path.

GSA-974 #resolve
2013-07-18 15:48:42 -04:00
David Roazen c15751e41e SAMReaderID: fix bug with hash code and equals() method
-Two SAMReaderIDs that pointed at the same underlying bam file through
 a relative vs. an absolute path were not being treated as equal, and
 had different hash codes. This was causing problems in the engine, since
 SAMReaderIDs are often used as the keys of HashMaps.

-Fix: explicitly use the absolute path to the encapsulated bam file in
 hashCode() and equals()

-Added tests to ensure this doesn't break again
2013-07-15 13:57:00 -04:00
sathibault 0a8f75b953 Merge branch 'master' into st_fpga_hmm
Conflicts:
	protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCaller.java
2013-07-15 08:17:32 -05:00
Eric Banks b16c7ce050 A whole slew of improvements to the Haplotype Caller and related code.
1. Some minor refactorings and claenup (e.g. removing unused imports) throughout.

2. Updates to the KB assessment functionality:
   a. Exclude duplicate reads when checking to see whether there's enough coverage to make a call.
   b. Lower the threshold on FS for FPs that would easily be filtered since it's only single sample calling.

3. Make the HC consistent in how it treats the pruning factor.  As part of this I removed and archived
   the DeBruijn assembler.

4. Improvements to the likelihoods for the HC
   a. We now include a "tristate" correction in the PairHMM (just like we do with UG).  Basically, we need
      to divide e by 3 because the observed base could have come from any of the non-observed alleles.
   b. We now correct overlapping read pairs.  Note that the fragments are not merged (which we know is
      dangerous).  Rather, the overlapping bases are just down-weighted so that their quals are not more
      than Q20 (or more specifically, half of the phred-scaled PCR error rate); mismatching bases are
      turned into Q0s for now.
   c. We no longer run contamination removal by default in the UG or HC.  The exome tends to have real
      sites with off kilter allele balances and we occasionally lose them to contamination removal.

5. Improved the dangling tail merging implementation.
2013-07-12 10:09:10 -04:00
Valentin Ruano Rubio ac77a4c699 Merge pull request #316 from broadinstitute/md_filter_counting
Bugfix for counting of applied filters
2013-07-08 10:58:47 -07:00
Eric Banks 921f551426 AnalyzeCovariates is no longer a deprecated tool. 2013-07-08 09:48:12 -04:00
Eric Banks 5f5c90e65c Fix bug introduced recently in the VariantAnnotator where only the last -comp was being annotated at a site.
Trivial fix, added integration test to cover it.
2013-07-05 00:04:52 -04:00
Mark DePristo 3db02e5ef1 Merge pull request #315 from broadinstitute/md_ref_conf_hc
Reference confidence model for the haplotype caller
2013-07-02 13:04:33 -07:00
Mark DePristo 7be01777f6 Bugfix for incPos in GenomeLoc
-- Shouldn't have taken a GenomeLoc as an argument, as it's a instance method, not a public static
2013-07-02 15:46:49 -04:00
Mark DePristo e3e8631ff5 Working version of HaplotypeCaller ReferenceConfidenceModel that accounts for indels as well as SNP confidences
-- Assembly graph building now returns an object that describes whether the graph was successfully built and has variation, was succesfully built but didn't have variation, or truly failed in construction.  Fixing an annoying bug where you'd prefectly assembly the sequence into the reference graph, but then return a null graph because of this, and you'd increase your kmer because it null was also used to indicate assembly failure
--
-- Output format looks like:
20      10026072        .       T       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:3,0:3:9:0,9,120
20      10026073        .       A       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:3,0:3:9:0,9,119
20      10026074        .       T       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:3,0:3:9:0,9,121
20      10026075        .       T       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:3,0:3:9:0,9,119
20      10026076        .       T       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:3,0:3:9:0,9,120
20      10026077        .       T       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:3,0:3:9:0,9,120
20      10026078        .       C       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:5,0:5:15:0,15,217
20      10026079        .       A       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:6,0:6:18:0,18,240
20      10026080        .       G       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:6,0:6:18:0,18,268
20      10026081        .       T       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:7,0:7:21:0,21,267

We use a symbolic allele to indicate that the site is hom-ref, and because we have an ALT allele we can provide AD and PL field values.  Currently these are calculated as ref vs. any non-ref value (mismatch or insertion) but doesn't yet account properly for alignment uncertainty.
-- Can we enabled for single samples with --emitRefConfidence (-ERC).
-- This is accomplished by realigning the each read to its most likley haplotype, and then evaluting the resulting pileups over the active region interval.  The realignment is done by the HaplotypeBAMWriter, which now has a generalized interface that lets us provide a ReadDestination object so we can capture the realigned reads
-- Provide access to the more raw LocusIteratorByState constructor so we can more easily make them programmatically without constructing lots of misc. GATK data structures.  Moved the NO_DOWNSAMPLING constant from LIBSDownsamplingInfo to LocusIteratorByState so clients can use it without making LIBSDownsamplingInfo a public class.
-- Includes GVCF writer
-- Add 1 mb of WEx data to private/testdata
-- Integration tests for reference model output for WGS and WEx data
-- Emit GQ block information into VCF header for GVCF mode
-- OutputMode from StandardCallerArgumentCollection moved to UnifiedArgumentCollection as its no longer relevant for HC
-- Control max indel size for the reference confidence model from the command line.  Increase default to 10
-- Don't use out_mode in HaplotypeCallerComplexAndSymbolicVariantsIntegrationTest
-- Unittests for ReferenceConfidenceModel
-- Unittests for new MathUtils functions
2013-07-02 15:46:38 -04:00
Mark DePristo 41aba491c0 Critical bugfix for adapter clipping in HaplotypeCaller
-- The previous code would adapter clip before reverting soft clips, so because we only clip the adapter when it's actually aligned (i.e., not in the soft clips) we were actually not removing bases in the adapter unless at least 1 bp of the adapter was aligned to the reference.  Terrible.
-- Removed the broken logic of determining whether a read adaptor is too long.
-- Doesn't require isProperPairFlag to be set for a read to be adapter clipped
-- Update integration tests for new adapter clipping code
2013-07-02 15:46:36 -04:00
David Roazen cdea744b95 Improve -dcov documentation to address recent user confusion
-Explicitly state that -dcov does not produce an unbiased random sampling from all available reads
 at each locus, and that instead it tries to maintain an even representation of reads from
 all alignment start positions (which, of course, is a form of bias)

-Recommend -dfrac for users who want a true across-the-board unbiased random sampling
2013-07-02 15:33:28 -04:00
Mark DePristo 9df58314ab Bugfix for counting of applied filters
-- Because LocusWalkers have multiple filtering streams, each counting filtering independent, and the close() function set calling setFilter on the global result, not on the private counter, which is incorporated into the global (thereby incrementing the counts of each filter).
-- [delivers #52667213]
2013-07-01 21:09:48 -04:00
David Roazen 31827022db Fix pipeline tests that were not respecting the pipeline test dry run setting
There are a few pipeline test classes that do not run Queue, but are
classified as pipeline tests because they submit farm jobs. Make these
unconventional pipeline tests respect the pipeline test dry run setting.
2013-06-28 15:27:17 -04:00
Scott Thibault 82dcdc01c0 Merge branch 'master' into st_fpga_hmm
Conflicts:
	protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCaller.java
	protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/LikelihoodCalculationEngine.java
2013-06-28 10:13:05 -05:00
David Roazen 94294ed6c4 Move DownsampleReadsQC walker to private 2013-06-25 15:48:44 -04:00
Eric Banks 165b936fcd Fixing the 'header is negative' problem in Reduce Reads... again.
Previous fixes and tests only covered trailing soft-clips.  Now that up front
hard-clipping is working properly though, we were failing on those in the tool.

Added a patch for this as well as a separate test independent of the soft-clips
to make sure that it's working properly.
2013-06-24 14:06:21 -04:00
Mark DePristo fdfe4e41d5 Better GATK version and command line output
-- Previous version emitted command lines that look like:

##HaplotypeCaller="analysis_type=HaplotypeCaller input_file=[private/testdata/reduced.readNotFullySpanningDeletion.bam] ..."

the new version provides additional information on when the GATK was run and the GATK version in a nicer format:

 ##GATKCommandLine=<ID=HaplotypeCaller,Version=2.5-206-gbc7be2b,Date="Thu Jun 20 11:09:01 EDT 2013",Epoch=1371740941197,CommandLineOptions="analysis_type=HaplotypeCaller input_file=[private/testdata/reduced.readNotFullySpanningDeletion.bam] read_buffer_size=null phone_home=AWS ...">

 -- Additionally, the command line options are emitted sequentially in the file, so you can see a running record of how a VCF was produced, such as this example from the integration test:

 ##GATKCommandLine=<ID=HaplotypeCaller,Version=2.5-206-gbc7be2b,Date="Thu Jun 20 11:09:01 EDT 2013",Epoch=1371740941197,CommandLineOptions="lots of stuff">
 ##GATKCommandLine=<ID=SelectVariants,Version=2.5-206-gbc7be2b,Date="Thu Jun 20 11:16:23 EDT 2013",Epoch=1371741383277,CommandLineOptions="lots of stuff">

 -- Removed the ProtectedEngineFeaturesIntegrationTest
 -- Actual unit tests for these features!
2013-06-20 11:19:13 -04:00
Mark DePristo 0672ac5032 Fix public / protected dependency 2013-06-19 19:42:09 -04:00
Valentin Ruano-Rubio 1f8282633b Removed plots generation from the BaseRecalibration software
Improved AnalyzeCovariates (AC) integration test.
Renamed AC test files ending with .grp to .table

Implementation:

* Removed RECAL_PDF/CSV_FILE from RecalibrationArgumentCollection (RAC). Updated rest of the code accordingly.
* Fixed BQSRIntegrationTest to work with new changes
2013-06-19 14:47:56 -04:00
Valentin Ruano-Rubio 08f92bb6f9 Added AnalyzeCovariates tool to generate BQSR assessment quality plots.
Implemtation details:

* Added tool class *.AnalyzeCovariates
* Added convenient addAll method to Utils to be able to add elements of an array.
* Added parameter comparison methods to RecalibrationArgumentCollection class in order to verify that multiple imput recalibration report are compatible and comparable.
* Modified the BQSR.R script to handle up to 3 different recalibration tables (-BQSR, -before and -after) and removed some irrelevant arguments (or argument values) from the output.
* Added an integration test class.
2013-06-19 14:38:02 -04:00
Mark DePristo fb114e34fe Merge pull request #295 from broadinstitute/dr_remove_PrintReads_ds_argument
PrintReads: remove -ds argument
2013-06-19 10:55:10 -07:00
droazen 573ecadecc Merge pull request #294 from broadinstitute/dr_handle_zero_length_cigar_elements
SAMDataSource: always consolidate cigar strings into canonical form
2013-06-19 10:32:22 -07:00
David Roazen 51ec5404d4 SAMDataSource: always consolidate cigar strings into canonical form
-Collapses zero-length and repeated cigar elements, neither of which
 can necessarily be handled correctly by downstream code (like LIBS).

-Consolidation is done before read filters, because not all read filters
 behave correctly with non-consoliated cigars.

-Examined other uses of consolidateCigar() throughout the GATK, and
 found them to not be redundant with the new engine-level consolidation
 (they're all on artificially-created cigars in the HaplotypeCaller
 and SmithWaterman classes)

-Improved comments in SAMDataSource.applyDecoratingIterators()

-Updated MD5s; differences were examined and found to be innocuous

-Two tests: -Unit test for ReadFormattingIterator
            -Integration test for correct handling of zero-length
             cigar elements by the GATK engine as a whole
2013-06-19 13:29:01 -04:00
David Roazen 23ee192d5e PrintReads: remove -ds argument
-This argument was completely redundant with the engine-level -dfrac
 argument.

-Could produce unintended consequences if used in conjunction with
 engine-level downsampling arguments.
2013-06-19 13:22:44 -04:00
David Roazen 0be788f0f9 Fix typo in snpEff documentation 2013-06-19 13:15:24 -04:00
Chris Hartl af275fdf10 Extend the documentation of GenotypeConcordance to include notes about Monomorphic and Filtered VCF records.
Address Geraldine's comments - information on moltenization and explanation of fields

Fix paren
2013-06-19 12:01:58 -04:00
Mark DePristo 15171c07a8 CatVariants accepts reference files ending in any standard extension
-- [resolves #49339235] Make CatVariants accept reference files ending in .fa (not only .fasta)
2013-06-19 11:10:36 -04:00
Mark DePristo 7b22467148 Bugfix: defaultBaseQualities actually works now
-- It was being applied in the wrong order (after the first call to the underlying MalformedReadFilter) so if your first read was malformed you'd blow up there instead of being fixed properly.  Added integration tests to ensure this continues to work.
-- [delivers #49538319]
2013-06-17 14:37:27 -04:00
Mark DePristo b69d210255 Bugfix: allow gzip VCF output in multi-threaded GATK output
-- VariantContextWriterStorage was gzipping the intermediate files that would be merged in, but the mergeInto function couldn't read those outputs, and we'd throw a very strange error. Now tmp. VCFs aren't compressed, even if the final VCF is.  Added integrationtest to ensure this behavior works going forward.
-- [delivers #47399279]
2013-06-17 12:39:18 -04:00
delangel 485ceb1e12 Merge pull request #283 from broadinstitute/md_beagleoutput
Simpler FILTER and info field encoding for BeagleOutputToVCF
2013-06-17 09:31:03 -07:00
James Warren f46f7d9b23 deducing dictionary path should not use global find and replace
Signed-off-by: David Roazen <droazen@broadinstitute.org>
2013-06-14 19:15:27 -04:00