Commit Graph

12478 Commits (f46f7d9b23d22ac249fddbfacc4e748b61940ac9)

Author SHA1 Message Date
James Warren f46f7d9b23 deducing dictionary path should not use global find and replace
Signed-off-by: David Roazen <droazen@broadinstitute.org>
2013-06-14 19:15:27 -04:00
Mark DePristo 52677429a0 Merge pull request #284 from broadinstitute/dr_fewer_stranded_temp_files
Reduce number of leftover temp files in GATK runs
2013-06-14 13:06:28 -07:00
David Roazen d167292688 Reduce number of leftover temp files in GATK runs
-WalkerTest now deletes *.idx files on exit

-ArtificialBAMBuilder now deletes *.bai files on exit

-VariantsToBinaryPed walker now deletes its temp files on exit
2013-06-14 15:56:03 -04:00
Mark DePristo b72880cc94 Merge pull request #282 from broadinstitute/md_gatklogs_gitversions
Use git hash to lookup versions when necessary in analyzeRunReports.py
2013-06-14 12:39:54 -07:00
Mark DePristo 20bb4902a3 Use git hash to lookup versions when necessary in analyzeRunReports.py 2013-06-14 15:31:25 -04:00
Mark DePristo 50ea098c11 Merge pull request #281 from broadinstitute/md_gatklogs
Update utilities to get GATKRunReports
2013-06-14 10:00:16 -07:00
Ryan Poplin c4e508a71f Merge pull request #275 from broadinstitute/md_fragment_with_pcr
Improvements to HaplotypeCaller and NA12878 KB
2013-06-14 09:32:26 -07:00
Mark DePristo a057f37331 Update utilities to get GATKRunReports
-- Critical bugfix: the GATK run reports magically changed names from something like GATK-run-report to GATKRunReport in GATK 2.4.  All GATK logs from 2.4 onwards were being eaten by the scripts that download logs, so the GATK usage is actually much much higher than our logs have suggested.  Looking forward to seeing some real numbers.  Unfortunately the error occurred so early in the downloading process that we actually deleted away these logs, so they cannot be recovered
-- Added a step in the downloader that archives the raw, unprocessed files so we can recover from such problems in the future
-- The s3 download scripts now download to /local/dev/GATKLogs so will only work on gsa4, but this is ok as this is better than taking forever to get the logs to the isilon.
-- Turn off some crazy debugging output from the downloader that was actually masking me from seeing the issue each night
-- Make analyzeRunReports.py robust to svn version abominations
-- Use python-2.6 in runGATKReport.csh
2013-06-14 10:17:32 -04:00
droazen ac346a93ba Merge pull request #278 from broadinstitute/md_gatk_version_in_vcf
Emit the GATK version number in the VCF header
2013-06-13 13:22:20 -07:00
Mark DePristo 908183aba7 Merge pull request #277 from broadinstitute/dr_fix_com_sun_dependency
Remove com.sun.javadoc.* dependencies from the GATK proper, and isolate them for doclet use only
2013-06-13 13:12:45 -07:00
David Roazen f9c986be74 Remove com.sun.javadoc.* dependencies from the GATK proper, and isolate them for doclet use only
Problem:
Classes in com.sun.javadoc.* are non-standard. Since we can't depend on their availability for
all users, the GATK proper should not have any runtime dependencies on this package.

Solution:
-Isolate com.sun.javadoc.* dependencies in a DocletUtils class for use only by doclets. The
 only users who need to run our doclets are those who compile from source, and they
 should be competent enough to figure out how to resolve a missing com.sun.* dependency.

-HelpUtils now contains no com.sun.javadoc.* dependencies and can be safely used by walkers/other
 tools.

-Added comments with instructions on when it is safe to use DocletUtils vs. HelpUtils

[delivers #51450385]
[delivers #50387199]
2013-06-13 15:52:41 -04:00
Mark DePristo 74f311c973 Emit the GATK version number in the VCF header
-- Looks like ##GATKVersion=2.5-159-g3f91d93 in the VCF header line
-- delivers [#51595305]
2013-06-13 15:46:16 -04:00
Mark DePristo d93bed5d61 Merge pull request #276 from broadinstitute/md_gatkreport_cleanup
Remove STANDARD option from GATKRunReport
2013-06-13 12:40:57 -07:00
Mark DePristo 6232db3157 Remove STANDARD option from GATKRunReport
-- AWS is now the default.  Removed old code the referred to the STANDARD type.  Deleted unused variables and functions.
2013-06-13 15:18:28 -04:00
Mark DePristo dd5674b3b8 Add genotyping accuracy assessment to AssessNA12878
-- Now table looks like:

Name     VariantType  AssessmentType           Count
variant  SNPS         TRUE_POSITIVE              1220
variant  SNPS         FALSE_POSITIVE                0
variant  SNPS         FALSE_NEGATIVE                1
variant  SNPS         TRUE_NEGATIVE               150
variant  SNPS         CALLED_NOT_IN_DB_AT_ALL       0
variant  SNPS         HET_CONCORDANCE          100.00
variant  SNPS         HOMVAR_CONCORDANCE        99.63
variant  INDELS       TRUE_POSITIVE               273
variant  INDELS       FALSE_POSITIVE                0
variant  INDELS       FALSE_NEGATIVE               15
variant  INDELS       TRUE_NEGATIVE                79
variant  INDELS       CALLED_NOT_IN_DB_AT_ALL       2
variant  INDELS       HET_CONCORDANCE           98.67
variant  INDELS       HOMVAR_CONCORDANCE        89.58

-- Rewrite / refactored parts of subsetDiploidAlleles in GATKVariantContextUtils to have a BEST_MATCH assignment method that does it's best to simply match the genotype after subsetting to a set of alleles.  So if the original GT was A/B and you subset to A/B it remains A/B but if you subset to A/C you get A/A.  This means that het-alt B/C genotypes become A/B and A/C when subsetting to bi-allelics which is the convention in the KB.  Add lots of unit tests for this functions (from 0 previously)
-- BadSites in Assessment now emits TP sites with discordant genotypes with the type GENOTYPE_DISCORDANCE and tags the expected genotype in the info field as ExpectedGenotype, such as this record:

20      10769255        .       A       ATGTG   165.73  .       ExpectedGenotype=HOM_VAR;SupportingCallsets=ebanks,depristo,CEUTrio_best_practices;WHY=GENOTYPE_DISCORDANCE     GT:AD:DP:GQ:PL  0/1:1,9:10:6:360,0,6

Indicating that the call was a HET but the expected result was HOM_VAR
-- Forbid subsetting of diploid genotypes to just a single allele.
-- Added subsetToRef as a separate specific function.  Use that in the DiploidExactAFCalc in the case that you need to reduce yourself to ref only. Preserves DP in the genotype field when this is possible, so a few integration tests have changed for the UG
2013-06-13 15:05:32 -04:00
Mark DePristo 33720b83eb No longer merge overlapping fragments from HaplotypeCaller
-- Merging overlapping fragments turns out to be a bad idea.  In the case where you can safely merge the reads you only gain a small about of overlapping kmers, so the potential gains are relatively small.  That's in contrast to the very large danger of merging reads inappropriately, such as when the reads only overlap in a repetitive region, and you artificially construct reads that look like the reference but actually may carry a larger true insertion w.r.t. the reference.  Because this problem isn't limited to repetitive sequeuence, but in principle could occur in any sequence, it's just not safe to do this merging.  Best to leave haplotype construction to the assembly graph.
2013-06-13 15:05:32 -04:00
droazen fb5143a590 Merge pull request #274 from broadinstitute/md_s3_only
GATKRunReport no longer tries to use the Broad filesystem destination, r...
2013-06-13 11:32:31 -07:00
Mark DePristo dd6e252373 GATKRunReport no longer tries to use the Broad filesystem destination, rather it goes unconditionally to S3 2013-06-13 13:33:10 -04:00
Mark DePristo c837d67b2f Merge pull request #273 from broadinstitute/rp_readIsPoorlyModelled
Relaxing the constraints on the readIsPoorlyModelled function.
2013-06-13 08:40:24 -07:00
Mark DePristo 2833325d31 Merge pull request #272 from broadinstitute/rp_hc_bam_writer_uninformative_reads
HC bam writer now sets the read to MQ0 if it isn't informative
2013-06-13 08:08:45 -07:00
Ryan Poplin f44efc27ae Relaxing the constraints on the readIsPoorlyModelled function.
-- Turns out we were aggressively throwing out borderline-good reads.
2013-06-13 11:06:23 -04:00
Ryan Poplin d5f0848bd5 HC bam writer now sets the read to MQ0 if it isn't informative
-- Makes visualization of read evidence easier in IGV.
2013-06-13 10:11:54 -04:00
Eric Banks 17d3ccb03b Merge pull request #270 from broadinstitute/rp_reference_haplotype_mismatch_bug
Fixing bug with dangling tails in which the tail connects all the way ba...
2013-06-12 11:03:48 -07:00
Ryan Poplin d1f397c711 Fixing bug with dangling tails in which the tail connects all the way back to the reference source node.
-- List of vertices can't contain a source node.
2013-06-12 12:23:01 -04:00
Mark DePristo b2dc7095ab Merge pull request #267 from broadinstitute/dr_reducereads_downsampling_fix
Exclude reduced reads from elimination during downsampling
2013-06-11 13:52:28 -07:00
David Roazen 95b5f99feb Exclude reduced reads from elimination during downsampling
Problem:
-Downsamplers were treating reduced reads the same as normal reads,
 with occasionally catastrophic results on variant calling when an
 entire reduced read happened to get eliminated.

Solution:
-Since reduced reads lack the information we need to do position-based
 downsampling on them, best available option for now is to simply
 exempt all reduced reads from elimination during downsampling.

Details:
-Add generic capability of exempting items from elimination to
 the Downsampler interface via new doNotDiscardItem() method.
 Default inherited version of this method exempts all reduced reads
 (or objects encapsulating reduced reads) from elimination.

-Switch from interfaces to abstract classes to facilitate this change,
 and do some minor refactoring of the Downsampler interface (push
 implementation of some methods into the abstract classes, improve
 names of the confusing clear() and reset() methods).

-Rewrite TAROrderedReadCache. This class was incorrectly relying
 on the ReservoirDownsampler to preserve the relative ordering of
 items in some circumstances, which was behavior not guaranteed by
 the API and only happened to work due to implementation details
 which no longer apply. Restructured this class around the assumption
 that the ReservoirDownsampler will not preserve relative ordering
 at all.

-Add disclaimer to description of -dcov argument explaining that
 coverage targets are approximate goals that will not always be
 precisely met.

-Unit tests for all individual downsamplers to verify that reduced
 reads are exempted from elimination
2013-06-11 16:16:26 -04:00
Ryan Poplin e1fd3dff9a Merge pull request #268 from broadinstitute/eb_calling_accuracy_improvements_to_HC
Eb calling accuracy improvements to hc
2013-06-11 11:18:51 -07:00
Eric Banks b63cbd8cc9 Merge pull request #266 from broadinstitute/gda_read_error_correction_new
Gda read error correction new
2013-06-11 10:42:06 -07:00
Eric Banks 2c3c680eb7 Misc changes and cleanup from all previous commits in this push.
1. By default, do not include the UG CEU callset for assessment.
2. Updated md5s that are different now with all the HC changes.
2013-06-11 12:53:11 -04:00
Eric Banks dadcfe296d Reworking of the dangling tails merging code.
We now run Smith-Waterman on the dangling tail against the corresponding reference tail.
If we can generate a reasonable, low entropy alignment then we trigger the merge to the
reference path; otherwise we abort.  Also, we put in a check for low-complexity of graphs
and don't let those pass through.

Added tests for this implementation that checks exact SW results and correct edges added.
2013-06-11 12:53:04 -04:00
Guillermo del Angel 55d5f2194c Read Error Corrector for haplotype assembly
Principle is simple: when coverage is deep enough, any single-base read error will look like a rare k-mer but correct sequence will be supported by many reads to correct sequences will look like common k-mers. So, algorithm has 3 main steps:
1. K-mer graph buildup.
For each read in an active region, a map from k-mers to the number of times they have been seen is built.
2. Building correction map.
All "rare" k-mers that are sparse (by default, seen only once), get mapped to k-mers that are good (by default, seen at least 20 times but this is a CL argument), and that lie within a given Hamming distance (by default, =1). This map can be empty (i.e. k-mers can be uncorrectable).
3. Correction proposal
For each constituent k-mer of each read, if this k-mer is rare and maps to a good k-mer, get differing base positions in k-mer and add these to a list of corrections for each base in each read. Then, correct read at positions where correction proposal is unanimous and non-empty.

The algorithm defaults are chosen to be very stringent and conservative in the correction: we only try to correct singleton k-mers, we only look for good k-mers lying at Hamming distance = 1 from them, and we only correct a base in read if all correction proposals are congruent.

By default, algorithm is disabled but can be enabled in HaplotypeCaller via the -readErrorCorrect CL option. However, at this point it's about 3x-10x more expensive so it needs to be optimized if it's to be used.
2013-06-11 12:26:24 -04:00
Eric Banks c0030f3f2d We no longer subset down to the best N haplotypes for the GL calculation.
I explain in comments within the code that this was causing problems with the marginalization over events.
2013-06-11 11:51:26 -04:00
Eric Banks c0e3874db0 Change the HC's phredScaledGlobalReadMismappingRate from 60 to 45, because Ryan and Mark told me to. 2013-06-11 11:51:26 -04:00
Eric Banks 77868d034f Do not allow the use of Ns in reads for graph construction.
Ns are treated as wildcards in the PairHMM so creating haplotypes with Ns gives them artificial advantages over other ones.
This was the cause of at least one FN where there were Ns at a SNP position.
2013-06-11 11:51:26 -04:00
Eric Banks e4e7d39e2c Fix FN problem stemming from sequence graphs that contain cycles.
Problem:
The sequence graphs can get very complex and it's not enough just to test that any given read has non-unique kmers.
Reads with variants can have kmers that match unique regions of the reference, and this causes cycles in the final
sequence graph.  Ultimately the problem is that kmers of 10/25 may not be large enough for these complex regions.

Solution:
We continue to try kmers of 10/25 but detect whether cycles exist; if so, we do not use them.  If (and only if) we
can't get usable graphs from the 10/25 kmers, then we start iterating over larger kmers until we either can generate
a graph without cycles or attempt too many iterations.
2013-06-11 11:51:26 -04:00
Ryan Poplin 210007cd09 Merge pull request #269 from broadinstitute/rp_minor_pruning_function_name
Minor changes to docs in the graph pruning.
2013-06-11 08:24:56 -07:00
Ryan Poplin 58e354176e Minor changes to docs in the graph pruning. 2013-06-11 10:33:22 -04:00
Mark DePristo c7836ec746 Merge pull request #264 from broadinstitute/md_dbsnp
Make HaplotypeCaller annotate dbSNP rsIDs
2013-06-10 13:38:20 -07:00
Mark DePristo 1c03ebc82d Implement ActiveRegionTraversal RefMetaDataTracker for map call; HaplotypeCaller now annotates ID from dbSNP
-- Reuse infrastructure for RODs for reads to implement general IntervalReferenceOrderedView so that both TraverseReads and TraverseActiveRegions can use the same underlying infrastructure
-- TraverseActiveRegions now provides a meaningful RefMetaDataTracker to ActiveRegionWalker.map
-- Cleanup misc. code as it came up
-- Resolves GSA-808: Write general utility code to do rsID allele matching, hook up to UG and HC
2013-06-10 16:20:31 -04:00
Mark DePristo 0d593cff70 Refactor rsID and overlap detection in VariantOverlapAnnotator utility class
-- Variants will be considered matching if they have the same reference allele and at least 1 common alternative allele.  This matching algorithm determines how rsID are added back into the VariantContext we want to annotate, and as well determining the overlap FLAG attribute field.
-- Updated VariantAnnotator and VariantsToVCF to use this class, removing its old stale implementation
-- Added unit tests for this VariantOverlapAnnotator class
-- Removed GATKVCFUtils.rsIDOfFirstRealVariant as this is now better to use VariantOverlapAnnotator
-- Now requires strict allele matching, without any option to just use site annotation.
2013-06-10 15:51:13 -04:00
Mark DePristo 3e979f30a9 Merge pull request #265 from broadinstitute/mc_move_qualify_intervals_to_protected
Moving QualifyMissingIntervals to protected
2013-06-10 10:14:42 -07:00
Mauricio Carneiro a95fbd48e5 Moving QualifyMissingIntervals to protected
Making this walker available so we can share it with the CSER group for CLIA analysis.
2013-06-10 13:11:41 -04:00
Eric Banks 2a935374f3 Merge pull request #242 from broadinstitute/vrr_N_cigar_error_and_override_option
Reads with N operator in CIGAR now result in a User exception. Option to filter them out is provided.
2013-06-10 08:46:15 -07:00
Valentin Ruano-Rubio 96073c3058 This commit addresses JIRA issue GSA-948: Prevent users from doing the wrong thing with RNA-Seq data and the GATK.
The previous behavior is to process reads with N CIGAR operators as they are despite that many of the tools do not actually support such operator and results become unpredictible.

Now if the there is some read with the N operator, the engine returns a user exception. The error message indicates what is the problem (including the offending read and mapping position) and give a couple of alternatives that the user can take in order to move forward:

a) ask for those reads to be filtered out (with --filter_reads_with_N_cigar or -filterRNC)

b) keep them in as before (with -U ALLOW_N_CIGAR_READS or -U ALL)

Notice that (b) does not have any effect if (a) is enacted; i.e. filtering overrides ignoring.

Implementation:

* Added filterReadsWithMCigar argument to MalformedReadFilter with the corresponding changes in the code to get it to work.
* Added ALLOW_N_CIGAR_READS unsafe flag so that N cigar containing reads can be processed as they are if that is what the user wants.
* Added ReadFilterTest class commont parent for ReadFilter test cases.
* Refactor ReadGroupBlackListFilterUnitTest to extend ReadFilterTest and push up some functionality to that class.
* Modified MalformedReadFilterUnitTest to extend ReadFilterTest and to test the new filter functionality.
* Added AllowNCigarMalformedReadFilterUnittest to check on the behavior when the unsafe ALLOW_N_CIGAR_READS flag is used.
* Added UnsafeNCigarMalformedReadFilterUnittest to check on the behavior when the unsafe ALL flag is used.
* Updated a broken test case in UnifiedGenotyperIntegrationTest resulting from the new behavior.
* Updated EngineFeaturesIntegrationTest testdata to be compliant with new behavior
2013-06-10 10:44:42 -04:00
Eric Banks cbb6c7ae92 Merge pull request #263 from broadinstitute/mccowan_reduce_reads_performance
Reduce reads performance improvements
2013-06-10 06:19:38 -07:00
Michael McCowan 00c06e9e52 Performance improvements:
- Memoized MathUtil's cumulative binomial probability function.
 - Reduced the default size of the read name map in reduced reads and handle its resets more efficiently.
2013-06-09 11:26:52 -04:00
Eric Banks e7c69cb304 Merge pull request #261 from broadinstitute/md_ad_bugfix
Fixes for genotype-level annotations in HaplotypeCaller / UnifiedGenotyper
2013-06-06 07:21:22 -07:00
Mark DePristo 209dd64268 HaplotypeCaller now emits per-sample DP
-- Created a new annotation DepthPerSampleHC that is by default on in the HaplotypeCaller
-- The depth for the HC is the sum of the informative alleles at this site.  It's not perfect (as we cannot differentiate between reads that align over the event but aren't informative vs. those that aren't even close) but it's a pretty good proxy and it matches with the AD field (i.e., sum(AD) = DP).
-- Update MD5s
-- delivers [#48240601]
2013-06-06 09:47:32 -04:00
Mark DePristo 34bdf20132 Bugfix for bad AD values in UG/HC
-- In the case where we have multiple potential alternative alleles *and* we weren't calling all of them (so that n potential values < n called) we could end up trimming the alleles down which would result in the mismatch between the PerReadAlleleLikelihoodMap alleles and the VariantContext trimmed alleles.
-- Fixed by doing two things (1) moving the trimming code after the annotation call and (2) updating AD annotation to check that the alleles in the VariantContext and the PerReadAlleleLikelihoodMap are concordant, which will stop us from degenerating in the future.
-- delivers [#50897077]
2013-06-05 17:48:41 -04:00
Mark DePristo c8845a2b63 Merge pull request #260 from broadinstitute/md_nist_kb
Add NIST Genomes in a Bottle to NA12878 KB
2013-06-05 13:09:41 -07:00