Commit Graph

12488 Commits (cb5b1c3c343bc3a02764746f66d91ff7bb2e9975)

Author SHA1 Message Date
Mark DePristo cb5b1c3c34 Create README.md 2013-06-17 16:03:45 -03:00
Mark DePristo fce448cc9e Merge pull request #287 from broadinstitute/md_gzip_vcf_nt
Bugfix: allow gzip VCF output in multi-threaded GATK output
2013-06-17 09:39:37 -07:00
Mark DePristo b69d210255 Bugfix: allow gzip VCF output in multi-threaded GATK output
-- VariantContextWriterStorage was gzipping the intermediate files that would be merged in, but the mergeInto function couldn't read those outputs, and we'd throw a very strange error. Now tmp. VCFs aren't compressed, even if the final VCF is.  Added integrationtest to ensure this behavior works going forward.
-- [delivers #47399279]
2013-06-17 12:39:18 -04:00
delangel 485ceb1e12 Merge pull request #283 from broadinstitute/md_beagleoutput
Simpler FILTER and info field encoding for BeagleOutputToVCF
2013-06-17 09:31:03 -07:00
Mark DePristo 5b1a472d2c Merge pull request #286 from broadinstitute/eb_add_tiers_to_KBconsensus
Added 2 new fields to the MongoVariantContext: confidence and isComplex.
2013-06-17 08:38:57 -07:00
Mark DePristo ee78927bdb Merge pull request #279 from broadinstitute/eb_make_rms_mq_work_with_rr
Fixes to several of the annotations for reduced reads (and other issues)...
2013-06-16 09:48:19 -07:00
Eric Banks e48f754478 Fixes to several of the annotations for reduced reads (and other issues).
1. Have the RMSMappingQuality annotation take into account the fact that reduced reads represent multiple reads.

2. The rank sume tests should not be using reduced reads (because they do not represent distinct observations).

3. Fixed a massive bug in the BaseQualityRankSumTest annotation!  It was not using the base qualities but rather
the read likelihoods?!

Added a unit test for Rank Sum Tests to prove that the distributions are correctly getting assigned appropriate p-values.
Also, and just as importantly, the test shows that using reduced reads in the rank sum tests skews the results and
makes insignificant distributions look significant (so it can falsely cause the filtering of good sites).

Also included in this commit is a massive refactor of the RankSumTest class as requested by the reviewer.
2013-06-16 01:18:20 -04:00
Eric Banks 9ec71bba26 Added 2 new fields to the MongoVariantContext: confidence and isComplex.
IsComplex will be used to designate calls as representing complex events which have multiple
correct allele representations.  Then call sets can get points for including them but will
not get penalized for missing them (because they may have used a different representation).
This is currently the biggest bane when trying to characterize FNs these days.

The confidence will be used to refactor the consensus making algorithm for the truth status
of the NA12878 KB.  The previous version allowed for 2 tiers: reviews and everything else.
But that is problematic when some of the input sets are of higher quality than others
because when they disagree the calls become discordant and we lose that information.
The new framework will allow each call to have its own associated confidence.  Then when
determining the consensus truth status we probabilistically calculate it from the
various confidences, so that nothing is hard coded in anymore.

Note that I added some unit tests to ensure the outcome that I expect for various scenarios
and then implemented a very rough version of the estimator that successfully produced those
outcomes.

HOWEVER, THIS IS NOT COMPLETE AND NEITHER FUNCTIONALITY IS HOOKED UP AT ALL.
Rather, this is an interim commit.  The only goal here is to get these fields added to the MVC
for the upcoming release so that Jacob (who prefers to work with stable) can add the
necessary functionality to IGV for us.
2013-06-16 00:31:16 -04:00
droazen 4151753718 Merge pull request #285 from broadinstitute/dr_james_warren_fasta_suffix_bugfix
deducing dictionary path should not use global find and replace
2013-06-14 16:57:10 -07:00
James Warren f46f7d9b23 deducing dictionary path should not use global find and replace
Signed-off-by: David Roazen <droazen@broadinstitute.org>
2013-06-14 19:15:27 -04:00
Mark DePristo 52677429a0 Merge pull request #284 from broadinstitute/dr_fewer_stranded_temp_files
Reduce number of leftover temp files in GATK runs
2013-06-14 13:06:28 -07:00
Mark DePristo 1677a0a458 Simpler FILTER and info field encoding for BeagleOutputToVCF
-- Previous version created FILTERs for each possible alt allele when that site was set to monomorphic by BEAGLE.  So if you had a A/C SNP in the original file and beagle thought it was AC=0, then you'd get a record with BGL_RM_WAS_A in the FILTER field.  This obviously would cause problems for indels, as so the tool was blowing up in this case.  Now beagle sets the filter field to BGL_SET_TO_MONOMORPHIC and sets the info field annotation OriginalAltAllele to A instead.  This works in general with any type of allele.
 -- Here's an example output line from the previous and current versions:
 old: 20    64150   rs7274499       C       .       3041.68 BGL_RM_WAS_A    AN=566;DB;DP=1069;Dels=0.00;HRun=0;HaplotypeScore=238.33;LOD=3.5783;MQ=83.74;MQ0=0;NumGenotypesChanged=1;OQ=1949.35;QD=10.95;SB=-6918.88
 new: 20    64062   .       G       .       100.39  BGL_SET_TO_MONOMORPHIC  AN=566;DP=1108;Dels=0.00;HRun=2;HaplotypeScore=221.59;LOD=-0.5051;MQ=85.69;MQ0=0;NumGenotypesChanged=1;OQ=189.66;OriginalAltAllele=A;QD=15.81;SB=-6087.15
-- update MD5s to reflect these changes
-- [delivers #50847721]
2013-06-14 15:56:13 -04:00
David Roazen d167292688 Reduce number of leftover temp files in GATK runs
-WalkerTest now deletes *.idx files on exit

-ArtificialBAMBuilder now deletes *.bai files on exit

-VariantsToBinaryPed walker now deletes its temp files on exit
2013-06-14 15:56:03 -04:00
Mark DePristo b72880cc94 Merge pull request #282 from broadinstitute/md_gatklogs_gitversions
Use git hash to lookup versions when necessary in analyzeRunReports.py
2013-06-14 12:39:54 -07:00
Mark DePristo 20bb4902a3 Use git hash to lookup versions when necessary in analyzeRunReports.py 2013-06-14 15:31:25 -04:00
Mark DePristo 50ea098c11 Merge pull request #281 from broadinstitute/md_gatklogs
Update utilities to get GATKRunReports
2013-06-14 10:00:16 -07:00
Ryan Poplin c4e508a71f Merge pull request #275 from broadinstitute/md_fragment_with_pcr
Improvements to HaplotypeCaller and NA12878 KB
2013-06-14 09:32:26 -07:00
Mark DePristo a057f37331 Update utilities to get GATKRunReports
-- Critical bugfix: the GATK run reports magically changed names from something like GATK-run-report to GATKRunReport in GATK 2.4.  All GATK logs from 2.4 onwards were being eaten by the scripts that download logs, so the GATK usage is actually much much higher than our logs have suggested.  Looking forward to seeing some real numbers.  Unfortunately the error occurred so early in the downloading process that we actually deleted away these logs, so they cannot be recovered
-- Added a step in the downloader that archives the raw, unprocessed files so we can recover from such problems in the future
-- The s3 download scripts now download to /local/dev/GATKLogs so will only work on gsa4, but this is ok as this is better than taking forever to get the logs to the isilon.
-- Turn off some crazy debugging output from the downloader that was actually masking me from seeing the issue each night
-- Make analyzeRunReports.py robust to svn version abominations
-- Use python-2.6 in runGATKReport.csh
2013-06-14 10:17:32 -04:00
droazen ac346a93ba Merge pull request #278 from broadinstitute/md_gatk_version_in_vcf
Emit the GATK version number in the VCF header
2013-06-13 13:22:20 -07:00
Mark DePristo 908183aba7 Merge pull request #277 from broadinstitute/dr_fix_com_sun_dependency
Remove com.sun.javadoc.* dependencies from the GATK proper, and isolate them for doclet use only
2013-06-13 13:12:45 -07:00
David Roazen f9c986be74 Remove com.sun.javadoc.* dependencies from the GATK proper, and isolate them for doclet use only
Problem:
Classes in com.sun.javadoc.* are non-standard. Since we can't depend on their availability for
all users, the GATK proper should not have any runtime dependencies on this package.

Solution:
-Isolate com.sun.javadoc.* dependencies in a DocletUtils class for use only by doclets. The
 only users who need to run our doclets are those who compile from source, and they
 should be competent enough to figure out how to resolve a missing com.sun.* dependency.

-HelpUtils now contains no com.sun.javadoc.* dependencies and can be safely used by walkers/other
 tools.

-Added comments with instructions on when it is safe to use DocletUtils vs. HelpUtils

[delivers #51450385]
[delivers #50387199]
2013-06-13 15:52:41 -04:00
Mark DePristo 74f311c973 Emit the GATK version number in the VCF header
-- Looks like ##GATKVersion=2.5-159-g3f91d93 in the VCF header line
-- delivers [#51595305]
2013-06-13 15:46:16 -04:00
Mark DePristo d93bed5d61 Merge pull request #276 from broadinstitute/md_gatkreport_cleanup
Remove STANDARD option from GATKRunReport
2013-06-13 12:40:57 -07:00
Mark DePristo 6232db3157 Remove STANDARD option from GATKRunReport
-- AWS is now the default.  Removed old code the referred to the STANDARD type.  Deleted unused variables and functions.
2013-06-13 15:18:28 -04:00
Mark DePristo dd5674b3b8 Add genotyping accuracy assessment to AssessNA12878
-- Now table looks like:

Name     VariantType  AssessmentType           Count
variant  SNPS         TRUE_POSITIVE              1220
variant  SNPS         FALSE_POSITIVE                0
variant  SNPS         FALSE_NEGATIVE                1
variant  SNPS         TRUE_NEGATIVE               150
variant  SNPS         CALLED_NOT_IN_DB_AT_ALL       0
variant  SNPS         HET_CONCORDANCE          100.00
variant  SNPS         HOMVAR_CONCORDANCE        99.63
variant  INDELS       TRUE_POSITIVE               273
variant  INDELS       FALSE_POSITIVE                0
variant  INDELS       FALSE_NEGATIVE               15
variant  INDELS       TRUE_NEGATIVE                79
variant  INDELS       CALLED_NOT_IN_DB_AT_ALL       2
variant  INDELS       HET_CONCORDANCE           98.67
variant  INDELS       HOMVAR_CONCORDANCE        89.58

-- Rewrite / refactored parts of subsetDiploidAlleles in GATKVariantContextUtils to have a BEST_MATCH assignment method that does it's best to simply match the genotype after subsetting to a set of alleles.  So if the original GT was A/B and you subset to A/B it remains A/B but if you subset to A/C you get A/A.  This means that het-alt B/C genotypes become A/B and A/C when subsetting to bi-allelics which is the convention in the KB.  Add lots of unit tests for this functions (from 0 previously)
-- BadSites in Assessment now emits TP sites with discordant genotypes with the type GENOTYPE_DISCORDANCE and tags the expected genotype in the info field as ExpectedGenotype, such as this record:

20      10769255        .       A       ATGTG   165.73  .       ExpectedGenotype=HOM_VAR;SupportingCallsets=ebanks,depristo,CEUTrio_best_practices;WHY=GENOTYPE_DISCORDANCE     GT:AD:DP:GQ:PL  0/1:1,9:10:6:360,0,6

Indicating that the call was a HET but the expected result was HOM_VAR
-- Forbid subsetting of diploid genotypes to just a single allele.
-- Added subsetToRef as a separate specific function.  Use that in the DiploidExactAFCalc in the case that you need to reduce yourself to ref only. Preserves DP in the genotype field when this is possible, so a few integration tests have changed for the UG
2013-06-13 15:05:32 -04:00
Mark DePristo 33720b83eb No longer merge overlapping fragments from HaplotypeCaller
-- Merging overlapping fragments turns out to be a bad idea.  In the case where you can safely merge the reads you only gain a small about of overlapping kmers, so the potential gains are relatively small.  That's in contrast to the very large danger of merging reads inappropriately, such as when the reads only overlap in a repetitive region, and you artificially construct reads that look like the reference but actually may carry a larger true insertion w.r.t. the reference.  Because this problem isn't limited to repetitive sequeuence, but in principle could occur in any sequence, it's just not safe to do this merging.  Best to leave haplotype construction to the assembly graph.
2013-06-13 15:05:32 -04:00
droazen fb5143a590 Merge pull request #274 from broadinstitute/md_s3_only
GATKRunReport no longer tries to use the Broad filesystem destination, r...
2013-06-13 11:32:31 -07:00
Mark DePristo dd6e252373 GATKRunReport no longer tries to use the Broad filesystem destination, rather it goes unconditionally to S3 2013-06-13 13:33:10 -04:00
Mark DePristo c837d67b2f Merge pull request #273 from broadinstitute/rp_readIsPoorlyModelled
Relaxing the constraints on the readIsPoorlyModelled function.
2013-06-13 08:40:24 -07:00
Mark DePristo 2833325d31 Merge pull request #272 from broadinstitute/rp_hc_bam_writer_uninformative_reads
HC bam writer now sets the read to MQ0 if it isn't informative
2013-06-13 08:08:45 -07:00
Ryan Poplin f44efc27ae Relaxing the constraints on the readIsPoorlyModelled function.
-- Turns out we were aggressively throwing out borderline-good reads.
2013-06-13 11:06:23 -04:00
Ryan Poplin d5f0848bd5 HC bam writer now sets the read to MQ0 if it isn't informative
-- Makes visualization of read evidence easier in IGV.
2013-06-13 10:11:54 -04:00
Eric Banks 17d3ccb03b Merge pull request #270 from broadinstitute/rp_reference_haplotype_mismatch_bug
Fixing bug with dangling tails in which the tail connects all the way ba...
2013-06-12 11:03:48 -07:00
Ryan Poplin d1f397c711 Fixing bug with dangling tails in which the tail connects all the way back to the reference source node.
-- List of vertices can't contain a source node.
2013-06-12 12:23:01 -04:00
Mark DePristo b2dc7095ab Merge pull request #267 from broadinstitute/dr_reducereads_downsampling_fix
Exclude reduced reads from elimination during downsampling
2013-06-11 13:52:28 -07:00
David Roazen 95b5f99feb Exclude reduced reads from elimination during downsampling
Problem:
-Downsamplers were treating reduced reads the same as normal reads,
 with occasionally catastrophic results on variant calling when an
 entire reduced read happened to get eliminated.

Solution:
-Since reduced reads lack the information we need to do position-based
 downsampling on them, best available option for now is to simply
 exempt all reduced reads from elimination during downsampling.

Details:
-Add generic capability of exempting items from elimination to
 the Downsampler interface via new doNotDiscardItem() method.
 Default inherited version of this method exempts all reduced reads
 (or objects encapsulating reduced reads) from elimination.

-Switch from interfaces to abstract classes to facilitate this change,
 and do some minor refactoring of the Downsampler interface (push
 implementation of some methods into the abstract classes, improve
 names of the confusing clear() and reset() methods).

-Rewrite TAROrderedReadCache. This class was incorrectly relying
 on the ReservoirDownsampler to preserve the relative ordering of
 items in some circumstances, which was behavior not guaranteed by
 the API and only happened to work due to implementation details
 which no longer apply. Restructured this class around the assumption
 that the ReservoirDownsampler will not preserve relative ordering
 at all.

-Add disclaimer to description of -dcov argument explaining that
 coverage targets are approximate goals that will not always be
 precisely met.

-Unit tests for all individual downsamplers to verify that reduced
 reads are exempted from elimination
2013-06-11 16:16:26 -04:00
Ryan Poplin e1fd3dff9a Merge pull request #268 from broadinstitute/eb_calling_accuracy_improvements_to_HC
Eb calling accuracy improvements to hc
2013-06-11 11:18:51 -07:00
Eric Banks b63cbd8cc9 Merge pull request #266 from broadinstitute/gda_read_error_correction_new
Gda read error correction new
2013-06-11 10:42:06 -07:00
Eric Banks 2c3c680eb7 Misc changes and cleanup from all previous commits in this push.
1. By default, do not include the UG CEU callset for assessment.
2. Updated md5s that are different now with all the HC changes.
2013-06-11 12:53:11 -04:00
Eric Banks dadcfe296d Reworking of the dangling tails merging code.
We now run Smith-Waterman on the dangling tail against the corresponding reference tail.
If we can generate a reasonable, low entropy alignment then we trigger the merge to the
reference path; otherwise we abort.  Also, we put in a check for low-complexity of graphs
and don't let those pass through.

Added tests for this implementation that checks exact SW results and correct edges added.
2013-06-11 12:53:04 -04:00
Guillermo del Angel 55d5f2194c Read Error Corrector for haplotype assembly
Principle is simple: when coverage is deep enough, any single-base read error will look like a rare k-mer but correct sequence will be supported by many reads to correct sequences will look like common k-mers. So, algorithm has 3 main steps:
1. K-mer graph buildup.
For each read in an active region, a map from k-mers to the number of times they have been seen is built.
2. Building correction map.
All "rare" k-mers that are sparse (by default, seen only once), get mapped to k-mers that are good (by default, seen at least 20 times but this is a CL argument), and that lie within a given Hamming distance (by default, =1). This map can be empty (i.e. k-mers can be uncorrectable).
3. Correction proposal
For each constituent k-mer of each read, if this k-mer is rare and maps to a good k-mer, get differing base positions in k-mer and add these to a list of corrections for each base in each read. Then, correct read at positions where correction proposal is unanimous and non-empty.

The algorithm defaults are chosen to be very stringent and conservative in the correction: we only try to correct singleton k-mers, we only look for good k-mers lying at Hamming distance = 1 from them, and we only correct a base in read if all correction proposals are congruent.

By default, algorithm is disabled but can be enabled in HaplotypeCaller via the -readErrorCorrect CL option. However, at this point it's about 3x-10x more expensive so it needs to be optimized if it's to be used.
2013-06-11 12:26:24 -04:00
Eric Banks c0030f3f2d We no longer subset down to the best N haplotypes for the GL calculation.
I explain in comments within the code that this was causing problems with the marginalization over events.
2013-06-11 11:51:26 -04:00
Eric Banks c0e3874db0 Change the HC's phredScaledGlobalReadMismappingRate from 60 to 45, because Ryan and Mark told me to. 2013-06-11 11:51:26 -04:00
Eric Banks 77868d034f Do not allow the use of Ns in reads for graph construction.
Ns are treated as wildcards in the PairHMM so creating haplotypes with Ns gives them artificial advantages over other ones.
This was the cause of at least one FN where there were Ns at a SNP position.
2013-06-11 11:51:26 -04:00
Eric Banks e4e7d39e2c Fix FN problem stemming from sequence graphs that contain cycles.
Problem:
The sequence graphs can get very complex and it's not enough just to test that any given read has non-unique kmers.
Reads with variants can have kmers that match unique regions of the reference, and this causes cycles in the final
sequence graph.  Ultimately the problem is that kmers of 10/25 may not be large enough for these complex regions.

Solution:
We continue to try kmers of 10/25 but detect whether cycles exist; if so, we do not use them.  If (and only if) we
can't get usable graphs from the 10/25 kmers, then we start iterating over larger kmers until we either can generate
a graph without cycles or attempt too many iterations.
2013-06-11 11:51:26 -04:00
Ryan Poplin 210007cd09 Merge pull request #269 from broadinstitute/rp_minor_pruning_function_name
Minor changes to docs in the graph pruning.
2013-06-11 08:24:56 -07:00
Ryan Poplin 58e354176e Minor changes to docs in the graph pruning. 2013-06-11 10:33:22 -04:00
Mark DePristo c7836ec746 Merge pull request #264 from broadinstitute/md_dbsnp
Make HaplotypeCaller annotate dbSNP rsIDs
2013-06-10 13:38:20 -07:00
Mark DePristo 1c03ebc82d Implement ActiveRegionTraversal RefMetaDataTracker for map call; HaplotypeCaller now annotates ID from dbSNP
-- Reuse infrastructure for RODs for reads to implement general IntervalReferenceOrderedView so that both TraverseReads and TraverseActiveRegions can use the same underlying infrastructure
-- TraverseActiveRegions now provides a meaningful RefMetaDataTracker to ActiveRegionWalker.map
-- Cleanup misc. code as it came up
-- Resolves GSA-808: Write general utility code to do rsID allele matching, hook up to UG and HC
2013-06-10 16:20:31 -04:00
Mark DePristo 0d593cff70 Refactor rsID and overlap detection in VariantOverlapAnnotator utility class
-- Variants will be considered matching if they have the same reference allele and at least 1 common alternative allele.  This matching algorithm determines how rsID are added back into the VariantContext we want to annotate, and as well determining the overlap FLAG attribute field.
-- Updated VariantAnnotator and VariantsToVCF to use this class, removing its old stale implementation
-- Added unit tests for this VariantOverlapAnnotator class
-- Removed GATKVCFUtils.rsIDOfFirstRealVariant as this is now better to use VariantOverlapAnnotator
-- Now requires strict allele matching, without any option to just use site annotation.
2013-06-10 15:51:13 -04:00