Commit Graph

12521 Commits (191e4ca251f1b655e5dcc5e00ab38c308e3c83f9)

Author SHA1 Message Date
Mark DePristo 191e4ca251 Merge pull request #300 from broadinstitute/mc_move_qualify_intervals_to_protected
Few bug fixes to this tool now that it is in protected
2013-06-24 09:35:45 -07:00
Yossi Farjoun d8ca4d3e6d Merge pull request #299 from broadinstitute/eb_mate_fixer_confused_by_nonprimary_alignment
Another fix for the Indel Realigner that arises because of secondary alignments.
2013-06-24 06:58:27 -07:00
Eric Banks d976aae2b1 Another fix for the Indel Realigner that arises because of secondary alignments.
This time we don't accidentally drop reads (phew), but this bug does cause us not to
update the alignment start of the mate.  Fixed and added unit test to cover it.
2013-06-21 16:59:22 -04:00
Mark DePristo dee51c4189 Error out when NCT and BAMOUT are used with the HaplotypeCaller
-- Currently we don't support writing a BAM file from the haplotype caller when nct is enabled.  Check in initialize if this is the case, and throw a UserException
2013-06-21 09:25:57 -04:00
David Roazen e03a5e9486 Update source release script in attempt to work around intermittent github issues
Github was intermittently rejecting large pushes that were in fact
fast-forward updates as being non-fast-forward. Try to prevent this
by ensuring that all refs are up-to-date and properly checked out
after branch filtering and before doing a source release.
2013-06-20 16:58:01 -04:00
David Roazen 0018af0c0a Update README file for the 2.6 release 2013-06-20 13:08:29 -04:00
Eric Banks 6977d6e2a7 Merge remote-tracking branch 'unstable/master' 2013-06-20 12:14:33 -04:00
Eric Banks 9f979fdc81 Merge pull request #297 from broadinstitute/md_vcfversion2
Better GATK version and command line output
2013-06-20 09:11:36 -07:00
Mark DePristo fdfe4e41d5 Better GATK version and command line output
-- Previous version emitted command lines that look like:

##HaplotypeCaller="analysis_type=HaplotypeCaller input_file=[private/testdata/reduced.readNotFullySpanningDeletion.bam] ..."

the new version provides additional information on when the GATK was run and the GATK version in a nicer format:

 ##GATKCommandLine=<ID=HaplotypeCaller,Version=2.5-206-gbc7be2b,Date="Thu Jun 20 11:09:01 EDT 2013",Epoch=1371740941197,CommandLineOptions="analysis_type=HaplotypeCaller input_file=[private/testdata/reduced.readNotFullySpanningDeletion.bam] read_buffer_size=null phone_home=AWS ...">

 -- Additionally, the command line options are emitted sequentially in the file, so you can see a running record of how a VCF was produced, such as this example from the integration test:

 ##GATKCommandLine=<ID=HaplotypeCaller,Version=2.5-206-gbc7be2b,Date="Thu Jun 20 11:09:01 EDT 2013",Epoch=1371740941197,CommandLineOptions="lots of stuff">
 ##GATKCommandLine=<ID=SelectVariants,Version=2.5-206-gbc7be2b,Date="Thu Jun 20 11:16:23 EDT 2013",Epoch=1371741383277,CommandLineOptions="lots of stuff">

 -- Removed the ProtectedEngineFeaturesIntegrationTest
 -- Actual unit tests for these features!
2013-06-20 11:19:13 -04:00
Mark DePristo 701d70401f Merge pull request #296 from broadinstitute/md_pubprotfix
Fix public / protected dependency
2013-06-19 17:17:21 -07:00
Mark DePristo 0672ac5032 Fix public / protected dependency 2013-06-19 19:42:09 -04:00
Eric Banks 74415a6a2a Merge pull request #292 from broadinstitute/vrr_analyzeCovariates
Added the AnalyzeCovariates tool to generate BQSR quality assessment plots.
2013-06-19 13:26:59 -07:00
Valentin Ruano-Rubio 1f8282633b Removed plots generation from the BaseRecalibration software
Improved AnalyzeCovariates (AC) integration test.
Renamed AC test files ending with .grp to .table

Implementation:

* Removed RECAL_PDF/CSV_FILE from RecalibrationArgumentCollection (RAC). Updated rest of the code accordingly.
* Fixed BQSRIntegrationTest to work with new changes
2013-06-19 14:47:56 -04:00
Valentin Ruano-Rubio 08f92bb6f9 Added AnalyzeCovariates tool to generate BQSR assessment quality plots.
Implemtation details:

* Added tool class *.AnalyzeCovariates
* Added convenient addAll method to Utils to be able to add elements of an array.
* Added parameter comparison methods to RecalibrationArgumentCollection class in order to verify that multiple imput recalibration report are compatible and comparable.
* Modified the BQSR.R script to handle up to 3 different recalibration tables (-BQSR, -before and -after) and removed some irrelevant arguments (or argument values) from the output.
* Added an integration test class.
2013-06-19 14:38:02 -04:00
Mark DePristo fb114e34fe Merge pull request #295 from broadinstitute/dr_remove_PrintReads_ds_argument
PrintReads: remove -ds argument
2013-06-19 10:55:10 -07:00
droazen 573ecadecc Merge pull request #294 from broadinstitute/dr_handle_zero_length_cigar_elements
SAMDataSource: always consolidate cigar strings into canonical form
2013-06-19 10:32:22 -07:00
David Roazen 51ec5404d4 SAMDataSource: always consolidate cigar strings into canonical form
-Collapses zero-length and repeated cigar elements, neither of which
 can necessarily be handled correctly by downstream code (like LIBS).

-Consolidation is done before read filters, because not all read filters
 behave correctly with non-consoliated cigars.

-Examined other uses of consolidateCigar() throughout the GATK, and
 found them to not be redundant with the new engine-level consolidation
 (they're all on artificially-created cigars in the HaplotypeCaller
 and SmithWaterman classes)

-Improved comments in SAMDataSource.applyDecoratingIterators()

-Updated MD5s; differences were examined and found to be innocuous

-Two tests: -Unit test for ReadFormattingIterator
            -Integration test for correct handling of zero-length
             cigar elements by the GATK engine as a whole
2013-06-19 13:29:01 -04:00
David Roazen 23ee192d5e PrintReads: remove -ds argument
-This argument was completely redundant with the engine-level -dfrac
 argument.

-Could produce unintended consequences if used in conjunction with
 engine-level downsampling arguments.
2013-06-19 13:22:44 -04:00
David Roazen 0be788f0f9 Fix typo in snpEff documentation 2013-06-19 13:15:24 -04:00
chartl a3d6ad55f9 Merge pull request #271 from broadinstitute/chartl_extend_genotypeconcordance_documentation
Extend Genotype Concordance Documentation
2013-06-19 09:03:05 -07:00
Chris Hartl af275fdf10 Extend the documentation of GenotypeConcordance to include notes about Monomorphic and Filtered VCF records.
Address Geraldine's comments - information on moltenization and explanation of fields

Fix paren
2013-06-19 12:01:58 -04:00
amilev 28a8d74290 Merge pull request #293 from broadinstitute/md_catvariants
CatVariants accepts reference files ending in any standard extension
2013-06-19 08:36:58 -07:00
Mark DePristo 15171c07a8 CatVariants accepts reference files ending in any standard extension
-- [resolves #49339235] Make CatVariants accept reference files ending in .fa (not only .fasta)
2013-06-19 11:10:36 -04:00
MauricioCarneiro 6a5502c94a Merge pull request #289 from broadinstitute/md_fix_bq
Bugfix: defaultBaseQualities actually works now
2013-06-18 11:58:39 -07:00
delangel 1c400e8f8e Merge pull request #291 from broadinstitute/gda_new_hmm_in_ug
Swapping in logless Pair HMM for default usage with UG:
2013-06-18 07:07:57 -07:00
Guillermo del Angel f176c854c6 Swapping in logless Pair HMM for default usage with UG:
-- Changed default HMM model.
-- Removed check.
-- Changed md5's: PL's in the high 100s change by a point or two due to new implementation.
-- Resulting performance improvement is about 30 to 50% less runtime when using -glm INDEL.
2013-06-18 10:06:27 -04:00
Mark DePristo 4c482eb0f0 Merge pull request #290 from broadinstitute/rp_pruning_priority_queue
Adding new pruning parameter to ReadThreadingAssembler
2013-06-17 17:16:00 -07:00
Ryan Poplin 8511c4385c Adding new pruning parameter to ReadThreadingAssembler
-- numPruningSamples allows one to specify that the minPruning factor must be met by this many samples for a path to be considered good (e.g. seen twice in three samples). By default this is just one sample.
-- adding unit test to test this new functionality
2013-06-17 16:46:40 -04:00
delangel a6a58cbc78 Merge pull request #288 from broadinstitute/gda_more_ancient_dna_fixes
Feature requested by Reich lab and Paavo lab in Leipzig for ancient DNA ...
2013-06-17 13:04:21 -07:00
Mark DePristo cb5b1c3c34 Create README.md 2013-06-17 16:03:45 -03:00
Mark DePristo 7b22467148 Bugfix: defaultBaseQualities actually works now
-- It was being applied in the wrong order (after the first call to the underlying MalformedReadFilter) so if your first read was malformed you'd blow up there instead of being fixed properly.  Added integration tests to ensure this continues to work.
-- [delivers #49538319]
2013-06-17 14:37:27 -04:00
Guillermo del Angel f6025d25ae Feature requested by Reich lab and Paavo lab in Leipzig for ancient DNA processing:
-- When doing cross-species comparisons and studying population history and ancient DNA data, having SOME measure of confidence is needed at every single site that doesn't depend on the reference base, even in a naive per-site SNP mode. Old versions of GATK provided GQ and some wrong PL values at reference sites but these were wrong. This commit addresses this need by adding a new UG command line argument, -allSitePLs, that, if enabled will:
a) Emit all 3 ALT snp alleles in the ALT column.
b) Emit all corresponding 10 PL values.
It's up to the user to process these PL values downstream to make sense of these. Note that, in order to follow VCF spec, the QUAL field in a reference call when there are non-null ALT alleles present will be zero, so QUAL will be useless and filtering will need to be done based on other fields.
-- Tweaks and fixes to processing pipelines for Reich lab.
2013-06-17 13:21:09 -04:00
Mark DePristo fce448cc9e Merge pull request #287 from broadinstitute/md_gzip_vcf_nt
Bugfix: allow gzip VCF output in multi-threaded GATK output
2013-06-17 09:39:37 -07:00
Mark DePristo b69d210255 Bugfix: allow gzip VCF output in multi-threaded GATK output
-- VariantContextWriterStorage was gzipping the intermediate files that would be merged in, but the mergeInto function couldn't read those outputs, and we'd throw a very strange error. Now tmp. VCFs aren't compressed, even if the final VCF is.  Added integrationtest to ensure this behavior works going forward.
-- [delivers #47399279]
2013-06-17 12:39:18 -04:00
delangel 485ceb1e12 Merge pull request #283 from broadinstitute/md_beagleoutput
Simpler FILTER and info field encoding for BeagleOutputToVCF
2013-06-17 09:31:03 -07:00
Mark DePristo 5b1a472d2c Merge pull request #286 from broadinstitute/eb_add_tiers_to_KBconsensus
Added 2 new fields to the MongoVariantContext: confidence and isComplex.
2013-06-17 08:38:57 -07:00
Mark DePristo ee78927bdb Merge pull request #279 from broadinstitute/eb_make_rms_mq_work_with_rr
Fixes to several of the annotations for reduced reads (and other issues)...
2013-06-16 09:48:19 -07:00
Eric Banks e48f754478 Fixes to several of the annotations for reduced reads (and other issues).
1. Have the RMSMappingQuality annotation take into account the fact that reduced reads represent multiple reads.

2. The rank sume tests should not be using reduced reads (because they do not represent distinct observations).

3. Fixed a massive bug in the BaseQualityRankSumTest annotation!  It was not using the base qualities but rather
the read likelihoods?!

Added a unit test for Rank Sum Tests to prove that the distributions are correctly getting assigned appropriate p-values.
Also, and just as importantly, the test shows that using reduced reads in the rank sum tests skews the results and
makes insignificant distributions look significant (so it can falsely cause the filtering of good sites).

Also included in this commit is a massive refactor of the RankSumTest class as requested by the reviewer.
2013-06-16 01:18:20 -04:00
Eric Banks 9ec71bba26 Added 2 new fields to the MongoVariantContext: confidence and isComplex.
IsComplex will be used to designate calls as representing complex events which have multiple
correct allele representations.  Then call sets can get points for including them but will
not get penalized for missing them (because they may have used a different representation).
This is currently the biggest bane when trying to characterize FNs these days.

The confidence will be used to refactor the consensus making algorithm for the truth status
of the NA12878 KB.  The previous version allowed for 2 tiers: reviews and everything else.
But that is problematic when some of the input sets are of higher quality than others
because when they disagree the calls become discordant and we lose that information.
The new framework will allow each call to have its own associated confidence.  Then when
determining the consensus truth status we probabilistically calculate it from the
various confidences, so that nothing is hard coded in anymore.

Note that I added some unit tests to ensure the outcome that I expect for various scenarios
and then implemented a very rough version of the estimator that successfully produced those
outcomes.

HOWEVER, THIS IS NOT COMPLETE AND NEITHER FUNCTIONALITY IS HOOKED UP AT ALL.
Rather, this is an interim commit.  The only goal here is to get these fields added to the MVC
for the upcoming release so that Jacob (who prefers to work with stable) can add the
necessary functionality to IGV for us.
2013-06-16 00:31:16 -04:00
droazen 4151753718 Merge pull request #285 from broadinstitute/dr_james_warren_fasta_suffix_bugfix
deducing dictionary path should not use global find and replace
2013-06-14 16:57:10 -07:00
James Warren f46f7d9b23 deducing dictionary path should not use global find and replace
Signed-off-by: David Roazen <droazen@broadinstitute.org>
2013-06-14 19:15:27 -04:00
Mark DePristo 52677429a0 Merge pull request #284 from broadinstitute/dr_fewer_stranded_temp_files
Reduce number of leftover temp files in GATK runs
2013-06-14 13:06:28 -07:00
Mark DePristo 1677a0a458 Simpler FILTER and info field encoding for BeagleOutputToVCF
-- Previous version created FILTERs for each possible alt allele when that site was set to monomorphic by BEAGLE.  So if you had a A/C SNP in the original file and beagle thought it was AC=0, then you'd get a record with BGL_RM_WAS_A in the FILTER field.  This obviously would cause problems for indels, as so the tool was blowing up in this case.  Now beagle sets the filter field to BGL_SET_TO_MONOMORPHIC and sets the info field annotation OriginalAltAllele to A instead.  This works in general with any type of allele.
 -- Here's an example output line from the previous and current versions:
 old: 20    64150   rs7274499       C       .       3041.68 BGL_RM_WAS_A    AN=566;DB;DP=1069;Dels=0.00;HRun=0;HaplotypeScore=238.33;LOD=3.5783;MQ=83.74;MQ0=0;NumGenotypesChanged=1;OQ=1949.35;QD=10.95;SB=-6918.88
 new: 20    64062   .       G       .       100.39  BGL_SET_TO_MONOMORPHIC  AN=566;DP=1108;Dels=0.00;HRun=2;HaplotypeScore=221.59;LOD=-0.5051;MQ=85.69;MQ0=0;NumGenotypesChanged=1;OQ=189.66;OriginalAltAllele=A;QD=15.81;SB=-6087.15
-- update MD5s to reflect these changes
-- [delivers #50847721]
2013-06-14 15:56:13 -04:00
David Roazen d167292688 Reduce number of leftover temp files in GATK runs
-WalkerTest now deletes *.idx files on exit

-ArtificialBAMBuilder now deletes *.bai files on exit

-VariantsToBinaryPed walker now deletes its temp files on exit
2013-06-14 15:56:03 -04:00
Mark DePristo b72880cc94 Merge pull request #282 from broadinstitute/md_gatklogs_gitversions
Use git hash to lookup versions when necessary in analyzeRunReports.py
2013-06-14 12:39:54 -07:00
Mark DePristo 20bb4902a3 Use git hash to lookup versions when necessary in analyzeRunReports.py 2013-06-14 15:31:25 -04:00
Mark DePristo 50ea098c11 Merge pull request #281 from broadinstitute/md_gatklogs
Update utilities to get GATKRunReports
2013-06-14 10:00:16 -07:00
Ryan Poplin c4e508a71f Merge pull request #275 from broadinstitute/md_fragment_with_pcr
Improvements to HaplotypeCaller and NA12878 KB
2013-06-14 09:32:26 -07:00
Mark DePristo a057f37331 Update utilities to get GATKRunReports
-- Critical bugfix: the GATK run reports magically changed names from something like GATK-run-report to GATKRunReport in GATK 2.4.  All GATK logs from 2.4 onwards were being eaten by the scripts that download logs, so the GATK usage is actually much much higher than our logs have suggested.  Looking forward to seeing some real numbers.  Unfortunately the error occurred so early in the downloading process that we actually deleted away these logs, so they cannot be recovered
-- Added a step in the downloader that archives the raw, unprocessed files so we can recover from such problems in the future
-- The s3 download scripts now download to /local/dev/GATKLogs so will only work on gsa4, but this is ok as this is better than taking forever to get the logs to the isilon.
-- Turn off some crazy debugging output from the downloader that was actually masking me from seeing the issue each night
-- Make analyzeRunReports.py robust to svn version abominations
-- Use python-2.6 in runGATKReport.csh
2013-06-14 10:17:32 -04:00
droazen ac346a93ba Merge pull request #278 from broadinstitute/md_gatk_version_in_vcf
Emit the GATK version number in the VCF header
2013-06-13 13:22:20 -07:00