Commit Graph

11928 Commits (65d31ba4adfecb5cfa7efbb4e30e60c7a7975c71)

Author SHA1 Message Date
David Roazen 65d31ba4ad Fix runtime public -> protected dependencies in the test suite
-replace unnecessary uses of the UnifiedGenotyper by public integration tests
 with PrintReads

-move NanoSchedulerIntegrationTest to protected, since it's completely dependent
 on the UnifiedGenotyper
2013-02-26 21:19:12 -05:00
Eric Banks 3ce0a32da7 Merge remote-tracking branch 'unstable/master' 2013-02-26 14:48:39 -05:00
Eric Banks 7a7adb79f1 Merge pull request #67 from broadinstitute/dr_release_script_disable_validation
Temporarily disable paranoid validation in the release scripts
2013-02-26 11:25:01 -08:00
Eric Banks 2cf0dc9939 Merge pull request #66 from broadinstitute/mc_retire_coveragebysample_walker_GSATDG-90
Archiving CoverageBySample
2013-02-26 11:19:09 -08:00
David Roazen 2b13af042d Temporarily disable paranoid validation in the release scripts
These validation steps are not strictly necessary, and would fail
with the protected repo right now, as it currently lacks a master
branch.
2013-02-26 14:17:39 -05:00
Mauricio Carneiro 711cbd3b5a Archiving CoverageBySample
This walker was not updated since 2009, and users were getting wrong answers when running it with ReduceReads. I don't want to deal with this because DiagnoseTargets does everything this walker does.
2013-02-26 13:49:00 -05:00
Ryan Poplin 357a05683d Merge pull request #65 from broadinstitute/dr_change_haplotypecaller_downsampling_settings_GSA-699
Change default downsampling coverage target for the HaplotypeCaller to 2...
2013-02-26 10:33:19 -08:00
David Roazen 8b29030467 Change default downsampling coverage target for the HaplotypeCaller to 250
-was previously set to 30, which seems far too aggressive given that with
 ActiveRegionWalkers, as with LocusWalkers, this limits the depth of any
 pileup returned by LIBS

-250 is a more conservative default used by the UG

-can adjust down/up later based on further experiments (GSA-699 will
 remain open)

-verified with Ryan that all integration test differences are either
 innocent or represent an improvement

GSA-699
2013-02-26 09:33:25 -05:00
depristo 51d618de97 Merge pull request #62 from broadinstitute/rp_increase_max_kmer_in_assembly
The maximum kmer length is derived from the reads.
2013-02-26 05:37:02 -08:00
depristo ed5aff3702 Merge pull request #55 from broadinstitute/dr_fix_sequence_dictionary_validation_GSA-768
Sequence dictionary validation: detect problematic contig indexing differences
2013-02-25 12:39:56 -08:00
Ryan Poplin 89e2943dd1 The maximum kmer length is derived from the reads.
-- This is done to take advantage of longer reads which can produce less ambiguous haplotypes
-- Integration tests change for HC and BiasedDownsampling
2013-02-25 14:40:25 -05:00
MauricioCarneiro bd9875aff5 Merge pull request #61 from broadinstitute/dr_update_release_scripts
1. removed all directives related to gatklite (we're getting rid of this distribution)
2. adapting scripts to the new gsa-protected repository
2013-02-25 10:37:59 -08:00
David Roazen 3645ea9bb6 Sequence dictionary validation: detect problematic contig indexing differences
The GATK engine does not behave correctly when contigs are indexed
differently in the reads sequence dictionaries vs. the reference
sequence dictionary, and the inconsistently-indexed contigs are included
in the user's intervals. For example, given the dictionaries:

Reference dictionary = { chrM, chr1, chr2, ... }
BAM dictionary       = { chr1, chr2, ... }

and the interval "-L chr1", the engine would fail to correctly retrieve
the reads from chr1, since chr1 has a different index in the two dictionaries.

With this patch, we throw an exception if there are contig index differences
between the dictionaries for reads and reference, AND the user's intervals
include at least one of the mismatching contigs.

The user can disable this exception via -U ALLOW_SEQ_DICT_INCOMPATIBILITY

In all other cases, dictionary validation behaves as before.

I also added comprehensive unit tests for the (previously-untested)
SequenceDictionaryUtils class.

GSA-768 #resolve
2013-02-25 11:14:22 -05:00
David Roazen baa3b15207 Update release scripts in preparation for open-sourcing protected 2013-02-25 10:17:16 -05:00
Eric Banks f62dd84869 Merge pull request #57 from broadinstitute/rp_bubble_traversal_merge_GSA-680
Rp bubble traversal merge gsa 680
2013-02-24 05:08:05 -08:00
Ryan Poplin 6a639c8ffc Replace Smith-Waterman alignment with the bubble traversal.
-- Instead of doing a full SW alignment against the reference we read off bubbles from the assembly graph.
-- Smith-Waterman is run only on the base composition of the bubbles which drastically reduces runtime.
-- Refactoring graph functions into a new DeBruijnAssemblyGraph class.
-- Bug fix in path.getBases().
-- Adding validation code to the assembly engine.
-- Renaming SimpleDeBruijnAssembler to match the naming of the new Assembly graph class.
-- Adding bug fixes, docs and unit tests for DeBruijnAssemblyGraph and KBestPaths classes.
-- Added ability to ignore bubbles that are too divergent from the reference
-- Max kmer can't be bigger than the extension size.
-- Reverse the order that we create the assembly graphs so that the bigger kmers are used first.
-- New algorithm for determining unassembled insertions based on the bubble traversal instead of the full SW alignment.
-- Don't need the full read span reference loc for anything any more now that we clip down to the extended loc for both assembly and likelihood evaluation.
-- Updating HaplotypeCaller and BiasedDownsampling integration tests.
-- Rebased everything into one commit as requested by Eric
-- improvements to the bubble traversal are coming as a separate push
2013-02-22 15:42:16 -05:00
depristo 2ad559cf58 Merge pull request #59 from broadinstitute/mc_reving_testng_GSA-695
Updating TestNG to the latest version
2013-02-22 10:39:04 -08:00
depristo 50612ac981 Merge pull request #58 from broadinstitute/mc_callset_assesment_GSATDG-52
AGBT scripts, tool updates and misc
2013-02-22 07:23:59 -08:00
Mauricio Carneiro 3f901ff0e7 R scripts for covreage analysis of the genome (AGBT13)
-- script that generates a scatterplot of the poorly covered regions versus PCR+
-- script that calculates the uncovered portion of the genome
2013-02-22 10:19:01 -05:00
Mauricio Carneiro e3f01673e1 Implementation of the find and diagnose Queue script
-- Added 'uncovered intervals' output for FindCoveredIntervals
-- updated scala script to make use of it.
2013-02-22 10:19:01 -05:00
Mauricio Carneiro 15a8f6d82e Coverage analysis by variant type R script (for AGBT13) 2013-02-22 10:19:01 -05:00
Mauricio Carneiro a0b1e15dd0 Coverage distribution R script
-- plot routines for the coverage distribution analysis presented at AGBT13
2013-02-22 10:19:01 -05:00
Mauricio Carneiro 8ae2e7be4f Queue scripts to process call and assess callsets (for AGBT13)
a quick data-processing pipeline:
-- only adding basic steps from aligned bam to recalibrated bam.
-- paired down from the multitude of options of the data processing pipeline
-- one bam in, one bam out
-- Implemented the "fly" mode to the quick processing pipeline -- because "quick" wasn't quick enough...

calling and assess pipeline:
-- takes in a list of bams
-- calls using both UG and HC with and without BQSR
-- can optionally skip bqsr if recalibration report is not provided (assumes bam already is recalibrated)

GSATDG-76
GSATDG-75
GSATDG-58
GSATDG-57
GSATDG-56
GSATDG-52
GSATDG-51
2013-02-22 10:19:01 -05:00
Mauricio Carneiro 1690be0866 Adding functionality to NA12878KB walkers
-- Added "excludeUniqueToCallsets" stratification to ExtractConsensusSites to ignore sites that are only supported by a given set of callsets.

GSATDG-53
2013-02-22 10:19:01 -05:00
Eric Banks 34ee953798 Merge pull request #56 from broadinstitute/md_relax_boundquals_checks
Relax bounds checking in QualityUtils.boundQual
2013-02-22 07:08:41 -08:00
Mauricio Carneiro 4ac50c89ad Updating TestNG to the latest version
-- changed SkipException constructors that are now private in TestNG
-- Updated build.xml to use the latest testng
-- Added guice dependency to ivy
-- Fixed broken SampleDBUnitTest

The SampleDBUnitTest was only passing before because the map comparison in the old TestNG was broken. It was comparing two DIFFERENT samples and testing for "equals"

GSA-695 #resolve
2013-02-22 09:40:23 -05:00
depristo c8a01e6569 Merge pull request #53 from broadinstitute/rp_pcrfree_bad_mappingquality_GSA-780
Bug fix in LikelihoodCalculationEngine: Mapping quality was being cast t...
2013-02-22 05:52:50 -08:00
Mark DePristo 182c32a2b7 Relax bounds checking in QualityUtils.boundQual
-- Previous version did runtime checking that qual >= 0 but BQSR was relying on boundQual to restore -1 to 1.  So relax the bound.
2013-02-22 08:46:59 -05:00
Eric Banks 48c699eec6 Merge pull request #54 from broadinstitute/md_improve_kb
Many improvements to NA12878 KB
2013-02-22 05:20:06 -08:00
Mark DePristo 8ac6d3521f Vast improvements to AssessNA12878 code and functionality
-- AssessNA12878 now breaks out multi-allelics into bi-allelic components.  This means that we can properly assess multi-allelic calls against the bi-allelic KB
-- Refactor AssessNA12878, moving into assess package in KB.  Split out previously private classes in the walker itself into separate classes.  Added real docs for all of the classes.
-- Vastly expand (from 0) unit tests for NA12878 assessments
-- Allow sites only VCs to be evaluated by Assessor
-- Move utility for creating simple VCs from a list of string alleles from GATKVariantContextUtilsUnitTest to GATKVariantContextUtils
-- Assessor bugfix for discordant records at a site.  Previous version didn't handle properly the case where one had a non-matching call in the callset w.r.t. the KB, so that the KB element was eaten during the analysis.  Fixed.  UnitTested
-- See GSA-781 -- Handle multi-allelic variants in KB for more information
-- Bugfix for missing site counting in AssessNA12878.  Previous version would count N misses for every missed value at a site.  Not that this has much impact but it's worth fixing
-- UnitTests for BadSitesWriter
-- UnitTests for filtered and filtering sites in the Assessor
-- Cleanup end report generation code (simply the code).  Note that instead of "indel" the new code will print out "INDELS"
-- Assessor DoC calculations now us LIBS and RBPs for the depth calculation.  The previous version was broken for reduced reads.  Added unit test that reads a complex reduced read example and matches the DoC of this BAM with the output of the GATK DoC tool here.
-- Added convenience constructor for LIBS using just SAMFileReader and an iterator.  It's now easy to create a LIBS from a BAM at a locus.  Added advanceToLocus function that moves the LIBS to a specific position.  UnitTested via the assessor (which isn't ideal, but is a proper test)
2013-02-21 20:43:12 -05:00
Ryan Poplin 62e14f5b58 Bug fix in LikelihoodCalculationEngine: Mapping quality was being cast to a byte and overflowing for reads with large mapping quality scores. 2013-02-21 14:34:17 -05:00
Mark DePristo 29319bf222 Improved allele trimming code in GATKVariantContextUtils
-- Now supports trimming the alleles from both the reverse and forward direction.
-- Added lots of unit tests for forwrad allele trimming, as well as creating VC from forward and reverse trimming.
-- Added docs and tests for the code, to bring it up to GATK spec
2013-02-21 12:01:43 -05:00
Mark DePristo f714ecc0ae Cleanup and improvements to NA12878 KB consensus creation
-- Extract consensus creation routines into ConsensusMaker, and added unit tests
-- Consensus algorithm now only takes the last added call for any call set in the consensus.  So if you re-review a site that consensus result is always your most recent view
-- If you have a TP and a FP, now the site is considered DISCORDANT, not a FP.  This is better.
-- Move consensus GT tests into ConsensusMakerUnitTest
-- NA12878 KB updates consensus, removing old entries when present.  The previous version of the KB update function would add duplicates when reviewing, so when you reviewed a site, there would actually be two consensus records, the previous one and the new one including your review.  The new algorithm removes old entries before adding the new consensus, so that the consensus track always reflects the most recent results
-- Don't include duplicate call set names in the consensus supporting call set name
2013-02-21 12:01:38 -05:00
depristo 09b444de26 Merge pull request #51 from broadinstitute/eb_optimize_hc_haplotype_comparisons
Haplotype/Allele based optimizations for the HaplotypeCaller that knock ...
2013-02-21 07:16:28 -08:00
Eric Banks 6996a953a8 Haplotype/Allele based optimizations for the HaplotypeCaller that knock off nearly 20% of the total runtime (multi-sample).
These 2 changes improve runtime performance almost as much as Ryan's previous attempt (with ID-based comparisons):
* Don't unnecessarily overload Allele.getBases() in the Haplotype class.
  * Haplotype.getBases() was calling clone() on the byte array.
* Added a constructor to Allele (and Haplotype) that takes in an Allele as input.
  * It makes a copy of he given allele without having to go through the validation of the bases (since the Allele has already been validated).
  * Rev'ed the variant jar accordingly.

For the reviewer: all tests passed before rebasing, so this should be good to go as far as correctness.
2013-02-21 10:14:11 -05:00
MauricioCarneiro a954cf3c01 Merge pull request #52 from broadinstitute/gg_more_gatkdocs_improvements_GSATDG-66-67 2013-02-21 06:48:35 -08:00
Geraldine Van der Auwera c3e01fea40 Added several more info types / annotations to GATKDocs
-- top-level walker type (locus, read etc)
-- parallelism options (nt or nct)
-- annotation type (for Variant Annotations)
-- downsampling settings that override engine defaults
-- reference window size
-- active region settings
-- partitionBy info
2013-02-21 03:12:40 -05:00
Eric Banks 0c34e47a87 Merge pull request #50 from broadinstitute/gg_new_ReassignOneMappingQualityFilter_GSATDG-77
New ReadFilter allows users to reassign a specific mapping quality...
2013-02-20 04:51:45 -08:00
Eric Banks 551d33686c Merge pull request #47 from broadinstitute/aw_reduceread_perf_1_GSA-761
Reduce memory footprint of SyntheticRead by replacing several Lists with...
2013-02-20 04:49:07 -08:00
Geraldine Van der Auwera e674b4a524 Added new ReadFilter that allows users to specifically reassign one single mapping quality to a different value. Useful for TopHat and other RNA-seq software users. 2013-02-20 01:24:45 -05:00
MauricioCarneiro 76810465aa Merge pull request #40 from broadinstitute/gg_retrieve_readfilters_GSATDG-63 2013-02-19 19:42:35 -08:00
Mark DePristo 910d966428 Extend timeout of NanoScheduler deadlock tests
-- The previous timeout of 1 second was just dangerously short.  Increase the timeout to 10 seconds
2013-02-19 20:25:25 -05:00
Eric Banks 9dfdb9528b Merge pull request #49 from broadinstitute/gda_hidden_ug_args
Hide arguments related to reference sample operation in UG - for interna...
2013-02-19 16:18:32 -08:00
Eric Banks 0055a6f1cd Merge pull request #45 from broadinstitute/mc_fix_indelrealigner_GSA-774
Fix to the Indel Realigner bug described in GSA-774
2013-02-19 16:16:48 -08:00
Guillermo del Angel 5a0a9bc488 Hide arguments related to reference sample operation in UG - for internal use only until paper is published and docs are polished. 2013-02-19 19:06:42 -05:00
depristo 334d124145 Merge pull request #48 from broadinstitute/rp_calcAlignmentByteArrayOffset_contract_GSA-772
Fix for calculating read pos rank sum test with reads that are informati...
2013-02-19 15:09:58 -08:00
Geraldine Van der Auwera faef85841b Added GATKDocs fct to indicate default Read Filters for each tool
-- Added getClazzAnnotations() as hub to retrieve various annotations values and class properties through reflection
-- Added getReadFilters() method to retrieve Read Filter annotations
-- getReadFilters() uses recursion to walk up the inheritance to also capture superclass annotations
-- getClazzAnnotations() stores collected info in doc handler root, which is unit.forTemplate in Doclet
-- Modified FreeMarker template to use the Readfilters info (displayed after arg table, before additional capabilities)
-- Tadaaa :-) #GSATDG-63 resolve
2013-02-19 16:12:29 -05:00
Mauricio Carneiro 371ea2f24c Fixed IndelRealigner reference length bug (GSA-774)
-- modified ReadBin GenomeLoc to keep track of softStart() and softEnd() of the reads coming in, to make sure the reference will always be sufficient even if we want to use the soft-clipped bases
-- changed the verification from readLength to aligned bases to allow reads with soft-clipped bases
-- switched TreeSet -> PriorityQueue in the ConstrainedMateFixer as some different reads can be considered equal by picard's SAMRecordCoordinateComparator (the Set was replacing them)
-- pulled out ReadBin class so it can be testable
-- added unit tests for ReadBin with soft-clips
-- added tests for getMismatchCount (AlignmentUtils) to make sure it works with soft-clipped reads

GSA-774 #resolve
2013-02-19 16:00:36 -05:00
Mauricio Carneiro 815028edd4 Added verbose error message to the PluginManager
-- added a logger.error with a more descriptive message of what the most likely cause of the error is

Typical error happens when a walker's global variable is not initialized properly (usually in test conditions). The old error message was very hard to understand "Could not create module because of an exception of type NullPointerException ocurred caused by exception null"
2013-02-19 16:00:35 -05:00
Alec Wysoker ab75e053da Reduce memory footprint of SyntheticRead by replacing several Lists with a single List of a small private static
class that contains the attributes that were scattered across the several Lists.
2013-02-19 15:33:33 -05:00