Commit Graph

4331 Commits (2c854e554a8757c3b07090d1e0b556b914a680d8)

Author SHA1 Message Date
Khalid Shakir 2c854e554a Refactored maven directories and java packages replacing "sting" with "gatk".
To reduce merge conflicts, this commit modifies contents of files, while file renamings are in previous commit.
See previous commit message for list of changes.
2014-05-19 17:36:39 -04:00
Khalid Shakir 4e6d43d003 Refactored maven directories and java packages replacing "sting" with "gatk".
To reduce merge conflicts, this commit only renames files, while file modifications are in next commit.
Some updates/fixes here are actually included in the next commit.
= Maven updates
Moved artifacts to new package names:
* private/queue-private -> private/gatk-queue-private
* private/gatk-private -> private/gatk-tools-private
* public/gatk-package -> protected/gatk-package-distribution
* public/queue-package -> protected/gatk-queue-package-distribution
* protected/gatk-protected -> protected/gatk-tools-protected
* public/queue-framework -> public/gatk-queue
* public/gatk-framework -> public/gatk-tools-public
New poms for new artifacts and packages:
* private/gatk-package-internal
* private/gatk-queue-package-internal
* private/gatk-queue-extensions-internal
* protected/gatk-queue-extensions-distribution
* public/gatk-engine
Updated references to StingText.properties to GATKText.properties.
Updated ant-bridge.sh to use gatk.* properties instead of sting.*.
= Engine updates
Renaming files containing engine parts from o.b.gatk.tools to o.b.gatk.engine.
Changed package references from tools to engine for CommandLineGATK, GenomeAnalysisEngine, ReadMetrics, ReadProperties, and WalkerManager.
Changed package reference tools.phonehome to engine.phonehome.
Renamed classes *Sting* to *GATK*, such as ReviewedGATKException.
= Test updates
Moved gatk example resources.
Moved test engine files from tools to engine packages.
Moved resources for phonehome to proper package.
Moved test classes under o.b.gatk into packages:
* o.b.g.utils.{BaseTest,ExampleToCopyUnitTest,GATKTextReporter,MD5DB,MD5Mismatch,TestNGTestTransformer}
* o.b.g.engine.walkers.WalkerTest
Updated package names in DependencyAnalyzerOutputLoaderUnitTest's data.
= Queue updates
Moving queue scripts to location where generated extensions can be used.
Renamed *.q to *.scala, updating licenses previously missed by git hooks.
Moved queue extensions to new artifact gatk-queue-extensions.
Fixed import statments frequently merge-conflicting on FullProcessingPipeline.scala.
= BWA
Added README on how to obtain and include bwa as a library.
Updated libbwa build.
Fixed packaged names under bwa/java implementation.
Updated contents of BWCAligner native implementation.
= Other fixes
Don't duplicate the resource bundle entries by both unpacking *and* appending.
(partial fix) Staged engine and utils poms to build GATKText.properties, once Utils random generator dependency on GATK engine is fixed.
Re-enabled custom testng listeners/reporters and moved testng dependencies to the gatk-root.
Updated comments referencing Sting with GATK.
Moved a couple untangled classes from gatk-tools-public to gatk-utils and gatk-engine.
2014-05-19 16:43:47 -04:00
Phillip Dexheimer a5abc079dc Revised final Queue status line to display number of jobs in each state when the script fails
* Addresses PT 61552466
* Included a simple scala script in private/testdata that will always fail
2014-05-15 21:30:44 -04:00
jmthibault79 78560212d0 Merge pull request #630 from broadinstitute/pd_blank_lines_in_listfile
Allow blank lines in a (non-BAM) list file
2014-05-14 11:32:44 -04:00
droazen 8297cd1a1a Merge pull request #619 from broadinstitute/pd_intervalmerge_doc
Made IntervalSharder respect the IntervalMergingRule specified on the co...
2014-05-14 11:22:18 -04:00
Phillip Dexheimer 77449961ab Allow blank lines in a (non-BAM) list file
* Addresses PT Bug 67841052
 * Added Unit Test
2014-05-13 23:14:15 -04:00
Khalid Shakir 67e44985b1 Java/Scala imports updated for new package names.
Fourth of four commits for picard/htsjdk package rename.
2014-05-08 19:13:31 +08:00
Khalid Shakir cc3f1f2b96 Revved picard libraries.
Third of four commits for picard/htsjdk package rename.
2014-05-08 19:13:27 +08:00
Khalid Shakir a894a2dddb Updates to GATK classes and POMs that need updating, plus RodSystemValidation md5 updates.
GATK classes accessing package protected htsjdk classes changed to new package names.
POMs updated to support merging of sam/tribble/variant -> htsjdk and changes to picard artifact.
RodSystemValidation outputs changed due to variant codec packages changes, requiring test md5 updates.
Second of four commits for picard/htsjdk package rename.
2014-05-08 19:13:27 +08:00
Khalid Shakir 3ce3e27aa1 Moved GATK classes and POMs that will need updating.
GATK classes accessing package protected htsjdk classes will need new package names.
POMs will merge sam/tribble/variant into htsjdk.
Move only, contents updated in next commit.
First of four commits for picard/htsjdk package rename.
2014-05-08 19:13:27 +08:00
Laura Gauthier bf7b97393e Add ability to output to a file discordant loci and their respective genotypes for each sample 2014-05-07 10:12:45 -04:00
Karthik Gururaj d9c489f928 Removed scary warning messages for VectorPairHMM 2014-05-06 10:59:24 -07:00
Karthik Gururaj f6ea25b4d1 Parallel version of the JNI for the PairHMM
The JNI treats shared memory as critical memory and doesn't allow any
parallel reads or writes to it until the native code finishes. This is
not a problem *per se* it is the right thing to do, but we need to
enable **-nct** when running the haplotype caller and with it have
multiple native PairHMM running for each map call.

Move to a copy based memory sharing where the JNI simply copies the
memory over to C++ and then has no blocked critical memory when running,
allowing -nct to work.

This version is slightly (almost unnoticeably) slower with -nct 1, but
scales better with -nct 2-4 (we haven't tested anything beyond that
because we know the GATK falls apart with higher levels of parallelism

* Make VECTOR_LOGLESS_CACHING the default implementation for PairHMM.
* Changed version number in pom.xml under public/VectorPairHMM
* VectorPairHMM can now be compiled using gcc 4.8.x
* Modified define-* to get rid of gcc warnings for extra tokens after #undefs
* Added a Linux kernel version check for AVX - gcc's __builtin_cpu_supports function does not check whether the kernel supports AVX or not.
* Updated PairHMM profiling code to update and print numbers only in single-thread mode
* Edited README.md, pom.xml and Makefile for users to pass path to gcc 4.8.x if necessary
* Moved all cpuid inline assembly to single function Changed info message to clog from cinfo
* Modified version in pom.xml in VectorPairHMM from 3.1 to 3.2
* Deleted some unnecessary code
* Modified C++ sandbox to print per interval timing
2014-05-02 19:12:48 -04:00
Valentin Ruano-Rubio d563072282 Fix for CombineGVCFs and GenotypeGVCFs recurrent exception about missing PLs
Story:

  https://www.pivotaltracker.com/story/show/68220438

Changes:

   - PL-less input genotypes are now uncalled and so non-variant sites when combining GVCFs.
   - HC GVCF/BP_RESOLUTION Mode now outputs non-variant sites in sites covered by deletions.
   - Fixed existing tests

Test:

   - HaplotypeCallerGVCFIntegrationTest
   - ReferenceConfidenceModelUnitTest
   - CombineGVCFsIntegrationTest
2014-05-02 09:21:06 -04:00
Phillip Dexheimer 7a2b70a10f Made IntervalSharder respect the IntervalMergingRule specified on the command line
* This addresses PT Bug 69741902
* Added a required IMR argument to FilePointer, BAMScheduler, IntervalSharder, and SAMDataSource
* This rule is used by FilePointer.combine and FilePointer.union
* Added unit and integration tests
2014-04-30 22:07:22 -04:00
Michael McCowan fe3c68cb2d Java 8 compatability fix: `Reflections` NPE bugfix. 2014-04-29 13:34:03 -04:00
Ryan Poplin 41d3069213 When we subset PLs because Alleles are removed during genotyping we also need to subset AD. 2014-04-28 15:52:26 -04:00
kshakir 10ee35eafa Merge pull request #616 from broadinstitute/ks_cjav_pbsengine_no_default_queue
Removed setting of a default queue in PbsEngineJobRunner.
2014-04-28 14:24:51 -04:00
Ryan Poplin 06dbe74a23 Merge pull request #609 from kcibul/kc_cancersimreads
extended SimulateReadsForVariants to optionally use the AF field to indi...
2014-04-28 13:31:56 -04:00
Carlos Borroto b7a59e01aa Removed setting of a default queue in PbsEngineJobRunner. Discussed here: http://gatkforums.broadinstitute.org/discussion/3959/would-it-be-possible-for-pbsengine-jobrunner-not-to-set-a-default-queue
Signed-off-by: Khalid Shakir <kshakir@broadinstitute.org>
2014-04-29 00:44:12 +08:00
Ami Levy-Moonshine 13dd755468 create a new read transformer that refactor NDN cigar elements to one N element.
story:
https://www.pivotaltracker.com/story/show/69648104

description:
This read transformer will refactor cigar strings that contain N-D-N elements to one N element (with total length of the three refactored elements).
This is intended primarily for users of RNA-Seq data handling programs such as TopHat2.
Currently we consider that the internal N-D-N motif is illegal and we error out when we encounter it. By refactoring the cigar string of
those specific reads, users of TopHat and other tools can circumvent this problem without affecting the rest of their dataset.

edit: address review comments - change the tool's name and change the tool to be a readTransformer instead of read filter
2014-04-28 11:29:00 -04:00
Michael McCowan 8290d3c8ac Allow for non-tab whitespace in sample names when performing on-the-fly sample-renaming. 2014-04-22 11:07:13 -04:00
MauricioCarneiro f03e5ffeb1 Merge pull request #604 from broadinstitute/vrr_hc_omniploidy_general_api
Disentangle UG and HC Genotyper engines.
2014-04-20 07:43:23 -04:00
Valentin Ruano-Rubio 7455ac9796 Addressed revisions 2014-04-19 16:48:48 -04:00
Ryan Poplin a9a48f2459 Merge pull request #607 from broadinstitute/mm_bugfix_raise_mathutils_n_ceiling
Support more samples in math utilities.
2014-04-17 13:32:34 -04:00
Joel Thibault 1ab50f4ba8 CatVariants now handles BCF and Block-Compressed VCF
[Delivers #67461500]
2014-04-17 12:31:38 -04:00
Kristian Cibulskis 7115cadbd8 extended SimulateReadsForVariants to optionally use the AF field to indicate allele fraction of the simulated event, useful in cancer and other variable ploidy use cases 2014-04-16 16:20:02 -04:00
Joel Thibault 4c74319578 Update for Picard refactoring which improves block-compressed VCF reading
[Delivers #69215404]
2014-04-16 14:39:23 -04:00
Joel Thibault fd09cb7143 Rev Picard 1.111.1920 2014-04-16 14:39:19 -04:00
Joel Thibault f98df5c071 Integration test for the file extensions CatVariants should handle 2014-04-16 13:25:47 -04:00
Joel Thibault bdd7024d00 Integration test for block-compressed VCF reading 2014-04-16 13:09:40 -04:00
Joel Thibault ce770b032a Move execAndCheck() to ProcessController 2014-04-16 13:09:40 -04:00
Joel Thibault b197618d13 This comment is no longer true 2014-04-15 15:42:39 -04:00
Mike f0732d386c Support more samples in math utilities.
- Amend `MathUtils`' constants such that they support callings in excess of 70,000 samples (instead, 100,000).
2014-04-14 12:05:38 -04:00
Valentin Ruano-Rubio 08203b516e Disentangle UG and HC Genotyper engines.
Description:

  Transforms a delegation dependency from HC to UG genotyping engine into a reusage by inhertance where HC and UG engines inherit from a common superclass GenotyperEngine
  that implements the common parts. A side-effect some of the code is now more clear and redundant code has been removed.

  Changes have a few consequence for the end user. HC has now a few more user arguments, those that control the functionality that HC was borrowing directly from UGE.

     Added -ploidy argument although it is contraint to be 2 for now.
     Added -out_mode EMIT_ALL_SITES|EMIT_VARIANTS_ONLY ...
     Added -allSitePLs flag.

Stories:

   https://www.pivotaltracker.com/story/show/68017394

Changes:

   - Moved (HC's) GenotyperEngine to HaplotypeCallerGenotyperEngine (HCGE). Then created a engine superclass class GenotypingEngine (GE) that contains common parts between HCGE and the UG counterpart 'UnifiedGenotypingEngine' (UGE). Simplified the code and applied the template pattern to accomodate for small diferences in behaviour between both caller
   engines. (There is still room for improvement though).

   - Moved inner classes and enums to top-level components for various reasons including making them shorter and simpler names to refer to them.

   - Create a HomoSpiens class for Human specific constants; even if they are good default for most users we need to clearly identify the human assumption across the code if we want to make
   GATK work with any species in general; i.e. any reference to HomoSapiens, except as a default value for a user argument, should smell.

   - Fixed a bug deep in the genotyping calculation we were taking on fixed values for snp and indel heterozygisity to be the default for Human ignoring user arguments.

   - GenotypingLikehooldCalculationCModel.Model to Gen.*Like.*Calc.*Model.Name; not a definitive solution though as names are used often in conditionals that perhaps should be member methods of the
     GenLikeCalc classes.

   - Renamed LikelihoodCalculationEngine to ReadLikelihoodCalculationEngine to distinguish them clearly from Genotype likelihood calculation engines.

   - Changed copy by explicity argument listing to a clone/reflexion solution for casting between genotypers argument collection classes.

   - Created GenotypeGivenAllelesUtils to collect methods needed nearly exclusively by the GGA mode.

Tests :

    - StandardCallerArgumentCollectionUnitTest (check copy by cloning/reflexion).
    - All existing integration and unit tests for modified classes.
2014-04-13 03:09:55 -04:00
Joel Thibault c84126205b Test that stdout redirects and log files do not affect output 2014-04-09 13:52:42 -04:00
Joel Thibault 1103fd231a Better exception message 2014-04-09 10:51:45 -04:00
Eric Banks b07c0a6b4c Merge pull request #594 from broadinstitute/dr_vcf_sample_renaming
Extend on-the-fly sample renaming feature to vcfs
2014-04-08 11:47:45 -04:00
David Roazen af6a897479 Extend on-the-fly sample renaming feature to vcfs
-Only works with single-sample vcfs

-As with bams, the user must provide a file mapping the absolute path to
 each vcf whose samples are to be renamed to the new sample name for that
 vcf. The argument is the same as for bams: --sample_rename_mapping_file,
 and the mapping file may contain a mix of bam and vcf files should the
 user wish.

-It's an error to attempt to remap the sample names of a multi-sample
 or sites-only vcf

-Implemented at the codec level at the instant the vcf header is first
 read in to minimize the chances of downstream code examining vcf
 headers/records before renaming occurs.

-Integration tests are in sting, unit tests are in picard

-Rev picard et. al. to 1.111.1902
2014-04-08 11:07:00 -04:00
Eric Banks e690ed1a67 The contig is named MT not M in b36. Delivers PT68890442. 2014-04-08 10:03:47 -04:00
Eric Banks ad336375dc Merge pull request #590 from broadinstitute/vrr_validate_variants_unused_alleles_fix
Addresses issue with strict validation on GVCF files.
2014-04-07 22:10:49 -04:00
Valentin Ruano-Rubio 5afcc8e05f Change in the command line interface of ValidateVariants.
Following reviewers comments the command line interface has been simplified.
All extra strict validations are performed by default (as before) and the
user has to indicate which one he/she does not want to use with --validationTypeToExclude.

Before he/she was able to indicate the only ones to apply with --validationType but that has been scrapped out.

Stories:

    - https://www.pivotaltracker.com/story/show/68725164

Changes:

    - Removed validateType argument.
    - Improved documentation.
    - Added some warnning log message on suspicious argument combinations.

Tests:

    - ValidateVariantsIntegrationTest#*
2014-04-07 16:27:11 -04:00
Ryan Poplin 7d11b4d5f1 Balancing training classes between SNP/Indel and TP/FP.
-- This results in much more consistent distribution of LOD scores for SNPs and Indels.
-- Removing genotype summary stats since they are now produced by default.
-- Added functionality to specify certain subsets of the training data to be used in Tranche file generation, -good:tranche=true set.vcf
2014-04-07 15:23:53 -04:00
MauricioCarneiro 84861fa10a Merge pull request #587 from broadinstitute/eb_actually_fail_on_reduced_bams
Make sure to fail in all cases where the BAM being used was created by ReduceReads.
2014-04-04 17:27:57 -04:00
Laura Gauthier ff25b656e1 Added check to make sure file passed in with sample IDs is valid (used in SelectVariants) -- throws UserException. Corresponding test checks for UserException. 2014-04-04 15:38:50 -04:00
Valentin Ruano-Rubio 18deeec6b0 Addresses issue with strict validation on GVCF files.
More concretelly Picard's strict VCF validation does not like that there is alternative alleles that are not participating in any genotype call across samples.

This is an issue with GVCF in the single-sample pipeline where this is certainly expected with <NON_REF> and other relative unlikely alleles.

To solve this issue we allow the user to exclude some of the strict validations using a new argument --validationTypeToExclude. In order to avoid the validation
issue with GVCF the user needs to add the following to the command line: '--validationTypeToExclude ALLELES'

Story:

    https://www.pivotaltracker.com/story/show/68725164

Changes:

    - Added validateTypeToExclude argument to ValidateVariants walker.
    - Implemented the selective exclusion of validation types.
    - Added new info and improved existing documentation of the ValidateVariants walker.

Tests:

    - ValidateVariantsIntegrationTest#testUnusedAlleleError
    - ValidateVariantsIntegrationTest#testUnusedAlleleFix
2014-04-04 14:37:10 -04:00
Laura Gauthier 06d78ba068 Expanded documentation to include description of which callsets are being compared in what order and more definitions 2014-04-04 10:35:53 -04:00
Eric Banks a3d55b3341 Make sure to fail in all cases where the BAM being used was created by ReduceReads.
In some cases, the program records were being removed from the BAM headers by the GATK engine
before we applied the check for reduced reads (so we did not fail appropriately).  Pushed up the
check to happen before the PG tags are modified and added a unit test to ensure it stays that way.
It turns out that some UG tests still used reduced bams so I switched to use different ones.

Based on reviewer feedback, made it more generic so that it's easy to add new unsupported tools.
2014-04-03 16:52:41 -04:00
Eric Banks 0b73573abc Slightly modifying the way to use the IUPAC ambiguity codes in the FastaAlternateReferenceMaker.
Previously it required you to create a single sample VCF and then to pass that in to the tool, but
Geraldine convinced me that this was a pain for users (because they usually have multi-sample VCFs).
Instead now you can pass in a multi-sample VCF and specify which sample's genotypes should be used
for the IUPAC encoding.  Therefore the argument changed from '--useIUPAC' to '--use_IUPAC_sample NA12878'.
2014-04-02 21:34:25 -04:00
Valentin Ruano-Rubio 84711b8e90 Fixed bug using GraphBased due to infinite likelihoods resulting from the calculation of alignment cost of very long insertion or deletions (done in linear scale)
Stories:

  https://www.pivotaltracker.com/story/show/66263868

Bug:

  The problem was due to the way we were calculating the fix penalty of a large deletion or insertion. In this case we calculate the alignment likelihood of the portion
  or read or haplotype deletion as the penalty of that deletion/insertion without going through the full pair-hmm process. For large events this resulted in a 0 in
  in linear scale computations that ins transformed into an infinity in log scale.

Changes:

  - Change to use log10 scale for calculate those penalties.
  - Minor addition of .gitignore to hide ./public/external-example/target which is generated by the building process.
2014-04-01 16:14:52 -04:00