Commit Graph

88 Commits (f0732d386cbaa04e70cd35cfcf36d8a27b6f5dc8)

Author SHA1 Message Date
Mike f0732d386c Support more samples in math utilities.
- Amend `MathUtils`' constants such that they support callings in excess of 70,000 samples (instead, 100,000).
2014-04-14 12:05:38 -04:00
Joel Thibault c84126205b Test that stdout redirects and log files do not affect output 2014-04-09 13:52:42 -04:00
Joel Thibault 1103fd231a Better exception message 2014-04-09 10:51:45 -04:00
Eric Banks b07c0a6b4c Merge pull request #594 from broadinstitute/dr_vcf_sample_renaming
Extend on-the-fly sample renaming feature to vcfs
2014-04-08 11:47:45 -04:00
David Roazen af6a897479 Extend on-the-fly sample renaming feature to vcfs
-Only works with single-sample vcfs

-As with bams, the user must provide a file mapping the absolute path to
 each vcf whose samples are to be renamed to the new sample name for that
 vcf. The argument is the same as for bams: --sample_rename_mapping_file,
 and the mapping file may contain a mix of bam and vcf files should the
 user wish.

-It's an error to attempt to remap the sample names of a multi-sample
 or sites-only vcf

-Implemented at the codec level at the instant the vcf header is first
 read in to minimize the chances of downstream code examining vcf
 headers/records before renaming occurs.

-Integration tests are in sting, unit tests are in picard

-Rev picard et. al. to 1.111.1902
2014-04-08 11:07:00 -04:00
Eric Banks ad336375dc Merge pull request #590 from broadinstitute/vrr_validate_variants_unused_alleles_fix
Addresses issue with strict validation on GVCF files.
2014-04-07 22:10:49 -04:00
Valentin Ruano-Rubio 5afcc8e05f Change in the command line interface of ValidateVariants.
Following reviewers comments the command line interface has been simplified.
All extra strict validations are performed by default (as before) and the
user has to indicate which one he/she does not want to use with --validationTypeToExclude.

Before he/she was able to indicate the only ones to apply with --validationType but that has been scrapped out.

Stories:

    - https://www.pivotaltracker.com/story/show/68725164

Changes:

    - Removed validateType argument.
    - Improved documentation.
    - Added some warnning log message on suspicious argument combinations.

Tests:

    - ValidateVariantsIntegrationTest#*
2014-04-07 16:27:11 -04:00
Ryan Poplin 7d11b4d5f1 Balancing training classes between SNP/Indel and TP/FP.
-- This results in much more consistent distribution of LOD scores for SNPs and Indels.
-- Removing genotype summary stats since they are now produced by default.
-- Added functionality to specify certain subsets of the training data to be used in Tranche file generation, -good:tranche=true set.vcf
2014-04-07 15:23:53 -04:00
MauricioCarneiro 84861fa10a Merge pull request #587 from broadinstitute/eb_actually_fail_on_reduced_bams
Make sure to fail in all cases where the BAM being used was created by ReduceReads.
2014-04-04 17:27:57 -04:00
Laura Gauthier ff25b656e1 Added check to make sure file passed in with sample IDs is valid (used in SelectVariants) -- throws UserException. Corresponding test checks for UserException. 2014-04-04 15:38:50 -04:00
Valentin Ruano-Rubio 18deeec6b0 Addresses issue with strict validation on GVCF files.
More concretelly Picard's strict VCF validation does not like that there is alternative alleles that are not participating in any genotype call across samples.

This is an issue with GVCF in the single-sample pipeline where this is certainly expected with <NON_REF> and other relative unlikely alleles.

To solve this issue we allow the user to exclude some of the strict validations using a new argument --validationTypeToExclude. In order to avoid the validation
issue with GVCF the user needs to add the following to the command line: '--validationTypeToExclude ALLELES'

Story:

    https://www.pivotaltracker.com/story/show/68725164

Changes:

    - Added validateTypeToExclude argument to ValidateVariants walker.
    - Implemented the selective exclusion of validation types.
    - Added new info and improved existing documentation of the ValidateVariants walker.

Tests:

    - ValidateVariantsIntegrationTest#testUnusedAlleleError
    - ValidateVariantsIntegrationTest#testUnusedAlleleFix
2014-04-04 14:37:10 -04:00
Laura Gauthier 06d78ba068 Expanded documentation to include description of which callsets are being compared in what order and more definitions 2014-04-04 10:35:53 -04:00
Eric Banks a3d55b3341 Make sure to fail in all cases where the BAM being used was created by ReduceReads.
In some cases, the program records were being removed from the BAM headers by the GATK engine
before we applied the check for reduced reads (so we did not fail appropriately).  Pushed up the
check to happen before the PG tags are modified and added a unit test to ensure it stays that way.
It turns out that some UG tests still used reduced bams so I switched to use different ones.

Based on reviewer feedback, made it more generic so that it's easy to add new unsupported tools.
2014-04-03 16:52:41 -04:00
Eric Banks 0b73573abc Slightly modifying the way to use the IUPAC ambiguity codes in the FastaAlternateReferenceMaker.
Previously it required you to create a single sample VCF and then to pass that in to the tool, but
Geraldine convinced me that this was a pain for users (because they usually have multi-sample VCFs).
Instead now you can pass in a multi-sample VCF and specify which sample's genotypes should be used
for the IUPAC encoding.  Therefore the argument changed from '--useIUPAC' to '--use_IUPAC_sample NA12878'.
2014-04-02 21:34:25 -04:00
Valentin Ruano-Rubio 84711b8e90 Fixed bug using GraphBased due to infinite likelihoods resulting from the calculation of alignment cost of very long insertion or deletions (done in linear scale)
Stories:

  https://www.pivotaltracker.com/story/show/66263868

Bug:

  The problem was due to the way we were calculating the fix penalty of a large deletion or insertion. In this case we calculate the alignment likelihood of the portion
  or read or haplotype deletion as the penalty of that deletion/insertion without going through the full pair-hmm process. For large events this resulted in a 0 in
  in linear scale computations that ins transformed into an infinity in log scale.

Changes:

  - Change to use log10 scale for calculate those penalties.
  - Minor addition of .gitignore to hide ./public/external-example/target which is generated by the building process.
2014-04-01 16:14:52 -04:00
Joel Thibault 70fe7f72f1 Return a TabixIndexCreator for appropriate file types
[Fixes #68291082]
2014-03-31 16:15:34 -04:00
Joel Thibault ab5634cbac Test that a Tabix index is created for block-compressed output formats
- Replace .idx and .tbi with appropriate constants
2014-03-31 14:36:48 -04:00
Joel Thibault a2d40c84ba Keep the list of zipped suffixes in sync with Variant 2014-03-31 14:36:41 -04:00
Ryan Poplin 6566dd6ca9 Fix for dropping of reference sample depth in the DP annotation.
-- In the case of hierarchical merge we can't assume that we have only one genotype.
-- Removed use of deprecated VC annotation access functions.
2014-03-24 14:01:50 -04:00
Ryan Poplin 69eaf7c82d Merge pull request #577 from broadinstitute/eb_minor_fixes_for_fragment_utils
Fixed docs for method and fixed the edge case optimization to properly u...
2014-03-21 14:01:44 -04:00
Eric Banks 0d82a70633 Fixed docs for method and fixed the edge case optimization to properly use equals() on Integers.
Shouldn't affect actual results at all.
2014-03-20 15:55:09 -04:00
Eric Banks 3b1c337401 Have CombineVariants throw a UserError when trying to combine GVCFs from the HaplotypeCaller.
Was previously throwing an IllegalArgumentException (in the wrong place in the code).
Error message tells users to use CombineGVCFs.
2014-03-19 19:11:40 -04:00
David Roazen e549f4a9d2 Fix typo in UtilsUnitTest data provider name
This is currently my leading suspect for the cause of the
intermittent NoSuchElementException errors on master, since
the maven surefire plugin seems unable to handle errors in
TestNG DataProviders without blowing up.
2014-03-18 11:52:29 -04:00
David Roazen 4ba72d43cf Re-enable GATKRunReportUnitTest
This test is not, as I had initially thought, the cause of the
maven errors. Our master branch is failing intermittently
regardless of whether this test is enabled or disabled.

This reverts commit 45fc9ff515eec8d676b64a04fb34fb357492ff84.
2014-03-18 09:53:41 -04:00
David Roazen afa6abe554 Temporarily disable GATKRunReportUnitTest in unstable while maven issues are worked out
This test passes when run individually, as part of the commit tests, or as
part of the package tests. However, when running the unit tests in isolation
it causes maven/surefire to throw a NoSuchElementException.

This is clearly a maven/surefire bug or configuration issue. I will re-enable
this test on a branch as Khalid and I try to work through it.
2014-03-18 01:28:28 -04:00
David Roazen 2d8653f493 Update pom versions to mark the start of GATK 3.2 development 2014-03-18 01:18:59 -04:00
David Roazen a6a41c777c Update pom versions for 3.1 2014-03-18 01:09:29 -04:00
David Roazen d5e38ec39b Move GATKRunReport tests from private to public
-Hide AWS downloader credentials in a private properties file
-Remove references to private ActiveRegion walker

Allows phone home functionality to be tested at release time
when we are running tests on the release jar.
2014-03-17 18:29:40 -04:00
Eric Banks 2e34ff7692 Merge pull request #563 from broadinstitute/aw_refactor_tribble
GATK changes to conform to Tribble refactoring as part improving Tabix s...
2014-03-17 13:35:46 -04:00
Eric Banks dabdd0a0fd Remove unused and unnecessary argument 2014-03-17 12:28:27 -04:00
Alec Wysoker 0369f93b24 GATK changes to conform to Tribble refactoring as part improving Tabix support in Tribble (among other things).
1. Enable on-the-fly indexing for vcf.gz.
2. Handle on-the-fly indexing where file to be indexed is not a regular file, thus index should not be created.
3. Add method setProgressLogger to all SAMFileWriter implementations.
4. Revved picard to 1.109.1722
5. IndelRealigner md5s change because the MC tag is added to records now.

Fixed up and signed off by ebanks.
2014-03-17 11:56:22 -04:00
Valentin Ruano-Rubio 2e964c59b4 Improved criteria to select best haplotypes out from the assembly graph.
Currently the best haplotypes are those that accumulate the largest ABSOLUTE edge *multiplicity* sum across their path in the assembly graph.

The edge *mulitplicity* is equal to the number of reads that expand through that edge, i.e. have a kmer that uniquely map to some vertex up-stream from the edge and the following base calls extend across that edge to vertices downstream from it.

Despite that it is obvious that higher multiplicties correlated with haplotype probability this criterion fails short in some regards of which the most relevant is:

As it is evaluated in condensed seq-graph (as supposed to uncompressed read-threading-graphs) it is bias to haplotypes that have more short-sequence vetices
  ( -> ATGC -> CA -> has worse score than -> A -> T -> G -> C -> C -> A ->). This is partly result of how we modify the edge multiplicities when we merge vertices from a linear chain.

This pull-request addresses the problem by changing to a new scoring schema based in likelihood estimates:

Each haplotype's likelihood can be calculated as the multiplication of the likelihood of "taking" its edges in the assembly graph. The likelihood of "taking" an edge in the assembly
graph is calculated as its multiplicity divide by the sum of multiplicity of edges that share the same source vertex.

This pull-request addresses the following stories:

https://www.pivotaltracker.com/story/show/66691418
https://www.pivotaltracker.com/story/show/64319760

Change Summary:

1. Change to the new scoring schema.
2. Added a graph DOT printing code to KBestHaplotypeFinder in order to diagnose scoring.
3. Graph transformation have been modified in order to generate no 0-multiplicity edges. (Nevertheless the schema above should work with 0 edges assuming that they are in fact 0.5)
2014-03-14 18:37:01 -04:00
Eric Banks ffaf92f871 Added new functionality to the FastaAlternateReferenceMaker to have it output IUPAC codes for het sites.
Enable it with the new --useIUPAC argument.
Added both unit and integration tests for the new functionality - and fixed up the
exising tests once I was in there.
2014-03-12 14:31:57 -04:00
droazen 8b53567dc7 Merge pull request #553 from broadinstitute/dr_rename_pipeline_tests
Rename existing PipelineTests to QueueTests to prepare for upcoming push of new pipeline tests
2014-03-10 21:36:45 -04:00
David Roazen 78562c14bb Rename existing PipelineTests to QueueTests to prepare for upcoming push of new pipeline tests
-These tests are really integration tests for Queue rather than generalized
 pipeline tests, so it makes sense to call them QueueTests.

-Rename test classes and maven build targets, and update shell scripts
 to reflect new naming.
2014-03-10 21:24:03 -04:00
David Roazen 7c34f05082 Merge remote-tracking branch 'origin/master' into intel 2014-03-10 14:07:36 -04:00
Ami Levy-Moonshine 2a6f05a8a1 add an option to randomly (uniformly) split a vcf file/s to more than 2 files.
The old code that allow split to two files (given in the input) is kept to allow uneven splitting between files.
2014-03-10 10:58:44 -04:00
David Roazen 9df59bd8cc Update pom versions to mark the start of GATK 3.1 development 2014-03-06 00:05:58 -05:00
David Roazen 34edcb8ddf Update pom versions for the 3.0 release 2014-03-05 23:37:21 -05:00
Karthik Gururaj 8fcbf9272c Merge branch 'intel_pairhmm' of /data/broad/gsa-unstable into intel_pairhmm
Conflicts:
	protected/gatk-protected/src/main/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/PairHMMLikelihoodCalculationEngine.java
	public/VectorPairHMM/src/main/c++/Sandbox.java
2014-03-05 09:35:50 -08:00
Laura Gauthier 43fdd38342 Add error handling to CalculateGenotypePosteriors to catch multiallelic variants with wrong number of ACs
-- throws UserException; added tests in PosteriorLikelihoodsUtilsUnitTests
Add error handling to CalculateGenotypePosteriors for cases where MLEAC>AN; add tests in PosteriorLikelihoodsUtilsUnitTests
Add unit tests to confirm that CalculateGenotypePosteriors has the ability to switch genotypes for four cases
2014-03-05 12:03:18 -05:00
Laura Gauthier 7f9f58dbd1 Added hidden flag to GenotypeConcordance to output sites of discordant genotypes (to System.out)
Revised ConcondanceMetrics tests to adapt to change
Added comments to PosteriorLikelihoodsUtils
2014-03-05 12:03:18 -05:00
Joel Thibault 57747ad35e Logger output should go to STDERR instead of STDOUT 2014-03-05 10:01:06 -05:00
Joel Thibault b4dde6a78c Add WARN to the valid log types error message
- order if statements and error message in increasing severity
2014-03-05 10:01:06 -05:00
Valentin Ruano Rubio 243d1bc07a Merge pull request #542 from broadinstitute/vrr_efficient_find_best_haplotypes
Added a more efficient implementation of the KBest haplotype finder code...
2014-03-05 09:44:50 -05:00
David Roazen 58905e8fe0 Disable the intermittently-failing and flawed ProgressMeterDaemonUnitTest
-created a Pivotal ticket to eventually redesign this test
2014-03-05 09:15:26 -05:00
Valentin Ruano-Rubio 69bf2b3247 Added a more efficient implementation of the KBest haplotype finder code (CONT.)
Changes:

  1. Addressed review comments on new K-best haplotype assembly graph finder.
  2. Generalize KBestHaplotypeFinder to deal with multiple source and sink vertices.
  3. Updated test to use KBestHaplotypeFinder instead of KBestPaths
  4. Retired KBestPaths to the archive.
  5. Small improvements to the code and documentation.
2014-03-04 23:22:27 -05:00
Valentin Ruano-Rubio 7acf2eb0e7 Added a more efficient implementation of the KBest haplotype finder code.
Story:

  https://www.pivotaltracker.com/story/show/66238286

Changes:

  1. Created a new k-best haplotype search implementation in class KBestHaplotypeFinder.
  2. Changed HC code to use the new implementation.
  This seems to fix the original problem without causing significant changes in outputs using some empirical data test cases
  3. Moved haplotype's cigar calculation code from Path to CigarUtils; need that in order to gain independence from Path in some parts of the code.
     In any case that seems like a more natural location for that functionality.
2014-03-04 12:22:14 -05:00
Eric Banks 22ad18b919 Moving Reduce Reads to the archive.
The GATK now fails with a user error if you try to run with a reduced bam.
(I added a unit test for that; everything else here is just the removal of all traces of RR)
2014-03-02 02:03:14 -05:00
Chris Whelan e61ba8b340 Added command line checks for duplicate files in ROD lists
-- Keep a list of processed files in ArgumentTypeDescriptor.getRodBindingsCollection
  -- Throw user exception if a file name duplicates one that was previously parsed
  -- Throw user exception if the ROD list is empty
  -- Added two unit tests to RodBindingCollectionUnitTest
2014-02-27 13:32:18 -05:00