Commit Graph

11340 Commits (aa39037be8c00a0dfdbe92d4f15951b1cfb81fba)

Author SHA1 Message Date
Guillermo del Angel 62d9de084f Changes to specify outputs from inputs arguments per Khalid's request 2012-10-16 13:57:35 -04:00
Mark DePristo 9bcefadd4e Refactor ExactCallLogger into a separate class
-- Update minor integration tests with NanoSchedule due to qual accuracy update
2012-10-16 13:30:09 -04:00
Kristian Cibulskis b26b7bd8e5 fixed problem with isIntermediate flag being interited from FQ2BAM
added support for tumor flag in metadata
2012-10-16 12:20:41 -04:00
Mark DePristo c74d7061fe Added AFCalcResultUnitTest
-- Ensures that the posteriors remain within reasonable ranges.  Fixed bug where normalization of posteriors = {-1e30, 0.0} => {-100000, 0.0} which isn't good.  Now tests ensure that the normalization process preserves log10 precision where possible
-- Updated MathUtils to make this possible
2012-10-16 08:11:06 -04:00
Mark DePristo 9b0ab4e941 Cleanup IndependentAllelesDiploidExactAFCalc
-- Remove capability to truncate genotype likelihoods -- this wasn't used and isn't really useful after all
-- Added lots of contracts and docs, still more to come.
-- Created a default makeMaxLikelihoods function in ReferenceDiploidExactAFCalc and DiploidExactAFCalc so that multiple subclasses don't just do the default thing
-- Generalized reference bi-allelic model in IndependentAllelesDiploidExactAFCalc so that in principle any bi-allelic reference model can be used.
2012-10-16 08:11:06 -04:00
Mark DePristo 6bd0ec8de4 Proper likelihoods and posterior probability of the joint allele frequency in IndependentAllelesDiploidExactAFCalc
-- Fixed minor numerical stability issue in AFCalcResult
-- posterior of joint A/B/C is 1 - (1 - P(D | AF_b == 0)) x (1 - P(D | AF_c == 0)), for any number of alleles, obviously.  Now computes the joint posterior like this, and then back-calculates likelihoods that generate these posteriors given the priors.  It's not pretty but it's the best thing to do
2012-10-16 08:11:06 -04:00
Mark DePristo d1511e38ad Removing ConstrainedAFCalculationModel; AFCalcPerformanceTest
-- Superceded by IndependentAFCalc
-- Added support to read in an ExactModelLog in AFCalcPerformanceTest and run the independent alleles model on it.
-- A few misc. bug fixes discovered during running the performance test
2012-10-16 08:11:06 -04:00
kshakir 9fcf71c031 Updated google reflections due to stale slf4j version conflicting with other projects also trying to use Queue as a component.
Added targets to build.xml to effectively 'mvn install' packaged GATK/Queue from ant.
TODO: Versions during 'mvn install' are hardcoded at 0.0.1 until a better versioning scheme that works with maven dependencies has been identified.
2012-10-16 02:22:30 -04:00
Ryan Poplin 31be807664 Updating missed integration test. 2012-10-15 22:31:52 -04:00
Ryan Poplin d27ae67bb6 Updating the multi-step UG integration test. 2012-10-15 22:30:01 -04:00
Kristian Cibulskis 6c0e4895f0 added intervals to MuTect in BAM-PP
moved intervals from trait to MuTect class
2012-10-15 22:00:27 -04:00
Kristian Cibulskis 9bb241f06f Merge branch 'develop' of github.com:broadinstitute/cmi-gatk into develop 2012-10-15 21:59:11 -04:00
David Roazen cb33f25bfc Update expected values for HybridSelectionPipelineTest
Mark has confirmed that these differences were to be expected
given his recent changes.
2012-10-15 18:32:15 -04:00
kshakir c4ee31075c Fixed package error and a few deprecated scala warnings. 2012-10-15 15:29:40 -04:00
kshakir 213cc00abe Refactored argument matching to support other plugins in addition to file lists.
Added plugin support for sending Queue status messages.
Argument parsing can store subclasses of java.io.File, for example RemoteFile.
2012-10-15 15:10:45 -04:00
Mauricio Carneiro 4642e4eb66 Merge branch 'unstable' of https://github.com/broadinstitute/cmi-gatk into unstable 2012-10-15 13:50:03 -04:00
Mauricio Carneiro 69194e5032 Adding intellij example files to the repo 2012-10-15 13:49:09 -04:00
Guillermo del Angel ff2307031a Set default parameters for several command line inputs based on refdata content on cloud instances 2012-10-15 13:49:09 -04:00
Guillermo del Angel b7318f1c96 Bug fixes for temp mutect integration 2012-10-15 13:49:09 -04:00
Guillermo del Angel c66a4d79ba Further bug fixes to merge cancer/germline fastq-bam pipelines 2012-10-15 13:49:09 -04:00
Guillermo del Angel 7580548b5f Temp fixes 2012-10-15 13:49:09 -04:00
Mauricio Carneiro 80d92e0c63 Allowing the GATK to have non-required outputs
Modified the SAMFileWriterArgumentTypeDescriptor to accept output bam files that are null if they're not required (in the @Output annotation).

This change enables the nWayOut parameter for the IndeRealigner and ReduceReads to operate optionally while maintaining the original single way out.

[#DEV-10 transition:31 resolution:1]
2012-10-15 13:49:08 -04:00
Mauricio Carneiro a234bacb02 Making nContigs parameter hidden in ReduceReads
For now, the het reduction should only be performed for diploids (n=2). We haven't really tested it for other ploidy so it should remain hidden until someone braves it out.
2012-10-15 13:49:08 -04:00
Guillermo del Angel d7308646e9 Fix bugs so that we can pass in 2 simultaneous samples in metadata (no co-cleaning yet but at least we don't need to run pipeline twice) to produce 2 bams. Pasted temp mutect so it's also run at the end of the run 2012-10-15 13:49:08 -04:00
Kristian Cibulskis dad7ca281e upgraded mutation caller with VCF output
raw indel calls (non filtered,non vcf)
2012-10-15 13:49:08 -04:00
Guillermo del Angel 31d6c3538b Some fixes to QC commands in pipeline, and workaround for critical engine bug in GATK that makes it hang when doing small targeted BAM's with a whole exome interval list 2012-10-15 13:49:08 -04:00
Guillermo del Angel 22b79fb4dd Resolve [DEV-7]: add single-sample VCF calling at end of FASTQ-BAM pipeline. Initial steps of [DEV-4]: queue extensions for Picard QC metrics 2012-10-15 13:49:08 -04:00
Guillermo del Angel d07df384e7 a) Initial raw version of CMI BAM->VCF pipeline (most likely not working yet, but at least compiles and produces reasonable command lines), b) rename FASTQ->BAM script so name is more descriptive 2012-10-15 13:49:07 -04:00
Kristian Cibulskis 658f355171 initial cancer pipeline with mutations and partial indel support 2012-10-15 13:49:07 -04:00
Guillermo del Angel 3e71b238b0 BAM pipeline fixes: a) temp workaround for DEV-9: -nWayOut argument in IndelRealigner is broken, for now things will only really work in single sample mode, b) correct extension of RealignerTargetCreator output, previous extension caused an error 2012-10-15 13:49:07 -04:00
Guillermo del Angel 91ce0243b0 Minor tweaks to CMIProcessing Pipeline: a) don't hard-code job mem limit to 4 G since it's too much for most AWS instances, leave it instead as input argument, b) minor doc cleanups 2012-10-15 13:49:07 -04:00
Mauricio Carneiro ccd5b22646 Reimplementation of the BAM procesing pipeline using the metadata information file.
Pipeline runs end-to-end using example metadata  and has been tested only for cases where everything is ideal.
Next step is to bring this to the cloud, test all different scenario (multiple tumors, single ended, missing parameters etc).
Parallel next step is to add QC metrics.
2012-10-15 13:49:07 -04:00
Mauricio Carneiro 6eedd69248 New version of the pipeline starting from an ALIGNED bam going all the way to reducing using n-way out cleaning 2012-10-15 13:49:07 -04:00
Mauricio Carneiro 6174aa801b Revised implementation of the RAWBAM => BAM pipeline
stripped out all the FQ pipeline and tumor/normal information.
2012-10-15 13:49:06 -04:00
Mauricio Carneiro 59bcd0f0d2 First implementation of the CMI data processing pipeline, handling both germline and cancer BAM/FQ => BAM.
Not ready for prime time yet, need more work!
2012-10-15 13:49:06 -04:00
Mauricio Carneiro 322ea1262c First implementation of a generic 'bundled' Data Processing Pipeline for germline and cancer.
not ready for prime time yet!
2012-10-15 13:49:06 -04:00
Mauricio Carneiro f1fb51b222 Reverting the DPP to the original version, going to create a new simplified version for CMI in private. 2012-10-15 13:49:06 -04:00
Mauricio Carneiro 429c96e723 Generic input file name recognition (still need to implement support to FastQ, but it now can at least accept it) 2012-10-15 13:49:06 -04:00
Ryan Poplin 25be94fbb8 Increasing the precision of MathUtils.approximateLog10SumLog10 from 1E-3 to 1E-4. Genotyper integration tests change as a result. Expanding the unit tests of MathUtils.log10sumLog10. 2012-10-15 13:24:32 -04:00
Guillermo del Angel 1977763b08 Set default parameters for several command line inputs based on refdata content on cloud instances 2012-10-15 12:20:08 -04:00
Ami Levy Moonshine 0d93effa4d Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-10-15 11:19:12 -04:00
Ami Levy Moonshine d63eaf5b52 add full postQC report option and clean some of the code (it still need more work) 2012-10-15 11:17:22 -04:00
Mark DePristo 57e231610b New framework for EXACT calculations, with new 3 new implementations
-- Before this branch, the EXACT calculation implementation was largely based on historical choices in the UnifiedGenotyper.  The code was badly organized, there were no unit tests, and the Diploid EXACT calculation was super slow O(n.samples ^ n.alt.alleles)
-- Reorganized code into a single class AFCalc superclass that carries out the calculation and an AFCalcResult object that contains only the information we should expose to code users, and is well-validated.
-- Implement a new model for the multi-allelic exact calculation that sweeps for each alt allele B all likelihoods into a bi-allelic model XB where X is all alleles != B, and calls these all separately using the reference bi-allelic model.  It produces identical quals for the bi-allelic case but slightly different results for multi-allelics due to a genuine model difference in that this Independent model doesn't penalize fully all genotype configurations as occurs in the Reference multi-allelic implementation.  However, it seems after much debate that the reference model is doing the wrong thing, so in fact the Independent model seems correct.  This code isn't the default implementation yet, simply because I want to do some cleanup and discuss with the methods group before enabling.
-- Constrained search model implemented, but will be deleted in a subsequent code cleanup
-- Massive (40K) suite of unit tests the exact models, which are passing for the reference and the independent alleles exact model.
-- Restored -- but isn't 100% hooked up -- the original clean bi-allelic model for Ryan to pass his optimized logless version on.
-- The only way to create these AFCalc objects is through an AFCalcFactory, which again validates its arguments.  The AFCalcFactory.Calculation enum exposes calculations to the UG / HC as the AFModel.
-- Separated AFCalc from UG, into its own package that could in principle be pushed into utils now
-- Created a simple main[] function to run performance tests of the EXACT model.
2012-10-15 08:32:32 -04:00
Mark DePristo dcf8af42a8 Finalizing IndependentAllelesDiploidExactAFCalc
-- Updating integration tests, confirming that results for the original EXACT model are as expected given our new more rigorous application of likelihoods, priors, and posteriors
-- Fix basic logic bug in AFCalcResult.isPolymorphic and UnifiedGenotypeEngine, where isNonRef really meant isRef.  Not ideal.  Finally caught by some tests, but good god it almost made it into the code
-- Now takes the Math.abs of the phred-scaled confidence so that we don't see -0.0
-- Massive new suite of unit tests to ensure that bi-allelic and tri-allele events are called properly with all models, and that the IndependentAllelesDiploidExactAFCalc calls events with up to 4 alt alleles correctly.  ID'd some of the bugs below
-- Fix sort order bug in IndependentAllelesDiploidExactAFCalc caught by new unit tests
-- Fix bug in GeneralPloidyExactAFCalc where the AFCalcResult has meaningless values in the likelihoods when no there we no informative GLs.
2012-10-15 08:21:03 -04:00
Mark DePristo 1ac09ca81e More bugfixes on the way to a final push with new Exact model framework
-- UnifiedGenotyperEngine uses only the alleles used in genotyping, not the original alleles, when considering which alleles to include in output
-- AFCalcFactory has a more informative info message when looking for and selecting an exact model to use in genotyping
2012-10-15 07:53:57 -04:00
Mark DePristo 6b639f51f0 Finalizing new exact model and tests
-- New capabilities in IndependentAllelesDiploidExactAFCalc to actually apply correct theta^n.alt.allele prior.
-- Tests that theta^n.alt.alleles is being applied correctly
-- Bugfix: keep in logspace when computing posterior probability in toAFCalcResult in AFCalcResultTracker.java
-- Bugfix: use only the alleles used in genotyping when assessing if an allele is polymorphic in a sample in UnifiedGenotyperEngine
2012-10-15 07:53:57 -04:00
Mark DePristo 2d72265f7d AFCalcUnit test a more appropriate name 2012-10-15 07:53:57 -04:00
Mark DePristo cb857d1640 AFCalcs must be made by factory method now
-- AFCalcFactory is the only way to make AFCalcs now.  There's a nice ordered enum there describing the models and their ploidy and max alt allele restrictions.  The factory makes it easy to create them, and to find models that work for you given your ploidy and max alt alleles.
-- AFCalc no longer has UAC constructor -- only AFCalcFactory does.  Code cleanup throughout
-- Enabling more unit tests, all of which almost pass now (except for IndependentAllelesDiploidExactAFCalc which will be fixed next)
-- It's now possible to run the UG / HC with any of the exact models currently in the system.
-- Code cleanup throughout the system, reorganizing the unit tests in particular
2012-10-15 07:53:56 -04:00
Mark DePristo 6bbe750e03 Continuing work on IndependentAllelesDiploidExactAFCalc
-- Continuing to get IndependentAllelesDiploidExactAFCalc working correctly.  A long way towards the right answer now, but still not there
-- Restored (but not tested) OriginalDiploidExactAFCalc, the clean diploid O(N) version for Ryan
-- MathUtils.normalizeFromLog10 no longer returns -Infinity when kept in log space, enforces the min log10 value there
-- New convenience method in VariantContext that looks up the allele index in the alleles
2012-10-15 07:53:56 -04:00
Mark DePristo 176b74095d Intermediate commit on the path to getting a working IndependentAllelesDiploidExact calculation
-- Still not work, but I know what's wrong
-- Many tests disabled, that need to be reanabled
2012-10-15 07:53:56 -04:00