Commit Graph

12751 Commits (2f5064dd1d1e1d8e5b8071bfca3c29b4cc174df1)

Author SHA1 Message Date
chapmanb 2f5064dd1d Provide close methods to clean up resources used while creating AlignmentContexts from BAM file regions. Allows utilization of CoveredLocusView via the API
Signed-off-by: David Roazen <droazen@broadinstitute.org>
2013-09-10 15:32:54 -04:00
Ryan Poplin 5e539bb11b Merge pull request #394 from broadinstitute/rp_single_sample_calling_pipeline_update
Updates to the single sample calling pipeline to reflect latest experime...
2013-09-10 07:12:05 -07:00
Geraldine Van der Auwera 292426b504 Merge pull request #390 from broadinstitute/mc_update_clipreads
Added REVERT SOFTCLIPPED bases to ClipReads
2013-09-09 16:43:03 -07:00
Geraldine Van der Auwera 8b829255e7 Clarified docs on using clipping options 2013-09-09 19:40:03 -04:00
Ryan Poplin 8971861fd0 Updates to the single sample calling pipeline to reflect latest experiments with full whole genome callsets. 2013-09-09 15:59:50 -04:00
MauricioCarneiro 014bc4269e Merge pull request #361 from broadinstitute/bt_pairhmm_array_implementation
Add Array Logless PairHMM
2013-09-08 20:16:53 -07:00
Ryan Poplin 08474a39fb Merge pull request #389 from broadinstitute/rp_single_sample_calling_pipline_HC
Created a single sample calling pipeline which leverages the reference m...
2013-09-06 14:35:51 -07:00
Ryan Poplin 3503050a39 Created a single sample calling pipeline which leverages the reference model calculation mode of the HaplotypeCaller
-- Adding changes to CombineVariants to work with the Reference Model mode of the HaplotypeCaller.
-- Added -combineAnnotations mode to CombineVariants to merge the info field annotations by taking the median
-- Added new StrandBiasBySample genotype annotation for use in computing strand bias from single sample input vcfs
-- Bug fixes to calcGenotypeLikelihoodsOfRefVsAny, used in isActive() as well as the reference model
-- Added active region trimming capabilities to the reference model mode, not perfect yet, turn off with --dontTrimActiveRegions
-- We only realign reads in the reference model if there are non-reference haplotypes, a big time savings
-- We only realign reads in the reference model if the read is informative for a particular haplotype over another
-- GVCF blocks will now track and output the minimum PLs over the block

-- MD5 changes!
-- HC tests: from bug fixes in calcGenotypeLikelihoodsOfRefVsAny
-- GVCF tests: from HC changes above and adding in active region trimming
2013-09-06 16:56:34 -04:00
Mauricio Carneiro b6c3ed0295 Added REVERT SOFTCLIPPED bases to ClipReads 2013-09-06 09:30:01 -04:00
Ryan Poplin add17dc463 Merge pull request #388 from broadinstitute/eb_change_record_size_mismatch_to_user_error
Changed the error for the record size mismatch in the genotyping engine ...
2013-08-30 10:29:54 -07:00
Eric Banks ea0deb1bb2 Changed the error for the record size mismatch in the genotyping engine to be a user error since it is possible
to reach this state with input VCFs that contain the same event multiple times (and it's not something we want to
handle in the code).
2013-08-30 12:18:19 -04:00
Eric Banks 5d79a6cbe0 Merge pull request #387 from lbergelson/lb_add_ungenotyped_case_countvariants
adding a check for the UNAVAILABLE case of GenotypeType in CountVariants
2013-08-30 06:14:08 -07:00
Louis Bergelson 4473b0065e adding a check for the UNAVAILABLE case of GenotypeType in CountVariants 2013-08-29 17:27:00 -04:00
bradtaylor 0435bbe38f Retreived PairHMM benchmarks from archive and made improvements
PairHMMSyntheticBenchmark and PairHMMEmpirical benchmark were written to test the banded pairHMM, and were archived along with it. I returned them to the test directory for use in benchmarking the ArrayLoglessPairHMM. I commented out references to the banded pairHMM (which was left in archive), rather than removing those references entirely.

Renamed PairHMMEmpiricalBenchmark to PairHMMBandedEmpiricalBenchmark and returned it to the archive. It has a few problems for use as a general benchmark, including initializing the HMM too frequently and doing too much setup work in the 'time' method. However, since the size selection and debug printing are useful for testing the banded implementation, I decided to keep it as-is and archive it alongside with the other banded pairHMM classes. I did fix one bug that was causing the selectWorkingData function to return prematurely. As a result, the benchmark was only evaluating 4-40 pairHMM calls instead of the desired "maxRecords".

I wrote a new PairHMMEmpiricalBenchmark that simply works through a list of data, with setup work and hmm-initialization moved to its own function. This involved writing a new data read-in function in PairHMMTestData. The original was not maintaining the input data in order, the end result of which would be an over-estimate of how much caching we are able to do. The new benchmark class more closely mirrors real-world operation over large data.

It might be cleaner to fix some of the issues with the BandedEmpiricalBenchmark and use one read-in function. However, this would involve more extensive changes to:
PairHMMBandedEmpiricalBenchmark
PairHMMTestData
BandedLoglessPairHMMUnitTest

I decided against this as the banded benchmark and unit test are archived.
2013-08-28 17:23:35 -04:00
bradtaylor 86fe9fae76 Changes to Array PairHMM to address review comments
Returned Logless Caching implementation to the default in all cases. Changing to the array version will await performance benchmarking

Refactored many pieces of functionality in ArrayLoglessPairHMM into their own methods.
2013-08-28 17:23:29 -04:00
bradtaylor 3671e41b0c Add Array Logless PairHMM
A new PairHMM implementation for read/haplotype likelihood calculations. Output is the same as the LOGLESS_CACHING version.

Instead of allocating an entire (read x haplotype) matrix for each HMM state, this version stores sub-computations in 1D arrays. It also accesses intersections of the (read x haplotype) alignment in a different order, proceeding over "diagonals" if we think of the alignment as a matrix.

This implementation makes use of haplotype caching. Because arrays are overwritten, it has to explicitly store mid-process information. Knowing where to capture this info requires us to look ahead at the subsequent haplotype to be analyzed. This necessitated a signature change in the primary method for all pairHMM implementations.

We also had to adjust the classes that employ the pairHMM:
LikelihoodCalculationEngine (used by HaplotypeCaller)
PairHMMIndelErrorModel (used by indel genotyping classes)

Made the array version the default in the HaplotypeCaller and the UnifiedArgumentCollection.
The latter affects classes:
ErrorModel
GeneralPloidyIndelGenotypeLikelihoodsCalculationModel
IndelGenotypeLikelihoodsCalculationModel
... all of which use the pairHMM via PairHMMIndelErrorModel
2013-08-28 17:21:23 -04:00
Ryan Poplin 7479152977 Merged bug fix from Stable into Unstable 2013-08-28 12:40:25 -04:00
Ryan Poplin 6bda569666 One of the log10sumLog10s in the VQSR was missed in a previous bug fix. Thanks to Mike McCowan for spotting this one. 2013-08-28 12:40:08 -04:00
Eric Banks 983097cff2 Merge pull request #385 from broadinstitute/gg_vqsr_docfixes
Fixed a few typos and clarified some doc points.
2013-08-26 17:42:47 -07:00
Geraldine Van der Auwera ed465cd2a5 Fixed a few typos and clarified some doc points. 2013-08-26 17:33:17 -04:00
David Roazen 42d771f748 Remove org.apache.commons.collections.IteratorUtils dependency from the test suite
-This was a dependency of the test suite, but not the GATK proper,
 which caused problems when running the test suite on the packaged
 GATK jar at release time

-Use GATKVCFUtils.readVCF() instead
2013-08-21 19:44:02 -04:00
Eric Banks 4b00c81181 Merge remote-tracking branch 'unstable/master' 2013-08-21 17:12:26 -04:00
Eric Banks 38b80a5916 Merge pull request #384 from broadinstitute/eb_pbt_dropping_multiallelics
Fixed bug in PhaseByTransmission where it was completely dropping multi-allelic records.
2013-08-21 14:09:32 -07:00
Eric Banks 9424008055 Merge pull request #383 from broadinstitute/dr_change_phone_home_aws_settings
Update GATK AWS phone-home configuration
2013-08-21 14:08:21 -07:00
Eric Banks d4dc5ba04a Fixed bug in PhaseByTransmission where it was completely dropping multi-allelic records.
Added test to make sure this is no longer happening.
2013-08-21 15:46:57 -04:00
David Roazen 9fbb4920d0 Update GATK AWS phone-home configuration
-Switch to using new GSA AWS account for storage of phone home data

-Use DNS-compliant bucket names, as per Amazon's best practices

-Encrypt publicly-distributed version of credentials. Grant only PutObject
 permission, and only for the relevant buckets.

-Store non-distributed credentials in private/GATKLogs/newAWSAccountCredentials
 for now -- need to integrate with existing python/shell scripts
 later to get the log downloading working with the new account
2013-08-21 14:31:46 -04:00
Eric Banks cdfd07f9eb Merge pull request #382 from broadinstitute/ami-FixBcfCoder-52571227
Ami fix bcf coder 52571227
2013-08-21 11:29:19 -07:00
Ami Levy-Moonshine 0f5bb706ff - update picard, sam, variants and tribble after fixing bug in BCF2Utils.makeDictionary as reported in ticket 52571227
- update call for VCFSimpleHeaderLine constructor in GATKVCFUtils
2013-08-21 12:06:42 -04:00
Ami Levy-Moonshine ec0c33890a change the dbsnp version from 137 to 129 for variantEval 2013-08-19 23:54:43 -04:00
Eric Banks e1174a582d Merge pull request #379 from broadinstitute/mc_dpp_updates_part2
Including SplitByRG in the FullProcessingPipeline
2013-08-19 18:42:12 -07:00
Eric Banks 6663d48ffe Merge pull request #381 from broadinstitute/mm_rev_picard_to_get_tribble_updates
Adaptations to accomodate Tribble API changes.
2013-08-19 18:31:02 -07:00
Michael McCowan c3a933ce84 Adaptations to accomodate Tribble API changes, comprising mostly of the following.
* Refactoring implementations of readHeader(LineReader) -> readActualHeader(LineIterator), including nullary implementations where applicable.
* Galvanizing fo generic types.
* Test fixups, mostly to pass around LineIterators instead of LineReaders.
* New rev of tribble, which incorporates a fix that addresses a problem with TribbleIndexedFeatureReader reading a header twice in some instances.
* New rev of sam, to make AbstractIterator visible (was moved from picard -> sam in Tribble API refactor).
2013-08-19 15:52:47 -04:00
Mauricio Carneiro e991307eb5 Including SplitByRG in the FullProcessingPipeline
Why wasn't it there before, you ask
----------------------------------

Before I was running it separately (by hand), but now it's integrated in
the FullProcessingPipeline.

Integration was a pain because of Queue's limitation of only allowing 1
@Output file. This forced me to write the ugliest piece of code of my
life, but it's working and it's processing the YRI from scratch using
that right now. So I'm happy... somewhat.

Other changes to the pipeline
-----------------------------

   * Add --filter_bases_not_stored to the IndelRealigner step -- sometimes BAM files have reads with no bases stored in the unmapped section (no idea why) but this disrupts the pipeline.
   * Change adaptor marking parameter to "dual indexed" instead of "pair-ended" -- for PCR Free data.
2013-08-18 00:51:32 -04:00
droazen ee5de8510d Merge pull request #380 from broadinstitute/gg_gatkdocs_arglabels
More detailed labels for arguments in the gakdocs
2013-08-16 15:34:56 -07:00
Geraldine Van der Auwera 80ed186971 More detailed labels for arguments in the gakdocs (requested by David) 2013-08-16 14:25:53 -04:00
Geraldine Van der Auwera 6cda283115 Merge pull request #378 from broadinstitute/gg_mutehelp
Disabled the help system's printout of cmdline options when GATK errors ...
2013-08-16 10:11:28 -07:00
Geraldine Van der Auwera 9bb0aac7bf Disabled the help system's printout of cmdline options when GATK errors out. Now the user has to explicitly ask for it using -h. 2013-08-16 13:09:52 -04:00
Geraldine Van der Auwera 3841635fcb Changed 'depreciated' to the more correct 'deprecated' 2013-08-16 13:06:41 -04:00
Ryan Poplin a21d5252c8 Merge pull request #377 from broadinstitute/eb_experiment_with_pcr_error_model
Eb experiment with pcr error model
2013-08-16 08:38:55 -07:00
Eric Banks 08be871309 Removing unused code in VariantsToTable: GQ is not an INFO field and is taken care of by -GF and not -F. 2013-08-16 01:57:24 -04:00
Eric Banks 17eb7b49fe Adding ability to use Ryan's PCR error modeling to the Haplotype Caller.
There is now a command-line option to set the model to use in the HC.
Incorporated Ryan's current (unmerged) branch in for most of these changes.

Because small changes to the math can have drastic effects, I decided not to let users tweak
the calculations themselves.  Instead they can select either NONE, CONSERVATIVE (the default),
or AGGRESSIVE.

Note that any base insertion/deletion qualities from BQSR are still used.

Also, note that the repeat unit x repeat length approach gave very poor results against the KB,
so it is not included as an option here.
2013-08-16 01:53:04 -04:00
Eric Banks e7152e10f7 Rev'ing picard, tribble, and variant jars. 2013-08-16 00:16:31 -04:00
Eric Banks 07f3f6f69d Some minor updates to the KB:
1. PASSing records shouldn't make ImportReviews error out
2. Fix KB setup script so that non-NA12878 sites-only files are not assumed to be TPs
2013-08-15 10:02:55 -04:00
Eric Banks 1a5e4cc4e7 Merge pull request #375 from broadinstitute/rp_queue_jobreport_rscript
Something changed with the ggtitle syntax in the latest version of ggplo...
2013-08-14 12:48:42 -07:00
Eric Banks 7f48f991cc Merge pull request #376 from broadinstitute/gg_AR_gatkdocfix
Made AR an Advanced argument...
2013-08-14 12:48:05 -07:00
Geraldine Van der Auwera 19a4bf9ff0 made AR an Advanced argument to discourage basic users from fiddling with it 2013-08-14 14:46:56 -04:00
Ryan Poplin d4ac183580 Something changed with the ggtitle syntax in the latest version of ggplot2. 2013-08-14 14:40:03 -04:00
Eric Banks 928a9779db Merge pull request #374 from broadinstitute/mc_dpp_updates
Updated Full Processing Pipeline
2013-08-14 04:50:25 -07:00
Mauricio Carneiro 765f5450ac Updated Full Processing Pipeline
* add interleaved fastq option to sam2fastq
    * add optional adapter trimming path
    * add "skip_revert" option to skip reverting the bams (sometimes useful -- hidden parameter)
    * add a walker that reads in one bam file and outputs N bam files, one for each read group in the original bam. This is a very important step in any BAM reprocessing pipeline.

I am using this new pipeline to process the CEU and YRI PCR Free WGS
trios.
2013-08-13 23:35:32 -04:00
Eric Banks 69e78efeae Merge pull request #366 from broadinstitute/gg_gatkdocfixes
More gatkdoc fixes
2013-08-13 04:52:03 -07:00