Commit Graph

54 Commits (a8faacda4e57d4cfea2b6f6539a7f924c5d3fab1)

Author SHA1 Message Date
carneiro a4ffae880d Subversion crashed my intellij BADLY, so now moving the data processing pipeline to core in 2 steps.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5932 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 23:31:24 +00:00
carneiro 36db9bdcd5 Implemented and tested BWA alignment in the data processing pipeline.
caveat: Right now bwa only supports one read group, so if the original file had multiple @RG lines, only the first one will be kept. (working on a solution to this)


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5931 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 23:03:07 +00:00
carneiro c85a1d9210 Implemented and tested BWA alignment in the data processing pipeline.
Renamed it and moved to core. Happy to support it.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5930 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 22:58:55 +00:00
carneiro 355be57539 fixing the pipeline so that it still works while I'm adding support for BWA.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5921 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-02 19:32:28 +00:00
carneiro 5974675b43 Two intermediate commits, to work over the weekend.
ReplicationValidationWalker: Just the skeleton of what will be the implementation of the replication/validation model.
dataProcessingV2: Committing an UNTESTED implementation of BWA alignment. I am running tests on it over the weekend.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5900 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-27 22:03:08 +00:00
carneiro 2efd807952 No more default callsets, they're now mandatory arguments.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5858 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-23 21:56:43 +00:00
carneiro b5b8cb959a Added VQSR to the downsampling script and changed memory limits for the clean script.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5817 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-18 20:07:42 +00:00
carneiro f35d955490 recalibrates a dataset splitting between good and bad regions for comparison (used to be named justRecalibrate)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5747 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 21:43:09 +00:00
carneiro 9f2a8033ff just recalibrates now recalibrates one sample, fully, not splitting intervals (naming makes more sense)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5746 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 21:42:23 +00:00
carneiro c2f8536e02 removing old GATK options
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5745 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 21:40:39 +00:00
carneiro 8bb92160b5 Script to identify mendelian violations in the CEU Trio and follow up with supposedly incorrect SNP calls
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5744 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 21:19:42 +00:00
carneiro e2b9227d8d script to test BQSR on good/bad regions
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5743 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 21:16:37 +00:00
rpoplin 4bbce42861 Renaming ContrastiveRecalibrator --> VariantRecalibrator in preparation for move to core
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5733 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 18:12:47 +00:00
carneiro a93a9ac663 adding gold standard (full coverage) to the variant eval analysis output.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5721 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-02 16:29:11 +00:00
carneiro 2384e23274 Added the capability of running count covariates only on a given interval.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5717 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-29 21:30:14 +00:00
carneiro 3868a7e778 Oneoff project to downsample, bootstrap and call snps to test sensitivity/specificity of downsampled coverage in WEX projects.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5713 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-29 19:17:30 +00:00
carneiro f04cc4321f fixed a bug when the pipeline was used on a single bam.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5708 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-28 17:19:22 +00:00
carneiro d35c7d1029 - minor changes to the 'justclean' script to handle the Trio Cleaning.
- fixing a bug on single ended BWA option of the data processing pipeline.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5662 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-19 16:35:24 +00:00
carneiro b722ebf244 quick help/comments updates to match the wikipage.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5569 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-04 12:55:55 +00:00
carneiro 0a772688fe implementation of the Gatherer class for CountCovariates, which makes it now scatter/gatherable. Kudos to the @Gather annotation Khalid just introduced!
QuickCCTest is my test script for the gatherer.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5547 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-31 21:15:21 +00:00
carneiro 20344a27b4 Quick updates to the data processing pipeline after successfully cleaning the papuans. It now scatter gathers everything and runs in the hour queue for low pass data.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5546 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-31 21:13:33 +00:00
carneiro 5d26c66769 Count Covariates is almost scatter-gatherable now!
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5537 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-29 22:25:33 +00:00
carneiro c3f70cc5cb DPP: Updated after some tests with BWA. Still needs more testing.
MDP: Removed ApplyVariantCut as it's no longer necessary with VQSR2.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5534 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-29 18:22:09 +00:00
carneiro ccdc021207 Added BWA (option) to the data processing pipeline. Lots of testing still happening...
little fix to the calling pipeline.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5528 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-28 20:17:57 +00:00
carneiro 1281c842ad quick updates to conform with the new picard bam function structure
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5505 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-24 16:58:37 +00:00
kshakir f3e94ef2be Walkers can now specify a class extending from Gatherer to merge custom output formats. Add @Gather(MyGatherer.class) to the walker @Output.
JavaCommandLineFunctions can now specify the classpath+mainclass as an alternative to specifying a path to an executable jar.
JCLF by default pass on the current classpath and only require the mainclass be specified by the developer extending the JCLF, relieving the QScript author from having to explicitly specify the jar.
Like the Picard MergeSamFiles, GATK engine by default is now run from the current classpath. The GATK can still be overridden via .jarFile or .javaClasspath.
Walkers from the GATK package are now also embedded into the Queue package.
Updated AnalyzeCovariates to make it easier to guess the main class, AnalyzeCovariates instead of AnalyzeCovariatesCLP.
Removed the GATK jar argument from the example QScripts.
Removed one of the most FAQ when getting started with Scala/Queue, the use of Option[_] in QScripts:
1) Fixed mistaken assumption with java enums. In java enums can be null so they don't need nullable wrappers.
2) Added syntactic sugar for Nullable primitives to the QScript trait. Any variable defined as Option[Int] can just be assigned an Int value or None, ex: myFunc.memoryLimit = 3
Removed other unused code.
Re-fixed dry run function ordering.
Re-ordered the QCommandline companion object so that IntelliJ doesn't complain about missing main methods.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5504 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-24 14:03:51 +00:00
carneiro 3414bccb46 documentation changes to agree with the wiki
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5494 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-22 21:48:49 +00:00
carneiro 28149e5c5e GenotypeAndValidate version 2, ready to be used.
- now it differentiates between confident REF calls and not confident calls.
- you can now use a BAM file as the truth set. 
- output is much clearer now

dataProcessingPipeline version 2, ready to be used.
- All the processing is now done at the sample level
- Reads the input bam file headers to combine all lanes of the same sample.
- Cleaning is now scattered/gathered. Inteligently breaks down in as many intervals as possible, given the dataset.
- Outputs one processed bam file per sample (and a .list file with all processed files listed)
- Much faster, low pass (read Papuans) can run in the hour queue.




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5493 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-22 20:18:02 +00:00
carneiro 748787c509 helper script to the papuan processing... minor updates
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5489 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-22 14:11:02 +00:00
kshakir f6d4b0aaf5 Using an embedded version of Picard for merging un-indexed bam files after scatter/gather instead of requiring the QScripts to specify the picard JAR. May do this for the GATK jar too.
Fixed initialization of pending counts when using -startFromScratch so the count doesn't start at zero and end at -<#njobs>.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5483 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-21 18:20:01 +00:00
carneiro 96628457cb pacbio calling pipeline also using VQSR2 now, minor updates on the other pipelines to get the papuans through.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5479 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-18 22:06:52 +00:00
carneiro c9442e4b21 now merging bam files per sample and processing according to cleaning options.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5477 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-18 21:31:29 +00:00
carneiro 18fac5112c first step towards the new sample based processing pipeline.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5471 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-18 19:25:15 +00:00
carneiro 55e5971b3b this is a oneoff script to clean the papuans and test TargetCreator and IndelRealigner with scatter gathering.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5457 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-17 17:09:53 +00:00
carneiro 42f70d9e07 join all per-lane Bams before doing target realigning and indel cleaning.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5435 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-14 16:11:03 +00:00
carneiro 0daa65b9ef quick and dirty 'close your eyes' solution to run the papuans over the weekend. Will be properly fixed soon.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5370 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-03 21:42:22 +00:00
carneiro c7a51f0de7 fixed 1kg pilot dindel calls vcf file and combined all vcfs into one master dindel file.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5335 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-28 19:04:58 +00:00
carneiro fd5d1f9cfc minor cosmetic changes.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5322 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-25 21:56:35 +00:00
carneiro 81414a21dd dpp: back to using 4gb memory assuming all is right with IndelRealigner now.
mdcp: Some class structural changes due to the inclusion of indel calls. ApplyCut now chooses the tranche differently for each dataset.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5319 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-25 19:21:02 +00:00
carneiro 6db3210387 the data processing pipeline needs more memory...
directory updates in the methods pipeline.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5305 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-24 17:22:58 +00:00
carneiro 897a333aba Methods Development Pipeline now has the option of calling indels with the -indels parameter. Also updated some databases and the new NA12878 HiSeq hg19 that Tim just funneled to us, is updated and called.
Small fixes on the data processing pipeline


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5304 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-24 17:12:55 +00:00
carneiro 2a48ec1307 now only accepts intervals files if the user specifically requests to report bams at interval only.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5291 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-23 16:49:58 +00:00
carneiro c61dd2f09f data processing pipeline now has on the fly bam indexing (powered by Matt) some new parameters, Indel Cleaning with constrain movement and fixMates is gone.
setting up methods development pipeline for some cosmetic changes.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5277 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-18 23:13:54 +00:00
depristo d97ed3e080 Comments for Mauricio
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5275 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-18 16:58:34 +00:00
carneiro acad3ada06 changed baq to calculate_as_necessary.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5270 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-17 23:50:46 +00:00
carneiro 7f9ca6b28a full data processing pipeline, now deleting intermediate files and performing both phases (per lane and combined) of the processing.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5269 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-17 23:34:00 +00:00
carneiro 497e9ab83b too hasty... cleaning up debug messages ;)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5257 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-17 02:11:03 +00:00
carneiro b4da843c49 now processes either a single bam file or a list of bam files in parallel.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5256 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-17 02:07:22 +00:00
carneiro 50c870cfce Data Processing Pipeline: local indel realignment, mark duplicates and BQSR. Done.
Pacbio pipeline: now all pacbio bams have baq annotated in so running UG is uber fast.

Methods pipeline: minor cosmetic changes.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5253 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-16 17:22:30 +00:00
carneiro 6d3b878dde data processing pipeline script already does:
. Local Indel Realignment 
. Mark Duplicates

will do:
. Base Quality Score Recalibration (soon)

it's working with a single BAM for testing, but will work with a list of bam files.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5250 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-15 21:49:05 +00:00