Commit Graph

5221 Commits (188c4f67b07fffae8e2c471ef1dd8de2fa40375b)

Author SHA1 Message Date
kshakir 188c4f67b0 Ignore missing external directory.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5262 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-17 19:22:21 +00:00
kshakir a189454343 FCP only adds the expand intervals QFunction once per script instead of once per QFunction using the ExpandTargets scala trait.
Eval dbSNP's type now based on eval dbSNP instead of genotype dbSNP.
Using an external treemap instead of the JGraphT internal node set to speed up larger graph generation.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5261 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-17 19:09:03 +00:00
delangel f1d708f4d4 Fixes for HRun annotation in case of indels:
a) In case of a deletion value was completely broken, we'd report 0 or -1.
b) For indels, we report maximum of forward and backward values - I've seen empirically many sites which are not strand biased but which seem to be artifacts and the homopolymer run is always to the right only (because we left align by convention).




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5260 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-17 18:57:21 +00:00
hanna fb9f92d09c For Kristian...bug fixes for mechanism allowing external source
directory to live anywhere on the filesystem.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5259 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-17 18:35:27 +00:00
asivache 0e04e95245 Bug fix: when extracting reference sequence for the event from the reference genome, the tool was treating Deletions and MNPs of length N in exactly the same way: ref_bases[current_pos+1,...,current_pos+N]. This is correct for Deletions but not for MNPs
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5258 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-17 16:15:42 +00:00
carneiro 497e9ab83b too hasty... cleaning up debug messages ;)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5257 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-17 02:11:03 +00:00
carneiro b4da843c49 now processes either a single bam file or a list of bam files in parallel.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5256 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-17 02:07:22 +00:00
asivache 52eedaf22d Subtle but very annoying bug due to incorrect exit condition on backward traversal. Example of incorrect old behavior (found by Martha Borkan, this normally would NOT happen with the combination of match/mismatch/open/extend parameters we have been using; use match=10.0, mismatch= -9.0, open= -15.0, extend= -6.66 in older builds in order to reproduce):
let's align two sequences (shown below, good alignment)

AAATTTGGTAAAA-GT
AAATTTGGTAAAAGGT

now let's reverse the same very sequences and align again

 TGAAAATGGTTTAAA
TGGAAAATGGTTTAAA

Note how we lost the deletion and got a mismatch instead at the very first letter of the upper sequence. The overall score of any particular alignment does not depend on the direction of the traversal, so the best alignment (with the highest score) should stay the same too.

New version fixes this issue and produces correct alignment of reverse sequences (up to the different choice of redundant position for the deletion):

T-GAAAATGGTTTAAA
TGGAAAATGGTTTAAA

This version also has the main() method reinstated, so the aligner can be run on its own as a little app.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5255 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-17 00:02:32 +00:00
fromer 6e291820d3 GeneNamesIntervalWalker outputs all genes in each interval; walkers now require a ROD named "intervals"
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5254 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-16 19:58:09 +00:00
carneiro 50c870cfce Data Processing Pipeline: local indel realignment, mark duplicates and BQSR. Done.
Pacbio pipeline: now all pacbio bams have baq annotated in so running UG is uber fast.

Methods pipeline: minor cosmetic changes.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5253 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-16 17:22:30 +00:00
fromer b304ced801 Updated haplotype calculator to correctly terminate haploptypes RIGHT BEFORE an unphased het
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5252 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-16 17:10:01 +00:00
kiran c0a4af3809 Expands targets by 50-bp on both sides when the expandIntervals argument is greater than 0.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5251 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-16 14:47:52 +00:00
carneiro 6d3b878dde data processing pipeline script already does:
. Local Indel Realignment 
. Mark Duplicates

will do:
. Base Quality Score Recalibration (soon)

it's working with a single BAM for testing, but will work with a list of bam files.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5250 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-15 21:49:05 +00:00
depristo 5a51c9a815 AWS_S3 logging is now enabled by default. It first tries to log internally at the Broad, and if it can't goes to AWS_S3. DEV option is removed
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5249 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-15 20:20:14 +00:00
corin d2efea6003 This is a draft of the improved and prettified pipeline. It may not yet compile, but Kiran is taking over adding a few more things as I finish up other tasks.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5248 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-15 19:35:00 +00:00
kshakir d185c2961f Added pipeline for calling FCP in batches called MultiFullCallingPipeline.
Bug smashes for the MCFP:
  Synchronized access to LSF library and modifications to the QGraph.
  If values are missing from the graph with -run make sure to exit with a non-zero.
  Refactored QGraph to pre-generate a unique Int for each QNode speeding up getHashCode/equals inside the graph.
  Added jobPriority and removed jobLimitSeconds from QFunction.
  All scatter gather is by default in a single sub directory queueScatterGather.
  Moved some FCPTest into BaseTest/PipelineTest for use by MFCPTest.
  Rev'ed the 1000G bams used for validation from v1 to v2 and added code to look for the bams before running other tests.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5247 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-15 18:26:14 +00:00
carneiro 7598f5f6a7 forgot to remove a debug line.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5246 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-15 16:25:48 +00:00
carneiro e45b699ac0 standardizing the name of the scripts and fixing some bugs with the remapping.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5245 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-15 16:19:58 +00:00
carneiro 87e19a17ae small updates to the variant eval part of the pipeline, some updates to the pacbio specific pipeline.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5244 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-15 16:19:07 +00:00
depristo 356eb264ab Now says FNR, not FDR. We really need to clean up VQSR
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5243 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-15 12:28:09 +00:00
chartl aeee41a755 Fix broken pipeline test (replacing PASS with .)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5242 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-15 02:19:18 +00:00
chartl 851b3e71f9 Major revision of the batch merge script. All sites are now used, hooks for some UG settings, no longer reliant on the pipeline management library (pipeline libs are probably going to go away -- nobody uses them)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5241 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-14 23:52:05 +00:00
fromer d6e3f2eba6 Added GC content calculator for CNV data
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5240 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-14 22:29:55 +00:00
chartl a081f3b94f Modifications, bugfix to theoretical posteriors. (Bug fix: eliminated discontinuity in prior distribution)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5239 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-14 19:47:34 +00:00
hanna b8c3c3ae6e Added commons math, for Kristian.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5238 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-14 18:57:21 +00:00
asivache 7a11b4f35d Another change in variant classification values
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5237 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-14 17:47:58 +00:00
asivache 7f7d7eb2d1 Inconsequential changes, more 'variant classification' values are recognized
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5236 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-14 17:36:39 +00:00
kiran d3660aa00e Very basic functionality for annotating indels (specifies whether the indel is frameshift, inframe, or non-coding). Does not attempt to recalculate the variant codon, variant amino acid, or whether the site falls within a splice region. Added a convenience method to WalkerTest for building command-line arguments with the proper spacing (so that I stop getting annoyed when I've gotten it wrong and the test system yells at me.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5235 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-13 17:58:20 +00:00
hanna 8d6db5d188 Additional logging of the temp file creation, management, and merging process
for VCF files.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5234 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-11 22:07:25 +00:00
carneiro 5f10fffa47 merge intervals now prints a sorted list in the end.
added the ccs datasets to the pbCalling pipeline.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5233 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-11 20:57:59 +00:00
carneiro 50c2fa3c3a this -1 made ALL the difference in the world. Minor bug fix.
Regular updates to the pbCalling pipeline.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5232 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-11 19:25:09 +00:00
fromer cdf53188d6 Updated DoC to work with scatter-gather; and, also manually implemented scatter-gather by sample above the scatter-gather by interval. Thansk to Khalid for his support!
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5231 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-11 19:14:42 +00:00
carneiro e7d38247bb chunkIntervals.lua creates 1Mb interval chunks out of any .intervals file. Useful for methods development pipeline datasets.
remapAmplicons.lua takes a sam file with reads aligned to amplicon references, a reference genome , and an amplicon reference mapping table, and rewrites the sam file with mappings to the reference sequence.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5230 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-11 18:21:31 +00:00
asivache 03482bf7c4 Number of MQ0 reads in each sample (format field)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5229 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-11 17:16:26 +00:00
asivache 8560bb290b Allelic fractions are now computed on MQ>0 reads only; total depth in each sample still includes MQ0 as per usual convention. Also renamed for clarity.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5228 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-11 17:13:15 +00:00
ebanks 9554df1a7c Adding integration test for indels in VF
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5227 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-11 16:58:57 +00:00
carneiro c630701a76 Following Ryan's suggestion, I am moving the Methods Development Calling pipeline to the Core.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5226 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-10 17:36:05 +00:00
carneiro 9c2c5efe35 a modified version of the Methods Development calling pipeline made to work with pacbio data.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5225 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-10 16:06:50 +00:00
depristo b1e4e1afb6 Slightly better output now -- no longer emitting pdfs by default. Emails will go to gsamembers now
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5224 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-10 13:02:24 +00:00
fromer 947cc44854 Thanks to Matt for walking me through a proper version of VCF_BAM_utilities! Feel free to add to it, or use it to get the samples in a VCF file, a BAM file, or a collection of BAM files
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5223 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-09 18:08:27 +00:00
hanna b992abb6eb A few more unit tests plus some extra
functionality for BAM index visualization.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5222 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-09 01:51:34 +00:00
kshakir 4d1cca95bb Removed deprecated getDbsnpFile.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5221 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-08 21:12:15 +00:00
kshakir a8ab5a5fb9 After code review with APSG, trying a patch for SIGSEGV errors which checks the LSF result codes from lsb_openjobinfo instead of checking for a null return value from lsb_readjobinfo.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5220 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-08 21:08:22 +00:00
delangel f3de9ee3e0 Refactoring of indel evaluation code to make it easier for external functions to get access to indel classification, in preparation for IndelMetricsByAC to stratify indel classes by AC (not done yet).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5219 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-08 17:35:16 +00:00
delangel 3635606cd8 Temp checkin just for experimentation: exposed probabilistic alignment parameters to command line interface to make it easier to experiment on their effects, although a full scrap/rewrite of this should be coming soon.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5218 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-08 17:33:29 +00:00
carneiro e5cfc6ae74 NA12878 hg19 dataset was included to the methods pipeline. (and I am running it)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5217 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-08 16:17:46 +00:00
ebanks 196eb77699 CG var format is screwed up and doesn't quite fit into the VariantsToVCF mold (we need to see multiple records before we can assign genotypes to a given position), so it's safer to keep this separate from the other well-behaved formats. Hopefully, it's temporary anyways.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5216 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-08 03:18:38 +00:00
ebanks 4fe0fcd707 Updates to handle CG data, headers, etc.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5215 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-08 03:16:05 +00:00
fromer 8d0f1b75d5 Added queue/util/BAMutilities Object [with BAM and VCF parsing utilities], which is now used by my qscripts that robustly split runs by sample
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5214 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-07 22:17:29 +00:00
kshakir 8040998c15 Renamed the pipeline yaml dbsnpFile to genotypeDbsnp, and added an evalDbsnp.
Added a genotypeDbsnpType and evalDbsnpType to check the extensions for .vcf or .rod.
Moved renaming of "recalibrated" bams to "cleaned" from sed to yaml generation template (see diff for more info).
Renamed fCP.q to FCP.q.
Though it's still disabled until VariantEval is updated, added changes above to the FCPTest.
Removed refseq table from the queue.sh wrapper script. Only specified in the yaml.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5213 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-07 22:01:09 +00:00