chartl
fe7f45ee2e
First pass at recalibrating associations, with optional data whitening. Modification to the TableCodec so it can natively read bedgraph files (just needed to add an extra header marker: "track").
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5515 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-25 19:35:39 +00:00
hanna
ac39f5532e
Turn off index caching.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5514 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-25 18:48:23 +00:00
kshakir
8e67c5567c
When host name lookup fails just use the whole internet address instead of truncating to the last two octets of the IP address.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5513 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-25 18:18:22 +00:00
hanna
8d8aed6a67
Fix correctness issue when dynamically merging many files.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5512 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-25 16:35:43 +00:00
delangel
c9283e6bc5
Refinement to previous commit: no need to duplicate code to annotate rsID since variantAnnotatorEngine is called from UG anyways.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5511 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-25 15:00:32 +00:00
delangel
3383733379
Same commit as previous one for VariantAnnotator.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5510 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-25 12:07:18 +00:00
delangel
8701dfe8d3
Hideous, horrible, hairy mutant bug: when we annotate ID field in indels, we were looking for SNP records matching the position, instead of indel records.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5509 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-25 12:04:08 +00:00
kshakir
3e3ff4a9e7
Bam gathering passes on the compression_level and the create_index flag to MergeSamFiles.
...
VCF gathering passes on the no_header and sites_only flags to CombineVariants.
Fixed deletion of gathered log files. Although they are intermediate and do not need to be re-run if not present, they should not be deleted.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5508 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-25 03:58:38 +00:00
carneiro
47279ee56e
Added --concordance option that outputs the intersection between two VCF files. Useful to see what calls were made in both technologies/algorithms.
...
Wiki has been updated accordingly.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5507 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-24 21:27:16 +00:00
kshakir
e47513f043
Minor updates to match the wiki documentation.
...
Upper cased the PartitionType enum values.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5506 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-24 20:22:23 +00:00
carneiro
1281c842ad
quick updates to conform with the new picard bam function structure
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5505 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-24 16:58:37 +00:00
kshakir
f3e94ef2be
Walkers can now specify a class extending from Gatherer to merge custom output formats. Add @Gather(MyGatherer.class) to the walker @Output.
...
JavaCommandLineFunctions can now specify the classpath+mainclass as an alternative to specifying a path to an executable jar.
JCLF by default pass on the current classpath and only require the mainclass be specified by the developer extending the JCLF, relieving the QScript author from having to explicitly specify the jar.
Like the Picard MergeSamFiles, GATK engine by default is now run from the current classpath. The GATK can still be overridden via .jarFile or .javaClasspath.
Walkers from the GATK package are now also embedded into the Queue package.
Updated AnalyzeCovariates to make it easier to guess the main class, AnalyzeCovariates instead of AnalyzeCovariatesCLP.
Removed the GATK jar argument from the example QScripts.
Removed one of the most FAQ when getting started with Scala/Queue, the use of Option[_] in QScripts:
1) Fixed mistaken assumption with java enums. In java enums can be null so they don't need nullable wrappers.
2) Added syntactic sugar for Nullable primitives to the QScript trait. Any variable defined as Option[Int] can just be assigned an Int value or None, ex: myFunc.memoryLimit = 3
Removed other unused code.
Re-fixed dry run function ordering.
Re-ordered the QCommandline companion object so that IntelliJ doesn't complain about missing main methods.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5504 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-24 14:03:51 +00:00
ebanks
18271aa1f4
It never fails to amaze me that aligners can find so many different ways to place indels off the ends of contigs
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5503 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-24 04:17:23 +00:00
ebanks
48b15d42e0
More fixes and improvements. We no longer use any bases under Q20 because random ~Q5s were cluttering the graphs; instead we grab any contiguous segments of size at least MIN_SEQUENCE_LENGTH where all bases are above Q20. Also, I implemented a quick algorithm to traverse the graph (using DFS) to choose the two best scoring paths (haplotypes). Used it successfully at NA12878 HM3 SNP sites to determine whether they are homozygous (no distiction yet between ref and alt) or heterozygous! Indels are the next target. Still have some issues to work out.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5502 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-24 03:51:19 +00:00
chartl
cd90fdeca1
Right. The issue was not setting the scatter/gather classes appropriately.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5501 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-23 20:08:53 +00:00
chartl
3c1bf40a45
QScript for scatter-gathering regional association (not quite as easy as using the built-in extension, due to the multiplexer). Currently does not work due to something I'm missing re: scatter gather class, this commit is an interim one.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5500 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-23 19:42:29 +00:00
hanna
26e3bea76e
Fix for == used to test object equality.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5499 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-23 18:15:19 +00:00
ebanks
401d1cb97f
Bug fixes plus some debugging code added. Broke out DeBruijnVertex into its own class so that the interface is now cleaner. Still very much a work in progress.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5498 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-23 17:35:34 +00:00
hanna
37fbf17da8
Finally restored code after accidentally removing three days worth of work:
...
schedule file infrastructure has been restored, and is now a single file.
Only the exact bins required for the traversal are stored in the schedule.
Very close to being able to merge schedule entries.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5497 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-23 05:52:40 +00:00
ebanks
69646ff840
... and the corresponding integration test update
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5496 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-23 01:58:07 +00:00
ebanks
ded80e0c57
Trivial change to remove space at the end of the description
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5495 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-23 01:47:46 +00:00
carneiro
3414bccb46
documentation changes to agree with the wiki
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5494 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-22 21:48:49 +00:00
carneiro
28149e5c5e
GenotypeAndValidate version 2, ready to be used.
...
- now it differentiates between confident REF calls and not confident calls.
- you can now use a BAM file as the truth set.
- output is much clearer now
dataProcessingPipeline version 2, ready to be used.
- All the processing is now done at the sample level
- Reads the input bam file headers to combine all lanes of the same sample.
- Cleaning is now scattered/gathered. Inteligently breaks down in as many intervals as possible, given the dataset.
- Outputs one processed bam file per sample (and a .list file with all processed files listed)
- Much faster, low pass (read Papuans) can run in the hour queue.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5493 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-22 20:18:02 +00:00
chartl
687b2e51b4
Switch from togglable wiggle output to togglable bedgraph format. Can be pulled directly into IGV to show the statistics values. I'll need to bug jim to allow value-toggling in a bedgraph, currently 2nd and 3rd columns are just ignored.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5492 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-22 17:58:53 +00:00
corin
a6d873b268
Fixing this so it gets the right 129 dbsnp for b37 samples
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5491 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-22 17:43:20 +00:00
chartl
5a79f16ea4
Fixed an edge case where an exception was thrown if either of the sets was empty for the MWU test. Also altered the output format so U itself is not printed (which though interesting, isn't so useful for recalibration), but rather a value I call V (really the deviation of U from its expectation).
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5490 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-22 16:28:44 +00:00
carneiro
748787c509
helper script to the papuan processing... minor updates
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5489 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-22 14:11:02 +00:00
ebanks
af7f78e8ba
Minor debugging output change.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5488 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-22 12:59:26 +00:00
ebanks
b463faad92
Fixing typo
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5487 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-22 03:57:11 +00:00
ebanks
1a9e65bcd4
Updating other walkers now that VCC extends from VC
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5486 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-22 03:10:40 +00:00
ebanks
0ee687e49d
For Mauricio: now, even in GENOTYPE_GIVEN_ALLELES mode, the VariantCallContext (which now inherits directly from VC) will report reference calls as confidently called if they pass the threshold even if the QUAL of the record itself is low because we were forced to have an ALT allele.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5485 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-22 02:42:28 +00:00
ebanks
ab6a815184
As per the comments in the commit itself: when reads get mapped to the junction of two chromosomes (e.g. MT since it is actually circular DNA), their unmapped bit is set, but they are given legitimate coordinates. The Picard code will come in and move the read all the way back to its mate - which can be arbitrarily far away and cause records to be written out of order. Very evil.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5484 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-21 20:30:24 +00:00
kshakir
f6d4b0aaf5
Using an embedded version of Picard for merging un-indexed bam files after scatter/gather instead of requiring the QScripts to specify the picard JAR. May do this for the GATK jar too.
...
Fixed initialization of pending counts when using -startFromScratch so the count doesn't start at zero and end at -<#njobs>.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5483 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-21 18:20:01 +00:00
chartl
8a0e813b04
A helper script that will take a list of bams, a list of case sample IDs, and a list of control sample IDs, and generate a sample meta data yaml (which includes the bamfiles)
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5482 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-21 16:11:55 +00:00
carneiro
1198a90ac7
cosmetic change.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5481 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-21 15:46:04 +00:00
ebanks
d9202f2764
Don't try to create a GenomeLoc from an unmapped read
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5480 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-21 13:46:55 +00:00
carneiro
96628457cb
pacbio calling pipeline also using VQSR2 now, minor updates on the other pipelines to get the papuans through.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5479 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-18 22:06:52 +00:00
carneiro
4e449905d1
methods development pipeline now sports VQSR2.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5478 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-18 22:00:46 +00:00
carneiro
c9442e4b21
now merging bam files per sample and processing according to cleaning options.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5477 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-18 21:31:29 +00:00
ebanks
1c95208e26
Finally found the bug that everyone is reporting on GS. Iterators on PriorityQueues aren't guaranteed to return elements in sorted order (a pretty stupid contract) - so we were passing items to the constrained writer out of order. Just do a Collections.sort instead (1 line of code). Happy father's day!
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5476 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-18 21:28:19 +00:00
ebanks
9568c84af9
Don't output these messages in INFO mode because they are scaring people unnecessarily
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5475 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-18 19:55:22 +00:00
depristo
22ff2573d5
Removed MAG entirely
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5474 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-18 19:43:23 +00:00
kiran
55897631ad
Initial attempt at identifying potentially interesting variants in a Mendelian disease context when the called genotypes are uncertain.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5473 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-18 19:41:35 +00:00
kshakir
b2b8a4f19f
Re-un-final'ed BAQ.MAG as it was pre r5469.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5472 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-18 19:40:31 +00:00
carneiro
18fac5112c
first step towards the new sample based processing pipeline.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5471 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-18 19:25:15 +00:00
asivache
1d5326ff0c
Minor fixes to the cmd-line help messages
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5470 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-18 18:18:04 +00:00
depristo
7857cb5a22
Waiting to go to the hospital -- fixed a bug in the BAQ calculation where the BAQ would NPE if a read had no usable bases (all clipped, for example) but didn't fail the PF filter
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5469 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-18 17:45:21 +00:00
fromer
e84a27ceea
OverlapWithBedInIntervalWalker calculates the average per-input-interval coverage by the BED intervals track
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5468 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-18 17:44:46 +00:00
depristo
abc7d1aef9
BeagleOutputToVCF now accepts an option to keep monomorphic sites. This is useful to genotype a single sample, where having AC=0 just means that the sample is hom-ref at the site.
...
ProduceBeagleInputWalker can optionally emit a beagle markers file, necessary to use the beagled reference panel for imputation. Also supports the VQSR calibration curve idea that a site can be flagged as a certain FP, based on the VQSLOD field. This allows us to have both continuous quality in the refinement of sites as well as hard filtering at some threshold so we don't end up with lots of sites with all 1/3 1/3 1/3 likelihoods for all samples (i.e., a definite FP site where we don't know anything about the samples).
Added a new VariantsToBeagleUnphased walker that writes out a marker drive hard-call unphased genotypes file suitable for imputating missing genotypes with a reference panel with beagle. Can optionally keep back a fraction of sites, marked as missing in the genotypes file, for assessment of imputation accuracy and power. The bootstrap sites can be written to a separate VCF for assessment as well.
Finally, my general Queue script for creating and evaluating reference panels from VCF files. Supports explicitly genotyping a BAM file at each panel SNP site, for assessment of imputation accuracy of a reference panel. Lots of options for exploring the impact of the VQS likelihooods, multiple VCFs for constructing the reference panel, as well as fraction of sites left out in assessing the panel's power.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5467 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-18 03:08:38 +00:00
depristo
9b8d41160b
GENOTYPE_GIVEN_ALLELES now respects the filter status of the incoming alleles file.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5466 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-18 02:59:28 +00:00