depristo
8fdad20f33
Useful utility for looking at the file size of GSA file systems
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5556 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-02 03:47:27 +00:00
depristo
f59862dc44
A bit better echos
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5555 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-02 03:47:03 +00:00
fromer
27bfec785e
Some walkers for printing FASTA of reference for bed ROD, and "inverting" a bed file (finding regions not covered in bed)
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5554 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-01 21:13:51 +00:00
kshakir
73f0610abf
When getCanonicalHostName fails use getHostName instead of getHostAddress as it's more compatible with our mail servers.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5553 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-01 20:26:26 +00:00
kshakir
abf4b5afbb
Fixed inclusion of GATKEngine into the Queue package.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5552 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-01 17:45:46 +00:00
depristo
f2c4356a40
Minor usability improvements to the standard eval script.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5551 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-01 17:36:50 +00:00
droazen
0927b7c297
Fix for bug GSA-441: BAM file list with blank lines gives a confusing error
...
message. Lines containing only whitespace in .list files are now ignored.
Also added support for comments in .list files: lines whose first
non-whitespace character is '#' are now also ignored.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5550 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-01 15:04:35 +00:00
kshakir
4f8411f4b5
Revved Picard to access new flag to disable mmap for bam indices. Only added a 3% speed boost but the mmap was added to the heap count, making it harder to specify/restrict the total resident memory size in LSF. Specifying -Xmx4g will now stay much closer to 4g resident memory usage versus bumping up to 9g when accessing 900 x ~8Mb bai's.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5549 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-01 01:40:41 +00:00
asivache
df53351b0f
Get rid of score cutoff at 0 in the alignment matrix (i.e. score[cell] = max(0, score[from_parent_cells]). Use the computed score as is. Technically, it's pretty much NW now, not SW.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5548 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-01 00:11:04 +00:00
carneiro
0a772688fe
implementation of the Gatherer class for CountCovariates, which makes it now scatter/gatherable. Kudos to the @Gather annotation Khalid just introduced!
...
QuickCCTest is my test script for the gatherer.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5547 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-31 21:15:21 +00:00
carneiro
20344a27b4
Quick updates to the data processing pipeline after successfully cleaning the papuans. It now scatter gathers everything and runs in the hour queue for low pass data.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5546 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-31 21:13:33 +00:00
carneiro
dac1309dbd
Added two modes for selecting variants at random (random sampling).
...
-number N -- generates a VCF with exactly N randomly chosen variants with equal probability.
-fraction F -- generates a VCF with approximately F (between 0-1) randomly chosen variants with equal probability. (Similar behavior to RandomlySplitVariants walker).
The reason for two modes is that the first one may need a lot of memory if your sample size is too large. The wiki is being updated with this information now.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5545 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-31 21:12:40 +00:00
carneiro
8a3b7d88aa
It was returning 1 when it should return 0
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5544 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-31 20:50:38 +00:00
depristo
c7445a6fbd
Now that logging is so standard, only prints messages about logging to DEBUG. Also, found a way to silence the mime.types warning, that doesn't matter at all to us.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5543 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-31 16:49:39 +00:00
droazen
7b452ea2b9
Fix for bug GSA-430: Can't specify same BAM file twice on the command line. An ArgumentException with an appropriate error message and a list of the duplicate BAMs is now thrown in this case.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5542 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-30 22:23:24 +00:00
hanna
deab9f0aa5
Initial work on proto-shard merger:
...
- create size() method that returns an approximation of the uncompressed size in bytes of BAM span.
I'll use this method as a protoshard weighting function until we determine how to normalize the
weights across the different data access mechanisms (reads, reference, RODs).
- Implementations of basic union/intersection/subtraction mechanisms for BAM spans; should be enough
to get an accurate weight for two proto-shards put together.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5541 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-30 22:03:43 +00:00
kshakir
d5ac822e97
When @Gather annotation is missing (probably due to an unclean build) printing out the full field+class name for debugging purposes.
...
Custom gatherer prints out the class name in the logs.
Try to retrieve mail domain from /etc/mailname before falling back to the hostname.
Building oneoff jars during ant oneoffs.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5540 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-30 21:43:37 +00:00
chartl
328f89f66a
Minor changes to MannWhitneyU:
...
- Comment fixes to better explain why two-sided test wants to use the LOWER (not higher) value for U
- Much more direct testing of MWU functions
- Uniform approximation was always using the < cumulant (sometimes the > cumulant should be used instead)
- Uniform approximation currently not used (regime in which it was being used was not the right one -- not necessarily bad, but not an improvement over normal)
+ this particular approximation is for major imbalances of the form m >> n. Code may be altered in the future to use this method for this particular regime, if the method's not too slow.
- Hook into one-sided test.
RegionalAssociationRecalibrator: NaNs were being caused by presence of Infinity and -Infinity values out of the walker. Currently I'm just re-setting them to arbitrary post-whitened values, but the walker will be changed to prevent output of these values, and the "fix" will undone.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5539 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-30 17:03:02 +00:00
chartl
fff11a3279
No more pesky NaNs for norms ( HINT::: ((double) x) == Double.NaN is NOT (somehow) the same as Double.compare(x,Double.NaN) == 0). Effectively reverse sorting by changing (rank/size) to ((size-rank)/size).
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5538 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-29 22:43:24 +00:00
carneiro
5d26c66769
Count Covariates is almost scatter-gatherable now!
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5537 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-29 22:25:33 +00:00
rpoplin
5ddc0e464a
Under guidance from Matt added ability to use key-value tags with ROD binding command line arguments, so now one can say -B:hapmap,VCF,known=false,training=true,truth=true,prior=12.0 hapmap.vcf and get the tags in a walker. Look at ContrastiveRecalibrator for an example of how to use the new ReferenceOrderedDataSource.getTags(). Removed references to FDR in tranches since we are only using truth sensitivity. Finally fixed long standing bug where tranche filters weren't set appropriately.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5536 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-29 21:04:09 +00:00
carneiro
0f4ace0902
fixed a bug when the concordance track doesn't have the sample in the variant track.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5535 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-29 18:24:19 +00:00
carneiro
c3f70cc5cb
DPP: Updated after some tests with BWA. Still needs more testing.
...
MDP: Removed ApplyVariantCut as it's no longer necessary with VQSR2.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5534 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-29 18:22:09 +00:00
chartl
f6dfdc7f3b
Single-tailed hypothesis testing in MWU
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5533 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-29 15:53:40 +00:00
kshakir
f443137dda
Fixed RodBind with tag order.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5532 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-29 14:47:26 +00:00
hanna
8ae14793f2
Small standalone utility to aggregate BGZF block statistics in a BAM file.
...
Works in the same coordinate space as BAM chunks, so this will be used to
calibrate chunk weighting.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5531 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-28 22:25:45 +00:00
chartl
f3e4c24f63
Framework works properly now, but whitening still has a kink which is that the covariance matrix gets re-sorted automatically by the eigendecomposition, so somehow the association between eigenvalue and dimension (e.g. association track) needs to be maintained throughout.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5530 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-28 22:22:37 +00:00
chartl
4c04c5a47a
Addition of a BedTableCodec to allow for parsing of Bed-formatted tables (e.g. bedGraphs). Fixes for the recalibrator. Implementation of the data whitening input. Some TODOs in the RAW.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5529 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-28 21:35:09 +00:00
carneiro
ccdc021207
Added BWA (option) to the data processing pipeline. Lots of testing still happening...
...
little fix to the calling pipeline.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5528 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-28 20:17:57 +00:00
corin
f2d84bf746
Changes the validity declaration from a true to false to a five point scale
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5527 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-28 18:31:53 +00:00
depristo
cdb0bde952
Bringing script up to date
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5526 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-27 20:49:07 +00:00
depristo
cd8321cdc9
Removed the completely unused generic but extremely expensive infrastructure for dynamic LocusIteratorFilters. Now the one, and probably only useful one, is called directly in the LocusIteratorByState itself to filter adaptor bases from reads. This shaves 10% off the runtime of all walkers, apparently. Has the additional benefit of eliminating a lot of complex infrastructure that resulted ultimately in only a single function call.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5525 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-27 20:48:24 +00:00
depristo
231d095316
A clean, fast way to compute fragment pileups. Now consumes no CPU time at all. Ready for general use.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5524 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-27 14:26:29 +00:00
depristo
bae0b6cba8
A script for playing with BEAGLE refinement parameters. Supports construction of reference panels from NGS data sets with varying niteration and calibration curve parameters, as well as imputing missing genotypes in a VCF with this reference panel, and comparison to a deeply sequenced individual.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5523 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-27 12:44:25 +00:00
depristo
6a1d12cf7b
Intermediate commit refactoring FragmentPileup to (1) make it more accessible (now in utils.pileup) as well as (2) improve performance. Passes all integration tests now. Upcoming refactoring will change further how the system can be accessed, and further improve performance.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5522 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-27 12:42:22 +00:00
depristo
3bcd4c5d75
--simplifyBAM is now in the SAMFileWriterArgumentTypeDescriptor, as suggested by map. PrintReads has an integrationtest now that writes out a 1 MB bit of HiSeq normally, with compress 0, and with simplifyBAM on.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5521 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-26 14:57:18 +00:00
hanna
28ae53d796
Merging the best parts of Mark's fix for the O(n^2) algorithm and my
...
concurrently-written fix for the same.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5520 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-26 13:32:23 +00:00
depristo
d8fbda17ab
O(N^2) bug found and removed -- very subtle and hard to find. ArrayLists underlying read backed pileups were being initialized with size() from the entire pileup up all samples, not the sample-specific sizes. So in 1000 samples at 4x, we were creating 1000 x 4000 element array lists, instead of 1000 x 4x element array lists. This fix results in a 2-3x speedup for 900 sample calling, and moves UG.map() back into the main CPU cost of UG with many samples.
...
900 samples in a single BAM:
Release: 64.29
With sample-specific size: 24s - 35s
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5519 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-26 12:38:19 +00:00
depristo
7272fcf539
Now uses the NO_HEADER option to avoid breaking MD5s due to changes in GATKArgumentCollection
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5518 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-26 12:00:37 +00:00
depristo
27c8fb1e4d
Added support for a general GATK option --simplifyBAM to automatically remove and simplify kept reads in an output BAM file. Specifically, duplicate, non-PF, and unmapped reads are removed, and all extended tags in the retained SAM records are removed except the RG:Z tag. This option is very useful when creating temporary BAM files (merged per-population or multi-sample cleaned) for future calling (as in the 1000G processing pipeline). Results in a significant reduction in space of the resulting BAM, faster reading of the BAM, and surprisingly even faster UG performance:
...
1-10mb of chromosome one, from NA12878 HiSeq 64x data set on hg18:
Full BAM
Write time: 8.6 m
Size: 866M
CountReads time: 2.9 m
UG time: 11.3 m
Simplified BAM:
Write time: 6.2
Size: 458M
CountReads time: 85.7 s
UG time: 10.1 m
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5517 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-26 01:21:35 +00:00
kshakir
fc8acd503e
Enabled the parameterize option for debugging PipelineTest MD5s.
...
Fixed escaping expressions that have more than one space between arguments.
Updated example to match the wiki.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5516 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-26 00:41:47 +00:00
chartl
fe7f45ee2e
First pass at recalibrating associations, with optional data whitening. Modification to the TableCodec so it can natively read bedgraph files (just needed to add an extra header marker: "track").
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5515 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-25 19:35:39 +00:00
hanna
ac39f5532e
Turn off index caching.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5514 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-25 18:48:23 +00:00
kshakir
8e67c5567c
When host name lookup fails just use the whole internet address instead of truncating to the last two octets of the IP address.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5513 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-25 18:18:22 +00:00
hanna
8d8aed6a67
Fix correctness issue when dynamically merging many files.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5512 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-25 16:35:43 +00:00
delangel
c9283e6bc5
Refinement to previous commit: no need to duplicate code to annotate rsID since variantAnnotatorEngine is called from UG anyways.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5511 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-25 15:00:32 +00:00
delangel
3383733379
Same commit as previous one for VariantAnnotator.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5510 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-25 12:07:18 +00:00
delangel
8701dfe8d3
Hideous, horrible, hairy mutant bug: when we annotate ID field in indels, we were looking for SNP records matching the position, instead of indel records.
...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5509 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-25 12:04:08 +00:00
kshakir
3e3ff4a9e7
Bam gathering passes on the compression_level and the create_index flag to MergeSamFiles.
...
VCF gathering passes on the no_header and sites_only flags to CombineVariants.
Fixed deletion of gathered log files. Although they are intermediate and do not need to be re-run if not present, they should not be deleted.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5508 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-25 03:58:38 +00:00
carneiro
47279ee56e
Added --concordance option that outputs the intersection between two VCF files. Useful to see what calls were made in both technologies/algorithms.
...
Wiki has been updated accordingly.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5507 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-24 21:27:16 +00:00