GenomeLocs can officially have any start/stop values from -Inf - +Inf. Bounds w.r.t. the reference are enforced, optionally, by GenomeLocParser. General code cleanup throughout the subsystem.
All validation code for GLs is now centralized, and all I/O systems now validate their inputs. Because of this, the Picard interval processing code has been changed to examine whether an interval is valid, and only keep the valid intervals. Note that the scatter/gather test was changed, because the original hg18 chr20 interval files as actually malformed (all records for some reason where on chr20).
Many interval processing routines were moved to IntervalUtils, as this is their natural home.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5830 348d0f76-0448-11de-a6fe-93d51630548a
Reviewed pipelines with dev team.
HSP updates:
- Calling SNPs and Indels at the same time then using SelectVariants to separate them for filtering
- Moved logs next to the files like in WGP
- Flattened outputs into one directory
- The file names for the final outputs are now <projectName>.vcf and <projectName>.eval
- Updated test to pass the chr20 intervals instead of a boolean
- Removed MultiFCP
WGP updates:
- Only cleaning and calling chromosomes 1-22, X, Y, MT
- Splitting SNPs from indels, filtering indels, then merging the selected SNPs and selected Indels back together to make sure there are no collisions in CombineVariants
- Still running VQSR on the recombined SNPs plus hard filtered indels
- Using hard indel filters from delangel
- Reduced number of tranches with rpoplin
- Changed prior for dbsnp from 10 to 8 with rpoplin
- Assuming identical samples on both CombineVariants
- Explicitly using variant merge option UNION even though it's the default
- Not setting the default genotype merge option PRIORITIZE
- Generating a vcf and eval for each tranche
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5825 348d0f76-0448-11de-a6fe-93d51630548a
Removed job priority as temp space isn't as tight at the moment and planning on changing the priority interface.
Updated chunk calling with ebanks:
- Using "the bundle" of resources.
- Using dbsnp 132 and 1000G indel RODs for both RTC & IR.
- Using the default maxIntervalSize in RTC.
- Removed use of UG.exactCalculation argument.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5814 348d0f76-0448-11de-a6fe-93d51630548a
Using hapmap training and truth based on wiki.
Explicitly setting the ts_filter_level even though 99.0 is the default.
Recal file path now ends with with .recal.
Added ar's vcf input.
Omni rod name now omni instead of 1kg.
The VR RodBind tags had spaces in them.
Was passing both the full intervals and the chunk intervals to chunk jobs.
Switched back to chr20 for default since the VR crashes on small intervals sets with "MESSAGE: Matrix is singular."
Log files names based on the file paths + .out.
Added eval statifications by sample based on the Hybrid Selection / Whole Exome pipeline.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5800 348d0f76-0448-11de-a6fe-93d51630548a
Hardcoded the reference and dbsnp since the training rods are also hardcoded, for now.
Changed freeze/chr20 to wg/chr20/cent1 to also test the heaviest known shard.
Other cleanup.
TODO: Memory command line options or have the script figure it out using FLS or similar.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5799 348d0f76-0448-11de-a6fe-93d51630548a
Minor updates to the FCPTest to match the changes due to using the old indel caller.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5766 348d0f76-0448-11de-a6fe-93d51630548a
Also added the old model of indel calling to the FCP.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5749 348d0f76-0448-11de-a6fe-93d51630548a
Feeding FCP UG the bam list instead of individual bams to cut scatter gather time from O(m^100) as measured by Chris to O(m^1).
Fixed NPE when eval values aren't found in PipelineTests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5694 348d0f76-0448-11de-a6fe-93d51630548a
Scattering non-contig interval lists by number of loci in the intervals instead of just number of intervals.
Queue caches the list of locs and how to split them up instead of reloading them from disk repeatedly.
TODO: general purpose function to divide data evenly.
Skip over comments when parsing picard analysis files.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5687 348d0f76-0448-11de-a6fe-93d51630548a