We can now emit bzipped VCFs from the GATK.
Details: any walker that defines a VCFWriter for its @Output (i.e. pretty much every core walker from UG and on), also has associated with it the -bzip (--bzip_compression) boolean argument. When set, it will emit a VCF that is compressed with bzip2.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4093 348d0f76-0448-11de-a6fe-93d51630548a
is an output stream, we'll automatically create it and point it to stdout. Otherwise,
we'll leave it empty.
I think about it like this: marking a field 'required' indicates to the GATK that the
walker author requires a value for this field, and if the GATK can provide one without
end user intervention, it will. Maybe this is hackish. We'll try it and see.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4091 348d0f76-0448-11de-a6fe-93d51630548a
most appropriate based on feature density, and the longest feature in the file. Also:
- Converted Tribble to TestNG; it has better features and is about 6x faster.
- As much code clean-up as I could get done. More to do, especially in the example code.
- Moved asserts in the code to throw exceptions.
- Added getBinSize to the index interface; both indexes already implemented this.
- Removed the abstract parts of the indexCreator interface; this is now more simple.
- Added an IndexType enumeration; might be overkill but it is at least a single point of entry for index information.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4082 348d0f76-0448-11de-a6fe-93d51630548a
-o/--out instead of -varout. Let's watch as our intrepid support engineer
gracefully responds to all the incoming questions of the form: "the GATK told
me to use -o instead of -varout. What do I do?"
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4078 348d0f76-0448-11de-a6fe-93d51630548a
Removed extra quotes from 'cd' pre-exec command.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4067 348d0f76-0448-11de-a6fe-93d51630548a
Thanks to Matt for the advice.
Also, for Guillermo: while I was at it, I changed the .stats debug output to emit the original interval instead of the cleaned region.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4058 348d0f76-0448-11de-a6fe-93d51630548a
While I was fixing it, I noticed that we weren't allowing the cleaner to un-clean reads with indels when they're wrong even though we should. Hypothetically, that should rarely happen: only when we can left-align out an indel or when the original mapper really went haywire. This situation is rare enough that I'm calling logger.info to let the user know it's happening and suggesting that they double-check that everything looks right with their reads. Better to be extra-cautious now that the cleaner is moving into the 1kg and Broad production pipelines soon
.
Mark, have no fear: this was truly a rare edge case - one that won't affect the cleaning stats. There is no need to re-clean the data processing paper bams!
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4057 348d0f76-0448-11de-a6fe-93d51630548a
removed the auto-deletion of the reflections jar, and removed the very old OmniPlan document we had checked-in.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4056 348d0f76-0448-11de-a6fe-93d51630548a
Some refactoring of Queue extensions for reusability in scripts.
Putting the extensions into the Queue.jar after building them.
More updates to GATK walker arguments specifying @Input and @Output for Queue.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4032 348d0f76-0448-11de-a6fe-93d51630548a
- GATKVCFWriter deleted, to be replaced if absolutely necessary when VCF writing goes into Tribble.
- VCFWriter is now an interface, for easier redirection.
- VCFWriterImpl fleshes out the VCFWriter interface.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4026 348d0f76-0448-11de-a6fe-93d51630548a
-l DEBUG with local programs now prints out the stdout/stderr of the programs as they are run.
More documentation in the examples with a new even simpler CountReads example.
Took out unused option to build Queue GATK extensions separately.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4025 348d0f76-0448-11de-a6fe-93d51630548a
Added ability to skip up-to-date jobs where the outputs are older than the inputs.
Changed -T CountDuplicates --quiet to --quietLocus so that Queue GATK extensions can use both short and full argument names.
Short names can be used to set values on Queue GATK extensions, for example: vf.XL :+= myFile
Moved Hidden from the GATK to StingUtils.
Updated ivy from 2.0.0 to 2.2.0-rc1 to fix sha1 issue: http://bit.ly/aX72w7
Added Queue to javadoc and testing build targets.
Added first Queue unit test.
Another pass at avoiding cycles in the DAG thanks to all function I/O being files.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4017 348d0f76-0448-11de-a6fe-93d51630548a
2) Make sure variants are biallelic before asking for isTransversion()
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4016 348d0f76-0448-11de-a6fe-93d51630548a
- Eliminate reduncancy of filter application.
- Track filter metrics per-shard to facitate per merging.
- Flatten counting iterator hierarchy for easier debugging.
- Rename Reads class to ReadProperties and track it outside of the Sting iterators.
Note: because shards are currently tied so closely to reads and not the merged triplet of <reads,ref,RODs>, the metrics
classes are managed by the SAMDataSource when they should be managed by something more general. For now, we're hacking
the reads data source to manage the metrics; in the future, something more general should manage the metrics classes.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4015 348d0f76-0448-11de-a6fe-93d51630548a
Major cleanup of the genotype writer code from the calling end. UG no longer supports making calls in anything but VCF, and that allows us to use the VCFWriter more generically now. Putting the ball in Matt's court to finish collapsing everything.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3996 348d0f76-0448-11de-a6fe-93d51630548a
Changes to StingUtils command line for use in Queue, replacing Queue's use of property files.
Updates to walkers used in existing QScripts to add @Input/@Output.
RMD used in @Required/@Allows now has a new default equal to "any" type.
New QueueGATKExtensions.jar generator for auto wrapping walkers as Queue CommandLineFunctions.
Added hooks to modify the functions that perform the Scattering and Gathering (setting their jar files, other arguments, etc.)
Removed dependency on BroadCore by porting LSF job submitter to scala.
Ivy now pulls down module dependencies from maven.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3984 348d0f76-0448-11de-a6fe-93d51630548a
When subsetting by sample, the -SN argument is treated as a literal sample name and, if no match is found, as a regular expression. This allows for a large number of samples to be selected at once (useful when, for instance, cases are given one sample name prefix and controls are given another).
After the subsetting procedure, the INFO-field annotations AC, AN, AF, and DP are all recalculated to properly reflect the new contents of the VCF.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3968 348d0f76-0448-11de-a6fe-93d51630548a
2) Small fixes in the VCFWriter:
a) Trailing missing values weren't being removed if their count was > 1 (e.g. ".,.")
b) We were handling key values that were Lists, but not Arrays. We now handle both.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3956 348d0f76-0448-11de-a6fe-93d51630548a
- Tribble is included directly in the GATK repo; those who have access to commit to Tribble can now directly commit from the GATK directory from Intellij; command line users can commit from
inside the tribble directory.
- Hapmap ROD now in Tribble; all mentions have been switched over.
- VariantContext does not know about GenomeLoc; use VariantContextUtils.getLocation(VariantContext vc) to get a genome loc.
- VariantContext.getSNPSubstitutionType is now in VariantContextUtils.
- This does not include the checked-in project files for Intellij; still running into issues with changes to the iml files being marked as changes by SVN
I'll send out an email to GSAMembers with some more details.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3954 348d0f76-0448-11de-a6fe-93d51630548a
disable on-the-fly dict and fai generation.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3938 348d0f76-0448-11de-a6fe-93d51630548a
2) New annotation SByDepth = log10(-StrandBias/Depth) (non-standard annotation, key name = "SBD"). If StrandBias/Depth happens to be positive (very rare but can happen), annotation gets value=-1000.
3) Abstracted out new class AnnotationByDepth so that QD and SBD can share code.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3930 348d0f76-0448-11de-a6fe-93d51630548a
Generalized some of the packaging code from VariantAnnotator. Matt might want to take a look to make this nicer...?
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3925 348d0f76-0448-11de-a6fe-93d51630548a
locale. Method to force JVM into proper locale exists in CommandLineProgram
and is disabled by default, but implementers of CommandLineProgram can opt in
to the forced US locale by calling a static method.
Question for the VCF developers: I removed the code to explicitly output doubles
in US locale. Do you / how do you want to handle this in applications that use
Tribble outside the GATK?
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3917 348d0f76-0448-11de-a6fe-93d51630548a
BTW, Eric, thanks for forwarding the DepthOfCoverage thread to gsamembers. I'd forgotten about reduce by interval. Mighty helpful in this case!
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3906 348d0f76-0448-11de-a6fe-93d51630548a
object or by a single object and all the references it contains. Requires a command-line change to add a Java agent to
the command-line; see the Sizeof.java javadoc for details.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3889 348d0f76-0448-11de-a6fe-93d51630548a
this reduces to the old method in typical cases, but it builds haplotypes correctly if there are two variants close by within a context window.
Annotation is temporarily named MyHaplotypeScore so it can be run in parallel with old one, soon it will be renamed after some more testing.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3885 348d0f76-0448-11de-a6fe-93d51630548a
features in that contig. Thanks to Mark for finding this. I'm not rev'ing the index version (which would cause all indexes to be rebuilt), since this seems like a pretty rare edge case.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3865 348d0f76-0448-11de-a6fe-93d51630548a
b) Cosmetic change to Beagle annotation description.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3861 348d0f76-0448-11de-a6fe-93d51630548a
2. Added some useful logging messages.
3. Added a oneoffs walker to calculate the number of realigned reads and intervals containing them.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3860 348d0f76-0448-11de-a6fe-93d51630548a
- makes writing to disk optional for indexes using the indexCreator classes (allow the user to specify the index file, if null don't write it)
- removed some system.out debugging code
- fixed version checking in interval tree
- made indexes store and return a LinkedHashSet for sequence names (to ensure they've preserved the ordering in the file)
- index creators now read the file before creating the index
- changed the Index.write() method to take a LEDataStream instead of a file
- removed the sequence dictionary code on the header
- added utils for getting LEDataStreams
- added a base Tribble exception
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3857 348d0f76-0448-11de-a6fe-93d51630548a
b) Bug fixes and update to how we represent indels and other complex events in a VariantContext object. Convention is now that all events are left aligned, with the first variant context location marking the common base before an event occurs. However, alleles in a VC don't have the common base in all VC's. Two new functions are now part of VariantContextUtils: CreateVariantContextWithPaddedAlleles and CreateVariantContextWithTrimmedAlleles. Both take a VC as an input and create a VC as an output.
Main flow is that a VCF reader would create a VC with trimmed alleles, all walkers would ideally work with these trimmed alleles, and then the VCF writer would pad back the alleles before writing. However, there are special cases where we need to pad alleles like for example when merging/combining VC's.
Pending issues:
- PED and DBSNP RODs have to be updated to create VC's for indels following the convention above. Changes will go in after Tribble location is moved and things are tested.
- Need to verify Indel genotyper and other modules that create VC's with indels.- Wiki page describing convention above and how walkers should interpret indel VC's still needs updating/detailing.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3850 348d0f76-0448-11de-a6fe-93d51630548a
2) Keep track of whether vcf records are unfiltered vs. pass filters in the variant context so we can regenerate the records on output.
3) No more "ID" hard-coded all over the code to set the VariantContext ID. Use a static variable instead.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3840 348d0f76-0448-11de-a6fe-93d51630548a
2. Moved Jared's VCFTool code into archive so that everything would compile.
3. Added the vcf reference base (needed for indels) as an attribute to the VariantContext from the reader.
4. TribbleRMDTrackBuilderUnitTest was complaining that a validation file didn'r exist, so I commented it out.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3835 348d0f76-0448-11de-a6fe-93d51630548a
- changed to a better method for getting headers from Codecs
- some removal of old commented out code in the GATKAgrumentCollection
- changes for the rename of FeatureReader to FeatureSource
- removed the old Beagle ROD
- cleaned up some of the code in SampleUtils
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3826 348d0f76-0448-11de-a6fe-93d51630548a
ObjectName. In the Queue-enabled future, we might want to come up with GUIDs
(or at least semi-unique IDs) so that we could use JMX to track runtime
attributes for multiple jobs running simultaneously.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3825 348d0f76-0448-11de-a6fe-93d51630548a
some workflows using AlignmentContext were working with it in a way I didn't
expect and wound up treating extended pileups as base pileups. I'll work to
make sure the AlignmentContext interface is crystal clear.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3815 348d0f76-0448-11de-a6fe-93d51630548a
a) VCF track name can work again with 3.3 or 4.0 VCF's when specifying -B name,VCF,file. Code will read header and parse automatically the version.
b) Old VCF codec is deprecated. Reader goes now direct from parsing VCF lines into producing VariantContext objects, with no intermediate VCF records. If anyone can't resist the urge to still input files using the old method, a new VCF3Codec is in place with the old code, but it will be eventually deleted.
c) VCF headers and VCF info fields no longer keep track of the version. They are parsed into an internal representation and will be output only in VCF4.0 format.
d) As a consequence, the existing GATK bug where files are produced with VCF4 body but VCF3.3 headers is solved.
e) Several VCF 4.0 writer bugs are now solved.
f) Integration test MD5's are changed, mostly because of corrected VCF4.0 headers and because validation data mostly uses now VCF4.0.
g) Several VCF files in the ValidationData/ directory have been converted to VCF 4.0 format. I kept the old versions, and the new versions have a .vcf4 extension.
Pending issues:
a) We are still not dealing with indels consistently or correctly when representing them. This will be a second part of the changes.
b) The VCF writer doesn't use VCFRecord but it does still use a lot of leftovers like VCFGenotypeEncoding, VCFGenotypeRecord, etc. This needs to be simplified and cleaned.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3813 348d0f76-0448-11de-a6fe-93d51630548a
that the semantics for which reads are in an extended event pileup are not
clear at this point. Eric and I have planned a future clarification for this
and the two of us will discuss who will implement this clarification and when
it'll happen.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3809 348d0f76-0448-11de-a6fe-93d51630548a
- Remove some old debugging cruft regarding handling of threaded engine exceptions.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3796 348d0f76-0448-11de-a6fe-93d51630548a
FYI: creating an exclusive lock on a file that does not exist will create that file as an empty file, and will NOT delete that file after the program terminates. So watch out if it's possible that the file you're locking does not exist - could end up leaving extra files that confuse users.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3795 348d0f76-0448-11de-a6fe-93d51630548a
the walker itself, but didn't do a great job of catching framework exceptions. This became extremely
unfortunate in the case where walkers caused exceptions that manifested themselves in the framework,
such as when the walker opens more files than file handles are available.
Reworked the exception handling so that framework errors are treated like walker errors and the resulting
exception bubbles out of the walker. Stack traces for threaded walkers are still convoluted and nasty.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3794 348d0f76-0448-11de-a6fe-93d51630548a