1. Fix: VCs were padded before the merge, but they were never unpadded afterwards. This leaves us with a VC that doesn't meet our spec.
2. Update: instead of running the merged VC through every standard annotation (which seems really wrong, since this isn't the annotator tool), just update the chromosome count annotations (AC,AF,AN) through VCUtils.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4734 348d0f76-0448-11de-a6fe-93d51630548a
Added an initial test for genotyping chr20 on ten 1000G bams.
Since tribble needs logging support too, for now setting the logging level and appending the console logger to the root logger, not just to "org.broadinstitute.sting".
Updated IntervalUtilsUnitTest to output to a temp directory and not the SVN controlled testdata directory.
Added refseq tables and dbsnps to validation data in BaseTest.
Now waiting up to two minutes for gather parts to propagate over NFS before attempting to merge the files.
Setting scatter/gather directories relative to the -run directory instead of the current directory that queue is running.
Fixed a bug where escaping test expressions didn't handle delimiters at the beginning or end of the String.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4717 348d0f76-0448-11de-a6fe-93d51630548a
When a QGraph is empty displaying a warning instead of crashing with an JGraph internal assertion error.
Cleaned up code using the Log4J root logger and explicitly talking to a logger for Sting.
When integration tests are run detecting that the logger has already been setup so that messages aren't logged twice.
Updated from Ivy 2.2.0-rc1 to 2.2.0.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4707 348d0f76-0448-11de-a6fe-93d51630548a
of a sequence dictionary and related info. This will hopefully eliminate the cases in
which the refseq track depends a sequence dictionary / contig parser that hasn't been
specified.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4700 348d0f76-0448-11de-a6fe-93d51630548a
Added a @Hidden experimental argument -validate to VariantEval that allows external JEXL assertions that must evaluate to true will throw an exception.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4692 348d0f76-0448-11de-a6fe-93d51630548a
dependent on the contents of the integrationtest directory. Will figure
out how to better manage the integrationtest directory at some point in
the future.
- Up the max heap size for tests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4691 348d0f76-0448-11de-a6fe-93d51630548a
- Add StingTextReporter, which provides a text dump of the errors to the
console. Had to create our own reporter (inheriting from the standard
TestNG TextReporter) to work around a configuration issue with the
TextReporter. In an ideal world, I'd report this on the TestNG mailing
list and help them resolve the issue, but this solution is relatively
robust at the moment and life is too short.
- Added back the failed test listener, which generates the testng-failed.xml
file.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4686 348d0f76-0448-11de-a6fe-93d51630548a
- Changed RMDTrackBuilder to use SequenceDictionaryUtils.validateDictionaries for ref <-> ROD sequence dictionary validation.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4683 348d0f76-0448-11de-a6fe-93d51630548a
Removed obsolete usages of PackageUtils with updated PluginManager.
Ported Queue interval utilities written in scala over to Sting's java IntervalUtils.
Added a very basic intergration test to ensure that the fullCallingPipeline.q compiles.
Added options to specify the temporary directories without having to use -Djava.io.tmpdir (useful during the above integration test).
While adding tempDir added options to specify the run directory from the command line, for example "-runDir v1".
Upgraded to scala 2.8.1 and updated calls to deprecated functions.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4661 348d0f76-0448-11de-a6fe-93d51630548a
for anything that needs to be simultaneously aware of multiple references, eg
Queue's interval sharding code, liftover support, distributed GATK etc.
GenomeLocParser instances must now be used to create/parse GenomeLocs.
GenomeLocParser instances are available in walkers by calling either
-getToolkit().getGenomeLocParser()
or
-refContext.getGenomeLocParser()
This is an intermediate change; GenomeLocParser will eventually be merged
with the reference, but we're not clear exactly how to do that yet. This
will become clearer when contig aliasing is implemented.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4642 348d0f76-0448-11de-a6fe-93d51630548a
Bug fix to LiftoverVariants - no barfing at reference sites.
AlleleFrequencyComparison - local changes added to make sure parsing works properly
Added HammingDistance annotation. Mostly useless. But only mostly.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4622 348d0f76-0448-11de-a6fe-93d51630548a
Initial test to see how Bamboo will respond. More detailed email to follow.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4609 348d0f76-0448-11de-a6fe-93d51630548a
Unrelated change: in the sorted-target mode (when we read sorted target intervals one by on from a file), one can now specify multiple semicolon-separated interval files (all must be sorted). Not hugely useful probably, but makes --targetIntervals always process its values in exactly the same way, so we are consistent (it has been already taking ;-separated args in unsorted mode)
NwayIntervalMergingIterator: reads in multiple sorted GenomeLoc input streams (iterators) and presents them as a single sorted and merged stream
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4602 348d0f76-0448-11de-a6fe-93d51630548a
of traversal to avoid holding a reference to the microscheduler, which holds a reference to
the engine, which in turn holds a reference to the walker, which itself holds a reference to
all the data aggregated during the course of the traversal.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4594 348d0f76-0448-11de-a6fe-93d51630548a
parsing engine. Hugely lowers our memory footprint in integrationtests, but not yet enough to
run Mark's new parallelized VariantEvalIntegrationTests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4585 348d0f76-0448-11de-a6fe-93d51630548a
Modified - SelectVariants: Hook up to VariantContextUtils to recalculate AC/AF/AN, which uses the accessor in VariantContext to do this. Somehow sites that were selected down to hom-ref genotypes only wound up getting positive AC.
**IMPORTANT** I kind of need input here. The header of a file used for an integration test specifies AC as being an integer. Recalculating it casts it into an integer list (which it should be, as it allows for alternate alleles). However this appears to clash with what the jexl expression is looking for? For now, the integration test itself needed to be changed -- it's unclear what to do when the header specifies AC of being one class, but recalculating it casts to another class, and I'm not sure what to do.
I'm committing my omni_qc pipeline because I'm almost certain 2 months down the road I'm going to wonder what the heck I did to generate my results.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4511 348d0f76-0448-11de-a6fe-93d51630548a
The core walker has been modified so that when variant contexts (eval and comp) are subset to command-line-specified sample(s), the chromosome count annotations (AC/AN/AF) are altered to reflect the AC/AN/AF of only those samples involved in the comparison. No more getting AC500 when you're comparing a 10-sample overlap. Interestingly enough, this didn't break any integration tests.
GenotypeConcordance now has two additional tables: Allele Count Statistics, and Allele Count Summary Statistics. These work exactly identically to the Sample Statistics and Sample Summary Statistics tables, except that the partition being used is no longer the sample, but instead the allele count of the variant sites. These tables stratify by both eval and comp ACs, e.g.
evalAC0
evalAC1
evalAC2
compAC0
compAC1
compAC2
Differences with previous integration tests were verified to only be in the Allele Count tables (by grepping them out of the diff); a new test has been added for the simple case of an AC=1 site in the eval becoming an AC=2 site in the comp.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4491 348d0f76-0448-11de-a6fe-93d51630548a
by the fact that the GATKSAMRecord, by design, needs to both inherit from
SAMRecord and wrap a 'member' SAMRecord, and method calls that aren't
implemented as explicit passthroughs can compromise the content of the
SAMRecord in subtle ways.
Will be automatically fixed when Picard moves to a lightweight SAMRecord
interface rather than the current heavyweight implementation. But in
the short-term, there's no obvious fix.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4489 348d0f76-0448-11de-a6fe-93d51630548a
- ProduceBeagleInputWalker
+ Now takes a validation ROD and a prior to give it, will use those genotypes in place of the variant genotypes if both are present
+ Takes a bootstrap argument -- can use some given %age of the validation sites
+ Optionally takes a bootstrap output argument -- re-prints the validation VCF, filtering those sites used as part of the bootstrap
-BeagleOutputToVCFWalker
+ Now filters sites where the genotypes have been reverted to hom ref
+ Now calls in to the new VCUtils to calculate AC/AN
-Queue
+ New pipeline libraries for easy qscript creation, still a work in progress, but this is a considerable prototype
+ full calling pipeline v2 uses the above libraries
+ minor changes to some of my own scripts
+ no more need for contig interval lists, these will be parsed out of your normal interval list when it is provided
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4459 348d0f76-0448-11de-a6fe-93d51630548a
Queue now submits new LSF jobs only after previous functions have completed successfully.
When the Queue process is shutdown (ex: via Control-C) sends a bkill command for any running jobs.
Ported commands like creating directories and scatter/gather interval list to scala functions.
Updates to LSF status tracking by porting the python to internally generated bash scripts.
Temporarily disabled job name submission to LSF. Plus side is that the full command is now available in "bjobs -w". TODO: Put back jobName passing to LSF based on an option?
Changed BaseTest to allow scala to access paths to references.
Changed the extension generator to default the analysis name to the walker "name".
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4442 348d0f76-0448-11de-a6fe-93d51630548a
argument. Brought it back, and added an integrationtest to make sure it
stays around.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4390 348d0f76-0448-11de-a6fe-93d51630548a
-- getToolkit().subContextFromSampleProperty(): filters a VariantContext to genotypes that come from samples that have a given property value
-- getToolkit().getSamplesWithProperty(): gets all samples with a given property
-- getToolkit().getSamplesFromVariantContext(): sample objects that are referenced by name in a VariantContext
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4361 348d0f76-0448-11de-a6fe-93d51630548a
This will allow other programs like Queue to reuse the functionality.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4351 348d0f76-0448-11de-a6fe-93d51630548a
pieces of core that depend on playground. Most of these have been eliminated by
(temporarily) promoting Aaron's report system to core in this checkin. I'll
follow up with other changes in separately.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4350 348d0f76-0448-11de-a6fe-93d51630548a
The GAE half has all the walker specific code. The new "Abstract" GAE has the rest of the logic.
More refactoring to come, with the end goal of having a tool that other java analysis programs (Queue, etc.) can use to read in genomic data.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4339 348d0f76-0448-11de-a6fe-93d51630548a
mOVsxGfDiiSMxVs2PPTVjzYTVbizlD6e
f9kUHUADFsZ0LiTGxRL5zPmq9kZcA4cQ
8eGHWJFAlBVmgxwPi3sMd1RmiN2PwHOf
iLhvHWveypKb2F8vKS5irHylc3pYvlOb
HDttXKUMEVoPrvVeWrH7E0htxYyNydMx
plus a bit of cleanup of custom exceptions in the sharding system.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4330 348d0f76-0448-11de-a6fe-93d51630548a
Added a pipeline java bean and YAML utility to serialize java beans.
Added a getFirehosePipelineYaml.sh that can pull firehose data into the pipeline yaml file format.
Updated the fullCallingPipeline.q to begin using the pipeline yaml file format for bams and reference.
More changes to come as this code gets tested out in the fullCallingPipeline.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4329 348d0f76-0448-11de-a6fe-93d51630548a
Changed integration tests, adding the -NO_HEADER argument, for walkers that previously did not include the command-line
arg headers.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4326 348d0f76-0448-11de-a6fe-93d51630548a
Ryan is using this to modify VCF code today...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4303 348d0f76-0448-11de-a6fe-93d51630548a
The exception: "org.broadinstitute.sting.utils.exceptions.UserException$CommandLineException: Invalid command line: This calculation is critically dependent on being able to skip over known variant sites. Please provide a dbSNP ROD or a VCF file containing known sites of genetic variation."
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4293 348d0f76-0448-11de-a6fe-93d51630548a
*** Three integration tests had to change: ***
RecalibarationWalkersIntegrationTest:
One of the tests was using the interval as the snp track, and wasn't supplying a DbSNP track (for CountCovariates)
SequenomValidationConverterIntegrationTest:
relies on Plink ROD which we've removed.
PileupWalkerIntegrationTest:
we no longer have implicit interval tracks, so there isn't a rod name over the specified region. Otherwise the same result.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4292 348d0f76-0448-11de-a6fe-93d51630548a
This commit adds two important classes: Sample, which contains data about one sample; and SampleDataSource, which manages sample data a la ReferenceDataSource and ReadsDataSource.
This code should be stable, but it has not been integrated with existing walkers yet. That's the next commit.
In the meantime, feel free to experiment with the code - there are two basic example walkers in the playground.sample package. And PLEASE let me know if you see any errors/inconsistencies.
Note that this also adds a new dependency on SnakeYaml, a YAML parser.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4285 348d0f76-0448-11de-a6fe-93d51630548a
in the reference. Reset routine is not accessible to any class outside
GenomeLocParser's package.
We'll have to do something more intelligent with this when the GATK goes
distributed.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4275 348d0f76-0448-11de-a6fe-93d51630548a
- Update the test to point to our repository.
- Update the md5 to reflect new Picard tag ordering.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4247 348d0f76-0448-11de-a6fe-93d51630548a
1. Rip out all of Ben's code intended to circumvent the stable VCF Writer output system in multi-threaded mode (I threw up a little when
I saw this code). This will improve memory consumption when running with -nt.
2. Don't annotate indels or > bi-allelic sites.
3. Fix bug where not all records were making it into the output VCF.
4. General code clean up.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4118 348d0f76-0448-11de-a6fe-93d51630548a
(-B:name,type file) as well as the old syntax. Also, a bonus feature: BAMs can now be tagged at the
command-line, which should allow us to get rid of some of the hackier calls in GenomeAnalysisEngine.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4105 348d0f76-0448-11de-a6fe-93d51630548a
most appropriate based on feature density, and the longest feature in the file. Also:
- Converted Tribble to TestNG; it has better features and is about 6x faster.
- As much code clean-up as I could get done. More to do, especially in the example code.
- Moved asserts in the code to throw exceptions.
- Added getBinSize to the index interface; both indexes already implemented this.
- Removed the abstract parts of the indexCreator interface; this is now more simple.
- Added an IndexType enumeration; might be overkill but it is at least a single point of entry for index information.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4082 348d0f76-0448-11de-a6fe-93d51630548a
-o/--out instead of -varout. Let's watch as our intrepid support engineer
gracefully responds to all the incoming questions of the form: "the GATK told
me to use -o instead of -varout. What do I do?"
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4078 348d0f76-0448-11de-a6fe-93d51630548a
Thanks to Matt for the advice.
Also, for Guillermo: while I was at it, I changed the .stats debug output to emit the original interval instead of the cleaned region.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4058 348d0f76-0448-11de-a6fe-93d51630548a
- GATKVCFWriter deleted, to be replaced if absolutely necessary when VCF writing goes into Tribble.
- VCFWriter is now an interface, for easier redirection.
- VCFWriterImpl fleshes out the VCFWriter interface.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4026 348d0f76-0448-11de-a6fe-93d51630548a
Added ability to skip up-to-date jobs where the outputs are older than the inputs.
Changed -T CountDuplicates --quiet to --quietLocus so that Queue GATK extensions can use both short and full argument names.
Short names can be used to set values on Queue GATK extensions, for example: vf.XL :+= myFile
Moved Hidden from the GATK to StingUtils.
Updated ivy from 2.0.0 to 2.2.0-rc1 to fix sha1 issue: http://bit.ly/aX72w7
Added Queue to javadoc and testing build targets.
Added first Queue unit test.
Another pass at avoiding cycles in the DAG thanks to all function I/O being files.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4017 348d0f76-0448-11de-a6fe-93d51630548a
- Eliminate reduncancy of filter application.
- Track filter metrics per-shard to facitate per merging.
- Flatten counting iterator hierarchy for easier debugging.
- Rename Reads class to ReadProperties and track it outside of the Sting iterators.
Note: because shards are currently tied so closely to reads and not the merged triplet of <reads,ref,RODs>, the metrics
classes are managed by the SAMDataSource when they should be managed by something more general. For now, we're hacking
the reads data source to manage the metrics; in the future, something more general should manage the metrics classes.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4015 348d0f76-0448-11de-a6fe-93d51630548a
Major cleanup of the genotype writer code from the calling end. UG no longer supports making calls in anything but VCF, and that allows us to use the VCFWriter more generically now. Putting the ball in Matt's court to finish collapsing everything.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3996 348d0f76-0448-11de-a6fe-93d51630548a
2) Small fixes in the VCFWriter:
a) Trailing missing values weren't being removed if their count was > 1 (e.g. ".,.")
b) We were handling key values that were Lists, but not Arrays. We now handle both.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3956 348d0f76-0448-11de-a6fe-93d51630548a
- Tribble is included directly in the GATK repo; those who have access to commit to Tribble can now directly commit from the GATK directory from Intellij; command line users can commit from
inside the tribble directory.
- Hapmap ROD now in Tribble; all mentions have been switched over.
- VariantContext does not know about GenomeLoc; use VariantContextUtils.getLocation(VariantContext vc) to get a genome loc.
- VariantContext.getSNPSubstitutionType is now in VariantContextUtils.
- This does not include the checked-in project files for Intellij; still running into issues with changes to the iml files being marked as changes by SVN
I'll send out an email to GSAMembers with some more details.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3954 348d0f76-0448-11de-a6fe-93d51630548a
Generalized some of the packaging code from VariantAnnotator. Matt might want to take a look to make this nicer...?
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3925 348d0f76-0448-11de-a6fe-93d51630548a
b) Cosmetic change to Beagle annotation description.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3861 348d0f76-0448-11de-a6fe-93d51630548a
- makes writing to disk optional for indexes using the indexCreator classes (allow the user to specify the index file, if null don't write it)
- removed some system.out debugging code
- fixed version checking in interval tree
- made indexes store and return a LinkedHashSet for sequence names (to ensure they've preserved the ordering in the file)
- index creators now read the file before creating the index
- changed the Index.write() method to take a LEDataStream instead of a file
- removed the sequence dictionary code on the header
- added utils for getting LEDataStreams
- added a base Tribble exception
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3857 348d0f76-0448-11de-a6fe-93d51630548a