Commit Graph

226 Commits (7fbca7013e7b9da220396d9bcaec209efdf67cb7)

Author SHA1 Message Date
Mark DePristo 982192e2e4 MD5DB for integrationtest management now writes out a md5mismatches files for clean analysis
-- This file is in integrationtests/md5mismatches.txt, and looks like:

expected        observed        test
7fd0d0c2d1af3b16378339c181e40611        2339d841d3c3c7233ebba9a6ace895fd        test BeagleOutputToVCF
43865f3f0d975ee2c5912b31393842f8        1b9c4734274edd3142a05033e520beac        testBeagleChangesSitesToRef
daead9bfab1a5df72c5e3a239366118e        27be14f9fc951c4e714b4540b045c2df        testDiffObjects:master=/local/dev/depristo/itest/public/testdata/diffTestMaster.vcf,test=/local/dev/depristo/itest/public/testdata/diffTestTest.vcf,md5=daead9bfab1a5df72c5e3a239366118e

-- Associated cleanup with making md5db an instantiated object, rather than a bunch of static methods
2012-06-14 16:42:27 -04:00
Mark DePristo 96dbd8df63 Fix a nasty script bug in Queue
-- If you are using user-defined configurations (configureJobFeatures) and you didn't overwride the analysisName of your jobs, and there were other jobs using the same name, then you got very strange errors at the end of your script.  For example, in my script I was using SelectVariants to prepare VCF files, and SelectVariants to generate a useful performance table.  Since I forgot to make a special analysisName for my table commands, the generic SV commands were being included in the analysis group, and these were throwing an error since the special features added for the table weren't added to those SV commands
2012-06-14 16:42:26 -04:00
Mark DePristo f77d2e6965 Renamed NO_HEADER to the more accurate no_cmdline_in_header
-- Also no_cmdline_in_header permits us to write contigs into the header, so that the shadow BCF system can work as well
2012-05-24 10:57:08 -04:00
Ryan Poplin c3fb321014 Minor updates to pacbio data processing script to make it work with the latest bwa version/settings. 2012-05-22 10:24:45 -04:00
Eric Banks 03d40272c8 Removed old GATKReport code and moved the new stuff in its place. 2012-05-18 01:44:31 -04:00
Eric Banks a26b04ba17 Extensive refactoring of the GATKReports. This was a beast.
The practical differences between version 1.0 and this one (v1.1) are:

* the underlying data structure now uses arrays instead of hashes, which should drastically reduce the memory overhead required to create large tables.
* no more primary keys; you can still create arbitrary IDs to index into rows, but there is no special cased primary key column in the table.
* no more dangerous/ugly table operations supported except to increment a cell's value (if an int) or to concatenate 2 tables.

Integration tests change because table headers are different.
Old classes are still lying around.  Will clean those up in a subsequent commit.
2012-05-18 01:11:26 -04:00
Khalid Shakir a9da9598f5 Implemented getSamplesFromVCF. 2012-05-03 21:57:57 -04:00
Khalid Shakir 7c11dde328 Updated DPP test MD5's due to template length (TLEN) changes when Picard was revved. 2012-05-03 14:47:58 -04:00
Mark DePristo 43d97c2e00 Rev Tribble to r97, adding binary feature support
From tribble logs:

Binary feature support in tribble

-- Massive refactoring and cleanup
-- Many bug fixes throughout
-- FeatureCodec is now general, with decode etc. taking a PositionBufferedStream
as an argument not a String
-- See ExampleBinaryCodec for an example binary codec
-- AbstractAsciiFeatureCodec provides to its subclass the same String decode,
readHeader functionality before.  Old ASCII codecs should inherit from this base
class, and will work without additional modifications
-- Split AsciiLineReader into a position tracking stream
(PositionalBufferedStream).  The new AsciiLineReader takes as an argument a
PositionalBufferedStream and provides the readLine() functionality of before.
Could potentially use optimizations (its a TODO in the code)
-- The Positional interface includes some more functionality that's now
necessary to support the more general decoding of binary features
-- FeatureReaders now work using the general FeatureCodec interface, so they can
index binary features
-- Bugfixes to LinearIndexCreator off by 1 error in setting the end block
position
-- Deleted VariantType, since this wasn't used anywhere and it's a particularly
clean why of thinking about the problem
-- Moved DiploidGenotype, which is specific to Gelitext, to the gelitext package
-- TabixReader requires an AsciiFeatureCodec as it's currently only implemented
to handle line oriented records
-- Renamed AsciiFeatureReader to TribbleIndexedFeatureReader now that it handles
Ascii and binary features
-- Removed unused functions here and there as encountered
-- Fixed build.xml to be truly headless
-- FeatureCodec readHeader returns a FeatureCodecHeader obtain that contains a
value and the position in the file where the header ends (not inclusive).
TribbleReaders now skip the header if the position is set, so its no longer
necessary, if one implements the general readHeader(PositionalBufferedStream)
version to see header lines in the decode functions.  Necessary for binary
codecs but a nice side benefit for ascii codecs as well
-- Cleaned up the IndexFactory interface so there's a truly general createIndex
function that takes the enumerated index type.  Added a writeIndex() function
that writes an index to disk.
-- Vastly expanded the index unit tests and reader tests to really test linear,
interval, and tabix indexed files.  Updated test.bed, and created a tabix
version of it as well.
-- Significant BinaryFeaturesTest suite.
-- Some test files have indent changes
2012-05-03 07:31:48 -04:00
Mark DePristo 58c470a6c5 Rev'ing Tribble from 53 to 94
-- Other tribble contributors did major refactoring / simplification of tribble, which required some changes to GATK code
-- Integrationtests pass without modification, though some very old index files (callable loci beds) were apparently corrupt and no longer tolerated by the newer tribble codebase
2012-05-03 07:31:47 -04:00
Khalid Shakir 91cb654791 AggregateMetrics:
- By porting from jython to java now accessible to Queue via automatic extension generation.
- Better handling for problematic sample names by using PicardAggregationUtils.
GATKReportTable looks up keys using arrays instead of dot-separated strings, which is useful when a sample has a period in the name.
CombineVariants has option to suppress the header with the command line, which is now invoked during VCF gathering.
Added SelectHeaders walker for filtering headers for dbGAP submission.
Generated command line for read filters now correctly prefixes the argument name as --read_filter instead of -read_filter.
Latest WholeGenomePipeline.
Other minor cleanup to utility methods.
2012-04-17 11:45:32 -04:00
Roger Zurawicki 63cf7ec7ec Added more primitives to GATK Report Column Type
- The Integer column type now accepts byte and shorts
 - Updated Unit Tests and added a new testParse() test

Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-03-28 09:07:54 -04:00
Eric Banks ed69f4ff7c Merge branch 'master' of ssh://gsa1.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-13 09:28:16 -04:00
Eric Banks 9b9856ead5 quick todo for next time we make a bundle 2012-03-13 09:28:11 -04:00
Eric Banks 6e9b8559d8 Unfortunately need to bump up memory needed for liftover to get Omni file sorted 2012-03-12 23:20:00 -04:00
Eric Banks 359090c4b7 Updating dbsnp to v135 2012-03-12 13:17:58 -04:00
Eric Banks 7e9a535c4d Updated the bundle to use the official filtered (final) indel calls 2012-03-12 12:12:24 -04:00
Christopher Hartl 2c1b14d35e Mostly small changes to my own scala scripts: .vcf.gz compatibility for output files, smarter beagle generation, simple script to scatter-gather combine variants. Whole genome indel calling now uses the gold standard indel set. 2012-02-22 17:20:04 -05:00
Christopher Hartl 685bcaced2 Merge branch 'master' of ssh://ni.broadinstitute.org/humgen/gsa-scr1/chartl/dev/unstable 2012-02-21 13:53:37 -05:00
Khalid Shakir cda1e1b207 Minor manual merge update for List class to Seq interface usage. 2012-02-08 02:24:54 -05:00
Khalid Shakir ef74363b1b Merged bug fix from Stable into Unstable 2012-02-08 02:14:26 -05:00
Khalid Shakir 23e7f1bed9 When an interval list specifies overlapping intervals merge them before scattering. 2012-02-08 02:12:16 -05:00
Christopher Hartl 974c2499cc Bugfixed to script. 2012-02-02 12:55:54 -05:00
Christopher Hartl 27ea6426a4 Small script to chunk up a VCF into equal-sized chunks 2012-02-02 12:29:03 -05:00
Christopher Hartl 0c562756eb Add a memory limit so this thing doesn't get killed on the farm 2012-02-02 10:30:09 -05:00
Christopher Hartl 45bf2562cc . 2012-02-02 09:11:17 -05:00
Christopher Hartl f8c5406084 Add the ability to extract samples 2012-02-02 09:06:39 -05:00
Christopher Hartl b567ed8793 Use the right reference path :( 2012-02-01 12:35:18 -05:00
Christopher Hartl 87a63d54d6 fix the script! 2012-02-01 12:05:29 -05:00
Christopher Hartl 810996cfca Introducing: VariantsToPed, the world's most annoying walker! And also a busted QScript to run it that I need Khalid's help debugging ( frownie face ). Note that VariantsToPed and PlinkSeq generate the same binary file (up to strand flips...thanks PlinkSeq), so I know it's working properly. Hooray! 2012-02-01 10:39:03 -05:00
Mauricio Carneiro 052a4bdb9c Turning off PHONE HOME option in the MDCP
* MDCP is for internal use and there is no need to report to the Amazon cloud.
   * Reporting to ASW_S3 is not allowing jobs to finish, this is probably a bug.
2012-01-27 11:13:30 -05:00
Mauricio Carneiro 97499529c7 another small bug with the file extension. 2012-01-24 16:14:35 -05:00
Mauricio Carneiro 7c7ca0d799 fixing bug with fastq extension
* PPP only recognized .fasta and .fq, failing when the user provided a .fastq file. Fixed.
2012-01-24 11:02:15 -05:00
Mauricio Carneiro 945cf03889 IntelliJ ate my import! 2012-01-23 21:46:45 -05:00
Mauricio Carneiro 2bb9525e7f Don't set base qualities if fastQ is provided
* Pacbio Processing pipeline now works with the new fastQ files outputted by the Pacbio instrument
2012-01-23 17:57:29 -05:00
Khalid Shakir c18beadbdb Device files like /dev/null are now tracked as special by Queue and are not used to generate .out file paths, scattered into a temporary directory, gathered, deleted, etc.
Attempted workaround for xdr_resourceInfoReq unsatisfied link during loading of libbat.so.
2012-01-23 16:17:04 -05:00
Christopher Hartl 39e6df5aa9 Fix edge case for very small VCFs 2012-01-19 00:51:28 -05:00
Christopher Hartl 1e037a0ecf Ensure second-to-last line printed 2012-01-19 00:33:08 -05:00
Christopher Hartl 9946853039 Remove duplicated line 2012-01-19 00:25:22 -05:00
Christopher Hartl cf9b1d350a Some minor changes to in-process functions that nobody else uses. CGL now properly ignores no-calls for external VCFs. 2012-01-19 00:20:49 -05:00
David Roazen b7c65cb089 Merged bug fix from Stable into Unstable 2012-01-18 09:52:47 -05:00
David Roazen d5199db8ec Be explicit about setting the snpEff -onlyCoding option in the pipeline
When run without an explicit -onlyCoding option, as we've been doing up to
now, snpEff automatically sets -onlyCoding to "true" provided that there is
at least one transcript marked as "protein_coding", which will always be the
case for us in practice (and indeed, all pipeline runs so far with snpEff
2.0.5 have run with -onlyCoding auto-set to "true").

However, given the disastrous effect on annotation quality setting
"-onlyCoding false" has, we wish to be explicit with this option
rather than relying on snpEff's auto-detection logic.
2012-01-17 20:04:27 -05:00
Ryan Poplin 75f87db468 Replacing Mills file with new gold standard indel set in the resource bundle for release with v1.5 2012-01-17 15:02:45 -05:00
Khalid Shakir a9a6516527 Merged bug fix from Stable into Unstable 2012-01-10 16:16:10 -05:00
Khalid Shakir ef50e77ee2 When running Queue jobs locally, merge the stderr to the stdout log if the error file is NOT specified.
Updated VE strats in the HSP for plotting Ka/Ks by AC.
2012-01-10 16:10:25 -05:00
Mauricio Carneiro 5bf960deb8 adding dbsnp to indel VQSR 2012-01-10 12:38:49 -05:00
Mauricio Carneiro 6f2abd76df Updating the MDCP with the new indel gold standard from Ryan. 2012-01-09 15:31:18 -05:00
Khalid Shakir 5793625592 No more "Q-<pid>@<host>". Generated log file names now use the first output + ".out" (ex. my.vcf.out) or the name of the first QScript plus the order the function was added (ex. MyScript-1.out). The same function added twice with the same outputs will now have the same default logs, meaning the 2nd instance of the function won't be added to the graph twice.
QScript accessor to QSettings to specify a default runName and other default function settings.
Because log files are no longer pseudo-random their presense can be used to tell if a job without other file outputs is "done". For now still using the log's .done file in addition to original outputs.
Gathered log files concatenate all log files together into the stdout.
InProcessFunctions now have PrintStreams for stdout and stderr.
Updated ivy to use commons-io 2.1 for copying logs to the stdout PrintStream. Removed snakeyaml.
During graph tracking of outputs the Index files, and now BAM MD5s, are tracked with the gathering of the original file.
In Queue generated wrappers for the GATK the Index and MD5s used for tracking are switched to private scope.
Added more detailed output when running with -l DEBUG.
Simplified graphviz visualization for additional debugging.
Switched usage of the scala class 'List' to the trait 'Seq' (think java.util.ArrayList vs. using the interface java.util.List)
Minor cleanup to build including sending ant gsalib to R's default libloc.
2012-01-08 12:11:55 -05:00
Mauricio Carneiro f6a18aea63 Updated MDCP with INDEL best practices
* chose 90.0 indel cut target for most datasets (this is arbitrary).
2012-01-06 17:21:59 -05:00
Mauricio Carneiro 3358c132a8 Updating the MD5s
Clipping adaptor boundaries changed the results of CountCovariates which affected the PPP output.
a few more loci were visible to locus walkers.
2011-12-21 15:14:05 -05:00