Commit Graph

1131 Commits (ffbd4d85f2e0112b32df0bbba00330b00a0806cf)

Author SHA1 Message Date
Khalid Shakir e57cd78bba Killed two more resource leakers that ignored requests to close wrapped file pointers, and added Unit Tests for each.
This bug will happen in all adapter/wrapper classes that are passed a resource, and then in their close method they ignore requests to close the wrapped resource, causing a leak when the adapter is the only one left with a reference to the resource.

Ex:

public Wrapper getNewWrapper(File path) {
  FileStream myStream = new FileStream(path); // This stream must be eventually closed.
  return new Wrapper(myStream);
}

public void close(Wrapper wrapper) {
  wrapper.close(); // If wrapper.close() does nothing, NO ONE else has a reference to close myStream.
}
2012-05-21 15:41:56 -04:00
Eric Banks 26968ae8eb Forgot that the VCFStreamingOntegrationTest uses VE 2012-05-18 02:51:53 -04:00
Eric Banks 52c206d5db Has anyone else ever noticed that the DiffEngine outputs were always doubled for some reason? That no longer happens with the new reports. 2012-05-18 02:32:20 -04:00
Eric Banks 03d40272c8 Removed old GATKReport code and moved the new stuff in its place. 2012-05-18 01:44:31 -04:00
Eric Banks a26b04ba17 Extensive refactoring of the GATKReports. This was a beast.
The practical differences between version 1.0 and this one (v1.1) are:

* the underlying data structure now uses arrays instead of hashes, which should drastically reduce the memory overhead required to create large tables.
* no more primary keys; you can still create arbitrary IDs to index into rows, but there is no special cased primary key column in the table.
* no more dangerous/ugly table operations supported except to increment a cell's value (if an int) or to concatenate 2 tables.

Integration tests change because table headers are different.
Old classes are still lying around.  Will clean those up in a subsequent commit.
2012-05-18 01:11:26 -04:00
Guillermo del Angel 5189b06468 New annotation for indels that describe if they're STR's and their characteristics. If an indel is a STR, 3 fields are added to INFO: STR (boolean), RU = repeat unit (String), RPA = number of repetitions per allele. So, for example, if ATATAT* context gets changed to ATAT and ATATATAT, then RU=AT and RPA=3,2,4. Will be made standard annotation shortly. Added unit tests for new functionality. Pending: refactor VariantContextUtils.isRepeat() to unify code, and fix VariantEval functionality. 2012-05-17 15:28:19 -04:00
Guillermo del Angel ae26f0fe14 a) Fully functional and working multiallelic exact model for pools. Needs cleanup/more testing. b) Better unit test for pool genotype likelihoods - it now optionally generates actual noisy pileups that can be used for assessing GL accuracy, c) Totally experimental, hidden option in VariantsToTable to output genotype fields. Specifying -GF will output columns of form Sample.FieldName - needs also more testing 2012-05-14 10:55:35 -04:00
Guillermo del Angel 89f8a6b2e6 Revert bad part of last commit that shouldn't have been pushed 2012-05-10 10:41:08 -04:00
Guillermo del Angel 27b1aa5dd3 Don't allow N's in insertions when discovering indels. Maybe better solution will be to use them as wildcards and merge them with compatible regular insertion alleles but for now it's easier to ignore them. Minor refactoring of Allele.accepableAlleleBases to support this. Added unit test to test consensus allele counter in presence of N's 2012-05-10 10:29:19 -04:00
Eric Banks c40cda7e3c Nope, loads of integration tests had to be changed. 2012-05-07 14:30:42 -04:00
Eric Banks f3433201b1 Merged bug fix from Stable into Unstable 2012-05-03 11:11:00 -04:00
Eric Banks 557da77a1a Don't compute QD if there is no QUAL; added integration test for this 2012-05-03 11:02:37 -04:00
Eric Banks 1fc7b5d58b Merged bug fix from Stable into Unstable 2012-05-03 10:37:58 -04:00
Laurent Francioli 567d01cee8 - Added option to output the father's allele first in phased child haplotypes - BUG corrected causing wrong phasing of child/father pairs
Signed-off-by: Eric Banks <ebanks@broadinstitute.org>
2012-05-03 10:36:49 -04:00
Laurent Francioli 96e5a26223 PED support for Inbreeding Coefficient annotation
Signed-off-by: Eric Banks <ebanks@broadinstitute.org>
2012-05-03 10:36:20 -04:00
Mark DePristo 43d97c2e00 Rev Tribble to r97, adding binary feature support
From tribble logs:

Binary feature support in tribble

-- Massive refactoring and cleanup
-- Many bug fixes throughout
-- FeatureCodec is now general, with decode etc. taking a PositionBufferedStream
as an argument not a String
-- See ExampleBinaryCodec for an example binary codec
-- AbstractAsciiFeatureCodec provides to its subclass the same String decode,
readHeader functionality before.  Old ASCII codecs should inherit from this base
class, and will work without additional modifications
-- Split AsciiLineReader into a position tracking stream
(PositionalBufferedStream).  The new AsciiLineReader takes as an argument a
PositionalBufferedStream and provides the readLine() functionality of before.
Could potentially use optimizations (its a TODO in the code)
-- The Positional interface includes some more functionality that's now
necessary to support the more general decoding of binary features
-- FeatureReaders now work using the general FeatureCodec interface, so they can
index binary features
-- Bugfixes to LinearIndexCreator off by 1 error in setting the end block
position
-- Deleted VariantType, since this wasn't used anywhere and it's a particularly
clean why of thinking about the problem
-- Moved DiploidGenotype, which is specific to Gelitext, to the gelitext package
-- TabixReader requires an AsciiFeatureCodec as it's currently only implemented
to handle line oriented records
-- Renamed AsciiFeatureReader to TribbleIndexedFeatureReader now that it handles
Ascii and binary features
-- Removed unused functions here and there as encountered
-- Fixed build.xml to be truly headless
-- FeatureCodec readHeader returns a FeatureCodecHeader obtain that contains a
value and the position in the file where the header ends (not inclusive).
TribbleReaders now skip the header if the position is set, so its no longer
necessary, if one implements the general readHeader(PositionalBufferedStream)
version to see header lines in the decode functions.  Necessary for binary
codecs but a nice side benefit for ascii codecs as well
-- Cleaned up the IndexFactory interface so there's a truly general createIndex
function that takes the enumerated index type.  Added a writeIndex() function
that writes an index to disk.
-- Vastly expanded the index unit tests and reader tests to really test linear,
interval, and tabix indexed files.  Updated test.bed, and created a tabix
version of it as well.
-- Significant BinaryFeaturesTest suite.
-- Some test files have indent changes
2012-05-03 07:31:48 -04:00
Mark DePristo 58c470a6c5 Rev'ing Tribble from 53 to 94
-- Other tribble contributors did major refactoring / simplification of tribble, which required some changes to GATK code
-- Integrationtests pass without modification, though some very old index files (callable loci beds) were apparently corrupt and no longer tolerated by the newer tribble codebase
2012-05-03 07:31:47 -04:00
Eric Banks e448cfcc59 Forgot to update these md5s 2012-05-02 21:09:50 -04:00
Khalid Shakir b8b7f28aa9 Revving Picard to pick up new SamFileHeaderMerger.
Updated ReadFilter abstract class to implement (via UnsupportedOperationException) the new SamRecordFilter.filterOut().
In IndelRealignerIntegrationTest updates for Picard fixes to SAMRecord.getInferredInsertSize() in svn r1115 & r1124.
- Ran FixMates to create new input BAM since running IR with variable maxReadsInMemory means all reads weren't realigned leading to different outputs.
- Updated md5s to match new expectations after looking at TLEN diff engine output.
2012-05-02 16:47:28 -04:00
Eric Banks 623b36fbc4 Add header lines for AC,AF, and AN tags 2012-05-02 15:33:34 -04:00
Eric Banks 619a69a5f1 As promised in the release notes for 1.6, I am removing the old deprecated genotyping framework revolving around the misordering of alleles and have moved the fixed version in its place in preparation for release 1.7 (or 2.0?). 2012-05-01 16:18:24 -04:00
Eric Banks 0f3af9555b Adding an option to SelectVariants which allows the user to re-genotype through the exact model (if PLs are present) the samples in order to recalculate the QUAL and genotypes. This is really the correct way to select a subset of samples, especially when originally called from low coverage data. Also added integration test to cover this case. 2012-05-01 14:58:06 -04:00
Eric Banks 0c8e801021 Removing public to private dependency 2012-05-01 11:04:11 -04:00
Eric Banks e964d17518 Removing public to private dependency 2012-05-01 11:02:28 -04:00
Mauricio Carneiro 462450c3e3 disabling all BQSR unit tests
with the changes to the cycle covariate, some tests need updates, others  need to be completely re-written.
2012-04-30 14:39:55 -04:00
Guillermo del Angel e185632013 Exhaustive unit tests for Pool SNP genotype likelihoods:
a) Add ability for ErrorModel to be specified by external log-probability vector for testing.
b) For a given depth and ploidy(=2*samples/pool), create artificial high quality pileup testing from AC=0 to AC=ploidy, and test that pool GL's have expected content.Misc. refactorings and cleanups
c) Misc. cleanups and beautification.
2012-04-30 14:29:46 -04:00
Guillermo del Angel 730208133b Several fixes and improvements to Pool caller with ancillary test functions (not done yet):
a) Utility class called Probability Vector that holds a log-probability vector and has the ability to clip ends that deviate largely from max value.
b) Used this class to hold site error model, since likelihoods of error model away from peak are so far down that it's not worth computing with them and just wastes time.
c) Expand unit tests and add an exhaustive test for ErrorModel class.
d) Corrected major math bug in ErrorModel uncovered by exhaustive test: log(e^x) is NOT x if log's base = 10.
e) Refactored utility functions that created artificial pileups for testing into separate class ArtificialPileupTestProvider. Right now functionality is limited (one artificial contig of 10 bp), can only specify pileups in one position with a given number of matches and mismatches to ref) but functionality will be expanded in future to cover more test cases.
f) Use this utility class for IndelGenotypeLikelihoods unit test and for PoolGenotypeLikelihoods unit test (the latter testing functionality still not done).
g) Linearized implementation of biallelic exact model (very simple approach, similar to diploid exact model, just abort if we're past the max value of AC distribution and below a threshold). Still need to add unit tests for this and to expand to multiallelic model.
h) Update integration test md5's due to minor differences stemming from linearized exact model and better error model math
2012-04-27 14:41:17 -04:00
Eric Banks 0439047269 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-27 10:49:45 -04:00
Eric Banks 05b44dd017 The genotypeCounts array wasn't always being initialized before it was accessed, leading to a NPE (which got caught and thrown as a JEXL expression when used in selection). Added unit test to cover all genotype count methods. 2012-04-27 10:49:36 -04:00
Khalid Shakir 9801dd114f Bug fix for: https://getsatisfaction.com/gsa/topics/problem_with_indelrealigner_and_l_unmapped
The GATK -L unmapped is for GenomeLocs with SAMRecord.NO_ALIGNMENT_REFERENCE_NAME, not SAMRecord.getReadUnmappedFlag()
Previously unmapped flag reads in the last bin were being printed while also seeking for the reads without a reference contig.
2012-04-27 09:58:38 -04:00
Guillermo del Angel 2f86ccb086 Correct md5's for previous code change 2012-04-26 16:20:41 -04:00
Guillermo del Angel 972d6531b6 Corner case fix for indel GL computation: sometimes (depending on surrounding context) reads which are not informative of two candidate haplotypes end up having marginally higher likelihoods with one haplotype as opposed to another, depending on uncertainty on alignments in surrounding regions. So, a sample whose GL is -0.0001,-0.0005,-0.001 may have its genotype set to 1/1 due to this statistical noise. We already have a tolerance comparing max(gl)-min(gl) to avoid genotyping, so this tolerance is now increased from 0.001 to 0.1 (equivalent to 1 PL unit) to avoid genotyping a sample if all PLs are within this threshold. Changed 2 integration test md5s that hit this case. 2012-04-26 10:15:26 -04:00
Laurent Francioli ab2a952ad1 PED support for Inbreeding Coefficient annotation
Signed-off-by: Eric Banks <ebanks@broadinstitute.org>
2012-04-25 12:56:47 -04:00
Laurent Francioli 219b0a128b PED support for ChromosomeCounts annotation
Signed-off-by: Eric Banks <ebanks@broadinstitute.org>
2012-04-25 12:50:04 -04:00
Laurent Francioli 19d5213d5a Added function to get founders IDs in SampleDB
Signed-off-by: Eric Banks <ebanks@broadinstitute.org>
2012-04-25 12:49:36 -04:00
Mauricio Carneiro e440d0ce69 BQSR triage #4
* fixed queue script plot file names
   * updated the ReadGroupCovariate to use the platform unit instead of sample + lane.
   * fixed plotting of marginalized reported qualities
2012-04-24 17:19:54 -04:00
Mauricio Carneiro e39a59594a BQSR triage and test routines
* updated BQSR queue script for faster turnaround
   * implemented plot generation for scatter/gatherered runs
   * adjusted output file names to be cooperative with the queue script
   * added the recalibration report file to the argument table in the report
   * added ReadCovariates unit test -- guarantees that all the covariates are being generated for every base in the read
   * added RecalibrationReport unit test -- guarantees the integrity of the delta tables
2012-04-23 11:23:00 -04:00
Eric Banks cd63bcb1b8 Fixing unit tests to register the user exception being thrown (instead of the NumberFormatException) 2012-04-23 10:06:51 -04:00
Eric Banks 1f23d99dfa If we are subsetting alleles in the UG (either because there were too many or because some were not polymorphic), then we may need to trim the alleles (because the original VariantContext may have had to pad at the end). Thanks to Ryan for reporting this. Only one of the integration tests had even partially covered this case, so I added one that did. 2012-04-20 17:00:05 -04:00
Eric Banks 4b81c75642 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-20 14:30:19 -04:00
Eric Banks f1c5510ec0 When running SelectVariants with the excludeNonVariants option, remove alleles from the ALT field that are no longer polymorphic. 2012-04-20 14:30:04 -04:00
Ryan Poplin a1596791af Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-20 14:03:04 -04:00
Ryan Poplin a57295eb75 Fixing a bug when breaking up active regions where the resulting regions would overlap by one base. Adding quality score manipulation from the UG into the haplotype caller (qual capped by mapping quality, min qual threshold). 2012-04-20 14:02:55 -04:00
Guillermo del Angel de68363c23 Removed experimental feature (aka hack) that was meant for 1000G consensus but remained in VQSR data manager - QD was being scaled by indel length. There's no evidence any more that QD is length-dependent, neither in CEU trio data nor in latest 1000G P2 calls 2012-04-20 10:58:34 -04:00
Mauricio Carneiro 0f8c77391d BQSR bug triage #3
* fixed context covariate famous "off by one" error
   * reduced maximum quality score to Q50 (following Eric/Ryan's suggestion)
   * remove context downsampling in BQSR R script
2012-04-19 17:31:04 -04:00
Khalid Shakir df5dd841af AC strat now checks if evals will be merged before throwing an error on multiple eval files.
Minor tweaks to WGP script based on new recal VCF format.
2012-04-19 16:08:55 -04:00
Guillermo del Angel 1ae2ab5b63 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-19 12:50:29 -04:00
Guillermo del Angel 02ff930f6a My changes 2012-04-19 12:45:18 -04:00
Mauricio Carneiro eb22cd7222 Unit test to guarantee BQSR sequential calculation accuracy
This test brings together the old and the new BQSR, building a recalibration table using the two separate frameworks and performing the recalibration calculation using the two different frameworks for 10,000+ bases and asserting that the calculations match in every case.
2012-04-19 09:33:40 -04:00
Mauricio Carneiro 68d0211fa1 Improved BQSR plotting and some new parameters
* Refactored CycleCovariate to be a fragment covariate instead of a per read covariate
   * Refactored the CycleCovariateUnitTest to test the pairing information
   * Updated BQSR Integration tests accordingly
   * Made quantization levels parameter not hidden anymore
   * Added hidden option to keep intermediate plotting files for debug purposes (they're automatically deleted)
   * Added hidden option not to generate the plots automatically (important for scatter/gathering)
2012-04-19 09:31:41 -04:00
Guillermo del Angel 143e92b797 Rebasing 2012-04-18 20:05:43 -04:00
Ryan Poplin dcc4871468 minor misc optimizations to PairHMM 2012-04-18 15:02:26 -04:00
Eric Banks 4448a3ea76 Final tweaks. Added an integration test to cover the case of SNPs and indels that start at the same position. 2012-04-17 23:54:10 -04:00
Eric Banks c1f52b773a Minor tweaks and updated integration tests MD5s 2012-04-17 23:17:28 -04:00
Eric Banks ea793d8e27 Khalid pressured me into adding an integration test that makes sure we don't fail on reads with adjacent I and D events. 2012-04-17 21:21:29 -04:00
Mauricio Carneiro f0c81b59b0 Implementation of the new BQSR plotting infrastructure
* removed low quality bases from the recalibration report.
   * refactored the Datum (Recal and Accuracy) class structure
   * created a new plotting csv table for optimized performance with the R script
   * added a datum object that carries the accuracy information (AccuracyDatum) for plotting
   * added mean reported quality score to all covariates
   * added QualityScore as a covariate for plotting purposes
   * added unit test to the key manager to operate with one required covariate and multiple optional covariates
   * integrated the plotting into BQSR (automatically generates the pdf with the recalibration tearsheet)
2012-04-17 19:23:55 -04:00
Khalid Shakir 91cb654791 AggregateMetrics:
- By porting from jython to java now accessible to Queue via automatic extension generation.
- Better handling for problematic sample names by using PicardAggregationUtils.
GATKReportTable looks up keys using arrays instead of dot-separated strings, which is useful when a sample has a period in the name.
CombineVariants has option to suppress the header with the command line, which is now invoked during VCF gathering.
Added SelectHeaders walker for filtering headers for dbGAP submission.
Generated command line for read filters now correctly prefixes the argument name as --read_filter instead of -read_filter.
Latest WholeGenomePipeline.
Other minor cleanup to utility methods.
2012-04-17 11:45:32 -04:00
Mark DePristo 3f6b2423d8 Update VE IT to reflect new fields and bugfixes 2012-04-13 17:00:37 -04:00
Mark DePristo f9190b6fcd VariantEvalUnitTest is better named VariantEvalWalkerUnitTest 2012-04-13 17:00:37 -04:00
Mark DePristo 84d1e8713a Infrastructure for combining VariantEvaluations
-- Not hooked up yet, so the output of VariantEval should be the same as before
-- Implemented a VariantEvalUnitTest that tests the low level strat / eval combinatorics and counting routines
-- Better docs throughout
2012-04-13 17:00:36 -04:00
Mark DePristo 2aa2d9aec0 Merged bug fix from Stable into Unstable 2012-04-13 09:25:43 -04:00
Mark DePristo 27e7e17dc7 New way to handle exceptions in multi-threaded GATK
-- HMS no longer tries to grab and throw all exceptions.  Exceptions are just thrown directly now.
-- Proper error handling is handled by functions in HMS, which are used by ShardTraverser and TreeReducer
-- Better printing of stack traces in WalkerTest
2012-04-13 09:23:33 -04:00
Mark DePristo e85e9a8cf5 More extensive testing of type of error thrown in multi-threaded walker test
-- Unfortunately the result of the multi-threaded test is non-deterministic so run the test 10x times to see if the right expection is always thrown
-- Now prints the stack trace and exception message of the caught exception of the wrong type, if this occurs
2012-04-13 09:23:33 -04:00
Eric Banks 297afc7911 Added unit test to ensure that we genotype correctly cases with really large GLs 2012-04-12 15:43:14 -04:00
Eric Banks 5b7da3831f Not sure why this didn't make it into the last push, but here's a working MD5 for the NDA annotation in UG 2012-04-11 13:49:50 -04:00
Eric Banks dc90508104 Adding a new annotation to UG calls: NDA = number of discovered (but not necessarily genotyped) alleles for the site. This could help downstream analysis esp. of indels for wonky sites (since we only use the top 2-3 alleles). Not enabled by default but we can change that if this turns out to be useful. 2012-04-11 13:47:10 -04:00
Eric Banks d2142c3aa7 Adding integration test for Flag Stat 2012-04-10 22:40:38 -04:00
Ryan Poplin 1df0adf862 Fixing ActivityProfile unit test. 2012-04-10 15:28:27 -04:00
Ryan Poplin e3cc7cc59c Resolving merge conflict. 2012-04-10 14:50:27 -04:00
Ryan Poplin a4634624b7 There are now three triggering options in the HaplotypeCaller. The default (mismatches, insertions, deletions, high quality soft clips), an external alleles file (from the UG for example), or extended triggers which include low quality soft clips, bad mates and unmapped mates. Added better algorithm for band pass filtering an ActivityProfile and breaking them apart when they get too big. Greatly increased the specificity of the caller by battening down the hatches on things like base quality and mapping quality thresholds for both the assembler and the likelihood function. 2012-04-10 14:48:23 -04:00
Eric Banks 10e74a71eb We now allow arbitrary annotations other than dbSNP (e.g. HM3) to come out of the Unified Genotyper. This was already set up in the Variant Annotator Engine and was just a matter of hooking UG up to it. Added integration test to ensure correct behavior. 2012-04-10 12:30:35 -04:00
Eric Banks f82986ee62 Adding unit tests for the very important log10sumLog10 util method. 2012-04-09 14:28:25 -04:00
Mauricio Carneiro 87e6bea6c1 Adding engine capability to quantize qualities.
* Added parameter -qq to quantize qualities using a recalibration report
   * Added options to quantize using the recalibration report quantization levels, new nLevels and no quantization.
   * Updated BQSR scripts to make use of the new parameters
2012-04-08 21:07:51 -04:00
Mark DePristo c22a66870c Modified UnitTests to respect reference padding 2012-04-06 16:27:20 -04:00
Mark DePristo 45fc0ea98d Improvements to indel analysis capabilities of VariantEval
-- Now calculates the number of Indels overlapping gold standard sites, as well as the percent of indels overlapping gold standard sites
-- Removed insertion : deletion ratio for 1 bp event, replaced it with 1 + 2 : 3 bp ratio for insertions and deletions separately.  This is based on an old email from Mark Daly:

    // - Since 1 & 2 bp insertions and 1 & 2 bp deletions are equally likely to cause a
    // downstream frameshift, if we make the simplifying assumptions that 3 bp ins
    // and 3bp del (adding/subtracting 1 AA in general) are roughly comparably
    // selected against, we should see a consistent 1+2 : 3 bp ratio for insertions
    // as for deletions, and certainly would expect consistency between in/dels that
    // multiple methods find and in/dels that are unique to one method  (since deletions
    // are more common and the artifacts differ, it is probably worth looking at the totals,
    // overlaps and ratios for insertions and deletions separately in the methods
    // comparison and in this case don't even need to make the simplifying in = del functional assumption

-- Added a new VEW argument to bind a gold standard track
-- Added two new stratifications: OneBPIndel and TandemRepeat which do exactly what you imagine they do
-- Deleted random unused functions in IndelUtils
2012-04-06 16:07:46 -04:00
Mark DePristo 52ef4a3e26 Function to compute whether a VariantContext indel is part of a TandemRepeat
Returns true iff VC is an non-complex indel where every allele represents an expansion or
 contraction of a series of identical bases in the reference.

 The logic of this function is pretty simple.  Take all of the non-null alleles in VC.  For
 each insertion allele of n bases, check if that allele matches the next n reference bases.
 For each deletion allele of n bases, check if this matches the reference bases at n - 2 n,
 as it must necessarily match the first n bases.  If this test returns true for all
 alleles you are a tandem repeat, otherwise you are not.  Note that in this context n is the
 base differences between the ref and alt alleles
2012-04-06 16:07:46 -04:00
Mauricio Carneiro 7c3b3650bb BQSR bug triage
* fixed bug where some keys were using the same recal datum objects
    * fixed quantization qual calculations when combining multiple reports
    * fixed rounding error with empirical quality reported when combining reports
    * fixed combine routine in the gatk reports due to the primary keys being out of order
    * added auto-recalibration option to BQSR scala script
    * reduced the size of the recalibration report by ~15%
    * updated md5's
2012-04-05 09:32:18 -04:00
Mark DePristo 76e4100d89 By default, IndelLengthHistogram won't collapse large events into the last bin, as it produces weird looking plots
-- Updated integration tests as well
2012-04-04 18:48:03 -04:00
Ryan Poplin bfad26353a Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-04 16:04:50 -04:00
Ryan Poplin dda2173c66 Moved the Smith-Watermaning of haplotypes to earlier in the process so that alleles sent to genotyping would have the exact genomic sequence of the active region they represent. As a side effect cleaned up some edge case problems with variants, both real and false, which show up on the edges of active regions. Removed code that was replicated between the Haplotype class and ReadUtils. Finally figured out how to ensure that the indel calls coming out of the HC were left aligned. 2012-04-04 16:04:29 -04:00
Mark DePristo 1ccea866d8 VariantEval now includes -keepAC0 argument to include sites with alt alleles but AC 0 in analyses
-- Updated EvalModules to work with new paramter
-- adding test file for keepAC0 to public/testdata and integration tests
2012-04-04 15:37:12 -04:00
Eric Banks 337ff7887a When constructing VariantContexts from symbolic alleles, check for the END tag in the INFO field; if present, set the stop position of the VC accordingly. Added integration test to ensure that this is working properly for use with -L intervals. 2012-04-04 10:57:05 -04:00
Guillermo del Angel 05d8400468 Fix up broken non-pool UG tests: GenotypeLikelihoods.calcNumLikelihoods now expects total # of alleles, not # of alt ones. Add doc to new function implementation. Add unit test for function. Add unit test for PoolGenotypeLikelihoods (not fully done yet) 2012-04-03 20:51:24 -04:00
Guillermo del Angel 5a10f173ea Bug fix: BaseTest change shouldn't have been committed, first cleanup of SNP pool code (more to follow) 2012-04-03 18:55:52 -04:00
Guillermo del Angel 63b1e737c6 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-03 15:43:50 -04:00
Guillermo del Angel 9e11b4f9a7 Major refactor/completion of new Pool Caller under UnifiedGenotyper framework. PoolAFCalculationModel implements new math to combine pools - correct, but still O(N^2) and not complete yet for multiallelics. Pool likelihoods are better encapsulated and kept in an internal hashmap from int[] -> double for space efficiency (likelihoods can be big for pool calls when in initial discovery mode with 4 alleles). Maybe need several iterations of optimization to make it runnable at large scale. Still need to correct function chooseMostLikelyAlternateAlleles before full runs can be produced. 2012-04-03 15:43:32 -04:00
Eric Banks 326220c91c Removing extended event related unit tests 2012-04-02 14:40:36 -04:00
Eric Banks 99d27ddcc4 Had some free time, so I unplugged extended events from the walkers. Now they exist only in LocusIteratorByState, but ReadProperties.generateExtendedEvents() always returns false so that block is never actually executed anymore. I don't want to touch LIBS because I think David is in there right now. 2012-04-02 14:27:36 -04:00
Mark DePristo 4f73ea902f Final update for VE. VCFStreaming wasn't yet updated 2012-03-30 21:52:01 -04:00
Mark DePristo fbbb8509ad Final commits to VariantEval
-- Molten now supports variableName and valueName so you don't have to use variable and value if you don't want to.
-- Cleanup code, reorganize a bit more.
-- Fix for broken integrationtests
2012-03-30 20:11:06 -04:00
Mark DePristo 4b45a2c99d Final version of new VariantEval infrastructure.
*** WAY FASTER ***
 -- 3x performance for multiple sample analysis with 1000 samples
 -- Analyzing 1MB of the ESP call set (3100 samples) takes 40 secs, compared to several minutes in the previous version
 -- According to JProfiler all of the runtime is now spent decoding genotypes, which will only get better when we move to BCF2

-- Remove the TableType system, as this was way too complex.  No longer possible to embed what were effectively multiple tables in a single Evaluator.  You now have to have 1 table per eval
-- Replaced it with @Molten, which allows an evaluator to provide a single Map from variable -> value for analysis.  IndelLengthHistogram is now a @Molten data type.  GenotypeConcordance is also.
-- No longer allow Evaluators to use private and protected variables at @DataPoints.  You get an error if you do.
-- Simplified entire IO system of VE.  Refactored into VariantEvalReportWriter.
-- Commented out GenotypePhasingEvaluator, as it uses the retired TableType
-- Stratifications are all fully typed, so it's easy for GATKReports to format them.
-- Removed old VE work around from GATKReportColumn
-- General code cleanup throughout
-- Updated integration tests
2012-03-30 15:31:56 -04:00
Mark DePristo 976bac0452 BaseTest now has a global variable to turn off network connection requirement 2012-03-30 15:31:55 -04:00
Mark DePristo 097ed4ecc4 Memory usage optimizations and safety improvements to StratNode and StratificationManager
-- Added memory and safety optimizations to StratNode and StratificationManager.  Fresh, immutable Hashmaps are allocated for final data structures, so they exactly the correct size and cannot be changed by users.
-- Added ability of a stratification to specify incompatible evaluation.  The two strats using this are AC and Sample with VariantSummary, as this computes per-sample averages and so combining these results in an O(n^2) memory requirement.  Added integration test to cover incompatible strats and evals
2012-03-30 15:31:55 -04:00
Mark DePristo c8086a79e3 New StratificationManager based VariantEval passes unmodified integration tests
-- Now needs cleanup and optimizations
2012-03-30 15:31:55 -04:00
Mark DePristo 8971b54b21 Phase II of Stratification manager
-- Renamed and reorganized infrastructure
-- StratificationManager now a Map from List<Object> -> V.  All key functions are implemented.  Less commonly used TODO
-- Ready for hookup to VE
2012-03-30 15:31:54 -04:00
Mark DePristo 9f1cd0ff66 Lots of new functionality for StratificationStates manager
-- Really working according to unit tests
-- A nCombination utils
2012-03-30 15:31:54 -04:00
Mark DePristo a3d896d80e Part I of creating a fast state space lookup for VE
-- Created a unit tested tree mapping from a List<String> -> integer (StratificationStates).  This class is the key infrastructure necessary to create a complete static mapping from all stratification combinations to an offset in a vector of EvalutionContexts for update in map.
-- Minor code cleanup throughout VE (removing unused headers, for example)
2012-03-30 15:31:53 -04:00
Eric Banks 6b49af253b Removing dependence on extended events from the RealignerTargetCreator. Did some minor refactoring while I was in there. 2012-03-30 10:33:30 -04:00
Eric Banks 16bef191c6 UG integration tests updated. A handful of sites are lost because there are only 5 indels and one starts at the beginning of the read so it no longer passes our min threshold (now consistent with GGA), but mostly the depth changes ever so slightly once in a while between extended and normal pileups (I think the normal pileups are correct). I have looked thoroughly in IGV at ALL differences and am happy with the new results. As an aside, the AD is now calculated more accurately for indels. 2012-03-30 01:35:49 -04:00
Mauricio Carneiro f80bd4276a fixed estimated Q reported calculation in the gatherer 2012-03-29 12:28:43 -04:00
Guillermo del Angel a0843f125e Forgot to add file itself for new unit test 2012-03-28 21:08:18 -04:00
Roger Zurawicki 63cf7ec7ec Added more primitives to GATK Report Column Type
- The Integer column type now accepts byte and shorts
 - Updated Unit Tests and added a new testParse() test

Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-03-28 09:07:54 -04:00
Guillermo del Angel d2586911a4 Forgot to add tolerance to new MathUtils unit tests 2012-03-28 08:18:36 -04:00
Guillermo del Angel b4a7c0d98d Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-27 15:01:03 -04:00
Guillermo del Angel 343a061b1c Fix merge issues when incorporating new AF calculations changes 2012-03-27 15:00:44 -04:00
Mauricio Carneiro 1b75663178 BQSR Gatherer implementation and integration tests
* restructured the hash tables into one class (RecalibrationReport) that has all the functionality for the different tables and key managers
   * optmized empirical qual calculation when merging recalibration reports
   * centralized the quality score quantization functionalities
   * unified the creating/loading of all the key manager/hash table structures.
   * added unit tests for the gatherer (disabled because gatk report needs to be sorted for automated testing)
   * added integration tests for BQSR and on-the-fly recalibration
2012-03-27 13:50:22 -05:00
Eric Banks c07a577ba3 Significant restructuring of the Exact model, as discussed within the dev group last week. There is no more marginalizing over alternate alleles, and we now keep track of the MLE and MAP. Important notes: 1) integration tests change because the previous marginalization wasn't done correctly (as pointed out by Guillermo) and our confidences were too high for many multi-allelic sites; 2) there is a major TO-DO item that needs to be discussed within the dev group (so they should expect a follow up email); 3) this code is still in flux as I am awaiting feedback from Ryan now on its performance with the Haplotype Caller (the good news, Ryan, is that we recover that site that we were losing previously). 2012-03-27 00:27:44 -05:00
Guillermo del Angel e8bb8ade1a Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-26 16:42:03 -04:00
Guillermo del Angel 1a2a4848e8 Added integration test for ValidationSiteSelector, correct MD5's 2012-03-26 16:39:55 -04:00
Mark DePristo 34ea443cdb Better algorithm for choosing which indel alleles are present in samples
-- The previous approach (requiring > 5 copies among all reads) is breaking down in many samples (>1000) just from sequencing errors.
-- This breakdown is producing spurious clustered indels (lots of these!) around real common indels
-- The new approach requires >X% of reads in a sample to carry an indel of any type (no allele matching) to be including in the counting towards 5.  This actually makes sense in that if you have enough data we expect most reads to have the indel, but the allele might be wrong because of alignment, etc.  If you have very few reads, then the threshold is crossed with any indel containing read, and it's counted.
-- As far as I can tell this is the right thing to do in general.  We'll make another call set in ESP and see how it works at scale.
-- Added integration tests to ensure that the system is behaving as I expect on the site I developed the code on from ESP
2012-03-26 16:28:49 -04:00
Guillermo del Angel db54c2625f Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-25 09:53:35 -04:00
Guillermo del Angel deb4586559 Next intermediate commit for new pool caller structure: a) Bug fixes in pool GL computation. Now, correct GL's are returned per each pool to the UG engine. Work still needs to be done in redoing interface with exact model. b) Added unit tests for new MathUtils dot product and logDotProduct functions. c) Refactorings of UnifiedGentotyperEngine since N (size of prior/posterior arrays) is no longer necessarily nSamples+1 but, in general, nSamplesPerPool*nPools+1 2012-03-24 21:49:43 -04:00
Mauricio Carneiro 9f74969e3a BQSR with GATKReport implementation
* restructured BQSR to report recalibrated tables.
   * implemented empirical quality calculation to the BQSR stage (instead of on-the-fly recalibration)
   * linked quality score quantization to the BQSR stage, outputting a quantization histogram
   * included the arguments used in BQSR to the GATK Report
   * included all three tables (RG, QUAL and COVARIATES) to the GATK Report with empirical qualities

On-the-fly recalibration with GATK Report

   * loads all tables from the GATKReport using existing infrastructure (with minor updates)
   * implemented initialiazation of the covariates using BQSR's argument list
   * reduced memory usage significantly by loading only the empirical quality and estimated quality reported for each bit set key
   * applied quality quantization to the base recalibration
   * excluded low quality bases from on-the-fly recalibration for mismatches, insertions or deletions
2012-03-23 15:42:32 -04:00
Mauricio Carneiro f421062b55 Updated read group covariate to use sample.lane instead of the id
Added Unit test.
2012-03-23 15:24:07 -04:00
Mark DePristo e4ec90cfce Merged bug fix from Stable into Unstable 2012-03-23 11:27:34 -04:00
Mark DePristo ff26f2bf68 HierarchicalMicroScheduler no longer attempts to wrap exceptions
-- This behavior, which isn't obviously valuable at all, continued to grab and rethrow exceptions in the HMS that, if run without NT, would show up as more meaningful errors.  Now HMS simply checks whether the throwable it received on error was a RuntimeException.  If so, it is stored and rethrow without wrapping later.  If it isn't, only in this case is the exception wrapped in a ReviewedStingException.
-- Added a QC walker ErrorThrowingWalker that will throw a UserException, ReviewedStingException, and NullPointerException from map as specified on the command line
-- Added IT that ensures that all three types are thrown properly (i.e., you catch a NullPointerException when you ask for one to be thrown) with and without threading enabled.
-- I believe this will finally put to rest all of these annoying HMS captures.
2012-03-23 11:27:21 -04:00
Mark DePristo 6df96644d9 Unified, standard IndelSummary metrics for VariantEval
-- Now you always get SNP and indel metrics with VariantEval!
--   Includes Number of SNPs, Number of singleton SNPs, Number of Indels, Number of singleton Indels, Percent of indel sites that are multi-allelic, SNP to indel ratio, Singleton SNP to indel ratio, Indel novelty rate, 1 to 2 bp indel ratio, 1 to 3 bp indel ratio, 2 to 3 bp indel ratio, 1 and 2 to 3 bp indel ratio, Frameshift percent, Insertion to deletion ratio, Insertion to deletion ratio for 1 bp events, Number of indels in protein-coding regions labeled as frameshift, Number of indels in protein-coding regions not labeled as frameshift, Het to hom ratio for SNPs, Het to hom ratio for indels, a Histogram of indel lengths, Number of large (>10 bp) deletions, Number of large (>10 bp) insertions, Ratio of large (>10 bp) insertions to deletions
-- Updated VE integration tests as appropriate
2012-03-22 21:24:37 -04:00
Menachem Fromer b9b9219ac7 Added respectPhaseInInput flag to RBP and integration tests 2012-03-22 17:40:21 -04:00
Eric Banks 8c09ff9459 Merged bug fix from Stable into Unstable 2012-03-21 12:44:43 -04:00
Eric Banks 07c3bd32b3 Bug fix: merge NO_VARIATION records with those of another type. The sad part is that this WAS covered by integration tests but someone updated the MD5s without actually paying attention... 2012-03-21 12:42:13 -04:00
Ryan Poplin 9e10779fa7 Caching log calculations cut the non-Map runtime of HaplotypeCaller in half. Moved the qual log cache used in HC and PairHMM into a common place and added unit tests. 2012-03-21 08:45:42 -04:00
Mauricio Carneiro 0e93cf5297 Taking care of bad cigars in the GATK
* fixed BadCigarFilter to filter out reads starting/ending in deletion and that have adjacent I/D events.
   * added Unit tests for BadCigarFilter
   * updated all exceptions in LocusIteratorByState to tell the user that he can instead run with -rf BadCigar
   * added the BadCigar filter to ReduceReads and RealignTargetCreator (if your walker blows up with these malformed reads, you may want to add it too)
2012-03-20 14:32:57 -04:00
Mauricio Carneiro 633b5c687d Fixing MD5's (new GATKReport header was missing from old md5's) 2012-03-19 15:28:45 -04:00
Roger Zurawicki 7afb333811 GATK Report code cleanup
- Updated the documentation on the code
 - Made the table.write() method private and updated necessary files.
 - Added a constructor to GATKReport that takes GATKReportTables
 - Optimized my code

Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-03-19 11:53:57 -04:00
Mauricio Carneiro 0d4ea30d6d Updating the BQSR Gatherer to the new file format
This is important for quick turnaround in the analysis cycle of the new covariates. Also added a dummy unit test that doesn't really test anything (disabled), but helps in debugging.
2012-03-19 09:02:27 -04:00
Ryan Poplin 943b1d34f8 intermediate commit to aid in debugging HC / exact model changes. HC integration tests will still fail 2012-03-18 15:50:27 -04:00
Eric Banks 9223e451a3 Merged bug fix from Stable into Unstable 2012-03-18 00:54:19 -04:00
Eric Banks 5c5d8e7cd3 Minor: cleaner way of turning off index-on-the-fly checking in case we want to turn it back on. 2012-03-18 00:53:29 -04:00
Guillermo del Angel a27a9ccba2 Merged bug fix from Stable into Unstable 2012-03-16 21:15:30 -04:00
Guillermo del Angel a05a7f287d TMP: disable checking of whether on the fly index is equal to index after run completed 2012-03-16 21:14:45 -04:00
Eric Banks 539d51f324 Resolving conflicts 2012-03-16 14:36:07 -04:00
Eric Banks be9e48ba29 Merged bug fix from Stable into Unstable 2012-03-16 14:33:53 -04:00
Eric Banks a7578e85e8 Rewriting a few of the indel integration tests for multi-allelics. The old tests were running b37 calls against a b36 reference, so the calls were all ref. The new tests are run against the pilot1 data and then those calls are fed back into the the same bam to test genotype given alleles, with a sprinkling of bi- and tri-allelics. 2012-03-16 14:21:27 -04:00
Mauricio Carneiro 3bfca0ccfd BitSet implementation of the on-the-fly recalibration using the CSV format file.
Infrastructure:
   * Added static interface to all different clipping algorithms of low quality tail clipping
   * Added reverse direction pileup element event lookup (indels) to the PileupElement and LocusIteratorByState
   * Complete refactor of the KeyManager. Much cleaner implementation that handles keys with no optional covariates (necessary for on-the-fly recalibration)
   * EventType is now an independent enum with added capabilities. All functionality is now centralized.

 BQSR and RecalibrateBases:
   * On-the-fly recalibration is now generic and uses the same bit set structure as BQSR for a reduced memory footprint
   * Refactored the object creation to take advantage of the compact key structure
   * Replaced nested hash maps with single hash maps indexed by bitsets
   * Eliminated low quality tails from the context covariate (using ReadClipper's write N's algorithm).
   * Excluded contexts with N's from the output file.
   * Fixed cycle covariate for discrete platforms (need to check flow cycle platforms now!)
   * Redfined error for indels to look at the previous base in negative strand reads (using new PE functionality)
   * Added the covariate ID (for optional covariates) to the output for disambiguation purposes
   * Refactored CovariateKeySet -- eventType functionality is now handled by the EventType enum.
   * Reduced memory usage of the BQSR script to 4

 Tests:
   * Refactored BQSRKeyManagerUnitTest to handle the new implementation of the key manager
   * Added tests for keys without optional covariates
   * Added tests for on-the-fly recalibration (but more tests are necessary)
2012-03-16 13:02:15 -04:00
Mauricio Carneiro ca11ab39e7 BitSets keys to lower BQSR's memory footprint
Infrastructure:
	* Generic BitSet implementation with any precision (up to long)
	* Two's complement implementation of the bit set handles negative numbers (cycle covariate)
	* Memoized implementation of the BitSet utils for better performance.
	* All exponents are now calculated with bit shifts, fixing numerical precision issues with the double Math.pow.
	* Replace log/sqrt with bitwise logic to get rid of numerical issues

 BQSR:
	* All covariates output BitSets and have the functionality to decode them back into Object values.
	* Covariates are responsible for determining the size of the key they will use (number of bits).
	* Generalized KeyManager implementation combines any arbitrary number of covariates into one bitset key with event type
	* No more NestedHashMaps. Single key system now fits in one hash to reduce hash table objects overhead

 Tests:
	* Unit tests added to every method of BitSetUtils
	* Unit tests added to the generalized key system infrastructure of BQSRv2 (KeyManager)
	* Unit tests added to the cycle and context covariates (will add unit tests to all covariates)
2012-03-16 13:01:48 -04:00
Eric Banks 7424041a17 Updating integration tests to deal with the new GL framework. Now multi-allelic indel calls are correct. 2012-03-16 12:50:39 -04:00
Eric Banks dce6b91f7d Add a conversion from the deprecated PL ordering to the new one. We need this for the DiploidSNPGenotypeLikelihoods which still use the old ordering. My intention is for this to be a temporary patch, but changing the ordering in DiploidSNPGenotypeLikelihoods is not appriopriate for committing to stable as it will break all of the external tools (e.g. MuTec) that are built on top of the class. We will have to talk to e.g. Kristian to see how disruptive this will be. Added unit tests to the GL conversions and indexing. 2012-03-16 11:14:37 -04:00
Eric Banks 41068b6985 The commit constitutes a major refactoring of the UG as far as the genotype likelihoods are concerned. I hate to do this in stable, but the VCFs currently being produced by the UG are totally busted. I am trying to make just the necessary changes in stable, doing everything else in unstable later. Now all GL calculations are unified into the GenotypeLikelihoods class - please try and use this functionality from now on instead of duplicating the code. 2012-03-15 16:08:58 -04:00
Ryan Poplin 0c6b34e9df Fixing a bug identified by the ActivityProfile unit tests 2012-03-15 14:24:30 -04:00
Ryan Poplin 252b830aa8 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-15 11:56:04 -04:00
Ryan Poplin 0fa5a7af05 Adding contracts and unit tests for HaplotypeCaller GenotypingEngine 2012-03-15 11:55:48 -04:00
Mark DePristo 7c5cdb51c2 UnitTests for ActivityProfile and minor ART cleanup
-- TODO for ryan -- there are bugs in ActivityProfile code that I cannot fix right now :-(
-- UnitTesting framework for ActivityProfile -- needs to be expanded
-- Minor helper functions for ActiveRegion to help with unit tests
2012-03-14 17:26:37 -04:00
Mark DePristo e73406b9b5 CountReadsInActiveRegions now emits a detailed GATK report
-- This report details which intervals are coming in and how many reads they contain
-- Added integration test to verify that the intervals aren't changing, before heading into the ART refactor
2012-03-14 17:26:37 -04:00
Eric Banks f76da1efd2 Updating md5s because MultiallelicSummary is now standard 2012-03-13 16:31:13 -04:00
Eric Banks 6e18ecfc9a Adding integration test to cover errors from my previous commit (GENOTYPE_GIVEN_ALLELE bugs reported by Sara Pulit and Chris Hartl) 2012-03-13 12:43:40 -04:00
David Roazen 5d6a686474 Restoring key-related unit/integration tests
The recent GATKReport commit accidentally clobbered a few tests -- this
restores them.
2012-03-13 00:58:24 -04:00
Roger Zurawicki 7887a06703 GATKReport v1.0
GATKReport format changes:

 - All non-data header lines are preceeded with a single pound ( #:)
 - Every report now has a report header containing the version number and number of tables
 - Every table has two lines of table header: The first explains the size of the table and the data types of each column, the second contains the table name and description.
 - This new format will allow reports in the future to be gatherable.
 - Changed the header format to include an end-of-line string ":;"

Added features:

 - Simplified GATK Reports:

	The constructor for a simplified GATK Report. Simplified GATK report are designed for reports that do not need the advanced functionality of a full GATK Report.

	A simple GATK Report consists of:
		- A single table
		- No primary key ( it is hidden )
	    Optional:
		- Only untyped columns. As long as the data is an Object, it will be accepted.
		- Default column values being empty strings.
	Limitations:
		- A simple GATK report cannot contain multiple tables.
		- It cannot contain typed columns, which prevents arithmetic gathering.

       - Added a constructor to generate simplified GATK reports.
       - Added a method to easily add data to simple GATK reports.

 - Upgraded the input parser take advantage of the new file format (v1).
 - Added the GATKReportGatherer, more usability cmoing in next versionof GATK Report. Curently, it can only add rows from one table to another. Added private methods in GATKReport to combine Tables and Reports, It is very conservative and will only gather if the table columns, as well as everything else matches. At the column level, it uses the (redundant) row ids to add new rows. It will throw an exception if it is overwriting data.
 - Made some GATKReport methods public, and added more setters and getters.
 - Added method that compares formats of two GATKReports, and added an equals method to verify all data inside.
 - The gsalib for R now supports reading GATKReport v1 files in addition to legacy formats (v0.*)
 - Added a GATKReportDataType enum to give column a certain data type. This must be specified when making a gatherable report. This enum contains several methods including a reverse lookup map.
 - Added a data type field in GATKColumn, when a type is not specified, the unknown type is used. Unknown types should not be gathered.

Test changes:

 - Updated Unit Tests for GATK Report v1. Added a test for the gatherer. Left one test disabled while we transition from v0 to v1.
 - Updated the MD5 hashes in integration tests throughout the GATK.

Other changes:

 - Added the gatherer functions to CoverageByRG
 - Also added the scatterCount parameter in the Interval Coverage script
 - Dropped support for reading in legacy GATKReport formats ( v0.*)
 - Updated VariantEvalWalker to work with GATK Report v1, added a format String to all applicable DataPoints.
 - Rewrote the read file method for GATK report files.
 - Optimized the equals methods within GATKReport. The protected functions should only be called by the GATKReport methods.

Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-03-12 23:09:19 -04:00
Ryan Poplin 92bbb9bbdd Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-10 10:09:57 -05:00
Mark DePristo 1011f3862b CalibrateGenotypeLikelihoods now emits the position of the variant for debugging
-- Refactored some duplicated code (FYI, code duplication = root of all evil) into shared functions
-- Added long-missing integrationtests
-- CHRIS/RYAN -- it would be very good to add an integration test covering external VCF files as I believe we rely on this functionality and it's not tested at all
2012-03-09 16:00:07 -05:00
David Roazen 32dee7ed9b Avoid buffer underflow in GATKBAMIndex by detecting premature EOF in BAM indices
GATKBAMIndex would allow an extremely confusing BufferUnderflowException to be
thrown when a BAM index file was truncated or corrupt. Now, a UserException is
thrown in this situation instructing the user to re-index the BAM.

Added a unit test for this case as well.
2012-03-08 15:30:44 -05:00
Mark DePristo 0376d73ece Improved, public version of ErrorRateByCycle
-- A cleaner table output (molten).  For those interested in seeing how this can be done with GATKReports look here for a nice clean example
-- Integration tests
-- Minor improvements to GATKReportTable with methods to getPrimaryKeys
2012-03-07 13:10:08 -05:00
Mark DePristo 569be953b9 Bugfix for VariantEval
-- We weren't properly handling the case where a site had both a SNP and indel in both eval and comp.  These would naturally pair off as SNP x SNP and INDEL x INDEL in eval, but we'd still invoke update2 with (null, SNP) and (null, INDEL) resulting most conspicously as incorrect false negatives in the validation report.
-- Updating misc. integrationtests, as the counting of comps (in particular for dbSNP) was inflated because of this effect.
2012-03-06 16:56:59 -05:00
David Roazen 811f871f78 Do not fail tests that require the GATK private key if the user does not have permission to read it
Several of the unit tests for the new key authorization feature require
read access to the GATK master private key file. Since this file is only
readable by members of the group gsagit, this makes it hard for people
outside the group to run the test suite.

Now, we skip tests that require the master private key if the private
key exists (since not existing would be a true error) but is not readable
by the user running the test suite

Bamboo, of course, will always be able to run these tests.
2012-03-06 15:57:02 -05:00
Ryan Poplin 46b470cc69 Minor misc updates 2012-03-06 10:14:45 -05:00
David Roazen 0702ee1587 Public-key authorization scheme to restrict use of NO_ET
-Running the GATK with the -et NO_ET or -et STDOUT options now
 requires a key issued by us. Our reasons for doing this, and the
 procedure for our users to request keys, are documented here:
 http://www.broadinstitute.org/gsa/wiki/index.php/Phone_home

-A GATK user key is an email address plus a cryptographic signature
 signed using our private key, all wrapped in a GZIP container.
 User keys are validated using the public key we now distribute with
 the GATK. Our private key is kept in a secure location.

-Keys are cryptographically secure in that valid keys definitely
 came from us and keys cannot be fabricated, however keys are not
 "copy-protected" in any way.

-Includes private, standalone utilities to create a new GATK user key
 (GenerateGATKUserKey) and to create a new master public/private key
 pair (GenerateKeyPair). Usage of these tools will be documented on
 the internal wiki shortly.

-Comprehensive unit/integration tests, including tests to ensure the
 continued integrity of the GATK master public/private key pair.

-Generation of new user keys and the new unit/integration tests both
 require access to the GATK private key, which can only be read by
 members of the group "gsagit".
2012-03-06 00:09:43 -05:00
Ryan Poplin f6905630bb Adding Unit test for Haplotype class. Used in HC's genotype given alleles mode. 2012-03-05 21:08:07 -05:00
Ryan Poplin 14a77b1e71 Getting rid of redundant methods in MathUtils. Adding unit tests for approximateLog10SumLog10 and normalizeFromLog10. Increasing the precision of the Jacobian approximation used by approximateLog10SumLog which changes the UG+HC integration tests ever so slightly. 2012-03-05 12:28:32 -05:00
Ryan Poplin f879daa7d0 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-05 08:29:08 -05:00
Ryan Poplin d6871967ae Adding more unit tests and contracts to PairHMM util class. Updating HaplotypeCaller to use the new PairHMM util class. Now that the HMM result isn't dependent on the length of the haplotype there is no reason to ensure all haplotypes have the save length which simplifies the code considerably. 2012-03-05 08:28:42 -05:00
Mark DePristo 69611af7d3 Workaround for bug in Picard in ReadGroupProperties
-- NPE caused when you call getRunDate on a read group without a date.
2012-03-02 18:53:45 -05:00
Mark DePristo 2f334a57c2 ReadGroupProperties mk2
-- Includes paired end status (T/F)
-- Includes count of reads used in calculation
-- Includes simple read type (2x76 for example)
-- Better handling of insert size, read length when there's no data, or the data isn't paired end by emitting NA not 0
2012-03-01 18:43:53 -05:00
Mauricio Carneiro 29f74b658b Unit tests for the context covariate
this is simple, but it's the infra-structure to start messing around with the context.
2012-03-01 17:56:45 -05:00
Mark DePristo aff508e091 ReadGroupProperties walker and associated infrastructure
-- ReadGroupProperties: Emits a GATKReport containing read group, sample, library, platform, center, median insert size and median read length for each read group in every BAM file.
-- Median tool that collects up to a given maximum number of elements and returns the median of the elements.
-- Unit and integration tests for everything.
-- Making name of TestProvider protected so subclasses and override name more easily
2012-03-01 15:01:11 -05:00
Mauricio Carneiro d379c3763a DNA Sequence to BitSet and vice-versa conversion tools
* Turns DNA sequences (for context covariates) into bit sets for maximum compression
  * Allows variable context size representation guaranteeing uniqueness.
  * Works with long precision, so it is limited to a context size of 31 bases (can be extended with BigNumber precision if necessary).
  * Unit Tests added
2012-02-29 19:25:20 -05:00
Mark DePristo ca0931c01f Adding test for reading samtools VCF file 2012-02-27 17:05:50 -05:00
Eric Banks bd944ab04f Another test where we no longer print out 'NaN' for the AF. 2012-02-27 15:19:08 -05:00
Eric Banks 52871187d7 Adding integration test for file with no GTs. Also updated md5 for one other test (since we no longer print out 'NaN' for the AF). 2012-02-27 15:09:56 -05:00
Eric Banks 1ea34058c2 Updating integration tests now that standard annotations support multiple alleles 2012-02-27 11:32:26 -05:00
Guillermo del Angel 16122bea8d Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-02-25 13:57:54 -05:00
Guillermo del Angel dea35943d1 a) Bug fix in calling new functions that give indel bases and length from regular pileup in LocusIteratorByState, b) Added unit test to cover these. 2012-02-25 13:57:28 -05:00
Mark DePristo c8a06e53c1 DoC now properly handles reference N bases + misc. additional cleanups
-- DoC now by default ignores bases with reference Ns, so these are not included in the coverage calculations at any stage.
-- Added option --includeRefNSites that will include them in the calculation
-- Added integration tests that ensures the per base tables (and so all subsequent calculations) work with and without reference N bases included
-- Reorganized command line options, tagging advanced options with @Advanced
2012-02-25 11:32:50 -05:00
Mark DePristo 50de1a3eab Fixing bad VCFIntegration tests
-- Left disabled a test that should have been enabled
-- Didn't add the md5 to the test I actually added
-- Now VCFIntegrationTests should be working!
2012-02-25 11:26:36 -05:00
Mark DePristo e0c189909f Added support for breakpoint alleles
-- See https://getsatisfaction.com/gsa/topics/support_vcf_4_1_structural_variation_breakend_alleles?utm_content=topic_link&utm_medium=email&utm_source=new_topic
-- Added integrationtest to ensure that we can parse and write out breakpoint example
2012-02-23 12:14:48 -05:00
Mauricio Carneiro 75783af6fc int <-> BitSet conversion utils for MathUtils
* added unit tests.
2012-02-21 14:10:36 -05:00
David Roazen 85d31f80a2 Merged bug fix from Stable into Unstable 2012-02-13 16:37:11 -05:00
David Roazen 03e5184741 Fix serious engine bug that could cause reads to be dropped under certain circumstances
When aggregating raw BAM file spans into shards, the IntervalSharder tries to combine
file spans when it can. Unfortunately, the method that combines two BAM file
spans was seriously flawed, and would produce a truncated union if the file spans
overlapped in certain ways. This could cause entire regions of the BAM file containing
reads within the requested intervals to be dropped.

Modified GATKBAMFileSpan.union() to correct this problem, and added unit tests
to verify that the correct union is produced regardless of how the file spans
happen to overlap.

Thanks to Khalid, who did at least as much work on this bug as I did.
2012-02-13 16:25:21 -05:00
Eric Banks ad90af94ed Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-02-13 15:10:10 -05:00
Eric Banks 0920a1921e Minor fixes to splitting multi-allelic records (as regards printing indel alleles correctly); minor code refactoring; adding integration tests to cover +/- splitting multi-allelics. 2012-02-13 15:09:53 -05:00
Eric Banks 14981bed10 Cleaning up VariantsToTable: added docs for supported fields; removed one-off hidden arguments for multi-allelics; default behavior is now to include multi-allelics in one record; added option to split multi-allelics into separate records. 2012-02-13 14:32:03 -05:00
Ryan Poplin 41ffd08d53 On the fly base quality score recalibration now happens up front in a SAMIterator on input instead of in a lazy-loading fashion if the BQSR table is provided as an engine argument. On the fly recalibration is now completely hooked up and live. 2012-02-13 12:35:09 -05:00
Eric Banks f52f1f659f Multiallelic implementation of the TDT should be a pairwise list of values as per Mark Daly. Integration tests change because the count in the header is now A instead of 1. 2012-02-10 14:15:59 -05:00
Eric Banks 5e18020a5f Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-02-10 11:08:33 -05:00
Eric Banks f53cd3de1b Based on Ryan's suggestion, there's a new contract for genotyping multiple alleles. Now the requester submits alleles in any arbitrary order - rankings aren't needed. If the Exact model decides that it needs to subset the alleles because too many were requested, it does so based on PL mass (in other words, I moved this code from the SNPGenotypeLikelihoodsCalculationModel to the Exact model). Now subsetting alleles is consistent. 2012-02-10 11:07:32 -05:00
Mauricio Carneiro 5af373a3a1 BQSR with indels integrated!
* added support to base before deletion in the pileup
   * refactored covariates to operate on mismatches, insertions and deletions at the same time
   * all code is in private so original BQSR is still working as usual in public
   * outputs a molten CSV with mismatches, insertions and deletions, time to play!
   * barely tested, passes my very simple tests... haven't tested edge cases.
2012-02-09 18:46:45 -05:00
Eric Banks 7a937dd1eb Several bug fixes to new genotyping strategy. Update integration tests for multi-allelic indels accordingly. 2012-02-09 16:14:22 -05:00
Mauricio Carneiro d561914d4f Revert "First implementation of GATKReportGatherer"
premature push from my part. Roger is still working on the new format and we need to update the other tools to operate correctly with the new GATKReport.

This reverts commit aea0de314220810c2666055dc75f04f9010436ad.
2012-02-08 23:28:55 -05:00
Eric Banks 2f800b078c Changes to default behavior of UG: multi-allelic mode is always on; max number of alternate alleles to genotype is 3; alleles in the SNP model are ranked by their likelihood sum (Guillermo will do this for indels); SB is computed again. 2012-02-08 15:27:16 -05:00
Mauricio Carneiro 337819e791 disabling the test while we fix it 2012-02-07 19:22:32 -05:00
Roger Zurawicki c0c676590b First implementation of GATKReportGatherer
- Added the GATKReportGatherer
- Added private methods in GATKReport to combine Tables and Reports
- It is very conservative and it will only gather if the table columns, match.
- At the column level it uses the (redundant) row ids to add new rows. It will throw an exception if it is overwriting data.
Added the gatherer functions to CoverageByRG

Also added the scatterCount parameter in the Interval Coverage script
Made some more GATKReport methods public

The UnitTest included shows that the merging methods work
Added a getter for the PrimaryKeyName
Fixed bugs that prevented the gatherer form working

Working GATKReportGatherer
Has only the functional to addLines
The input file parser assumes that the first column is the primary key

Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-02-07 18:14:47 -05:00
Mauricio Carneiro e1d69e4060 make the size of a GenomeLoc int instead of long
it will never be bigger than an int and it's actually useful to be an int so we can use it as parameters to array/list/hash size creation.
2012-02-03 17:12:42 -05:00
Mauricio Carneiro d5d4fa8a88 Fixed discordance bug reported by Brad Chapman
discordance now reports discordance between genotypes as well (just like concordance)
2012-01-30 09:50:45 -05:00
Mauricio Carneiro 2a565ebf90 embarrassing fix-up, thanks Khalid. 2012-01-26 19:58:42 -05:00
Mauricio Carneiro 246e085ec9 Unit tests for GATKSAMRecord class
* new unit tests for the alignment shift properties of reduce reads
   * moved unit tests from ReadUtils that were actually testing GATKSAMRecord, not any of the ReadUtils to it.
   * cleaned up ReadUtilsUnitTest
2012-01-26 17:06:36 -05:00
Ryan Poplin cdff23269d HaplotypeCaller now uses insertions and softclipped bases as possible triggers. LocusIteratorByState tags pileup elements with the required info to make this calculation efficient. The days of the extended event pileup are coming to a close. 2012-01-26 15:56:33 -05:00
Eric Banks ddaf51a50f Updated one integration test for indels 2012-01-25 19:18:51 -05:00
Eric Banks e349b4b14b Allow appending with the dbSNP ID even if a (different) ID is already present for the variant rod. 2012-01-25 11:35:54 -05:00
Mauricio Carneiro ffd61f4c1c Refactor the Pileup Element with regards to indels
Eric reported this bug due to the reduced reads failing with an index out of bounds on what we thought was a deletion, but turned out to be a read starting with insertion.

   * Refactored PileupElement to distinguish clearly between deletions and read starting with insertion
   * Modified ExtendedEventPileup to correctly distinguish elements with deletion when creating new pileups
   * Refactored most of the lazyLoadNextAlignment() function of the LocusIteratorByState for clarity and to create clear separation between what is a pileup with a deletion and what's not one. Got rid of many useless if statements.
   * Changed the way LocusIteratorByState creates extended event pileups to differentiate between insertions in the beginning of the read and deletions.
   * Every deletion now has an offset (start of the event)
   * Fixed bug when LocusITeratorByState found a read starting with insertion that happened to be a reduced read.
   * Separated the definitions of deletion/insertion (in the beginning of the read) in all UG annotations (and the annotator engine).
   * Pileup depth of coverage for a deleted base will now return the average coverage around the deletion.
   * Indel ReadPositionRankSum test now uses the deletion true offset from the read, changed all appropriate md5's
   * The extra pileup elements now properly read by the Indel mode of the UG made any subsequent call have a different random number and therefore all RankSum tests have slightly different values (in the 10^-3 range). Updated all appropriate md5s after extremely careful inspection -- Thanks Ryan!

 phew!
2012-01-24 16:07:21 -05:00
Khalid Shakir c18beadbdb Device files like /dev/null are now tracked as special by Queue and are not used to generate .out file paths, scattered into a temporary directory, gathered, deleted, etc.
Attempted workaround for xdr_resourceInfoReq unsatisfied link during loading of libbat.so.
2012-01-23 16:17:04 -05:00
Mark DePristo 02450e4b12 Merged bug fix from Stable into Unstable 2012-01-23 12:08:39 -05:00
Mark DePristo 80a4ce0edf Bugfix for incorrect error messages for missing BAMs and VCFs
-- Missing BAMs were appearing as StingExceptions
-- Missing VCFs were showing up as CommandLineErrors, but it's clearer for them to be CouldNotReadInputFile exceptions
-- Added integration tests to ensure missing BAMs, VCFs, and -L files are properly thrown as CouldNotReadInputFile exceptions
-- Added path to standard b37 BAM to BaseTest
-- Cleaned up code in SAMDataSource, removing my parallel loading code as this just didn't prove to be useful.
2012-01-23 09:52:07 -05:00
Christopher Hartl 4a08e8ca6e Minor tweaks to T2D-related qscripts. Replacing old md5s from the BeagleIntegrationTest. All differences boiled down either to the accounting of genotypes changed (./. --> 0/0 is no longer a "changed" genotype, and original genotypes that were ./. are represented as OG=. rather than OG=./. .)
This is somewhat of an arbitrary decision, and is negotiable. I could see treating

GT:PL   ./.:.

differently from

GT:PL   .:0,3,6

but am not sure the worth of doing so.
2012-01-23 08:25:34 -05:00
Eric Banks ab8f499bc3 Annotate with FS even for filtered sites 2012-01-18 22:04:51 -05:00
Ryan Poplin 0268da7560 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-01-18 09:53:00 -05:00
Ryan Poplin 60024e0d7b updating TDT integration test 2012-01-18 09:52:50 -05:00
Mark DePristo 0c7865fdb5 UnitTest for reverseAlleleClipping
-- No code modified yet, just implementing a unit test to ensure correctness of the existing code
2012-01-18 07:35:11 -05:00
Mauricio Carneiro cec7107762 Better location for the downsampling of reads in PrintReads
* using the filter() instead of map() makes for a cleaner walker.
   * renaming the unit tests to make more sense with the other unit and integration tests
2012-01-14 14:06:09 -05:00
Mauricio Carneiro 28aa353501 Added "unbiased" downsampling parameter to PrintReads
* also cleaned up and updated part of the unit tests for print reads. Needs a more thorough cleaning.
2012-01-12 16:33:55 -05:00
Matt Hanna 2c3176eb80 Merged bug fix from Stable into Unstable 2012-01-12 13:31:10 -05:00
Matt Hanna cd43f016ce Fixed NPE in getNextOverlappingBAMScheduleEntry() when mixed mapped/unmapped interval lists are used. Added integrationtest to verify behavior. 2012-01-12 13:29:11 -05:00
Mauricio Carneiro 77a03c9709 Patching special case in the adaptor clipping
* if the adaptor boundary is more than MAXIMUM_ADAPTOR_SIZE bases away from the read, then let's not clip anything and consider the fragment to be undetermined for this read pair.
   * updated md5's accordingly
2012-01-11 17:47:44 -05:00
Eric Banks c5320ef1af Resolving changes in integration test during merge 2012-01-10 12:14:16 -05:00
Eric Banks 0f36f6947e Resolving merge conflicts 2012-01-10 11:44:16 -05:00
Eric Banks f2cecce10f Much better implementation of the approximate summing of an array of log10 values (including more efficient rounding). Now effectively takes 0% of UG runtime on T2D GENES (as opposed to 11% previously). 2012-01-10 11:34:23 -05:00
Mark DePristo dd80ffbbbe Merged bug fix from Stable into Unstable 2012-01-05 21:51:48 -05:00
Mark DePristo c96fee477c Bug fix for VariantSummary
-- Call sets with indels > 50 bp in length are tagged as CNVs in the tag (following the 1000 Genomes convention) and were unconditionally checking whether the CNV is already known, by looking at the known cnvs file, which is optional.  Fixed.  Has the annoying side effect that indels > 50bp in size are not counted as indels, and so are substrated from both the novel and known counts for indels.  C'est la vie
-- Added integration test to check for this case, using Mauricio's most recent VCF file for NA12878 which has many large indels.  Using this more recent and representative file probably a good idea for more future tests in VE and other tools.  File is NA12878.HiSeq.WGS.b37_decoy.indel.recalibrated.vcf in Validation_Data
2012-01-05 21:51:06 -05:00
Guillermo del Angel 58d4539304 Enabled banded indel computation by default. Reversed logic in input UG argument so that we can still disable it if required. Minor changes to integration tests due to minor differences in GL's and in annotations 2012-01-04 15:28:26 -05:00
David Roazen 621ee2b613 Merged bug fix from Stable into Unstable 2012-01-03 16:56:49 -05:00
David Roazen ea6e718cb8 SnpEff 2.0.5 support. Re-enabled SnpEff in the HybridSelectionPipeline.
For now, we recommend only running with the GRCh37.64 database.
2012-01-03 15:18:36 -05:00
David Roazen 4984ca5e31 Merged bug fix from Stable into Unstable 2012-01-03 11:03:30 -05:00
David Roazen f3f01da1af Enforce serial dependencies in RecalibrationWalkersIntegrationTest
Some tests in this class were intermittently not being executed due
to being randomly scheduled before tests whose results they depend on.
Now the serial dependencies are enforced to avoid problematic orderings.
2012-01-03 10:42:41 -05:00
Eric Banks ab8d47d9a5 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-01-03 09:38:49 -05:00
Mauricio Carneiro cd68cc239b Added knuth-shuffle (KS) and randomSubset using KS to MathUtils
* Knuth-shuffle is a simple, yet effective array permutator (hope this is good english).
         * added a simple randomSubset that returns a random subset without repeats of any given array with the same probability for every permutation.
         * added unit tests to both functions
2012-01-03 09:29:46 -05:00
Mauricio Carneiro 94791a2a75 Add support for reads starting with insertion
* Modified cleanCigarShift to allow insertions in the beginning and end of the read
      * Allowed cigars starting/ending in insertions in the systematic ReadClipper tests
      * Updated all ReadClipper unit tests
      * ReduceReads does not hard clip leading insertions by default anymore
      * SlidingWindow adjusts start location if read starts with insertion
      * SlidingWindow creates an empty element with insertions to the right
      * Fixed all potential divide by zero with totalCount() (from BaseCounts)
      * Updated all Integration tests
      * Added new integration test for multiple interval reducing
2012-01-03 09:29:45 -05:00
Mauricio Carneiro 1b6d52817e fixing adaptor clipping effect on recalibration integration test 2012-01-01 22:20:06 -05:00
Eric Banks 393993e0c7 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-31 20:42:46 -05:00
Mauricio Carneiro 55cfa76cf3 Updated integration tests for the new adaptor clipping fix. 2011-12-30 18:47:14 -05:00
Mauricio Carneiro c7d0a9ebee Forgot to test for inter-chromosomal mates in the adaptor clipping
* Fixing bug caught by Eric (and Kristian)
2011-12-30 00:19:53 -05:00
Eric Banks 1a45ea5a05 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-29 11:37:15 -05:00
Eric Banks d20a25d681 A much better way of choosing the alternate allele(s) to genotype in the SNP model of UG: instead of looking at the sum of base qualities (which can and did lead to us over-genotyping esp. when allowing multiple alternate alleles), we look at the likelihoods themselves (free since we are already calculating likelihoods for all 10 genotypes). Now, even if the base quals exceed some arbitrary threshold, we only bother genotyping an alternate allele when there's a sample for which it is more likely than ref/ref (I can generate weird edge cases where this falls apart, but none that model truly variable sites that we actually want to call). This leads to a huge efficiency improvement esp. for exomes (and esp. for many samples) where we almost always were trying to genotype all 3 alternate alleles. Integration tests change only because ref calls have slight QUAL differences (because the best alt allele is still chosen arbitrarily, but differently). 2011-12-27 16:50:38 -05:00
Mauricio Carneiro 17bfe48d5e Made all class methods private in the ReadClipper
* ReadClipperUnitTest now uses static methods
 * Haplotype caller now uses static methods
 * Exon Junction Genotyper now uses static methods
2011-12-27 02:11:32 -05:00
David Roazen 506c0e9c97 Disabling SnpEff support in the GATK and SnpEff annotation in the HybridSelectionPipeline
SnpEff support will remain disabled until SnpEff 2.0.4 has been officially released
and we've verified the quality of its annotations.
2011-12-23 19:12:57 -05:00
David Roazen 510c71158c Merged bug fix from Stable into Unstable 2011-12-22 10:49:52 -05:00
David Roazen 32cdef9682 Rename *PerformanceTest test classes to *LargeScaleTest
This is in preparation for the installation of the new performance test suite in Bamboo.

Note that "ant performancetest" is now "ant largescaletest"
2011-12-22 10:38:49 -05:00
Mauricio Carneiro 731a463415 Updated IntegrationTests with new adaptor clipper
phew!
2011-12-20 17:48:52 -05:00
Mauricio Carneiro cadff40247 getRefCoordSoftUnclippedStart and End refactor
These functions are methods of the read, and supplement getAlignmentStart() and getUnclippedStart() by calculating the unclipped start counting only soft clips.

* Removed from ReadUtils
* Added to GATKSAMRecord
* Changed name to getSoftStart() and getSoftEnd
* Updated third party code accordingly.
2011-12-20 17:48:51 -05:00
Mauricio Carneiro f73ad1c2e2 Bugfix/Rewrite: Algorithm to determine adaptor boundaries
The algorithm wasn't accounting for the case where the read is the reverse strand and the insert size is negative.

    * Fixed and rewrote for more clarity (with Ryan, Mark and Eric).
    * Restructured the code to handle GATKSAMRecords only
    * Cleaned up the other structures and functions around it to minimize clutter and potential for error.
    * Added unit tests for all 4 cases of adaptor boundaries.
2011-12-20 17:48:39 -05:00
Mauricio Carneiro 78d9bf7196 Added REVERT_SOFTCLIPPED_BASES capability to ReadClipper
* New ClippingOp REVERT_SOFTCLIPPED_BASES turns soft clipped bases into matches.
    * Added functionality to clipping op to revert all soft clip bases in a read into matches
    * Added revertSoftClipBases function to the ReadClipper for public use
    * Wrote systematic unit tests
2011-12-20 00:04:30 -05:00
Laurent Francioli 16cc2b864e - Corrected bug causing cases where both parents are HET to be accounted twice in the TDT calculation - Adapted TDT Integration test to corrected version of TDT
Signed-off-by: Ryan Poplin <rpoplin@broadinstitute.org>
2011-12-19 10:30:59 -05:00
Eric Banks 3069a689fe Bug fix: if there are multiple records at a given position, it turns out that SelectVariants would drop all variants that follow after one that fails filters (instead of dropping just the failing one). Added an integration test to cover this case. 2011-12-19 10:04:33 -05:00
Mauricio Carneiro 5b678e3b94 Remove ClippingOp UnitTests
* all testing functionality is in the ReadClipperUnitTest, no need to double test.
* class and package naming cleanup
2011-12-19 07:49:26 -05:00
Eric Banks 76bd13a1ed Forgot to update the unit test 2011-12-18 01:13:49 -05:00
Eric Banks 07f9d14d9f Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-18 00:43:15 -05:00
Eric Banks c5ffe0ab04 No reason to sum the normalized posteriors array to get Pr(AF>0) given that we can just compute 1.0 - array[0]. Integration tests change only because of trivial precision artifacts for reference calls using EMIT_ALL_SITES. 2011-12-18 00:31:47 -05:00
Eric Banks 6dc52d42bf Implemented the proper QUAL calculation for multi-allelic calls. Integration tests pass except for the ones making multi-allelic calls (duh) and one of the SLOD tests (which used to print 0 when one of the LODs was NaN but now we just don't print the SB annotation for that record). 2011-12-18 00:01:42 -05:00
Khalid Shakir 6059ca76e8 Removing cruft that snuck in last commit. 2011-12-16 23:00:16 -05:00
Khalid Shakir 7486696c07 When using bam list mode in HSP deriving VCF name from bam list instead of requiring an additional parameter.
Creating a single temporary directory per ant test run instead of a putting temp files across all runs in the same directory.
Updated various tests for above items and other small fixes.
2011-12-16 18:09:25 -05:00
Mauricio Carneiro e5df9e0684 cleaner test output
cleaned up the debug "pass" messages in the unit tests
2011-12-16 18:04:00 -05:00
Mauricio Carneiro fcc21180e8 Added hardClipLeadingInsertions UnitTest for the ReadClipper
fixed issue where a read starting with an insertion followed by a deletion would break, clipper can now safely clip the insertion and the deletion if that's the case.

note: test is turned off until contract changes to allow hanging insertions (left/right).
2011-12-16 18:02:47 -05:00
Mauricio Carneiro 075be52adc Added hardClipByReferenceCoordinates (left and right tails) UnitTest for the ReadClipper 2011-12-16 18:01:33 -05:00
Mauricio Carneiro 5bba44d693 Added hardClipByReferenceCoordinates UnitTest for the ReadClipper
* fixed edge case when requested to hard clip beginning of a read that had hanging soft clipped bases on the left tail.
* fixed edge case when requested to hard clip end of a read that had hanging soft clipped bases on the right tail.
* fixed AlignmentStart of a clipped read that results in only hard clips and soft clips

note: added tests to all these beautiful cases...
2011-12-16 18:01:33 -05:00
Mauricio Carneiro 5838ba529d Added hardClipByReadCoordinates UnitTest for the ReadClipper 2011-12-16 18:01:33 -05:00
Mauricio Carneiro c26295919e Added hardClipBothEndsByReferenceCoordinates UnitTest for the ReadClipper 2011-12-16 18:01:33 -05:00
Mauricio Carneiro e61e5c7589 Refactor of ReadClipper unit tests
* expanded the systematic cigar string space test framework Roger wrote to all tests
* moved utility functions into Utils and ReadUtils
* cleaned up unused classes
2011-12-15 19:05:43 -05:00
Mauricio Carneiro 4748ae0a14 Bugfix: Softclips before Hardclips weren't being accounted for
caught a bug in the hard clipper where it does not account for hard clipping softclipped bases in the resulting cigar string, if there is already a hard clipped base immediately after it.
* updated unit test for hardClipSoftClippedBases with corresponding test-case.
2011-12-15 12:17:25 -05:00
Mauricio Carneiro 50dee86d7f Added unit test to catch Ryan's exception
Unit test to catch the special case that broke the clipping op, fixed in the previous commit.
2011-12-14 16:58:14 -05:00
Mauricio Carneiro c85100ce9c Fix ClippingOp bug when performing multiple hardclip ops
bug: When performing multiple hard clip operations in a read that has indels, if the N+1 hardclip requests to clip inside an indel that has been removed by one of the (1..N) previous hardclips, the hard clipper would go out of bounds.

fix: dynamically adjust the boundaries according to the new hardclipped read length. (this maintains the current contract that hardclipping will never return a read starting or ending in indels).
2011-12-14 16:57:47 -05:00
Eric Banks de5928ac5a Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-14 16:24:56 -05:00
Eric Banks 4fddac9f22 Updating busted integration tests 2011-12-14 16:24:43 -05:00
Mark DePristo 71b4bb12b7 Bug fix for incorrect logic in subsetSamples
-- Now properly handles the case where a sample isn't present (no longer adds a null to the genotypes list)
-- Fix for logic failure where if the number of requested samples equals the number of known genotypes then all of the records were returned, which isn't correct when there are missing samples.
-- Unit tests added to handle these cases
2011-12-14 16:14:26 -05:00
Eric Banks 1e90d602a4 Optimization: cache up front the PL index to the pair of alleles it represents for all possible numbers of alternate alleles. 2011-12-14 13:38:20 -05:00
Mauricio Carneiro 5cc1e72fdb Parallelized SelectVariants
* can now use -nt with SelectVariants for significant speedup in large files
* added parallelization integration tests for SelectVariants
2011-12-12 18:41:14 -05:00
Laurent Francioli 7cf27bb66e Updated md5sum for MendelianViolationEvaluator test to reflect the change in column alignment in VariantEval. 2011-12-12 12:22:43 +01:00
Laurent Francioli 025bdfe2cc Merge branch 'master' of ssh://copper.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-12 12:19:44 +01:00
Eric Banks 7b6338c742 Merge branch 'master' into trialleles 2011-12-11 00:28:46 -05:00
Eric Banks 7c4b9338ad The old bi-allelic implementation of the Exact model has been completely deprecated - you can only use the multi-allelic implementation now. 2011-12-11 00:23:33 -05:00
Eric Banks 044f211a30 Don't collapse likelihoods over all alt alleles - that's just not right. For now, the QUAL is calculated for just the most likely of the alt alleles; I need to think about the right way to handle this properly. 2011-12-10 23:57:14 -05:00
Mauricio Carneiro 8475328b2c Turning off test that breaks read clipper
until we define what is the desired behavior for clipping this particular case.
2011-12-09 11:53:12 -05:00
Roger Zurawicki 4cbd1f0dec Reorganized the testing code and created ClipReadsTestUtils
Tests are more rigorous and includes many more test cases.
We can tests custom cigars and the generated cigars.
     *Still needs debugging because code is not working.
Created test classes to be used across several tests.

Some cases are still commented out.

Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2011-12-09 11:52:34 -05:00
Roger Zurawicki 0e9c2cefa2 testHardClipSoftClippedBases works with Matches and Deletions
Insertions are a problem so cigar cases with "I" are commented out.
The test works with multiple deletions and matches.

This is still not a complete test. A lot of cigar test cases are commented out.

Added insertions to ReadClipperUnitTest

ReadClipper now tests for all indels.

Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2011-12-09 11:43:37 -05:00
Eric Banks 442ceb6ad9 The Exact model now computes both the likelihoods and posteriors (in separate arrays); likelihoods are used for assigning genotypes, not the posteriors. 2011-12-09 10:16:44 -05:00
Laurent Francioli a79144f7db Merge branch 'master' of ssh://copper.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-09 15:57:24 +01:00
Laurent Francioli 72fbfba97d Added UnitTests for getFamilies() and getChildrenWithParents() 2011-12-09 15:57:07 +01:00
Eric Banks aa4a8c5303 No dynamic programming solution for assignning genotypes; just done greedily now. Fixed QualByDepth to skip no-call genotypes. No-calls are no longer given annotations (attributes). 2011-12-09 02:25:06 -05:00
Eric Banks 2fe50c64da Updating md5s 2011-12-09 00:47:01 -05:00
Eric Banks 4aebe99445 Need to use longs for the set index (because we can run out of ints when there are too many alternate alleles). Integration tests now use the multiallelic implementation. 2011-12-08 15:31:02 -05:00
Mark DePristo 4055877708 Prints 0.0 TiTv not NaN when there are no variants
-- Updated md5
2011-12-07 12:07:54 -05:00
Mark DePristo 5d2212bc8e Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-07 09:03:17 -05:00
Eric Banks 79d18dc078 Fixing indexing bug on the ACsets. Added unit tests for the Exact model code. 2011-12-06 16:17:18 -05:00
Matt Hanna f5b977fc88 Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-12-06 10:11:35 -05:00
Matt Hanna 4001c22a11 Better file count / buffering variation in test suite. Parameterized read shard buffering. Misc cleanup. 2011-12-06 10:10:38 -05:00
Khalid Shakir 677bea0abd Right aligning GATKReport numeric columns and updated MD5s in tests.
PreQC parses file with spaces in sample names by using tabs only.
PostQC allows passing the file names for the evals so that flanks can be evaled.
BaseTest's network temp dir now adds the user name to the path so files aren't created in the root.
HybridSelectionPipeline:
- Updated to latest versions of reference data.
- Refactored Picard parsing code replacing YAML.
2011-12-05 23:22:15 -05:00
Eric Banks 29662be3d7 Fixed bug where k=2N case wasn't properly being computed. Added optimization for BB genotype case not in old model. At this point, integration tests pass except for 1 case where QUALs differ by 0.01 (this is okay because I occasionally need to compute extra cells in the matrix which affects the approximations) and 2 cases where multi-allelic indels are being genotyped (some work still needs to be done to support them). 2011-12-03 23:12:04 -05:00
Mark DePristo 3060a4a15e Support for list of known CNVs in VariantEval
-- VariantSummary now includes novelty of CNVs by reciprocal overlap detection using the standard variant eval -knownCNVs argument
-- Genericizes loading for intervals into interval tree by chromosome
-- GenomeLoc methods for reciprocal overlap detection, with unit tests
2011-11-30 17:05:16 -05:00
Laurent Francioli 9574be0394 Updated MendelianViolationEvaluator integration test 2011-11-30 14:44:15 +01:00
Laurent Francioli a4606f9cfe Merge branch 'MendelianViolation'
Conflicts:
	public/java/src/org/broadinstitute/sting/utils/MendelianViolation.java
2011-11-30 11:13:15 +01:00
Laurent Francioli 7d58db626e Added MendelianViolationEvaluator integration test 2011-11-30 10:09:20 +01:00
Ryan Poplin 110298322c Adding Transmission Disequilibrium Test annotation to VariantAnnotator and integration test to test it. 2011-11-29 09:29:18 -05:00
Mark DePristo e60272975a Fix for changed MD5 in streaming VCF test 2011-11-23 19:01:33 -05:00
Mark DePristo 12f09d88f9 Removing references to SimpleMetricsByAC 2011-11-23 16:08:18 -05:00
Mark DePristo 4107636144 VariantEval updates
-- Performance optimizations
-- Tables now are cleanly formatted (floats are %.2f printed)
-- VariantSummary is a standard report now
-- Removed CompEvalGenotypes (it didn't do anything)
-- Deleted unused classes in GenotypeConcordance
-- Updates integration tests as appropriate
2011-11-23 13:02:07 -05:00
Mark DePristo e484625594 GenotypesContext now updates cached data for add, set, replace operations when possible
-- Involved separately managing the sample -> offset and sample sorted list operations.  This should improve performance throughout the system
2011-11-22 08:40:48 -05:00
Mark DePristo 29ca24694a UG now encoding NO_CALLs as ./. not ./.:.:4:0,0,0
A few updated UGs integration tests
2011-11-22 08:22:32 -05:00
Mark DePristo 2b51c01df4 Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-11-21 19:16:06 -05:00
Mark DePristo 5443d3634a Again, fixing the add call when we really mean replace
-- Updating MD5s for UG to reflect that what was previously called ./.:.:10:0,0,0 is now just ./.  Eric will fix long-standing bug in QD observed from this change
-- VFW MD5s restored to their old correct values.  There was a bug in my implementation to caused the genotypes to not be parsed from the lazy output even through the header was incorrect.
2011-11-21 19:15:56 -05:00
Mauricio Carneiro 5ad3dfcd62 BugFix: byte overflow in SyntheticRead compressed base counts
* fixed and added unit test
2011-11-21 17:11:50 -05:00
Mark DePristo 2c501364b8 GenotypesContext no longer have immutability in constructor
-- additional bug fixes throughout VariantContext and GenotypesContext objects
2011-11-21 14:34:31 -05:00
David Roazen 1296dd41be Removing the legacy -L "interval1;interval2" syntax
This syntax predates the ability to have multiple -L arguments, is
inconsistent with the syntax of all other GATK arguments, requires
quoting to avoid interpretation by the shell, and was causing
problems in Queue.

A UserException is now thrown if someone tries to use this syntax.
2011-11-21 13:18:53 -05:00
Mark DePristo 2e9ecf639e Generalized interface to LazyGenotypesContext
-- Now you provide a LazyParsing object
-- LazyGenotypesContext now knows nothing about the VCF parser itself.  The parser holds all of the necessary data to parse the VCF genotypes when necessarily, and the LGC only has a pointer to this object
-- Using new interface added LazyGenotypesContext to unit tests with a simple lazy version
-- Deleted VCFParser interface, as it was no longer necessary
2011-11-21 09:30:40 -05:00
Mark DePristo f0ac588d32 Extensive unit test for GenotypeContextUnitTest
-- Currently only tests base class.  Adding subclass testing in a bit
2011-11-20 18:28:01 -05:00
Mark DePristo 9cb3fe3a59 Vastly better way of doing on-demand genotyping loading
-- With our GenotypesContext class we can naturally create a LazyGenotypesContext subclass that does the on-demand loading.
-- This new class was replaced all of the old, complex functionality
-- Better still, there were many cases were the genotypes were being loaded unnecessarily, resulting in efficiency.  This was detected because some of the integration tests changed as the genotypes were no longer being parsing unnecessarily
-- Misc. bug fixes throughout the system
-- Bug fixes for PhaseByTransmission with new GenotypesContext
2011-11-20 08:23:09 -05:00
Mark DePristo 7d09c0064b Bug fixes and code cleanup throughout
-- chromosomeCounts now takes builder as well, cleaning up a lot of code throughout the codebase.
2011-11-19 18:40:15 -05:00
Mark DePristo 707bd30b3f Should have been @BeforeMethod 2011-11-19 16:10:09 -05:00
Mark DePristo 8f7eebbaaf Bugfix for pError not being checked correctly in CommonInfo
-- UnitTests to ensure correct behavior
-- UnitTests to ensure correct behavior for pass filters vs. failed filters vs. unfiltered
2011-11-19 15:58:59 -05:00
Mark DePristo b7b57ef39a Updating MD5 to reflect canonical ordering of calculation
-- We should no longer have md5s changing because of hashmaps changing their sort order on us
-- Added GenotypeLikelihoodsUnitTests
-- Refactored ExactAFCaclculation to put the PL -> QUAL calculation in the GenotypeLikelihoods class to avoid the code copy.
2011-11-19 15:57:33 -05:00
Mark DePristo 73119c8e3c Merge with master
-- A few bug fixes
2011-11-19 09:56:06 -05:00
Mark DePristo f685fff79b Killing the final versions of old new VariantContext interface 2011-11-18 21:32:43 -05:00
Mark DePristo 6cf315e17b Change interface to getNegLog10PError to getLog10PError 2011-11-18 21:07:30 -05:00
Matt Hanna 8bb4d4dca3 First pass of the asynchronous block loader.
Block loads are only triggered on queue empty at this point.  Disabled by
default (enable with nt:io=?).
2011-11-18 15:02:59 -05:00
Mark DePristo f54afc19b4 VariantContextBuilder
-- New approach to making VariantContexts modeled on StringBuilder
-- No more modify routines -- use VariantContextBuilder
-- Renamed isPolymorphic to isPolymorphicInSamples.   Same for mono
-- getChromosomeCount -> getCalledChrCount
-- Walkers changed to use new VariantContext.  Some deprecated new VariantContext calls remain
-- VCFCodec now uses optimized cached information to create GenotypesContext.
2011-11-18 12:39:10 -05:00
Mark DePristo 7490dbb6eb First version of VariantContextBuilder 2011-11-18 11:06:15 -05:00
Mark DePristo fa454c88bb UnitTests for VariantContext for chrCount, getSampleNames, Order function
-- Major change to how chromosomeCounts is computed.  Now NO_CALL alleles are always excluded.  So ChromosomeCounts(A/.) is 1, the previous result would have been 2.
-- Naming changes for getSamplesNameInOrder()
2011-11-17 20:37:22 -05:00
Mark DePristo 02f22cc9f8 No more VC integration tests. All tests are now unit tests 2011-11-17 15:33:09 -05:00
Khalid Shakir c50274e02e During flanking interval creation merging overlapping flanks so that on scatter the list doesn't accidentally genotype the same site twice.
Moved flanking interval utilies to IntervalUtils with UnitTests.
2011-11-17 13:56:42 -05:00
Eric Banks bad19779b9 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-11-17 13:29:43 -05:00
Eric Banks 16a021992b Updated header description for the INFO and FORMAT DP fields to be more accurate. 2011-11-17 13:17:53 -05:00
Mark DePristo 7e66677769 Expanded UnitTests for VariantContext
Tests for
-- getGenotype and getGenotypes
-- subContextBySample
-- modify routines
2011-11-16 20:45:15 -05:00
Mauricio Carneiro 72f00e2883 Merging Roger's Unit tests for Reduce Reads from RR repository 2011-11-16 17:26:49 -05:00
Mark DePristo aa0610ea92 GenotypeCollection renamed to GenotypesContext 2011-11-16 16:24:05 -05:00
Mark DePristo 974daaca4d V13 version in archive. Can you pulled out wholesale for performance testing 2011-11-16 16:08:46 -05:00
Mark DePristo 101ffc4dfd Expanded, contrastive VariantContextBenchmark
-- Compares performance across a bunch of common operations with GATK 1.3 version of VariantContext and GATK 1.4
-- 1.3 VC and associated utilities copied wholesale into test directory under v13
2011-11-16 13:35:16 -05:00
Mark DePristo e56d52006a Continuing bugfixes to get new VC working 2011-11-16 10:39:17 -05:00
Eric Banks c2ebe58712 Merge remote-tracking branch 'Laurent/master' 2011-11-16 09:34:47 -05:00
David Roazen 0d163e3f52 SnpEff 2.0.4 support
-Modified the SnpEff parser to work with the SnpEff 2.0.4 VCF output format
-Assigning functional classes and effect impacts now handled directly
 by SnpEff rather than the GATK
-Removed support for SnpEff 2.0.2, as we no longer trust the output of that
 version since it doesn't exclude effects associated with certain nonsensical
 transcripts. These effects are excluded as of 2.0.4.
-Updated unit and integration tests

This support is based on a *release-candidate* of SnpEff 2.0.4, and so is subject
to change between now and the next GATK release.
2011-11-15 18:36:22 -05:00
Mark DePristo df415da4ab More bug fixes on the way to passing all tests 2011-11-15 17:38:12 -05:00
Laurent Francioli fb685f88ec Merge branch 'master' of ssh://copper.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-11-15 16:23:53 -05:00
Mark DePristo 460a51f473 ID field now stored in the VariantContext itself, not the attributes 2011-11-15 14:56:33 -05:00
Eric Banks 7fada320a9 The right fix for this test is just to delete it. 2011-11-15 14:53:27 -05:00
Mark DePristo 233e581828 Merging in Master 2011-11-15 09:28:24 -05:00
Mark DePristo 6e1a86bc3e Bug fixes to VariantContext and GenotypeCollection 2011-11-15 09:21:30 -05:00
Roger Zurawicki 284430d61d Added more basic UnitTests for ReadClipper
hardClipByReadCoordinatesWorks
hardClipLowQualTailsWorks
2011-11-15 00:13:52 -05:00
Roger Zurawicki 8e91e19229 Merge branch 'master' of ssh://nickel/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-11-15 00:13:37 -05:00
Mauricio Carneiro cde829899d compress Reduce Read counts bytes by offset
compressed the representation of the reduce reads counts by offset results in 17% average compression in final BAM file size.

Example compression -->

from : 10, 10, 11, 11, 12, 12, 12, 11, 10
to:      10, 0, 1, 1,2, 2, 2, 1, 0
2011-11-14 18:30:24 -05:00
Mark DePristo 4ff8225d78 GenotypeMap -> GenotypeCollection part 3
-- Test code actually builds
2011-11-14 17:51:41 -05:00
Mark DePristo f0234ab67f GenotypeMap -> GenotypeCollection part 2
-- Code actually builds
2011-11-14 17:42:55 -05:00
Mark DePristo 2e9d5363e7 Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-11-14 15:32:06 -05:00
Mark DePristo 1fbdcb4f43 GenotypeMap -> GenotypeCollection 2011-11-14 15:32:03 -05:00
Eric Banks 7b2a7cfbe7 Transfer headers from the resource VCF when possible when using expressions. While there, VA was modified so that it didn't assume that the ID field was present in the VC's info map in preparation for Mark's upcoming changes. 2011-11-14 14:31:27 -05:00
Mark DePristo 9b5c79b49d Renamed InferredGeneticContext to CommonInfo
-- I have no idea why I named this InferredGeneticContext, a totally meaningless term
-- Renamed to CommonInfo.
-- Made package protected, as no one should use this outside of VariantContext and Genotype
-- UGEngine was using IGC constant, but it's now using the public one in VariantContext.
2011-11-14 14:28:52 -05:00
Mark DePristo 077397cb4b Deleted MutableVariantContext
-- All methods that used this capable now use VariantContext directly instead
2011-11-14 14:19:06 -05:00
Mark DePristo 79987d685c GenotypeMap contains a Map, not extends it
-- On path to replacing it with GenotypeCollection
2011-11-14 12:55:03 -05:00
Laurent Francioli 1347beef40 Merge branch 'PhaseByTransmission' 2011-11-14 11:31:28 +01:00
Laurent Francioli 6881d4800c Added Integration tests for Phasing by Transmission 2011-11-14 10:47:51 +01:00
Laurent Francioli 34acf8b978 Added Unit tests for new methods in GenotypeLikelihoods 2011-11-14 10:47:02 +01:00
Roger Zurawicki 1202a809cb Added Basic Unit Tests for ReadClipper
Tests some but not all functions
Some tests have been disabled because they are not working
2011-11-13 22:27:49 -05:00
Mark DePristo fee9b367e4 VariantContext genotypes are now stored as GenotypeMap objects
-- Enables further sophisticated optimizations, as this class can be smarter about storing the data and will directly support operations like subset to samples
-- All instances in the gatk that used Map<String, Genotype> now use GenotypeMap type.
-- Amazingly, there were many places where HashMap<String, Genotype> is used, so that the order of the genotypes is technically undefined and could be dangerous.  Now everything uses GenotypeMap with a specific ordering of samples (by name)
-- Integrationtests updated and all pass
2011-11-11 15:00:35 -05:00
Mark DePristo 4938569b3a More general handling of parameters for VariantContextBenchmark 2011-11-11 10:22:19 -05:00
Mark DePristo e216e85465 First working version of VariantContextBenchmark 2011-11-11 09:56:00 -05:00
Mark DePristo ee40791776 Attributes are now Map<String,Object> not Map<String,?>
-- Allows us to avoid an unnecessary copy when creating InferredGeneticContext (whose name really needs to change).
2011-11-11 09:55:42 -05:00
Mark DePristo 153e52ffed VariantEvalIntegrationTest for IntervalStratification 2011-11-10 14:10:39 -05:00
Mauricio Carneiro d00b2c6599 Adding a synthetic read for filtered data
* Generalized the concept of a synthetic read to cread both running consensus and a synthetic reads of filtered data.
* Synthetic reads can now have deletions (but not insertions)
* New reduced read tag for filtered data synthetic reads *(RF)*
* Sliding window header now keeps information of consensus and filtered data
* Synthetic reads are created simultaneously, new functionality is controlled internally by addToSyntheticReads
2011-11-09 20:16:22 -05:00
Eric Banks 02d5e3025e Added integration test for intervals from bed file 2011-11-09 15:34:19 -05:00
Ryan Poplin 94dc447a70 Merged bug fix from Stable into Unstable 2011-11-07 15:26:35 -05:00
Ryan Poplin 0b181be61f Bug fix in SelectVariants when using a discordance track but no sample specifications. Added integration test to test this. 2011-11-07 15:25:16 -05:00
Eric Banks 759f4fe6b8 Moving unclaimed walker with bad integration test to archive 2011-11-07 13:16:38 -05:00
Eric Banks 3517489a22 Better --sample selection integration test for VE. The previous one would return true even if --sample was not working at all. 2011-11-06 01:07:49 -04:00
Eric Banks ad57bcd693 Adding integration test to cover using expressions with IDs (-E foo.ID) 2011-11-05 23:53:15 -04:00
Mauricio Carneiro e89ff063fc GATKSAMRecord refactor
The GATK engine will now provide a GATKSAMRecord to all tools which incorporates the functionality used by the GATK to the bam file (ReadGroups, Reduced Reads, ...).

* No tools should create SAMRecord anymore, use GATKSAMRecord instead *
2011-11-03 15:43:26 -04:00
Eric Banks e8bceb1eaa Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-11-02 21:13:54 -04:00
Eric Banks 78a00d2ddc Updating UG integration tests (needed updating only because the -mbq default is different from the old -mmq one). 2011-11-02 21:13:44 -04:00
Eric Banks e1edd6bd12 Removing the min mapping quality argument since it wasn't being used in the normal processing of the pileups in UG - only for indel pileups. Instead, we apply the min base quality to the reads in the pileup for indels and define it to be the min 'confidence' of the base. Docs are updated but I didn't rename the argument as I don't want people to complain. 2011-11-02 20:32:58 -04:00
Mark DePristo 8a2929c1dd Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-11-02 16:21:00 -04:00
Eric Banks 4501dce58d Fixing merge conflict 2011-11-02 12:50:32 -04:00
Eric Banks 54331b44e9 New way of looking at the size of a pileup: there's a physical number of elements in the data structure and there's a representative depth of coverage (since a reduced read represents depth >= 1). The size() method has been removed because its meaning is ambiguous. Updated several annotations and the UG engine to make use of the representative depths. 2011-11-02 12:47:30 -04:00
Mark DePristo 392e0aeace Moved unit tests into master IntervalUtilsUnitTest 2011-11-02 10:52:00 -04:00
Mark DePristo c2b97030a4 IntervalUtils for completely balanced locus-based scatter/gather
-- scatterLocusIntervals master utility
-- Moved around some general functionality from GenomeLocSortedSet to GenomeLoc
-- Util function for reversing a list (List<T> -> List<T>, unlike Collections version)
-- DoC is PartitionType.INTERVAL
-- Significant unit tests on new functionality (all passing)
-- Ready for real-world testing, as soon as I can get LocusScatterFunction.scala to actually work
2011-11-02 10:49:40 -04:00
Mauricio Carneiro b004489c6d Moving ReduceRead TAG to GATKSAMRecord
ReduceReads are now a feature of a GATKSAMRecord, so the tag and the special methods needed to use it will now be housed by the GATKSAMRecord.
2011-11-01 17:12:09 -04:00
Eric Banks 0ca7428e76 Allow processing of empty intervals, but warn user when this case is encountered. 2011-10-28 12:12:14 -04:00
Eric Banks 649dfe98f0 Add VCF header for any expressions that are requested 2011-10-28 10:22:19 -04:00
Eric Banks 8b1a62da27 Adding unit test to cover overlapping intervals from the same source with the intersection rule. 2011-10-28 09:59:43 -04:00
Eric Banks 6ba08a103d Empty ROD files should generate an exception when used for creating intervals. Moved some now obsolete files to the archive as the realigner will now read all target intervals into memory. 2011-10-28 09:23:25 -04:00
Eric Banks 19e27d4568 Removing all instances of -BTI (in tests and in GATKdocs) and replacing them with the appropriate alternative. 2011-10-27 23:55:11 -04:00
Eric Banks ccfd853b34 Added further integration tests for rod-based intervals that deal with more complex cases. Good call by Mark to test the empty VCF example because we were failing on it; fixed. 2011-10-27 20:43:50 -04:00
Khalid Shakir b80d407dc7 No more hunting down R "resources". As a tradeoff Rscript cannot be specified on the commandline and will be found in the environment path.
Other minor cleanup.
2011-10-27 14:17:07 -04:00
Eric Banks 8c4dbce6d8 Don't serialize the GATKArgumentCollection for the GATKRunReports (which would have meant dealing with the new IntervalBindings). Also, forgot to remove a test that's no longer relevant to BED parsing. 2011-10-27 13:58:19 -04:00
Eric Banks 4a7e6fee3f Remove support for BED file interval parsing in the GATK; it should all go through Tribble now. IndelRealigner no longer supports unordered interval input (which shouldn't have been used anyways). Temporarily commenting out serialization of arguments so that tests pass; this whole piece will be deleted soon anyways. 2011-10-27 13:38:08 -04:00
Eric Banks 44f905b5e5 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-10-26 23:31:11 -04:00
Mark DePristo 034a997d07 Generalized Reads -> Fragment calculation
-- Supports ReadBackedPileup -> FragmentCollection as before
-- Added support for List<SAMRecord> -> FragmentCollection for Ryan's haplotype caller
-- General cleanup, renaming, move to separate package, more extensive unit tests, etc.
-- Added toFragment() function to ReadBackedPileup interface
2011-10-26 15:54:38 -04:00
Eric Banks b39fcb1bea Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-10-26 15:44:25 -04:00
Eric Banks 3273c20c98 Added integration tests for Tribble-based intervals and fixed up some of the other tests based on some method changes. 2011-10-26 15:29:18 -04:00
Mark DePristo 7fa943aef1 Renamed FragmentPileup to FragmentUtils 2011-10-26 14:01:45 -04:00
Mark DePristo 1b722c21cf merge master 2011-10-25 16:08:39 -04:00
David Roazen 2794e5c1d4 Modified the VCFJarClassLoadingUnitTest to play nice with the packaged-jar test targets. 2011-10-25 14:47:15 -04:00
Khalid Shakir fac9932938 Embedding gsalib source and queueJobReport R scripts in the dist and package jars.
Moved gsalib and queueJobReport.R to embeddable namespaced locations.
Updated packager dependencies/dir to add an @includes which filters the embedded fileset.
RScriptExecutor can now JIT compiles the gsalib.
RScriptExecutor uses ProcessController and sends the Rscript output to java's stdout when run under -l DEBUG.
Refactored ProcessController and IOUtils from Queue to Sting Utils.
Added more unit tests to ProcessController along with a utility class to hard stop OutputStreams at a specified byte count.
Replaced uses of some IOUtils with Apache Commons IO.
ShellJobRunner refactored to use direct ProcessController and now kills jobs on shutdown.
Better QGraph responsiveness on shutdown by using Object.wait() instead of Thread.sleep().
2011-10-24 15:58:34 -04:00
Khalid Shakir 89a581a66f Added ability to specify arguments in files via -args/--arg_file
Pushing back downsample and read filter args so they show up in getApproximateCommandLineArgs()
2011-10-24 15:58:34 -04:00
Mark DePristo 502592671d Cleanup FragmentPileup before main repo commit
-- removed intermiate functions.  Now only original version and best optimized new version remain
-- Moved general artificial read backed pileup creation code into ArtificialSamUtils
2011-10-24 14:40:05 -04:00
Mark DePristo 166174a551 Google caliper example execution script
-- FragmentPileup with final performance testing
2011-10-24 14:04:53 -04:00
Mark DePristo 42bf9adede Initial version of "fast" FragmentPileup code
-- Uses mayOverlapRoutine in ReadUtils
-- Attempts to be smart when doing overlap calculation, to avoid unnecessary allocations
-- PileupElement now comparable (sorts on offset than on start)
-- Caliper microbenchmark to assess performance
2011-10-22 21:36:37 -04:00
Guillermo del Angel f4b409fa0d CombineVariants bug fix: when merging records with disparate alleles we were leaving AC,AF fields intact. This had as a consequence that we could end up with a record with 3 alt alleles but only 2 values in AC,AF fields. Now, if alleles in combined vc are different from original, and if AC,AF fields can't be recomputed from genotypes, we remove attributes from vc map since they'll be invalid anyway. Integration test md5 changed since there were several badly merged records in result 2011-10-21 14:07:20 -04:00
Mark DePristo b863390cb1 Moving reduced read functionality into GATKSAMRecord
-- More functions take / produce GATKSAMRecords instead of SAMRecord
2011-10-21 13:28:05 -04:00
Mark DePristo 110e13bc1e Merge branch 'master' into SamRecordFactory 2011-10-21 09:43:52 -04:00
Mark DePristo 3227143a1c Systematic test code for FragmentPileup
-- Creates all combinatinos of overlapping and non-overlapping read pair pileups in all orientations and first/second pairings to validate fragment detection.
2011-10-19 17:50:27 -04:00
Eric Banks d8d73fe4f2 Treat ./X genotypes as MIXED so that isHet, isHom, etc. still return the expected and correct values. Added docs to these accessors with contracts explicitly mentioned. Fixed case where NPE could be thrown. 2011-10-19 15:11:13 -04:00
Eric Banks 5a6468c11e Allowing ./X genotypes and adding a unit test to ensure that this case is covered from now on (especially given that we may want to revert in the future). Reverting this change is really easy and entails uncommenting a few lines of code. But for now, despite Mark's objections, this case is allowed in the VCF spec and we are wrong not to allow it. 2011-10-19 11:52:05 -04:00
David Roazen 88d6b8bc1f Merged bug fix from Stable into Unstable 2011-10-14 20:13:38 -04:00
David Roazen bd8bb93811 Split RScriptExecutorUnitTest into public and private test classes.
We can't have a public test that depends on both public and private
code/data -- the new release system needs to do public-only tests,
and will catch this sort of thing.
2011-10-14 20:04:42 -04:00
David Roazen 4f01a742cb Merged bug fix from Stable into Unstable 2011-10-13 21:39:52 -04:00
David Roazen edfd6f8a06 Removing a public -> private dependency from the test suite.
The public integration test VariantContextIntegrationTest was dependent on the
private walker TestVariantContextWalker. Moved this walker to public/java/test
(NOT public/java/src, since this walker is only used by the test suite) to avoid
errors during public-only tests.
2011-10-13 21:32:52 -04:00
Mark DePristo 404ef741f1 Merged bug fix from Stable into Unstable 2011-10-13 18:02:06 -04:00
Mark DePristo 2ebdff074c Update MD5s for SOLiD recalibration
-- MD5 db had spelling error; fixed
-- Bug in AlignmentUtils resulted in some bases not being color space corrected.  The integration test caught the change, and it's clear that the new version is correct, as the prev. version was not considering the last the N qualities for reads with a ND operation.
2011-10-13 18:01:51 -04:00
Eric Banks 9aecd50473 Adding ability to exclude annotations from the VA and UG lists. As described in the docs, this argument trumps all others (including -all) so that we can get around the SnpEff issue brought up by Menachem. Added integration test for it. 2011-10-12 15:44:54 -04:00
David Roazen cfd0ac8410 Merged bug fix from Stable into Unstable
Conflicts:
	public/java/test/org/broadinstitute/sting/gatk/walkers/genotyper/UnifiedGenotyperIntegrationTest.java
2011-10-11 12:03:51 -04:00
David Roazen 24b72334b3 UnifiedGenotyper now correctly initializes the VariantAnnotator engine.
This allows the annotation classes to perform any necessary initialization/validation.
For example, it allows the SnpEff annotator to (among other things) validate its rod binding.
This will prevent a NullPointerException when SnpEff annotation is requested but no rod binding
is present.

Added an integration test to cover this case so that it doesn't break again.
2011-10-11 12:02:05 -04:00
Mark DePristo fb72bcf732 DiffObjects no longer prints out the file name in the status so MD5 are stable 2011-10-10 15:10:57 -04:00
Mark DePristo e3ff4f4266 Failing MD5 because output now contains absolute path 2011-10-10 11:05:02 -04:00
Mark DePristo 3e6c16d961 CombineVariants preserves allele order 2011-10-10 11:04:38 -04:00
Mark DePristo a4bb842958 RankSum tests have lightly different MD5 results based on allele order
-- UG GENOTYPE_GIVEN_ALLELES now uses the order of alleles in the VCF, so this changes the MD5
2011-10-10 11:04:07 -04:00
Mark DePristo 46e7370128 this.allele, getAlleles(), and getAltAlleles() now return List not set
-- Changes associated code throughout the codebase
-- Updated necessary (but minimal) UnitTests to reflect new behavior
-- Much better makealleles() function in VC.java that enforces a lot of key constraints in VC
2011-10-09 11:45:55 -07:00
Mark DePristo 822654b119 UnitTests for allele getting functions in VC in prep for move from set to list 2011-10-09 10:36:14 -07:00
Mark DePristo c67f6c076b simpleMerge now preserves allele order
-- UnitTests for dangerous PL merging cases in the multi-allelic case.  The new behavior is correct
2011-10-08 17:39:53 -07:00
Mark DePristo e94e6ba101 A UnitTest to ensure that the order of alleles is maintained
-> A, C, T and A, T, C are different and must be maintained.  The constructors were doing this appropriately, so nothing needed to be changed
2011-10-08 08:47:58 -07:00
Matt Hanna 6fbd41724a Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-10-07 11:20:00 -04:00
Matt Hanna 4514bc350f More reliable way of finding the Tribble jar. 2011-10-07 11:19:29 -04:00
Eric Banks 181c76750e Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-10-06 22:38:55 -04:00
Eric Banks ca9cd9b688 Minor fix for merging intervals which hadn't been necessary when only merging from the left to right. Added integration tests to cover the parallelization of RTC. 2011-10-06 22:38:44 -04:00
Khalid Shakir f91b015e0e Made the BaseTest.testDir absolute 2011-10-06 22:33:21 -04:00
Eric Banks 61a3dfae24 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-10-06 15:58:04 -04:00
Eric Banks 6eb87bf58a RTC now caches all intervals as GenomeLocs (which is expected to take < 1Gb whole genome based on back of the envelope calculations with Matt) so that 1) we don't have to worry about emitting outside of the leaves in the hierarchical reductions and 2) we can emit the intervals in sorted order which is a big performance plus for the realigner. Integration tests change only because intervals whose start=stop are now printed as chr:start instead of chr:start-stop. 2011-10-06 15:57:49 -04:00
Mark DePristo 6d9c210460 Updating MD5s for updated BAM with read groups 2011-10-06 12:15:48 -07:00
Matt Hanna 3961733590 Merged bug fix from Stable into Unstable 2011-10-06 12:54:52 -04:00
Matt Hanna 4fa5045e84 Abandoning classfileset/rootfileset approach due to difficulting managing
classloading of bcel*.jar/ant-apache-bcel*.jar.  Switching instead to manually
specifying a minimal set of packages/classes to include in the vcf.jar via
build.xml, and adding a unit test which creates a limited classloader
only aware of vcf.jar and tribble.jar and tries to use it to load the core
classes in the vcf jar.

Hopefully third time's the charm.
2011-10-06 12:49:51 -04:00
Mark DePristo 4b5b9155a9 Fixed bad expected value in PedReaderUnitTest 2011-10-06 08:16:47 -07:00
Mark DePristo 3226d5dc0d Merge branch 'master' into ped 2011-10-05 15:03:09 -07:00
Mark DePristo e7c80f7c45 Renaming quantitative trait to OtherPhenotype which is now a String not a double
-- we can now use PED file to represent population data or other arbitrary phenotype data, not just doubles
2011-10-05 12:26:33 -07:00
Mark DePristo 51ecc20867 getFamily() and associated methods implemented and tested
-- Sample no longer serializable
-- Sample now implements Comparable
2011-10-05 09:55:05 -07:00
Mark DePristo f4bac58f14 Merged bug fix from Stable into Unstable 2011-10-04 21:00:34 -07:00
Mark DePristo d1d39943d0 Updating MD5 for BAMs that I added a read group to, part 2 2011-10-04 21:00:15 -07:00
Mark DePristo 9bd3ba4c7e Missed one MD5 2011-10-04 16:04:52 -07:00
Mark DePristo ffdfdcde3f Updating MD5s
-- Interval test now uses RG containing BAM
-- DoC sample name ordering has changed.
2011-10-04 15:54:45 -07:00
Mark DePristo 463eab7604 All MD5 mismatches for test are shown
-- Now for tests like DoC, with 20 output md5s, you see all of the differences before failing.
2011-10-04 15:53:52 -07:00
Mark DePristo c642a080d4 Merged bug fix from Stable into Unstable 2011-10-04 14:08:41 -07:00
Mark DePristo 941317167e Updating MD5 for BAMs that I added a read group to 2011-10-04 14:08:00 -07:00
Mark DePristo e1d6c7a50a Updating MD5 that have changed due to sample ordering differences 2011-10-04 09:33:23 -07:00
Mark DePristo 343a7b6b2f Updating UG integration tests for arbitrary impact of sample order changes on downsampling 2011-10-04 08:14:00 -07:00
Mark DePristo a27641e1fc Cleaned up imports 2011-10-04 06:28:36 -07:00
Mark DePristo b20689ff55 No longer supports extraProperties
-- the underlying data structure is still present, but until I decide what to do for the extensible system I've completely disabled the subsystem
-- Added code to merge Samples, so that a mostly full record can be merged with a consistent empty record.  If the two records are inconsistent, an error is thrown
-- addSample() in Sample.class now invokes mergeSample() when appropriate
-- Validation types are now only STRICT or SILENT
-- Validation code implemented in SampleDBBuilder
-- Extensive unit tests for SampleDBBuilder
2011-10-03 19:20:33 -07:00
Mark DePristo 867a7476c1 Systematic unit tests for the sample object 2011-10-03 19:09:02 -07:00
Mauricio Carneiro 3837aa45b4 Fixing conflicts
Conflicts:
	public/java/test/org/broadinstitute/sting/utils/clipreads/ReadClipperUnitTest.java
2011-10-03 19:07:59 -07:00
Mark DePristo 2e3dc52088 Minor function renaming 2011-10-03 14:41:13 -07:00
Mark DePristo dd71884b0c On path to SampleDB engine integration
-- PedReader tag parser
-- Separation of SampleDBBuilder from SampleDB (now immutable)
-- Removed old sample engine arguments
2011-10-03 12:08:07 -07:00
Mark DePristo 89ac50e86e SampleDataSource -> SampleDB 2011-10-03 09:33:30 -07:00
Mark DePristo 93fba06cb5 Support for whitespace only lines 2011-10-03 09:30:10 -07:00
Mark DePristo 0604ce55d1 PedReader support for ; separated lines, not only newline 2011-10-03 09:19:58 -07:00
Mark DePristo 52f670c8b8 100% version of PedReader
-- Passes all unit tests
-- Added unit tests for missing fields
2011-10-03 06:12:58 -07:00
Roger Zurawicki bf6a3a6532 Added framework to do batch CigarClip Testing
*NOTE: This commit has not been compiled!
2011-10-02 22:33:46 -04:00
Mark DePristo dd75ad9f49 95% PedReader
-- Passes significiant unit tests
-- Implicit sample creation for mom / dad when you create single samples
-- Continuing cleanup of Sample and SampleDataSource
2011-09-30 18:03:34 -04:00
Mark DePristo 84160bd83f Reorganization of Sample
-- Moved Gender and Afflication to separate public enums
-- PedReader 90% implemented
-- Improve interface cleanup to XReadLines and UserException
2011-09-30 15:50:54 -04:00
Mark DePristo 56f10b40a8 Fixing test bugs for WindowMaker that required empty sample list 2011-09-30 14:18:27 -04:00
Mark DePristo 30d23942b1 Renamed ReadBackedPileup getXSampleName() functions to getXSample
-- now that we don't have Sample objects floating around we don't have to have all of the Name extensions on our functions
2011-09-30 10:02:57 -04:00
Mark DePristo e055a78f6e LIBS now requires at least one sample be present
-- UnitTest provides a "null" sample for matching the reads without read groups
2011-09-30 09:49:35 -04:00
Mark DePristo b71b51751e Bug fix for UnitTest
-- Provide the null sample to the LIBS, as this seems to be required for correctly passing this unit test
-- Will be fixed in a future update
2011-09-29 17:30:01 -04:00