Commit Graph

731 Commits (085588cb043fd1befdec32e5e42d4257dc8d48ae)

Author SHA1 Message Date
Khalid Shakir e57cd78bba Killed two more resource leakers that ignored requests to close wrapped file pointers, and added Unit Tests for each.
This bug will happen in all adapter/wrapper classes that are passed a resource, and then in their close method they ignore requests to close the wrapped resource, causing a leak when the adapter is the only one left with a reference to the resource.

Ex:

public Wrapper getNewWrapper(File path) {
  FileStream myStream = new FileStream(path); // This stream must be eventually closed.
  return new Wrapper(myStream);
}

public void close(Wrapper wrapper) {
  wrapper.close(); // If wrapper.close() does nothing, NO ONE else has a reference to close myStream.
}
2012-05-21 15:41:56 -04:00
Eric Banks 26968ae8eb Forgot that the VCFStreamingOntegrationTest uses VE 2012-05-18 02:51:53 -04:00
Eric Banks 52c206d5db Has anyone else ever noticed that the DiffEngine outputs were always doubled for some reason? That no longer happens with the new reports. 2012-05-18 02:32:20 -04:00
Eric Banks 03d40272c8 Removed old GATKReport code and moved the new stuff in its place. 2012-05-18 01:44:31 -04:00
Eric Banks a26b04ba17 Extensive refactoring of the GATKReports. This was a beast.
The practical differences between version 1.0 and this one (v1.1) are:

* the underlying data structure now uses arrays instead of hashes, which should drastically reduce the memory overhead required to create large tables.
* no more primary keys; you can still create arbitrary IDs to index into rows, but there is no special cased primary key column in the table.
* no more dangerous/ugly table operations supported except to increment a cell's value (if an int) or to concatenate 2 tables.

Integration tests change because table headers are different.
Old classes are still lying around.  Will clean those up in a subsequent commit.
2012-05-18 01:11:26 -04:00
Guillermo del Angel 5189b06468 New annotation for indels that describe if they're STR's and their characteristics. If an indel is a STR, 3 fields are added to INFO: STR (boolean), RU = repeat unit (String), RPA = number of repetitions per allele. So, for example, if ATATAT* context gets changed to ATAT and ATATATAT, then RU=AT and RPA=3,2,4. Will be made standard annotation shortly. Added unit tests for new functionality. Pending: refactor VariantContextUtils.isRepeat() to unify code, and fix VariantEval functionality. 2012-05-17 15:28:19 -04:00
Guillermo del Angel ae26f0fe14 a) Fully functional and working multiallelic exact model for pools. Needs cleanup/more testing. b) Better unit test for pool genotype likelihoods - it now optionally generates actual noisy pileups that can be used for assessing GL accuracy, c) Totally experimental, hidden option in VariantsToTable to output genotype fields. Specifying -GF will output columns of form Sample.FieldName - needs also more testing 2012-05-14 10:55:35 -04:00
Guillermo del Angel 89f8a6b2e6 Revert bad part of last commit that shouldn't have been pushed 2012-05-10 10:41:08 -04:00
Guillermo del Angel 27b1aa5dd3 Don't allow N's in insertions when discovering indels. Maybe better solution will be to use them as wildcards and merge them with compatible regular insertion alleles but for now it's easier to ignore them. Minor refactoring of Allele.accepableAlleleBases to support this. Added unit test to test consensus allele counter in presence of N's 2012-05-10 10:29:19 -04:00
Eric Banks c40cda7e3c Nope, loads of integration tests had to be changed. 2012-05-07 14:30:42 -04:00
Eric Banks f3433201b1 Merged bug fix from Stable into Unstable 2012-05-03 11:11:00 -04:00
Eric Banks 557da77a1a Don't compute QD if there is no QUAL; added integration test for this 2012-05-03 11:02:37 -04:00
Eric Banks 1fc7b5d58b Merged bug fix from Stable into Unstable 2012-05-03 10:37:58 -04:00
Laurent Francioli 567d01cee8 - Added option to output the father's allele first in phased child haplotypes - BUG corrected causing wrong phasing of child/father pairs
Signed-off-by: Eric Banks <ebanks@broadinstitute.org>
2012-05-03 10:36:49 -04:00
Laurent Francioli 96e5a26223 PED support for Inbreeding Coefficient annotation
Signed-off-by: Eric Banks <ebanks@broadinstitute.org>
2012-05-03 10:36:20 -04:00
Mark DePristo 43d97c2e00 Rev Tribble to r97, adding binary feature support
From tribble logs:

Binary feature support in tribble

-- Massive refactoring and cleanup
-- Many bug fixes throughout
-- FeatureCodec is now general, with decode etc. taking a PositionBufferedStream
as an argument not a String
-- See ExampleBinaryCodec for an example binary codec
-- AbstractAsciiFeatureCodec provides to its subclass the same String decode,
readHeader functionality before.  Old ASCII codecs should inherit from this base
class, and will work without additional modifications
-- Split AsciiLineReader into a position tracking stream
(PositionalBufferedStream).  The new AsciiLineReader takes as an argument a
PositionalBufferedStream and provides the readLine() functionality of before.
Could potentially use optimizations (its a TODO in the code)
-- The Positional interface includes some more functionality that's now
necessary to support the more general decoding of binary features
-- FeatureReaders now work using the general FeatureCodec interface, so they can
index binary features
-- Bugfixes to LinearIndexCreator off by 1 error in setting the end block
position
-- Deleted VariantType, since this wasn't used anywhere and it's a particularly
clean why of thinking about the problem
-- Moved DiploidGenotype, which is specific to Gelitext, to the gelitext package
-- TabixReader requires an AsciiFeatureCodec as it's currently only implemented
to handle line oriented records
-- Renamed AsciiFeatureReader to TribbleIndexedFeatureReader now that it handles
Ascii and binary features
-- Removed unused functions here and there as encountered
-- Fixed build.xml to be truly headless
-- FeatureCodec readHeader returns a FeatureCodecHeader obtain that contains a
value and the position in the file where the header ends (not inclusive).
TribbleReaders now skip the header if the position is set, so its no longer
necessary, if one implements the general readHeader(PositionalBufferedStream)
version to see header lines in the decode functions.  Necessary for binary
codecs but a nice side benefit for ascii codecs as well
-- Cleaned up the IndexFactory interface so there's a truly general createIndex
function that takes the enumerated index type.  Added a writeIndex() function
that writes an index to disk.
-- Vastly expanded the index unit tests and reader tests to really test linear,
interval, and tabix indexed files.  Updated test.bed, and created a tabix
version of it as well.
-- Significant BinaryFeaturesTest suite.
-- Some test files have indent changes
2012-05-03 07:31:48 -04:00
Mark DePristo 58c470a6c5 Rev'ing Tribble from 53 to 94
-- Other tribble contributors did major refactoring / simplification of tribble, which required some changes to GATK code
-- Integrationtests pass without modification, though some very old index files (callable loci beds) were apparently corrupt and no longer tolerated by the newer tribble codebase
2012-05-03 07:31:47 -04:00
Eric Banks e448cfcc59 Forgot to update these md5s 2012-05-02 21:09:50 -04:00
Khalid Shakir b8b7f28aa9 Revving Picard to pick up new SamFileHeaderMerger.
Updated ReadFilter abstract class to implement (via UnsupportedOperationException) the new SamRecordFilter.filterOut().
In IndelRealignerIntegrationTest updates for Picard fixes to SAMRecord.getInferredInsertSize() in svn r1115 & r1124.
- Ran FixMates to create new input BAM since running IR with variable maxReadsInMemory means all reads weren't realigned leading to different outputs.
- Updated md5s to match new expectations after looking at TLEN diff engine output.
2012-05-02 16:47:28 -04:00
Eric Banks 623b36fbc4 Add header lines for AC,AF, and AN tags 2012-05-02 15:33:34 -04:00
Eric Banks 619a69a5f1 As promised in the release notes for 1.6, I am removing the old deprecated genotyping framework revolving around the misordering of alleles and have moved the fixed version in its place in preparation for release 1.7 (or 2.0?). 2012-05-01 16:18:24 -04:00
Eric Banks 0f3af9555b Adding an option to SelectVariants which allows the user to re-genotype through the exact model (if PLs are present) the samples in order to recalculate the QUAL and genotypes. This is really the correct way to select a subset of samples, especially when originally called from low coverage data. Also added integration test to cover this case. 2012-05-01 14:58:06 -04:00
Eric Banks 0c8e801021 Removing public to private dependency 2012-05-01 11:04:11 -04:00
Eric Banks e964d17518 Removing public to private dependency 2012-05-01 11:02:28 -04:00
Mauricio Carneiro 462450c3e3 disabling all BQSR unit tests
with the changes to the cycle covariate, some tests need updates, others  need to be completely re-written.
2012-04-30 14:39:55 -04:00
Guillermo del Angel e185632013 Exhaustive unit tests for Pool SNP genotype likelihoods:
a) Add ability for ErrorModel to be specified by external log-probability vector for testing.
b) For a given depth and ploidy(=2*samples/pool), create artificial high quality pileup testing from AC=0 to AC=ploidy, and test that pool GL's have expected content.Misc. refactorings and cleanups
c) Misc. cleanups and beautification.
2012-04-30 14:29:46 -04:00
Guillermo del Angel 730208133b Several fixes and improvements to Pool caller with ancillary test functions (not done yet):
a) Utility class called Probability Vector that holds a log-probability vector and has the ability to clip ends that deviate largely from max value.
b) Used this class to hold site error model, since likelihoods of error model away from peak are so far down that it's not worth computing with them and just wastes time.
c) Expand unit tests and add an exhaustive test for ErrorModel class.
d) Corrected major math bug in ErrorModel uncovered by exhaustive test: log(e^x) is NOT x if log's base = 10.
e) Refactored utility functions that created artificial pileups for testing into separate class ArtificialPileupTestProvider. Right now functionality is limited (one artificial contig of 10 bp), can only specify pileups in one position with a given number of matches and mismatches to ref) but functionality will be expanded in future to cover more test cases.
f) Use this utility class for IndelGenotypeLikelihoods unit test and for PoolGenotypeLikelihoods unit test (the latter testing functionality still not done).
g) Linearized implementation of biallelic exact model (very simple approach, similar to diploid exact model, just abort if we're past the max value of AC distribution and below a threshold). Still need to add unit tests for this and to expand to multiallelic model.
h) Update integration test md5's due to minor differences stemming from linearized exact model and better error model math
2012-04-27 14:41:17 -04:00
Eric Banks 0439047269 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-27 10:49:45 -04:00
Eric Banks 05b44dd017 The genotypeCounts array wasn't always being initialized before it was accessed, leading to a NPE (which got caught and thrown as a JEXL expression when used in selection). Added unit test to cover all genotype count methods. 2012-04-27 10:49:36 -04:00
Khalid Shakir 9801dd114f Bug fix for: https://getsatisfaction.com/gsa/topics/problem_with_indelrealigner_and_l_unmapped
The GATK -L unmapped is for GenomeLocs with SAMRecord.NO_ALIGNMENT_REFERENCE_NAME, not SAMRecord.getReadUnmappedFlag()
Previously unmapped flag reads in the last bin were being printed while also seeking for the reads without a reference contig.
2012-04-27 09:58:38 -04:00
Guillermo del Angel 2f86ccb086 Correct md5's for previous code change 2012-04-26 16:20:41 -04:00
Guillermo del Angel 972d6531b6 Corner case fix for indel GL computation: sometimes (depending on surrounding context) reads which are not informative of two candidate haplotypes end up having marginally higher likelihoods with one haplotype as opposed to another, depending on uncertainty on alignments in surrounding regions. So, a sample whose GL is -0.0001,-0.0005,-0.001 may have its genotype set to 1/1 due to this statistical noise. We already have a tolerance comparing max(gl)-min(gl) to avoid genotyping, so this tolerance is now increased from 0.001 to 0.1 (equivalent to 1 PL unit) to avoid genotyping a sample if all PLs are within this threshold. Changed 2 integration test md5s that hit this case. 2012-04-26 10:15:26 -04:00
Laurent Francioli ab2a952ad1 PED support for Inbreeding Coefficient annotation
Signed-off-by: Eric Banks <ebanks@broadinstitute.org>
2012-04-25 12:56:47 -04:00
Laurent Francioli 219b0a128b PED support for ChromosomeCounts annotation
Signed-off-by: Eric Banks <ebanks@broadinstitute.org>
2012-04-25 12:50:04 -04:00
Laurent Francioli 19d5213d5a Added function to get founders IDs in SampleDB
Signed-off-by: Eric Banks <ebanks@broadinstitute.org>
2012-04-25 12:49:36 -04:00
Mauricio Carneiro e440d0ce69 BQSR triage #4
* fixed queue script plot file names
   * updated the ReadGroupCovariate to use the platform unit instead of sample + lane.
   * fixed plotting of marginalized reported qualities
2012-04-24 17:19:54 -04:00
Mauricio Carneiro e39a59594a BQSR triage and test routines
* updated BQSR queue script for faster turnaround
   * implemented plot generation for scatter/gatherered runs
   * adjusted output file names to be cooperative with the queue script
   * added the recalibration report file to the argument table in the report
   * added ReadCovariates unit test -- guarantees that all the covariates are being generated for every base in the read
   * added RecalibrationReport unit test -- guarantees the integrity of the delta tables
2012-04-23 11:23:00 -04:00
Eric Banks cd63bcb1b8 Fixing unit tests to register the user exception being thrown (instead of the NumberFormatException) 2012-04-23 10:06:51 -04:00
Eric Banks 1f23d99dfa If we are subsetting alleles in the UG (either because there were too many or because some were not polymorphic), then we may need to trim the alleles (because the original VariantContext may have had to pad at the end). Thanks to Ryan for reporting this. Only one of the integration tests had even partially covered this case, so I added one that did. 2012-04-20 17:00:05 -04:00
Eric Banks 4b81c75642 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-20 14:30:19 -04:00
Eric Banks f1c5510ec0 When running SelectVariants with the excludeNonVariants option, remove alleles from the ALT field that are no longer polymorphic. 2012-04-20 14:30:04 -04:00
Ryan Poplin a1596791af Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-20 14:03:04 -04:00
Ryan Poplin a57295eb75 Fixing a bug when breaking up active regions where the resulting regions would overlap by one base. Adding quality score manipulation from the UG into the haplotype caller (qual capped by mapping quality, min qual threshold). 2012-04-20 14:02:55 -04:00
Guillermo del Angel de68363c23 Removed experimental feature (aka hack) that was meant for 1000G consensus but remained in VQSR data manager - QD was being scaled by indel length. There's no evidence any more that QD is length-dependent, neither in CEU trio data nor in latest 1000G P2 calls 2012-04-20 10:58:34 -04:00
Mauricio Carneiro 0f8c77391d BQSR bug triage #3
* fixed context covariate famous "off by one" error
   * reduced maximum quality score to Q50 (following Eric/Ryan's suggestion)
   * remove context downsampling in BQSR R script
2012-04-19 17:31:04 -04:00
Khalid Shakir df5dd841af AC strat now checks if evals will be merged before throwing an error on multiple eval files.
Minor tweaks to WGP script based on new recal VCF format.
2012-04-19 16:08:55 -04:00
Guillermo del Angel 1ae2ab5b63 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-19 12:50:29 -04:00
Guillermo del Angel 02ff930f6a My changes 2012-04-19 12:45:18 -04:00
Mauricio Carneiro eb22cd7222 Unit test to guarantee BQSR sequential calculation accuracy
This test brings together the old and the new BQSR, building a recalibration table using the two separate frameworks and performing the recalibration calculation using the two different frameworks for 10,000+ bases and asserting that the calculations match in every case.
2012-04-19 09:33:40 -04:00
Mauricio Carneiro 68d0211fa1 Improved BQSR plotting and some new parameters
* Refactored CycleCovariate to be a fragment covariate instead of a per read covariate
   * Refactored the CycleCovariateUnitTest to test the pairing information
   * Updated BQSR Integration tests accordingly
   * Made quantization levels parameter not hidden anymore
   * Added hidden option to keep intermediate plotting files for debug purposes (they're automatically deleted)
   * Added hidden option not to generate the plots automatically (important for scatter/gathering)
2012-04-19 09:31:41 -04:00