Commit Graph

2235 Commits (5ec737f008b8906c5cf7f5e7ddf7fd75e2ae4fbb)

Author SHA1 Message Date
Eric Banks e02ec8c8b6 Don't update the record ID unless we are actually going to emit the record 2012-06-04 14:58:50 -04:00
Eric Banks 8405156ae1 Refactored VariantsToTable so that 1) genotype-level fields can be specified (stabilized and supported code) and 2) the --moltenize argument could be supported to produce molten output of the data. Added tests that cover these capabilities. 2012-06-04 14:28:32 -04:00
Ryan Poplin f11e7ebc3a Fixing the previous fix related to clipping. Adding extra reference padding in the HaplotypeCaller to get those larger alleles during GGA. 2012-06-04 12:49:36 -04:00
Ryan Poplin 320956ee4b Bug fix in clipping function in ReadUtils for when the read ends at exactly the clipping boundary. Bug fixes in HaplotypeCaller GGA mode for when Smith-Waterman produces a different allele than what was given in the input alleles VCF. GGA mode now works with multiallelic records. Adding min pruning factor argument which is combined with the pruning factor that is determined dynamically by the coverage. 2012-06-04 10:55:36 -04:00
Guillermo del Angel 7a54baf08c Merged bug fix from Stable into Unstable 2012-06-03 08:42:08 -04:00
Guillermo del Angel 47df7bbc14 Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/stable 2012-06-03 08:38:54 -04:00
Guillermo del Angel 2ddbdee3bc Fixed broken VariantEval stratifications VariantType and IndelSize - integration tests to follow 2012-06-03 08:38:38 -04:00
Mauricio Carneiro 12a8c54f9a Fixing VCF header for filter elements (thanks Eric) 2012-06-01 15:45:15 -04:00
Eric Banks 3a15ba2102 Malformed VCF headers should be User Errors 2012-05-31 16:05:53 -04:00
Khalid Shakir c4f7df4dce When an underlying exception occurs because of the user error, if the exception instance does not include a message instead of telling the user "because null", tell them "because <exception class name>". 2012-05-30 16:39:06 -04:00
Ryan Poplin 421d0d1435 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-05-30 15:21:35 -04:00
Ryan Poplin 5dd811f84a Adding genotype given alleles mode to the HaplotypeCaller. 2012-05-30 15:07:01 -04:00
Eric Banks d09b8d5584 Fixing docs 2012-05-30 13:24:08 -04:00
Mauricio Carneiro d6e1205310 Updating default values for DiagnoseTargets 2012-05-30 12:43:07 -04:00
Khalid Shakir c3c7f17d90 Updated hard limit MathUtils.MAXN number of samples from 11,000 to 50,000.
Instead of creating a supposed network temporary directory locally which then fails when remote nodes try to access the non-existant dir, now checking to see if they network directory is available and throwing a SkipException to bypass the test when it cannot be run.
TODO: Throw similar SkipExceptions when fastas are not available. Right now instead of skipping the test or failing fast the REQUIRE_NETWORK_CONNECTION=false means that the errors popup later when the networked fastas aren't found.
2012-05-29 11:18:22 -04:00
Roger Zurawicki b8b139841d DiagnoseTargets with working Q1,Median,Q3
- Merged Roger's metrics with Mauricio's optimizations
 - Added Stats for DiagnoseTargets
     - now has functions to find the median depth, and upper/lower quartile
     - the REF_N callable status is implemented
 - The walker now runs efficiently
 - Diagnose Targets accepts overlapping intervals
 - Diagnose Targets now checks for bad mates
 - The read mates are checked in a memory efficient manner
 - The statistics thresholds have been consolidated and moved outside of the statistics classes and into the walker.
 - Fixed some bugs
 - Removed rod binding

Added more Unit tests

 - Test callable statuses on the locus level
 - Test bad mates

 - Changed NO_COVERAGE -> COVERAGE_GAPS to avoid confusion

Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-05-29 10:16:45 -04:00
Eric Banks 50031b63c5 Fix possible NPE from NBaseCount annotation module 2012-05-29 09:46:00 -04:00
Mark DePristo 08de4dfd96 Missed one integration test 2012-05-29 07:23:24 -04:00
Mark DePristo 454c8e63e6 Made GQ an int, not a float. Updated VC code and lots of corresponding MD5s
-- VCFWriter / codec now passes the same rigorous UnitTest as the BCF2 writer / codec.  As part of this we now can only test doubles for equivalence in VCFs to 1e-2 (not exactly impressive)
2012-05-28 20:20:05 -04:00
Mark DePristo 7ce24a96f1 PBT now uses getGenotypeLikelihoodString to avoid NPE when there are no PLs present 2012-05-28 20:18:16 -04:00
Mark DePristo 1818c29371 Fixed long-standing bug in beagle codec that was passing on the header record for decoding 2012-05-28 20:17:26 -04:00
Mark DePristo 06b02e1b9b Update MD5s to reflect new limited output of DiffObjectsWalkers
-- Also updated GQ change in VCFIntegrationTest
2012-05-27 11:20:47 -04:00
Mark DePristo 5894d045cb Bugfixes and code cleanup throughout so BCF2 passes VC -> BCF -> VC tests
-- This version of BCF should actually work properly for most files, assuming headers are properly defined.
-- Lots of bug fixes to BCF2 codec
-- Genotype getPhredScaledQual is now an int, returning -1 if there's no QUAL.  NOTE THIS SEMANTICS change
-- Equals() method for GenotypeLikelihoods, using PLs.
-- VCFCodec now longer adds empty bindings to missing input field values.  NOTE THIS CHANGE
-- VCs can be marked as fully decoded, so that when fullyDecode() is called it returns itself, instead of doing the decoding work.  The BCF2 codec now makes VCs marked as fully decoded
-- stringToBytes returns empty list for null or "" string in BCF2Encoder
-- Proper handling of genotype ordering in BCF2 reader / writer
-- Removed the crazy slow noDups and sameSamples tests that were slowing down unit and integration tests totally unnecessarily
-- Many failing MD5s now due to double -> int change in GQ, will update later
2012-05-27 11:17:17 -04:00
Mark DePristo 86e5a066fc Even more conservative limit on number of differences to summarize at 1000 2012-05-27 11:17:13 -04:00
Mark DePristo 31f4e5b52e Stop unlimited runtimes in DiffEngine when you have lots of differences
-- Added a new parameter to control the maximum number of pairwise differences to generate, which previously could expand to a very large number when there were lots of differences among genotypes, resulting in a n^2 algorithm running with n > 1,000,000
2012-05-27 11:17:13 -04:00
Guillermo del Angel a6ee4f98b5 Yet More missing md5's 2012-05-25 17:21:47 -04:00
Mauricio Carneiro 4109fcbb08 Merged bug fix from Stable into Unstable 2012-05-25 13:03:05 -04:00
Mauricio Carneiro 2be5704a25 Fixed haplotype boundary bug in PairHMMIndelErrorModel
haplotypes were being clipped to the reference window when their unclipped ends went beyond the reference window. The unclipped ends include the hard clipped bases, therefore, if the reference window ended inside the hard clipped bases of a read, the boundaries would be wrong (and the read clipper was throwing an exception).

   * updated code to use SoftEnd/SoftStart instead of UnclippedEnd/UnclippedStart where appropriate.
   * removed unnecessary code to remove hard clips after processing.
   * reorganized the logic to use the assigned read boundaries throughout the code (allowing it to be final).
2012-05-25 13:00:45 -04:00
Guillermo del Angel 175bb35e70 Made TandemRepeatAnnotator standard annotation. HRun no longer standard (superceded by former) 2012-05-25 12:56:23 -04:00
Mark DePristo d6df817174 Oops, don't enable shadow BCF tests 2012-05-24 13:31:13 -04:00
Mark DePristo 0a86564669 Updated test files didn't make it into last push 2012-05-24 13:29:44 -04:00
Mark DePristo 7280cdf937 Bugfixes and testdata cleanup
-- Cut down the size of a few large files in public/testdata that were only used in part
-- Refactor vcf Filename => shadow BCF filename to BCF2Utils.  Fix bug in WalkerTest due to the way this was handled previously
2012-05-24 13:26:05 -04:00
Mark DePristo e9c22b9aad Final updates to integration tests for BCF2
-- Fully working version
-- Use -generateShadowBCF to write out foo.bcf as well as foo.vcf anywhere you use -o foo.vcf
-- Moved MedianUnitTest to its proper home in Utils
-- Added reportng to ivy and testng, so build/report/X/html/ is a nicely formatted output for Unit and Integration tests.  From this website it's easy to see md5 diffs, etc.  This is a vastly better way to manage unit and integration test output
2012-05-24 10:58:59 -04:00
Mark DePristo ade1843818 Bugfix for not setting header in AbstractVCFCodec 2012-05-24 10:58:58 -04:00
Mark DePristo 6ca71fe3b4 GATK tests use public/testdata not /humgen/ as much as possible 2012-05-24 10:58:58 -04:00
Mark DePristo 69ee4d0454 Moved getMetaDataForField to VariantContextUtils 2012-05-24 10:57:09 -04:00
Mark DePristo cb13f16e90 WalkerTest infrastructure to generate and test shadowBCF file for every generated VCF file
-- Currently disabled
2012-05-24 10:57:09 -04:00
Mark DePristo f77d2e6965 Renamed NO_HEADER to the more accurate no_cmdline_in_header
-- Also no_cmdline_in_header permits us to write contigs into the header, so that the shadow BCF system can work as well
2012-05-24 10:57:08 -04:00
Mark DePristo 4bde24f020 Bugfix for VCFWriter in the case where there are no genotypes in the VC but genotypes in the header 2012-05-24 10:57:08 -04:00
Mark DePristo 4846bf5c8e @Hidden --also_generate_bcf engine argument produces both VCF and BCF files for -o my.vcf
-- Going to be useful going forward for integration tests so they will generate both VCF and BCF files automatically
2012-05-24 10:57:07 -04:00
Mark DePristo bb0d87666a Finally just deleted equals() method in GATKArgumentCollection.
-- We never compare these things in the codebase anyway...
2012-05-24 10:57:07 -04:00
Mark DePristo 6f469305ab Don't try to share BCF2 yet 2012-05-24 10:57:06 -04:00
Mark DePristo c8ed0bfc4c Edge case fixes for BCF2
--handle entirely missing GT in a sample in decodeGenotypeAlleles
--Create MAX_ALLELES_IN_GENOTYPES constant in BCF2Utils, and extracted its use inline from the code
-- Generalized genotype writing code to handle ploidy != 2 and variable ploidy among samples
-- Remove special case inline treatment of case where all samples have no GT field values, and moved this into calcVCFGenotypeKeys
-- Removed restriction on getPloidy requiring ploidy > 1.  It's logically find to return 0 for a no called sample
-- getMaxPloidy() in VC that does what it says
-- Support for padding / depadding of generic genotype fields
2012-05-24 10:57:06 -04:00
Mark DePristo 40431890be -- BCF2 is now a reference dependent codec so it can initialize the contigs in the case where the file doesn't have contigs in it
-- BCF2 writer can now work without the contig lines being in the header
-- Made GenomeLocParser a final class
2012-05-24 10:57:06 -04:00
Mark DePristo 6301572009 GenotypeLikelihood PLs are capped at Short.MAX_INT now
-- UserExceptions in BCF2 now where appropriate
-- Asserts for code safety
-- Public -> protected encode(Object v) method is for testing only
2012-05-24 10:57:06 -04:00
Mark DePristo d52bc31a47 Bugfix for doNotWriteGenotypes mode
-- Was outputing GT ./. in sites only mode.  Fixed
2012-05-24 10:57:05 -04:00
Mark DePristo 64d4238e2f 99% working version of BCF2 encoder / decoder
-- fixed final bugs with PL encoding / decoding
-- Ready for testing by other members of the group
-- Current performance numbers aren't so great, but they will improve in the next phase of BCF2 optimizations
-- Fixed a nasty bug in the filter field
-- Not that some (many?) GATK tools won't work with BCF because they internally assume values are Strings not their true types

Read 1500 genotypes file in VCF -> VCF : 11 seconds
Read 1500 genotypes file in VCF -> BCF : 9.5 seconds

VariantEval 1500 genotypes file in VCF : 3 seconds
VariantEval 1500 genotypes file in BCF : 3 seconds
2012-05-24 10:57:05 -04:00
Mark DePristo b5bce8d3f9 AD should be UNBOUNDED, actually
-- Pass in # alt alleles as appropriate for getCount in VCF header line
2012-05-24 10:57:05 -04:00
Mark DePristo aaf11f00e3 Near final BCF2 implementation
-- Trivial import changes in some walkers
-- SelectVariants has a new hidden mode to fully decode a VCF file
-- DepthPerAlleleBySample (AD) changed to have not UNBOUNDED by A type, which is actually the right type
-- GenotypeLikelihoods now implements List<Double> for convenience.  The PL duality here is going to be removed in a subsequent commit
-- BugFixes in BCF2Writer.  Proper handling of padding.  Bugfix for nFields for a field
-- padAllele function in VariantContextUtils
-- Much better tests for VariantContextTestProvider, including loading parts of dbSNP 135 and the Phase II 1000G call set with genotypes to test encoding / decoding of fields.
2012-05-24 10:57:02 -04:00
Mark DePristo dfee17a672 Generalize / unify code for handling strings
-- List<String> is converted inside of the codec to a collapsed string, and exploded in the decoder.
-- Unified the type conversion code in BCFWriter to simply the mapping from VCF type => BCF type and special value recoding
-- Code cleanup and renaming
2012-05-24 10:57:02 -04:00
Mark DePristo b4a5acd6f4 Added some genotype tests for BCF2, which all pass. Of course that's because I commented out the ones that didn't 2012-05-24 10:57:01 -04:00
Mark DePristo 373ae39e86 Testing of BCF codec
-- Rev.d tribble
-- Minor code cleanup
-- BCF2 encoder / decoder use Double not Float internally everywhere
-- Generalized VC testing framework
2012-05-24 10:57:01 -04:00
Mark DePristo fb1911a1b6 -- Convenience constructor for VariantContextBuilder that creates a new one based on an existing builder
-- Convenience routine for creating alleles from strings of bases
-- Convenience constructor for VCFFilterHeader line whose description is the same as name
-- VariantContextTestProvider creates all sorts of types of VariantContexts for testing purposes.  Can be reused throughtout code for BCF, VCF, etc.
-- Created basic BCF2WriterCodec tests that consumes VariantContextTestProvider contexts, writes them to disk with BCF2 writer, and checks that they come back equals to the original VariantContexts. Actually worked for some complex tests in the first go
2012-05-24 10:57:01 -04:00
Mark DePristo 4968dcd36a Throw an error when genotype fields with mixed vector lengths are encountered 2012-05-24 10:57:00 -04:00
Mark DePristo afd2f1a3f9 Individual VariantContextWriters are now package protected
-- Added VCFHeader() constructor that makes an empty header, and updated VariantRecalibrator to use it
-- Update build.xml to build vcf.jar with updated paths and bcf2 support.
2012-05-24 10:57:00 -04:00
Mark DePristo 24864fd5b0 GATK now writes BCF output to any file with .bcf extension
-- Moved VCF and BCF writers to variantcontext.writers
-- Updated vcf.jar build path
-- Refactored VCFWriter and other code.  Now the best (and soon to be only) way to create these files is through a factory method called VariantContextWriterFactory.  Renamed the general VCFWriter interface to VariantContextWriter which is implemented by VCFWriter and BCF2Writer.
2012-05-24 10:57:00 -04:00
Mark DePristo e2311294c0 Removed unused ManualSortingVCFWriter 2012-05-24 10:56:59 -04:00
Mark DePristo 93cef82637 BCF2 header encoding decoding at final spec 2012-05-24 10:56:58 -04:00
Mark DePristo ce9e9eebb1 No dictionary in header. Now built dynamically from the header in the writer and codec
-- Created BCF2Utils and moved BCF2Constants and TypeDescriptor methods there
2012-05-24 10:56:58 -04:00
Mark DePristo f0b081a85f Update VCF.jar loading test
-- to reflect new path to VCFWriter
2012-05-24 10:56:58 -04:00
Mark DePristo c3b8048e2e Moving around classes in VCF and BCF2
-- Refactored VCF writers into vcf.writers package
-- Moved BCF2Writer to bcf2.writer
-- Updates to all of the walkers using VCFWriter to reflect new packages
-- A large number of files had their headers cleaned up because of this as well
2012-05-24 10:56:58 -04:00
Mark DePristo 679ffdd333 Move BCF2 from private utils to public codecs 2012-05-24 10:56:56 -04:00
Mark DePristo 450f098a61 BCF2 encoder / decoder implement new site / genotype block organization
-- Supports final organization of data blocks into sites data and genotypes data
2012-05-24 10:56:55 -04:00
Mark DePristo 27b51d4dea Enable on the fly indexing of BCF2 2012-05-24 10:56:54 -04:00
Mark DePristo 81bd7646d6 Fix for MISSING floats
-- Restructured code to separate the MISSING value in java (currently everywhere a null) from the byte representation on disk (an int).
-- Now handles correctly MISSING qual fields
2012-05-24 10:56:53 -04:00
Mark DePristo 3afbc50511 More BCF2 improvements
-- Refactored setting of contigs from VCFWriterStub to VCFUtils.  Necessary for proper BCF working
-- Added VCFContigHeaderLine that manages the order for sorting, so we now emit contigs in the proper order.
-- Cleaned up VCFHeader operations
-- BCF now uses the right header files correctly when encoding / decoding contigs
-- Clean up unused tools
-- Refactored header parsing routines to make them more accessible
-- More minor header changes from Intellij
2012-05-24 10:56:52 -04:00
Mark DePristo 0799855479 Archiving GCF
-- Rider update to CramByPiece.scala
2012-05-24 10:56:51 -04:00
Guillermo del Angel 43919078cd Merged bug fix from Stable into Unstable 2012-05-23 21:21:01 -04:00
Guillermo del Angel 4bc04e2a9e Correct way in which start/stop positions in a VC are computed when creating an indel VC. Old way was incorrect in case GENOTYPE_GIVEN_ALLELES was specified with a complex record. New way should work in general for all cases and is simpler. 2012-05-23 21:19:30 -04:00
Ryan Poplin 08dfd6cab6 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-05-21 16:47:07 -04:00
Ryan Poplin 04000d920c Bug fix in BadCigar read filter for index out of bounds exception when used with a bam file that contains unmapped reads. 2012-05-21 16:46:59 -04:00
Eric Banks 666862af19 Added @Hidden option for GSA production use to cap the max alleles for indels at a lower number than for SNPs 2012-05-21 16:03:29 -04:00
Khalid Shakir e57cd78bba Killed two more resource leakers that ignored requests to close wrapped file pointers, and added Unit Tests for each.
This bug will happen in all adapter/wrapper classes that are passed a resource, and then in their close method they ignore requests to close the wrapped resource, causing a leak when the adapter is the only one left with a reference to the resource.

Ex:

public Wrapper getNewWrapper(File path) {
  FileStream myStream = new FileStream(path); // This stream must be eventually closed.
  return new Wrapper(myStream);
}

public void close(Wrapper wrapper) {
  wrapper.close(); // If wrapper.close() does nothing, NO ONE else has a reference to close myStream.
}
2012-05-21 15:41:56 -04:00
Eric Banks 7f5ec17d22 Fixed up the comments in the GATKReportTable code and added some sanity checks to make sure that the user doesn't inconsistently add rows and corresponding IDs to the table. 2012-05-21 14:16:13 -04:00
Eric Banks 92d8aa3d4c Don't exception out in these VE modules if the VCF has records that aren't just SNPs or indels 2012-05-21 09:38:52 -04:00
Eric Banks 3af3834d50 Fixing 2 bugs in the SAMRecord printing argument descriptor code (as reported by Kristian):
* For some reason, the original implementor decided to use Booleans instead of booleans and didn't always check for null so we'd occasionally get a NPE.  Switched over to booleans.
* We'd also generate a NPE if SAMRecord writing specific arguments (e.g. --simplifyBAM) were used while writing to sdout.
2012-05-18 11:55:41 -04:00
Eric Banks 26968ae8eb Forgot that the VCFStreamingOntegrationTest uses VE 2012-05-18 02:51:53 -04:00
Eric Banks 52c206d5db Has anyone else ever noticed that the DiffEngine outputs were always doubled for some reason? That no longer happens with the new reports. 2012-05-18 02:32:20 -04:00
Eric Banks 03d40272c8 Removed old GATKReport code and moved the new stuff in its place. 2012-05-18 01:44:31 -04:00
Eric Banks a26b04ba17 Extensive refactoring of the GATKReports. This was a beast.
The practical differences between version 1.0 and this one (v1.1) are:

* the underlying data structure now uses arrays instead of hashes, which should drastically reduce the memory overhead required to create large tables.
* no more primary keys; you can still create arbitrary IDs to index into rows, but there is no special cased primary key column in the table.
* no more dangerous/ugly table operations supported except to increment a cell's value (if an int) or to concatenate 2 tables.

Integration tests change because table headers are different.
Old classes are still lying around.  Will clean those up in a subsequent commit.
2012-05-18 01:11:26 -04:00
Guillermo del Angel 5189b06468 New annotation for indels that describe if they're STR's and their characteristics. If an indel is a STR, 3 fields are added to INFO: STR (boolean), RU = repeat unit (String), RPA = number of repetitions per allele. So, for example, if ATATAT* context gets changed to ATAT and ATATATAT, then RU=AT and RPA=3,2,4. Will be made standard annotation shortly. Added unit tests for new functionality. Pending: refactor VariantContextUtils.isRepeat() to unify code, and fix VariantEval functionality. 2012-05-17 15:28:19 -04:00
Eric Banks 0f7c917e7a Better error checking and messages for bad alleles 2012-05-17 13:36:42 -04:00
Eric Banks d44886d9e8 Very naughty bug: VE output is not at all gatherable but no one told this to Queue. Fixed. 2012-05-15 10:29:04 -04:00
Eric Banks 819c3d0c15 Adding to the Hrun docs 2012-05-15 10:27:52 -04:00
Guillermo del Angel 5fc3adbb04 One more VariantsToTable bug fix 2012-05-14 14:10:07 -04:00
Guillermo del Angel 04d691f04a Forgot to update MD5's due to new Exact AF model in pool caller (all changes legit, minor QUAL/QD/SB differences). Fixed bug in VariantsToTable from previous commit 2012-05-14 14:01:29 -04:00
Guillermo del Angel ae26f0fe14 a) Fully functional and working multiallelic exact model for pools. Needs cleanup/more testing. b) Better unit test for pool genotype likelihoods - it now optionally generates actual noisy pileups that can be used for assessing GL accuracy, c) Totally experimental, hidden option in VariantsToTable to output genotype fields. Specifying -GF will output columns of form Sample.FieldName - needs also more testing 2012-05-14 10:55:35 -04:00
Ryan Poplin c9dd0f3173 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-05-10 13:09:10 -04:00
Ryan Poplin 0cdadffe14 Committing the best of the frantic pre-CSHL experiments: Better algorithm for partioning reads amongst the alleles they support. Require the read's original alignment to actually overlap the variant. QD uses the non-informative reads when calculating D. More HC-specific annotations for potential use in a statistical filtering strategy. Increasing the minimum kmer length in the assembly graphs. Misc minor bug fixes. 2012-05-10 13:09:03 -04:00
Guillermo del Angel 89f8a6b2e6 Revert bad part of last commit that shouldn't have been pushed 2012-05-10 10:41:08 -04:00
Guillermo del Angel 27b1aa5dd3 Don't allow N's in insertions when discovering indels. Maybe better solution will be to use them as wildcards and merge them with compatible regular insertion alleles but for now it's easier to ignore them. Minor refactoring of Allele.accepableAlleleBases to support this. Added unit test to test consensus allele counter in presence of N's 2012-05-10 10:29:19 -04:00
Eric Banks 4f37d6d399 Fixing docs 2012-05-10 00:56:00 -04:00
Mark DePristo c81acfc15d Working implementation of BCF2
-- Nearly complete on spec implementation.  Slow but clean
-- Some refactoring of VariantContext to support common functions for BCF and VCF
2012-05-08 19:46:51 -04:00
Mark DePristo a5193c2399 Mostly complete reference implementation of BCF2
-- Can run VariantEval on 3000 sample exome VCF and get the same output as the original VCF
2012-05-08 19:46:51 -04:00
Eric Banks 473d07b0c5 fixing up docs from previous Pool Caller commit 2012-05-08 11:02:55 -04:00
Eric Banks b4999d14c1 updating docs 2012-05-08 10:58:46 -04:00
Guillermo del Angel 33a1dd2048 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-05-08 10:42:12 -04:00
Eric Banks 5cf4fd63c2 Catch malformed base qualities and throw as a User Error 2012-05-08 09:34:57 -04:00
Guillermo del Angel a4f4b5007b Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-05-08 09:34:33 -04:00
Guillermo del Angel 605984353f Pool Caller improvements: a) New non-standard private annotation Heteroplasmy which measures mean heteroplasmy (pool AF) across called samples, meant for easier mtDNA calling. Pure homoplasmic variants (pool AF = 1 or 0) would have heteroplasmy=1. b) Don't output pool genotypes by default for large pool sizes because it makes file sizes explode and they're unreadable. c) Refactored classes ExactACCounts and ExactACSet and moved to superclass AlleleFrequencyCalculationModel because both Pool and Exact AF calculation models will use it. d) Initial refactorings and skeleton for linearized multi-allelic exact model (not done yet). e) Unit test for Pool AF calculation model. 2012-05-08 09:33:38 -04:00
Eric Banks c40cda7e3c Nope, loads of integration tests had to be changed. 2012-05-07 14:30:42 -04:00
Eric Banks 66838a073e Very annoying: we have been emitting an extra TAB in the header of the VCF (which breaks some parsers) for sites-only file. Hopefully not too many integration tests will need to be fixed... 2012-05-07 12:20:11 -04:00
David Roazen 6b769e91d8 BCF2: third checkpoint
* writer mostly implemented
* walkers to convert BCF2 <-> VCF
* almost working for sites-only files; genotypes still need work
* initial performance tests this afternoon will be on sites-only files
2012-05-04 13:00:15 -04:00
Eric Banks f3433201b1 Merged bug fix from Stable into Unstable 2012-05-03 11:11:00 -04:00
Eric Banks 557da77a1a Don't compute QD if there is no QUAL; added integration test for this 2012-05-03 11:02:37 -04:00
Eric Banks 1fc7b5d58b Merged bug fix from Stable into Unstable 2012-05-03 10:37:58 -04:00
Laurent Francioli 567d01cee8 - Added option to output the father's allele first in phased child haplotypes - BUG corrected causing wrong phasing of child/father pairs
Signed-off-by: Eric Banks <ebanks@broadinstitute.org>
2012-05-03 10:36:49 -04:00
Laurent Francioli 96e5a26223 PED support for Inbreeding Coefficient annotation
Signed-off-by: Eric Banks <ebanks@broadinstitute.org>
2012-05-03 10:36:20 -04:00
Mark DePristo 43d97c2e00 Rev Tribble to r97, adding binary feature support
From tribble logs:

Binary feature support in tribble

-- Massive refactoring and cleanup
-- Many bug fixes throughout
-- FeatureCodec is now general, with decode etc. taking a PositionBufferedStream
as an argument not a String
-- See ExampleBinaryCodec for an example binary codec
-- AbstractAsciiFeatureCodec provides to its subclass the same String decode,
readHeader functionality before.  Old ASCII codecs should inherit from this base
class, and will work without additional modifications
-- Split AsciiLineReader into a position tracking stream
(PositionalBufferedStream).  The new AsciiLineReader takes as an argument a
PositionalBufferedStream and provides the readLine() functionality of before.
Could potentially use optimizations (its a TODO in the code)
-- The Positional interface includes some more functionality that's now
necessary to support the more general decoding of binary features
-- FeatureReaders now work using the general FeatureCodec interface, so they can
index binary features
-- Bugfixes to LinearIndexCreator off by 1 error in setting the end block
position
-- Deleted VariantType, since this wasn't used anywhere and it's a particularly
clean why of thinking about the problem
-- Moved DiploidGenotype, which is specific to Gelitext, to the gelitext package
-- TabixReader requires an AsciiFeatureCodec as it's currently only implemented
to handle line oriented records
-- Renamed AsciiFeatureReader to TribbleIndexedFeatureReader now that it handles
Ascii and binary features
-- Removed unused functions here and there as encountered
-- Fixed build.xml to be truly headless
-- FeatureCodec readHeader returns a FeatureCodecHeader obtain that contains a
value and the position in the file where the header ends (not inclusive).
TribbleReaders now skip the header if the position is set, so its no longer
necessary, if one implements the general readHeader(PositionalBufferedStream)
version to see header lines in the decode functions.  Necessary for binary
codecs but a nice side benefit for ascii codecs as well
-- Cleaned up the IndexFactory interface so there's a truly general createIndex
function that takes the enumerated index type.  Added a writeIndex() function
that writes an index to disk.
-- Vastly expanded the index unit tests and reader tests to really test linear,
interval, and tabix indexed files.  Updated test.bed, and created a tabix
version of it as well.
-- Significant BinaryFeaturesTest suite.
-- Some test files have indent changes
2012-05-03 07:31:48 -04:00
Mark DePristo 58c470a6c5 Rev'ing Tribble from 53 to 94
-- Other tribble contributors did major refactoring / simplification of tribble, which required some changes to GATK code
-- Integrationtests pass without modification, though some very old index files (callable loci beds) were apparently corrupt and no longer tolerated by the newer tribble codebase
2012-05-03 07:31:47 -04:00
Eric Banks e448cfcc59 Forgot to update these md5s 2012-05-02 21:09:50 -04:00
Khalid Shakir b8b7f28aa9 Revving Picard to pick up new SamFileHeaderMerger.
Updated ReadFilter abstract class to implement (via UnsupportedOperationException) the new SamRecordFilter.filterOut().
In IndelRealignerIntegrationTest updates for Picard fixes to SAMRecord.getInferredInsertSize() in svn r1115 & r1124.
- Ran FixMates to create new input BAM since running IR with variable maxReadsInMemory means all reads weren't realigned leading to different outputs.
- Updated md5s to match new expectations after looking at TLEN diff engine output.
2012-05-02 16:47:28 -04:00
Mauricio Carneiro f51a1d0d61 Better error message to the BAMScheduler
In the case where the BAM file was aligned using a reference but analysis is being attempted with a different reference.
2012-05-02 16:10:00 -04:00
Mauricio Carneiro 940029fa5d Fixing on-the-fly recalibration (caught by Ryan)
low quality bases in the tails were being turned to N's in the final read.
2012-05-02 16:06:04 -04:00
Eric Banks 623b36fbc4 Add header lines for AC,AF, and AN tags 2012-05-02 15:33:34 -04:00
Guillermo del Angel 429800a192 Fix corner case rounding issue in MathUtils unit test: 10^logFactorial(4)) was 23.999999... which if cast directly yielded 23 - so, do pre-rounding to ensure correct integer result if caller will cast value. 2012-05-02 09:57:06 -04:00
Guillermo del Angel 76a95fdedf Full implementation of multiallelic exact model for pools. Still super-linear so not useable at scale but it should be a gold standard to compare to. Unit tests are not exhaustive yet, will be expanded to provide better test coverage. Small inconsequential optimization in MathUtils: we're already caching log10(factorial(n)) for large n, so might as well use the cached values to compute binomial and multinomial coefficients instead of the log-gamma approximation which is more expensive (doesn't seem to save much time either in PoolCaller nor in UG though). 2012-05-02 09:24:28 -04:00
Joel Thibault 4d732fa586 Move all MongoDB files into private/java/src/org/broadinstitute/sting/mongodb 2012-05-01 18:23:51 -04:00
Eric Banks 619a69a5f1 As promised in the release notes for 1.6, I am removing the old deprecated genotyping framework revolving around the misordering of alleles and have moved the fixed version in its place in preparation for release 1.7 (or 2.0?). 2012-05-01 16:18:24 -04:00
Joel Thibault c255dd5917 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-05-01 16:10:38 -04:00
Ryan Poplin 51af61b5d7 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-05-01 16:07:23 -04:00
Ryan Poplin fc55dcec3c Unfortunately the reverse trimming of alleles still doesn't work with mixed records in some corner cases. Turning it off for now. 2012-05-01 16:02:36 -04:00
Ryan Poplin 20a0078f23 Merging active regions across shard boundries if they are contiguous, have the same active status and don't grow too big. 2012-05-01 15:51:36 -04:00
Eric Banks 0f3af9555b Adding an option to SelectVariants which allows the user to re-genotype through the exact model (if PLs are present) the samples in order to recalculate the QUAL and genotypes. This is really the correct way to select a subset of samples, especially when originally called from low coverage data. Also added integration test to cover this case. 2012-05-01 14:58:06 -04:00
Joel Thibault aa4d41cce0 Minor cleanup before push 2012-05-01 14:16:44 -04:00
Joel Thibault b101b9c30b Add Mongo switch 2012-05-01 14:00:48 -04:00
Joel Thibault 1b609e9075 Move Mongo to server couchdb 2012-05-01 13:59:47 -04:00
Joel Thibault fd57d27f45 Move MongoDB connection handling to a separate class 2012-05-01 13:59:37 -04:00
Joel Thibault db3cd1abd5 Use 2 MongoDB collections (tables): one for INFO/attributes, one for samples/genotypes. 2012-05-01 13:57:23 -04:00
Joel Thibault 04e1be9106 Better handling of Mongo errors + exceptions 2012-05-01 13:57:23 -04:00
Joel Thibault ca737479cf Query for stop locations because we don't have that information in the reference 2012-05-01 13:57:23 -04:00
Joel Thibault 1cda87a4ad Set ROD priority list to input 2012-05-01 13:57:23 -04:00
Joel Thibault a7fe847faf Set the priority list and don't bother combining if not needed 2012-05-01 13:57:23 -04:00
Joel Thibault f739305f43 Combine the variants found at a location 2012-05-01 13:57:23 -04:00
Joel Thibault 020f884d5a Use new key of source ROD plus alleles 2012-05-01 13:57:23 -04:00
Joel Thibault 221ce9c3d6 Add alleles to the primary key 2012-05-01 13:57:23 -04:00
Joel Thibault 3198ce5471 Can have multiple variants at a location 2012-05-01 13:57:22 -04:00
Joel Thibault 11ed8e61c9 Add referenceBaseForIndel to the Mongo VariantContext objects 2012-05-01 13:53:44 -04:00
Joel Thibault 7ed0ee7ed0 Skip locations with no genotypes instead of throwing a NPE 2012-05-01 13:53:44 -04:00
Joel Thibault 4bdfeacdaa Handle multiple samples/genotypes per location
TODO: sample selection
2012-05-01 13:53:43 -04:00
Joel Thibault 1f7c628796 Insert the ROD filename into MongoDB as part of the primary key 2012-05-01 13:53:43 -04:00
Joel Thibault bb8a6e9b0a Initial test of write and read from MongoDB 2012-05-01 13:53:43 -04:00
David Roazen c0084c741b Pilot BCF2 Implementation: Checkpointing the code
* Not working yet, still very much a work-in-progress with lots of placeholders
* Needed to check this in to enable possible collaboration, since it's
  going slower than anticipated and the conference deadline looms.
2012-05-01 12:23:10 -04:00
Eric Banks 0c8e801021 Removing public to private dependency 2012-05-01 11:04:11 -04:00
Eric Banks e964d17518 Removing public to private dependency 2012-05-01 11:02:28 -04:00
Mauricio Carneiro 462450c3e3 disabling all BQSR unit tests
with the changes to the cycle covariate, some tests need updates, others  need to be completely re-written.
2012-04-30 14:39:55 -04:00
Guillermo del Angel e185632013 Exhaustive unit tests for Pool SNP genotype likelihoods:
a) Add ability for ErrorModel to be specified by external log-probability vector for testing.
b) For a given depth and ploidy(=2*samples/pool), create artificial high quality pileup testing from AC=0 to AC=ploidy, and test that pool GL's have expected content.Misc. refactorings and cleanups
c) Misc. cleanups and beautification.
2012-04-30 14:29:46 -04:00
Christopher Hartl 7d029b9a28 Merge branch 'master' of ssh://ni.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-30 12:16:30 -04:00
Christopher Hartl 944a7d815e Bringing VQSRV3 up to date. Lots of new features (un-classifying the worst-performing training sites, treating the x% best/worst sites as postive/negative points, ability to pass in a monomorphic track to see ROC curves output). Minor changes to AlleleBalance: weighted average was incorrectly specified (using logscale actually biased the average towards the AB of low-quality genotypes), and breaking out AB by het, hom, and diploid to bring it in line with some (private) changes to the indel likelihood model that (correctly) computes these values for indels. 2012-04-28 11:31:03 -04:00
Ryan Poplin 54a9bc2da2 Bug fix in reverse trim alleles for the case of mixed records that become non-mixed after subsetting the alleles. 2012-04-28 09:12:26 -04:00