Commit Graph

8857 Commits (0f5674b95e9dd18df55cf09dbdd79f8bc9ec46fa)

Author SHA1 Message Date
Guillermo del Angel 0f5674b95e Redid fix for corner case when forming consensus with reads that start/end with insertions and that don't agree with each other in inserted bases: since I can't iterate over the elements of a HashMap because keys might change during iteration, and since I can't use ConcurrentHashMaps, the code now copies structure of (bases, number of times seen) into ArrayList, which can be addressed by element index in order to iterate on it. 2012-02-20 09:12:51 -05:00
Ryan Poplin fe102a5d47 Fix for my renaming of the BQSR walker 2012-02-18 11:13:20 -05:00
Ryan Poplin 3d9eee4942 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-02-18 10:55:29 -05:00
Ryan Poplin a8be96f63d This caching in the BQSR seems to be too slow now that there are so many keys 2012-02-18 10:54:39 -05:00
Ryan Poplin 78718b8d6a Adding Genotype Given Alleles mode to the HaplotypeCaller. It constructs the possible haplotypes via assembly and then injects the desired allele to be genotyped. 2012-02-18 10:31:26 -05:00
Guillermo del Angel e724c63f2b Reverting last commit until I learn how to effectively replicate and debug pipeline test failures, and until I also learn how to effectively remove a kep from a HashMap that's being iterated on 2012-02-17 17:18:43 -05:00
Guillermo del Angel f2ef8d1d23 Reverting last commit until I learn how to effectively replicate and debug pipeline test failures, and until I also learn how to effectively remove a kep from a HashMap that's being iterated on 2012-02-17 17:15:53 -05:00
Guillermo del Angel 3e031a540f Solve merge conflict 2012-02-17 10:56:03 -05:00
Guillermo del Angel cd352f502d Corner case bug fix: if a read starts with an insertion, when computing the consensus allele for calling the insertion was only added to the last element in the consensus key hash map. Now, an insertion that partially overlaps with several candidate alleles will have their respective count increased for all of them 2012-02-17 10:21:37 -05:00
Guillermo del Angel 2f08846d82 Merged bug fix from Stable into Unstable 2012-02-14 21:26:25 -05:00
Guillermo del Angel 7dc6f73399 Bug fix for validation site selector: records with AC=0 in them were always being thrown out if input vcf was sites-only, even when -ignorePolymorphicStatus flag was set 2012-02-14 21:11:24 -05:00
Ryan Poplin 30085781cf Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-02-14 14:01:20 -05:00
Ryan Poplin ae5b42c884 Put base insertion and base deletions in the SAMRecord as a string of quality scores instead of an array of bytes. Start of a proper genotype given alleles mode in HaplotypeCaller 2012-02-14 14:01:04 -05:00
David Roazen 8f7587048c Update the expected novel TiTv in the HybridSelectionPipelineTest
The expected novel TiTv has changed for this set of variants now that
multi-allelic mode is on by default.
2012-02-13 20:25:52 -05:00
David Roazen dfcdf92afa Revert "Disable HaplotypeCaller integration tests in Stable"
These tests should remain enabled in Unstable.

This reverts commit 15c5b7aee1327f9dc012d2168f127a4700fe5064.
2012-02-13 16:37:31 -05:00
David Roazen 85d31f80a2 Merged bug fix from Stable into Unstable 2012-02-13 16:37:11 -05:00
David Roazen d5fce22d78 Disable HaplotypeCaller integration tests in Stable
These tests use out-of-date files that no longer exist, and only
need to be enabled in Unstable for now.
2012-02-13 16:28:19 -05:00
David Roazen 03e5184741 Fix serious engine bug that could cause reads to be dropped under certain circumstances
When aggregating raw BAM file spans into shards, the IntervalSharder tries to combine
file spans when it can. Unfortunately, the method that combines two BAM file
spans was seriously flawed, and would produce a truncated union if the file spans
overlapped in certain ways. This could cause entire regions of the BAM file containing
reads within the requested intervals to be dropped.

Modified GATKBAMFileSpan.union() to correct this problem, and added unit tests
to verify that the correct union is produced regardless of how the file spans
happen to overlap.

Thanks to Khalid, who did at least as much work on this bug as I did.
2012-02-13 16:25:21 -05:00
Ryan Poplin 8742f5e36c Updating BQSR scala script to take any number of known sites files and to use the scatter count input argument. 2012-02-13 15:44:30 -05:00
Eric Banks ad90af94ed Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-02-13 15:10:10 -05:00
Eric Banks 0920a1921e Minor fixes to splitting multi-allelic records (as regards printing indel alleles correctly); minor code refactoring; adding integration tests to cover +/- splitting multi-allelics. 2012-02-13 15:09:53 -05:00
Eric Banks 14981bed10 Cleaning up VariantsToTable: added docs for supported fields; removed one-off hidden arguments for multi-allelics; default behavior is now to include multi-allelics in one record; added option to split multi-allelics into separate records. 2012-02-13 14:32:03 -05:00
Ryan Poplin e9338e2c20 Context covariate needs to look in the reverse direction for negative stranded reads. 2012-02-13 13:40:41 -05:00
Ryan Poplin 41ffd08d53 On the fly base quality score recalibration now happens up front in a SAMIterator on input instead of in a lazy-loading fashion if the BQSR table is provided as an engine argument. On the fly recalibration is now completely hooked up and live. 2012-02-13 12:35:09 -05:00
Eric Banks c8c06c7753 Merge branch 'master' of ssh://gsa1.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-02-11 23:02:19 -05:00
Eric Banks ac9250b12b Don't assume chrom20, just pull from the file list 2012-02-11 23:02:05 -05:00
Ryan Poplin 3caa1b83bb Updating HC integration tests 2012-02-11 11:48:32 -05:00
Ryan Poplin 9b8fd4c2ff Updating the half of the code that makes use of the recalibration information to work with the new refactoring of the bqsr. Reverting the covariate interface change in the original bqsr because the error model enum was moved to a different class and didn't make sense any more. 2012-02-11 10:57:20 -05:00
Eric Banks f52f1f659f Multiallelic implementation of the TDT should be a pairwise list of values as per Mark Daly. Integration tests change because the count in the header is now A instead of 1. 2012-02-10 14:15:59 -05:00
Mauricio Carneiro f1990981fc A little BQSR scala script to use with scatter/gather 2012-02-10 14:00:53 -05:00
Mauricio Carneiro a7c6f255e9 Adding the old gatherer to BQSR
for now, the old gatherer will still work for us to scatter/gather our tests.
2012-02-10 13:33:57 -05:00
Mauricio Carneiro 1fb19a0f98 Moving the covariates and shared functionality to public
so Ryan can work on the recalibration on the fly without breaking the build. Supposedly all the secret sauce is in the BQSR walker, which sits in private.
2012-02-10 11:44:01 -05:00
Mark DePristo 48cc4b913a bugfix for incremental refresh in gsafolkLSFlogs 2012-02-10 11:30:51 -05:00
Eric Banks 8ad4046642 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-02-10 11:18:02 -05:00
Mark DePristo 0722df46db gsafolkLSFLogs creates a subset of LSF MySQL db related to gsafolk only
Creates a SQL table in the MySQL server calcium at the Broad that contains only
key information about the LSF usage of members of the gsafolk fairshare group

Does this by first building a list of gsafolk uids, selecting lsf info from matter's
table, and inserts this information into the gsafolk_lsf queue as part of the
GATK schema.  The standard way to run this is with incremental refreshes enabled,
so that the program only fetches new raw lsf records with timestamps beyond the
max timestamp present in the GATK LSF table.

The default way to run this program is via cron with 'python private/python/gsafolkLSFLogs.py'
2012-02-10 11:12:20 -05:00
Eric Banks 5e18020a5f Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-02-10 11:08:33 -05:00
Eric Banks f53cd3de1b Based on Ryan's suggestion, there's a new contract for genotyping multiple alleles. Now the requester submits alleles in any arbitrary order - rankings aren't needed. If the Exact model decides that it needs to subset the alleles because too many were requested, it does so based on PL mass (in other words, I moved this code from the SNPGenotypeLikelihoodsCalculationModel to the Exact model). Now subsetting alleles is consistent. 2012-02-10 11:07:32 -05:00
Mauricio Carneiro 5af373a3a1 BQSR with indels integrated!
* added support to base before deletion in the pileup
   * refactored covariates to operate on mismatches, insertions and deletions at the same time
   * all code is in private so original BQSR is still working as usual in public
   * outputs a molten CSV with mismatches, insertions and deletions, time to play!
   * barely tested, passes my very simple tests... haven't tested edge cases.
2012-02-09 18:46:45 -05:00
Eric Banks 7a937dd1eb Several bug fixes to new genotyping strategy. Update integration tests for multi-allelic indels accordingly. 2012-02-09 16:14:22 -05:00
Eric Banks 0f728a0604 The Exact model now subsets the VC to the first N alleles when the VC contains more than the maximum number of alleles (instead of throwing it out completely as it did previously). [Perhaps the culling should be done by the UG engine? But theoretically the Exact model can be called outside of the UG and we'd still want the context subsetted.] 2012-02-09 14:02:34 -05:00
Matt Hanna aa097a83d5 Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-02-09 11:26:48 -05:00
Matt Hanna b57d4250bf Documentation request by Eric. At each stage of the GATK where filtering occurs, added documentation suggesting the goal of the filtering along with examples of suggested inputs and outputs. 2012-02-09 11:24:52 -05:00
Ryan Poplin 5b3d875833 Incorporating Mark's suggestions on the AnalyzeCovariates plots. 2012-02-09 09:05:08 -05:00
Mauricio Carneiro d561914d4f Revert "First implementation of GATKReportGatherer"
premature push from my part. Roger is still working on the new format and we need to update the other tools to operate correctly with the new GATKReport.

This reverts commit aea0de314220810c2666055dc75f04f9010436ad.
2012-02-08 23:28:55 -05:00
Ryan Poplin 270b160d87 Incorporating feedback from Mauricio on the plots 2012-02-08 16:26:32 -05:00
Ryan Poplin 4316437a62 Initial version of R script that will be called by new BQSR to replace the entire AnalyzeCovariates program 2012-02-08 16:00:12 -05:00
Eric Banks 2f800b078c Changes to default behavior of UG: multi-allelic mode is always on; max number of alternate alleles to genotype is 3; alleles in the SNP model are ranked by their likelihood sum (Guillermo will do this for indels); SB is computed again. 2012-02-08 15:27:16 -05:00
Matt Hanna 51ac87b28c Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-02-08 08:43:55 -05:00
Matt Hanna 5b58fe741a Retiring Picard customizations for async I/O and cleaning up parts of the code to use common Picard utilities I recently discovered.
Also embedded bug fix for issues reading sparse shards and did some cleanup based on comments during BAM reading code transition meetings.
2012-02-08 08:34:37 -05:00
Khalid Shakir cda1e1b207 Minor manual merge update for List class to Seq interface usage. 2012-02-08 02:24:54 -05:00