Commit Graph

590 Commits (ada4c5a13cee8abba445fc8cdd99b66add8bfc3e)

Author SHA1 Message Date
sjia ada4c5a13c Small change to debug printing code
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1521 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-03 18:31:21 +00:00
kiran c3aaca1262 Improvements to make this work with uncompressed fastq files. Pulled the fastq parser out into it's own SAMFileReader-like entity.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1520 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-03 17:20:16 +00:00
asivache 499b3536a4 Changed to use AlignmentUtils.isReadUnmapped() for better consistency with SAM spec; also, it is now explicitly enforced that unmapped reads have <NO_...> values set for ref contig and start upon "remapping"
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1519 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-03 16:45:07 +00:00
ebanks 5bd99fc1c4 VariantFiltration moved to core.
Another win for the team.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1517 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-03 15:41:41 +00:00
chartl 5130ca9b94 git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1516 348d0f76-0448-11de-a6fe-93d51630548a 2009-09-03 15:17:02 +00:00
jmaguire e2780c17af Checkin of the Multi-Sample SNP caller.
Doesn't work yet; same command I used to use now causes GATK to throw an exception.

Will check with Matt & Aaron tomorrow, then do a regression test.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1509 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-03 00:23:28 +00:00
ebanks 55013eff78 Re-revert back to point estimation for now. We need to do this right, just not yet.
Also, it's safer to let colt do the log factorial calculations for us. 


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1503 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-02 15:33:18 +00:00
ebanks 24d809133d Oops - comment out the printouts
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1500 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-02 01:45:56 +00:00
ebanks 91ccb0f8c5 Revert to having these filters use integration over binomial probs
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1499 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-02 01:40:22 +00:00
aaron 4a1d79cd7b added a flag, maximum_reads_at_locus, shortName "mrl", which limits the number of reads we add to the locusByHanger. In some bam files misalignment produces pile-ups of 750K or more reads. We now limit this to the default of 100K reads.
The user is warned if a locus exceeds this threshold, and no more reads are added.

Also CombineDup walker had an incorrect package name.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1496 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-01 04:21:58 +00:00
ebanks 0addae967a IndelArtifact filter can now handle filtering false SNPs that occur within the span of an indel but after the first position
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1495 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-01 03:34:39 +00:00
asivache 591f8eedbb Added setName() and getName() (however, not used anywhere yet). Now can set the name of the fasta record manually to whatever, however it will work only if done early enough. If the fasta record already started printing itself (i.e. the header line is already done), setName() will throw an exception. Could be too entangled, may reverse this back...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1493 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-31 18:09:55 +00:00
asivache c9eb193c7f Now recognizes a special name for a bound rod track: snpmask. If a rod with this name is bound, then ONLY snps from that track will be used (to set alt reference bases to N's), but indels will be ignored. This helps when an alt. ref has to be created for a set of indel calls, and another rod (e.g. dbSNP) is used to put N's in (for sequenom). If dbSNP rod is not marked as "snpmask", the indels reported there will make their way into the alt. reference output and mess it up.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1492 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-31 18:05:57 +00:00
ebanks 8e3c3324fa Added filter for SNPs cleaned out by the realigner.
It uses the realigner output for filtering; in addition, dbsnp indels partially work; IndelGenotyper calls don't yet work.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1489 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-31 04:32:32 +00:00
ebanks 463f80c03e Require each filter or feature to declare whether or not they want mapping quality zero reads in the alignment context
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1487 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-31 03:37:24 +00:00
ebanks 1a299dd459 Require each filter or feature to declare whether or not they want mapping quality zero reads in the alignment context
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1486 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-31 03:31:37 +00:00
ebanks e70101febc Add a VEC filter for clustered SNP calls that takes advantage of the new windowed approach; delete the old standalone walker.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1485 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-31 03:14:42 +00:00
ebanks 215e908a11 Reworking of the VariantFiltration system to allow for a windowed view of variants and inclusion of more data to the various filters.
This now allows us to incorporate both the clustered SNP filter and a SNP-near-indels filter, which otherwise wasn't possible.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1484 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-31 02:16:39 +00:00
depristo 49a7babb2c Better organization of Genotype likelihood calculations. NewHotness is now just GenotypeLikelihoods. There are 1, 3, and empirical base error models available as subclasses, along with a simple way to make this (see the factory).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1481 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-30 19:16:30 +00:00
depristo 5af4bb628b Intermediate checking before code reorganization. Full blown support for empirical transition probs in SSG for all platforms. Support for defaultPlatform arg in SSG. Renaming classes for final cleanup
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1479 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-30 17:34:43 +00:00
depristo 6ab9ddf9f5 Significant output formatting improvements. SNPs as indels analysis. heterozygosity rate calculations
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1478 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-29 21:49:09 +00:00
depristo f0179109fa Removing min confidence for on/off genotype
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1473 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-28 01:04:13 +00:00
depristo dc9d40eb9a Now requires a minimum genotype LOD before applying tests
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1471 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-28 00:19:23 +00:00
depristo a639459112 Trival consistency change from char in to char out, not char in to byte out
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1466 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-27 23:37:37 +00:00
chartl 6012f7602b @ minor fixes to CoverageAndPowerWalker and AnalyzePowerWalker (switching to By Reference traversal, spitting out Syzygy position for sanity check)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1465 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-27 21:44:18 +00:00
chartl bd1e679bc5 @ Fixed issues with AnalyzePowerWalker which depended on CoverageAndPowerWalker. The latter was changed but not the former. Now fixed
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1464 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-27 20:23:41 +00:00
kiran a17dad5fa9 Converts from fastq.gz to unaligned BAM format. Accepts a single fastq (for single-end run) or two fastqs (for paired-end run). Also allows you to set certain BAM metadata (read groups, etc.).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1463 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-27 20:20:09 +00:00
chartl 8740124cda @ListUtils - Bugfix in getQScoreOrderStatistic: method would attempt to access an empty list fed into it. Now it checks for null pointers and returns 0.
@MathUtils - added a new method: cumBinomialProbLog which calculates a cumulant from any start point to any end point using the BinomProbabilityLog calculation.

@PoolUtils - added a new utility class specifically for items related to pooled sequencing. A major part of the power calculation is now to calculate powers
             independently by read direction. The only method in this class (currently) takes your reads and offsets, and splits them into two groups
             by read direction.

@CoverageAndPowerWalker - completely rewritten to split coverage, median qualities, and power by read direction. Makes use of cumBinomialProbLog rather than
                          doing that calculation within the object itself.




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1462 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-27 19:31:53 +00:00
chartl 1da45cffb3 New:
Minor changes to CoverageAndPowerWalker bootstrapping (faster selection of indeces).

Entirely new Aritifical Pool Walker (ArtificialPoolWalkerMk2), will likely replace ArtificialPoolWalker on the next commit. Adapted the method of sampling, and added a helper context class: ArtificialPoolContext which carries much of the burden of calculation and data handling for the walker. The walker itself maps and reduces ArtificialPoolContexts.

Cheers!






git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1461 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-26 21:42:35 +00:00
chartl 92ea947c33 Added binomProbabilityLog(int k, int n, double p) to MathUtils:
binomialProbabilityLog uses a log-space calculation of the
       binomial pmf to avoid the coefficient blowing up and thus
       returning Infinity or NaN (or in some very strange cases
       -Infinity). The log calculation compares very well, it seems
       with our current method. It's in MathUtils but could stand
       testing against rigorous truth data before becoming standard.

Added median calculator functions to ListUtils

getQScoreMedian is a new utility I wrote that given reads and
       offsets will find the median Q score. While I was at it, I wrote
       a similar method, getMedian, which will return the median of any
       list of Comparables, independent of initial order. These are in
       ListUtils.

Added a new poolseq directory and three walkers

CoverageAndPowerWalker is built on top of the PrintCoverage walker
       and prints out the power to detect a mutant allele in a pool of
       2*(number of individuals in the pool) alleles. It can be flagged
       either to do this by boostrapping, or by pure math with a
       probability of error based on the median Q-score. This walker
       compiles, runs, and gives quite reasonable outputs that compare
       visually well to the power calculation computed by Syzygy.

ArtificialPoolWalker is designed to take multiple single-sample
       .bam files and create a (random) artificial pool. The coverage of
       that pool is a user-defined proportion of the total coverage over
       all of the input files. The output is not only a new .bam file,
       but also an auxiliary file that has for each locus, the genotype
       of the individuals, the confidence of that call, and that person's
       representation in the artificial pool .bam at that locus. This
       walker compiles and, uhh, looks pretty. Needs some testing.

AnalyzePowerWalker extends CoverageAndPowerWalker so that it can read previous power
calcuations (e.g. from Syzygy) and print them to the output file as well for direct
downstream comparisons.




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1460 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-25 21:27:50 +00:00
kiran 478f426727 Fixed a missing method implementation in these two files.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1459 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-25 21:21:58 +00:00
kiran f12ea3a27e Added ability for all filters to return a probability for a given variant - interpreted as the probability that the given variant should be included in the final set. The joint probability of all the filters is computed to determine whether a variant should stay or go. At the moment, this is only visible in verbose mode (specify -V). Also removed 'learning mode'; now, filters emit important stats no matter what. Various code cleanups.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1458 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-25 21:17:56 +00:00
asivache 0bdecd8651 A most stupid bug. In cases when more than one indel variant was present in cleaned bam file, the "consensus" (max. # of occurences) call was computed incorrectly, and most of the times the call itself was not made at all. Fortunately, the locations where we see multiple indels are a minority, and many of them are suspicious anyway (manifestation of alignment problems?). Could change results of POOLED calls though.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1448 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-20 22:31:44 +00:00
kcibul 6c0adc9145 resuse fasta file reader
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1446 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-20 16:01:58 +00:00
ebanks 10c98c418b Walker to determine the concordance of 2 genotype call sets.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1443 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-20 01:32:44 +00:00
ebanks 1d74143ef4 A convenience argument - for Mark - so that you don't have to specify all the output file names
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1442 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-20 00:49:12 +00:00
ebanks 82e2b7017e Prevent array bounds errors
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1435 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-18 16:54:31 +00:00
ebanks 26a6f816c9 set default value for output format
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1434 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-18 16:17:09 +00:00
ebanks 9b1d7921e8 added filter based on concordance to another call set
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1432 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-18 15:16:30 +00:00
ebanks b2a18a9d61 - first pass at a basic indel filter (for now, based on size and homopolymer runs)
- fix simple indel rod printout


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1431 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-18 03:04:12 +00:00
ebanks 78439f7305 Modify Sequenom input format based on official documentation
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1430 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-18 01:42:57 +00:00
ebanks d4808433a1 Added option to output the locations of indels in the alternate reference
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1424 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-16 03:46:36 +00:00
ebanks 4b6ddc55bd Merge our 2 fastq writers into 1: incorporate Kiran's secondary-base file writer into the fasta/fastq writers
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1423 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-14 20:55:23 +00:00
ebanks 0ec581080c Refactoring the code; also, now it prints continuously instead of potentially storing one long string.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1421 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-13 01:32:46 +00:00
asivache 2a01e71277 A very simple standalone filter for fooling around with the data: can extract only mapped or only unmapped reads, only reads with mapping quals > X, reads with average base qual > Y, reads with min base qual > Z, reads with edit distance from the ref > MIN and/or < MAX
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1420 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-12 20:28:51 +00:00
asivache ebec0ec171 A standalone companion to BamToFastqWalker: does the same thing but without calling in gatk's heavy artillery (does not "require" a reference either). Extracts seqs and quals and places them into fastq; along the way it also reverse complements reads that align to the negative strand (so that fastq contains reads as they come from the machine).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1419 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-12 20:24:37 +00:00
asivache 112a283f54 be nice, don't forget to close the reader when done
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1418 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-12 20:19:56 +00:00
asivache ba2a3d8a58 Reverse qualities when read seq. is reverse complemented
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1417 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-12 20:17:35 +00:00
ebanks 143f8eea4e option to output in sequenom input format
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1415 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-12 16:50:37 +00:00
ebanks 7f1159b6a9 Added option to mask out SNP sites with "N"s in the new reference.
This is useful when producing Sequenom input files for validating indels...


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1414 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-12 15:17:45 +00:00