Commit Graph

  • 6313c465fb we want the RMS of the reads qualities not the RMS of the RMS of the read qualities. aaron 2009-08-20 21:56:29 +0000
  • 6c0adc9145 resuse fasta file reader kcibul 2009-08-20 16:01:58 +0000
  • 0386e110cf some documentation changes, add a couple of simple checks aaron 2009-08-20 05:20:27 +0000
  • 026e09ec07 adding the package description for the VCF validator aaron 2009-08-20 04:46:27 +0000
  • 10c98c418b Walker to determine the concordance of 2 genotype call sets. ebanks 2009-08-20 01:32:44 +0000
  • 1d74143ef4 A convenience argument - for Mark - so that you don't have to specify all the output file names ebanks 2009-08-20 00:49:12 +0000
  • 5725de56dc fixes in VCF, some changes to get it ready to move out of the GATK aaron 2009-08-19 23:31:03 +0000
  • 0b927f44fa created a better seperation between instantiation of an VCF object and the object itself aaron 2009-08-19 20:32:50 +0000
  • ed8c92a12a make isReference do the right thing ebanks 2009-08-19 20:32:29 +0000
  • 21091b9839 Fix for invalid format error when outputting BAM files. hanna 2009-08-19 19:42:39 +0000
  • 4cf9110468 Adding a lot of changes to the VCF code, plus a new basic validator. Also removing an extra copy of the Artificial SAM generator that got checked in at some point. aaron 2009-08-19 05:08:28 +0000
  • b3fe566c0c Fix descriptions of walker args ebanks 2009-08-18 19:46:48 +0000
  • 82e2b7017e Prevent array bounds errors ebanks 2009-08-18 16:54:31 +0000
  • 26a6f816c9 set default value for output format ebanks 2009-08-18 16:17:09 +0000
  • 53153fcd79 Allow RODs to specify that incomplete records are okay (i.e. that they allow optional fields) ebanks 2009-08-18 15:26:10 +0000
  • 9b1d7921e8 added filter based on concordance to another call set ebanks 2009-08-18 15:16:30 +0000
  • b2a18a9d61 - first pass at a basic indel filter (for now, based on size and homopolymer runs) - fix simple indel rod printout ebanks 2009-08-18 03:04:12 +0000
  • 78439f7305 Modify Sequenom input format based on official documentation ebanks 2009-08-18 01:42:57 +0000
  • 63d90702d6 another iteration of the VCFReader and VCFRecord, introducing the VCFWriter aaron 2009-08-17 22:17:34 +0000
  • 1e8b97b560 quietly skip empty intervals files rather than crash. jmaguire 2009-08-17 20:19:14 +0000
  • 92c63fb530 It's just "lod" not discovery_lod now. jmaguire 2009-08-17 18:44:09 +0000
  • df5744bcd3 update this walker so any variants can be passed in ebanks 2009-08-17 16:30:39 +0000
  • 8403618846 the start to the VCF implementation aaron 2009-08-17 04:34:15 +0000
  • d4808433a1 Added option to output the locations of indels in the alternate reference ebanks 2009-08-16 03:46:36 +0000
  • 4b6ddc55bd Merge our 2 fastq writers into 1: incorporate Kiran's secondary-base file writer into the fasta/fastq writers ebanks 2009-08-14 20:55:23 +0000
  • 843d7e6c8f Now you can specify '-' instead of input file name, and the script will read from stdin asivache 2009-08-14 20:30:56 +0000
  • 0ec581080c Refactoring the code; also, now it prints continuously instead of potentially storing one long string. ebanks 2009-08-13 01:32:46 +0000
  • 2a01e71277 A very simple standalone filter for fooling around with the data: can extract only mapped or only unmapped reads, only reads with mapping quals > X, reads with average base qual > Y, reads with min base qual > Z, reads with edit distance from the ref > MIN and/or < MAX asivache 2009-08-12 20:28:51 +0000
  • ebec0ec171 A standalone companion to BamToFastqWalker: does the same thing but without calling in gatk's heavy artillery (does not "require" a reference either). Extracts seqs and quals and places them into fastq; along the way it also reverse complements reads that align to the negative strand (so that fastq contains reads as they come from the machine). asivache 2009-08-12 20:24:37 +0000
  • 112a283f54 be nice, don't forget to close the reader when done asivache 2009-08-12 20:19:56 +0000
  • ba2a3d8a58 Reverse qualities when read seq. is reverse complemented asivache 2009-08-12 20:17:35 +0000
  • 144b424933 Added : String reverse(String s) - reverses a string asivache 2009-08-12 20:16:22 +0000
  • 143f8eea4e option to output in sequenom input format ebanks 2009-08-12 16:50:37 +0000
  • 7f1159b6a9 Added option to mask out SNP sites with "N"s in the new reference. This is useful when producing Sequenom input files for validating indels... ebanks 2009-08-12 15:17:45 +0000
  • 43f63b7530 Added a walker to convert a bam file to fastq format (including the option to re-reverse the negative strand reads). Picard has such a tool but it is geared towards their pipeline and requires intimate knowledge of the lanes/flowcells,etc. This is just easy. ebanks 2009-08-12 15:10:40 +0000
  • d101c20b30 added the ability to pass in a csv file of ROD triplets (one triplet per line) to the -B option aaron 2009-08-11 22:10:20 +0000
  • e4acd14675 Now GenomicMap maps (and RemapAlignment outputs) regions between intervals on the master reference as 'N' cigar elements, not 'D'. 'D' is now used only for bona fide deletions. asivache 2009-08-11 21:10:17 +0000
  • 2c3f56cb8d fix length calculation (it was including +/- char when it shouldn't) ebanks 2009-08-11 20:28:24 +0000
  • 5fab934f4e - moved the reference maker to its own directory - added first version of a more complicated reference maker which takes in RODs and creates an alternative reference based on the variants (indels and/or SNPs) ebanks 2009-08-11 18:01:06 +0000
  • d69ae60b69 fixed two tests affected by my previous commit aaron 2009-08-11 17:57:50 +0000
  • fc1c76f1d2 fixing a bug where reads in overlapping interval based locus traversals could get assigned to only one of two the regions aaron 2009-08-11 17:50:16 +0000
  • bb1d31914c 2009_02 release is no longer with us. Update the bam list. hanna 2009-08-11 12:49:23 +0000
  • 1851613de4 Now using larger database of HLA alleles sjia 2009-08-11 03:11:14 +0000
  • 0e7c158949 I've pulled out the functionality of the analyzer into a single python file which doesn't require all of the irrelevant config parameters (which would cause problems for external users). I'll release this and the simple config file to 1KG for use in analyzing recalibration efforts. ebanks 2009-08-11 02:56:43 +0000
  • dd228880ed Partially implemented NewHotnessGenotypeLikelihoodsTest caused the tests to fail. Ouch! So hot it burned me. hanna 2009-08-10 20:45:44 +0000
  • 3208eaabcc A standalone picard-level tool for breaking individual reads into "pairs" of first/last N bases. Supports: * splitting off only start or end of the read, or both; the output will contain chopped sequences AND corresponding base qualities * splitting arbitrary number of bases off each end (different numbers for left and right segments can be specified; segments can overlap) * splitting only unmapped reads, ignoring mapped ones * writing splitted ends into separate sam/bam files, or into a single output file * decorating original read names with user-specified suffixes for each end (e.g. _1 and _2 for left and right parts of the read); default: no decoration, original read names are used * when mapped reads are split, the alignment cigars are chopped appropriately and the alignment start positions are adjusted (for the right end) to correctly specify the alignment of the selected part of the read asivache 2009-08-10 20:42:49 +0000
  • 36312ae4b2 tiny cleanup asivache 2009-08-10 20:26:52 +0000
  • ecae619a1b warn user when dbSNP rod looks suspicious ebanks 2009-08-10 20:20:20 +0000
  • 2841e151d0 javadoc comments only asivache 2009-08-10 18:44:35 +0000
  • 921d4f4e95 RemapAlignments is a standalone picard-level tool that does not use gatk engine; moved to 'tools' asivache 2009-08-10 15:41:07 +0000
  • b7768830c5 Tiny reorganization in the playground: a place for 'picard-level' standalone tools that are not based on gatk asivache 2009-08-10 15:07:35 +0000
  • 02f1af0743 Don't die when a readgroup is absent from the covariates table - it could happen when all reads are unmapped (or have MQ0); instead, just don't alter the quals. ebanks 2009-08-10 03:10:33 +0000
  • 089dab00e2 Was discordance rate, now concordance rate depristo 2009-08-07 19:37:52 +0000
  • 6d3ef73868 Now includes statistics on the allele agreement with dbSNP -- counts concordant calls as dbSNP = A/C and we say A/C, vs. we say A/T depristo 2009-08-07 19:37:07 +0000
  • 20baa80751 Updated polarized reference priors, need DiploidGenotypePriors class that is directly used by the NewHotness genotypelikelihoods, more bug fixes and refactoring, etc. depristo 2009-08-07 19:01:04 +0000
  • a864c2f025 Updated polarized reference priors, need DiploidGenotypePriors class that is directly used by the NewHotness genotypelikelihoods, more bug fixes and refactoring, etc. depristo 2009-08-07 19:00:06 +0000
  • db250f8d3e Don't print if not in learning mode ebanks 2009-08-07 06:08:02 +0000
  • 4c1fa52ddf -Added mapping quality zero filter -Set some reasonable defaults (based on pilot2) ebanks 2009-08-07 03:18:02 +0000
  • bbd7bec5db Continuing cleanup of SSG. GenotypeLikelihoods now have extensive testing routines. DiploidGenotype supports het, homref, etc calculations. SSG has been cleaned up to remove old garbage functionality. Also now supports output to standard output by simply omitting varout depristo 2009-08-05 22:25:30 +0000
  • d60d5aa516 Fixed bug: previously reset likelihoods after each region/exon. Better comments/documentation added sjia 2009-08-05 18:44:46 +0000
  • 0d47798721 made booster distance a parameter kcibul 2009-08-05 18:29:21 +0000
  • 3b74b3ba74 print out ref/alt ratio, not major/minor ebanks 2009-08-05 16:36:25 +0000
  • 48713e154c Windowed access to the reference. hanna 2009-08-05 16:29:15 +0000
  • 65e9dcf5b7 Fully operational version of the new genotype likelihoods class. (1) Much cleaner interface. Now explicitly stores likelihoods, priors, and posteriors in separate arrays indexed by an enum, (2) no longer can be used to make calls, it relies on SSGGenotypeCall to order the likelihoods, calculate best to ref, etc, this is just for calculating genotype likelihoods now; (3) Now performs extensive error checking with validate() to ensure the system is behaving properly. (4) fixed incorrect treatment of N bases, which we being counted against everyone (5) likely found a stats bug in which heterozyosity was being applied incorrectly to the genotype priors depristo 2009-08-05 01:00:55 +0000
  • 4dc23f2763 Trivial formatting changes as I moved more legacy code into this system depristo 2009-08-05 00:54:26 +0000
  • 34af669dbb Explicit ENUM representation of the diploid genotypes. Please use this from now on to represent strings like AA or AT depristo 2009-08-05 00:53:43 +0000
  • 5487ab0ee6 Added several useful routines to MathUtils for summing and bounds checking of doubles depristo 2009-08-05 00:41:31 +0000
  • 68309408e4 git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1378 348d0f76-0448-11de-a6fe-93d51630548a sjia 2009-08-04 21:23:01 +0000
  • 45ab212f22 Post-presentation update sjia 2009-08-04 21:21:12 +0000
  • 21d1eba502 Cleaned division of responsibilities between arguments to map function. Reference has been changed from an array of bases to an object (ReferenceContext), and LocusContext has been renamed to reflect the fact that it contains contextual information only about the alignments, not the locus in general. hanna 2009-08-04 21:01:37 +0000
  • a5a7d7dab8 added "booster" metrics kcibul 2009-08-04 20:53:45 +0000
  • 3a8d923785 minor output changes ebanks 2009-08-04 20:12:16 +0000
  • 939b19e715 Committing the first version of the homopolymer filter. Removes SNPs that occur at the edges of homopolymer runs and whose nonref allele matches the repeated base in the homopolymer. mmelgar 2009-08-04 14:35:51 +0000
  • 20ff603339 New hotness and old and Busted genotype likelihood objects are now in the code base as I work towards a bug-free SSG along with a cleaner interface to the genotype likelihood object depristo 2009-08-03 23:07:53 +0000
  • 4986b2abd6 Fixing bug in SSG -- genotyping and discovery were mixed up by name depristo 2009-08-03 22:13:35 +0000
  • 3485397483 Reorganization of the genotyping system depristo 2009-08-03 20:55:31 +0000
  • 9f1d3aed26 -Output single filtration stats file with input from all filters -move out isHet test to GenotypeUtils so all can use it ebanks 2009-08-03 20:44:21 +0000
  • d88ea91939 Slight reorganization of genotype interface depristo 2009-08-03 19:19:11 +0000
  • 880a01cb5d Slight reorganization of genotype interface depristo 2009-08-03 19:18:41 +0000
  • d840a47b11 Slight reorganization of genotype interface depristo 2009-08-03 19:17:15 +0000
  • 20986a03de cleanup before moving files depristo 2009-08-03 19:08:24 +0000
  • e3b08f245f Pull out RMS calculation into MathUtils for all to use ebanks 2009-08-03 17:00:20 +0000
  • e495b836d3 - added mapping quality filter - make the filters brainless in that they strictly have thresholds and filter based on them; require user to calculate and input these thresholds. - update filters in preparation for migration to new output format ebanks 2009-08-03 16:46:51 +0000
  • ba07f057ac finish the math for RMS ebanks 2009-08-03 16:18:09 +0000
  • 8bc925a216 Commit on the behalf of Mark: cleaning up some old and busted code in GenotypeLikelihood and associated objects. kiran 2009-07-31 21:18:30 +0000
  • 8d06bb21ed A little gadget to select random samples from input stream(s) of unknown length. By default, selects a single line (with probability 1/TOTAL_NUMBER_OF_LINES_READ), with -N option randomly selects specified number of lines. Can read from STDIN or from arbitrary number of input streams (all streams will be merged). Examples:\n cat file1 file2 file3 | randomSampleFromStream.pl -N 5 \n\n or \n randomSampleFromStream.pl file1 file2 file3 asivache 2009-07-31 18:55:14 +0000
  • 9dfee7a75c the "-genotype" option now acts correctly as a discovery mode caller in SSG aaron 2009-07-31 18:31:45 +0000
  • c2c80dd946 cleanup and moving some things around to more logical locations aaron 2009-07-31 16:28:39 +0000
  • 9dada95ec3 git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1357 348d0f76-0448-11de-a6fe-93d51630548a sjia 2009-07-31 16:21:16 +0000
  • 9a0761cd8f accidentally committed some debug code aaron 2009-07-31 15:25:22 +0000
  • 2f2c8576a5 GLF output is now well validated, and some changes for new Genotypes interface code aaron 2009-07-31 15:21:28 +0000
  • afccbc44ec Script that performs all the processing steps from raw Illumina reads through to analysis of barcoding and hybrid selection efficience as documented in the GATK tutorial; can automatically run all steps in series on the farm. andrewk 2009-07-31 00:22:53 +0000
  • eb4b9a743a Script that runs most of the steps involved in validating the CoverageEval system that predicts performance for given depth of sequencing coverage across a genome. andrewk 2009-07-31 00:18:45 +0000
  • 8eeb87af2a Tests for downsampling related utilities in ListUtils class that didn't get checked in earlier andrewk 2009-07-31 00:09:35 +0000
  • efd0fd1f0a Short python script that takes paired-end BAMs and aligns them with BWA. Referenced in GSA wiki tutorial andrewk 2009-07-31 00:04:10 +0000
  • 678c2533ca Removed custom output stream for file and replaced with the standard out PrintStream andrewk 2009-07-30 22:36:42 +0000
  • 2a7dfce9ae fix the header string mismatch that Andrew found aaron 2009-07-30 22:26:34 +0000
  • 44673b2dce Removed a debugging println that was accidentally checked in andrewk 2009-07-30 22:23:27 +0000
  • 845488ff94 VariantEval now decides whether a variant is not confidently called using BestVsNetxBest if genotypes are being evaluated and BestVsRef if not (variant discovery only). Also, the absolute value of the BestVsRef LOD (getVariantionConfidence) is used so that confident reference calls (if the GELI has output them) will show up in the final table as reference calls rather than no calls. andrewk 2009-07-30 21:54:06 +0000
  • 1c648a2d5f Skip compiled python files (*.pyc) in svn status output andrewk 2009-07-30 21:45:23 +0000