Commit Graph

1850 Commits (da2e879bbc07033e7b5bd90e9f941fcaec937914)

Author SHA1 Message Date
ebanks b8cdf64c20 Better descriptions for max reads/downsampling args
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2618 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-17 02:30:27 +00:00
depristo 64225b28fd Convenience methods for getting the VCFReader and VCFRecord
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2614 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-16 20:22:31 +00:00
hanna 420cef4094 Added version numbers to the help doclet extractor. Since the help system is behaving
more like a resource bundle at this point, changed it over to use the Java ResourceBundle
support classes.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2606 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-15 23:31:29 +00:00
hanna 930082314a Put a major.minor version into the GATK Javadoc for reading. Also,
update some straggler packages to the new package-info.java format introduced in 1.5.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2604 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-15 21:48:30 +00:00
asivache 7a991421f7 -erw argument, begone! Rod traversals are now enabled. current tests pass, more tests for RODWalkers are welcome ;)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2601 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-15 21:11:14 +00:00
asivache c8c5c176cd -erw argument, begone! Rod traversals are now enabled. current tests pass, more tests for RODWalkers are welcome ;)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2600 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-15 21:07:49 +00:00
asivache a12933a26d Bug fixed: now the length of an insertion is determined correctly. Thought I committed this...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2599 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-15 20:58:48 +00:00
asivache 404b95183f This is a LocusWalker, not a RodWalker (thanks Mark!!). RodWalkers currently are not capable of attaching alignment contexts (reads) to the ROD-annotated loci they traverse over...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2596 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-15 20:33:41 +00:00
rpoplin 7078219b89 Updating outdated comments.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2595 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-15 19:17:52 +00:00
rpoplin ba2acda406 Clarifying the comment regarding differentiating between first and second of pair in CycleCovariate.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2594 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-15 18:36:14 +00:00
ebanks b911b7df82 Fixing the AC annotation to be in line with the VCF spec
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2593 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-15 18:28:52 +00:00
rpoplin f2e539c52f As per discussions with Tim we are reverting the previous change regarding PairedReadOrderCovariate. The CycleCovariate now differentiates between first and second of pair by multiplying the cycle by -1. PairedReadOrderCovariate has been removed completely.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2592 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-15 18:18:59 +00:00
asivache eae1b73945 Fixed a bug in left-adjusting the indels introduced in previous commit :-/
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2591 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-15 17:41:23 +00:00
rpoplin df998041a8 Minor change to solid warning message. Added note for a future solid recalibration integration test when we get the required data file.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2590 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-15 16:31:25 +00:00
rpoplin 70df30fc1b Added method to AlignmentUtils which takes a read's cigar and the refBases char array given to a ReadWalker and returns the aligned reference char array. Bug fix in solid_recal_modes to use this aligned reference array. Recalibrator version number is no longer separate for each of the two walkers.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2589 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-15 15:36:59 +00:00
ebanks 2a116bb5d6 Made the VCF validator a simple rod walker instead of having it be in a separate package.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2588 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-15 06:39:06 +00:00
hanna b19bb19f3d First successful test of new sharding system prototype. Can traverse over reads from a single
BAM file.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2587 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-15 03:35:55 +00:00
aaron db9570ae29 Looks bigger than it is:
* Moved GATKArgumentCollection into gatk.arguments folder to clean up the main folder, also added some associated argument classes (most of the changes).
* Added code the argument parsing system for default enums, we needed this so we could preserve the current unsafe flag, and at the same time allow finer grained control of unsafe operations.  You can now specify:

"-U" (for all unsafe operations), "-U ALLOW_UNINDEXED_BAM" (only allow unindexed BAMs), "-U NO_READ_ORDER_VERIFICATION", etc.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2586 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-15 00:14:35 +00:00
asivache df63f51253 No changes, just sync-ing; only some commented out debugging prints are added...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2583 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-14 17:45:15 +00:00
asivache d85461c463 MergingIterator completely re-done. Now it is not a generic class (sorry guys), but rather it is tailored for merging ROD tracks. This implementation peeks the locations of next ROD annotations in each track, but does not actually read these RODs from underlying streams until the location is reached and it is time to actually return the object. Now underlying ROD track iterators (registered in the resource pool!) are not advanced prematurely past the current position and all the way to the next ROD record wherever it is, so that the sharding system can reuse them.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2582 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-14 17:43:36 +00:00
asivache c0891d512f added: peekNextLocation(); it's quite hard (and probably unnecessary, ever) to make seekable iterator a peekable one, but it is quite easy and useful to be able to peek just the next location the iterator will jump to after next call to next()
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2581 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-14 17:38:19 +00:00
rpoplin 49c44e7b36 PairedReadOrderCovariate is now a standard covariate and because of this CycleCovariate no longer multiplies by negative one for second of pair reads. Added PairedReadOrderCovariate to some of the integration tests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2574 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-13 20:09:10 +00:00
hanna 05575e2e56 Better bounding for the locus window. Don't make the locus window calculation blow up if the GenomeLoc ends
up being outside the reference.  Force the blowup elsewhere.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2573 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-13 17:03:54 +00:00
hanna 02e23e2d9c Threading support for beagle output files.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2569 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-13 02:42:16 +00:00
aaron 0513690416 two fixes in the new cached DbSNP code:
-isBiallelic would incorrectly say triallelic sites are biallelic.
-getAlternateAlleleList was broken, since the new cached list is immutable, we couldn’t remove list items.

Also added a dbSNP validating walker to the one-offs, for testing the new b37 130 dbSNP rod.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2568 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-13 00:27:34 +00:00
asivache a138bad95a A rare but not-so-subtle bug fixed: a funky alignment (a kind that should not have been generated in the first place) could make the indel left-adjusting method to overshoot read start and build a cigar like -3M6I...
also, few minor fix-ups.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2567 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-12 21:29:50 +00:00
rpoplin b51f4aae11 Updating the recalibrator to make use of StingSAMFileWriter.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2566 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-12 20:58:27 +00:00
rpoplin c8ad025ad0 cleaning up unused import statements
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2565 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-12 18:52:37 +00:00
rpoplin 189829841b The recalibrator now uses all input RODs when looking for known polymorphic sites not just the one named dbsnp. Added an integration test which uses both dbsnp and an input vcf file and skips over the union of the two.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2564 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-12 18:50:39 +00:00
aaron 16777e3875 more fixes for the empty interval list problem; you can now run LocusWindow traversals with an empty interval list, but the GATK will give you a warning (unless you're running in unsafe mode).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2563 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-12 18:47:43 +00:00
hanna 35a4fcc481 Additional sanity checking: make sure the user can't alter the header / compression level / presorted state of a file to which SAMRecords have already been written.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2562 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-12 18:39:41 +00:00
ebanks 03b7d5f5c7 1. Fixed small but embarrassing bug in weighted Allele Balance annotation calculation.
2. Made RankSumTest abstract; added 2 subclasses: BaseQualityRST and MappingQualityRST (the latter based on a suggestion from Mark Daly).  Untested so they're still experimental.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2561 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-12 18:33:53 +00:00
hanna 58999a8e9d Enhance the I/O management system to support custom headers and set the presorted flag
from the initialize() method (or at any time before the first SAM record is written).


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2560 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-12 18:21:42 +00:00
aaron 3c5f5177b1 check to see if the parsed interval list is empty, since we now allow interval files that are empty. If so, make sure we default to a non-interval based traversal.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2559 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-12 17:52:27 +00:00
ebanks 42aff1d2c3 Annotator in general should be able to annotate monomorphic or tri-allelic sites.
It's up to the individual annotations to decide whether they want to annotate or not.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2556 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-11 19:52:18 +00:00
rpoplin 11f91b3c95 Reverting Eric's previous change because it killed the PG tag in the output bam file header. Added a new -compress command line argument to set the compression level of the output bam file.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2555 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-11 19:02:56 +00:00
rpoplin 62dd2fa5be Fixing another bug in solid recal regarding negative strand reads. The isInconsistentColorSpace method incorrectly used the inconsistent tag added by parseColorSpace, the inconsistent tag is in the direction of the read like the color space tag, and not in the direction of the reference like everything else. This affects the recalibrated quality scores but the improvment in SNP calling performance is minor when using the default UG settings (min base quality 10).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2553 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-11 14:28:52 +00:00
ebanks 971834ca90 Added a walker to the vcf tools compilation: one that combines vcf records. Both merges and unions are supported (see documentation... when it gets written this week).
Also, moved some code that pulls samples out of rods from VCFUtils into SampleUtils.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2552 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-10 06:45:11 +00:00
ebanks 80af0f2f54 Changed the OUTPUT_BAM_FILE argument from String to SAMFileWriter and removed the call to close().
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2551 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-10 03:45:54 +00:00
ebanks fcce77c245 Added -beagle option to emit likelihoods file for use with the BEAGLE imputation engine; still experimental.
(Also converted getPileup -> getBasePileup)


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2549 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-08 18:41:04 +00:00
rpoplin 9cbae53ee1 Bug fixes for both SET_Q_ZERO and REMOVE_REF_BIAS solid recal modes regarding proper handling of negative strand reads. These changes yield a minor improvment in HapMap sensitivity.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2548 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-08 15:19:22 +00:00
ebanks d5ab002449 Curiously, it seems I never set the default base quality used by the Genotyper to 10. It's done now.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2546 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-08 06:02:01 +00:00
ebanks b468369dfa -UG's call into VariantAnnotator now uses the full alignment context (as opposed to the filtered one)
-MQ0 annotation is now standard again
-Added AC and AN annotations to VCF output



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2545 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-08 05:40:42 +00:00
rpoplin f587ff46af Tile is now a standard covariate. By default the TileCovariate returns -1 if tile can't be derived from the read's name. Added a new command line option -throwTileException which will force TileCovariate to throw an exception if tile can't be derived for a read. Singleton covariates, such as any read group without tile info, must be skipped over in TableRecalibration so that the sequential formulation doesn't apply the same correction more than once. TileCovariate class has been added to the Early Access package.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2544 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-07 22:51:41 +00:00
asivache d01bde36a4 Make sure that reference view holds enough bases to pass full-length deleted sequence to the walker's map() function in extended event mode (this addresses the problem of a deletion crossing the shard's boundary, so that an attempt to extract deleted bases results in a crash)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2543 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-07 22:37:22 +00:00
asivache e9bc85c188 Now has methods that allow to 1) check if a location is within the bounds of the reference view; 2) expand reference view (i.e. expand the bounds and reload the reference sequence) in order to accomodate specified location. The second method can be called directly since it performs a check and if the location is already within the bounds, then returns immediately. The costly ref sequence reloading occurs only when the location is not fully contained within the current bounds.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2542 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-07 22:35:17 +00:00
asivache 7f91b4d824 Bug fix. It would be nice if we could extract ROD annotations for the whole length of an extended event (indel), and we tried... But alas, it does not work with the current ROD system (after extracting length on ref > 1 ROD data for a deletion, rod iterator crashes on the attempt to re-load annotations for next reference base)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2541 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-07 21:30:55 +00:00
rpoplin 5f58492401 A rogue QualityUtils.MAX_REASONABLE_Q_SCORE managed to get through my previous bug fix. It should instead check the command line -maxQ argument.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2540 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-07 21:17:39 +00:00
ebanks c7a8dffa89 Check for division by 0 in annotations
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2539 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-07 19:27:15 +00:00
depristo 076481f786 Fixes to mergeVCF -- now correctly supports merging of filter fields. Also removed incorrect hasFilteringCodes() function. Updated intergration tests
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2535 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-07 14:50:13 +00:00
rpoplin cea544871d Fixed an issue with recalibrating original quality scores above Q40. There is a new option -maxQ which sets the maximum quality score possible for when a RecalDatum tries to compute its quality score from the mismatch rate. The same option was added to AnalyzeCovariates to help with plotting q scores above Q40. Added an integration test which makes use of this new -maxQ option.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2534 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-07 13:50:30 +00:00
ebanks 6c739e30e0 1. Removing an old version of the Genotype interface which is no longer being used. Needed to do this now so that the naming conflicts would cease.
2. Adding a preliminary version of the new Genotype/Allele interface (putting it into refdata/ as the VariantContext really only applies to rods) with updates to VariantContext.  This is by no means complete - further updates coming tomorrow.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2533 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-07 05:51:10 +00:00
chartl 7e3e714d3c Moving experimental annotations from core to oneoffs
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2528 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-06 19:34:10 +00:00
chartl a32245f7d2 Modifications:
QualityUtils - Stole the BaseUtils code for flipping reads around and applied it to quality scores
SecondBaseSkew - Nothing's really different, just a commented line

Additions (experimental annotations for future development of second-base annotation)
** I DO NOT INTEND FOR ANYONE TO USE THESE **
- ProportionOfNonrefBasesSupportingSNP
- ProportionOfSNPSecondBasesSupportingRef
- ProportionOfRefSecondBasesSupportingSNP
  + I hope these are self-explanatory
- QualityAdjustedSecondBaseLod
  + Adjust lod-score by 10*log10[P[second bases are as observed]]

Added walker:

QualityScoreByStrand - oneoff project that's being saved if i ever need it



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2527 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-06 19:18:07 +00:00
rpoplin 75809100c6 Use inheritance so that shared code isn't duplicated between the RecalDatums
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2523 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-06 16:45:16 +00:00
ebanks fdd14e1a01 Proposed interface for VariantContext. It's currently an interface so it doesn't break the build...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2521 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-06 16:31:39 +00:00
rpoplin e011a1b6f8 Cut the memory footprint of the RecalDatum in half to improve performance of CountCovariates when run with many covariates.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2520 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-06 16:12:27 +00:00
rpoplin 370a365147 Small runtime improvement in TableRecalibration.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2519 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-06 15:51:12 +00:00
ebanks b745c2f8d7 Fix for Jared: don't blow up if there are no samples in the input (since that's allowed) - but warn the user just in case.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2518 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-06 15:37:06 +00:00
depristo f857159343 useful convenience function to get a genotype associated with a particular sample
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2515 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-06 15:03:07 +00:00
depristo 8d13597a27 Temporary command-line support to enable rod walkers, if you know what you are doing this is safe.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2505 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-06 12:15:36 +00:00
ebanks d8351cb9fc Give Annotations access to rod data.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2504 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-05 18:53:01 +00:00
ebanks 8b087305f3 Added back the MQ0 annotation - however, it's not yet standard (since mq0 reads are filtered out by default in the genotyper). But it'll work when using the Annotator as a standalone.
While I'm at it, change getPileup to getBasePileup to remove all of the deprecation warnings.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2502 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-05 17:07:19 +00:00
depristo c209ba55aa More informative error message
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2499 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-05 13:55:20 +00:00
rpoplin 0a6bd5a270 CycleCovariate is now one-based so that 0 and -0 don't collide with each other. Solid recal modes now only change the inconsistent base and the previous base (along the direction of the read) instead of both the bases before and after. Removed estimatedNumberOfBins from the Covariate interface because it wasn't being used.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2498 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-04 20:52:15 +00:00
ebanks ed2fff13aa -Misc improvements to VCF code
-Small fix to callset concordance


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2497 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-04 02:28:47 +00:00
ebanks b668d32cf1 Updated the min mapping quality and min base quality defaults to be 10 in both cases (and updated all integration tests) as suggested by Mark.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2494 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-03 21:31:04 +00:00
asivache 46362ce532 In extended event lines, now prints deletions in verbose format as well (e.g. "-AAT")
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2490 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-30 21:57:20 +00:00
asivache a18e31f5b8 If alignment context at the locus holds extended event, get rod metadata and (importantly) reference bases for the whole span of the event (if it is a deletion that is, insertions still have length 0 on the ref!)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2489 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-30 21:56:25 +00:00
asivache 89791d730e Compute and cache the length of the longest deletion observed at the site; ReadBackedExtendedEventPileup now has a getter to access that value.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2487 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-30 21:19:39 +00:00
asivache 8932e67325 Removed sanity check that required GenomeLoc argument to be strictly 1-base long. We need to relax this in order to be able to pass around a reference context containing full-length chunk of deleted reference bases
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2485 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-30 20:14:08 +00:00
rpoplin 9b2733a54a Misc clean up in the recalibrator related to the nested hash map implementation. CountCovariates no longer creates the full flattened set of keys and iterates over them. The output csv file is in sorted order by default now but there is a new option -unsorted which can be used to save a little bit of run time.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2482 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-30 16:58:04 +00:00
asivache c928347c0c Extended event pileups are more verbose now: following a sequence of 'D','I', and '.' symbols, actual distinct events are listed along with their counts (example: +AAA:3,+AAC:1 for the total of 4 indel observations with 3 reads showing +AAA and one read showing +AAC)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2480 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-29 22:44:18 +00:00
asivache e286313b67 Fix for reads that have insertion as their last (mapped) cigar elements (i.e. not followed by M)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2476 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-29 21:13:16 +00:00
hanna 05deb8796b Simplify handling of reference sequence for unmapped reads. Improvement made based on a suggestion from Alec.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2475 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-29 21:06:20 +00:00
rpoplin 96c4929b3c Recalibrator now uses NestedHashMap instead of NHashMap. The keys are now nested hash maps instead of Lists of Comparables. These results in a big speed up (thanks Tim!). There is still a little bit of clean up to do, but everything works now.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2474 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-29 21:01:32 +00:00
asivache bfd6bf9ec5 PileupWalker just got a new option: --showIndelPileups. When this option is used, two lines are printed for every genomic location that has indels associated with it: first line is a conventional base pileup, the second line is an "extended event" (indel) pileup. The refence base in that second line is always set to "E" (for Extended), and the pileup string contains I,D,. symbols for insertion, deletion, noevent, respectively. Only this simple short format for indel pileups is implemented so far.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2472 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-29 20:16:34 +00:00
asivache 9652692019 Modified to enable locus traversals firing additional calls to walker's map() with alignment context filled with extended events (indels). Walker should override generateExtendedEvents() to return true, and it should make sure that it catches those additional indel pileups and processes them differently, as needed. If there are indels associated with a specific reference base, TWO map() calls will be issued in locus traversal at that location: first one will have a context filled with a regular base pileup, the second call will provide the context filled with indel pileup (pileup elements will have insertion, deletion, or noevent type associated with them and will also carry information about the full length of the event and inserted bases).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2471 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-29 20:13:25 +00:00
asivache 06eb576924 Can now be constructed with either base pileup or extended event (indel) pileup; has query methods checking what kind of pileup is served by the context, and getter methods return the appropriate pileup. TODO: while it is impossible right now to create a context that contains both types of pileups simultaneously, this restriction is only weakly enforced through the lack of appropriate constructor. Either we keep it this way, or some getters may become ambiguous and have to be fixed!
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2470 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-29 20:07:29 +00:00
depristo 87e863b48d Removed used routines in duputils; duplicatequals to archive; docs for new duplicate traversal code; general code cleanup; bug fixes for combineduplicates; integration tests for combine duplicates walker
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2468 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-29 19:46:29 +00:00
ebanks 5fdf17fccb Removed the VCF "NS" annotation (which wasn't working for pooled calls anyways) since it's ambiguous and not useful.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2465 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-29 17:30:47 +00:00
hanna e32174fbc4 UnifiedGenotyper now works without -varout or -vf set.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2464 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-29 16:46:24 +00:00
hanna b125571a98 Intermediate check in: transfer responsibility of wrapping the GenotypeWriter around the output stream to the output
management code.  Currently, will not work when neither -varout nor -vf are specified, but should work in all other
cases.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2463 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-29 16:11:11 +00:00
ebanks aeb34758e6 Adding a validation stringency to the VCF writers (which defaults to STRICT). If set to SILENT, it will not throw an exception for (reasonable) off-spec requests but will instead ignore such requests and silently move on.
This change allows the pooled calculation model to work correctly with multiple threads.  Boys, the Genotyper is now officially parallelized.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2462 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-29 15:33:53 +00:00
depristo fcc80e8632 Completely rewritten duplicate traversal, more free of bugs, with integration tests for count duplicates walker validated on a TCGA hybrid capture lane.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2458 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-28 23:56:49 +00:00
ebanks 893c9c85fa Added previous optimization to diploid (non-pool) model and shaved off 20% of runtime from it. Moved out some common functionality to joint estimate parent class.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2453 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-28 21:20:48 +00:00
rpoplin 92e3682991 Moved NHashMap to sting/utils
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2452 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-28 20:57:32 +00:00
rpoplin 562db45fa5 Sites that were marked NO_DINUC no longer get dinuc-corrected but are still recalibrated using the other available covariates. Solid cycle is now the same as Illumina cycle pending an analysis that looks at the effect of PrimerRoundCovariate. Solid color space methods cleaned up to reduce number of calls to read.getAttribute(). Polished NHashMap sort method in preparation for move to core/utils. Added additional plots in AnalyzeCovariates to look at reported quality as a function of the covariate.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2451 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-28 20:19:37 +00:00
asivache 2a704e83df Reads now have new traversal flag: generateExtendedEvents(). Support added to GenomeAnalysisEngine and Walker. This is a silent and transparent framework change that no existing code is going to see. The actual code that makes use of the new flag (which is false by default) will be committed separately...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2450 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-28 19:52:44 +00:00
ebanks c8d0e6e004 Optimization to pooled calculation model: stop calculating P(D|AF) if we are beyond the max likelihood such that subsequent likelihoods won't factor into the confidence score. Also, use new Pileup interface.
Pooled calling now takes less than half the time it used to.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2449 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-28 18:39:55 +00:00
ebanks b1ac4b81d5 Optimization: look up diploid genotypes from a static matrix instead of creating them on the fly (with String.format); bases no longer need to be ordered appropriately
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2448 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-28 17:28:51 +00:00
andrewk 57516582c2 Converter from HapMap chip genotype data to VCF added; HapMapGenotypeROD adjusted to not convert from Hg18 to b36 formatting of contigs
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2447 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-28 01:36:08 +00:00
ebanks d2770f380c Writing calls to standard out now works again (it got broken when we introduced parallelization)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2446 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-27 04:36:45 +00:00
ebanks 12990c5e7a Added qual-by-depth annotation
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2445 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-25 02:30:30 +00:00
ebanks 0571d9dcb9 Point MAX_QUAL_SCORE to SAMUtils.MAX_PHRED_SCORE.
Also, array size for caches should be max score + 1.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2444 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-24 20:47:32 +00:00
ebanks 438d21842a The new recalibrator had been mimicking the behavior of the old one in that if there was no dinuc available (following a no-call base or at either end of a read), it didn't try to recalibrate. Now that Ryan has modularized the system, we no longer need to skip the base completely (we just need to skip the dinuc value)... which is good because the Picard people complained after realizing that cycle #1 never got recalibrated.
The major effects of this commit are as follows:
1. We no longer skip any good bases (of course, this change alone breaks every single integration test).
2. The dinuc covariate returns a "no dinuc" value for the first base of a read (but not for the last base anymore, since there is a valid dinuc) or if the previous base is a bad base (e.g. 'N').

I've done a bunch of testing on real data and everything looks right; however, let's wait until the recalibrator guru gets back from vacation next week and can double-check everything before shipping this out in another early access release.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2443 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-24 20:41:29 +00:00
ebanks aaf674d9db Cleaned up this annotation.
Still experimental.  As of now, it's not useful.  More analysis is needed to determine how to handle cases where UG is unsure whether a sample is het or hom.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2442 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-24 03:06:46 +00:00
ebanks 6df40876a3 Un-reverted Matt's previous changes and fixed integration tests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2441 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-24 02:47:00 +00:00
hanna 2bd0b1bbf7 After further review, it's unclear that my patch in RecalDataManager was the right choice. Reverting.
Also updating other IntervalCleanerIntegrationTest failures that were masked by my first patch.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2440 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-24 00:32:33 +00:00
hanna 98c268483e Fixed issues with the integration tests:
1) sam-jdk apparently no longer supports custom tags with type int[] values.
2) BAM output for indel cleaner integration test changed in a way that's so subtle it can't be seen after converting the output to .sam.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2439 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-23 23:12:22 +00:00
aaron b134e0052f added changes to the code to allow different types of interval merging,
1: all overlapping and abutting intervals merged (ALL), 
2: just overlapping, not abutting intervals (OVERLAPPING_ONLY), 
3: no merging (NONE).  This option is not currently allowed, it will throw an exception.  Once we're more certain that unmerged lists are going to work in all cases in the GATK, we'll enable that.  

The command line option is --interval_merging or -im


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2437 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-23 21:59:14 +00:00
alecw 159778416c In TableRecalibrationWalker, update UQ tag if it was present in the original SAMRecord. This required a new sam.jar, which caused some other files to need to be changed.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2435 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-23 21:42:36 +00:00
ebanks dc96879861 2 separate changes which both affect lots of UG integration md5s, so I'm committing them together:
1. allele balance annotation is now weighted by genotype quality (so we don't get misled by borderline het calls)

2. Updates to the Unified Genotyper for parallelization:
   a. verbose writing now works again; arg was moved from UAC to UG
   b. UG checks for command that don't work with parallelization
   c. some cleanup



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2432 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-23 19:03:56 +00:00
ebanks 872a9d1c7b I'm making this change now (as opposed to waiting until Monday) to honor Tim's request.
The cycle covariate is now first/second of pair aware.  I'm taking it on faith from both Chris Hartl (waiting on slides from him) and Tim that this is the right thing to do.  We'll have Ryan confirm it all next week.
The only change is that if a read is the second of a pair, we multiple the cycle by -1 (a simple way of separating its index from that of its mate).
Of course, this broke all integration tests.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2431 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-23 16:26:43 +00:00
hanna e29e8e52b9 Multithreading support for the unified genotyper. Tests on a 10Mbase region on pilot 1 show a 6.8x improvement
when running 8 ways parallel.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2430 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-23 00:48:06 +00:00
ebanks 03bf75e335 Now implements TreeReducible
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2427 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-22 17:52:51 +00:00
hanna 0d890e1bf0 Rework Eric's output management code given that the behavior of the UG changes drastically
depending on its output format.  Current implementation is probably a bit overkill-ish and
we can whittle this down to what's absolutely necessary.
Writing VCFs to the 'out' protected printstream may not work at this moment.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2425 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-22 00:33:43 +00:00
ebanks f448a263e9 The cleaner now cleans duplicate reads (instead of ignoring them) - although it doesn't include them for scoring ref or alt consenses
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2424 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-21 21:01:55 +00:00
ebanks e06dfe44c4 Check for null platform (even when the read group isn't null) and assign it the default platform if it is
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2420 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-21 07:01:41 +00:00
ebanks 87e5a41964 Fixed a bug that accounted for a bunch of my remaining mis-cleaned indels.
Also, slightly optimized the cleaner by using readBases (instead of readString) and caching cigar element lengths.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2419 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-21 05:46:16 +00:00
hanna b780ffb34a Add a getFormat() method to get the output format from the writer. The need for
this call suggests that I may be thinking about the typing of the GenotypeWriter object the wrong way.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2418 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-21 01:46:26 +00:00
hanna 11cbfcec9c Get rid of backlink from ArgumentDefinitions to ArgumentSources. This will help in the future with multiple
source -> single definition mapping sets.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2417 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-21 00:39:36 +00:00
hanna 9e53c06328 First revision of command-line argument support for GenotypeWriter. Also, fixed the damn build.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2416 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-20 19:19:23 +00:00
ebanks 4ff61097cf Trivial change: < -> <=
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2415 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-20 03:35:27 +00:00
ebanks 566b556b50 Give user ability to turn off max allowed interval size
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2414 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-20 03:20:22 +00:00
depristo ee8bcdc61d PooledConcordance calculations have been reformatted and bugs fixed. Now properly handles monomorphic sites. Also works with -G option now, correctly
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2412 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-19 23:22:36 +00:00
aaron 7e0f69dab5 Changed the GLF record to store it's contig name and position in each record instead of in the Reader. Integration tests all stay the same.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2410 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-18 22:54:56 +00:00
hanna 80b3eb85fa Fixed curiously epic failure in read-backed pileup: size() mismatched the numReads-numDeletions at that locus in the case where includeReadsWithDeletionsAtLoci == false, causing failures including bad output from pileup walker. Also fixed up ValidatingPileup to run with the new ReadBackedPileup instead of just compiling successfully.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2409 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-18 22:52:44 +00:00
rpoplin fdf542c214 The CycleCovariate for 454 data is now the TACG flow cycle. That is, each flow grabs all the T's, A's, C's, and G's in order in a single cycle. This is changed from incrementing the cycle whenever there is a discontinuous nucleotide along the direction of the read.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2408 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-18 22:39:51 +00:00
ebanks 4ea31fd949 Pushed header initialization out of the GenotypeWriter constructors and into a writeHeader method, in preparation for parallelization.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2406 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-18 19:16:41 +00:00
chartl 79b997f43d Minor fix to getValue (thanks Ryan!)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2404 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-18 15:45:51 +00:00
aaron 9971a8da9a adding a check to the RodVCF to ensure that records are in-order in the underlying VCF file.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2403 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-18 15:24:45 +00:00
chartl 38563bbc2d The values used to be integers (-1 for unpaired, 0 for unmapped, 1 for first, 2 for second); but i switched to strings before commit so it was more clear. Forgot to update the OTHER getValue method.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2402 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-18 15:05:14 +00:00
chartl 7b5e332ff3 Added - PairedQualityScoreCountsWalker: counts quality scores (e.g. as a histogram) on first reads of a pair and second reads of a pair. Turns out there's a consistent difference in quality scores; even after recalibrating without the pair ordering as a covariate (there's a bit of averaging -- but not as much as I initially thought).
Added - A paired read order covariate to use with recalibration. Currently experimental: for instance, what's a proper pair versus just a pair? Nobody should use this one...



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2401 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-18 15:01:01 +00:00
ebanks 4f59bfd513 Updates to the various GenotypeWriters to make them do simple things like write records (plus allow GLFReader to close).
Adding first pass of stub and storage classes for the GenotypeWriters so that UG can be parallelizable.  Not hooked up yet, so UG is unchanged.
The mergeInto() code in the storage class is ugly, but it's all Tribble's fault.  We can clean it up later if this whole thing works.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2400 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-18 07:20:23 +00:00
ebanks 94f5edb68a 1. Fixed VCFGenotypeRecord bug (it needs to emit fields in the order specified by the GenotypeFormatString)
2. isNoCall() added to Genotype interface so that we can distinguish between ref and no calls (all we had before was isVariant())
3. Added Hardy-Weinberg annotation; still experimental - not working yet so don't use it.
4. Move 'output type' argument out of the UnifiedArgumentCollection and into the UnifiedGenotyper, in preparation for parallelization.
5. Improved some of the UG integration tests.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2398 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-18 04:14:14 +00:00
rpoplin 6fbf77be95 Updating the two solid_recal_mode options to also change the previous base since solid aligner prefers single color mismatch alignments over true SNP alignments. COUNT_AS_MISMATCH mode has been removed completely. The default mode is now SET_Q_ZERO.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2394 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-17 20:07:26 +00:00
ebanks c75ec67f84 When called as a standalone, VariantAnnotator now emits samples in sorted (as opposed to random) order in VCFs.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2392 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-17 19:01:08 +00:00
rpoplin aa86f3710d Updating HomopolymerCovariate to only count the consecutive previous bases. I left in the code but commented out for if somebody wants to worry about carry forward homopolymer problems.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2391 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-17 18:25:09 +00:00
hanna 9143822822 Fix half-hearted attempt to try to move classes from package to package.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2389 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-17 17:41:42 +00:00
asivache acb4d477da sync...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2387 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-17 01:03:01 +00:00
asivache ba86508854 remove debug print command
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2386 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-17 00:00:01 +00:00
asivache d72d332239 1) changed to search specifically for D and I cigar elements (and to process properly/ignore H,S,P elements) and print out only intervals that encompass actual indels. There's still one interval per read (at most) generated, which is the smallest intervals that covers ALL indels (D or I elements) present in the read; 2) if an interval (thus the original read itself and indels in it) sticks beyond the end of the chromosome, the read is ignored and this interval is NOT printed into the output; instead, a warning is printed to STDOUT (should we send it to logger.warn() instead?
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2385 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-16 23:29:07 +00:00
hanna 5b78354efd Fixed NPE in index check with RefWalkers.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2384 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-16 22:37:45 +00:00
hanna e6127cd6c5 Temporary hack for Tim Fennell: introduce a sharding strategy that stuffs all data into a single
shard for cases when the index file isn't available.  Works for the case in question, but is not
guaranteed to work in general.  Will be replaced once the new sharding system comes online.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2383 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-16 21:55:42 +00:00
ebanks bef1c50b3b Some cleanup
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2382 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-16 21:41:06 +00:00
ebanks bb92e31118 Optimizations:
1. push the ReadBackedPileup filtering up into the ReadFilters for read-based filters
2. stop querying the cigar for its length (just do it once)


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2381 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-16 21:39:58 +00:00
hanna ee47eb4367 Make filters used available to the walker via getToolkit().
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2379 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-16 21:26:04 +00:00
ebanks b626fc0684 Joint Estimate is now the default calculation model.
Reworked all of the integration tests so that they're now more comprehensive, cover more of what we wan to test, and don't take forever to run.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2376 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-16 19:41:02 +00:00
ebanks e051311e8c Added convenience methods in RodVCF to pull out all of the VCF data from the VCFRecord (e.g. getID(), getSamples(), getInfoValues())
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2374 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-16 17:58:41 +00:00
ebanks bb312814a2 UG is now officially in the business of making good SNP calls (as opposed to being hyper-aggressive in its calls and expecting the end-user to filter).
Bad/suspicious bases/reads (high mismatch rate, low MQ, low BQ, bad mates) are now filtered out by default (and not used for the annotations either), although this can all be turned off.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2373 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-16 17:28:09 +00:00
aaron af440943a4 Fixing a bug that Steven uncovered; we had an abigous contract for peek() in PushbackIterator, and SeekableRODIterator wasn't checking to see if it's PushbackIterator hasNext() was true before calling peek().
Changed peek() to element() to be consistant with the Java standards of the Queue and Stack classes (element() throws an exception if a record isn't available).  

Also updated some of the ROD iterator next() methods to throw NoSuchElementException if next() is called when a record isn't available.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2372 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-15 23:04:40 +00:00
andrewk 1035abc85f Add minimum base quality thresholding to depth of coverage via getBaseAndMappingFilteredPileup
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2371 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-15 22:58:30 +00:00
ebanks 9b0bdbbf29 Fix for homopolymer bug: ref was lowercase, alt allele was uppercase, so alt != ref. Yuck.
This is a temporary fix - pushed more elegant solution over to Matt.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2360 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-15 19:02:23 +00:00
ebanks b234019cf5 Readded locus printing suppression to DoC walker
(and removed unused import from UG)


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2358 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-15 14:50:56 +00:00
ebanks 874552ff75 Pull the genotype (and genotype quality) calculation out of the VCF code and into the Genotyper.
[Also, enable Mark's new UG arguments]



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2355 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-15 04:29:28 +00:00
depristo 2cbc85cc7a min mapping quality and min base quality arguments for UG
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2354 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-15 03:57:27 +00:00
depristo faa638532a Correct location
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2353 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-15 02:42:21 +00:00
depristo 1da97ebb85 Walker for calculating non-independent base errors, v1. Will be moved to somewhere not in core
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2352 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-15 02:40:15 +00:00
rpoplin 8e44bfd2ef CycleCovariate and PrimerRoundCovariate now correctly handle negative strand 454 and SOLID reads.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2349 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-14 21:52:30 +00:00
ebanks 97618663ef Refactored and generalized the VCF header info code.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2346 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-13 21:02:45 +00:00
depristo 05b8782d5f Documentation updates. Moved CountX.java walkers to QC
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2345 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-13 18:40:22 +00:00
depristo 92307361a4 In preparation for move
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2344 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-13 18:28:06 +00:00
ebanks 45199136f0 Completed my documentation responsibilities - based on Mark's reasonable assignment and not the one Matt made up while on Meth.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2342 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-13 04:13:30 +00:00
ebanks bd2a46ab4c I want to move over to hpprojects tonight, so I'm checking in various changes all in one go:
1. Initial code for annotating calls with the base mismatch rate within a reference window (still needs analysis).
2. Move error checking code from rodVCF to VCFRecord.
3. More improvements to SNP Genotype callset concordance.
4. Fixed some comments in Variation/Genotype



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2341 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-13 02:52:18 +00:00
kiran 2748eb60e1 Added short documentation for each class so that it appears in the walker command-line documentation.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2340 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-12 21:41:07 +00:00
rpoplin 78e94b5a84 TableRecalibration now puts the full list of walker arguments into the PG tag of the bam file it creates. Thanks Matt and Eric. Also, the default nback for the HomopolymerCovariate is 8, down from 10.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2339 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-12 17:29:41 +00:00
rpoplin 014013630f Added hieracrchy to the covariate classes: Required, Standard, and Experimental. Required covariates (rg and reported quality) are added for the user whether or not they are specified in the -cov list. There is now a -standard option in CountCovariates which will add in all of the standard covariates so the user doesn't have to type them all out or even know which ones are the standard. There is logger output to say which covariates are being used of course. The list of covariates used is also added to the PG tag in the bam file produced by TableRecalibration.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2338 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-12 16:34:05 +00:00
hanna 6955b5bf53 Cleanup of the doc system, and introduce Kiran's concept of a detailed summary
below the specific command-line arguments for the walker.  Also introduced
@help.summary to override summary descriptions if required.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2337 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-12 04:04:37 +00:00
hanna cdfe204d19 Incorporated feedback from Kiran. Use the Javadoc first sentence extraction capability to just show the first sentence from each line of Javadoc. @help.description can still be used to produce exceptionally verbose descriptions.
Also increased the line width as much as I could tolerate (100 characters -> 120 characters).


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2336 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-11 21:59:55 +00:00
kiran 38d9f7b903 Renamed ReferenceContext's getSimpleBase() method to getBaseIndex()
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2334 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-11 20:14:39 +00:00
rpoplin 60c3eb4b60 Added help.description to the recalibration walkers.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2331 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-11 19:02:29 +00:00
ebanks 2ea7632b76 The SNP genotype concordance module is now more comprehensive.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2330 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-11 18:34:33 +00:00
hanna 590aeee7d2 Documentation for more basic walkers.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2329 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-11 18:15:40 +00:00
hanna d1815f3559 More documentation for walkers that I'm familiar with in the collection of core walkers.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2328 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-11 18:02:33 +00:00
hanna 956c36a2c8 Help for the qc package.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2327 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-11 17:32:47 +00:00
hanna 450ea233a5 Docs for the basic walkers: CountLoci, CountReads.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2326 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-11 17:17:34 +00:00
aaron 86dc98bfb5 update the documentation for CombineDuplicates for the new help system.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2324 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-11 17:01:42 +00:00
aaron 420725441a documentation updates for the new help system.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2323 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-11 16:15:44 +00:00
ebanks 2de7e1a178 Move VariantAnnotator over to use a StratifiedAlignmentContext split by sample.
The only major difference is that we are now able to get accurate allele balance ratios.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2321 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-11 05:28:28 +00:00
aaron f64a4c66ac some tweaks for the GATK paper genotyper to better work with shared memory parallelization, added documentation changes for Matt's new help system.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2319 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-10 22:33:51 +00:00
ebanks 2869270c11 Fixed deletion depth calculation plus mis-spelling in ReadBackedPileup method.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2315 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-10 21:11:42 +00:00
ebanks 31b1d60d28 Generalized the StratifiedAlignmentContext code so that it's easy to add new ways to stratify. Then added an MQ0-free stratification so we don't need to be carrying around 2 different alignment contexts (full vs. mq0-free) anymore.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2314 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-10 19:50:06 +00:00
hanna 0c396f04a2 Fix obvious cut/paste error in output stream management code.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2313 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-10 19:23:13 +00:00
ebanks 11ac7885b0 Pull out StratifiedAlignmentContext code so other walkers can use it.
This is basically a wrapper class around AlignmentContext which allows you to stratify a context by e.g. reads on forward vs. reverse strands.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2312 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-10 19:21:16 +00:00
hanna adb2fdbee7 Before, we were only checking that the reference was present if @Requires required that a reference was present. Now we always check that a reference is present, so that we get an intelligent error message.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2311 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-10 19:15:48 +00:00
hanna 5eac510b2f Refactor the code I gave Eric yesterday to output command line arguments.
Convert it from a completely wonky solution to a slightly less wonky solution
that will work in more cases.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2310 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-10 18:57:54 +00:00
hanna 74b8055b6a Only show extra walker help if the user didn't specify a walker or specified
an invalid walker.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2309 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-10 16:43:06 +00:00
ebanks f7c44ad019 - Read in arguments for the header based on reflection
- Hook up Variation and Genotype in SSG



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2300 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-09 21:35:33 +00:00
hanna 408f6f3dee Refactoring of prior commit: better handling of unnamed package within the help system.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2297 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-09 20:12:35 +00:00
hanna 1d2151adcf Better handling of nulls output by
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2296 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-09 19:34:56 +00:00
ebanks 40c2d7a4bc Fix all-bases-mode and genotype-mode in the UG and add integration tests for them.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2295 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-09 17:41:30 +00:00
ebanks 4e54b91ce4 UG now outputs the FORMAT header fields when there's genotype data.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2294 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-09 16:31:07 +00:00
rpoplin 12c49ea485 Added DuplicateReadFilter to filter out reads that are marked as duplicates.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2293 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-09 15:42:53 +00:00
ebanks fb900b12e1 VariantFiltration now details the filters it has used in the header of the VCF it produces.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2292 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-09 15:36:15 +00:00
ebanks 8d67d9ade3 -Minor fix in UG for all-bases mode
-Make minConfidenceScore in VariantEval a double so non-integer values can be used (requested by Steve H).


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2290 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-09 03:49:10 +00:00
ebanks 717eb1de96 - Depth annotation now includes MQ0 reads
- Removed MQ0 annotation
- Updated RMS MQ annotation to use new pileup
- UG now outputs all of its arguments as key/value pairs in the header (for VCF)
- Cleaned up VCFGenotypeWriterAdapter interface a bit



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2288 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-09 02:53:00 +00:00
ebanks e8822a3fb4 Stage 3 of Variation refactoring:
We are now VCF3.3 compliant.
(Only a few more stages left.  Sigh.)



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2287 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-08 21:43:28 +00:00
hanna d75d3a361a Clean up some of the walker help output based on additional experience and
feedback received.  Also, add a flag to build.xml to disable generation of
docs on demand (use ant -Ddisable.doc=true to disable docs).


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2284 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-07 21:33:11 +00:00
hanna 10be5a5de9 Move some files around to reflect our growing help infrastructure.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2280 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-07 19:23:12 +00:00
rpoplin 1d5b9883db Added --solid_recal_mode argument to experiment with different ways of dealing with solid reference bias. Currently the default option is DO_NOTHING which means use the same behavior as the old recalibrator. Eventually the new methods in RecalDataManager will be moved over to a SolidUtils class. Added transition and transversion methods to BaseUtils that work like simpleComplement, used with the color space in my solid methods. Also, initial check-in of HomopolymerCovariate.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2276 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-07 14:26:27 +00:00
ebanks c0528cd88e Updated the CallsetConcordance classes to use new VCF Variation code... and uncovered a whole bunch of VCF bugs in the process. I'm not convinced that I got them all, so I'll unit test like crazy when the refactoring is done.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2272 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-06 11:43:40 +00:00
ebanks b6f8e33f4c Stage 2 of Variation refactoring:
VCFRecord now implements Variation, VCFGenotypeRecord now implements Genotype.

Because of this change, RodVCF is now just a wrapper around the VCFRecord and does nothing else.  Also, one can call toVariation on the VCFGenotypeRecord and it returns the VCFRecord.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2271 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-06 06:48:03 +00:00
hanna 3b440e0dbc Add a taglet to allow users to override the display name in command-line help.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2270 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-06 04:12:10 +00:00
ebanks 08f2214f14 Stage 1 of massive Variation/Genotype refactoring.
This stage consists only of the code originating in the Genotyper and flowing through to the genotype writers.  I haven't finished refactoring the writers and haven't even touched the readers at all.

The major changes here are that
1. Variations which are BackedByGenotypes are now correctly associated with those Genotypes
2. Genotypes which have an associated Variation can actually be associated with it (and then return it when toVariation() is called).

The only integration tests which need to be updated are MSG-related (because the refactoring now made it easy for me to prevent MSG from emitting tri-allelic sites).



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2269 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-06 03:12:41 +00:00
hanna b04de77952 First pass at a reorganized walker info display. Groups walkers by package
and displays walker data extracted from the JavaDoc.  Needs a bit of help,
both in content and flexibility of package naming.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2267 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-04 23:24:29 +00:00
depristo 07b88621c5 Improved RankSum calculations and RankSum annotation. Much more meaningful
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2266 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-04 22:16:40 +00:00
hanna 4c147329a9 Turn javadoc comments for packages and classes into key/value pairs in a properties file. Embed the properties file
in GenomeAnalysisTK.jar.  Still no support for actually displaying the archived javadoc.  Also change the approach 
to providing package javadocs: retired the deprecated package.html file in favor of Java1.5-style package-info.java.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2263 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-04 20:08:41 +00:00
ebanks 1e8dcc30da -dbSNP rod should not implement VariantBackedByGenotype since dbsnp records have no genotype data
-added code to cache the allele list so it didn't need to get recomputed each time it was requested.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2260 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-04 14:56:48 +00:00
ebanks 58937bf9ba You can now use the -exp flag to tell the Genotyper to include experimental annotations when it calls out to VariantAnnotator.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2256 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-04 04:45:05 +00:00
ebanks b05e73a914 Finished implementation of the Wilcoxon Rank Sum Test thanks to Tim Fennell (calculating the normal approximation) and Nick Patterson (dithering to break tie bands).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2255 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-04 04:04:39 +00:00
ebanks 861221d046 - Moved various header line printing into a single method
- Fixed output for coverage above min depth



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2254 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-04 02:15:43 +00:00
ebanks aef4be5610 Moved CoarseCoverageWalker to core and packaged both coverage walkers in coverage/
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2249 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-03 17:53:36 +00:00
ebanks c2017cc91b PrintCoverageWalker functionality moved to DepthOfCoverageWalker. Added integration tests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2247 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-03 17:23:59 +00:00
ebanks 01cf5cc741 1. Merged CoverageHistogram into DepthOfCoverageWalker
2. Fixed bug in histogram calculation for small intervals
3. Better output in DoCWalker
4. Comments added to code



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2245 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-03 17:01:53 +00:00
ebanks 44b9f60735 PercentOfBasesCovered functionality moved to DepthOfCoverageWalker. Added integration tests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2244 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-03 16:11:09 +00:00
ebanks 126d1eca35 Move to core (qc/)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2243 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-03 15:45:58 +00:00
ebanks 9da5cc25ad More archiving (with permission from Andrey) plus a move to core.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2242 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-03 15:40:27 +00:00
ebanks a88202c3f6 Refactored DoCWalker to output in a more helpful and usable style. It now outputs in tabular format with 2 different sections: per locus and then per interval.
I am now at a point where I can merge the functionality from other coverage walkers into this one.
Thanks to Andrew for input.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2239 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-03 05:28:21 +00:00
ebanks d7e4cd4c82 Moving some useful and stable walkers to core:
- ClipReads
- PrintRODs (generalized to print all RODs that are Variations)
- FixBAMSortOrderTag (added documentation to walker so that people know what it does and why)



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2238 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-03 03:00:45 +00:00
rpoplin 46f3d3e39b Added comments to AnalyzeCovariates and R scripts. R script prevents residuals from going off the edge of the plot. Added skeleton code to the recalibration walkers showing how we plan to handle SOLID reference inserting behavior.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2233 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-02 23:15:52 +00:00
depristo dec0a781c2 Un-reinventing the wheel. --sleep argument removed.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2227 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-02 20:19:28 +00:00
chartl 6a9e7bea05 Removing experimental annotations
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2220 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-02 19:03:55 +00:00
ebanks 0a2304eff8 - Rename minConfidenceScore in VariantEval to minPhredConfidenceScore
- Moved validation walkers to new qc dir
- Killed unused test



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2218 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-02 17:59:19 +00:00
ebanks a5dfc9107d - Cleaned up annotation code some more
- Use QualityUtils when phred-scaling now



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2217 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-02 17:45:29 +00:00
ebanks 7055a3ea2d - All annotations are now required to return their VCF INFO keys and descriptions
- Renamed keys to fit with the standard naming
- FisherStrand is no longer standard
- Integration tests no longer test experimental annotations since they're not stable



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2216 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-02 17:24:06 +00:00
rpoplin 67179e2412 Initial checkin of AnalyzeCovariates.java which replaces analyzeRecalQuals_1KG.py and is updated to use the new Covariates system. It creates similar plots of residual error for each covariate that was used in the calculation. There is also an option to filter out base qualities below a given threshold.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2215 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-02 16:47:35 +00:00
ebanks 2838629724 -VCF writer now checks whether the allele frequency has been set before trying to write it out.
-Renamed methods to be more consistent.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2214 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-02 16:25:32 +00:00
depristo 6231637615 fixes for VariantAnnotations and second bases. Misc. removal of failing (and unstable) integration tests that require rereview
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2213 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-02 15:41:35 +00:00
ebanks b979bd2ced - Optimized implementation of -byReadGroup in DoCWalker
- Added implementation of -bySample in DoCWalker
- Removed CoverageBySample and added a watered down version to the examples directory



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2209 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-02 03:39:24 +00:00
ebanks 7c73496e72 Moved DoC walker over to new pileup system so it no longer moves like it's stuck in molasses.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2208 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-02 02:46:39 +00:00
ebanks 05923f7fba Started transition to oneoffprojects.
Moved/killed a few other walkers (with permission).



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2204 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-01 21:19:02 +00:00
ebanks c36069355e Trivial change to verbose
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2203 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-01 20:48:10 +00:00
rpoplin 3180fffd43 Eliminated unnecessary boxing of longs in RecalDatum. Changes to RecalDatum in preparation for new AnalyzeCovariates script. Updated TableRecalibrationWalker to make use of these changes.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2199 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-01 16:49:05 +00:00
chartl 21a9a717e4 Some minor changes and test:
- DepthOfCoverage is now by reference (so locus-by-locus output correctly reports zero-coverage bases)
  - VariantsToVCF now lets you bind variants with any string except intervals and dbsnp (not just NA######)
  - A PileupWalker integration test on a particularly nasty FHS site
  - Two second-base annotation related integration tests on that same site
       + outputs were all hand-validated in matlab; within a certain tolerance for the annotations




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2197 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-01 15:15:54 +00:00
ebanks 7c6c490652 An unfinished implementation of the Wilcoxon rank sum test and a variant annotation that uses it. I need to merge and update this code with Tim's implementation somehow - but that won't happen until later this week, so I'm committing this before I accidentally blow it away.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2193 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-01 04:56:17 +00:00
ebanks 00f15ea909 Improved performance of deletion-free pileup and added mapping-quality-zero-free pileup convenience method.
Finished converting genotyper and annotator code to new ReadBackedPileup system.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2192 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-01 04:50:47 +00:00
rpoplin 6bb864da2a More misc cleanup.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2191 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-30 22:29:07 +00:00
rpoplin b89b9adb2c misc code cleanup
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2190 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-30 21:16:00 +00:00
rpoplin 4969cb1957 CountCovariates uses new optimized ReadBackedPileup. It also smarter about re-doing calculations for the dnsnp variation rate sanity check.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2188 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-30 20:35:40 +00:00
ebanks add2fa7ab4 more use of new ReadBackedPileup optimizations
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2187 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-30 20:04:01 +00:00
rpoplin 817e2cb8c5 Recalibrator makes use of the new GATKSAMRecord wrapper and now no longer has to hash the SAMRecord. Covariate's getValue method signature has changed to take the SAMRecord instead of the ReadHashDatum. ReadHashDatum removed completely.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2185 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-30 19:59:17 +00:00
ebanks e9a8156cfb Use new optimized ReadBackedPileup
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2184 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-30 18:17:18 +00:00
rpoplin d8146ab23d Changed the format of the recalibration csv file slightly so that it is easier to load the file into something like R and look at the values of the covariates.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2183 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-30 17:55:23 +00:00
ebanks a184d28ce9 Completing the optimization started by Matt: we now wrap SAMRecords and SAMReadGroupRecords with our own versions which cache oft-used variables (e.g. platform, readString, strand flag). All walkers automagically get this speedup since the wrapping occurs in the engine.
I note that all integration/unit tests pass except for BaseTransitionTableCalculatorJava, which is already broken.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2182 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-30 17:39:29 +00:00
hanna 3300ca906a An iterator for Eric to use when injecting his new wrapping reads -- a stopgap solution for getting additional caching
functionality into a SAMRecord.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2173 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-27 22:25:52 +00:00
rpoplin 26db15be5c Added SingleReadGroupFilter to only use reads from a specific read group, filtering out all others.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2172 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-27 20:33:59 +00:00
rpoplin 91f5672a32 misc cleanup
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2171 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-27 19:56:20 +00:00
rpoplin d1298dda13 Encapsulated the sections of code that were shared by the two Recalibration walkers. This includes both the shared command line arguments and the section of code in the map methods which pull out data from the SAMRecord and stuff it into the ReadHashDatum. Command line arguments are now passed to the Covariates using a new initialize method that all Covariates must implement. Updated the dbsnp sanity check warning message to be less cryptic.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2170 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-27 19:54:10 +00:00
depristo 75b61a3663 Updated, optimized REadBackedPileup. Updated test that was breaking the build -- it created a pileup from reads without bases...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2169 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-25 23:30:39 +00:00
alecw ac1b289d55 Add tile to ReadHashDatum, and implement TileCovariate
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2166 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-25 21:41:42 +00:00
depristo db40e28e54 ReadBackedPileup in all its glory. Documented, aligned with the output of LocusIteratorByState, and caching common outputs for performance
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2165 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-25 20:54:44 +00:00
rpoplin b44363d20a Removed silly casts from Integer to int.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2164 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-25 19:59:21 +00:00
ebanks d0f673f0c0 Use Math.abs so we don't get (inconsistent) -0's
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2160 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-25 19:08:34 +00:00
rpoplin 6ff8526592 Added arguments to the recalibration walkers so the user can specify the default read group id and platform to use when a read has no read group. There are also options to force every read group and every platform to be the specified values. Added integration tests that use a bam file with no read groups. Added comments to all the covariates to explain what each of the methods in the Covariate interface are used for.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2157 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-25 15:41:12 +00:00
ebanks e1e5b35b19 Don't have the spanning deletions argument be a hard cutoff, but instead be a percentage of the reads in the pileup. Default is now 5% of reads.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2155 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-25 04:54:44 +00:00
depristo 03342c1fdd Restructuring and interface change to ReadBackedPileup. We now lower support the Pileup interface, the BasicPileup static methods, and the ReadBackedPileup class. Now everything is a ReadBackedPileup and all methods to manipulate pileups are off of it. Also provides the recommended iterable() interface of pileup elements so you can use the syntax for (PileupElement p : pileup) and access directly from p.getBase() and p.getQual() and p.getSecondBase(). Only a few straggler walkers use the old style interface -- but those walkers will be retired soon. Documentation coming in the AM. Please everyone use the new syntax, it's safer, and will be more efficient as soon as the LocusIteratorByState directly emits the ReadBackedPileup for the Alignment context, as opposed to the current interface. In the process of the change over, discovered several bugs in the second-best base code due to things getting out of sync, but these changes were resolved manually. All other integrationtests passed without modification.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2154 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-25 03:51:41 +00:00
ebanks 2cb3e53b0b Verbose mode shouldn't be printing out 'NaN's and 'Infinity's
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2153 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 22:01:00 +00:00
rpoplin c9ff5f209c Added a CountCovariates integration test that uses a vcf file as the list of variant sites to skip over instead of the usual dbSNP rod.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2152 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 21:51:38 +00:00
ebanks 3484f652e7 1. Variation is now passed to VariantAnnotator along with the List of Genotypes so non-genotype calls has access to all relevant info.
2. Killed OnOffGenoype
3. SpanningDeletions is now SpanningDeletionFraction



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2151 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 21:47:20 +00:00
ebanks e05cb346f3 GenotypeLocusData now extends Variation.
Also, Variations should be INSERTIONs or DELETIONs (and not just INDELs).
Technically, VCF records can be indels now.
More changes coming


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2150 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 21:07:55 +00:00
rpoplin 8b30279edc style update
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2149 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 20:56:31 +00:00
rpoplin dffa46b380 BAM files created by TableRecalibration now have the version number and list of covariates used appended to their header with a new 'PG' tag. Eventually the entire list of command line args will be put in there as well. Big thanks to Matt and Aaron. The integration test uses the --no_pg_tag so that the md5 doesn't change every time the version number changes.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2148 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 20:53:57 +00:00
rpoplin 277e6d6b32 Further optimizations of TableRecalibration. This completes my goal of having the only math done in the map function be addition, subtraction and rounding the quality score to an integer. Everything else has been moved to the initialize method and only done once.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2145 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 18:21:57 +00:00
ebanks 87c1860398 I'm not sure I believe it, but JProfiler claims that calling FourBaseProbs.isVerbose() was taking 5% of my runtime...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2142 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 17:00:32 +00:00
ebanks b3f561710f Optimizations:
1. Only do calculations in UG for alternate allele with highest sum of quality scores (note that this also constitutes a bug fix for a precision problem we were having).
2. Avoid using Strings in DiploidGenotype when we can (it was taking 1.5% of my compute according to JProfiler)

UG now runs in half the time for JOINT_ESTIMATE model.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2141 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 16:27:39 +00:00
rpoplin a59e5b5e1a Added dbSNP sanity check to CountCovariates. If the mismatch rate is too low at dbSNP sites it warns the user that the dbSNP file is suspicious. Added option in CountCovariates and TableRecalibration to ignore read group id's and collapse them together. Also, If the read group is null the walkers no long crash with NullPointerException but instead warn the user the read group and platform are defaulting to some values. Default window size in MinimumNQSCovariate is 5 (two bases in either direction) based on rereading of Chris's analysis.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2140 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 16:16:44 +00:00
alecw e5e6d515c3 Fix misunderstanding of GenomeLoc interval
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2138 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 15:12:49 +00:00
ebanks cb6d6f2686 Very minor performance improvements
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2137 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 05:21:07 +00:00
ebanks c90bea39a1 read.getReadString().charAt(offset) --> read.getReadBases()[offset]
[As a courtesy I fixed all instances once I was updating GenotypeLikelihoods]


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2136 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 04:25:19 +00:00
ebanks ec321abd7b Added ability to filter on the QUAL field
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2135 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 04:08:22 +00:00
ebanks 36d493e645 All standard annotations now inherit from StandardVariantAnnotation. Users can specify whether they want all annotations, just the standard annotations, or specific annotations. When calling in from another walker, the default is just the standard ones.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2134 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 03:55:12 +00:00
ebanks ee5093d2c6 -Added VariantFiltration integration tests
-Added integration test for GLFs



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2133 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 02:36:27 +00:00
rpoplin 9e4eadc37c CountCovariates v2.0.2: Added a --process_nth_locus <int> argument to only use every Nth covered locus when creating the recalibration table.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2129 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 22:07:38 +00:00
ebanks ed4cf3de57 Check that we're biallelic before calling isSNP()
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2127 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 20:20:48 +00:00
rpoplin 5744a1d968 The covariates don't care about SAMRecord's anymore - Cleaning up the import statements.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2126 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 20:10:12 +00:00
chartl 23983b2fd8 New annotation: ResidualQuality
Computes a metric for how much error is left that isn't explained by ref or snp bases. This is the sum of Q scores, weighted by the proportion of non-ref non-snp bases to non-snp bases. Reported in Log space.

Update to the integration test so bamboo doesn't look as though someone murdered it with a spork




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2124 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 20:04:01 +00:00
ebanks 70059a0fc9 Refactored joint estimation model to allow subclasses to overload PofD calculation over all frequencies. Pooled model now takes only 20% of time that it used to.
Added integration test for pooled model and updated other joint estimation tests to be more comprehensive now that they are faster.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2123 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 20:03:38 +00:00
rpoplin 7f947f6b60 Updated recalibrator integration tests to use all three platforms as well as a bam with multi-platform reads intermingled. CountCovariates v2.0.1: Once again uses a read filter to filter out zero mapping quality reads. Added --sorted_output option to output the table recalibration file in sorted order
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2122 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 19:51:36 +00:00
ebanks 14bf6ce83c 1. Newest version of the joint estimation model. Faster than previous version and now qscores can get to be > 39.8 for hets.
2. More sanity checks in annotations


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2119 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 17:05:50 +00:00
rpoplin 1d46de6d34 The old recalibrator is replaced with the refactored recalibrator. Added a version message to the logger output. These walkers start at version 2.0.0
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2117 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 14:58:33 +00:00
ebanks dfe7d69471 1. VCF: don't print slod if it's never set
2. UG: don't print slod if lods are infinite (todo: figure out a good guess instead)
3. UG: if probF=0 for 2 alt alleles are both 0 (because of precision), use log values to discriminate



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2116 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 02:55:43 +00:00
ebanks 753cb100a3 Add checks for weird situations
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2115 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 02:14:25 +00:00
ebanks bf935a6ab1 1. Fixed bug in PrimaryBaseSecondaryBaseSymmetry code (not checking for null before trying to access object's methods) which was causing Integration Tests to fail.
2. Retired allele frequency range from UG, which wasn't very useful.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2113 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 01:31:48 +00:00
depristo 9c206abb97 removing unnecessary printing
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2110 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-22 12:41:48 +00:00
chartl 59416ae06a This is an annotation adapted from one that Mark Daly suggested some time ago. Right now it calculates:
- For all reference bases, the proportion of their second best bases that support the SNP

- the proportion of non-reference bases that support the SNP

and reports the difference between the two. Initially I was taking depth into account as well, but that did not appear to work as nicely as I'd like (even at 20,000x depth, if 95% of the non-reference bases are C, and 98% of the reference second-best-bases are C, then we would want to be suspicious of it; but perhaps slightly less so than if the depth were only 20...)

Anyway it's now available. I'm not sure how useful it will be, but I spawned the FHS annotation jobs again, so we'll see.




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2109 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-22 00:47:49 +00:00
depristo 27122f7f97 Performance improvements for pooled caller. Now possible to actually run on real data in a finite amount of time. Minor changes to GL interface (making strandIndex public) to support cached calculations in pooled caller.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2107 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-21 15:07:40 +00:00
ebanks 797bb83209 New VariantFiltration.
Wiki docs are updated.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2105 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-20 19:50:26 +00:00
ebanks d84444200b The Unified Genotyper now sorts the sample names in the vcf that it outputs.
[There was no reason to enforce that every VCF being output from the GATK should have the samples sorted, since someone might want them ordered non-alphabetically]


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2102 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-20 16:13:18 +00:00
ebanks 2a5349d886 VariantAnnotator now adds dbsnp id if a dbsnp rod is supplied and it's not already set for a record
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2100 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-20 03:26:09 +00:00
depristo 82fd824c4d Continuing improvements to unified genotyper
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2098 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-20 01:39:29 +00:00
rpoplin 22aaf8c5e0 Added the old recalibrator integration tests to the refactored recalibrator sitting in playground.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2096 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-19 22:43:28 +00:00
aaron 6ba1f3321d Fixed the sample mix-up bug Kiran discovered, and added a unit test in the VCF reader class (Thanks for the good example files Kiran). Also renamed the toStringRepresentation function to toStringEncoding, and added a matching method in VCFGenotypeRecord.
Updated the integration tests that were failing to due to different ordering of genotyping entries in VCF, I'll check in the VCF diff tool I wrote when I get a cycle or two.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2092 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-19 18:17:47 +00:00
alecw 7623b39927 Add rodPicardDbSNP
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2088 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-19 17:27:46 +00:00
ebanks 7b957d3e2e Make the whining from Khalid's office stop already
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2079 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-19 03:04:48 +00:00
hanna 85bc9d3e91 (Hopefully) temporary hack: load contig information by contig name rather than contig id to avoid
off-by-one errors.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2078 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-18 23:33:27 +00:00
ebanks f667bed7fc -Don't annotate allele balance or on-off genotype if there's no genotype data
-If qscore is infinity (because of precision) make a best guess instead


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2076 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-18 22:01:32 +00:00
ebanks 087e01a439 minor changes for --noSLOD
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2074 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-18 18:48:01 +00:00
ebanks a70cf2b763 A bunch of changes needed to make outputting pooled calls possible
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2073 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-18 18:42:57 +00:00
ebanks 0a35c8e0ba 1. The joint estimation model now constrains genotypes to be AA,AB,or BB only (i.e. to use a single alternate allele). Note that this doesn't work for the old models (point estimate or SSG) because calculations aren't divided by alternate allele.
2. Allele frequency spectrum is not emitted for single samples (since it doesn't make sense).
3. If in pooled mode, throw an exception of pool size isn't set appropriately.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2072 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-18 17:43:15 +00:00
depristo 6fe1c337ff Pileup cleanup; pooled caller v1
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2070 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-18 17:03:48 +00:00
chartl b68d6e06b7 Rollback of the previous "fix" and implementation of the real fix.
We totally *do* want to annotate the call if called by another walker. Totally boneheaded misenterpretation of what the code was doing -- Eric, please forgive me for being an idiot.

Instead, change the StingException to what it really should be -- an IllegalStateException, which is not coincidentally already handled by the calling function. 



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2067 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-18 06:09:24 +00:00
chartl 95f1be94c0 Fix for the broken build:
do **not** attempt to annotate if UnifiedGenotyper is called from another walker! Why this didn't break the build earlier I have no idea.

Ultimately, there should be a better way of interfacing UG with another walker -- what if some other walker wants the annotations from UG? But since we're calling map directly -- and the annotations don't get returned directly from map -- this needs to be handled differently, while the map function should ultimately return the LOD score or quality under the GCM alone.




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2066 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-18 05:56:31 +00:00
ebanks 9fb50e9bd9 Further refactoring so that pooled calling will work.
Okay, Mark, you should be all set.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2065 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-18 00:18:13 +00:00
chartl 539f6f15e5 Added --
Second base skew annotations and integration tests. Nothing need be given except -A SecondBaseSkew; the statistic it annotates calls with is a chi-square statistic given by the deviation of the observed proportion of reference second-best-bases from the expected 1/3. Future additions may be to ask that the deviation be instead from a given transition table.

A big note for all users: All IllegalStateExceptions from the variation ROD (e.g. the RodGeliText) are dealt with SILENTLY. I understand this isn't optimal, but I'd rather simply not annotate a non-bi-allelic site than fail completely (there are quite a few such sites even on the regions over which the integration test has been written).




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2064 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-18 00:11:13 +00:00
depristo 42a0bbaf46 Minor reformating for pooled calling
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2063 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-17 22:06:11 +00:00
ebanks 4d9c826766 Integration tests actually run on real data now.
<tries to hide sheepish grin>


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2061 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-17 21:04:14 +00:00
ebanks a048f5cdf1 -Refactored JointEstimation code so that pooled calling will work
-Use phred-scale for fisher strand test
-Use only 2N allele frequency estimation points



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2059 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-17 20:21:15 +00:00
asivache 21729d9311 Do not print debug message when debug mode is not requested!!
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2056 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-16 20:28:41 +00:00
rpoplin 967215066d The old CountCovariates now warns the user if they didn't supply a dbSNP rod file. Thanks Kiran for the use case.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2055 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-16 19:16:46 +00:00
ebanks 4558375575 Stage 1 of the VariantFiltration refactoring is now complete. There now exists a parallel tool called VariantAnnotator which simply takes variant calls and annotates them with the same type of data that we used to use for filtering (e.g. DoC, allele balance). The output is a VCF with the INFO field appropriately annotated.
VariantAnnotator can be called as a standalone walker or by another walker, as it is by the UnifiedGenotyper.  UG now no longer computes any of this meta data - it relegates the task completely to the annotator (assuming the output format accepts it).

This is a fairly all-encompassing check in.  It involves changes to all of the UG code, bug fixes to much of the VCF code as things popped up, and other changes throughout.  All integration tests pass and I've tediously confirmed that the annotation values are correct, but this framework could use some more rigorous testing.

Stage 2 of the process will happen later this week.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2053 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-16 02:41:20 +00:00
kiran 103763fc84 An accessor for the VCF header
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2051 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-15 09:28:25 +00:00
ebanks bf451873ff 1. Bug fix: check that AF=0 doesn't contain more probability than 1-fraction
2. Fix for Kiran: allow UG to call SNPs at deletion sites; we'll add an annotation to the VariantAnotator for deletions at the locus (next week).
3. Added integration tests for joint estimation model



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2038 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-13 18:02:18 +00:00
asivache 1be36ca959 Bug fix: when cleanedReadIterator is initialized, it gets immediately set to the contig of the first cleaned read; when the first uncleaned read coming in is on the lower contig, this would trigger 'readNextContig' with that lower contig as an argument. As the result, the whole cleaned reads file would be read through the end and no cleaned reads would be ever seen by the code afterwards. Now we do not call readNextContig if the (uncleaned) read's contig is lower than the current contig already loaded into cleanedReadIterator. the 'readNextContig' method now also throws an exception if requested contig is less than the currently loaded one
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2037 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-13 15:41:26 +00:00
depristo cff31f2d06 comments for eric
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2035 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-13 14:19:31 +00:00
aaron 234bb71747 changed the toVariation() method to take a reference base, instead of using the reference base loaded from the underlying data source (if it was reference aware). Also changed some isVariant() methods which weren't using the passed in ref base.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2034 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-13 06:54:38 +00:00
ebanks 902cf84448 Bug fix: if the most likely allele frequency is 0, don't make a variant call (even if the Qscore for AF=1/n > threshold)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2033 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-13 04:10:32 +00:00
ebanks 555fb975de 1. Print out allele frequency range (from joint estimation model only).
2. Don't print verbose output from SLOD calculation (it's just a repeat of previous output).



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2032 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-13 03:59:13 +00:00
hanna 8145ed4672 Take 2, updating picard with bug fix for bam files containing no reads.
Just stomped on the existing md5s because that's what Eric told me to do.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2029 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-12 22:52:08 +00:00
ebanks 61b5fb82ce 2 major changes:
1. Add dbsnp RS ID to VCF output from genotyper; to do this I needed to fix the dbsnp rod which did not correctly return this value.

2. Remove AlleleBalanceBacked and instead generalize the arbitrary info fields backing VCFs (and potentially others) in preparation for refactoring VariantFiltration next week.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2028 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-12 22:51:49 +00:00
aaron c3c001e02e cleanup of the traversal output code
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2026 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-12 06:18:10 +00:00
ebanks 0922400ca9 Don't try to calculate ratios when DoC is zero (which happens when calls are made by an LD-aware genotyper)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2025 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-12 02:51:44 +00:00
hanna 2ea85fb62b Fix some problematic command-line argument naming and descriptions.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2023 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-12 02:12:26 +00:00
depristo 6c9f86bb4d Removed unnecessary output and added debugging print() routine
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2020 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-11 18:37:36 +00:00
hanna 8406325247 New Picard is breaking one of the integration tests.
Revert until we find out whether the cause is legit.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2017 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-11 03:59:32 +00:00
hanna 499e7d1d75 Push forward some more delicate merging routines.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2016 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-11 03:07:34 +00:00
hanna bae4d3f7ea Updated Picard with fix for Doug Voet. Thanks Alec.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2015 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-11 02:01:08 +00:00
hanna 2e4782f202 Command-line arguments for SamReadFilters.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2014 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-10 23:36:17 +00:00
hanna 2cf9670d1e Allow users to directly specify filters from the command-line, applicable to
any walker.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2012 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-10 18:40:16 +00:00
ebanks 6a37090529 Output changes for VCF and UG:
1. Don't cap q-scores at 99
2. Scale SLOD to allow more resolution in the output
3. UG outputs weighted allele balance (AB) and on-off genotype (OO) info fields for het genotype calls (works for joint estimation model and SSG)


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2011 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-10 16:31:31 +00:00
depristo d316cbad4c VariantFilteration now accepts a VCF rod in addition to an input geli. It will then annotate this VCF file with filtering information in the INFO field too. --OnlyAnnotate will not write in filtering output
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2008 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-10 13:24:58 +00:00
aaron f9819d5f13 a little clean-up
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2007 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-10 06:18:34 +00:00
aaron 2ed423ed56 print the current location in read walkers (in addition to the number of reads processed), along with some refactoring to support the change.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2006 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-10 05:57:01 +00:00
ebanks 2fa2ae43ec Enough people have found this useful, so...
Moving Callset Concordance tool to core and adding integration test.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2003 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-09 20:59:18 +00:00
ebanks 3793519bd4 -Added convenience method to VCF record to tell if it's a no call and have rodVCF use it before querying for info fields
-Don't restrict info fields to 2-letter keys
[about to move these to core]


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2002 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-09 20:52:51 +00:00
ebanks 74751a8ed3 -Some minor fixes to get accurate vcf record merging done
-Improvement to snp genotype concordance test

And with that, it looks like I get revision #2000.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2000 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-09 06:40:55 +00:00
ebanks 7ce0df76f8 Added accessors to the rod data sources so that walkers can access the name/file/type triplets for input rods. This is necessary if e.g. you want to create a vcf writer based on all of the samples being input.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1994 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-09 04:25:39 +00:00
ebanks d07f3bb6f6 Added methods to get strand bias and to test if record has allele freq or bias fields set.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1993 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-09 04:20:35 +00:00
kiran 3313b0ddb4 Fixed a minor bug where the lodThreshold wasn't being printed in the header.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1992 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-08 16:51:36 +00:00
kiran 567f5758d2 Optionally lists read depths by read group.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1990 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-08 16:39:19 +00:00
hanna 21c5f543fa Fix sharding bug -- loci to which >100,000 (= 1 shard) reads are assigned an
alignment start will confuse the sharding system and cause it to return duplicate reads.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1987 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-08 14:27:26 +00:00
ebanks d549347f25 Refactored GenotypeLikelihoods to use an underlying 4-base model.
It needs to be modified a bit and then hooked up to a pooled model, but that is now possible.
At this point, there is no difference to the Unified Genotyper.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1978 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-05 21:59:25 +00:00
aaron aacd72854f a fix for a bug Andrey discovered: in read-based interval traversals we're dupplicating reads in rare cases. The problem was that to accomidate a bug in SAM JDK indexing, we were forced to add one to the stop of our QueryOverlapping() calls to ensure we always got all of the overlapping reads.
Added a PlusOneFixIterator that wraps other iterators, and eliminates reads that start outside of our intended interval (interval stop - 1).  Updated and checked BamToFastqIntegrationTest MD5 sums.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1976 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-05 05:26:33 +00:00
ebanks a545859c62 Joint Estimation model now emits a reasonable slod
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1969 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-03 21:12:42 +00:00
ebanks 11d950abe0 No longer allow the lod_threshold argument - use confidence instead.
Have UG output qscores in all cases.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1968 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-03 16:18:51 +00:00
asivache 2fb45dbd73 Make window size a command line argument
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1967 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-03 16:13:35 +00:00
asivache 55f61b1f88 Bug fix in adjustment of the shift position.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1966 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-03 16:08:11 +00:00
ebanks 3a33401822 2nd stage of the genotyper output refactoring is complete.
Now, all output is generalized and all of the intelligence lies where it is supposed to.
Next stage is syncing up old and new models and making sure we're outputting exactly what we should.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1960 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-02 22:43:08 +00:00
aaron ba67c7f02b added a warning for those using bed files; we properly convert bed to the internal representation but the user needs to be aware that any output will be one-based closed intervals
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1959 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-02 21:09:18 +00:00
aaron b71b66bd88 the underlying parameter is a float so we need to use Float.valueOf() instead; Noticed by external user Hou Huabin
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1958 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-02 20:22:25 +00:00
ebanks af6d0003f8 -Generalized the GenotypeConcordance module to deal with any number of individuals (although it will default to its old behavior if the -samples argument is left out).
-Make rods return the appropriate type of Genotype calls from getGenotype().



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1954 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-01 05:35:47 +00:00
asivache 4b0796ba58 After fixing a few glitches and bugs, this version finally works as intended
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1952 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-31 04:59:58 +00:00
asivache ea8d5c7077 Some internal refactoring. Now "safely" ignores duplicate records (NOT duplicate reads but rather malformed bam files!) resulting from the bug/feature in CleanedReadInjector.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1949 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-30 17:50:51 +00:00
ebanks 4ee1d6f733 -Have the calculation models determine whether a call passes the lod/confidence thresholds (as opposed to returning everything and letting the UG decide); this way, walkers which call map() will get only the good calls.
-Do the right thing in all models for all-base-mode (for Kiran).


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1940 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-30 02:35:51 +00:00
asivache e3b4d4cbed Genotyper reimplemented. Does the same thing, at least for now, but internal data structures redesign enables collecting various statistics for indel-containing/reference-matching reads. The statistics are not yet used by the caller itself to make a better judgement w.r.t. the validity of the calls it makes, but they are now printed into the output stream (--verbose). The statistics (for both normal and tumor) include: indel observation count/total coverage, av. number of mismatches per indel-containing and per ref-matching read, av. mapping quality, av. mismatch rate and av. base quality within an NQS windoew around the indel, numbers of indel and ref observations per strand.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1936 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-29 19:09:16 +00:00
ebanks 5cdbdd9e5b now that the design is stable, pull the setReference and setLocation methods back out of Genotype and stick them into constructors of implementing classes
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1931 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-29 13:27:37 +00:00
ebanks 3091443dc7 Sweeping changes to the genotype output system, as per several discussions with Matt & Aaron.
Some things still need to be changed, but it will entail some more design decisions first (which means I get to bug M&A again tomorrow!).


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1930 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-29 03:46:41 +00:00
depristo 86573177d1 Reverting rod walkers to use underlying refwalker implementation while we work on ROD2 and reenable the system. Added some serious sparse file parsing to variant eval tests
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1929 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-29 01:04:37 +00:00
aaron 5a3bd50537 adding error log reporting to the GATK, and a stream based output method for the argument collection
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1926 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-28 19:56:05 +00:00
aaron 04e9a494e9 removed the GenotypesBacked interface, which is currently unused. Also cleaned up some documentation lines
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1924 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-28 18:08:14 +00:00
depristo 186a8dd698 Trivial protection for null value
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1918 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-27 21:52:52 +00:00
depristo 726378be8b Almost ready to stop doing eagar decoding; waiting on Eric
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1914 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-27 15:28:05 +00:00
aaron 3fb3773098 a fix for traverse dupplicates bug: GSA-202. Also removed some debugging output from FastaAltRef walker
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1912 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-26 20:18:55 +00:00
hanna a1e8a532ad Support for initialize() and onTraversalDone() output from parallelized walkers.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1911 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-26 20:18:31 +00:00
ebanks 75ad6bbef7 Check that map isn't being called passing in null arguments.
(This seems wrong; see JIRA entry GSA-211)


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1907 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-25 02:30:36 +00:00
hanna 65b98470f3 Temporary fix: have RodLocusView manage and close its RODs. Really the
relationship between these two classes needs to be rethought; see JIRA
GSA-207.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1904 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-23 16:00:12 +00:00
aaron ad1fc511b1 intermediate commit for some changes in the Variation system, so Eric can go ahead with his changes. Everything is pretty set, but the Variation interface could use a convenience method that joins all the alternate alleles.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1903 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-23 06:31:15 +00:00
ebanks 6c338eccb8 Joint Estimation model now emits calls in all formats.
The whole GenotypeCall framework needs to be changed, but this will work for the time being.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1902 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-23 03:07:28 +00:00
ebanks 54c61c663c -Cleanup of the Joint Estimation code
-Don't print verbose/debugging output to logger, but instead specify a file in the argument collection (and then we only need to print conditionally)


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1899 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-22 15:25:29 +00:00
asivache 2cab4c68d4 Added method: isCodingExon(). Returns true if position is simultaneously within an exon AND within coding interval of any single transcript from the list. The old method of detecting coding positions as isExon() && isCoding() is buggy, as the position could be in the UTR part of one transcript (isExon() is true), and within coding region bounds (but not in the exon) of another transcript (isCoding() is true). As a result UTR positions would be erroneously annotated as coding.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1898 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-22 14:55:07 +00:00
ebanks 55fa1cfa06 -Renamed new calculation model and worked out some significant xhanges with Mark
-Allow walkers calling the UG to pass in their own argument collections


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1896 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-21 20:49:36 +00:00
ebanks 9b9744109c Mark's new unified calculation model is now officially implemented.
Because it doesn't actually use EM, it's no longer a subclass of the EM model.

Note that you can't use it just yet because it doesn't actually emit calls (just prints to logger).  I need to deal with general UG output tomorrow.  Hold off until then, Mark, and then you can go wild.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1891 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-21 02:39:23 +00:00
depristo caa3187af8 Enabling correct high-performance ROD walker and moved VariantEval over to it. Performance improvements in variantEval in general. See wiki for full description
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1890 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-20 23:31:13 +00:00
depristo 449a6ba75a Deleting lots of code as part of my cleanup. More classes tagged for removal. Many more walkers have their days numbered.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1885 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-20 12:23:36 +00:00
ebanks b8ab77c91c Don't filter out reads without proper read groups. Instead, allow the user (or another walker calling UG) to specify an assumed sample to use (but then we assume single-sample mode).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1883 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-20 01:30:53 +00:00
ebanks c29924e7cf Reverting previous change.
Aaron, it's all yours...


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1881 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-20 00:55:24 +00:00
aaron d21b582b18 memory leak, where the Resource Pool was releasing based on the value and not the key, resulting in the resourceAssignments map growing with each additional shard
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1880 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-20 00:39:42 +00:00
ebanks 761a730758 assertBiAllelic -> assertMultiAllelic.
Chris, if this breaks an integration test, you get it.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1879 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-20 00:09:46 +00:00
aaron cfa86d52c2 ensure that in the indel case we don't allow identification as both an insertion and deletion at the same location in the VCF ROD
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1875 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-19 18:21:00 +00:00
ebanks 51f9ec0a5c subtract largest posterior value from all values; this hopefully solves any precision issues
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1870 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-18 05:20:15 +00:00
ebanks b9e8867287 -push allele frequency and genotype likelihood variable definitions down into the subclasses so that they can use different data structures
-use slightly more stringent stability metric
-better integration test



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1869 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-18 04:22:17 +00:00
chartl ad777a9c14 @BasicPileup - made the counts public so they can be used
@PoolUtils - split reads by indel/simple base

@BaseTransitionTable - complete refactoring, nicer now

@UnifiedArgumentCollection - added PoolSize as an argument

@UnifiedGenotyper - checks to ensure pooled sequencing uses the appropriate model

@GenotypeCalculationModel - instantiates with the new PoolSize argument




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1867 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-16 21:56:56 +00:00
andrewk d1a4cd2f73 Added ValidationData analysis type to VariantEvalWalker; this eval takes a GFF file with validated truth data positions (bound to "validation")and calculates the accuracy of the genotype calls bound to "eval".
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1862 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-16 15:39:08 +00:00
ebanks 418e007ca6 A cleaner interface: now everyone can use UG's initialize method
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1860 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-16 14:09:16 +00:00
aaron 96972c3a5c a fix for a bug Eric found: if your first call contains fewer samples than calls at other loci, your VCFHeader got setup incorrectly.
Also moved a buch of Lists over to Sets for consistancy.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1859 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-16 04:57:50 +00:00
aaron a69ea9b57c Cleaning up the VCF code, adding lots of tests for a variety of edge cases. Two issues are still outstanding: updating the no call string with the standard 1000g decided on today, and fixing Eric's issue where not all the VCF sample names are present initially.
also: their, I hope your happy Eric, from now on I'll try not to flout my awesomest grammer in the future accept when I need to illicit a strong response :-)

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1858 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-16 04:11:34 +00:00
ebanks 993c567bd8 I had to remove some of my more agressive optimizations, as they were causing us to get slightly different results as MSG. Results in only small cost to running time.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1856 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-16 00:59:32 +00:00
asivache 7d7ff09f54 throw an exception if read has no associated read group
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1855 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-15 18:11:32 +00:00
depristo 0c2016c19a Improved error messages -- now easier to read, points to the GATK Error Messages wiki, and avoids double printing of stack traces
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1850 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-15 12:07:44 +00:00
ebanks a32470cea1 Deal with the fact that walkers can call UG's init/map functions directly.
We need to filter contexts in that case since the calling walkers don't get UG's traversal-level filters.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1848 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-15 02:31:45 +00:00
ebanks e740e7a7ce Because walkers call UG's map function, we need to move the actual writing out
to UG's reduce function.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1845 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-14 20:49:26 +00:00
ebanks 52d2e0ca07 All walkers now use read.getReadGroup()
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1839 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-14 19:27:40 +00:00
aaron eb90e5c4d7 changes to VCF output, and updated MD5's in the integration tests
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1836 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-14 18:42:48 +00:00
ebanks 89771fef05 -Use read.getReadGroup()
-Add another filter for read groups for Chris


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1835 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-14 18:08:32 +00:00
ebanks 311ab8da5a A helper class to create the masks for the sequenom design maker.
This project is now officially done.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1834 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-14 17:28:51 +00:00
ebanks 0c95d6906f Merge both versions of the Sequenom assay design maker: use Jared's base code and add in indels. [Jared, this still emits the same output for SNPs as your original version)
Remove all sequenom stuff from the FastaAlternateReferenceMaker so it can just concentrate on making alternate references...


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1831 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-14 17:11:45 +00:00
ebanks f2886d88e0 We now emit genotype calls
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1828 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-14 02:49:56 +00:00
ebanks 96b8499a31 Remodeled version of the UnifiedGenotyper.
We currently get identical lods and slods as MultiSampleCaller (except slods for ref calls, as I discussed with Jared) and are a bit faster in my few test cases.  Single-sample mode still emulates SSG.
The remaining to do items:
1. more testing still needed
2. we currently only output lods/slods, but I need to emit actual calls
3. stubs are in place for Mark's proposed version of the EM calculation and now I need to add the actual code.
More check-ins coming soon...


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1821 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-13 20:27:01 +00:00
aaron 77499e35ac fixes for GSA-199: Need easier way to write binary outputs to standard output. GLF and VCF now have stream constructors, and can get dumped to standard out.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1818 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-13 15:50:20 +00:00
ebanks caf689821f added method to get normalized posteriors
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1809 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-12 02:33:22 +00:00
ebanks cf7a26759d -use the getReadGroup() function that was added to picard for us
-clean up some include lines


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1808 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-12 01:39:32 +00:00
hanna d844d1c496 SAMFileWriters specified as command-line arguments were sometimes incorrectly altering the default short name. Make sure short name is not specified if shortName is not specified but fullName is.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1807 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-09 19:16:46 +00:00
hanna da084357db Fixed minor typo in output message.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1806 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-09 18:56:54 +00:00
ebanks a9f3d46fa8 Your time has come, SSG.
Fare thee well.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1799 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-08 20:27:56 +00:00
aaron 98e3a0bf1a VCF can now be emitted from SSG. The basic's are there (the genotype, read depth, our error estimate), but more fields need to be added for each record as nessasary.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1797 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-08 19:50:04 +00:00
kiran 29ad6cd876 Made redundant by BCMMarkDupes
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1795 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-08 18:47:20 +00:00
ebanks 15bf014e0b logger.info -> logger.debug (don't want to risk filling up my log on genome-wide calls)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1792 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-08 17:53:11 +00:00
ebanks 04fe50cadd *** We no longer have a separate model for the single-sample case. ***
For now, a single sample input will be special-cased in the EM model - but that will change when the EM model degenerates to the single sample output with a single sample as input.  For now, the EM code for multi-samples isn't finished; I'm planning on checking that in soon.

The SingleSampleIntegrationTest now uses the UnifiedCaller instead of SSG, and so should all of you.  More on that in a separate email.
Other minor cleanups added too.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1785 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-08 14:08:57 +00:00
kiran 829e99413b Rescores a variant after removing duplicates (defined very strictly as reads with the same start points).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1782 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-08 03:07:36 +00:00
ebanks 1905b5defa Hash by chromosome for now to reduce memory. This is a temporary solution until we decide how to reture the Injector for good.
Also, with Picard's latest changes, we need to make sure we don't double-close the sam writer.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1779 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-07 20:06:25 +00:00
ebanks 203c626fc2 A wrapper around the GenotypeLikelihoods class for the UnifiedGenotyper. This wrapper incorporates both strand-based likelihoods and a combined likelihoods over both strands.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1777 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-07 19:57:37 +00:00
depristo 8dd0924b37 Minor performance improvements to VariantEval -- now all of the CPU time is spent dealing with the ROD system...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1772 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-06 23:40:30 +00:00
aaron 4554ca1b28 more cleanup, depecaited the old genotype, corrected SNPCallsFromGenotypes' imports and two other classes that depend on it.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1771 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-06 19:09:27 +00:00
aaron 3aec76136f Removing the AllelicVariant interface, which is replaced by the Variation interface.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1770 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-06 17:44:24 +00:00
aaron 66fc8ea444 GSA-182: Adding support for BED interval files.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1767 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-06 02:45:31 +00:00
hanna aec83b401d SSG multithreading doesn't play well with some I/O changes made since I last svn up'd. Reverting until I can find the reason.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1766 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-05 19:48:57 +00:00
hanna 8a503c86b6 Code supporting SSG proof-of-concept shared memory parallelism.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1765 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-05 18:56:16 +00:00
ebanks fb619bd593 -Refactoring: make GenotypeCalculationModel constructors empty so that they don't have to be updated every time we add a new parameter; instead put that logic in the super class's initialize method (making everything protected so that only the factory can access them)
-Adding initial version of Multi-sample calculation model.  This still needs much work: it needs to be cleaned up and finished.  Right now, it (purposely) throws a RuntimeException after completing the EM loop.

Also:
-Fix logic in GenotypeLikelihoods.setPriors
-Add logger to the models for output






git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1764 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-05 18:10:36 +00:00
aaron 7fc4472e6d A big fix for MergingSamRecordIterator, where we weren't correctly handling the comparisons of SAMRecords correctly (we weren't applying the new reference index first, so sometimes the MT contig would be ID 23, sometimes 24 in different records).
Also a fix to the GLF tests, and a correction to PrintReadsWalker to remove the close() on the output source, the source handles that itself (and you get a double close).

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1758 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-02 19:35:35 +00:00
ebanks 53a4bd7f51 A better understanding of what's going on means no need for clearing the cache
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1755 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-02 18:07:46 +00:00
aaron e885cc4b21 changes for corrected GLF likelihood output, along with better tests
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1754 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-01 20:45:05 +00:00
aaron 2e4949c4d6 Rev'ing Picard, which includes the update to get all the reads in the query region (GSA-173). With it come a bunch of fixes, including retiring the FourBaseRecaller code, and updated md5 for some walker tests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1751 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-30 20:37:59 +00:00
ebanks 303972aa4b Yup, I broke the build...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1750 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-30 20:20:43 +00:00
ebanks 841d25cc44 Added ability to set the priors after construction (and requiring a flushing of the likelihoods cache)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1749 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-30 19:55:49 +00:00
hanna 70e1aef550 Better integrate the @ArgumentCollection into the command-line argument parser. Walkers can now specify their own @ArgumentCollections. Also cleaned up a bit of the CommandLineProgram template method pattern to minimize duplicate code.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1746 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-29 22:23:19 +00:00
aaron b1c321f161 Adjusted Genotype concordance to more accurately use the new Genotyping code, fixed the VCF rod, and temp. fix the build by reintroducing Shermans ReadCigarFormatter
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1745 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-29 21:28:21 +00:00
ebanks 9ef80e3c3c One minor addition: to incorporate Pooled calling (and to be as general as possible), we allow the genotype calculation model to use rods if it wants.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1741 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-29 17:05:59 +00:00
ebanks 19bfe43173 First pass at a unified caller, being checked in now so Mark can give feedback if he chooses and so Matt can debug issues with the ArgumentCollection class.
Some notes:
1. This design should be flexible enough to include pooled calling (for now) after discussions with Chris.
2. Using the unified caller with the SingleSampleCalculationModel emits the exact same output as SSG over all of chr20 for NA12878.  Additionally, when we include the "max deletions allowed at a locus" argument (so we don't try to call SNPs at deletion sites), it removed 233 SNP calls in chr20 that were clearly indel artficts.
3. The MultiSampleEMCalculationModel is still a work in progress and will be checked in later this week.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1740 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-29 16:48:15 +00:00
andrewk 5dab95aa5a Fix getMergedReadGroupsByReaders so that it provides read groups in the same way Picard does so that it works correctly when input read files have no clashes in their read groups and retain their original read group names.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1737 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-29 06:35:50 +00:00
asivache bce2f0d7cf Now instantiates the list of alternative consenses to evaluate as LinkedHashSet to guarantee iterator traversal order. Old implementation used HashSet and exhibited unstable behavior when two alt consenses turned out to be equally good: depending on the run conditions (including size of the interval set being cleaned??), either one could be seen first as selected as the 'best' one
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1734 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-28 06:15:46 +00:00
asivache 663175e868 Bug fix: when jumping onto next contig (chromosome), the walker was erasing last mismatch interval from the previous chr it was still holding without printing it; now it gets printed.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1733 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-25 22:24:34 +00:00
asivache aec61c558b moving IndelGenotyper out from playground
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1731 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-25 19:43:53 +00:00
aaron 2b7d39035a switched over the FastaAlternateReferenceWalker to the Variation system
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1726 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-25 16:09:43 +00:00
aaron 7ffc1d97ef Cut DeNovoSNPWalker over to the new Variation system, some renaming of methods on the Variation interface, and some corrections on the interface.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1724 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-25 04:35:52 +00:00
aaron d2af26e81f Pooled EM SNP Rod converted over to the Variation interface
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1719 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-24 16:33:11 +00:00
ebanks 97105ac001 We need to return a null RODRecordList when the default value is null (as opposed to a list with a single null value), because that's what everyone is expecting.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1718 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-24 16:23:12 +00:00
ebanks d4b40bc06f Filter for reads with missing read groups so we can safely assume all reads have valid read groups
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1717 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-24 16:10:26 +00:00
ebanks 90de2e0cde Added ability to specify whether you want to use a point estimate or fair coin test calculation; for now you can use either but fair coin test is still experimental as it needs to be parametrized correctly. This job will hopefully be done by the future Bioinformatic Analyst...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1716 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-24 15:29:50 +00:00
aaron d262cbd41c changes to add VCF to the rod system, fix VCF output in VariantsToVCF, and some other minor changes
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1715 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-24 15:16:11 +00:00
ebanks 423a3ee894 Added a sequenom rod to empower Carrie to convert 1KG validation SNPs to sequenom format
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1706 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-23 20:22:09 +00:00
hanna 856bbd0320 Let Picard specify the default compression level.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1701 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-23 19:01:48 +00:00
aaron f783cb30e0 adding an interface so that the current @Requires with ROD annotations work in walkers like VariantEval
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1700 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-23 18:24:05 +00:00
hanna ebfbe56b43 Make sure compression level always gets pushed into SAMFileWriterFactory.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1699 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-23 18:20:26 +00:00
asivache bf7cd66d53 New, simpler rodRefSeq. Fully relies on the ROD system standard mechanisms. Multiple transcripts over a given location will be now returned by the ROD system itself as RodRecordList<rodRefSeq>; and yes, rodRefSeq does represent a single transcript record now and implements Transcript interface
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1697 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-23 18:18:25 +00:00
asivache 8fa4c93f5a Transcript is now simply an interface
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1696 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-23 18:13:31 +00:00
asivache 1bd4c0077c Now that ROD system supports overlapping RODs, we do not need rodRefSeq to be too smart and read in all the overlapping records (transcripts) on its own; leave it to the generic ROD mechanism.
PARTIAL commit; new, simpler rodRefSeq will reappear in a seq.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1694 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-23 18:11:16 +00:00
aaron 11c32b588f fixing VariantEvalWalkerIntegrationTest md5 sums, a couple comment changes, and a little bit of cleanup
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1690 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-22 20:54:47 +00:00
ebanks 0748d80baa Added a convenience method in rodDbSNP to deal with Andrey's changes to the rod. Now you can just ask for the first real SNP rod from the list and not have to think about how it works.
CountCovariates uses it.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1688 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-22 20:15:40 +00:00
ebanks 682b765536 bug: need to upper case chars so that == works throughout
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1684 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-22 18:20:43 +00:00
asivache 57d31b8e9b Filter that discards reads from specific lanes; and also its friend that helps blacklisting a set of lanes from GATK command line a one-liner.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1681 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-22 16:46:06 +00:00
ebanks 5ce42cbab3 After thinking about this a bit more, it makes sense to pull this functionality out of my walker and into the GenomeLocParser where everyone else can benefit from it...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1677 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-22 01:32:35 +00:00
ebanks b1dc6d65e4 interval merging is now blazingly fast
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1674 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-21 21:15:04 +00:00
asivache 15135788ca OK, let's bite the bullet. Now rodDbSNP objects are 'isSNP()' only when they are annotated as 'exact', not a 'range'.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1673 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-21 19:25:16 +00:00
asivache 8ad181f46f Note to myself: do 'ant clean' now and then or old versions of the code that suddenly became invalid will stick around. The world is not perfect, and neither is automatic dependency resolution.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1672 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-21 17:40:52 +00:00
asivache d2d1354199 Now uses BrokenRODSimulator class to pass the test. CHANGE the code to use new ROD system directly and MODIFY MD5 in corresponding tests, since a few snps are seen differently now.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1670 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-21 17:03:49 +00:00
asivache 29adc0ca1c Little class that can be used to simulate the results returned by the old ROD system. This is needed to keep couple of tests from breaking. All the code that uses this class must be changed urgently to accomodate the data as returned by new ROD system, and the corresponding tests (MD5 sums) have to be modified as well since some data as seen through the new ROD system is indeed different.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1668 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-21 16:58:56 +00:00
asivache a6bd509593 Changing the carpet under your feet!! New incremental update to th eROD system has arrived.
all the updated classes now make use of new SeekableRodIterator instead of RODIterator. RODIterator class deleted. This batch makes only trivial updates to tests dictated by the change in the ROD system interface. Few less trivial updates to follow. This is a partial commit; a few walkers also still need to be updated, hold on...

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1667 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-21 16:55:22 +00:00
asivache 4c67a49ccb Removed unused imports
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1666 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-21 16:45:22 +00:00
hanna e7f44ada98 Make unpackList public static so that Doug can use it in the scatter/gather framework.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1665 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-21 15:32:49 +00:00
ebanks 7b627fd622 Check for empty interval lists to merge
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1664 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-21 04:34:26 +00:00
hanna 7f5778c966 Update gsadevelopers -> gsahelp.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1663 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-20 23:36:54 +00:00
aaron 3a487dd64e little fixes; also fixed a tyPo
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1662 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-18 22:38:51 +00:00
aaron b6d7d6acc6 fix for the eval tests, and a change to the backedbygenotypes interface, more changes to come
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1661 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-18 22:25:16 +00:00
depristo 4318f75910 tiny cleanup
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1660 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-18 21:04:25 +00:00
depristo 3a341b2f06 Fixes for VariantEval for genotyping mode
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1659 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-18 21:01:43 +00:00
aaron 7b39aa4966 Adding the VCF ROD. Also changed the VCF objects to much more user friendly.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1658 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-18 20:19:34 +00:00
ebanks b19fd4d45c Damn unit tests have a null Toolkit()...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1654 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-18 17:10:49 +00:00
ebanks 90626c843d oops - we don't need reference bases, but we still need reference
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1653 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-18 16:24:45 +00:00
ebanks 2b2df4e1ba - Fix the CleanedReadInjector to deal with -L intervals correctly.
- Some walkers don't use the ref base, so speed up traversals by not requiring it


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1652 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-18 16:17:58 +00:00
asivache 94618044e8 Starting an update of ROD system. These basic classes will completely replace old ones, but with this update they are not linked to anything, so this checkpoint should be safe.
The main reason for the change is that there can be (and are!) multiple RODs overlapping with a single reference base position in a single track. There can be two "trivial" RODs at the same location (e.g. samtools pileup will have two point-like records at putative indel sites: one for the reference, the other one for the indel itself). Or there can be one or more "extended" RODs (length >1), eg. dbSNP can report an indel at Z:510-525 AND a SNP at Z:515.

The ReferenceOrderedDatum object (and children) will not be changed, but it is now explicitly interpreted as a single data *record*, possibly out of many available from a given track for the current site. As long as single data record occupies one line in a data file, the new ROD system will take care of loading and keeping multiple records, including extended (length > 1) ones, and will automatically drop the records when they finally go out of scope. For one-line-per-record, multiple-records-per-site RODs, there is no need anymore for the hack used so far that involved passing ROD's own implementation of iterator through reflection mechanism (though it will still work)

* RODRecordList: 
the ROD system (its iterators) will now always return a LIST of all RODs available at current position or at current query interval (see below). This class is a trivial wrapper for a list of ROD objects, with added location argument for the whole collection. The location of the RODRecordList is where the ROD system is currently sitting at: a single, current base on the reference (if next() traversal is performed), or the location of the query interval when returned by seekForward() (see below). The ROD objects themselves will have their locations set according to the original data in the file. Hence, perusing the above example of a dbSNP indel at Z:510-525 and SNP at Z:515, when moving to the position Z:515 the ROD system will return a RODRecorList with location Z:515, and with two ROD objects packaged inside, one with location Z:510-525, the other with Z:515.

*RODRecodIterator:
Almost identical to old SimpleRODIterator used by ReferenceOrderedData; this is a low-level iterator that walks over records in the data file (with a callback to ROD's ::parseLine() to parse real data)

*SeekableRODIterator:
a decorator class that wraps around Iterator<ROD> (such as RODRecordIterator) and makes the data traversable by reference position, rather than record by record. This is reimplementation of the old RODIterator.  SeekableRODIterator's ::next() moves to the next position on the ref and returns all RODs overlapping with that position (as a RODRecordList). This iterator also adds a seekForward(loc) operation, that allows fast forwarding to a specified position or interval. Length > 1 query arguments (extended intervals) are fully supported by seekForward(), the returned RODRecordList wil contain all RODs overlapping with the specified interval, and the location of the returned RODRecordList object will be set to that query interval. NOTE: it is ILLEGAL to perform next() after a seekForward() query with length > 1 interval. seekForward() with point-like (length=1) interval reenables next().


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1650 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-18 15:58:37 +00:00
hanna 355136928e Play nice with other jobs in this VM -- don't close stdout / stderr.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1646 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-17 18:55:08 +00:00
ebanks 5d85bd9671 By default, VF should ask for deleted bases so that they show up in coverage.
The Strand filter then needs to ignore those bases when determining bias.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1636 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-16 16:46:09 +00:00
hanna 01a9b1c63b Fix for problem where err stream remapped to output stream in certain cases, (hopefully) completing Matt's hat trick of fail. Thanks, unit tests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1634 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-16 08:33:56 +00:00
hanna 9f7cf73411 Output stream management fixes. I completely screwed up the output stream management system, but cleverly masked this fact by breaking some other stream management functionality that masked the problem.
Sigh.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1630 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-15 21:06:45 +00:00
hanna 17758b381c Properly initialize redirected output streams in case of out and err.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1629 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-15 19:47:43 +00:00
andrewk 00dfe014b7 Added option to FastaReferenceWalker to change output FASTA file format's line width and to remove header lines; allows dumping raw sequence using intervals
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1628 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-15 18:00:30 +00:00
hanna b69eb208a6 Always create output files, even if no output was written to them.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1627 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-15 17:58:14 +00:00
aaron b401929e41 incremental clean-up and changes for VariantEval, moved DiploidGenotype to a better home, and fixed a spelling error.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1624 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-15 04:48:42 +00:00
ebanks 01e7b39c8d 1. Don't print out values in filter field of the VCF.
2. Fix ratio printouts (for params file)
3. Rename ratio filter's get counts method to avoid confusion; more changes on the way this week.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1616 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-14 21:03:39 +00:00
ebanks 436f543b3b I owe Doug a beer for finding this:
don't print out intervals to be merged if they're not within the global -L intervals


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1615 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-14 20:22:30 +00:00
aaron e03fccb223 Changes to switch Variant Eval over to the new Variation system.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1611 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-14 05:34:33 +00:00
aaron 5b41ef5f70 rod DBSNP had a bug where the reference wasn't calculated correctly under certain conditions. Fixed getRefBasesFWD and getRefSnpFWD so that they were more in line with getAltBasesFWD and getAltSnpFWD. Also updated Variant Eval tests to reflect this change.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1609 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-13 23:48:58 +00:00
ebanks c669e8d5ad Use constant seed in the random generator so we can be stable (and thus unit tests will work)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1607 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-13 17:40:56 +00:00
depristo 6c7a300664 Missing file
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1601 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-12 19:17:09 +00:00
depristo 6e13a36059 Framework for ROD walkers -- totally experiment and not working right now
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1600 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-12 19:13:15 +00:00
depristo e8d544869d Alignment context now supports the idea of skipped bases -- not currently in use
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1598 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-12 19:11:38 +00:00
depristo 3949b4ac72 commented out version of next() and hasNext() that appear to be correct but are causing testing problems
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1596 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-12 19:09:21 +00:00
depristo 58105636c8 getBoundRods() convenience method
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1595 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-12 19:07:57 +00:00
depristo 4e1eded389 Fixed bad compareTo operator
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1594 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-12 19:07:10 +00:00
depristo 7c8b17b456 fix for SSG with pl name
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1591 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-11 20:39:34 +00:00
chartl d6a0b65ac9 Changes:
Rollback of Variant-related changes of r1585, additional PGC code




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1586 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-11 16:23:01 +00:00
chartl 0c54aba92a Changes:
@VariantEvalWalker - added a command line option to input a file path to a pooled call file for pooled genotype concordance checking. This string is to be passed to the PooledGenotypeConcordance object.

@AllelicVariant - added a method isPooled() to distinguish pooled AllelicVariants from unpooled ones.

@ all the rest - implemented isPooled(); for everything other than PooledEMSNProd it simply returns false, for PooledEMSNProd it returns true.

Added:

@PooledGenotypeConcordance - takes in a filepath to a pool file with the names of hapmap individuals for concordance checking with pooled calls
 and does said concordance checking over all pools. Commented out as all the methods are as yet unwritten.




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1585 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-11 15:01:50 +00:00
aaron 5a64a80ab5 changes to the variation class, updates to SSG, updated tests based on changes to the SSGenotypeCall, and added the ability to run a single integration test from using the build script.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1577 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-10 04:31:33 +00:00
depristo c988205884 Notes for Aaron in SSG
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1576 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-10 03:18:51 +00:00
ebanks 1362a56227 Added fasta tests and small fix to cleaner test
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1575 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-10 03:13:11 +00:00
depristo 0093482c62 N reference base fix for SSG
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1572 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-09 21:19:36 +00:00
ebanks cb31d5a0ab VariantFiltration now outputs VCF. Important changes:
1. VariantsToVCF can now be called statically to output VCF for a single ROD instance; this is temporary until we have a VCF ROD.
2. VariantFiltration now outputs only 2 files, both mandatory: all variants that pass filters in geli text, and all variants in VCF.
If there are any problems, go find Aaron.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1569 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-09 20:04:32 +00:00
asivache dd0085c428 1) now is tolerant to sloppy cigar strings with 0-length elements (at the price of extra recursive call)
2) when reads with deletions are requested, adds to the pile just those: reads with 'D' over the current reference base, but not 'N'
3) next() now implements a loop: recursive forward iteration calls to next() until ref. position with non-zero coverage is encountered were OK for (short) deletions, but with long stretches of N's they end up with stack overflow

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1568 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-09 20:04:04 +00:00
ebanks 542af6402e output correct format for Sequenom SNPs
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1567 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-09 19:21:53 +00:00
kiran 3b1e966b4c Lowercases the sequencing platform so that a difference in case doesn't lead to the failure to look up an entry in the hash.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1565 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-09 17:35:45 +00:00
kiran d82d6c0665 Excludes variants that fall below a certain LOD that changes as a function of depth.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1564 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-09 17:34:16 +00:00
kiran 06eae52292 Throws an exception if you attempt to use a filter that doesn't exist.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1563 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-09 17:33:27 +00:00
asivache 1060b36288 Bug fix: 'N' cigar elements now treated properly; for all practical intents and purposes, N is the same as D and should be treated as such, the difference is only in logical interpretation.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1562 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-09 17:08:35 +00:00
chartl 9d69bd2c84 Modifications:
@CoverageAndPowerWalker - removed a hanging colon that was being printed after the reference position

@VariantEvalWalker - added a command line argument for pool size for eventual use in doing pooled caller evaluations. As now, the variable is unused.

@AlignmentContext - altered the scope of class variables from private to protected in order that child objects might have access to them


New Additions:

Filtered Contexts

Sometimes we want to filter or partition reads by some aspect (quality score, read direction, current base, whatever) and use only those reads as
part of the alignment context. Prior to this I've been doing the split externally and creating a new AlignmentContext object. This new approach makes
it a bit easier, as each of these objects are children of AlignmentContext, and can be instantiated from a "raw" AlignmentContext.

@FilteredAlignmentContext is an abstract class that defines the behavior. The abstract method 'filter' is called on the input AlignmentContext, filtering
those reads and offsets by whatever you can think of. The filtered reads/offsets are then maintained in the reads and offsets fields. These classes can
be passed around as AlignmentContexts themselves. Writing a new kind of read-filtered alignment context boils down to implementing the filter method.

@ReverseReadsContext - a FilteredAlignmentContext that takes only reads in the reverse direction

@ForwardReadsContext - a FilteredAlignmentContext that takes only reads in the forward direction

@QualityScoreThresholdContext - a FilteredAlignmentContext that takes only reads above a given quality score threshold (defaults to 22 if none provided).

A unit test bamfile and associated unit tests for these are in the works.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1559 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-09 15:49:52 +00:00
depristo d9588e6083 bug fixes to LIBS and LIBH following ultra-aggressive regression testing across 454, solid, and solexa
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1558 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-09 15:36:12 +00:00
asivache df11618092 Set default value of useLocusIteratorByHanger to FALSE. Otherwise the -LIBH flag is useless and there'd be no wayto "unset" the 'true' value. Old version was (always) using LocusIteratorByHanger. Now default iterator is indeed LocusIteratorByState, and -LIBH will switch back to the old one
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1556 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-09 15:09:09 +00:00
depristo eeb9b6eb13 GenotypeLikelhoods now support a cache per subclass, avoiding genotyping clashes
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1554 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-09 10:39:14 +00:00
ebanks 0cc219c0df -Added unit test for walkers dealing with intervals for cleaning
-I also uncovered a corner case in the cleaner that for some reason was commented out but shouldn't have been.  Hooray for unit tests!



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1553 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-09 02:35:17 +00:00
depristo ec0f6f23c7 LocusIterationByState is now the system deafult. Fixed Aaron's build problem
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1552 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-09 01:28:05 +00:00
ebanks 5dbba6711c Lots of changes: (I'll send email out in a sec)
1) Moved various disparate concordance / set splitting functionalities to a new parent tool which works like VariantFiltration (i.e. people can write various modules that fit inside and can be run though it).
2) Fixed up argument parsing in VariantFiltration to use key=value format so we don't accidentally mox up values (like I had been doing).
3) Have indel rod print samples


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1540 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-07 01:12:09 +00:00
depristo 1c3d67f0f3 Improvements to the CountCovariates and TableRecablirator, as well as regression tests for SLX and 454 data
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1539 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-04 22:26:57 +00:00
depristo 2b0d1c52b2 General WalkerTest framework. Includes some minor changes to GATK core to enable creation of true command-line like GATK modules in the code. Extensive first-pass tests for SSG
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1538 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-04 19:13:37 +00:00
aaron 0cc634ed5d -Renamed rodVariants to RodGeliText
-Remove KGenomesSNPROD
-Remove rodFLT
-Renamed rodGFF to RodGenotypeChipAsGFF
-Fixed a problem in SSGenotypeCall
-Added basic SSGenotype Test class
-Make VCFHeader constructors public

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1536 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-04 18:40:43 +00:00
ebanks fd1c72c151 Fixed package name
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1535 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-04 15:40:06 +00:00
ebanks 6c476514f8 Moved to core. Wiki pages are going up; unit tests will be written soon.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1533 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-04 15:09:11 +00:00
ebanks 849dce799d This rod was all wrong for generating the alternate snp alleles (it returned null or even the wrong value); fixed.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1531 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-04 14:21:46 +00:00
depristo a08c68362e Renaming error to getNegLog10PError(); added Cached clearing method to GL; SSG now has a CallResult that counts calls; No more Adding class to System.out, now to logger.info; First major testing piece (and general approach too) to unit testing of a walker -- SingleSampleGenotyper now knows how many calls to make on a particular 1mb region on NA12878 for each call type and counts the number of calls *AND* the compares the geli MD5 sum to the expected one!
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1530 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-04 12:39:06 +00:00
aaron 3c2ae55859 changes for the genotype overhaul. Lots of changes focusing on the output side, from single sample genotyper to the output file formats like GLF and geli. Of note the genotype formats are still emitting posteriors as likelihoods; this is the way we've been doing it but it may change soon.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1529 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-04 05:31:15 +00:00
ebanks 5bd99fc1c4 VariantFiltration moved to core.
Another win for the team.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1517 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-03 15:41:41 +00:00
depristo bdd0a6f9fa change to make build work
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1511 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-03 13:43:10 +00:00
depristo b01ac9de0c High performance LocusIterator implementation. Now with greatly reduced memory impact and 2x (and more potentially) speed ups of raw locus iteration. General performance improvements to SSG with empirical probs. You can enable high-performance locus iteration with the -LIBS arg. It's still testing but passes validing pileup.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1510 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-03 03:06:25 +00:00
ebanks 3dfc77dc89 Add an indel rod which represents the initial point of the indel only
(useful for alternate reference making)


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1507 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-02 19:32:29 +00:00
aaron 0e6feff8f2 fixed locus pile-up limiting problem
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1505 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-02 16:56:44 +00:00
aaron 05c164ec69 changing the default behavior to allow any sized read pile-up (which may exceed the memory limit); the user can then select their own read limit. The default of 100K was arbitrary.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1498 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-01 14:46:00 +00:00
ebanks 54c0b6c430 Allow this ROD to consist of just the positions
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1497 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-01 12:43:18 +00:00
aaron 4a1d79cd7b added a flag, maximum_reads_at_locus, shortName "mrl", which limits the number of reads we add to the locusByHanger. In some bam files misalignment produces pile-ups of 750K or more reads. We now limit this to the default of 100K reads.
The user is warned if a locus exceeds this threshold, and no more reads are added.

Also CombineDup walker had an incorrect package name.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1496 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-01 04:21:58 +00:00
ebanks 0addae967a IndelArtifact filter can now handle filtering false SNPs that occur within the span of an indel but after the first position
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1495 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-01 03:34:39 +00:00
ebanks 8e3c3324fa Added filter for SNPs cleaned out by the realigner.
It uses the realigner output for filtering; in addition, dbsnp indels partially work; IndelGenotyper calls don't yet work.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1489 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-31 04:32:32 +00:00
ebanks 8bc7afe781 Smarter SW penalties
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1488 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-31 04:29:19 +00:00
ebanks 1a299dd459 Require each filter or feature to declare whether or not they want mapping quality zero reads in the alignment context
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1486 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-31 03:31:37 +00:00
ebanks 215e908a11 Reworking of the VariantFiltration system to allow for a windowed view of variants and inclusion of more data to the various filters.
This now allows us to incorporate both the clustered SNP filter and a SNP-near-indels filter, which otherwise wasn't possible.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1484 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-31 02:16:39 +00:00
depristo 813a4e838f Removing old code
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1482 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-30 19:27:11 +00:00
depristo 49a7babb2c Better organization of Genotype likelihood calculations. NewHotness is now just GenotypeLikelihoods. There are 1, 3, and empirical base error models available as subclasses, along with a simple way to make this (see the factory).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1481 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-30 19:16:30 +00:00
depristo 522e4a77ae Caching support across multiple technologies
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1480 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-30 18:10:14 +00:00
depristo 5af4bb628b Intermediate checking before code reorganization. Full blown support for empirical transition probs in SSG for all platforms. Support for defaultPlatform arg in SSG. Renaming classes for final cleanup
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1479 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-30 17:34:43 +00:00
depristo bde67428fd Better formatting of the code
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1477 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-29 21:46:47 +00:00
aaron 8331c195fb changed the full name of maximum_reads to maximum_iterations for consistancy
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1475 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-28 16:03:46 +00:00
depristo 8e129d76fd Support for original quality scores OQ flag. pQ flag in TableRecalibation to preserve quality scores below a threshold (defaulting to 5)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1474 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-28 14:14:21 +00:00
depristo bf60980653 Experitmental support for empirical P(B_true | B_miscall). --useEmpiricalTransitions flag to SSG enables this support. Much better implementation of Genotype likelihoods -- the system should scream along now. Continuing progress towards deleting old model
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1469 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-28 00:17:24 +00:00
depristo 7cf9a54b64 change for new char/byte in BaseUtils
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1467 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-27 23:47:56 +00:00
hanna e5115409fa Force columnSpacing to be at least one. We need a general-purpose, working tool for outputting columnar data to a PrintStream; will add JIRA.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1457 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-25 19:54:54 +00:00
hanna ccdb4a0313 General-purpose management of output streams.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1454 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-23 00:56:02 +00:00
aaron cd711d7697 Added detection of interval files with zero length to the GATK, and removed it from the interval merger walker: this was a critical blocking emergency issue for Eric.
also fixed some verbage in the GAEngine.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1449 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-21 05:35:49 +00:00
aaron 6313c465fb we want the RMS of the reads qualities not the RMS of the RMS of the read qualities.
Also the VCF version tag seems to be standardized as VCR.  Updated the VCF code.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1447 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-20 21:56:29 +00:00
ebanks ed8c92a12a make isReference do the right thing
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1439 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-19 20:32:29 +00:00
ebanks b3fe566c0c Fix descriptions of walker args
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1436 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-18 19:46:48 +00:00
ebanks 53153fcd79 Allow RODs to specify that incomplete records are okay (i.e. that they allow optional fields)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1433 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-18 15:26:10 +00:00
ebanks b2a18a9d61 - first pass at a basic indel filter (for now, based on size and homopolymer runs)
- fix simple indel rod printout


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1431 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-18 03:04:12 +00:00
jmaguire 1e8b97b560 quietly skip empty intervals files rather than crash.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1428 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-17 20:19:14 +00:00
jmaguire 92c63fb530 It's just "lod" not discovery_lod now.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1427 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-17 18:44:09 +00:00
ebanks df5744bcd3 update this walker so any variants can be passed in
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1426 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-17 16:30:39 +00:00
aaron d101c20b30 added the ability to pass in a csv file of ROD triplets (one triplet per line) to the -B option
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1412 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-11 22:10:20 +00:00
ebanks 2c3f56cb8d fix length calculation (it was including +/- char when it shouldn't)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1410 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-11 20:28:24 +00:00
aaron fc1c76f1d2 fixing a bug where reads in overlapping interval based locus traversals could get assigned to only one of two the regions
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1407 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-11 17:50:16 +00:00
ebanks ecae619a1b warn user when dbSNP rod looks suspicious
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1400 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-10 20:20:20 +00:00
asivache 2841e151d0 javadoc comments only
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1399 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-10 18:44:35 +00:00
ebanks 02f1af0743 Don't die when a readgroup is absent from the covariates table - it could
happen when all reads are unmapped (or have MQ0); instead, just don't alter
the quals.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1394 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-10 03:10:33 +00:00
depristo 6d3ef73868 Now includes statistics on the allele agreement with dbSNP -- counts concordant calls as dbSNP = A/C and we say A/C, vs. we say A/T
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1392 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-07 19:37:07 +00:00
depristo 20baa80751 Updated polarized reference priors, need DiploidGenotypePriors class that is directly used by the NewHotness genotypelikelihoods, more bug fixes and refactoring, etc.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1391 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-07 19:01:04 +00:00
depristo bbd7bec5db Continuing cleanup of SSG. GenotypeLikelihoods now have extensive testing routines. DiploidGenotype supports het, homref, etc calculations. SSG has been cleaned up to remove old garbage functionality. Also now supports output to standard output by simply omitting varout
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1387 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-05 22:25:30 +00:00
hanna 48713e154c Windowed access to the reference.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1383 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-05 16:29:15 +00:00
depristo 65e9dcf5b7 Fully operational version of the new genotype likelihoods class. (1) Much cleaner interface. Now explicitly stores likelihoods, priors, and posteriors in separate arrays indexed by an enum, (2) no longer can be used to make calls, it relies on SSGGenotypeCall to order the likelihoods, calculate best to ref, etc, this is just for calculating genotype likelihoods now; (3) Now performs extensive error checking with validate() to ensure the system is behaving properly. (4) fixed incorrect treatment of N bases, which we being counted against everyone (5) likely found a stats bug in which heterozyosity was being applied incorrectly to the genotype priors
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1382 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-05 01:00:55 +00:00
depristo 4dc23f2763 Trivial formatting changes as I moved more legacy code into this system
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1381 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-05 00:54:26 +00:00
depristo 34af669dbb Explicit ENUM representation of the diploid genotypes. Please use this from now on to represent strings like AA or AT
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1380 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-05 00:53:43 +00:00
hanna 21d1eba502 Cleaned division of responsibilities between arguments to map function. Reference has been changed
from an array of bases to an object (ReferenceContext), and LocusContext has been renamed to reflect
the fact that it contains contextual information only about the alignments, not the locus in general.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1376 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-04 21:01:37 +00:00
depristo 20ff603339 New hotness and old and Busted genotype likelihood objects are now in the code base as I work towards a bug-free SSG along with a cleaner interface to the genotype likelihood object
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1372 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-03 23:07:53 +00:00
depristo 4986b2abd6 Fixing bug in SSG -- genotyping and discovery were mixed up by name
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1371 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-03 22:13:35 +00:00
depristo 3485397483 Reorganization of the genotyping system
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1370 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-03 20:55:31 +00:00
depristo d840a47b11 Slight reorganization of genotype interface
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1366 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-03 19:17:15 +00:00
ebanks 4366ce16e0 Made sure all RODs have a (good) toString() method - and use it in the Venn walker. (thanks, Mark)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1339 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-30 14:53:27 +00:00
ebanks feb7238f10 Wasn't always returning the correct alt base
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1337 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-30 03:08:04 +00:00
hanna 5429b4d4a8 A bit of reorganization to help with more flexible output streams. Pushed construction of data
sources and post-construction validation back into the GATKEngine, leaving the MicroScheduler
to just microschedule.  


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1336 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-29 23:00:15 +00:00
hanna 7a13647c35 Support for specifying SAMFileReaders and SAMFileWriters as @Arguments directly. *Very*
rough initial implementation, but should provide enough support so that people can stop
creating SAMFileWriters in reduceInit.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1332 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-29 16:11:45 +00:00
ebanks 3c4410f104 -add basic indel metrics to variant eval
-variants need a length method (can't assume it's a SNP)!


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1324 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-28 03:25:03 +00:00
aaron f1109e9070 Added the interator to SAMDataSource to prevent seeing dupplicate reads, only in a byReads traversal. The iterator discards any reads in the current interval that would have been seen in the previous interval.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1317 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-25 22:36:29 +00:00
asivache a361e7b342 SAMDataSource is now exposed by GATK engine; SamFileHeaderMerger is exposed from Resources all the way up to SAMDataSource, so now we can see underlying individual readers should we need them; GATK engine has new methods getSamplesByReaders(), getLibrariesByReaders(), and getMergedReadGroupsByReaders(): each of these methods returns a list of sets, with each element (set) holding, respectively, samples, libraries, or (merged) read groups coming from an individual input bam file (so now when using multiple -I options we can still find out which of the input bams each read comes from)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1315 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-24 22:59:49 +00:00
hanna 2024fb3e32 Better division of responsibilities between sources and type descriptors.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1314 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-24 22:15:57 +00:00
ebanks 59f0c00d77 -set indel cleaning walkers to be in core package
-move Andrey's alignment utility classes to core


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1307 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-24 05:23:29 +00:00
aaron 0b16253db3 an iterator to fix the problem where read-based interval traversals are getting duplicate reads because reads span the two intervals.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1305 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-23 23:59:48 +00:00
ebanks 477502338f moved major indel cleaning pieces to core (yippee!)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1301 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-23 19:59:51 +00:00
ebanks 4efe26c59a Major: allow genotyper to optionally output in 1KG format, including outputting the samples in which indels are found.
Minor: refactor 454 filtering


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1300 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-23 19:53:51 +00:00
ebanks ee8ed534e0 print full genotype for alt allele
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1297 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-23 01:35:23 +00:00
depristo 9c12c02768 AlleleBalance and on/off primary base filters -- version 0.0.1 -- for experimental use only
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1294 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-22 17:54:44 +00:00
hanna 6e4fd8db4a Better formatting of available walkers, and only output them along with help. Cleanup JVMUtils.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1290 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-21 22:23:28 +00:00
depristo 761d70faa1 Better printing of multiple rods -- now produces a comma-separated set of values
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1289 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-21 21:58:27 +00:00
depristo 8588f75eb6 Better printing with toSimpleString() -- now prints out chip-genotype string
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1288 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-21 21:57:59 +00:00
hanna 1843684cd2 Cleanup: GATKEngine no longer needs to be lazy loaded, b/c the plugin directory no longer exists.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1287 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-21 18:50:51 +00:00
hanna b43925c01e Switched to Reflections (http://code.google.com/p/reflections/) project for
inspecting the source tree and loading walkers, rather than trying to roll
our own by hand.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1286 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-21 18:32:22 +00:00
kiran 436a196e2b Bug fixes to support hapmap genotyping concordance.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1285 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-21 16:20:10 +00:00
aaron f13a1e8591 adding a couple of small changes to support contract with VariantEval
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1283 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-21 03:49:15 +00:00
aaron b4adb5133a GLF rod as a AllelicVariant object.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1282 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-21 00:55:52 +00:00
ebanks 54fce98056 duh, don't print newline
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1280 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-20 03:04:27 +00:00
ebanks 1d2b545608 add FLT toString method (to be used in PrintRODs) and add it to ROD list
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1279 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-20 02:47:50 +00:00
ebanks 387316ebe1 added indel rod
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1276 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-17 16:05:51 +00:00
ebanks da4af3b620 print indels in the format required for 1KG submissions
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1275 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-17 15:59:18 +00:00
ebanks d45c90b166 ROD to represent simple output from IndelGenotyper
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1274 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-17 14:36:12 +00:00
hanna df1c61e049 Re-add the plugin path.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1271 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-16 22:48:44 +00:00
hanna 7c30c30d26 Cleaned up some duplicate code in preparation for making plugin dir configurable.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1270 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-16 22:02:21 +00:00
depristo 107f42a01e Hacks for getting GLFs support in the Rod system working
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1268 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-16 21:03:47 +00:00
ebanks 88ffb08af4 Need to return real values for some of the AllelicVariant methods
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1264 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-16 02:31:10 +00:00
ebanks ba349e8d52 add FLT ROD
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1257 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-15 19:40:50 +00:00
ebanks 800f7e6360 make AllelicVariant extend ReferenceOrderedDatum (not Comparable) since ROD itself is Comparable. Then we can generalize RMD tags.
Blame Matt if this doesn't work - he said it wouldn't break anything.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1256 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-15 19:25:06 +00:00
ebanks 5be5e1d45f added conversion from iupac format and new rod to deal with FLT file format
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1254 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-15 18:34:41 +00:00
aaron d36e232ed3 adding GLF rods to the module list
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1252 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-15 15:42:34 +00:00
aaron 9ecb3e0015 adding GLFRods with tests and some other code changes
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1251 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-15 15:30:19 +00:00
hanna c25f84a01c Regression: we lost our hack to work around BAM files with index problems (affects BAM files created before 23 Apr 2009 and traversed by interval). Added the hack back in, along with a much more explicit comment about why its there.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1248 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-15 14:41:37 +00:00
ebanks 513d43b5f3 now implements AllelicVariant
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1246 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-15 14:06:25 +00:00
ebanks d369136bda depricate this ROD yet again
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1245 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-15 13:33:03 +00:00
ebanks efcbb16688 un-deprecate this ROD and make it implement Genotype
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1240 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-14 19:45:41 +00:00
depristo 84d407ff3f Fixing odd merge problem with VariantEval -- better cluster analysis (no cumsum), rodVariant is now an AllelicVariant
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1239 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-14 18:53:27 +00:00
hanna 76b09a879b Display a more intelligent error message if the user runs a locus traversal across an unmapped reads file.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1238 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-14 18:36:09 +00:00
hanna 99f9cd84ed Warning for possibly mismatched reads / reference was very aggressive. Relax
the criteria a bit.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1234 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-14 16:21:22 +00:00
hanna 12b5d9c70c The number of loci can easily overflow an int. Change reduce type to a Long.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1233 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-14 16:07:00 +00:00
depristo 5bf7647498 0.2.3 -- now preserves Q0 bases throughout the reads
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1232 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-14 12:27:31 +00:00
hanna 0f6bfaaf73 Skip validation in case of no reads aligning.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1230 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-14 02:03:36 +00:00
hanna bfe90af5e2 Some quick and dirty fixes to support querying unmapped BAM files.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1228 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-14 01:25:20 +00:00
hanna 9f0fb9f3aa Fix for GSA-90: GATK banner and error messages should point to the wiki website.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1226 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-13 21:56:41 +00:00
hanna b18caa2052 Fix for GSA-90: System isn't failing with an error when you use the wrong reference.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1225 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-13 20:42:12 +00:00
hanna 5c321f9630 Oops! Accidentally deactivated the ArgumentFactory, needed by the CleanedReadInjector, while refactoring last night.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1223 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-13 16:41:55 +00:00
ebanks 0070b8ea6a Until 454 goes far, far away, at least we can completely ignore it
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1219 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-10 18:31:53 +00:00
asivache b08b121756 synchronyzing; debug statements commented out, so nothing changed really
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1215 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-10 16:38:33 +00:00
asivache a1eb128377 few more detailed debug printouts conditioned on if (DEBUG), so no real changes...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1214 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-10 16:36:57 +00:00
hanna 03e1713988 Better support for specifying read filters to apply directly from the walkers.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1212 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-09 23:59:53 +00:00
aaron ce08f5f0c3 Removed some unused variables, fixed some javadoc. The usual.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1211 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-09 22:10:22 +00:00
aaron 9cfd89c54f a small refactoring, and some documentation cleanup
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1210 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-09 22:03:45 +00:00
aaron d86717db93 Refactoring of the traversal engine base class, I removed a lot of old code.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1209 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-09 21:57:00 +00:00
ebanks 3519323156 Output the correct geli text format
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1208 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-09 19:45:18 +00:00
ebanks 99631cdaa1 fix and then deprecate the rodGELI class (GELIs suck)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1207 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-09 19:18:13 +00:00
hanna 5e26770634 Hack the MicroScheduler to be tolerant of RefWalkers. We need to implement a longer-term solution to make it easier for datasources to report problems they've encountered along the way (GSA-103).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1205 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-09 17:26:59 +00:00
ebanks 3fe7104963 Added walker to filter out clustered SNPs from a call set
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1203 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-09 03:16:27 +00:00
hanna 3f0304de5a Get rid of unused iterator.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1200 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-08 20:39:16 +00:00
hanna da4d26b1ea Enum support for command-line argument system, and some cleanup for hacks to the CleanedReadInjector that were required because Enum support was missing.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1199 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-08 20:26:16 +00:00
ebanks aacec3aeb0 rod for binary GELI files (still needs to be tested)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1198 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-08 20:25:56 +00:00
hanna 433ad1f060 Cleanup...deprecate FastaSequenceFile2 in favor of IndexedFastaSequenceFile or ReferenceSequenceFile from Picard, depending on the application.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1196 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-08 18:49:08 +00:00
hanna d8fbb2b62c Refactoring; make a better home for the MalformedReadFilteringIterator.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1194 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-08 16:54:20 +00:00
hanna 4ba2194b5e Filter reads whose alignment starts past the end of the contig to which it allegedly aligns.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1188 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-07 22:27:44 +00:00
hanna 5d7393d7cb Temporary fix for Eric's problems with SOLiD reads: make sure the command-line argument system takes the --validation-strictness command-line argument into account when creating SAMFileReaders.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1183 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-07 15:18:05 +00:00
hanna 5735c87581 Basic infrastructure for filtering malformed reads.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1178 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-06 22:50:22 +00:00
hanna 31313481f6 Temporary patch to filter out bad alignments that aren't quite fully reported as bad.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1176 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-06 18:41:55 +00:00
hanna d19366eaad Cleanup emergency fixes for out-of-bounds issues in reference retrieval. Fix spelling mistakes.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1173 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-06 15:41:30 +00:00
jmaguire 4019cd2bd7 Added ROD for parsing hapmap3 genotype files.
Tweak to TabularROD to allow HapMapGenotypeROD to work.
Added HapMapGenotypeROD to list of RODs in ReferenceOrderedData.java.
Modified MultiSampleCaller to return a single object with most of the relvant information.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1169 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-05 16:28:24 +00:00
ebanks e5e249d4ac temporary fix to deal with screwy SOLiD reads
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1168 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-05 03:25:57 +00:00
depristo cf1854b339 Fix for monsterous problems with solid data -- now can dynamically expand recalibration tables on the fly as reads declare additional read groups -- use assumeFaultyHeader flag
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1167 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-03 17:15:49 +00:00
depristo bcda66d2db Simple performance improvements
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1166 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-03 16:45:23 +00:00
hanna 0d00823332 Fix for performance bug in extending the read with X's in cases where the read is aligned off the end of the contig.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1165 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-03 16:17:38 +00:00
hanna 62807139fc Cleanup pileup and depth of coverage in preparation for release. Add pileup, depth of coverage, and print reads to package for distribution.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1159 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-02 14:54:01 +00:00
aaron 1c83b4d949 forgot to take out some test code
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1157 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-02 14:18:37 +00:00
aaron bc17ff567a When you get the reference string for a read that is mapped partially off the end of a contig, the string is masked with X's for base positions without corresponding reference positions. Now with a test case!
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1156 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-02 14:15:50 +00:00
depristo 47cb9f169e Stable tool that's the reverse of merging -- splits a file into individual BAM files, one for each sample ID in the SAM header
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1155 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-02 12:56:46 +00:00
aaron bb92eb8b1c added a fix for overlapping reads in the locus context
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1153 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-02 02:08:59 +00:00
hanna 9b182e3063 Prep for documenting command-line arguments: delete some arguments that don't make sense any more given
the state of the traversals and GATK input requirements: all_loci (replaced by walker annotation), max
OTF sorts (bam files must be sorted and indexed), threaded io (replaced by data sharding framework).


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1144 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-01 18:23:35 +00:00
aaron d58eeb7539 Don't cry wolf: only one warning is now emitted, instead of tons of warnings.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1139 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-01 13:50:37 +00:00
hanna a3e0ec20c4 Kill the TraverseByLocusWindows traversal. TraverseLocusWindows will take its place.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1138 348d0f76-0448-11de-a6fe-93d51630548a
2009-07-01 13:46:35 +00:00
hanna e93f751bd7 First step in replacing the Hello, World! document. Revamped the HelloWalker and checked it into the source tree, created a special build file for it, and added it to the packaging tool.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1135 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-30 21:59:54 +00:00
aaron f5cba5a6bb Fixed genome loc to be immutable, the only way to now change it's values is through the GenomeLocParser.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1132 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-30 19:17:24 +00:00
depristo 9fca79ed62 Read groups are now sorted in the output data, for convenience
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1129 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-30 16:50:44 +00:00
aaron 03f8177a53 When you get the reference string for a read that is mapped partially off the end of a contig, the string is masked with X's for base positions without corresponding reference positions.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1121 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-29 20:51:55 +00:00
depristo 7ecc43e9a7 Fixed subtle null ptr exception discovered by Kiran. Now deals with the rare situation where you have only say Q28 bases at dbSNP sites, so you fail in the Table recalibration step with a null pointer error into the data structure indexed by quality score. If you are Q score above those seen before you aren't modified in any way.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1118 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-29 18:57:42 +00:00
ebanks 95e2ae0171 Deal with reads whose ends are aligned off the end of a chromosome.
Includes update to ignore non-ATCG bases (not just 'N')
(Also, create a BWA dir for future work)



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1117 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-29 16:50:05 +00:00
jmaguire 65a788f18a Added a ROD (SangerSNP) for parsing the Sanger's chr20 pilot1 SNP calls.
Some doodling around with indel calling in an EM context.
 



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1116 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-29 16:32:12 +00:00
aaron d7d4298917 Some files to support generic genotype outputing
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1112 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-26 15:43:41 +00:00
hanna 491ed70b44 TraverseByLocusWindow -- asstd bug fixes.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1109 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-25 22:51:38 +00:00
depristo 5289230eb8 Version 0.2.1 (released) of the TableRecalibrator
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1108 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-25 22:50:55 +00:00
hanna ad3a3aa350 First pass at passing lists of files / lists of interval arguments work. Note that the interval
ROD system will throw up its hands and not deal with intervals at all if multiple interval files 
are passed in (see JIRA GSA-95). 


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1105 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-25 20:44:23 +00:00
ebanks 83816fb801 Stop using the annoying refIterator (temp change until new traversal is green lighted)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1103 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-25 20:05:39 +00:00
ebanks 0d9041380d remove printouts
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1100 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-25 18:54:14 +00:00
hanna 102b38c055 Sketch of new version of TraverseByLocusWindow, and a flag to conditionally turn it on.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1097 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-25 18:20:56 +00:00
aaron 5b1c23a7f2 changes to fix and test the interval based traversals
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1095 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-25 17:54:15 +00:00
ebanks 347608cfe0 remove hacked traversal in preparation for move to Matt's new one
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1091 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-25 14:32:05 +00:00
depristo 0a50f2e160 Updated and near final version of tabular recalibration system. Uses 'yates' correction for low-occupancy quality bins. Faster and more robust handling of input and output
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1082 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-24 03:52:12 +00:00
hanna ef546868bf Pooling of unmapped reads -- improves runtime of files with tons of unmapped reads by an order of magnitude.
Desperately needs cleanup.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1080 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-23 23:48:06 +00:00
aaron 8b4d0412ca Changed the duplicate traversal over to the new style of traversal and plumbed into the genome analysis engine. Also added a CountDuplicates walker, to validate the engine.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1072 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-22 21:11:18 +00:00
aaron bcb64d92e9 Aaron: 1, GenomeLoc: 0. I changed our GenomeLoc class, seperating the creation of a genome loc (with the reference setup) to a parser class. GenomeLoc now just represents the actual genomic postion. The constructors are now package-protected (to enforce using the parser), but we may want to expose some constructors in the future.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1069 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-22 14:39:41 +00:00
depristo 26eb362f52 Added novel / known split to variant eval. That is, emits all of the standard analyses on SNP partitioned into those known in the provided known db and those novel. Also fixed problem with counting bases within subsets
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1068 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-21 21:27:40 +00:00
depristo d3f0c51944 longer update times so we don't overwhelm when running genome-wide
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1067 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-21 14:10:02 +00:00
depristo 9e26550b0d Apprach v2. Added python analysis script, so java no longer must be used to analyses quality score data. About to refactor out lots of unneeded code
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1063 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-20 16:00:23 +00:00
hanna dde52e33eb Cleanup of the cleaned read injector based on Eric's feedback.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1062 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-19 22:04:47 +00:00
depristo 8ac40e8e2d Updated version of the recalibration tool
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1060 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-19 17:45:47 +00:00
depristo d748c85dc4 Cleaned code and reorganized -- moving in the right direction for v2
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1052 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-18 22:28:34 +00:00
depristo 1bca144119 Moving things around
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1049 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-18 21:06:46 +00:00
depristo 3c40db260d Added REFERENCE_BASES required annotation for performance
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1047 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-18 21:03:57 +00:00
kiran 7a921c908c Can now adjust the genotype likelihoods of a variant returned from the rod. This automatically causes the lodBtr, lodBtnb, and genotype to be recomputed.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1041 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-18 07:26:37 +00:00
kiran c4d9058f32 Added module rodVariants.class to the list of allowable RODs.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1037 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-17 21:33:13 +00:00
kiran ab2a80f3ea A new ROD type that allows one to input a geli.calls file back into a walker.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1036 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-17 21:32:21 +00:00
hanna cba9025983 More package-level documentation.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1030 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-17 16:28:45 +00:00
hanna 43a28750e0 Package level documentation -- helps new users get acclimated to the codebase more quickly.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1029 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-17 16:27:48 +00:00
aaron 6ee64c7e43 added changes to support alec toUnmappedRead seek. Huge improvements (orders of magnitude) in unmapped read performance.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1021 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-16 22:15:56 +00:00
aaron 7db4497013 fixing the readTraversal output
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1019 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-16 19:44:38 +00:00
ebanks 647b8a1ab0 Fix TabularROD printing and testing so Aaron stops nagging me.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1016 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-16 15:49:26 +00:00
aaron a0a549557f added a check of the sort ordering to the query methods, so that we detect if a file is unsorted much earlier. Also added some verbosity to the exception; it now contains an information about the raw attribute we saw for 'SO', the sort order of the bam file.
Also fixed a bunch of documentation

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1015 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-15 22:15:03 +00:00
ebanks 11aa715630 added capability for filtering by platform
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1011 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-15 19:19:50 +00:00
ebanks 8f4bc8cb6e Move filtering functionality into the PrintReadsWalker. More to come.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1010 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-15 16:38:08 +00:00
kiran 0583459839 Another formatting change to make Hapmap sites more clearly visible.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1004 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-12 19:53:21 +00:00
kiran e9be2a9c60 Changed a formatting issue.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1002 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-12 19:40:32 +00:00
ebanks 032d0436e6 Added ROD for 1KG SNP calls
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@988 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-11 19:53:51 +00:00
ebanks ffffe3b2f6 -Support for 1KG SNP calls in RODs
-Minor bug fix


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@987 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-11 18:56:37 +00:00
aaron 63b5c12cbd Changed dataSources to datasources, to be consistant with the rest of our package names. Also, this makes me champion in the largest check-in contest.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@985 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-11 18:13:22 +00:00
hanna 678ddd914f Stopgap fixes GFF, DbSNP being half-open rather than half-closed.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@980 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-10 21:38:57 +00:00
aaron 94b0e46d12 checked in a sample xml file used to store the defaults for the SomaticCoverage tool, and added it to the SomaticCoverage.jar in build.sml. Also added a inputStream marshalling method to the GATKArgumentCollection.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@979 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-10 20:46:16 +00:00
aaron 72a81f8f25 removed the requirement that a bam file list be present in the XML version of the command line arguments.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@975 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-10 20:01:13 +00:00
aaron f304803811 initial check-in of an easy way to create command line tools based on the GATK
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@970 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-10 17:34:02 +00:00
kiran b0cc763eb5 Added some methods to format bases such that read bases on the forward strand are in uppercase, while those on the negative strand are lowercase. This does *not* affect the default functionality of the standard PileupWalker
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@969 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-10 17:31:00 +00:00
hanna ad80894afa Bumped picard to latest svn version.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@965 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-10 14:36:34 +00:00
aaron ec2f015447 fixed a bunch of comments and license headers.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@964 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-10 14:10:46 +00:00
hanna dc6a9ca196 Pooling resources to lower memory consumption.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@962 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-10 13:39:32 +00:00
kiran 3adb4239e4 Same as regular Pileup, but also allows you to see flanking region around locus. This will be useful in determining that some SNPs are spurious due to being at the ends of homopolymer regions.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@959 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-10 08:19:31 +00:00
depristo 7fa84ea157 10x speedup of recalibration walker
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@954 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-09 15:39:40 +00:00
aaron a62bc6b05d fixed some documentation and attached a correct license
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@953 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-09 14:44:27 +00:00
aaron bf6190b471 cleaned up the PrintReadsWalker, and added a lot of documentation.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@952 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-09 14:28:32 +00:00
kiran fecba2cae5 Disabled option to show secondary quals as the definition has changed to conform to the spec and thus this printout is non-sensical.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@950 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-09 03:21:14 +00:00
aaron a8a2d0eab9 added support for the -M option in traversals.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@935 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-08 15:12:24 +00:00
asivache 9f35a5aa32 Insidious bug: clipped sequences (S cigar elements) where a) processed incorrectly; b) sometimes caused IntervalCleaner to crash, if such sequence occured at the boundary of the interval. The following inconsistency occurs: LocusWindow traversal instantiates interval reference stretch up to rightmost read.getAlignmentEnd(), but this does not include clipped bases; then IntervalCleaner takes all read bases (as a string) and does not check if some of them were clipped. Inside the interval this would cause counting mismatches on clipped bases, at the boundary of the interval the clipped bases would stick outside the passed reference stretch and index-out-of-bound exception would be thrown. THIS IS A PARTIAL, TEMPORARY FIX of the problem: mismatchQualitySum() is fixed, in that it does not count mismatches on clipped bases anymore; however, we do not attempt yet to realign only meaningful, unclipped part of the read; instead all reads that have clipped bases are assigned to the original reference and we do not attempt to realign them at all (we'd need to be careful to preserve the cigar if we wanted to do this)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@933 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-08 05:20:29 +00:00
depristo 98396732ba Bug fixes for Andrey
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@930 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-07 18:19:51 +00:00
depristo 819862e04e major restructuring of generalized variant analysis framework. Now trivally easy to add additional analyses. Easy partitioning of all analyses by features, such as singleton status. Now has transition/transversional bias, counting, dbSNP coverage, HWE violation, selecting of variants by presence/absense in dbs. Also restructured the ROD system to make it easier to add tracks. Also, added the interval track -- if you provide an interval list, then the system autoatmically makese this available to you as a bound rod -- you can always find out where you are in the interval at every site. Python scripts improved to handle more merging, etc, into population snps.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@918 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-05 23:34:37 +00:00
aaron 199be46c36 changed the warning that is outputted when the GenomeLoc constructor can't find the given contig in the reference.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@913 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-05 15:49:03 +00:00
aaron b323c58ef2 add a place to store the walker return value, along with a method to retrieve it
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@910 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-05 14:41:42 +00:00
ebanks 36fb6ca3c5 Allow user to specify the compression to be used when writing out BAM files.
Updated most of the walkers to reflect this change.
Now it won't take forever to write BAMs!



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@909 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-05 08:48:34 +00:00
ebanks 45eeefbb80 Deal with randomly occurring unmapped reads
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@906 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-05 02:55:53 +00:00
aaron 109bef6c08 We're no longer in the read-dropping business.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@901 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-04 22:37:51 +00:00
depristo 13be846c2a qualsAsInt argument for Pileup -- fixing stupid bug [again]
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@898 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-04 18:52:12 +00:00
depristo 97c8ff75dd qualsAsInt argument for Pileup -- fixing stupid bug
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@897 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-04 18:51:17 +00:00
depristo 9de3e58aa8 qualsAsInt argument for Pileup
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@896 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-04 18:37:39 +00:00
asivache bcc7bacba1 added List<Transcript> getTranscripts(); also more comments added
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@894 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-04 16:25:14 +00:00
depristo b492192838 Pairwise SNP distance metrics now enabled
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@892 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-04 00:11:29 +00:00
hanna 6e60cddfed A fix for the 'rod blows up when it hits a GenomeLoc outside the reference' issu
e.  Really a stopgap; error handling in the RODs needs to be addressed in a more comprehensive way.  Right now, hasNext() isn't guaranteed to be correct.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@878 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-02 18:14:46 +00:00
aaron fc91e3e30e equals signs can be important
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@870 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-01 16:56:21 +00:00
aaron 4edb33788b added a fix for a bug Andrew found
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@869 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-01 16:53:56 +00:00
hanna b7defeae83 Fix bug in unit tests created by new filter in TraversalEngine.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@868 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-01 15:50:44 +00:00
hanna fc7320133c Cleaned up error when fasta index is missing. Code still throws an exception, but the message is more direct (no more 'error while micromanaging') and tells the user to run 'samtools faidx' to fix the issue.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@867 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-01 15:34:38 +00:00
hanna c04b67c969 Basic instrumentation support for the hierarchical microscheduler.x
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@862 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-29 22:19:27 +00:00
asivache c252fec1bc synchronizing, no real changes
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@859 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-29 21:56:14 +00:00
asivache eafdba7300 more efficient implementation of line parsing, runs at least 1.5 times faster
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@858 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-29 21:09:06 +00:00
hanna 8761ab3aff Oops. IteratorPool was occasionally creating too many RODIterators in cases where some reference-ordered data was missing. Fixed by better tracking position of RODIterator.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@857 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-29 21:00:31 +00:00
hanna a1edb898ef Make criteria for determining whether to stop and merge inputs more sane.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@855 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-29 18:08:18 +00:00
depristo e0803eabd9 enabled underlying filtering of zero mapping quality reads, vastly improves system performance
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@853 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-29 14:51:08 +00:00
hanna 1f93545c70 Always opt to merge dictionaries when creating a SAMFileHeaderMerger.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@852 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-28 22:38:16 +00:00
hanna 0cf90b6f8a Tie into sequence merging code in the latest version of picard.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@851 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-28 21:48:35 +00:00
hanna 5e8c08ee63 Update to latest version of picard. Change imports in all classes dependent on picard public from import edu.mit.broad.picard... to import net.sf.picard...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@849 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-28 20:13:01 +00:00
hanna aa17c4a468 Farewell, functionalj. You promised much, but you could not deliver.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@847 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-28 01:35:49 +00:00
depristo ce6a0f522b First incarnation of the population-based SNP analysis tool. Also bug fixes throughout the GATK
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@845 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-27 22:02:24 +00:00
hanna a11bf0f43e Basic unit tests for ReferenceOrderedView, ShardDataProvider. Addressing GSA-25.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@844 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-27 21:15:01 +00:00
aaron 5c6163ecbf Removing the old reads traversal.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@842 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-27 18:36:11 +00:00
aaron c7b032cc88 missed a file in the add.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@841 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-27 18:27:38 +00:00
aaron 3c3cd5bb64 Moving some of the data sharding around. A new shard catagory now exits, INTERVAL. This saved a lot of code that was mirroring the same approach in both the read and locus shard strategies.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@840 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-27 18:24:31 +00:00
asivache 99524ab6d0 package name corrected
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@839 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-27 18:20:43 +00:00
asivache b76f8c4eb5 moved from playground to gatk
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@838 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-27 18:18:33 +00:00
asivache ae0bac5696 'made public' implies the 'public' keyword, actually...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@835 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-27 17:57:01 +00:00
asivache 41c1a62ac4 formerly private class, factored out and made public. Represents a transcript annotation (transcript id, genomic location, genomic intervals for all exons present in this transcript, etc)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@834 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-27 17:52:38 +00:00
hanna 864a1e81e3 Delete stale class from previous rethink of the traversal engine.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@828 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-27 13:52:03 +00:00
hanna a488d2dbb2 Lazy creation of output streams. Only create output streams when absolutely necessary.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@824 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-26 21:56:57 +00:00
asivache d73f2e95cc refseq added to the list of known rod types
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@820 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-26 21:06:44 +00:00
aaron d994544c47 Added back end code support for Sharding based on genomic location for reads. Changed the sharding
code to take GenomeLocSortedSet instead of a list<GenomeLoc>, and added a bunch of much simplier 
and cleaner test cases.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@816 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-26 20:57:46 +00:00
hanna 54bb643d19 Validated Mark's assertion that GSA-27 is fixed. Also did some cleanup on the pileup walker so that it doesn't output to System.out.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@812 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-26 15:58:21 +00:00
hanna 008d677bea Fixed ValidatingPileup to work with Andrey's new rodSAMPileup -> GenotypeList type hierarchy.
Fixed reference-ordered data validation system to validate class hierarchies instead of specific class types.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@811 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-23 20:50:28 +00:00
hanna 34413362fd Bugfix: handle case where queue is empty.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@808 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-22 21:45:22 +00:00
hanna ec2e8d5726 Fixes for getting ValidatingPileup running in parallel.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@807 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-22 21:20:24 +00:00
hanna 2a5be1debe Cleanup in datasources.providers namespace. Make it easier for others writing traversal engines to use.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@803 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-22 19:12:00 +00:00
asivache d5bb4d9ba9 Auxiliary class that can read one line from samtools pileup file. Used by rodSAMPileup to read pairs of lines as needed. NOTE: this class implements Genotype and (a trivial) GenotypeList, but it is NOT a rod!
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@798 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-22 17:20:01 +00:00
asivache 732fed9aad ALERT, ALERT! rodSAMPileup is now a GenotypeList, not a Genotype! Now it can intelligently read full samtools pileup files (containing, in general, both point and indel genotypes at the same position). No need to split/synchronize pileups from different individuals anymore, hooray!
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@797 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-22 17:17:59 +00:00
asivache 26633957d9 Genotype interface is extended: now it requires implementing object to be able to tell whether it isPointGenotype() or isIndelGenotype() (and the contract requires, e.g. alleles to be represented differently)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@796 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-22 17:14:46 +00:00
asivache 8773b3a430 a trivial wrapper interface for the objects capable of holding 'full' genotype, i.e. both point (as in ref/snp) and indel variants at the same reference position
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@794 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-22 17:12:01 +00:00
depristo 7a979859a9 Intermediate checking for evaluation -- now supports transition / transversion evaluation
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@793 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-22 17:05:06 +00:00
ebanks f2ea193149 For some reason the apostraphes in the comments were throwing annoying
compile-time warnings: "unmappable character for encoding UTF8"


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@792 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-22 14:07:07 +00:00
depristo 30c63daf89 More improvements to the duplicate quality combiner, making progress towards a clean system
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@788 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-21 22:26:57 +00:00
depristo 65995887fc Releasable version of the Pileup walker
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@786 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-21 22:25:37 +00:00
hanna d61a5261c1 Better integration of reference-ordered data into the data sharding system.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@779 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-21 20:09:32 +00:00
andrewk 0219d33e10 QualityUtils: added reverse function to reverse an array of bytes (and not complement it), BaseUtils: split qualToProb into itself and qualToErrProb, CovariateCounterWalker and LogisticRecalibrationWalker: several changes including a properly acocunting (only partly complete) for reversing AND complementing bases that are negative strand, PrintReadsWalker: created option to output reads to a BAM file rather than just to the sceern (useful for creating a downsampled BAM file)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@770 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-21 18:30:45 +00:00
asivache 7e5e422591 ReferenceOrdereData now inspects the ROD class using reflection. If the ROD declares a static Iterator<ROD> createIterator(String rodName, File rodFile) factory method, it is wrapped and used by the ReferenceOrderedData to read records from rodFile. If the ROD does not provide such factory method, the old behavior is the default: ReferenceOrderedData uses its own simple default iterator to read the file line by line (assuming there is only one line per record/position).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@768 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-21 15:23:22 +00:00
hanna 26dd3cd50e Cleanup. Move filtering functions closer to where they're used.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@767 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-20 21:42:48 +00:00
hanna e7a6f8cdc4 Removed evidence of a previous incarnation of data sharding.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@766 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-20 20:48:33 +00:00
hanna 3cad580655 Catch and rethrow the walker's required argument, so that command-line arguments will be displayed when the GATK throws an argument exception.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@765 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-20 19:17:16 +00:00
hanna dc748d9c9c Integrate more feedback on command-line argument system. Focus on help
formatter: separate required from optional but otherwise keep ordering
the same, reorder GATK arguments by usage.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@764 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-20 19:01:25 +00:00
ebanks 57918de753 add the @Requires for this walker
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@762 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-20 17:03:12 +00:00
hanna 96e73e496a Delete deprecated old-school traversals.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@758 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-20 14:57:17 +00:00
hanna 01a3cb27c7 @Required / @Allows flags for main arguments.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@751 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-19 23:26:17 +00:00
hanna ff798fe483 Reintroduce support for interval-based traversals.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@749 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-18 22:54:18 +00:00
hanna c10741e9f5 Rename TraverseLociByReference to TraverseLoci to represent its new function.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@743 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-18 01:31:57 +00:00
hanna 2c4de7b5c5 Switch TraverseByLoci over to new sharding system, and cleanup some code in passing read files along
the pathway from command line to traversal engine.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@727 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-15 21:02:12 +00:00
aaron 99d4ebc26d Added functionality to return the final accumulator of a traversal, so external tools can get the result of a walker.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@724 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-15 20:20:27 +00:00
depristo 7834b969b4 Better interface to the tabular ROD, now makes writing files easier. Also has corresponding test files
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@719 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-14 23:20:11 +00:00
aaron 50f32b7f61 Added a shard strategy for the reduce-by-interval traversals. Also fixed bugs that I found along the way.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@718 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-14 21:20:18 +00:00
depristo 0f8e6061b6 Simple interface improvements
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@717 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-14 21:08:09 +00:00
depristo 8e9e2f4502 Revised ROD system. Split the system in Basic type and interface. Enabled more control over rod accessing, including an initialize() function to fetch headers and other options from the file. Added general tabular rod, which has a named columns and supports a map<String,String> interface. Comes with shiny new Junit system for RODs. Also, added simple python script for accessing picard data.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@716 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-14 21:06:28 +00:00
aaron d8c1b010f1 Fixing the naming of the function I checked in earlier.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@713 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-14 19:27:10 +00:00
ebanks b62bddee42 The header was never being set.
Added this hack for now and will alert the authorities ASAP... 


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@708 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-14 17:18:51 +00:00
aaron 7aa90757ac Moved the iterators over to the StingSAMIterator interface. This will help us ensure that iterators that need to be closed get closed.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@702 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-14 16:52:18 +00:00
aaron c3b2c66911 The GATK doesn't need the rest
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@698 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-14 16:20:45 +00:00
aaron 0215905bb6 Added an adapter class, that will adapt plain iterators and closeable iterators of SAMRecords into STingSAMIterators. Also unit tests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@697 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-14 15:17:32 +00:00
hanna 80c13f7127 Added a getter for command-line arguments.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@695 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-14 13:55:52 +00:00
hanna 307c6e4ecf Oops. Forgot to add new file to svn.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@694 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-14 00:52:30 +00:00
hanna d14cab0be7 Added IterableLocusContextQueue and test. Cleaned up tests, adding BaseTest where it didn't exist. Enhanced test runner to run only classes ending in ...Test.java, so that utility classes can sit alongside the tests but won't be run by JUnit.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@693 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-13 21:32:05 +00:00
hanna 12ae3a22b6 Break locus context data access providers into modular components in preparation for traverse by loci.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@689 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-13 18:51:16 +00:00
aaron 6e69193e3c Deprecated calls to getSamReader on both the GenomeAnalysisEngine and the TraversalEngine. This call fails in the new style traversals, but it won't disapear until the cut-over to the new traversals is complete.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@671 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-12 18:52:42 +00:00
ebanks 630066cc0a 1. Merge LocusWindows whose reads overlap.
2. Fix bug (we weren't clearing the "to emit" list)


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@670 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-12 17:33:23 +00:00
aaron 9f942fdfa0 Added code to correct the violation of the parsing interface. Now the analysis type resides in the command line arg, but is stored into the argument collection before it's passed to the genomeAnalysisEngine.
Also fixed a bug where we'd exception-out if we didn't provide a interval region.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@669 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-12 15:33:55 +00:00
hanna ee9077fc69 LocusIterator iterated through LocusContexts, which was fine until now when we need something
that iterates through loci (GenomeLocs).  Rename LocusIterator to LocusContextIterator.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@662 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-12 13:54:57 +00:00
hanna 608948210c Check for a reference before extraction.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@661 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-12 13:29:44 +00:00
hanna 32696b13f5 Fixed method override issue with old-style traversals.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@660 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-12 01:22:18 +00:00
hanna 862b8a6787 intervals_file + genome_loc => intervals.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@659 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-12 01:04:18 +00:00
hanna 23e9e29964 Changed reads traversals from providing a LocusContext from which the reference sequence
could be extracted to a char[] containing the reference bases.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@657 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-11 22:45:11 +00:00
hanna 052819bed5 Switched dependencies of GenomeAnalysisTK to depend on GenomeAnalysisEngine.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@656 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-11 22:33:00 +00:00
aaron ff1b92acc4 Switch over to the GenomeAnalysisEngine/CommandLineGATK system from the GenomeAnalysisTK code.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@655 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-11 22:05:58 +00:00
ebanks 009e71fcd9 We need to sort cleaned reads ourselves (instead of letting SAMFileWriter
do it) because the SAM headers are often screwed up and claim to be
"unsorted".  While here, I broke off the module from the SortSamIterator
in case someone else wants to use it.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@654 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-11 15:43:42 +00:00
aaron c735e1f627 small javadoc cleanup.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@653 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-11 03:44:21 +00:00
aaron e8b8ab5985 Added code to extend Matt's getReferenceBases out to the read walkers, so they can see the corresponding reference for each read.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@652 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-11 03:42:38 +00:00
aaron 898f65547e Added code to split GenomeAnalysisTK.java into an object concerned with loading command line args, and one that runs the engines. This will allow us to run the GATK from other tools (like Matlab). Also some cleanup to seperate out the legacy traversals and the new style traversals. This is not live yet, and any modifications you need should be made to GenomeAnalysisTK.java for now.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@650 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-11 02:07:20 +00:00
aaron 8d43ec3d7e a fix for a situation where a chromosome on the reference file contains no reads, and doesn't align to the bam file. This came up using reference 18, which has chomosomes like chr1_random that aren't in all BAM files.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@649 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-11 01:39:25 +00:00
hanna 55c1b688bd Fix mediocre javadoc.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@646 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-08 22:31:16 +00:00
hanna 522f8b58be Added second method for getting large sequences of the reference for use in reads traversals.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@645 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-08 22:18:04 +00:00
aaron 517f27f331 Added sharding strat. code that picks the right kind of shard, based on the traversal engine
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@644 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-08 21:55:10 +00:00
hanna 6e394490cb Cleanup in preparation for ByLoci traversal. Also did some work minimizing unit tests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@643 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-08 21:27:54 +00:00
hanna ee777c89de Change the default mechanism for adding ROD bindings to the new system. TODO: create a new object type for these triplets.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@642 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-08 18:43:00 +00:00
ebanks 3aabc144c6 Added functionality to allow for a contract between LocusWindowTraversalEngine and LocusWindowWalker which allows the Walker to act upon reads outside of the provided intervals.
(Really, all we want to do is spit out all reads, but this allows the Walker to do other things with the reads if it wants)


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@641 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-08 17:28:16 +00:00
hanna de1c282e62 Reference-ordered data relies on bugs in the old command-line argument system to work. Update the ROD system to from -B track1 type1 file1 track2 type2 file2 to -B track1,type1,file1 -B track2,type2,file2.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@640 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-08 15:28:19 +00:00
hanna 483a58627b More cleanup -- pushing shared functions down into the traversal engine.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@639 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-08 14:12:45 +00:00
hanna 7a9cfe1f75 Push reduceInit down a level so that the walker can call into it without weird casts.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@638 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-08 13:46:28 +00:00
hanna a5154d99a3 Haven't heard any complaints, so I'm deleting the original implementation of TraverseByLociByReference. All TbyLbyR's will now go through the new sharding code.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@637 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-08 13:37:00 +00:00
hanna 4c269b8496 Cleanup LinearMicroScheduler in preparation for TraverseByLoci inclusion.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@634 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-08 00:58:37 +00:00
hanna 7f8850a8a2 Argument validation.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@631 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-07 20:28:56 +00:00
depristo 5a6892900e fixing oddities in duplicates
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@628 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-07 18:55:45 +00:00
depristo 2204be43eb System for traversing duplicate reads, along with a walker to compute quality scores among duplicates and a smarter method to combine quality scores across duplicates -- v1
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@624 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-07 18:06:02 +00:00
hanna 752928df94 Switch to better mechanism for supplying a default.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@615 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-07 01:22:01 +00:00
hanna dc944ec69b First stage of ROD plumbing for MicroScheduler.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@614 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-06 23:26:21 +00:00
aaron 5136724884 Added code to the schedulers, one step closer to turning on the new reads traversals
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@613 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-06 22:36:25 +00:00
aaron 0aba688e6f Added a interface that all our SAMRecord iterators should try to code to. This is in the effort to keep our code generic
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@609 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-06 21:40:41 +00:00
hanna 98716138e9 Cleanup: add support for non-public fields. Track matches as state of parsing engine as well as definitions.
Made fields of command-line argument system non-public by default.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@606 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-06 19:38:05 +00:00
aaron f5eae98af2 Fixed a bug where we could ask for a read when there were none in the pool (that's a bad thing).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@605 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-06 18:40:55 +00:00
hanna ef211f96b1 Remove old Apache CLI-based arg system.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@604 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-06 18:37:51 +00:00
hanna 521aa40baa Bring new command-line argument parsing system live.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@603 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-06 18:16:11 +00:00
hanna 4ac9e72739 Migrate default and GATK arguments over to new attribute system in preparation for conversion.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@600 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-05 23:57:48 +00:00
hanna b0cdba8bb3 Acting on Kiran's suggestion to make the doc tag in the @Argument annotation required.x
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@598 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-05 22:43:40 +00:00
aaron f5880109a7 Added TraverseReads test, some bug fixes discovered in the traversal test
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@594 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-05 20:36:00 +00:00
aaron daa2163ee8 Made the MergingSamIterator2 peekable. This iterator is being a ducktaped together swiss army knife, the iterators could use a redo soon.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@593 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-05 19:15:07 +00:00
aaron 09b0b6b57d Fixes to try and speed up unmapped read traversals. Still not nearly as fast as they should be, but the next step would be to modify samtools code.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@592 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-05 18:17:07 +00:00
hanna 6e38966349 Rename some key classes.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@587 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-01 22:01:04 +00:00
hanna 5bdf653919 Cleanup: prepare for better output handling.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@586 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-01 21:40:46 +00:00
hanna 9f5f6f9bc7 N-way parallelism. Works for small test cases. Untested for large test cases.
-Needs more comprehensive unit testing.
-Needs some basic refactoring.
-Needs rethink of interface boundaries.
-Needs to play more nicely in the /tmp sandbox.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@583 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-01 19:34:09 +00:00
depristo 84dae06d5a Initial version of ByDuplicates traversal, as well as a duplicate quality score estimator
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@576 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-30 22:16:21 +00:00
depristo ff420f5f6f Enabled iterator() function
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@575 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-30 22:15:14 +00:00
aaron 63403d32cd Changes to the interface to the simple data source rippled out to a bunch of files.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@572 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-30 20:35:56 +00:00
hanna 7f173af2ea Encapsulate output tracking a bit.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@570 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-30 15:12:13 +00:00
hanna ba9a0b5da8 Break out some of the weird inner classes out of the HierachicalMicroScheduler.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@566 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-29 21:07:07 +00:00
hanna 95d10ba314 Sketch of hierarchical reduce process, with unit tests for some core classes. Requires breakout of inner classes, testing.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@565 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-29 20:26:16 +00:00
ebanks 7de5da7065 Start getting the cleaner working in Walker
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@561 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-29 14:59:53 +00:00
hanna 6ecc43f385 Provide a default logger, some config settings, and some doc updates.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@557 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-29 02:06:05 +00:00
aaron b836761104 removed the test cases from the bottom of this file
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@556 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-28 21:50:22 +00:00
aaron d4de68e260 added changes for the readsTraversal to accomidate design changes
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@553 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-28 19:49:58 +00:00
aaron b6874f30cb Added changes to bounded read iterator, it now explicitly takes a MSRI2 instead of the interfaces ClosableIterator<SAMRecord>. It would be good to fix this in the future with an interface that lets you get the (possibly merged) header.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@552 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-28 17:57:54 +00:00
aaron 395aaf48b0 Added the new by reads traversal, still needs to be sewn into the micromanager code.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@551 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-28 17:55:08 +00:00
aaron a343f3eab7 Fixed bug where we weren't setting the reads group correctly. Also added code to set the printMetrics field of the singleSampleGenotyper from the Pool caller, it was null excepting out for me without that set.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@548 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-27 15:17:20 +00:00
hanna 9a8902571c Placeholder for parallel MicroManager.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@542 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-26 23:08:12 +00:00
hanna 1daa011387 Interval-based traversals were bleeding file handles. Fixed.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@541 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-26 18:35:54 +00:00
hanna 1e2e78265d Inadvertently removed interval file support in new TbLbR. Fixed.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@540 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-26 18:15:42 +00:00
hanna c9e9731495 More cleanup.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@539 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-26 17:46:52 +00:00
hanna 4036f24909 Documentation and cleanup work in preparation for parallelism.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@538 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-26 17:42:00 +00:00
ebanks 0c76a70313 Renamed traversal by "interval" to "locusWindow"
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@537 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-26 02:26:08 +00:00
depristo 9a299c11d3 Oops, typo and build problems. FYI, fixing typos is better than packing...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@536 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-25 01:37:17 +00:00
depristo ce470702fc consistency with java naming conventions
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@535 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-24 21:44:48 +00:00
depristo bfce0c93ab removing bad file
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@534 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-24 21:40:04 +00:00
depristo 05c6679321 Enabled ReduceByInterval
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@533 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-24 21:39:44 +00:00
hanna ee2f022c71 Make new TraverseByLociByReference the default.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@532 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-24 19:50:11 +00:00
hanna e50ae97fe1 Introduce new index-based fasta reader. Clean up MicroManager code, pushing necessary code back into TraversalEngine.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@531 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-24 19:40:21 +00:00
jmaguire dd408a2a9a First draft of actual pooled EM caller.
Produces sane looking output on region of 1kG pilot1:

    CALL NA12813.SRP000031.2009_02.bam CC 0.609084 0.609084
    CALL NA12003.SRP000031.2009_02.bam CC 2.114234 2.114234 CCCCC
    CALL NA06994.SRP000031.2009_02.bam CC 0.910114 0.910114 C
    CALL NA18940.SRP000031.2009_02.bam CT 2.589749 0.910114 T
    CALL NA18555.SRP000031.2009_02.bam CC 0.609084 0.609084

Next up, eval vs. Baseline pilot1 calls and pilot3 deep-coverage truth.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@525 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-24 13:42:15 +00:00
ebanks 13d4692d2e 1. Added a by-interval traversal.
2. Added a shell for the indel cleaner walker (it's currently being used to test the interval traversal).
3. Fixed small bug in downsampling (make sure to downsample the offsets too)
4. GenomeAnalysisTK.execute => anyone object to my change to "instanceof" instead of trying to catch a ClassCastException (yuck)?



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@524 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-24 04:33:35 +00:00
aaron bd4cacb832 Added code to make a read group and sample name for BAM files that don't annotate them on reads. The defaults for both are now the filename, but this may be shortened in the future.
The sample name for a read can be retrieved with the command:

read.getAttribute(SAMTag.RG.toString());




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@518 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-24 00:31:00 +00:00
aaron 635bfd8604 Added a little bit of hack to get the header back to the walker by initialization time, which was before sharding in the last version.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@516 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-23 21:07:11 +00:00
aaron 0208d201c7 Forgot this in the last commit...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@515 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-23 20:47:22 +00:00
aaron 3dc2afd7ab Added the ability to get a merged header in a LociByReference traversal
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@514 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-23 20:34:52 +00:00
hanna 282f1d88b8 Make the operation 'read from the iterator and place on the queue' atomic with respect to hasNext(), next().
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@513 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-23 20:16:26 +00:00
aaron 8c13940c5a A lot of changes to support by-read sharding and some from debugging of the by loci traversals
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@511 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-23 19:03:14 +00:00
hanna 3d7575bbb8 Oops...omitted walker.initialize().
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@504 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-23 17:35:28 +00:00
hanna 1bf4d040d8 Increase default shard size from 5 to 100000.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@494 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-22 18:29:44 +00:00
hanna 3af66a462e Make PrintLocusContextWalker less verbose.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@493 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-22 18:28:02 +00:00
hanna 4cafb95be8 TraverseByLoci / TraverseByLociByReference suffered from the same sam-triggered off-by-one (?) bug as TraverseByReference; it was just less obvious here because these versions don't shard.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@491 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-22 15:48:20 +00:00
kcibul cb2f621d01 reverting accidental commit of change to shard size
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@490 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-22 00:33:28 +00:00
kcibul b820130dce * added ability to load multiple BAM files from command line
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@489 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-22 00:28:08 +00:00
kiran 5abfc7d079 Added an argument ('extended' or 'ext') that outputs the four-base probs in a long format.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@485 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-21 22:27:26 +00:00
asivache 521e202a10 updated interface
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@482 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-21 21:07:20 +00:00
asivache 55ca272919 reimplemented; now implements Genotype interface instead of AllelicVariant
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@481 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-21 21:06:42 +00:00
hanna eafb4633ba Temporary workaround for samtools index bug: there seems to be an off-by-one error. Will file bug report.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@470 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-17 23:14:41 +00:00
asivache f2f9fa3ed4 doc added
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@464 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-17 16:43:25 +00:00
hanna d639ec3776 Remove some copied code to make sure the traversal engine stays in sync with the locus context provider.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@463 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-17 16:41:56 +00:00
depristo 50ae1763f7 Support for -continue_after_errors flag in the validating pileup walker in case you want to see errors as they arise, rather than aborting greedily
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@461 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-17 03:13:11 +00:00
depristo ee5ab9536f trivial checking / flagging issues to enable testing of merging iterator performance
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@460 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-17 03:11:59 +00:00
depristo dbf2344cef Fixes for including duplicate reads in the locus traversal; now checks that the ref arg is provided when needed
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@459 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-17 01:27:36 +00:00
hanna 01be8f09e3 Exception cleanup. All our non-runtime exceptions should extend from StingException, StingException needs to be lower in the tree to build.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@457 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-16 22:17:25 +00:00
aaron e5c80e59dc fixed the case when you're not seeking, it didn't initalize
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@456 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-16 22:16:03 +00:00
hanna 165e504d1c Turn on new TraverseLociByReference is now only dependent on the -et flag. REGION_STR does not matter.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@454 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-16 19:45:47 +00:00
aaron 12e1f192c4 Fixed a bug in this code where it would eat reads that didn't start at the beginning of the provided interval. This should fix / help fix Kristian problem
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@453 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-16 18:42:00 +00:00
asivache 835f1067d8 added isHom() and isHet() queries to the Genotype interface (with the obvious meaning)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@452 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-16 18:41:39 +00:00
kcibul d35a542bb9 * fixed bug where the merged header was not being set on the read (although the read group was)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@445 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-16 12:53:07 +00:00
asivache 0d324354ae separate interface for genotypes as opposed to (population) allelic variants
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@443 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-16 03:55:16 +00:00
depristo 7261787b71 Fixed potential bug with next() operation returning empty contexts when a read contains a large deletion. We can now use the look ahead safely...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@438 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-15 21:38:28 +00:00
aaron e70aecf518 bug fix, but important
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@437 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-15 21:07:20 +00:00
aaron 67ea66c866 Bug fix
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@434 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-15 19:12:18 +00:00
depristo 1edfe48194 Better debugging output with .debug
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@433 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-15 19:09:18 +00:00
depristo 9cc808104e Fixed subtle bug in permitting EXPAND_WINDOW to be > 1. We now use the right window size so we avoid including empty hangers. There's still a rare bug to sort out, which occurs in the case where a read with an indel can generate empty hangers.
Also cleaned up the debugging output.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@432 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-15 19:08:26 +00:00
aaron 180ff13290 Added a bunch of changes to support the new MicroManager code
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@431 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-15 18:29:38 +00:00
aaron 12407b5b1a Deleted the old file
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@427 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-15 13:55:01 +00:00
aaron 6db9127f90 Added changes to shattering, refactored SAMBAM into SAM
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@426 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-15 13:52:56 +00:00
depristo 24722a442e Slight code cleanup
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@421 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-14 22:21:36 +00:00
aaron 13b0995d54 Adding an iterator that bounds the number of reads
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@419 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-14 22:18:31 +00:00
depristo 72a3d84ed2 General purpose pileup code -- you can use these features to obtain detailed pileup data from reads and offsets. Useful for all pileup based walkers. Expanded support for rodSAMPileup to enable the new ValidatingPileupWalker, which takes a samtools pileup output and checks that GATK gives identical output as samtools on a per base and per qual pileup. It's going to be a very useful validation tool.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@418 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-14 22:13:10 +00:00
hanna 0629f79049 Moved fasta support files into their own package.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@408 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-14 18:13:23 +00:00
aaron eb4b4a053b A bunch of updates to the SAM/BAM data source, along with test cases for the merging of multiple files (it works!).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@399 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-14 14:19:20 +00:00
kiran 30121534ed Outputs the secondary bases and quals (if available) in verbose mode. Prefixed with the tag 'SQ='.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@398 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-14 13:58:28 +00:00
depristo 8b2c2e677b Uses the cleaner new GenomeLoc(read) syntax
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@396 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-14 00:55:43 +00:00
depristo 1cee7948ab Added lots of assertions to check for problems.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@395 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-14 00:55:19 +00:00
depristo 794360c410 Added verbose option to show mapping qualities and base qualities as ints!
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@394 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-14 00:54:48 +00:00
depristo cc75e8f712 Uses the cleaner new GenomeLoc(read) syntax
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@393 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-14 00:53:58 +00:00