Commit Graph

11632 Commits (e939e0d9b3ddd224728209d38e32a7768dc886d3)

Author SHA1 Message Date
Mauricio Carneiro e939e0d9b3 Small improvement to the license update scripts
- launch it once per file type, not license type (was unnecessary).
   - renamed ParseLicense to UpdateLicense for clarity

GSATDG-5
2013-01-21 16:24:33 -05:00
Mauricio Carneiro 35e3939dca Adding bamboo script to check all licenses
- CheckLicense.py script checks license of sources against a provided license file
   - checkAllLicenses.csh script runs CheckLicense for all files in the repo with the appropriate license for each

GSATDG-9
2013-01-21 16:09:38 -05:00
Mauricio Carneiro 35f8dc7426 Accept spaces before comments or package line
this allows us to be a bit more lienent before erroring out in the license script. Feature suggested by Yossi.

GSATDG-5
2013-01-18 16:51:01 -05:00
Mauricio Carneiro 7b8b064165 Last manual license update (hopefully)
if everyone updates their git hook accordingly, this will be the last time I have to manually run the script.

GSATDG-5
2013-01-18 16:13:07 -05:00
Mauricio Carneiro 02d1b87326 Better error handling for the license scripts
- Your commit will now fail gracefully with an error message if you mess up the license system
   - Your file will be preserved (unmodified) if you fail the commit process
   - Error message should be indicative of the error you need to fix (usually missing package information)

Set your pre-commit hook as a symlink to be automatically updated by new pushes with :

	ln -s private/shell/pre-commit .git/hooks/

GSATDG-18 #resolve
2013-01-18 16:13:07 -05:00
Ami Levy-Moonshine 0fb7b73107 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-18 15:03:42 -05:00
Ami Levy-Moonshine 826c29827b change the default VCFs gatherer of the GATK (not just the UG) 2013-01-18 15:03:12 -05:00
Mauricio Carneiro 63e1a377cc Automatic license information git-hook
This hook will automatically add / fix the license information in all files you commit to the repo.
To activate it, copy it to your hooks directory :

	cp private/shell/pre-commit .git/hook/

Now everytime you commit, you will have all your java and scala files automatically updated.

GSATDG-5 GSATDG-7 GSATDG-8 #resolve
2013-01-18 12:47:32 -05:00
Eric Banks cac439bc5e Optimized the Allele Biased Downsampling: now it doesn't re-sort the pileup but just removes reads from the original one.
Added a small fix that slightly changed md5s.
2013-01-18 11:17:31 -05:00
Chris Hartl 08d2da9057 Merge branch 'master' of gsa2:/humgen/gsa-scr1/chartl/dev/unstable 2013-01-18 10:28:45 -05:00
Chris Hartl bf5748a538 Forgot to actually put in the md5. Also with the new change to record pairing and filtering, the multiple-records integration test changed: the indel records (T/TG | T/TGACA) are matched up (rather than left separate) resulting in properly identifying mismatching alleles, rather than HET-UNAVAILABLE and UNAVAILABLE-HET. Very nice. 2013-01-18 10:25:36 -05:00
Mauricio Carneiro f99bd5be6a Small fix to make the script more generic (thanks Yossi)
GSA-710 #resolve
2013-01-18 10:06:50 -05:00
Chris Hartl 91030e9afa Bugfix: records that get paired up during the resolution of multiple-records-per-site were not going into genotype-level filtering. Caught via testing.
Testing for moltenized output, and for genotype-level filtering. This tool is now fully functional. There are three todo items:

1) Docs
2) An additional output table that gives concordance proportions normalized by records in both eval and comp (not just total in eval or total in comp)
3) Code cleanup for table creation (putting a table together the way I do takes -way- too many lines of code)
2013-01-18 09:49:48 -05:00
Eric Banks 39c73a6cf5 1. Ryan and I noticed that the FisherStrand annotation was completely busted for indels with reduced reads; fixed.
2. While making the previous fix and unifying FS for SNPs and indels, I noticed that FS was slightly broken in the general case for indels too; fixed.
3. I also fixed a minor bug in the Allele Biased Downsampling code for reduced reads.
2013-01-18 03:35:48 -05:00
Eric Banks 6a903f2c23 I finally gave up on trying to get the Haplotype/Allele merging to work in the HaplotypeCaller.
I've resigned myself instead to create a mapping from Allele to Haplotype.  It's cheap so not a big deal, but really shouldn't be necessary.
Ryan and I are talking about refactoring for GATK2.5.
2013-01-18 01:21:08 -05:00
Eric Banks 6db3e473af Better error message for bad qual 2013-01-17 10:30:04 -05:00
Eric Banks 953592421b I think we got out of sync with the HC tests as we were clobbering each other's changes. Only differences here are to some RankSumTest values. 2013-01-17 09:19:21 -05:00
Eric Banks ded659232b Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-16 22:49:56 -05:00
Eric Banks a623cca89a Bug fix for HaplotypeCaller, as reported on the forum: when reduced reads didn't completely overlap a deletion call,
we were incorrectly trying to find the reference position of a base on the read that didn't exist.
Added integration test to cover this case.
2013-01-16 22:47:58 -05:00
Ami Levy-Moonshine fcb3c6dc2a fix small bugs in scala files 2013-01-16 22:42:20 -05:00
Eric Banks dbb69a1e10 Need to use ints for quals in HaplotypeScore instead of bytes because of overflow (they are summed when haplotypes are combined) 2013-01-16 22:33:16 -05:00
Chris Hartl e15d4ad278 Addition of moltenize argument for moltenized tabular output. NRD/NRS not moltenized because there are only two columns. 2013-01-16 18:00:23 -05:00
Mark DePristo 738c24a3b1 Add tests to ensure that all insertion reads appear in the active region traversal 2013-01-16 16:25:36 -05:00
Mark DePristo 3c476a92a2 Add dummy functionality (currently throws an error) to allow HC to include unmapped reads during assembly and calling 2013-01-16 16:25:36 -05:00
Eric Banks 79bc818022 Bug fix for VariantsToVCF: old dbSNP files can have '-' as reference base and those records always need to be padded. 2013-01-16 16:15:58 -05:00
Eric Banks 4cf34ee9da Bug fix to FisherStrand: do not let it output INFINITY. This all needs to be unit tested, but that's coming on the horizon. 2013-01-16 15:35:04 -05:00
Mark DePristo 2a42b47e4a Massive expansion of ActiveRegionTraversal unit tests, resulting in several bugfixes to ART
-- UnitTests now include combinational tiling of reads within and spanning shard boundaries
-- ART now properly handles shard transitions, and does so efficiently without requiring hash sets or other collections of reads
-- Updating HC and CountReadsInActiveRegions integration tests
2013-01-16 15:30:00 -05:00
Mark DePristo ddcb33fcf8 Cache result of getLocation() in Shard so we don't performance expensive calculation over and over 2013-01-16 15:30:00 -05:00
Mark DePristo 4d0e7b50ec ArtificialBAMBuilder utility class for creating streams of GATKSAMRecords with a variety of properties
--  Allows us to make a stream of reads or an index BAM file with read having the following properties (coming from n samples, of fixed read length and aligned to the genome with M operator, having N reads per alignment start, skipping N bases between each alignment start, starting at a given alignment start)
-- This stream can be handed back to the caller immediately, or written to an indexed BAM file
-- Update LocusIteratorByStateUnitTest to use this functionality (which was refactored from LIBS unit tests and ArtificialSAMUtils)
2013-01-16 15:29:59 -05:00
Eric Banks ec1cfe6732 Oops, forgot to add 1 of my files 2013-01-16 15:05:49 -05:00
Eric Banks e47a389b26 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-16 14:59:11 -05:00
Eric Banks d18dbcbac1 Added tests for changing IUPAC bases to Ns, for failing on bad ref bases, and for the HaplotypeCaller not failing when running over a region with an IUPAC base.
Out of curiosity, why does Picard's IndexedFastaSequenceFile allow one to query for start position 0?  When doing so, that base is a line feed (-1 offset to the first base in the contig) which is an illegal base (and which caused me no end of trouble)...
2013-01-16 14:55:33 -05:00
Khalid Shakir 4ffb43079f Re-committing the following changes from Dec 18:
Refactored interval specific arguments out of GATKArgumentCollection into InvtervalArgumentCollection such that it can be used in other CommandLinePrograms.
Updated SelectHeaders to print out full interval arguments.
Added RemoteFile.createUrl(Date expiration) to enable creation of presigned URLs for download over http: or file:.
2013-01-16 12:43:15 -05:00
Eric Banks 445735a4a5 There was no reason to be sharing the Haplotype infrastructure between the HaplotypeCaller and the HaplotypeScore annotation since they were really looking for different things.
Separated them out, adding efficiencies for the HaplotypeScore version.
2013-01-16 11:10:13 -05:00
Eric Banks 392b5cbcdf The CachingIndexedFastaSequenceFile now automatically converts IUPAC bases to Ns and errors out on other non-standard bases.
This way walkers won't see anything except the standard bases plus Ns in the reference.
Added option to turn off this feature (to maintain backwards compatibility).

As part of this commit I cleaned up the BaseUtils code by adding a Base enum and removing all of the static indexes for
each of the bases.  This uncovered a bug in the way the DepthOfCoverage walker counts deletions (it was counting Ns instead!) that isn't covered by tests.  Fortunately that walker is being deprecated soon...
2013-01-16 10:22:43 -05:00
Eric Banks 4fb3e48099 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-16 00:13:38 -05:00
Eric Banks 0d282a7750 Bam writing from HaplotypeCaller seems to be working on all my test cases. Note that it's a hidden debugging option for now.
Please let me know if you notice any bad behavior with it.
2013-01-16 00:12:02 -05:00
Mark DePristo c8de9b92d0 Updating CountReadsInActiveRegionsIntegrationTest integration tests due to new ART 2013-01-15 15:41:33 -05:00
Chris Hartl 327169b283 Refactor the method that identifies the site overlap type into the type enum class (so it can be used elsewhere potentially).
Completed todo item: for sites like

(eval)
20   12345   A    C
20   12345   A    AC

(comp)
20   12345   A    C
20   12345   A    ACCC

the records will be matched by the presence of a non-empty intersection of alleles. Any leftover records are then paired with an empty variant context (as though the call was unique). This has one somewhat counterintuitive feature, which is that normally

(eval)
20  12345  A   AC
(comp)
20  12345  A   ACCC

would be classified as 'ALLELES_DO_NOT_MATCH' (and not counted in genotype tables), in the presence of the SNP, they're counted as EVAL_ONLY and TRUTH_ONLY respectively.

+ integration test
2013-01-15 12:13:45 -05:00
Eric Banks d3baa4b8ca Have Haplotype extend the Allele class.
This way, we don't need to create a new Allele for every read/Haplotype pair to be placed in the PerReadAlleleLikelihoodMap (very inefficient).  Also, now we can easily get the Haplotype associated with the best allele for a given read.
2013-01-15 11:36:20 -05:00
Mark DePristo 3c37ea014b Retire original TraverseActiveRegion, leaving only the new optimized version
-- Required some updates to MD5s, which was unexpected, and will be sorted out later with more detailed unit tests
2013-01-15 10:24:45 -05:00
Eric Banks 94800771e3 1. Initial implementation of bam writing for the HaplotypeCaller with -bam argument; currently only assembled haplotypes are emitted.
2. Framework is set up in the VariantAnnotator for the HaplotypeCaller to be able to call in to annotate dbSNP plus comp RODs.  Until the HC uses meta data though, this won't work.
2013-01-15 10:19:18 -05:00
Mark DePristo 39bc9e999d Add a test to LocusIteratorByState to ensure that we aren't holding reads anywhere
-- Run an iterator with 100Ks of reads, each carrying MBs of byte[] data, through LIBS, all starting at the same position.  Will crash with an out-of-memory error if we're holding reads anywhere in the system.
-- Is there a better way to test this behavior?
2013-01-14 16:30:16 -05:00
Mark DePristo b8b2b9b2de ManagingReferenceOrderedView optimization: don't allow a fresh RefMetaDataTracker in the frequent case where there's no reference meta data 2013-01-14 16:30:16 -05:00
Mark DePristo 7eea6b8f92 ReservoirDownsampler optimizations
-- Add an option to not allocate always ArrayLists of targetSampleSize, but rather the previous size + MARGIN.  This helps for LIBS as most of the time we don't need nearly so much space as we allow
-- consumeFinalizedItems returns an empty list if the reservior is empty, which it often true for our BAM files with low coverage
-- Allow empty sample lists for SamplePartitioner as these are used by the RefTraversals and other non-read based traversals

Make the reservoir downsampler use a linked list, rather than a fixed sized array list, in the expectFewOverflows case
2013-01-14 16:30:16 -05:00
Mark DePristo c7f0ca8ac5 Optimization for LIBS: PerSampleReadStateManager now uses a simple LinkedList of AlignmentStateMachine
-- Instead of storing a list of list of alignment starts, which is expensive to manipulate, we instead store a linear list of alignment starts.  Not grouped as previously.  This enables us to simplify iteration and update operations, making them much faster
-- Critically, the downsampler still requires this list of list.  We convert back and forth between these two representations as required, which is very rarely for normal data sets (WGS NA12878 on chr20 is 0.2%, 4x WGS is even less).
2013-01-14 16:30:16 -05:00
Mark DePristo 5a5422e4f8 Refactor PerSampleReadStates into a separate class
-- No longer update the total counts in each per-sample state manager, but instead return delta counts that are updated by the overall ReadStateManager
-- One step on the way to improving the underlying representation of the data in PerSampleReadStateManager
-- Make LocusIteratorByState final
2013-01-14 16:30:16 -05:00
Mark DePristo 5c2799554a Refactor updateReadStates into PerSampleReadStateManager, add tracking of downsampling rate 2013-01-14 16:30:16 -05:00
Mark DePristo a4334a67e0 SamplePartitioner optimizations and bugfixes
-- Use a linked hash map instead of a hash map since we want to iterate through the map fairly often
-- Ensure that we call doneSubmittingReads before getting reads for samples.  This function call fell out before and since it wasn't enforced I only noticed the problem while writing comments
-- Don't make unnecessary calls to contains for map.  Just use get() and check that the result is null
-- Use a LinkedList in PassThroughDownsampler, since this is faster for add() than the existing ArrayList, and we were's using random access to any resulting
2013-01-14 16:30:16 -05:00
Mark DePristo 19288b007d LIBS bugfix: kept reads now only (correctly) includes reads that at least passed the reservoir
-- Added unit tests to ensure this behavior is correct
2013-01-14 16:30:16 -05:00