Commit Graph

5798 Commits (0dc0d586f197099980ef67d7ec8565d64dc84fee)

Author SHA1 Message Date
depristo 0dc0d586f1 Phasing-specific utilies are now in the Phasing walker
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5839 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-22 18:51:35 +00:00
depristo a1349f3520 report packages are no more
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5838 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-22 18:45:08 +00:00
depristo 72ad8ded19 Removed unused importants, but some of these scripts are now out of date (they have been for a long time) so they don't compile anyway
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5837 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-22 18:43:48 +00:00
depristo f608ed6d5a Removed old (and unused) reporting system, now that Kiran's VE reporting system is working. Refactors dictionary creation error messages into UserExceptions
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5836 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-22 18:42:52 +00:00
rpoplin 4e7ecbdcb2 FS values need to be jittered just like HRun
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5835 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-22 16:44:12 +00:00
depristo 9cc049f80f Contracted ReferenceContext. Removed depreciated accessors that aren't used in the GATK at all
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5834 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-22 02:41:15 +00:00
depristo d77f4ebe31 CalibrateGenotypeLikelihoods now emits a molten data set with REF and ALT alleles, so that GL calibration can be evaluated as a function of the REF/ALT bases. DigestTable is a stand-alone Rscript that digests the multi-GB molten data table into a tiny table that shows reported vs. empirical GLs, as a function of a variety of features of the data, like REF/ALT, comp GT, eval GT, and GL itself.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5833 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-21 14:02:30 +00:00
depristo 6a49e8df34 Significant change to the way subsetting by sample works with monomorphic sites. Now keeps the alt allele, even if a record is AC=0 after the subset. Previously, the system dropped the alt allele, which I don't think is the right behavior. If you really want a VCF without monomorphic sites, use the option to drop monomorphic sites after subsetting. See detailed information below.
Right now, if you select a multi-sample VCF file down (or one with filters I see) down to a smaller set of samples, and the site isn't polymorphic in that subgroup, then the alt allele is lost.  For example, when selecting down NA12878 from the OMNI, I previously received the following VCF:

1       82154   rs4477212       A       .       .       PASS    AC=0;AF=0.00;AN=2;CR=100.0;DP=0;GentrainScore=0.7826;HW=1.0     GT:GC   0/0:0.7205
1       534247  SNP1-524110     C       .       .       PASS    AC=0;AF=0.00;AN=2;CR=99.93414;DP=0;GentrainScore=0.7423;HW=1.0  GT:GC   0/0:0.6491
1       565286  SNP1-555149     C       T       .       PASS    AC=2;AF=1.00;AN=2;CR=98.8266;DP=0;GentrainScore=0.7029;HW=1.0   GT:GC   1/1:0.3471
1       569624  SNP1-559487     T       C       .       PASS    AC=2;AF=1.00;AN=2;CR=97.8022;DP=0;GentrainScore=0.8070;HW=1.0   GT:GC   1/1:0.3942

Where the first two records lost the ALT allele, because NA12878 is hom-ref at this site.  My change results in a VCF that looks like:

1       82154   rs4477212       A       G       .       PASS    AC=0;AF=0.00;AN=2;CR=100.0;DP=0;GentrainScore=0.7826;HW=1.0     GT:GC   0/0:0.7205
1       534247  SNP1-524110     C       T       .       PASS    AC=0;AF=0.00;AN=2;CR=99.93414;DP=0;GentrainScore=0.7423;HW=1.0  GT:GC   0/0:0.6491
1       565286  SNP1-555149     C       T       .       PASS    AC=2;AF=1.00;AN=2;CR=98.8266;DP=0;GentrainScore=0.7029;HW=1.0   GT:GC   1/1:0.3471
1       569624  SNP1-559487     T       C       .       PASS    AC=2;AF=1.00;AN=2;CR=97.8022;DP=0;GentrainScore=0.8070;HW=1.0   GT:GC   1/1:0.3942

The genotype remains unchanged, but the ALT allele is now preserved.  I think this is the correct behavior, as reducing samples down shouldn't change the character of the site, only the AC in the subpopulation.  This is related to the tricky issue of isPolymorphic() vs. isVariant().  

isVariant => is there an ALT allele?
isPolymorphic => is some sample non-ref in the samples?

In part this is complicated as the semantics of sites-only VCFs, where ALT = . is used to mean not-polymorphic.  Unfortunately, I just don't think there's a consistent convention right now, but it might be worth at some point to adopt a single approach to handling this.  Wiki docs updated.

Does anyone have critical infrastructure that depends on the previous convention?  Let me know so we can coordinate the change.

There's a new function subContextFromGenotypes() that also takes a Set<Allele> to handle this type of behavior.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5832 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-21 13:59:16 +00:00
depristo 8377424089 Basic error checking to ensure incoming arguments are provided correctly.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5831 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-21 13:43:48 +00:00
depristo e234589240 Contracts for GenomeLocParser and GenomeLoc are now fully implemented.
GenomeLocs can officially have any start/stop values from -Inf - +Inf.  Bounds w.r.t. the reference are enforced, optionally, by GenomeLocParser.  General code cleanup throughout the subsystem.

All validation code for GLs is now centralized, and all I/O systems now validate their inputs.  Because of this, the Picard interval processing code has been changed to examine whether an interval is valid, and only keep the valid intervals.  Note that the scatter/gather test was changed, because the original hg18 chr20 interval files as actually malformed (all records for some reason where on chr20).  

Many interval processing routines were moved to IntervalUtils, as this is their natural home.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5830 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-21 02:01:59 +00:00
kiran 3aa56037af If asked, filters out triple-het situations too (which cannot be simply phased by transmission).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5829 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-20 18:48:19 +00:00
carneiro 3a2e32eef3 wex is wex, wgs is wgs.... i think i got it right this time.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5828 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-20 16:44:25 +00:00
depristo e16bc2cbd9 Contracts for Java now write for GenomeLoc and GenomeLocParser. The semantics of GenomeLoc are now much clearer. It is no longer allowed to create invalid GenomeLocs -- you can only create them with well formed start, end, and contigs, with respect to the mater dictionary. Where one previously created an invalid GenomeLoc, and asked is this valid, you must now provide the raw arguments to helper functions to assess this. Providing bad arguments to GenomeLoc generates UserExceptions now. Added utilty functions contigIsInDictionary and indexIsInDictionary to help with this.
Refactored several Interval utilties from GenomeLocParser to IntervalUtils, as one might expect they go

Removed GenomeLoc.clone() method, as this was not correctly implemented, and actually unnecessary, as GenomeLocs are immutable.  Several iterator classes have changed to remove their use of clone()

Removed misc. unnecessary imports

Disabled, temporarily, the validating pileup integration test, as it uses reads mapped to an different reference sequence for ecoli, and this now does not satisfy the contracts for GenomeLoc


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5827 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-20 15:43:27 +00:00
depristo 0095aa2627 Contracts for java now enabled by default in GATK build. The contract checking is automatically enabled when running tests and integrationtests. If you want to run the GATK with Contract checking enabled, add -javaagent:lib/cofoja.jar to your jvm args
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5826 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-20 02:53:42 +00:00
kshakir 6c6e52def9 Renamed FCP to HybridSelectionPipeline.
Reviewed pipelines with dev team.
HSP updates:
- Calling SNPs and Indels at the same time then using SelectVariants to separate them for filtering
- Moved logs next to the files like in WGP
- Flattened outputs into one directory
- The file names for the final outputs are now <projectName>.vcf and <projectName>.eval
- Updated test to pass the chr20 intervals instead of a boolean
- Removed MultiFCP
WGP updates:
- Only cleaning and calling chromosomes 1-22, X, Y, MT
- Splitting SNPs from indels, filtering indels, then merging the selected SNPs and selected Indels back together to make sure there are no collisions in CombineVariants
- Still running VQSR on the recombined SNPs plus hard filtered indels
- Using hard indel filters from delangel
- Reduced number of tranches with rpoplin
- Changed prior for dbsnp from 10 to 8 with rpoplin
- Assuming identical samples on both CombineVariants
- Explicitly using variant merge option UNION even though it's the default
- Not setting the default genotype merge option PRIORITIZE
- Generating a vcf and eval for each tranche


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5825 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-19 22:47:02 +00:00
kiran d896a4a9d3 Given genotypes for a trio, phases child by transmission. Computes probability that the determined phase is correct given that the genotypes for mom and dad are correct (useful if you want to use this to compare phasing accuracy, but want to break that comparison down by phasing confidence in the truth set). Optionally filters out sites where the phasing is indeterminate.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5824 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-19 21:27:37 +00:00
rpoplin fe4b40ac2c Adding new InbreedingCoeff and PercentNBases annotations for Guillermo to use.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5823 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-19 19:50:39 +00:00
carneiro 76c87c9f1d trio WGS was creating trio WEX filenames.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5822 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-19 17:45:45 +00:00
ebanks bc98ac1e74 Adding a TODO for future consideration
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5821 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-19 15:02:23 +00:00
hanna 0bb6b9a91a Locus iterators were implemented in a peekable style, which meant that a locus
and its three or four nearest neighbors could be in memory at once.  Tweaking
the iterators to ensure that previous AlignmentContexts don't have strong 
references which means that the garbage collector can work effectively to
help us trundle through these regions.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5820 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-18 21:40:40 +00:00
hanna a38b2be329 Fix for old, broken invariant where unmapped reads are represented by null rather than an empty BAMFileSpan.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5819 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-18 20:57:38 +00:00
carneiro ebcd333ed8 Quick small updates:
SelectVariants: typo
MethodsDevelopmentPipeline: Added CEU Trio WGS dataset


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5818 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-18 20:08:39 +00:00
carneiro b5b8cb959a Added VQSR to the downsampling script and changed memory limits for the clean script.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5817 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-18 20:07:42 +00:00
rpoplin 4b00fd2688 Adding User Exception to VQSR for the case of trying to cluster with an annotation that doesn't exist in the input VCF
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5816 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-18 19:47:51 +00:00
depristo 218354e338 Contracts for Java (http://code.google.com/p/cofoja/) infrastructure enabled. No piece of code actually uses this, so it's possible to remove easily. Does not build by default (you must modify build.xml). Really an intermediate commit so I can play around with the system in my java classes and revert safely. Very much looking forward to DVCS
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5815 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-18 18:05:59 +00:00
kshakir 83e207d9dd Added option to exclude intervals during chunk calling.
Removed job priority as temp space isn't as tight at the moment and planning on changing the priority interface.
Updated chunk calling with ebanks:
- Using "the bundle" of resources.
- Using dbsnp 132 and 1000G indel RODs for both RTC & IR.
- Using the default maxIntervalSize in RTC.
- Removed use of UG.exactCalculation argument.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5814 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-18 03:48:02 +00:00
rpoplin d698c87bbf More UserExceptions and warnings in VQSR.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5813 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-16 19:03:21 +00:00
kshakir 541b5f7a80 Somehow checked in a version that was building extensions for everything ("") instead of selected packages. Fixed.
Also added more logging when extension generation fails.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5812 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-16 16:58:37 +00:00
delangel a27e8b1dc6 Bug fix - use correct variable to retrieve from map.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5811 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-16 15:32:58 +00:00
rpoplin d925f76edc Cutting down on the number of info lines in VQSR so that I can read the warning messages
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5810 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-16 13:35:51 +00:00
delangel 5a7444e186 First step in refactoring UG way of storing indel likelihoods - main motive is that rank sum annotations require per-read quality or likelihood information, and even the question "what allele of a variant is present in a read" which is trivial for SNPs may not be that straightforward for indels.
This step just changes storage of likelihoods so now we have, instead of an internal matrix, a class member which stores, as a hash table, a mapping from pileup element to an (allele, likelihood) pair. There's no functional change aside from internal data storage.
As a bonus, we get for free a 2-3x improvement in speed in calling because redundant likelihood computations are removed.
Next step will hook this up to, and redefine annotation engine interaction with UG for indel case.




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5809 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-15 23:04:11 +00:00
depristo 3ccc08ace4 Now emits siteType = {SNP,INDEL}. Doesn't work (and may never actually work) for indels under current extended event system.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5808 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-15 19:16:09 +00:00
depristo 75db4705ab Added splitContextByReadGroup() and fixed bug in getPileupForReadGroup() that resulted in a NPE when no reads where present for a read group.
Added doc string for getNBoundRodTracks()

Intermediate commit for CalibrateGenotypeLikelihoods and GenotypeConcordanceTable, so I have a record of my work.  Not ready for public consumption.  Really looking forward to making local commits so I can track my progress without needing to push incomplete functionality up to the server.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5807 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-15 17:36:07 +00:00
depristo 9423652ad8 Computes how well a genotype chip covers a reference panel
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5806 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-14 15:07:28 +00:00
depristo 5e9c0d00c6 Simple R script to visualize geontype likelihood accuracy
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5805 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-14 15:05:55 +00:00
delangel fa75efb6ac Backing off - need to change pileup interface for rank sum tests before indels can be annotated with them
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5804 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-13 21:54:54 +00:00
asivache befbcd274b Computes additional stats we want to use later for filtering: median and mad for indel position with respect to starts and ends of all the reads that support it
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5803 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-13 21:19:58 +00:00
asivache 5c889580c4 Change of logic: if "read" (sequence 2) sticks out beyond the boundary of the ref (sequence 1) it is aligned to, the extra bases on the left or on the right will be softclipped in the cigar generated for such an alignment, rather than added to the firts/last M block. This also affects alignment offset: if read starts before the ref (used to be represented by a negative offset), the cigar now will start with S, and the returned offset (alignment start) will be 0.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5802 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-13 21:12:54 +00:00
delangel d4ca8d94fa Trivial change to allow indels to be annotated by rank rum tests
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5801 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-13 20:24:08 +00:00
kshakir 95fc6c0a83 Changed VR tranches from old 0.1-10 to new 100 to 90.
Using hapmap training and truth based on wiki.
Explicitly setting the ts_filter_level even though 99.0 is the default.
Recal file path now ends with with .recal.
Added ar's vcf input.
Omni rod name now omni instead of 1kg.
The VR RodBind tags had spaces in them.
Was passing both the full intervals and the chunk intervals to chunk jobs.
Switched back to chr20 for default since the VR crashes on small intervals sets with "MESSAGE: Matrix is singular."
Log files names based on the file paths + .out.
Added eval statifications by sample based on the Hybrid Selection / Whole Exome pipeline.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5800 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-13 14:38:56 +00:00
kshakir 08c13f3944 Using embedded GATK.
Hardcoded the reference and dbsnp since the training rods are also hardcoded, for now.
Changed freeze/chr20 to wg/chr20/cent1 to also test the heaviest known shard.
Other cleanup.
TODO: Memory command line options or have the script figure it out using FLS or similar.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5799 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-12 23:19:49 +00:00
hanna 03452c15c0 Cleanup GATKBAMIndex unit test to allow a more efficient access pattern for
FindLargeShards.  Runtime of FindLargeShards on papuan dataset is now 75min.
GATK proper should benefit as well, although the benefits might be so small
as to not be measurable.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5798 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-12 21:50:33 +00:00
dheiman 9e08a699c6 Corrected memory handling and jobName formatting issues
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5797 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-12 17:47:56 +00:00
depristo db1f9af679 Now supports multiple records in allele at sites that genotype as reference
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5796 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-12 17:36:27 +00:00
chartl 66c8fa5c48 James P says this change worked for him, so I'm committing it.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5795 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-12 16:55:18 +00:00
rpoplin a22e98a2c4 Yikes. Fixing the build
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5794 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-12 01:52:35 +00:00
rpoplin 40797f9d45 Ensuring a minimum number of variants when clustering with bad variants. Better error message when Matrix library fails to calculate inverse.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5793 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-12 01:48:37 +00:00
kshakir a20d257773 Generating extensions for org.broadinstitute.sting.gatk.datasources.reads.utilities, including FindLargeShards.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5792 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-12 00:49:31 +00:00
kshakir ec443e89cf Added pass-throughs for -Djava.io.tmpdir to javac and testng.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5791 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-11 20:56:35 +00:00
carneiro fb1be2653c A succint walker that reports GC content by interval. Taking down two old implementations of the same thing from oneoffs. Documentation added to the wiki.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5790 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-11 18:53:11 +00:00