Significant change to the way subsetting by sample works with monomorphic sites. Now keeps the alt allele, even if a record is AC=0 after the subset. Previously, the system dropped the alt allele, which I don't think is the right behavior. If you really want a VCF without monomorphic sites, use the option to drop monomorphic sites after subsetting. See detailed information below.

Right now, if you select a multi-sample VCF file down (or one with filters I see) down to a smaller set of samples, and the site isn't polymorphic in that subgroup, then the alt allele is lost. For example, when selecting down NA12878 from the OMNI, I previously received the following VCF: 1 82154 rs4477212 A . . PASS AC=0;AF=0.00;AN=2;CR=100.0;DP=0;GentrainScore=0.7826;HW=1.0 GT:GC 0/0:0.7205 1 534247 SNP1-524110 C . . PASS AC=0;AF=0.00;AN=2;CR=99.93414;DP=0;GentrainScore=0.7423;HW=1.0 GT:GC 0/0:0.6491 1 565286 SNP1-555149 C T . PASS AC=2;AF=1.00;AN=2;CR=98.8266;DP=0;GentrainScore=0.7029;HW=1.0 GT:GC 1/1:0.3471 1 569624 SNP1-559487 T C . PASS AC=2;AF=1.00;AN=2;CR=97.8022;DP=0;GentrainScore=0.8070;HW=1.0 GT:GC 1/1:0.3942 Where the first two records lost the ALT allele, because NA12878 is hom-ref at this site. My change results in a VCF that looks like: 1 82154 rs4477212 A G . PASS AC=0;AF=0.00;AN=2;CR=100.0;DP=0;GentrainScore=0.7826;HW=1.0 GT:GC 0/0:0.7205 1 534247 SNP1-524110 C T . PASS AC=0;AF=0.00;AN=2;CR=99.93414;DP=0;GentrainScore=0.7423;HW=1.0 GT:GC 0/0:0.6491 1 565286 SNP1-555149 C T . PASS AC=2;AF=1.00;AN=2;CR=98.8266;DP=0;GentrainScore=0.7029;HW=1.0 GT:GC 1/1:0.3471 1 569624 SNP1-559487 T C . PASS AC=2;AF=1.00;AN=2;CR=97.8022;DP=0;GentrainScore=0.8070;HW=1.0 GT:GC 1/1:0.3942 The genotype remains unchanged, but the ALT allele is now preserved. I think this is the correct behavior, as reducing samples down shouldn't change the character of the site, only the AC in the subpopulation. This is related to the tricky issue of isPolymorphic() vs. isVariant(). isVariant => is there an ALT allele? isPolymorphic => is some sample non-ref in the samples? In part this is complicated as the semantics of sites-only VCFs, where ALT = . is used to mean not-polymorphic. Unfortunately, I just don't think there's a consistent convention right now, but it might be worth at some point to adopt a single approach to handling this. Wiki docs updated. Does anyone have critical infrastructure that depends on the previous convention? Let me know so we can coordinate the change. There's a new function subContextFromGenotypes() that also takes a Set<Allele> to handle this type of behavior. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5832 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-21 13:59:16 +00:00 · 2011-05-21 13:59:16 +00:00 · 6a49e8df34
parent 8377424089
commit 6a49e8df34
3 changed files with 14 additions and 10 deletions
--- a/java/src/org/broadinstitute/sting/gatk/contexts/AlignmentContextUtils.java
+++ b/java/src/org/broadinstitute/sting/gatk/contexts/AlignmentContextUtils.java
@ -121,19 +121,23 @@ public class AlignmentContextUtils {
     * Splits the AlignmentContext into one context per read group
     *
     * @param context the original pileup
-     * @return a Map of ReadGroup to AlignmentContext
+     * @return a Map of ReadGroup to AlignmentContext, or an empty map if context has no base pileup
     *
     **/
    public static Map<SAMReadGroupRecord, AlignmentContext> splitContextByReadGroup(AlignmentContext context, Collection<SAMReadGroupRecord> readGroups) {
-        HashMap<SAMReadGroupRecord, AlignmentContext> contexts = new HashMap<SAMReadGroupRecord, AlignmentContext>();
+        if ( ! context.hasBasePileup() ) {
+            return Collections.emptyMap();
+        } else {
+            HashMap<SAMReadGroupRecord, AlignmentContext> contexts = new HashMap<SAMReadGroupRecord, AlignmentContext>();

-        for (SAMReadGroupRecord rg : readGroups) {
-            ReadBackedPileup rgPileup = context.getBasePileup().getPileupForReadGroup(rg.getReadGroupId());
-            if ( rgPileup != null ) // there we some reads for RG
-                contexts.put(rg, new AlignmentContext(context.getLocation(), rgPileup));
+            for (SAMReadGroupRecord rg : readGroups) {
+                ReadBackedPileup rgPileup = context.getBasePileup().getPileupForReadGroup(rg.getReadGroupId());
+                if ( rgPileup != null ) // there we some reads for RG
+                    contexts.put(rg, new AlignmentContext(context.getLocation(), rgPileup));
+            }
+
+            return contexts;
        }
-
-        return contexts;
    }

    public static Map<String, AlignmentContext> splitContextBySampleName(ReadBackedPileup pileup, String assumedSingleSample) {
--- a/java/src/org/broadinstitute/sting/gatk/walkers/variantutils/SelectVariants.java
+++ b/java/src/org/broadinstitute/sting/gatk/walkers/variantutils/SelectVariants.java
@ -314,7 +314,7 @@ public class SelectVariants extends RodWalker<Integer, Integer> {
                genotypes.add(genotypePair.getValue());
        }
        
-        VariantContext sub = vc.subContextFromGenotypes(genotypes);
+        VariantContext sub = vc.subContextFromGenotypes(genotypes, vc.getAlleles());

        HashMap<String, Object> attributes = new HashMap<String, Object>(sub.getAttributes());

--- a/java/test/org/broadinstitute/sting/gatk/walkers/variantutils/SelectVariantsIntegrationTest.java
+++ b/java/test/org/broadinstitute/sting/gatk/walkers/variantutils/SelectVariantsIntegrationTest.java
@ -31,7 +31,7 @@ public class SelectVariantsIntegrationTest extends WalkerTest {
        WalkerTestSpec spec = new WalkerTestSpec(
                baseTestString(" -sn A -sn B -sn C -B:variant,VCF3 " + testfile + " -NO_HEADER"),
                1,
-                Arrays.asList("fe6963e5fea1d3854634dcf3defd7b92")
+                Arrays.asList("5ba7536a0819421b330350a160e4261a")
        );

        executeTest("testRepeatedLineSelection--" + testfile, spec);