Significant change to the way subsetting by sample works with monomorphic sites. Now keeps the alt allele, even if a record is AC=0 after the subset. Previously, the system dropped the alt allele, which I don't think is the right behavior. If you really want a VCF without monomorphic sites, use the option to drop monomorphic sites after subsetting. See detailed information below.

Right now, if you select a multi-sample VCF file down (or one with filters I see) down to a smaller set of samples, and the site isn't polymorphic in that subgroup, then the alt allele is lost.  For example, when selecting down NA12878 from the OMNI, I previously received the following VCF:

1       82154   rs4477212       A       .       .       PASS    AC=0;AF=0.00;AN=2;CR=100.0;DP=0;GentrainScore=0.7826;HW=1.0     GT:GC   0/0:0.7205
1       534247  SNP1-524110     C       .       .       PASS    AC=0;AF=0.00;AN=2;CR=99.93414;DP=0;GentrainScore=0.7423;HW=1.0  GT:GC   0/0:0.6491
1       565286  SNP1-555149     C       T       .       PASS    AC=2;AF=1.00;AN=2;CR=98.8266;DP=0;GentrainScore=0.7029;HW=1.0   GT:GC   1/1:0.3471
1       569624  SNP1-559487     T       C       .       PASS    AC=2;AF=1.00;AN=2;CR=97.8022;DP=0;GentrainScore=0.8070;HW=1.0   GT:GC   1/1:0.3942

Where the first two records lost the ALT allele, because NA12878 is hom-ref at this site.  My change results in a VCF that looks like:

1       82154   rs4477212       A       G       .       PASS    AC=0;AF=0.00;AN=2;CR=100.0;DP=0;GentrainScore=0.7826;HW=1.0     GT:GC   0/0:0.7205
1       534247  SNP1-524110     C       T       .       PASS    AC=0;AF=0.00;AN=2;CR=99.93414;DP=0;GentrainScore=0.7423;HW=1.0  GT:GC   0/0:0.6491
1       565286  SNP1-555149     C       T       .       PASS    AC=2;AF=1.00;AN=2;CR=98.8266;DP=0;GentrainScore=0.7029;HW=1.0   GT:GC   1/1:0.3471
1       569624  SNP1-559487     T       C       .       PASS    AC=2;AF=1.00;AN=2;CR=97.8022;DP=0;GentrainScore=0.8070;HW=1.0   GT:GC   1/1:0.3942

The genotype remains unchanged, but the ALT allele is now preserved.  I think this is the correct behavior, as reducing samples down shouldn't change the character of the site, only the AC in the subpopulation.  This is related to the tricky issue of isPolymorphic() vs. isVariant().  

isVariant => is there an ALT allele?
isPolymorphic => is some sample non-ref in the samples?

In part this is complicated as the semantics of sites-only VCFs, where ALT = . is used to mean not-polymorphic.  Unfortunately, I just don't think there's a consistent convention right now, but it might be worth at some point to adopt a single approach to handling this.  Wiki docs updated.

Does anyone have critical infrastructure that depends on the previous convention?  Let me know so we can coordinate the change.

There's a new function subContextFromGenotypes() that also takes a Set<Allele> to handle this type of behavior.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5832 348d0f76-0448-11de-a6fe-93d51630548a
This commit is contained in:
depristo 2011-05-21 13:59:16 +00:00
parent 8377424089
commit 6a49e8df34
3 changed files with 14 additions and 10 deletions

View File

@ -121,19 +121,23 @@ public class AlignmentContextUtils {
* Splits the AlignmentContext into one context per read group
*
* @param context the original pileup
* @return a Map of ReadGroup to AlignmentContext
* @return a Map of ReadGroup to AlignmentContext, or an empty map if context has no base pileup
*
**/
public static Map<SAMReadGroupRecord, AlignmentContext> splitContextByReadGroup(AlignmentContext context, Collection<SAMReadGroupRecord> readGroups) {
HashMap<SAMReadGroupRecord, AlignmentContext> contexts = new HashMap<SAMReadGroupRecord, AlignmentContext>();
if ( ! context.hasBasePileup() ) {
return Collections.emptyMap();
} else {
HashMap<SAMReadGroupRecord, AlignmentContext> contexts = new HashMap<SAMReadGroupRecord, AlignmentContext>();
for (SAMReadGroupRecord rg : readGroups) {
ReadBackedPileup rgPileup = context.getBasePileup().getPileupForReadGroup(rg.getReadGroupId());
if ( rgPileup != null ) // there we some reads for RG
contexts.put(rg, new AlignmentContext(context.getLocation(), rgPileup));
for (SAMReadGroupRecord rg : readGroups) {
ReadBackedPileup rgPileup = context.getBasePileup().getPileupForReadGroup(rg.getReadGroupId());
if ( rgPileup != null ) // there we some reads for RG
contexts.put(rg, new AlignmentContext(context.getLocation(), rgPileup));
}
return contexts;
}
return contexts;
}
public static Map<String, AlignmentContext> splitContextBySampleName(ReadBackedPileup pileup, String assumedSingleSample) {

View File

@ -314,7 +314,7 @@ public class SelectVariants extends RodWalker<Integer, Integer> {
genotypes.add(genotypePair.getValue());
}
VariantContext sub = vc.subContextFromGenotypes(genotypes);
VariantContext sub = vc.subContextFromGenotypes(genotypes, vc.getAlleles());
HashMap<String, Object> attributes = new HashMap<String, Object>(sub.getAttributes());

View File

@ -31,7 +31,7 @@ public class SelectVariantsIntegrationTest extends WalkerTest {
WalkerTestSpec spec = new WalkerTestSpec(
baseTestString(" -sn A -sn B -sn C -B:variant,VCF3 " + testfile + " -NO_HEADER"),
1,
Arrays.asList("fe6963e5fea1d3854634dcf3defd7b92")
Arrays.asList("5ba7536a0819421b330350a160e4261a")
);
executeTest("testRepeatedLineSelection--" + testfile, spec);