gatk-3.8/protected/java/test/org/broadinstitute/sting/gatk/walkers
Guillermo del Angel 55d5f2194c Read Error Corrector for haplotype assembly
Principle is simple: when coverage is deep enough, any single-base read error will look like a rare k-mer but correct sequence will be supported by many reads to correct sequences will look like common k-mers. So, algorithm has 3 main steps:
1. K-mer graph buildup.
For each read in an active region, a map from k-mers to the number of times they have been seen is built.
2. Building correction map.
All "rare" k-mers that are sparse (by default, seen only once), get mapped to k-mers that are good (by default, seen at least 20 times but this is a CL argument), and that lie within a given Hamming distance (by default, =1). This map can be empty (i.e. k-mers can be uncorrectable).
3. Correction proposal
For each constituent k-mer of each read, if this k-mer is rare and maps to a good k-mer, get differing base positions in k-mer and add these to a list of corrections for each base in each read. Then, correct read at positions where correction proposal is unanimous and non-empty.

The algorithm defaults are chosen to be very stringent and conservative in the correction: we only try to correct singleton k-mers, we only look for good k-mers lying at Hamming distance = 1 from them, and we only correct a base in read if all correction proposals are congruent.

By default, algorithm is disabled but can be enabled in HaplotypeCaller via the -readErrorCorrect CL option. However, at this point it's about 3x-10x more expensive so it needs to be optimized if it's to be used.
2013-06-11 12:26:24 -04:00
..
annotator Refactor rsID and overlap detection in VariantOverlapAnnotator utility class 2013-06-10 15:51:13 -04:00
beagle Updated all JAVA file licenses accordingly 2013-01-10 17:06:41 -05:00
bqsr Make BQSR calculateIsIndel robust to indel CIGARs are start/end of read 2013-05-31 13:58:37 -04:00
compression/reducereads Fix error in merging code in HC 2013-05-31 16:29:29 -04:00
diagnostics Update MD5s and the Diagnose Target scala script 2013-05-13 12:06:17 -04:00
diffengine Fixed issues raised by Appistry QA (mostly small fixes, corrections & clarifications to GATKDocs) 2013-03-12 10:57:14 -04:00
fasta Updated all JAVA file licenses accordingly 2013-01-10 17:06:41 -05:00
filters Don't allow users to specify keys and IDs that contain angle brackets or equals signs (not allowed in VCF spec). 2013-04-05 00:52:32 -04:00
genotyper Refactor rsID and overlap detection in VariantOverlapAnnotator utility class 2013-06-10 15:51:13 -04:00
haplotypecaller Read Error Corrector for haplotype assembly 2013-06-11 12:26:24 -04:00
indels Secondary alignments were not handled correctly in IndelRealigner 2013-05-06 19:09:10 -04:00
phasing Updated all JAVA file licenses accordingly 2013-01-10 17:06:41 -05:00
validation MathUtils.randomSubset() now uses Collections.shuffle() (indirectly, through the other methods 2013-03-29 14:52:10 -04:00
varianteval Move some VCF/VariantContext methods back to the GATK based on feedback 2013-01-29 16:56:55 -05:00
variantrecalibration Update MD5s for VQSR header change 2013-04-16 11:45:45 -04:00
variantutils CombineVariants no longer adds PASS to unfiltered records 2013-05-20 16:53:51 -04:00