gatk-3.8/doc/MSA_Realigner_README.txt

Overview:
The Multiple Sequence Alignment (MSA) Realigner is designed to consume one or more BAM files
and to fix reads which are misaligned due to the presence of insertions/deletions (indels)
within or near them.  While this isn't the forum for a detailed explanation of why indel-
containing reads get misaligned, it is important to note that an artifact of these misalignments
is what look to be SNPs near the site of the indel (however, since they aren't really SNPs, I
often refer to them as columns of mismatches at a position in a pileup).  These particular false
positive SNPs usually occur in clusters (generally defined as 2 or more mismatch columns within
n base pairs, where n is usually less than or equal to 10).  It is often the case that an aligner
will detect the indel in some of the reads and will fail to detect it in others; that is because
the aligners don't use knowledge about the other reads mapping to the same location when placing an
individual read.  It is the realigner's job to use all of the reads mapping to a given location to
find a consensus indel which best explains the data and which minimizes entropy within the reads.

There are 2 major steps to the realignment process:
Step 1: Determining (small) suspicious intervals which are likely in need of realignment
Step 2: Running the realigner over those intervals

A more detailed explanation follows.

-----

Step 1: Determining (small) suspicious intervals which are likely in need of realignment

There are several methods for finding these intervals, which can be used in conjunction with one
another or separately.

A. In the case that aligners do find some reads with indels in them, one would want to make sure
that the other indel-containing reads in the pileup are aligned correctly.  Note that when using
aligners which don't allow for gapped alignments (e.g. MAQ with single-end reads) this method is
not useful.

Usage:
java -jar dist/GenomeAnalysisTK.jar -I <input.bam> -R <ref.fasta> -T IndelIntervals
-L <regionsToCheck.txt> -S SILENT -o <intervalsOutput1.txt>

Optional Arguments:
--minIndelsPerInterval N [the minimum number of indels at a given position necessary for emission; default=1]

--maxReadLength N [the max read length above which we completely ignore the read; default=-1, no max]
This argument was put in place specifically to filter out 454 reads which inherently have false indels.


B. Occasionally it is the case that you have a SNP call set for your file that you'd like to use
in searching for clustered SNPS (which are suspicious).  Note that the realigner works best with
an unfiltered SNP list if at all possible.  The following method outputs clustered SNP intervals.

Usage:
java -jar dist/GenomeAnalysisTK.jar -R <ref.fasta> -T SNPClusters
-B dbsnp,dbsnp,<input.rod>,eval,1KGSNPs,<SNPlist.txt> -o <intervalsOutput2.txt>

Optional Arguments:
--windowSize N [mismatch columns are considered clustered when they occur no more than N bp apart; default=10]


C. When you do not have (or do not want to use) an available SNP call set, the following method
outputs intervals of clustered mismatching intervals.  Generally, one would use method B or
method C, but not both.

Usage:
java -jar dist/GenomeAnalysisTK.jar -I <input.bam> -R <ref.fasta> -T MismatchIntervals
-L <regionsToCheck.txt> -S SILENT -o <intervalsOutput3.txt>

Optional Arguments:
--windowSize N [mismatch columns are considered clustered when they occur no more than N bp apart; default=10]

--mismatchFraction f [fraction of reads that need to mismatch for the position to be considered mismatching; default=0.15]
Note that this fraction should be adjusted based on your particular data set.  For DEEP coverage and/or
when looking for indels with low allele frequency, this number should be smaller.


At this point, you'd need to combine any intervals files you had into a master list (no need to
worry about sorting and/or merging).  The following is an example command you might use:

cat intervalsOutput1.txt intervalsOutput2.txt intervalsOutput3.txt > allIntervals.txt


Step 2: Running the realigner over your intervals
Usage:
java -jar dist/GenomeAnalysisTK.jar -I <input.bam> -R <ref.fasta> -T IntervalCleaner
-L allIntervals.txt -S SILENT

Optional Arguments:
--maxReadLength N [the max read length above which we completely ignore the read; default=-1, no max]
This argument was put in place specifically to filter out 454 reads which inherently have false indels.

--OutputCleaned <output.bam> [the output BAM file to emit the reads; by default it writes all reads - whether or
not they were realigned - which at all overlap the input intervals (but not those outside the intervals)]
--OutputAllReads [when used with OutputCleaned it instructs the realigner to emit ALL reads from the original BAM file]
--OutputCleanedReadsOnly [when used with OutputCleaned it instructs the realigner to emit ONLY realigned reads]
--bam_compression N [when used with OutputCleaned it determines the BAM compression; default=5, recommended=1]

--OutputIndels <indels.txt> [the output file (text) for the indels found]

--LODThresholdForCleaning d [LOD threshold above which the realigner will proceed to realign; default=5.0]
This term is equivalent to "significance" - i.e. is the improvement significant enough to merit realignment?
Note that this number should be adjusted based on your particular data set.  For LOW coverage and/or
when looking for indels with low allele frequency, this number should be smaller.

--EntropyThreshold f [percentage of mismatches at a position to be considered having high entropy; default=0.15]
This is similar to the argument in the MismatchIntervals method.  The point here is that the realigner
will only proceed with the realignment (even above the given threshold) if it minimizes entropy among
the reads (and doesn't simply push the mismatch column to another position).  This parameter is just
a heuristic and should be adjusted based on your particular data set.

--maxConsensuses N [max alternate consensuses to try (necessary to improve performance in deep coverage); default=30]
If you need to find the optimal solution regardless of running time, use a higher number.

--maxReadsForConsensuses N [max reads (chosen randomly) used for finding the potential alternate consensuses
(necessary to improve performance in deep coverage); default=120]
If you need to find the optimal solution regardless of running time, use a higher number.

Questions or comments:
Email Eric Banks - ebanks@broadinstitute.org