113 lines
13 KiB
Plaintext
Executable File
113 lines
13 KiB
Plaintext
Executable File
Overview:
|
|
The Multiple Sequence Alignment (MSA) Realigner is designed to consume one or more BAM files
|
|
and to fix reads which are misaligned due to the presence of insertions/deletions (indels)
|
|
within or near them. While this isn't the forum for a detailed explanation of why indel-
|
|
containing reads get misaligned, it is important to note that an artifact of these misalignments
|
|
is what look to be SNPs near the site of the indel (however, since they aren't really SNPs, I
|
|
often refer to them as columns of mismatches at a position in a pileup). These particular false
|
|
positive SNPs usually occur in clusters (generally defined as 2 or more mismatch columns within
|
|
n base pairs, where n is usually less than or equal to 10). It is often the case that an aligner
|
|
will detect the indel in some of the reads and will fail to detect it in others; that is because
|
|
the aligners don't use knowledge about the other reads mapping to the same location when placing an
|
|
individual read. It is the realigner's job to use all of the reads mapping to a given location to
|
|
find a consensus indel which best explains the data and which minimizes entropy within the reads.
|
|
|
|
There are 2 major steps to the realignment process:
|
|
Step 1: Determining (small) suspicious intervals which are likely in need of realignment
|
|
Step 2: Running the realigner over those intervals
|
|
|
|
A more detailed explanation follows.
|
|
|
|
-----
|
|
|
|
Step 1: Determining (small) suspicious intervals which are likely in need of realignment
|
|
|
|
There are several methods for finding these intervals, which can be used in conjunction with one
|
|
another or separately.
|
|
|
|
A. In the case that aligners do find some reads with indels in them, one would want to make sure
|
|
that the other indel-containing reads in the pileup are aligned correctly. Note that when using
|
|
aligners which don't allow for gapped alignments (e.g. MAQ with single-end reads) this method is
|
|
not useful.
|
|
|
|
Usage:
|
|
java -jar dist/GenomeAnalysisTK.jar -I <input.bam> -R <ref.fasta> -T IndelIntervals
|
|
-L <regionsToCheck.txt> -S SILENT -o <intervalsOutput1.txt>
|
|
|
|
Optional Arguments:
|
|
--minIndelsPerInterval N [the minimum number of indels at a given position necessary for emission; default=1]
|
|
|
|
--maxReadLength N [the max read length above which we completely ignore the read; default=-1, no max]
|
|
This argument was put in place specifically to filter out 454 reads which inherently have false indels.
|
|
|
|
|
|
B. Occasionally it is the case that you have a SNP call set for your file that you'd like to use
|
|
in searching for clustered SNPS (which are suspicious). Note that the realigner works best with
|
|
an unfiltered SNP list if at all possible. The following method outputs clustered SNP intervals.
|
|
|
|
Usage:
|
|
java -jar dist/GenomeAnalysisTK.jar -R <ref.fasta> -T SNPClusters
|
|
-B dbsnp,dbsnp,<input.rod>,eval,1KGSNPs,<SNPlist.txt> -o <intervalsOutput2.txt>
|
|
|
|
Optional Arguments:
|
|
--windowSize N [mismatch columns are considered clustered when they occur no more than N bp apart; default=10]
|
|
|
|
|
|
C. When you do not have (or do not want to use) an available SNP call set, the following method
|
|
outputs intervals of clustered mismatching intervals. Generally, one would use method B or
|
|
method C, but not both.
|
|
|
|
Usage:
|
|
java -jar dist/GenomeAnalysisTK.jar -I <input.bam> -R <ref.fasta> -T MismatchIntervals
|
|
-L <regionsToCheck.txt> -S SILENT -o <intervalsOutput3.txt>
|
|
|
|
Optional Arguments:
|
|
--windowSize N [mismatch columns are considered clustered when they occur no more than N bp apart; default=10]
|
|
|
|
--mismatchFraction f [fraction of reads that need to mismatch for the position to be considered mismatching; default=0.15]
|
|
Note that this fraction should be adjusted based on your particular data set. For DEEP coverage and/or
|
|
when looking for indels with low allele frequency, this number should be smaller.
|
|
|
|
|
|
At this point, you'd need to combine any intervals files you had into a master list (no need to
|
|
worry about sorting and/or merging). The following is an example command you might use:
|
|
|
|
cat intervalsOutput1.txt intervalsOutput2.txt intervalsOutput3.txt > allIntervals.txt
|
|
|
|
|
|
Step 2: Running the realigner over your intervals
|
|
Usage:
|
|
java -jar dist/GenomeAnalysisTK.jar -I <input.bam> -R <ref.fasta> -T IntervalCleaner
|
|
-L allIntervals.txt -S SILENT
|
|
|
|
Optional Arguments:
|
|
--maxReadLength N [the max read length above which we completely ignore the read; default=-1, no max]
|
|
This argument was put in place specifically to filter out 454 reads which inherently have false indels.
|
|
|
|
--OutputCleaned <output.bam> [the output BAM file to emit the reads; by default it writes all reads - whether or
|
|
not they were realigned - which at all overlap the input intervals (but not those outside the intervals)]
|
|
--OutputAllReads [when used with OutputCleaned it instructs the realigner to emit ALL reads from the original BAM file]
|
|
--OutputCleanedReadsOnly [when used with OutputCleaned it instructs the realigner to emit ONLY realigned reads]
|
|
--bam_compression N [when used with OutputCleaned it determines the BAM compression; default=5, recommended=1]
|
|
|
|
--OutputIndels <indels.txt> [the output file (text) for the indels found]
|
|
|
|
--LODThresholdForCleaning d [LOD threshold above which the realigner will proceed to realign; default=5.0]
|
|
This term is equivalent to "significance" - i.e. is the improvement significant enough to merit realignment?
|
|
Note that this number should be adjusted based on your particular data set. For LOW coverage and/or
|
|
when looking for indels with low allele frequency, this number should be smaller.
|
|
|
|
--EntropyThreshold f [percentage of mismatches at a position to be considered having high entropy; default=0.15]
|
|
This is similar to the argument in the MismatchIntervals method. The point here is that the realigner
|
|
will only proceed with the realignment (even above the given threshold) if it minimizes entropy among
|
|
the reads (and doesn't simply push the mismatch column to another position). This parameter is just
|
|
a heuristic and should be adjusted based on your particular data set.
|
|
|
|
--maxConsensuses N [max alternate consensuses to try (necessary to improve performance in deep coverage); default=30]
|
|
If you need to find the optimal solution regardless of running time, use a higher number.
|
|
|
|
--maxReadsForConsensuses N [max reads (chosen randomly) used for finding the potential alternate consensuses
|
|
(necessary to improve performance in deep coverage); default=120]
|
|
If you need to find the optimal solution regardless of running time, use a higher number.
|
|
|
|
Questions or comments:
|
|
Email Eric Banks - ebanks@broadinstitute.org
|