From 24e81e3e7b04dd4e949e382b5f3e8cdce7579a81 Mon Sep 17 00:00:00 2001
From: ebanks <ebanks@348d0f76-0448-11de-a6fe-93d51630548a>
Date: Wed, 22 Jul 2009 16:35:23 +0000
Subject: [PATCH] moved to wiki

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1292 348d0f76-0448-11de-a6fe-93d51630548a
---
 doc/MSA_Realigner_README.txt | 143 -----------------------------------
 1 file changed, 143 deletions(-)
 delete mode 100755 doc/MSA_Realigner_README.txt

diff --git a/doc/MSA_Realigner_README.txt b/doc/MSA_Realigner_README.txt
deleted file mode 100755
index b035267ab..000000000
--- a/doc/MSA_Realigner_README.txt
+++ /dev/null
@@ -1,143 +0,0 @@
-Overview:
-The Multiple Sequence Alignment (MSA) Realigner is designed to consume one or more BAM files
-and to fix reads which are misaligned due to the presence of insertions/deletions (indels)
-within or near them.  While this isn't the forum for a detailed explanation of why indel-
-containing reads get misaligned, it is important to note that an artifact of these misalignments
-is what look to be SNPs near the site of the indel (however, since they aren't really SNPs, I
-often refer to them as columns of mismatches at a position in a pileup).  These particular false
-positive SNPs usually occur in clusters (generally defined as 2 or more mismatch columns within
-n base pairs, where n is usually less than or equal to 10).  It is often the case that an aligner
-will detect the indel in some of the reads and will fail to detect it in others; that is because
-the aligners don't use knowledge about the other reads mapping to the same location when placing an
-individual read.  It is the realigner's job to use all of the reads mapping to a given location to
-find a consensus indel which best explains the data and which minimizes entropy within the reads.
-
-There are 3-4 major steps to the realignment process:
-Step 1: Determining (small) suspicious intervals which are likely in need of realignment
-Step 2: Merging the intervals
-Step 3: Running the realigner over those intervals
-Optional Step 4: Rebuild your original BAM with cleaned reads
-
-A more detailed explanation follows.
-
------
-
-Step 1: Determining (small) suspicious intervals which are likely in need of realignment
-
-There are several methods for finding these intervals, which can be used in conjunction with one
-another or separately.
-
-A. In the case that aligners do find some reads with indels in them, one would want to make sure
-that the other indel-containing reads in the pileup are aligned correctly.  Note that when using
-aligners which don't allow for gapped alignments (e.g. MAQ with single-end reads) this method is
-not useful.
-
-Usage:
-java -jar dist/GenomeAnalysisTK.jar -I <input.bam> -R <ref.fasta> -T IndelIntervals
--L <regionsToCheck.txt> -S SILENT -o <intervalsOutput1.txt>
-
-Optional Arguments:
---minIndelsPerInterval N [the minimum number of indels at a given position necessary for emission; default=1]
-
---allow454Reads [don't filter out 454 reads (which inherently have false indels); default=false]
-
-
-B. Occasionally it is the case that you have a SNP call set for your file that you'd like to use
-in searching for clustered SNPS (which are suspicious).  Note that the realigner works best with
-an unfiltered SNP list if at all possible.  The following method outputs clustered SNP intervals. 
-
-Usage:
-java -jar dist/GenomeAnalysisTK.jar -R <ref.fasta> -T SNPClusters
--B dbsnp,dbsnp,<input.rod>,eval,1KGSNPs,<SNPlist.txt> -o <intervalsOutput2.txt>
-
-Optional Arguments:
---windowSize N [mismatch columns are considered clustered when they occur no more than N bp apart; default=10]
-
-
-C. When you do not have (or do not want to use) an available SNP call set, the following method
-outputs intervals of clustered mismatching intervals.  Generally, one would use method B or
-method C, but not both.
-
-Usage:
-java -jar dist/GenomeAnalysisTK.jar -I <input.bam> -R <ref.fasta> -T MismatchIntervals
--L <regionsToCheck.txt> -S SILENT -o <intervalsOutput3.txt>
-
-Optional Arguments:
---windowSize N [mismatch columns are considered clustered when they occur no more than N bp apart; default=10]
-
---allow454Reads [don't filter out 454 reads (which inherently have false indels); default=false]
-
---mismatchFraction f [fraction of reads that need to mismatch for the position to be considered mismatching; default=0.15]
-Note that this fraction should be adjusted based on your particular data set.  For DEEP coverage and/or
-when looking for indels with low allele frequency, this number should be smaller.
-
-
-Step 2: Merging the intervals
-
-At this point, you need to combine any intervals files you have into a
-master list; this is done by running the interval merger.
-*** Note that this step needs to be done even if you have only a single intervals file.
-
-Usage: java -jar dist/GenomeAnalysisTK.jar -I <input.bam> -R <ref.fasta> -T IntervalMerger
---intervalsToMerge intervalsOutput1.txt [--intervalsToMerge intervalsOutput2.txt]
-[--intervalsToMerge intervalsOutput3.txt] -o <mergedIntervalList.txt>
-
-Optional Arguments:
---allow454Reads [don't filter out 454 reads (which inherently have false indels); default=false]
-
---maxIntervalSize [max size in bp of merged intervals that we'll pass to the realigner; default=500]
-
-
-Step 3: Running the realigner over your intervals
-Usage: java -jar dist/GenomeAnalysisTK.jar -I <input.bam> -R <ref.fasta> -T IntervalCleaner
--L mergedIntervalList.txt -S SILENT
-
-Optional Arguments:
---allow454Reads [don't filter out 454 reads (which inherently have false indels); default=false]
-
---OutputCleaned <output.bam> [the output BAM file to emit the reads; by default it writes all reads - whether or 
-not they were realigned - which at all overlap the input intervals (but not those outside the intervals)]
---OutputCleanedReadsOnly [when used with OutputCleaned it instructs the realigner to emit ONLY realigned reads]
---bam_compression N [when used with OutputCleaned it determines the BAM compression; default=5, recommended=1]
-
---OutputIndels <indels.txt> [the output file (text) for the indels found]
-
---LODThresholdForCleaning d [LOD threshold above which the realigner will proceed to realign; default=5.0]
-This term is equivalent to "significance" - i.e. is the improvement significant enough to merit realignment?
-Note that this number should be adjusted based on your particular data set.  For LOW coverage and/or
-when looking for indels with low allele frequency, this number should be smaller.
-
---EntropyThreshold f [percentage of mismatching base quality scores at a position to be considered having high entropy; default=0.15]
-This is similar to the argument in the MismatchIntervals method.  The point here is that the realigner
-will only proceed with the realignment (even above the given threshold) if it minimizes entropy among
-the reads (and doesn't simply push the mismatch column to another position).  This parameter is just
-a heuristic and should be adjusted based on your particular data set.
-
---maxConsensuses N [max alternate consensuses to try (necessary to improve performance in deep coverage); default=30]
-If you need to find the optimal solution regardless of running time, use a higher number.
-
---maxReadsForConsensuses N [max reads (chosen randomly) used for finding the potential alternate consensuses
-(necessary to improve performance in deep coverage); default=120]
-If you need to find the optimal solution regardless of running time, use a higher number.
-
-
-Optional Step 4: Rebuild your original BAM with cleaned reads
-If you want your cleaned read BAM to contain ALL of the original reads
-too (regardless of whether they were cleaned or fell within one of the
-target intervals), you can do so in this last optional step.
-Important note: this option works best with the
-"-OutputCleanedReadsOnly" option in Step 3.
-
-First, be sure to index your cleaned output BAM from Step 3:
-samtools index <output.bam>
-
-Usage: java -jar dist/GenomeAnalysisTK.jar -I <originalInput.bam> -R <ref.fasta> -T CleanedReadInjector --cleaned_reads <previousOutput.bam> -S SILENT --output_bam <fullOutput.bam>
-
-Optional Arguments:
---bam_compression N [when used with OutputCleaned it determines the BAM compression; default=5, recommended=1]
-
-
-Questions or comments:
-Email Eric Banks - ebanks@broadinstitute.org
-
-