From 24e81e3e7b04dd4e949e382b5f3e8cdce7579a81 Mon Sep 17 00:00:00 2001 From: ebanks Date: Wed, 22 Jul 2009 16:35:23 +0000 Subject: [PATCH] moved to wiki git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1292 348d0f76-0448-11de-a6fe-93d51630548a --- doc/MSA_Realigner_README.txt | 143 ----------------------------------- 1 file changed, 143 deletions(-) delete mode 100755 doc/MSA_Realigner_README.txt diff --git a/doc/MSA_Realigner_README.txt b/doc/MSA_Realigner_README.txt deleted file mode 100755 index b035267ab..000000000 --- a/doc/MSA_Realigner_README.txt +++ /dev/null @@ -1,143 +0,0 @@ -Overview: -The Multiple Sequence Alignment (MSA) Realigner is designed to consume one or more BAM files -and to fix reads which are misaligned due to the presence of insertions/deletions (indels) -within or near them. While this isn't the forum for a detailed explanation of why indel- -containing reads get misaligned, it is important to note that an artifact of these misalignments -is what look to be SNPs near the site of the indel (however, since they aren't really SNPs, I -often refer to them as columns of mismatches at a position in a pileup). These particular false -positive SNPs usually occur in clusters (generally defined as 2 or more mismatch columns within -n base pairs, where n is usually less than or equal to 10). It is often the case that an aligner -will detect the indel in some of the reads and will fail to detect it in others; that is because -the aligners don't use knowledge about the other reads mapping to the same location when placing an -individual read. It is the realigner's job to use all of the reads mapping to a given location to -find a consensus indel which best explains the data and which minimizes entropy within the reads. - -There are 3-4 major steps to the realignment process: -Step 1: Determining (small) suspicious intervals which are likely in need of realignment -Step 2: Merging the intervals -Step 3: Running the realigner over those intervals -Optional Step 4: Rebuild your original BAM with cleaned reads - -A more detailed explanation follows. - ------ - -Step 1: Determining (small) suspicious intervals which are likely in need of realignment - -There are several methods for finding these intervals, which can be used in conjunction with one -another or separately. - -A. In the case that aligners do find some reads with indels in them, one would want to make sure -that the other indel-containing reads in the pileup are aligned correctly. Note that when using -aligners which don't allow for gapped alignments (e.g. MAQ with single-end reads) this method is -not useful. - -Usage: -java -jar dist/GenomeAnalysisTK.jar -I -R -T IndelIntervals --L -S SILENT -o - -Optional Arguments: ---minIndelsPerInterval N [the minimum number of indels at a given position necessary for emission; default=1] - ---allow454Reads [don't filter out 454 reads (which inherently have false indels); default=false] - - -B. Occasionally it is the case that you have a SNP call set for your file that you'd like to use -in searching for clustered SNPS (which are suspicious). Note that the realigner works best with -an unfiltered SNP list if at all possible. The following method outputs clustered SNP intervals. - -Usage: -java -jar dist/GenomeAnalysisTK.jar -R -T SNPClusters --B dbsnp,dbsnp,,eval,1KGSNPs, -o - -Optional Arguments: ---windowSize N [mismatch columns are considered clustered when they occur no more than N bp apart; default=10] - - -C. When you do not have (or do not want to use) an available SNP call set, the following method -outputs intervals of clustered mismatching intervals. Generally, one would use method B or -method C, but not both. - -Usage: -java -jar dist/GenomeAnalysisTK.jar -I -R -T MismatchIntervals --L -S SILENT -o - -Optional Arguments: ---windowSize N [mismatch columns are considered clustered when they occur no more than N bp apart; default=10] - ---allow454Reads [don't filter out 454 reads (which inherently have false indels); default=false] - ---mismatchFraction f [fraction of reads that need to mismatch for the position to be considered mismatching; default=0.15] -Note that this fraction should be adjusted based on your particular data set. For DEEP coverage and/or -when looking for indels with low allele frequency, this number should be smaller. - - -Step 2: Merging the intervals - -At this point, you need to combine any intervals files you have into a -master list; this is done by running the interval merger. -*** Note that this step needs to be done even if you have only a single intervals file. - -Usage: java -jar dist/GenomeAnalysisTK.jar -I -R -T IntervalMerger ---intervalsToMerge intervalsOutput1.txt [--intervalsToMerge intervalsOutput2.txt] -[--intervalsToMerge intervalsOutput3.txt] -o - -Optional Arguments: ---allow454Reads [don't filter out 454 reads (which inherently have false indels); default=false] - ---maxIntervalSize [max size in bp of merged intervals that we'll pass to the realigner; default=500] - - -Step 3: Running the realigner over your intervals -Usage: java -jar dist/GenomeAnalysisTK.jar -I -R -T IntervalCleaner --L mergedIntervalList.txt -S SILENT - -Optional Arguments: ---allow454Reads [don't filter out 454 reads (which inherently have false indels); default=false] - ---OutputCleaned [the output BAM file to emit the reads; by default it writes all reads - whether or -not they were realigned - which at all overlap the input intervals (but not those outside the intervals)] ---OutputCleanedReadsOnly [when used with OutputCleaned it instructs the realigner to emit ONLY realigned reads] ---bam_compression N [when used with OutputCleaned it determines the BAM compression; default=5, recommended=1] - ---OutputIndels [the output file (text) for the indels found] - ---LODThresholdForCleaning d [LOD threshold above which the realigner will proceed to realign; default=5.0] -This term is equivalent to "significance" - i.e. is the improvement significant enough to merit realignment? -Note that this number should be adjusted based on your particular data set. For LOW coverage and/or -when looking for indels with low allele frequency, this number should be smaller. - ---EntropyThreshold f [percentage of mismatching base quality scores at a position to be considered having high entropy; default=0.15] -This is similar to the argument in the MismatchIntervals method. The point here is that the realigner -will only proceed with the realignment (even above the given threshold) if it minimizes entropy among -the reads (and doesn't simply push the mismatch column to another position). This parameter is just -a heuristic and should be adjusted based on your particular data set. - ---maxConsensuses N [max alternate consensuses to try (necessary to improve performance in deep coverage); default=30] -If you need to find the optimal solution regardless of running time, use a higher number. - ---maxReadsForConsensuses N [max reads (chosen randomly) used for finding the potential alternate consensuses -(necessary to improve performance in deep coverage); default=120] -If you need to find the optimal solution regardless of running time, use a higher number. - - -Optional Step 4: Rebuild your original BAM with cleaned reads -If you want your cleaned read BAM to contain ALL of the original reads -too (regardless of whether they were cleaned or fell within one of the -target intervals), you can do so in this last optional step. -Important note: this option works best with the -"-OutputCleanedReadsOnly" option in Step 3. - -First, be sure to index your cleaned output BAM from Step 3: -samtools index - -Usage: java -jar dist/GenomeAnalysisTK.jar -I -R -T CleanedReadInjector --cleaned_reads -S SILENT --output_bam - -Optional Arguments: ---bam_compression N [when used with OutputCleaned it determines the BAM compression; default=5, recommended=1] - - -Questions or comments: -Email Eric Banks - ebanks@broadinstitute.org - -