diff --git a/doc/ReadQualityRecalibrator/README b/doc/ReadQualityRecalibrator/README new file mode 100644 index 000000000..97c7f3b12 --- /dev/null +++ b/doc/ReadQualityRecalibrator/README @@ -0,0 +1,90 @@ +Read Quality Recalibrator +------------------------- +The tools in this package recalibrate quality scores +assigned to nucleic acids in an aligned BAM file by +analyzing the covariation between machine reported +quality scores and: + +1) the position within the read, and +2) the preceding nucleotide (sequencing chemistry effect). + +The aligned reads have their dbSNP sites masked out, and +the mismatched bases are used as a metric for the true error +rate of the system. The error rate at different dinucleotides +and positions in the read is then fed into a logistic regression +system which outputs a correction factor for each of those +combinations which are then use to output a recalibrated BAM +file. + +Software Dependencies +--------------------- +The recalibrator currently depends on the following +applications. Please install these before proceeding. + +- Java Runtime Environment 1.6.0_12 or later +- Python 2.4.2 or later +- R 2.6.0 or later +- samtools 0.1.4 or later. + +Please update the file paths at the top of +RecalQual.py to point to the local installations of the +software listed above. + +Running +------- +Before running the tool, please update the file paths +listed at the top of RecalQual.py to point to the +local installations of the tools listed above. + +The recalibrator has two modes: recalibration and +evaluation, which can be run separately or jointly. +By default, the recalibrator will recalibrate only. + +To calibrate a given bam file, run the following +command: + +python RecalQual.py + +After recalibration, performing evaluation will walk +through the source BAM file again, remeasuring the +differences between empirical vs. reported quality. +The results of evaluation are comma-delimited text +files (.csv) and graphs (.png) which can be found +in the output directory (see below) for the given run. +A successful recalibration should produce a plot of +empirical vs. observed qualities that shows little +difference between the two values. + +To both recalibrate and evaluate, execute: + +python RecalQual.py --recalibrate --evaluate + +To (only) evaluate a given bam file after calibrating: + +python RecalQual.py --evaluate + +Output +------ +The recalibration process keeps many incremental +files around for future analysis. By default, all +of these files will be grouped into the directory +'output./'. By default, this directory +will be created within the working directory. This +location can be changed by editing the output_root +variable at the top of RecalQual.py. + +It is safe to delete this supplemental output +directory at any time. + +Known Issues +------------ +- The recalibrator places severe memory demands on + files with large numbers of read groups (> 1000). +- If running in 'evaluation' mode (see the 'Running' + section above), X11 is required to generate the + graphs. If running on a machine via ssh, be certain + to enable X tunnelling. + +Support +------- +For support, please email gsadevelopers@broad.mit.edu.