gatk-3.8/doc/ReadQualityRecalibrator
hanna f6e985d97f Documentation for read quality recalibrator. We have to spend some time rethinking how to organize these mini-releases.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@938 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-08 16:54:39 +00:00
..
README Documentation for read quality recalibrator. We have to spend some time rethinking how to organize these mini-releases. 2009-06-08 16:54:39 +00:00

README

Read Quality Recalibrator
-------------------------
The tools in this package recalibrate quality scores 
assigned to nucleic acids in an aligned BAM file by 
analyzing the covariation between machine reported 
quality scores and: 

1) the position within the read, and 
2) the preceding nucleotide (sequencing chemistry effect).  

The aligned reads have their dbSNP sites masked out, and 
the mismatched bases are used as a metric for the true error 
rate of the system.  The error rate at different dinucleotides 
and positions in the read is then fed into a logistic regression 
system which outputs a correction factor for each of those 
combinations which are then use to output a recalibrated BAM 
file. 

Software Dependencies
---------------------
The recalibrator currently depends on the following
applications.  Please install these before proceeding.

- Java Runtime Environment 1.6.0_12 or later
- Python 2.4.2 or later
- R 2.6.0 or later
- samtools 0.1.4 or later.

Please update the file paths at the top of 
RecalQual.py to point to the local installations of the
software listed above.

Running
-------
Before running the tool, please update the file paths
listed at the top of RecalQual.py to point to the
local installations of the tools listed above.

The recalibrator has two modes: recalibration and
evaluation, which can be run separately or jointly.
By default, the recalibrator will recalibrate only.

To calibrate a given bam file, run the following 
command:

python RecalQual.py <source bam> <recalibrated bam>

After recalibration, performing evaluation will walk 
through the source BAM file again, remeasuring the 
differences between empirical vs. reported quality.  
The results of evaluation are comma-delimited text 
files (.csv) and graphs (.png) which can be found 
in the output directory (see below) for the given run.
A successful recalibration should produce a plot of 
empirical vs. observed qualities that shows little
difference between the two values.

To both recalibrate and evaluate, execute:

python RecalQual.py --recalibrate --evaluate <source bam> <recalibrated bam>

To (only) evaluate a given bam file after calibrating:

python RecalQual.py --evaluate <source bam> <recalibrated bam>

Output
------
The recalibration process keeps many incremental
files around for future analysis.  By default, all 
of these files will be grouped into the directory 
'output.<source bam>/'.  By default, this directory
will be created within the working directory.  This
location can be changed by editing the output_root
variable at the top of RecalQual.py.

It is safe to delete this supplemental output 
directory at any time.

Known Issues
------------
- The recalibrator places severe memory demands on
  files with large numbers of read groups (> 1000).
- If running in 'evaluation' mode (see the 'Running'
  section above), X11 is required to generate the 
  graphs.  If running on a machine via ssh, be certain
  to enable X tunnelling.

Support
-------
For support, please email gsadevelopers@broad.mit.edu.