94 lines
3.2 KiB
Plaintext
94 lines
3.2 KiB
Plaintext
Read Quality Recalibrator
|
|
-------------------------
|
|
The tools in this package recalibrate quality scores of
|
|
Illumina reads in an aligned BAM file. After recalibration,
|
|
the quality scores in the QUAL field in each Illumina read
|
|
in the output BAM are accurate in that the reported quality
|
|
score is equal to its actual probability of mismatching.
|
|
This is process is accomplished by analyzing the covariation
|
|
between machine reported quality scores and
|
|
|
|
1) the position within the read, and
|
|
2) the preceding nucleotide (sequencing chemistry effect).
|
|
|
|
The aligned reads have their dbSNP sites masked out, and the
|
|
mismatched bases are used as a metric for the true error rate
|
|
of the system. The error rate at different dinucleotides and
|
|
positions in the read is then fed into a logistic regression
|
|
system which outputs a correction factor for each of those
|
|
combinations which are then use to output a recalibrated BAM
|
|
file.
|
|
|
|
Software Dependencies
|
|
---------------------
|
|
The recalibrator currently depends on the following
|
|
applications. Please install these before proceeding.
|
|
|
|
- Java Runtime Environment 1.6.0_12 or later
|
|
- Python 2.4.2 or later
|
|
- R 2.6.0 or later
|
|
- samtools 0.1.4 or later.
|
|
|
|
Please update the file paths at the top of
|
|
RecalQual.py to point to the local installations of the
|
|
software listed above.
|
|
|
|
Running
|
|
-------
|
|
Before running the tool, please update the file paths
|
|
listed at the top of RecalQual.py to point to the
|
|
local installations of the tools listed above.
|
|
|
|
The recalibrator has two modes: recalibration and
|
|
evaluation, which can be run separately or jointly.
|
|
By default, the recalibrator will recalibrate only.
|
|
|
|
To calibrate a given bam file, run the following
|
|
command:
|
|
|
|
python RecalQual.py <source bam> <recalibrated bam>
|
|
|
|
After recalibration, performing evaluation will walk
|
|
through the source BAM file again, remeasuring the
|
|
differences between empirical vs. reported quality.
|
|
The results of evaluation are comma-delimited text
|
|
files (.csv) and graphs (.png) which can be found
|
|
in the output directory (see below) for the given run.
|
|
A successful recalibration should produce a plot of
|
|
empirical vs. observed qualities that shows little
|
|
difference between the two values.
|
|
|
|
To both recalibrate and evaluate, execute:
|
|
|
|
python RecalQual.py --recalibrate --evaluate <source bam> <recalibrated bam>
|
|
|
|
To (only) evaluate a given bam file after calibrating:
|
|
|
|
python RecalQual.py --evaluate <source bam> <recalibrated bam>
|
|
|
|
Output
|
|
------
|
|
The recalibration process keeps many incremental
|
|
files around for future analysis. By default, all
|
|
of these files will be grouped into the directory
|
|
'output.<source bam>/'. By default, this directory
|
|
will be created within the working directory. This
|
|
location can be changed by editing the output_root
|
|
variable at the top of RecalQual.py.
|
|
|
|
It is safe to delete this supplemental output
|
|
directory at any time.
|
|
|
|
Known Issues
|
|
------------
|
|
- The recalibrator places severe memory demands on
|
|
files with large numbers of read groups.
|
|
- If running in 'evaluation' mode (see the 'Running'
|
|
section above), X11 is required to generate the
|
|
graphs. If running on a machine via ssh, be certain
|
|
to enable X tunnelling.
|
|
|
|
Support
|
|
-------
|
|
For support, please email gsadevelopers@broad.mit.edu.
|