Read Quality Recalibrator ------------------------- The tools in this package recalibrate quality scores of Illumina reads in an aligned BAM file. After recalibration, the quality scores in the QUAL field in each Illumina read in the output BAM are accurate in that the reported quality score is equal to its actual probability of mismatching. This is process is accomplished by analyzing the covariation between machine reported quality scores and 1) the position within the read, and 2) the preceding nucleotide (sequencing chemistry effect). The aligned reads have their dbSNP sites masked out, and the mismatched bases are used as a metric for the true error rate of the system. The error rate at different dinucleotides and positions in the read is then fed into a logistic regression system which outputs a correction factor for each of those combinations which are then use to output a recalibrated BAM file. Software Dependencies --------------------- The recalibrator currently depends on the following applications. Please install these before proceeding. - Java Runtime Environment 1.6.0_12 or later - Python 2.4.2 or later - R 2.6.0 or later - samtools 0.1.4 or later. Please update the file paths at the top of RecalQual.py to point to the local installations of the software listed above. Running ------- Before running the tool, please update the file paths listed at the top of RecalQual.py to point to the local installations of the tools listed above. The recalibrator has two modes: recalibration and evaluation, which can be run separately or jointly. By default, the recalibrator will recalibrate only. To calibrate a given bam file, run the following command: python RecalQual.py After recalibration, performing evaluation will walk through the source BAM file again, remeasuring the differences between empirical vs. reported quality. The results of evaluation are comma-delimited text files (.csv) and graphs (.png) which can be found in the output directory (see below) for the given run. A successful recalibration should produce a plot of empirical vs. observed qualities that shows little difference between the two values. To both recalibrate and evaluate, execute: python RecalQual.py --recalibrate --evaluate To (only) evaluate a given bam file after calibrating: python RecalQual.py --evaluate Platforms --------- By default, the recalibrator processes only read groups originating from Illumina sequencers. To enable calibration for other platforms, edit the 'platforms' array at the top of RecalQual.py. Platforms specified here should case-insensitive match the "PL" attribute of the read group in the BAM file. Output ------ The recalibration process keeps many incremental files around for future analysis. By default, all of these files will be grouped into the directory 'output./'. By default, this directory will be created within the working directory. This location can be changed by editing the output_root variable at the top of RecalQual.py. It is safe to delete this supplemental output directory at any time. Known Issues ------------ - The recalibrator places severe memory demands on files with large numbers of read groups. - If running in 'evaluation' mode (see the 'Running' section above), X11 is required to generate the graphs. If running on a machine via ssh, be certain to enable X tunnelling. Troubleshooting --------------- - The memory requirements of the recalibrator will vary based on the type of JVM running the application and the number of read groups in the input bam file. If the application reports 'java.lang.OutOfMemoryError: Java heap space', increase the max heap size provided to the JVM by adding ' -Xmx????m' to the jvm_args variable in RecalQual.py, where '????' is the maximum available memory on the processing computer. Support ------- For support, please email gsadevelopers@broad.mit.edu.