From 127c321d0a7cb942174a876a29fa3f0b744d68da Mon Sep 17 00:00:00 2001 From: hanna Date: Mon, 8 Jun 2009 21:11:44 +0000 Subject: [PATCH] Cut over to 1kG version of fasta / reference. Updated doc with latest version of tool summary. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@940 348d0f76-0448-11de-a6fe-93d51630548a --- doc/ReadQualityRecalibrator/README | 21 ++++++++++++--------- python/RecalQual.py | 4 ++-- 2 files changed, 14 insertions(+), 11 deletions(-) diff --git a/doc/ReadQualityRecalibrator/README b/doc/ReadQualityRecalibrator/README index 97c7f3b12..3a1620e46 100644 --- a/doc/ReadQualityRecalibrator/README +++ b/doc/ReadQualityRecalibrator/README @@ -1,17 +1,20 @@ Read Quality Recalibrator ------------------------- -The tools in this package recalibrate quality scores -assigned to nucleic acids in an aligned BAM file by -analyzing the covariation between machine reported -quality scores and: +The tools in this package recalibrate quality scores of +Illumina reads in an aligned BAM file. After recalibration, +the quality scores in the QUAL field in each Illumina read +in the output BAM are accurate in that the reported quality +score is equal to its actual probability of mismatching. +This is process is accomplished by analyzing the covariation +between machine reported quality scores and 1) the position within the read, and 2) the preceding nucleotide (sequencing chemistry effect). -The aligned reads have their dbSNP sites masked out, and -the mismatched bases are used as a metric for the true error -rate of the system. The error rate at different dinucleotides -and positions in the read is then fed into a logistic regression +The aligned reads have their dbSNP sites masked out, and the +mismatched bases are used as a metric for the true error rate +of the system. The error rate at different dinucleotides and +positions in the read is then fed into a logistic regression system which outputs a correction factor for each of those combinations which are then use to output a recalibrated BAM file. @@ -79,7 +82,7 @@ directory at any time. Known Issues ------------ - The recalibrator places severe memory demands on - files with large numbers of read groups (> 1000). + files with large numbers of read groups. - If running in 'evaluation' mode (see the 'Running' section above), X11 is required to generate the graphs. If running on a machine via ssh, be certain diff --git a/python/RecalQual.py b/python/RecalQual.py index 728bf22ef..eb70d22ab 100755 --- a/python/RecalQual.py +++ b/python/RecalQual.py @@ -17,13 +17,13 @@ output_root = './' resources='resources/' # Where does the reference live? -reference_base = resources + 'Homo_sapiens_assembly18' +reference_base = resources + 'human_b36_both' reference = reference_base + '.fasta' reference_dict = reference_base + '.dict' reference_fai = reference_base + '.fasta.fai' # Where does DBSNP live? -dbsnp = resources + 'dbsnp.rod.out' +dbsnp = resources + 'dbsnp.1kg.rod.out' # Where are the application files required to run the recalibration? gatk = resources + 'gatk/GenomeAnalysisTK.jar'