gatk-3.8/python/FlatFileTable.py

#!/usr/bin/env python

import sys, itertools

def record_generator(filename, sep="\t", skip_n_lines=0, skip_until_regex_line=""):
    """Given a file with field headers on the first line and records on subsequent lines,
generates a dictionary for each line keyed by the header fields"""
    fin = open(filename)

    if skip_n_lines > 0:
        for i in range(skip_n_lines): # Skip a number of lines
            fin.readline()

    found_regex = False
    if skip_until_regex_line != "":
        import re
        regex_line = re.compile(skip_until_regex_line)
        for line in fin:
            match = regex_line.search(line)
            if match:
                found_regex = line
                break
        if not found_regex:
            print "Warning: Regex "+skip_until_regex_line+" not found in FlatFileTable:record_generator"

    if found_regex:
        header = found_regex.rstrip().split(sep) # Parse header
    else:
        header = fin.readline().rstrip().split(sep) # Pull off header
    
    for line in fin: # 
        fields = line.rstrip().split(sep)
        record = dict(itertools.izip(header, fields))
        yield record

def record_matches_values(record, match_field_values):
    for match_field, match_values in match_field_values:
        if record[match_field] not in match_values:
            return False
    return True
Added ParseDCCSequenceData.py to repository and made changes that allow an analysis of quantity of sequence data by platform and project, moved table / record system to a new module called FlatFileTable.py and built that into ParseDCCSequenceData and CoverageEval.py; changed lod threshold in CoverageEvalWalker. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1201 348d0f76-0448-11de-a6fe-93d51630548a 2009-07-09 06:04:26 +08:00			`#!/usr/bin/env python`

			`import sys, itertools`

Add ability for flat file table parsing module to skip ahead to first occurence of a regular expression (use case: consistently parsing DepthOfCoverage output for histogram section of file across file format changes) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2377 348d0f76-0448-11de-a6fe-93d51630548a 2009-12-17 04:38:50 +08:00			`def record_generator(filename, sep="\t", skip_n_lines=0, skip_until_regex_line=""):`
Added ParseDCCSequenceData.py to repository and made changes that allow an analysis of quantity of sequence data by platform and project, moved table / record system to a new module called FlatFileTable.py and built that into ParseDCCSequenceData and CoverageEval.py; changed lod threshold in CoverageEvalWalker. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1201 348d0f76-0448-11de-a6fe-93d51630548a 2009-07-09 06:04:26 +08:00			`"""Given a file with field headers on the first line and records on subsequent lines,`
			`generates a dictionary for each line keyed by the header fields"""`
			`fin = open(filename)`
Skip compiled python files (*.pyc) in svn status output git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1346 348d0f76-0448-11de-a6fe-93d51630548a 2009-07-31 05:45:23 +08:00
Add ability for flat file table parsing module to skip ahead to first occurence of a regular expression (use case: consistently parsing DepthOfCoverage output for histogram section of file across file format changes) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2377 348d0f76-0448-11de-a6fe-93d51630548a 2009-12-17 04:38:50 +08:00			`if skip_n_lines > 0:`
			`for i in range(skip_n_lines): # Skip a number of lines`
			`fin.readline()`
Skip compiled python files (*.pyc) in svn status output git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1346 348d0f76-0448-11de-a6fe-93d51630548a 2009-07-31 05:45:23 +08:00
Add ability for flat file table parsing module to skip ahead to first occurence of a regular expression (use case: consistently parsing DepthOfCoverage output for histogram section of file across file format changes) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2377 348d0f76-0448-11de-a6fe-93d51630548a 2009-12-17 04:38:50 +08:00			`found_regex = False`
			`if skip_until_regex_line != "":`
			`import re`
			`regex_line = re.compile(skip_until_regex_line)`
			`for line in fin:`
			`match = regex_line.search(line)`
			`if match:`
			`found_regex = line`
			`break`
			`if not found_regex:`
			`print "Warning: Regex "+skip_until_regex_line+" not found in FlatFileTable:record_generator"`

			`if found_regex:`
			`header = found_regex.rstrip().split(sep) # Parse header`
			`else:`
			`header = fin.readline().rstrip().split(sep) # Pull off header`
Skip compiled python files (*.pyc) in svn status output git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1346 348d0f76-0448-11de-a6fe-93d51630548a 2009-07-31 05:45:23 +08:00
			`for line in fin: #`
Added ParseDCCSequenceData.py to repository and made changes that allow an analysis of quantity of sequence data by platform and project, moved table / record system to a new module called FlatFileTable.py and built that into ParseDCCSequenceData and CoverageEval.py; changed lod threshold in CoverageEvalWalker. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1201 348d0f76-0448-11de-a6fe-93d51630548a 2009-07-09 06:04:26 +08:00			`fields = line.rstrip().split(sep)`
			`record = dict(itertools.izip(header, fields))`
			`yield record`

			`def record_matches_values(record, match_field_values):`
			`for match_field, match_values in match_field_values:`
			`if record[match_field] not in match_values:`
			`return False`
			`return True`