datconv.readers package

This package contains datconv compatible Reader classes for common formats of files used in public.

Reader interface

This module contain Datconv Reader skeleton class suitable as starting point for new readers.

class datconv.readers._skeleton.DCReader[source]

Bases: object

This class must be named exactly DCReader. It is responsible for:

  • reading input data (i.e. every reader class assumes certain input file format)
  • driving entire data conversion process (i.e. main processing loop in implemented in this class)
  • determine internal representation of header, records and footer (this strongly depands on reader and kind of input format).

Additional constructor parameters may be added to constructor, but they all have to be named parameters. Parameters are usually passed from YAML file as subkeys of Reader:CArg key.

setWriter(writer)[source]

Obligatory method that must be defined in Reader class. It is called by main datconv.py script after it read configuration file and create Writer class.

Parameters:writer – is instance of Writer class.
setFilter(flt)[source]

Obligatory method that must be defined in Reader class. It may be called by main datconv.py script after it read configuration file and create Filter class. If Filter is not configured this method is not called.

Parameters:flt – is instance of Filter class.
Process(inpath, outpath=None, rfrom=1, rto=0)[source]

Main method that drive all data conversion process. Parameters are usually passed from YAML file as subkeys of Reader:PArg key. Parameters given in this method are typical ones, however thay may be customized. Usually some kind of input and output path should be passed here. Also if structure of input data format allows for it, it is recommended to implement reading data from certain to certain record number.

Iterate(inpath, outpath=None, rfrom=1, rto=0)[source]

Clone of Process method which will yield value returned from Writer. Parameters are usually passed from YAML file as subkeys of Reader:PArg key. Parameters given in this method are typical ones, however thay may be customized. Usually some kind of input and output path should be passed here. Also if structure of input data format allows for it, it is recommended to implement reading data from certain to certain record number.

datconv.readers.dcxml module

This module implements Datconv Reader which reads data from XML file.

exception datconv.readers.dcxml.FilterBreak[source]

Bases: Exception

Exception class to support Reader.process break isued from Filter class.

exception datconv.readers.dcxml.ToLimitBreak[source]

Bases: Exception

Exception class to support Reader.process break caused by reaching configured record limit.

class datconv.readers.dcxml.ContentGenerator(bratags, headtags, rectags, foottags, wri, flt=None, lp_step=0, rfrom=1, rto=0)[source]

Bases: xml.sax.handler.ContentHandler

This class handles XML events generated by parser created by xml.sax.make_parser(). It implements most of the functionality of this XML Reader. See documentation of its base class for description of methods meaning.

See description of DCReader constructor and Process() method for meaning of most parameters.

startDocument()[source]

Receive notification of the beginning of a document.

The SAX parser will invoke this method only once, before any other methods in this interface or in DTDHandler (except for setDocumentLocator).

endDocument()[source]

Receive notification of the end of a document.

The SAX parser will invoke this method only once, and it will be the last method invoked during the parse. The parser shall not invoke this method until it has either abandoned parsing (because of an unrecoverable error) or reached the end of input.

startElement(name, attrs)[source]

Signals the start of an element in non-namespace mode.

The name parameter contains the raw XML 1.0 name of the element type as a string and the attrs parameter holds an instance of the Attributes class containing the attributes of the element.

endElement(name)[source]

Signals the end of an element in non-namespace mode.

The name parameter contains the name of the element type, just as with the startElement event.

characters(content)[source]

Receive notification of character data.

The Parser will call this method to report each chunk of character data. SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks; however, all of the characters in any single event must come from the same external entity so that the Locator provides useful information.

class datconv.readers.dcxml.DCReader(bratags=[], headtags=[], rectags=[], foottags=[], log_prog_step=0)[source]

Bases: object

This Datconv XML Reader class uses xml.sax parser to read and interpret XML file. This parser uses ContentGenerator class from this module to handle XML events. See documentation of standard Python xml.sax library for more information how it works. This Reader assumens that srtucture of input XML file is following:

  • there is/are some (one or more) BRACE tag(s); entire document content is included in this/those brace tag(s); well-formed XML document should have at least one such tag;
  • then there is/are some optional HEAD tag(s); head tags begin and end completly before record tags begin;
  • then there are RECORD tags; everything what is inside record tags is treated as record data and is being passed to Filter and Writer; record tags can not be nested - every record tag must end before another record tag begin; there may be several kinds (names) or record tags - in such case we say that we have multiply record types. If list of record tags is empty then every tag which is one level under brace tag and which is not head nor foot tag is treated as record tag.
  • then there is/are some optional FOOTER tag(s); footer tags begin and end completly after record tags;

Constructor parameters explicitly list which tags are of what kind.

TODO: The text inside brace, header and footer tags is discarded (only attributes are passed to Writer).

TODO: The header tags between record tags are discarded (only ones before first record tag are passed to Writer.

TODO: This class does not support CDATA inside XML.

Parameters are usually passed from YAML file as subkeys of Reader:CArg key.

Parameters:
  • bratags – list of tag names that will be treated as brace tags (see above).
  • headtags – list of tag names that will be treated as header tags (see above).
  • rectags – list of tag names that will be treated as record tags (see above).
  • foottags – list of tag names that will be treated as footer tags (see above).
  • log_prog_step – log info message after this number of records or does not log progress messages if this key is 0 or logging level is set to value higher than INFO.

For more detailed descriptions see Configuration keys.

Process(inpath, outpath=None, rfrom=1, rto=0)[source]

Parameters are usually passed from YAML file as subkeys of Reader:PArg key.

Parameters:
  • inpath – Path to input file.
  • outpath – Path to output file passed to Writer (fall-back if output connector is not defined).
  • rfrom-rto – specifies scope of records to be processed.

For more detailed descriptions see Configuration keys.

datconv.readers.dcijson_events module

This module implements Datconv Reader which reads data from JSON file.

exception datconv.readers.dcijson_events.FilterBreak[source]

Bases: Exception

Exception class to support Reader.process break isued from Filter class.

exception datconv.readers.dcijson_events.ToLimitBreak[source]

Bases: Exception

Exception class to support Reader.process break caused by reaching configured record limit.

class datconv.readers.dcijson_events.DCReader(mode=3, rec_tag='rec', log_prog_step=0, backend=None)[source]

Bases: object

This Datconv Reader class is utility class to help discover structure of JSON data file. It returns events generated by Python ijson JSON files parser.

Example records returned by this reader (mode == 3):

Input:

{
  "PadnDrawNbrs": {
    "cdc": 5019,
    "product": "addn"
  }
}

Output (datconv.writers.dccsv module):

prefix , event , value 
item , start_map , None 
item , map_key , PadnDrawNbrs 
item.PadnDrawNbrs , start_map , None 
item.PadnDrawNbrs , map_key , cdc 
item.PadnDrawNbrs.cdc , number , 5019 
item.PadnDrawNbrs , map_key , product 
item.PadnDrawNbrs.product , string , addn 
item.PadnDrawNbrs , end_map , None 
item , end_map , None 

Usage instructions of ijson package:

Parameters are usually passed from YAML file as subkeys of Reader:CArg key.

Parameters:
  • mode – returns: 1-only unique prefixes; 2-unique (prefix,event) pairs; 3-all events (including data).
  • rec_tag – name or tag to be placed as record marker.
  • log_prog_step – log info message after this number of records or does not log progress messages if this key is 0 or logging level is set to value higher than INFO.
  • backend – backend used by ijson package to parse json file, possible values:

yajl2_cffi - requires yajl2 C library and cffi Python package to be installed in the system;

yajl2 - requires yajl2 C library to be installed in the system;

None - uses default, Python only backend.

For more detailed descriptions see Configuration keys.

Process(inpath, outpath=None, rfrom=1, rto=0)[source]

Parameters are usually passed from YAML file as subkeys of Reader:PArg key.

Parameters:
  • inpath – Path to input file.
  • outpath – Path to output file passed to Writer (fall-back if output connector is not defined).
  • rfrom-rto – specifies scope of records to be processed.

For more detailed descriptions see Configuration keys.

datconv.readers.dcijson_keys module

This module implements Datconv Reader which reads data from JSON file.

exception datconv.readers.dcijson_keys.FilterBreak[source]

Bases: Exception

Exception class to support Reader.process break isued from Filter class.

exception datconv.readers.dcijson_keys.ToLimitBreak[source]

Bases: Exception

Exception class to support Reader.process break caused by reaching configured record limit.

class datconv.readers.dcijson_keys.DCReader(headkeys=[], reckeys=[], footkeys=[], log_prog_step=0, backend=None)[source]

Bases: object

This Datconv JSON Reader class uses ijson sax-type parser to read and interpret JSON file. It assumes that input file contain array of json objects and data records are values of some key(s) inside those objects.

Example (headkeys = [], reckeys = [], footkeys = []):

Input:

[
{
  "PadnDrawNbrs": {
    "cdc": 5019,
    "product": "addn"
  }
},
{
  "SiteData": {
    "siteId": 38
  },
  "rec0Control": {
    "curDraw": 5
  }
}
]

Output (datconv.writers.dcxml module):

<Datconv>
<PadnDrawNbrs>
    <cdc>5019</cdc>
    <product>addn</product>
</PadnDrawNbrs>
<SiteData>
    <siteId>38</siteId>
</SiteData>
<rec0Control>
    <curDraw>5</curDraw>
</rec0Control>
</Datconv>

Parameters are usually passed from YAML file as subkeys of Reader:CArg key.

Parameters:
  • headkeys – list of key names that will be passed to Writer as header.
  • reckeys – list of key names that will be treated as records. If empty all highest level keys that are not heders or footers are passed to Writer as records.
  • footkeys – list of key names that will be passed to Writer as footer.
  • log_prog_step – log info message after this number of records or does not log progress messages if this key is 0 or logging level is set to value higher than INFO.
  • backend

    backend used by ijson package to parse json file, possible values:

    yajl2_cffi - requires yajl2 C library and cffi Python package to be installed in the system;

    yajl2 - requires yajl2 C library to be installed in the system;

    None - uses default, Python only backend.

For more detailed descriptions see Configuration keys.

Process(inpath, outpath=None, rfrom=1, rto=0)[source]

Parameters are usually passed from YAML file as subkeys of Reader:PArg key.

Parameters:
  • inpath – Path to input file.
  • outpath – Path to output file passed to Writer (fall-back if output connector is not defined).
  • rfrom-rto – specifies scope of records to be processed.

For more detailed descriptions see Configuration keys.

datconv.readers.dcijson module

This module implements Datconv Reader which reads data from JSON file.

exception datconv.readers.dcijson.FilterBreak[source]

Bases: Exception

Exception class to support Reader.process break isued from Filter class.

exception datconv.readers.dcijson.ToLimitBreak[source]

Bases: Exception

Exception class to support Reader.process break caused by reaching configured record limit.

class datconv.readers.dcijson.DCReader(rec_tag='rec', log_prog_step=0, backend=None)[source]

Bases: object

This Datconv JSON Reader class uses ijson sax-type parser to read and interpret JSON file. It assumes that input file contain array of json objects and every such object is passed as record to Writer. This Reader passes always empty header and footer to Writer.

Example:

Input:

[
{
  "PadnDrawNbrs": {
    "cdc": 5019,
    "product": "addn"
  }
},
{
  "SiteData": {
    "siteId": 38
  },
  "rec0Control": {
    "curDraw": 5
  }
}
]

Output (datconv.writers.dcxml module):

<Datconv>
<rec>
    <PadnDrawNbrs>
        <cdc>5019</cdc>
        <product>addn</product>
    </PadnDrawNbrs>
</rec>
<rec>
    <SiteData>
        <siteId>38</siteId>
    </SiteData>
    <rec0Control>
        <curDraw>5</curDraw>
    </rec0Control>
</rec>
</Datconv>

Parameters are usually passed from YAML file as subkeys of Reader:CArg key.

Parameters:
  • rec_tag – name or tag to be placed as record marker.
  • log_prog_step – log info message after this number of records or does not log progress messages if this key is 0 or logging level is set to value higher than INFO.
  • backend

    backend used by ijson package to parse json file, possible values:

    yajl2_cffi - requires yajl2 C library and cffi Python package to be installed in the system;

    yajl2 - requires yajl2 C library to be installed in the system;

    None - uses default, Python only backend.

For more detailed descriptions see Configuration keys.

Process(inpath, outpath=None, rfrom=1, rto=0)[source]

Parameters are usually passed from YAML file as subkeys of Reader:PArg key.

Parameters:
  • inpath – Path to input file.
  • outpath – Path to output file passed to Writer (fall-back if output connector is not defined).
  • rfrom-rto – specifies scope of records to be processed.

For more detailed descriptions see Configuration keys.

datconv.readers.dccsv module

This module implements Datconv Reader which reads data from CSV file.

class datconv.readers.dccsv.DCReader(columns='item', strip=False, csv_opt=None)[source]

Bases: object

This module implements Datconv Reader which reads data from CSV file.

Parameters are usually passed from YAML file as subkeys of Reader:CArg key.

Parameters:
  • columns

    this parameter may be one of 3 possible types:

    if it is positive number, it specifies line number in input file that stores column names.

    if it is a list, it directly specifies column names in input file.

    if it is string it stands for column name prefix, i.e. columns will have names <prefix>1, <prefix>2, …

  • strip – if True, strips white spaces from values
  • csv_opt – dictionary with csv writer options. See documentation of csv standard Python library. If None, Reader tries to recognize format using csv.Sniffer class.

For more detailed descriptions see conf_template.yaml file in this module folder.

Process(inpath, outpath=None, rfrom=1, rto=0)[source]

Parameters are usually passed from YAML file as subkeys of Reader:PArg key.

Parameters:
  • inpath – Path to input file.
  • outpath – Path to output file passed to Writer (fall-back if output connector is not defined).
  • rfrom-rto – specifies scope of records to be processed.

For more detailed descriptions see Configuration keys.

Configuration keys

Listing of all possible configuration keys to be used with readers contained in this package.

There are sample values given, if key is not specified in configuration file, than default value is assumed.

Reader: 
    Module: datconv.readers.dcxml
    CArg:
        # List of tag names that will be treated as brace tags (see class description in source or pydoc).
        # default: []
        bratags:  [PadnDrawNbrs]

        # List of tag names that will be treated as header tags (see class description in source or pydoc).
        # Note: tags listed here will be placed in header passed to Writer; header tags not listed here will be silently skipped.
        # default: []
        headtags: [SiteData, rec0Control]

        # List of tag names that will be treated as record tags (see class description in source or pydoc).
        # If list of record tags is empty then every tag which is one level under brace tag and which is not head nor foot tag is treated as record tag.
        # default: []
        rectags:  [Gampdf_winNbrs]

        # List of tag names that will be treated as footer tags (see class description in source or pydoc).
        # Note: tags listed here will be placed in footer passed to Writer; footer tags not listed here will be silently skipped.
        # default: []
        foottags: []

        # Log info message after this number of records
        # If this value is zero no progress logging is done.
        # default: 0
        log_prog_step: 10000

    PArg:
        # Path to input file
        # Obligatory parameter
        inpath:  ../GET-Data/cdc_5019/AddnDrawNbrs_c5019_s38.xml
        
        # Path to output file passed to Writer (fall-back if output connector is not defined)
        # default: none (use defined output connector)
        outpath: out/AddnDrawNbrs_c5019_s38.xml
        
        # Start passing records to Filter and Writer from this record
        # default: 1
        rfrom:    1
        
        # Stop process on this record; if zero, process up to last record.
        # default: 0
        rto:      20

Reader: 
    Module: datconv.readers.dcijson_events
    CArg:
        # Returns: 1-only unique prefixes; 2-unique (prefix,event) pairs; 3-all events (including data).
        # default: 3
        mode: 3
        
        # Name or tag to be placed as record marker.
        # default: rec
        rec_tag: rec

        # Log info message after this number of records
        # If this value is zero no progress logging is done.
        # default: 0
        log_prog_step: 10000

        # Backend used by ijson package to parse json file, possible values:
        # - yajl2_cffi: requires yajl2 C library and cffi Python packege to be installed in the system;
        # - yajl2 - requires yajl2 C library to be installed in the system;
        # - null - uses default, Python only backend.
        # default: null
        backend: yajl2_cffi
        
    PArg:
        # Path to input file
        # Obligatory parameter
        inpath:  ../GET-Data/cdc_5019/AddnDrawNbrs_c5019_s38.json
        
        # Path to output file passed to Writer (fall-back if output connector is not defined)
        # default: none (use defined output connector)
        outpath: out/AddnDrawNbrs_c5019_s38.json
        
        # Start passing records to Filter and Writer from this record
        # default: 1
        rfrom:    1
        
        # Stop process on this record; if zero, process up to last record.
        # default: 0
        rto:      20

Reader: 
    Module: datconv.readers.dcijson_keys
    CArg:
        # List of key names that will be passed to Writer as header.
        # default: []
        headkeys: [SiteData, rec0Control]

        # List of key names that will be treated as records. If empty all highest level keys that are not heders or footers are passed to Writer as records.
        # default: []
        reckeys:  [Gampdf_winNbrs]

        # List of  key names that will be passed to Writer as footer.
        # default: []
        footkeys: []

        # Log info message after this number of records
        # If this value is zero no progress logging is done.
        # default: 0
        log_prog_step: 10000
        
        # Backend used by ijson package to parse json file (see above):
        # default: null
        backend: yajl2_cffi

    PArg:
        # Path to input file
        # Obligatory parameter
        inpath:  ../GET-Data/cdc_5019/AddnDrawNbrs_c5019_s38.json
        
        # Path to output file passed to Writer (fall-back if output connector is not defined)
        # default: none (use defined output connector)
        outpath: out/AddnDrawNbrs_c5019_s38.json
        
        # Start passing records to Filter and Writer from this record
        # default: 1
        rfrom:    1
        
        # Stop process on this record; if zero, process up to last record.
        # default: 0
        rto:      20

Reader: 
    Module: datconv.readers.dcijson
    CArg:
        # Name or tag to be placed as record marker.
        # default: rec
        rec_tag: rec

        # Log info message after this number of records
        # If this value is zero no progress logging is done.
        # default: 0
        log_prog_step: 10000
        
        # Backend used by ijson package to parse json file (see above):
        # default: null
        backend: yajl2_cffi

    PArg:
        # Path to input file
        # Obligatory parameter
        inpath:  ../GET-Data/cdc_5019/AddnDrawNbrs_c5019_s38.json
        
        # Path to output file passed to Writer (fall-back if output connector is not defined)
        # default: none (use defined output connector)
        outpath: out/AddnDrawNbrs_c5019_s38.json
        
        # Start passing records to Filter and Writer from this record
        # default: 1
        rfrom:    1
        
        # Stop process on this record; if zero, process up to last record.
        # default: 0
        rto:      20

Reader: 
    Module:  datconv.readers.dccsv
    CArg:
        # this parameter may be one of 3 possible types:
        #    if it is positive number, it specifies line number in input file that stores column names.
        #    if it is a list, it directly specifies column names in input file.
        #       Specified names must be possible to use as XML tag names.
        #    if it is string it stands for column name prefix, i.e. columns will have names <prefix>1, <prefix>2, ...
        # default: 'item'
        columns: 1
        
        # if True, strips white spaces from values
        # default: false
        strip: true

        # Python csv writer class constructor options. See documantation of csv standard Python library.
        # Caution: Escape characters must be contained in double quotes ('\n' will not work).
        # If null, Reader tries to recognize format using csv.Sniffer class.
        # default: null
        csv_opt:
            lineterminator: "\n"

    PArg:
        # Path to input file
        # Obligatory parameter
        inpath:  ../GET-Data/cdc_5019/AddnDrawNbrs_c5019_s38.csv
        
        # Path to output file passed to Writer (fall-back if output connector is not defined)
        # default: none (use defined output connector)
        outpath: out/AddnDrawNbrs_c5019_s38.json
        
        # Start passing records to Filter and Writer from this record
        # default: 1
        rfrom:    1
        
        # Stop process on this record; if zero, process up to last record.
        # default: 0
        rto:      20