datconv.writers package

This package contains datconv compatible Writer classes for common formats of files used in public.

Writer interface

This module contain Datconv Writer skeleton class suitable as starting point for new writers.

class datconv.writers._skeleton.DCWriter[source]

Bases: object

This class must be named exactly DCWriter. It is responsible for:

  • writing data to output file.

Additional constructor parameters may be added to this method, but they all have to be named parameters. Parameters are usually passed from YAML file as subkeys of Writer:CArg key.

setOutput(out)[source]

Obligatory method that must be defined in Writer class. It is called by main Datconv.Run() function before conversion begin and before any write* function is being called.

Parameters:out – is instance of datconv Output Connector class according to configuration file. In case Output Connector is not defined in configuration file there are two fallbacks checked: a) if Reader:PArg:outpath is defined, the file connector with specified path is used, b) standard output stream is used as output.

This method in some rare cases may be called multiply times (e.g. when convering set of files). Initialization of some variables related to output file (like output records counter etc.) should be done here.

writeHeader(header)[source]

Obligatory method that must be defined in Writer class. Write header to output file (if it makes sense).

Parameters:header – is instance of header as passed by Reader (always a list, but type of elements is up to Reader).
writeFooter(footer)[source]

Obligatory method that must be defined in Writer class. Write footer to output file (if it makes sense).

Parameters:footer – is instance of footer as passed by Reader (always a list, but type of elements is up to Reader).
getHeader()[source]

Obligatory method that returnes header passed to writeHeader.

getFooter()[source]

Obligatory method that returnes footer passed to writeFooter.

writeRecord(record)[source]

Obligatory method that must be defined in Writer class. Transform passed record to its specific formet and pass to output connecte=or either as object or string.

Parameters:record – is instance of lxml.etree.ElementTree class as passed by Reader.
Returns:Transformed record that this writer passed to object type connector.

See Filter interface and package lxml documentation for information how to obtain structure and data from record.

datconv.writers.dccsv module

This module implements Datconv Writer which saves data in form of CSV file. Supports connectors of type: STRING, LIST, ITERABLE.

class datconv.writers.dccsv.DCWriter(columns=None, simple_xpath=False, add_header=False, col_names=True, csv_opt=None)[source]

Bases: object

Please see constructor description for more details.

Parameters are usually passed from YAML file as subkeys of Writer:CArg key.

Parameters:
  • columns

    this parameter may be one of 4 possible types or None: if it is a string, it should be the path to file that contain specification of columns in output file.

    if it is a list, it directly specifies columns in output file.

    if it is a integer, add columns based on first record.

    if it is None or dictionary, columns in output CSV file are being generated automatically based on contentents of input file. When this option is used number of columns in different records in CSV file may very because new columns are being added when discovered.

  • simple_xpath – determines weather simple xpaths are used in column specification. See pdxpath Writer for more descripption.
  • add_header – if True, generic header (as initialized by Reader) is added as first line of output file.
  • col_names – if True, line with column names (fields) is added before data or after data (in case of auto option).
  • csv_opt – dictionary with csv writer options. See documentation of csv standard Python library.

For more detailed descriptions see conf_template.yaml file in this module folder.

datconv.writers.dcxml module

This module implements Datconv Writer which saves data in form of XML file.

class datconv.writers.dcxml.DCWriter(pretty=True, encoding='unicode', cnt_tag=None, cnt_attr=None, add_header=True, add_footer=True)[source]

Bases: object

Please see constructor description for more details.

Constructor parameters are usually passed from YAML file as subkeys of Writer:CArg key.

Parameters:
  • pretty – this parameter is passed to lxml.etree.tostring function. If True, XML is formated in readable way (one tag in one line), otherwise full record is placed in one line (more compact,suitable for computers).
  • encoding – this parameter is passed to lxml.etree.tostring function. It determines emcoding used in output XML file. See documantation of codecs standard Python library for possible encodings. This parameter is ignored in Python3, where always unicode coding is used.
  • cnt_tag – tag name to store records count, if not set record count will not be printed in output footer
  • cnt_attr – attribute of cnt_tag tag to store records count, if not set record count will be printed as tag text
  • add_header – if True, generic header (as initialized by Reader) is added as first tag of output file.
  • add_footer – if True, generic footer (as initialized by Reader) is added as last tag of output file.

For more detailed descriptions see conf_template.yaml file in this module folder.

datconv.writers.dcxpaths module

This module implements Datconv Writer which generates list of fields (tags that have text) in scanned document. This is helper Writer, it generates text file that can be used as configuration file (list of columns) for CSV Writer. It may be also helpfull if you only want to extract (e.g. to compare) structure of XML file.

Format of output file is following:
Field Name, Record Name, XPath, Default Value
where:
Field Name - the name of tag with text
Record Name - the name of record (root tag) in which this field is contained
XPath - path to tag inside XML structure starting from record root (but not containing record name) in the forn of XPath expression
Default Value - placeholder to place default value, this writer leave it empty
Type - tye of data guessed from data (if add_type option is set)

Generated entries are unique and sorted by Record Name and XPath. Supports connectors of type: STRING, LIST, ITERABLE.

class datconv.writers.dcxpaths.DCWriter(simple_xpath=False, ignore_rectyp=False, ignore_xpath=False, ignore_attr=False, add_header=True, add_type=False, rectyp_separator='_', colno=0)[source]

Bases: object

Please see constructor description for more details.

Constructor parameters are usually passed from YAML file as subkeys of Writer:CArg key.

Parameters:
  • simple_xpath – if True, Wirter generate xpaths relatative to record tag, and will not generate separate fields for replicated data (repeated tags; arrays) and not generate fields for tag’s attributes. The same setting must be applied in CSV Writer if it uses configuration file generated by this Writer.
  • ignore_rectyp – if True, Writer join fields with the same name contained in different records. Generated Field Name does not contain record name prefix, and in place of record name ‘*’ is placed.
  • ignore_xpath – if True, Writer join fields with the same name contained in different paths of XML structure. Generated XPath is in form ‘.//FieldName’.
  • ignore_attr – if True, Writer will not generate fields for XML attbibutes. If simple_xpath is True, this option is automatically set to True.
  • add_header – add header as first line of output.
  • add_type – add data type information guessed from data.
  • rectyp_separator – separator in generated column name between record type and calumn name (has effect if ignore_rectyp = false).
  • colno – this parameter is for interface compatibility reason, it has no meaning in this class.

For more detailed descriptions see conf_template.yaml file in this module folder.

checkXPath(record, ret_new=False)[source]

Helper function - it scans record and finds new (not already known) xpaths to add to output.

Depending on constructor simple_xpath parameter it calls either _checkXPathSimple or _checkXPath.

resetXPaths()[source]

Reset class internal structures (found xpaths’ list).

Typically called in Writer.setOutput when we are about to read new file.

datconv.writers.dcjson module

This module implements Datconv Writer which saves data in form of JSON file. Supports connectors of type: STRING, OBJECT (dict()), ITERABLE.

class datconv.writers.dcjson.DCWriter(add_header=True, add_footer=True, add_newline=True, convert_values=2, null_text='None', preserve_order=False, text_key='text', text_eliminate=True, with_prop=False, ignore_rectyp=False, json_opt=None)[source]

Bases: object

Please see constructor description for more details.

Parameters are usually passed from YAML file as subkeys of Writer:CArg key.

Parameters:
  • add_header – if True, generic header (as initialized by Reader) is added as first object of output file or stream - only in non-iteration mode.
  • add_footer – if True, generic footer (as initialized by Reader) is added as last object of output file or stream - only in non-iteration mode.
  • add_newline – if True, adds newline character after each record.
  • convert_values – 0 - does not convert (all values are text); 1 - tries to convert values to int, bool or float (do not quote in json file) - little slower; 2 - like 1 but in addition checks if int values can be stored in 64 bits, if not place them as string value.
  • null_text – text that is converted to JSON null value (apply if convert_values is > 0).
  • preserve_order – if True, order of keys in json output match order in source.
  • text_key – name of key to store XML text.
  • text_eliminate – if True, XML text key will be eliminated if there are no other tag components.
  • with_prop – if True, XML properties are being saved in JSON file.
  • ignore_rectyp – if True, XML root tag for records (aka record type) will not be saved in JSON file (simplifies output layout in case there is one record type).
  • json_opt

    dictionary with json.dump() options. See documentation of json standard Python library.

For more detailed descriptions see conf_template.yaml file in this module folder.

Configuration keys

Listing of all possible configuration keys to be used with writers contained in this package.

There are sample values given, if key is not specified in configuration file, than default value is assumed.

Writer: 
    Module: datconv.writers.dcxml
    CArg: 
        # If True, XML is formated in readable way (one tag in one line),
        # otherwise full record is placed in one line (more compact, suitable for computers).
        # default: true
        pretty:   true
        
        # Determines emcoding used in output XML file.
        # Note that if encoding is set to ascii some characters may be converted to XML/HTML compatibe special codes
        # Most raliable way is to set unicode here (default option)
        # See documantation of codecs standard Python library for possible encodings.
        # Note: This option is ignored in Python3, where always unicode coding is used what produce UTF-8 XML file.
        # default: utf8 (in Python2), unicode (in Python3 and above)
        encoding: unicode
        
        # Tag name to store records countin output footer, if not set records count will not be printed
        # default: null
        cnt_tag: Footer
        
        # attribute of cnt_tag tag to store records count, if not set records count will be printed as tag text
        # default: null
        cnt_attr: tranCount
        
        # if True, generic header (as initialized by Reader) is added as first tag of output file
        # default: true
        add_header: true
        
        # if True, generic footer (as initialized by Reader) is added as last tag of output file
        add_footer: true

Writer: 
    Module: datconv.writers.dcxpaths
    CArg:
        # If true, Wirter generate xpaths relatative to record tag, and will not generate
        # separate fields for replicated data (repeated tags; arrays) and not generate fields for tag's attributes.
        # The same setting must be applied in writers.dccsv if it uses configuration file generated by this Writer.
        # default: false
        simple_xpath: false
        
        # If true, Writer join fields with the same name contained in different records.
        # Generated Field Name does not contain record name prefix, and in place of record name '*' is placed.
        # default: false
        ignore_rectyp: false
        
        # If true, Writer join fields with the same name contained in different paths of XML structure.
        # Generated XPath is in form './/FieldName' or '//FieldName' (depands on simple_xpath property).
        # default: false
        ignore_xpath: false
        
        # If true, Writer will not generate fields for XML attbibutes.
        # If simple_xpath is true, this option is automatically set to true.
        # default: false
        ignore_attr: false
        
        # if True, generic header (as initialized by Reader) is added as first line of output file
        # default: true
        add_header: true
        
        # If true, Writer will add data type information guessed from data.
        # default: false
        add_type: false
        
        # Separator in generated column name between record type and calumn name (has effect if ignore_rectyp = false).
        # default: "_"
        rectyp_separator: "."
        
Writer: 
    Module: datconv.writers.dccsv
    CArg:
        # This parameter specifies columns to be placed in outpot CSV file.
        # It may be one of 4 possible types or null:
        # - string: path to file that contain specification of columns in output file.
        #   This specification may be generated by (or based on file generated by) writers.dcxpaths. 
        #   See this module description for more details.
        #   Lines that begin with # sign in specification file are ignored.
        # - list: direct specification of columns in output file.
        #   It should be list of 4 element lists.
        #   Those 4 element lists are similar to lines in file specification described above.
        #   This option is suitable if we want very few columns
        # - integer: assuming that all records have the same fields, add columns based on first record
        # - dictionary or null: this runs writer in so called auto-mode.
        #   In this mode columns in output file are being added automatically as they are being fond in input file.
        #   Columns fould in previous records are also placed, so number of columns increase with consequtive records.
        #   This is possible to enforce certain number of columns from begin to ensure equal number of columns (see below).
        #   In this option column names are added (if configured - see below) at end of file.
        # default: null
        columns: out/CZEC8173.TMF.xpaths        # string: path to file that contain specification of columns in output file
        # or
        columns:                                # list: direct specification of columns in output file
            - ['ISN','*','ISN',null]
            - ['TIME','*','TIME',null]
        # or
        columns: 1                              # integer: assuming that all records have the same fields, adds columns based on first record
                                                # the number has currently no meaning (we advice to place 1 here).
        # or
        columns:                                # dictionary (auto) case
            ignore_rectyp: false                # like in writers.pdxpaths (see above)
            ignore_xpath:  false                # like in writers.pdxpaths (see above)
            ignore_attr:   false                # like in writers.pdxpaths (see above)
            colno:         160                  # enforce this number of columns from begin (they are filled with empty values (default: 0 - i.e. option not active)

        # Determines weather simple xpaths are used in column specification (see option description in writers.pdxpaths above).
        # This option actually determines if function lxml.etree.Element.find or .xpath is used (see lxml documentation).
        # find is less capable but about 25% faster than xpath - therefore this option.
        # default: false
        simple_xpath: false
       
        # If True, generic header (as initialized by Reader) is added as first line of output file.
        # default: false
        add_header: false

        # If True, line with column names (fields) is added before data or after data (in case of auto option).
        # default: true
        col_names: true

        # Python csv writer class constructor options. See documantation of csv standard Python library.
        # Caution: Escape characters must be contained in double quotes ('\n' will not work).
        # default: null
        csv_opt:
            lineterminator: "\n"
            
Writer:
    Module: datconv.writers.dcjson
    CArg:
        # If True, generic header (as initialized by Reader) is added as first object of output file.
        # default: true
        add_header: true
        
        # If True, generic footer (as initialized by Reader) is added as last object of output file.
        # default: true
        add_footer: true
        
        # If True, adds newline character after each record.
        # default: true
        add_newline: true
        
        # 0 - does not convert (all values are text) 
        # 1 - tries to convert velues to int, bool or float (do not quote in json file) - little slower
        # 2 - like 1 but in addition checks if int values fits in 64 bits, if not place them as string value
        # default: 2
        convert_values: 2
        
        # Text that is converted to JSON null value (apply if convert_values is True)
        # default: 'None'
        null_text: ''
        
        # If True, order of keys in json output match order in source
        # default: false
        preserve_order: false
        
        # Name of key to store XML text
        # default: 'text'
        text_key: '_text_'
        
        # If True, XML text key will be eliminated if there are no other tag components
        # default: true
        text_eliminate: true
        
        # If True, XML properties are being saved in JSON file
        # default: true
        with_prop: true
        
        # If True, XML root tag for records (aka record type) will not be saved in JSON file 
        # (simplifies output layout in case there is one record type)
        # default: false
        ignore_rectyp: true
        
        # Dictionary with json.dump() options. See documentation of json standard Python library.
        # default: null
        json_opt: 
            indent: 2