Qore CsvUtil Module Reference
1.6.2
|
The CsvUtil module provides functionality for parsing CSV-like files.
To use this module, use "%requires CsvUtil"
in your code.
All the public symbols in the module are defined in the CsvUtil namespace
Currently the module provides the following classes:
Furthermore, the following specialized classes are implemented based on the above and are provided for convenience and backwards-compatibility:
Note that the CsvIterator class can be used to parse arbitrary text data; the field separator character can be specified in the constructor, as well as the quote character and end of line sequence. See the constructor documentation for more information.
Examples:
If "example-file.csv"
is:
UK,1234567890,"Sony, Xperia S",31052012 UK,1234567891,"Sony, Xperia S",31052012 UK,1234567892,"Sony, Xperia S",31052012 UK,1234567893,"Sony, Xperia S",31052012
The data is read verbatim, each value is returned as a string, header names are generated numerically; the output is:
1: {0: "UK", 1: "1234567890", 2: "Sony, Xperia S", 3: "31052012"} 2: {0: "UK", 1: "1234567891", 2: "Sony, Xperia S", 3: "31052012"} 3: {0: "UK", 1: "1234567892", 2: "Sony, Xperia S", 3: "31052012"} 4: {0: "UK", 1: "1234567893", 2: "Sony, Xperia S", 3: "31052012"}
Also the "example-file-copy.csv"
will contain data from the original file formatted as CSV.
If header names are provided and field types are specified, the output looks different:
Now the hash keys in each record returned are those given in the constructor, and the fields "serno"
and "received"
are given other data types; this produces:
1: {cc: "UK", serno: 1234567890, desc: "Sony, Xperia S", received: 2012-05-31 00:00:00 Thu +02:00 (CEST)} 2: {cc: "UK", serno: 1234567891, desc: "Sony, Xperia S", received: 2012-05-31 00:00:00 Thu +02:00 (CEST)} 3: {cc: "UK", serno: 1234567892, desc: "Sony, Xperia S", received: 2012-05-31 00:00:00 Thu +02:00 (CEST)} 4: {cc: "UK", serno: 1234567893, desc: "Sony, Xperia S", received: 2012-05-31 00:00:00 Thu +02:00 (CEST)}
Use the "header_lines"
and "header_names"
options to automatically read the header names from the file if present. Use the "fields"
option to describe the fields and perform transformations on the data read. For more information, see the CsvFileIterator class.
The CsvUtil module supports multi-type records, meaning that different input lines may have different structures, and input record type resolution is performed at run-time using predefined rules. CsvUtil classes have two constructors, a backwards-compatible constructor variant, which accepts field definition in options and supports only a single record type, and a new constructor avariant, which ccepts field specifications and options passed as separate parameters.
When used with multi-type definitions, the CsvUtil::AbstractCsvIterator::getValue() method returns records in an extended form. In this case, the return value of CsvUtil::AbstractCsvIterator::getValue() is a hash with the following keys:
"type"
: the resolved record type name"record"
: a hash of record valuesAs this is not backwards compatible, this mode is enabled using the "extended_record"
option.
If no record type resolution rules or logic is defined, then record types are resolved automatically based on their unique number of fields. If the number of fields is not unique (i.e. two or more records have the same number of fields), then a rule must exist to resolve the record type.
Typically the value of a particular field determines the record type, however even multiple fields could be used. Record type detection configuration is supplied by the "value"
(field value equality test) or "regex"
(regular expression test) keys in the field specification hash for the record in question. If multiple fields in a record definintion have "value"
or "regex"
keys, then all fields must match the input data in order for the input line to match the record.
The above record type resolution logic is executed in AbstractCsvIterator::identifyTypeImpl(), which executes any "regex"
or "value"
tests on the input line in the order of the field definitions in the record description hash.
Record type resolution is performed as follow:
"value"
: Matches the full value of the field; if an integer "value"
value is used, then integer comparisons are done, otherwise string comparisons are performed."regex"
: Matches the full value of the field with regular expression.When there are no record-matching keys in the field hashes for any record and the input record field number are not unique, then AbstractCsvIterator::identifyTypeImpl() must be overridden in a subclass to provide custom record matching logic.
"regex"
and "value"
keys in a field specification hash"regex"
and "value"
keys), then all fields with this configuration must match for the record to be matchedIf "example-multi-file.csv"
is:
1,2016-01,20160401 2,123,"Sony, Xperia S",1,100 2,124,"Nokia, Lumia",2,150 3,250
The data is read verbatim, each value is returned as a string, header names are generated numerically; the output is:
1: {type: "header", record: {type: 1, invoice_no: "2016-01", date: 2016-04-01 00:00:00.000000 Fri +02:00 (CEST)}} 2: {type: "item", record: {type: 2, item_no: "123", item: "Sony, Xperia S", pcs: 1, price: 100}} 3: {type: "item", record: {type: 2, item_no: "124", item: "Nokia, Lumia", pcs: 2, price: 150}} 4: {type: "footer", record: {type: 3, total: 250}}
Also the "example-multi-file-copy.csv"
will contain data from the original file formatted as CSV.
number_format
option to allow numbers with alternative decimal separators to be parsed and generated (issue 2806)compat_force_empty_string
CSV parsing option to force "*string" fields with no value to return an empty string when parsing rather than NOTHING for backwards compatibility with very early versions of CsvUtil (issue 2476)"eol"
"datamap"
and "info_log"
options for CSV generation"date-format"
is now "date_format"
"optimal-quotes"
is now "optimal_quotes"
"verify-columns"
is now "verify_columns"
"write-headers"
is now "write_headers"
"ignore-empty"
is now "ignore_empty"
"ignore-whitespace"
is now "ignore_whitespace"
"header-names"
is now "header_names"
"header-lines"
is now "header_lines"
"verify-columns"
is now "verify_columns"
"header_reorder"
; CsvUtil::AbstractCsvIterator "date_format"
,"extended_record"
"index"
, "default"
, "header"
, "value"
, "regex"
"format"
field option when used with "*date"
field types"tolwr"
parser option"string"
to "*string"
"write-headers"
option to CsvUtil::AbstractCsvWriter and subclasses to enable headers to be suppressed"optimal-quotes"
option to CsvUtil::AbstractCsvWriter and subclasses to enable more efficient csv output (now the default output option); to revert back to the previous behavior (where all fields are quoted regardless of data type or content), set to False in the constructor"*int"
, "*float"
, "*number"
, and "*date"
types"date"
type handling with empty input; now returns 1970-01-01Z (use "*date" to map empty input to NOTHING)