Qore Programming Language Reference Manual
0.9.4.7
|
The Qore language is character-encoding aware. All strings are assumed to have the default character encoding, unless the program explicitly specified another encoding for certain objects and operations. Every Qore string has a character encoding ID attached to it, so, when another encoding is required, the Qore language will attempt to do an encoding translation.
Qore uses the operating system's iconv
library functions to perform any encoding conversions.
Qore supports character encodings that are backwards compatible with 7-bit ASCII
. This includes all ISO-8859-*
character encodings, UTF-8
, KOIR-8
, KOIU-8
, and KOI7
, among others (see the table below: Known Character Encodings).
However, mutibyte character encodings are currently only properly supported for UTF-8
. For UTF-8
strings, the length(), index(), rindex(), substr(), reverse(), the splice operator, print formatting (regarding field lengths) functions and methods taking format strings, and regular expression operators and functions, all work with character offsets, which may be different than byte offsets. For all character encodings other than UTF-8
, a 1 byte=1 character relationship is assumed.
Qore will accept any encoding name given to it, even if it is not a known encoding name or alias. In this case, Qore will tag the strings with this encoding, and pass this user-defined encoding name to the iconv
library when encodings must be converted. This allows programmers to use encodings known by the system's iconv
library, but unknown to Qore. In this case, Qore will assume that the strings are backwards compatible with ASCII
, meaning that that one character is represented by one byte and that the strings are null-terminated.
Note that when Qore matches an encoding name to a code or alias in the following table, the comparison is not case-sensitive.
Code | Aliases | Description |
ISO-8859-1 | ISO88591 , ISO8859-1 , ISO-88591 , ISO8859P1 , ISO81 , LATIN1 , LATIN-1 | latin-1, Western European character set |
ISO-8859-2 | ISO88592 , ISO8859-2 , ISO-88592 , ISO8859P2 , ISO82 , LATIN2 , LATIN-2 | latin-2, Central European character set |
ISO-8859-3 | ISO88593 , ISO8859-3 , ISO-88593 , ISO8859P3 , ISO83 , LATIN3 , LATIN-3 | latin-3, Southern European character set |
ISO-8859-4 | ISO88594 , ISO8859-4 , ISO-88594 , ISO8859P4 , ISO84 , LATIN4 , LATIN-4 | latin-4, Northern European character set |
ISO-8859-5 | ISO88595 , ISO8859-5 , ISO-88595 , ISO8859P5 , ISO85 | Cyrillic character set |
ISO-8859-6 | ISO88596 , ISO8859-6 , ISO-88596 , ISO8859P6 , ISO86 | Arabic character set |
ISO-8859-7 | ISO88597 , ISO8859-7 , ISO-88597 , ISO8859P7 , ISO87 | Greek character set |
ISO-8859-8 | ISO88598 , ISO8859-8 , ISO-88598 , ISO8859P8 , ISO88 | Hebrew character set |
ISO-8859-9 | ISO88599 , ISO8859-9 , ISO-88599 , ISO8859P9 , ISO89 , LATIN5 , LATIN-5 | latin-5, Turkish character set |
ISO-8859-10 | ISO885910 , ISO8859-10 , ISO-885910 , ISO8859P10 , ISO810 , LATIN6 , LATIN-6 | latin-6, Nordic character set |
ISO-8859-11 | ISO885911 , ISO8859-11 , ISO-885911 , ISO8859P11 , ISO811 | Thai character set |
ISO-8859-13 | ISO885913 , ISO8859-13 , ISO-885913 , ISO8859P13 , ISO813 , LATIN7 , LATIN-7 | latin-7, Baltic rim character set |
ISO-8859-14 | ISO885914 , ISO8859-14 , ISO-885914 , ISO8859P14 , ISO814 , LATIN8 , LATIN-8 | latin-8, Celtic character set |
ISO-8859-15 | ISO885915 , ISO8859-15 , ISO-885915 , ISO8859P15 , ISO815 , LATIN9 , LATIN-9 | latin-9, Western European with euro symbol |
ISO-8859-16 | ISO885916 , ISO8859-16 , ISO-885916 , ISO8859P16 , ISO816 , LATIN10 , LATIN-10 | latin-10, Southeast European character set |
KOI7 | n/a | Russian: Kod Obmena Informatsiey, 7 bit characters |
KOI8-R | KOI8R | Russian: Kod Obmena Informatsiey, 8 bit |
KOI8-U | KOI8U | Ukrainian: Kod Obmena Informatsiey, 8 bit |
US-ASCII | ASCII , USASCII | 7-bit ASCII character set |
UTF-8 | UTF8 | variable-width universal character set |
UTF-16 | UTF16 | variable-width universal character set based on a fundamental 2-byte character encoding; not backwards-compatible with ASCII and therefore not supported universally in Qore; it's recommended to convert these strings to UTF-8 in Qore; do not use UTF-16 as the default character encoding in Qore |
UTF-16BE | UTF16BE | variable-width universal character set based on a fundamental 2-byte character encoding with big-endian encoding; not backwards-compatible with ASCII and therefore not supported universally in Qore; it's recommended to convert these strings to UTF-8 in Qore; do not use UTF-16BE as the default character encoding in Qore |
UTF-16LE | UTF16LE | variable-width universal character set based on a fundamental 2-byte character encoding with little-endian encoding; not backwards-compatible with ASCII and therefore not supported universally in Qore; it's recommended to convert these strings to UTF-8 in Qore; do not use UTF-16LE as the default character encoding in Qore |
UTF-16 is currently not well supported in Qore, because Qore's string support is based on the assumption that all strings are backwards-compatible with ASCII, and UTF-16 is not due to the minimum 2-byte character width and the possibility of embedded null bytes.
It's possible to generate string data in UTF-16 encoding (using Qore::convert_encoding()), however note that all strings so generated will be tagged with a BOM (byte order marker) at the beginning of the string data (this is performed by libiconv).
The following classes support parsing UTF-16 data by converting it to UTF-8 and processing the UTF-8 data:
The following classes support processing UTF-16 data natively:
Many string operations on UTF-16 data will provide invalid results due to the embedded nulls.
UTF-16LE
encoding specifically)The default character encoding for Qore is determined by environment variables.
First, the QORE_CHARSET
environment variable is checked. If it is set, then this character encoding will be the default character encoding for the process. If not, then the LANG
environment variable is checked. If a character encoding is specified in the LANG
environment variable, then it will be used as the default character encoding. Otherwise, if no character encoding can be derived from the environment, UTF-8
is assumed.
Character encodings are automatically converted by the Qore language when necessary. Encoding conversion errors will cause a Qore exception to be thrown. The character encoding conversions supported by Qore depend on the operating system's iconv
library function.
The following is a non-exhaustive list of examples in Qore where character encoding processing is performed.
Character encodings can be explicitly performed with the convert_encoding() function, and the encoding attached to a string can be checked with the get_encoding() function. If you have a string with incorrect encoding and want to change the encoding tag of the string (without changing the actual bytes of the string), use the force_encoding() function.
get_default_encoding() returns the default encoding for the Qore process.
The Qore::SQL::Datasource, Qore::SQL::DatasourcePool, and Qore::SQL::SQLStatement classes will translate character encodings to the encoding required by the database if necessary as well (this is actually the responsibility of the DBI driver for the database in question).
The Qore::File and Qore::Socket classes translate character encodings to the encoding specified for the object if necessary, as well as tagging strings received or read with the object's encoding.
The Qore::HTTPClient class will translate character encodings to the encoding specified for the object if necessary, as well as tag strings received with the object's encoding. Additionally, if an HTTP server response specifies a specific encoding to use, the encoding of strings read from the server will be automatically set to this encoding as well.