Did you know ... Search Documentation:
Pack logtalk -- logtalk-3.85.0/tests/prolog/unicode/NOTES.md

This file is part of Logtalk https://logtalk.org/ SPDX-FileCopyrightText: 1998-2023 Paulo Moura <pmoura@logtalk.org> SPDX-License-Identifier: Apache-2.0

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

This directory contains work-in-progress test sets for Prolog Unicode support. Currently, three test sets are provided: builtins (for flags, built-in predicates, and stream properties), encodings (for UTF-8, UTF-16, and UTF-32 encodings, with and without a BOM), and syntax (for the \uXXXX and \UXXXXXXXX escape sequences). The encodings test set is only enabled for backends supporting all the above encodings (currently, CxProlog, XVM, SICStus Prolog, SWI-Prolog, and Trealla Prolog).

The tests are based on an extended version of the October 5, 2009 WG17 ISO Prolog Core revision standardization proposal, which specifies the following minimal language features:

  1. An encoding Prolog flag, allowing applications to query the default encoding for opening streams. When the Prolog systems supports multiple encodings, the default encoding can be changed by setting this flag to a supported encoding.
  2. Encodings are represented by atoms after the names specified by the Internet Assigned Numbers Authority (IANA) and marked as the "(preferred MIME name)" alias when available:
    http://www.iana.org/assignments/character-sets

    For example, 'UTF-8', 'UTF-16LE', or 'UTF-32'.

  3. Two new open/4 predicate options, encoding(Atom) and bom(Boolean). The handling of these options depends on the mode argument, only applies to text files, and follows from the Unicode standard guidelines and current practice:
  • write mode: If an encoding/1 option is present, use the specified encoding, otherwise use the default encoding (which can be queried using the encoding flag). If bom(true) option is present, write a BOM if the encoding is a Unicode encoding. If no bom/1 option is used, write a BOM if the encoding is UTF-16 or UTF-32 but not if the encoding is UTF-8, `UTF-16LE`, `UTF-16BE`, `UTF-32LE`, or `UTF-32LE`. If the encoding is UTF-16 or UTF-32, write the data big-endian.
  • append mode: If an encoding/1 option is present, use that encoding, otherwise use the default encoding (which can be queried using the encoding flag). Ignore bom/1 option if present and never write a BOM.
  • read mode: the default is bom(true), i.e. perform BOM detection and use the corresponding encoding if a BOM is found. If no BOM is detected, then use the encoding/1 option if present and the default encoding otherwise. When a bom(false) option is present, no BOM detection is performed, an encoding/1 is required if the file encoding is different from the default encoding, and a BOM at the beginning of the stream is to be interpreted as a ZERO WIDTH NON-BREAKING SPACE (ZWNBSP).

    The bom/1 option is ignored when not using a Unicode encoding. The bom/1 and encoding/1 options are ignored when a type(binary) option is present.

  1. The open/3 predicate (for text files) always perform BOM detection on mode read and uses the corresponding encoding if a BOM is found. Otherwise the default encoding is used (which can be queried using the encoding flag). In write mode, a BOM is written if the default encoding is UTF-16 or UTF-32 but not if the encoding is UTF-8, `UTF-16LE`, `UTF-16BE`, `UTF-32LE`, or `UTF-32LE`. If the encoding is UTF-16 or UTF-32, the data is written big-endian. In append mode, no BOM is written and the default encoding is used.
  2. Two new stream properties, encoding(Atom) and bom(Boolean), set from the open/3-4 calls and the default values as described above, that can be queried using the standard stream_property/2 predicate.
  3. The standard built-in predicates that must be Unicode aware include:
  1. Unicode code points can be specified in quoted atoms and double-quoted terms using the \uXXXX and \UXXXXXXXX escape sequences. The \uXXXX escape sequence, using four hexadecimal digits, covers the Basic Multilingual Plane (BMP). The \UXXXXXXXX escape sequence, using eight hexadecimal digits, covers the full Unicode code points space. The use of code points makes these escape sequences independent of both the chosen Unicode text encoding and the Prolog system internal character set (thus providing better portability than the ISO Prolog Core standard octal and hexadecimal escape sequences).