Negotiating Unicode and UTF-8

February 3, 2000

Question from : ZIG

Question:

How may character set negotiation be used to negotiate Unicode with UTF-8 encoding?

Background:

Those familiar with Unicode and UTF-8 should skip this background and proceed to the "response".
ISO 10646 defines the Universal Character Set (UCS) and assigns to each character an integer value, its "code". Four bytes are used to represent codes, thus the code range is potentially 0000 0000 to FFFF FFFF hex. The Unicode Standard contains all the same characters and encoding points as ISO 10646.
The first 64K characters of the UCS are referred to as the "Basic Multilingual Plane", BMP. (More precisely, the highest code point for a character in the BMP is FFFD hex. The last two code points of the BMP are explicitly not characters.) Thus, although four bytes are allocated for code assignments, two bytes are sufficient to represent BMP characters.
A number of encodings are defined for the UCS; the most basic are UCS-4 and UCS-2, four and two bytes per character respectively, into which the actual character code value is encoded without transformation. UCS-2 is applicable only to the BMP.
Even the two-byte encoding, though, is inappropriate for some applications, and thus several "UCS transformation formats", UTFs, have been developed, most notable for purposes of this clarification is UTF-8 (though there is also UTF-16, a 16-bit form as well as its Endian designated forms: UTF-16BE and UTF-16LE). Each transformation format specifies an algorithm mapping each character's UCS code to an encoded value for transmission.
UTF-8 encodes a BMP character as a sequence of one to six bytes. Most significantly, when transformed via UTF-8, characters from the range 0000 0000 to 0000 007F (the US-ASCII repertoire) correspond to single 8-bit values (with the high-order bit 0) 00 to 7F. In other words, the 7-bit ASCII range is expressed as single bytes, for which UTF-8 is a null transformation.

Response:

The Character Set and Language Negotiation (3) definition should be used.

The first of the two ISO 10646 Object Identifiers, with ASN.1 identifier 'collections', is optional in this definition, and thus may be omitted for negotiation of Unicode and UTF-8, when negotiation of specific collections is not intended.
Notes:
'collections' is mandatory in the earlier definitions (Character Set and Language Negotiation (1) and Character Set and Language Negotiation (2)) and so these two definitions should not be used for negotiation of Unicode and UTF-8, unless negotiation of specific collections is intended.
When 'collections' is omitted, the value of 'implementationLevel', which is a component of the (omitted) object identifer, defaults to 3 (this default is specified in the Character Set and Language Negotiation (3) definition), which is the level required for Unicode.
If 'collections' is supplied (because negotiation of specific collections is intended) then the value of 'implementationLevel' should be 3.

The value of the second of the two ISO 10646 Object Identifiers (with ASN.1 identifier 'encodingLevel') should be:
"1.0.10646.1.0.8"
The final component indicates the actual encoding level, where a value of 8 indicates UTF-8.

Status: Approved (1/00)

Library of Congress