Negotiating Unicode and UTF-8

February 3, 2000
Question from : ZIG
Question:
How may character set negotiation be used to negotiate Unicode with UTF-8 encoding?

Background:
Those familiar with Unicode and UTF-8 should skip this background and proceed to the "response".

ISO 10646 defines the Universal Character Set (UCS) and assigns to each character an integer value, its "code". Four bytes are used to represent codes, thus the code range is potentially 0000 0000 to FFFF FFFF hex. The Unicode Standard contains all the same characters and encoding points as ISO 10646.

The first 64K characters of the UCS are referred to as the "Basic Multilingual Plane", BMP. (More precisely, the highest code point for a character in the BMP is FFFD hex. The last two code points of the BMP are explicitly not characters.) Thus, although four bytes are allocated for code assignments, two bytes are sufficient to represent BMP characters.

A number of encodings are defined for the UCS; the most basic are UCS-4 and UCS-2, four and two bytes per character respectively, into which the actual character code value is encoded without transformation. UCS-2 is applicable only to the BMP.

Even the two-byte encoding, though, is inappropriate for some applications, and thus several "UCS transformation formats", UTFs, have been developed, most notable for purposes of this clarification is UTF-8 (though there is also UTF-16, a 16-bit form as well as its Endian designated forms: UTF-16BE and UTF-16LE). Each transformation format specifies an algorithm mapping each character's UCS code to an encoded value for transmission.

UTF-8 encodes a BMP character as a sequence of one to six bytes. Most significantly, when transformed via UTF-8, characters from the range 0000 0000 to 0000 007F (the US-ASCII repertoire) correspond to single 8-bit values (with the high-order bit 0) 00 to 7F. In other words, the 7-bit ASCII range is expressed as single bytes, for which UTF-8 is a null transformation.


Response:
The Character Set and Language Negotiation (3) definition should be used.


Status: Approved (1/00)
Library of Congress