The foregoing parts have described the two encoding options, MARC-8 and Unicode, specified for use in MARC 21 records. There are many situations in which it may be necessary or desirable to convert records from one of these schemes to the other. This section identifies a number of factors a successful conversion must take into account and specifies techniques for converting Unicode records that contain characters absent from the MARC-8 repertoire.
On this site, Part 5: MARC-8 Code Tables contains tables showing the MARC-8 repertoire along with the code values of each character in MARC-8 and Unicode schemes. It also contains links to an XML version of the table for all MARC-8 characters, and a comma-delimited file of MARC-8/Unicode correspondences for the EACC (CJK) character set only.
The following points need to be considered when converting from either specified encoding to the other.
When converting from MARC-8 to Unicode, Leader position 9 (Character encoding scheme) must be set to "a" to indicate that the converted record uses Unicode encoding. When converting from Unicode to MARC-8 Unicode, Leader position 9 must be set to "blank" (20(hex)).
Neither field 066 (Character Sets Present) nor any escape sequence is allowed in a Unicode MARC 21 environment. Escape sequences and the 066 field in a MARC-8-encoded record must be removed during conversion to Unicode.
When converting to MARC-8, escape sequences and a 066 field must be constructed where appropriate. Field 066 is required in a MARC-8-encoded record whenever it contains a type 2 escape sequence, as described above in Part 2. If there are no such escapes, field 066 is not used.
Subfield $6 (Linkage) is used in MARC 21 records to link alternate graphic representations of the same data, to identify the presence of specific scripts in a field, and to flag fields in which the display/print directionality of data is right-to-left (e.g., for Arabic script). The subfield $6 script identification code in MARC-8-encoded MARC 21 records identifies MARC-8 character sets, rather than scripts per se; hence the code is irrelevant in the Unicode environment because the character set is always UCS. The script identification code should be dropped from subfield $6 when converting from MARC-8 to Unicode encoding. The field orientation code, which flags a field as having right-to-left display directionality, should be used in Unicode-encoded MARC 21 records. When present, the script identification code is separated from the subfield $6 linking tag and occurrence number by two solidus (slash) characters (002F(hex)). In conversion from Unicode to MARC-8, the script identification code should be restored, typically to a code recorded in subfield $c of the 066 field.
In moving from MARC-8 to Unicode it is necessary to re-order combining characters and base characters so that the base character precedes the combining character(s). When converting from Unicode to MARC-8, combining marks must be moved to precede the base characters. The differing rules for proper sequencing of combining marks when a base letter has more than one are specified in Part 3 (i.e.top down vs. inside out.) Best practice during conversion is to reorder the multiple marks according to the rule for the output encoding, but this is not considered mandatory.
When converting from MARC-8 to Unicode, the conversion should determine whether multi-digit numbers used in bidirectional scripts have been entered in logical or visual order. If visual order has been used, best practice requires that the digits be re-ordered from visual order to logical order. If logical order has been used, no re-ordering is necessary.
The numbers, punctuation marks, and symbols found in ASCII 21-3F, 5B, 5D (hex) are also, in full or in part, allocated code points in the MARC-8 sets for Hebrew, Cyrillic, and Arabic. These are mapped (folded) into a single, identical set of Unicode code points, as specified in the mappings in the code tables in Part 5; hence mappings are not perfectly reversible because these characters will always be mapped to the ASCII set during reconversion to MARC-8. The resultant record may contain more escape sequences than the same record originally encoded in MARC-8 required. This is an acceptable result that should not interfere with processing or display of the record.
The characters (alpha, beta, gamma) of the custom MARC-8 Greek Symbols set are mapped to the regular Greek letters in Unicode and consequently are not reversible when reconverting to MARC-8. Use of the Greek Symbols set is strongly discouraged for that reason. In MARC-8 contexts where escapes to the standard Basic Greek set is not feasible, textual equivalents of the symbols should be preferred i.e., [alpha], [beta], [gamma] over use of the Greek Symbol set.
The space character (20 (hex)) is defined only in the Unicode and ASCII sets but is recognized by all the standard graphic character sets in MARC-8 even though not included in those sets. When converting from Unicode to MARC-8, the space character can be converted (unchanged) without being preceded by an escape sequence no matter which of the standard sets is the current working set. Optionally, the escape sequence to ASCII may be included before the space character. However, when the output working set is a custom set accessed by escape technique 1, the technique 1 escape sequence to ASCII is required before the space character.
The MARC-8 repertoire contains over 16,000 characters; the Unicode repertoire contains over 100,000 characters. Direct mappings using the tables in Part 5 are sufficient for Unicode to MARC-8 conversion only for a record that contains no characters that are outside the MARC-8 repertoire. Additional techniques are needed for the more general case in which non-MARC-8 characters may be present in a Unicode record that is to be converted.
Two generally applicable methods for conversion from Unicode to MARC-8 encoding have been approved to aid conversion of MARC 21 records containing characters outside the MARC-8 repertoire. The two methods must not be used in the same record.
The lossy conversion method is intended for use in situations in which the loss of data beyond the large MARC-8 repertoire is not a concern. Each character that is not in the MARC-8 repertoire is replaced with an ASCII vertical bar (7C(hex)) during conversion. This method is called lossy because the substitution of a generic placeholder for every unconvertible Unicode code point loses data that cannot be recovered in a reconversion to Unicode.
In the lossless conversion method, a Unicode character that is not in the MARC-8 repertoire is replaced by a hexadecimal Numeric Character Reference (NCR) identifying the specific unconvertable Unicode code point. This method preserved precisely the information content of the Unicode record although the result may result in a cryptic display, and additional conversion techniques will be required to reconstruct the record exactly in Unicode. The Numeric Character Reference consists only of ASCII characters, thus can be carried into the MARC-8 target record.
The structure of the NCR is &#xXXXX; where:
It is not correct to represent a non-ASCII character in an NCR by its UTF-8 octets; only the scalar value of the code point is allowed.
Either the lossy or the lossless conversion method can be applied directly to a Unicode record, but better results will be obtained if characters outside the MARC-8 repertoire are first converted, as far as possible, into approximately equivalent MARC-8 characters or character sequences. This will minimize the number of vertical bars or NCRs in the output and a more readily usable output record will result. Techniques of this sort are frequently referred to as normalization. Unicode defines four normalization forms for use within the Unicode environment. The optimal normalization for conversion to MARC-8 is a variant of the one called Compatibility Decomposition, or KD.
The code charts on the Unicode web site list valid decomposition sequences for all decomposable characters. These sequences are of two kinds: canonical and compatibility. A common example of the canonical type is the decomposition of a letter with a diacritical mark: E with acute accent (00E9(hex)) decomposes to E (0045(hex)) + acute (0301(hex)). Compatibility decompositions differ from the canonical in that they "do not attempt to retain or emulate the formatting of the original character." (Unicode Standard 5.0, Section 17.1). Some examples of characters with compatibility equivalents are the ellipsis character (2026(hex)) that decomposes to a sequence of three periods (002E(hex)); the circled digit four (2463(hex)) that becomes simply 4 (0043(hex)); the Roman numeral IV (2163(hex)) that decomposes to I (0049(hex)) + V (0056(hex)); and any of the spaces of different width (2000-2008(hex)) that can decompose in one or two steps to the ASCII space (0020(hex)).
Unicode normalization form D specifies only canonical decompositions. MARC-8 repertoire includes several precomposed characters that can be decomposed in Unicode, but should not be decomposed during conversion to MARC-8. These characters are specified in Table 4.1 below.
All code points are shown in hexadecimal notation.
MARC-8 code points are shown in the G0 range for all sets except Extended Latin.
Character name | Unicode code points (u.c., l.c) |
MARC-8 G0 code points (u.c., l.c.) |
MARC-8 character set |
---|---|---|---|
Cyrillic Short I | (0419, 0439) | (4A, 6A) | Basic Cyrillic |
Cyrillic Io | (0401, 0451) | (44, 64) | Extended Cyrillic |
Cyrillic Gje | (0403, 0453) | (42, 62) | Extended Cyrillic |
Cyrillic Yi | (0407, 0457) | (47, 67) | Extended Cyrillic |
Cyrillic Kje | (040C, 045C) | (4C, 6C) | Extended Cyrillic |
Cyrillic Short U | (040E, 045E) | (4D, 6D) | Extended Cyrillic |
Arabic Alef, Madda above | 0622 | 42 | Basic Arabic |
Arabic Alef, Hamza above | 0623 | 43 | Basic Arabic |
Arabic Waw, Hamza above | 0624 | 44 | Basic Arabic |
Arabic Alef, Hamza below | 0625 | 45 | Basic Arabic |
Arabic Yeh, Hamza above | 0626 | 46 | Basic Arabic |
Latin O with horn | (01A0, 01A1) | (AC, BC) | Extended Latin (ANSEL) |
Latin U with horn | (01AF, 01B0) | (AD, BD) | Extended Latin (ANSEL) |
Unicode normalization form KD, the optimal for conversion from Unicode to MARC-8 repertoires, specifies that decompositions of both types, canonical and compatibility, should be done until no further decomposition is possible. The full KD normalization, however, may not be desired because of the canonical decompositions in the above table and other issues with the compatibility decompositions, such as loss of formatting with superscripts and subscripts.
A further complication arises when a character with a combining mark cannot be normalized into components belonging to the MARC-8 repertoire. Proper treatment depends on whether it is the base character or the combining mark that is absent from the repertoire. If the base character can be converted, then the combining mark should be replaced by an NCR or placeholder properly repositioned before the base character in the output record. If the base character cannot be converted, either the lossless or the lossy technique can be applied directly, preferably to the character in its precomposed form, so that it will generate a single NCR or placeholder rather than two. This treatment is preferred whether or not the combining character is also missing from the MARC-8 repertoire.
MARC 21 HOME >> Specifications >> Character Sets >> Part 4
The Library of Congress >> Especially
for Librarians and Archivists >> Standards ( 12/04/2007 ) |
Contact Us |