Frequently Asked Questions (FAQ)
- What are the ISO 639 standards?
ISO 639 provides two sets of language codes, one
as a two-character code set (639-1) and another as a three-character
code set (639-2) for the representation of names of languages.
ISO 639-1, Codes for the representation of names of languages--Part
1: Alpha-2 code, was devised primarily for use in terminology,
and includes identifiers for major languages of the world for which
specialized terminologies have been developed. The maintenance agency for ISO 639-1 is the International Information Centre for Terminology (Infoterm).
ISO 639-2, Codes for the representation of names of languages--Part
2: Alpha-3 code, was devised primarily for use in bibliographic
documentation and terminology. It includes identifiers for all of
the languages represented in part 1, as well as for many other languages
that have significant bodies of literature. It also provides identifiers
for groups of languages, such as language families, that together
indirectly cover most or all languages of the world. The maintenance agency for ISO 639-2 is the Library of Congress.
ISO 639-3, Codes for the representation of names of languages - Part 3: Alpha-3 code for comprehensive coverage of languages, is a code list that aims to define three-letter identifiers for all known human languages. At the core of ISO 639-3 are the individual languages already accounted for in ISO 639-2. The large number of living languages in the initial inventory of ISO 639-3 beyond those already included in ISO 639-2 was derived primarily from Ethnologue (15th edition). Additional extinct, ancient, historic, and constructed languages have been obtained from Linguist List.
There are other ISO 639 standards in development:
- ISO/DIS 639-4: Codes for the representation of names of languages – Part 4: Implementation guidelines and general principles for language coding
- ISO/DIS 639-5: Codes for the representation of names of languages – Part 5: Alpha-3 code for language families and groups
- ISO/CD 639-6: Codes for the representation of names of languages – Part 6: Alpha-4 Code for the comprehensive coverage of language variants
Back to Questions
- What are the differences between the ISO 639-1 and 639-2
code lists?
ISO 639-1, the two-character code, was devised primarily
for use in terminology and includes identifiers for most of the
major languages of the world that are not only most frequently represented
in the total body of the world's literature, but that are also among
the most developed languages of the world, having specialized vocabulary
and terminology. ISO 639-1 includes identifiers for a subset of
the languages covered by ISO 639-2.
ISO 639-2, the three-character code, was devised primarily
for use in bibliography, as well as in terminology. It has a less
restrictive scope than ISO 639-1, being devised to include identifiers
for languages that are most frequently represented in the total
body of the world's literature, regardless of whether specialized
terminologies exist in those languages or not. Because three characters
allow for a much larger set of distinct identifiers, an alpha-3
code can accommodate a much larger set of languages. Indeed, ISO
639-2 does include significantly more entries than ISO 639-1, yet
the scope is not so broad as to result in a separate identifier
for every individual language that has been documented. ISO 639-2
limits coverage of individual languages to those for which at least
modest bodies of literature have been developed. Other languages
are still accommodated, however, by means of identifiers for collections
of languages, such as language families.
In summary, the basic difference between ISO 639-1 and ISO 639-2
has to do with scope: the scope of ISO 639-1 is more restrictive,
focusing on languages for which specialized terminologies have been
developed. In practical terms, ISO 639-2 covers a larger number
of individual languages (due to its less-restrictive scope). It
also includes identifiers for collections of languages.
Both code lists are considered open lists (i.e., it is possible
for new entries to be added to the lists).
Back to Questions
- What are the differences between the terminology and bibliographic
codes in the ISO 639-2 standard?
In the ISO 639-2 standard, two code sets are provided in
which the language codes are the same except for 22 of the 450+
languages that have alternative codes. One set is for bibliographic
applications, often referred to as ISO 639-2/B, and the other
for terminology applications, referred to as ISO 639-2/T.
The choice of the set used must be made clear by exchanging partners
prior to information interchange.
These alternative codes in ISO 639-2 exist for historical
reasons. At the time that ISO 639-2 was developed, there already
was a well-known and widely used language code list that had been
used for over 30 years in bibliographic systems which was largely
adapted for the 3-character code set . At the same time there was
the 2-character code list (now called ISO 639-1, previously
ISO 639), which covered far fewer languages than those for bibliographic
applications. There was a desire by some participants for the 3-character
codes for languages that were already in the 2-character list to
generally share the same 2 characters. In 22 cases the existing
bibliographic code was very different than the 2-character code
(because it was based on a different form of the language name),
but the impact on existing bibliographic systems with millions of
records using those well-established codes would have been enormous
if a new 3-character code were adopted. Thus, these alternative
codes were used for those languages. The alternative codes should
be considered as synonyms; there is no overlap in codes between
the B and the T list.
For more information, please see: www.loc.gov/standards/iso639-2/normtext.html
Back to Questions
- How were the ISO 639 code lists developed?
ISO 639-1: Codes for the representation of names of
languages: alpha-2 codes was developed by the ISO TC37/SC2 in
1988 for use in terminology, lexicography and linguistics.
ISO 639-2: Codes for the representation of names of languages:
alpha-3 codes was developed by the ISO TC37/SC2-TC46/SC4 Joint
Working Group. Work on the standard was initiated in 1989 because
of the inadequacy of the ISO 639-1 two-character code list to represent
a sufficient number of languages for bibliographic and terminology
needs. The list was largely based on the MARC
Code List for Languages, which has been in wide use since
1968.
ISO 639-3: In 2002, ISO TC37/SC2 invited SIL International (www.sil.org)
to participate in the development of a new standard based on the language
identifiers in the Ethnologue that would be a superset of ISO 639-2 and
would provide identifiers for all known languages. In 2004 the proposed
new standard, ISO/DIS 639-3 was released, incorporating identifiers for
living languages from the Ethnologue 15th ed. (www.ethnologue.com) and
for historical, ancient and constructed languages from the languages
database of LinguistList (linguistlist.org), accounting for more than
7000 individual languages. In February 2007, ISO 639-3 was adopted.
Elements other than collections listed in ISO 639-2 are a subset of
those listed ISO 639-3; every non-collective element in ISO 639-2 is
included in ISO 639-3. The denotation represented by alpha-3 identifiers
included in both ISO 639-2 and ISO 639-3 is the same in each standard,
and the denotation represented by alpha-2 identifiers in ISO 639-1 is the
same as that represented by the corresponding alpha-3 identifiers in ISO
639-2 and ISO 639-3.
For more information about the development of the ISO 639-2 codes,
please see:
www.loc.gov/standards/iso639-2/develop.html
Back to Questions
- Who uses the ISO 639 codes and why?
There are a wide variety of processes for which it is necessary to
identify the specific language beforehand. Language-based indexing and
searching are fairly obvious examples from the realm of bibliographic
applications, as is semantic interpretation. But there are a number of
others: spell-checking, sorting, syllabification and hyphenation,
morphological and syntactic parsing, fuzzy string searches and
comparisons, speech recognition, speech synthesis, semantic associations,
thesaurus lookups, and potentially many others.
Using "the" name of a language as the means of language identification in
machine applications poses two distinct problems of ambiguity. Firstly,
different languages can have identical or very similar names. For
example, there are four languages called Lele: Lele [lle] of Papua New
Guinea (Austronesian); Lele [lel] of the Democratic Republic of Congo
(Niger-Congo, Bantoid); Lele [lln] of Chad (Afro-Asiatic); Lele [llc] of
Guinea (Niger-Congo, Mande). Conversely, the same language may be called
by multiple different names, for example, one name used by native
speakers, another used by speakers of the neighboring language, and yet
another used by the national government.
The ISO 639-1 code set was devised for use in terminology, lexicography
and linguistics.
The ISO 639-2 code set was devised for use by libraries, information
services, and publishers to indicate language in the exchange of
information, especially in computerized systems. The codes have
been widely used in the library community and may also be adopted
for any application requiring the expression of language in coded
form by terminologists and lexicographers.
The ISO 639-3 code set was devised for broad use in a variety of applications where more specific language coding was necessary than the other two standards provided.
Back to Questions
- What is the relationship between the Internet RFC 4646
(and its predecessors RFC 3066 and 1766) and the ISO 639 standards?
Internet RFC
4646 (Tags for the Identification of Languages), describes the structure, content, construction, and semantics of language tags for use in cases where it is desirable to indicate the language used in an information object. It also describes how to register values for use in language tags and the creation of user-defined extensions for private interchange.The language tag consists of a primary subtag and a series of subsequent subtags, each of which narrows or refines the range of languages identified by the overall tag. It enables the user to specify, in addition to the primary language, other characteristics such as script, country, or variant. It is considered an Internet Best Current Practices for the Internet Community and gives guidance for the use of ISO 639 codes.
RFC 4646 specifies use of a 2-character code from ISO 639-1 when it exists; when a language does not have a 2-character code assigned the 3-character code is used. Although it states that the 3-character terminology code is used in these cases where no 2-character code exists, this situation will not occur, since the only alternative codes in ISO 639-2 are for languages that already have a 2-character code.
Back to Questions
- Are the language codes intended to be used as abbreviations
for the language?
The language codes in ISO 639-2 were developed to serve as a device
to identify a language or group of languages. They were NOT intended
to serve as abbreviations or short forms for languages, but rather
as a code that serves as a device to identify a language name. Some
codes in the list consist of letters that are used in some form
of the language name. This has not been possible in all situations,
however, and often one would need to know the English form of the
language name to recognize a relationship. There are situations
where codes have been selected that diverge from the language name.
In using the language codes, systems generally display the language
name represented by the code and not the code itself to users. Therefore
it becomes irrelevant whether the code is "123", "xyz",
"eng" or whatever.
See section
4.1 of ISO 639-2 for criteria for the selection of the language
code.
Back to Questions
- Who are the registration authorities for the ISO 639 standards?
The registration authority for the ISO 639-1 codes
is:
International Information Centre
for Terminology (Infoterm)
Simmeringer Hauptstrasse 24, A-1110
Vienna
Austria
E-mail: [email protected]
The registration authority for the ISO 639-2 codes
is:
Library of Congress
Network Development and MARC Standards Office
101 Independence Ave. SE,
Washington, DC 20540-4402
USA
E-mail: [email protected]
The registration authority for ISO 639-3 is:
SIL International
ISO 639-3 Registrar
7500 W. Camp Wisdom Rd.
Dallas, TX 75236
USA
E-mail: [email protected]
Back to Questions
- What are the functions of the registration authorities
for the ISO 639-1 and 639-2 standards?
The registration authorities for the ISO 639 standards receive
and review request applications for both new language codes
and for changing existing ones according to criteria
indicated in the standards.
The registration authorities maintain accurate lists of
information associated with registered language codes.
They also process and distribute updates of the codes on
a regular basis to subscribers and other parties.
For more information about the registration authorities' duties,
please see: www.loc.gov/standards/iso639-2/annexa.html.
Back to Questions
- What is the Joint Advisory Committee (JAC) for the ISO
639 standards?
The Joint Advisory Committee ISO 639/RA-JAC was established to
advise the ISO 639-1 and 639-2 registration authorities and
guide coding rule applications (as laid down in the ISO 639
documentation). It consists of six individuals representing ISO
member bodies, plus the rotating chairs of the registration authorities
as well as up to six observers. The JAC considers applications for
new language codes and votes on whether they will be included.
More information about the Joint Advisory Committee and its activities
can be found at: www.loc.gov/standards/iso639-2/annexa.html.
Back to Questions
- Are there any electronic discussion lists for the ISO
639 language codes?
Yes, for general discussion about the ISO 639 language codes, please
write to: [email protected].
There is also a discussion list on the IETF RFCs on language coding
at: [email protected].
Information about this discussion list is found online at: www.alvestrand.no/mailman/listinfo/ietf-languages.
Back to Questions
- How does one request new ISO 639 language codes?
To request new codes in the ISO 639-1 and 639-2 standards, please fill out
the online form at: www.loc.gov/standards/iso639-2/iso639-2form.html.
Before submitting your requests, please review the
criteria used to define new codes. Appropriate
documentation must be provided with the request.
Information on the process for submitting a proposal for a new language
or other change to the ISO 639-3 code set may be found at
http://www.sil.org/iso639-3/submit_changes.asp.
Back to Questions
- What are the criteria used to define new ISO 639 language
codes?
The criteria used to define new codes in the ISO 639-1 standard
are:
Relation to ISO 639-2. Since ISO 639-1 is to remain a subset
of ISO 639-2, it must first satisfy the requirements for ISO 639-2.
In addition it must satisfy the following.
Documentation
- a significant body of existing documents (specialized
texts, such as college or university textbooks, technical documentation
manuals, specialized journals, subject-field related books, etc.)
written in specialized languages
- a number of existing terminologies in various
subject fields (e.g. technical dictionaries, specialized glossaries,
vocabularies, etc. in printed or electronic form)
Recommendation. A recommendation and support of a specialized
authority (such as a standards organization, governmental body,
linguistic institution, or cultural organization)
Other considerations
- the number of speakers of the language community
- the recognized status of the language in one
or more countries
- the support of the request by one or more official
bodies
Collective codes. ISO 639-1 does not use collective codes.
If these are necessary the alpha-3 code will be used.
The criteria used when defining new codes in the ISO 639-2 standard
are:
- What is the timeline used for approving new ISO 639 language
codes?
After a request for a new, deleted, or changed code is submitted
to the appropriate registration authority (Infoterm for 639-1 and
Library of Congress for 639-2), the appropriate registration authority
determines whether or not the request meets the relevant criteria.
The registration authority then informs the requester of the process
generally within two weeks of the submission. If the request
meets the criteria, the registration authority determines an
appropriate code and consults the ISO 639/JAC. If the first vote
is not unanimous, a second round of voting is conducted.
The original requester will be informed of the JAC decision in six weeks to two months from submission of the original request.
Results of the JAC decisions will be publicized in a change
notice available on the Web.
In general, changes to the ISO 639-3 code set that do not affect the Part
2 code set are processed according to an annual review calendar. However,
a change that affect both ISO 639-2 and ISO 639-3 is reflected in the
639-3 code set as soon as the change is finalized
Back to Questions
- Are separate language codes used for languages in different
scripts?
A single language code is normally provided for a language even
though the language is written in more than one script. ISO
15924 Codes for the representation of names of scripts is available for coding scripts; these may be included as subtags after the primary language tag according to RFC 4646.
Back to Questions
- Are separate language codes defined for dialects of languages?
A dialect of a language is usually represented by the same language
code as that used for the language. If the language is assigned
to a collective language code, the dialect is assigned to
the same collective language code. Generally, dialects are not given
different codes, but determining the difference between dialects
and languages will be decided on a case-by-case basis. In the future ISO 639-6, currently under development, may be used to identify language variants and dialects.
Back to Questions
- Are separate language codes defined for different orthographies?
A language using more than one orthography is not given multiple
language codes. According to RFC 4646 orthographic variations may be registered in the IANA language tag registry as variant subtags.
Back to Questions
- What are collective language and macrolanguage codes?
Collective language codes are language groups that are used if
the criteria for assigning a separate language code are not met.
The words "languages" after the language group name indicates that a language code
is a collective one.
ISO 639-1 does not use collective codes, but ISO 639-2 does. References
from separate language names to the collective code used for that
language are not included in the ISO 639-2 standard, but may be
found in the MARC
Code List for Languages.
Some language identifiers in ISO 639-1 and 639-2 may be considered "macrolanguages". These are designated as individual language identifiers that correspond in a one-to-many manner with individual language identifiers in ISO 639-3. For instance, ISO 639-3 contains over 30 identifiers designated as individual language identifiers for distinct varieties of Arabic, while ISO 639-1 and ISO 639-2 each contain only one identifier for Arabic, "ar" and "ara" respectively, which are designated as individual language identifiers in those parts of ISO 639. Macrolanguages are distinguished from language collections in that the individual languages that correspond to a macrolanguage must be very closely related, and there must be some domain in which only a single language identity is recognized.
Back to Questions
- Can ISO 639 language codes be changed after they had
initially been created?
ISO 639 language codes are usually not changed in order
to ensure continuity and stability of online retrieval from large
databases built over many years. However, when language names associated
with codes have been changed, variant forms of a language
name may be included in the entry, separated by a semicolon in the
code lists.
Obsolete codes are generally not reassigned when they have
been changed or discontinued.
A list of codes that have been changed or added to the lists are
located at: www.loc.gov/standards/iso639-2/codechanges.html.
To request a change to the name of an already defined language
name, please see: www.loc.gov/standards/iso639-2/iso639-2chform.html.
Back to Questions
- Why do some languages have both ISO 639-1 and 639-2 codes
associated with them while others have only ISO 639-2 codes?
For languages to be assigned the 2-character or 3-character codes,
they must meet the criteria of the respective
lists.
However, because of the inadequacy of the alpha-two codes to represent
all of the languages in the world (it can only accommodate 676 codes)
and to assure backwards compatibility with existing usage compliant
with RFC 4646 (and its predecessors), new language codes may be considered for inclusion
in both parts or in ISO 639-2 only.
Back to Questions
- Are the ISO 639 codes case sensitive?
ISO 639-2 recommends use of the language codes in lower case, but
they should be considered case-insensitive and are unique codes
regardless of case.
Back to Questions
- How does one indicate the language variation spoken in
a particular country?
The ISO 639 standards and RFC 4646 allow for combining the language
code with a country code from ISO 3166 to denote the area in which
a term, phrase, or language is used. For instance, using RFC 4646, English as spoken
in the United States may be indicated with the following:
en-US
Back to Questions
- How does one make distinctions between traditional and
simplified Chinese characters and using the ISO 639 language codes?
The differences between traditional and simplified Chinese characters
cannot be represented using the ISO 639 codes because these are
distinctions in script. The character sets can be coded using
ISO 15924
(Code for the Representation of Names of Scripts) script
codes as subtags appended to the primary subtag for Chinese.
Back to Questions
- How does one distinguish between Cantonese and Mandarin
variations of Chinese?
ISO 639-2 was intended for written languages primarily, and
since Chinese is the same in its written form for Cantonese and
Mandarin, no distinction was made in the code list. . Individual Chinese languages included under the macrolanguage Chinese (coded as "zh" in 639-1; "zho" in 639-2/T and "chi" in 639-2/B) are listed at: http://www.sil.org/iso639-3/documentation.asp?id=zho. The ISO 639-3 code set defines cmn as Mandarin Chinese and yue as Yue Chinese (of which Cantonese is a dialect). Before the standardization of ISO 639-3 these could be coded by using the code for Chinese with the country code (i.e. zh-CN for Chinese as spoken in China and zh-TW for Chinese as spoken in Taiwan) or by using a subtag registered with the Internet Assigned Numbers Authority (IANA), (e.g. zh-guoyu).
Back to Questions
- How does one indicate undetermined languages using the
ISO 639 language codes?
In some situations, it may be necessary to indicate that the identity
of the language used in an information object has not been determined.
If the situation is that it is undetermined because there is no language content, the following identifier is provided by ISO 639-2:
zxx (No linguistic content; Not applicable)
If there is language content, but the specific language cannot be determined a special identifier is provided by ISO
639-2:
und (Undetermined)
Back to Questions
- Is there a mechanism for using locally defined codes?
If a user wishes to use locally defined codes for languages not
covered by ISO 639-2, codes qaa through qtz are reserved
for local use, including for local treatment of dialects. These
codes may only be used locally, and may not be exchanged internationally.
Back to Questions
- What is the difference between a language code and a
country code?
ISO 639 provides two and three-character codes for representing
names of languages. ISO
3166 provides two and three-character codes for representing
names of countries. These two standards were developed independently,
and there was no attempt to use the same code for a language as
that for the country in which it is spoken. One should use codes
from each list independently.
The language code and country code may be used together to indicate
a language variation spoken in a particular country (see question
22).
Back to Questions
Comments on this document: [email protected]
|