The Library of Congress >> Especially for Librarians and Archivists >> Standards
HOME >> MARC Development >> Discussion Paper List
DATE: December 19, 2008
REVISED:
NAME: Encoding URIs for controlled values in MARC records
SOURCE: RDA/MARC Working Group
SUMMARY: This paper explores the use of URIs for controlled values in MARC records to accommodate RDA descriptions. It considers using a URI in place of or in addition to the value in a number of fields in the format where controlled vocabularies are used. It suggests either encoding the data in a new subfield $1 that would be defined in all applicable fields or reusing the existing subfield that would be used for the data in another form, and allowing the URI to be self-defining as such.
KEYWORDS: Subfield $1 (All formats); URIs; Controlled values; RDA
RELATED: 2008-05/2; 2008-05/3; 2008-DP05/1; 2008-DP05/3
STATUS/COMMENTS:
12/19/2008 - Made available to the MARC 21 community for discussion.
01/25/2009 - Results of the MARC Advisory Committee discussion - The committee decided to move this discussion paper forward as a proposal using subfield 1 (the number one). A mechanism will be established if it applies to more than one subfield in the field.
The document RDA—Resource Description and Access: Scope and Structure defines the framework for the development of RDA and introduces key concepts. The RDA guidelines and instructions for a particular element specifies how it is expressed, using either a literal value surrogate, or a non-literal value surrogate, concepts adopted from the DCMI Abstract Model (DCAM). It also specifies that values may be a typed value string or a plain value string. A non-literal value surrogate is defined by the DCAM as: “a value surrogate for a non-literal value, made up of a property URI (a URI that identifies a property), zero or one value URI (a URI that identifies the non-literal value associated with the property), zero or one vocabulary encoding scheme URI (a URI that identifies the vocabulary encoding scheme of which the value is a member), zero or more value strings (literals that represent the value)”. Different elements may use different types of value strings. The use of a URI instead of a plain value string is particularly applicable to situations where the value of the particular element comes from a controlled vocabulary, which could be an authority list or formal thesaurus (e.g. a record from the LC Name Authority File or a record for an LCSH heading) or any other list of controlled terms (e.g. the MARC Code List for Languages). Although URIs have not been made available for values in the aforementioned controlled vocabularies, work is underway to provide them. LC’s Network Development and MARC Standards Office is developing a registry for controlled lists and in so doing is establishing URIs both for the list itself and for each value on the list. Currently the following code lists are available as prototypes from LC’s registry at http://www.loc.gov:8081/standards/registry/lists.html: MARC language codes, MARC country codes, MARC relator codes, MARC geographic area codes, ISO 639-1 and ISO 639-2 language codes.
The DCMI/RDA Task Group is establishing controlled vocabularies with URIs that identify each value or concept. It will also allow for other controlled vocabularies to be used with RDA elements (as specified) and assumes that either URIs or literal values can be recorded. Note that the DCMI/RDA Task Group is also establishing URIs for elements. This paper discusses only URIs for controlled vocabularies; URIs for element are outside its scope.
The MARC 21 Formats allows for recording URIs in some fields, usually in fields that define a subfield $u. However, what is recorded in these subfields are links to other bibliographic or related resources. Field 856 (Electronic Location and Access) uses a subfield $u for a URI, which represents the resource described in the record or some related resource. Other note fields include a subfield $u where the link goes to some text which substitutes for or amplifies what would otherwise be recorded in the field. For instance, subfield $u in field 505 (Formatted Contents Note) links to a table of contents for the resource described that is external to the record; subfield $u in field 583 (Action Note) links to a description of an action performed on the resource described in the record (for instance a lengthy description that is available in a source outside of the record). Fields/subfields for URIs are available in all MARC 21 formats.
URIs for controlled values are identifiers for a concept or term that is not a bibliographic entity, as that identified in 856, or a supplemental resource, as that identified in $u in other MARC fields. In some sense, a URI for a controlled value serves a similar purpose to a code that identifies a concept in a code list, but a URI is (probably) network accessible. That is, a code is intended to be language neutral, a persistent token that can be used to reference a concept that generally has a label, a definition, and whatever else the maintainer of that list provides. For instance an ISO 639-2 language code is an identifier of an entity which is a language that may be known by several names. That language code could also be represented as a URI. Whether the URI is resolvable is another issue. URIs may be pure identifiers or resolvable ones.
URIs might also be used to identify bibliographic or authority records, for instance as a value in the linking entry fields to point to another resource by referencing its bibliographic record. In addition, a URI could be used in a bibliographic field to link to an authority record which itself may be describing a concept. LC made available its Permalink system, which uses a URI for a bibliographic record in LC’s online catalog, consisting of a domain name for Permalink itself (lccn.loc.gov) plus the Library of Congress Control Number (LCCN) in normalized form (e.g. http://lccn.loc.gov/2008273747). Currently the MARC 21 bibliographic and authority formats include subfield $0 and subfield $w in various fields to allow for linking to another authority or bibliographic record through a control number. It is assumed that a “raw” control number that has meaning within the originating system is recorded in subfields $0 and $w. Alternatively, a URI that includes the control number could theoretically be used to link to other MARC records. Whether it could use these same subfields defined for the “raw” control number and rely on its nature of being self-defining (by virtue of the URI syntax used) needs to be discussed. Authority records themselves may also be considered controlled values and be assigned URIs.
URIs often implicitly include information about the maintainer of the scheme or controlled vocabulary. They consist of the URI scheme (e.g. “http”, these are registered by IANA, the Internet Assigned Numbers Authority) plus the scheme-specific part, which is defined by the naming authority. The scheme-specific part generally includes a domain name and any other portions as determined by the host. (For more information on URI syntax see: URI Generic Syntax http://www.ietf.org/rfc/rfc2396.txt). In establishing URIs for controlled vocabularies, maintainers should be careful to choose robust and persistent URIs that include an identification of the domain in which the vocabulary is established.
In the set of proposals and discussion papers about implementing RDA in MARC 21 that were presented at the Annual 2008 MARC Advisory Committee meetings the issue of recording URIs for controlled values came up several times, in particular in the discussion of 2008-05/3 (New content designation for RDA elements: Content type, Media type, Carrier type), 2008-DP05/1 (Using RDA relators between names and resources with MARC 21 records), and 2008-DP05/3 (Treatment of controlled lists of terms and coded data in RDA and MARC 21). All of these papers dealt with RDA elements that will contain controlled vocabulary values, either from RDA itself or from an external list. In 2008-DP05/1, which discussed using URIs for relator codes, the paper suggested recording a URI in subfield $4 (Relator code), considering that it would be obvious that it is a URI rather than a raw code because of its syntax. Participants agreed that the issue of recording URIs needed to be explored more fully to be able to accommodate the implementation of RDA.
Whether URIs that represent values in controlled vocabulary lists need a dedicated subfield in all applicable fields needs to be determined. Alternatively, whether a subfield could be used that is defined for a controlled term or code should be considered, since a URI is essentially self-defining: the first portion of the URI is a known Internet protocol (the most common of which would be http://) and the scheme-specific portion would indicate the source of the vocabulary. This question depends on how a system might process such data and whether a URI could be recognized as such.
Whether URIs for controlled values should be encoded instead of or in addition to the code or textual value will also need to be determined. It is important to keep in mind that in the current record sharing environment records need to be able to stand alone and not necessarily have to go fetch a value from another source to be understood. There also may be implications for indexing if the value itself is not included in the data. This decision may need to be made by exchange partners or general guidance may be needed.
Some decisions need to be made concerning what library systems might do with the URIs that are provided for values in MARC 21 fields. Whether the URIs will resolve to something and what they will resolve to is an open question. Would one expect to find information about that value when clicking on the URI, or is it intended to be a pure identifier? How will the URIs be translated for the end user in the record and how might displays be generated for the user? Consideration also needs to be given to how the use of URIs will affect the exchange record and whether it will continue to be self-contained. The answer to that may affect whether the record creator needs to encode both the URI and the value that the URI represents. An analysis of the use cases for system functionality of these URIs would be desirable.
Fields that could require a URI for a controlled value include:
A new subfield could be defined to be used for a URI for a controlled value. The URI may take the place of the value or be used as a supplement to the term or code. The only subfield that has not been used in all fields in the MARC 21 formats is subfield $1 (i.e. the number “1”). Whether the community wants to use up this last undefined subfield needs to be discussed. Note that recording a URI for the identification of a concept in a controlled vocabulary is a different type of URI than the URI for a resource or related resource, as in the defined subfields $u in field 856 and other fields. If the community does not want to use subfield $1, then specific subfields in the applicable fields would need to be determined and the choice would have to vary among fields.
Since many fields where URIs are needed have multiple subfields it needs to be clear which subfield the URI relates to. That is, in 1XX and 7XX fields with relator term and code subfields, subfield $1 would be a URI for the value otherwise recorded in subfields $e and $4. Thus it could be defined as “Relator URI”, rather than just “URI”. The only other situation where a URI might be needed in these fields is one that identifies a bibliographic or authority record. Subfields $0 and $w are available for Control number, but whether these could be used both for a “raw” control number and a control number in the form of a URI also needs further discussion.
In field 041 (Language code) there are multiple subfields that contain language codes. Thus there needs to be a way to tell what type of language code the URI represents (currently determined by the particular subfield code used). In this case the code value itselfwould be recorded in the appropriate subfield to indicate the language usage with the URI in subfield $1 immediately following it, so that the URI gives an alternative form of the code that is in the appropriate subfield.
Whether the same subfield that is used for a code (or text as appropriate) can also be used for the URI should be considered. For instance for relator codes, subfield $4 in the access point fields might contain either the MARC code or a URI, and the distinction would be self-defining by the syntax of the value. In subfield $4 in the MARC 21 formats, the use of the MARC Relator list is assumed as the source of the code and this could continue to be the case. If the data came from a source other than the MARC list, a URI could be used, which would imply the maintenance agency for the controlled value.
Whether the system could interpret the URI because of its syntax needs to be considered. Often systems validate against a set of values, for instance in subfield of 041 against a list of MARC language codes or in field 043 against a list of MARC geographic area codes. It may require more processing to first examine the string to determine if it is a URI. On the other hand, it may not be difficult for a system to drop all data up to the last slash. In terms of the MARC codes, their equivalent URIs will include the code at the end of the string, e.g. http://www.loc.gov:8081/standards/registry/vocabulary/relators/art, so everything up to the last slash would be dropped to find the code. {NOTE: this is a temporary URI used as an example before the prototype registry is brought into a production service.) However, other URIs may have different encoding rules and will need to be accommodated, although it is likely that URIs established by other metadata efforts will include a term that indicates an understandable value.
If it is decided to use the same subfield for a URI as well as a code (and one can argue that both are identifiers), the subfield(s) could be renamed “Language identifier” instead of “language code”, “Relator identifier” instead of “Relator code”, etc.
4.1 Are there additional data elements that could contain URIs representing controlled values in addition to those listed in 3.1?
4.2 Is it desirable to define a new subfield used for URIs in all fields where needed? If yes, will it be unambiguous which value the URI is an alternate form of? Should both the code/term and URI be recorded?
4.3 Could existing subfields be reused to contain a URI? If so, will systems be able to do validation from code lists? Is it an option to suggest that users input the data in both forms (i.e. code and URI)? This may require making some subfields repeatable that aren’t currently.
4.4 What will users do with URIs encountered in MARC records in library systems? What are the use cases? How might they display to the end user? How will this affect the exchange record? Are there indexing considerations?
4.5 Should this mechanism be available in all MARC 21 formats wherever there are controlled vocabularies even if there isn’t currently a perceived need?
HOME >> MARC Development >> Discussion Paper List
The Library of Congress >> Especially
for Librarians and Archivists >> Standards ( 11/24/2015 ) |
Legal | External Link Disclaimer | Contact Us |