Z39.50 Attribute Architecture
Draft
August 11, 1998
Review/comment period extended to September 15.
1. Introduction and Preliminary Notes
1.1 Historical Background
The initial attributes for the bib-1 attribute set were developed by
a team of representatives of the Library of Congress, RLG, OCLC and
WLN in the mid-1980s. This set was merged with a similar set that
had been developed by European library system developers to become
bib-1. Bib-1 was the only attribute set contained in the published
version (version 2) of Z39.50 in 1992.
It was at this time that some problems with the bib-1 attribute set surfaced. Within the bibliographic community, implementors had no published definitions of the bib-1 attribute semantics, thus each vendor implemented the bib-1 attribute set with their own interpretation of the attribute usage. A document was produced to clarify this ( see ftp://ftp.loc.gov/pub/z3950/defs/bib1.txt), although it was never formally included as part of the standard.
As the Internet grew, more communities wanted to implement Z39.50 and, in turn, needed additional attributes (beyond those already in bib-1) to reflect the types of data they wanted to exchange. This proved difficult as Z39.50-1992 did not allow a query to include attributes from more than a singe attribute set. Since bib-1 was the only publically visible set, it was expanded to accommodate the needs of these communities. Thus, bib-1 grew without plan or rigor, evolving away from the bibliographic community where is had started, and "bib-1" became somewhat of a misnomer as it grew into a global set of attributes.
In 1994 and 1995, as Z39.50 version 3 was being finalized and as Z39.50
began to be widely implemented, Additional concerns arose over the relationships among attribute sets that other groups were developing, notably the STAS and GILS attribute sets. The Z39.50 Implementors Group (ZIG) had many questions about the development and implementation of multiple attribute sets, including duplication of attributes across sets. At the February 1996 ZIG meeting Clifford Lynch presented a discussion paper (Defining and Maintaining Attribute Sets for Use with the Z39.50 Protocol: A
Discussion Paper ) that detailed the issues:
- Duplication of common attributes in specialized attribute sets,
due to the limits of the version 2 query.
- Interoperability problems due to attribute set proliferation, for
example, how to know which basic attributes were imbedded in
specialized sets.
- Ambiguities in the semantics of attributes.
- Lack of rigorous semantics in the bib-1 attribute set; lack of a
scope statement for the bib-1 attribute set; lack of consultation
with the broad community concerned with bibliographic records.
- Lack of guidance about the semantics of mixing attributes
from different attribute sets in a single Z39.50 query (and in particular, in a single query operand).
Following the discussion of these issues at the ZIG meeting, Lynch
volunteered to bring together a group of interested people to
recommend resolutions of the issues. The group met three times.
Lynch prepared interim reports that were discussed at subsequent ZIG
meetings. The final report of the group was presented at the
January 1998 ZIG meeting. The current text of the new architecture
includes revisions based on discussions then and at the June 1998
ZIG meeting.
The major conclusion of the group was that a new architecture for
attribute sets should be developed; they went on to recommend an
architecture based on classes of attribute sets, with expanded
attribute types. Another major conclusion was that expert
communities, rather than the ZIG, should be responsible for
developing and maintaining attribute sets (as was the case with GILS
and STAS). Notably, they recommended that the bibliographic
community, rather than the ZIG, develop the next generation of
bibliographic attributes. The ZIG should continue to be responsible
for attributes that are general to Z39.50, that is, not specific to
a given community.
1.2 Acknowledgements
The following people attended one or more of the attribute architecture meetings (the three meetings discussed above as well as a NISO sponsored meeting in March 1998):
- Joel Baron
- Priscilla Caplan
- Eliot Christian
- Ray Denenberg
- Larry Dixson
- Yonsook Enloe
- Eric Ferrin
- Michael Fox
- Patricia Harris
- Janet Hylton
- Manette Lazear
- Ralph LeVan
- Clifford Lynch
- Bill Moen
- Nassib Nassar
- Doug Nebert
- Mark H. Needleman
- Paul Over
- Mark Piekenbrock
- Cecilia Preston
- Sara Randall
- Ananth Rao
- Lou Reich
- Mackenzie Smith
- Lennie Stovel
- Margaret St. Pierre
- Fay Turner
- Les Wibberley
- Joe Zeeman
1.3 Brief Technical Background
This document addresses the Z39.50 type-1 query only. (Z39.50 defines a number of query types, but only requires support for the type-1 query.)
The type-1 query consists of one or more search terms, each with a set of
attributes, specifying, for example, the type of term (author, title,
subject, etc.), whether the term is truncated, its structure, etc. The
server is responsible for mapping attributes to the logical design of the
database.
A term in a type-1 query, together with its accompanying collection of
attributes, is called an operand. Operands may be
combined in a type-1 query, linked by boolean operators (And, Or, And-not,
and Proximity).
Each attribute is a pair: an attribute type and a value
of that type. An Attribute set defines a set of attribute types,
and for each, a list of possible values.
An attribute set definition is assigned an object identifier, referred to
as its attribute set identifier.
Example: The bib-1 attribute set defines a
number of attribute types; one of which is Use. For bib-1 Use attributes,
many attribute values are defined, one of which
is personal name. Each type is assigned a numeric value, and each value is assigned
either a numeric value or a string. In bib-1, type Use is assigned the value 1, and Personal Name is
assigned the value 1. Thus bib-1 Use attribute Personal Name is represented as the pair
(1,1). This pair is further qualified by the bib-1 attribute set identifier (1.2.840.10003.3.1) to distinguish it from the pair (1,1) that may be defined by other attribute sets.
In version 2 of Z39.50, all attributes within a query must belong to the same attribute set (the query accommodates only a single, global attribute set id). In version 3, attributes may be combined from different attribute sets, within a single query, even within a single
operand (an attribute set id may accompany every attribute). This is a significant enhancement, providing support for multiple database searching, and allowing attribute sets to be defined with less
replication.
Also in version 3, new data types for terms are defined (in version 2 only
binary values are allowed).
1.4 Version 3 Assumption
There are several enhancements in version 3 pertaining to attribute sets
and query construction; the two enhancements described at the end of 1.3
are certainly the most important, and are seen to be functional
prerequisites for the development of an attribute architecture. For this reason, version 3 is
assumed by this architecture, and version 2 is not addressed.
1.5 Limitations
The Z39.50 type-1 query has known limitations, and the architecture
specified in this document is restricted by these limitations. As the
standard evolves and new versions are approved, the architecture may be
expanded.
1.5.1 Semantic Indicator
In order to compensate for some of the type-1 limitations, it may be
necessary to utilize the semantic indicator (provided within version 3)
for purposes that would otherwise be accomplished by more coherent
mechanisms if these limitations were not present. It should be thus noted
that in future versions of Z39.50 it is intended that these limitations
will be addressed, obviating the need for extensive use of the semantic
indicator at the attribute level.
1.5.2 Nesting and Occurrence
Occurrence is not permitted in conjunction with nesting. Thus for example "field 1
within field 2" (nesting), or "second occurrence of field 3" (occurrence) may be specified but not
"second occurrence of field 1, within field 2". This is a limitation posed by the type-1 query. In
the future (either as an amendment to the type-1 query, or as part of a new query definition) nesting should
be cast as an operator; thus, in the query symbolically expressed as "A within B", 'within' would be analogous to 'and' in the query expressed as A and B".
2. Attribute Set Class Definitions
The attribute architecture allows definition of multiple attribute set classes. An attribute set class definition provides an umbrella context for the definition of an attribute set belonging to a particular class. It defines attribute types that may be included in an attribute set for that class. Attribute set Class 1 is defined as part of this architecture document (section 3).
This architecture strongly recommends that an attribute set definition conforming to
a particular class not include types that are not defined for that class.
The architecture provides the attribute set class approach to allow flexibility and future expansion within the existing architecture. It is anticipated that attribute set Class 1 meets all know needs for an attribute class at this time. Thus it is not known whether additional classes will be necessary.
2.1 Mutual Exclusivity
An attribute class may declare that specific attribute types are mutually exclusive
within a query operand (for example, Abstract and Field Name attributes of Class 1). Mutual exclusivity rules are to be defined at the level of the attribute class rather than specific attribute sets.
2.2 Attribute Values
Although many attribute values are (and perhaps will continue to be) enumerated, an attribute value may take any of the following forms:
- Enumerated
- Numeric
For example, the value of an 'occurrence' attribute may simply be the actual occurrence,
that is,
to indicate "second occurrence of field N" the value of the Occurrence attribute would be
2.
- Character string.
- A sequence of values.
For example a list of language codes (where some indication is provided, either via semantic indicator or by a rule specified in the attribute set definition, of what purpose is intended by supplying a list; it might be, for example, that languages are supplied in order of preference, or as another example, alternative codes are supplied for the same language so in case the server doesn't recognize the first, it may recognize the second).
3. Attribute Set Class 1
This class is intended to cover all known, existing needs, at the time that this document was finalized. (Existing attribute sets may need to be re-specified within this framework.)
There may be other approaches developed which partition the set of attributes into fundamentally different types. This might result in the definition of a new attribute class inconsistent with this class. However, no need for such a separate class has been identified.
The importance of enumerating all of the possible attribute types within
this "universal" attribute class is to provide a template for developers
of attribute sets, and to set up a framework for interoperability among
independently defined attribute sets which are intended to serve various
communities. In particular, it should be possible for groups of content
experts to develop new Abstract attributes, ASN.1
datatypes, comparison operators, and perhaps structure/format attributes which fit comfortably
within this framework. Server developers can, based on the template
defined here, recognize various attribute types that are omitted in a
given query, as well as illegal repetitions or combinations of attributes
of given types that would indicate a malformed query.
3.1 General Rules for Class 1
3.1.1 Semantic Precedence and Interaction among Sets
The context of this attribute class is identified as being in effect for a query, when the OID of an
attribute set conformant with this class is specified as the global OID for a Z39.50 query.
The "global" OID refers to the object identifier within
the type-1 query that does not accompany a specific attribute. For class
1, this is referred to as the dominant OID for the query. When
attributes from different attribute sets are mixed within a query, and
when the respective attribute set definitions conflict such that the
resulting semantics are ambiguous, the semantics of the dominant set
prevail.
For an attribute set intended to conform to this class, its definition should:
- indicate that it conforms to Class 1;
- indicate whether that attribute set may be used as the dominant set
in a query; and if so:
- describe the rules that apply, when that set is used as the dominant
set, for intermixing of attributes from different sets within an
operand or query.
Interaction between attribute sets conformant to this attribute set class
and historical attribute sets not conformant to this class within a query
operand are undefined.
3.1.2 Inheritance and Population
An attribute set consistent with this attribute class will define
attributes of one or more of the types specified in 3.2.
Any class 1 attribute set inherits the rules, prescribed for the class,
that apply to attribute types defined for that set. However, a class 1
attribute set need not define nor populate every attribute type defined
for class 1. A class 1 attribute set may define as few as one attribute
type, or as many as all of the attribute types defined for class 1.
3.1.3 Omitted Attributes
An attribute set definition should not specify a default value for an attribute type to
be applied when that attribute type is omitted from an operand. Each individual server may
determine the semantics of omitted attributes. Thus when a client omits an attribute of a given
type from an operand (unless that type is not applicable for the given attribute combination, or unless the attribute type is mandatory) the client is, in effect, leaving it to the server to select a value. (When attribute types are omitted when a list of field names is provided via multiple Field Name values, the server will choose values for the omitted types based on the most specific field name in the list.)
3.1.5 Repeatability
In general if any attribute is allowed to be repeatable, the semantics of
repeating the attribute must be well-defined (implicitly or explicitly).
While repeatability may be permissible for a given attribute type, as a
general principle, an attribute type should not be repeated as a
substitute for Boolean operations. To amplify this point, an attribute
definition might prescribe how to interpret, for example, multiple
Abstract attributes in a single operand. For example, the definition might
prescribe:
- Multiple Abstract attributes may be supplied in
order of preference, so if a server does not support the first supplied, then use the second, etc.; or
- if multiple Abstract attributes are supplied, the server
is to choose the "best" among the set.
The definition may include a semantic indicator, allowing a client to
select among several semantic alternatives. However, none of those
alternatives should be to construct separate operands (linked by boolean
'and' or 'or') for each Abstract attribute -- the type-1
query supports boolean operations, so allowing another means of specifying boolean operations
would add un-necessary complexity (in contrast to potential semantic interpretations of multiple
Abstract attributes which cannot be otherwise
represented via the type-1 query, as in the examples above).
3.1.5.1 Mechanism for Repeating Attributes
There are two mechanisms for providing multiple attributes of the same
type within an operand:
- Via 'list' within 'complex' CHOICE of 'attributeValue' within
AttributeElement.
- Via separate instances of AttributeElement.
The first mechanism (provided by version 3, and not supported in version
2) is the mechanism prescribed for this class.
3.2 Attribute Types Defined within the Attribute Class
3.2.1 Access Point Attribute Types
This attribute class definition recognizes that some applications of
Z39.50 make a strong link to database schemes, while others continue to
work with abstract definitions of databases. Thus there are two distinct
attribute types to accommodate these very different approaches to the use
of Z39.50. These two types should not be mixed within an operand.
- Abstract Attribute Type
Defines an intellectual access point for a group of relatively homogeneous databases,
independent of database schema. Nesting (e.g. place
names within subject headings) is not valid for Abstract attributes. Values of this attribute should be enumerated.
- Field Name Attribute Type
Defined in conjunction with a specific database schema. It can be
qualified via repetition of the attribute. An example of such
qualification (nesting) might be a field path within an SGML
database. The Field Name attribute should not be used in conjunction
with the Abstract attribute within the same query operand. This is generally a string-valued attribute, though enumerated values corresponding to numeric tags used for schema elements may also be used.
3.2.1.1 Nesting, Occurrence, and Anchoring of Access Point
Attributes
See definitions for bolded terms below.
- Nesting of Abstract attributes (for example: Place Name within Subject Heading) is
not permitted.
- Nesting of Field Name attributes may be supported, and if so, nesting should be indicated by
repetition of the Field Name attribute type, where the order of nesting is as in the following
example: field 1, field 2, and field 3, supplied in that order, means "field 3 within field 2 within
field 1". (This rule is supplied in order to avoid conflicting definitions, and reduce complexity of
implementations supporting multiple attribute sets where nesting is prescribed.)
- Occurrence of Abstract attributes (for example: second Author) may be
supported. Note, however, care should be taken when casting an access point with occurrences as an
abstract access point. It may be reasonable to consider Author, for example, as abstract with occurrences (for "first" and "second" author); alternatively, there could be multiple access
points (e.g. 'first author' and 'second author'), or, as another alternative, Author could be cast as a Field Name rather than an Abstract access point.
- Occurrence of a Field Name attribute may be supported, but not in conjunction with nesting.
So, for example, "second occurrence of field N" may be supported, but not "second occurrence of
field M within field N".
- When an Abstract attribute is supplied, it is considered anchored.
- When one or more Field Name attributes are supplied, these may be indicated as
not anchored by defining a wildcard attribute as a value of the attribute type Field
Name. In the absence of a wildcard attribute, they are
considered anchored.
Definitions:
- Nesting is the ability to specify that the field that contains the term must be within
another specified field.
- Occurrence is the ability to specify the occurrence of the field that contains the
term.
- Anchored means that matching must occur from the root of the
element tree.
- Not anchored means that matching may occur beginning at any
node within the element tree.
Example of Anchored vs. Not anchored:
Suppose a schema includes elements Description (unstructured) and Contact,
where Contact is structured into sub-elements Name, eMail, and Affiliation:
When Field Name attribute Description is specified as anchored, then it is
intended to match the first level Description; if multiple Field Name
attributes Contact and Description are specified, as anchored, then it is
intended to match Description within Contact. If the single Field Name
attribute Description is specified as not anchored, then it is intended to
match either Description, or Description within Contact.
3.2.1.2 Mixing Field Name Attributes from Multiple Attribute Sets
Mixing Field Name attributes from multiple attribute sets is permissible,
and no attribute set conforming to this class should preclude mixing of
its Field Name attributes with Field Name attributes from other sets.
This is a cross-attribute-set rule for any attribute set conforming to class 1.
Attribute sets might be defined that correspond directly to tagSets (which define Z39.50 retrieval elements). It may be desired to search on a field that corresponds to an element defined by a retrieval schema. A type-1 query operand might correspondingly be constructed with nested Field Name attributes corresponding to the elements in the tagPath for the desired field. It may be that those elements are from different tagPaths. Correspondingly, the Field Name attributes would belong to different attribute sets.
3.2.2 Query Management Attribute Types
These attributes have the property that they can be rewritten by the
server as part of a revised query that the server returns to the client.
- Weight Attribute Type
The weight of an operand in a weighted boolean query. An attribute set definition that includes
this type should specify a normalized value
(for example zero to 1000). This
is a non-repeating numeric attribute.
- Hit Count Attribute Type
The number of records satisfying the operand. This attribute is intended to
convey information from server to client, but it may passed back from client to server when
the client simply wants to turn around a reformulated search -- in that case, it is to be ignored by the server.
This is a non-repeating, numeric attribute.
- Stopwording Attribute Type
For a query sent from client to server, may be used to request that the server not treat any word within the term as a stopword. For a query returned from the server, may be used to indicate that one or more words were treated as stopwords.
3.2.3 Qualifying Attribute Types
- Language Attribute Type
The language of the term supplied within the operand. Values from
some standard source should be defined and registered so that they are generally available. Not
repeatable.
- Content Authority Attribute Type
The source of the term. This is a string-valued attribute. In the
interests of simplicity it is recommended that it be non-repeatable, though
there may be situations where repeatable content authority could be
meaningfully interpreted.
- Expansion/Interpretation Attribute Type
Indicates that thesaural expansion, singular/plural matching, part of
speech qualification, phonetic matching, case sensitivity, or
stemming should be used in the query evaluation. Word by word
truncation is also viewed as a form of stemming and is to be included
within this attribute type, as would various loose forms of phrase
matching. Repeatable; may be string-valued or enumerated.
3.2.4 Comparison Attribute Type
Defines the relationship between the term in the operand and the term in the term list at the server.
Comparison attributes are strongly typed. There are different comparison attributes for each of the term-value
datatypes discussed in 3.3 (numerics, character strings, and language strings).
Comparison attributes are mandatory, non-repeatable and enumerated.
Comparison attributes are a generalization
of the relation attributes of bib-1, but named differently to avoid confusion. Note that
equality is
used only for cases of true equality testing (e.g. to test that two numbers are mathematically equal, or that two character strings are lexically equivalent; however equality would not, in general, be used for language strings).
Various "matching" comparison operators are used for string matching using
various kinds of regular expressions, for example. Sample values might
include:
- complete match
- doesn't match
- left and right anchored match
- contains
- contained in bounding-polygon
- match via regular expression
- relevance feedback
- equality as strings
- numeric greater than
- between (range operations in conjunction with the range datatype)
The bib-1 Completeness attribute and most of the Truncation attribute have been folded into the Comparison attribute as forms of anchored matching.
3.2.5 Format/Structure Attribute Type
Used primarily to help with the interpretation of a character string term in cases where the comparison operator normally does not assume an ASN.1 datatype; it provides guidance for the datatype
conversion process. Examples of attributes of this type are "character string", "language string", "date". This is an enumerated or string-valued attribute, non-repeatable.
3.2.6 Occurrence Attribute Type
Indicates the desired occurrence of a field. For example "second occurrence of field 1".
3.2.7 Indirection Attribute Type
Indicates that the actual content of the term is not supplied, but instead, a pointer (e.g. url) to the term
is supplied in lieu of the actual term. This attribute has enumerated values, e.g. URL, URN, DOI,
etc. Non-repeatable.
3.3 Datatyping
It is recommended that term values have strong datatyping, carrying over
into the definition of the comparison attributes (operators); for example,
there should be separate comparison attributes for strings, numerics, etc.
Groups defining specific Abstract attributes should
consider defining ASN.1
datatypes to support their applications -- for example, personal names or
dates, or geospatial information (points and polygons). There will of
course be cases where the ASN.1 approach to datatyping will be too
heavy-weight; in those cases the Format/Structure attribute type can be
used in conjunction with strings to indicate that the content of a string
represents data in a specific format.
The basic datatypes defined as part of the general attribute class should
include:
- Numerics
Integers and intUnits. These should be supported with the usual comparison operators equal,
greater than, less than or equal, etc.
- Character Strings
These are handled lexically. They differ from language strings (below) which are
word-oriented. Character strings are not assumed to contain words.
- Language Strings
These are strings that contain one or more words. They are treated
as sets of words or phrases. This approach is felt to be better than tagging strings as "word lists".
For example, distinguishing between lexical and linguistic (or at least token-based) operations should clarify
queries considerably. Different comparison operators are used for language strings than for
(lexicaI) character strings.
3.3.1 Additional Types
Attribute set developers may define additional ASN.1 types, for example, for dates,
points and polygons.
There is a Z39.50 ASN.1 Date/time
definition, that may be specified when the term is a date and/or time.
Personal names are an interesting boundary case where one might argue either for an ASN.1 based definition or a Format/Structure attribute indicating a normalized name according to some
rules; the choice of the appropriate approach is best left to a bibliographic attribute definition working group.
3.4 Enumeration and Summary of Class 1 Attribute Types
An attribute set definition conformant to class 1 should follow the guidelines and use the numeric values in the summary table below to represent the class 1 types. If any of these types is omitted in an attribute set definition, the definition should skip the value for that type rather than renumber.
Attribute Type | Number | Value | Repeatable | Occurrence | Roughly-corresponding Bib-1 Type |
Abstract | 1 | enumerated |
no | Must occur in an operand if Field Name does not occur, and must not occur if Field Name occurs. | Use |
Field Name | 2 | generally, string |
yes | Must occur in an operand if Abstract does not occur, and must not occur if Abstract occurs. | Use |
Weight | 3 | numeric: e.g. 0 to 1000 |
no | optional | (new) |
Hit Count | 4 | numeric |
no | optional | (new) |
Stopwording | 5 | 0 or 1 |
no | optional | (new) |
Language | 6 | string or enumerated |
generally, no | optional | (new) |
Content Authority | 7 | string |
generally, no | optional | (new) |
Expansion/interpretation | 8 | string or enumerated |
yes | optional | includes part of Truncation and Relation |
Comparison | 9 | enumerated |
no | mandatory | includes Relation and part of Truncation and Completeness |
Format/Structure | 10 | string or enumerated |
no | optional | Structure |
Occurrence | 11 | numeric |
no | optional | (loosely) Completeness |
Indirection | 12 | enumerated |
no | optional | (new) |
3.5 Attribute List Construction
Within a properly constructed operand, the attribute list within an operand should include attributes in ascending order by attribute type.
Certain combinations of attribute types are nor allowed in within an attribute list:
- Field Name and Abstract within an operand are not allowed.
- Format/Structure may be used only in conjunction with a comparison attribute type, when the comparison attribute does not have an ASN.1 defined datatype.
- An Occurrence attributes may not be combined with an Abstract attribute.
An attribute set definition should describe any further restrictions on allowable combinations of attribute types.
3.6 Utility Attribute Set
A Utility attribute set will be developed and maintained, consistent with Class 1, that will include commonly used (non-domain-specific) values for all of the Class 1 types.
4. Lessons Learned: Recommendations for Future Enhancements to the Z39.50 Query
As a result of the deliberations over this architecture, limitations posed by the type-1 query have
resulted in identification of recommended enhancements that should be considered for a future
version of Z39.50. These are documented here (additional contributions to this list are welcome):
- The term in an operand should be replaced by a sequence of Terms. In the interim, ASN.1 definitions such as MultipleSearchTerms-1 may be
used.
- Explicit range operators will be useful and should be added in favor of boolean combinations
of operators that result in range definitions.
- Attributes on operators should be supported.
- Nesting should be handled at the operator level rather than by repeating attributes. That is,
for "field 1 within field 2", 'within' should be an operator.
Library
of Congress
Comments