Z39.50 Draft Attribute Architecture

This draft attribute architecture is for discussion at the January 1998 ZIG meeting.

Attribute set Class definitions

An Attribute set class definition provides an umbrella context for the definition of an attribute set belonging to a particular class. A class definition defines all of the attribute types that may be included in an attribute set for that class.

(At least one attribute set class definition will be developed, but it is not clear that more than one will be necessary.)

Attribute Set Class 1

This class is intended to cover all known, existing needs (existing attribute sets will need to be re-specified within this framework). The intent is not to preclude new types of attributes beyond those specified here; it should be possible to add new attribute types to this broad attribute class, if they are relatively orthogonal to the attribute types defined here.

Note: There may be other approaches developed which partition the set of attributes into fundamentally different types; this might result in the definition of a new attribute class that is inconsistent with this class. However, no need for such a separate class has been identified.

The importance of enumerating all of the possible attribute types within this "universal" attribute class is to provide a template for developers of attribute sets, and to set up a framework for interoperability among independently defined attribute sets which are intended to serve various communities. In particular, it should be possible for groups of content experts to develop new use attribute sets, ASN.1 datatypes, comparison operators, and perhaps structure/format attributes which fit comfortably within this framework. Server developers can, based on the template defined here, recognize various attribute types that are omitted in a given query, as well as illegal repetitions or combinations of attributes of given types that would indicate a malformed query.

The context of the attribute class defined here would be identified as being in effect for a query by specifying the OID of an attribute set conformant with the class in the overall OID for a Z39.50 query -- most likely one of the utility attribute sets which it is proposed (below) that the ZIG develop.

Interaction between attribute sets conformant to this attribute set class and historical attribute sets not conformant to this class within a query operand are undefined.

The attribute types that are defined within this attribute class are enumerated below. An attribute set consistent with this attribute class will define attributes of one or more of the types specified here. If attributes of a given type are omitted in a query they should be treated as omitted in establishing the semantics of a given query (in other words, there are no defaults for omitted attributes). Some types of attributes (for example, use and field attributes) are mutually exclusive in a given query operand; these rules are defined at the level of the attribute class rather than specific attribute sets.

While repeatability may be permissible for a given attribute type, as a general principle, an attribute type should not be repeated as a substitute for Boolean operations. To amplify this point, an attribute definition might prescribe how to interpret, for example, multiple Use attributes in a single operand. For example, the definition might prescribe:

Multiple Use attributes may be supplied in order of preference, so if a server does not support the first supplied, then use the second, etc.; or
if multiple Use attributes are supplied, the server is to choose the "best" among the set; or
multiple Use attributes implies nesting, thus if Use attributes use-1, use-2, and use-3 are specified in a single operand, it means search for use-1 within use-2 within use-3 (see "Nesting of Use-type Attributes" below).

The definition may include a semantic operator, allowing a client to select among several semantic alternatives. However, none of those alternatives should be to construct separate operands (linked by boolean 'and' or 'or') for each Use attributes. The reason is that the type-1 query supports boolean operations, so allowing another means of specifying boolean operations would add un-necessary complexity. This is in contrast to potential semantic interpretations of multiple Use attributes which cannot be otherwise represented via the type-1 query, as in the examples above.

Attribute Types defined within the Attribute Class

Use-type attributes

This attribute class definition recognizes that some applications of Z39.50 make a strong link to database schemes, while others continue to work with abstract definitions of databases. Thus there are two distinct attribute types to accommodate these very different approaches to the use of Z39.50. These two types should be mutually exclusive, unless care is taken to describe in detail how they may be used in combination.

Database Fieldname
Defined in conjunction with a specific database schema. It can be qualified via repetition of the attribute. An example of such qualification (nesting) might be a field path within an SGML database. The Fieldname attribute should not be used in conjunction with the Use attribute unless the attribute definition describes how they may be used in combination. This is generally a character-valued attribute, though it may also be useful to permit numeric values to facilitate mapping of fieldnames from schema definitions that use numeric assignments for fields.
Use attribute
Defines an intellectual access point for a group of relatively homogeneous databases, independent of database schema. Nesting, established through repetition, is also valid for the use attribute and establishes a context for the use attribute (for example, place names within abstracts). There is not a clean line between a qualified use attribute and the expansion/interpretation attributes discussed below. A good example of this is the creation context used by CIMI, which is an attribute on as SGML field. This should be a numerically valued attribute.

Nesting of Use-type Attributes

Whenever multiple attributes of a given type are used for nesting, the order of nesting should be as in the following example: if field-1, field- 2, and field-3 are supplied, in that order, it means field-1 within field- 2 within field-3. This rule, though arbitrary and perhaps beyond the scope of architecture, is supplied in order to avoid conflicting definitions, and reduce complexity of implementations supporting multiple attribute sets where nesting is prescribed.

Query Management Type Attributes

These are attributes which have the property that they can be rewritten by the server as part of the return of a revised query returned back to the client as additional search information.

Weight
The weight of an operand in a weighted boolean query. This should be registered (along with a normalized value range) as part of a basic attribute set within the attribute class. This is a non-repeating numeric attribute.
Hit Count
The number of records satisfying the operand. This should again be registered as part of the basic attribute set. This attribute is intended for purposes of conveying information from server to client, but it may passed back from client to server (when the client simply wants to turn around a reformulated search -- in that case, it is to be ignored by the server). This is a non-repeating numeric attribute.
Stopwording
Used to indicate whether or not a given word was used as a stopword, or whether it should or should not be considered as a stopword. A numeric (actually enumerated) valued attribute.

Qualifying Attribute Types

Language Attribute
The language of a given query operand. A general-use version of the language attribute, using values defined from some standard source should be defined and registered so that it is generally available. It is not clear whether a character to numeric mapping is needed for this attribute type or whether it should simply be character valued. Not repeatable.
Note: A Character set Attribute is not proposed; the current thinking is that it is unnecessary since it is handled by general Z39.50 character set support.
Content Authority
The source of the term. This is a character valued attribute. In the interests of simplicity, probably should be non-repeatable, although there may be situations where repeatable content authority could be meaningfully interpreted.
Expansion/interpretation
Indicates that thesaural expansion, singular/plural matching, part of speech qualification, phonetic matching, case sensitivity, or stemming should be used in the query evaluation. Word by word truncation is also viewed as a form of stemming and is to be included within this attribute type, as would various loose forms of phrase marching. Repeatable; may be character or numeric valued.

Comparison Operators

There are different comparison attributes for each of the term-value datatypes discussed below. (See also the discussion of datatyping for operand values below.)

Comparison attributes are strongly typed. They are mandatory, non-repeatable and numeric valued.

Comparison attributes are somewhat similar to the relation attributes of bib-1, but named differently to avoid confusion. Note that equality is used only for cases of true equality testing (i.e. string and integers). Various "matching" comparison operators are used for string matching using various kinds of regular expressions, for example. Sample values might include:

complete match
doesn't match
left and right anchored match
contains
contained in bounding-polygon
match via grep
relevance feedback
equality as strings
numeric greater than
between (range operations in conjunction with the range datatype)

The bib-1 Completeness attribute, as well as much of the Truncation attribute, have been folded into the comparison attribute; they are replaced by anchored matching.

Format/Structure

This attribute type is used primarily to help with the interpretation of a character string operand value in cases where the comparison operator normally assumes an ASN.1 datatype; it provides guidance for the datatype conversion process. In addition, the format/structure attribute can be used for indirection, for example indicating that the operand value is a URL or URN that points to a value rather than the operand value specified inline. This is a non-repeatable, character valued operand.

Datatyping, Comparison attributes, and Format/Structure Attributes

It is recommended that term values have strong datatyping, carrying over into the definition of the comparison attributes (operators); for example, there should be separate comparison attributes for strings, numerics, etc. Groups defining specific Use attributes should consider defining ASN.1 datatypes to support their applications -- for example, personal names or dates, or geospatial information (points and polygons). There will of course be cases where the ASN.1 approach to datatyping will be too heavy-weight; in those cases the format/structure attribute type can be used in conjunction with strings to indicate that the content of a string represents some data in a specific format.

The basic datatypes defined as part of the general attribute class should include:

numerics
Integers and intUnits. These will need to be supported with the usual comparison operators equal, greater than, less than or equal, etc.
character strings
These are handled lexically. They are not assumed to contain words.
language-bearing strings
These are character strings that contain one or more words. They are treated as sets of words or phrases. This approach is felt to be better than tagging strings as "word lists", for example, distinguishing between lexical and linguistic (or at least token-based) operations should clarify queries considerably. Different comparison operators are used for language strings as opposed to IexicaI character strings.

Occurrence Attribute Types

Occurrence Attribute.
Indicates the desired occurrence. For example "second subject heading".

Dates

There is now a Z39.50 ASN.1 Date/time definition, that should be specified when the term is a date and/or time.

Additional Types

Attribute set developers may define additional ASN.1 types, for example points and polygons. Personal names are an interesting "boundary" case where one might argue either for an ASN.1 based definition or a format/structure attribute indicating a normalized name according to some rules; the choice of the appropriate approach is best left to a bibliographic attribute definition working group.

Attribute Values

Although many attribute values are (and perhaps will continue to be) enumerated integers, this architecture recognizes that an attribute value may take any of the following forms:

Enumerated integer.
Integer value. For example, the value of an "occurrence" attribute may simply be the "occurrence", for example, to indicate "second subject heading" the value of the Occurrence attribute would be 2.
Character string.
A sequence of values.

Follow-on Actions

The ZIG should define at least two attribute sets within the new attribute set architecture (perhaps more than two; this is a packaging and granularity question). The ZIG should move away from naming conventions such as "bib-1" which imply some special legitimacy or precedence hierarchy for various attribute sets, and not use names for groups of attribute sets like "CORE". This may help avoid political debates.

One of the attribute sets (to be defined by the ZIG) within this attribute class should cover widely used basic functions, including comparison operator values, language codes, and basic expansion/interpretation values, plus query management types -- call this attribute set, for a working name, "PURPLE".

In addition, the ZIG should define a basic set of use attributes, called, for a working name, "ORANGE". in addition, a committee of bibliographic experts should be established, under auspices such as NISO, to define a new bibliographic attribute set within this general framework.

Other note

In version 4 the term in an operand should be replaced by a sequence of Terms. In the interim, a range of ASN.1 definition might be reserved for version 3 range comparison types, which is a pair of term values. Explicit range operators will be useful and should be added in favor of boolean combinations of operators that result in range definitions. Also in version 4, attributes on operators should be allowed.

Library of Congress
Comments January 9, 1998