Report of the Working Group on Z39.59 Attribute Architecture

Draft for distribution at the April Z39.50 Implementor's Group meeting

Clifford Lynch

April 6, 1997

Note: due to time constraints, this document has not been reviewed by the
Attribute Architecture Working Group prior to its distribution to the ZIG.
I am responsible for all inaccuracies here; my apologies to my colleagues
from the Architecture group.  This report has not been posted
electronically. Following the ZIG meeting, I'll incorporate comments from
both the ZIG and the Working Group and then post the revised version out to
the ZIG list for additional comment, after which the report of the group
will move to final status and the group's work will be complete.

Introduction

The Attribute Architecture Working Group has been asked to make
recommendations to the ZIG on how to structure future attribute sets for
Z39.50. There are a number of background documents (a list of which will be
included in the final version of this document, along with lists of the
participants at each of the Working Group's meetings) that define the
problems and provide some discussion of the Working Group's deliberations
to date that have already been distributed to the ZIG and posted on the
Z39.50 Implementor's Group list. This document attempts to summarize the
conclusions of the Working Group, reflecting thinking from recent meetings
of the Working Group in November 1996 and March 1997.  It is intended as a
summary of conclusions rather than a report of deliberations and discussion.

Classes of Attribute Sets

Perhaps the most important conclusion of the working group was that it was
important to establish general structures (which we called attribute set
"classes") which could provide an umbrella context for the definition of
future attribute sets. Such a class would define all of the attribute types
which could appear in attribute sets conformant to that class. Without such
a framework, there is no possibility of meaningful interoperability among
work done by various autonomous attribute set developers without the
necessity for each attribute set developer group specifying how their
attribute types interact with other attribute types developed by other
groups.  Initially, we thought to define multiple classes, or at least the
first of what might well be multiple classes, but it appears that a single
class may meet at least present needs.

The class defined here covers all existing needs of Z39.50 attribute sets,
as far as the working group could determine -- although existing attribute
sets would need to be respecified within this framework. The intent is not
to preclude new types of attributes beyond those specified here, however;
our thought is that the class defined here should be maintained on an
ongoing basis, and that it should be possible to add new attribute types to
this broad attribute class, if someone comes up with new attribute types
that are relatively orthogonal to the attribute types defined here. There
may also be other approaches developed which partition the set of
attributes into fundamentally different types; this might result in the
definition of a new attribute class that is inconsistent with the class
under discussion here. However, the working group could not identify a need
for such a separate class (which in effect constitutes a radically
different structural view) of attribute sets.

The importance of enumerating all of the possible attribute types within
this "universal" attribute class is to provide a template for developers of
attribute sets, and to set up a framework for interoperability among
independently defined attribute sets which are intended to serve various
communities. In particular, it should be possible for groups of content
experts to develop new use attribute sets, ASN.1 datatypes, comparison
operators, and perhaps structure/format attributes which fit comfortably
within this framework.   Server developers can, based on the template
defined here, recognize various attribute types that are omitted in a given
query, as well as illegal repetitions or combinations of attributes of
given types that would indicate a malformed query.

The context of the attribute class defined here would be identified as
being in effect for a query by specifying the OID of an attribute set
conformant with the class in the overall OID for a Z39.50 query -- most
likely one of the utility attribute sets which we are proposing that the
ZIG should develop (see discussion below). Interaction between attribute
sets that are conformant to this attribute set class and historical
attribute sets that are not conformant to this attribute set class within a
query operand are undefined.

The attribute types that are defined within this attribute class are
enumerated below. There are a few general rules. While repeatability is
discussed with regard to each attribute type, a general principle is that
repeating an attribute type should not be used as a substitute for Boolean
operations. An attribute set which is consistent with this attribute class
will define attributes of one or more of the types specified here. If
attributes of a given type are omitted in a query they should be treated as
omitted in establishing the semantics of a given query (in other words,
there are no defaults for omitted attributes). Some types of attributes
(for example, use and field attributes) are mutually exclusive in a given
query operand; these rules are defined at the level of the attribute class
rather than specific attribute sets.

Attribute Types defined within the Attribute Class

Use-type attributes

The Working Group felt that it was important to recognize that some
applications of Z39.50 now make a strong link to database schemas, while
others continue to work with abstract definitions of databases. Thus, we
propose two distinct and mutually exclusive attribute types to accommodate
these very different approaches to the use of Z39.50.

Database Fieldname -- this is defined in conjunction with a specific
database schema. It can be qualified  via repetition of the attribute.  An
example of such qualification (nesting) might be a field path within an
SGML database. The fieldname attribute is mutually exclusive with the Use
Attribute. This is a character valued attribute, although it may also be
useful to permit numeric values to facilitate mapping of fieldnames from
schema definitions that use numeric assignments for fields.

Use attribute -- defines an intellectual access point for a group of
relatively homogeneous databases, independent of database schema. This is
mutually exclusive with the database fieldname. Nesting, established
through repetition, is also valid for the use attribute and establishes a
context for the use attribute (for example, place names within abstracts).
There is not a clean line between a qualified use attribute and the
expansion/interpretation attributes discussed below -- a good example of
this is the creation context used by CIMI, which is an attribute on an SMGL
field. This should be a numerically valued attribute.

Query Management Type Attributes.

The working group had originally thought to assign a single type to query
management attributes, which have the particular property that they can be
rewritten by the server as part of the return of a revised query returned
back to the client as additional search information; upon closer
examination we realized that since values needed to be associated with
these attribute types, each had to be defined as a specific type, but also
that there really weren't very many of them.

Weight -- this would be the weight of an operand in a weighted boolean
query. We believe that this should be registered (along with a normalized
value range) as part of a basic attribute set within the attribute class.
This is a non-repeating numeric attribute.

Hit Count -- the number of records satisfying the operand. This should
again be registered as part of the basic attribute set. Note that this
attribute can be passed back from client to server. This is a nonrepeating
numeric attribute.

Stopwording -- this is used to indicate whether or not a given word was
used as a stopword, or whether it should or should not be considered as a
stopword.  This would be a numeric (actually enumerated) valued attribute.

Qualifying Attribute Types

Language Attribute -- the language of a given query operand. The Working
Group believes that a general use version of the language attribute, using
values defined from some standard source should be defined and registered
so that it is generally available.  It is not clear whether we need a
character to numeric mapping for this attribute type or whether it should
be character valued. This is not repeatable.

Earlier versions of the Working Group's efforts suggested a character set
attribute  might be needed. The current position of the working group is
that the character set attribute is unnecessary since it is handled by
general Z39.50 character set support.

Content Authority -- this defines the source of the term. It is a character
valued attribute. In the interests of reducing complexity, the working
group believes that this should be non-repeatable, although we can envision
some esoteric situations where a repeatable content authority could be
meaningfully interpreted.

Expansion/Interpretation -- this is used to indicate that thesaural
expansion, singular/plural matching, part of speech qualification, phonetic
matching, case sensitivity, or stemming should be used in the query
evaluation. Word by word truncation is also viewed as a form of stemming
and is indicated within this attribute type, as would various loose forms
of phrase matching.  This is repeatable, and may be character or numeric
valued.

Comparison Operators

These are mandatory and non-repeatable; they are numeric. Comparison
attributes are strongly typed; there are different comparison attributes
for each of the term value datatypes discussed below. See also the
discussion of datatypes for operand values later in this document.

Comparison attributes are somewhat similar to the relation operators of
BIB-1, but are named differently to avoid confusion. Note that equality is
now used only for cases of true equality testing (ie string and integers);
various "matching" comparison operators are used for string matching using
various kinds of regular expressions, for example. Sample values might
include:

-- complete match
-- doesn't match
-- left and right anchored match
-- contains
-- contained in bounding-polygon
-- match via grep
-- relevance feedback
-- equality as strings
-- numeric greater than
-- between (range operations, in conjunction with the range datatype)

Completeness and much of the truncation attribute have been folded into the
comparison attribute; they are replaced by anchored matching.

Format/Structure

This type of attribute is used primarily to help with the interpretation of
a character string operand value in cases where the comparison operator
normally assumes an ASN.1 datatype; it provides guidance for the datatype
conversion process. In addition, the format/structure attribute can be used
for indirection, for example indicating that the operand value is a URL or
URN that points to a value rather than the operand value specified inline.
This is a non-repeatable, character valued operand.

Datatyping, Comparison attributes and Format/Structure Attributes

The working group recommends that term values should have strong
datatyping, and that this datatyping should carry over into the definition
of the comparison attributes (operators); for example, we should have
separate comparison attributes for strings, numerics, and the like. Those
groups defining specific use attributes should also consider defining ASN.1
datatypes to support their applications -- for example, personal names or
dates, or geospatial information (points and polygons). We recognize that
there are cases where the ASN.1 approach to datatyping will be considered
too "heavy-weight"; in those cases the format/structure attribute type can
be used in conjunction with strings to indicate that the contents of a
string are represent some data in a specific format.

The basic datatypes that are defined as part of the general attribute class
should include:

numerics -- integers and intunits. These will need to be supported with the
usual comparison operators like equal, greater than, less than or equal,
etc.

character strings. These are handled lexically. They are not assumed to
contain words.

Language-bearing strings. These are character strings that contain one or
more words. They are treated as sets of words or phrases.  The working
group believes that this approach is better than tagging strings as "word
lists", for example; distinguinshing between lexical and linguistic (or at
least token-based) operations should clarify queries considerably.
Different comparison operators are used for language strings as opposed to
lexical character strings.

dates (a new ASN.1 type)

We expect that other attribute set developers will define additional ASN.1
types, for example points and polygons.  Personal names are an interesting
"boundary" case where one might argue either for an ASN.1 based definition
or a format/structure attribute indicating a normalized name according to
some rules; the choice of the appropriate approach is, we believe, best
left to a bibliographic attribute definition working group.

Follow-on Actions

The ZIG should define at least two attribute sets within the new attribute
set architecture (to some extent, there is a packaging question about how
many separate OIDs should be assigned). The Working Group on Attribute Set
Architecture strongly urges the ZIG to move away from naming conventions
such as BIB-1 which imply some sort of special legitimacy or precedence
hierarchy for various attribute sets, and not to use name for groups of
attribute sets like "CORE".  This will avoid needless political debates.

We suggest that the ZIG define one attribute set within this attribute
class which covers widely used basic functions, including comparison
operator values, language codes, and basic expansion/interpretation values,
plus query management types -- call this attribute set, for a working name,
"PURPLE".

In addition, we suggest that the ZIG define a basic set of use attributes,
called, for a working name, "ORANGE". In addition, we believe that a
committee of bibliographic experts should be established, under auspices
such as NISO, to define a new bibliographic attribute set within this
general framework. .

Other notes:

In version 4 the term in an operand should be replaced by a sequence of
terms. In the interim, we might register a range ASN.1 definition for
version 3 range comparison types, which is a pair of term values.  The
working group believes that explicit range operators would be very useful,
and that they should be added in favor of boolean combinations of operators
that result in range definitions.

Also, in version 4, we should allow attributes on operators.