Z39.50 Draft Attribute Architecture
This draft attribute architecture is for discussion at the January
1998 ZIG meeting.
Attribute set Class definitions
An Attribute set class definition provides an umbrella context for the
definition of an attribute set belonging to a particular class. A class
definition defines all of the attribute types that may be included in an
attribute set for that class.
(At least one attribute set class definition will be developed, but it is
not clear that more than one will be necessary.)
Attribute Set Class 1
This class is intended to cover all known, existing needs (existing
attribute sets will need to be re-specified within this framework). The
intent is not to preclude new types of attributes beyond those specified
here; it should be possible to add new attribute types to this broad
attribute class, if they are relatively orthogonal to the attribute types
defined here.
Note: There may be other approaches developed which
partition the set of attributes into fundamentally different types;
this might result in the definition of a new attribute class that is
inconsistent with this class. However, no need for such a separate
class has been identified.
The importance of enumerating all of the possible attribute types within
this "universal" attribute class is to provide a template for developers
of attribute sets, and to set up a framework for interoperability among
independently defined attribute sets which are intended to serve various
communities. In particular, it should be possible for groups of content
experts to develop new use attribute sets, ASN.1 datatypes, comparison
operators, and perhaps structure/format attributes which fit comfortably
within this framework. Server developers can, based on the template
defined here, recognize various attribute types that are omitted in a
given query, as well as illegal repetitions or combinations of attributes
of given types that would indicate a malformed query.
The context of the attribute class defined here would be identified as
being in effect for a query by specifying the OID of an attribute set
conformant with the class in the overall OID for a Z39.50 query -- most
likely one of the utility attribute sets which it is proposed (below) that
the ZIG develop.
Interaction between attribute sets conformant to this attribute set class
and historical attribute sets not conformant to this class within a query
operand are undefined.
The attribute types that are defined within this attribute class are
enumerated below. An attribute set consistent with this attribute class
will define attributes of one or more of the types specified here. If
attributes of a given type are omitted in a query they should be treated
as omitted in establishing the semantics of a given query (in other words,
there are no defaults for omitted attributes). Some types of attributes
(for example, use and field attributes) are mutually exclusive in a given
query operand; these rules are defined at the level of the attribute class
rather than specific attribute sets.
While repeatability may be permissible for a given attribute type, as a
general principle, an attribute type should not be repeated as a
substitute for Boolean operations. To amplify this point, an attribute
definition might prescribe how to interpret, for example, multiple Use
attributes in a single operand. For example, the definition might
prescribe:
- Multiple Use attributes may be supplied in order of preference, so if
a server does not support the first supplied, then use the second,
etc.; or
- if multiple Use attributes are supplied, the server is to choose the
"best" among the set; or
- multiple Use attributes implies nesting, thus if Use attributes
use-1, use-2, and use-3 are specified in a single operand, it means
search for use-1 within use-2 within use-3 (see "Nesting of Use-type
Attributes" below).
The definition may include a semantic operator, allowing a client to
select among several semantic alternatives. However, none of those
alternatives should be to construct separate operands (linked by boolean
'and' or 'or') for each Use attributes. The reason is that the type-1
query supports boolean operations, so allowing another means of specifying
boolean operations would add un-necessary complexity. This is in contrast
to potential semantic interpretations of multiple Use attributes which
cannot be otherwise represented via the type-1 query, as in the examples
above.
Attribute Types defined within the Attribute Class
Use-type attributes
This attribute class definition recognizes that some applications of
Z39.50 make a strong link to database schemes, while others continue to
work with abstract definitions of databases. Thus there are two distinct
attribute types to accommodate these very different approaches to the use
of Z39.50. These two types should be mutually exclusive, unless care is
taken to describe in detail how they may be used in combination.
- Database Fieldname
Defined in conjunction with a specific database schema. It can be
qualified via repetition of the attribute. An example of such
qualification (nesting) might be a field path within an SGML
database. The Fieldname attribute should not be used in conjunction
with the Use attribute unless the attribute definition describes how
they may be used in combination. This is generally a character-valued
attribute, though it may also be useful to permit numeric values to
facilitate mapping of fieldnames from schema definitions that use
numeric assignments for fields.
- Use attribute
Defines an intellectual access point for a group of relatively
homogeneous databases, independent of database schema. Nesting,
established through repetition, is also valid for the use attribute
and establishes a context for the use attribute (for example, place
names within abstracts). There is not a clean line between a
qualified use attribute and the expansion/interpretation attributes
discussed below. A good example of this is the creation context used
by CIMI, which is an attribute on as SGML field. This should be a
numerically valued attribute.
Nesting of Use-type Attributes
Whenever multiple attributes of a given type are used for nesting, the
order of nesting should be as in the following example: if field-1, field-
2, and field-3 are supplied, in that order, it means field-1 within field-
2 within field-3. This rule, though arbitrary and perhaps beyond the
scope of architecture, is supplied in order to avoid conflicting
definitions, and reduce complexity of implementations supporting multiple
attribute sets where nesting is prescribed.
Query Management Type Attributes
These are attributes which have the property that they can be rewritten by
the server as part of the return of a revised query returned back to the
client as additional search information.
- Weight
The weight of an operand in a weighted boolean query. This should be
registered (along with a normalized value range) as part of a basic
attribute set within the attribute class. This is a non-repeating
numeric attribute.
- Hit Count
The number of records satisfying the operand. This should again be
registered as part of the basic attribute set. This attribute is
intended for purposes of conveying information from server to client,
but it may passed back from client to server (when the client simply
wants to turn around a reformulated search -- in that case, it is to
be ignored by the server). This is a non-repeating numeric attribute.
- Stopwording
Used to indicate whether or not a given word was used as a stopword,
or whether it should or should not be considered as a stopword. A
numeric (actually enumerated) valued attribute.
Qualifying Attribute Types
- Language Attribute
The language of a given query operand. A general-use version of the
language attribute, using values defined from some standard source
should be defined and registered so that it is generally available.
It is not clear whether a character to numeric mapping is needed for
this attribute type or whether it should simply be character valued.
Not repeatable.
Note: A Character set Attribute is not proposed; the
current thinking is that it is unnecessary since it is handled by
general Z39.50 character set support.
- Content Authority
The source of the term. This is a character valued attribute. In the
interests of simplicity, probably should be non-repeatable, although
there may be situations where repeatable content authority could be
meaningfully interpreted.
- Expansion/interpretation
Indicates that thesaural expansion, singular/plural matching, part of
speech qualification, phonetic matching, case sensitivity, or
stemming should be used in the query evaluation. Word by word
truncation is also viewed as a form of stemming and is to be included
within this attribute type, as would various loose forms of phrase
marching. Repeatable; may be character or numeric valued.
Comparison Operators
There are different comparison attributes for each of the term-value
datatypes discussed below. (See also the discussion of datatyping for
operand values below.)
Comparison attributes are strongly typed. They are mandatory,
non-repeatable and numeric valued.
Comparison attributes are somewhat similar to the relation attributes of
bib-1, but named differently to avoid confusion. Note that equality is
used only for cases of true equality testing (i.e. string and integers).
Various "matching" comparison operators are used for string matching using
various kinds of regular expressions, for example. Sample values might
include:
- complete match
- doesn't match
- left and right anchored match
- contains
- contained in bounding-polygon
- match via grep
- relevance feedback
- equality as strings
- numeric greater than
- between (range operations in conjunction with the range datatype)
The bib-1 Completeness attribute, as well as much of the Truncation
attribute, have been folded into the comparison attribute; they are
replaced by anchored matching.
Format/Structure
This attribute type is used primarily to help with the interpretation of
a character string operand value in cases where the comparison operator
normally assumes an ASN.1 datatype; it provides guidance for the datatype
conversion process. In addition, the format/structure attribute can be
used for indirection, for example indicating that the operand value is a
URL or URN that points to a value rather than the operand value specified
inline. This is a non-repeatable, character valued operand.
Datatyping, Comparison attributes, and Format/Structure
Attributes
It is recommended that term values have strong datatyping, carrying over
into the definition of the comparison attributes (operators); for example,
there should be separate comparison attributes for strings, numerics, etc.
Groups defining specific Use attributes should consider defining ASN.1
datatypes to support their applications -- for example, personal names or
dates, or geospatial information (points and polygons). There will of
course be cases where the ASN.1 approach to datatyping will be too
heavy-weight; in those cases the format/structure attribute type can be
used in conjunction with strings to indicate that the content of a string
represents some data in a specific format.
The basic datatypes defined as part of the general attribute class should
include:
- numerics
Integers and intUnits. These will need to be supported with the usual
comparison operators equal, greater than, less than or equal, etc.
- character strings
These are handled lexically. They are not assumed to contain words.
- language-bearing strings
These are character strings that contain one or more words. They are
treated as sets of words or phrases. This approach is felt to be
better than tagging strings as "word lists", for example,
distinguishing between lexical and linguistic (or at least
token-based) operations should clarify queries considerably.
Different comparison operators are used for language strings as
opposed to IexicaI character strings.
Occurrence Attribute Types
- Occurrence Attribute.
Indicates the desired occurrence. For example "second subject
heading".
Dates
There is now a Z39.50 ASN.1 Date/time definition, that should be specified
when the term is a date and/or time.
Additional Types
Attribute set developers may define additional ASN.1 types, for example
points and polygons. Personal names are an interesting "boundary" case
where one might argue either for an ASN.1 based definition or a
format/structure attribute indicating a normalized name according to some
rules; the choice of the appropriate approach is best left to a
bibliographic attribute definition working group.
Attribute Values
Although many attribute values are (and perhaps will continue to be)
enumerated integers, this architecture recognizes that an attribute value
may take any of the following forms:
- Enumerated integer.
- Integer value. For example, the value of an "occurrence" attribute
may simply be the "occurrence", for example, to indicate "second
subject heading" the value of the Occurrence attribute would be 2.
- Character string.
- A sequence of values.
Follow-on Actions
The ZIG should define at least two attribute sets within the new attribute
set architecture (perhaps more than two; this is a packaging and
granularity question). The ZIG should move away from naming conventions
such as "bib-1" which imply some special legitimacy or precedence
hierarchy for various attribute sets, and not use names for groups of
attribute sets like "CORE". This may help avoid political debates.
One of the attribute sets (to be defined by the ZIG) within this attribute
class should cover widely used basic functions, including comparison
operator values, language codes, and basic expansion/interpretation
values, plus query management types -- call this attribute set, for a
working name, "PURPLE".
In addition, the ZIG should define a basic set of use attributes, called,
for a working name, "ORANGE". in addition, a committee of bibliographic
experts should be established, under auspices such as NISO, to define a
new bibliographic attribute set within this general framework.
Other note
In version 4 the term in an operand should be replaced by a sequence of
Terms. In the interim, a range of ASN.1 definition might be reserved for
version 3 range comparison types, which is a pair of term values. Explicit
range operators will be useful and should be added in favor of boolean
combinations of operators that result in range definitions. Also in
version 4, attributes on operators should be allowed.
Library
of Congress
Comments
January 9, 1998