Z39.50 Draft Attribute Architecture
February 18, 1998
An earlier version of this draft attribute architecture was discussed
at the January 1998 ZIG meeting. Notes from that meeting were applied to
the February 6 version of this document. Comments were solicited until
February 17. Those comments have been applied to this (February 18)
version.
This document will remain stable until after the NISO attribute meeting in
March. The document may subsequently be revised as a result of that
meeting.
Following are changes applied between February 6 and February 18.
- At the beginning of 3.2.1.1,
Change:
When multiple Use-type attributes occur within an
operand, nesting is implied, unless a semantic indicator is
present, indicating a different interpretation of the multiple
occurrence.
To:
Multiple Use-type attributes may occur within an
operand, and this may imply nesting (either implicitly or via
semantic indicator; multiple occurrences may have a different
interpretation, also either implicitly or via a semantic
indicator).
- Section 3.1.5.1 has been added.
- The latter portion of 3.2.1.2 has been re-written to lift the
restriction for a nested search that the path is assumed to be
anchored.
- Added a reference in 3.2.3 to "character set and language
negotiation".
- Globally changed "character-valued" to "string-valued".
1. Introduction and Preliminary Notes
1.1 Historical Background
Ed. Note: Historical background to be supplied.
1.2 Acknowledgements
Ed. Note: List of meeting participants and other
significant contributors to be listed here.
1.3 Brief Technical Background
Z39.50 specifies a number of query types, and requires support for
one of those types, the type-1 query. This document is concerned with the
Z39.50 type-1 query only.
The type-1 query consists of one or more search terms, each with a set of
attributes, specifying, for example, the type of term (author, title,
subject, etc.), whether the term is truncated, and its structure. The
server is responsible for mapping attributes to the logical design of the
database.
A term in a type-1 query, together with its accompanying collection of
attributes, is referred to as an operand. Operands may be
combined in a type-1 query, linked by boolean operators (And, Or, And-not,
and Proximity).
Each attribute is a pair: an attribute type and a value
of that type. An Attribute set defines a set of attribute types,
and for each, a list of possible values.
Example: The bib-1 attribute set defines a
number of attribute types; one of which is Use. For bib-
1 Use attributes, many attribute values are defined, one of which
is personal name.
An attribute set definition is assigned an object identifier, referred to
as its attribute set identifier.
In version 2 of Z39.50, all attributes within a query must belong to a
single attribute set. In version 3, attributes may be combined from
different attribute sets, within a single query, even within a single
operand. This is a significant enhancement, providing support for multiple
database searching, and allowing attribute sets to be defined with less
replication.
Also in version 3, new data types for terms are defined (in version 2 only
binary values are allowed).
1.4 Version 3 Assumption
There are several enhancements in version 3 pertaining to attribute sets
and query construction; the two enhancements described at the end of 1.3
are clearly the most important, and are seen to be functional
prerequisites for attribute architecture. For this reason, version 3 is
assumed by this architecture, and version 2 is not addressed.
1.5 Limitations
The Z39.50 type-1 query has known limitations, and the architecture
specified in this document is restricted by these limitations. As the
standard evolves and new versions are approved, the architecture may be
expanded.
1.5.1 Semantic Indicator
In order to compensate for some of the type-1 limitations, it may be
necessary to utilize the semantic indicator (provided within version 3)
for purposes that would otherwise be accomplished by more coherent
mechanisms if these limitations were not present. It should be thus noted
that in future versions of Z39.50 it is intended that these limitations
will be addressed, obviating the need for extensive use of the semantic
indicator at the attribute level.
1.5.2 Nesting and Occurrence
When specifying, for example, "field-1 within field-2" it will be possible
to specify an occurrence, for example, "second occurrence of field-1,
within field-2" but not, for example, "second occurrence of field-1,
within second occurrence of field-2". That is, only a single node may
carry occurrence, likely either the root or the leaf, and this should be
statically specified within the attribute set definition. (This is a
limitation of class 1, not a limitation of the type-1 query, and may be
overcome by classes other than 1).
1.6 Some Unresolved Issues
- Though there is clear consensus within the ZIG that nesting of
fieldname attributes should be supported, there is division of
opinion on whether nesting should be permitted for Use attributes.
- Similarly, there is division of opinion on whether specification of
occurrence should be permitted for Use attributes.
- some feel that the limitation described in 1.5.2, for occurrence, can
be overcome, for example by use of the 'complex' form of
attributeValue.
- It is not clear whether anchoring is sufficiently specified. See
3.2.1.2.
- 3.1.2 states that a class 1 attribute set may not define any
attribute types not defined for class 1 (i.e. not defined in this
document). Some feel this is overly restrictive.
2. Attribute set Class definitions
An Attribute set class definition provides an umbrella context for the
definition of an attribute set belonging to a particular class. A class
definition defines all of the attribute types that may be included in an
attribute set for that class.
(At least one attribute set class definition will be developed, but it is
not clear that more than one will be necessary.)
3. Attribute Set Class 1
This class is intended to cover all known, existing needs (existing
attribute sets will need to be re-specified within this framework). The
intent is not to preclude new types of attributes beyond those specified
here; it should be possible to add new attribute types to this broad
attribute class, if they are relatively orthogonal to the attribute types
defined here.
Note: There may be other approaches developed which
partition the set of attributes into fundamentally different types;
this might result in the definition of a new attribute class that is
inconsistent with this class. However, no need for such a separate
class has been identified.
The importance of enumerating all of the possible attribute types within
this "universal" attribute class is to provide a template for developers
of attribute sets, and to set up a framework for interoperability among
independently defined attribute sets which are intended to serve various
communities. In particular, it should be possible for groups of content
experts to develop new use attribute sets, ASN.1 datatypes, comparison
operators, and perhaps structure/format attributes which fit comfortably
within this framework. Server developers can, based on the template
defined here, recognize various attribute types that are omitted in a
given query, as well as illegal repetitions or combinations of attributes
of given types that would indicate a malformed query.
3.1 General Rules for Class 1
3.1.1 Semantic Precedence and Interaction among Sets
The context of the attribute class defined here would be identified as
being in effect for a query by specifying the OID of an attribute set
conformant with the class in the global OID for a Z39.50 query (most
likely one of the utility attribute sets which it is proposed below that
the ZIG develop). The "global" OID refers to the object identifier within
the type-1 query that does not accompany a specific attribute. For class
1, this is referred to as the dominant OID for the query. When
attributes from different attribute sets are mixed within a query, and
when the respective attribute set definitions conflict such that the
resulting semantics are ambiguous, the semantics of the dominant set
prevail.
Any attribute set definition should:
- indicate whether that attribute set may be used as the dominant set
in a query; and if so:
- describe the rules that apply, when that set is used as the dominant
set, for intermixing of attributes from different sets within an
operand or query.
Interaction between attribute sets conformant to this attribute set class
and historical attribute sets not conformant to this class within a query
operand are undefined.
3.1.2 Inheritance and Population
An attribute set consistent with this attribute class will define
attributes of one or more of the types specified in 3.2.
Any class 1 attribute set inherits the rules prescribed for the class,
that apply to attribute types defined for that set. However, a class 1
attribute set need not define nor populate every attribute type defined
for class 1. A class 1 attribute set may define as few as one attribute
type, or as many as all of the attribute types defined for class 1. It may
not define attribute types not defined for class 1.
3.1.3 Omitted Attributes
If attributes of a given type are omitted in a query they should be
treated as omitted in establishing the semantics of a given query (in
other words, there are no defaults for omitted attributes).
3.1.4 Mutual Exclusivity
Some types of attributes (for example, use and field attributes) are
mutually exclusive in a given query operand; these rules are defined at
the level of the attribute class rather than specific attribute sets.
3.1.5 Repeatability
In general if any attribute is allowed to be repeatable, the semantics of
repeating the attribute must be well defined (implicitly or explicitly).
When an attribute set definition is being developed and the need is
foreseen for an attribute to repeat, for example when values are
orthogonal, it is recommended that the developers consider separating the
values into different attribute types, if possible.
While repeatability may be permissible for a given attribute type, as a
general principle, an attribute type should not be repeated as a
substitute for Boolean operations. To amplify this point, an attribute
definition might prescribe how to interpret, for example, multiple Use
attributes in a single operand. For example, the definition might
prescribe:
- Multiple Use attributes may be supplied in order of preference, so if
a server does not support the first supplied, then use the second,
etc.; or
- if multiple Use attributes are supplied, the server is to choose the
"best" among the set; or
- multiple Use attributes implies nesting, thus if Use attributes
use-1, use-2, and use-3 are specified in a single operand, in that
order, it means search for use-3 within use-2 within use-1 (see
"Nesting of Use-type Attributes" 3.2.1.1).
The definition may include a semantic indicator, allowing a client to
select among several semantic alternatives. However, none of those
alternatives should be to construct separate operands (linked by boolean
'and' or 'or') for each Use attributes. The reason is that the type-1
query supports boolean operations, so allowing another means of specifying
boolean operations would add un-necessary complexity. This is in contrast
to potential semantic interpretations of multiple Use attributes which
cannot be otherwise represented via the type-1 query, as in the examples
above.
3.1.5.1 Mechanism for Repeating Attributes
There are two mechanisms for providing multiple attributes of the same
type within an operand:
- Via 'list' within 'complex' CHOICE of 'attributeValue' within
AttributeElement.
- Via separate instances of AttributeElement.
The first mechanism (provided by version 3, and not supported in version
2) is the mechanism prescribed for this class.
3.2 Attribute Types defined within the Attribute Class
3.2.1 Use-type attributes
This attribute class definition recognizes that some applications of
Z39.50 make a strong link to database schemes, while others continue to
work with abstract definitions of databases. Thus there are two distinct
attribute types to accommodate these very different approaches to the use
of Z39.50. These two types should not be mixed within an operand.
- Database Fieldname
Defined in conjunction with a specific database schema. It can be
qualified via repetition of the attribute. An example of such
qualification (nesting) might be a field path within an SGML
database. The Fieldname attribute should not be used in conjunction
with the Use attribute unless the attribute definition describes how
they may be used in combination. This is generally a string-valued
attribute, though it may also be useful to permit numeric values to
facilitate mapping of fieldnames from schema definitions that use
numeric assignments for fields.
- Use attribute
Defines an intellectual access point for a group of relatively
homogeneous databases, independent of database schema. Nesting,
established through repetition, is also valid for the use attribute
and establishes a context for the use attribute (for example, place
names within abstracts). There is not a clean line between a
qualified use attribute and the expansion/interpretation attributes
discussed below. A good example of this is the creation context used
by CIMI, which is an attribute on as SGML field. This should be a
numerically valued attribute.
3.2.1.1 Nesting of Use-type Attributes
Multiple Use-type attributes may occur within an operand, and this may
imply nesting (either implicitly or via semantic indicator; multiple
occurrences may have a different interpretation, also either implicitly or
via a semantic indicator).
The order of nesting should be as in the following example: if field-1,
field-2, and field-3 are supplied, in that order, it means field-3 within
field-2 within field-1. This rule, though arbitrary and perhaps beyond
the scope of architecture, is supplied in order to avoid conflicting
definitions, and reduce complexity of implementations supporting multiple
attribute sets where nesting is prescribed.
3.2.1.2 Anchored vs. Non-anchored Searching
Whether a search is flat or nested (structured), it should be either
implicit clear, or there should be an explicit indicator (see 3.2.8)
designating whether the access point path is anchored or unanchored.
Definitions:
- Anchored means that matching must occur from the root of the
element tree;
- unanchored means that matching may occur beginning at any
node within the element tree.
example:
Suppose a schema has elements Name (unstructured) and Creator,
structured into sub-elements Name, eMail, and Affiliation:
When fieldName attribute Name is specified, as anchored, then it is
intended to match the first-level Name; if multiple-fieldname-
attributes Creator and Name are specified, as anchored, then it is
intended to match Name within Creator. If the single fieldname
attribute Name is specified, as unanchored, then it is intended to
match either Name or Name within Creator.
A single wildcard, for example, "Any", may be used in a fieldName path.
3.2.1.3 Mixing Fieldname Attributes from Multiple Attribute Sets
Mixing Fieldname attributes from multiple attribute sets is permissible,
and no attribute set conforming to this class should preclude mixing of
its fieldname attributes with fieldname attributes from other sets. This
is a cross-attribute-set rule for any attribute set conforming to class 1.
3.2.2 Query Management Type Attributes
These attributes have the property that they can be rewritten by the
server as part of the return of a revised query returned back to the
client as additional search information.
- Weight
The weight of an operand in a weighted boolean query. This should be
registered (along with a normalized value range) as part of a basic
attribute set within the attribute class. This is a non-repeating
numeric attribute.
- Hit Count
The number of records satisfying the operand. This should again be
registered as part of the basic attribute set. This attribute is
intended for purposes of conveying information from server to client,
but it may passed back from client to server (when the client simply
wants to turn around a reformulated search -- in that case, it is to
be ignored by the server). This is a non-repeating numeric attribute.
- Stopwording
For a query sent from client to server, may be used to request that
the server not treat a given word as a stopword. For a query returned
from the server, may be used to indicate that a given word was
treated as a stopword.
3.2.3 Qualifying Attribute Types
- Language Attribute
The language of a given query operand. A general-use version of the
language attribute, using values defined from some standard source
should be defined and registered so that it is generally available.
It is not clear whether a string to numeric mapping is needed for
this attribute type or whether it should simply be string-valued. Not
repeatable.
Note: A Character set Attribute is not proposed; the
current thinking is that it is unnecessary since it is handled by
general Z39.50 character set support.See Character Set and Language Negotiation
- Content Authority
The source of the term. This is a string-valued attribute. In the
interests of simplicity, probably should be non-repeatable, although
there may be situations where repeatable content authority could be
meaningfully interpreted.
- Expansion/interpretation
Indicates that thesaural expansion, singular/plural matching, part of
speech qualification, phonetic matching, case sensitivity, or
stemming should be used in the query evaluation. Word by word
truncation is also viewed as a form of stemming and is to be included
within this attribute type, as would various loose forms of phrase
marching. Repeatable; may be string or numeric valued.
3.2.4 Comparison Attribute Type
There are different comparison attributes for each of the term-value
datatypes discussed below. (See also the discussion of datatyping for
operand values below.)
Comparison attributes are strongly typed. They are mandatory,
non-repeatable and numeric valued.
Comparison attributes are somewhat similar to the relation attributes of
bib-1, but named differently to avoid confusion. Note that equality is
used only for cases of true equality testing (i.e. string and integers).
Various "matching" comparison operators are used for string matching using
various kinds of regular expressions, for example. Sample values might
include:
- complete match
- doesn't match
- left and right anchored match
- contains
- contained in bounding-polygon
- match via grep
- relevance feedback
- equality as strings
- numeric greater than
- between (range operations in conjunction with the range datatype)
The bib-1 Completeness attribute, as well as much of the Truncation
attribute, have been folded into the comparison attribute; they are
replaced by anchored matching.
3.2.5 Format/Structure Attribute Type
This attribute type is used primarily to help with the interpretation of
a character string operand value in cases where the comparison operator
normally assumes an ASN.1 datatype; it provides guidance for the datatype
conversion process. This is an enumerated or string-valued attribute.
Non-repeatable.
3.2.6 Occurrence Attribute Type
Indicates the desired occurrence. For example "second subject
heading".
3.2.7 Indirection Attribute Type
Indicates that the actual content of the term is not supplied, but
that a pointer (e.g. url) is supplied in lieu of the actual term.
This attribute will have enumerated values, e.g. URL, URN, DOI, etc.
Non-repeatable.
3.2.8 Anchor Attribute Type
indicates whether a search is anchored or unanchored (see 3.2.1.2),
that is, whether matching is to occur at the root of the element
tree, or may begin at any node of the element tree.
3.3 Datatyping
It is recommended that term values have strong datatyping, carrying over
into the definition of the comparison attributes (operators); for example,
there should be separate comparison attributes for strings, numerics, etc.
Groups defining specific Use attributes should consider defining ASN.1
datatypes to support their applications -- for example, personal names or
dates, or geospatial information (points and polygons). There will of
course be cases where the ASN.1 approach to datatyping will be too
heavy-weight; in those cases the format/structure attribute type can be
used in conjunction with strings to indicate that the content of a string
represents some data in a specific format.
The basic datatypes defined as part of the general attribute class should
include:
- numerics
Integers and intUnits. These will need to be supported with the usual
comparison operators equal, greater than, less than or equal, etc.
- character strings
These are handled lexically. They are not assumed to contain words.
- language-bearing strings
These are character strings that contain one or more words. They are
treated as sets of words or phrases. This approach is felt to be
better than tagging strings as "word lists", for example,
distinguishing between lexical and linguistic (or at least
token-based) operations should clarify queries considerably.
Different comparison operators are used for language strings as
opposed to IexicaI character strings.
3.3.1 Dates
There is now a Z39.50 ASN.1 Date/time definition, that should be specified
when the term is a date and/or time.
3.3.2 Additional Types
Attribute set developers may define additional ASN.1 types, for example
points and polygons. Personal names are an interesting "boundary" case
where one might argue either for an ASN.1 based definition or a
format/structure attribute indicating a normalized name according to some
rules; the choice of the appropriate approach is best left to a
bibliographic attribute definition working group.
3.4 Attribute Values
Although many attribute values are (and perhaps will continue to be)
enumerated integers, this architecture recognizes that an attribute value
may take any of the following forms:
- Enumerated integer.
- Integer value. For example, the value of an "occurrence" attribute
may simply be the "occurrence", for example, to indicate "second
subject heading" the value of the Occurrence attribute would be 2.
- Character string.
- A sequence of values.
4. Follow-on Actions
The ZIG should define at least two attribute sets within the new attribute
set architecture (perhaps more than two; this is a packaging and
granularity question). The ZIG should move away from naming conventions
such as "bib-1" which imply some special legitimacy or precedence
hierarchy for various attribute sets, and not use names for groups of
attribute sets like "CORE". This may help avoid political debates.
One of the attribute sets (to be defined by the ZIG) within this attribute
class should cover widely used basic functions, including comparison
operator values, language codes, and basic expansion/interpretation
values, plus query management types -- call this attribute set, for a
working name, "PURPLE".
In addition, the ZIG should define a basic set of use attributes, called,
for a working name, "ORANGE". in addition, a committee of bibliographic
experts should be established, under auspices such as NISO, to define a
new bibliographic attribute set within this general framework.
5. Other note
In version 4 the term in an operand should be replaced by a sequence of
Terms. In the interim, a range of ASN.1 definition might be reserved for
version 3 range comparison types, which is a pair of term values. Explicit
range operators will be useful and should be added in favor of boolean
combinations of operators that result in range definitions. Also in
version 4, attributes on operators should be allowed.
Library
of Congress
Comments
February 18, 1998