Report of the Working Group on Z39.59 Attribute Architecture Draft for distribution at the April Z39.50 Implementor's Group meeting Clifford Lynch April 6, 1997 Note: due to time constraints, this document has not been reviewed by the Attribute Architecture Working Group prior to its distribution to the ZIG. I am responsible for all inaccuracies here; my apologies to my colleagues from the Architecture group. This report has not been posted electronically. Following the ZIG meeting, I'll incorporate comments from both the ZIG and the Working Group and then post the revised version out to the ZIG list for additional comment, after which the report of the group will move to final status and the group's work will be complete. Introduction The Attribute Architecture Working Group has been asked to make recommendations to the ZIG on how to structure future attribute sets for Z39.50. There are a number of background documents (a list of which will be included in the final version of this document, along with lists of the participants at each of the Working Group's meetings) that define the problems and provide some discussion of the Working Group's deliberations to date that have already been distributed to the ZIG and posted on the Z39.50 Implementor's Group list. This document attempts to summarize the conclusions of the Working Group, reflecting thinking from recent meetings of the Working Group in November 1996 and March 1997. It is intended as a summary of conclusions rather than a report of deliberations and discussion. Classes of Attribute Sets Perhaps the most important conclusion of the working group was that it was important to establish general structures (which we called attribute set "classes") which could provide an umbrella context for the definition of future attribute sets. Such a class would define all of the attribute types which could appear in attribute sets conformant to that class. Without such a framework, there is no possibility of meaningful interoperability among work done by various autonomous attribute set developers without the necessity for each attribute set developer group specifying how their attribute types interact with other attribute types developed by other groups. Initially, we thought to define multiple classes, or at least the first of what might well be multiple classes, but it appears that a single class may meet at least present needs. The class defined here covers all existing needs of Z39.50 attribute sets, as far as the working group could determine -- although existing attribute sets would need to be respecified within this framework. The intent is not to preclude new types of attributes beyond those specified here, however; our thought is that the class defined here should be maintained on an ongoing basis, and that it should be possible to add new attribute types to this broad attribute class, if someone comes up with new attribute types that are relatively orthogonal to the attribute types defined here. There may also be other approaches developed which partition the set of attributes into fundamentally different types; this might result in the definition of a new attribute class that is inconsistent with the class under discussion here. However, the working group could not identify a need for such a separate class (which in effect constitutes a radically different structural view) of attribute sets. The importance of enumerating all of the possible attribute types within this "universal" attribute class is to provide a template for developers of attribute sets, and to set up a framework for interoperability among independently defined attribute sets which are intended to serve various communities. In particular, it should be possible for groups of content experts to develop new use attribute sets, ASN.1 datatypes, comparison operators, and perhaps structure/format attributes which fit comfortably within this framework. Server developers can, based on the template defined here, recognize various attribute types that are omitted in a given query, as well as illegal repetitions or combinations of attributes of given types that would indicate a malformed query. The context of the attribute class defined here would be identified as being in effect for a query by specifying the OID of an attribute set conformant with the class in the overall OID for a Z39.50 query -- most likely one of the utility attribute sets which we are proposing that the ZIG should develop (see discussion below). Interaction between attribute sets that are conformant to this attribute set class and historical attribute sets that are not conformant to this attribute set class within a query operand are undefined. The attribute types that are defined within this attribute class are enumerated below. There are a few general rules. While repeatability is discussed with regard to each attribute type, a general principle is that repeating an attribute type should not be used as a substitute for Boolean operations. An attribute set which is consistent with this attribute class will define attributes of one or more of the types specified here. If attributes of a given type are omitted in a query they should be treated as omitted in establishing the semantics of a given query (in other words, there are no defaults for omitted attributes). Some types of attributes (for example, use and field attributes) are mutually exclusive in a given query operand; these rules are defined at the level of the attribute class rather than specific attribute sets. Attribute Types defined within the Attribute Class Use-type attributes The Working Group felt that it was important to recognize that some applications of Z39.50 now make a strong link to database schemas, while others continue to work with abstract definitions of databases. Thus, we propose two distinct and mutually exclusive attribute types to accommodate these very different approaches to the use of Z39.50. Database Fieldname -- this is defined in conjunction with a specific database schema. It can be qualified via repetition of the attribute. An example of such qualification (nesting) might be a field path within an SGML database. The fieldname attribute is mutually exclusive with the Use Attribute. This is a character valued attribute, although it may also be useful to permit numeric values to facilitate mapping of fieldnames from schema definitions that use numeric assignments for fields. Use attribute -- defines an intellectual access point for a group of relatively homogeneous databases, independent of database schema. This is mutually exclusive with the database fieldname. Nesting, established through repetition, is also valid for the use attribute and establishes a context for the use attribute (for example, place names within abstracts). There is not a clean line between a qualified use attribute and the expansion/interpretation attributes discussed below -- a good example of this is the creation context used by CIMI, which is an attribute on an SMGL field. This should be a numerically valued attribute. Query Management Type Attributes. The working group had originally thought to assign a single type to query management attributes, which have the particular property that they can be rewritten by the server as part of the return of a revised query returned back to the client as additional search information; upon closer examination we realized that since values needed to be associated with these attribute types, each had to be defined as a specific type, but also that there really weren't very many of them. Weight -- this would be the weight of an operand in a weighted boolean query. We believe that this should be registered (along with a normalized value range) as part of a basic attribute set within the attribute class. This is a non-repeating numeric attribute. Hit Count -- the number of records satisfying the operand. This should again be registered as part of the basic attribute set. Note that this attribute can be passed back from client to server. This is a nonrepeating numeric attribute. Stopwording -- this is used to indicate whether or not a given word was used as a stopword, or whether it should or should not be considered as a stopword. This would be a numeric (actually enumerated) valued attribute. Qualifying Attribute Types Language Attribute -- the language of a given query operand. The Working Group believes that a general use version of the language attribute, using values defined from some standard source should be defined and registered so that it is generally available. It is not clear whether we need a character to numeric mapping for this attribute type or whether it should be character valued. This is not repeatable. Earlier versions of the Working Group's efforts suggested a character set attribute might be needed. The current position of the working group is that the character set attribute is unnecessary since it is handled by general Z39.50 character set support. Content Authority -- this defines the source of the term. It is a character valued attribute. In the interests of reducing complexity, the working group believes that this should be non-repeatable, although we can envision some esoteric situations where a repeatable content authority could be meaningfully interpreted. Expansion/Interpretation -- this is used to indicate that thesaural expansion, singular/plural matching, part of speech qualification, phonetic matching, case sensitivity, or stemming should be used in the query evaluation. Word by word truncation is also viewed as a form of stemming and is indicated within this attribute type, as would various loose forms of phrase matching. This is repeatable, and may be character or numeric valued. Comparison Operators These are mandatory and non-repeatable; they are numeric. Comparison attributes are strongly typed; there are different comparison attributes for each of the term value datatypes discussed below. See also the discussion of datatypes for operand values later in this document. Comparison attributes are somewhat similar to the relation operators of BIB-1, but are named differently to avoid confusion. Note that equality is now used only for cases of true equality testing (ie string and integers); various "matching" comparison operators are used for string matching using various kinds of regular expressions, for example. Sample values might include: -- complete match -- doesn't match -- left and right anchored match -- contains -- contained in bounding-polygon -- match via grep -- relevance feedback -- equality as strings -- numeric greater than -- between (range operations, in conjunction with the range datatype) Completeness and much of the truncation attribute have been folded into the comparison attribute; they are replaced by anchored matching. Format/Structure This type of attribute is used primarily to help with the interpretation of a character string operand value in cases where the comparison operator normally assumes an ASN.1 datatype; it provides guidance for the datatype conversion process. In addition, the format/structure attribute can be used for indirection, for example indicating that the operand value is a URL or URN that points to a value rather than the operand value specified inline. This is a non-repeatable, character valued operand. Datatyping, Comparison attributes and Format/Structure Attributes The working group recommends that term values should have strong datatyping, and that this datatyping should carry over into the definition of the comparison attributes (operators); for example, we should have separate comparison attributes for strings, numerics, and the like. Those groups defining specific use attributes should also consider defining ASN.1 datatypes to support their applications -- for example, personal names or dates, or geospatial information (points and polygons). We recognize that there are cases where the ASN.1 approach to datatyping will be considered too "heavy-weight"; in those cases the format/structure attribute type can be used in conjunction with strings to indicate that the contents of a string are represent some data in a specific format. The basic datatypes that are defined as part of the general attribute class should include: numerics -- integers and intunits. These will need to be supported with the usual comparison operators like equal, greater than, less than or equal, etc. character strings. These are handled lexically. They are not assumed to contain words. Language-bearing strings. These are character strings that contain one or more words. They are treated as sets of words or phrases. The working group believes that this approach is better than tagging strings as "word lists", for example; distinguinshing between lexical and linguistic (or at least token-based) operations should clarify queries considerably. Different comparison operators are used for language strings as opposed to lexical character strings. dates (a new ASN.1 type) We expect that other attribute set developers will define additional ASN.1 types, for example points and polygons. Personal names are an interesting "boundary" case where one might argue either for an ASN.1 based definition or a format/structure attribute indicating a normalized name according to some rules; the choice of the appropriate approach is, we believe, best left to a bibliographic attribute definition working group. Follow-on Actions The ZIG should define at least two attribute sets within the new attribute set architecture (to some extent, there is a packaging question about how many separate OIDs should be assigned). The Working Group on Attribute Set Architecture strongly urges the ZIG to move away from naming conventions such as BIB-1 which imply some sort of special legitimacy or precedence hierarchy for various attribute sets, and not to use name for groups of attribute sets like "CORE". This will avoid needless political debates. We suggest that the ZIG define one attribute set within this attribute class which covers widely used basic functions, including comparison operator values, language codes, and basic expansion/interpretation values, plus query management types -- call this attribute set, for a working name, "PURPLE". In addition, we suggest that the ZIG define a basic set of use attributes, called, for a working name, "ORANGE". In addition, we believe that a committee of bibliographic experts should be established, under auspices such as NISO, to define a new bibliographic attribute set within this general framework. . Other notes: In version 4 the term in an operand should be replaced by a sequence of terms. In the interim, we might register a range ASN.1 definition for version 3 range comparison types, which is a pair of term values. The working group believes that explicit range operators would be very useful, and that they should be added in favor of boolean combinations of operators that result in range definitions. Also, in version 4, we should allow attributes on operators.