|
|
| Example | Explanation |
|---|---|
title all "complete dinosaur" |
Title contains all of the words: "complete", and "dinosaur" |
|
Title contains any of the words: "dinosaur", "bird", or "reptile" |
(caudal or dorsal) prox vertebra |
A proximity query: either "caudal" or "dorsal" near 'vertebra" |
ribs prox/distance<=5 chevrons |
A more specific proximity query: "ribs" within 5 words of "chevrons" |
ribs prox/unit=sentence chevrons |
"ribs" in the same sentence as "chevrons" |
ribs prox/distance>0/unit=paragraph chevrons |
"ribs" and "chevrons" occuring in the same document in different paragraphs |
subject any/relevant "fish frog" |
find documents that would seem relevant either to "fish" or "frog" |
subject any/rel.lr "fish frog" |
Same as previous, but use a specific relevance algorithm (linear regression) |
Following is the Backus Naur Form (BNF) definition for CQL. ["::=" represents "is defined as"]
cqlQuery |
::= |
prefixAssignment cqlQuery | scopedClause |
prefixAssignment |
::= |
'>' prefix '=' uri | '>' uri |
scopedClause |
::= |
scopedClause booleanGroup searchClause | searchClause |
booleanGroup |
::= |
boolean [modifierList] |
boolean |
::= |
'and' | 'or' | 'not' | 'prox' |
searchClause |
::= |
'(' cqlQuery ')' |
relation |
::= |
comparitor [modifierList] |
comparitor |
::= |
comparitorSymbol | namedComparitor |
comparitorSymbol |
::= |
'=' | '>' | '<' | '>=' | '<=' | '<>' |
namedComparitor |
::= |
identifier |
modifierList |
::= |
modifierList modifier | modifier |
modifier |
::= |
'/' modifierName [comparitorSymbol modifierValue] |
prefix, uri, modifierName, modifierValue, searchTerm, index |
::= |
term |
term |
::= |
identifier | 'and' | 'or' | 'not' | 'prox' |
identifier |
::= |
charString1 | charString2 |
charString1 |
:= |
Any sequence of characters that does not include any of the following: whitespace If the final sequence is a reserved word, that token is returned instead. Note that '.' (period) may be included, and a sequence of digits is also permitted. Reserved words are 'and', 'or', 'not', and 'prox' (case insensitive). When a reserved word is used in a search term, case is preserved. |
charString2 |
:= |
Double quotes enclosing a sequence of any characters except double quote (unless preceded by backslash (\)). Backslash escapes the character following it. The resultant value includes all backslash characters except those releasing a double quote (this allows other systems to interpret the backslash character). The surrounding double quotes are not included. |
CQL Query
A CQL query is essentially a search clause, or multiple
search clauses connected by boolean operators. (In addition
it may include prefix assignments which assign short names to known
contexts. See context sets.)
Search Clause
A search clause consists of an index, relation, and search term,
or a search term alone. Thus every search clause has a search term,
but both the index and relation may be omitted - the clause
must include either both or neither of the index and relation.
(Note that the use of the "index" concept in CQL is not
intended to have any implementation implications; it does not imply
the presence of a physical index.)
Examples:
Index/relation/search
term: title = cat
Search term only: cat
Search Term
Search terms may be enclosed in double quotes. Search terms must be
enclosed in double quotes if they contain any of the following characters: < > =
/ ( ) and whitespace. The search term may be empty, but must be present
in a search clause. An empty search term is expressed as "" and has
no defined semantics.
Index Name
An index name always includes a base name and may also include a
prefix, which provides a context for the index name, the name of
the context set of which the index is a
part. If the context is not supplied, it is determined by the server. If
the index is not supplied it is determined by the server. (Note
that the index may be omitted only when the relation is also omitted.
Either both must be supplied, or both omitted.)
Examples:
title = cat context
determined by the server
dc.title = cat index context
is dc
cat context
and index determined by the server
Relation
The relation in a search clause specifies the relationship
between the index and search term. It also always includes
a base name and may also include a prefix providing a context for
the relation. If a relation is supplied with no accompanying context,
the context is 'cql' (the cql
context set). If no relation is supplied, then cql.scr
(server choice) is assumed, which means that the relation is determined
by the server. (Note that the relation may be omitted only when the
index is also omitted. Either both must be supplied, or both omitted.)
Examples:
title = cat context for relation is 'cql'
; fully qualified relation is cql.=
title cql.any cat relation
is 'any'; relation context is 'cql'. Equivalent to: title
any cat
cat index and relation
are determined by the server (formally the relation is 'cql.scr')
Examples:
dc.title any/relevant/rel.CORI "cat
fish"
the relation
'any' is modified by (1) 'relevant' whose context is 'cql' and (2)
'CORI' whose context is 'rel'.
dc.author exact/stem "smith, j." the
relation 'exact' is modified by 'stem' whose context is 'cql'.
Boolean Operators
Search clauses may be linked by boolean operators. These are: and, or, not and prox.
(Note that not is really and-not, that is,
it may not be used as a unary operator.) Boolean operators all have
the same precedence; they are evaluated left-to-right. Parentheses
may be used to overide left-to-right evaluation.
Boolean Modifiers
As a relation may have modifiers, similiarly, a boolean operator
may have modifiers, separated by '/' characters. Boolean modifiers
may come from any context set. If not supplied, the context is
the CQL context set. (Note that
Boolean operators themselves are limited to the built-in set of
four.)
Example: dc.title=cat and/rel.sum dc.title=dog
Case Insensitive
All parts of CQL are case insensitive apart from user supplied search
terms, which may or may not be case sensitive. 'OR','or', 'Or'
and 'oR' are all the same boolean operator, just as 'dc.title',
'DC.Title' and 'dC.TiTLe' are all the same context set plus index
name.
The following are all formally defined by the CQL context set but described here for convenience.
For ordered (e.g. numeric) terms:
<, >, <=, >=,
and <> mean "less than", "greater than", "less
or equal", "greater or equal", and "not equal".
when the term is a list of words:
'=' is used for word adjacency -- the words appear in that order with no others intervening. (Note the dual use of '=', it is used for numeric equality as described above.)
'any' means "any of these words"
'all' means "all of these words"
When the term is a character string:
'exact' is used for exact
string matching.
When the term has multiple dimensions:
'within' may be used to search for values that
fall within the range, area or volume described by the search term.
When the index's data has multiple dimensions:
'encloses' may be used to search for values
of the database's term fully encloses the search term.
| This query | Would match this | but not this |
|---|---|---|
title = "cat in the hat" |
"a day in the life of the cat in the hat" |
"hat in the cat" or "cat in the green hat" |
title all "cat hat" |
"hat in the cat" |
"cat in the grass" |
title any "cat hat" |
"cat in the grass" |
"dog in the grass" |
title exact "cat in the hat" |
"cat in the hat" |
"a day in the life of the cat in the hat" |
date within "2002 2005" |
2004 |
2006 |
dateRange encloses 2003 |
"2002 2005" |
"2004 2005" |
These relation modifiers request that the server perform some algorithm on the term before processing.
stem
The server should apply a stemming algorithm to the words within the
term. for example, walked, walking, walker etc. would all
be represented by the stem word walk. This allows a search like
title =/stem "these completed dinosaurs" to match The Complete
Dinosaur.
relevant
The server should use a relevancy algorithm for determining matches
and the order of the result set.
Example: subject any/relevant "fish frog"
would find records relevant to "fish" or "frog" and
order the result set by relevance to fish or frog.
These modifiers qualify the relation to more precisely determine its semantics.
word
The term consists of words (rather than being an opaque string).
string
The term is a single item, and should not be broken up.
isoDate
Each item within the term conforms to the ISO 8601 specification for
expressing dates.
number
Each item within the term is a number.
uri
Each item within the term is a URI.
masked
This means that the masking rules (see next) apply. Masking is assumed
even if not specified, unless 'unmasked' is specified (so there
is never any reason to include 'masked').
A single asterisk (*) is used to mask zero or more characters.
A single question mark (?) is used to mask a single character, thus N consecutive question-marks means mask N characters.
Carat/hat (^) is used as an anchor character for terms that are word lists, that is, where the relation is 'all' or 'any', or '=' when used for word adjacency. It may not be used to anchor a string, that is, when relation is 'exact' (string matches are, by definition, anchored). It may occur at the beginning or end of a word (with no intervening space) to mean right or left anchored."^" has no special meaning when it occurs within a word (not at the beginning or end) or string but must be escaped nevertheless.
Backslash (\) is used to escape '*', '?', quote (") and '^' , as well as itself. The use of a backslash not followed immediately by one of these characters is reserved for future definition.
Masking examples:
dc.title = c*t (matches cat and coast etc.)
dc.title = c?t (matches cat and cot, not coast or ct)
" ?" (matches any single character)
dc.title = "^cat in the hat" (matches 'cat in the hat' where it is at the beginning of the field)
dc.title any "^cat eats rat" (matches 'cat eats rat', 'cat eats dog', 'cat', but not 'rat eats cat')
dc.title any "^cat ^dog eats rat" (matches 'cat eats rat', 'dog eats cat', 'cat loves bat', but not 'bat loves cat')
dc.title = "\"Of Couse\" she said"
A search clause may be a result set name. This is a special case, employing the context set 'cql'. The index and relation are expressed as "cql.resultSetId =" and the term is a result set name that has been returned by the server in the 'resultSetName' parameter of the response. It may be used by itself in a query to refer to an existing result set from which records are desired. It may also be used in conjunction with other resultSetName clauses or other indexes, combined by boolean operators. The semantics of resultSetId with relations other than "=" is undefined.
Example: cql.resultSetId = "resultA" and cql.resultSetId = "resultB"
The proximity boolean boolean operator is expressed in terms of distance, unit, and ordering.
Examples:
distance takes the form:
distance [relation] [value]
where relation is one of: "<", ">" ,"<=" ,">=" ,"=" , "<>"; default "<="
and value is a non-negative integer; default: 1 for word, zero otherwise
unit takes the form
unit=[value]
where value is one of "word", "sentence", "paragraph", or "element"(default "word"),
ordering is "ordered" or "unordered"; default "unordered"
Context sets permit CQL users to create their own indexes, relations, relation modifiers and boolean modiers without fear of chosing the same name as someone else and thereby having an ambiguous query. All of these four aspects of CQL must come from a context set, however there are rules for determining the prevailing default if one is not supplied. Context sets allow CQL to be used by communities in ways which the designers could not have foreseen, while still maintaining the same rules for parsing which allow interoperability.
When defining a new context set, it is necessary to provide a description of the semantics of each item within it. While context sets may contain indexes, relations, relation modifiers and boolean modifiers, there is no requirement that all should be present; in fact it is expected that most context sets will only define indexes.
Each context set has a unique identifier, a URI. When sending the context set in a query, a short form is used. These short names may be sent as a mapping within the query itself (see next), or be published by the recipient of the query in some protocol dependent fashion. The prefix 'cql' is reserved for the CQL context set, but authors may wish to recommend a short name for use with their set.
An index, relation, or modifier qualified by a context is represented in the form prefix.value, where prefix is a short name for a unique context set identifier.
Binding Short Name to URI
The binding of short name to URI is defined either within the
query or by the server. A prefix map may occur at any place in the query
and applies to anything which follows. Example:
>dc="http://www.dublincore.org/" dc.title = "cat"
In the following query:
>a="http:/x.com/y" a.title=cat and (>a="http:/f.com/g" a.title=hat) and a.title=rat
both the "a" in "a.title=cat" and in "a.title=rat" refer to http:/x.com/y, while the "a" in "a.title=rat" refers to http:/f.com/g.
Default Context
When no context is attached to a relation, relation modifier, or boolean
modifier, the context is the cql context set. When no context
is attached to an index the context is determined by the server.
In order to claim conformance to CQL a server must support one of the following three levels:
Level 0
Must be able to process a term-only query.
(The term is either a single word or if multiple words separated by
spaces then the entire search term is quoted). If the term includes
quote marks, they must be a escaped by preceding them with a backslash,
e.g."raising the \"titanic\"".)
If an unsupported query is supplied, must be able to respond with a diagnostic to say that the query is not supported.
Level 1
Support for Level 0.
Ability to parse both:
(a) search clauses consisting of 'index relation searchTerm'; and
(b) queries where search terms are combined with booleans, e.g. "term1
AND term2"
Support for at least one of (a) and (b).
Note that (b) does not necessarily include queries such as:
index relation term1 AND index relation term2
but rather queries where the search clauses are terms-only (do not include index or relation).
Level 2
Support for Level 1.
Ability to parse all of CQL and respond with appropriate diagnostics.
Note that Level 2 does not require support for all of CQL, it requires that the server be able to parse all of CQL (and respond with proper diagnostics for the parts not supported.).
|
July 25, 2007 |