CQL: Contextual Query Language
The CQL Context Set
The CQL context set defines a set of indexes, relations and relation
modifiers. The indexes supplied are 'utility' indexes which are generally
useful across all applications of the language. These utility indexes
are for instances when CQL is required to express a concept not directly
related to the records, or for indexes applicable in practically every
context.
- The reserved name for this context set is: cql
- The identifier for this context set is: info:srw/cql-context-set/1/cql-v1.2
Sections: Indexes | Relations | Relation Modifiers | Booleans | Boolean Modifiers
INDEXES
-
resultSetId
A search clause may be a result set id. This is a special case, where
the index and relation are expressed as "cql.resultSetId =" and the
term is the result set id returned by the server in the 'resultSetId'
parameter of the SRU response. It may be used by itself in a query
to refer to an existing result set from which records are desired.
It may also be used in conjunction with other resultSetId clauses
or other indexes, combined by boolean operators. The semantics of
resultSetId with relations other than "=" is undefined. The semantics
of resultSetId with scan is also undefined.
Examples:
- cql.resultSetId = "5940824f-a2ae-41d0-99af-9a20bc4047b1"
Match the result set with the given identifier.
- allRecords
A special index which matches every record available. Every record
is matched no matter what values are provided for the relation and
term, but the recommended syntax is: cql.allRecords = 1. The semantics
for scanning allRecords is not defined.
Examples:
- cql.allRecords = 1 NOT dc.title
= fish
Search for all records that do not match 'fish' as a word in
title.
- allIndexes
Alias: anywhere
The 'allIndexes' index will result in a search equivalent to searching
all of the indexes (in all of the context sets) that the server has
access to. The semantics for scanning allIndexes is not defined.
Examples:
- cql.allIndexes = fish
If the server had three indexes title, creator and date, then
this would be the same as title = fish or creator = fish or date
= fish
- anyIndexes
Alias: serverChoice
The 'anyIndexes' index allows the server to determine how to search
for the given term. The server may choose one or more indexes in which
to search, which may or may not be generally available via CQL. It
may choose a different index to search every time, based on the term
for example, and hence may not produce consistent results via scan.
This is the default when the index and relation is omitted from
a search clause. The relation used when the index is omitted is '='.
Examples:
- cql.anyIndexes = fish
Search in any one or more indexes for the term fish
- keywords
The keywords index is an index of terms from the record, determined
by the server as being generally descriptive or meaningful to search
on. It might include the full text of a document, descriptive metadata
fields, or anything else generally useful to search as an initial
entry point to the data. Exactly which fields make up this index is
determined by the server, however the choice must be consistent, unlike
anyIndexes above, when the choice can be different for different searches.
Examples:
- cql.keywords any/relevant "code
computer calculator programming"
Search in descriptive locations for the given terms
RELATIONS
Implicit Relations
These relations are defined as such in the grammar of CQL. The cql context
set only defines their meaning, rather than their existence.
- =
alias: scr
This is the default relation, and the server can choose any appropriate
relation or means of comparing the query term with the terms from
the data being searched. If the term is numeric, the most commonly
chosen relation is '=='. For a string term, either 'adj' or '==' as
appropriate for the index and term.
Examples:
- animal.numberOfLegs = 4
Recommended to use '=='
- dc.identifer = "gb 141 staff a-m"
Recommended to use '=='
- dc.title = "lord of the rings"
Recommended to use 'adj'
- dc.date = "2004 2006"
Recommended to use 'within'
- ==
alias:exact
This relation is used for exact equality matching. The term in the
data is exactly equal to the term in the search.
Examples:
- dc.identifier == "gb 141 staff
a-m"
Search for the string 'gb 141 staff a-m' in the identifier index.
- dc.date == "2006-09-01 12:00:00"
Search for the given datestamp.
- animal.numberOfLegs == 4
Search for animals with exactly 4 legs.
- <>
This relation means 'not equal to' and matches anything which is not
exactly equal to the search term.
Examples:
- dc.date <> 2004-01-01
Search for any date except the first of January, 2004
- dc.identifier <> ""
Search for any identifier which is not the empty string.
- <, >, <=,>=
These relations retain their regular meanings as pertaining to ordered
terms (less than, greater than, less than or equal to, greater than
or equal to).
Examples:
- dc.date > 2006-09-01
Search for dates after the 1st of September, 2006
- animal.numberOfLegs < 4
Search for animals with less than 4 legs.
Defined Relations
These relations are defined as being widely useful as part of a default
context set.
- adj
This relation is used for phrase searches. All of the words in the
search term must appear, and must be adjacent to each other in the
record in the order of the search term. The query could also be expressed
using the PROX boolean operator.
Examples:
- dc.title adj "lord of the rings"
Search for the phrase 'lord of the rings' somewhere in the title.
- dc.description adj "blue shirt"
Search for 'blue' followed by 'shirt' in the description.
- all, any
These relations may be used when the term contains multiple items
to indicate "all of these items" or "any of these items". These queries
could be expressed using boolean AND and OR respectively. These relations
have an implicit relation modifier of 'cql.word', which may be changed
by use of alternative relation modifiers.
Examples:
- dc.title all "lord rings"
Search for both lord and rings in the title.
- dc.description any "computer calculator"
Search for either computer or calculator in the description.
- within
Within may be used with a search term that has multiple dimensions.
It matches if the database's term falls completely within the range,
area or volume described by the search term, inclusive of the extents
given.
Examples:
- dc.date within "2002 2003"
Search for dates between 2002 and 2003 inclusive.
- animal.numberOfLegs within "2 5"
Search for animals that have 2,3,4 or 5 legs.
- encloses
Conversely, encloses is used when the index's data has multiple dimensions.
It matches if the database's term fully encloses the search term.
Examples:
- foo.dateRange encloses 2002
Search for ranges of dates that include the year 2002.
- geo.area encloses "45.3, 19.0"
Search for any area that encloses the point 45.3, 19.0
RELATION MODIFIERS
Functional Modifiers
- stem
The server should apply a stemming algorithm to the words within the
term. For example such that computing and computer both match the
stem of 'compute'.
- relevant
The server should use a relevancy algorithm for determining matches
and the order of the result set.
- phonetic
The server should use a phonetic algorithm for determining words which
sound like the term.
- fuzzy
The server should be liberal in what it counts as a match. The exact
details of this are left up to the server, but might include permutations
of character order, off-by-one for numerical terms and so forth.
- partial
When used with within or encloses, there may be some section which
extends outside of the term. This permits for the database term to
be partially enclosed, or fall partially within the search term.
Note: all of the following functional relation-modifiers
are new in version 1.2.
- ignoreCase, respectCase
The server is instructed to either ignore or respect the case of the
search term, rather than its default behaviour (which is unspecified).
This modifier may be used in sort keys to ensure that terms with the
same letters in different cases are sorted together or separately,
respectively.
- ignoreAccents, respectAccents
The server is instructed to either ignore or respect diacritics in
terms, rather than its default behaviour (which is unspecified, but
respectAccents is recommended). This modifier may be used in sort
keys, to ensure that characters with diacritics are sorted together
or separately from those without them.
- locale=value
The term should be treated as being from the specified locale. Locales
will in general include specifications for whether sort order is case-sensitive
or insensitive, how it treats accents, and so forth. The default locale
is determined by the server. The value is usually of the form C, french,
fr_CH, fr_CH.iso88591 or similar. This modifier may be used in sort
keys.
Examples:
- dc.title any/stem "computing disestablishmentarianism"
Find the local stemmed form of 'computing' and 'disestablishmentarianism',
and search for those stems in the stemmed forms of the terms in titles.
- person.phoneNumber =/fuzzy "0151 795-4252"
Search for a phone number which is something similar to '0151 795-4252'
but not necessarily exactly that
- "fish" sortBy dc.title/ignoreCase
Search for 'fish', and then sort the results by title, case insenstively.
- dc.title within/locale=fr "l m"
Find all titles between l and m, ensure that the locale is 'fr' for
determining the order for what is between l and m.
Term-format Modifiers
These modifiers specify the format of the search term to ensure that
the correct comparison is performed by the server. These modifiers may
all be used in sort keys.
- word
The term should be broken into words, according to the server's definition
of a 'word'
- string
The term is a single item, and should not be broken up.
- isoDate
Each item within the term conforms to the ISO 8601 specification for
expressing dates.
- number
Each item within the term is a number.
- uri
Each item within the term is a URI.
- oid
Each item within the term is an ISO object identifier, dot-separated
format.
Examples:
- dc.title =/string Jaws
Search in title for the string 'Jaws', rather than Jaws as a word.
(Equivalent to the use of == as the relation)
- zeerex.set ==/oid "1.2.840.10003.3.1"
Search for the given OID as an attribute set.
- squirrel sortby numberOfLegs/number
Search for squirrel, and sort by the numberOfLegs index ensuring
that it is treated as a number, not a string. (eg '2' would sort after
'10' as a string, but before it as a number)
Masking
- masked (default modifier)
The following masking rules and special characters apply for search
terms, unless overridden in a profile via a relation modifier. To explicitly
request this functionality, add 'cql.masked' as a relation modifier.
- A single asterisk (*) is used to mask zero or more characters.
- A single question mark (?) is used to mask a single character,
thus N consecutive question-marks means mask N characters.
- Carat/hat (^) is used as an anchor character for terms that are
word lists, that is, where the relation is 'all' or 'any', or
'adj'. It may not be used to anchor a string, that is, when the
relation is '==' (string matches are, by default, anchored). It
may occur at the beginning or end of a word (with no intervening
space) to mean right or left anchored."^" has no special meaning
when it occurs within a word (not at the beginning or end) or
string but must be escaped nevertheless.
- Backslash (\) is used to escape '*', '?', quote (") and '^' ,
as well as itself. Backslash not followed immediately by one of
these characters is an error.
Examples:
- dc.title = c*t
Matches words that start with c and end in t
- dc.title adj "*fish food*"
Matches a word that ends in fish, followed by a word that starts
with food
- dc.title = c?t
Matches a three letter word that starts with c and ends in t.
- dc.title adj "^cat in the hat"
Matches 'cat in the hat' where it is at the beginning of the
field
- dc.title any "^cat ^dog rat^"
Matches cat at the beginning, dog at the beginning or rat at
the end
- dc.title == "\"Of Couse\", she said"
Escape internal double quotes within the term.
- unmasked
Do not apply masking rules, all characters are literal.
- substring
The 'substring' modifier may be used to specify a range of
characters (first and last character) indicating the desired substring
within the field to be searched. The modifier takes a value, of the
form "start:end" where start and end obey the following rules:
- Positive integers count forwards through the string, starting
at 1. The first character is 1, the tenth character is 10.
- Negative integers count backwards through the string, with -1
being the last character.
- Both start and end are inclusive of that character.
- If omitted, start defaults to 1 and end defaults to -1.
Examples:
- marc.008 =/substring="1:6" 920102
- dc.title =/substring=":" "The entire
title"
- dc.title =/substring="2:2" h
- dc.title =/substring="-5:" title
- regexp
The term should be treated as a regular expression. Any features beyond
those found in modern POSIX regular expressions are considered to be
server dependent. This modifier overrides the default 'masked' modifier,
above. It may be used in either a string or word context.
Examples:
- dc.title adj/regexp "(lord|king|ruler)
of th[ea] r.*s"
Match lord or king or ruler, followed by of, followed by the
or tha, followed by r plus zero or more characters plus s
BOOLEANS
The CQL context set does not define booleans, as these can only be defined
by the CQL grammar. It gives the semantics of the booleans defined.
- AND
The combination of two sets of records with AND will result in the
set of records that appear in both of the sets.
- OR
The combination of two sets of records with OR will result in the
set of records that appear in either or both of the sets. It is therefor
inclusive OR, not exclusive OR.
- NOT
The combination of two sets of records with NOT will result in the
set of records that appear in the left set, but not in the right hand
set. It cannot be used as a unary operator.
- PROX
The prox (short for proximity) boolean operator allows for the relative
locations of the terms to be used in order to determine the resulting
set of records. The semantics of when a match occurs is defined by
the modifiers or defaults for those modifiers, as described below.
BOOLEAN MODIFIERS
The CQL context set defines four boolean modifiers, which are only used
with the prox boolean operator.
- distance symbol value
The distance that the two terms should be separated by.
- Symbol is one of: < > <= >= = <>
If the modifier is not supplied, it defaults to <=.
- Value is a non-negative integer.
If the modifier is not supplied, it defaults to 1 when unit=word,
or 0 for all other units.
- unit=value
The type of unit for the distance.
Value is one of: 'paragraph', 'sentence', 'word' and 'element', and
defaults to 'word'.
These values are explicitly undefined. They are subject to interpretation
by the server. See Proximity Units.
- unordered
The order of the two terms is unimportant. This is the default.
- ordered
The order of the two terms must be as per the query.
Examples:
- cat prox/unit=word/distance>2/ordered hat
Find 'cat' where it appears more than two words before 'hat'.
("ordered" means 'cat' and 'hat' in that order. "distance >2" means that the proximity between 'cat' and 'hat' is greater than two words.
Would exclude "Cat in the Hat" but would find "The Big Red Cat in the Big Red Hat".)
- cat prox/unit=paragraph hat
Find cat and hat appearing in the same paragraph (distance defaulting
to 0) in either order (unordered default)
- zeerex.set = cql prox/unit=element/distance=0 zeerex.index = resultSetId
Find the cql context set in the same element as the index name resultSetId.
E.g. search for cql.resultSetId
Proximity Units
As noted above proximity units 'paragraph', 'sentence', 'word' and 'element'
are explicitly undefined, that is, they are undefined when used by the
CQL context set. Other context sets may assign them specific values.
Thus compare "prox/unit=word" with "prox/xyz.unit=word".
In the first, 'unit' is a prox modifier from the CQL set, and as such
its values are undefined, so 'word' is subject to interpretation by the
server. In the second, 'unit' is a prox modifier defined by the xyz context
set, which may assign the unit 'word' a specific meaning.
Other context sets may define additional units, for example, 'street':
prox/xyz.unit="street"
Note that this approach, 'prox/xyz.unit="street"', is preferable to
'Prox/unit=xyz.street'. In the first case, 'unit' is a modifier defined
in the xyz context set, and 'street' is a value defined for that modifier.
In the second, 'unit' is a modifier from the cql context set, with a
value defined in a different set. so its value would have to be one that
is defined in the cql context set. Pairing a modifier from one set with
a value from another is not a good practice.
|