Seek and ye shall find: full-text search in PostgreSQL
======================================================
:author: Dan Scott (denials)
:institution: Laurentian University
:license: CC-BY-SA 3.0 Unported
:duration: 50
:Date: September 18, 2012
:Email: dan@coffeecode.net
:Copyright: {date={localdate}}: {author}, {institution} * {license}
:icons:
:iconsdir: ./images/icons/
:slidecalloutwidth: 2%
:data-uri:
It happened to me
-----------------
* Summer project created large amounts of text from scanned newspapers
* Wanted an independent & useful search engine
* Lucene-based search engines like Solr and ElasticSearch are not integrated
with the database
* I'm a library geek _and_ a database geek and, since 2006, have been hacking
on the http://evergreen-ils.org[Evergreen library system], which uses
PostgreSQL full-text search
* Housekeeping: slides are at http://stuff.coffeecode.net/2012/pgopen_fulltext
Simplistic (straw-man) approach
-------------------------------
[source,sql]
.I'll just use LIKE / ILIKE / regex matching...
------------------------------------------------------------------------------
CREATE SCHEMA lunews;
CREATE TABLE lunews.issues (
id TEXT, text_uri TEXT, volume TEXT, issue TEXT, content TEXT
);
COPY lunews.issues (id, text_uri, volume, issue, content)
FROM 'raw_lambda_text.tsv';
SELECT id FROM lunews.issues WHERE content LIKE '%picket%';
SELECT id FROM lunews.issues WHERE content ILIKE '%picket%';
SELECT id FROM lunews.issues WHERE content ~ E'p[iao]cket';
------------------------------------------------------------------------------
... with simple problems
------------------------
:incremental:
* Fast for test data (50 rows with ~10,000 words per row) but does not scale
to real world data
** Table scans are _not_ our friends
** Index expressions are wonderful but can't anticipate the wide variety of
human questions
* We could build something based on +pg_trgm+ extension - but that's work!
* Full-text search is built into PostgreSQL, so let's use it
image:oleg_bartunov.jpg["Oleg Bartunov",align="left"] image:teodor_sigaev.jpg["Teodor Sigaev",align="right"]
_(Thank you, http://www.sai.msu.su/~megera/[Oleg Bartunov] and http://www.sigaev.ru/[Teodor Sigaev]!)_
2-minute demo - 50 rows
-----------------------
. As a proof of concept, create and populate our table containing the full
text of 50 newspaper issues.
. Create a +tsvector+ index on the text column of interest (_content_):
+
[source,sql]
------------------------------------------------------------------------------
CREATE INDEX ON lunews.issues USING GIN (to_tsvector(content));
------------------------------------------------------------------------------
+
. Run some search queries:
+
[source,sql]
------------------------------------------------------------------------------
SELECT id FROM lunews.issues
WHERE to_tsvector(content)
@@ to_tsquery('strike & picket');
SELECT id, ts_rank_cd(to_tsvector(content),
to_tsquery('picket | pocket | packet'))
FROM lunews.issues
WHERE to_tsvector(content)
@@ to_tsquery('picket | pocket | packet')
ORDER BY 2 DESC
;
------------------------------------------------------------------------------
5-minute demo - 1 million rows
------------------------------
. Create and populate a table containing the titles and subjects of
1 *_MILLION_* books from OpenLibrary.
. Create weighted and unweighted +tsvector+ keyword columns.
[source,sql]
------------------------------------------------------------------------------
CREATE TABLE openlib (id INT, title TEXT, subjects TEXT,
unweighted_kw TSVECTOR, weighted_kw TSVECTOR);
COPY openlib (id, title, subjects) FROM STDIN; -- <1> <2>
UPDATE openlib SET unweighted_kw =
to_tsvector('english', title) ||
to_tsvector('english', COALESCE(subjects, '')); -- <3>
UPDATE openlib SET weighted_kw =
setweight(to_tsvector('english', title), 'A') ||
setweight(to_tsvector('english',
COALESCE(subjects, '')), 'D'); -- <4>
------------------------------------------------------------------------------
<1> 2012-09-15 12:33:05-04 - started loading
<2> 2012-09-15 12:33:10-04 - finished loading
<3> 2012-09-15 12:33:51-04 - populated unweighted_kw
<4> 2012-09-15 12:34:40-04 - populated weighted_kw
5-minute demo - 1 million rows (cont.)
--------------------------------------
. Create indexes on the weighted and unweighted keyword columns.
. Create separate expression indexes on the title and subjects columns.
[source,sql]
------------------------------------------------------------------------------
CREATE INDEX openlib_title ON openlib
USING GIN (to_tsvector('english', title)); -- <1>
CREATE INDEX openlib_subjects ON openlib
USING GIN (to_tsvector('english', subjects)); -- <2>
CREATE INDEX openlib_unweighted ON openlib
USING GIN (unweighted_kw); -- <3>
CREATE INDEX openlib_weighted ON openlib
USING GIN (weighted_kw); -- <4>
------------------------------------------------------------------------------
<1> 2012-09-15 12:35:13-04 - created openlib_title index
<2> 2012-09-15 12:35:39-04 - created openlib_subjects index
<3> 2012-09-15 12:36:04-04 - created openlib_unweighted index
<4> 2012-09-15 12:36:30-04 - created openlib_weighted index
5-minute demo - 1 million rows - searching
------------------------------------------
With an unweighted search:
[source,sql]
------------------------------------------------------------------------------
SELECT id, SUBSTRING(title FROM 0 FOR 20),
ts_rank_cd(unweighted_kw,
to_tsquery('government & history'))
FROM openlib
WHERE unweighted_kw @@ to_tsquery('government & history')
ORDER BY 3 DESC LIMIT 5;
id | substring | rank
----------+---------------------+----------
6836768 | Pietre e potere | 0.35
17995790 | Reform and insurrec | 0.236111
8669133 | American Political | 0.229545
866182 | Lives in the public | 0.225
16640636 | HeĢ katastatikeĢ no | 0.214286
(5 rows) Total runtime: 84.942 ms
------------------------------------------------------------------------------
5-minute demo - 1 million rows - searching
------------------------------------------
With a weighted search:
[source,sql]
------------------------------------------------------------------------------
SELECT id, SUBSTRING(title FROM 0 FOR 20),
ts_rank_cd(weighted_kw,
to_tsquery('government & history'))
FROM openlib
WHERE weighted_kw @@ to_tsquery('government & history')
ORDER BY 3 DESC LIMIT 5;
id | substring | rank
----------+---------------------+----------
11331254 | Global Issues: Hist | 2.09091
5284994 | Michigan: a history | 0.615909
5857361 | California history | 0.590909
23324185 | Outlines of the his | 0.525974
14832714 | History of federal | 0.508333
(5 rows) Total runtime: 86.443 ms
------------------------------------------------------------------------------
5-minute demo - 1 million rows - searching
------------------------------------------
With a weighted search and relevancy algorithm tweak:
[source,sql]
------------------------------------------------------------------------------
SELECT id, SUBSTRING(title FROM 0 FOR 20),
ts_rank_cd(weighted_kw,
to_tsquery('government & history'), 4)
FROM openlib
WHERE weighted_kw @@ to_tsquery('government & history')
ORDER BY 3 DESC LIMIT 5;
id | substring | ts_rank_cd
----------+---------------------+------------
20696993 | A study of general | 0.5
11830422 | 21st Century Comple | 0.5
16606467 | History of the gove | 0.333333
9930696 | Judiciary and Respo | 0.333333
13681611 | history of local go | 0.333333
(5 rows) Total runtime: 92.518 ms
------------------------------------------------------------------------------
Parsing process
---------------
. Start with a _document_ - the text to be searched; can be a single word or an
entire book
. Parse the document into _tokens_ - with types like words, numbers, paths,
email addresses, white space
. Normalize the tokens to create _lexemes_ - including case-folding, stemming,
and using dictionaries and thesauruses to unify different forms of a token
ifdef::backend-slidy2[>>>]
Text search data types
~~~~~~~~~~~~~~~~~~~~~~
* _tsvector_ - text search vector (array-ish thing) containing lexemes with
positional information to inform relevance ranking
* _tsquery_ - a preprocessed text search query
Show me the tokens!
-------------------
[source,sql]
------------------------------------------------------------------------------
SELECT alias, description FROM ts_token_type('default');
-----------------+------------------------------------------
asciiword | Word, all ASCII
word | Word, all letters
numword | Word, letters and digits
email | Email address
url | URL
host | Host
sfloat | Scientific notation
version | Version number
hword_numpart | Hyphenated word part, letters and digits
hword_part | Hyphenated word part, all letters
hword_asciipart | Hyphenated word part, all ASCII
blank | Space symbols
tag | XML tag
protocol | Protocol head
numhword | Hyphenated word, letters and digits
asciihword | Hyphenated word, all ASCII
hword | Hyphenated word, all letters
url_path | URL path
file | File or path name
float | Decimal notation
int | Signed integer
uint | Unsigned integer
entity | XML entity
(23 rows)
------------------------------------------------------------------------------
Let's see some lexemes
----------------------
* Normalized tokens, remember?
[source,sql]
------------------------------------------------------------------------------
SELECT to_tsvector('english',
'Kurt Vonnegut, Breakfast of Champions, 1973'
);
'1973':6 'breakfast':3 'champion':5 'kurt':1 'vonnegut':2
SELECT to_tsvector('english', 'Les nuits en Paris');
'en':3 'les':1 'nuit':2 'pari':4
SELECT to_tsvector('french', 'Les nuits en Paris'); -- <1>
'le':1 'nuit':2 'paris':4
------------------------------------------------------------------------------
<1> Dig that multilingual support, eh?
Operators and functions
-----------------------
* +to_tsvector()+ - turn a document into a vector
* +to_tsquery()+ - normalize the terms in a query to return a TSQUERY
* +plainto_tsquery()+ - normalize the terms in a query to return a TSQUERY,
automatically ANDing the terms
* +@@+ - match operator for a tsvector and a tsquery
* +ts_debug()+ - display how a string gets normalized by a given text search
configuration
Let's search!
-------------
* Boolean operators
** +&+ - AND
** +|+ - OR
** +!+ - NOT (tricksy; negation operator requires +&+ or +|+)
* Wildcards for matching prefixes: +prefix:*+
[source,sql]
------------------------------------------------------------------------------
SELECT id FROM lunews.issues
WHERE to_tsvector(content) @@ plainto_tsquery('student strike');
SELECT id FROM lunews.issues
WHERE to_tsvector(content) @@ to_tsquery('strike &! student');
SELECT id FROM lunews.issues
WHERE to_tsvector(content) @@ to_tsquery('prof:* | fac:*');
SELECT id FROM lunews.issues
WHERE to_tsvector(content) @@ to_tsquery('faculty & strike')
AND content ILIKE '%faculty strike%';
------------------------------------------------------------------------------
What is this "normalization"?
-----------------------------
* *Parsers*: tokenize and classify the text of a document (word, number,
path...) - in practice, there is only one
* *Dictionaries*: convert tokens into normalized form, strip out stop words,
unify synonyms
* *Templates*: provide the dictionaries' underlying functions
* *Configurations*: which dictionaries to use for each token supplied by the
parser
[source,sql]
------------------------------------------------------------------------------
SELECT to_tsvector('english', 'stochastically'); -- <1>
'stochast':1
SELECT to_tsvector('english', 'To be or not to be'); -- <2>
SELECT to_tsvector('english', 'laptopping'); -- <3>
'laptop':1
------------------------------------------------------------------------------
<1> The English Porter stemming algorithm is thorough :)
<2> Stopwords can eat up a lot of space if you decide to index them.
<3> The stemmer makes no guarantees that the original words were real!
What are we matching?
---------------------
+ts_headline()+ returns our +tsquery+ in context:
[source,sql]
------------------------------------------------------------------------------
SELECT ts_headline(title,
to_tsquery('english', 'chicago & theater'),
'MaxWords = 10, MinWords = 5')
FROM openlib
WHERE to_tsvector('english', title)
@@ to_tsquery('english', 'chicago & theater')
LIMIT 5;
ts_headline
=-------------------------------------------------
Rice's Theater, Chicago, 1851-1857
(1 row) Total runtime: 0.616 ms
------------------------------------------------------------------------------
Relevancy
---------
* _Image of match_
We don't just want matches; we want the best matches. PostgreSQL offers
two built-in ranking functions:
* +ts_rank+ ([ 'weights' +float4[]+, ] 'vector' +tsvector+,
'query' +tsquery+ [, 'normalization' +integer+ ]) - frequency of matching
lexemes
* +ts_rank_cd+ ([ 'weights' +float4[]+, ] 'vector' +tsvector+,
'query' +tsquery+ [, 'normalization' +integer+ ]) - cover density
algorithm that incorporates term proximity
Relevancy scoring can be adjusted by passing a bit mask of options, to adjust
rank for long documents and the number of unique words.
.Cover sets
[quote, Clarke et al, Relevance Ranking for One to Three Term Queries (1999)]
______________________________________________________________________________
(1) the shorter the cover, the more likely the corresponding text is relevant;
and (2) the more covers contained in a document, the more likely the document
is relevant.
______________________________________________________________________________
Beyond searching a single column
--------------------------------
* An entire newspaper in a column doesn't allow for much precision
* A story in a column is better, but still not optimal
* Separate columns for title, author, story? Now we're talking
** Concatenate the +tsvector+ documents for maximum flexibility
Beyond searching a single column (cont.)
----------------------------------------
[source, sql]
------------------------------------------------------------------------------
CREATE TABLE lunews.stories (
id SERIAL, title TEXT, author TEXT, story TEXT
);
WITH ft AS (
SELECT id, title || author || story AS keywords,
to_tsvector(title) || to_tsvector(author) ||
to_tsvector(story) AS ts_kw
FROM lunews.stories
),
q AS (
SELECT id, to_tsquery('exotic & horizons') AS tsq
FROM lunews.stories
)
SELECT ft.id, ts_rank_cd(ft.ts_kw, q.tsq) AS score,
ts_headline(ft.keywords, q.tsq) AS kic
FROM ft INNER JOIN q ON q.id = ft.id
WHERE q.tsq @@ ft.ts_kw
;
------------------------------------------------------------------------------
Weighting
---------
* Terms can be weighted using four scores: 'A', 'B', 'C', and 'D'
* Weight sets of terms at index time using the +setweight()+ function
* Adjust weight values at query time by passing an array of four
+float+ values as the first argument to +ts_rank()+/+ts_rank_cd()+
** Example: You can set the title to 'A' and subject to 'D' for a default
boost to title occurrences, but adjust the weights for a subject-focused
search
[source, sql]
------------------------------------------------------------------------------
WITH ft AS (
SELECT id, title || author || story AS keywords,
setweight(to_tsvector(COALESCE(title, '')), 'A')
|| setweight(to_tsvector(COALESCE(author, '')), 'B')
|| setweight(to_tsvector(COALESCE(story, '')),'D')
AS ts_kw
FROM lunews.stories
)
SELECT ts_rank_cd('{0.1, 0.2, 0.4, 1.0}', ft.ts_kw, q.tsq) -- <1>
------------------------------------------------------------------------------
<1> Small gotcha: floats in the score array represent 'D', 'C', 'B', 'A'
Custom search configuration
---------------------------
Perhaps you actually want to index stopwords (to find 'To be or not to be' or
'It')
[quote,#postgresql on Freenode]
_____________________________________________________________________________
denials, Isn't there an easier way to allow stop word, like just set a flag in
some config file to True or False?
_____________________________________________________________________________
[source,sql]
------------------------------------------------------------------------------
CREATE TEXT SEARCH DICTIONARY tsd_english_full(
template = snowball, language = english
);
CREATE TEXT SEARCH CONFIGURATION tsc_english_full(
COPY = pg_catalog.english
);
ALTER TEXT SEARCH CONFIGURATION tsc_english_full
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
hword, hword_part, word
WITH tsd_english_full;
SELECT to_tsvector('tsc_english_full', 'to be or not to be');
'be':2,6 'not':4 'or':3 'to':1,5
------------------------------------------------------------------------------
Accelerate
----------
* GIN and GiST indexes support full-text search queries
** Usual trade-off of better performance / slower build for GIN
* One index per column of interest, plus one general index for weighted
aggregate searches, per text search configuration
* Create +tsvector+ columns with an associated trigger to maintain their
contents
** +tsvector_update_trigger()+ function automatically updates the
+tsvector+ column with the contents of one of more columns, for a
single text search configuration
** +tsvector_update_trigger_column()+ function uses whatever text search
configuration appears in a specified column on a per-row basis, so that
you can better support multilingual data in a single table
Accelerate (example)
--------------------
Use the built-in +tsvector_update_trigger()+ function to maintain a
+tsvector+ column, with a corresponding GIN index:
[source,sql]
------------------------------------------------------------------------------
CREATE TABLE openlib_subset(title TEXT, tsv TSVECTOR);
CREATE TRIGGER tsv_update
BEFORE INSERT OR UPDATE ON openlib_subset FOR EACH ROW
EXECUTE PROCEDURE
tsvector_update_trigger(tsv, 'pg_catalog.english', title);
CREATE INDEX idx_title ON openlib_subset USING GIN (tsv);
INSERT INTO openlib_subset (title)
SELECT title FROM openlib ORDER BY id DESC LIMIT 10000;
SELECT title, tsv FROM openlib_subset
WHERE to_tsquery('Chicago') @@ tsv
AND LENGTH(title) < 30 LIMIT 5;
title | tsv
------------------------------+--------------------------------------
The Chicago address | 'address':3 'chicago':2
You were never in Chicago | 'chicago':5 'never':3
African Americans in Chicago | 'african':1 'american':2 'chicago':4
Chicago | 'chicago':1
------------------------------------------------------------------------------
Preparsing
----------
* Changing the parser to emit more / different tokens requires source code
changes
* Hack: change the text before you index & query it
Desired goal: index +"term1/term2"+ as a single term, _and_ as separate terms
[source,sql]
------------------------------------------------------------------------------
SELECT alias, token, lexemes FROM ts_debug('english','term1/term2');
alias | token | lexemes
-------+-------------+---------------
file | term1/term2 | {term1/term2}
SELECT alias, token, lexemes FROM ts_debug('english',
regexp_replace('term1/term2',
E'(\\S+)/(\\S+)', E'\\1/\\2 \\1 \\2')
);
alias | token | lexemes
---------+-------------+---------------
file | term1/term2 | {term1/term2}
blank | |
numword | term1 | {term1}
blank | |
numword | term2 | {term2}
------------------------------------------------------------------------------
Synonyms via a dictionary
-------------------------
*Problem*: default 'english' configuration does not recognize American
variations as the equivalent of Canadian words
* Examples: 'color' vs. 'colour', 'favorite' vs. 'favourite'
*Solution*: The 'synonym' template lets you define a list of words and their
synonyms, including prefix matches if applicable.
.+canuck.syn+ synonym file
------------------------------------------------------------------------------
colour color*
colours color*
------------------------------------------------------------------------------
[source,sql]
------------------------------------------------------------------------------
CREATE TEXT SEARCH DICTIONARY canuck (
TEMPLATE = synonym, SYNONYMS = canuck -- <1>
);
ALTER TEXT SEARCH CONFIGURATION tsc_english_full
ALTER MAPPING FOR asciiword WITH canuck, tsd_english_full;
SELECT ts_lexize('canuck', 'colours');
{color}
------------------------------------------------------------------------------
<1> Do not specify the suffix; PostgreSQL adds +.syn+ for you.
ifdef::ts_write_bug_fixed[]
Synonyms via +ts_rewrite()+
---------------------------
*Problem*: any update to a synonym dictionary requires reindexing for any
text search configurations that use that dictionary
*Solution*: the +ts_rewrite()+ function substitutes terms, which can be drawn
from a table, at query-time
[source,sql]
------------------------------------------------------------------------------
CREATE TABLE synonyms (term TSQUERY, sub TSQUERY);
INSERT INTO synonyms (term, sub) VALUES
(to_tsquery('colour'), to_tsquery('color:*')),
(to_tsquery('colours'), to_tsquery('color:*'));
------------------------------------------------------------------------------
endif::ts_write_bug_fixed[]
Spell-checking suggestions
--------------------------
* You can use the *pg_trgm* extension to calculate similarity with existing
words in your corpus.
[source, sql]
------------------------------------------------------------------------------
CREATE EXTENSION pg_trgm;
CREATE TABLE lunews.issues_words AS (
SELECT word FROM ts_stat( -- <1>
'SELECT to_tsvector(''simple'', content) -- <2>
FROM lunews.issues'
)
);
CREATE INDEX lunews_issues_content_idx
ON lunews.issues_words
USING GiST (word gist_trgm_ops); -- <3>
------------------------------------------------------------------------------
<1> +ts_stat()+ returns statistics about each distinct lexeme in the
+tsvector+, such as frequency. Here we just want the distinct lexemes.
<2> We use the +simple+ configuration because we do not want stemming to
kick in when we process the lexemes.
<3> Use the +gin_trgm_ops+ or +gist_trgrm_ops+ operator class if you create
a GIN or GiST index, respectively.
Spell-checking suggestions: query
---------------------------------
The +<->+ operator calculates the distance between terms on either side of
the query.
[source, sql]
------------------------------------------------------------------------------
SELECT word, word <-> 'picket' AS dist
FROM lunews.issues_words
ORDER BY dist
LIMIT 5;
word | dist
-------------+----------
picket | 0
pickets | 0.333333
picketer | 0.4
picketed | 0.4
picket-line | 0.416667
(5 rows)
------------------------------------------------------------------------------
Resources
---------
* PostgreSQL Global Development Group. http://www.postgresql.org/docs/9.2/interactive/textsearch-intro.html[Full Text Search: Introduction]. 'PostgreSQL 9.2.0 Documentation'. 2012.
* Clarke et al. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.12.1615&rep=rep1&type=pdf[Relevance Ranking for One to Three Term Queries]. 'Information Processing and Management', 36(2):291-311, 2000.
* Porter, M.F. http://snowball.tartarus.org/texts/introduction.html[Snowball: A language for stemming algorithms]. 2001