Seek and ye shall find: full-text search in PostgreSQL
======================================================
:author: Dan Scott (denials)
:institution: Laurentian University
:license: CC-BY-SA 3.0 Unported
:duration: 50
:Date: September 18, 2012
:Email: dan@coffeecode.net
:Copyright: {date={localdate}}: {author}, {institution} * {license} 
:icons:
:iconsdir: ./images/icons/
:slidecalloutwidth: 2%
:data-uri:

It happened to me
-----------------

* Summer project created large amounts of text from scanned newspapers
* Wanted an independent & useful search engine
* Lucene-based search engines like Solr and ElasticSearch are not integrated
  with the database
* I'm a library geek _and_ a database geek and, since 2006, have been hacking
  on the http://evergreen-ils.org[Evergreen library system], which uses
  PostgreSQL full-text search

* Housekeeping: slides are at http://stuff.coffeecode.net/2012/pgopen_fulltext

Simplistic (straw-man) approach
-------------------------------

[source,sql]
.I'll just use LIKE / ILIKE / regex matching...
------------------------------------------------------------------------------
CREATE SCHEMA lunews;
CREATE TABLE lunews.issues (
  id TEXT, text_uri TEXT, volume TEXT, issue TEXT, content TEXT
);
COPY lunews.issues (id, text_uri, volume, issue, content)
  FROM 'raw_lambda_text.tsv';

SELECT id FROM lunews.issues WHERE content LIKE '%picket%';
SELECT id FROM lunews.issues WHERE content ILIKE '%picket%';
SELECT id FROM lunews.issues WHERE content ~ E'p[iao]cket';
------------------------------------------------------------------------------

... with simple problems
------------------------

:incremental:

* Fast for test data (50 rows with ~10,000 words per row) but does not scale
  to real world data
  ** Table scans are _not_ our friends
  ** Index expressions are wonderful but can't anticipate the wide variety of
     human questions
* We could build something based on +pg_trgm+ extension - but that's work!
* Full-text search is built into PostgreSQL, so let's use it

image:oleg_bartunov.jpg["Oleg Bartunov",align="left"] image:teodor_sigaev.jpg["Teodor Sigaev",align="right"]

_(Thank you, http://www.sai.msu.su/~megera/[Oleg Bartunov] and http://www.sigaev.ru/[Teodor Sigaev]!)_

2-minute demo - 50 rows
-----------------------

. As a proof of concept, create and populate our table containing the full
  text of 50 newspaper issues.
. Create a +tsvector+ index on the text column of interest (_content_):
+
[source,sql]
------------------------------------------------------------------------------
CREATE INDEX ON lunews.issues USING GIN (to_tsvector(content));
------------------------------------------------------------------------------
+
. Run some search queries:
+
[source,sql]
------------------------------------------------------------------------------
SELECT id FROM lunews.issues
  WHERE to_tsvector(content)
    @@ to_tsquery('strike & picket');

SELECT id, ts_rank_cd(to_tsvector(content),
    to_tsquery('picket | pocket | packet'))
  FROM lunews.issues
  WHERE to_tsvector(content)
    @@ to_tsquery('picket | pocket | packet')
  ORDER BY 2 DESC
;
------------------------------------------------------------------------------

5-minute demo - 1 million rows
------------------------------

. Create and populate a table containing the titles and subjects of
  1 *_MILLION_* books from OpenLibrary.
. Create weighted and unweighted +tsvector+ keyword columns.

[source,sql]
------------------------------------------------------------------------------
CREATE TABLE openlib (id INT, title TEXT, subjects TEXT,
  unweighted_kw TSVECTOR, weighted_kw TSVECTOR);
COPY openlib (id, title, subjects) FROM STDIN;    -- <1> <2>
UPDATE openlib SET unweighted_kw =
  to_tsvector('english', title) ||
  to_tsvector('english', COALESCE(subjects, '')); -- <3>
UPDATE openlib SET weighted_kw =
  setweight(to_tsvector('english', title), 'A') ||
  setweight(to_tsvector('english', 
    COALESCE(subjects, '')), 'D');                -- <4>
------------------------------------------------------------------------------
<1> 2012-09-15 12:33:05-04 - started loading
<2> 2012-09-15 12:33:10-04 - finished loading
<3> 2012-09-15 12:33:51-04 - populated unweighted_kw
<4> 2012-09-15 12:34:40-04 - populated weighted_kw

5-minute demo - 1 million rows (cont.)
--------------------------------------

. Create indexes on the weighted and unweighted keyword columns.
. Create separate expression indexes on the title and subjects columns.

[source,sql]
------------------------------------------------------------------------------
CREATE INDEX openlib_title ON openlib
  USING GIN (to_tsvector('english', title));      -- <1>
CREATE INDEX openlib_subjects ON openlib
  USING GIN (to_tsvector('english', subjects));   -- <2>
CREATE INDEX openlib_unweighted ON openlib
  USING GIN (unweighted_kw);                      -- <3>
CREATE INDEX openlib_weighted ON openlib
  USING GIN (weighted_kw);                        -- <4>
------------------------------------------------------------------------------

<1> 2012-09-15 12:35:13-04 - created openlib_title index
<2> 2012-09-15 12:35:39-04 - created openlib_subjects index
<3> 2012-09-15 12:36:04-04 - created openlib_unweighted index
<4> 2012-09-15 12:36:30-04 - created openlib_weighted index

5-minute demo - 1 million rows - searching
------------------------------------------

With an unweighted search:

[source,sql]
------------------------------------------------------------------------------
SELECT id, SUBSTRING(title FROM 0 FOR 20),
    ts_rank_cd(unweighted_kw,
      to_tsquery('government & history'))
  FROM openlib
  WHERE unweighted_kw @@ to_tsquery('government & history')
  ORDER BY 3 DESC LIMIT 5;

    id    |      substring      |   rank   
----------+---------------------+----------
  6836768 | Pietre e potere     |     0.35
 17995790 | Reform and insurrec | 0.236111
  8669133 | American Political  | 0.229545
   866182 | Lives in the public |    0.225
 16640636 | Hē katastatikē no   | 0.214286
(5 rows) Total runtime: 84.942 ms
------------------------------------------------------------------------------

5-minute demo - 1 million rows - searching
------------------------------------------

With a weighted search:

[source,sql]
------------------------------------------------------------------------------
SELECT id, SUBSTRING(title FROM 0 FOR 20),
    ts_rank_cd(weighted_kw,
      to_tsquery('government & history'))
  FROM openlib
  WHERE weighted_kw @@ to_tsquery('government & history')
  ORDER BY 3 DESC LIMIT 5;

    id    |      substring      |   rank   
----------+---------------------+----------
 11331254 | Global Issues: Hist |  2.09091
  5284994 | Michigan: a history | 0.615909
  5857361 | California history  | 0.590909
 23324185 | Outlines of the his | 0.525974
 14832714 | History of federal  | 0.508333
(5 rows) Total runtime: 86.443 ms
------------------------------------------------------------------------------

5-minute demo - 1 million rows - searching
------------------------------------------

With a weighted search and relevancy algorithm tweak:

[source,sql]
------------------------------------------------------------------------------
SELECT id, SUBSTRING(title FROM 0 FOR 20),
    ts_rank_cd(weighted_kw,
      to_tsquery('government & history'), 4)
  FROM openlib
  WHERE weighted_kw @@ to_tsquery('government & history')
  ORDER BY 3 DESC LIMIT 5;

    id    |      substring      | ts_rank_cd 
----------+---------------------+------------
 20696993 | A study of general  |        0.5
 11830422 | 21st Century Comple |        0.5
 16606467 | History of the gove |   0.333333
  9930696 | Judiciary and Respo |   0.333333
 13681611 | history of local go |   0.333333
(5 rows) Total runtime: 92.518 ms 
------------------------------------------------------------------------------

Parsing process
---------------

. Start with a _document_ - the text to be searched; can be a single word or an
  entire book
. Parse the document into _tokens_ - with types like words, numbers, paths,
  email addresses, white space
. Normalize the tokens to create _lexemes_ - including case-folding, stemming,
  and using dictionaries and thesauruses to unify different forms of a token

ifdef::backend-slidy2[>>>]

Text search data types
~~~~~~~~~~~~~~~~~~~~~~
* _tsvector_ - text search vector (array-ish thing) containing lexemes with
  positional information to inform relevance ranking
* _tsquery_ - a preprocessed text search query

Show me the tokens!
-------------------

[source,sql]
------------------------------------------------------------------------------
SELECT alias, description FROM ts_token_type('default');
-----------------+------------------------------------------
 asciiword       | Word, all ASCII
 word            | Word, all letters
 numword         | Word, letters and digits
 email           | Email address
 url             | URL
 host            | Host
 sfloat          | Scientific notation
 version         | Version number
 hword_numpart   | Hyphenated word part, letters and digits
 hword_part      | Hyphenated word part, all letters
 hword_asciipart | Hyphenated word part, all ASCII
 blank           | Space symbols
 tag             | XML tag
 protocol        | Protocol head
 numhword        | Hyphenated word, letters and digits
 asciihword      | Hyphenated word, all ASCII
 hword           | Hyphenated word, all letters
 url_path        | URL path
 file            | File or path name
 float           | Decimal notation
 int             | Signed integer
 uint            | Unsigned integer
 entity          | XML entity
(23 rows)
------------------------------------------------------------------------------

Let's see some lexemes
----------------------

* Normalized tokens, remember?

[source,sql]
------------------------------------------------------------------------------
SELECT to_tsvector('english',
   'Kurt Vonnegut, Breakfast of Champions, 1973'
);
 '1973':6 'breakfast':3 'champion':5 'kurt':1 'vonnegut':2

SELECT to_tsvector('english', 'Les nuits en Paris');
 'en':3 'les':1 'nuit':2 'pari':4

SELECT to_tsvector('french', 'Les nuits en Paris'); -- <1>
 'le':1 'nuit':2 'paris':4
------------------------------------------------------------------------------

<1> Dig that multilingual support, eh?

Operators and functions
-----------------------

* +to_tsvector()+ - turn a document into a vector
* +to_tsquery()+ - normalize the terms in a query to return a TSQUERY
* +plainto_tsquery()+ - normalize the terms in a query to return a TSQUERY,
  automatically ANDing the terms
* +@@+ - match operator for a tsvector and a tsquery
* +ts_debug()+ - display how a string gets normalized by a given text search
  configuration

Let's search!
-------------

* Boolean operators
  ** +&+ - AND
  ** +|+ - OR
  ** +!+ - NOT (tricksy; negation operator requires +&+ or +|+)
* Wildcards for matching prefixes: +prefix:*+

[source,sql]
------------------------------------------------------------------------------
SELECT id FROM lunews.issues
  WHERE to_tsvector(content) @@ plainto_tsquery('student strike');
SELECT id FROM lunews.issues
  WHERE to_tsvector(content) @@ to_tsquery('strike &! student');
SELECT id FROM lunews.issues
  WHERE to_tsvector(content) @@ to_tsquery('prof:* | fac:*');
SELECT id FROM lunews.issues
  WHERE to_tsvector(content) @@ to_tsquery('faculty & strike')
    AND content ILIKE '%faculty strike%';
------------------------------------------------------------------------------

What is this "normalization"?
-----------------------------

* *Parsers*: tokenize and classify the text of a document (word, number,
  path...) - in practice, there is only one
* *Dictionaries*: convert tokens into normalized form, strip out stop words,
  unify synonyms
* *Templates*: provide the dictionaries' underlying functions
* *Configurations*: which dictionaries to use for each token supplied by the
  parser

[source,sql]
------------------------------------------------------------------------------
SELECT to_tsvector('english', 'stochastically');     -- <1>
 'stochast':1

SELECT to_tsvector('english', 'To be or not to be'); -- <2>

SELECT to_tsvector('english', 'laptopping');         -- <3>
 'laptop':1
------------------------------------------------------------------------------

<1> The English Porter stemming algorithm is thorough :)
<2> Stopwords can eat up a lot of space if you decide to index them.
<3> The stemmer makes no guarantees that the original words were real!

What are we matching?
---------------------

+ts_headline()+ returns our +tsquery+ in context:

[source,sql]
------------------------------------------------------------------------------
SELECT ts_headline(title, 
    to_tsquery('english', 'chicago & theater'),
    'MaxWords = 10, MinWords = 5')
  FROM openlib
  WHERE to_tsvector('english', title)
    @@ to_tsquery('english', 'chicago & theater')
  LIMIT 5;
                   ts_headline                    
=-------------------------------------------------
 Rice's <b>Theater</b>, <b>Chicago</b>, 1851-1857
(1 row) Total runtime: 0.616 ms
------------------------------------------------------------------------------

Relevancy
---------

* _Image of match_

We don't just want matches; we want the best matches. PostgreSQL offers
two built-in ranking functions:

* +ts_rank+ ([ 'weights' +float4[]+, ] 'vector' +tsvector+,
   'query' +tsquery+ [, 'normalization' +integer+ ]) - frequency of matching
   lexemes
* +ts_rank_cd+ ([ 'weights' +float4[]+, ] 'vector' +tsvector+,
   'query' +tsquery+ [, 'normalization' +integer+ ]) - cover density
   algorithm that incorporates term proximity

Relevancy scoring can be adjusted by passing a bit mask of options, to adjust
rank for long documents and the number of unique words.

.Cover sets
[quote, Clarke et al, Relevance Ranking for One to Three Term Queries (1999)]
______________________________________________________________________________
(1) the shorter the cover, the more likely the corresponding text is relevant;
and (2) the more covers contained in a document, the more likely the document
is relevant.
______________________________________________________________________________

Beyond searching a single column
--------------------------------

* An entire newspaper in a column doesn't allow for much precision
* A story in a column is better, but still not optimal
* Separate columns for title, author, story? Now we're talking
  ** Concatenate the +tsvector+ documents for maximum flexibility

Beyond searching a single column (cont.)
----------------------------------------

[source, sql]
------------------------------------------------------------------------------
CREATE TABLE lunews.stories (
  id SERIAL, title TEXT, author TEXT, story TEXT
);

WITH ft AS (
  SELECT id, title || author || story AS keywords,
    to_tsvector(title) || to_tsvector(author) ||
    to_tsvector(story) AS ts_kw
  FROM lunews.stories
),
q AS (
  SELECT id, to_tsquery('exotic & horizons') AS tsq
  FROM lunews.stories
)
SELECT ft.id, ts_rank_cd(ft.ts_kw, q.tsq) AS score,
    ts_headline(ft.keywords, q.tsq) AS kic
  FROM ft INNER JOIN q ON q.id = ft.id
  WHERE q.tsq @@ ft.ts_kw
;
------------------------------------------------------------------------------

Weighting
---------

* Terms can be weighted using four scores: 'A', 'B', 'C', and 'D'
* Weight sets of terms at index time using the +setweight()+ function
* Adjust weight values at query time by passing an array of four
  +float+ values as the first argument to +ts_rank()+/+ts_rank_cd()+ 
  ** Example: You can set the title to 'A' and subject to 'D' for a default
     boost to title occurrences, but adjust the weights for a subject-focused
     search

[source, sql]
------------------------------------------------------------------------------
WITH ft AS (
  SELECT id, title || author || story AS keywords,
    setweight(to_tsvector(COALESCE(title, '')), 'A')
    || setweight(to_tsvector(COALESCE(author, '')), 'B')
    || setweight(to_tsvector(COALESCE(story, '')),'D')
   AS ts_kw
  FROM lunews.stories
)
SELECT ts_rank_cd('{0.1, 0.2, 0.4, 1.0}', ft.ts_kw, q.tsq) -- <1>
------------------------------------------------------------------------------

<1> Small gotcha: floats in the score array represent 'D', 'C', 'B', 'A'

Custom search configuration
---------------------------

Perhaps you actually want to index stopwords (to find 'To be or not to be' or
'It')

[quote,#postgresql on Freenode]
_____________________________________________________________________________
denials, Isn't there an easier way to allow stop word, like just set a flag in
some config file to True or False?
_____________________________________________________________________________

[source,sql]
------------------------------------------------------------------------------
CREATE TEXT SEARCH DICTIONARY tsd_english_full(
  template = snowball, language = english
);
CREATE TEXT SEARCH CONFIGURATION tsc_english_full(
  COPY = pg_catalog.english
);
ALTER TEXT SEARCH CONFIGURATION tsc_english_full
  ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
    hword, hword_part, word
  WITH tsd_english_full;

SELECT to_tsvector('tsc_english_full', 'to be or not to be');
 'be':2,6 'not':4 'or':3 'to':1,5
------------------------------------------------------------------------------

Accelerate
----------

* GIN and GiST indexes support full-text search queries
  ** Usual trade-off of better performance / slower build for GIN
* One index per column of interest, plus one general index for weighted
  aggregate searches, per text search configuration
* Create +tsvector+ columns with an associated trigger to maintain their
  contents
  ** +tsvector_update_trigger()+ function automatically updates the
     +tsvector+ column with the contents of one of more columns, for a
     single text search configuration
  ** +tsvector_update_trigger_column()+ function uses whatever text search
     configuration appears in a specified column on a per-row basis, so that
     you can better support multilingual data in a single table

Accelerate (example)
--------------------

Use the built-in +tsvector_update_trigger()+ function to maintain a
+tsvector+ column, with a corresponding GIN index:

[source,sql]
------------------------------------------------------------------------------
CREATE TABLE openlib_subset(title TEXT, tsv TSVECTOR);
CREATE TRIGGER tsv_update
  BEFORE INSERT OR UPDATE ON openlib_subset FOR EACH ROW
  EXECUTE PROCEDURE
    tsvector_update_trigger(tsv, 'pg_catalog.english', title);

CREATE INDEX idx_title ON openlib_subset USING GIN (tsv);

INSERT INTO openlib_subset (title)
  SELECT title FROM openlib ORDER BY id DESC LIMIT 10000;

SELECT title, tsv FROM openlib_subset 
  WHERE to_tsquery('Chicago') @@ tsv
    AND LENGTH(title) < 30 LIMIT 5;
            title             |                 tsv                  
------------------------------+--------------------------------------
 The Chicago address          | 'address':3 'chicago':2
 You were never in Chicago    | 'chicago':5 'never':3
 African Americans in Chicago | 'african':1 'american':2 'chicago':4
 Chicago                      | 'chicago':1
------------------------------------------------------------------------------

Preparsing
----------

* Changing the parser to emit more / different tokens requires source code
  changes
* Hack: change the text before you index & query it

Desired goal: index +"term1/term2"+ as a single term, _and_ as separate terms

[source,sql]
------------------------------------------------------------------------------
SELECT alias, token, lexemes FROM ts_debug('english','term1/term2');
 alias |    token    |    lexemes    
-------+-------------+---------------
 file  | term1/term2 | {term1/term2}

SELECT alias, token, lexemes FROM ts_debug('english',
  regexp_replace('term1/term2',
    E'(\\S+)/(\\S+)', E'\\1/\\2 \\1 \\2')
);
  alias  |    token    |    lexemes    
---------+-------------+---------------
 file    | term1/term2 | {term1/term2}
 blank   |             | 
 numword | term1       | {term1}
 blank   |             | 
 numword | term2       | {term2}
------------------------------------------------------------------------------

Synonyms via a dictionary
-------------------------

*Problem*: default 'english' configuration does not recognize American
variations as the equivalent of Canadian words

  * Examples: 'color' vs. 'colour', 'favorite' vs. 'favourite'

*Solution*: The 'synonym' template lets you define a list of words and their
synonyms, including prefix matches if applicable.

.+canuck.syn+ synonym file
------------------------------------------------------------------------------
colour color*
colours color*
------------------------------------------------------------------------------

[source,sql]
------------------------------------------------------------------------------
CREATE TEXT SEARCH DICTIONARY canuck (
  TEMPLATE = synonym, SYNONYMS = canuck            -- <1>
);
ALTER TEXT SEARCH CONFIGURATION tsc_english_full
  ALTER MAPPING FOR asciiword WITH canuck, tsd_english_full;
SELECT ts_lexize('canuck', 'colours');
 {color}
------------------------------------------------------------------------------

<1> Do not specify the suffix; PostgreSQL adds +.syn+ for you.

ifdef::ts_write_bug_fixed[]

Synonyms via +ts_rewrite()+
---------------------------

*Problem*: any update to a synonym dictionary requires reindexing for any
text search configurations that use that dictionary

*Solution*: the +ts_rewrite()+ function substitutes terms, which can be drawn
from a table, at query-time

[source,sql]
------------------------------------------------------------------------------
CREATE TABLE synonyms (term TSQUERY, sub TSQUERY);
INSERT INTO synonyms (term, sub) VALUES
  (to_tsquery('colour'), to_tsquery('color:*')),
  (to_tsquery('colours'), to_tsquery('color:*'));
------------------------------------------------------------------------------

endif::ts_write_bug_fixed[]

Spell-checking suggestions
--------------------------

* You can use the *pg_trgm* extension to calculate similarity with existing
  words in your corpus.

[source, sql]
------------------------------------------------------------------------------
CREATE EXTENSION pg_trgm;
CREATE TABLE lunews.issues_words AS (
  SELECT word FROM ts_stat(                  -- <1>
    'SELECT to_tsvector(''simple'', content) -- <2>
    FROM lunews.issues'
  )
);

CREATE INDEX lunews_issues_content_idx
  ON lunews.issues_words
  USING GiST (word gist_trgm_ops);           -- <3>
------------------------------------------------------------------------------

<1> +ts_stat()+ returns statistics about each distinct lexeme in the
    +tsvector+, such as frequency. Here we just want the distinct lexemes.
<2> We use the +simple+ configuration because we do not want stemming to
    kick in when we process the lexemes.
<3> Use the +gin_trgm_ops+ or +gist_trgrm_ops+ operator class if you create
    a GIN or GiST index, respectively.

Spell-checking suggestions: query
---------------------------------

The +<->+ operator calculates the distance between terms on either side of
the query.

[source, sql]
------------------------------------------------------------------------------
SELECT word, word <-> 'picket' AS dist
  FROM lunews.issues_words
  ORDER BY dist
  LIMIT 5;
    word     |   dist   
-------------+----------
 picket      |        0
 pickets     | 0.333333
 picketer    |      0.4
 picketed    |      0.4
 picket-line | 0.416667
(5 rows)
------------------------------------------------------------------------------

Resources
---------

* PostgreSQL Global Development Group. http://www.postgresql.org/docs/9.2/interactive/textsearch-intro.html[Full Text Search: Introduction]. 'PostgreSQL 9.2.0 Documentation'. 2012.
* Clarke et al. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.12.1615&rep=rep1&type=pdf[Relevance Ranking for One to Three Term Queries]. 'Information Processing and Management', 36(2):291-311, 2000.
* Porter, M.F. http://snowball.tartarus.org/texts/introduction.html[Snowball: A language for stemming algorithms]. 2001