Full Text Search

This whitepaper explores advantages such powerful PostgreSQL feature as full-text search focusing on search approaches and indexing.

Regular textual search features provided by any popular DBMS has the following weaknesses preventing from using it in complex information systems:

it is slow since there is no index support and so the entire contents must be processed for every search
no linguistic support to process all derived forms of search words
no ranking of search results to get best matches for the searching query

Full-text search is generally used to identify natural-language documents that are relevant to search words or phrases. In other words, the goal is to find all documents containing the specified query terms and determine their similarity to the query. Syntax of those queries and similarity metrics are very depending on the specific application.

PostgreSQL offers two options of full-text search: tsvector indexes and trigram indexes. Tsvector indexes can be efficiently used for complex linguistic searches. Trigram indexes option is designed for substring searches and fuzzy matching.

Let us illustrate the concept of PostgreSQL full-text search on examples. First, we need to create a sample table 'journal' like this:

CREATE TABLE journal (
	id SERIAL PRIMARY KEY,
	title VARCHAR(255) NOT NULL,
	comments TEXT
);

Now we can do full text search without any extra steps like building an index. Here is the query to print the title of each journal's record that contains the word "friend" in its "comments" field:

SELECT title FROM journal
WHERE to_tsvector('english', comments) @@ to_tsquery('english', 'friend');

The query converts the "comments" column into a tsvector using to_tsvector function, then checks if it matches the tsquery item generated from the search term "friend" through to_tsquery function.

This query will also return rows containing derived forms of the searched word such as "friends" and "friendly", since all these are reduced to the same normalized lexeme. However, the full-text search is not optimized by performance since the related index has not been created.

Indexing for Full-text Search in PostgreSQL.

In order to improve performance of full-text search, the tsvector GIN (generalized inverted index) index must be created around lexemes:

CREATE INDEX idx_journal_comments ON journal USING gin(to_tsvector('english', comments));

According to the PostgreSQL documentation:

GIN indexes are the preferred text search index type. As inverted indexes, they contain an index entry for each word (lexeme), with a compressed list of matching locations. Multi-word searches can find the first match, then use the index to remove rows lacking additional words.

Therefore, GIN indexes are the must-have option for implementing full-text search in PostgreSQL. Tsvector indexes can also concatenate multiple columns, for example:

CREATE INDEX idx2_journal ON journal USING GIN (to_tsvector('english', title || ' ' || comments));

Trigram Approach for PostgreSQL Full Text Search

Trigram is defined as a group of three consecutive characters taken from a string. The similarity of two strings can be measured by counting the number of shared trigrams. This approach is very effective for measuring the similarity of words in many natural languages.

To use Trigram capabilities, the pg_trgm extension of PostgreSQL must be enabled:

CREATE EXTENSION IF NOT EXISTS pg_trgm;

This extension includes functions and operators necessary for trigram support, most important of them are:

real similarity (text, text) returns coefficient of similarity for the two arguments ranged from zero (the two strings are completely dissimilar) to one (the two strings are identical)
text[] show_trgm (text) returns an array of all the trigrams for the specified text that could be useful for debugging purpose
real word_similarity (text, text) returns the greatest value of similarity between the set of trigrams in the first argument and continuous ordered set of trigrams in the second argument
boolean text % text returns true when similarity of the arguments is greater than the current threshold stored in pg_trgm.similarity_threshold
real text <-> text returns the "distance" between the arguments, that is negative value to "similarity" or in other words one minus similarity(...)

Let us explore those functions in examples. The following query discovers all the trigrams generated by PostgreSQL for the "word":

SELECT show_trgm('word');

The result of this query is:

show_trgm
---------------------------------
{"  w"," wo","wor","ord","rd "}

Now, let's consider how the function word_similarity works:

SELECT word_similarity('word', 'two words');

The result of this query is 0.8 since the most similar extent of an ordered set of trigrams in the second string is {" w"," wo","wor","ord"}.

Building GIN Indexes

GIN index can be created using the trigram operator class gin_trgm_ops on column "comments" of table "journal" as follows:

CREATE INDEX IDX_journal_comments ON journal USING GIN (comments gin_trgm_ops);

This statement will force the PostgreSQL to split the row values into trigrams and build the related index, so you will have an index to use for similarity searching like this:

SELECT comments, similarity(comments, 'friend') AS sml
  FROM journal
  WHERE comments % 'friend'
  ORDER BY sml DESC, comments;

The Trigram approach empowers text similarity searches with more flexible and effective pattern matching for the specified column.

Tsvector vs. Trigram

When choosing between tsvector and trigram approaches it is important to understand cons and pros of these techniques.

Trigram is suitable for text search, especially for matching similar patterns and substring searches. This method splits compared words into consecutive triplets of characters, making them ready for similarity matching. Trigram approach does not deal with text semantics and so cannot be used for complex linguistic searches and ranking large documents.

Tsvector approach has been specially designed for full-text search operations, that makes it ideal choice for complex linguistic search tasks and natural language queries.

	TSVECTOR	TRIGRAM INDEXES
Functionality	Compares text in a natural language way including derived forms of the searched words	Splits the text into set of trigrams (three consecutive characters)
Usage	Ideal for complex linguistic search scenarios, supporting natural language queries	Effective for substring searches and matching similar patterns/td>
Index Type	GIN or GiST (Generalized Search Tree) indexes are common	GIN or GiST indexes with the gin_trgm_ops extension

Have questions? Contact us