This whitepaper explores advantages such powerful PostgreSQL feature as full-text search focusing on search approaches and indexing.
Regular textual search features provided by any popular DBMS has the following weaknesses preventing from using it in complex information systems:
Full-text search is generally used to identify natural-language documents that are relevant to search words or phrases. In other words, the goal is to find all documents containing the specified query terms and determine their similarity to the query. Syntax of those queries and similarity metrics are very depending on the specific application.
PostgreSQL offers two options of full-text search: tsvector indexes and trigram indexes. Tsvector indexes can be efficiently used for complex linguistic searches. Trigram indexes option is designed for substring searches and fuzzy matching.
Let us illustrate the concept of PostgreSQL full-text search on examples. First, we need to create a sample table 'journal' like this:
CREATE TABLE journal ( id SERIAL PRIMARY KEY, title VARCHAR(255) NOT NULL, comments TEXT );
Now we can do full text search without any extra steps like building an index. Here is the query to print the title of each journal's record that contains the word "friend" in its "comments" field:
SELECT title FROM journal WHERE to_tsvector('english', comments) @@ to_tsquery('english', 'friend');
The query converts the "comments" column into a tsvector using to_tsvector
function, then checks if it matches the tsquery item generated from the search term
"friend" through to_tsquery
function.
This query will also return rows containing derived forms of the searched word such as "friends" and "friendly", since all these are reduced to the same normalized lexeme. However, the full-text search is not optimized by performance since the related index has not been created.
In order to improve performance of full-text search, the tsvector GIN (generalized inverted index) index must be created around lexemes:
CREATE INDEX idx_journal_comments ON journal
USING gin(to_tsvector('english', comments));
According to the PostgreSQL documentation:
GIN indexes are the preferred text search index type. As inverted indexes, they contain an index entry for each word (lexeme), with a compressed list of matching locations. Multi-word searches can find the first match, then use the index to remove rows lacking additional words.
Therefore, GIN indexes are the must-have option for implementing full-text search in PostgreSQL. Tsvector indexes can also concatenate multiple columns, for example:
CREATE INDEX idx2_journal ON journal
USING GIN (to_tsvector('english', title || ' ' || comments));
Trigram is defined as a group of three consecutive characters taken from a string. The similarity of two strings can be measured by counting the number of shared trigrams. This approach is very effective for measuring the similarity of words in many natural languages.
To use Trigram capabilities, the pg_trgm extension of PostgreSQL must be enabled:
CREATE EXTENSION IF NOT EXISTS pg_trgm;
This extension includes functions and operators necessary for trigram support, most important of them are:
real similarity (text, text)
returns coefficient of similarity for the two
arguments ranged from zero (the two strings are completely dissimilar) to one
(the two strings are identical)text[] show_trgm (text)
returns an array of all the trigrams for the specified
text that could be useful for debugging purposereal word_similarity (text, text)
returns the greatest value of similarity
between the set of trigrams in the first argument and continuous ordered
set of trigrams in the second argumentboolean text % text
returns true when similarity of the arguments is greater than
the current threshold stored in pg_trgm.similarity_thresholdreal text <-> text
returns the "distance" between the arguments,
that is negative value to "similarity" or in other words one minus similarity(...)Let us explore those functions in examples. The following query discovers all the trigrams generated by PostgreSQL for the "word":
SELECT show_trgm('word');
The result of this query is:
show_trgm --------------------------------- {" w"," wo","wor","ord","rd "}
Now, let's consider how the function word_similarity
works:
SELECT word_similarity('word', 'two words');
The result of this query is 0.8 since the most similar extent of an ordered set of trigrams in the second string is {" w"," wo","wor","ord"}.
GIN index can be created using the trigram operator class gin_trgm_ops
on column
"comments" of table "journal" as follows:
CREATE INDEX IDX_journal_comments ON journal
USING GIN (comments gin_trgm_ops);
This statement will force the PostgreSQL to split the row values into trigrams and build the related index, so you will have an index to use for similarity searching like this:
SELECT comments, similarity(comments, 'friend') AS sml FROM journal WHERE comments % 'friend' ORDER BY sml DESC, comments;
The Trigram approach empowers text similarity searches with more flexible and effective pattern matching for the specified column.
When choosing between tsvector and trigram approaches it is important to understand cons and pros of these techniques.
Trigram is suitable for text search, especially for matching similar patterns and substring searches. This method splits compared words into consecutive triplets of characters, making them ready for similarity matching. Trigram approach does not deal with text semantics and so cannot be used for complex linguistic searches and ranking large documents.
Tsvector approach has been specially designed for full-text search operations, that makes it ideal choice for complex linguistic search tasks and natural language queries.
TSVECTOR | TRIGRAM INDEXES | |
---|---|---|
Functionality | Compares text in a natural language way including derived forms of the searched words | Splits the text into set of trigrams (three consecutive characters) |
Usage | Ideal for complex linguistic search scenarios, supporting natural language queries | Effective for substring searches and matching similar patterns/td> |
Index Type | GIN or GiST (Generalized Search Tree) indexes are common | GIN or GiST indexes with the gin_trgm_ops extension |
Have questions? Contact us