Skip to content
Feb 25

DB: Full-Text Search in Databases

MT
Mindli Team

AI-Generated Content

DB: Full-Text Search in Databases

When you need to find a specific phrase in a database table, a simple LIKE query might come to mind. However, as your data grows, this approach grinds to a halt, offering poor performance and even poorer relevance. Full-text search is the engineered solution to this problem, transforming unstructured text into searchable data by indexing individual words and applying sophisticated ranking algorithms. This capability powers everything from e-commerce product finders to document repositories, making it an essential skill for building responsive, user-friendly applications.

From LIKE to Full-Text: The Need for Speed

The fundamental limitation of a SQL LIKE '%keyword%' query is that it performs a sequential scan. The database must check every row, examining the target column character-by-character for a match. This is an operation—its time grows linearly with the size of your data. Furthermore, it lacks semantic understanding; searching for "running" will not find documents containing "ran" or "runs."

Full-text search solves this by pre-processing text and building specialized indexes. Instead of treating a document as a mere string, it breaks it down into tokens (typically words), normalizes them, and creates a map from each token back to the documents that contain it. This allows for sub-second responses over millions of documents. The leap from pattern matching to token-based retrieval is the core of performant search.

Building the Engine: The Inverted Index

The heart of any full-text search system is the inverted index. Think of it as the index at the back of a textbook, but far more powerful. You don't look up page numbers; you look up document IDs.

The construction process involves several key steps:

  1. Tokenization: The raw text is split into individual tokens. For "The quick-brown fox!", this might yield ["The", "quick-brown", "fox!"].
  2. Normalization: Tokens are standardized to a common form. This usually includes:
  • Lowercasing: "The" becomes "the".
  • Removing diacritics: "café" becomes "cafe".
  • Handling hyphenation: Deciding if "quick-brown" becomes ["quick", "brown"] or remains a single token.
  1. Applying Linguistic Filters:
  • Stop-word removal: Common, low-meaning words like "the," "and," "is" are filtered out to reduce index size and noise.
  • Stemming and Lemmatization: Stemming crudely chops word endings ("running" -> "run", "flies" -> "fli"), while lemmatization uses a vocabulary to find the dictionary base form ("ran" -> "run", "better" -> "good").

After processing, the index is built. For documents:

  • Doc1: "The cat sat."
  • Doc2: "The dog ran."

The inverted index would look like:

"cat" -> [Doc1]
"dog" -> [Doc2]
"sat" -> [Doc1]
"ran" -> [Doc2]

When you search for "cat," the database instantly retrieves Doc1's ID without scanning Doc2.

Ranking by Relevance: TF-IDF Scoring

Returning matching documents is only half the battle; presenting them in order of relevance is what makes search useful. The classic algorithm for this is TF-IDF (Term Frequency-Inverse Document Frequency).

TF-IDF calculates a weight for each term in a document within a collection. The weight is a product of two statistics:

  • Term Frequency (TF): How often a term appears in a specific document. A higher count suggests the document is more relevant to that term. It's often normalized (e.g., ).
  • Inverse Document Frequency (IDF): How common or rare a term is across all documents. Common terms (like "the") are less informative. It is calculated as:

where is the total number of documents, and the denominator is the number of documents containing term .

The TF-IDF score is: .

Interpretation: A high TF-IDF score occurs when a term appears frequently in a given document (high TF) but is rare across the entire corpus (high IDF). This identifies terms that are strongly characteristic of that particular document. For a query with multiple terms, a document's relevance score is often the sum of the TF-IDF scores for each query term it contains.

Implementation: PostgreSQL vs. Elasticsearch

You can implement full-text search in traditional relational databases or dedicated search engines, each with trade-offs.

In PostgreSQL, you use built-in full-text search capabilities. You create a tsvector column (which stores the processed tokens) and a GIN index on it for speed.

-- Create a searchable column
ALTER TABLE products ADD COLUMN search_vector tsvector;
UPDATE products SET search_vector = to_tsvector('english', description || ' ' || name);

-- Create the index
CREATE INDEX idx_search ON products USING GIN(search_vector);

-- Perform a ranked search
SELECT name, ts_rank(search_vector, query) AS rank
FROM products, to_tsquery('english', 'wireless & headphone') query
WHERE search_vector @@ query
ORDER BY rank DESC;

PostgreSQL handles stemming, stop-words, and ranking (using a variant of TF-IDF), making it a robust choice for search tightly integrated with transactional data.

Elasticsearch (or OpenSearch) is a dedicated search and analytics engine built on Apache Lucene. It treats the inverted index as the primary data structure. It is distributed by nature, excels at scalability, and offers more advanced features out-of-the-box.

Configuration in Elasticsearch involves defining an index with a mapping that specifies analyzers for each text field. An analyzer is a pipeline composed of a character filter, tokenizer, and token filters (for lowercase, stop-words, stemming). Its ranking algorithm, Okapi BM25, is a state-of-the-art probabilistic model that improves upon TF-IDF, particularly in handling term saturation and document length normalization.

Designing Search Systems: Facets and Autocomplete

Beyond basic keyword search, production systems require supportive features.

Faceted filtering (or faceted search) allows users to narrow results by categories. For a product search, facets could be brand, price_range, and color. Technically, this is implemented using aggregations. While searching for "laptop," the system also queries the index to count how many matching documents fall into each bucket (e.g., Dell: 45, HP: 32, price 500: 12). This provides immediate, drill-down feedback.

Autocomplete (or type-ahead) suggests queries as the user types. This is often powered by a separate data structure optimized for prefixes, like a Trie or a specialized n-gram index. In Elasticsearch, this is typically implemented using the completion suggester or custom edge-ngram token filters. The goal is to predict intent and reduce user effort, responding in milliseconds to partial inputs like "wirel" -> ["wireless mouse", "wireless charger"].

Common Pitfalls

  1. Ignoring Analyzer Configuration: Using the default analyzer for all text is a common mistake. Product SKUs, email addresses, and log messages often need different tokenization rules (e.g., no stemming, case-sensitive). Always define a custom analyzer suited to your data's domain.
  2. Over-Stemming: Aggressive stemming can hurt recall precision. For example, stemming "university" and "universe" to the same root "univers" will return irrelevant matches. Lemmatization is often preferable, or you can use a controlled synonym list to map related terms (e.g., "TV" -> "television") without over-broadening.
  3. Forgetting Re-indexing: When you update a source document in your primary database, the search index becomes stale. You must have a reliable process to re-index updated data, either in real-time (using change data capture or application triggers) or in scheduled batches. An out-of-sync index leads to frustrated users.
  4. Treating Search as an Afterthought: Bolting on full-text search after an application is built leads to poor architecture. Consider search requirements early—data modeling, query patterns, and scalability needs—to choose between an integrated (PostgreSQL) or specialized (Elasticsearch) solution appropriately.

Summary

  • Full-text search uses inverted indexes to enable fast, token-based retrieval, moving far beyond the slow pattern-matching of LIKE queries.
  • Relevance ranking is critical; the TF-IDF algorithm and its successors like BM25 score documents based on term prominence within a document and rarity across the collection.
  • Implementation choices range from integrated solutions like PostgreSQL (using tsvector and GIN indexes) to dedicated engines like Elasticsearch, which offer superior scalability and advanced features.
  • Effective search requires linguistic processing (tokenization, stemming, stop-word removal) and supporting features like faceted filtering for navigation and autocomplete for user experience.
  • Successful implementation requires careful analyzer design, a plan for keeping the index synchronized with source data, and treating search as a core architectural component from the start.

Write better notes with AI

Mindli helps you capture, organize, and master any subject with AI-powered summaries and flashcards.