Question Answering with Extractive Models

In today's information-saturated world, the ability to instantly locate precise answers within vast texts is invaluable. Extractive Question Answering (QA) models provide this capability by identifying the exact text span from a given context that answers a question, powering everything from search engine snippets to intelligent document assistants. This guide explores how to build robust extractive QA systems by fine-tuning modern transformer models, moving from single-document analysis to scalable open-domain search.

Core Concepts of Extractive QA

At its heart, an extractive QA model treats answering as a span prediction task. Given a question and a relevant context passage, the model's objective is to predict the start and end token positions of the answer within that context. This differs from generative QA, where a model might formulate an answer in its own words. Transformer-based architectures like BERT revolutionized this task by using their deep, bidirectional understanding of language to score every possible span.

The standard approach is to add a lightweight QA head on top of a pre-trained transformer model like BERT, RoBERTa, or ELECTRA. This head typically consists of two separate linear layers that take the final hidden states of the context tokens as input. One layer outputs a score for each token being the start of the answer span, and the other does the same for the end token. During training, the model learns to maximize the scores for the correct start and end positions. The final answer for a given question-context pair is the sequence of tokens between the predicted start and end indices.

Working with SQuAD and Handling Impossible Questions

Most models are initially fine-tuned on datasets like SQuAD (Stanford Question Answering Dataset). Understanding its format is crucial. In SQuAD, each data point includes a context paragraph, a question, and the answer text along with its character-based start position in the context. During preprocessing, this character position is converted to precise token indices compatible with the model's tokenizer.

A critical advancement introduced with SQuAD 2.0 is the inclusion of impossible questions—questions for which no answer exists in the provided context. To handle these, the model is trained to treat the [CLS] token's position as the designated "no-answer" span. The QA head learns to predict the start and end both as position 0 (the index of the [CLS] token) when the answer is not present. At inference time, a no-answer threshold is often applied: if the score for the predicted answer span is below a certain confidence level, the system defaults to outputting "no answer found."

Processing Long Documents with Chunking and Stride

Transformer models have a maximum sequence length limit (e.g., 512 tokens). To handle documents longer than this, we use a chunking strategy with stride. The document is split into overlapping segments, each of which is fed into the model along with the question. The overlap, or stride, is essential to prevent answers from being cut off at a chunk boundary.

For example, with a max length of 512 and a stride of 128 tokens, the first chunk contains tokens 0-511, the second chunk contains tokens 384-895, and so on. This creates multiple candidate answer spans from the different chunks that may contain the same answer. These candidates are then reconciled using a confidence scoring method. The confidence for a candidate span is often calculated as the sum of the model's start score and end score for that span. The candidate with the highest confidence score across all chunks is selected as the final answer. This process ensures that the model can locate answers anywhere within a lengthy document.

Building Open-Domain QA Systems with Retriever-Reader Architecture

A standalone extractive model is a reader; it finds answers within a given passage. For true open-domain QA—where you can ask any question against a massive corpus like Wikipedia—you must first find the relevant passages. This is achieved by combining a retriever with the reader in a two-stage pipeline, often called Retriever-Reader.

The retriever is a fast, scalable system responsible for finding the top- $k$ most relevant documents or passages from the entire corpus for a given question. This is often implemented using dense vector search (e.g., with DPR - Dense Passage Retriever) or sparse keyword-based methods like BM25. The reader model then processes each of these retrieved passages independently, generating answer spans and confidence scores for each. Finally, all candidate answers from the top passages are pooled, and the one with the highest overall confidence score is returned. This architecture balances the speed of retrieval with the precision of deep learning-based reading.

Common Pitfalls

Ignoring Stride During Answer Reconciliation: A common error is processing chunks independently and simply taking the highest-scoring answer from any single chunk. This can fail when the same correct answer appears in multiple overlapping chunks with varying confidence scores. Always implement a deduplication and scoring logic that considers all occurrences. For example, aggregate scores for identical answer strings across chunks before selecting the final answer.

Poor No-Answer Threshold Calibration: Setting the no-answer threshold arbitrarily (e.g., always at 0) leads to poor user experience. If it's too low, the system will fabricate low-confidence, often incorrect answers. If it's too high, it will return "no answer" too often. The threshold must be calibrated on a held-out validation set by analyzing the trade-off between precision and answer coverage for possible questions.

Mismatched Tokenization Between Training and Inference: If you use a different tokenizer or preprocessing steps during inference than were used during model fine-tuning, the alignment between tokens and text will break, leading to garbled or incorrect answer spans. Always ensure consistency by using the same tokenizer class and version from the model's original training library (e.g., the AutoTokenizer from Hugging Face Transformers).

Neglecting Retriever Performance in Open-Domain Systems: Even a perfect reader cannot answer a question if the correct document is not in the top- $k$ passages retrieved. A frequent bottleneck in open-domain QA is an underperforming retriever. Optimizing the retriever—through better indexing, more effective embedding models, or hybrid search strategies—often yields greater overall system improvement than further tuning the reader alone.

Summary

Extractive QA models predict the start and end token positions of an answer within a provided context, typically by adding a QA head to a pre-trained transformer like BERT.
Handling impossible questions is a key requirement, implemented by training the model to predict a special no-answer span (like the [CLS] token) and using a calibrated confidence threshold.
For documents longer than a model's maximum sequence length, chunking with an overlap (stride) is necessary, followed by confidence-based reconciliation of answer candidates from all chunks.
The confidence score for an answer candidate, usually the sum of its predicted start and end logits, is critical for comparing candidates across chunks and for no-answer detection.
Full open-domain QA systems combine a fast retriever to find relevant passages from a large corpus with a precise reader model to extract the final answer, creating a powerful, scalable search capability.

Question Answering with Extractive Models

Question Answering with Extractive Models

Core Concepts of Extractive QA

Working with SQuAD and Handling Impossible Questions

Processing Long Documents with Chunking and Stride

Building Open-Domain QA Systems with Retriever-Reader Architecture

Common Pitfalls

Summary

Write better notes with AI