Question Answering Systems

Question Answering (QA) systems represent a cornerstone of modern artificial intelligence, enabling machines to comprehend natural language and provide precise information. Moving beyond simple keyword search, these systems understand the intent behind a question and locate or synthesize an answer from vast amounts of text. From powering virtual assistants and customer service chatbots to enabling researchers to sift through scientific literature, QA technology is reshaping how we interact with information.

Extractive QA with BERT and Span Prediction

Extractive Question Answering is the task of locating the answer to a question as a contiguous span of text from a provided context document. Think of it as the ultimate "find the evidence" exercise, where the model highlights the exact sentence or phrase that answers the query. The dominant approach for this task uses encoder-only transformer models like BERT (Bidirectional Encoder Representations from Transformers).

The process works as follows: the question and context passage are concatenated into a single sequence, separated by a special [SEP] token, and fed into the BERT model. BERT's deep bidirectional attention allows every token in the context to be informed by the question and vice versa, creating rich, contextualized representations for each word. Instead of classifying the entire sequence, the model performs span prediction. It has two output heads: one predicts the probability of each token being the start of the answer span, and the other predicts the probability of each token being the end.

The training objective is to maximize the probability of the correct start and end indices. Formally, for a context of tokens $c_{1}, c_{2}, ..., c_{n}$ and a correct answer span from index $i$ to $j$ , the model learns to produce start scores $s_{i}$ and end scores $e_{j}$ such that the probability $P (start = i, end = j) = softmax (s_{i}) \times softmax (e_{j})$ is maximized. During inference, the model scores all possible spans within a reasonable length constraint and selects the one with the highest joint score.

This architecture is the backbone of models that excel on reading comprehension benchmarks like SQuAD (Stanford Question Answering Dataset). SQuAD provides triples of (question, context, answer span) and has been instrumental in driving progress in extractive QA. A model fine-tuned on SQuAD learns to answer questions like "What causes rainfall?" by pinpointing the phrase "condensation of water vapor" from a provided science passage.

Generative QA with Sequence-to-Sequence Models

While extractive QA is powerful, it fails when the answer is not a verbatim span in the text. Generative Question Answering addresses this by creating a free-form, natural language answer. This is essential for questions requiring synthesis ("Summarize the arguments for and against"), deduction ("What would happen next?"), or when the answer is implied across multiple sentences.

This task is typically tackled with sequence-to-sequence (Seq2Seq) models, specifically decoder-only or encoder-decoder transformer models like T5, BART, or GPT variants. Here, the model is given the question and context as input and must generate the answer token-by-token as output. The encoder processes the input (question + context), creating a dense representation. The decoder then attends to this representation and autoregressively generates the answer sequence, predicting the next token based on the input and the tokens it has already generated.

The training objective is a conditional language modeling task. Given an input sequence $X$ (question and context) and a target sequence $Y$ (the answer), the model learns to maximize the probability $P (Y ∣ X) = \prod_{t = 1}^{m} P (y_{t} ∣ X, y_{1}, ..., y_{t - 1})$ , where $m$ is the length of the answer. This approach provides tremendous flexibility. For instance, given a medical history, a generative model can answer "What is the likely diagnosis?" by generating "Type 2 diabetes," even if those exact words never appear in the text.

Retrieval-Augmented Generation and Open-Domain Architectures

Both extractive and generative models described above assume a relevant context passage is provided—a closed-domain setting. Open-domain QA removes this assumption, requiring the system to first find relevant information from a massive corpus (like the entire internet or a company knowledge base) and then answer the question. The state-of-the-art architecture for this is Retrieval-Augmented Generation (RAG), which elegantly combines search with language models.

A RAG system operates in two distinct stages:

Retriever: Given a user's question, this component searches a large document index to find the $k$ most relevant passages or documents. This is often done using dense vector search. The question is converted into a high-dimensional vector via an encoder, and passages with similar vectors (high semantic similarity) are retrieved.
Generator: This is a generative Seq2Seq model (like the one in the previous section). It takes the original question and the concatenated text of the retrieved passages as its input context. It then synthesizes an answer, grounding its generation in the provided evidence.

This hybrid approach offers major advantages: it allows the system to access up-to-date or proprietary information not in the generator's original training data, and it improves factuality by providing citable sources. The generator can also learn to ignore irrelevant retrieved text. For example, asking "What was the inflation rate in Q4 2023?" triggers a search for the latest economic reports, and the generator reads these reports to produce a precise percentage figure and explanation.

Evaluating Answer Quality

Building a QA system is only half the battle; rigorously evaluating its performance is critical. For extractive tasks on benchmarks like SQuAD, two core metrics are standard:

Exact Match (EM): A binary metric that awards a score of 1 only if the predicted answer string matches the ground truth answer string exactly (after minor normalization like lowercasing and removing articles). It is strict but clear.
F1 Score: A more forgiving metric that measures the harmonic mean of precision and recall at the token level. It treats the prediction and ground truth as sets of words. Precision is the proportion of predicted words that are correct, and recall is the proportion of correct words that were predicted. The F1 score is $2 \times \frac{Precision \times Recall}{Precision + Recall}$ . An answer of "condensation of vapor" compared to a truth of "condensation of water vapor" would have a high F1 but a zero EM.

For generative and open-domain QA, evaluation becomes more complex. While EM and F1 can still be applied, they often fail to capture semantic correctness. Additional metrics like BERTScore (which compares the semantic similarity of embeddings between generated and reference answers) and human evaluation for fluency, completeness, and factual consistency are essential. In RAG systems, retrieval accuracy—whether the correct documents were found—is also a key performance indicator.

Common Pitfalls

Ignoring the Retrieval Bottleneck in Open-Domain QA: A brilliant generator is useless if the retriever feeds it irrelevant documents. A common mistake is over-optimizing the generator while using a naive keyword-based retriever. Correction: Treat retrieval as a first-class problem. Invest in a dense passage retriever (DPR) trained jointly or in tandem with your QA objective to ensure high-quality, semantically relevant context is retrieved.

Overfitting to Benchmark Artifacts: Models can learn superficial patterns in datasets like SQuAD without developing true understanding. For example, they might learn that answers to "What is the name...?" are often proper nouns following the word "called." Correction: Evaluate models on out-of-domain datasets or adversarial sets designed to break these shortcuts. Use techniques like data augmentation and train on diverse, multi-domain QA corpora to build robust comprehension.

Confusing Extractive with Generative Needs: Applying an extractive model to a task requiring synthesis will always fail, as it cannot generate new phrasing. Conversely, using a heavy generative model for a simple fact-lookup task is inefficient and can introduce hallucination—the generation of plausible but incorrect facts not present in the source. Correction: Carefully analyze your use case. If answers are always direct quotes, use extractive QA. If answers require explanation, summarization, or reasoning, use generative QA or RAG.

Neglecting Answer Grounding and Explainability: Especially with generative models, it can be impossible to trace why an answer was given, raising risks in healthcare, legal, or financial applications. Correction: Architect for transparency. In RAG systems, always return the source passages used to generate the answer. For critical applications, consider hybrid systems that can provide both a generated answer and highlight the supporting evidence in the source text.

Summary

Extractive QA models, typically based on BERT-style span prediction, identify answers as exact text spans from a provided context and are evaluated using Exact Match and F1 metrics on benchmarks like SQuAD.
Generative QA employs sequence-to-sequence models to produce free-form answers, enabling synthesis and explanation beyond verbatim text extraction.
**

Question Answering Systems

Question Answering Systems

Extractive QA with BERT and Span Prediction

Generative QA with Sequence-to-Sequence Models

Retrieval-Augmented Generation and Open-Domain Architectures

Evaluating Answer Quality

Common Pitfalls

Summary

Write better notes with AI