Retrieval-Augmented Fine-Tuning

Retrieval-Augmented Fine-Tuning (RAFT) tackles a critical weakness in large language models: their tendency to hallucinate or rely on outdated, memorized facts when answering knowledge-intensive questions. By training a model to explicitly use and reference provided documents, RAFT produces more accurate, verifiable, and robust outputs for tasks like technical Q&A, legal analysis, or medical inquiry. This approach bridges the gap between the static knowledge of a fine-tuned model and the dynamic, external knowledge of a retrieval-augmented generation (RAG) system, offering a compelling hybrid solution.

The Core Idea: Teaching Models to Use Evidence

At its heart, Retrieval-Augmented Fine-Tuning (RAFT) is a supervised training paradigm designed for open-domain question answering. Unlike standard fine-tuning, which trains a model on simple question-answer pairs, RAFT trains the model on triples consisting of a question, a set of retrieved documents (the "context"), and the corresponding answer. The model learns to generate answers by reasoning over the provided context, even when that context is imperfect. The goal is not to memorize facts from the fine-tuning data, but to internalize the skill of grounding its responses in given evidence. This shifts the model's objective from "recall what you know" to "synthesize an answer from what you are shown."

A typical training example in RAFT looks like this: The question is "What are the primary symptoms of condition X?" The provided context includes several relevant medical journal snippets and, crucially, some distractor documents—documents that are topically related but do not contain the correct answer. The target answer is derived solely from the relevant documents. By training on many such examples, the model learns to ignore irrelevant information and focus on the key evidence needed to construct a faithful response.

Implementing RAFT: Distractors and Chain-of-Thought

The practical implementation of RAFT involves two key design choices that drive its robustness: the curation of the training context and the format of the model's output.

First, distractor document inclusion is not a bug but a core feature. In real-world retrieval systems, your search will rarely return only perfect, answer-bearing documents. More commonly, it returns a mix of relevant and partially relevant (distractor) information. By deliberately including these distractors in the training context, RAFT teaches the model to become robust to noisy retrieval. The model must learn to discriminate between useful and useless information, which prevents it from blindly latching onto any plausible-looking statement in the context. For instance, if asked about a specific clause in a 2023 contract, and the context includes summaries from 2020 and 2022, the model must identify and use only the 2023 document.

Second, RAFT often employs chain-of-thought generation to encourage grounded reasoning. Instead of training the model to output only the final answer, it is trained to produce a reasoning trace followed by the answer. This trace explicitly cites the documents used. A training target might be: "Document 2 states that the process requires three stages. Document 3 clarifies that the second stage is exothermic. Therefore, the answer is: The second stage is exothermic." This format forces the model to disclose its evidential basis, making its reasoning process more transparent and easier to debug. It reinforces the connection between the source context and the generated text.

RAFT vs. RAG vs. Pure Fine-Tuning

Understanding where RAFT sits requires comparing it to the two approaches it hybridizes: standard RAG and pure fine-tuning.

Standard RAG (Retrieval-Augmented Generation) operates at inference time. For each query, a retriever fetches relevant documents from a knowledge base, and a language model generates an answer conditioned on those documents. The base model is typically not fine-tuned for this specific task. Its strength is access to up-to-date, external knowledge. Its weakness is that the model may not be adept at ignoring distractors or synthesizing complex evidence, as it wasn't trained to do so.

Pure Fine-Tuning involves training a model on a dataset of Q&A pairs, allowing it to internalize knowledge and task format. Its strength is high proficiency with the specific task style and efficient inference (no retrieval step). Its critical weakness is static knowledge; it cannot answer questions about information not in its training data, and it may confidently hallucinate outdated or incorrect answers.

RAFT combines these approaches by fine-tuning the model for the RAG task. It takes the RAG architecture and optimizes the generator's behavior through training on curated (context, question, answer) examples. The result is a model specialized in the skill of evidence-based response generation. It typically outperforms standard RAG on tasks where the retrieval can be noisy, because it's trained to handle that noise. It outperforms pure fine-tuning on questions requiring knowledge beyond its original training cut-off or extremely specialized domains, as it learns to rely on the provided context rather than its internal memory.

When Does RAFT Provide Meaningful Improvement?

RAFT is not a universal solution. It provides meaningful quality improvements under specific conditions that make the investment in training worthwhile.

The primary use case is knowledge-intensive tasks where retrieval is necessary but imperfect. If you have a closed, clean knowledge base where retrieval always returns a perfect answer, a standard RAG system may suffice. However, if your domain involves complex queries over large corpora—like scientific literature, legal case history, or internal technical documentation—where the "golden" document is often buried among related but irrelevant ones, RAFT's training to handle distractors becomes invaluable. It shines in professional domains like healthcare, finance, or law, where answer faithfulness and the ability to cite sources are paramount.

RAFT is also highly beneficial when you need consistent output formatting and complex reasoning. By fine-tuning, you can teach the model not just to answer correctly, but to answer in a specific structured format (e.g., a bulleted list of side effects with citations) or to follow a strict chain-of-thought logic. This level of control is harder to achieve with prompt engineering alone in a standard RAG setup. Essentially, RAFT provides meaningful improvement when you need a reliable, domain-specialized agent that can work with a dynamic knowledge base more expertly than a general-purpose model.

Common Pitfalls

A major pitfall is fine-tuning on context that is too clean. If you train RAFT only using perfectly retrieved, answer-bearing documents, the model will fail catastrophically when deployed with a real-world retriever that outputs noise. It will assume every provided document is relevant, leading to confident but incorrect answers derived from distractors. Always ensure your training data includes a realistic proportion of irrelevant or partially relevant documents to build robustness.

Another mistake is neglecting the retriever. RAFT fine-tunes the generator, not the retriever. A poorly performing retriever that consistently fails to fetch any relevant documents will doom even a perfectly RAFT-trained model. The model can only be as good as the evidence it receives. The solution is to view RAFT as part of a two-component system: you must also train or optimize your retriever to ensure at least some relevant context is in the retrieved set for the model to use.

Finally, practitioners sometimes confuse RAFT with simply expanding the model's knowledge. The goal is not to cram the fine-tuning documents into the model's weights. The goal is to teach a behavior: the skill of reasoning with evidence. If you find your model starts answering questions correctly without the retrieved context, it has likely begun to memorize the training facts, which defeats the purpose. Mitigate this by ensuring questions in the training set cannot be answered from the model's pre-existing knowledge alone, forcing it to rely on the provided context.

Summary

RAFT (Retrieval-Augmented Fine-Tuning) trains a language model to generate answers by reasoning over a provided set of documents, mastering the skill of evidence-based response generation.
Key implementation features include training with distractor documents to build robustness against imperfect retrieval and using chain-of-thought generation to promote transparent, grounded reasoning.
RAFT hybridizes standard RAG and pure fine-tuning, offering superior handling of noisy retrieval compared to RAG and superior ability to use new knowledge compared to fine-tuning alone.
It provides the most meaningful improvement for knowledge-intensive tasks in specialized domains where retrieval is essential but imperfect, and where output consistency and faithfulness are critical.
Avoid pitfalls by training with realistically noisy context, maintaining a high-quality retriever, and ensuring the model learns the evidence-use behavior rather than memorizing the training facts.

Retrieval-Augmented Fine-Tuning

Retrieval-Augmented Fine-Tuning

The Core Idea: Teaching Models to Use Evidence

Implementing RAFT: Distractors and Chain-of-Thought

RAFT vs. RAG vs. Pure Fine-Tuning

When Does RAFT Provide Meaningful Improvement?

Common Pitfalls

Summary

Write better notes with AI