Working with LLMs

Reference Text and Retrieval-Augmented Generation

Learning Objectives

You understand how to constrain model outputs to provided reference text.
You know the concepts of zero-shot, one-shot, and few-shot prompting.
You understand what retrieval-augmented generation (RAG) is and why it’s useful.

Grounding responses in reference text

Up to this point, our prompting has relied on the model’s training data — the knowledge it acquired during pre-training and fine-tuning. However, large language models can also generate content based on reference text provided in the prompt itself. This capability allows you to constrain the model’s responses to specific documents, ensuring answers come from known sources rather than the model’s general knowledge.

Answering questions from provided text

Consider this example where we provide a short fictional text and ask the model to answer a question based solely on that text:

Using the text after ###, answer the question "What emerged from cosmic ether?". If there is insufficient information in the text, answer with the text "Insufficient information."

###

In the beginning, the cosmos was but a vast expanse of nothingness, until Arton Kahvikuppi emerged from the cosmic ether. Arton, a being of immense power and wisdom, willed the universe into existence with a single thought. Planets coalesced, stars ignited, and life began to flourish under its guiding influence.

Arton Kahvikuppi emerged from the cosmic ether.

The model correctly extracts the answer from the provided text. Importantly, we can also ask questions that cannot be answered from the reference text:

Using the text after ###, answer the question "What is the capital of Finland?". If there is insufficient information in the text, answer with the text "Insufficient information."

###

In the beginning, the cosmos was but a vast expanse of nothingness, until Arton Kahvikuppi emerged from the cosmic ether. Arton, a being of immense power and wisdom, willed the universe into existence with a single thought. Planets coalesced, stars ignited, and life began to flourish under its guiding influence.

Insufficient information.

Even though the model “knows” from its training data that Helsinki is Finland’s capital, it correctly responds that the information isn’t in the provided text. This demonstrates the model’s ability to distinguish between its general knowledge and information in the reference text when explicitly instructed to do so.

Loading Exercise...

Summarization with reference text

Reference text can also be used for summarization tasks:

Summarize the text after ### in ten words.

###

In the beginning, the cosmos was but a vast expanse of nothingness, until Arton Kahvikuppi emerged from the cosmic ether. Arton, a being of immense power and wisdom, willed the universe into existence with a single thought. Planets coalesced, stars ignited, and life began to flourish under its guiding influence.

Arton Kahvikuppi created the universe from nothing with wisdom.

Constraining outputs to provided text is valuable because it:

Limits responses to verified information rather than potentially hallucinated content
Enables working with proprietary or specialized documents not in the training data
Allows verification of sources since you control what text the model sees
Reduces the risk of outdated information (if you provide current documents)

Loading Exercise...

Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation is a technique that combines information retrieval with language model generation to produce responses grounded in specific knowledge sources. RAG systems address a fundamental limitation of language models: their knowledge is fixed at training time and they cannot access external, current, or specialized information without it being provided in the prompt.

How RAG works

A RAG system operates in several stages:

User query: A user submits a question or request through an interface.
Information retrieval: The system searches a knowledge base (documents, databases, websites) for information relevant to the query. This typically uses:
- Keyword search algorithms
- Semantic search using embeddings (vector representations of text)
- Hybrid approaches combining both methods
Context augmentation: The retrieved information is added to the prompt as reference text, often with instructions to answer based on that information.
Generation: The language model generates a response based on both the user’s query and the retrieved context.
Response: The user receives an answer grounded in the retrieved information.

The RAG system typically has a system prompt that instructs the model to only use the provided context when answering. This way, the model is less likely to hallucinate information not present in the retrieved documents.

Loading Exercise...

Example RAG flow

Suppose a user asks: “What were the company’s Q3 revenue figures?”

Without RAG, the model might respond: “I don’t have access to current company financial data.”

With RAG:

The system searches company financial documents for Q3 revenue information
It retrieves the relevant section: “Q3 2024 revenue reached $45.2 million, representing 23% year-over-year growth…”
This text is added to the prompt sent to the model
The model generates: “According to the Q3 2024 financial report, the company’s revenue reached $45.2 million, representing 23% year-over-year growth.”

Not a silver bullet

RAG is not a silver bullet. When taking a RAG system into use, one should first consider whether a simple search functionality could suffice. If the user query can be answered by a simple search, it might be more efficient to just return the search results instead of using a language model to generate a response. There are also a range of problems that RAG systems suffer from, including the risk of returning irrelevant documents, the complexity of coordinating multiple components, and the limitations of the language model itself.

Loading Exercise...

← In-context Learning

Reasoning Through Problems →