Large Language Models

Instruction Tuning and RLHF

Learning Objectives

You understand what instruction tuning is and how it relates to fine-tuning.
You know the term reinforcement learning from human feedback (RLHF) and its purpose.
You understand how these methods transformed language models into helpful conversational assistants.

From general models to conversational assistants

In the previous chapter, we examined the two-phase training paradigm: pre-training followed by fine-tuning. Early work from 2018-2020 demonstrated that this approach could produce models capable of impressive feats — generating coherent text, answering questions, and performing various language tasks.

However, these early models had a fundamental limitation. When you interacted with them, they didn’t behave like helpful assistants. Instead, they acted like sophisticated text completion engines. As an example, a user might type a question, and — instead of answering the question — the model could generate text that plausibly continued from that input, such as more similar questions.

The model had learned language patterns but not how to be helpful. It could generate fluent text but didn’t understand that it should function as an assistant responding to user requests rather than a system predicting likely text continuations.

Between 2021 and 2022, researchers developed two complementary techniques that transformed how language models behave: instruction tuning and reinforcement learning from human feedback (RLHF). These are specialized fine-tuning approaches that teach models to function as conversational assistants.

Instruction tuning: Learning to follow instructions

Instruction tuning is a specific type of fine-tuning that teaches models to interpret user inputs as instructions requiring helpful responses, rather than as text prompts to continue. This seemingly straightforward shift fundamentally changes how the model approaches interaction.

During instruction tuning, the model undergoes fine-tuning on datasets specifically constructed to demonstrate instruction-following behavior. These datasets contain thousands to hundreds of thousands of examples, each consisting of an instruction-response pair. The examples cover diverse tasks and interaction styles to develop general instruction-following capabilities.

For instance:

Instruction: “Translate the following sentence to Spanish: ‘The library closes at 8 PM.’”
Response: “La biblioteca cierra a las 8 PM.”
Instruction: “Explain what causes ocean tides in 2-3 sentences.”
Response: “Ocean tides are caused primarily by the gravitational pull of the Moon on Earth’s oceans. As the Moon orbits Earth, its gravity creates bulges in the ocean water on the side facing the Moon and the opposite side. As Earth rotates, different locations pass through these bulges, experiencing the cycle of high and low tides.”
Instruction: “Write a haiku about autumn.”
Response: “Crisp leaves underfoot / Golden sunlight fading fast / Cool winds whisper change”
Instruction: “List three renewable energy sources and briefly describe each.”
Response: “1. Solar power captures energy from sunlight using photovoltaic panels or thermal collectors. 2. Wind power uses turbines to convert wind’s kinetic energy into electricity. 3. Hydroelectric power generates electricity from flowing water, typically through dams on rivers.”

Through exposure to many such examples, the model learns relevant behaviors, including recognizing instructions, structuring responses appropriately, and adapting tone and detail level based on the request.

Instruction tuning fundamentally changes the model’s behavioral mode. A pre-trained model implicitly asks “What text typically follows this pattern in my training data?” An instruction-tuned model asks “What response would be helpful given this request?”

Loading Exercise...

Reinforcement learning from human feedback

Instruction tuning makes models responsive to instructions, but responsiveness alone doesn’t ensure quality. A model might technically follow an instruction but still produce responses that are unhelpful, misleading, inappropriate, unnecessarily verbose, or poorly structured. We need models that not only understand what the user asked but also produce responses that humans would judge as genuinely useful and appropriate.

Reinforcement learning from human feedback (RLHF) is a training technique that teaches models to prefer responses aligning with human preferences. Unlike instruction tuning (which uses supervised learning on instruction-response pairs), RLHF employs reinforcement learning guided by human judgments of quality.

The three stages of RLHF

RLHF involves three distinct stages, each building on the previous one:

Stage 1: Collecting human preference data

Human annotators compare multiple model outputs and rank them by quality. The process typically works as follows:

Present the same prompt to the model multiple times, generating different responses (models can produce varied outputs due to sampling randomness in the generation process)
Show these responses to human annotators
Ask annotators to rank the responses according to criteria like helpfulness, accuracy, harmlessness, and clarity

For example, given the prompt “Explain how vaccines work,” the model might generate three responses. Annotators would rank them, typically placing the accessible and comprehensive response first.

This process creates a dataset of human preferences — not explicit instruction-response pairs, but comparative judgments showing which outputs humans prefer over others.

Thumbs up or thumbs down

This is also what is going on when you see “thumbs up” or “thumbs down” buttons in AI chat interfaces, or when a chat interface shows you two responses and asks you to pick the one you prefer. Every time you provide feedback, it can be collected as preference data for future training.

Loading Exercise...

Stage 2: Training a reward model

Using the preference data, researchers train a separate neural network called a reward model (sometimes called a preference model). This model learns to predict human preferences: given a prompt and a potential response, it outputs a numerical score estimating how much humans would prefer that response.

The reward model essentially learns to approximate human judgment. If trained on sufficient preference data covering diverse scenarios, it can evaluate new responses without requiring human review of each one. This is crucial because reinforcement learning requires evaluating thousands or millions of candidate responses — far too many for humans to review directly.

The reward model doesn’t generate text responses — it is only used to evaluate responses generated by the language model, assigning higher scores to responses that match patterns humans preferred in the training data.

Loading Exercise...

Stage 3: Optimizing the language model with reinforcement learning

When a reward model has been trained, a language model can be fine-tuned using reinforcement learning. The goal is to adjust the language model’s parameters so that it generates responses that receive higher scores from the reward model. The process works as follows:

The language model generates a response to a prompt
The reward model evaluates this response and assigns a numerical score
The language model’s parameters are adjusted using reinforcement learning algorithms to make higher-scored responses more likely in future generations
This process repeats across many prompts and generated responses

Through many iterations, the model learns to produce outputs that receive higher reward scores — and by extension, outputs that align better with patterns of human preference. This differs fundamentally from supervised learning. Rather than imitating specific examples, the model learns through exploration and feedback, discovering general strategies for producing responses humans find helpful.

What RLHF teaches the model

Through RLHF, models learn preferences that are difficult to capture through instruction tuning alone:

The model learns when to be concise and when to elaborate. For a straightforward factual question, brevity is typically preferred. For a complex topic where the user is seeking deep understanding, thorough explanation with examples is better.
The model learns to acknowledge uncertainty rather than confidently stating incorrect information. If asked about something beyond its knowledge or where sources conflict, it learns to express appropriate uncertainty rather than fabricating confident-sounding answers.
The model learns to decline inappropriate requests politely and to avoid generating harmful content, even when the request seems superficially reasonable or the harm isn’t immediately obvious.
The model learns subtle aspects of effective communication — appropriate tone for different contexts, clear structure, natural language patterns, and helpful framing of information that aids understanding.

These preferences emerge from patterns in aggregated human feedback rather than explicit programmed rules, allowing the model to generalize to novel situations it wasn’t explicitly trained on.

Loading Exercise...

Understanding the training pipeline

Modern conversational AI systems typically undergo this training sequence:

Pre-training (self-supervised) → Learn general language patterns from massive text corpora
Instruction tuning (supervised fine-tuning) → Learn to interpret and follow user instructions
RLHF (reinforcement learning) → Learn to prefer responses that align with human preferences for quality

Each stage builds on the previous one. Pre-training provides the linguistic foundation, instruction tuning adds instruction-following behavior, and RLHF refines the quality and alignment of responses based on human feedback.

This pipeline has become the standard approach for building helpful, reliable conversational assistants, though variations and improvements to each stage continue to be developed.

Loading Exercise...

← Pre-training and Fine-tuning

Rise of GPT and Open-Source Models →