Pre-training and Fine-tuning
Learning Objectives
- You understand the distinction between pre-training and fine-tuning.
- You recognize fine-tuning as a broad category that encompasses various adaptation techniques.
- You know the term generative pre-trained transformer (GPT).
Modern large language models are built using a two-phase training paradigm. The first phase establishes broad language understanding, while the second phase adapts that understanding to specific applications. This separation of concerns — general learning followed by specialization — has proven effective and remains the foundation of how today’s large language models are built.
Phase 1: Pre-training
Pre-training is the phase where a model learns basic structure and patterns of language. The model trains on massive corpora — typically hundreds of billions to trillions of words drawn from a range of sources such as books, websites, scientific papers, and so on.
The training objective is conceptually similar to learning word representations, but applied at a much larger scale with a more complex model architecture (the transformer). Instead of learning fixed vectors for individual words, the model learns to generate context-dependent representations for every token in a sequence based on surrounding words.
During pre-training, the model learns to perform simple prediction tasks such as predict the next token in a sequence, or predict missing tokens within sequences. For example:
- Given “The cat sat on the”, the model learns to predict “mat” (or other plausible completions like “floor” or “chair”)
- Given “In summary, the main findings”, the model learns to predict tokens like “are”, “were”, “suggest”, or “indicate”
Through billions of iterations of this task across over massive amounts of text, the model gradually learns several types of knowledge:
-
Grammar rules, syntax patterns, and how sentences are properly constructed across different registers and styles.
-
How words and concepts relate to each other in meaning, common phrases and idioms, typical ways concepts are expressed, and relationships between related terms.
-
Information that appears frequently in training data, such as historical facts, scientific concepts, geographic information, and general world knowledge — though this knowledge reflects patterns in the training data rather than verified facts.
-
How arguments are constructed, how problems are solved step-by-step in text, how logical relationships are expressed, and patterns of explanatory discourse.
A key part of pre-training is that it does not require human annotation or labeled examples. The text itself provides supervision: every next token in the corpus serves as the correct answer for that prediction. This self-supervised learning approach allows using virtually unlimited amounts of text data without the prohibitive cost of manual labeling.
After pre-training, the model possesses rich, general understanding of language patterns but lacks specialization for particular purposes. It knows how language works but doesn’t yet know how to apply that knowledge to specific applications or desired behaviors.
Phase 2: Fine-tuning
A pre-trained model has broad linguistic knowledge base but no specific purpose or behavioral alignment. Fine-tuning is the general term for adapting a pre-trained model to particular applications or behaviors. This constitutes the second major phase in building useful language models.
Fine-tuning encompasses various approaches, all sharing a common structure: the pre-trained model undergoes further training on a smaller, more focused dataset that demonstrates desired behavior. The key difference from pre-training is that fine-tuning typically uses supervised learning — the model receives explicit examples of inputs paired with desired outputs.
Common fine-tuning objectives include:
-
Task-specific applications: Training the model to excel at particular tasks like summarization (condensing long documents while preserving key information), translation (converting between languages while maintaining meaning and style), or question answering (providing accurate, relevant responses to queries).
-
Domain adaptation: Specializing the model for specific fields like medicine, law, or scientific research by training on domain-specific texts and expert-curated examples that reflect terminology and reasoning patterns particular to those domains.
-
Behavioral alignment: Teaching the model to follow instructions, engage in helpful conversation, maintain appropriate tone, and avoid problematic outputs. We’ll explore this crucial application in detail in the next chapter.
Fine-tuning datasets are much smaller than pre-training corpora — typically thousands to millions of examples rather than billions of tokens. For instance, a fine-tuning dataset for summarization might include:
- Input: [Full news article about climate policy]
- Target output: “New legislation aims to reduce carbon emissions by 40% by 2030 through renewable energy incentives and stricter industrial regulations.”
Or for question answering:
- Input: “What is the capital of France?”
- Target output: “The capital of France is Paris.”
By training on such examples, the model learns to apply its general language understanding to the specific format and requirements of the target application.
The two-phase paradigm has proven to be highly efficient. Pre-training requires enormous computational resources — often millions of dollars in computing costs and weeks or months of continuous training on specialized hardware. However, this expensive phase only needs to be done once. The resulting pre-trained model serves as a foundation that can be fine-tuned many times for different applications, each time with modest computational cost (often hours or days rather than weeks). This allows a single pre-training effort to support dozens or hundreds of specialized applications.
Fine-tuning is a broad category encompassing many specific techniques. In the next chapter, we’ll examine two particularly important fine-tuning approaches that emerged in recent years: instruction tuning and reinforcement learning from human feedback (RLHF). These techniques have become essential for building the conversational AI assistants that many people interact with today.
Historical milestones
The two-phase training paradigm emerged from foundational work published in 2018:
GPT-1 (OpenAI, 2018) provided the first large-scale demonstration that pre-training on massive text corpora followed by task-specific fine-tuning could outperform models built and trained specifically for individual tasks. This represented a significant conceptual shift. Rather than designing specialized architectures for each application (one model for translation, another for summarization, and so on), researchers could build a single general-purpose model and adapt it through fine-tuning.
BERT (Google, 2018) introduced a complementary pre-training approach based on “masked language modeling” — predicting randomly hidden tokens throughout sequences rather than only predicting the next token. BERT demonstrated that benefits of large-scale pre-training extended beyond generative tasks to understanding-focused applications.
Together, these papers established the paradigm that continues dominating the field: invest computational resources in pre-training to build models with strong general capabilities, then adapt them through fine-tuning for specific applications. This approach proved far more effective than training task-specific models from scratch.
For reference:
- Improving Language Understanding by Generative Pre-training (OpenAI, 2018)
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Google, 2018)
The acronym GPT is strongly associated with OpenAI’s model series (GPT-, ChatGPT, etc.), but generative pre-trained transformer describes a general approach rather than a specific product.
The term applies to any model with three characteristics:
- Generative → it generates new text rather than only analyzing or classifying existing text
- Pre-trained → it learns broadly from large text corpora before being adapted to specific tasks
- Transformer-based → it uses the transformer architecture with self-attention mechanisms
Many modern large language models from various organizations fit this description, even without “GPT” in their names. When we use “GPT” or “GPT-style model” in this course, we refer to this broader category of models, not exclusively to OpenAI’s products.