Large Language Models

Rise of GPT and Open-Source Models


Learning Objectives

  • You know the key milestones in the development of large language models from 2018 to the present.
  • You understand when and how LLMs transitioned from research tools to widely-used applications.

The early wave: 2018-2019

Following the foundational work of GPT-1 and BERT in 2018, researchers quickly began exploring how to scale and improve transformer-based models. The year 2019 saw several influential developments:

GPT-2 (OpenAI, 2019) scaled to 1.5 billion parameters — approximately 10 times larger than GPT-1. It demonstrated that larger models could generate remarkably coherent text across diverse topics and styles, often maintaining consistency over multiple paragraphs. The model could continue stories, answer questions, and adapt to various writing styles with greater fluency than its predecessors.

RoBERTa (Meta AI, 2019) showed that careful optimization of BERT-style training — using more data, longer training duration, larger batches, and refined techniques — could substantially improve performance on language understanding benchmarks without architectural changes.

T5 (Google, 2019) introduced a unified “text-to-text” framework where every language task — translation, summarization, question answering, classification — was treated as converting input text to output text. This elegant simplification influenced how researchers thought about task formulation and model design.

During this period, large language models remained primarily research tools developed by major technology companies and research institutions. Access was limited to researchers and select partners through APIs or collaborations. The general public had little direct interaction with these models.

The GPT-2 controversy

GPT-2’s release was accompanied by an unusual decision that brought language models into mainstream media attention. OpenAI initially released only smaller versions of the model (117M, 345M, and 762M parameters), withholding the full 1.5 billion parameter version for several months due to concerns about potential misuse — particularly automated generation of misleading content, spam, or disinformation at scale.

This cautious approach generated significant media coverage, including articles like The AI Text Generator That’s Too Dangerous to Make Public from Wired. The debate marked one of the first times language models entered mainstream public discourse, raising questions about responsible AI development that continue today.

The full model was eventually released later in 2019, and the most extreme concerns about catastrophic misuse largely did not materialize in the ways initially feared. However, the episode established important conversations about AI safety, responsible disclosure practices, and potential societal impacts that have shaped subsequent development and deployment decisions.


Scaling up: 2020-2021

The next phase saw dramatic increases in model scale and growing public awareness of language model capabilities, alongside increasing recognition of their limitations.

GPT-3 and the emergent behavior

GPT-3 (OpenAI, 2020) represented a major leap, scaling to 175 billion parameters — more than 100 times larger than GPT-2. Beyond its size, GPT-3 demonstrated a surprising capability: given just a few examples of a task within the prompt, the model could often perform that task reasonably well without any fine-tuning or parameter updates. For instance:

Translate English to French:
sea otter => loutre de mer
peppermint => menthe poivrée
plush giraffe => girafe peluche
cheese =>

Given this pattern, GPT-3 would correctly respond “fromage.” This ability to adapt to new tasks from examples alone suggested that sufficiently large models develop flexible, general-purpose capabilities during pre-training that can be directed through prompting.

This emergent behavior — capabilities not explicitly programmed or trained for — sparked widespread interest and effectively led to a surge in experimentation with prompt engineering, where users crafted specific prompts to elicit desired behaviors from the model.

GPT-3’s release coincided with growing public awareness. Mainstream publications began covering language models more seriously. The New York Times published How Do You Know a Human Wrote This?, and demonstrations of GPT-3 writing articles, poetry, code, and engaging in conversation circulated widely online, generating both enthusiasm and concern.

Other developments in 2020-2021

Microsoft introduced Turing-NLG with 17 billion parameters, and other organizations began exploring large-scale models. However, access remained limited to researchers, developers with API keys, or organizations with partnerships. Most people could only read about these models rather than use them directly.

During this period, researchers also focused on understanding and improving model behavior — reducing factual errors (often called “hallucinations”), mitigating biases present in training data, improving factual grounding, and developing techniques to make models more controllable and aligned with user intentions. This work laid the groundwork for the instruction tuning and RLHF techniques discussed in the previous chapter.


The breakthrough: ChatGPT and mainstream adoption

In late 2022, the landscape changed dramatically. OpenAI released ChatGPT on November 30, 2022, a conversational interface powered by GPT-3.5 (an improved version of GPT-3 enhanced with instruction tuning and RLHF, as discussed in the previous chapter).

ChatGPT represented the first time millions of people could easily interact with a large language model through a simple, accessible interface. Unlike earlier models that required technical knowledge, API access, or familiarity with prompting techniques, ChatGPT was available to anyone with a web browser and presented a familiar chat interface.

Within five days, ChatGPT reached one million users. Within two months, it had reached 100 million users — one of the fastest consumer technology adoptions in history. The Google Trends data shows the dramatic spike in searches for “ChatGPT” and “GPT” following the November 2022 release — dwarfing even seasonal peaks for common terms like “ice cream” and popular figures like “Messi” and “Ronaldo”.

ChatGPT demonstrated that instruction tuning and RLHF had successfully transformed language models from impressive but difficult-to-use research systems into practical tools for everyday tasks that non-technical users could interact with naturally.

Loading Exercise...

The current landscape: 2023-present

Since early 2023, the field has seen explosive growth in both model development and deployment across diverse organizations and contexts.

Proliferation of models

New models now appear regularly from diverse sources:

  • Proprietary models such from major technology companies continue to advance.

  • Open-source and open-weight models have democratized access to language model technology. Platforms like Hugging Face host hundreds of thousands of models contributed by researchers and developers worldwide. These models vary in size, architecture, training data, and intended use cases.

  • Specialized models targeting specific languages, domains, or applications have also started to emerge. Medical language models for clinical documentation and decision support, legal assistants for contract analysis, coding-focused models optimized for software development, and models designed specifically for languages beyond English have all emerged to meet particular needs.

This ecosystem diversity means different users and organizations can choose models appropriate for their specific needs, computational resources, privacy requirements, and deployment contexts.

The scale explosion

Model sizes have continued growing dramatically. When considering the amount of parameters — the learned weights that define the model — we see a clear trend of exponential growth:

  • GPT-1 (2018): 117 million parameters
  • GPT-2 (2019): 1.5 billion parameters
  • GPT-3 (2020): 175 billion parameters
  • GPT-4 and GPT-5 (2023+): The exact architectures are not disclosed, but these models are widely believed to contain hundreds of billions to over one trillion parameters
Loading Exercise...