Pre-training and Fine-tuning
Learning Objectives
- You know the term generative pre-trained transformer.
- You understand the distinction between pre-training and fine-tuning.
The introduction of the transformer architecture led to generative pre-trained transformers (GPTs), which are mostly associated with the term large language model. The transformer architecture avoided some of the problems with scalability that were present in earlier (e.g. RNN-based) models. The transformer architecture also allowed the models to better capture the context of words due to the attention mechanism, and led to better parallelization of the training process as a whole.
The creation of GPT models is based on two key phases, pre-training and fine-tuning. In the pre-training phase, the model is trained on large amounts of data, where the model is trained to generate text. The pre-training phase helps the model to learn the syntax and grammar in the text. In addition, the pre-training phase helps in forming a factual base from which the model can draw from in the future.
Pre-training is a form of unsupervised machine learning. In the context of GPTs, pre-training focuses on learning to predict the next word in a sequence (or to fill in a gap in text) based on the word context.
After the pre-training phase, the pre-trained model is further trained to handle specific tasks. This is the fine-tuning phase. The fine-tuning phase allows improving the performance of the model on specific tasks such as question answering, translation, and summarization.
Fine-tuning is a form of supervised machine learning. In the context of GPTs, fine-tuning focuses on training the models to perform well in specific tasks.
The first GPT models were introduced in 2018 and were able to achieve state-of-the-art performance in a range of natural language processing tasks. As an example, see the articles Improving Language Understanding by Generative Pretraining and BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
It is worth noting that although the term GPT has been strongly associated with OpenAI, the term generative pre-trained transformer is a general term that refers to a class of models that are pre-trained on large amounts of text data and then fine-tuned for specific tasks.