Neural Networks and Language Models

Self-Attention and Transformers

Learning Objectives

You know of the self-attention mechanism.
You know of the transformer architecture.

A key challenge of the recurrent neural network (RNN) relates to the task it was designed for, i.e. learning to work with sequential data. RNN-like approaches including LSTM provide the output from the node back to the node. This is implemented through “unrolling”, i.e. creating a new layer for each step in the sequence. This becomes challenging for long sequences, such as long snippets of text.

The introduction of the Transformer architecture in 2017 was a key enabler of contemporary large language models. A key part of the transformer is a self-attention mechanism that focuses on identifying the relevance of words in a sequence of words.

The approach was proposed in an aptly named article Attention is all you need

The key idea of the self-attention mechanism is to identify the relevance of words in a sequence relative to a target word. This can be viewed as an extension to forming word embeddings, where words were represented as vectors with similar words having similar vectors. Word embeddings are classically static and do not change once formed. The self-attention mechanism allows the model to weigh the relevance of the words in context. This is effectively done by performing a query that adjusts the word embedding for a specific context, which results in contextualized word embeddings.

Loading Exercise...

The transformer architecture has been shown to outperform RNNs and LSTMs in a range of natural language processing tasks such as machine translation, text summarization, and question-answering systems. It is also the basis for the majority of the current large language models, which we will discuss in the next part.

Loading Exercise...

← Sequential Data and Neural Networks

Large Language Models →