Neural Networks for Language

Summary


This part explored how neural networks evolved to handle language, from basic architectures through embeddings and sequential models to transformers powering modern large language models.

We studied neural network basics: layers of nodes with weights and biases, activation functions introducing nonlinearity, and training through backpropagation and gradient descent.

We examined word embeddings — dense vectors representing words in continuous space where semantic similarity corresponds to geometric proximity. Training methods like Word2Vec learn embeddings where words in similar contexts develop similar representations, enabling patterns like vector arithmetic (king - man + woman ≈ queen).

We explored sequential models. Recurrent neural networks maintain hidden states carrying information through sequences but suffer from vanishing gradients — error signals weaken when backpropagating through many steps. LSTMs address this through gating mechanisms and memory cells, though they still process sequentially and struggle with very long-range relationships.

This motivated attention mechanisms, allowing each word to directly focus on any other word regardless of distance. Attention computes relevance scores between all word pairs simultaneously. Self-attention creates context-dependent embeddings — “bank” receives different representations in “river bank” versus “savings bank.”

Finally, we studied transformers, introduced in “Attention Is All You Need” (2017). Transformers build entirely on attention, eliminating recurrence. They consist of stacked layers combining multi-head self-attention, feed-forward processing, positional encodings, layer normalization, and residual connections. Modern applications often use encoder-only (BERT) or decoder-only (GPT) variants. Transformers revolutionized NLP by handling long-range dependencies, processing sequences in parallel, and scaling effectively.

Key Takeaways

  • Neural networks consist of layers with learnable weights, trained through backpropagation.
  • Word embeddings represent words as dense vectors capturing semantic relationships.
  • RNNs maintain sequential hidden states but suffer from vanishing gradients.
  • LSTMs use gating mechanisms to improve longer-range learning.
  • Attention mechanisms enable direct word-to-word connections regardless of distance.
  • Self-attention creates context-dependent embeddings adapting to surrounding words.
  • Transformers build entirely on attention, processing sequences in parallel — the foundation for modern large language models.