Summary
This part explored how probability and prediction underpin language models, early statistical experiments, and why their limitations motivated neural approaches.
We studied probability and prediction: how probability describes event likelihood, patterns emerge through repeated trials, and predictions arise from past observations — concepts fundamental to statistical language modeling.
We examined origins of language models through Markov and Shannon’s pioneering work. They demonstrated that language follows discoverable statistical patterns: Markov showed each letter’s probability depends on preceding letters, while Shannon showed more context produces increasingly natural text.
We explored Markov chains and n-gram models. N-gram models predict the next word based on previous n-1 words — from unigrams (no context) to bigrams (one previous word) to trigrams and beyond. These were widely deployed in speech recognition, machine translation, and text prediction throughout the 1990s-2000s.
Finally, we analyzed why these models fell short: fixed context windows missing long-range dependencies, severe data sparsity as n grows, inability to handle out-of-vocabulary words, and lack of semantic understanding — counting surface patterns without grasping meaning or grammar. These constraints made them brittle and unable to scale to language’s full complexity, highlighting the need for approaches that could handle longer dependencies and learn richer representations.
Key Takeaways
- Probability provides the foundation for prediction by estimating probabilities from observed data.
- Language models estimate the probability of word sequences, building on Markov’s and Shannon’s work.
- N-gram models predict words based on previous n-1 words and were widely used in early NLP.
- Early statistical models had fundamental constraints: fixed context windows, data sparsity, vocabulary restrictions, and lack of semantic understanding.