Lecture 16: Recurrent Neural Networks and Language Models

From vanishing gradients to selective state spaces: the evolution of sequence modeling

Overview

Sequential data is everywhere: time series, text, audio, video. This lecture explores how neural networks process sequences of arbitrary length while maintaining memory of past inputs. We start with vanilla RNNs and discover their fundamental limitation—the vanishing gradient problem. Then we see how LSTMs solve this through gating mechanisms, making them highly effective for time series and moderate-length sequences. Finally, we examine why LSTMs struggle with modern language modeling tasks and explore how newer architectures like Mamba bring back recursion with linear complexity. Through hands-on experiments with Sequential MNIST and architecture comparisons, the lecture reveals when to use each approach based on computational trade-offs.

Learning Objectives

By the end of this lecture, you will be able to:

Implement vanilla RNN and LSTM architectures and understand their core computational mechanisms
Diagnose the vanishing gradient problem in RNNs through visualization and empirical analysis
Explain LSTM’s gating mechanisms (forget, input, output gates) and how they enable long-term memory
Train and compare RNN vs LSTM on Sequential MNIST to observe performance differences with sequence length
Analyze computational bottlenecks in Transformers (quadratic attention) vs recursive models (linear complexity)
Select appropriate architectures for different tasks based on sequence length, memory constraints, and computational requirements

Materials

Quick Access

RNN & Language Models Notebook

Resources

Classic Blog Posts:
- Understanding LSTM Networks by Christopher Olah - The definitive visual guide to LSTMs
- A Visual Guide to Mamba and State Space Models - Modern recursive architectures explained
Foundational Papers:
- “Long Short-Term Memory” (Hochreiter & Schmidhuber, 1997) - The original LSTM paper
- “Attention Is All You Need” (Vaswani et al., 2017) - The Transformer architecture
- “Mamba: Linear-Time Sequence Modeling with Selective State Spaces” (Gu & Dao, 2023) - Modern efficient recursion

Previous: ← Lecture 15: Convolutional Neural Networks | Next: Lecture 17: LLM Agents & Tool Use →