Lecture 12: Simplifying Data — PCA and Friends
Dimensionality Reduction for Visualization and Understanding
Overview
High-dimensional data is everywhere in machine learning: text embeddings with hundreds of dimensions, images with thousands of pixels, genomic data with tens of thousands of features. But humans can only visualize 2D or 3D spaces. This lecture introduces dimensionality reduction (DR) techniques that compress high-dimensional data while preserving important relationships. We start with Principal Component Analysis (PCA), the foundational linear method that finds directions of maximum variance. Then we explore advanced nonlinear visualization tools like t-SNE and UMAP that excel at revealing local clusters and patterns. Finally, we apply these methods to a real challenge: distinguishing AI-generated text from human writing using embeddings from the DAIGT dataset.
Learning Objectives
By the end of this lecture, you will:
- Explain PCA geometrically and understand how it finds directions of maximum variance for data compression
- Recognize PCA’s limitations and know when it works well (linear structure, global relationships) versus when it fails (nonlinear manifolds, local clusters)
- Choose appropriate DR methods by selecting between PCA and advanced nonlinear tools (t-SNE, UMAP) based on whether you need global structure or local cluster visualization
- Interpret DR visualizations critically, understanding what relationships are preserved and what might be artifacts
- Apply DR to real data using the full pipeline (embedding → PCA → nonlinear DR → clustering) on text data while avoiding common pitfalls like text length leakage
- Optimize DR workflows by using PCA preprocessing to speed up nonlinear methods and choosing appropriate hyperparameters
Materials
Datasets & Acknowledgments
- DAIGT V3 (thedrcat, Kaggle): Detecting AI Generated Text dataset with 130k texts, used to demonstrate real-world DR applications for distinguishing AI vs. human writing
- Mammoth 3D (PaCMAP repository): 10k-point digitization of mammoth skeleton for testing global structure preservation
- Synthetic datasets: Circles, moons, and swiss roll for demonstrating nonlinear structure
- Libraries: scikit-learn (PCA, t-SNE), UMAP-learn, sentence-transformers (text embeddings)
- Methodology inspiration: Yang Lu (uwyang), “Visualizing AI v.s. Human writing embeddings using TSNE for AI detection (Part 1/n)” — our analysis follows similar preprocessing and visualization techniques
- Key papers: van der Maaten & Hinton (2008) for t-SNE, McInnes et al. (2018) for UMAP
Previous: ← Lecture 11: K-Means Clustering | Next: Lecture 13: Neural Network Architecture →