Lecture 11: K-Means Clustering

Finding Groups in Unsupervised Learning

Overview

In this lecture, we explore K-Means clustering - one of the most fundamental unsupervised learning algorithms for discovering hidden patterns and structures in unlabeled data. Unlike supervised learning where labels guide us, clustering reveals natural groupings purely from data patterns. We’ll build K-Means from scratch to understand its inner workings, apply it to real-world text data, learn how to evaluate and optimize clustering results, and understand when K-Means excels and when alternative methods are needed.

Learning Objectives

By the end of this lecture, you will:

  • Understand the fundamental concepts of clustering and unsupervised learning
  • Master the K-Means algorithm, including its objective function (WCSS) and iterative optimization
  • Implement K-Means from scratch to understand the assignment and update steps
  • Apply K-Means to text data using TF-IDF transformation
  • Evaluate clustering quality using silhouette score, Davies-Bouldin index, and Calinski-Harabasz score
  • Choose optimal K using the elbow method and silhouette analysis
  • Use K-Means variants (MiniBatch K-Means, K-Means++) for improved performance
  • Recognize K-Means limitations (spherical clusters, fixed K) and when to use alternative methods

Materials

TipQuick Access

Datasets & Acknowledgments

  • 20 Newsgroups (Ken Lang): Text documents from 20 different newsgroups used to demonstrate document clustering
  • Libraries: scikit-learn (K-Means implementation, metrics), NumPy (numerical operations), Matplotlib/Seaborn (visualization)

Previous: ← Lecture 10: Kernel Methods & Gaussian Processes | Next: Lecture 12: PCA and Dimensionality Reduction →