Project 2: Neural Archaeology

Decoding and Rewiring the Hidden Mind of Language Models

Overview

Peer into the “black box” of language models by implementing interpretability techniques from cutting-edge research. This project takes you on a journey of neural archaeology—excavating the hidden layers of transformer models to understand how they encode concepts like safety, emotion, and truthfulness. You’ll build practical tools to detect harmful content, visualize emotional representations, and understand how information flows through neural networks.

Learning Objectives

By completing this project, you will:

  • Implement PCA from scratch to understand unsupervised dimensionality reduction
  • Apply K-means clustering to discover natural groupings in neural representations
  • Compare supervised vs unsupervised learning for representation discovery
  • Analyze layer-wise information processing in Transformer architectures
  • Build safety detection systems using neural representations
  • Understand data efficiency and learning curves for safety systems
  • Visualize high-dimensional neural activations in 2D/3D space
  • Think critically about AI safety and ethics through hands-on experimentation

Materials

TipQuick Access

Project Notebook (GitHub) — Complete implementation guide and assignment
Setup Guide — Colab and local setup instructions

ImportantSubmission Requirements
  1. Completed Jupyter notebook exported as PDF with all outputs visible
  2. All code implementations (# YOUR CODE HERE sections) completed
  3. All analysis sections answered (marked as “Required Analysis”)
  4. All visualizations displayed in the PDF

What You’ll Build

Core Algorithms

You’ll implement fundamental ML techniques from scratch: - PCA (Principal Component Analysis) for dimensionality reduction - K-Means clustering for unsupervised pattern discovery - Logistic regression for safety classification - Hidden state extraction from transformer models

Neural Interpretability Tools

Your implementations will power analysis of language models: - Extract and visualize internal “thoughts” at each layer - Map the geometric structure of emotions in neural space - Build classifiers to detect unsafe content

Research Insights

Working with real language models (SmolLM-1.7B), you’ll discover: - How safety information emerges in later layers - Whether emotions cluster naturally in activation space - Which pooling strategies best capture semantic meaning - How layer depth affects representation quality

Technical Stack

  • Models: SmolLM-1.7B-Instruct (HuggingFace Transformers)
  • Datasets: Circuit Breaker safety data, emotion scenarios
  • Libraries: PyTorch, NumPy, scikit-learn, Matplotlib
  • Compute: Google Colab (free GPU) or local GPU (recommended)

Research Context

This project synthesizes techniques from cutting-edge AI safety research:

You’ll gain hands-on experience with methods used by AI safety researchers at leading labs.

Additional Resources


Previous: ← Project 1: Music Recommender | Next: Capstone Project →