Project 2: Neural Archaeology

Decoding and Rewiring the Hidden Mind of Language Models

Overview

Peer into the “black box” of language models by implementing interpretability techniques from cutting-edge research. This project takes you on a journey of neural archaeology—excavating the hidden layers of transformer models to understand how they encode concepts like safety, emotion, and truthfulness. You’ll build practical tools to detect harmful content, visualize emotional representations, and understand how information flows through neural networks.

Learning Objectives

By completing this project, you will:

Implement PCA from scratch to understand unsupervised dimensionality reduction
Apply K-means clustering to discover natural groupings in neural representations
Compare supervised vs unsupervised learning for representation discovery
Analyze layer-wise information processing in Transformer architectures
Build safety detection systems using neural representations
Understand data efficiency and learning curves for safety systems
Visualize high-dimensional neural activations in 2D/3D space
Think critically about AI safety and ethics through hands-on experimentation

Materials

Quick Access

Project Notebook (GitHub) — Complete implementation guide and assignment
Setup Guide — Colab and local setup instructions

Submission Requirements

Completed Jupyter notebook exported as PDF with all outputs visible
All code implementations (# YOUR CODE HERE sections) completed
All analysis sections answered (marked as “Required Analysis”)
All visualizations displayed in the PDF

What You’ll Build

Core Algorithms

You’ll implement fundamental ML techniques from scratch: - PCA (Principal Component Analysis) for dimensionality reduction - K-Means clustering for unsupervised pattern discovery - Logistic regression for safety classification - Hidden state extraction from transformer models

Neural Interpretability Tools

Your implementations will power analysis of language models: - Extract and visualize internal “thoughts” at each layer - Map the geometric structure of emotions in neural space - Build classifiers to detect unsafe content

Research Insights

Working with real language models (SmolLM-1.7B), you’ll discover: - How safety information emerges in later layers - Whether emotions cluster naturally in activation space - Which pooling strategies best capture semantic meaning - How layer depth affects representation quality

Technical Stack

Models: SmolLM-1.7B-Instruct (HuggingFace Transformers)
Datasets: Circuit Breaker safety data, emotion scenarios
Libraries: PyTorch, NumPy, scikit-learn, Matplotlib
Compute: Google Colab (free GPU) or local GPU (recommended)

Research Context

This project synthesizes techniques from cutting-edge AI safety research:

Representation Engineering (Zou et al., 2023) — Reading and controlling model representations
Circuit Breakers (Zou et al., 2024) — Training models to refuse harmful requests
Refusal Mechanisms (Zhang et al., 2024, Sel et al., 2025) — Understanding how models say “no”

You’ll gain hands-on experience with methods used by AI safety researchers at leading labs.

Additional Resources

The Illustrated Transformer — Visual guide to transformer architecture
OpenAI Tokenizer Tool — Understand how text becomes tokens
Anthropic’s Interpretability Research — State-of-the-art in neural interpretability
PCA Explained Visually — Interactive PCA tutorial

Previous: ← Project 1: Music Recommender | Next: Capstone Project →