A complete end-to-end pipeline for LLM interpretability with sparse autoencoders (SAEs) using Llama 3.2, written in pure PyTorch and fully reproducible.
-
Updated
Mar 23, 2025 - Python
A complete end-to-end pipeline for LLM interpretability with sparse autoencoders (SAEs) using Llama 3.2, written in pure PyTorch and fully reproducible.
Real-time 3D visualisation of SAE feature activations inside GPT-2, token by token
Fast XAI with interactions at large scale. SPEX can help you understand the output of your LLM, even if you have a long context!
Automates attribution-graph analysis via probe prompting: circuit-trace a prompt, auto-generate concept probes, profile feature activations, cluster supernodes.
Spectral Attention Divergence — dynamical systems probe for LLM inference via dual-path attention comparison and delay-coordinate attractor reconstruction
A minimal mech-interp project for steering LLMs
Replication of 'From Reasoning to Answer' (EMNLP 2025) — Reasoning-Focus Heads + Activation Patching on DeepSeek-R1-Distill-Qwen-7B
This project explores methods to detect and mitigate jailbreak behaviors in Large Language Models (LLMs). By analyzing activation patterns—particularly in deeper layers—we identify distinct differences between compliant and non-compliant responses to uncover a jailbreak "direction." Using this insight, we develop intervention strategies that modify
Normative, machine-first definition of Interpretive SEO (SEO interprétatif), aligned with the Interpretive Governance standard.
Fine-tuned DistilBERT for binary sentiment analysis on IMDB movie reviews with token-level interpretability using LayerIntegratedGradients
Multi-scale Haar wavelet probes of LLM hidden states for FX direction prediction. Walk-forward Sharpe=1.22 net OOS on GBP/USD 2020-2026.
Studying LLM latent space topology using persistent homology and Hodge decomposition. Discovers that Hodge gradient potential recovers the animacy hierarchy — a linguistic universal — without supervision.
Layerwise diagnostic showing instruction-tuned LLMs settle on next-token predictions later.
Real-time 3D visualization of LLM cognitive states — explore transformer geometry during inference
Cross-patching diagnostic showing instruction-tuned late effects depend on upstream state.
Mechanistic interpretability toolkit for comparing transformer activations, token shifts, and activation patching behaviour.
Add a description, image, and links to the llm-interpretability topic page so that developers can more easily learn about it.
To associate your repository with the llm-interpretability topic, visit your repo's landing page and select "manage topics."