llm-interpretability

Here are 16 public repositories matching this topic...

PaulPauls / llama3_interpretability_sae

A complete end-to-end pipeline for LLM interpretability with sparse autoencoders (SAEs) using Llama 3.2, written in pure PyTorch and fully reproducible.

pytorch feature-extraction open-research sparse-autoencoder llama3 llm-interpretability feature-steering

Updated Mar 23, 2025
Python

09Catho / axon

Star

Real-time 3D visualisation of SAE feature activations inside GPT-2, token by token

python threejs machine-learning deep-learning websocket 3d-visualization sparse-autoencoder fastapi gpt2 mechanistic-interpretability transformerlens llm-interpretability

Updated May 19, 2026
JavaScript

basics-lab / spectral-explain

Star

Fast XAI with interactions at large scale. SPEX can help you understand the output of your LLM, even if you have a long context!

explainable-ai xai shap explainability sparse-transformer llm-interpretability

Updated Mar 16, 2026
Jupyter Notebook

peppinob-ol / attribution-graph-probing

Star

Automates attribution-graph analysis via probe prompting: circuit-trace a prompt, auto-generate concept probes, profile feature activations, cluster supernodes.

graph-analysis sparse-autoencoders mechanistic-interpretability llm-interpretability research-tooling circuit-tracing attribution-graphs probe-prompting prompt-probing neuronpedia feature-activation supernodes cross-layer-transcoder

Updated May 11, 2026
Jupyter Notebook

Project-Navi / navi-SAD

Sponsor

Star

Spectral Attention Divergence — dynamical systems probe for LLM inference via dual-path attention comparison and delay-coordinate attractor reconstruction

research transformers pytorch dynamical-systems attention-mechanism mistral permutation-entropy llm-interpretability takens-embedding attractor-reconstruction

Updated May 6, 2026
Python

handdl / commitment-steering

Star

A minimal mech-interp project for steering LLMs

mechanistic-interpretability llm-interpretability

Updated Jan 13, 2026
Jupyter Notebook

atgugu / mechinterp-rfh-replication

Star

Replication of 'From Reasoning to Answer' (EMNLP 2025) — Reasoning-Focus Heads + Activation Patching on DeepSeek-R1-Distill-Qwen-7B

reasoning attention-heads mechanistic-interpretability llm-interpretability deepseek-r1 emnlp-2025 transformer-lens

Updated Mar 27, 2026
Jupyter Notebook

Luisibear98 / intervention-jailbreak

Star

This project explores methods to detect and mitigate jailbreak behaviors in Large Language Models (LLMs). By analyzing activation patterns—particularly in deeper layers—we identify distinct differences between compliant and non-compliant responses to uncover a jailbreak "direction." Using this insight, we develop intervention strategies that modify

intervention llm llm-interpretability

Updated Feb 2, 2025
Python

GautierDorval / interpretive-seo

Star

Normative, machine-first definition of Interpretive SEO (SEO interprétatif), aligned with the Interpretive Governance standard.

knowledge-graph entity-disambiguation llm-interpretability semantic-governance dual-web interpretive-seo seo-interpretatif

Updated Apr 11, 2026
HTML

helenmand / IMDB-movie-reviews-sentiment-explainer

Star

Fine-tuned DistilBERT for binary sentiment analysis on IMDB movie reviews with token-level interpretability using LayerIntegratedGradients

imdb-reviews captum distilbert-fine-tuning llm-interpretability

Updated Sep 22, 2025
Jupyter Notebook

Bvbvbv120 / multiscale-haar-fx-probe

Star

Multi-scale Haar wavelet probes of LLM hidden states for FX direction prediction. Walk-forward Sharpe=1.22 net OOS on GBP/USD 2020-2026.

nlp time-series sharpe-ratio quantitative-finance wavelet-analysis probing foreign-exchange walk-forward llm-interpretability haar-wavelet

Updated May 19, 2026
Python

rororowyourboat / latent-topologies

Star

Studying LLM latent space topology using persistent homology and Hodge decomposition. Discovers that Hodge gradient potential recovers the animacy hierarchy — a linguistic universal — without supervision.

computational-linguistics persistent-homology topological-data-analysis hodge-decomposition llm-interpretability