Beyond Behavioral Benchmarks

The question

Researchers sometimes use language models to simulate how a Hispanic respondent might answer a survey question. The practice is called silicon sampling, and it rests on an assumption: that the model is reaching for an internal representation of "Hispanic" and reasoning from there. That what the model says about a demographic group reflects what the model knows about that group.

That assumption had never been tested mechanistically. So I tested it.

The short answer: behavioral output and internal representation can be substantially decoupled. A model can pass every fairness benchmark we throw at it while internally computing demographic information in ways those benchmarks cannot detect. In some cases, while that internal demographic information is doing nothing at all.

What I did

I built a four-phase mechanistic evaluation pipeline and ran it across eight open-weights language models, ranging from GPT-2 (117M parameters) to Pythia-12B.

Behavioral baseline. Run the standard fairness benchmark (BBQ) and record what the model says.
Activation probing. Look inside the model. Train a linear classifier on its internal activations to test whether demographic identity is even encoded.
Causal patching. Surgically swap one demographic representation for another inside the model's residual stream and measure whether the output changes. This is the test for whether the encoding actually drives anything.
Semantic characterization. Project the demographic encoding directions onto the model's vocabulary to surface representational artifacts invisible to behavioral evaluation.

Phases 1 through 3 produce statistically validated findings, permutation-tested with 1,000 iterations. Phase 4 is exploratory and model-specific.

What I found

1. Demographic identity is encoded everywhere, all the time.

A simple linear probe recovers race/ethnicity and gender at 100% accuracy from internal activations at every layer in every model tested. This holds even when demographic labels are paraphrased ("Latino" for "Hispanic," "African American" for "Black"). The encoding is concept-level, not token-specific. Models don't need to be asked about demographics to be representing them.

2. Whether the encoding actually drives outputs depends on how the model was trained.

In base models (GPT-2, Pythia-1.4B), the demographic encoding direction is causally active at intermediate layers. Patching it changes the output. But at the final layer it goes inert: behavioral evaluation, which only sees the final layer, misses all the action.

In Mistral-7B-Instruct-v0.1, an instruction-tuned model, the encoding is geometrically perfect (100% probe accuracy) but causally disconnected from every layer of computation. Instruction tuning restored behavioral sensitivity through a mechanism that operates entirely independently of the demographic encoding direction. The encoding exists. It drives nothing.

3. The encodings carry pretraining artifacts that no behavioral benchmark can see.

In GPT-2, the Hispanic encoding direction loads on criminalization-adjacent vocabulary ("detention," "joints," "inals") when contrasted with White, and on Spanish cultural surnames (Gomez, Cruz, Gutierrez) when contrasted with Black. The same model represents the same demographic differently depending on which other group is in the picture.

The Black encoding direction is dominated by color-compound subword tokens at early layers (Berry, jack, smith, as in Blackberry, Blackjack, Blacksmith) before shifting to civic vocabulary (Lives, protester, Panther) at later layers.

These are pretraining-corpus distributional artifacts, not model intent. But they sit in directions that, in base models, are causally active at intermediate layers, and they are completely invisible to the bias benchmarks the field uses to evaluate models for deployment.

Why this matters

If you publish silicon-sampling results (claims about what demographic communities think, want, or need based on simulated LLM responses) and you've validated your model on behavioral fairness benchmarks, that validation is not enough. The model can show clean behavioral output while drawing on representations shaped by criminalization associations, color-compound polysemy, or majority-context dependence.

If you build alignment systems and your evaluation stops at output-level fairness metrics, you may be treating only the symptom. Mistral-7B-Instruct's results suggest current behavioral interventions can leave the underlying representation intact and simply route around it.

For evaluation generally: geometric presence (what a probe measures), causal activity (what patching measures), and behavioral expression (what benchmarks measure) are three distinct properties. They can diverge. Treating any one as a proxy for the others produces conclusions the evidence does not support.

What's next

Modern frontier models. This audit covered open-weights models up to Mistral-7B because mechanistic intervention requires direct access to activations. Replicating findings on closed frontier models (GPT-4, Claude, Llama-3) is methodologically harder but increasingly important.
Sparse autoencoder decomposition of demographic encoding directions, to test whether they are monosemantic features or superpositions of correlated social signals.
Causal connection between intermediate-layer activity and final-layer outputs. Whether the criminalization-adjacent vocabulary detected at intermediate layers propagates to influence what the model actually generates.
Open-ended generation benchmarks (BOLD, HolisticBias) to test whether the behavioral collapse documented here generalizes beyond multiple-choice formats.
Application to silicon-sampling pipelines in the wild, running Phases 1 through 3 on models being used in published demographic simulation studies.

The pipeline is open-source and works on any open-weights model supported by TransformerLens.

The question

What I did

What I found

Why this matters

What's next

Cite

AI Sidekick