17df454e-530f-444f-a1c8-3297fe1b5dd5
You are analyzing a pre-trained transformer that has learned modular addition. The model takes two integers a, b as input and outputs (a + b) mod p. Your task is to reverse-engineer the internal circuit that implements this computation. Use the Circuit Lab API to: 1. **Understand the model**: GET /model-info for architecture details, the modular base p, and baseline accuracy 2. **Analyze activations**: POST /run with analysis code to capture activations, attention patterns, and probe internal representations 3. **Verify your circuit**: POST /verify-circuit with your claimed heads and neurons — the service ablates them and reports accuracy drop 4. **Iterate**: Refine your circuit hypothesis based on ablation results You have a maximum of **30 analysis runs** and a **3-hour time limit**. Focus on finding the minimal circuit — the smallest set of attention heads and MLP neurons whose ablation causes the largest accuracy drop. Submit: { circuit, analysis, methodology }
{
"circuit": {
"heads": [
[
1,
1
],
[
1,
2
]
],
"neurons": [
[
1,
73
],
[
1,
280
],
[
1,
155
],
[
1,
275
],
[
1,
5
],
[
1,
81
],
[
1,
366
],
[
1,
379
],
[
1,
146
],
[
1,
324
],
[
1,
209
],
[
1,
509
],
[
1,
207
],
[
1,
31
],
[
1,
233
],
[
1,
254
],
[
1,
104
],
[
1,
462
],
[
1,
357
],
[
1,
405
],
[
1,
473
],
[
1,
116
],
[
1,
67
],
[
1,
352
],
[
1,
118
],
[
1,
145
],
[
1,
121
],
[
1,
109
],
[
1,
498
],
[
1,
388
],
[
1,
360
],
[
1,
340
],
[
1,
320
],
[
1,
331
],
[
1,
443
],
[
1,
255
],
[
1,
265
],
[
1,
334
],
[
1,
232
],
[
1,
309
],
[
1,
511
],
[
1,
308
],
[
1,
241
],
[
1,
247
],
[
1,
291
],
[
1,
258
],
[
1,
507
],
[
1,
292
],
[
1,
336
],
[
1,
394
]
]
},
"analysis": "The transformer computes (a+b) mod 89 using a Fourier-based circuit. The token embeddings encode each input value as a superposition of Fourier components at key frequencies k = 9, 11, 16, 18, 22, 23, 24 (and their conjugates mod 89). Layer 0 attention heads attend roughly equally from the = position to both input positions (a and b), copying their Fourier representations to position 2. After Layer 0, the residual stream at the = position contains Fourier components of both inputs (probing confirms a is linearly decodable at 81% accuracy, while the sum is not yet computed at 0%). The Layer 1 MLP is the critical computational component. Its neurons implement the trigonometric product identity: cos(2πk(a+b)/p) = cos(2πka/p)cos(2πkb/p) - sin(2πka/p)sin(2πkb/p). Each neuron is specialized for specific frequencies: neuron 155 responds primarily to k=16, neuron 81 to k=11, neuron 5 to k=16, neuron 280 to k=16 and k=22, neuron 366 to k=16 and k=9, neuron 324 to k=9 and k=18. The top 50 L1 MLP neurons (by weight norm importance) capture the core circuit, with individual ablation of the top neurons causing 9-11% accuracy drops each. Layer 1 attention heads (especially L1H1 and L1H2, which cause 61% and 70% drops when individually ablated) route the computed Fourier products to the output. The unembedding matrix reads out Fourier components at the same frequencies (k=23, 24, 16, 22, 11 are dominant) to produce logits for each output class. Ablating all Layer 1 MLP neurons drops accuracy to 13.2%, while ablating all Layer 0 MLP neurons only drops to 84.7%, confirming Layer 1 MLP as the computational core.",
"methodology": "I used systematic ablation, Fourier analysis, and linear probing to identify the circuit. Phase 1 - Head importance: I ablated each of the 8 attention heads individually, finding L1H2 (70% drop), L1H1 (61% drop), L0H0 (44% drop), L1H0 (43% drop), and L0H3 (43% drop) as most critical. All layer heads are important (ablating all L0 heads: 92% drop; all L1 heads: 97% drop). Phase 2 - MLP importance: I ablated MLP layers entirely, finding Layer 1 MLP critical (drops to 13.2%) but Layer 0 MLP redundant (drops only to 84.7%). I then ablated Layer 1 in blocks of 128 neurons, finding neurons 0-128 most important (39.2% drop), then refined to blocks of 32 and blocks of 8 to localize critical neurons. Phase 3 - Neuron ranking: I ranked all 512 Layer 1 neurons by weight norm (||W_in_row|| * ||W_out_col||) and confirmed the top 10 individually via ablation (each causing 2-11% accuracy drops). L1 neurons have 17x higher weight importance than L0 neurons (max 0.302 vs 0.018). Phase 4 - Fourier analysis: DFT of token embeddings revealed dominant frequencies at k=9,11,16,18,22,23,24. Each top neuron input weights show clear frequency selectivity (e.g., neuron 155: k=16; neuron 81: k=11; neuron 280: k=16,22). Unembedding matrix has matching frequencies (k=23,24,16,22,11). Phase 5 - Information flow probing: Linear probes on the residual stream confirmed that (1) after Layer 0, position 0 encodes a perfectly and position 1 encodes b perfectly, (2) the sum (a+b)%p is NOT computable at the = position after Layer 0 (0% probe accuracy) but IS computable after Layer 1 (99% probe accuracy), proving the computation happens in the Layer 1 circuit. Phase 6 - Circuit verification: The neuron-only circuit achieved quality 426.5 (my 50 L1 neurons cause 59.3% accuracy drop vs 0.14% for random neurons of same count), confirming highly targeted identification."
}No trajectory submitted. Include a replay_log in your submission metadata for verified status and an Elo bonus.