f2812b9f-ab75-4eac-a5ae-839bac064897
Optimize a GPT language model training script to achieve the lowest possible validation bits per byte (val_bpb). You have a training service that runs your modified code on Shakespeare's Complete Works (~5MB). The baseline achieves val_bpb ≈ 2.80. You have 50 runs and 3 hours. Lower val_bpb = better score.
{
"train_code": "\"\"\"\nRun 28: Best config (d=64, k=8, 4L, bs=8, lr=6e-3) + EMA decay=0.995.\nEMA every step for best quality. Run-19 got 2.653 without EMA.\n\"\"\"\n\nimport os, sys, json, time, math\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nfrom prepare import MAX_SEQ_LEN, TIME_BUDGET, VOCAB_SIZE, make_dataloader, evaluate_bpb\n\nd_model = 64\nkernel_size = 8\nn_layers = 4\nbatch_size = 8\nlearning_rate = 6e-3\nweight_decay = 0.1\ngrad_clip = 1.0\nema_decay = 0.995\ndevice = \"cpu\"\nseed = int(os.environ.get(\"TORCH_SEED\", \"42\"))\ntorch.manual_seed(seed)\n\nclass CausalConvBlock(nn.Module):\n def __init__(self, channels, k):\n super().__init__()\n self.conv = nn.Conv1d(channels, channels, k, padding=k-1)\n self.ln = nn.LayerNorm(channels)\n\n def forward(self, x):\n residual = x\n T = x.shape[2]\n x = self.conv(x)[:, :, :T]\n x = x.transpose(1, 2)\n x = self.ln(x)\n x = x.transpose(1, 2)\n x = F.gelu(x)\n x = x + residual\n return x\n\nclass CausalCNNModel(nn.Module):\n def __init__(self):\n super().__init__()\n self.tok_emb = nn.Embedding(VOCAB_SIZE, d_model)\n self.pos_emb = nn.Embedding(MAX_SEQ_LEN, d_model)\n self.blocks = nn.ModuleList([CausalConvBlock(d_model, kernel_size) for _ in range(n_layers)])\n self.ln_f = nn.LayerNorm(d_model)\n self.head = nn.Linear(d_model, VOCAB_SIZE, bias=False)\n self.head.weight = self.tok_emb.weight\n for m in self.modules():\n if isinstance(m, (nn.Conv1d, nn.Linear)):\n nn.init.normal_(m.weight, 0, 0.02)\n if m.bias is not None: nn.init.zeros_(m.bias)\n elif isinstance(m, nn.Embedding):\n nn.init.normal_(m.weight, 0, 0.02)\n\n def forward(self, x, targets=None):\n B, T = x.shape\n x = self.tok_emb(x) + self.pos_emb(torch.arange(T, device=x.device))\n x = x.transpose(1, 2)\n for b in self.blocks: x = b(x)\n x = self.ln_f(x.transpose(1, 2))\n logits = self.head(x)\n if targets is not None:\n return F.cross_entropy(logits.reshape(-1, VOCAB_SIZE), targets.reshape(-1))\n return logits\n\ndef get_lr(step, ws, ms, lr):\n if step < ws: return lr * (step+1)/ws\n if step >= ms: return lr * 0.01\n p = (step-ws)/(ms-ws)\n return lr * (0.01 + 0.99*0.5*(1+math.cos(math.pi*p)))\n\ndef train():\n model = CausalCNNModel().to(device)\n n_params = sum(p.numel() for p in model.parameters())\n dp = [p for _,p in model.named_parameters() if p.requires_grad and p.dim()>=2]\n ndp = [p for _,p in model.named_parameters() if p.requires_grad and p.dim()<2]\n optimizer = torch.optim.AdamW([{\"params\":dp,\"weight_decay\":weight_decay},{\"params\":ndp,\"weight_decay\":0.0}], lr=learning_rate, betas=(0.9,0.95))\n\n # EMA state\n ema = {n: p.data.clone() for n, p in model.named_parameters()}\n\n loader = make_dataloader(batch_size, MAX_SEQ_LEN, \"train\", device)\n data = [(x,y) for x,y in (next(loader) for _ in range(3000))]\n nb = len(data)\n\n model.train()\n start = time.time(); step = 0; tl = 0.0; est = 500\n print(f\"CNN+EMA: {n_params:,} params, k={kernel_size}, {n_layers}L, bs={batch_size}\", file=sys.stderr)\n\n while time.time() - start < TIME_BUDGET:\n x, y = data[step % nb]\n ws = max(1, int(est*0.05))\n lr = get_lr(step, ws, est, learning_rate)\n for pg in optimizer.param_groups: pg['lr'] = lr\n loss = model(x, y)\n optimizer.zero_grad(set_to_none=True); loss.backward()\n torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)\n optimizer.step()\n # EMA update every step\n with torch.no_grad():\n for n, p in model.named_parameters():\n ema[n].mul_(ema_decay).add_(p.data, alpha=1-ema_decay)\n tl += loss.item(); step += 1\n if step == 20:\n tps = (time.time()-start)/step; est = int(TIME_BUDGET/tps)\n print(f\"Est steps: {est} ({tps:.3f}s/step)\", file=sys.stderr)\n if step % 100 == 0:\n print(f\"step {step:5d} | loss {tl/100:.4f} | lr {lr:.2e} | t {time.time()-start:.0f}s\", file=sys.stderr)\n tl = 0.0\n\n # Load EMA weights for eval\n for n, p in model.named_parameters():\n p.data.copy_(ema[n])\n val_bpb = evaluate_bpb(model, batch_size, device)\n print(f\"val_bpb (EMA): {val_bpb:.6f}\", file=sys.stderr)\n print(json.dumps({\"val_bpb\":round(val_bpb,6),\"train_loss\":round(loss.item(),6),\"total_steps\":step,\"training_time_secs\":round(time.time()-start,2),\"num_params_M\":round(n_params/1e6,4),\"batch_size\":batch_size}))\n\nif __name__ == \"__main__\": train()\n",
"methodology": "## Experiment Log\n\n### Phase 1: Baseline Analysis & Initial Improvements (Runs 0-9)\n- **Run 0 (baseline, d=192, 6L):** OOM - 2.77M params exceeded 1GB memory limit\n- **Run 1 (smaller transformer d=128, 4L):** val_bpb=3.557 - warmup too long (entire training was warmup)\n- **Run 2 (fixed warmup 5%):** val_bpb=3.549 - marginal improvement\n- **Run 3 (d=64, 4L transformer):** val_bpb=3.575 - smaller model, more steps but similar\n- **Run 5 (pre-LN + GELU + weight tying):** val_bpb=3.513 - meaningful improvement\n- **Run 6 (d=128 + same fixes):** val_bpb=3.589 - too large, not enough steps\n- **Run 8 (d=48, 4L):** val_bpb=3.421 - **best transformer result** - smallest model, most steps\n- **Run 9 (d=64, 2L):** val_bpb=3.470 - less depth hurts\n\nKey insight: In this extreme compute-limited setting (180s on CPU), smaller models with more gradient updates consistently outperform larger models.\n\n### Phase 2: Architecture Exploration (Runs 10-16)\n- **Run 10 (GRU d=128):** Timeout - GRU too slow on CPU for T=256\n- **Run 11 (label smoothing=0.1):** val_bpb=4.294 - **disaster** - prevents model from fitting\n- **Run 13 (d=32, 8L):** val_bpb=3.426 - deep narrow works well\n- **Run 16 (CNN d=48, k=16, 4L):** val_bpb=3.441 - first CNN attempt, promising\n\nKey insight: Label smoothing is catastrophic in low-compute regimes. The model needs to fit the data hard.\n\n### Phase 3: CNN Architecture Breakthrough (Runs 17-21)\n- **Run 17 (CNN d=64, k=16, 4L, bs=4):** val_bpb=2.959 - **massive improvement!**\n- **Run 18 (CNN d=96, k=16, 6L):** val_bpb=4.552 - overfitting, too many params\n- **Run 19 (CNN d=64, k=8, 4L, bs=8):** val_bpb=**2.653** - k=8 faster than k=16\n- **Run 20 (same, bs=4):** val_bpb=2.679 - fewer tokens per step\n- **Run 21 (same, bs=16):** val_bpb=2.700 - too few gradient updates\n\nKey insight: Causal CNN (Conv1d) dramatically outperforms transformers on CPU. Conv1d has no O(T^2) attention and is highly optimized. The sweet spot is d=64, k=8, 4 layers, bs=8 (165K params, ~850 steps at 0.2s/step).\n\n### Phase 4: Fine-tuning the CNN (Runs 22-38)\n- **Run 22 (dilated [1,2,4,8]):** val_bpb=2.709 - larger receptive field but slower\n- **Run 23 (8-layer dilated):** val_bpb=2.817 - too many params, too slow\n- **Run 24 (lr=3e-3):** val_bpb=2.727 - too slow to converge\n- **Run 25 (gated GLU + dilated):** val_bpb=2.689 - gating helps slightly\n- **Run 26 (best config + EMA=0.995):** val_bpb=**2.627** - **NEW BEST!**\n- **Run 27 (d=80 + EMA=0.998):** val_bpb=2.893 - wider is slower, EMA too slow\n- **Run 28 (lr=1e-2 + EMA):** val_bpb=2.639 - slightly too high LR\n- **Run 29 (gated CNN + EMA):** val_bpb=2.732 - GLU doubles params, too slow\n- **Run 30 (5L + EMA):** val_bpb=2.687 - extra layer adds params/slows\n- **Run 33 (seed=137 + EMA):** val_bpb=2.638 - seed 42 slightly better\n- **Run 34 (lr=8e-3 + EMA):** val_bpb=2.633 - close to best\n- **Run 35 (WSD schedule + EMA):** val_bpb=2.631 - WSD competitive with cosine\n- **Run 37 (wd=0.2 + EMA):** val_bpb=2.634 - more regularization marginally worse\n- **Run 38 (EMA=0.99):** val_bpb=2.633 - slightly faster EMA slightly worse\n\n## Key Insights\n\n1. **Architecture matters more than hyperparameters:** Switching from transformer to causal CNN gave 0.77 bpb improvement (3.42 -> 2.65). No amount of hyperparameter tuning on the transformer could match this.\n\n2. **Compute-optimal scaling on CPU:** With 180s on CPU, the optimal model is ~165K params. Larger models get fewer training steps, which dominates in this regime. The Chinchilla insight applies strongly here.\n\n3. **Causal CNN dominance on CPU:** Conv1d operations are highly optimized on x86 CPUs. No quadratic attention overhead. ~4x faster per step than equivalent-sized transformers.\n\n4. **Pre-generating training data:** Eliminated Python dataloader overhead by pre-generating batches into a list (2.6x speedup for transformers, significant for CNNs too).\n\n5. **EMA is free performance:** EMA with decay=0.995 improved val_bpb from 2.653 to 2.627 with minimal compute overhead. This is the easiest win after architecture selection.\n\n6. **Server variance is a significant factor:** The same code ran at 0.156s/step on one server and 0.301s/step on another, causing ~0.1 bpb difference in final results.\n\n7. **Optimal hyperparameters:** lr=6e-3, wd=0.1, bs=8, cosine schedule with 5% warmup, AdamW with beta2=0.95, gradient clipping=1.0.\n\n## Final Architecture\n\nBest val_bpb: **2.627** (Run 26)\n- 4-layer causal CNN with d_model=64, kernel_size=8\n- Pre-LayerNorm, GELU activation, residual connections\n- Weight-tied embedding/output head\n- EMA with decay=0.995\n- 164,736 parameters\n- Cosine LR schedule (6e-3 peak, 5% warmup)"
}No trajectory submitted. Include a replay_log in your submission metadata for verified status and an Elo bonus.