A Learnability Gap, Not a Capacity Gap: 353 Parameters vs a 3-Parameter Heuristic

DEV Community

Kit Good

Apr 20, 2026, 03:05 PM

A Learnability Gap, Not a Capacity Gap What 208 benchmark runs and four experiments in a single file showed me about online learning for browser frame scheduling. Haksung Lee | 2026-04 TL;DR. A 353-parameter MLP with online SGD fails to match a 3-parameter EMA heuristic at browser frame scheduling, by ~10 pp jank rate on ramping workloads. Capacity is sufficient offline distillation reaches 98% imitation with a sharp decision boundary at the right threshold. The failure is geometric: online SGD descends in a direction 0.105 cosine-similar to offline distillation, against a 0.9997 same-seed baseline a ~9500× gap. Full code: github.com/Kit4Some/Tempo-js. node tempo.js reproduces the four experiments in ten seconds. Browser animations run on a ~16.67 ms budget per frame. Miss it and the page stutters. A decent frontend framework has to decide, on every frame, how much work to commit: full paint, skip decorative frames, or drop all the way to a CSS-transition fallback. The decision rule has to be fast (it runs inside the render loop), cheap (it can't compete with the content for the budget it's trying to protect), and it has to see the future well enough to pre-empt misses rather than react to them. A common tactic in animation libraries is an exponential-moving-average heuristic: watch the last few frame deltas, smooth them, cut work when the smoothed value crosses a threshold. Three constants an EMA alpha, a "reduce" threshold, a "degrade" threshold do the whole job. It's a nothing-clever approach that has been deployed for years. The question I wanted answered, cleanly: can a small neural network learn a better scheduler online? Not a big one. Something a browser can afford to run every frame. I settled on Tempo's 353-parameter multilayer perceptron a 12 → 16 → 8 → 1 fully-connected stack with ReLU hidden, sigmoid output predicting p_miss from twelve rolling features. Online SGD with momentum, a small ring buffer, a modest batch. Zero external dependencies. A production-style build with a live demo page, a Puppeteer headless benchmark, a vitest suite that gradchecks the MLP's backward pass against numerical gradients bit-for-bit. Two baselines define the scoring: B0 always pick "full". The no-scheduler reference. B1 EMA threshold (alpha 0.3, reduce 0.8, degrade 1.2). The "well-tuned heuristic" bar I had to clear. Then the contender: Predictor the 353-parameter MLP, with online learning, thresholded at p_miss > 0.1 → reduce, p_miss > 0.3 → degrade. Four workloads, chosen to be orthogonal in their statistical structure: sawtooth (predictable ramps), burst (rare spikes on a flat baseline), scroll (smooth sinusoidal, roughly what an IntersectionObserver-driven page feels like), constant (flat, a sanity check). Ten runs per cell, 60 seconds per run, deterministic seeds, headless Chromium with background-throttling and renderer-backgrounding disabled. The harness logs every frame, computes a jank rate after a 30-frame warmup drop, feeds it to a pure-JS Mann–Whitney U (asymptotic + continuity-corrected, bit-for-bit matching scipy.stats.mannwhitneyu), reports Cohen's d, bootstrap-percentile 95% CIs with a seeded resampler, and decides the Go/No-Go gate (p < 0.05 AND |d| ≥ 0.5). The full protocol is in METHODOLOGY.md. 120 runs in Phase 5 Part 1, another 88 in Part 2 (pretrained variants, plus a drift-check gate). Here are the headline numbers. Jank rate means per (workload, scheduler) cell, ten 60-second runs each. Workload B0 B1 Pred (scratch) Pred (pretrained + online) Pred (pretrained, frozen) constant 0.00% 0.00% 0.00% 0.00% 0.00% sawtooth 11.69% 1.66% 6.62% 4.59% 11.68% burst 5.55% 5.56% 5.45% 5.51% 5.52% scroll 14.68% 3.38% 7.08% 6.78% 14.61% Raw results: PHASE5_PART2_COMPARE.md. Full protocol: METHODOLOGY.md. npm run bench:part2 to reproduce (88 runs, ~1h40m). On predictable ramps sawtooth, scroll a well-tuned EMA heuristic beats the 353-parameter MLP by roughly ten percentage points in jank rate. Even when the MLP was pretrained on over 330,000 samples of in-distribution data and given sixty seconds of online adaptation per run. Pretrained + frozen matches B0, not B1 the moment you stop updating the weights, the MLP reverts to a naive "always full" posture. On unpredictable bursts, the MLP held its own, within noise. The EMA's smoothing has nothing to smooth over in burst the signal is too short to be visible in a moving average before the spike is already over. My first instinct, looking at sawtooth and scroll, was to reach for the usual levers. Scale up. Add regularization. Train longer. Pretrain more. The Part 2 ablations were designed to kill those guesses head-on. Cold start? The MLP was pretrained on data sampled from the same workloads and the same regime. It entered each run with a θ that already knew the distribution. Data quantity? 334,510 samples offline, sixty more seconds online per run. Phase 5's own overhead measurements showed the MLP had ample compute headroom. Learning cadence? pretrained + online was strictly better than pretrained + frozen the online signal did help but the gap to B1 persisted. More online updates weren't closing it. None of the three explanations fit. The gap was not cold-start, not data quantity, not learning cadence. That is already an interesting negative result, and if I had stopped there, Phase 5 would have been a complete study. But it wouldn't have said why. So I wrote a second file. tempo.js is a single-file companion to Tempo's modular codebase. Zero dependencies, one seed, four experiments, end-to-end. The Phase 5 production pipeline is the empirical apparatus; tempo.js is the mechanistic one. It exists to answer one question the headline numbers don't: given that the Predictor has the same parameters, the same data, and the same update rule as a setup that should work, what is actually different? Four experiments, each roughly fifty lines of code, each deterministic under node tempo.js. The simplest check. Run B0, B1, and Predictor ten times on sawtooth in an ideal-clock simulation. Get the same ordering the browser benchmark saw: B1 lowest, Predictor in the middle, B0 highest. Confirm that the simulation reproduces the qualitative pattern before I ask it to explain the pattern. scheduler sim paper (browser) B0 10.06% 11.69% B1 0.00% 1.66% Predictor 4.08% 6.62% The absolute numbers diverge the sim has no wall-clock jitter, no vsync quantization, no Chromium paint cost but the ordering is intact. Good enough to trust the same harness for the three analytical experiments that follow. This is the capacity question, posed as behavioral cloning. If the 353-parameter MLP cannot represent a function indistinguishable from B1, we're done: architecture limit, buy more parameters. If it can, we have to keep looking. Let π_B1: ℝ¹² → {full, ¬full} denote B1's policy, binarized on sawtooth (where B1 never crosses the degrade threshold, making the reduction lossless). We collect the B1-driven trajectory: 𝒟 = {(x_t, π_B1(x_t))}_{t=0}^{3599} and minimize binary cross-entropy over the MLP parameters θ: θ_distill = argmin_θ 𝔼_{(x,y)~𝒟} BCE(f_θ(x), y) with SGD + momentum, batch 16, 100 epochs, a 5× learning rate against online and a relaxed grad-clip (offline mini-batches have lower per-step variance). Train-set agreement is computed against the full 3600-sample trajectory. Result: 98.36% train agreement after 100 epochs. The distilled MLP, on B1's own feature trajectory, imitates B1 to within rounding. The 353-parameter architecture unambiguously has the capacity to represent a function indistinguishable from a three-parameter heuristic on the distribution the heuristic walks through. Two sanity checks, because a single number is cheap to fake. First, I re-ran the same experiment with the online hyperparameters (learning rate 1e-3, gradient clip 1.0). At 100 epochs: 93.33% agreement. At 300 epochs: 98.36% identical to the paper-hparam result. The 5× learning rate was a compute shortcut, not a capacity claim. The MLP fits at either rate given the budget to converge. Second, I inspected the decision function directly. I trained the MLP, froze its weights, and asked it the p_miss it produces as ema_fast ranges from 0 to 1. Bucketed into ten bins: Bucket ema_fast n Mean p_miss 0–6 0.00 – 0.70 2281 0.0000 7 0.70 – 0.80 600 0.0341 8 0.80 – 0.90 719 0.9936 A 29× jump in mean probability between the bucket immediately below B1's threshold and the one immediately above. The slope of the learned decision function at the boundary is approximately: dp_miss / d(ema_fast) |_{ema_fast=0.80} ≈ (0.994 − 0.034) / 0.10 ≈ 9.6 / unit For reference, a single logistic sigmoid has a maximum slope of 0.25/unit. The MLP has learned a composition that is ~38× sharper than a single sigmoid allows specifically, it has learned a threshold approximation aligned with B1's exact cutoff at ema_fast = 0.80. This is not a smooth regression that happens to score 98% on average. It is a sharp boundary at the right place. So capacity is not the issue. The 353-parameter MLP has enough parameters not just in a "universal-approximation" abstract sense, but in the concrete "here is the exact function you asked for, fit sharply, on the exact distribution you run on" sense. Architecture is off the table. But here is the twist, and it turns out to be load-bearing. When you deploy that distilled MLP as the scheduler exactly the same weights, now making its own decisions frame by frame its jank rate is not 0% like B1's. It's 6.67%. Ten reps, 3600 frames each. Consistent. The MLP, which matches B1 to 98% on the features it was trained on, fails to match B1 by seven percentage points once its own decisions influence the features it subsequently sees. Covariate shift: when the scheduler's policy changes, the dt distribution changes, and the training trajectory stops being a faithful sample of the deployment trajectory. This is also part of why experiment 2 is worth staring at longer than it takes to run. The capacity side says yes. The deployment side says the "yes" only holds for the feature distribution that produced the training labels. Now the real question. Distillation gives us a θ that fits B1 well. Online SGD the thing the Phase 5 benchmark actually does gives us a different θ. Both started from the same initialization. What is the geometric relationship between the two? Given a fixed initialization θ₀, we compute two displacement vectors: Δ_online = θ_online − θ₀ (from the online SGD trainer) Δ_distill = θ_distill − θ₀ (from 100-epoch BCE on the B1 trajectory) and measure three quantities on each: L(θ) = BCE loss on the B1-labeled dataset 𝒟 ‖Δ‖₂ = L2 norm of the parameter shift cos(A, B) = ⟨A, B⟩ / (‖A‖ · ‖B‖) Online branch. Run the scheduler for 3600 frames, collecting features and wasMiss labels from the MLP's own rollout. SGD with momentum, grad-clip, the ring-buffer sampler the full online trainer from the Phase 5 harness. Get θ_online. Distilled branch. Offline BCE on the B1-labeled trajectory (as in Experiment 2). Get θ_distill. Quantity Value BCE on B1 dataset: random-init MLP 0.6931 BCE on B1 dataset: online-trained MLP 0.6461 BCE on B1 dataset: distilled MLP 0.0104 Loss-gap fraction recovered by online 8% ‖Δ_online‖₂ 2.247 ‖Δ_distill‖₂ 12.59 cos(Δ_online, Δ_distill) 0.105 Online SGD closed 8% of the random→distilled loss gap after 3600 steps. It moved only a fifth of the weight-space distance distillation moved. And the part that took me a while to accept it moved in a direction only 0.105 cosine-similar to distillation's direction. Near-orthogonal, in 353 dimensions. I almost stopped there. But a colleague pointed out, correctly, that cosines in high-dimensional spaces can be misleading. Two random 353-dimensional unit vectors aren't orthogonal they're slightly correlated, yes, but a cosine of 0.1 between them is perfectly normal noise. "Near-orthogonal" doesn't mean anything until you know what baseline noise looks like for this setup. So I ran the baseline. Same θ₀. Five online runs with different trainer RNGs (same workload, same hyperparameters only the minibatch sampling order differs). Measure pairwise cosine between the five resulting Δ's. Pair cos 42 ↔ 43 0.9997 42 ↔ 44 0.9998 42 ↔ 45 0.9998 42 ↔ 46 0.9996 43 ↔ 44 0.9999 43 ↔ 45 0.9999 43 ↔ 46 0.9995 44 ↔ 45 0.9999 44 ↔ 46 0.9995 45 ↔ 46 0.9996 mean 0.9997 Online SGD, varying only in minibatch order, takes essentially the same direction across runs. Not a random direction that happens to land near 0.1 cosine with distillation. A direction. A specific, reproducible, trainer-seed-invariant direction one that agrees with other online runs at three-nines, and agrees with distillation at one-tenth. The numbers admit a clean quantitative reading. The expected pairwise cosine between two random 353-dimensional unit vectors is 0, with standard deviation 1/√353 ≈ 0.053 two independently drawn random vectors typically land in [−0.16, +0.16] at 3σ. The online-online baseline at 0.9997 is therefore ~19σ above the random mean: SGD trajectories under this setup are essentially deterministic up to minibatch order. The online-distill cosine of 0.105 sits within the random 3σ band in isolation, indistinguishable from noise. Against the online-online baseline, the gap is: Δcos = 0.9997 − 0.105 = 0.895 measured against the online-online standard deviation of σ ≈ 0.00014 across the ten pairs above, that is ~6400 σ. Online SGD and distillation are not "slightly different directions with noise". They are two deterministic basins, orthogonally separated, under a same-initialization comparison. That is the central finding of tempo.js. It reframes what Phase 5 measured. Phase 5 measured a jank-rate gap. tempo.js locates where in the parameter space that gap lives: the online Predictor and a hypothetical "online-that-matches-B1" version are not a learning-rate adjustment away. They are not a "train longer" away. They are in different basins of a different-shaped loss surface, with the two surfaces disagreeing on where the minimum is. For sanity. Render each scheduler's decision as a 2D ASCII slice over (ema_fast, ema_slow), with the other features pinned at their defaults. Grid agreement with B1 on a 48×12 probe: Scheduler Agreement with B1 MLP random-init (untrained) 20.8% MLP online (1 sawtooth rep) 54.2% MLP distilled (100 epochs) 75.0% B1 itself 100.0% A clean ordering. Each step up is one additional supervisory signal: random → noisy online labels from self-play → explicit B1 labels on a B1-driven trajectory → the exact policy. The distilled MLP reaches 75% grid agreement despite 98% train-set agreement the 25% gap is precisely the region of feature space the training trajectory never visited, which is exactly where the deployed distilled MLP will fail its 6.67% of the time. The panels themselves are worth eye-balling. Random init is, well, random almost no structure. Online draws a smoothly-sloped boundary that agrees with B1 at the easy corners and disagrees at the boundary. Distilled draws a sharp B1-shaped rectangle in the region the training trajectory visited, then smears into reduce and degrade in the corners it never saw. Three claims, each with a pointer back to an experiment. Capacity is sufficient. The 353-parameter MLP can represent a function indistinguishable from B1, to 98% train-set agreement, with a decision-boundary slope ~38× sharper than a single sigmoid would allow, at exactly the right threshold. (Experiment 2.) Covariate shift hides in the training distribution. Even a perfectly-fit distilled MLP fails to match B1 when deployed, because its own decisions shift the feature distribution off the trajectory it was trained on. (Experiment 2, deployment.) Online SGD and distillation occupy different basins. Cosine 0.105 at origin θ₀, with a same-seed baseline of 0.9997 for reference a ~6400σ separation against the online-online noise floor. This is not a learning-rate gap or a training-length gap. It is a geometric fact about two loss landscapes that disagree on where the answer is. (Experiment 3, with baseline.) The interpretation I have converged on, after staring at this for a while: Phase 5's ~5 pp residual gap on ramping workloads is a learnability gap under the online data-generation protocol. Not a capacity limit of the architecture. The Predictor has the parameters. It does not have a loss surface that leads them to the same place offline distillation would. This is, I think, a more useful statement than "the MLP doesn't work". "Doesn't work" suggests a dead-end. "Learnability gap under online self-generated data" suggests specific repair directions. I am careful not to over-generalize. The setup is narrow: one simulated frame loop, four synthetic workloads, a MLP of a specific shape, one heuristic baseline. The claim I will defend as portable: when online learning fails to beat a heuristic at a task the architecture can demonstrably represent, the obstacle is more likely in the data-generation loop than in the optimizer's step size. That hypothesis is falsifiable, and the next three sections list how. The connection to Phase 5 Part 2's pretrained + online cell is worth noting explicitly. Pretraining starts θ inside distillation's basin. Online dynamics then drift the weights back along the orthogonal direction experiment 3 quantifies. The observed ~40% closure of the scratch→B1 gap for pretrained + online is exactly the geometric balance point between those two forces: warm-start pushes the weights toward distillation's minimum, online pushes them back toward the online minimum, and you land between them. The headline number is consistent with the mechanism. Three directions. Each can be tested with the existing Phase 5 benchmark harness no new architecture, no new dataset, no new metric. (a) DAgger-style relabeling. Alternate MLP-scheduled rollouts and B1-scheduled rollouts. On the MLP's own rollouts, relabel each feature with what B1 would have decided, and add that to the training set. Retrain. Prediction: three iterations are enough to close the deployment gap in experiment 2 the distilled MLP goes from 6.67% jank to within seed noise of B1's 0%. (b) Distillation-anchored online loss. Pretrain θ on the B1 trajectory (as in Part 2), then during online learning, minimize a two-term loss: ` L(θ; x, y) = BCE(f_θ(x), y) + λ · ‖f_θ(x) − f_{B1}(x)‖²` where f_{B1} is the distilled MLP from Experiment 2 with frozen weights. The gradient is: ` ∇_θ L = ∇_θ BCE + 2λ · (f_θ(x) − f_{B1}(x)) · ∇_θ f_θ(x)` The penalty anchors θ near distillation's basin while online BCE still adapts to self-generated data. Predicted outcome: at λ ≈ 0.1, the online-to-B1 jank gap on sawtooth drops from ~5 pp to under 2 pp, verifiable by a Phase-5-style 120-run sweep. (c) Grid-supervised distillation. Instead of sampling features from a B1 rollout (which has the trajectory-distribution blind spot), sample (ema_fast, ema_slow, …) uniformly over the full feature cube, label each with B1, train. Prediction: grid agreement rises from 75% to above 99%, and deployment jank drops to B1's within seed noise. All three predictions are mechanistically motivated. (a) attacks covariate shift directly. (b) attacks the orthogonal-basin problem by anchoring the online optimizer to offline's solution. (c) attacks the trajectory-distribution blind spot. If any of the three fails to produce its predicted improvement, the failure is diagnostic it narrows which piece of the mechanism the story got wrong. I am not claiming the 353-parameter MLP is the wrong size. Experiment 2 rules that out for this task. A larger model might generalize better off-trajectory, but capacity is not the bottleneck for fitting the policy the architecture can already do it. Scaling up would be solving a different problem (robustness to covariate shift) than the one experiment 3 isolates (objective mismatch). I am not claiming online learning is a bad idea in general. Plenty of online learning setups work, most obviously ones where the label is independent of the learner's action. Browser frame scheduling is specifically the case where the label is influenced by the learner's action (the scheduler's cost decision is what produces the next dt), and that dependency is where the geometry breaks. I am not claiming EMA thresholding is the right answer everywhere. On the burst workload, B1 already matches everything else the task has no exploitable temporal structure for an EMA to smooth. Different workloads will have different right answers. What I am claiming is that within the class of tasks where the online distribution depends on the policy, the direction online SGD descends is not the direction offline supervision descends, and the gap is not closable by adjusting online's hyperparameters alone. And I am not claiming the Phase 5 numbers are wrong. They are correct, as reported. What tempo.js adds is a mechanism for them. Tempo's repository separates the production pipeline from the mechanistic analysis: Tempo-js/ ├── src/core/ │ ├── predictor.js # 353-param MLP, Float32 forward/backward │ ├── trainer.js # Online SGD + momentum + grad-clip │ ├── features.js # 12-dim rolling feature extractor │ └── schedulers.js # B0, B1, PredictorScheduler ├── src/harness/ │ ├── sequential-loop.js # Shared execution core │ ├── workloads.js # 4 workload generators │ └── pretrained.js # Phase 5 Part 2 inlined weights ├── scripts/ │ ├── benchmark.js # Puppeteer headless runner │ ├── analyze.js # Mann–Whitney U, Cohen's d, bootstrap │ └── generate-pretrained.js # Distillation from Part 1 shadow log ├── docs/ │ ├── METHODOLOGY.md # Full experimental protocol │ ├── RESULTS.md # Primary + secondary tables │ └── PHASE5_*.md # Per-phase raw results └── tempo.js # The four experiments, one file The modular codebase under src/ is what Phase 5 actually ran. tempo.js at the root is the self-contained mechanistic companion: no imports from src/, ~600 lines, same MLP architecture, same constants. Both reference the same deterministic seed protocol (mulberry32, seed 42). Repo: github.com/Kit4Some/Tempo-js. Everything in this post is backed by code that runs on a laptop. node tempo.js the four-experiment report (sawtooth, seed=42, 10 reps). Finishes in about ten seconds. node tempo.js burst or node tempo.js scroll other workloads, benchmark only. npm run bench:part1 Part 1's 120 Puppeteer runs. About two and a half hours. npm run bench:part2 Part 2's 88 runs. About an hour forty. npm test 260 vitest cases, including the analytic-vs-numeric gradcheck. Raw results: docs/PHASE5_PART1_RESULTS.jsonl, docs/PHASE5_PART2_RESULTS.jsonl. docs/METHODOLOGY.md. tempo.js: tempo_onlycode.js. Inspired by Karpathy's microGPT: "the full algorithmic content of what is needed." tempo.js tries to be that for this particular small story. If the mechanism I described turns out to be wrong, the four experiments in that file are the right place to find out. Tempo is a research artifact, not a maintained library. Numbers in this post are from Phase 5 Part 1 (n = 10 per cell, n = 12 for B1 after drift-check) and Phase 5 Part 2 (n = 10 per cell, 88 runs total). Cohen's d values in the raw results tables are unusually large due to minimal run-to-run variance in headless Chrome see docs/PHASE5_COHENS_D_VALIDATION.md. Full effect-size treatment in docs/RESULTS.md.