Historical Baseline — Sprint 1 Initial Notes¶

Date: 2026-05-02 Sprint goal: first working historical baseline of frontier training compute.

What's in this sprint¶

Repo + Python env scaffolded (uv, pandas, numpy, statsmodels, matplotlib).
Raw Epoch CSVs pulled to data/raw/.
Cleaning + frontier-flagging + log-linear fitting modules in model/.
Processed dataset at data/processed/historical_models.{csv,parquet} (1,011 rows × 35 cols).
First chart: outputs/charts/historical_compute_over_time.png.
First trend table: outputs/tables/initial_compute_growth_estimates.csv.

1,011 notable models, 521 with training compute, 179 with training cost. Compute coverage is strong post-2018 (≥30/yr through 2025); cost coverage is much sparser and will require careful handling in the cost sprint.
Epoch's Frontier model flag is set for only 123 rows. It is more conservative than any of our three rules — useful as a sanity benchmark.
2026 partial year: only 2 rows have known compute so far; trend fits that include 2026 are slightly tilted by recent disclosures and should be read with that caveat.
Long pre-2010 tail: dozens of pre-deep-learning systems (perceptrons, early reinforcement learners) with very small FLOP counts. Including them in a single linear fit pulls the slope down — this is why the 2018+ window produces a much steeper slope than the full window.

From outputs/tables/initial_compute_growth_estimates.csv:

Window	Rule	n	Annual ×	Doubling	R²
2018+	all models	370	6.30×	4.5 mo	0.50
2018+	Rule A (top-10)	113	5.97×	4.7 mo	0.84
2018+	Rule B (top/org/yr)	264	6.38×	4.5 mo	0.46
2018+	Rule C (≥1e23 FLOP)	137	2.00×	12.0 mo	0.30
Full	all models	521	2.12×	11.1 mo	0.76
Full	Rule A	245	2.14×	10.9 mo	0.83
Full	Epoch flag	113	2.05×	11.6 mo	0.89

Two early findings worth noting:

Frontier definition does change the answer. In the 2018+ window the implied annual growth multiplier ranges from ~2.0× (Rule C) to ~6.4× (Rule B). Rule C's slow rate is a selection effect: once you condition on FLOP ≥ 1e23 you have already truncated the lower tail, so the remaining variation through time is much smaller.
Rule A 2018+ has the cleanest fit (R²=0.84) and the most defensible reading. Headline candidate for the supply base case: ~6× per year, doubling every ~5 months.

A handful of pre-2010 entries jump several OOM above neighbors — these are likely AlphaGo predecessors and game-playing RL systems where the compute estimate aggregates self-play (lots of experience, low parameter count). Worth checking whether they should sit in the same trend as modern pretraining FLOP.
2024–2026 cluster around 1e25–1e26 with very tight grouping — consistent with reports of frontier-lab convergence on similar compute budgets.
Rule B picks up many small-lab models with relatively low compute, which depresses its R². Not a bug — it's the expected behavior of the rule and a good demonstration of why we run three definitions.

In rough priority order:

Cost trend fits (04_cost_trend_analysis.ipynb). Same regression structure on estimated_training_cost_usd, plus sensitivity across the headline / cloud / upfront cost variants.
Cost-per-FLOP trend. Should be strongly negative. Hardware-era diagnostics (V100 vs A100 vs H100/Blackwell era).
Residual diagnostics by year and by organization. outputs/charts/historical_residuals_compute.png, outputs/charts/historical_residuals_cost.png.
Organization-level breakouts — outputs/charts/historical_compute_by_organization.png and outputs/charts/historical_cost_by_organization.png.
Hardware / cluster descriptive timeline (no model, just charts + tables).
Final memo docs/historical_findings.md with explicit the supply capacity model handoff parameters (base / fast / slow compute and cost growth assumptions, recommended historical window, recommended frontier rule).

Should Rule A use a top-N percentile rather than a fixed N? Top-10 picks up smaller frontier sets in 2010 and oversamples them in 2024 when many more models are released per year. A fractional rule would be more stable.
What weighting (if any) for the regression? Currently OLS. We may want WLS using compute_estimate_quality as inverse-variance.
Cost in 2023 USD vs nominal: Epoch already inflation-adjusts — but the cost-per-FLOP trend will look different against nominal hardware prices, which is what cluster-builders actually paid. Decision to be documented in the cost-trend sprint.
Pre-2018 data inclusion: keep as a separate "long-run" comparison fit; do not mix into the supply base case.

These are not final — final values will live in historical_findings.md once the cost and residual sprints are complete. Provisional from sprint 1:

Recommended historical window: 2018-01-01 to last full quarter of data.
Recommended frontier rule: Rule A (top-N at release) for the base case.
Provisional base compute growth: ~6× per year, doubling ~5 months.
Provisional fast / slow bounds: 4×–9× per year (rough; tighten via bootstrap CIs in a later sprint).