Historical Baseline — Findings¶
Author: automated analysis pipeline Date: 2026-05-02 Status: Historical baseline complete. Supply-capacity handoff parameters at the bottom.
1. Summary¶
Under the recommended Frontier Rule A (top-10 by training compute at release), 2018+ window, the historical record from Epoch AI says:
| Metric | Annual multiplier | Doubling | R² | n |
|---|---|---|---|---|
| Training compute (FLOP) | 5.97× | 4.7 mo | 0.84 | 113 |
| Training cost (2023 USD) | 4.89× | 5.2 mo | 0.72 | 74 |
| Cost per FLOP (2023 USD/FLOP) | 0.76× | n/a (decline) | 0.21 | 74 |
In plain English: from 2018 through early 2026, frontier training runs grew by roughly 6× per year in raw compute and 5× per year in inflation- adjusted dollars, while the price per training FLOP fell ~24%/yr.
The single most important historical-baseline caveat is that these numbers are moderately sensitive to frontier definition and very sensitive to cost variant. Sections 6–8 quantify both.
2. Data sources¶
- Primary: Epoch AI "Notable AI Models" CSV (https://epoch.ai/data/notable_ai_models.csv), retrieved 2026-05-02. 1,011 model rows, 1950–2026.
- Cross-checks: Epoch's
frontier_ai_models.csvandlarge_scale_ai_models.csv(downloaded but not used in headline fits; Epoch's own frontier subset is replicated in ourepoch_frontier_flagcolumn). - Local snapshot:
data/raw/epoch_*.csv(immutable). - Processed dataset:
data/processed/historical_models.{csv,parquet}(1,011 rows × 35 cols).
Mappings from Epoch column names to historical-baseline schema names: docs/data_dictionary.md.
3. Inclusion criteria¶
- All 1,011 notable models retained in the processed dataset (full row preservation).
- For trend fits, rows are dropped only when the relevant
ycolumn is missing. - Compute coverage: 521 / 1,011 models have a known
training_compute_flop. Coverage is strong post-2018 (≥30/year through 2025). - Cost coverage is sparser: 179 models with the headline 2023-USD cost figure, 33 with explicit upfront cost, 25 with explicit cloud-rental cost.
- Date quality: 1,007 / 1,011 rows have a parseable publication date.
- Confidence flags: Epoch's
Confidencefield is mapped tocompute_estimate_quality(high / medium / low).
4. Frontier definitions¶
We use three independent rules plus Epoch's own flag as a sanity check. There is no neutral definition of "frontier" and treating any single rule as authoritative is precisely the kind of unforced error this project is trying to avoid.
| Rule | Definition | n flagged (full) | n flagged (2018+) |
|---|---|---|---|
| A | Top 10 by training compute within the 1-year window ending at release | 245 | 113 |
| B | Highest-compute model per organization per calendar year | 378 | 264 |
| C | training_compute_flop ≥ 1e23 |
137 | 137 |
| (Epoch) | Epoch's curated Frontier model boolean |
123 | — |
Rule C is the most restrictive but also the most fragile — by truncating the lower tail it removes most within-year variation, which collapses both its slope estimate and its R².
5. Historical compute trend¶
Full table: outputs/tables/historical_trend_estimates.csv. Compute rows below.
| Window | Rule | n | ×/yr | Doubling (mo) | R² |
|---|---|---|---|---|---|
| 2018+ | all models | 370 | 6.30 | 4.5 | 0.50 |
| 2018+ | A (top-10) | 113 | 5.97 | 4.7 | 0.84 |
| 2018+ | B (top/org/yr) | 264 | 6.38 | 4.5 | 0.46 |
| 2018+ | C (≥1e23) | 137 | 2.00 | 12.0 | 0.30 |
| Full | all models | 521 | 2.12 | 11.1 | 0.76 |
| Full | A | 245 | 2.14 | 10.9 | 0.83 |
| Full | B | 378 | 2.02 | 11.9 | 0.75 |
| Full | Epoch flag | 113 | 2.05 | 11.6 | 0.89 |
Headline reading: the modern (2018+) frontier-compute trend is ~6×
per year under Rule A or B and converges with the all-models trend. The
long-run trend (1950–2026) is ~2× per year. The two regimes are real and
visible in the chart outputs/charts/historical_compute_over_time.png.
Rule C's 2× answer for 2018+ is a selection artifact rather than a disagreement — once you require ≥1e23 FLOP, the bottom of the post-2018 distribution disappears and you are left fitting a shorter dynamic range. Treat it as an upper-bound floor, not a "slow" estimate.
6. Historical cost trend¶
| Window | Rule | n | ×/yr | Doubling (mo) | R² |
|---|---|---|---|---|---|
| 2018+ | all models | 155 | 3.13 | 7.3 | 0.27 |
| 2018+ | A | 74 | 4.89 | 5.2 | 0.72 |
| 2018+ | B | 121 | 3.02 | 7.5 | 0.23 |
| 2018+ | C | 51 | 3.03 | 7.5 | 0.50 |
| Full | A | 98 | 3.16 | 7.2 | 0.73 |
| Full | Epoch flag | 56 | 3.28 | 7.0 | 0.91 |
Headline cost reading: under Rule A 2018+, frontier training costs grew ~5× per year, doubling about every 5 months. Under Epoch's own flag the fit is even cleaner (R² = 0.91) at ~3.3× per year, doubling ~7 months.
Note the spread: 3× to 5×. We'll treat ~4× as the central estimate.
Chart: outputs/charts/historical_cost_over_time.png.
Cost variant sensitivity (historical-baseline critical finding)¶
Epoch publishes three cost columns. Under Rule A 2018+, the same trend looks very different depending on which one you fit:
| Cost variant | ×/yr | Doubling (mo) | R² | n |
|---|---|---|---|---|
| Headline (2023 USD) | 4.89 | 5.2 | 0.72 | 74 |
| Cloud-rental | 3.41 | 6.8 | 0.85 | 22 |
| Upfront-hardware | 2.49 | 9.1 | 0.84 | 27 |
The headline figure grows almost twice as fast as the upfront-hardware figure. This is not noise — it is a real divergence:
- Upfront cost captures the price of the chips actually installed. Hardware prices (especially per-FLOP) have fallen, so total upfront cost grows more slowly than total compute.
- Cloud-rental cost is what you'd pay a hyperscaler at posted rates; it is closer to opportunity cost.
- Headline 2023 USD is Epoch's blended figure, closest to cloud-rental but with broader coverage.
For the supply capacity model we recommend carrying all three cost variants forward, with the headline 2023-USD figure as the base case and explicit fast/slow bounds that reflect cost-variant disagreement.
7. Cost per FLOP trend¶
| Window | Rule | n | ×/yr | Annual decline | R² |
|---|---|---|---|---|---|
| 2018+ | A | 74 | 0.76 | 24.2% | 0.21 |
| 2018+ | C | 51 | 0.88 | 11.7% | 0.10 |
| Full | A | 98 | 0.69 | 31.1% | 0.46 |
| Full | Epoch flag | 56 | 0.72 | 27.8% | 0.68 |
Cost per FLOP declines roughly 25–30% per year across most reasonable cuts of the data. R² is materially lower than for compute or cost in isolation — unsurprising, because cost-per-FLOP combines the noise of two already-uncertain estimates and the dataset thins to ~50–100 rows.
The historical base estimate is ~25%/yr decline (0.75×/yr) for the modern window, with bounds at ~12% and ~30% reflecting the rule sensitivity.
Chart: outputs/charts/historical_cost_per_flop_over_time.png.
8. Key uncertainties¶
- Cost variant divergence (largest single uncertainty). Same models, same window, same regression — but the implied annual cost-growth multiplier ranges 2.5× → 4.9× depending on which cost column you use. the supply capacity model must not silently average these.
- Cost coverage is thin. 74 frontier-rule-A rows have headline cost, only 22–27 have the more hardware-grounded variants. Late 2025 / 2026 are particularly sparse.
- 2026 partial year. Only 2 models with known compute disclosed so
far. Strong negative residuals for 2026 in
residuals_compute_trend.pngprobably reflect right-truncation, not a deceleration. - OLS, no error model. Epoch's
Confidencefield is collected but not yet used in fitting. A WLS variant using inverse-confidence weights would tighten standard errors but would not change central estimates materially given how concentratedConfidentratings are post-2020. - Selection artifacts in Rule C. Already discussed.
- Organization-level structure. The residual-by-org chart shows OpenAI, Google, Meta, DeepMind, NVIDIA, and xAI sit consistently above the trend line under Rule A 2018+; Alibaba, Anthropic, Google Brain sit below. the supply capacity model should not assume one global rate fits all labs.
- Long-tail pre-1990 data. Including pre-deep-learning systems pulls the long-run slope down. The two-regime (slow until ~2010, fast after) structure is visible in every chart and is one of the cleaner findings in this dataset.
- Hardware quality / cluster scale data is too sparse for a model.
199 frontier rows with accelerator counts and 204 with training duration
give a credible descriptive timeline (
outputs/charts/historical_hardware_timeline.png) but not enough for a reliable multivariate fit yet.
9. Recommended assumptions for the supply capacity model¶
These are the explicit handoff parameters. Each is a recommendation, not a forecast — the corresponding fast/slow bounds are intentionally wide enough to bracket the rule-sensitivity and variant-sensitivity exposed in Sections 5–7.
Base compute growth assumption: 6.0×/yr (Rule A, 2018+)
Fast compute growth assumption: 6.4×/yr (Rule B, 2018+)
Slow compute growth assumption: 2.0×/yr (long-run, 1950+)
Base training-cost growth assumption: 4.0×/yr (mid of Rule A 2018+ vs Epoch-flag full)
Fast training-cost growth assumption: 4.9×/yr (Rule A 2018+, headline 2023 USD)
Slow training-cost growth assumption: 2.5×/yr (Rule A 2018+, upfront-hardware variant)
Base cost-per-FLOP decline assumption: 0.75×/yr (~25% / yr decline)
Fast cost-per-FLOP decline assumption: 0.69×/yr (~31% / yr decline)
Slow cost-per-FLOP decline assumption: 0.88×/yr (~12% / yr decline)
Recommended historical window: 2018-01-01 → most recent full quarter
Recommended frontier definition: Rule A (top-10 at release)
with Rule B + Epoch flag as cross-checks
Known weaknesses¶
- Cost-variant divergence (2.5× ↔ 4.9×) is wider than rule-choice divergence and is not captured by quoting a single 2023-USD figure with a CI. the supply capacity model must explicitly track at least the cloud and upfront variants in parallel.
- The cost-per-FLOP fit is the noisiest of the three trends (R² = 0.21 under the headline rule). Treat its central estimate as ±10 percentage points on the annual decline rate, not ±2.
- Right-truncation: late-2025 and 2026 disclosures will continue to arrive for months. Re-fitting in Q3 2026 should tighten the modern-window slopes and may revise the headline numbers slightly upward (reporting bias historically favors high-compute systems).
- Organization-level residual structure is not modeled. the supply capacity model may need lab-level effects rather than a single global rate.
10. Open questions¶
- Should Rule A use a top-N percentile (e.g. top 10%) rather than a fixed top-10? The current rule under-counts the modern era when many more frontier-grade models are released per year.
- Should the fits be WLS with
Confidenceas inverse variance? Likely small effect, but worth confirming. - Should we collect non-Epoch cost figures (lab disclosures, hardware purchase reports, public cloud pricing) as a follow-up cross-check on the cost-variant divergence?
- Do we need a separate "training compute including post-training"
trend? RLHF/RLAIF and inference-time scaling are now non-negligible and
Epoch's
Finetune computecolumn is sparse but populated for some recent models. - Should the next phase formalize the two-regime finding (slow pre-2010, fast post-2010) with a piecewise fit and a structural break test rather than assuming a single 2018+ window?
Appendix: deliverable checklist¶
| Spec deliverable | File | Status |
|---|---|---|
| Clean dataset (parquet) | data/processed/historical_models.parquet |
✓ |
| Clean dataset (CSV) | data/processed/historical_models.csv |
✓ |
| Data dictionary | docs/data_dictionary.md |
✓ |
| Compute over time | outputs/charts/historical_compute_over_time.png |
✓ |
| Cost over time | outputs/charts/historical_cost_over_time.png |
✓ |
| Cost per FLOP | outputs/charts/historical_cost_per_flop_over_time.png |
✓ |
| Compute by organization | outputs/charts/historical_compute_by_organization.png |
✓ |
| Cost by organization | outputs/charts/historical_cost_by_organization.png |
✓ |
| Residuals (compute) | outputs/charts/historical_residuals_compute.png |
✓ |
| Residuals (cost) | outputs/charts/historical_residuals_cost.png |
✓ |
| Hardware timeline | outputs/charts/historical_hardware_timeline.png |
✓ (bonus) |
| Trend estimates table | outputs/tables/historical_trend_estimates.csv |
✓ (45 rows) |
| Hardware summary | outputs/tables/historical_hardware_summary.csv |
✓ (bonus) |
| historical-baseline memo | docs/historical_findings.md |
✓ (this file) |