Data Dictionary¶
Source¶
- Dataset: Epoch AI "Notable AI Models" database
- URL: https://epoch.ai/data/notable_ai_models.csv
- Retrieved: 2026-05-02
- Local copy:
data/raw/epoch_notable_ai_models_raw.csv
Two additional Epoch CSVs are pulled for cross-checks but not used directly:
data/raw/epoch_frontier_ai_models_raw.csv(Epoch's own frontier subset)data/raw/epoch_large_scale_ai_models_raw.csv(large-scale subset)
The raw CSV has 1,011 model rows (47 columns) covering 1950–2026. The
processed dataset (data/processed/historical_models.{csv,parquet})
preserves all 1,011 rows with the columns documented below plus the three
frontier flags.
Field definitions¶
| Column | Source column | Type | Notes |
|---|---|---|---|
model_id |
derived | string | "<model_name> \| <organization> \| <date>" |
model_name |
Model |
string | unchanged |
organization |
Organization |
string | normalized: Google Deepmind→Google DeepMind, OpenAi→OpenAI, Meta AI/Meta Platforms→Meta |
release_year |
derived | int | Publication date.year |
release_year_fractional |
derived | float | year + (day-of-year − 1) / 365.25 |
publication_date |
Publication date |
datetime | parsed permissively; failures → NaT |
domain |
Domain |
string | comma-separated tags (e.g. Multimodal,Language,Vision) |
training_compute_flop |
Training compute (FLOP) |
float | total training FLOP |
training_compute_log10 |
derived | float | log10(training_compute_flop) |
estimated_training_cost_usd |
Training compute cost (2023 USD) |
float | Epoch's headline cost figure (2023 USD, inflation-adjusted) |
training_cost_log10 |
derived | float | log10(estimated_training_cost_usd) |
training_cost_cloud_usd |
Training compute cost (cloud) |
float | implied cloud-rental cost variant |
training_cost_upfront_usd |
Training compute cost (upfront) |
float | implied upfront-hardware cost variant |
cost_per_flop |
derived | float | estimated_training_cost_usd / training_compute_flop |
cost_per_flop_log10 |
derived | float | log10(cost_per_flop) |
parameters |
Parameters |
float | parameter count |
dataset_tokens |
Training dataset size (total) |
float | Epoch's "total" dataset size; units vary by domain (tokens, examples, hours) |
hardware_type |
Training hardware |
string | e.g. NVIDIA H100, TPU v4 |
hardware_quantity |
Hardware quantity |
float | accelerator count |
training_duration_days |
derived | float | Training time (hours) / 24 |
epoch_frontier_flag |
Frontier model |
bool | Epoch's own frontier classification (NaN → False) |
epoch_confidence |
Confidence |
string | Confident, Likely, Speculative, Unknown |
compute_estimate_quality |
derived | string | high/medium/low derived from epoch_confidence. NaN if no compute number. |
cost_estimate_quality |
derived | string | high if upfront cost present, medium if only cloud cost, low if only headline figure |
date_quality |
derived | string | publication_date if Publication date parsed, else unclear |
notability_criteria |
Notability criteria |
string | Epoch's reason for inclusion |
organization_category |
Organization categorization |
string | e.g. Industry, Academia |
country |
Country (of organization) |
string | |
source_url |
Link |
string | upstream paper / blog / model card |
training_compute_notes |
Training compute notes |
string | free text |
cost_notes |
Compute cost notes |
string | free text |
frontier_rule_a |
derived | bool | top-10 by training compute within trailing 1-year window of release |
frontier_rule_b |
derived | bool | top model by compute per organization per calendar year |
frontier_rule_c |
derived | bool | training_compute_flop ≥ 1e23 |
frontier_any |
derived | bool | rule_a OR rule_b OR rule_c |
Key derived definitions¶
Training compute¶
Total floating-point operations performed during pretraining of the model, as estimated by Epoch (often reverse-engineered from architecture + parameters + tokens + epochs, sometimes from disclosed run details). Stored in absolute FLOP — typical 2024 frontier models are 1e25–1e26.
Estimated training cost (USD)¶
Epoch's headline Training compute cost (2023 USD) field, inflation-
adjusted. This is the implied training-run cost, not total R&D spend.
We deliberately keep cloud and upfront variants separate because they
encode very different assumptions about hardware utilization and amortization.
Cost per FLOP¶
estimated_training_cost_usd / training_compute_flop. Sensitive to which
of the three cost variants is used — The historical baseline uses the headline 2023-USD
figure for trend fits, with sensitivity tests against the other two
deferred to a later sprint.
Frontier model¶
There is no neutral definition. The historical baseline uses three independent rules plus Epoch's own flag for cross-comparison:
- Rule A (top-10 in window): among the highest-compute models released in the 1-year window ending at this model's release. Captures "frontier-at-release."
- Rule B (top per org per year): the highest-compute model from each organization in each calendar year. Captures lab-level frontiers.
- Rule C (compute threshold):
training_compute_flop ≥ 1e23. Round threshold matching some compute-governance frameworks. - Epoch flag: Epoch's curated
Frontier modelboolean.
Trend-rate sensitivity to rule choice is one of the central diagnostics.
Confidence flags¶
Epoch's Confidence column rates how well-substantiated the compute
estimate is. We map it as:
| Epoch | Historical baseline |
|---|---|
| Confident | high |
| Likely | medium |
| Speculative | low |
| Unknown | low |
If training_compute_flop is missing entirely, compute_estimate_quality
is NA.
Transformations applied¶
training_compute_log10 = log10(training_compute_flop)
training_cost_log10 = log10(estimated_training_cost_usd)
cost_per_flop = estimated_training_cost_usd / training_compute_flop
cost_per_flop_log10 = log10(cost_per_flop)
training_duration_days = training_time_hours / 24
release_year_fractional = year + (day_of_year - 1) / 365.25
Organization normalization is intentionally minimal — we do not collapse parent/sub relationships (e.g. DeepMind vs Google DeepMind), since Epoch's labels track the org at publication time and merging them would erase a real signal.