精读预计 2 分钟

Can Europe train a frontier AI model on the compute it owns?

摘要

该报告（EuroMesh）分析认为，相比等待新建 1GW 级别数据中心（平均需 7.6 年并网），利用 EuroHPC 等现有公共算力进行低通信频率（DiLoCo 风格）的联邦训练，可将顶级模型的交付时间提前至 2028 年。核心逻辑在于 “并网时间差” 带来的收益超过了 “训练效率损耗”。报告同时也指出，该方案面临算力异构、政治协调难度以及超大规模分布式训练尚未在百亿参数以上得到验证等现实挑战。

荐读理由

欧洲已拥有上万 exaflops 的 EuroHPC 超级计算机和国家 AI 工厂，靠 DiLoCo 低通信训练联邦即可在 2028 年左右产出 frontier-class 模型，可直接迁移到自己 AI 工程项目

原文

EuroMesh

A sourced model and short report on a single question:

Can Europe stand up a sovereign frontier-class AI model now, by federating the public compute it already owns, while the gigawatt datacenters it is planning take years to connect to the grid?

The answer the model gives is yes, as a stopgap. Europe already operates tens of exaflops of public AI compute across the EuroHPC supercomputers and the national AI Factories. A 1 GW campus, by contrast, waits a mean of 7.6 years for grid power. Federated with low-communication (DiLoCo-style) training, the compute Europe already has can deliver a frontier-class model around 2028, against around 2033 for a new gigawatt campus.

Read this first

The report is paper/compute-at-home.pdf (built from paper/compute-at-home.md). It is a short, sourced read aimed at a general audience. Title: "Do We Need OpenAI or Anthropic? Europe Has Tens of Exaflops at Home."

What is in the repo

euromesh/
├── README.md
├── requirements.txt
├── paper/
│   ├── compute-at-home.md / .pdf   the report
│   ├── grid_queue_dataset.md       sourced 1 GW vs 40 MW grid-connection lead times
│   ├── eurohpc_substrate.md        sourced EU public-compute inventory + "is it enough" math
│   ├── build_pdf.sh, _report.typ   PDF build (pandoc + typst)
│   └── figures/                    generated charts (PNG + SVG)
└── model/
    ├── MODEL_SPEC.md               the model specification (equations, params, invariants)
    ├── RESULTS.md                  full results, scenarios, sensitivity, caveats
    ├── run.py                      regenerates every CSV and figure
    ├── src/                        the three-layer model (efficiency, ramp, regions)
    ├── params/                     hardware.yaml, training.yaml, regions.csv + SOURCES
    ├── results/                    generated CSVs (do not hand-edit)
    └── tests/                      pytest suite (52 tests) + invariant self-checks

The model in one paragraph

Three layers. Layer 1 is the per-FLOP efficiency of low-communication training (how much the DiLoCo penalty costs). Layer 2 is time-to-availability (when sites energize and how fast cumulative compute accrues). Layer 3 is a per-region scorecard on time, cost, carbon, and feasibility. The headline result is set almost entirely by Layer 2: it reduces to one inequality, the federation wins if its sites are online before a gigawatt campus is. The training efficiency penalty is second-order, confirmed by the sensitivity tornado.

Run it

python3 -m venv .venv
.venv/bin/pip install -r requirements.txt
.venv/bin/python -m model.run          # regenerates all CSVs in model/results and figures in paper/figures
.venv/bin/python -m pytest model/tests/ # 52 passed
bash paper/build_pdf.sh                 # rebuilds paper/compute-at-home.pdf (needs pandoc + typst)

The run is reproducible from a clean tree: deleting every output and re-running exits 0 and regenerates everything.

Data and sources

Grid-connection lead times: paper/grid_queue_dataset.md, seven regions, per-region primary sources, anchored by the AWS "up to seven years" statement and the IEA 2-to-10-year range, with limitations stated.
EU public compute: paper/eurohpc_substrate.md, the EuroHPC flagships and the 19 AI Factories, accelerator counts and the training-time math.
Model parameters: model/params/SOURCES.md and model/params/SOURCES_hardware_training.md, with confidence tags.

Honest caveats

The point of this repo is clarity, not novelty. The thesis rests on grid-queue lead times, which are sourced central estimates rather than observed figures (no European operator has yet energized a 1 GW point load). The compute is owned but not yet usable for one coordinated run: the EuroHPC machines are shared, batch-scheduled, and heterogeneous, so the addressable fraction is a political decision rather than a hardware fact. Frontier-scale distributed training is unproven above about 10B parameters today, so the target is a credible frontier-class model rather than a guaranteed 405B. All of this is in model/RESULTS.md and the report's caveats section. Figures and dated events are as of June 2026. This is an independent model and analysis, not peer-reviewed.

Hacker News · 137 赞 · 277 评讨论 → 阅读原文 →

这条对你有帮助吗？