QR code linking to sam.onl/talks/2026/06/03

Production Pre-Training at Scale:

The Good, the Bad, and the Restarts

Lessons from AuroraGPT

Sam Foreman¹, Nathan Nichols, Varuni Sastry, Samuel Wheeler, Khalid Hossain, Huihuo Zheng, Murali Emani, Filippo Simini, Marieme Ngom, Ethan Wong, Venkat Vishwanath

2026-06-03

Outline

The Good:
- AuroraGPT-2B:
  - Training on 7.8T tokens
- Software stack:
  - Using  ezpz
  - Moving to  torchtitan²
  - CODING AGENTS !!
The Bad:
- Rapidly evolving software (and hardware!)
  - Fork tax (fast-moving upstream!)
- At scale, failure is the default

The Restarts:
- Towards resilient training
- 3 layers of recovery:
  - Job → Node → Process

Motivation

How to do production training on a rapidly evolving software stack?
- across {Intel, NVIDIA, AMD, …} hardware?
- while also mitigating failures ?
  - {hardware, system, network, lustre, …}
Tension between:
- long running pre-training jobs
- typical facility scheduling policies
  - INCITE (>= 20% of machine, 2k nodes on Aurora)
    -> very large batches
    -> bad model, unstable training³⁴⁵

The stack

Current stack:

🍋 saforem2/ezpz (+ ezpz.cool)
🧠 saforem2/torchtitan@ezpz: FSDP · TP · PP · EP · MoE
📚 zhenghh04/blendcorpus: weighted blending across datasets

Old stack (reference):

🪦 argonne-lcf/Megatron-DeepSpeed:
- AuroraGPT-2B reference (~7.77T tokens)
- pre-torchtitan

Hardware agnostic

Same code, every vendor; no cuda in user code !

🍋 `ezpz`: write once, run anywhere

# train.py

import ezpz
# auto device + backend selection
rank = ezpz.setup_torch()
print(rank)

ezpz launch python3 train.py

Same code, every site. No per-cluster mpiexec / srun, CPU bindings, or tile-compact wrappers. → ezpz.cool

flowchart LR EZ(["ezpz launch"]) EZ --> AURORA["PBS · Intel Aurora"] EZ --> POLARIS["PBS · NVIDIA Polaris"] EZ --> PM["SLURM · NVIDIA Perlmutter"] EZ --> FRONTIER["SLURM · AMD Frontier"] %% EZ --> CLUSTER["{SLURM, PBS} · ANY!} private-cluster"] EZ --> MAC["Multi-CPU"] classDef hub fill:#ee8f2408,stroke:#ee8f24,color:#ee8f24,stroke-width:1.5px classDef aurora fill:#3b82f608,stroke:#3b82f6,color:#3b82f6,stroke-width:1.5px classDef polaris fill:#10b98108,stroke:#10b981,color:#10b981,stroke-width:1.5px %% classDef cluster fill:#88888808,stroke:#888888,color:#888888,stroke-width:1.5px classDef perlmutter fill:#06b6d408,stroke:#06b6d4,color:#06b6d4,stroke-width:1.5px classDef frontier fill:#ef444408,stroke:#ef4444,color:#ef4444,stroke-width:1.5px classDef mac fill:#a78bfa08,stroke:#a78bfa,color:#a78bfa,stroke-width:1.5px class EZ hub class AURORA aurora class POLARIS polaris class PM perlmutter class FRONTIER frontier class MAC mac

AuroraGPT-2B: the reference run on Aurora

Spec	Value
Architecture	1.986B params, 12 layers, GQA (16h / 4 kv)
Hardware	256 Aurora nodes × 12 Intel Max GPUs = 3,072 GPUs, BF16
Framework	Megatron-DeepSpeed (ZeRO Stage 0)
Optimizer	SophiaG⁶ (β=0.9/0.95, ρ=0.01, wd=0.1, LR=2.28e-5)
Training Config	50M tok/batch (8192 ctx · LBS=2)
Tokenizer	SentencePiece, vocab=256K
Stages	3 (pretrain · continued-pretrain · math+code)
Tokens	~7.77T total

This is the pre-torchtitan reference. Everything that follows is the migration story: same scale, same data, what changed and what broke when we cut over.

Why MDS (Megatron-DeepSpeed) first: the only option at the time

When AuroraGPT kicked off, MDS was the only LLM pre-training framework that ran at scale and supported:

Intel XPU
Model, pipeline parallelism
DeepSpeed ZeRO Offloading

Supporting context:

PyTorch FSDP1 had Intel XPU gaps (collectives, AC patterns, optimizer-state sharding)
torchtitan existed as a research project (not tested)
MDS was the pragmatic choice

By early 2026, the calculus changed: torchtitan + DTensor + FSDP2 closed the gap and the MDS fork’s maintenance cost crossed over.

Why SophiaG: large-batch stability at 50M tok/batch (256N)

W&B Report: AuroraGPT-2B Pre-Training⁷

AuroraGPT-2B optimizer comparison: AdamW vs ipex.FusedLamb vs Muon vs MuonClip vs SophiaG at GBS=6,144 — **SophiaG** is the only one that stays in the low-loss band with bounded grad norms.

LR-finder — exponential sweep, blow-up / 10

Smith 2015⁸ / Gugger⁹: exponentially ramp LR over ~10% of training, record EMA-smoothed loss, pick LR at the steepest descent (or blow-up point / 10 as a conservative default).

See also our recent work on cross-optimizer LR scaling¹⁰.

LR-vs-loss sweeps for AdamW, Muon, SophiaG across 2B and 20B configs on Aurora — Cross-optimizer sweep on Aurora. Full report: `docs/experiments/lr-finder/README.md`

2B reference: training loss

AuroraGPT-2B training loss across 3 stages — (1.) Pretrain¹¹ → (2.) continued-pretrain¹² → (3.) math+code¹³

Why we moved to `torchtitan`

	MDS	TT
Actively maintained	❌️	✅
Declarative parallelism (DTensor, FSDP2)	❌️	✅
FSDP+TP / EP / CP without plumbing	❌️	✅
MoE support	❌️	✅
Easy to extend, debug, maintain	❌️	✅

The trade-off we accepted: living on a fast-moving upstream  pytorch/torchtitan@main; the “fork tax”

2B reference + `torchtitan` overlay

AuroraGPT-2B MDS 3-stage trajectory with TorchTitan 256N production run overlaid — Training loss comparison for AuroraGPT-2B trained with MDS vs. TT.

2B loss: MDS full trajectory vs `torchtitan`

2B training loss: MDS full 3-stage curve vs TT v2 256N chain — 256N / GBS=6,144. At matched tokens, **δ ≈ 0.02** — within run-to-run noise. The cutover preserved training behavior.

2B eval: MDS reference vs `torchtitan`

HellaSwag accuracy vs tokens — MDS vs TT 256N vs TT 512N — Data: `docs/evals/agpt/2b`, production run: `docs/production/agpt/2b`

ARC-Easy accuracy vs tokens — MDS vs TT 256N vs TT 512N — Data: `docs/evals/agpt/2b`, production run: `docs/production/agpt/2b`

20B eval: all-production overlay (2B + 20B)

HellaSwag accuracy vs tokens — 2B MDS, 2B TT 256N, 2B TT 512N, 20B TT 512N — Data: `docs/evals/agpt/20b`, production run: `docs/production/agpt/20b`

ARC-Easy accuracy vs tokens — 2B MDS, 2B TT 256N, 2B TT 512N, 20B TT 512N — Data: `docs/evals/agpt/20b`, production run: `docs/production/agpt/20b`

The fork tax: upstream-sync as a workflow

46 Upstream Syncs in 7 Weeks
Smoke test: bit-exact loss + grad-norm + peak memory
- Verify changes from upstream haven’t broken anything

flowchart TB A["start upstream @ HEAD"] --> B["resync · 50 steps · deterministic smoke test"] B --> D{"bit-exact?"} D -->|✅| C["diff = 0 ✓ ship"] D -.->|❌️| F["diff ≠ 0 bisect + fix"] F -.-> B classDef source fill:#7c4ed508,stroke:#7c4ed5,color:#7c4ed5,stroke-width:1.5px classDef stage fill:#118cc208,stroke:#118cc2,color:#118cc2,stroke-width:1.5px classDef ship fill:#1da81108,stroke:#1da811,color:#1da811,stroke-width:1.5px classDef gate fill:#8a8a8a18,stroke:#9a9a9a,color:#838383,stroke-width:1.5px classDef bad fill:#e0556008,stroke:#e05560,color:#e05560,stroke-width:1.5px class A source class B stage class D gate class C ship class F bad

The Restarts: At Scale, Failure is the Default

Llama 3 405B — 16K H100s · 54 days · 419 failures (≈ 1 every 3h); 99% recovered via automation¹⁴
OPT-175B — 35 manual restarts + 100+ cycled hosts in 2 mo on ~1K A100s¹⁵
BLOOM-176B — frequent loss spikes; embedding-norm + checkpoint cadence on 384 A100s × 3.5 mo¹⁶
GLM-130B — loss spikes “increasingly frequent”; some recover, others go to NaN¹⁷

“The Restarts”: three layers of recovery

Bad-node failover, hang-watchdog, and PBS resubmit each operate at a different scope.

flowchart TB subgraph job["📋 JOB scope — PBS · hours"] JOB_TXT["crash / walltime → chained resubmit"] subgraph node["🖥 NODE scope — failover wrapper · minutes"] NODE_TXT["bad host detected → swap from spare pool"] subgraph proc["⏱ PROCESS scope — ezpz launch · seconds"] PROC_TXT["stdout idle ≥ N s → kill + backoff"] end end end classDef jobC fill:#118cc208,stroke:#118cc2,color:#118cc2,stroke-width:1.5px classDef nodeC fill:#ee8f2408,stroke:#ee8f24,color:#ee8f24,stroke-width:1.5px classDef procC fill:#1da81108,stroke:#1da811,color:#1da811,stroke-width:1.5px classDef jobTxt fill:transparent,stroke:none,color:#118cc2 classDef nodeTxt fill:transparent,stroke:none,color:#ee8f24 classDef procTxt fill:transparent,stroke:none,color:#1da811 class JOB_TXT jobTxt class NODE_TXT nodeTxt class PROC_TXT procTxt class job jobC class node nodeC class proc procC

flowchart TB subgraph proc["⏱ PROCESS · seconds"] PROC_TXT["stdout idle ≥ N s → kill + backoff"] end subgraph node["🖥 NODE · minutes"] NODE_TXT["bad host detected → swap from spare pool"] end subgraph job["📋 JOB · hours"] JOB_TXT["crash / walltime → chained resubmit"] end proc -- "process exit" --> node node -- "node exhaustion" --> job classDef jobC fill:#118cc208,stroke:#118cc2,color:#118cc2,stroke-width:1.5px classDef nodeC fill:#ee8f2408,stroke:#ee8f24,color:#ee8f24,stroke-width:1.5px classDef procC fill:#1da81108,stroke:#1da811,color:#1da811,stroke-width:1.5px classDef jobTxt fill:transparent,stroke:none,color:#118cc2 classDef nodeTxt fill:transparent,stroke:none,color:#ee8f24 classDef procTxt fill:transparent,stroke:none,color:#1da811 class JOB_TXT jobTxt class NODE_TXT nodeTxt class PROC_TXT procTxt class job jobC class node nodeC class proc procC

Inner loops catch most failures; outer loops catch the rest.

What generalizes, what doesn’t

Generalizes across vendors / sites / models

Bit-exact deterministic smoke gate after every upstream sync
lm-eval as the ground truth for “is it actually learning?”
Spare-node failover wrapper — same idea on Slurm
Launcher / env autodetect — push every vendor-shaped assumption out of training code

Doesn’t generalize (needs per-(config, hardware, version) tuning)

torch.compile decisions
AC boundaries (MoE + AC + compile = grief)
EP↔FSDP frontier
Collective tuning (XCCL vs gloo fallbacks, NCCL env)

sam.onl/talks/2026/06/03

Thanks

AuroraGPT team: Venkat Vishwanath, the AI/ML Group at ALCF, collaborators across ANL.

Argonne Leadership Computing Facility: Aurora time, Sunspot staging.

Intel: Intel Max 1550 XPU + oneAPI / XCCL / IPEX support throughout.

Code & docs

ezpz: github.com/saforem2/ezpz
torchtitan fork (experiments/ezpz): github.com/saforem2/torchtitan
These slides: sam.onl/talks/2026/06/03

This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.

Questions?

Appendix: backup slides

Material that didn’t make the main path but is here for Q&A.

Open questions
Silent-correctness bugs
Failover engineering deep-dive
LR-finder
yeet-env tarball broadcast scaling

Open questions: the ask

A portable bit-exact regression suite across vendors — does anyone have one?
torch.compile at 1T scale — defensible decision tree?
Async-checkpoint Pareto frontier: recovery time × frequency × storage cost in production
Optimizer failure at 80B+: SophiaG (Hessian-diagonal estimate saturates) + Muon (Newton-Schulz iterations overflow bf16) both diverge — algorithmic limit or fp-precision artifact?
MoE EP↔FSDP scaling boundaries at 16B / 64B / 100B
Make xccl honor train_timeout_seconds so we don’t have to rely on an stdout-idle watchdog as the hang-detection ground truth

Operational reality — bad-node failover

5+ production jobs killed by bad-node failures in 2 weeks. Pattern: PBS gives us 256 / 512 nodes, one is bad, training crashes or hangs after N hours, walltime gone.

Job	Trajectory	Failure
8459818	2B 256N v2	`shepherd died from signal 9` after step 2070
8470102	20B 256N v2	gloo TCP `Connection closed by peer` after ~3h
8479579	20B 512N v2	silent hang at step 803 (heartbeat continued)

Failover wrapper: request select=N+spare (~2%, min 4). Split into active + spare pool. On crash, scrape bad nodes from log, swap a spare in, retry.

qsub -q prod -l select=522 -v NHOSTS_TRAIN=512 \
    submit_agpt_2b_aurora_venv_failover.sh

Handles 6 recurring crash modes. Does not handle silent hangs — those still need a heartbeat watchdog.

Silent-hang detection: `ezpz launch --timeout / --retries`

The problem. xccl on XPU silently ignores train_timeout_seconds, so a torchtitan job stuck in a hung collective sits consuming the full PBS walltime instead of aborting. Every collective hang quiets stdout (every rank blocks in the same call, nothing reaches the log) — that’s the signal we can act on.

ezpz launch --timeout 600 --retries 3 \
    python -m torchtitan.train --config-file ./config.toml

--timeout SECONDS — kill the launched process if its stdout goes idle (not walltime) for this many consecutive seconds. Returns exit code 124 (matches GNU timeout(1)).
--retries N — re-execute on any non-zero exit (including the watchdog’s 124) up to N times. Exponential backoff: 5s → 10s → 20s → 40s → 60s (capped).

Scope caveat. Watches only the process ezpz launch spawns directly. If qsub runs a wrapper script that internally invokes python train.py, the watchdog needs to live inside that script (or you wrap the inner call with ezpz launch too).

`ezpz launch --timeout`: one hang/recover cycle

stateDiagram-v2 direction LR [*] --> Running Running --> Hung: no stdout for N seconds Hung --> Killed: SIGTERM → SIGKILL Killed --> Running: backoff, re-exec

stateDiagram-v2 direction TB [*] --> Running Running --> Hung: no stdout for N seconds Hung --> Killed: SIGTERM → SIGKILL Killed --> Running: backoff, re-exec

Every collective hang shows up as silence on stdout — the process is “alive” by kill -0 but nothing is happening. The watchdog fires on the absence of progress, not on a heartbeat ping.

`--auto-retry`: bad-node failover, on tap

Allocate spares up front, swap them in on failure:

# 522 nodes allocated, train on 512, keep 10 as spares.
# Loop until success, walltime, or spare exhaustion.
ezpz launch --auto-retry --np 512 -- python -m torchtitan.train …

Classifies each attempt’s exit → success / walltime / bad-node / stuck-pre-training
On bad-node: scrapes the failing host from the log, swaps in a spare, re-execs
Guards against config bugs: 2 consecutive attempts with zero step= markers → stop (don’t burn the whole spare pool on a broken run)

Ships in saforem2/ezpz#144. Same scraper as the bash-lib path; pure-Python loop on top.

Failover wrapper: caught a real silent hang in production

Job 8505298, 2026-05-23. Attempt 1 trains cleanly steps 1→37, then log goes completely silent at step 37. No traceback, no MPI error, no rank dying. Just dead.

Time (CT)	Event
21:06:41	step 37 logged · `loss 11.80 · tps 3,919`
21:36:41	30 min dead air · `ezpz launch --timeout=1800` SIGTERMs
21:36:43	wrapper classifies exit 124 → silent-hang (not walltime)
21:36:43	no traceback to scrape → blind swap of rank-0 host
21:36:45	attempt 2 launches on swapped node set
21:57:49	walltime hit · step 296 · loss 5.68 · ckpts persisted

Three new pieces had to fire in sequence on a real-world hang to prove production-readiness: --timeout=1800 watchdog · exit 124 classification distinct from PBS exit 143 · failover_swap_one_blind() when no specific bad node can be identified. They did.

Full writeup: docs/experiments/agpt/aurora/20260523-failover-silent-hang-recovery-8505298.md

LR-finder — 2B sweep + optimal LR per config

2B LR-vs-loss sweep — AdamW, Muon, SophiaG overlaid — 2B sweep — flat / descent / blow-up signature.

Optimal LR per (model size, optimizer) — bar chart — Optimal LR per config. AdamW most tolerant; SophiaG ~10× lower.

Findings: AdamW most LR-tolerant; SophiaG needs ~10× lower LR; SophiaG + Muon both break at 80B (bf16 overflow in Newton-Schulz / Hessian estimate on 9216-dim matrices).

Silent bug #1 — bf16 master ⇒ RMSNorm frozen

Symptom. Loss curves looked reasonable. lm-eval scores didn’t move — ARC-Easy stuck at ~0.27 (random baseline) for 17K+ steps.

Cause. training.dtype=bfloat16 → bf16 master copy. RMSNorm weights init at 1.0; bf16 ULP at scale 1.0 is ~7.8e-3. Per-step optimizer update is ~1.6e-5 — every update rounds to zero. All 25 RMSNorm tensors stayed at exactly 1.0 from step 100 → 17,400.

Why other params trained fine. Linear layers init at std≈0.02 → bf16 ULP at scale 0.02 is ~3.8e-5, same magnitude as the update. RMSNorm’s larger init scale = coarser ULP = updates lost in rounding.

Fix. Default training.dtype=float32, FSDP MixedPrecisionPolicy(param_dtype=bf16, reduce_dtype=fp32). Master is fp32; forward/backward stay bf16. Extra ~1 GB master at 2B, ~10 GB at 20B — under budget.

Silent bug #1 — the smoking gun

v1 (bf16-master) vs v2 (fp32-master) — 20B lm-eval

v1 (bf16 master, 256N): ARC-Easy 0.27 flat across 2,500 steps · v2 (fp32 master, 512N): ARC-Easy 0.27 → 0.44 by step 800 · HellaSwag breaking out.

Lesson: loss looks like training. lm-eval is the only ground truth for “is the model actually learning?” Add a periodic eval gate.

Silent bug #2 — TP loss reported / `dp_world_size`

Symptom: Step-1 loss for agpt_* (vocab=256128) should be ln(256128) ≈ 12.45 - On TP=1 we see 12.95 ✓ - On TP=2 after upstream commit 1786292d (2026-04-27): - step-1 reported as 1.07 ✗ — exactly 12.84 / 12 where dp_world_size = 12
Cause: _dist_reduce() short-circuits on DTensor with full_tensor()
- But the loss is Replicated on the TP mesh, and the reduction was requested over batch_mesh (orthogonal)
- The short-circuit silently drops the cross-batch sum
Why it survived review: Gradients + optimizer steps are correct
- Only the loss: field that lands in stdout / W&B is wrong
- Loss curves look “reasonable” — just 1/12 of the true value
- Filed as pytorch/torchtitan#3204; our workaround calls loss.full_tensor() before dist_sum in ezpz/trainer.py:503-516
Lesson: Bit-exact smoke caught this immediately; the prior 2B TP=1 baseline gave us a number to disagree with

`ezpz yeet`: Efficiently Running 50k Python Processes

Nodes	yeet (s)	First-step (s)	Per-node (ms)
8	69.7	29.3	8,712
16	89.7	31.6	5,606
32	89.2	20.9	2,788
64	91.2	34.6	1,425
128	110.4	30.5	862
256	132.9	37.6	519
512	174.5	44.5	341
1024	255.4	60.8	249
2048	421.4	94.8	206
4096	750.6	194.0	183

Two regimes. 8–64 nodes extract-bound (~70–91s flat, per-node cost falls 8.7s → 1.4s); ≥128 nodes broadcast-bound, each 2× in nodes adds ~1.5–1.8× wall-clock.

Full write-up: sam.onl/posts/2026/05/01

Footnotes

Argonne National Laboratory ↩
And away from argonne-lcf/Megatron-DeepSpeed! ↩
An Empirical Model of Large-Batch Training ↩
How Does Critical Batch Size Scale in Pre-training? ↩
How to Set the Batch Size for Large-Scale Pre-training? ↩
Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training ↩
See the 📊 All Runs section. ↩
Cyclical Learning Rates for Training Neural Networks (Smith 2015) ↩
How do you find a good learning rate (Gugger 2017) ↩
Extending µP: Spectral Conditions for Feature Learning Across Optimizers (Gupta et al. 2026) ↩
allenai/olmo-mix-1124 — 0 → 4.67T tokens ↩
allenai/dolmino-mix-1124 — 4.67T → 7.06T tokens ↩
NVIDIA/{Nemotron-CC-Math-v1, Code-CC-v1} — 7.06T → 7.77T tokens ↩
Llama 3 herd of models (Meta AI, 2024), §3.3.2 (Training reliability) ↩
OPT-175B chronicles + dev log (Zhang et al., 2022) ↩
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model (BigScience, 2022) ↩
GLM-130B: An Open Bilingual Pre-Trained Model (Zeng et al., ICLR 2023) ↩

sam.onl/talks/2026/06/03/ 1

Production Pre-Training at Scale:

The Good, the Bad, and the Restarts

Lessons from AuroraGPT

Outline

Motivation

The stack

🍋 ezpz: write once, run anywhere

AuroraGPT-2B: the reference run on Aurora

Why MDS (Megatron-DeepSpeed) first: the only option at the time

Why SophiaG: large-batch stability at 50M tok/batch (256N)

LR-finder — exponential sweep, blow-up / 10

2B reference: training loss

Why we moved to torchtitan

2B reference + torchtitan overlay

2B loss: MDS full trajectory vs torchtitan

2B eval: MDS reference vs torchtitan

20B eval: all-production overlay (2B + 20B)

The fork tax: upstream-sync as a workflow

The Restarts: At Scale, Failure is the Default

“The Restarts”: three layers of recovery

What generalizes, what doesn’t

Thanks

Appendix: backup slides

Open questions: the ask

Operational reality — bad-node failover

Silent-hang detection: ezpz launch --timeout / --retries

ezpz launch --timeout: one hang/recover cycle

--auto-retry: bad-node failover, on tap

Failover wrapper: caught a real silent hang in production

LR-finder — 2B sweep + optimal LR per config

Silent bug #1 — bf16 master ⇒ RMSNorm frozen

Silent bug #1 — the smoking gun

Silent bug #2 — TP loss reported / dp_world_size

ezpz yeet: Efficiently Running 50k Python Processes

Footnotes

Footnotes

🍋 `ezpz`: write once, run anywhere

Why we moved to `torchtitan`

2B reference + `torchtitan` overlay

2B loss: MDS full trajectory vs `torchtitan`

2B eval: MDS reference vs `torchtitan`

Silent-hang detection: `ezpz launch --timeout / --retries`

`ezpz launch --timeout`: one hang/recover cycle

`--auto-retry`: bad-node failover, on tap

Silent bug #2 — TP loss reported / `dp_world_size`

`ezpz yeet`: Efficiently Running 50k Python Processes