Production Pre-Training at Scale:
The Good, the Bad, and the Restarts
Lessons from AuroraGPT
Sam Foreman1, Nathan Nichols, Varuni Sastry, Samuel Wheeler, Khalid Hossain, Huihuo Zheng, Murali Emani, Filippo Simini, Marieme Ngom, Ethan Wong, Venkat Vishwanath
2026-06-03
Outline
-
The Good:
- AuroraGPT-2B:
- Training on 7.8T tokens
- Software stack:
- Using
ezpz - Moving to
torchtitan2 - CODING AGENTS !!
- Using
- AuroraGPT-2B:
-
The Bad:
- Rapidly evolving software (and hardware!)
- Fork tax (fast-moving upstream!)
- At scale, failure is the default
- Rapidly evolving software (and hardware!)
- The Restarts:
- Towards resilient training
- 3 layers of recovery:
- Job → Node → Process
Motivation
-
How to do production training on a rapidly evolving software stack?
- across {Intel, NVIDIA, AMD, …} hardware?
- while also mitigating failures ?
- {hardware, system, network, lustre, …}
-
Tension between:
The stack
Current stack:
- 🍋 saforem2/
ezpz(+ ezpz.cool) - 🧠 saforem2/
torchtitan@ezpz: FSDP · TP · PP · EP · MoE - 📚 zhenghh04/
blendcorpus: weighted blending across datasets
Old stack (reference):
- 🪦 argonne-lcf/
Megatron-DeepSpeed:- AuroraGPT-2B reference (~7.77T tokens)
- pre-
torchtitan
cuda in user code !🍋 ezpz: write once, run anywhere
# train.py
import ezpz
# auto device + backend selection
rank = ezpz.setup_torch()
print(rank)ezpz launch python3 train.pySame code, every site. No per-cluster mpiexec / srun, CPU bindings, or tile-compact wrappers. → ezpz.cool
AuroraGPT-2B: the reference run on Aurora
| Spec | Value |
|---|---|
| Architecture | 1.986B params, 12 layers, GQA (16h / 4 kv) |
| Hardware | 256 Aurora nodes × 12 Intel Max GPUs = 3,072 GPUs, BF16 |
| Framework | Megatron-DeepSpeed (ZeRO Stage 0) |
| Optimizer | SophiaG6 (β=0.9/0.95, ρ=0.01, wd=0.1, LR=2.28e-5) |
| Training Config | 50M tok/batch (8192 ctx · LBS=2) |
| Tokenizer | SentencePiece, vocab=256K |
| Stages | 3 (pretrain · continued-pretrain · math+code) |
| Tokens | ~7.77T total |
This is the pre-torchtitan reference. Everything that follows is
the migration story: same scale, same data, what changed and what
broke when we cut over.
Why MDS (Megatron-DeepSpeed) first: the only option at the time
When AuroraGPT kicked off, MDS was the only LLM pre-training framework that ran at scale and supported:
- Intel XPU
- Model, pipeline parallelism
- DeepSpeed ZeRO Offloading
Supporting context:
- PyTorch FSDP1 had Intel XPU gaps (collectives, AC patterns, optimizer-state sharding)
torchtitanexisted as a research project (not tested)- MDS was the pragmatic choice
By early 2026, the calculus changed: torchtitan + DTensor + FSDP2 closed the gap
and the MDS fork’s maintenance cost crossed over.
Why SophiaG: large-batch stability at 50M tok/batch (256N)
W&B Report: AuroraGPT-2B Pre-Training7
SophiaG is the only one that stays in the low-loss band with bounded grad norms.
LR-finder — exponential sweep, blow-up / 10
Smith 20158 / Gugger9: exponentially ramp LR over ~10% of training, record EMA-smoothed loss, pick LR at the steepest descent (or blow-up point / 10 as a conservative default).
See also our recent work on cross-optimizer LR scaling10.
Cross-optimizer sweep on Aurora. Full report:
docs/experiments/lr-finder/README.md
Why we moved to torchtitan
| MDS | TT | |
|---|---|---|
| Actively maintained | ❌️ | ✅ |
| Declarative parallelism (DTensor, FSDP2) | ❌️ | ✅ |
| FSDP+TP / EP / CP without plumbing | ❌️ | ✅ |
| MoE support | ❌️ | ✅ |
| Easy to extend, debug, maintain | ❌️ | ✅ |
The trade-off we accepted: living on a fast-moving upstream pytorch/torchtitan@main; the “fork tax”
2B reference + torchtitan overlay
Training loss comparison for AuroraGPT-2B trained with MDS vs. TT.
2B loss: MDS full trajectory vs torchtitan
256N / GBS=6,144. At matched tokens, δ ≈ 0.02 — within run-to-run noise. The cutover preserved training behavior.
2B eval: MDS reference vs torchtitan
Data:
docs/evals/agpt/2b,
production run:
docs/production/agpt/2b
20B eval: all-production overlay (2B + 20B)
Data:
docs/evals/agpt/20b,
production run:
docs/production/agpt/20b
The fork tax: upstream-sync as a workflow
- 46 Upstream Syncs in 7 Weeks
- Smoke test: bit-exact loss + grad-norm + peak memory
- Verify changes from upstream haven’t broken anything
The Restarts: At Scale, Failure is the Default
- Llama 3 405B — 16K H100s · 54 days · 419 failures (≈ 1 every 3h); 99% recovered via automation14
- OPT-175B — 35 manual restarts + 100+ cycled hosts in 2 mo on ~1K A100s15
- BLOOM-176B — frequent loss spikes; embedding-norm + checkpoint cadence on 384 A100s × 3.5 mo16
- GLM-130B — loss spikes “increasingly frequent”; some recover, others go to NaN17
“The Restarts”: three layers of recovery
Bad-node failover, hang-watchdog, and PBS resubmit each operate at a different scope.
Inner loops catch most failures; outer loops catch the rest.
What generalizes, what doesn’t
Generalizes across vendors / sites / models
- Bit-exact deterministic smoke gate after every upstream sync
- lm-eval as the ground truth for “is it actually learning?”
- Spare-node failover wrapper — same idea on Slurm
- Launcher / env autodetect — push every vendor-shaped assumption out of training code
Doesn’t generalize (needs per-(config, hardware, version) tuning)
torch.compiledecisions- AC boundaries (MoE + AC + compile = grief)
- EP↔FSDP frontier
- Collective tuning (XCCL vs gloo fallbacks, NCCL env)
Thanks
AuroraGPT team: Venkat Vishwanath, the AI/ML Group at ALCF, collaborators across ANL.
Argonne Leadership Computing Facility: Aurora time, Sunspot staging.
Intel: Intel Max 1550 XPU + oneAPI / XCCL / IPEX support throughout.
Code & docs
ezpz: github.com/saforem2/ezpztorchtitanfork (experiments/ezpz): github.com/saforem2/torchtitan- These slides: sam.onl/talks/2026/06/03
This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.
Questions?
Appendix: backup slides
Material that didn’t make the main path but is here for Q&A.
- Open questions
- Silent-correctness bugs
- Failover engineering deep-dive
- LR-finder
yeet-envtarball broadcast scaling
Open questions: the ask
- A portable bit-exact regression suite across vendors — does anyone have one?
torch.compileat 1T scale — defensible decision tree?- Async-checkpoint Pareto frontier: recovery time × frequency × storage cost in production
- Optimizer failure at 80B+: SophiaG (Hessian-diagonal estimate saturates) + Muon (Newton-Schulz iterations overflow bf16) both diverge — algorithmic limit or fp-precision artifact?
- MoE EP↔FSDP scaling boundaries at 16B / 64B / 100B
- Make
xcclhonortrain_timeout_secondsso we don’t have to rely on an stdout-idle watchdog as the hang-detection ground truth
Operational reality — bad-node failover
5+ production jobs killed by bad-node failures in 2 weeks. Pattern: PBS gives us 256 / 512 nodes, one is bad, training crashes or hangs after N hours, walltime gone.
| Job | Trajectory | Failure |
|---|---|---|
| 8459818 | 2B 256N v2 | shepherd died from signal 9 after step 2070 |
| 8470102 | 20B 256N v2 | gloo TCP Connection closed by peer after ~3h |
| 8479579 | 20B 512N v2 | silent hang at step 803 (heartbeat continued) |
Failover wrapper: request select=N+spare (~2%, min 4). Split
into active + spare pool. On crash, scrape bad nodes from log, swap a
spare in, retry.
qsub -q prod -l select=522 -v NHOSTS_TRAIN=512 \
submit_agpt_2b_aurora_venv_failover.shHandles 6 recurring crash modes. Does not handle silent hangs — those still need a heartbeat watchdog.
Silent-hang detection: ezpz launch --timeout / --retries
The problem. xccl on XPU silently ignores
train_timeout_seconds, so a torchtitan job stuck in a hung
collective sits consuming the full PBS walltime instead of aborting.
Every collective hang quiets stdout (every rank blocks in the same
call, nothing reaches the log) — that’s the signal we can act on.
ezpz launch --timeout 600 --retries 3 \
python -m torchtitan.train --config-file ./config.toml--timeout SECONDS— kill the launched process if its stdout goes idle (not walltime) for this many consecutive seconds. Returns exit code 124 (matches GNUtimeout(1)).--retries N— re-execute on any non-zero exit (including the watchdog’s 124) up to N times. Exponential backoff: 5s → 10s → 20s → 40s → 60s (capped).
Scope caveat. Watches only the process ezpz launch spawns
directly. If qsub runs a wrapper script that internally invokes
python train.py, the watchdog needs to live inside that script (or
you wrap the inner call with ezpz launch too).
ezpz launch --timeout: one hang/recover cycle
Every collective hang shows up as silence on stdout — the process is “alive” by kill -0 but nothing is happening. The watchdog fires on the absence of progress, not on a heartbeat ping.
--auto-retry: bad-node failover, on tap
Allocate spares up front, swap them in on failure:
# 522 nodes allocated, train on 512, keep 10 as spares.
# Loop until success, walltime, or spare exhaustion.
ezpz launch --auto-retry --np 512 -- python -m torchtitan.train …- Classifies each attempt’s exit →
success/walltime/bad-node/stuck-pre-training - On bad-node: scrapes the failing host from the log, swaps in a spare, re-execs
- Guards against config bugs: 2 consecutive attempts with zero
step=markers → stop (don’t burn the whole spare pool on a broken run)
Ships in saforem2/ezpz#144. Same scraper as the bash-lib path; pure-Python loop on top.
Failover wrapper: caught a real silent hang in production
Job 8505298, 2026-05-23. Attempt 1 trains cleanly steps 1→37, then
log goes completely silent at step 37. No traceback, no MPI error,
no rank dying. Just dead.
| Time (CT) | Event |
|---|---|
| 21:06:41 | step 37 logged · loss 11.80 · tps 3,919 |
| 21:36:41 | 30 min dead air · ezpz launch --timeout=1800 SIGTERMs |
| 21:36:43 | wrapper classifies exit 124 → silent-hang (not walltime) |
| 21:36:43 | no traceback to scrape → blind swap of rank-0 host |
| 21:36:45 | attempt 2 launches on swapped node set |
| 21:57:49 | walltime hit · step 296 · loss 5.68 · ckpts persisted |
Three new pieces had to fire in sequence on a real-world hang to
prove production-readiness: --timeout=1800 watchdog · exit 124
classification distinct from PBS exit 143 · failover_swap_one_blind()
when no specific bad node can be identified. They did.
Full writeup: docs/experiments/agpt/aurora/20260523-failover-silent-hang-recovery-8505298.md
LR-finder — 2B sweep + optimal LR per config
Optimal LR per config. AdamW most tolerant; SophiaG ~10× lower.
Findings: AdamW most LR-tolerant; SophiaG needs ~10× lower LR; SophiaG + Muon both break at 80B (bf16 overflow in Newton-Schulz / Hessian estimate on 9216-dim matrices).
Silent bug #1 — bf16 master ⇒ RMSNorm frozen
Symptom. Loss curves looked reasonable. lm-eval scores didn’t move — ARC-Easy stuck at ~0.27 (random baseline) for 17K+ steps.
Cause. training.dtype=bfloat16 → bf16 master copy. RMSNorm
weights init at 1.0; bf16 ULP at scale 1.0 is ~7.8e-3. Per-step
optimizer update is ~1.6e-5 — every update rounds to zero. All
25 RMSNorm tensors stayed at exactly 1.0 from step 100 → 17,400.
Why other params trained fine. Linear layers init at std≈0.02 →
bf16 ULP at scale 0.02 is ~3.8e-5, same magnitude as the update.
RMSNorm’s larger init scale = coarser ULP = updates lost in rounding.
Fix. Default training.dtype=float32, FSDP
MixedPrecisionPolicy(param_dtype=bf16, reduce_dtype=fp32). Master is
fp32; forward/backward stay bf16. Extra ~1 GB master at 2B, ~10 GB at
20B — under budget.
Silent bug #1 — the smoking gun

v1 (bf16 master, 256N): ARC-Easy 0.27 flat across 2,500 steps · v2
(fp32 master, 512N): ARC-Easy 0.27 → 0.44 by step 800 · HellaSwag breaking
out.
Lesson: loss looks like training. lm-eval is the only ground truth for “is the model actually learning?” Add a periodic eval gate.
Silent bug #2 — TP loss reported / dp_world_size
-
Symptom: Step-1 loss for
agpt_*(vocab=256128) should beln(256128) ≈ 12.45- On TP=1 we see12.95✓ - On TP=2 after upstream commit1786292d(2026-04-27): - step-1 reported as1.07✗ — exactly12.84 / 12wheredp_world_size = 12 -
Cause:
_dist_reduce()short-circuits on DTensor withfull_tensor()- But the loss is Replicated on the TP mesh, and the reduction was
requested over
batch_mesh(orthogonal) - The short-circuit silently drops the cross-batch sum
- But the loss is Replicated on the TP mesh, and the reduction was
requested over
-
Why it survived review: Gradients + optimizer steps are correct
- Only the
loss:field that lands in stdout / W&B is wrong - Loss curves look “reasonable” — just
1/12of the true value - Filed as
pytorch/torchtitan#3204; our workaround callsloss.full_tensor()beforedist_suminezpz/trainer.py:503-516
- Only the
-
Lesson: Bit-exact smoke caught this immediately; the prior 2B TP=1 baseline gave us a number to disagree with
ezpz yeet: Efficiently Running 50k Python Processes
| Nodes | yeet (s) | First-step (s) | Per-node (ms) |
|---|---|---|---|
| 8 | 69.7 | 29.3 | 8,712 |
| 16 | 89.7 | 31.6 | 5,606 |
| 32 | 89.2 | 20.9 | 2,788 |
| 64 | 91.2 | 34.6 | 1,425 |
| 128 | 110.4 | 30.5 | 862 |
| 256 | 132.9 | 37.6 | 519 |
| 512 | 174.5 | 44.5 | 341 |
| 1024 | 255.4 | 60.8 | 249 |
| 2048 | 421.4 | 94.8 | 206 |
| 4096 | 750.6 | 194.0 | 183 |
Two regimes. 8–64 nodes extract-bound (~70–91s flat, per-node cost falls 8.7s → 1.4s); ≥128 nodes broadcast-bound, each 2× in nodes adds ~1.5–1.8× wall-clock.
Full write-up: sam.onl/posts/2026/05/01
Footnotes
Footnotes
-
Argonne National Laboratory ↩
-
And away from argonne-lcf/Megatron-DeepSpeed! ↩
-
Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training ↩
-
See the 📊 All Runs section. ↩
-
Cyclical Learning Rates for Training Neural Networks (Smith 2015) ↩
-
How do you find a good learning rate (Gugger 2017) ↩
-
Extending µP: Spectral Conditions for Feature Learning Across Optimizers (Gupta et al. 2026) ↩
-
allenai/olmo-mix-1124 — 0 → 4.67T tokens ↩
-
allenai/dolmino-mix-1124 — 4.67T → 7.06T tokens ↩
-
NVIDIA/{Nemotron-CC-Math-v1, Code-CC-v1} — 7.06T → 7.77T tokens ↩
-
Llama 3 herd of models (Meta AI, 2024), §3.3.2 (Training reliability) ↩
-
OPT-175B chronicles + dev log (Zhang et al., 2022) ↩
-
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model (BigScience, 2022) ↩
-
GLM-130B: An Open Bilingual Pre-Trained Model (Zeng et al., ICLR 2023) ↩