<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet href="/rss/styles.xsl" type="text/xsl"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Sam Foreman</title><description>Personal site and blog of Sam Foreman -- computational scientist at Argonne National Laboratory.</description><link>https://sam.onl/</link><item><title>[talk] Production Pre-Training at Scale: The Good, the Bad, and the Restarts</title><link>https://sam.onl/talks/2026/06/03/</link><guid isPermaLink="true">https://sam.onl/talks/2026/06/03/</guid><description>Production pre-training at scale is less a story of peak FLOP/s than of what works, what breaks, scaling and tuning hyper-parameters and how cheaply you recover: corrupted shards, ECC errors, fabric flaps, silent data corruption, and the operational cost of separating transient from systemic faults. This talk describes the open-source stack we have built and battle-tested through the AuroraGPT effort at the Argonne Leadership Computing Facility, where we routinely train across thousands of Intel GPU nodes on Aurora. The stack pairs `blendcorpus` for reproducible, weighted data blending over petabyte-scale scientific corpora with a fork of `torchtitan` extended to target Intel XPUs and the SYCL toolchain, driven by `ezpz`: our orchestration layer that makes distributed PyTorch launches portable across \{NVIDIA, AMD, Intel, MPS, CPU\} with zero code changes. We will walk through concrete pain points at each layer, including upstream data churn and shard reproducibility, collective tuning and dataloader behavior on non-NVIDIA accelerators, and node-failure management at scale. We will describe our efforts with various first and second order optimizers and related tuning for scaling pre-training. We will close on the good, the bad, and the restarts: what generalizes across vendors, what we got wrong, and the open questions we would like the TPC community to help us answer.</description><pubDate>Wed, 03 Jun 2026 00:00:00 GMT</pubDate><content:encoded>Production pre-training at scale is less a story of peak FLOP/s than of what works, what breaks, scaling and tuning hyper-parameters and how cheaply you recover: corrupted shards, ECC errors, fabric flaps, silent data corruption, and the operational cost of separating transient from systemic faults. This talk describes the open-source stack we have built and battle-tested through the AuroraGPT effort at the Argonne Leadership Computing Facility, where we routinely train across thousands of Intel GPU nodes on Aurora. The stack pairs `blendcorpus` for reproducible, weighted data blending over petabyte-scale scientific corpora with a fork of `torchtitan` extended to target Intel XPUs and the SYCL toolchain, driven by `ezpz`: our orchestration layer that makes distributed PyTorch launches portable across \{NVIDIA, AMD, Intel, MPS, CPU\} with zero code changes. We will walk through concrete pain points at each layer, including upstream data churn and shard reproducibility, collective tuning and dataloader behavior on non-NVIDIA accelerators, and node-failure management at scale. We will describe our efforts with various first and second order optimizers and related tuning for scaling pre-training. We will close on the good, the bad, and the restarts: what generalizes across vendors, what we got wrong, and the open questions we would like the TPC community to help us answer.</content:encoded><category>talk</category></item><item><title>[post] Running 50k Python Processes on Aurora with ezpz yeet</title><link>https://sam.onl/posts/2026/05/01/</link><guid isPermaLink="true">https://sam.onl/posts/2026/05/01/</guid><description>How `ezpz yeet` distributes Python environments to every worker node in an HPC job, and how it scales from 8 to 4096 nodes on Aurora.</description><pubDate>Fri, 01 May 2026 00:00:00 GMT</pubDate><content:encoded>How `ezpz yeet` distributes Python environments to every worker node in an HPC job, and how it scales from 8 to 4096 nodes on Aurora.</content:encoded><category>post</category></item><item><title>[post] Pre-Training AuroraGPT with TorchTitan</title><link>https://sam.onl/posts/2026/04/27/</link><guid isPermaLink="true">https://sam.onl/posts/2026/04/27/</guid><description>Pre-training AuroraGPT with TorchTitan and ezpz: Last Two Weeks (Apr 12–27, 2026)</description><pubDate>Mon, 27 Apr 2026 00:00:00 GMT</pubDate><content:encoded>Pre-training AuroraGPT with TorchTitan and ezpz: Last Two Weeks (Apr 12–27, 2026)</content:encoded><category>post</category></item><item><title>[post] ⏱️ Comparing Launchers on Aurora</title><link>https://sam.onl/posts/2026/02/28/</link><guid isPermaLink="true">https://sam.onl/posts/2026/02/28/</guid><description>Benchmarking and comparing the performance of different launchers on Aurora at ALCF: `torchrun` vs. `ezpz launch`</description><pubDate>Sat, 28 Feb 2026 00:00:00 GMT</pubDate><content:encoded>Benchmarking and comparing the performance of different launchers on Aurora at ALCF: `torchrun` vs. `ezpz launch`</content:encoded><category>post</category></item><item><title>[post] 🍋 ezpz: distributed PyTorch across any hardware</title><link>https://sam.onl/posts/2026/01/10/</link><guid isPermaLink="true">https://sam.onl/posts/2026/01/10/</guid><description>A history and overview of `ezpz`, with AMD and Intel PyTorch enablement timelines and why portable distributed training across GPU vendors is finally possible.</description><pubDate>Sat, 10 Jan 2026 00:00:00 GMT</pubDate><content:encoded>A history and overview of `ezpz`, with AMD and Intel PyTorch enablement timelines and why portable distributed training across GPU vendors is finally possible.</content:encoded><category>post</category></item><item><title>[post] 🎉 Happy New Year!</title><link>https://sam.onl/posts/2026/01/07/</link><guid isPermaLink="true">https://sam.onl/posts/2026/01/07/</guid><description>A New Year update summarizing ongoing projects including AuroraGPT, AERIS, and other involvements at Argonne.</description><pubDate>Wed, 07 Jan 2026 00:00:00 GMT</pubDate><content:encoded>A New Year update summarizing ongoing projects including AuroraGPT, AERIS, and other involvements at Argonne.</content:encoded><category>post</category></item><item><title>[talk] AuroraGPT: Training Foundation Models on Supercomputers</title><link>https://sam.onl/talks/demo-slides/</link><guid isPermaLink="true">https://sam.onl/talks/demo-slides/</guid><description>Faithful structural port of the 2025-12-16 talk @ Argonne National Laboratory</description><pubDate>Tue, 16 Dec 2025 00:00:00 GMT</pubDate><content:encoded>Faithful structural port of the 2025-12-16 talk @ Argonne National Laboratory</content:encoded><category>talk</category></item><item><title>[talk] AuroraGPT: Training Foundation Models on Supercomputers</title><link>https://sam.onl/talks/2025/12/16/</link><guid isPermaLink="true">https://sam.onl/talks/2025/12/16/</guid><pubDate>Tue, 16 Dec 2025 00:00:00 GMT</pubDate><content:encoded/><category>talk</category></item><item><title>[post] 🧊 Cooling Down Checkpoints: Best Practices for Model Evaluation</title><link>https://sam.onl/posts/2025/11/12/</link><guid isPermaLink="true">https://sam.onl/posts/2025/11/12/</guid><description>Best practices for cooling down model checkpoints before evaluation to improve validation loss comparisons.</description><pubDate>Wed, 12 Nov 2025 00:00:00 GMT</pubDate><content:encoded>Best practices for cooling down model checkpoints before evaluation to improve validation loss comparisons.</content:encoded><category>post</category></item><item><title>[talk] Training Foundation Models on Supercomputers</title><link>https://sam.onl/talks/2025/10/24/</link><guid isPermaLink="true">https://sam.onl/talks/2025/10/24/</guid><pubDate>Fri, 24 Oct 2025 00:00:00 GMT</pubDate><content:encoded/><category>talk</category></item><item><title>[talk] Training Foundation Models on Supercomputers</title><link>https://sam.onl/talks/2025/10/15/</link><guid isPermaLink="true">https://sam.onl/talks/2025/10/15/</guid><pubDate>Wed, 15 Oct 2025 00:00:00 GMT</pubDate><content:encoded/><category>talk</category></item><item><title>[talk] AERIS: Argonne&apos;s Earth Systems Model</title><link>https://sam.onl/talks/2025/10/08/</link><guid isPermaLink="true">https://sam.onl/talks/2025/10/08/</guid><pubDate>Wed, 08 Oct 2025 00:00:00 GMT</pubDate><content:encoded/><category>talk</category></item><item><title>[post] 🎨 Mixing Between Distributions While Training</title><link>https://sam.onl/posts/2025/10/06/</link><guid isPermaLink="true">https://sam.onl/posts/2025/10/06/</guid><description>A mathematical framework for smoothly interpolating between data distributions during training using an annealing schedule.</description><pubDate>Mon, 06 Oct 2025 00:00:00 GMT</pubDate><content:encoded>A mathematical framework for smoothly interpolating between data distributions during training using an annealing schedule.</content:encoded><category>post</category></item><item><title>[talk] Training Foundation Models on Supercomputers</title><link>https://sam.onl/talks/2025/09/24/</link><guid isPermaLink="true">https://sam.onl/talks/2025/09/24/</guid><pubDate>Wed, 24 Sep 2025 00:00:00 GMT</pubDate><content:encoded/><category>talk</category></item><item><title>[post] 📊 `pbs-tui`: TUI for PBS Job Scheduler Monitoring</title><link>https://sam.onl/posts/2025/09/17/</link><guid isPermaLink="true">https://sam.onl/posts/2025/09/17/</guid><description>A terminal dashboard for monitoring PBS Pro job schedulers with interactive keybindings and snapshot modes.</description><pubDate>Wed, 17 Sep 2025 00:00:00 GMT</pubDate><content:encoded>A terminal dashboard for monitoring PBS Pro job schedulers with interactive keybindings and snapshot modes.</content:encoded><category>post</category></item><item><title>[post] 🍹 BlendCorpus + TorchTitan @ ALCF</title><link>https://sam.onl/posts/2025/09/12/</link><guid isPermaLink="true">https://sam.onl/posts/2025/09/12/</guid><description>A walkthrough of running BlendCorpus with TorchTitan for multi-source data training at ALCF.</description><pubDate>Fri, 12 Sep 2025 00:00:00 GMT</pubDate><content:encoded>A walkthrough of running BlendCorpus with TorchTitan for multi-source data training at ALCF.</content:encoded><category>post</category></item><item><title>[talk] Open SkAI2025</title><link>https://sam.onl/talks/openskai25/</link><guid isPermaLink="true">https://sam.onl/talks/openskai25/</guid><pubDate>Tue, 02 Sep 2025 00:00:00 GMT</pubDate><content:encoded/><category>talk</category></item><item><title>[talk] Scientific AI at Scale: AuroraGPT</title><link>https://sam.onl/talks/openskai25/ai4science/</link><guid isPermaLink="true">https://sam.onl/talks/openskai25/ai4science/</guid><pubDate>Tue, 02 Sep 2025 00:00:00 GMT</pubDate><content:encoded/><category>talk</category></item><item><title>[talk] Scientific AI at Scale: Distributed Training</title><link>https://sam.onl/talks/openskai25/training/</link><guid isPermaLink="true">https://sam.onl/talks/openskai25/training/</guid><pubDate>Tue, 02 Sep 2025 00:00:00 GMT</pubDate><content:encoded/><category>talk</category></item><item><title>[talk] AuroraGPT</title><link>https://sam.onl/talks/auroragpt-siam25/</link><guid isPermaLink="true">https://sam.onl/talks/auroragpt-siam25/</guid><pubDate>Thu, 31 Jul 2025 00:00:00 GMT</pubDate><content:encoded/><category>talk</category></item><item><title>[post] 🏗️ Building PyTorch 2.8 from Source on Aurora</title><link>https://sam.onl/posts/2025/06/14/</link><guid isPermaLink="true">https://sam.onl/posts/2025/06/14/</guid><description>Step-by-step instructions for building PyTorch 2.8 and related Intel libraries from source on Aurora.</description><pubDate>Sat, 14 Jun 2025 00:00:00 GMT</pubDate><content:encoded>Step-by-step instructions for building PyTorch 2.8 and related Intel libraries from source on Aurora.</content:encoded><category>post</category></item><item><title>[post] 🧜‍♀️ Mermaid</title><link>https://sam.onl/posts/2025/06/02/</link><guid isPermaLink="true">https://sam.onl/posts/2025/06/02/</guid><description>Experiments with Mermaid diagram rendering for flowcharts depicting distributed GPU training.</description><pubDate>Mon, 02 Jun 2025 00:00:00 GMT</pubDate><content:encoded>Experiments with Mermaid diagram rendering for flowcharts depicting distributed GPU training.</content:encoded><category>post</category></item><item><title>[post] 📰 Nice Headings</title><link>https://sam.onl/posts/2025/06/01/</link><guid isPermaLink="true">https://sam.onl/posts/2025/06/01/</guid><description>Recreating neovim-style heading aesthetics for website content in both light and dark themes.</description><pubDate>Sun, 01 Jun 2025 00:00:00 GMT</pubDate><content:encoded>Recreating neovim-style heading aesthetics for website content in both light and dark themes.</content:encoded><category>post</category></item><item><title>[talk] LLMs on Aurora: Overview</title><link>https://sam.onl/talks/incite-hackathon-2025/auroragpt/</link><guid isPermaLink="true">https://sam.onl/talks/incite-hackathon-2025/auroragpt/</guid><pubDate>Wed, 21 May 2025 00:00:00 GMT</pubDate><content:encoded/><category>talk</category></item><item><title>[talk] ALCF Incite Hackathon 2025</title><link>https://sam.onl/talks/incite-hackathon-2025/</link><guid isPermaLink="true">https://sam.onl/talks/incite-hackathon-2025/</guid><pubDate>Wed, 07 May 2025 00:00:00 GMT</pubDate><content:encoded/><category>talk</category></item><item><title>[talk] LLMs on Aurora: Hands-On</title><link>https://sam.onl/talks/incite-hackathon-2025/ezpz/</link><guid isPermaLink="true">https://sam.onl/talks/incite-hackathon-2025/ezpz/</guid><pubDate>Wed, 07 May 2025 00:00:00 GMT</pubDate><content:encoded/><category>talk</category></item><item><title>[post] 🚧 Frameworks Issue with numpy \&gt; 2</title><link>https://sam.onl/posts/2025/05/03/</link><guid isPermaLink="true">https://sam.onl/posts/2025/05/03/</guid><description>Documenting a breaking issue where upgrading numpy beyond version 2 breaks TensorFlow in the ALCF frameworks module.</description><pubDate>Sat, 03 May 2025 00:00:00 GMT</pubDate><content:encoded>Documenting a breaking issue where upgrading numpy beyond version 2 breaks TensorFlow in the ALCF frameworks module.</content:encoded><category>post</category></item><item><title>[post] 🔥 Building PyTorch 2.6 from Source on Aurora</title><link>https://sam.onl/posts/2025/04/28/</link><guid isPermaLink="true">https://sam.onl/posts/2025/04/28/</guid><description>Build instructions for compiling PyTorch 2.6 from source on Intel Aurora using environment variables and ezpz.</description><pubDate>Mon, 28 Apr 2025 00:00:00 GMT</pubDate><content:encoded>Build instructions for compiling PyTorch 2.6 from source on Intel Aurora using environment variables and ezpz.</content:encoded><category>post</category></item><item><title>[post] 🧑🏻‍💻 Sam Foreman’s Résumé</title><link>https://sam.onl/posts/resume/</link><guid isPermaLink="true">https://sam.onl/posts/resume/</guid><description>Professional resume covering education, experience, publications, and talks.</description><pubDate>Sat, 26 Apr 2025 00:00:00 GMT</pubDate><content:encoded>Professional resume covering education, experience, publications, and talks.</content:encoded><category>post</category></item><item><title>[post] 🪛 Torchtune on Aurora</title><link>https://sam.onl/posts/torchtune-aurora/</link><guid isPermaLink="true">https://sam.onl/posts/torchtune-aurora/</guid><description>Patches and instructions for getting torchtune working on Intel Aurora with PyTorch 2.3 and 2.5.</description><pubDate>Sun, 23 Mar 2025 00:00:00 GMT</pubDate><content:encoded>Patches and instructions for getting torchtune working on Intel Aurora with PyTorch 2.3 and 2.5.</content:encoded><category>post</category></item><item><title>[post] 🚑 Torchtune Patch on Aurora</title><link>https://sam.onl/posts/torchtune-patch-aurora/</link><guid isPermaLink="true">https://sam.onl/posts/torchtune-patch-aurora/</guid><description>A specific diff patch to resolve torchtune import issues with FSDP on Aurora.</description><pubDate>Sun, 23 Mar 2025 00:00:00 GMT</pubDate><content:encoded>A specific diff patch to resolve torchtune import issues with FSDP on Aurora.</content:encoded><category>post</category></item><item><title>[talk] AuroraGPT: Foundation Models for Science</title><link>https://sam.onl/talks/aurora-gpt-fm-for-electric-grid/auroragpt-fm-for-electric-grid/</link><guid isPermaLink="true">https://sam.onl/talks/aurora-gpt-fm-for-electric-grid/auroragpt-fm-for-electric-grid/</guid><pubDate>Wed, 12 Feb 2025 00:00:00 GMT</pubDate><content:encoded/><category>talk</category></item><item><title>[post] 🫥 svgbob</title><link>https://sam.onl/posts/svgbob/</link><guid isPermaLink="true">https://sam.onl/posts/svgbob/</guid><description>Experimenting with svgbob as an ASCII-art-to-SVG alternative to Mermaid diagrams.</description><pubDate>Fri, 15 Nov 2024 00:00:00 GMT</pubDate><content:encoded>Experimenting with svgbob as an ASCII-art-to-SVG alternative to Mermaid diagrams.</content:encoded><category>post</category></item><item><title>[talk] Parallel Training Methods</title><link>https://sam.onl/talks/ai-for-science-2024/</link><guid isPermaLink="true">https://sam.onl/talks/ai-for-science-2024/</guid><pubDate>Tue, 05 Nov 2024 00:00:00 GMT</pubDate><content:encoded/><category>talk</category></item><item><title>[talk] AuroraGPT: ANL&apos;s General Purpose Scientific LLM</title><link>https://sam.onl/talks/auroragpt/alcf-hpc-workshop-2024/auroragpt-alcf-hands-on-hpc-workshop-2024/</link><guid isPermaLink="true">https://sam.onl/talks/auroragpt/alcf-hpc-workshop-2024/auroragpt-alcf-hands-on-hpc-workshop-2024/</guid><pubDate>Wed, 30 Oct 2024 00:00:00 GMT</pubDate><content:encoded/><category>talk</category></item><item><title>[talk] Deep Learning and Foundation Models at Scale</title><link>https://sam.onl/talks/alcf-hpc-workshop-2024/alcf-hpc-workshop-2024/</link><guid isPermaLink="true">https://sam.onl/talks/alcf-hpc-workshop-2024/alcf-hpc-workshop-2024/</guid><pubDate>Tue, 29 Oct 2024 00:00:00 GMT</pubDate><content:encoded/><category>talk</category></item><item><title>[post] 💾 Converting Checkpoints</title><link>https://sam.onl/posts/auroragpt/checkpoints/</link><guid isPermaLink="true">https://sam.onl/posts/auroragpt/checkpoints/</guid><description>Scripts and procedures for converting Megatron-DeepSpeed checkpoints to HuggingFace format and back.</description><pubDate>Thu, 17 Oct 2024 00:00:00 GMT</pubDate><content:encoded>Scripts and procedures for converting Megatron-DeepSpeed checkpoints to HuggingFace format and back.</content:encoded><category>post</category></item><item><title>[post] 🏔️ Spike Skipper</title><link>https://sam.onl/posts/auroragpt/spike-skipper/</link><guid isPermaLink="true">https://sam.onl/posts/auroragpt/spike-skipper/</guid><description>Implementation of a mechanism to skip bad-data training steps that cause loss spikes during LLM training.</description><pubDate>Tue, 17 Sep 2024 00:00:00 GMT</pubDate><content:encoded>Implementation of a mechanism to skip bad-data training steps that cause loss spikes during LLM training.</content:encoded><category>post</category></item><item><title>[talk] AuroraGPT</title><link>https://sam.onl/talks/hpc-user-forum/auroragpt/</link><guid isPermaLink="true">https://sam.onl/talks/hpc-user-forum/auroragpt/</guid><pubDate>Wed, 04 Sep 2024 00:00:00 GMT</pubDate><content:encoded/><category>talk</category></item><item><title>[post] 🍋 ezpz @ ALCF</title><link>https://sam.onl/posts/ezpz-at-alcf/</link><guid isPermaLink="true">https://sam.onl/posts/ezpz-at-alcf/</guid><description>Getting started with the ezpz Python library and shell utilities for streamlined environment setup at ALCF.</description><pubDate>Fri, 23 Aug 2024 00:00:00 GMT</pubDate><content:encoded>Getting started with the ezpz Python library and shell utilities for streamlined environment setup at ALCF.</content:encoded><category>post</category></item><item><title>[post] 📝 ezpz-v1</title><link>https://sam.onl/posts/ezpz-v1/</link><guid isPermaLink="true">https://sam.onl/posts/ezpz-v1/</guid><description>Documentation and usage guide for ezpz v1, a library for simplifying distributed training setup and testing.</description><pubDate>Fri, 23 Aug 2024 00:00:00 GMT</pubDate><content:encoded>Documentation and usage guide for ezpz v1, a library for simplifying distributed training setup and testing.</content:encoded><category>post</category></item><item><title>[post] 💅 How to Make Dope Slides</title><link>https://sam.onl/posts/dope-slides/</link><guid isPermaLink="true">https://sam.onl/posts/dope-slides/</guid><description>A guide to creating polished presentation slides using Quarto and Reveal.js with custom CSS.</description><pubDate>Tue, 13 Aug 2024 00:00:00 GMT</pubDate><content:encoded>A guide to creating polished presentation slides using Quarto and Reveal.js with custom CSS.</content:encoded><category>post</category></item><item><title>[talk] Training LLMs at Scale</title><link>https://sam.onl/talks/llms-at-scale/</link><guid isPermaLink="true">https://sam.onl/talks/llms-at-scale/</guid><pubDate>Fri, 09 Aug 2024 00:00:00 GMT</pubDate><content:encoded/><category>talk</category></item><item><title>[post] 🔳 `l2hmc-qcd` Example: 4D SU(3)</title><link>https://sam.onl/posts/jupyter/l2hmc-4dsu3/</link><guid isPermaLink="true">https://sam.onl/posts/jupyter/l2hmc-4dsu3/</guid><description>A Jupyter notebook demonstrating l2hmc-qcd training and evaluation on 4D SU(3) lattice gauge theory.</description><pubDate>Wed, 24 Jul 2024 00:00:00 GMT</pubDate><content:encoded>A Jupyter notebook demonstrating l2hmc-qcd training and evaluation on 4D SU(3) lattice gauge theory.</content:encoded><category>post</category></item><item><title>[talk] Training LLMs on Polaris</title><link>https://sam.onl/talks/llms-on-polaris/</link><guid isPermaLink="true">https://sam.onl/talks/llms-on-polaris/</guid><pubDate>Wed, 17 Jul 2024 00:00:00 GMT</pubDate><content:encoded/><category>talk</category></item><item><title>[post] 📸 `flash-attn` on Sunspot</title><link>https://sam.onl/posts/auroragpt/flash-attn-sunspot/</link><guid isPermaLink="true">https://sam.onl/posts/auroragpt/flash-attn-sunspot/</guid><description>Debugging flash attention discrepancies on Sunspot and documenting framework comparison results with Intel.</description><pubDate>Mon, 17 Jun 2024 00:00:00 GMT</pubDate><content:encoded>Debugging flash attention discrepancies on Sunspot and documenting framework comparison results with Intel.</content:encoded><category>post</category></item><item><title>[post] 🏎️ Megatron-DeepSpeed on Intel XPU</title><link>https://sam.onl/posts/auroragpt/aurora-gpt/</link><guid isPermaLink="true">https://sam.onl/posts/auroragpt/aurora-gpt/</guid><description>Setup and running instructions for Megatron-DeepSpeed on Intel XPU hardware for AuroraGPT.</description><pubDate>Sat, 15 Jun 2024 00:00:00 GMT</pubDate><content:encoded>Setup and running instructions for Megatron-DeepSpeed on Intel XPU hardware for AuroraGPT.</content:encoded><category>post</category></item><item><title>[post] 🐛 `mpi4py` bug on Sunspot</title><link>https://sam.onl/posts/auroragpt/mpi4py-reproducer/</link><guid isPermaLink="true">https://sam.onl/posts/auroragpt/mpi4py-reproducer/</guid><description>A minimal reproducer for an mpi4py import error on Sunspot caused by missing MPI symbols.</description><pubDate>Sat, 25 May 2024 00:00:00 GMT</pubDate><content:encoded>A minimal reproducer for an mpi4py import error on Sunspot caused by missing MPI symbols.</content:encoded><category>post</category></item><item><title>[post] 🎲 MCMC + Diffusion Sampling</title><link>https://sam.onl/posts/ai-for-physics/diffusion/</link><guid isPermaLink="true">https://sam.onl/posts/ai-for-physics/diffusion/</guid><description>Combining denoising diffusion probabilistic models with HMC sampling for 2D U(1) lattice gauge theory.</description><pubDate>Mon, 15 Apr 2024 00:00:00 GMT</pubDate><content:encoded>Combining denoising diffusion probabilistic models with HMC sampling for 2D U(1) lattice gauge theory.</content:encoded><category>post</category></item><item><title>[post] 🐢 Starting Up Distributed Training on Aurora</title><link>https://sam.onl/posts/auroragpt/startup-times/</link><guid isPermaLink="true">https://sam.onl/posts/auroragpt/startup-times/</guid><description>Methodology for measuring and reducing distributed training startup times on Aurora.</description><pubDate>Thu, 21 Mar 2024 00:00:00 GMT</pubDate><content:encoded>Methodology for measuring and reducing distributed training startup times on Aurora.</content:encoded><category>post</category></item><item><title>[post] 🚂 Loooooooong Sequence Lengths</title><link>https://sam.onl/posts/auroragpt/long-sequences/</link><guid isPermaLink="true">https://sam.onl/posts/auroragpt/long-sequences/</guid><description>Optimizations for extremely long sequence lengths in Megatron-DeepSpeed as part of the DeepSpeed4Science project.</description><pubDate>Mon, 12 Feb 2024 00:00:00 GMT</pubDate><content:encoded>Optimizations for extremely long sequence lengths in Megatron-DeepSpeed as part of the DeepSpeed4Science project.</content:encoded><category>post</category></item><item><title>[post] 🏁 `l2hmc` Example: 2D $U(1)$</title><link>https://sam.onl/posts/jupyter/test/</link><guid isPermaLink="true">https://sam.onl/posts/jupyter/test/</guid><description>A Jupyter notebook walkthrough of the l2hmc framework for 2D U(1) gauge theory experiments.</description><pubDate>Mon, 12 Feb 2024 00:00:00 GMT</pubDate><content:encoded>A Jupyter notebook walkthrough of the l2hmc framework for 2D U(1) gauge theory experiments.</content:encoded><category>post</category></item><item><title>[post] 🎢 l2hmc-qcd Example: 2D U(1)</title><link>https://sam.onl/posts/ai-for-physics/l2hmc-qcd/2du1/</link><guid isPermaLink="true">https://sam.onl/posts/ai-for-physics/l2hmc-qcd/2du1/</guid><description>A walkthrough of the l2hmc-qcd framework applied to 2D U(1) gauge theory with PyTorch and TensorFlow.</description><pubDate>Thu, 14 Dec 2023 00:00:00 GMT</pubDate><content:encoded>A walkthrough of the l2hmc-qcd framework applied to 2D U(1) gauge theory with PyTorch and TensorFlow.</content:encoded><category>post</category></item><item><title>[post] 🔳 l2hmc-qcd Example: 4D SU(3)</title><link>https://sam.onl/posts/jupyter/l2hmc/4dsu3/</link><guid isPermaLink="true">https://sam.onl/posts/jupyter/l2hmc/4dsu3/</guid><description>An earlier version of the l2hmc-qcd 4D SU(3) example notebook with training and evaluation steps.</description><pubDate>Wed, 06 Dec 2023 00:00:00 GMT</pubDate><content:encoded>An earlier version of the l2hmc-qcd 4D SU(3) example notebook with training and evaluation steps.</content:encoded><category>post</category></item><item><title>[post] 🎢 L2HMC for LQCD</title><link>https://sam.onl/posts/ai-for-physics/l2hmc-qcd/</link><guid isPermaLink="true">https://sam.onl/posts/ai-for-physics/l2hmc-qcd/</guid><description>Overview of the L2HMC framework applied to lattice QCD simulations.</description><pubDate>Fri, 01 Dec 2023 00:00:00 GMT</pubDate><content:encoded>Overview of the L2HMC framework applied to lattice QCD simulations.</content:encoded><category>post</category></item></channel></rss>