How Frontier Labs Use FP8 to Train Faster and Spend Less

Written by arkadiibe | Published 2026/04/13
Tech Story Tags: engineering | llm-engineering | llms | optimization | performance | llm-training | mixed-precision-training | hackernoon-top-story

TLDRA practical look at FP8 in LLM pretraining: how it works, where to apply it, what to watch out for, and what speedups you can realistically expect — with real numbers for MoE model.via the TL;DR App

Training a frontier LLM costs as much as flying every single person in a sold-out Wembley Stadium from Paris to London (or back). Llama 3 70B alone required roughly 6.4 million GPU-hours. At ~$2/hour per H100, that's $13 million for a single training run.

At $13 million a run, every percentage point is worth fighting for. FP8 is one of the biggest fights — and it comes down to something surprisingly simple: the format in which numbers are stored during training.

In this post we'll get into why this works, how deep the rabbit hole goes, and what it actually saves in practice. No hand-waving — we'll look at real numbers from an actual MoE layer.

It’s all matrix multiplications

For simplicity, we'll treat cost as proportional to time — if you rent GPUs, you pay per hour, so faster training directly means lower cost.

Modern LLMs are built from repeated transformer blocks, and within each block, matrix multiplications dominate the compute — accounting for roughly 60–80% of total training compute. We'll use a MoE architecture as our running example. The diagram below shows a simplified MoE transformer block, with the GEMM operations highlighted (note that experts are a grouped GEMM).

One thing to keep in mind: the diagram represents a computation graph — a directed graph of operations where each node transforms its inputs into outputs. During pretraining, we traverse this graph in both directions. The forward pass computes the model's predictions. The backward pass traverses the graph in reverse to propagate gradients — and along the way, computes weight gradients for every matrix in the model. That's additional matmuls on top of the forward pass, which is why the backward pass costs roughly 2× the forward pass in compute.

The matrices involved are not small. In a typical modern LLM, hidden dimensions run from thousands to tens of thousands, and MLP intermediate dimensions are often 4× the hidden dim or larger. A single weight matrix in a large model can have hundreds of millions of elements.

All these weights and activations have to live somewhere — and that somewhere is GPU HBM memory. Reducing the memory footprint of these matrices is therefore just as valuable as speeding up the compute — and ideally, you want both at once.

So the question becomes: what actually limits how fast your GPU can do matmuls — and how can we shrink the data they operate on?

Fewer bits, more compute

Numbers in a computer are not just numbers — they're bit patterns of a fixed width. The most common formats in deep learning are FP32, BF16, and FP16, each trading range and precision for size. For most of recent deep learning history, BF16 has been the standard for training — it's wide enough to represent the dynamic range of weights and gradients, and modern GPUs are heavily optimized for it. Accumulation inside kernels is typically done in FP32.

The NVIDIA Hopper architecture (H100) introduced hardware support for a new format: FP8. Two variants are available — E4M3 (4 exponent bits, 3 mantissa bits) and E5M2 (5 exponent bits, 2 mantissa bits) — allowing practitioners to trade range for precision depending on the use case. The diagram below shows how bits are distributed across all these formats

So why go smaller? Look at the H100 SXM5 spec sheet:

BF16 gives 1,979 TFLOPS, FP8 gives 3,958 — a 2× difference on this architecture. To understand where this comes from, you need to know how tensor cores actually work.

A tensor core executes a matrix multiply-accumulate on a small tile each cycle. The key constraint is the total bit-width of the operand delivery path — registers to MAC array — which is fixed in silicon. With BF16, each element is 16 bits. Switch to FP8 and each element is 8 bits, so you feed 2× the elements per cycle through the same datapath. The MAC array is physically wired to exploit this: it performs twice as many 8-bit multiplications per cycle as 16-bit ones. The accumulators stay wide (FP32), so accumulation cost doesn't change.

The 2× gain isn't magic — it follows directly from the datapath width being fixed. Halving the operand size doubles the throughput, and the accumulation tree, register file, and instruction dispatch are all sized to match exactly that. The silicon doesn't give you more than the datapath allows.

On H100, the design choices add up to 2×. A different architecture could land anywhere — but the direction is always the same: fewer bits per operand means more operations per cycle.

This also helps on the memory side. Smaller elements mean less data to load from HBM — which matters even in compute-bound regimes, since every byte you don't load is bandwidth you save for something else.

You can't just cast to FP8

With BF16, life is relatively simple — the format is wide enough that you can cast tensors and multiply without much ceremony. FP8 is different. The representable range is extremely limited: E4M3 maxes out at 448, E5M2 at 57344. Naively casting a typical weight or activation tensor to FP8 will cause overflow for large values and underflow for small ones — most of your numerical information is gone before the matmul even runs.

This is why FP8 training requires explicit quantization before the cast. There are two main approaches.

Per-tensor scaling

The idea is straightforward: before casting to FP8, scale the tensor so its values fit within the representable range. Concretely, you compute a scaling factor from the tensor's maximum absolute value — specifically, FP8_MAX divided by that maximum — multiply the tensor by this factor, then cast to FP8. The diagram below illustrates this.

The result is an FP8 tensor and a single FP32 scaling coefficient. When multiplying two quantized matrices, you need to rescale the result accordingly:

The GEMM kernel handles this rescaling internally — the scaling factors are passed as inputs and folded into the output write. This is what unlocks the 2× peak throughput we described earlier.

One practical question: which FP8 format to use? The answer depends on what you're quantizing — a pattern established empirically across FP8 training papers (for example, this one). For weights and activations, E4M3 is the right choice — values tend to be roughly Gaussian, concentrated near zero, and E4M3's extra mantissa bit gives finer precision in that dense region. For gradients, E5M2 works better — gradients are inherently noisy, and the gradient just needs to point in roughly the right direction. Losing a mantissa bit barely matters when the signal is already stochastic.

Blockwise quantization

Per-tensor scaling works well in the average case. But at large-scale pretraining, outliers happen — a single large outlier can dominate the scaling factor, compressing all other values toward zero where they become indistinguishable from each other after the cast. The diagram below illustrates this.

The fix is to increase quantization granularity. Instead of one scaling factor per tensor, we compute one per block — a contiguous slice of the tensor quantized independently. This is blockwise quantization.

DeepSeek V3 was one of the first large models publicly reported to use blockwise quantization in pretraining. Their recipe: 1D blocks of size 128 for activations and gradients, 2D blocks of size 128×128 for weights. They also used E4M3 for all three — weights, activations, and gradients — since within a small block the dynamic range is tight enough that precision matters more than range.

The tradeoff: blockwise quantization is significantly more expensive than per-tensor scaling. But in practice it is more numerically stable.

What FP8 actually buys you

Enough theory — let's look at real profiling data from a MoE layer forward pass in a several-hundred-billion parameter model. To keep things simple, I've highlighted only the parts that matter most. The profile covers forward only — backward follows the same idea (quantize operands, run FP8 matmul), but has its own engineering challenges we won't cover here.

The profile shows three configurations: BF16 baseline, FP8, and FP8 + combine. Everything outside the MLP block remains exactly as in the BF16 setup.

Even with this simple on-the-fly quantization, FP8 alone gives a 12% speedup over the BF16 baseline. That's with no kernel fusion, minimal reuse of quantized tensors — just cast, multiply, rescale.

The additional 2% in "FP8 + combine" comes from a subtler optimization. In MoE with expert parallelism, the dispatch stage sends tokens to their assigned experts across GPUs — an all-to-all communication. In BF16 training, this happens in BF16. But since we're going to quantize the activations before the expert matmul anyway, we can quantize before dispatch instead of after — and send tokens directly in FP8. This cuts communication volume by almost 2× (scaling coefficients add back only a few percent of the total volume).

This points to a broader flexibility that FP8 opens up. Once you commit to quantization, you can move the quantization boundary to places where it saves not just compute, but also communication. Beyond that, working with 8-bit tensors means less memory pressure on HBM — which in turn opens up opportunities to optimize data access patterns in kernels and reduce redundant memory reads.

Where FP8 helps — and where it doesn't

As you may have noticed, we didn't apply FP8 everywhere in the profile. There are two reasons for this.

First: quantization has overhead, and it isn't always worth it.

Quantization is fundamentally a memory-bound operation — you're reading every element, dividing by a scale factor, and casting. No arithmetic intensity, no compute-bound regime. It lives on the left side of the roofline (see A Practical Roofline Model for ML Training).

This means FP8 only pays off when the matmul itself is compute-bound — large enough that the 2× FLOP throughput gain outweighs the quantization cost. For large MLP blocks with big hidden dimensions, this is clearly the case. For QKV and attention projection, the picture is more nuanced — the matmul benefit may be smaller, but there are potential wins on the communication side through FP8 All-Gather in FSDP setups. Whether it's worth it depends on your specific configuration and requires careful stability experiments.

Second: not all operations are numerically safe in FP8.

Some operations are simply too sensitive to the precision loss FP8 introduces. Unembedding layers are known to cause instabilities — the logits over a large vocabulary are numerically delicate. Attention softmax operates on values whose relative differences matter a lot, making it another risky candidate. Layer norms are similarly sensitive. Some labs run the first and last few transformer layers entirely in BF16 as an extra safety margin.

The practical takeaway: FP8 is a tool for the heavy compute-bound matmuls in the middle of your network — not a global precision switch.

Conclusion

In our profiling example, applying FP8 to the MLP block gave a 14% speedup on the forward pass. Since the backward pass costs roughly 2× the forward, this translates to an estimated ~14% end-to-end speedup as well — the ratio holds because backward follows the same quantize-and-multiply pattern.

Applied to a training run like Llama 3 70B — $13 million at ~$2/hour per H100 — that's roughly $1.8 million saved. Enough to fly about 25 000 from that Wembley Stadium to Paris.

And that's the conservative estimate. Our example used only basic on-the-fly quantization with no kernel fusion and minimal engineering effort. In practice, carefully designed FP8 training systems — with quantization reuse, kernel fusion, and FP8-aware distributed training — save closer to 20–30% end-to-end. The more effort you put into the integration, the closer you get to the theoretical ceiling.

Getting there isn't trivial. Real FP8 integration requires numerical stability experiments, careful decisions about what runs in FP8 and what doesn't, and non-trivial changes to your distributed training stack. These are fascinating engineering challenges — and well beyond the scope of this post.

If you're interested in the engineering behind large-scale pretraining, I'll be writing more on this topic. Follow to stay updated!


Written by arkadiibe | LLM pretrain engineer
Published by HackerNoon on 2026/04/13