Why LLMs Are Non-Deterministic, and It Is Not Just the Temperature

A lot of engineers coming from classical scientific computing, myself included, are skeptical of LLMs due to their non-determinism. Same exact prompt can give two very different answers, which makes reproducibility a challenge. And that is a real risk even more critical now: at the end of the day, who wants to deploy a production stack on top of a random token generator?

But where does this randomness come from? The actual reasons turn out to be much older than LLMs themselves. They are the same finite-precision and parallel-reduction issues that have been around in scientific computing for decades. LLMs just inherited them.

To see how, let's walk down the implementation stack layer by layer...

Temperature and Sampling

A language model processes input tokens through stacked transformer blocks, and the final layer produces one real-valued score for every possible next token. These raw scores are called logits, written $z_i \in \mathbb{R}$ for token $i$ , and there is one logit per entry of the model's vocabulary (typically tens of thousands). The logits are then converted into a probability distribution by the softmax function, and one token is drawn from that distribution by sampling. This pipeline is illustrated in Figure 1.

The sampling pipeline from forward pass through logits, softmax, and sampled token Figure 1: The sampling pipeline. The model produces one logit per vocabulary entry, softmax converts them into a probability distribution, and a single token is drawn by sampling.

Softmax is the standard map from a real-valued vector to a probability distribution:

\sigma(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}},

mapping $\mathbf{z} \in \mathbb{R}^K$ to a vector in $(0, 1)^K$ that sums to one. Rescaling the logits by a positive constant $T$ before applying softmax gives

\sigma(\mathbf{z}/T)_i = \frac{e^{z_i / T}}{\sum_{j=1}^{K} e^{z_j / T}}.

This $T$ is the temperature: a parameter that controls how sharp or flat the resulting distribution is. The formula and the name are not arbitrary; they come directly from statistical physics. In a system of discrete energy levels $E_i$ at thermal equilibrium, the Boltzmann distribution gives the probability of finding the system in state $i$ :

p_i = \frac{e^{-E_i / k_B T}}{\sum_j e^{-E_j / k_B T}},

where $k_B$ is the Boltzmann constant and $T$ is the physical temperature. The two formulas are structurally identical, with $z_i$ taking the place of $-E_i / k_B$ in the exponent and the same $T$ playing the same role in both. At high $T$ the exponential is gentler and the probability mass spreads across many states; at low $T$ the exponential is sharper and the mass concentrates on the highest-scoring state.

Quantifying randomness: entropy

The qualitative picture (high $T$ spreads probability, low $T$ concentrates it) can be made precise with entropy, which measures how uncertain a probability distribution is, as shown in Figure 2.

Temperature effect on the probability distribution over tokens Figure 2: The effect of temperature on the token probability distribution. Higher temperature spreads probability mass across more tokens; lower temperature concentrates it on the top-scoring token.

For a discrete distribution $\{p_i\}$ , Shannon entropy is

H(\{p_i\}) = -\sum_{i=1}^{K} p_i \log_2 p_i,

with the convention $0 \log 0 = 0$ . The unit is bits when the logarithm is base 2. A uniform distribution over $K$ outcomes has the maximum entropy $\log_2 K$ , while a distribution concentrated on one outcome has $H = 0$ .

The same quantity appears in physics as thermodynamic entropy,

S = -k_B \sum_i p_i \ln p_i,

differing only by the constant $k_B$ and the choice of logarithm base. The structural identity between softmax and the Boltzmann distribution carries over to their entropies: both measure the same thing, namely the uncertainty in which state the system will be found in.

In the limit $T \to 0$ , the softmax collapses onto a single token, so $p_k = 1$ for the top-scoring token and $p_i = 0$ for all others. Plugging into the formula, $1 \cdot \log 1 = 0$ and the convention $0 \log 0 = 0$ kills the rest, so $H = 0$ . Zero entropy means zero uncertainty: the next token is fully determined.

What sampling actually means

So far we have a probability distribution $\{p_i\}$ over tokens. Sampling is the procedure that turns this distribution into one concrete token. The standard method is inverse transform sampling: lay the probabilities end to end on the interval $[0, 1]$ , draw a uniform random number $U \sim \text{Uniform}(0, 1)$ , and return the token whose interval contains $U$ . Formally, return the index $k$ satisfying

\sum_{i=1}^{k-1} p_i \le U < \sum_{i=1}^{k} p_i.

This works but it hides the role of temperature. Temperature lives inside $p_i$ , and the random draw $U$ sits outside; the two pieces do not interact in a way that makes the $T \to 0$ limit easy to see.

The Gumbel-Max trick

A more revealing formulation comes from extreme value theory. The Gumbel distribution describes the statistics of maxima of large samples, in the same way the normal distribution describes the statistics of sums. A standard Gumbel random variable $G$ can be constructed from a uniform $U \sim \text{Uniform}(0, 1)$ via

G = -\ln(-\ln U).

The classical Gumbel-Max theorem [1] states that sampling from the softmax distribution with logits $z_i$ and temperature $T$ is equivalent to drawing independent Gumbel noises $G_i$ , one per token, and returning

\text{sampled token} = \arg\max_i \big( z_i + T \cdot G_i \big).

Each symbol in this expression has a clear role. The logit $z_i$ is the model's deterministic score for token $i$ . The noise $G_i$ is randomness drawn fresh for each sampling step. The temperature $T$ scales how much that noise matters relative to the logits.

The shape of this formula makes the deterministic limit visible. As $T \to 0$ , the term $T \cdot G_i$ shrinks to zero regardless of the noise, and the expression reduces to

\arg\max_i z_i,

which is exactly greedy sampling, the choice of the highest-scoring token. The transition from stochastic to deterministic is not a separate case to analyze; it falls out of a single formula by turning one knob.

So if $T$ is the control parameter for sampling, then $T \to 0$ should give us a fully deterministic LLM?

Layer 1: Real Numbers in Finite Bits

To understand why temperature alone is not enough, we need a quick detour into how these logits are actually stored. A logit is a real number, but the model does not work with real numbers. It works with finite-precision approximations of them, produced by billions of floating-point operations across matrix multiplications, attention, and normalization. Beyond the classical FP64 and FP32 from scientific computing, LLM compute has driven a proliferation of lower-precision formats motivated by bandwidth. The motivation is simple: modern GPUs spend more time moving weights than computing on them, so halving the bit width nearly halves the memory traffic.

Comparison of floating-point formats from FP32 down to FP8 and INT4 Figure 3: Floating-point formats used in LLM training and inference. Each format trades precision for bandwidth in a different way.

The memory savings are concrete. A 70B-parameter model needs 280 GB in FP32, more than any single GPU can hold. BF16 cuts that to 140 GB, FP8 to 70 GB (fitting on a single 80 GB GPU). The idea is intuitive: cut the exponent bits and you lose dynamic range; cut the mantissa bits and you lose precision. The question is which sacrifice hurts less, and the answer depends on what kind of number you are storing.

What flows through a neural network

At its core, a neural network has three kinds of numbers with fundamentally different statistical character. To see why they need different formats, consider what a single layer of a feed-forward network does.

In the forward pass, layer $\ell$ takes the previous output and computes:

\mathbf{a}^{(\ell+1)} = \sigma\!\left(W^{(\ell)}\, \mathbf{a}^{(\ell)}\right)

where $\sigma$ is a nonlinearity. In the backward pass, the network computes how each weight should change to reduce the training loss $\mathcal{L}$ , by propagating derivatives back through the same chain:

\frac{\partial \mathcal{L}}{\partial W^{(\ell)}} = \frac{\partial \mathcal{L}}{\partial \mathbf{a}^{(\ell+1)}} \cdot \frac{\partial \mathbf{a}^{(\ell+1)}}{\partial W^{(\ell)}}

This gives us three quantities:

$W^{(\ell)} \in \mathbb{R}^{n_{\text{out}} \times n_{\text{in}}}$ : the weight matrix at layer $\ell$ . Learned during training, fixed at inference.
$\mathbf{a}^{(\ell)}$ : the activation at layer $\ell$ . The output of the previous layer, fed as input to the next.
$\frac{\partial \mathcal{L}}{\partial W^{(\ell)}}$ : the gradient. Computed only during training to update the weights.

$W$ and $\mathbf{a}$ stay bounded by design. Without care, multiplying through dozens of layers would either blow the signal up or collapse it to zero. Two mechanisms prevent this. First, initialization [3] draws each weight from a distribution with variance $2 / n_{\text{in}}$ , calibrated so that the output of each layer has roughly the same magnitude as its input. A typical weight entry ends up around $0.02$ or $-0.01$ , not $500$ . Second, normalization layers like RMSNorm [4] act as a safety net at every layer, rescaling each activation to unit root-mean-square:

\text{RMSNorm}(\mathbf{x})_i = \frac{x_i}{\sqrt{\frac{1}{n}\sum_{j=1}^{n} x_j^2}} \cdot \gamma_i

Between the two, weights and activations live in a narrow range. What matters for them is precision: resolving small differences within that range.

The gradient $\frac{\partial \mathcal{L}}{\partial W^{(\ell)}}$ does not stay bounded. Expanding the backward pass across all $L$ layers gives a chain of Jacobians:

\frac{\partial \mathcal{L}}{\partial W^{(\ell)}} = \frac{\partial \mathcal{L}}{\partial \mathbf{a}^{(L)}} \prod_{k=\ell+1}^{L} \frac{\partial \mathbf{a}^{(k)}}{\partial \mathbf{a}^{(k-1)}} \cdot \frac{\partial \mathbf{a}^{(\ell)}}{\partial W^{(\ell)}}

Each factor can shrink or amplify the signal. If the factors are consistently below 1, gradients vanish and early layers stop learning. If consistently above 1, gradients explode and training diverges (Figure 4).

Training loss curves and gradient magnitude across layers for vanishing, exploding, and balanced scenarios Figure 4: Three gradient regimes. Top: gradients shrink to near-zero in the early layers, the loss plateaus. Middle: gradients grow through the chain, the loss oscillates wildly. Bottom: with proper initialization and normalization, gradient magnitude stays balanced across layers and training proceeds smoothly.

What matters for gradients is dynamic range: covering many orders of magnitude without underflow or overflow.

This tension between precision and range shaped the evolution of formats. FP16, borrowed from graphics, cut both exponent and mantissa bits from FP32, collapsing the dynamic range to $[2^{-14}, 2^{15}]$ . Gradients regularly underflowed or overflowed. BF16 [2] fixed this by keeping all 8 exponent bits from FP32 and sacrificing only mantissa, trading precision for full dynamic range. Training tolerates the lost precision because SGD is already noisy. BF16 is now the standard training format.

FP8 halves the budget again, but 8 bits cannot serve both roles with one format. It splits into two variants [5]: E4M3 (4 exponent, 3 mantissa) for forward-pass weights and activations that need precision, and E5M2 (5 exponent, 2 mantissa) for backward-pass gradients that need range. The format became native on Hopper (H100) and was validated at frontier scale by DeepSeek-V3 in late 2024 [6]. Integer formats like INT8 and INT4 push compression even further for inference, and per-block scaling schemes like Microscaling (MX) [7] allow sub-byte precision on newer hardware (Figure 5), but those are a separate story.

Hardware support matrix for numeric formats across GPU generations Figure 5: Hardware support for numeric formats across GPU generations. INT8, INT4, and MX formats extend the trade-off further for inference workloads. Support remains fragmented across vendors.

Representation inherently lacks accuracy. But approximation alone does not mean non-determinism: the same number rounds to the same approximation every time. The non-determinism enters when we start doing arithmetic on these approximations.

Layer 2: Arithmetic on Approximations

In real arithmetic, addition is always associative: $(a + b) + c = a + (b + c)$ . In floating-point, however, this identity breaks. Each addition rounds the result back to the nearest representable number, and rounding depends on the magnitudes involved at that moment. This happens especially when you add a small number to a large running total and the small one disappears below the precision floor.

A concrete example makes this visible. Take $a = 1.0$ and eight copies of $\varepsilon = 0.0625$ , all stored in FP8 E4M3. The true sum is

a + \sum_{i=1}^8\ \varepsilon = 1.4375.

The actual FP8 result depends entirely on the order in which the additions are performed, and the three natural orderings can be drawn as reduction trees illustrated below.

Sequential reduction. The accumulator starts at $1.0$ and each $\varepsilon$ is added one at a time. In E4M3 the representable values near $1.0$ are spaced $0.125$ apart, so each $\varepsilon = 0.0625$ falls exactly at the halfway point and round-to-nearest-even rounds back down to $1.0$ . After all eight additions the result is still $1.0$ , losing $0.4375$ to rounding, as shown in Figure 6.

Sequential reduction tree where each value is added to a running accumulator Figure 6: Sequential reduction. The accumulator absorbs each small value one at a time, and rounding destroys them all.

Pairwise reduction. The eight $\varepsilon$ terms are summed among themselves first, in a balanced binary tree, before the large value $1.0$ enters. Pairs of $0.0625$ sum to $0.125$ , those sum to $0.25$ , and so on, reaching $0.5$ exactly. Adding $1.0$ at the end gives $1.5$ . Still not the true sum, but the error has shrunk from $0.4375$ to $0.0625$ , as shown in Figure 7.

Pairwise reduction tree where small values are summed among themselves first Figure 7: Pairwise reduction. Small values accumulate among themselves before meeting the large value, preserving more precision.

Split reduction. The terms are partitioned into two slices, each summed independently, and the slice results combined at the end. If slice 1 is summed sequentially and slice 2 is summed pairwise, the two slices return $1.0$ and $0.25$ , and the final answer is $1.25$ . A different choice of split point or per-slice ordering gives a different number again, as illustrated in Figure 8.

Split reduction tree where terms are partitioned into independently-summed slices Figure 8: Split reduction. Different partitioning and per-slice ordering produce yet another result.

We can see that three reduction paths lead to three different results, and none of them is equal to the ground truth $1.4375$ . The takeaway is not which tree is best, but that the shape of the reduction tree is part of what determines the result. In hand-written code the tree is fixed by the code, but on a CPU/GPU it is decided at runtime.

Every Matmul Is a Sum

So where does this reduction problem show up in an LLM? Look back at the forward pass from Layer 1. Each layer computes $\mathbf{a}^{(\ell+1)} = \sigma(W^{(\ell)} \mathbf{a}^{(\ell)})$ . For a single token, $W^{(\ell)} \mathbf{a}^{(\ell)}$ is a matrix-vector multiplication: each output element is a dot product, a sum of element-wise products:

(W \mathbf{a})_i = \sum_{k=1}^{K} W_{ik}\, a_k

That sum is exactly the kind of reduction we analyzed above. When the model processes a batch of tokens at once, the activation vectors stack into a matrix $A$ , and the operation becomes a full matrix multiplication:

C = AB, \quad C_{ij} = \sum_{k=1}^{K} A_{ik} B_{kj}

Each element of $C$ is the same dot product. Whether it is one vector or a thousand, every output value is computed by summing products, and that sum is where the reduction tree lives.

A transformer block contains two such operations: the feed-forward network (activation times weight) and attention (activation times activation, which we will revisit in Layer 3). The backward pass runs the same operations in reverse to compute gradients, so the reduction-order problem applies to training too.

With $K$ in the thousands, the sum can be partitioned across parallel workers. GPU kernels call this split-K: divide the K-axis into slices, let each worker sum its slice independently, and merge the partial results:

\sum_{k=1}^{K} A_{ik} B_{kj} = \underbrace{\sum_{k=1}^{K/2} A_{ik} B_{kj}}_{\text{slice 1}} + \underbrace{\sum_{k=K/2+1}^{K} A_{ik} B_{kj}}_{\text{slice 2}}

Apply this recursively and the matmul becomes a tree of partial sums. Each level of the GPU hierarchy implements one cut of this tree, and each level introduces its own variability.

Thread level. Inside a single thread, a small accumulator sums a handful of products in registers. This is the sequential pattern (Figure 6), fully ordered by the compiled instruction stream, and reproducible per thread.

Warp and thread-block level. A warp (NVIDIA's term for a group of 32 threads executing in lockstep; AMD calls them wavefronts) holds 32 partial accumulators that must be combined. The reduction uses shuffle instructions in a pairwise tree (Figure 7). The block-level reduction across multiple warps then combines those results, again pairwise. The exact tree depends on the kernel implementation and the tile shape chosen for this matmul.

Within a GPU: split-K across thread blocks. When $K$ is large, the kernel applies the split-K pattern from above across multiple thread blocks, each accumulating its slice, then a second-stage reduction combines them (Figure 8). The split point, the tile shape, and the reduction order are decided at runtime by the kernel library based on the matrix shape and the GPU.

Across GPUs: tensor parallelism. For models too large for one GPU, the same split happens at the network level. Each GPU computes its slice, then partial results are combined through an all-reduce. NCCL [8] chooses between a ring (sequential, as in Figure 6) and a tree (pairwise, as in Figure 7) based on message size and topology. The same three patterns from the worked example reappear at every scale.

Non-associativity is a property of the arithmetic and produces no non-determinism on its own. If the reduction tree were fixed across runs, the same inputs would give bit-identical results every time. Parallelism is what turns the reduction tree from a mathematical given into a runtime artifact, since the GPU chooses the split-K factor and NCCL chooses ring versus tree based on conditions that vary from launch to launch. The two together are what cost us bitwise reproducibility: floating-point supplies the sensitivity, the runtime supplies the variation.

Layer 3: Scheduling at Runtime

So far we have seen that the reduction tree determines the result, and that the kernel library picks the tree at runtime based on the matrix shape. The next question is what determines the matrix shape itself. In a single-user research setting, the shape comes from the model and the prompt and stays fixed. In a production serving system, the shape is a function of system state at the moment your request is processed.

There are three distinct mechanisms by which system state leaks into matrix shape:

Batch composition. How many independent requests are processed together in one forward pass.
Prefill chunking. How a single long prompt is split across multiple forward passes.
Continuous batching. How the batch itself evolves over time as new requests enter and finished ones leave.

Each acts on a different axis, and all three are decisions the inference engine makes without consulting the model.

Across requests: batch shape changes the kernel

In $C = AB$ from Layer 2, the dimensions $K$ and $N$ are fixed by the model architecture. The dimension $M$ , the number of token rows, is not. It is set by the workload: $M = b \cdot L$ for batch size $b$ and sequence length $L$ .

When $M$ is large, the kernel keeps each K-reduction local to one thread group and the reduction order stays fixed. When $M$ is small, most threads sit idle, so the kernel switches to split-K. The threshold depends on $M$ , $N$ , $K$ , the GPU model, and the kernel library's autotuning heuristics. A small change in batch composition can flip the kernel selection, and once it flips, the reduction tree changes (Figure 9).

How batch shape determines kernel selection between standard and split-K strategies Figure 9: Batch shape determines kernel selection. When $M$ (the number of token rows in $C = AB$ ) is large, each tile handles its own K-reduction. When $M$ is small, the kernel switches to split-K, changing the reduction tree.

Within one request: chunked prefill changes the attention sum

The second mechanism operates on the sequence axis, and on attention, the other major operation inside each transformer block. Unlike the feed-forward matmul where one operand is a fixed weight matrix, attention multiplies activations by activations:

\text{Attn}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d}}\right) V,

with $Q, K, V \in \mathbb{R}^{N \times d}$ . For each query row $i$ , the output is a softmax-weighted sum over all $N$ keys and values.

A long prompt is not processed in one shot; the inference engine splits it into chunks (chunked prefill), runs each chunk as its own forward pass, and stores the resulting keys and values in the KV cache. The attention sum then partitions at the chunk boundaries, which is the divide-and-conquer cut from Layer 2 applied along the sequence axis. Whether a 2000-token prompt is processed as four chunks of 500 or two of 1000 depends on the scheduler's state when the request arrives, as shown in Figure 10.

Two different chunked prefill partitions of the same prompt producing different reduction trees Figure 10: Chunked prefill. The same 2000-token prompt split into four chunks of 500 versus two chunks of 1000 produces different attention reduction trees. Each chunking changes the sequence length $N$ that the attention sum runs over.

Across time: continuous batching changes the batch step by step

The third mechanism is the most subtle. Modern serving stacks (vLLM [9], SGLang, TensorRT-LLM) do not run a batch to completion before accepting new requests. Instead, at every decode step they admit new requests into the batch and remove finished ones (continuous batching [10], also called in-flight batching). The matrix shape $M$ therefore changes step by step within a single request's lifetime, not just from one request to the next. Two identical requests submitted ten seconds apart will see different co-tenants, different $M$ at corresponding steps, and therefore different kernel selections, as illustrated in Figure 11.

Continuous batching timeline showing how the batch composition evolves step by step Figure 11: Continuous batching. The batch composition evolves at every decode step as new requests enter and finished ones leave. $M$ changes step by step within a single request's lifetime.

The runtime is now part of the math

Layer 2 turned the reduction order into a runtime decision. Layer 3 makes that decision a function of system state. Which other requests are batched with yours, which chunks the scheduler chose, where in the decode timeline your tokens land, all of these feed into the kernel selection, and the kernel selection determines the order of summation. The output of the model is no longer just a function of the input. It is also a function of everything else the system happened to be doing at the moment your request was processed.

Layer 4: Architecture-Level Routing (MoE)

In Mistral, DeepSeek-V3, and Qwen, the same token can be multiplied by different weights on two different runs, depending on what else was in the batch. The previous layers all kept the model fixed and only changed how the computation ran. Mixture of Experts allows the effective model to vary from batch to batch.

An MoE block [11] replaces the feed-forward network with $N$ smaller feed-forward networks called experts (typically 8 to 128), plus a small network called the gate. For each token $x$ , the gate produces a score $g_e(x)$ for every expert $e$ , and the token is routed to the $k$ experts with the highest scores (TopK selection, with $k$ usually 1 or 2). The output is the weighted sum of those experts' outputs:

\text{MoE}(x) = \sum_{e \in \text{TopK}(g(x))} g_e(x) \cdot \text{Expert}_e(x).

Figure 12 shows the $k = 1$ case with $N = 4$ experts. The gate scores all four, the highest-scoring expert (FFNN 1, with weight 0.45) processes the token, and its output is scaled by that weight.

MoE layer architecture with gating network routing a token to the top-scoring expert Figure 12: An MoE layer with four experts and top-1 routing. The gate scores each expert, and the token is processed by the highest-scoring one.

Mathematically, this is a function of $x$ alone. Given fixed model weights, the gate scores are determined by $x$ , the TopK choice follows from the scores, and the output follows from the choice. In production, however, tokens are not processed one at a time. They are packed into batches of thousands, and the engineering of MoE on GPUs breaks the per-token view.

GPUs need matmul shapes known in advance, but the gate's assignment is data-dependent and unpredictable. To keep shapes fixed, each expert is given a buffer of $C$ slots per batch, sized to roughly the average load under uniform routing. Routing is rarely uniform in practice. When more tokens want an expert than $C$ allows, the surplus is dropped (replaced by a residual passthrough), rerouted to a lower-scoring expert, or processed in an overflow path at a throughput penalty. Which of these happens to a given token depends on every other token in the batch competing for the same buffer, as shown in Figure 13.

MoE batch routing showing how buffer overflow causes tokens to be rerouted depending on batch composition Figure 13: MoE batch routing. When an expert's buffer overflows, surplus tokens are dropped or rerouted. The routing outcome for any single token depends on the entire batch.

Two batches that share your prompt but differ elsewhere can route the same token to different experts and produce a different output. Puigcerver and colleagues note this in the Soft MoE paper [12]: standard MoE is deterministic at the batch level, not at the sequence level.

Closing: The Stack Below the Stack

Sampling collapses to greedy at $T = 0$ and is therefore not the actual source of non-determinism. The actual sources are deeper: finite-precision arithmetic that breaks associativity (Layer 1), parallel reductions whose tree shape varies at runtime (Layer 2), scheduling decisions that change the matrix dimensions from one request to the next (Layer 3), and MoE routing that sends the same token through different weights depending on who else is in the batch (Layer 4).

None of this is new. Layers 1 through 3 are the same finite-precision and parallel-reduction issues that have shaped reproducibility in scientific computing for decades. LLMs just inherited them at scale. Layer 4 is the departure: MoE routing makes the divergence structural, not just numerical.

Every layer above Layer 1 can be pinned. The runtime choices can be held fixed, and bitwise determinism returns. The cost is throughput, which is the same trade-off scientific computing has always faced.

This article has covered only half of the trust question. Bitwise determinism guarantees that the same input produces the same output, but it says nothing about whether the output is correct. A deterministic model can still be confidently wrong, and the uncertainty and hallucination side of LLMs is a different story for another time.

References

Maddison, C. J. et al. (2014). A* Sampling. NeurIPS 2014.
Wang, S. & Kanwar, P. (2019). BFloat16: The secret to high performance on Cloud TPUs. Google Cloud Blog.
He, K. et al. (2015). Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ICCV 2015.
Zhang, B. & Sennrich, R. (2019). Root Mean Square Layer Normalization. NeurIPS 2019.
Micikevicius, P. et al. (2022). FP8 Formats for Deep Learning. arXiv. Joint specification by NVIDIA, Intel, and Arm.
DeepSeek-AI (2024). DeepSeek-V3 Technical Report. arXiv.
Rouhani, B. et al. (2023). OCP Microscaling Formats (MX) Specification. Open Compute Project.
NVIDIA (2025). NCCL Documentation. NVIDIA Developer.
Kwon, W. et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP 2023.
Yu, G. et al. (2022). Orca: A Distributed Serving System for Transformer-Based Generative Models. OSDI 2022.
Shazeer, N. et al. (2017). Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. ICLR 2017.
Puigcerver, J. et al. (2023). From Sparse to Soft Mixtures of Experts. arXiv.

Disclaimer: AI was used to assist with grammar polishing and rhetoric refining. The opinions and conclusions are the author's own and do not represent those of any past or current employer. Feedback and corrections are always welcome.