NVFP4 Scales, Quantization, and Intuition
NVFP4 represents each value using FP4 data plus two levels of scaling:
$$
x \approx \text{fp4_value} \times \text{block_scale} \times \text{global_scale}
$$
Where:
- $\text{fp4_value}$ is an FP4 E2M1 value.
- $\text{block_scale}$ is an FP8 E4M3 scale shared by one block, usually 16 values.
- $\text{global_scale}$ is a high-precision per-tensor scale shared by all blocks.
1. Scale Calculation
- FP4 E2M1 has maximum magnitude: 6
- FP8 E4M3 has maximum magnitude: 448
- For the whole tensor: $\text{global_amax} = \max(|x|)$
NVFP4 computes the per-tensor scale:
$$
\text{global_scale} = \frac{\text{global_amax}}{6 \times 448}
$$
For each block: $\text{block_amax} = \max(|x_{\text{block}}|)$
The true scale needed by this block is:
$$
\text{true_block_scale} = \frac{\text{block_amax}}{6}
$$
But NVFP4 does not store $\text{true_block_scale}$ in FP32 directly. Instead, it stores an FP8 block scale relative to $\text{global_scale}$:
$$
\text{block_scale} = \frac{\text{true_block_scale}}{\text{global_scale}}
$$
So:
$$
\text{block_scale} = \frac{\text{block_amax}}{6 \times \text{global_scale}}
$$
Then $\text{block_scale}$ is cast to FP8 E4M3.
2. Quantization
For each value $x$ in a block, NVFP4 first reconstructs the effective scale:
$$
\text{effective_scale} = \text{fp8}(\text{block_scale}) \times \text{global_scale}
$$
Then the value is normalized:
$$
x_{\text{scaled}} = \frac{x}{\text{effective_scale}}
$$
Finally, $x_{\text{scaled}}$ is cast to FP4 E2M1:
$$
\text{fp4_value} = \text{cast_to_e2m1}(x_{\text{scaled}})
$$
So quantization is approximately:
$$
\text{fp4_value} = \text{cast_to_e2m1}\left(\frac{x}{\text{fp8}(\text{block_scale}) \times \text{global_scale}}\right)
$$
The FP4 values are then packed, usually two FP4 values per byte.
3. Dequantization
Dequantization reverses the process.
First, unpack the FP4 value and convert it back to its E2M1 numeric value. Then reconstruct the effective scale:
$$
\text{effective_scale} = \text{fp8}(\text{block_scale}) \times \text{global_scale}
$$
Finally:
$$
x_{\text{dequantized}} = \text{fp4_value} \times \text{effective_scale}
$$
So:
$$
x_{\text{dequantized}} \approx \text{fp4_value} \times \text{fp8}(\text{block_scale}) \times \text{global_scale}
$$
4. Why the Scales Are Computed This Way
The goal is to use the available numeric range efficiently.
For each block, we want the largest value in that block to map close to the largest FP4 value:
$$
\frac{\text{block_amax}}{\text{effective_scale}} \approx 6
$$
Since:
$$
\text{effective_scale} = \text{block_scale} \times \text{global_scale}
$$
we get:
$$
\frac{\text{block_amax}}{\text{block_scale} \times \text{global_scale}} = 6
$$
Solving for $\text{block_scale}$:
$$
\text{block_scale} = \frac{\text{block_amax}}{6 \times \text{global_scale}}
$$
$\text{global_scale}$ acts as a global unit for all block scales. The FP8 $\text{block_scale}$ stores how many of those global units each block needs.
NVFP4 is two-stage because the block scales themselves are quantized to FP8.
MXFP4 Scales, Quantization, and Intuition
MXFP4 represents each value using FP4 data plus one block-level scale:
$$
x \approx \text{fp4_value} \times \text{block_scale}
$$
Where:
- $\text{fp4_value}$ is an FP4 E2M1 value.
- $\text{block_scale}$ is an E8M0 scale shared by one block, usually 32 values.
- (Compared to NVFP4: there is no $\text{global_scale}$ in MXFP4.)
1. Scale Calculation
- FP4 E2M1 has maximum magnitude: 6
- For each block: $\text{block_amax} = \max(|x_{\text{block}}|)$
The true scale we want for this block is:
$$
\text{true_block_scale} = \frac{\text{block_amax}}{6}
$$
because we want the largest value in the block to become approximately 6, the max FP4 value.
However, MXFP4 stores the block scale in E8M0, which represents powers of two: $\text{block_scale} = 2^k$
So MXFP4 chooses the smallest power-of-two scale that can cover the block:
$$
k = \lceil \log_2(\text{block_amax} / 6) \rceil
$$
$$
\text{block_scale} = 2^k = 2^{\lceil \log_2(\text{block_amax} / 6) \rceil}
$$
The
ceilis important because it chooses a scale large enough to avoid overflow, i.e. makes sure the largest value in the block fits inside FP4 range.
The stored scale is the biased exponent: $\text{stored_scale} = k + 127$:
MXFP4 stores k instead of the scale because the scale is only allowed to be $2^k$. Storing k is smaller, simpler, and faster to use in hardware. Because k can be negative, but the stored scale byte is unsigned, we need to store k + 127 to ensure it is unsigned, and use stored_scale - 127 to recover the real k.
Notes:
ceil(x)= smallest integer ≥ xfloor(x)= largest integer ≤ x
2. Quantization
For each value $x$ in a block, MXFP4 first normalizes it by the block scale:
$$
x_{\text{scaled}} = \frac{x}{\text{block_scale}}
$$
Then $x_{\text{scaled}}$ is cast to FP4 E2M1:
$$
\text{fp4_value} = \text{cast_to_e2m1}(x_{\text{scaled}})
$$
So quantization is approximately:
$$
\text{fp4_value} = \text{cast_to_e2m1}\left(\frac{x}{\text{block_scale}}\right)
$$
The FP4 values are then packed, usually two FP4 values per byte.
3. Dequantization
Dequantization reverses the process.
First, unpack the FP4 value and convert it back to its E2M1 numeric value: $\text{fp4_value} \rightarrow \text{e2m1_value}$
Then recover the block scale from the stored E8M0 exponent:
$$
\text{block_scale} = 2^{(\text{stored_scale} - 127)}
$$
Finally:
$$
x_{\text{dequantized}} = \text{e2m1_value} \times \text{block_scale}
$$
So:
$$
x_{\text{dequantized}} \approx \text{fp4_value} \times \text{block_scale}
$$
References
- NVIDIA Model-Optimizer —
nvfp4_tensor.py - NVIDIA Model-Optimizer —
mxfp4_tensor.py - vLLM LLM-Compressor — W4A4 FP4 Quantization Example
- NVFP4 quantization generates per-tensor global scales and per-group (size 16) local quantization scales for the weights, as well as per-tensor global scales for the activations. Per-group local activation quantization scales are generated dynamically during inference time.
- If running inference on a machine that is
< SM100, vLLM will not run activation quantization, only weight-only quantization.