This note describes the GPTQ quantizer format, not how parameters are updated for quantization (see the GPTQ paper and this paper for some subsequent updates).
Preliminaries
Asymmetric quantization
If we quantize weights to b bits, the weights are stored as the integer values [0,2b−1]. If the smallest weight within a set of weights w is wmin=min(w) and our largest weight is wmax=max(w). We want to map the range [wmin,wmax] to the range of integer values [0,2b−1].
To do this mapping, we first define a scaling factor scale=qmaxwmax−wmin, where qmax=2b−1. Suppose that we are quantizing weights using 4-bit integers, then qmax=2b=4−1=15 and we have wmin=−1.5 and wmax=3.0. Then scale=153.0−−1.5=0.3. We can then divide the weights by scale and round the result to get integer weights, i.e. [round(0.3−1.5),round(0.33.0)]=[−5,10]. Now scalewmax−scalewmin=15, so after rounding the weights can be represented using 16 integer values, but we still need to add a bias (called zero from here on) to get to the desired range of [0,2b=4−1]. To do so, we can add zero=round(scale−wmin)=round0.3−−1.5=5.
Once we have determined scale and zero, we can find the quantized weight q(wi) using q(wi)=clamp(round(scalewi)+zero) and the reconstructed weight using q′(q(wi))=(q(wi)−zero)⋅scale.
Typically, zero is also stored in b-bits, but this only works when wmin≤0 (otherwise we get negative zero) and wmax≥0 (otherwise zero>15). However, this should be true for most sets of weights.
Real-world quantization We could make a very simple quantizer by finding scale and zero for a set of model parameters as-is, but it would usually result in an inaccurate model. The rounding in the quantizer may be to crude to represent the weight distribution without too much loss. Real-world quantization methods update a model's weights to work in tandem with the quantizers.
Symmetric quantization
The idea behind symmetric quantization is that ∣wmin∣=∣wmax∣ (conversely, this condition is not necessary in asymmetric quantization. In order to do so, their values are redefined as wmax=max(∣w∣) and wmin=−wmax. The value of zero is always going to be the same in this case, zero=22b.
Symmetric quantization is the default in the most popular GPTQ implementations. I am not sure why, because fidelity is lost in cases where wmin and wmax are not close to min(w) and max(w) respectively.
Packing integers
When the values are quantized, they are not stored as-is. They are packed into int32 values. This is dual purpose: operations on 32-bit integers is fast on most hardware and this packing avoids wasting bits. When we are packing weights such that 32modb=0, packing is straightforward using bit shifts. For instance, we can pack four 8-bit weights using bitwise operators: packed = w0 | (w1 << 8) | (w2 << 16) | (w3 << 24). We can unpack e.g. w2 by using (w2 = packed >> 16) & 0xff.
Masking & 0xff in the example unpacking example above is the mask to clear out all bits except the 8 least significant bits. The mask can be computed using 2**bits-1.
AutoGPTQ supports 2, 3, 4, and 8-bit quantization. 3-bit quantization is the odd-one-out, because 32modb=2. We could pack 30 bits into a 32-bit integer, but we would have two redundant bits. This is addressed by packing 32 3-bit integers in 3 int32s. Packing/unpacking is fiddly because the integers contain partial values.
Luckily, 4-bit GPTQ quantization seems to have become the standards and some kernels (e.g. ExLlama) only support 4-bit GPTQ.
Let's ignore 3-bit GPTQ For the remainder of this note I'll assume that 32modb=0, so no 3-bit quantization. It makes the shape definitions a little cleaner.
Bug in common implementations Many implementations have an issue in packing that will result in invalid zero values when used with asymmetric quantization. During packing, they subtract 1 from zero values and then convert zero to uint32 for packing. Then they add 1 during unpacking. However, this results in an incorrect value when zero=0 before packing. For example:
The bug is quite unfortunate, because it makes asymmetric quantization perform much less well than it should. However, it cannot be resolved without breaking compatibility with existing checkpoints. There is work on a new version of the GPTQ format to solve this issue.
How GPTQ quantizers are stored
What are we quantizing?
Before looking at the storage format, it’s a good idea to take a step back and look at what we are quantizing. In transformers these are linear layers. In Torch we construct a linear layer using:
linear = nn.Linear(in_features, out_features)
Linear will store the weights as a matrix with shape [out_features, in_features] and is applied to an input vector as Wx. We could quantize the matrix with a single scale and zero. However, this would typically lead to a large quantization loss, since there can be a large variance in weights. GPTQ uses scale and zero parameters for every output feature (row). This makes most sense, since the weights in a row participate in the same dot-product in Xw.
Simplified format
We will first start with a simplified description of GPTQ before getting to the actual format, to see how it naturally follows from Preliminaries, before some additional complexity is added. This simplified GPTQ uses stores the weight matrix with shape [out_features, in_features], with b-bit quantization and storing c=⌊b32⌋ quantized weights in an int32:
Parameter suffix
Shape
Type
Description
qweight
(out_features, in_features/c)
int32
Integer-quantized weights, packed
scales
(out_features,)
float16
scale per output feature
qzeros
(out_features/c,)
int32
zero per output feature, packed
Since we are quantizing rows, we have a scale and zero per row. The only quirk is that the quantized weights and zeros are packed by storing c values in an int32.
Groups
Since having per-row scale and zero reduces the quantization loss, we could do the same for the columns. Of course, this it be rather pointless to have in_features scales/zeros or each output feature, because then the scales matrix would consume as much memory as the original weight matrix. As a compromise, GPTQ divides in_features evenly in n_groups groups instead. In the GPTQ quantizer configuration, this is usually configured as group_size (group_size = in_features / n_groups), resulting in the following shapes:
Parameter suffix
Shape
Type
Description
qweight
(out_features, in_features/c)
int32
Integer-quantized weights, packed
scales
(n_groups, out_features)
float16
scale per group + output feature
qzeros
(n_groups, out_features/c)
int32
zero per group + output feature
g_idx
(in_features,)
int32
The group identifier for each input feature
scales and qzeros are now matrices to store the parameters for n_groups groups. The new g_idx tensor maps input features to groups. So to get the scale for input feature/column i, we get group=g_idx[i] and we can then get the scales/zeros using scales[group]/qzeros[group].
Quantizer parameter groups are optional. Disabling this functionality is equivalent to using one group and g_idx==0.
Transposition
The second difference between the simplified GPTQ format and the actual format is that the weight matrix is transposed before storage, so in the checkpoints we see the following parameters:
Parameter suffix
Shape
Type
Description
qweight
(in_features/c,out_features)
int32
Integer-quantized weights, packed
scales
(n_groups, out_features)
float16
scale per output feature in a group
qzeros
(n_groups, out_features/c)
int32
zero per output feature in a group
g_idx
(in_features,)
int32
The group identifier for each input feature
And this is the actual storage format that you will find in PyTorch checkpoints of GPTQ models on e.g. the Hugging Face hub.
Other GPTQ configuration options
These are some other configuration options that change how the quantizer works, but have no ramifications on the serialized format:
desc_act: the GPTQ quantization method is sensitive to the order in which weights are processed. When this option is enabled, the weights are sorted by descending activation. This prioritizes reduction of the quantization loss of parameters that have a larger impact on activations. Activation sorting make quantizer parameter lookups less efficient — the quantizer is constructed from the permuted weight matrix, so the scales and qzero lookups are random-access after quantization.
static_groups: pre-calculates the quantization parameters before permuting the weight matrix when desc_act is used. Avoids the random-access pattern that desc_act introduces.