/
- Blog /
- Switching from Gemma 4 to Gemma 4 QAT

Switching from Gemma 4 to Gemma 4 QAT

While improving perplexity (4 minutes read)

2026-06-06

Quantization aware training

Gemma 4 QAT was released, which got me hyped up. In a nutshell, on consumer hardware I have to run heavily quantized models (4-bit per weight works well for me) which makes them in turn lose accuracy and perplexity. But the problem is that the quantization happens post-training on a model which doesn't expect to be quantized and the degradation happens. However, in QAT during training we know that weights will be quantized and will induce errors, and we will let the model have a chance to compensate during the training for the accumulated errors and allow neighboring weights to mitigate accumulated errors as well.

To make it more visual I will use an analogy with a photo; here is me holding my drone in grayscale, that takes 8-bits per pixel:

This is the result of what it would look like if we quantized the image the way we usually quantize our LLM models (8-bits quantized to 1-bit):

Yes, there is 8x size saving and there is some resemblance to the original, but the image degraded way too much. Now if we let each pixel figure out what amount of error it is inducing after it is quantized and allow it to be fed back to the pre-quantized image to let other pixels compensate and we end up with a dithered image:

Of course it's not 100% like the original and there is quality loss, but it's still a 1-bit image, still 8x smaller yet it's significantly closer to the original than our first attempt at quantization.

And something similar happens with QAT: when we are smarter during training and give a chance for the model to adjust to the fact it will be quantized and feed back the errors induced due to the future quantization, the model will behave much better. And we can achieve very quantized LLMs which perform very closely to their un-quantized counterparts.

Another blog post goes over QAT itself before it was used in Gemma 4.

Unsloth dynamic

On top of QAT, I'm using the UD variant of the model, which in a nutshell doesn't quantize the whole model constantly with the same equal quantization. My analogy for this is variable compression bitrate movies or music files allowing us to save on bandwidth where it's not needed and doesn't affect the result as much. In a movie analogy, this would be allocating less bitrate for a slow-moving credit screen compared to a high-activity action scene. And UD dynamically selects what quantization to use where without affecting the end result.

Tests

Performance

Original post-training quantized Gemma 4:

llama-bench -hf unsloth/gemma-4-31B-it-GGUF:Q4_0 -n 512 -d 8192,32768,65536,131072

model	size	params	backend	ngl	test	t/s
gemma4 31B Q4_0	16.13 GiB	30.70 B	CUDA	-1	pp512 @ d8192	824.93 ± 4.40
gemma4 31B Q4_0	16.13 GiB	30.70 B	CUDA	-1	tg512 @ d8192	20.84 ± 0.00
gemma4 31B Q4_0	16.13 GiB	30.70 B	CUDA	-1	pp512 @ d32768	551.16 ± 1.25
gemma4 31B Q4_0	16.13 GiB	30.70 B	CUDA	-1	tg512 @ d32768	18.90 ± 0.00
gemma4 31B Q4_0	16.13 GiB	30.70 B	CUDA	-1	pp512 @ d65536	382.09 ± 0.65
gemma4 31B Q4_0	16.13 GiB	30.70 B	CUDA	-1	tg512 @ d65536	17.22 ± 0.00
gemma4 31B Q4_0	16.13 GiB	30.70 B	CUDA	-1	pp512 @ d131072	236.39 ± 0.22
gemma4 31B Q4_0	16.13 GiB	30.70 B	CUDA	-1	tg512 @ d131072	13.93 ± 0.00

And the new QAT UD variant which is slightly smaller and slightly faster as well:

llama-bench -hf unsloth/gemma-4-31B-it-qat-GGUF:UD-Q4_K_XL -n 512 -d 8192,32768,65536,131072

model	size	params	backend	ngl	test	t/s
gemma4 31B Q4_0	16.09 GiB	30.70 B	CUDA	-1	pp512 @ d8192	827.57 ± 4.16
gemma4 31B Q4_0	16.09 GiB	30.70 B	CUDA	-1	tg512 @ d8192	20.89 ± 0.00
gemma4 31B Q4_0	16.09 GiB	30.70 B	CUDA	-1	pp512 @ d32768	552.09 ± 1.33
gemma4 31B Q4_0	16.09 GiB	30.70 B	CUDA	-1	tg512 @ d32768	18.94 ± 0.00
gemma4 31B Q4_0	16.09 GiB	30.70 B	CUDA	-1	pp512 @ d65536	382.28 ± 0.64
gemma4 31B Q4_0	16.09 GiB	30.70 B	CUDA	-1	tg512 @ d65536	17.25 ± 0.00
gemma4 31B Q4_0	16.09 GiB	30.70 B	CUDA	-1	pp512 @ d131072	236.46 ± 0.21
gemma4 31B Q4_0	16.09 GiB	30.70 B	CUDA	-1	tg512 @ d131072	13.95 ± 0.00

Perplexity test

When measuring perplexity we test how many other tokens the model is choosing from, how confident it is in the answers it gives us. Even if the token was exactly correct, high perplexity would signal to us that it was hesitating between many other tokens and was not sure of itself in the output, regardless of whether it was the correct one or not.

Perplexity tests were run for each model and test like this:

llama-perplexity -hf unsloth/gemma-4-31B-it-GGUF:Q4_0 -f _tests/wiki.test.raw -s 1024 -c 8192 -b 1024 -ub 1024 -ngl all

1 0.00.931.928 I common_init_result: fitting params to device memory ...
2 0.00.931.937 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
3 0.03.880.221 W load: override 'tokenizer.ggml.add_bos_token' to 'true' for Gemma4
4 0.03.944.207 W load: control-looking token:    212 '' was not control-type; this is probably a bug in the model. its type will be overridden
5 0.03.944.760 W load: control-looking token:     50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden
6 0.03.973.296 W load: special_eog_ids contains '<|tool_response>', removing '' token from EOG list
7 0.08.242.752 W llama_context: n_ctx_seq (8192) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
8 0.08.296.367 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
9 0.08.378.919 I
10 0.08.379.010 I system_info: n_threads = 6 (n_threads_batch = 6) / 6 | CUDA : ARCHS = 1200 | FORCE_MMQ = 1 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | BLACKWELL_NATIVE_FP4 = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | OPENMP = 1 | REPACK = 1 |
11 0.08.379.021 I perplexity: tokenizing the input ..
12 0.09.259.797 I perplexity: tokenization took 880.771 ms
13 0.09.259.943 I perplexity: calculating perplexity over 36 chunks, n_ctx=8192, batch_size=1024, n_seq=1
14 0.18.500.650 I perplexity: 9.24 seconds per pass - ETA 5.53 minutes
15 [1]2414.0873,[2]744.1431,[3]1208.4171,[4]1098.5517,[5]1136.4832,[6]1087.0484,[7]1362.3209,[8]1247.2345,[9]1245.1584,[10]1079.0721,[11]1232.4814,[12]1300.3726,[13]1304.6480,[14]1237.1090,[15]1219.7609,[16]1274.1374,[17]1271.9768,[18]1297.0858,[19]1304.4827,[20]1367.1843,[21]1315.0807,[22]1312.6935,[23]1405.5037,[24]1460.9886,[25]1499.5799,[26]1551.1594,[27]1583.2841,[28]1590.3732,[29]1624.3423,[30]1636.1713,[31]1663.1279,[32]1652.8445,[33]1644.3609,[34]1612.1382,[35]1664.2287,[36]1716.7822,
16 5.26.687.123 I Final estimate: PPL = 1716.7822 +/- 26.47241

Perplexity of each run is captured in the table below:

model	size MiB	wikipedia	frontend (js/css/html)	golang	c#
gemma-4-31B-it-qat-GGUF:UD-Q4_K_XL	16486	1716.78	1.98	639.18	11.78
gemma-4-31B-it-GGUF:Q4_0	16535	11981.15	3.29	170.12	32.51

It's slightly smaller than the previous Q4 variant, yet across many topics (all I tried) it yields much lower perplexity.

Conclusion

This is huge win, the QAT 31B and 12B models are now my default go-to models.