• /
    • Blog /
    • Switching from Gemma 4 to Gemma 4 QAT

Switching from Gemma 4 to Gemma 4 QAT

While improving perplexity

2026-06-06

llama-bench -hf unsloth/gemma-4-31B-it-GGUF:IQ4_XS -hf unsloth/gemma-4-31B-it-GGUF:Q4_0 -hf unsloth/gemma-4-31B-it-GGUF:Q4_1 -hf unsloth/gemma-4-31B-it-qat-GGUF:UD-Q4_K_XL --split-mode tensor -n 512 -d 8192,32768,65536,131072

ai@ai:~$ llama-bench -hf unsloth/gemma-4-31B-it-qat-GGUF:UD-Q4_K_XL -n 512 -d 8192,32768,65536,131072 ggml_cuda_init: found 2 CUDA devices (Total VRAM: 31699 MiB): Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15849 MiB Device 1: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15849 MiB

model size params backend ngl test t/s
gemma4 31B Q4_0 16.09 GiB 30.70 B CUDA -1 pp512 @ d8192 827.57 ± 4.16
gemma4 31B Q4_0 16.09 GiB 30.70 B CUDA -1 tg512 @ d8192 20.89 ± 0.00
gemma4 31B Q4_0 16.09 GiB 30.70 B CUDA -1 pp512 @ d32768 552.09 ± 1.33
gemma4 31B Q4_0 16.09 GiB 30.70 B CUDA -1 tg512 @ d32768 18.94 ± 0.00
gemma4 31B Q4_0 16.09 GiB 30.70 B CUDA -1 pp512 @ d65536 382.28 ± 0.64
gemma4 31B Q4_0 16.09 GiB 30.70 B CUDA -1 tg512 @ d65536 17.25 ± 0.00
gemma4 31B Q4_0 16.09 GiB 30.70 B CUDA -1 pp512 @ d131072 236.46 ± 0.21
gemma4 31B Q4_0 16.09 GiB 30.70 B CUDA -1 tg512 @ d131072 13.95 ± 0.00

llama-bench -hf unsloth/gemma-4-31B-it-qat-GGUF:UD-Q4_K_XL --split-mode tensor -n 512 -d 8192,32768,65536,131072

alt

model size params backend ngl sm test t/s
gemma4 31B Q4_0 16.09 GiB 30.70 B CUDA -1 tensor pp512 @ d8192 973.03 ± 18.59
gemma4 31B Q4_0 16.09 GiB 30.70 B CUDA -1 tensor tg512 @ d8192 36.12 ± 0.06
gemma4 31B Q4_0 16.09 GiB 30.70 B CUDA -1 tensor pp512 @ d32768 769.23 ± 12.14
gemma4 31B Q4_0 16.09 GiB 30.70 B CUDA -1 tensor tg512 @ d32768 32.98 ± 0.04
gemma4 31B Q4_0 16.09 GiB 30.70 B CUDA -1 tensor pp512 @ d65536 603.22 ± 7.29
gemma4 31B Q4_0 16.09 GiB 30.70 B CUDA -1 tensor tg512 @ d65536 30.44 ± 0.04
gemma4 31B Q4_0 16.09 GiB 30.70 B CUDA -1 tensor pp512 @ d131072 420.75 ± 3.54
gemma4 31B Q4_0 16.09 GiB 30.70 B CUDA -1 tensor tg512 @ d131072 25.11 ± 0.03

ai@ai:~/llama.cpp$ ./build/bin/llama-bench -hf unsloth/gemma-4-31B-it-qat-GGUF:UD-Q4_K_XL --split-mode tensor -n 512 -d 32768 ggml_cuda_init: found 2 CUDA devices (Total VRAM: 31699 MiB): Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15849 MiB Device 1: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15849 MiB

model size params backend ngl sm test t/s
gemma4 31B Q4_0 16.09 GiB 30.70 B CUDA -1 tensor pp512 @ d32768 765.46 ± 12.06
gemma4 31B Q4_0 16.09 GiB 30.70 B CUDA -1 tensor tg512 @ d32768 33.18 ± 0.00

build: d73cd0767 (9585)

Perplexity test

llama-perplexity -hf unsloth/gemma-4-31B-it-qat-GGUF:UD-Q4_K_XL -f _tests/wiki.test.raw -s 1024 -c 8192 -b 1024 -ub 1024 -ngl all

1 0.00.931.928 I common_init_result: fitting params to device memory ...
2 0.00.931.937 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
3 0.03.880.221 W load: override 'tokenizer.ggml.add_bos_token' to 'true' for Gemma4
4 0.03.944.207 W load: control-looking token:    212 '' was not control-type; this is probably a bug in the model. its type will be overridden
5 0.03.944.760 W load: control-looking token:     50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden
6 0.03.973.296 W load: special_eog_ids contains '<|tool_response>', removing '' token from EOG list
7 0.08.242.752 W llama_context: n_ctx_seq (8192) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
8 0.08.296.367 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
9 0.08.378.919 I
10 0.08.379.010 I system_info: n_threads = 6 (n_threads_batch = 6) / 6 | CUDA : ARCHS = 1200 | FORCE_MMQ = 1 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | BLACKWELL_NATIVE_FP4 = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | OPENMP = 1 | REPACK = 1 |
11 0.08.379.021 I perplexity: tokenizing the input ..
12 0.09.259.797 I perplexity: tokenization took 880.771 ms
13 0.09.259.943 I perplexity: calculating perplexity over 36 chunks, n_ctx=8192, batch_size=1024, n_seq=1
14 0.18.500.650 I perplexity: 9.24 seconds per pass - ETA 5.53 minutes
15 [1]2414.0873,[2]744.1431,[3]1208.4171,[4]1098.5517,[5]1136.4832,[6]1087.0484,[7]1362.3209,[8]1247.2345,[9]1245.1584,[10]1079.0721,[11]1232.4814,[12]1300.3726,[13]1304.6480,[14]1237.1090,[15]1219.7609,[16]1274.1374,[17]1271.9768,[18]1297.0858,[19]1304.4827,[20]1367.1843,[21]1315.0807,[22]1312.6935,[23]1405.5037,[24]1460.9886,[25]1499.5799,[26]1551.1594,[27]1583.2841,[28]1590.3732,[29]1624.3423,[30]1636.1713,[31]1663.1279,[32]1652.8445,[33]1644.3609,[34]1612.1382,[35]1664.2287,[36]1716.7822,
16 5.26.687.123 I Final estimate: PPL = 1716.7822 +/- 26.47241

llama-perplexity -hf unsloth/gemma-4-31B-it-GGUF:Q4_0 -f _tests/wiki.test.raw -s 1024 -c 8192 -b 1024 -ub 1024 -ngl all

model size MiB wikipedia frontend (js/css/html) golang c#
gemma-4-31B-it-qat-GGUF:UD-Q4_K_XL 16486 1716.78 1.98 639.18 11.78
gemma-4-31B-it-GGUF:Q4_0 16535 11981.15 3.29 170.12 32.51

It's slightly smaller than the previous Q4 variant, yet across many topics (all I tried) it yields much lower perplexity.

Related articles