Compressing KV cache

And expanding the context window

2026-06-11

I was running out of memory when going withe gemm4-31b:q4 just tiny bit over 128k context window. So decided to use q8 quantization on the KV cache. And now I still max out ram but with upto 210k context window:

alt

llama-bench -hf unsloth/gemma-4-31B-it-qat-GGUF:UD-Q4_K_XL --split-mode tensor --flash-attn on -ctk q8_0 -ctv q8_0 -n 512 -d 194560,215040

1 | model                          |       size |     params | backend    | ngl | type_k | type_v |     sm |  fa |            test |                  t/s |
2 | ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -----: | -----: | --: | --------------: | -------------------: |
3 | gemma4 31B Q4_0                |  16.09 GiB |    30.70 B | CUDA       |  -1 |   q8_0 |   q8_0 | tensor |   1 | pp512 @ d194560 |        316.88 ± 2.48 |
4 | gemma4 31B Q4_0                |  16.09 GiB |    30.70 B | CUDA       |  -1 |   q8_0 |   q8_0 | tensor |   1 | tg512 @ d194560 |         12.54 ± 0.00 |
5 | gemma4 31B Q4_0                |  16.09 GiB |    30.70 B | CUDA       |  -1 |   q8_0 |   q8_0 | tensor |   1 | pp512 @ d215040 |        294.93 ± 2.08 |
6 | gemma4 31B Q4_0                |  16.09 GiB |    30.70 B | CUDA       |  -1 |   q8_0 |   q8_0 | tensor |   1 | tg512 @ d215040 |         11.73 ± 0.00 |

Related articles