Ollama vs Llama.cpp

How I improved token/s by 10%

2026-06-05

Why

Previously I ran llamacpp just to prove things were working, then later switched to ollama for simplicity, but now I want to return to llamacpp. It should be more performant, but I want to check for myself. I want to run the Gemma4 31B model, but with llamacpp there are so many more options to choose from with various quantizations that I want to benchmark which one will be the go-to one for my setups. One for dual GPU setup and dedicated to AI, but powered only on demand when I need it. And one with a single GPU but running 24/7. So having something for random ad hoc tasks and one for more targeted and intentional API calls.

alt

Benchmark

Running similarly sized variants of gemma4:31b. On these two base platforms, my model decision is here if you want to skip it.

Rig AI

Dedicated for AI but not running constantly:

Will be running:

Rig NAS

A virtual machine running inside the Proxmox NAS setup, with PCIe passthrough for the GPU:

Will be running with the following command: llama-server -hf unsloth/gemma-4-31B-it-GGUF:UD-IQ2_M --host 0.0.0.0 --no-mmproj The idea is to not load the model straight away but only on demand and offload it as soon as possible to not hog the shared machine and GPU with this task. Also, a much more quantized model had to be used to fit into the VRAM.

Why Gemma4

It runs well with Cline, thinking/reasoning, chain, tools invocation and editing files works well enough. Doesn't go spastic as much as others in my experience. While the 26b is faster and smaller, it is a mix of experts and not always as good in Cline (can end up failing on a file edit and looping with errors much more easily). And then the 12b could be used on the NAS for more casual requests when the dual GPU rig is not running.

Summary and decision

Both engines are optimized for environments with single user and maximum single request running at any time. And to make it fair, I matched llamacpp to ollama's parameters.

Model Engine Quantization Size (GB) Context t/s Split mode
gemma4:31b Ollama Q4_K_M 19 32k 19.96 default
gemma-4-31B-it-GGUF llamacpp Q4_K_M 18.3 32k 22.0 default

The size discrepancy is likely due to the fact that Ollama's model bundles in a single file the multimodal model, while with llamacpp it's a separate file (which we can later explicitly tell not to load).

Commands used:

Going to switch to llamacpp because:

alt

So llamacpp won and I am going to switch but now I want to figure out which exact model quantization I should use. The section below shows what runs I did on ollama and llamacpp. And also want to experiment a tiny bit with single GPU model sizes and compare how much overhead there is running it on two GPUs.

You can skip the details below if not interested.

Runs

Ollama - splitting the load on two GPUs

Going to figure out the details of the ollama model: ollama show gemma4:31b

1   Model
2     architecture        gemma4
3     parameters          31.3B
4     context length      262144
5     embedding length    5376
6     quantization        Q4_K_M
7     requires            0.20.0
8 
9   Capabilities
10     completion
11     vision
12     tools
13     thinking
14 
15   Parameters
16     temperature    1
17     top_k          64
18     top_p          0.95
19 
20   License
21     Apache License
22     Version 2.0, January 2004
23     ...

Quantization is Q4_K_M, temperature 1, top_k 64 and top_p 0.95

I ran multiple runs, they fluctuate tiny bit, but average is under 20 tokens/s. ollama run --verbose gemma4:31b "write simple C++ webserver"

1 total duration:       1m55.384730898s
2 load duration:        12.326818119s
3 prompt eval count:    22 token(s)
4 prompt eval duration: 69.191131ms
5 prompt eval rate:     317.96 tokens/s
6 eval count:           2032 token(s)
7 eval duration:        1m41.752473912s
8 eval rate:            19.97 tokens/s
1 total duration:       1m35.045527887s
2 load duration:        297.417321ms
3 prompt eval count:    22 token(s)
4 prompt eval duration: 63.148366ms
5 prompt eval rate:     348.39 tokens/s
6 eval count:           1874 token(s)
7 eval duration:        1m33.667665968s
8 eval rate:            19.99 tokens/s
1 total duration:       1m55.730248723s
2 load duration:        294.014649ms
3 prompt eval count:    22 token(s)
4 prompt eval duration: 63.19743ms
5 prompt eval rate:     348.12 tokens/s
6 eval count:           2275 token(s)
7 eval duration:        1m54.110752464s
8 eval rate:            19.94 tokens/s

And the context-window is 32k:

 ollama ps
NAME          ID              SIZE     PROCESSOR    CONTEXT    UNTIL
gemma4:31b    6316f0629137    28 GB    100% GPU     32768      4 hours from now

Llamacpp

The GeForce RTX 5060 Ti uses the Blackwell architecture and has a CUDA Compute Capability of 12.0 and built llamacpp with 120 on.

Checking GPU support:

1 ai@ai:~$ llama-server --list-devices
2 Available devices:
3   CUDA0: NVIDIA GeForce RTX 5060 Ti (15849 MiB, 15712 MiB free)
4   CUDA1: NVIDIA GeForce RTX 5060 Ti (15849 MiB, 15712 MiB free)

2-bit and 3-bit dual GPU

llama-bench -hf unsloth/gemma-4-31B-it-GGUF:UD-IQ2_XXS -hf unsloth/gemma-4-31B-it-GGUF:UD-IQ2_M -hf unsloth/gemma-4-31B-it-GGUF:Q3_K_M -n 512 -d 8192,32768,65536,131072,190000

Bit higher context windows: llama-bench -hf unsloth/gemma-4-31B-it-GGUF:UD-IQ2_M -hf unsloth/gemma-4-31B-it-GGUF:UD-IQ2_XXS -n 512 -d 230000

Maximum context window: llama-bench -hf unsloth/gemma-4-31B-it-GGUF:UD-IQ2_XXS -n 512 -d 262144

model size params backend ngl test t/s
gemma4 31B IQ2_XXS - 2.0625 bpw 7.93 GiB 30.70 B CUDA -1 pp512 @ d8192 798.88 ± 2.13
gemma4 31B IQ2_XXS - 2.0625 bpw 7.93 GiB 30.70 B CUDA -1 tg512 @ d8192 31.25 ± 0.00
gemma4 31B IQ2_XXS - 2.0625 bpw 7.93 GiB 30.70 B CUDA -1 pp512 @ d32768 538.31 ± 2.36
gemma4 31B IQ2_XXS - 2.0625 bpw 7.93 GiB 30.70 B CUDA -1 tg512 @ d32768 27.21 ± 0.00
gemma4 31B IQ2_XXS - 2.0625 bpw 7.93 GiB 30.70 B CUDA -1 pp512 @ d65536 376.02 ± 0.49
gemma4 31B IQ2_XXS - 2.0625 bpw 7.93 GiB 30.70 B CUDA -1 tg512 @ d65536 23.77 ± 0.00
gemma4 31B IQ2_XXS - 2.0625 bpw 7.93 GiB 30.70 B CUDA -1 pp512 @ d131072 234.13 ± 0.19
gemma4 31B IQ2_XXS - 2.0625 bpw 7.93 GiB 30.70 B CUDA -1 tg512 @ d131072 17.90 ± 0.00
gemma4 31B IQ2_XXS - 2.0625 bpw 7.93 GiB 30.70 B CUDA -1 pp512 @ d190000 174.62 ± 0.11
gemma4 31B IQ2_XXS - 2.0625 bpw 7.93 GiB 30.70 B CUDA -1 tg512 @ d190000 14.97 ± 0.00
gemma4 31B IQ2_XXS - 2.0625 bpw 7.93 GiB 30.70 B CUDA -1 pp512 @ d230000 149.02 ± 0.09
gemma4 31B IQ2_XXS - 2.0625 bpw 7.93 GiB 30.70 B CUDA -1 tg512 @ d230000 13.38 ± 0.00
gemma4 31B IQ2_XXS - 2.0625 bpw 7.93 GiB 30.70 B CUDA -1 pp512 @ d262144 132.02 ± 0.10
gemma4 31B IQ2_XXS - 2.0625 bpw 7.93 GiB 30.70 B CUDA -1 tg512 @ d262144 12.44 ± 0.00
gemma4 31B IQ2_M - 2.7 bpw 10.00 GiB 30.70 B CUDA -1 pp512 @ d8192 660.15 ± 1.79
gemma4 31B IQ2_M - 2.7 bpw 10.00 GiB 30.70 B CUDA -1 tg512 @ d8192 27.65 ± 0.00
gemma4 31B IQ2_M - 2.7 bpw 10.00 GiB 30.70 B CUDA -1 pp512 @ d32768 472.82 ± 0.81
gemma4 31B IQ2_M - 2.7 bpw 10.00 GiB 30.70 B CUDA -1 tg512 @ d32768 24.43 ± 0.00
gemma4 31B IQ2_M - 2.7 bpw 10.00 GiB 30.70 B CUDA -1 pp512 @ d65536 342.64 ± 0.46
gemma4 31B IQ2_M - 2.7 bpw 10.00 GiB 30.70 B CUDA -1 tg512 @ d65536 21.62 ± 0.00
gemma4 31B IQ2_M - 2.7 bpw 10.00 GiB 30.70 B CUDA -1 pp512 @ d131072 220.65 ± 0.13
gemma4 31B IQ2_M - 2.7 bpw 10.00 GiB 30.70 B CUDA -1 tg512 @ d131072 16.65 ± 0.00
gemma4 31B IQ2_M - 2.7 bpw 10.00 GiB 30.70 B CUDA -1 pp512 @ d190000 166.96 ± 0.11
gemma4 31B IQ2_M - 2.7 bpw 10.00 GiB 30.70 B CUDA -1 tg512 @ d190000 14.09 ± 0.00
gemma4 31B IQ2_M - 2.7 bpw 10.00 GiB 30.70 B CUDA -1 pp512 @ d230000 141.98 ± 0.14
gemma4 31B IQ2_M - 2.7 bpw 10.00 GiB 30.70 B CUDA -1 tg512 @ d230000 12.69 ± 0.00
gemma4 31B Q3_K - Medium 13.71 GiB 30.70 B CUDA -1 pp512 @ d8192 684.59 ± 2.77
gemma4 31B Q3_K - Medium 13.71 GiB 30.70 B CUDA -1 tg512 @ d8192 22.12 ± 0.00
gemma4 31B Q3_K - Medium 13.71 GiB 30.70 B CUDA -1 pp512 @ d32768 485.47 ± 0.95
gemma4 31B Q3_K - Medium 13.71 GiB 30.70 B CUDA -1 tg512 @ d32768 19.93 ± 0.00
gemma4 31B Q3_K - Medium 13.71 GiB 30.70 B CUDA -1 pp512 @ d65536 348.94 ± 0.52
gemma4 31B Q3_K - Medium 13.71 GiB 30.70 B CUDA -1 tg512 @ d65536 18.07 ± 0.00
gemma4 31B Q3_K - Medium 13.71 GiB 30.70 B CUDA -1 pp512 @ d131072 223.23 ± 0.20
gemma4 31B Q3_K - Medium 13.71 GiB 30.70 B CUDA -1 tg512 @ d131072 14.48 ± 0.00
gemma4 31B Q3_K - Medium 13.71 GiB 30.70 B CUDA -1 pp512 @ d190000 166.84 ± 0.10
gemma4 31B Q3_K - Medium 13.71 GiB 30.70 B CUDA -1 tg512 @ d190000 12.48 ± 0.00

2-bit on single GPU

llama-bench -hf unsloth/gemma-4-31B-it-GGUF:UD-IQ2_XXS --split-mode none --main-gpu 1 -n 512 -d 8192,32768,65536

model size params backend ngl main_gpu sm test t/s
gemma4 31B IQ2_XXS - 2.0625 bpw 7.93 GiB 30.70 B CUDA -1 1 none pp512 @ d8192 791.74 ± 9.66
gemma4 31B IQ2_XXS - 2.0625 bpw 7.93 GiB 30.70 B CUDA -1 1 none tg512 @ d8192 30.55 ± 0.09
gemma4 31B IQ2_XXS - 2.0625 bpw 7.93 GiB 30.70 B CUDA -1 1 none pp512 @ d32768 537.26 ± 2.05
gemma4 31B IQ2_XXS - 2.0625 bpw 7.93 GiB 30.70 B CUDA -1 1 none tg512 @ d32768 26.82 ± 0.03
gemma4 31B IQ2_XXS - 2.0625 bpw 7.93 GiB 30.70 B CUDA -1 1 none pp512 @ d65536 375.05 ± 1.08
gemma4 31B IQ2_XXS - 2.0625 bpw 7.93 GiB 30.70 B CUDA -1 1 none tg512 @ d65536 23.69 ± 0.01

Similar to default dual GPU setup, overhead from layer setup is minimal and running single GPU will let it trigger thermals sooner and not let it turbo as much as two GPUs with half load each.

2-bit tensor dual GPU

Instead of dividing the model sequentially (where GPU 1 sits around waiting for GPU 0), --split-mode tensor splits the individual matrix math blocks of every single layer across both GPUs simultaneously.

It requires a lot of communication between the cards. If GPUs are connected via a slow motherboard slot, the overhead can very have negative effects on the performance.

llama-bench -hf unsloth/gemma-4-31B-it-GGUF:UD-IQ2_XXS --split-mode tensor --main-gpu 1 -n 512 -d 8192,32768,65536

model size params backend ngl main_gpu sm test t/s
gemma4 31B IQ2_XXS - 2.0625 bpw 7.93 GiB 30.70 B CUDA -1 1 tensor pp512 @ d8192 953.70 ± 18.05
gemma4 31B IQ2_XXS - 2.0625 bpw 7.93 GiB 30.70 B CUDA -1 1 tensor tg512 @ d8192 50.23 ± 0.06
gemma4 31B IQ2_XXS - 2.0625 bpw 7.93 GiB 30.70 B CUDA -1 1 tensor pp512 @ d32768 757.52 ± 11.50
gemma4 31B IQ2_XXS - 2.0625 bpw 7.93 GiB 30.70 B CUDA -1 1 tensor tg512 @ d32768 44.35 ± 0.07
gemma4 31B IQ2_XXS - 2.0625 bpw 7.93 GiB 30.70 B CUDA -1 1 tensor pp512 @ d65536 595.87 ± 7.08
gemma4 31B IQ2_XXS - 2.0625 bpw 7.93 GiB 30.70 B CUDA -1 1 tensor tg512 @ d65536 40.01 ± 0.06

Results are promising, using tensor almost doubles the performance, of course the extreme PCIe overhead will not allow it to be exactly double the performance.

4-bit on dual GPU

llama-bench -hf unsloth/gemma-4-31B-it-GGUF:IQ4_XS -hf unsloth/gemma-4-31B-it-GGUF:Q4_0 -hf unsloth/gemma-4-31B-it-GGUF:Q4_1 -n 512 -d 8192,32768,65536,131072

and to test bigger context window:

llama-bench -hf unsloth/gemma-4-31B-it-GGUF:IQ4_XS -n 512 -d 163840

llama-bench -hf unsloth/gemma-4-31B-it-qat-GGUF:UD-Q4_K_XL -n 512 -d 8192,32768,65536,131072

model size params backend ngl test t/s
gemma4 31B IQ4_XS - 4.25 bpw 15.23 GiB 30.70 B CUDA -1 pp512 @ d8192 841.03 ± 4.60
gemma4 31B IQ4_XS - 4.25 bpw 15.23 GiB 30.70 B CUDA -1 tg512 @ d8192 21.78 ± 0.00
gemma4 31B IQ4_XS - 4.25 bpw 15.23 GiB 30.70 B CUDA -1 pp512 @ d32768 558.09 ± 1.25
gemma4 31B IQ4_XS - 4.25 bpw 15.23 GiB 30.70 B CUDA -1 tg512 @ d32768 19.67 ± 0.00
gemma4 31B IQ4_XS - 4.25 bpw 15.23 GiB 30.70 B CUDA -1 pp512 @ d65536 385.18 ± 0.66
gemma4 31B IQ4_XS - 4.25 bpw 15.23 GiB 30.70 B CUDA -1 tg512 @ d65536 17.85 ± 0.00
gemma4 31B IQ4_XS - 4.25 bpw 15.23 GiB 30.70 B CUDA -1 pp512 @ d131072 237.50 ± 0.22
gemma4 31B IQ4_XS - 4.25 bpw 15.23 GiB 30.70 B CUDA -1 tg512 @ d131072 14.34 ± 0.00
gemma4 31B IQ4_XS - 4.25 bpw 15.23 GiB 30.70 B CUDA -1 pp512 @ d163840 197.36 ± 0.18
gemma4 31B IQ4_XS - 4.25 bpw 15.23 GiB 30.70 B CUDA -1 tg512 @ d163840 13.29 ± 0.00
gemma4 31B Q4_0 16.13 GiB 30.70 B CUDA -1 pp512 @ d8192 824.93 ± 4.40
gemma4 31B Q4_0 16.13 GiB 30.70 B CUDA -1 tg512 @ d8192 20.84 ± 0.00
gemma4 31B Q4_0 16.13 GiB 30.70 B CUDA -1 pp512 @ d32768 551.16 ± 1.25
gemma4 31B Q4_0 16.13 GiB 30.70 B CUDA -1 tg512 @ d32768 18.90 ± 0.00
gemma4 31B Q4_0 16.13 GiB 30.70 B CUDA -1 pp512 @ d65536 382.09 ± 0.65
gemma4 31B Q4_0 16.13 GiB 30.70 B CUDA -1 tg512 @ d65536 17.22 ± 0.00
gemma4 31B Q4_0 16.13 GiB 30.70 B CUDA -1 pp512 @ d131072 236.39 ± 0.22
gemma4 31B Q4_0 16.13 GiB 30.70 B CUDA -1 tg512 @ d131072 13.93 ± 0.00
gemma4 31B Q4_1 17.79 GiB 30.70 B CUDA -1 pp512 @ d8192 780.59 ± 3.65
gemma4 31B Q4_1 17.79 GiB 30.70 B CUDA -1 tg512 @ d8192 19.16 ± 0.00
gemma4 31B Q4_1 17.79 GiB 30.70 B CUDA -1 pp512 @ d32768 531.21 ± 1.25
gemma4 31B Q4_1 17.79 GiB 30.70 B CUDA -1 tg512 @ d32768 17.51 ± 0.00
gemma4 31B Q4_1 17.79 GiB 30.70 B CUDA -1 pp512 @ d65536 372.39 ± 0.61
gemma4 31B Q4_1 17.79 GiB 30.70 B CUDA -1 tg512 @ d65536 16.05 ± 0.00
gemma4 31B Q4_1 17.79 GiB 30.70 B CUDA -1 pp512 @ d131072 230.52 ± 0.26
gemma4 31B Q4_1 17.79 GiB 30.70 B CUDA -1 tg512 @ d131072 13.14 ± 0.00

5-bit on dual GPU

Model unsloth/gemma-4-31B-it-GGUF:Q5_K_XL might be too big for context window bigger than 64k. While medium and small variants can have ~90k context window.

So I run just medium and small variants: llama-bench -hf unsloth/gemma-4-31B-it-GGUF:Q5_K_M -hf unsloth/gemma-4-31B-it-GGUF:Q5_K_S -n 256,512 -d 8192,32768,65536,95000

model size params backend ngl test t/s
gemma4 31B Q5_K - Medium 20.16 GiB 30.70 B CUDA -1 pp512 @ d8192 731.06 ± 3.07
gemma4 31B Q5_K - Medium 20.16 GiB 30.70 B CUDA -1 tg256 @ d8192 17.11 ± 0.00
gemma4 31B Q5_K - Medium 20.16 GiB 30.70 B CUDA -1 tg512 @ d8192 17.09 ± 0.00
gemma4 31B Q5_K - Medium 20.16 GiB 30.70 B CUDA -1 pp512 @ d32768 507.55 ± 1.17
gemma4 31B Q5_K - Medium 20.16 GiB 30.70 B CUDA -1 tg256 @ d32768 15.82 ± 0.00
gemma4 31B Q5_K - Medium 20.16 GiB 30.70 B CUDA -1 tg512 @ d32768 15.81 ± 0.00
gemma4 31B Q5_K - Medium 20.16 GiB 30.70 B CUDA -1 pp512 @ d65536 360.52 ± 0.56
gemma4 31B Q5_K - Medium 20.16 GiB 30.70 B CUDA -1 tg256 @ d65536 14.58 ± 0.00
gemma4 31B Q5_K - Medium 20.16 GiB 30.70 B CUDA -1 tg512 @ d65536 14.57 ± 0.00
gemma4 31B Q5_K - Medium 20.16 GiB 30.70 B CUDA -1 pp512 @ d95000 282.92 ± 0.45
gemma4 31B Q5_K - Medium 20.16 GiB 30.70 B CUDA -1 tg256 @ d95000 13.26 ± 0.00
gemma4 31B Q5_K - Medium 20.16 GiB 30.70 B CUDA -1 tg512 @ d95000 13.26 ± 0.00
gemma4 31B Q5_K - Small 19.66 GiB 30.70 B CUDA -1 pp512 @ d8192 753.69 ± 1.99
gemma4 31B Q5_K - Small 19.66 GiB 30.70 B CUDA -1 tg256 @ d8192 17.45 ± 0.00
gemma4 31B Q5_K - Small 19.66 GiB 30.70 B CUDA -1 tg512 @ d8192 17.43 ± 0.00
gemma4 31B Q5_K - Small 19.66 GiB 30.70 B CUDA -1 pp512 @ d32768 518.44 ± 1.11
gemma4 31B Q5_K - Small 19.66 GiB 30.70 B CUDA -1 tg256 @ d32768 16.11 ± 0.00
gemma4 31B Q5_K - Small 19.66 GiB 30.70 B CUDA -1 tg512 @ d32768 16.10 ± 0.00
gemma4 31B Q5_K - Small 19.66 GiB 30.70 B CUDA -1 pp512 @ d65536 366.11 ± 0.57
gemma4 31B Q5_K - Small 19.66 GiB 30.70 B CUDA -1 tg256 @ d65536 14.82 ± 0.00
gemma4 31B Q5_K - Small 19.66 GiB 30.70 B CUDA -1 tg512 @ d65536 14.82 ± 0.00
gemma4 31B Q5_K - Small 19.66 GiB 30.70 B CUDA -1 pp512 @ d95000 288.83 ± 0.37
gemma4 31B Q5_K - Small 19.66 GiB 30.70 B CUDA -1 tg256 @ d95000 13.47 ± 0.00
gemma4 31B Q5_K - Small 19.66 GiB 30.70 B CUDA -1 tg512 @ d95000 13.47 ± 0.00

Irrelevant

As mentioned above these are now not as important thanks to Gemma 4 Quantization-Aware-Training variants, where now the Q4 variant provides better perplexity than before and instead of speculating if I should run Q5 or Q4_0 or Q4_1, now the answer is simpler, the qat_Q4 should be better than these. More in the follow-up article. But still will leave it in place in case somebody will find it interesting, like the performance scaling down depending on the context size, etc.

References