Can IBM’s Granite 4.1 code? I ran it against other local models to find out

Full transparency: Except for some initial prompting, most of this blog and the research behind it was AI generated. I did check everything manually though. While I tried to include all relevant files in this repo, reproducibility may be hampered by some of my older Claude contexts leaking into this process (e.g. about how opinionated I can be).

Inspiration came from (https://adrianco.medium.com/how-reliable-fast-and-expensive-is-each-version-of-claude-code-sonnet-through-opus-4-8-fast-272d74ffc869)[Adrian Cockcroft’s retort]. Consider this like a ‘hello world’ version that that.

Definitely reach out to comment.

– Mostly genAI text follows –

I’ve been curious about how the newer IBM Granite models hold up for actual coding tasks, not just benchmarks designed to flatter them. So I set up a simple test on my M1 Pro MacBook: one identical prompt, five local models, six automated tests. Here’s what happened.

Setup

Hardware: MacBook Pro 14", M1 Pro, 32 GB unified memory. Ollama running locally, all models at default quantization.

The lineup, smallest to largest:

Model Size Avg tok/s
granite4:latest 2.1 GB 49.6
mistral:latest 4.4 GB 25.9
deepseek-r1:8b 5.2 GB 20.0
granite4.1:8b 5.3 GB 17.2
devstral-small-2 15 GB — (all runs timed out)

I initially planned to include gemma3:27b and granite4.1:30b, but at 0.8 and 0.1 tok/s respectively, those aren’t models you run — they’re meditation exercises. Excluded before the benchmark started.

devstral-small-2 (Mistral’s new coding model) is a different story. It downloaded fine, but at 15 GB it leaves almost no breathing room on a 32 GB machine. All three of its runs hit the 5-minute timeout. I’m treating it as a DNF.

That leaves four models that actually produced results.

The task

One-shot prompt: write transformer.py, a Python CLI that converts between JSON and YAML. Standard library plus PyYAML. No explanation, just the code.

Then I run six automated tests:

Test What it checks
T1 JSON → YAML, flat object
T2 YAML → JSON, flat object
T3 JSON → YAML, nested with list
T4 JSON → YAML with --indent flag
T5 YAML → JSON with --indent flag
T6 Unknown format → non-zero exit code

Each model ran three times. I’m reporting averages across valid runs (runs that completed within 5 minutes).

Results

Model Tok/s Avg score Perfect runs
granite4:latest 49.6 3.7/6 0/3
mistral:latest 25.9 3.0/6 0/3
deepseek-r1:8b 20.0 6.0/6 1/1
granite4.1:8b 17.2 6.0/6 3/3
devstral-small-2

Test matrix

Model T1 T2 T3 T4 T5 T6
granite4:latest 2/3
mistral:latest 2/3 2/3 2/3
deepseek-r1:8b
granite4.1:8b
devstral-small-2

The clear winner is deepseek-r1:8b — average 6.0/6, 1 perfect runs out of 1. The chain-of-thought model wins on quality, when it finishes. Reliability is another question.

One pattern stands out: T2 and T5 (YAML→JSON) failed consistently for granite4:latest, mistral:latest. JSON→YAML they handle fine. YAML→JSON they don’t. My guess: the training data has far more JSON→YAML examples than the reverse. It’s a one-line Python change to swap the direction, but somehow these models get it wrong.

granite4.1:8b scored +2.3 points higher than granite4:latest — which is what you’d hope for from a newer, larger model. The cost: it runs 2.9× slower (17.2 vs 49.6 tok/s). Whether that trade-off makes sense depends on whether you’re batch processing or waiting for a response.

deepseek-r1:8b was the wild card. Its chain-of-thought reasoning is visible in the output, and when it works, the code is correct — 6/6 on the 1 run(s) that completed. But 2 of 3 runs hit the 5-minute timeout. The thinking process apparently sometimes goes off the rails and never stops. That’s a problem if you’re running automated benchmarks. Or, you know, want an answer today.

devstral-small-2 (15 GB) timed out on all three attempts with a 5-minute cap. I’m not surprised — 15 GB leaves very little headroom on a 32 GB machine once the OS and Ollama runtime take their cut. It might finish eventually, but “eventually” isn’t useful. Excluded from results.

What the speed numbers actually mean

granite4:latest at 49.6 tok/s is fast enough that you barely notice it’s generating. granite4.1:8b at 17.2 tok/s feels noticeably slower — you watch the text appear. For interactive use, that matters. For batch processing (like this benchmark), it doesn’t.

The anomaly worth noting: granite4.1:8b is slower than deepseek-r1:8b (17.2 vs 20.0 tok/s), despite being a similar size. IBM’s quantization choices for the 4.1 series seem to trade speed for something — quality, presumably, though deepseek-r1 kept pace on score.

What this doesn’t tell you

A few honest caveats:

  • One task, one prompt. JSON/YAML transformation is a clean, well-defined problem. Models that struggle here might shine on something messier, and vice versa.
  • No iteration. Real coding involves debugging, follow-up questions, reading error messages. This tests one-shot instruction following only.
  • Default quantization throughout. The same model at different quantization levels can behave quite differently. I didn’t explore that.
  • No frontier comparison. I’m not claiming granite4.1:8b beats GPT-4 or Claude. I’m asking what runs usably on 32 GB of unified memory.

Verdict

For one-shot coding tasks on an M1 Pro MacBook:

  • granite4.1:8b is the one to use if you want reliability. Slower than the baseline, but it actually writes correct code.
  • granite4:latest is fine if speed matters more than correctness — good enough on JSON→YAML, falls apart on the reverse.
  • deepseek-r1:8b is interesting but flaky. When it completes, the code is correct. Whether it completes is another question.
  • mistral:latest underperformed expectations. The YAML→JSON blind spot is a real limitation.
  • devstral-small-2: maybe on a 64 GB machine. Not here.

The gap between granite4:latest and granite4.1:8b on this specific task is significant enough to justify the speed trade-off — if you’re doing code generation rather than quick lookups. Whether IBM’s 4.1 series holds that advantage across a broader task set, I don’t know. One benchmark, one task, one machine. Take it accordingly.