Back to all articles
Benchmarks

GPT-5 vs Gemini 2.5 Pro vs Claude Opus 4 on Aider Polyglot Coding

June 17, 2026 · 20 min read · Claude / GPT / Gemini

Cream-background editorial illustration of three abstract coding model cards racing across a polyglot test grid, with te

The Scoreboard Has a Clear Winner

GPT-5 high is sitting at 88.0% on Aider Polyglot. That is 198 solved cases out of 225, and it is not a tiny leaderboard wobble. On the same benchmark, Gemini 2.5 Pro Preview 06-05 with 32k thinking lands at 83.1%, while Claude Opus 4 with 32k thinking lands at 72.0%, according to Aider’s official Polyglot leaderboard (Aider).

That gap matters because Aider Polyglot is not a toy “write a Fibonacci function” test. Aider describes it as 225 Exercism coding exercises across C++, Go, Java, JavaScript, Python, and Rust (Aider). The model has to edit code, produce a valid diff, and get tests passing. This is much closer to the daily grind of agentic coding than a single-shot multiple-choice benchmark.

The headline: GPT-5 wins on raw pass rate and cost per successful fix. Gemini is close on correctness and much better at output format discipline. Claude Opus 4 looks expensive and behind on this particular benchmark, despite Anthropic launching it as a top coding model in May 2025 (Anthropic).

Horizontal bar chart comparing Aider Polyglot pass rate for GPT-5 high at 88.0%, Gemini 2.5 Pro Preview 06-05 32k thinki

Raw Results: Pass Rate, Cost, Format Reliability

Here is the compact view. These are Aider run results, not vendor marketing numbers.

Model Aider run date Pass rate Solved / 225 Cost per run Cost per solved case Correct edit format Edit format
GPT-5 high 2025-08-23 88.0% 198 $29.08 $0.147 91.6% diff
Gemini 2.5 Pro Preview 06-05, 32k thinking 2025-06-06 83.1% 187 $49.88 $0.267 99.6% diff-fenced
Claude Opus 4 20250514, 32k thinking 2025-05-25 72.0% 162 $65.75 $0.406 97.3% diff

The pass-rate deltas are simple:

  • GPT-5 beats Gemini by 4.9 percentage points, or 11 more solved tasks.
  • GPT-5 beats Claude Opus 4 by 16.0 points, or 36 more solved tasks.
  • Gemini beats Claude Opus 4 by 11.1 points, or 25 more solved tasks.

The cost deltas are sharper. GPT-5’s run cost is about 42% lower than Gemini’s and 56% lower than Claude’s. Claude costs about 2.26x GPT-5 per run while solving 36 fewer cases.

That is the uncomfortable part for Claude here. Opus 4 is not merely behind; it is behind while costing more.

Why the Cost Gap Is So Large

Token list price explains most of it.

OpenAI priced GPT-5 at $1.25 per million input tokens and $10 per million output tokens when it launched the API model family (OpenAI). Google’s current Gemini 2.5 Pro standard pricing is also $1.25 input and $10 output per million tokens for prompts up to 200k tokens, with higher rates above 200k (Google AI). Anthropic’s Claude Opus 4 was priced at $15 input and $75 output per million tokens, and Anthropic’s current pricing page now marks Claude Opus 4 as retired except on Vertex AI while still listing those historical rates (Anthropic Docs).

Aider’s token usage lines up with that story. The Gemini run used about 2.72M prompt tokens and 4.65M completion tokens, which maps almost exactly to the $49.88 reported run cost at $1.25/$10 pricing. The Claude run used fewer completion tokens, about 363k, but Opus 4’s $75/M output price still pushed the run to $65.75.

GPT-5 is the interesting case. It used about 2.68M prompt tokens and 2.62M completion tokens. It solved more cases than Gemini while emitting about 2.0M fewer completion tokens. OpenAI also said GPT-5 scored 88% on Aider Polyglot and described that as a new record in its developer launch post (OpenAI).

For developers, this is the part to care about: benchmark cost is not just list price. It is list price multiplied by the model’s tendency to think, retry, explain, and emit large diffs.

Scatter plot with x-axis cost per Aider run and y-axis pass rate; GPT-5 high in upper-left at $29.08 and 88.0%, Gemini 2

Edit Format: Gemini Is the Neatest Operator

GPT-5 wins the benchmark, but Gemini wins the format-discipline column.

Aider reports Gemini 2.5 Pro Preview 06-05 at 99.6% correct edit format, with only one malformed response. Claude Opus 4 is also strong at 97.3%. GPT-5 high is lower at 91.6%, with 22 malformed responses across the run (Aider).

That sounds like a small implementation detail until you run agents in a real repo. Bad edit format means wasted turns, failed patch application, or the human having to rescue the tool. If your workflow is “model proposes diff, CI checks, agent iterates,” format reliability is part of intelligence.

Still, GPT-5’s lower format score did not stop it from winning. That tells us two things. First, it can recover or produce enough correct edits despite more malformed attempts. Second, pass rate is ultimately a harsher metric than neatness. A perfectly formatted wrong patch is still wrong.

A reasonable read: use GPT-5 when the hardest part is solving the bug. Use Gemini when your pipeline is highly sensitive to strict diff formatting and you can tolerate a small drop in solved cases.

Model Versions and Date Traps

There is one trap in this comparison: these are historical benchmark entries, not a statement about every current production endpoint on June 17, 2026.

Gemini 2.5 Pro Preview 06-05 was released on June 5, 2025, with adaptive thinking; Google later released stable gemini-2.5-pro on June 17, 2025, and its changelog says the old preview IDs were later shut down or redirected through the model lifecycle (Google AI changelog). Claude Opus 4 launched on May 22, 2025, and Anthropic’s pricing docs now mark Claude Opus 4 as retired except on Vertex AI (Anthropic Docs). GPT-5 launched later, on August 7, 2025, and OpenAI’s API post lists gpt-5, gpt-5-mini, and gpt-5-nano as the API sizes at launch (OpenAI).

So the fair comparison is: “How did these named models perform in Aider’s recorded runs?” It is not: “Which vendor’s newest model family is best today?”

That distinction matters for procurement and for engineering decisions. If your team is choosing a coding model today, rerun a small internal eval on your own stack. Include your repo size, your test latency, your preferred edit format, and your actual retry policy.

A minimal harness can be boring and useful:

aider --model openai/gpt-5 --reasoning-effort high
aider --model gemini/gemini-2.5-pro --thinking-tokens 32k
aider --model anthropic/claude-opus-4-20250514

Then measure solved tickets, failed patch applications, CI passes, wall time, and dollars per merged fix.

What I’d Choose

If I had to pick one model from this Aider snapshot for a coding agent, I would start with GPT-5 high. It has the best pass rate, the lowest run cost among the three, and the best cost per solved case. The 91.6% correct edit format is a blemish, but not enough to erase an 88.0% solve rate.

Gemini 2.5 Pro is the strong second choice. Its 83.1% pass rate is close enough that teams with strict patch-format automation should take it seriously. The 99.6% correct edit-format rate is excellent. The downside is cost in this run: $49.88 is a lot to pay for 11 fewer solved cases than GPT-5.

Claude Opus 4 is the hard sell here. Anthropic positioned Opus 4 as a serious coding and agent model, and it did strong work on other coding benchmarks at launch, including SWE-bench and Terminal-bench claims in Anthropic’s release post (Anthropic). But on Aider Polyglot, this particular Opus 4 run is both weaker and more expensive. Unless your internal workload shows Claude-specific strengths such as codebase taste, long-context collaboration, or fewer destructive edits, the Aider data does not justify choosing Opus 4 over GPT-5 or Gemini for this job.

The practical rule: do not buy “best coding model” as a brand claim. Buy passed tests per dollar, with edit reliability as a guardrail.

Readers who want to try these models hands-on can call them through onehop with an OpenAI-compatible API by changing one base_url. It is cheaper than first-party, includes $10 free credit for new accounts, and requires no card: call Claude and other models on onehop, or sign up for $10 free credit.