Gemini 3.1 Pro vs GPT-5.2 vs Claude Opus 4.6 on Terminal-Bench 2.0
June 16, 2026 · 21 min read · Claude / GPT / Gemini

The number that matters: 68.5%
Google DeepMind’s Gemini 3.1 Pro model card puts Gemini 3.1 Pro at 68.5% on Terminal-Bench 2.0 using the Terminus-2 harness. In the same table, Claude Opus 4.6 scores 65.4%, GPT-5.3-Codex scores 64.7%, and GPT-5.2 scores 54.0% under that same harness line (Google DeepMind).
That is the cleanest apples-to-apples row in the current public material. It says: if you run these models through the same reported Terminus-2 setup, Gemini 3.1 Pro is ahead, Claude Opus 4.6 is close, GPT-5.3-Codex is basically in the same band, and GPT-5.2 trails by a lot.
But there is a trap here. Terminal-Bench is not just a model benchmark. It is a model plus agent plus harness benchmark.
Epoch AI’s Terminal-Bench 2.0 page describes the benchmark as tasks where agents must operate inside a real terminal: understand the filesystem, use installed programs, reason about running processes, and complete tasks without being told every command. It also says scores are reported for model-agent combinations because the agent can materially change performance (Epoch AI). That one sentence should change how you read every leaderboard.

The benchmark table developers should actually use
Here is the useful cut, restricted to numbers I could verify from primary sources.
| Model | Terminal-Bench 2.0 score | Harness / source context | API price, if relevant |
|---|---|---|---|
| Gemini 3.1 Pro | 68.5% | Terminus-2 harness in Google model card | $2/$12 per 1M input/output tokens for prompts ≤200k; $4/$18 above 200k (Google AI) |
| Claude Opus 4.6 | 65.4% | Terminus-2 harness in Google model card; public leaderboard per Google methodology note | $5/$25 per 1M input/output tokens (Anthropic) |
| GPT-5.3-Codex | 64.7% | Terminus-2 harness in Google model card | OpenAI’s provider run reports 77.3% using Codex CLI, not the same harness (OpenAI) |
| GPT-5.2 | 54.0% | Terminus-2 harness in Google model card | $1.75/$14 per 1M input/output tokens (OpenAI) |
The ordering under Terminus-2 is straightforward: Gemini 3.1 Pro > Claude Opus 4.6 > GPT-5.3-Codex > GPT-5.2.
The bigger engineering point is less tidy. Google’s own methodology PDF says Gemini scores are self-computed, while non-Gemini model numbers are generally provider-reported unless otherwise stated. For Terminal-Bench 2.0 specifically, it says Gemini 3.1 Pro is self-computed, other models come from the public leaderboard, and results are reported both for the default Terminus-2 harness and for other best self-reported harnesses where applicable (Google DeepMind methodology PDF).
So the fair read is not “Gemini crushes everyone.” It is: Gemini leads the shared Terminus-2 comparison by 3.1 percentage points over Claude and 3.8 points over GPT-5.3-Codex. GPT-5.2 is the clear laggard in this setup.
Why GPT-5.3-Codex has two different stories
The most confusing row is GPT-5.3-Codex.
Google’s model card gives it 64.7% on the Terminus-2 harness. One line below, the same card lists “other best self-reported harness” numbers: GPT-5.2 at 62.2% using Codex, and GPT-5.3-Codex at 77.3% using Codex (Google DeepMind). OpenAI’s own GPT-5.3-Codex announcement also reports 77.3% on Terminal-Bench 2.0, with xhigh reasoning effort, and explicitly frames the model as a coding agent model for Codex (OpenAI).
Both can be true.
A CLI coding agent is not a stateless chat completion. The harness decides how files are exposed, how commands are run, how patches are applied, how state is summarized, how often the model can recover from a bad path, and sometimes how reasoning effort is selected. If you use Codex CLI, OpenAI’s 77.3% number is relevant. If you are comparing models inside the same third-party agent harness, the 64.7% Terminus-2 number is the cleaner comparison.
That distinction maps directly to real usage:
# Same task, different agent harness can change the result
agent run --model gemini-3.1-pro-preview --harness terminus-2
agent run --model gpt-5.3-codex --harness codex-cli
If your team is building its own CLI agent, do not copy a provider’s best harness score into a spreadsheet and call it model quality. Treat it as system quality: model, tool loop, memory, retry policy, patch mechanics, sandbox, and prompt contract.

What Terminal-Bench 2.0 measures better than SWE-Bench
SWE-Bench is still useful, but Terminal-Bench catches a different failure mode. A model can generate a plausible patch and still be bad at living inside a shell.
Terminal-Bench tasks include things like building a Linux kernel in QEMU, configuring a Git web server, cracking a 7z hash for a provided secret, generating OpenSSL certificates, and reshaping data files. The Terminal-Bench site describes version 2.0 as 89 high-quality tasks across software engineering, machine learning, security, data science, and more (Terminal-Bench).
That matters because CLI agents fail in boring, expensive ways:
- They forget where they are in the filesystem.
- They run a command, ignore stderr, and patch the wrong file.
- They pass visible tests but miss the hidden invariant.
- They burn tokens exploring instead of forming a plan.
- They get stuck after one failed install or one flaky test.
On those tasks, Gemini 3.1 Pro’s 68.5% Terminus-2 score is impressive because it suggests strong command-loop behavior, not just code synthesis. Claude Opus 4.6 at 65.4% is close enough that I would not migrate a mature Claude Code workflow on benchmark delta alone. GPT-5.2 at 54.0% is the one I would avoid for hard terminal automation unless cost is the dominant constraint or you have a very strong harness around it.
Picking a model for a CLI coding agent
My practical ranking depends on what you are buying.
If you want the strongest shared-harness terminal score, start with Gemini 3.1 Pro. It leads the verified comparison row and is cheaper than Claude Opus 4.6 on standard per-token pricing below 200k prompt tokens. The catch: for large-repo prompts above 200k, Google’s posted price steps up from $2/$12 to $4/$18 per 1M input/output tokens, so long-context agent runs need budgets and cache discipline (Google AI).
If you already run a Claude-heavy workflow, Claude Opus 4.6 remains a strong choice. Anthropic released Opus 4.6 on February 5, 2026, with stronger coding, longer agentic task performance, and a 1M-token context window in beta for the developer platform (Anthropic). Its 65.4% Terminus-2 score is close to Gemini. The price is higher: $5 input and $25 output per million tokens in standard pricing.
If you want the best Codex-specific terminal number, GPT-5.3-Codex deserves a separate evaluation. Its provider-reported 77.3% Terminal-Bench 2.0 score is tied to OpenAI’s Codex setup, while the shared Terminus-2 row is 64.7%. That is not a contradiction. It is a warning label.
If you are looking at GPT-5.2, the case is cost and general capability, not peak terminal agency. OpenAI prices GPT-5.2 at $1.75/$14 per million input/output tokens and says it supports xhigh reasoning effort in the API (OpenAI). But on the shared Terminal-Bench 2.0 row, 54.0% is a large gap.
Bottom line
For terminal-based coding agents, I would shortlist models this way: Gemini 3.1 Pro for best shared-harness Terminal-Bench 2.0 performance, Claude Opus 4.6 for teams that value long-context coding reliability and already trust Claude workflows, GPT-5.3-Codex when the target runtime is Codex itself, and GPT-5.2 only when price or API availability matters more than terminal success rate.
The main lesson is methodological. A Terminal-Bench score is never just “the model.” For developers building CLI agents, the harness is part of the product. Track both numbers: the shared-harness score that tells you raw portability, and the provider-harness score that tells you what the full native stack can do.
Readers who want to try these models hands-on can call Claude and other models on onehop with an OpenAI-compatible API: change one base_url, keep the rest of the client mostly the same, and compare costs against first-party routes. New accounts get $10 free credit with no card required: call Claude and other models on onehop, or sign up for $10 free credit.
Related reading

GPT-5 vs Gemini 2.5 Pro vs Claude Opus 4 on Aider Polyglot Coding
A data-first comparison of GPT-5, Gemini 2.5 Pro, and Claude Opus 4 on Aider Polyglot coding.
June 17, 2026 · 20 min read

Use Groq GPT-OSS 120B with the OpenAI SDK: Base URL, Pricing, and Caching
Swap one OpenAI SDK base URL to run GPT-OSS 120B on Groq, estimate cached token costs, and avoid tool billing surprises.
June 17, 2026 · 24 min read

Using Grok Build in Warp with a SuperGrok or X Premium Subscription
xAI now lets Warp users connect Grok or X Premium and run grok-build-0.1 inside terminal agent workflows.
June 16, 2026 · 20 min read