Back to all articles
Benchmarks

Gemini 3.1 Pro vs GPT-5.2 vs Claude Opus 4.6 on Terminal-Bench 2.0

June 16, 2026 · 21 min read · Claude / GPT / Gemini

Cream-background editorial cover showing three abstract terminal windows as stacked charcoal cards, each connected to a

The number that matters: 68.5%

Google DeepMind’s Gemini 3.1 Pro model card puts Gemini 3.1 Pro at 68.5% on Terminal-Bench 2.0 using the Terminus-2 harness. In the same table, Claude Opus 4.6 scores 65.4%, GPT-5.3-Codex scores 64.7%, and GPT-5.2 scores 54.0% under that same harness line (Google DeepMind).

That is the cleanest apples-to-apples row in the current public material. It says: if you run these models through the same reported Terminus-2 setup, Gemini 3.1 Pro is ahead, Claude Opus 4.6 is close, GPT-5.3-Codex is basically in the same band, and GPT-5.2 trails by a lot.

But there is a trap here. Terminal-Bench is not just a model benchmark. It is a model plus agent plus harness benchmark.

Epoch AI’s Terminal-Bench 2.0 page describes the benchmark as tasks where agents must operate inside a real terminal: understand the filesystem, use installed programs, reason about running processes, and complete tasks without being told every command. It also says scores are reported for model-agent combinations because the agent can materially change performance (Epoch AI). That one sentence should change how you read every leaderboard.

Horizontal bar chart on cream background comparing Terminal-Bench 2.0 Terminus-2 scores: Gemini 3.1 Pro 68.5, Claude Opu

The benchmark table developers should actually use

Here is the useful cut, restricted to numbers I could verify from primary sources.

Model Terminal-Bench 2.0 score Harness / source context API price, if relevant
Gemini 3.1 Pro 68.5% Terminus-2 harness in Google model card $2/$12 per 1M input/output tokens for prompts ≤200k; $4/$18 above 200k (Google AI)
Claude Opus 4.6 65.4% Terminus-2 harness in Google model card; public leaderboard per Google methodology note $5/$25 per 1M input/output tokens (Anthropic)
GPT-5.3-Codex 64.7% Terminus-2 harness in Google model card OpenAI’s provider run reports 77.3% using Codex CLI, not the same harness (OpenAI)
GPT-5.2 54.0% Terminus-2 harness in Google model card $1.75/$14 per 1M input/output tokens (OpenAI)

The ordering under Terminus-2 is straightforward: Gemini 3.1 Pro > Claude Opus 4.6 > GPT-5.3-Codex > GPT-5.2.

The bigger engineering point is less tidy. Google’s own methodology PDF says Gemini scores are self-computed, while non-Gemini model numbers are generally provider-reported unless otherwise stated. For Terminal-Bench 2.0 specifically, it says Gemini 3.1 Pro is self-computed, other models come from the public leaderboard, and results are reported both for the default Terminus-2 harness and for other best self-reported harnesses where applicable (Google DeepMind methodology PDF).

So the fair read is not “Gemini crushes everyone.” It is: Gemini leads the shared Terminus-2 comparison by 3.1 percentage points over Claude and 3.8 points over GPT-5.3-Codex. GPT-5.2 is the clear laggard in this setup.

Why GPT-5.3-Codex has two different stories

The most confusing row is GPT-5.3-Codex.

Google’s model card gives it 64.7% on the Terminus-2 harness. One line below, the same card lists “other best self-reported harness” numbers: GPT-5.2 at 62.2% using Codex, and GPT-5.3-Codex at 77.3% using Codex (Google DeepMind). OpenAI’s own GPT-5.3-Codex announcement also reports 77.3% on Terminal-Bench 2.0, with xhigh reasoning effort, and explicitly frames the model as a coding agent model for Codex (OpenAI).

Both can be true.

A CLI coding agent is not a stateless chat completion. The harness decides how files are exposed, how commands are run, how patches are applied, how state is summarized, how often the model can recover from a bad path, and sometimes how reasoning effort is selected. If you use Codex CLI, OpenAI’s 77.3% number is relevant. If you are comparing models inside the same third-party agent harness, the 64.7% Terminus-2 number is the cleaner comparison.

That distinction maps directly to real usage:

# Same task, different agent harness can change the result
agent run --model gemini-3.1-pro-preview --harness terminus-2
agent run --model gpt-5.3-codex --harness codex-cli

If your team is building its own CLI agent, do not copy a provider’s best harness score into a spreadsheet and call it model quality. Treat it as system quality: model, tool loop, memory, retry policy, patch mechanics, sandbox, and prompt contract.

Flow diagram showing “developer task” entering an agent harness, splitting into model, terminal tools, filesystem state,

What Terminal-Bench 2.0 measures better than SWE-Bench

SWE-Bench is still useful, but Terminal-Bench catches a different failure mode. A model can generate a plausible patch and still be bad at living inside a shell.

Terminal-Bench tasks include things like building a Linux kernel in QEMU, configuring a Git web server, cracking a 7z hash for a provided secret, generating OpenSSL certificates, and reshaping data files. The Terminal-Bench site describes version 2.0 as 89 high-quality tasks across software engineering, machine learning, security, data science, and more (Terminal-Bench).

That matters because CLI agents fail in boring, expensive ways:

  • They forget where they are in the filesystem.
  • They run a command, ignore stderr, and patch the wrong file.
  • They pass visible tests but miss the hidden invariant.
  • They burn tokens exploring instead of forming a plan.
  • They get stuck after one failed install or one flaky test.

On those tasks, Gemini 3.1 Pro’s 68.5% Terminus-2 score is impressive because it suggests strong command-loop behavior, not just code synthesis. Claude Opus 4.6 at 65.4% is close enough that I would not migrate a mature Claude Code workflow on benchmark delta alone. GPT-5.2 at 54.0% is the one I would avoid for hard terminal automation unless cost is the dominant constraint or you have a very strong harness around it.

Picking a model for a CLI coding agent

My practical ranking depends on what you are buying.

If you want the strongest shared-harness terminal score, start with Gemini 3.1 Pro. It leads the verified comparison row and is cheaper than Claude Opus 4.6 on standard per-token pricing below 200k prompt tokens. The catch: for large-repo prompts above 200k, Google’s posted price steps up from $2/$12 to $4/$18 per 1M input/output tokens, so long-context agent runs need budgets and cache discipline (Google AI).

If you already run a Claude-heavy workflow, Claude Opus 4.6 remains a strong choice. Anthropic released Opus 4.6 on February 5, 2026, with stronger coding, longer agentic task performance, and a 1M-token context window in beta for the developer platform (Anthropic). Its 65.4% Terminus-2 score is close to Gemini. The price is higher: $5 input and $25 output per million tokens in standard pricing.

If you want the best Codex-specific terminal number, GPT-5.3-Codex deserves a separate evaluation. Its provider-reported 77.3% Terminal-Bench 2.0 score is tied to OpenAI’s Codex setup, while the shared Terminus-2 row is 64.7%. That is not a contradiction. It is a warning label.

If you are looking at GPT-5.2, the case is cost and general capability, not peak terminal agency. OpenAI prices GPT-5.2 at $1.75/$14 per million input/output tokens and says it supports xhigh reasoning effort in the API (OpenAI). But on the shared Terminal-Bench 2.0 row, 54.0% is a large gap.

Bottom line

For terminal-based coding agents, I would shortlist models this way: Gemini 3.1 Pro for best shared-harness Terminal-Bench 2.0 performance, Claude Opus 4.6 for teams that value long-context coding reliability and already trust Claude workflows, GPT-5.3-Codex when the target runtime is Codex itself, and GPT-5.2 only when price or API availability matters more than terminal success rate.

The main lesson is methodological. A Terminal-Bench score is never just “the model.” For developers building CLI agents, the harness is part of the product. Track both numbers: the shared-harness score that tells you raw portability, and the provider-harness score that tells you what the full native stack can do.

Readers who want to try these models hands-on can call Claude and other models on onehop with an OpenAI-compatible API: change one base_url, keep the rest of the client mostly the same, and compare costs against first-party routes. New accounts get $10 free credit with no card required: call Claude and other models on onehop, or sign up for $10 free credit.