Use Groq GPT-OSS 120B with the OpenAI SDK: Base URL, Pricing, and Caching
June 17, 2026 · 24 min read · GPT / Claude / Gemini

Groq’s OpenAI-compatible endpoint is a real one-line swap: set base_url to https://api.groq.com/openai/v1 and keep using the OpenAI SDK. As of June 17, 2026, Groq lists openai/gpt-oss-120b at $0.15 per 1M uncached input tokens, $0.075 per 1M cached input tokens, and $0.60 per 1M output tokens on its pricing page (Groq Pricing).
That is the useful part. The trap is assuming “OpenAI-compatible” means “identical behavior and identical billing.” It does not. You still need to choose the Groq model ID, watch cache hits, avoid unsupported OpenAI parameters, and count built-in tool calls separately.

What You Are Actually Switching
Groq’s docs say its API is “mostly compatible with OpenAI’s client libraries” and show the exact OpenAI client configuration: pass your Groq key and set base_url="https://api.groq.com/openai/v1" (Groq OpenAI Compatibility).
Install the OpenAI Python SDK:
pip install openai
export GROQ_API_KEY="gsk_..."
Then call GPT-OSS 120B through Groq:
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["GROQ_API_KEY"],
base_url="https://api.groq.com/openai/v1",
)
response = client.chat.completions.create(
model="openai/gpt-oss-120b",
messages=[
{"role": "system", "content": "You are a concise senior backend engineer."},
{"role": "user", "content": "Write a PostgreSQL index plan for a slow user_events query."},
],
)
print(response.choices[0].message.content)
print(response.usage)
The important difference is the model. On OpenAI’s own platform you might use an OpenAI-hosted model name. On Groq, GPT-OSS is addressed as openai/gpt-oss-120b or openai/gpt-oss-20b.
OpenAI released gpt-oss-120b and gpt-oss-20b on August 5, 2025 as open-weight reasoning models under Apache 2.0 (OpenAI). OpenAI says gpt-oss-120b has 117B total parameters, 5.1B active parameters per token, 128 experts, 4 active experts per token, and native support for context lengths up to 128k (OpenAI). Groq’s model page lists the hosted openai/gpt-oss-120b context window as 131,072 tokens and max output tokens as 65,536 (Groq model card).
Pick 120B or 20B
Use 120B when quality matters more than raw cost: agent planning, harder coding tasks, complex extraction, or multi-step reasoning. Use 20B when you need cheaper throughput for routing, summarization, classification, short assistants, or high-volume background jobs.
| Groq model ID | Listed speed | Uncached input | Cached input | Output |
|---|---|---|---|---|
openai/gpt-oss-120b |
500 TPS | $0.15 / 1M | $0.075 / 1M | $0.60 / 1M |
openai/gpt-oss-20b |
1,000 TPS | $0.075 / 1M | $0.0375 / 1M | $0.30 / 1M |
Those prices come from Groq’s current pricing table and prompt caching table (Groq Pricing). The 20B model is exactly half the listed token price of 120B. That makes it a good default for “try it first” workflows. If the answer quality is not good enough, promote that path to 120B.

Here is a tiny switch you can keep in config:
import os
from openai import OpenAI
MODEL = os.getenv("GROQ_MODEL", "openai/gpt-oss-20b")
client = OpenAI(
api_key=os.environ["GROQ_API_KEY"],
base_url="https://api.groq.com/openai/v1",
)
def ask(prompt: str) -> str:
response = client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content": "Answer with practical developer steps."},
{"role": "user", "content": prompt},
],
)
return response.choices[0].message.content
print(ask("Give me a minimal Redis rate limiter design."))
Run it with 20B:
GROQ_MODEL=openai/gpt-oss-20b python app.py
Run it with 120B:
GROQ_MODEL=openai/gpt-oss-120b python app.py
Price Requests with Cache Hits
Groq prompt caching is automatic for supported models, including openai/gpt-oss-20b and openai/gpt-oss-120b. Groq says no code changes are required, cache hits use exact prefix matching, cached portions get a 50% input-token discount, and cached data expires automatically after 2 hours without use (Groq Prompt Caching).
The practical rule: put stable text first.
Good order:
- System prompt
- Tool definitions
- Few-shot examples
- Shared documents or schemas
- User-specific input
- Timestamps, IDs, per-request data
Bad order: putting a timestamp, request ID, or user-specific field before a 20,000-token shared prompt. That breaks the prefix.
Here is a cost helper for GPT-OSS 120B:
def groq_gpt_oss_120b_cost(prompt_tokens, cached_tokens, completion_tokens):
uncached_tokens = max(prompt_tokens - cached_tokens, 0)
return (
uncached_tokens / 1_000_000 * 0.15
+ cached_tokens / 1_000_000 * 0.075
+ completion_tokens / 1_000_000 * 0.60
)
And here is a runnable call that prints cache usage if Groq returns it:
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["GROQ_API_KEY"],
base_url="https://api.groq.com/openai/v1",
)
STATIC_POLICY = """
You are an internal code review assistant.
Always check correctness, security, performance, and migration risk.
Return JSON with keys: summary, risks, suggested_patch.
"""
def review(diff: str):
response = client.chat.completions.create(
model="openai/gpt-oss-120b",
messages=[
{"role": "system", "content": STATIC_POLICY},
{"role": "user", "content": diff},
],
response_format={"type": "json_object"},
)
usage = response.usage
details = getattr(usage, "prompt_tokens_details", None)
cached = getattr(details, "cached_tokens", 0) if details else 0
print("prompt_tokens:", usage.prompt_tokens)
print("cached_tokens:", cached)
print("completion_tokens:", usage.completion_tokens)
print("estimated_cost_usd:", groq_gpt_oss_120b_cost(
usage.prompt_tokens,
cached,
usage.completion_tokens,
))
return response.choices[0].message.content
Groq’s caching docs show cached_tokens under usage.prompt_tokens_details and define cache hit rate as cached_tokens / prompt_tokens × 100% (Groq Prompt Caching). Do not assume every second request is cheaper. Exact prefixes matter.
Count Tool Calls Separately
Token prices are not the whole bill if you enable built-in tools. Groq’s pricing page lists Built-In Tools for GPT-OSS separately: Browser Search basic search is $5 / 1000 requests, Browser Search visit website is $1 / 1000 requests, and Code Execution Python is $0.18 / hour (Groq Pricing).
That changes how you design agents. A support bot that calls search once per user message has a different cost shape from a summarizer that only uses prompt tokens. Cache helps with repeated tool schemas, but it does not make external tool calls free.
Also check compatibility before copy-pasting OpenAI code. Groq’s OpenAI compatibility docs list unsupported fields that can return 400, including logprobs, logit_bias, top_logprobs, messages[].name, and n values other than 1 (Groq OpenAI Compatibility). Groq also says temperature=0 is converted to 1e-8.
A safe minimal request looks like this:
response = client.chat.completions.create(
model="openai/gpt-oss-120b",
temperature=0.2,
messages=[
{"role": "system", "content": "Be precise. If unsure, say so."},
{"role": "user", "content": "Explain this stack trace and suggest a fix: ..."},
],
)
Avoid migrating your whole app in one commit. Put provider settings behind environment variables:
LLM_BASE_URL=https://api.groq.com/openai/v1
LLM_API_KEY=$GROQ_API_KEY
LLM_MODEL=openai/gpt-oss-120b
Then wire them into the SDK:
client = OpenAI(
api_key=os.environ["LLM_API_KEY"],
base_url=os.environ["LLM_BASE_URL"],
)
response = client.chat.completions.create(
model=os.environ["LLM_MODEL"],
messages=[{"role": "user", "content": "Summarize this incident report."}],
)
A Practical Multi-Provider Escape Hatch
If you are wiring this into production, do not hard-code provider assumptions all over your codebase. Keep base_url, api_key, and model in config. That makes Groq easy to test, and it also makes provider routing boring.
For teams that want one OpenAI-compatible endpoint for Claude, GPT, and Gemini, onehop is the easy path: change the base URL to https://api.onehop.ai/v1. It is OpenAI/Anthropic compatible, priced cheaper than first-party, and new accounts get $10 free with no card required.
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["ONEHOP_API_KEY"],
base_url="https://api.onehop.ai/v1",
)
response = client.chat.completions.create(
model=os.environ["ONEHOP_MODEL"],
messages=[
{"role": "user", "content": "Compare this API design against common REST mistakes."}
],
)
print(response.choices[0].message.content)
Use Groq when you specifically want fast GPT-OSS inference and Groq’s tool/caching stack. Use onehop when you want a single integration surface across Claude, GPT, Gemini, and other hosted models without rewriting client code. You can call Claude and other models on onehop or sign up for $10 free credit.
Production Checklist
Before shipping, run this checklist:
- Pin the model ID:
openai/gpt-oss-120bfor quality,openai/gpt-oss-20bfor lower cost. - Keep stable prompt sections first so Groq can reuse cached prefixes.
- Log
prompt_tokens,cached_tokens, andcompletion_tokensfor every request. - Add separate accounting for Browser Search, Visit Website, and Code Execution.
- Remove unsupported OpenAI parameters before routing traffic to Groq.
- Keep
base_urlconfigurable so you can test Groq, first-party APIs, or onehop without touching business logic.
The whole migration can be one line. The reliable migration is three lines plus accounting: base URL, model ID, and cost telemetry. Start there, then decide whether 120B’s quality is worth the output-token spend for each path in your app. If you want the same base-URL pattern for broader model access, call Claude and other models on onehop and sign up for $10 free credit.
Related reading

Calling the Gemini API with the OpenAI SDK: A Migration Guide Changing Only base_url, API Key, and Model Name
A Gemini-compatible API migration checklist for existing OpenAI SDK projects, with code, parameter mapping, and pricing.
June 14, 2026 · 9 min read

Calling the Gemini API with the OpenAI SDK: An Integration Guide Requiring Only base_url, Key, and Model Name Changes
Connect existing OpenAI SDK code to Gemini with minimal changes to just three configuration fields.
June 14, 2026 · 9 min read

Calling Gemini with the OpenAI SDK: Integration Guide by Changing Only base_url, API Key, and Model Name
Google now supports an OpenAI-compatible API, letting you connect to Gemini by changing base_url, the API key, and the model name.
June 14, 2026 · 11 min read