Use Groq GPT-OSS 120B with the OpenAI SDK: Base URL, Pricing, and Caching

Groq’s OpenAI-compatible endpoint is a real one-line swap: set base_url to https://api.groq.com/openai/v1 and keep using the OpenAI SDK. As of June 17, 2026, Groq lists openai/gpt-oss-120b at $0.15 per 1M uncached input tokens, $0.075 per 1M cached input tokens, and $0.60 per 1M output tokens on its pricing page (Groq Pricing).

That is the useful part. The trap is assuming “OpenAI-compatible” means “identical behavior and identical billing.” It does not. You still need to choose the Groq model ID, watch cache hits, avoid unsupported OpenAI parameters, and count built-in tool calls separately.

A compact price card comparison for Groq GPT-OSS 120B and GPT-OSS 20B, with columns for uncached input, cached input, ou

What You Are Actually Switching

Groq’s docs say its API is “mostly compatible with OpenAI’s client libraries” and show the exact OpenAI client configuration: pass your Groq key and set base_url="https://api.groq.com/openai/v1" (Groq OpenAI Compatibility).

Install the OpenAI Python SDK:

pip install openai
export GROQ_API_KEY="gsk_..."

Then call GPT-OSS 120B through Groq:

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["GROQ_API_KEY"],
    base_url="https://api.groq.com/openai/v1",
)

response = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    messages=[
        {"role": "system", "content": "You are a concise senior backend engineer."},
        {"role": "user", "content": "Write a PostgreSQL index plan for a slow user_events query."},
    ],
)

print(response.choices[0].message.content)
print(response.usage)

The important difference is the model. On OpenAI’s own platform you might use an OpenAI-hosted model name. On Groq, GPT-OSS is addressed as openai/gpt-oss-120b or openai/gpt-oss-20b.

OpenAI released gpt-oss-120b and gpt-oss-20b on August 5, 2025 as open-weight reasoning models under Apache 2.0 (OpenAI). OpenAI says gpt-oss-120b has 117B total parameters, 5.1B active parameters per token, 128 experts, 4 active experts per token, and native support for context lengths up to 128k (OpenAI). Groq’s model page lists the hosted openai/gpt-oss-120b context window as 131,072 tokens and max output tokens as 65,536 (Groq model card).

Pick 120B or 20B

Use 120B when quality matters more than raw cost: agent planning, harder coding tasks, complex extraction, or multi-step reasoning. Use 20B when you need cheaper throughput for routing, summarization, classification, short assistants, or high-volume background jobs.

Groq model ID	Listed speed	Uncached input	Cached input	Output
`openai/gpt-oss-120b`	500 TPS	$0.15 / 1M	$0.075 / 1M	$0.60 / 1M
`openai/gpt-oss-20b`	1,000 TPS	$0.075 / 1M	$0.0375 / 1M	$0.30 / 1M

Those prices come from Groq’s current pricing table and prompt caching table (Groq Pricing). The 20B model is exactly half the listed token price of 120B. That makes it a good default for “try it first” workflows. If the answer quality is not good enough, promote that path to 120B.

Model family map showing GPT-OSS 20B as fast/low-cost and GPT-OSS 120B as higher-capability, both connected to Groq Open

Here is a tiny switch you can keep in config:

import os
from openai import OpenAI

MODEL = os.getenv("GROQ_MODEL", "openai/gpt-oss-20b")

client = OpenAI(
    api_key=os.environ["GROQ_API_KEY"],
    base_url="https://api.groq.com/openai/v1",
)

def ask(prompt: str) -> str:
    response = client.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": "Answer with practical developer steps."},
            {"role": "user", "content": prompt},
        ],
    )
    return response.choices[0].message.content

print(ask("Give me a minimal Redis rate limiter design."))

Run it with 20B:

GROQ_MODEL=openai/gpt-oss-20b python app.py

Run it with 120B:

GROQ_MODEL=openai/gpt-oss-120b python app.py

Price Requests with Cache Hits

Groq prompt caching is automatic for supported models, including openai/gpt-oss-20b and openai/gpt-oss-120b. Groq says no code changes are required, cache hits use exact prefix matching, cached portions get a 50% input-token discount, and cached data expires automatically after 2 hours without use (Groq Prompt Caching).

The practical rule: put stable text first.

Good order:

System prompt
Tool definitions
Few-shot examples
Shared documents or schemas
User-specific input
Timestamps, IDs, per-request data

Bad order: putting a timestamp, request ID, or user-specific field before a 20,000-token shared prompt. That breaks the prefix.

Here is a cost helper for GPT-OSS 120B:

def groq_gpt_oss_120b_cost(prompt_tokens, cached_tokens, completion_tokens):
    uncached_tokens = max(prompt_tokens - cached_tokens, 0)

    return (
        uncached_tokens / 1_000_000 * 0.15
        + cached_tokens / 1_000_000 * 0.075
        + completion_tokens / 1_000_000 * 0.60
    )

And here is a runnable call that prints cache usage if Groq returns it:

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["GROQ_API_KEY"],
    base_url="https://api.groq.com/openai/v1",
)

STATIC_POLICY = """
You are an internal code review assistant.
Always check correctness, security, performance, and migration risk.
Return JSON with keys: summary, risks, suggested_patch.
"""

def review(diff: str):
    response = client.chat.completions.create(
        model="openai/gpt-oss-120b",
        messages=[
            {"role": "system", "content": STATIC_POLICY},
            {"role": "user", "content": diff},
        ],
        response_format={"type": "json_object"},
    )

    usage = response.usage
    details = getattr(usage, "prompt_tokens_details", None)
    cached = getattr(details, "cached_tokens", 0) if details else 0

    print("prompt_tokens:", usage.prompt_tokens)
    print("cached_tokens:", cached)
    print("completion_tokens:", usage.completion_tokens)
    print("estimated_cost_usd:", groq_gpt_oss_120b_cost(
        usage.prompt_tokens,
        cached,
        usage.completion_tokens,
    ))

    return response.choices[0].message.content

Groq’s caching docs show cached_tokens under usage.prompt_tokens_details and define cache hit rate as cached_tokens / prompt_tokens × 100% (Groq Prompt Caching). Do not assume every second request is cheaper. Exact prefixes matter.

Count Tool Calls Separately

Token prices are not the whole bill if you enable built-in tools. Groq’s pricing page lists Built-In Tools for GPT-OSS separately: Browser Search basic search is $5 / 1000 requests, Browser Search visit website is $1 / 1000 requests, and Code Execution Python is $0.18 / hour (Groq Pricing).

That changes how you design agents. A support bot that calls search once per user message has a different cost shape from a summarizer that only uses prompt tokens. Cache helps with repeated tool schemas, but it does not make external tool calls free.

Also check compatibility before copy-pasting OpenAI code. Groq’s OpenAI compatibility docs list unsupported fields that can return 400, including logprobs, logit_bias, top_logprobs, messages[].name, and n values other than 1 (Groq OpenAI Compatibility). Groq also says temperature=0 is converted to 1e-8.

A safe minimal request looks like this:

response = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    temperature=0.2,
    messages=[
        {"role": "system", "content": "Be precise. If unsure, say so."},
        {"role": "user", "content": "Explain this stack trace and suggest a fix: ..."},
    ],
)

Avoid migrating your whole app in one commit. Put provider settings behind environment variables:

LLM_BASE_URL=https://api.groq.com/openai/v1
LLM_API_KEY=$GROQ_API_KEY
LLM_MODEL=openai/gpt-oss-120b

Then wire them into the SDK:

client = OpenAI(
    api_key=os.environ["LLM_API_KEY"],
    base_url=os.environ["LLM_BASE_URL"],
)

response = client.chat.completions.create(
    model=os.environ["LLM_MODEL"],
    messages=[{"role": "user", "content": "Summarize this incident report."}],
)

A Practical Multi-Provider Escape Hatch

If you are wiring this into production, do not hard-code provider assumptions all over your codebase. Keep base_url, api_key, and model in config. That makes Groq easy to test, and it also makes provider routing boring.

For teams that want one OpenAI-compatible endpoint for Claude, GPT, and Gemini, onehop is the easy path: change the base URL to https://api.onehop.ai/v1. It is OpenAI/Anthropic compatible, priced cheaper than first-party, and new accounts get $10 free with no card required.

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["ONEHOP_API_KEY"],
    base_url="https://api.onehop.ai/v1",
)

response = client.chat.completions.create(
    model=os.environ["ONEHOP_MODEL"],
    messages=[
        {"role": "user", "content": "Compare this API design against common REST mistakes."}
    ],
)

print(response.choices[0].message.content)

Use Groq when you specifically want fast GPT-OSS inference and Groq’s tool/caching stack. Use onehop when you want a single integration surface across Claude, GPT, Gemini, and other hosted models without rewriting client code. You can call Claude and other models on onehop or sign up for $10 free credit.

Production Checklist

Before shipping, run this checklist:

Pin the model ID: openai/gpt-oss-120b for quality, openai/gpt-oss-20b for lower cost.
Keep stable prompt sections first so Groq can reuse cached prefixes.
Log prompt_tokens, cached_tokens, and completion_tokens for every request.
Add separate accounting for Browser Search, Visit Website, and Code Execution.
Remove unsupported OpenAI parameters before routing traffic to Groq.
Keep base_url configurable so you can test Groq, first-party APIs, or onehop without touching business logic.

The whole migration can be one line. The reliable migration is three lines plus accounting: base URL, model ID, and cost telemetry. Start there, then decide whether 120B’s quality is worth the output-token spend for each path in your app. If you want the same base-URL pattern for broader model access, call Claude and other models on onehop and sign up for $10 free credit.

Use Groq GPT-OSS 120B with the OpenAI SDK: Base URL, Pricing, and Caching

What You Are Actually Switching

Pick 120B or 20B

Price Requests with Cache Hits

Count Tool Calls Separately

A Practical Multi-Provider Escape Hatch

Production Checklist

Related reading

Call Qwen3.7 Plus with the OpenAI SDK via DashScope Compatible Mode

Calling the Gemini API with the OpenAI SDK: A Migration Guide Changing Only base_url, API Key, and Model Name

Calling the Gemini API with the OpenAI SDK: An Integration Guide Requiring Only base_url, Key, and Model Name Changes