Calling the Gemini API with the OpenAI SDK: How to Configure base_url, Model Names, and Gemini 3.5 Flash Pricing

Google’s OpenAI compatibility documentation was updated on 2026-05-18, and it states directly that Gemini models can be called through the OpenAI Python / JavaScript SDK. The core changes are only three fields: api_key, base_url, and model (Google). Today is 2026-06-14, and the pricing page lists the standard paid-tier price for gemini-3.5-flash as: input $1.50 / 1 million tokens, output including thinking tokens $9.00 / 1 million tokens (Google Pricing).

My judgment is simple: if you already have an OpenAI SDK project, don’t rewrite it yet. First connect Gemini as an OpenAI-compatible backend, verify cost, streaming, and tool calling, and then decide whether to migrate to the native Gemini SDK.

A before-and-after migration comparison on an off-white background, with an OpenAI SDK configuration card on the left and a Gemini configuration card on the right; only the three changed lines, api_key, base_url, and model, are highlighted, connected by a terracotta arrow

1. Minimal change: only switch the endpoint and model name

First install the official OpenAI SDK for Python:

pip install openai
export GEMINI_API_KEY="你的 Gemini API Key"

Then change the client to this:

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["GEMINI_API_KEY"],
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/",
)

resp = client.chat.completions.create(
    model="gemini-3.5-flash",
    reasoning_effort="low",
    messages=[
        {"role": "system", "content": "你是一个直接、准确的代码助手。"},
        {"role": "user", "content": "用三句话解释 SSE 流式输出。"},
    ],
)

print(resp.choices[0].message.content)

Shapes like chat.completions.create, messages, and tools are still in the OpenAI Chat Completions style; OpenAI’s own API reference also still defines Chat Completions as an interface that generates replies based on a list of messages (OpenAI). So the focus of migration is not business code, but the configuration layer.

2. Don’t omit the final slash in base_url

The address in Google’s documentation is:

https://generativelanguage.googleapis.com/v1beta/openai/

If you omit the final /, some clients may run into strange issues when joining paths. In production code, it is recommended to extract it into environment variables:

OPENAI_BASE_URL="https://generativelanguage.googleapis.com/v1beta/openai/"
OPENAI_API_KEY="$GEMINI_API_KEY"
OPENAI_MODEL="gemini-3.5-flash"

If you want to avoid switching among multiple vendor accounts, quotas, and bills, onehop is the easier path: change base_url to https://api.onehop.ai/v1, and use the same OpenAI / Anthropic-compatible interface to call Claude, GPT, and Gemini. New accounts get $10 with no card required; it is suitable for doing a PoC first, then deciding whether to connect directly to the official provider.

from openai import OpenAI

client = OpenAI(
    api_key="你的 onehop key",
    base_url="https://api.onehop.ai/v1",
)

resp = client.chat.completions.create(
    model="gemini-3.5-flash",
    messages=[{"role": "user", "content": "给我一个 FastAPI 健康检查接口"}],
)
print(resp.choices[0].message.content)

The entry points are here: Call models such as Claude on onehop, register to get a free $10 trial credit.

3. Estimate costs based on output tokens first

The standard tier of gemini-3.5-flash is not “so cheap you don’t need to care.” The output price is 6 times the input price:

Model	Tier	Input / 1 million tokens	Output / 1 million tokens
`gemini-3.5-flash`	Standard	`$1.50`	`$9.00`
`gemini-3.5-flash`	Batch	`$0.75`	`$4.50`
`gemini-3.5-flash`	Flex	`$0.75`	`$4.50`

The Batch and Flex figures also come from the same Google pricing page. When writing applications, you should limit max_completion_tokens, especially for summarization, code generation, and Agent tool loops. Longer input can still be cached; runaway output is real money burned.

A compact bar chart with Standard, Batch, and Flex tiers on the horizontal axis and dollars per million tokens on the vertical axis; each group has two bars, with input in charcoal gray and output in terracotta, emphasizing that output pricing is higher

4. How reasoning_effort maps

Google’s compatibility layer accepts the OpenAI-style reasoning_effort and maps it to Gemini’s thinking configuration (Google):

`reasoning_effort`	Gemini 3 Flash `thinking_level`
`minimal`	`minimal`
`low`	`low`
`medium`	`medium`
`high`	`high`

If you do not pass it, the model default is used. Google’s documentation also states one key limitation: Gemini 3 cannot disable thinking; none only applies to some 2.5 models. My recommendation is to default to low in production, and only raise it to medium or high for complex planning or long-chain tool calls. Because output pricing includes thinking tokens, reasoning intensity is not a free knob.

5. Streaming and function calling: usable, but guard against empty chunks

Streaming calls keep the OpenAI SDK style:

stream = client.chat.completions.create(
    model="gemini-3.5-flash",
    messages=[{"role": "user", "content": "写一个 Redis 缓存封装"}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

The if delta here is very practical. Streaming responses may contain roles, tool calls, or empty deltas; do not assume every chunk has text.

Function calling also uses tools and tool_choice="auto". Google’s compatibility documentation provides a weather function example and confirms that the Gemini API supports function calling (Google). In real projects, don’t just print what the model returns; check message.tool_calls, execute the local function, and then feed the tool result back to the model as the next-round message.

Conclusion: the minimum cost of migrating to Gemini is three lines of configuration; what you really need to watch is output tokens, thinking intensity, empty streaming chunks, and the closed loop for tool calling. If you simply want to quickly put Claude, GPT, and Gemini into the same OpenAI SDK project, using onehop’s unified entry point will save quite a bit of configuration time: call models such as Claude on onehop, or first claim the free $10 trial credit upon registration.

Calling the Gemini API with the OpenAI SDK: How to Configure base_url, Model Names, and Gemini 3.5 Flash Pricing

1. Minimal change: only switch the endpoint and model name

2. Don’t omit the final slash in base_url

3. Estimate costs based on output tokens first

4. How reasoning_effort maps

5. Streaming and function calling: usable, but guard against empty chunks

Related reading

Call Qwen3.7 Plus with the OpenAI SDK via DashScope Compatible Mode

Use Groq GPT-OSS 120B with the OpenAI SDK: Base URL, Pricing, and Caching

Calling the Gemini API with the OpenAI SDK: A Migration Guide Changing Only base_url, API Key, and Model Name