API Reference

API Reference

OpenAI-compatible API for Gemini and Claude. Drop-in replacement — just change the base URL.

Base URL

https://getinference.io/v1

Quickstart

  1. Create an account and buy credits
  2. Copy your API key from the dashboard
  3. Use any OpenAI-compatible SDK with our base URL
OpenAI compatible. Any library or tool that works with the OpenAI API works with GetInference. Just swap the base URL and API key.
from openai import OpenAI

client = OpenAI(
    api_key="sk-your-key-here",
    base_url="https://getinference.io/v1"
)

response = client.chat.completions.create(
    model="gemini-flash",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
import OpenAI from "openai";

const client = new OpenAI({
  apiKey: "sk-your-key-here",
  baseURL: "https://getinference.io/v1",
});

const res = await client.chat.completions.create({
  model: "gemini-flash",
  messages: [{ role: "user", content: "Hello!" }],
});

console.log(res.choices[0].message.content);
curl https://getinference.io/v1/chat/completions \
  -H "Authorization: Bearer sk-your-key-here" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemini-flash",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'
# Coming soon — GetInference Python SDK
# pip install getinference

from getinference import GetInference

client = GetInference(api_key="sk-your-key-here")

response = client.chat("Hello!", model="gemini-flash")
print(response)
from openai import OpenAI

client = OpenAI(
    api_key="sk-your-key-here",
    base_url="https://getinference.io/v1"
)

response = client.chat.completions.create(
    model="gemini-flash",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
import OpenAI from "openai";

const client = new OpenAI({
  apiKey: "sk-your-key-here",
  baseURL: "https://getinference.io/v1",
});

const res = await client.chat.completions.create({
  model: "gemini-flash",
  messages: [{ role: "user", content: "Hello!" }],
});

console.log(res.choices[0].message.content);
curl https://getinference.io/v1/chat/completions \
  -H "Authorization: Bearer sk-your-key-here" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemini-flash",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'
# Coming soon — GetInference Python SDK
# pip install getinference

from getinference import GetInference

client = GetInference(api_key="sk-your-key-here")

response = client.chat("Hello!", model="gemini-flash")
print(response)

Authentication

Pass your API key as a Bearer token in the Authorization header with every request.

Get your key from the dashboard. You can create up to 10 active keys per account. All keys share your account credit balance.

Authorization: Bearer sk-your-key-here
Authorization: Bearer sk-your-key-here

Models

All models are accessed via the same endpoint. Pass the model ID in the request body.

Gemini

gemini-flash Google
Fast, cheap, general purpose
Gemini 2.0 Flash
gemini-flash-lite Google
Cheapest option for simple tasks
Gemini 2.0 Flash Lite
gemini-2-5-flash Google
Latest generation flash with strong reasoning
Gemini 2.5 Flash
gemini-2-5-flash-lite Google
Budget-friendly latest generation
Gemini 2.5 Flash Lite
gemini-pro Google
Most capable Gemini for complex tasks
Gemini 2.5 Pro

Claude

claude-haiku Anthropic
Fast and affordable Claude
Claude Haiku 4.5
claude-sonnet Anthropic
Balanced performance and cost
Claude Sonnet 4
claude-opus Anthropic
Most capable Claude for demanding tasks
Claude Opus 4

Chat Completions

POST /v1/chat/completions

Create a model response for a conversation. This is the primary endpoint for generating text.

Request body

ParameterTypeDescription
modelstringRequiredModel ID (e.g. gemini-flash)
messagesarrayRequiredConversation messages with role and content
temperaturefloatOptionalSampling temperature 0-2. Default: 1
max_tokensintegerOptionalMaximum tokens to generate
streambooleanOptionalEnable streaming. Default: false
top_pfloatOptionalNucleus sampling threshold 0-1
stopstring | arrayOptionalStop sequences

Response

Returns a chat completion object with the model's response, token usage, and metadata.

response = client.chat.completions.create(
    model="gemini-flash",
    messages=[
        {"role": "system", "content": "You are helpful."},
        {"role": "user", "content": "What is quantum computing?"}
    ],
    temperature=0.7,
    max_tokens=1024
)
const response = await client.chat.completions.create({
  model: "gemini-flash",
  messages: [
    { role: "system", content: "You are helpful." },
    { role: "user", content: "What is quantum computing?" }
  ],
  temperature: 0.7,
  max_tokens: 1024
});
curl https://getinference.io/v1/chat/completions \
  -H "Authorization: Bearer sk-your-key-here" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemini-flash",
    "messages": [
      {"role": "system", "content": "You are helpful."},
      {"role": "user", "content": "What is quantum computing?"}
    ],
    "temperature": 0.7,
    "max_tokens": 1024
  }'
# Coming soon
response = client.chat(
    "What is quantum computing?",
    model="gemini-flash",
    system="You are helpful.",
    temperature=0.7,
    max_tokens=1024
)
response = client.chat.completions.create(
    model="gemini-flash",
    messages=[
        {"role": "system", "content": "You are helpful."},
        {"role": "user", "content": "What is quantum computing?"}
    ],
    temperature=0.7,
    max_tokens=1024
)
const response = await client.chat.completions.create({
  model: "gemini-flash",
  messages: [
    { role: "system", content: "You are helpful." },
    { role: "user", content: "What is quantum computing?" }
  ],
  temperature: 0.7,
  max_tokens: 1024
});
curl https://getinference.io/v1/chat/completions \
  -H "Authorization: Bearer sk-your-key-here" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemini-flash",
    "messages": [
      {"role": "system", "content": "You are helpful."},
      {"role": "user", "content": "What is quantum computing?"}
    ],
    "temperature": 0.7,
    "max_tokens": 1024
  }'
# Coming soon
response = client.chat(
    "What is quantum computing?",
    model="gemini-flash",
    system="You are helpful.",
    temperature=0.7,
    max_tokens=1024
)

RESPONSE

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "model": "gemini-flash",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "Quantum computing uses..."
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 24,
    "completion_tokens": 156,
    "total_tokens": 180
  }
}

Streaming

Set stream: true to receive partial responses as server-sent events. The response is delivered incrementally as the model generates tokens.

Each event contains a delta object with the new content. The stream ends with a [DONE] message.

stream = client.chat.completions.create(
    model="gemini-flash",
    messages=[{"role": "user", "content": "Write a poem"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")
const stream = await client.chat.completions.create({
  model: "gemini-flash",
  messages: [{ role: "user", content: "Write a poem" }],
  stream: true,
});

for await (const chunk of stream) {
  process.stdout.write(
    chunk.choices[0]?.delta?.content || ""
  );
}
curl https://getinference.io/v1/chat/completions \
  -H "Authorization: Bearer sk-your-key-here" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemini-flash",
    "messages": [{"role": "user", "content": "Write a poem"}],
    "stream": true
  }'
stream = client.chat.completions.create(
    model="gemini-flash",
    messages=[{"role": "user", "content": "Write a poem"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")
const stream = await client.chat.completions.create({
  model: "gemini-flash",
  messages: [{ role: "user", content: "Write a poem" }],
  stream: true,
});

for await (const chunk of stream) {
  process.stdout.write(
    chunk.choices[0]?.delta?.content || ""
  );
}
curl https://getinference.io/v1/chat/completions \
  -H "Authorization: Bearer sk-your-key-here" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemini-flash",
    "messages": [{"role": "user", "content": "Write a poem"}],
    "stream": true
  }'

STREAM OUTPUT

data: {"choices":[{"delta":{"content":"Once"}}]}

data: {"choices":[{"delta":{"content":" upon"}}]}

data: {"choices":[{"delta":{"content":" a"}}]}

data: [DONE]

List Models

GET /v1/models

Returns a list of all available models.

curl https://getinference.io/v1/models \
  -H "Authorization: Bearer sk-your-key-here"
curl https://getinference.io/v1/models \
  -H "Authorization: Bearer sk-your-key-here"

Errors

The API returns standard HTTP status codes. Error responses include a JSON body with details.

StatusMeaningWhat to do
400Bad requestCheck your request body format
401UnauthorizedCheck your API key
402Budget exceededBuy more credits
404Model not foundCheck the model ID
429Rate limitedSlow down, retry with backoff
500Server errorRetry with exponential backoff

Rate Limits & Budgets

Each API key has a budget equal to your total purchased credits. When spend reaches the budget, requests return 402.

You can create up to 10 active API keys per account. All keys share the same account budget. Revoke unused keys from the dashboard.

Tip: Use separate keys for different projects or environments (dev, staging, prod) so you can track spend per-key on the dashboard.