API Reference

OpenAI-compatible API for Gemini and Claude. Drop-in replacement — just change the base URL.

Base URL

https://getinference.io/v1

Quickstart

Create an account and buy credits
Copy your API key from the dashboard
Use any OpenAI-compatible SDK with our base URL

OpenAI compatible. Any library or tool that works with the OpenAI API works with GetInference. Just swap the base URL and API key.

from openai import OpenAI

client = OpenAI(
    api_key="sk-your-key-here",
    base_url="https://getinference.io/v1"
)

response = client.chat.completions.create(
    model="gemini-flash",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: "sk-your-key-here",
  baseURL: "https://getinference.io/v1",
});

const res = await client.chat.completions.create({
  model: "gemini-flash",
  messages: [{ role: "user", content: "Hello!" }],
});

console.log(res.choices[0].message.content);

curl https://getinference.io/v1/chat/completions \
  -H "Authorization: Bearer sk-your-key-here" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemini-flash",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

# Coming soon — GetInference Python SDK
# pip install getinference

from getinference import GetInference

client = GetInference(api_key="sk-your-key-here")

response = client.chat("Hello!", model="gemini-flash")
print(response)

from openai import OpenAI

client = OpenAI(
    api_key="sk-your-key-here",
    base_url="https://getinference.io/v1"
)

response = client.chat.completions.create(
    model="gemini-flash",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: "sk-your-key-here",
  baseURL: "https://getinference.io/v1",
});

const res = await client.chat.completions.create({
  model: "gemini-flash",
  messages: [{ role: "user", content: "Hello!" }],
});

console.log(res.choices[0].message.content);

curl https://getinference.io/v1/chat/completions \
  -H "Authorization: Bearer sk-your-key-here" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemini-flash",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

# Coming soon — GetInference Python SDK
# pip install getinference

from getinference import GetInference

client = GetInference(api_key="sk-your-key-here")

response = client.chat("Hello!", model="gemini-flash")
print(response)

Authentication

Pass your API key as a Bearer token in the Authorization header with every request.

Get your key from the dashboard. You can create up to 10 active keys per account. All keys share your account credit balance.

Authorization: Bearer sk-your-key-here

Authorization: Bearer sk-your-key-here

Models

All models are accessed via the same endpoint. Pass the model ID in the request body.

Gemini

gemini-flash Google

Fast, cheap, general purpose

Gemini 2.0 Flash

gemini-flash-lite Google

Cheapest option for simple tasks

Gemini 2.0 Flash Lite

gemini-2-5-flash Google

Latest generation flash with strong reasoning

Gemini 2.5 Flash

gemini-2-5-flash-lite Google

Budget-friendly latest generation

Gemini 2.5 Flash Lite

gemini-pro Google

Most capable Gemini for complex tasks

Gemini 2.5 Pro

Claude

claude-haiku Anthropic

Fast and affordable Claude

Claude Haiku 4.5

claude-sonnet Anthropic

Balanced performance and cost

Claude Sonnet 4

claude-opus Anthropic

Most capable Claude for demanding tasks

Claude Opus 4

Chat Completions

POST /v1/chat/completions

Create a model response for a conversation. This is the primary endpoint for generating text.

Request body

Parameter	Type		Description
`model`	string	Required	Model ID (e.g. `gemini-flash`)
`messages`	array	Required	Conversation messages with `role` and `content`
`temperature`	float	Optional	Sampling temperature 0-2. Default: 1
`max_tokens`	integer	Optional	Maximum tokens to generate
`stream`	boolean	Optional	Enable streaming. Default: false
`top_p`	float	Optional	Nucleus sampling threshold 0-1
`stop`	string \| array	Optional	Stop sequences

Response

Returns a chat completion object with the model's response, token usage, and metadata.

response = client.chat.completions.create(
    model="gemini-flash",
    messages=[
        {"role": "system", "content": "You are helpful."},
        {"role": "user", "content": "What is quantum computing?"}
    ],
    temperature=0.7,
    max_tokens=1024
)

const response = await client.chat.completions.create({
  model: "gemini-flash",
  messages: [
    { role: "system", content: "You are helpful." },
    { role: "user", content: "What is quantum computing?" }
  ],
  temperature: 0.7,
  max_tokens: 1024
});

curl https://getinference.io/v1/chat/completions \
  -H "Authorization: Bearer sk-your-key-here" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemini-flash",
    "messages": [
      {"role": "system", "content": "You are helpful."},
      {"role": "user", "content": "What is quantum computing?"}
    ],
    "temperature": 0.7,
    "max_tokens": 1024
  }'

# Coming soon
response = client.chat(
    "What is quantum computing?",
    model="gemini-flash",
    system="You are helpful.",
    temperature=0.7,
    max_tokens=1024
)

response = client.chat.completions.create(
    model="gemini-flash",
    messages=[
        {"role": "system", "content": "You are helpful."},
        {"role": "user", "content": "What is quantum computing?"}
    ],
    temperature=0.7,
    max_tokens=1024
)

const response = await client.chat.completions.create({
  model: "gemini-flash",
  messages: [
    { role: "system", content: "You are helpful." },
    { role: "user", content: "What is quantum computing?" }
  ],
  temperature: 0.7,
  max_tokens: 1024
});

curl https://getinference.io/v1/chat/completions \
  -H "Authorization: Bearer sk-your-key-here" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemini-flash",
    "messages": [
      {"role": "system", "content": "You are helpful."},
      {"role": "user", "content": "What is quantum computing?"}
    ],
    "temperature": 0.7,
    "max_tokens": 1024
  }'

# Coming soon
response = client.chat(
    "What is quantum computing?",
    model="gemini-flash",
    system="You are helpful.",
    temperature=0.7,
    max_tokens=1024
)

RESPONSE

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "model": "gemini-flash",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "Quantum computing uses..."
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 24,
    "completion_tokens": 156,
    "total_tokens": 180
  }
}

Streaming

Set stream: true to receive partial responses as server-sent events. The response is delivered incrementally as the model generates tokens.

Each event contains a delta object with the new content. The stream ends with a [DONE] message.

stream = client.chat.completions.create(
    model="gemini-flash",
    messages=[{"role": "user", "content": "Write a poem"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

const stream = await client.chat.completions.create({
  model: "gemini-flash",
  messages: [{ role: "user", content: "Write a poem" }],
  stream: true,
});

for await (const chunk of stream) {
  process.stdout.write(
    chunk.choices[0]?.delta?.content || ""
  );
}

curl https://getinference.io/v1/chat/completions \
  -H "Authorization: Bearer sk-your-key-here" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemini-flash",
    "messages": [{"role": "user", "content": "Write a poem"}],
    "stream": true
  }'

stream = client.chat.completions.create(
    model="gemini-flash",
    messages=[{"role": "user", "content": "Write a poem"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

const stream = await client.chat.completions.create({
  model: "gemini-flash",
  messages: [{ role: "user", content: "Write a poem" }],
  stream: true,
});

for await (const chunk of stream) {
  process.stdout.write(
    chunk.choices[0]?.delta?.content || ""
  );
}

curl https://getinference.io/v1/chat/completions \
  -H "Authorization: Bearer sk-your-key-here" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemini-flash",
    "messages": [{"role": "user", "content": "Write a poem"}],
    "stream": true
  }'

STREAM OUTPUT

data: {"choices":[{"delta":{"content":"Once"}}]}

data: {"choices":[{"delta":{"content":" upon"}}]}

data: {"choices":[{"delta":{"content":" a"}}]}

data: [DONE]

List Models

GET /v1/models

Returns a list of all available models.

curl https://getinference.io/v1/models \
  -H "Authorization: Bearer sk-your-key-here"

curl https://getinference.io/v1/models \
  -H "Authorization: Bearer sk-your-key-here"

Errors

The API returns standard HTTP status codes. Error responses include a JSON body with details.

Status	Meaning	What to do
400	Bad request	Check your request body format
401	Unauthorized	Check your API key
402	Budget exceeded	Buy more credits
404	Model not found	Check the model ID
429	Rate limited	Slow down, retry with backoff
500	Server error	Retry with exponential backoff

Rate Limits & Budgets

Each API key has a budget equal to your total purchased credits. When spend reaches the budget, requests return 402.

You can create up to 10 active API keys per account. All keys share the same account budget. Revoke unused keys from the dashboard.

Tip: Use separate keys for different projects or environments (dev, staging, prod) so you can track spend per-key on the dashboard.