API reference
Every endpoint and the arbiter response block.
Arbiter exposes an OpenAI-compatible chat endpoint plus a handful of read-only
observability endpoints and two control endpoints for demos. All paths are
served by main.py. Examples assume the default http://localhost:8000.
Summary
| Method & path | Purpose |
|---|---|
POST /v1/chat/completions | The router. OpenAI-compatible; returns the completion plus an arbiter block. |
GET /health | Liveness check. |
GET /v1/report | Cumulative savings versus the baseline. |
GET /v1/policy | What the router has learned per task type. |
GET /v1/recent | The most recent routing decisions (newest first). |
GET /v1/overview | Summary stats: pool size, classifier split, alert count. |
GET /v1/pricing | The runtime's chat-surface catalog with list prices, each tagged routable. |
GET /v1/alerts | Recent price-shift events. |
POST /v1/simulate-price | Demo hook: scale a model's cost to imitate a re-price. |
POST /v1/reset | Clear learned state and feeds for a fresh run. |
POST /v1/register | Mint a client API key from an email (open, no auth). |
GET /v1/key | Info and usage for the calling key. |
POST /v1/key/{pause,resume,revoke} | Change the calling key's status. |
GET /, /app, /start, /docs | The web app (see interface.md). |
/v1/chat/completions, /v1/reset and /v1/simulate-price require an
Authorization: Bearer <key> (see Client authentication);
a missing or invalid key returns 401. Read-only endpoints are open.
POST /v1/chat/completions
The one endpoint your application calls. It accepts a standard OpenAI chat
request. The model field is ignored - Arbiter selects the model. Every
other field is passed through to the runtime unchanged.
Request
{
"model": "anything",
"messages": [{"role": "user", "content": "Calculate 6 * 7"}],
"max_tokens": 15
}Only messages is required (a 422 is returned otherwise).
Optional budget. Add "arbiter_max_cost": <usd> to cap the per-request cost.
Arbiter estimates each model's cost for the request from list prices and routes
only among models within the ceiling (falling back to the cheapest if none fit).
The field is stripped before the request reaches the runtime.
Response. The normal OpenAI completion body, with an added arbiter object
describing the decision:
{
"id": "chatcmpl_...",
"choices": [{ "message": {"role": "assistant", "content": "42"}, "...": "..." }],
"usage": { "total_tokens": 18, "...": "..." },
"arbiter": {
"task": "math",
"classified_by": "rules",
"model": "mistral-small-3.2-24b-instruct-2506",
"mode": "explore",
"reason": "gathering baseline data",
"quality": 0.0,
"quality_reason": "expected 42, missing",
"cost": 5e-06,
"baseline_cost": null,
"saved": null,
"tokens_needed": 18,
"eligible_models": 9
}
}The arbiter block
| Field | Meaning |
|---|---|
task | Detected task type: code/math/structured/factual/open. |
classified_by | How the task was decided: rules, model, or model-fallback. |
model | The model Arbiter routed to. |
mode | explore (still learning this model for the task) or exploit. |
reason | Human-readable explanation of the routing decision. |
quality | Score of this answer, 0..1. |
quality_reason | How the score was derived (objective check, judge, or learned). |
cost | The real charge for this call, from x-btl-customer-charge. |
baseline_cost | Learned mean cost of the baseline for this task, or null if not yet sampled. |
saved | baseline_cost - cost for this call, or null if baseline unknown. |
tokens_needed | Estimated tokens required (used by the context filter). |
eligible_models | How many models passed the context and budget filters. |
budget_max_cost | The per-request cost ceiling, if arbiter_max_cost was set (else null). |
budget_met | Whether any model fit the budget (false means it fell back to the cheapest). |
Streaming
Set "stream": true and the response is a standard OpenAI SSE stream
(text/event-stream) of chat.completion.chunk events, so any OpenAI streaming
client works unchanged. Routing happens before the first token; the answer is
scored and folded into the policy once the stream finishes.
Routing details are exposed two ways:
- Response headers available immediately:
X-Arbiter-Model,X-Arbiter-Task,X-Arbiter-Mode,X-Arbiter-Classified-By,X-Arbiter-Eligible. - A trailing
arbiterevent after the stream, carrying the finalquality,cost,savedandcost_estimated. Strict clients stop at[DONE]and ignore it.
event: arbiter
data: {"task":"math","model":"...","quality":1.0,"cost":2e-06,"cost_estimated":true,"saved":4e-05}Cost on streaming. The runtime does not report a cost header on streaming
responses. When it is absent, Arbiter prices the call at the model's learned
average cost for that task (measured from non-streaming calls) and flags it with
cost_estimated: true; price-shift detection is skipped for those calls. Cost on
non-streaming calls is always the real measured charge.
GET /health
{ "status": "ok" }GET /v1/report
Cumulative savings versus running everything on the baseline. Actual spend is exact; baseline spend re-prices each call at the baseline's measured mean cost per task (see strategies.md).
{
"calls": 486,
"actual_spend": 0.01650,
"baseline_spend": 0.05770,
"saved": 0.04120,
"saved_pct": 71.4
}saved_pct can be negative early on, while exploration is paying to learn.
GET /v1/policy
What has been learned, grouped by task type. Each row is one model's running stats for that task.
{
"code": [
{ "model": "mistral-small-3.2-24b-instruct-2506", "n": 1, "quality": 0.4, "avg_cost": 6e-06 }
]
}| Field | Meaning |
|---|---|
model | Model id. |
n | Number of observations for this task. |
quality | Mean quality, 0..1 (null if n is 0). |
avg_cost | Mean measured cost per call. |
GET /v1/recent
The most recent routing decisions, newest first (bounded ring buffer).
[
{
"ts": 1783175779.47,
"task": "factual",
"classified_by": "model",
"model": "mistral-small-3.2-24b-instruct-2506",
"mode": "explore",
"quality": 0.5,
"cost": 6e-06,
"saved": null
}
]GET /v1/overview
Summary stats for the dashboard beyond raw savings.
{
"pool_size": 9,
"classifier": { "rules": 2, "model": 2, "model-fallback": 0 },
"alerts": 0,
"active_price_overrides": {}
}| Field | Meaning |
|---|---|
pool_size | Number of candidate models (baseline included). |
classifier | Cumulative counts of how requests were classified. |
alerts | Number of price-shift events recorded. |
active_price_overrides | Any demo multipliers currently applied. |
GET /v1/alerts
Recent price-shift events that forced a model to be re-learned, newest first.
[
{
"task": "math",
"model": "mistral-small-3.2-24b-instruct-2506",
"old_unit": 4.0e-08,
"new_unit": 3.2e-07,
"direction": "up",
"ts": 1783175800.12
}
]old_unit / new_unit are cost-per-token before and after the shift.
POST /v1/simulate-price
Demo hook. Scales a model's reported cost to imitate a provider re-price, so the
price-shift re-routing can be shown on cue. Set the multiplier back to 1 to
clear it.
Request
{ "model": "gpt-4o", "multiplier": 2.0 }Response
{ "model": "gpt-4o", "multiplier": 2.0, "active": { "gpt-4o": 2.0 } }model is required (422 otherwise).
POST /v1/reset
Clears the learned policy, the decision feed, the alerts, the classifier counters, and any price overrides. Useful before a clean demo run.
{ "status": "reset" }POST /v1/register
Mint a client API key in exchange for an email. Open (no auth).
Request
{ "email": "you@example.com" }Response
{ "api_key": "arb_..." }A valid email is required (422 otherwise). Pass the returned key as
Authorization: Bearer <key> on protected endpoints.
Managing a key
Manage the calling key (authenticated with that key). GET /v1/key returns its
email, status, and rolling usage:
{ "email": "you@example.com", "status": "active",
"used_6h": 12, "limit_6h": 50, "used_week": 88, "limit_week": 600 }POST /v1/key/pause, /resume, /revoke change the status. A paused key still
authenticates (so it can resume itself) but /v1/chat/completions returns 403
until resumed. A revoked key stops authenticating entirely. Every routing
feature - budgets, streaming, and the rest - is on the same API a caller uses;
the web app just drives it.
Errors
Validation errors return 422 (for example, a chat request with no messages,
or a register call with a bad email). A protected endpoint called without a valid
key returns 401. A minted key over its rate limit (50 requests per 6 hours or
600 per week) returns 429 with a Retry-After header.
When the runtime or an upstream provider rejects a routed call, Arbiter does not
turn it into an opaque 500. It surfaces the upstream status directly - a 402
(out of credit), 429 (rate limited), or 400 (bad request) is passed through
with a JSON detail describing the upstream error and the model that was tried:
{ "detail": { "upstream": "btl_runtime", "model": "gpt-4.1-mini", "error": { "...": "..." } } }A failed call is not recorded into the policy, so a transient upstream error
never poisons what the router has learned. If the runtime is unreachable, a 502
is returned instead.
GET / and the web app
/, /app, /start, and /docs serve the web interface
(interface.md). If the interface has not been built, / serves
the fallback single-file dashboard from static/index.html, which polls
report, overview, recent, alerts, and policy on an interval.