Arbiter Docs

API reference

Every endpoint and the arbiter response block.

Arbiter exposes an OpenAI-compatible chat endpoint plus a handful of read-only observability endpoints and two control endpoints for demos. All paths are served by main.py. Examples assume the default http://localhost:8000.

Summary

Method & pathPurpose
POST /v1/chat/completionsThe router. OpenAI-compatible; returns the completion plus an arbiter block.
GET /healthLiveness check.
GET /v1/reportCumulative savings versus the baseline.
GET /v1/policyWhat the router has learned per task type.
GET /v1/recentThe most recent routing decisions (newest first).
GET /v1/overviewSummary stats: pool size, classifier split, alert count.
GET /v1/pricingThe runtime's chat-surface catalog with list prices, each tagged routable.
GET /v1/alertsRecent price-shift events.
POST /v1/simulate-priceDemo hook: scale a model's cost to imitate a re-price.
POST /v1/resetClear learned state and feeds for a fresh run.
POST /v1/registerMint a client API key from an email (open, no auth).
GET /v1/keyInfo and usage for the calling key.
POST /v1/key/{pause,resume,revoke}Change the calling key's status.
GET /, /app, /start, /docsThe web app (see interface.md).

/v1/chat/completions, /v1/reset and /v1/simulate-price require an Authorization: Bearer <key> (see Client authentication); a missing or invalid key returns 401. Read-only endpoints are open.


POST /v1/chat/completions

The one endpoint your application calls. It accepts a standard OpenAI chat request. The model field is ignored - Arbiter selects the model. Every other field is passed through to the runtime unchanged.

Request

{
  "model": "anything",
  "messages": [{"role": "user", "content": "Calculate 6 * 7"}],
  "max_tokens": 15
}

Only messages is required (a 422 is returned otherwise).

Optional budget. Add "arbiter_max_cost": <usd> to cap the per-request cost. Arbiter estimates each model's cost for the request from list prices and routes only among models within the ceiling (falling back to the cheapest if none fit). The field is stripped before the request reaches the runtime.

Response. The normal OpenAI completion body, with an added arbiter object describing the decision:

{
  "id": "chatcmpl_...",
  "choices": [{ "message": {"role": "assistant", "content": "42"}, "...": "..." }],
  "usage": { "total_tokens": 18, "...": "..." },
  "arbiter": {
    "task": "math",
    "classified_by": "rules",
    "model": "mistral-small-3.2-24b-instruct-2506",
    "mode": "explore",
    "reason": "gathering baseline data",
    "quality": 0.0,
    "quality_reason": "expected 42, missing",
    "cost": 5e-06,
    "baseline_cost": null,
    "saved": null,
    "tokens_needed": 18,
    "eligible_models": 9
  }
}

The arbiter block

FieldMeaning
taskDetected task type: code/math/structured/factual/open.
classified_byHow the task was decided: rules, model, or model-fallback.
modelThe model Arbiter routed to.
modeexplore (still learning this model for the task) or exploit.
reasonHuman-readable explanation of the routing decision.
qualityScore of this answer, 0..1.
quality_reasonHow the score was derived (objective check, judge, or learned).
costThe real charge for this call, from x-btl-customer-charge.
baseline_costLearned mean cost of the baseline for this task, or null if not yet sampled.
savedbaseline_cost - cost for this call, or null if baseline unknown.
tokens_neededEstimated tokens required (used by the context filter).
eligible_modelsHow many models passed the context and budget filters.
budget_max_costThe per-request cost ceiling, if arbiter_max_cost was set (else null).
budget_metWhether any model fit the budget (false means it fell back to the cheapest).

Streaming

Set "stream": true and the response is a standard OpenAI SSE stream (text/event-stream) of chat.completion.chunk events, so any OpenAI streaming client works unchanged. Routing happens before the first token; the answer is scored and folded into the policy once the stream finishes.

Routing details are exposed two ways:

  • Response headers available immediately: X-Arbiter-Model, X-Arbiter-Task, X-Arbiter-Mode, X-Arbiter-Classified-By, X-Arbiter-Eligible.
  • A trailing arbiter event after the stream, carrying the final quality, cost, saved and cost_estimated. Strict clients stop at [DONE] and ignore it.
event: arbiter
data: {"task":"math","model":"...","quality":1.0,"cost":2e-06,"cost_estimated":true,"saved":4e-05}

Cost on streaming. The runtime does not report a cost header on streaming responses. When it is absent, Arbiter prices the call at the model's learned average cost for that task (measured from non-streaming calls) and flags it with cost_estimated: true; price-shift detection is skipped for those calls. Cost on non-streaming calls is always the real measured charge.


GET /health

{ "status": "ok" }

GET /v1/report

Cumulative savings versus running everything on the baseline. Actual spend is exact; baseline spend re-prices each call at the baseline's measured mean cost per task (see strategies.md).

{
  "calls": 486,
  "actual_spend": 0.01650,
  "baseline_spend": 0.05770,
  "saved": 0.04120,
  "saved_pct": 71.4
}

saved_pct can be negative early on, while exploration is paying to learn.


GET /v1/policy

What has been learned, grouped by task type. Each row is one model's running stats for that task.

{
  "code": [
    { "model": "mistral-small-3.2-24b-instruct-2506", "n": 1, "quality": 0.4, "avg_cost": 6e-06 }
  ]
}
FieldMeaning
modelModel id.
nNumber of observations for this task.
qualityMean quality, 0..1 (null if n is 0).
avg_costMean measured cost per call.

GET /v1/recent

The most recent routing decisions, newest first (bounded ring buffer).

[
  {
    "ts": 1783175779.47,
    "task": "factual",
    "classified_by": "model",
    "model": "mistral-small-3.2-24b-instruct-2506",
    "mode": "explore",
    "quality": 0.5,
    "cost": 6e-06,
    "saved": null
  }
]

GET /v1/overview

Summary stats for the dashboard beyond raw savings.

{
  "pool_size": 9,
  "classifier": { "rules": 2, "model": 2, "model-fallback": 0 },
  "alerts": 0,
  "active_price_overrides": {}
}
FieldMeaning
pool_sizeNumber of candidate models (baseline included).
classifierCumulative counts of how requests were classified.
alertsNumber of price-shift events recorded.
active_price_overridesAny demo multipliers currently applied.

GET /v1/alerts

Recent price-shift events that forced a model to be re-learned, newest first.

[
  {
    "task": "math",
    "model": "mistral-small-3.2-24b-instruct-2506",
    "old_unit": 4.0e-08,
    "new_unit": 3.2e-07,
    "direction": "up",
    "ts": 1783175800.12
  }
]

old_unit / new_unit are cost-per-token before and after the shift.


POST /v1/simulate-price

Demo hook. Scales a model's reported cost to imitate a provider re-price, so the price-shift re-routing can be shown on cue. Set the multiplier back to 1 to clear it.

Request

{ "model": "gpt-4o", "multiplier": 2.0 }

Response

{ "model": "gpt-4o", "multiplier": 2.0, "active": { "gpt-4o": 2.0 } }

model is required (422 otherwise).


POST /v1/reset

Clears the learned policy, the decision feed, the alerts, the classifier counters, and any price overrides. Useful before a clean demo run.

{ "status": "reset" }

POST /v1/register

Mint a client API key in exchange for an email. Open (no auth).

Request

{ "email": "you@example.com" }

Response

{ "api_key": "arb_..." }

A valid email is required (422 otherwise). Pass the returned key as Authorization: Bearer <key> on protected endpoints.

Managing a key

Manage the calling key (authenticated with that key). GET /v1/key returns its email, status, and rolling usage:

{ "email": "you@example.com", "status": "active",
  "used_6h": 12, "limit_6h": 50, "used_week": 88, "limit_week": 600 }

POST /v1/key/pause, /resume, /revoke change the status. A paused key still authenticates (so it can resume itself) but /v1/chat/completions returns 403 until resumed. A revoked key stops authenticating entirely. Every routing feature - budgets, streaming, and the rest - is on the same API a caller uses; the web app just drives it.

Errors

Validation errors return 422 (for example, a chat request with no messages, or a register call with a bad email). A protected endpoint called without a valid key returns 401. A minted key over its rate limit (50 requests per 6 hours or 600 per week) returns 429 with a Retry-After header.

When the runtime or an upstream provider rejects a routed call, Arbiter does not turn it into an opaque 500. It surfaces the upstream status directly - a 402 (out of credit), 429 (rate limited), or 400 (bad request) is passed through with a JSON detail describing the upstream error and the model that was tried:

{ "detail": { "upstream": "btl_runtime", "model": "gpt-4.1-mini", "error": { "...": "..." } } }

A failed call is not recorded into the policy, so a transient upstream error never poisons what the router has learned. If the runtime is unreachable, a 502 is returned instead.

GET / and the web app

/, /app, /start, and /docs serve the web interface (interface.md). If the interface has not been built, / serves the fallback single-file dashboard from static/index.html, which polls report, overview, recent, alerts, and policy on an interval.

On this page