API reference

Arbiter exposes an OpenAI-compatible chat endpoint plus a handful of read-only observability endpoints and two control endpoints for demos. All paths are served by main.py. Examples assume the default http://localhost:8000.

Summary

Method & path	Purpose
`POST /v1/chat/completions`	The router. OpenAI-compatible; returns the completion plus an `arbiter` block.
`GET /health`	Liveness check.
`GET /v1/report`	Cumulative savings versus the baseline.
`GET /v1/policy`	What the router has learned per task type.
`GET /v1/recent`	The most recent routing decisions (newest first).
`GET /v1/overview`	Summary stats: pool size, classifier split, alert count.
`GET /v1/pricing`	The runtime's chat-surface catalog with list prices, each tagged `routable`.
`GET /v1/alerts`	Recent price-shift events.
`POST /v1/simulate-price`	Demo hook: scale a model's cost to imitate a re-price.
`POST /v1/reset`	Clear learned state and feeds for a fresh run.
`POST /v1/register`	Mint a client API key from an email (open, no auth).
`GET /v1/key`	Info and usage for the calling key.
`POST /v1/key/{pause,resume,revoke}`	Change the calling key's status.
`GET /`, `/app`, `/start`, `/docs`	The web app (see interface.md).

/v1/chat/completions, /v1/reset and /v1/simulate-price require an Authorization: Bearer <key> (see Client authentication); a missing or invalid key returns 401. Read-only endpoints are open.

POST /v1/chat/completions

The one endpoint your application calls. It accepts a standard OpenAI chat request. The model field is ignored - Arbiter selects the model. Every other field is passed through to the runtime unchanged.

Request

{
  "model": "anything",
  "messages": [{"role": "user", "content": "Calculate 6 * 7"}],
  "max_tokens": 15
}

Only messages is required (a 422 is returned otherwise).

Optional budget. Add "arbiter_max_cost": <usd> to cap the per-request cost. Arbiter estimates each model's cost for the request from list prices and routes only among models within the ceiling (falling back to the cheapest if none fit). The field is stripped before the request reaches the runtime.

Response. The normal OpenAI completion body, with an added arbiter object describing the decision:

{
  "id": "chatcmpl_...",
  "choices": [{ "message": {"role": "assistant", "content": "42"}, "...": "..." }],
  "usage": { "total_tokens": 18, "...": "..." },
  "arbiter": {
    "task": "math",
    "classified_by": "rules",
    "model": "mistral-small-3.2-24b-instruct-2506",
    "mode": "explore",
    "reason": "gathering baseline data",
    "quality": 0.0,
    "quality_reason": "expected 42, missing",
    "cost": 5e-06,
    "baseline_cost": null,
    "saved": null,
    "tokens_needed": 18,
    "eligible_models": 9
  }
}

The arbiter block

Field	Meaning
`task`	Detected task type: `code`/`math`/`structured`/`factual`/`open`.
`classified_by`	How the task was decided: `rules`, `model`, or `model-fallback`.
`model`	The model Arbiter routed to.
`mode`	`explore` (still learning this model for the task) or `exploit`.
`reason`	Human-readable explanation of the routing decision.
`quality`	Score of this answer, 0..1.
`quality_reason`	How the score was derived (objective check, judge, or learned).
`cost`	The real charge for this call, from `x-btl-customer-charge`.
`baseline_cost`	Learned mean cost of the baseline for this task, or `null` if not yet sampled.
`saved`	`baseline_cost - cost` for this call, or `null` if baseline unknown.
`tokens_needed`	Estimated tokens required (used by the context filter).
`eligible_models`	How many models passed the context and budget filters.
`budget_max_cost`	The per-request cost ceiling, if `arbiter_max_cost` was set (else `null`).
`budget_met`	Whether any model fit the budget (`false` means it fell back to the cheapest).

Streaming

Set "stream": true and the response is a standard OpenAI SSE stream (text/event-stream) of chat.completion.chunk events, so any OpenAI streaming client works unchanged. Routing happens before the first token; the answer is scored and folded into the policy once the stream finishes.

Routing details are exposed two ways:

Response headers available immediately: X-Arbiter-Model, X-Arbiter-Task, X-Arbiter-Mode, X-Arbiter-Classified-By, X-Arbiter-Eligible.
A trailing arbiter event after the stream, carrying the final quality, cost, saved and cost_estimated. Strict clients stop at [DONE] and ignore it.

event: arbiter
data: {"task":"math","model":"...","quality":1.0,"cost":2e-06,"cost_estimated":true,"saved":4e-05}

Cost on streaming. The runtime does not report a cost header on streaming responses. When it is absent, Arbiter prices the call at the model's learned average cost for that task (measured from non-streaming calls) and flags it with cost_estimated: true; price-shift detection is skipped for those calls. Cost on non-streaming calls is always the real measured charge.

GET /health

{ "status": "ok" }

GET /v1/report

Cumulative savings versus running everything on the baseline. Actual spend is exact; baseline spend re-prices each call at the baseline's measured mean cost per task (see strategies.md).

{
  "calls": 486,
  "actual_spend": 0.01650,
  "baseline_spend": 0.05770,
  "saved": 0.04120,
  "saved_pct": 71.4
}

saved_pct can be negative early on, while exploration is paying to learn.

GET /v1/policy

What has been learned, grouped by task type. Each row is one model's running stats for that task.

{
  "code": [
    { "model": "mistral-small-3.2-24b-instruct-2506", "n": 1, "quality": 0.4, "avg_cost": 6e-06 }
  ]
}

Field	Meaning
`model`	Model id.
`n`	Number of observations for this task.
`quality`	Mean quality, 0..1 (`null` if `n` is 0).
`avg_cost`	Mean measured cost per call.

GET /v1/recent

The most recent routing decisions, newest first (bounded ring buffer).

[
  {
    "ts": 1783175779.47,
    "task": "factual",
    "classified_by": "model",
    "model": "mistral-small-3.2-24b-instruct-2506",
    "mode": "explore",
    "quality": 0.5,
    "cost": 6e-06,
    "saved": null
  }
]

GET /v1/overview

Summary stats for the dashboard beyond raw savings.

{
  "pool_size": 9,
  "classifier": { "rules": 2, "model": 2, "model-fallback": 0 },
  "alerts": 0,
  "active_price_overrides": {}
}

Field	Meaning
`pool_size`	Number of candidate models (baseline included).
`classifier`	Cumulative counts of how requests were classified.
`alerts`	Number of price-shift events recorded.
`active_price_overrides`	Any demo multipliers currently applied.

GET /v1/alerts

Recent price-shift events that forced a model to be re-learned, newest first.

[
  {
    "task": "math",
    "model": "mistral-small-3.2-24b-instruct-2506",
    "old_unit": 4.0e-08,
    "new_unit": 3.2e-07,
    "direction": "up",
    "ts": 1783175800.12
  }
]

old_unit / new_unit are cost-per-token before and after the shift.

POST /v1/simulate-price

Demo hook. Scales a model's reported cost to imitate a provider re-price, so the price-shift re-routing can be shown on cue. Set the multiplier back to 1 to clear it.

Request

{ "model": "gpt-4o", "multiplier": 2.0 }

Response

{ "model": "gpt-4o", "multiplier": 2.0, "active": { "gpt-4o": 2.0 } }

model is required (422 otherwise).

POST /v1/reset

Clears the learned policy, the decision feed, the alerts, the classifier counters, and any price overrides. Useful before a clean demo run.

{ "status": "reset" }

POST /v1/register

Mint a client API key in exchange for an email. Open (no auth).

Request

{ "email": "you@example.com" }

Response

{ "api_key": "arb_..." }

A valid email is required (422 otherwise). Pass the returned key as Authorization: Bearer <key> on protected endpoints.

Managing a key

Manage the calling key (authenticated with that key). GET /v1/key returns its email, status, and rolling usage:

{ "email": "you@example.com", "status": "active",
  "used_6h": 12, "limit_6h": 50, "used_week": 88, "limit_week": 600 }

POST /v1/key/pause, /resume, /revoke change the status. A paused key still authenticates (so it can resume itself) but /v1/chat/completions returns 403 until resumed. A revoked key stops authenticating entirely. Every routing feature - budgets, streaming, and the rest - is on the same API a caller uses; the web app just drives it.

Errors

Validation errors return 422 (for example, a chat request with no messages, or a register call with a bad email). A protected endpoint called without a valid key returns 401. A minted key over its rate limit (50 requests per 6 hours or 600 per week) returns 429 with a Retry-After header.

When the runtime or an upstream provider rejects a routed call, Arbiter does not turn it into an opaque 500. It surfaces the upstream status directly - a 402 (out of credit), 429 (rate limited), or 400 (bad request) is passed through with a JSON detail describing the upstream error and the model that was tried:

{ "detail": { "upstream": "btl_runtime", "model": "gpt-4.1-mini", "error": { "...": "..." } } }

A failed call is not recorded into the policy, so a transient upstream error never poisons what the router has learned. If the runtime is unreachable, a 502 is returned instead.

GET / and the web app

/, /app, /start, and /docs serve the web interface (interface.md). If the interface has not been built, / serves the fallback single-file dashboard from static/index.html, which polls report, overview, recent, alerts, and policy on an interval.

On this page