Arbiter Docs

Configuration

Environment variables, model registry, and tunables.

Arbiter has three layers of configuration: environment variables (deployment settings), the model registry (which models it may route to), and a handful of code-level tunables (how it decides). Environment variables are the only ones you set without touching code.

Environment variables

Loaded by config.py from .env (or the real environment, which takes precedence). See .env.example.

VariableDefaultPurpose
GATEWAY_API_KEY- (required)Your BTL machine key. Every runtime call uses it. Arbiter refuses to start a request without it.
BTL_BASE_URLhttps://api.badtheorylabs.com/v1The runtime's OpenAI-compatible base URL.
BASELINE_MODELgpt-4oThe premium model savings are measured against. Must be an OpenAI-surface route (not an Anthropic-direct model).
BASELINE_CONTEXT128000Assumed context window for the baseline, used by the context filter. Change it if you point BASELINE_MODEL at a model with a different window.
REQUEST_TIMEOUT120Per-request timeout to the runtime, in seconds.
ARBITER_DBdata/arbiter.dbPath to the SQLite policy store. Point it at a mounted volume in production so learned state survives redeploys.

The model registry

The pool of models Arbiter may route to lives in models.py:

  • CANDIDATES - the list of routable models, each a ModelSpec(id, tier, context, in_price, out_price).
  • BASELINE - the premium baseline, built from BASELINE_MODEL / BASELINE_CONTEXT.
ModelSpec("deepseek-chat-v3", "small", 128_000, 0.20, 0.80)
#          id                  tier     context   in     out ($/1M tokens)

To add or remove a model, edit CANDIDATES. Each entry needs:

  • an id that answers on the /v1/chat/completions surface - verify with the runtime's GET /v1/models, and confirm it isn't an Anthropic-direct route (those need /v1/messages and are out of scope);
  • a tier (small/mid/large) - a rough prior that only affects the order models are explored in, not the final choice;
  • the context window in tokens, which the filter uses to decide eligibility;
  • the input/output list prices ($/1M tokens), used by the budget filter.

Keeping the pool modest matters: every new model adds MIN_SAMPLES exploration calls per task type before the policy can trust it. A wide, well-spread set of a handful of models routes better and cheaper than a huge one.

Role models

Two models play fixed roles rather than being routed to. Both are constants:

ConstantFileDefaultRole
CLASSIFIER_MODELclassifier.pydeepseek-v4-flashReads ambiguous prompts to pick a task type. Chosen because it's $0 on the runtime and, unlike some free models, isn't a reasoning model (see strategies.md).
JUDGE_MODELjudge.pydeepseek-v4-proRates open-ended answers 0..1, during exploration only.

Tunables (the policy thresholds)

These live as constants at the top of policy.py. They control how the router learns and reacts. Defaults are chosen to keep the cold start cheap and price detection quiet; adjust with the trade-offs in mind.

ConstantDefaultEffectRaise it to...
MIN_SAMPLES2Observations per model before its numbers are trusted.Trust the data more, at a longer/costlier cold start.
EPSILON0.10Steady-state chance of re-exploring instead of exploiting.React faster to drift, at slightly higher steady-state cost.
QUALITY_TOLERANCE0.05How much quality you'll trade for a cheaper model.Prefer cheaper models more aggressively (accepting more quality risk).
PRICE_SHIFT0.75Unit-price move that triggers re-learning a model.Ignore more price noise; lower it to react to smaller moves (risking false alerts).
MIN_TOKENS_FOR_PRICE40Token history required before price detection runs.Reduce false price alerts on tiny, rounding-noisy calls.

For the reasoning behind these mechanisms, see strategies.md.

What is not configurable (by design)

  • The caller's model field. Always ignored - choosing the model is the product.
  • Per-user / per-session state. There is one shared, persistent policy; there is no isolation to configure.
  • The routing decision itself. The cheapest-within-tolerance rule is fixed; you tune its thresholds, not the rule.

Client keys and per-key rate limits are configurable - see integration.md.

On this page