Architecture
The request lifecycle and how the pieces fit.
Arbiter is a small FastAPI service that speaks the OpenAI chat-completions protocol on its front, and calls the BTL runtime on its back. Between the two it runs a fixed pipeline that decides which model should answer, checks the answer, and learns from the result.
This document is the map. For the reasoning behind each stage, see strategies.md; for the HTTP surface, see api-reference.md.
The pipeline
Every chat request flows through the same six stages:
OpenAI-compatible request
|
v
1. CLASSIFY (classifier.py) rules first; a free model when ambiguous
-> task type: code / math / structured / factual / open
|
v
2. FILTER (models.py: fits) drop any model whose context cannot hold the
prompt -> the eligible set
|
v
3. ROUTE (policy.py: choose) explore new models, else exploit the cheapest
one within tolerance of the best
|
v
4. CALL (btl.py: chat) send the chosen model to the runtime; capture
the answer and the cost headers
|
v
5. GRADE (scoring / judge) objective check, or the judge for open-ended
tasks -> a quality score from 0 to 1
|
v
6. LEARN/REACT (policy.py: record) fold cost and quality into the policy; a price
move re-opens exploration
|
v
answer + arbiter metadataStages 1-3 are free and fast (no model call, except the ambiguous-classify case). The paid work is stage 4 - and, during exploration only, stage 5's judge.
The request lifecycle, end to end
This is what happens for a single POST /v1/chat/completions, and where each
step lives in the code (main.py:chat_completions orchestrates it):
- Parse. The request must contain
messages. Themodelfield the caller sent is intentionally ignored - choosing the model is Arbiter's job. - Classify (
classifier.py:classify_smart). Rules bucket the request instantly; if none match, a free model reads it. Result: a task type plus how it was decided (rules/model/model-fallback). - Filter (
models.py:fits). Estimate the tokens the request needs (~4 chars per token plus the requestedmax_tokens) and keep only models whose context window fits. This is a correctness guard, ahead of cost. - Route (
policy.py:choose). Given the task and the eligible set, the policy returns aDecision- a model, a mode (explore/exploit), and a human-readable reason. - Call (
btl.py:chat). The chosen model id is written into the payload and sent to the runtime. The response comes back with the answer and the cost headers parsed into aCost. - Grade (
scoring.py:score,judge.py:judge). Objective tasks are checked directly (code parses, arithmetic matches, JSON is valid). Open-ended tasks get a neutral score here and are sent to the judge - but only while exploring; in exploit mode the learned quality is reused. - Learn / react (
policy.py:record). The measured cost and quality are folded into the policy. If the model's cost-per-token has moved beyond the threshold, its stats are wiped so it is re-priced and re-routed. - Respond. The original OpenAI response is returned, plus an
arbiterblock describing the decision, quality, cost, and savings.
Component map
| File | Responsibility |
|---|---|
main.py | FastAPI app; orchestrates the lifecycle above; owns the HTTP endpoints and the in-memory feeds. |
config.py | Loads settings from .env / environment (key, base URL, baseline model, timeout). |
btl.py | The only place that calls the runtime. BTLClient.chat plus the Cost and Completion value objects. |
classify.py | Deterministic rule-based classification and the shared text helpers. |
classifier.py | The hybrid classifier: rules first, then a free model on ambiguity. |
models.py | The candidate model registry, the baseline, context windows, and the fits guard. |
scoring.py | Objective quality checks for code, math, and structured tasks. |
judge.py | LLM-as-judge for open-ended and factual tasks. |
policy.py | The routing brain and persistent memory (SQLite): choose, record, report. |
static/index.html | The fallback single-file dashboard, served at / when the web app has not been built. |
ui/ | The Next.js web app (landing, onboarding, dashboard, docs); exported to static files and served by FastAPI. See interface.md. |
scripts/bench.py | A workload generator for demos and testing. |
State and persistence
Arbiter keeps two kinds of state:
- Learned policy - on disk, shared, persistent. The per-task, per-model
quality and cost stats live in a SQLite database (
data/arbiter.db). There is one shared brain: every request reads from and writes to it, regardless of who sent the request or in what session, and it survives restarts. There is no per-user or per-session isolation. - Live feeds - in memory, ephemeral. The recent-decisions feed, the
price-shift alerts, the classifier counters, and the demo price multipliers
live in
app.stateand reset when the process restarts.
Where Arbiter touches the runtime
BTL is load-bearing in four distinct places, not one:
- Answering requests - the routed model call (
btl.py:chat). - Classifying - the free model used for ambiguous prompts
(
classifier.py). - Judging - the quality rater for open-ended tasks, during exploration
(
judge.py). - Measuring and reacting to cost - the
x-btl-*headers drive both the savings figures and the price-shift detector (btl.py:Cost,policy.py:record).
Every one of these runs through a single BTL key on the /v1/chat/completions
surface. The reasoning behind each is in strategies.md.