- Saturated flag now triggers on load saturation (was dead code)
- Heavy tier routes MoE(131K) first instead of Dense(98K)
- Token estimation uses JSON length/3.5 (was content/4)
- Cross-turn session tracking via X-Session-Id + Redis TTL 24h
- Added GPU_CONTEXT map (MoE 131K, VLM 131K, Dense 65K)
- Heavy tier now prefers MoE/VLM (131K) over Dense (65K) for large requests
- Response headers: X-Context-Remaining, X-Context-Model
- Routing data includes context_remaining field
- Agents can use this to trigger compaction when nearing limits
Agent conversations with system prompts easily exceed 4000 tokens,
forcing everything to Dense. Now only truly heavy work triggers Dense.
Most agent convos will route to MoE (default) instead.
Previously it picked the least-loaded GPU globally, ignoring priority order.
Now it tries candidates in order: MoE → VLM → Dense. Only falls back to
least-loaded when ALL candidates are busy.
System messages are common in agent conversations but don't indicate
heavy workload. Now only token count (>4000) and turn count (>8) trigger
heavy routing. Simple conversations with system prompts can now route to VLM.
Mumuni's Ollama client probes /v1/props for model discovery and
/v1/models/<id> for per-model details. Previously both returned 404,
causing client retries. Now returns proper model properties and details.
When all GPUs are saturated, requests now enter a queue loop (poll every 500ms)
instead of immediately returning 503. Configurable via QUEUE_TIMEOUT env var
(default 30s) or X-Queue-Timeout header per-request.
This prevents agent failures from cluster saturation — agents wait for a slot
instead of crashing on fallback.