OLD: checked only if CURRENT agent was on a GPU
Tanko→MoE, Mumuni also→MoE (didnt see Tanko)
NEW: checks if ANY agent is on a GPU (cross-agent awareness)
Pass 1: prefer GPUs with 0 agents
Pass 2: prefer GPU this agent is not already on
Pass 3: any non-busy GPU
Prevents Tanko+Mumuni piling onto same GPU simultaneously
even when both slots are free. Combined with MoE=1 slot,
guarantees overflow goes to idle Dense.
MoE at 95C with p50=13s latency — thermal throttling causing
death spiral. Both slots stuck processing for 113s p95.
Dense idle at 38C with 2 free slots. Reducing MoE to 1 slot
forces heavy overflow to Dense, giving MoE thermal headroom.
Heavy tier: MoE → Dense → VLM still valid — first heavy goes
to MoE, second overflows to Dense.
Mumuni 23K-token responses split the final SSE timings chunk
across HTTP frames. The old per-chunk check missed timings when
split. Now accumulates lines in a buffer before parsing.
Also fixed: store_perf_record accidentally dropped in prior edit.
select_best_gpu() now spreads different agents across GPUs:
- If agent already has a request on a GPU, prefer other GPUs first
- Tracked via Redis agent_gpu:{agent}:{model} with 120s TTL
- Same agent can still use multiple slots on same GPU if needed
- Falls back to normal priority when only one option available
Prevents Tanko+Mumuni from piling onto MoE simultaneously
while Dense sits idle. Each agent naturally spreads across
available GPUs.
Heavy tier keeps MoE primary (workhorse for >25K tok).
Default tier routes Dense → VLM → MoE to prevent MoE overload.
MoE had 5 timeouts in 15 min when Default pushed overflow to it.
More conversations now route to VLM as primary. 9B VLM has 262K
context window and 88 tok/s average — well suited for moderate
conversations. Dense absorbs overflow and heavy reasoning.
Strix Halo running qwen3.6-35B-A3B was hitting 94°C with 2 concurrent
slots, causing 300s request timeouts. Mumuni + Koby accumulated 15
timeouts in the last hour. Reduced to 1 slot for thermal headroom.
Medium and Default tiers already route VLM before MoE as fallback,
minimizing overflow traffic to the hot GPU.
Router: new /metrics/scatter endpoint returns individual data points
(prompt_tokens, inference_ms, model, agent, reason, stream)
for scatter visualization.
Dashboard: new panel showing latency vs prompt size by model.
- Log-scale X axis (prompt tokens) with model color coding
- Dropdown to filter by individual model or view all
- Hover tooltips with details per point
- Auto-refresh every 30s
Enables direct observation of context-length vs latency
relationship — validates routing tier decisions.
Router now buffers streaming response chunks to extract timings
(prompt_n, predicted_n, predicted_per_second) from the final
SSE data frame before yielding to the client. Streaming requests
get real throughput data instead of 0 tok/s.
Uses llama.cpp timings field in the last content chunk:
- completion_tokens = predicted_n
- tokens_per_sec = predicted_per_second
- inference_ms = predicted_ms (generation only)
Client sees identical stream, no perceptible delay.
router/router.py (+158 lines):
- store_perf_record(): captures queue_ms, inference_ms, prompt_tokens,
completion_tokens, tokens_per_sec per request in Redis
- Per-model, per-reason, per-agent rolling windows (last 200-500)
- /metrics/performance?window=N endpoint with percentiles (p50/p95/p99)
for latency, throughput, and queue time per model/reason/agent
- Queue time now surfaced in routing metadata and routes:recent
- Streaming requests tracked with estimated prompt tokens
nginx/nginx.conf:
- Added /metrics/ proxy pass to router_api
Enables model performance comparison and routing tier validation.
VRAM percentage no longer marks GPU as saturated.
Saturation is about slot availability (handled by is_gpu_busy()),
not memory usage. Added vram_warning boolean flag (≥95% threshold)
for informational monitoring without affecting routing decisions.
27B Dense now correctly shows healthy at 91% VRAM.
router/router.py:
- check_gpu_health() now accepts configurable timeouts (sidecar_timeout, gpu_timeout)
- /health and /v1/models endpoints use fast 1.5s/1s timeouts (non-blocking)
- /v1/models now calls check_gpu_health once per model instead of twice
- GPU_CONTEXT updated to 262144 across all models (turboquant upgrade)
- 27B max_concurrent reduced 2→1 (24GB VRAM saturated at 256K context)
docker-compose.yml:
- Router healthcheck timeout 5s→15s, interval 15s→30s
- Nginx healthcheck timeout 5s→15s, interval 15s→30s
Fixes dashboard hang when any GPU is unreachable.