Heavy tier keeps MoE primary (workhorse for >25K tok).
Default tier routes Dense → VLM → MoE to prevent MoE overload.
MoE had 5 timeouts in 15 min when Default pushed overflow to it.
More conversations now route to VLM as primary. 9B VLM has 262K
context window and 88 tok/s average — well suited for moderate
conversations. Dense absorbs overflow and heavy reasoning.
Strix Halo running qwen3.6-35B-A3B was hitting 94°C with 2 concurrent
slots, causing 300s request timeouts. Mumuni + Koby accumulated 15
timeouts in the last hour. Reduced to 1 slot for thermal headroom.
Medium and Default tiers already route VLM before MoE as fallback,
minimizing overflow traffic to the hot GPU.
Router: new /metrics/scatter endpoint returns individual data points
(prompt_tokens, inference_ms, model, agent, reason, stream)
for scatter visualization.
Dashboard: new panel showing latency vs prompt size by model.
- Log-scale X axis (prompt tokens) with model color coding
- Dropdown to filter by individual model or view all
- Hover tooltips with details per point
- Auto-refresh every 30s
Enables direct observation of context-length vs latency
relationship — validates routing tier decisions.
Router now buffers streaming response chunks to extract timings
(prompt_n, predicted_n, predicted_per_second) from the final
SSE data frame before yielding to the client. Streaming requests
get real throughput data instead of 0 tok/s.
Uses llama.cpp timings field in the last content chunk:
- completion_tokens = predicted_n
- tokens_per_sec = predicted_per_second
- inference_ms = predicted_ms (generation only)
Client sees identical stream, no perceptible delay.
- Dashboard: when a model has zero non-streaming records, shows
"streaming only" instead of misleading 0 tok/s
- Dashboard: minimum bar width enforced (6% avg, 4% p50) so
low-tps models are always visible
- Router: removed inflated streaming tps estimate (prompt tokens
skewed results for long conversations)
Fixes Dense model appearing to "register nothing" when Mumuni
sends mostly streaming requests.
router/router.py (+158 lines):
- store_perf_record(): captures queue_ms, inference_ms, prompt_tokens,
completion_tokens, tokens_per_sec per request in Redis
- Per-model, per-reason, per-agent rolling windows (last 200-500)
- /metrics/performance?window=N endpoint with percentiles (p50/p95/p99)
for latency, throughput, and queue time per model/reason/agent
- Queue time now surfaced in routing metadata and routes:recent
- Streaming requests tracked with estimated prompt tokens
nginx/nginx.conf:
- Added /metrics/ proxy pass to router_api
Enables model performance comparison and routing tier validation.
VRAM percentage no longer marks GPU as saturated.
Saturation is about slot availability (handled by is_gpu_busy()),
not memory usage. Added vram_warning boolean flag (≥95% threshold)
for informational monitoring without affecting routing decisions.
27B Dense now correctly shows healthy at 91% VRAM.
router/router.py:
- check_gpu_health() now accepts configurable timeouts (sidecar_timeout, gpu_timeout)
- /health and /v1/models endpoints use fast 1.5s/1s timeouts (non-blocking)
- /v1/models now calls check_gpu_health once per model instead of twice
- GPU_CONTEXT updated to 262144 across all models (turboquant upgrade)
- 27B max_concurrent reduced 2→1 (24GB VRAM saturated at 256K context)
docker-compose.yml:
- Router healthcheck timeout 5s→15s, interval 15s→30s
- Nginx healthcheck timeout 5s→15s, interval 15s→30s
Fixes dashboard hang when any GPU is unreachable.