Commit Graph

24 Commits

Author SHA1 Message Date
Abiba 93d0d3cc4b revert: MoE concurrency back to 2 (Dense-first routing handles thermal) 2026-05-27 00:04:42 +00:00
Abiba c4ea5e3a98 fix: flip Tier 4 (Heavy) to Dense-first for thermal safety
Dense → MoE → VLM instead of MoE → Dense → VLM.
Combined with MoE at 1 concurrent slot, Dense absorbs all
primary traffic. MoE only activates when Dense saturated.
Prevents Strix Halo from hitting 94C thermal limit.
2026-05-27 00:01:33 +00:00
Abiba ebe8f9ced4 fix: reduce MoE concurrency 2→1 to prevent thermal timeout (94°C)
Strix Halo running qwen3.6-35B-A3B was hitting 94°C with 2 concurrent
slots, causing 300s request timeouts. Mumuni + Koby accumulated 15
timeouts in the last hour. Reduced to 1 slot for thermal headroom.

Medium and Default tiers already route VLM before MoE as fallback,
minimizing overflow traffic to the hot GPU.
2026-05-26 23:47:08 +00:00
Abiba b3db0841ef feat: redesigned routing tiers for even GPU distribution + speed priority
OLD: Dense was last choice in every tier, got 4% of auto-routed traffic
NEW: 5-tier routing with speed-first prioritization

Tier 1 (Lightweight): VLM → Dense → MoE    (≤500 tok, ≤100 words)
Tier 2 (Simple):      VLM → Dense → MoE    (≤4000 tok, ≤6 turns)
Tier 3 (Medium):      DENSE → MoE → VLM    (≤25000 tok, ≤15 turns)
Tier 4 (Heavy):       MoE → Dense → VLM    (>25000 tok or >15 turns)
Tier 5 (Default):     DENSE → MoE → VLM    (balanced fallback)

Also: quality hint now routes to MoE (better reasoning)
Bugfix: Tier 1 now checks token count to prevent giant single-word
inputs from being routed as lightweight
2026-05-26 22:00:20 +00:00
Abiba f47c3f3304 feat: latency vs prompt size scatter plot on dashboard
Router: new /metrics/scatter endpoint returns individual data points
(prompt_tokens, inference_ms, model, agent, reason, stream)
for scatter visualization.

Dashboard: new panel showing latency vs prompt size by model.
- Log-scale X axis (prompt tokens) with model color coding
- Dropdown to filter by individual model or view all
- Hover tooltips with details per point
- Auto-refresh every 30s

Enables direct observation of context-length vs latency
relationship — validates routing tier decisions.
2026-05-26 12:18:31 +00:00
Abiba cfb05fa501 feat: capture streaming token counts from SSE final chunk
Router now buffers streaming response chunks to extract timings
(prompt_n, predicted_n, predicted_per_second) from the final
SSE data frame before yielding to the client. Streaming requests
get real throughput data instead of 0 tok/s.

Uses llama.cpp timings field in the last content chunk:
- completion_tokens = predicted_n
- tokens_per_sec = predicted_per_second
- inference_ms = predicted_ms (generation only)

Client sees identical stream, no perceptible delay.
2026-05-25 19:58:51 +00:00
Abiba 8c5c922a4e fix: handle single data point in performance percentiles 2026-05-25 17:00:40 +00:00
Abiba b849cd3395 feat: per-request performance tracking + /metrics/performance endpoint
router/router.py (+158 lines):
- store_perf_record(): captures queue_ms, inference_ms, prompt_tokens,
  completion_tokens, tokens_per_sec per request in Redis
- Per-model, per-reason, per-agent rolling windows (last 200-500)
- /metrics/performance?window=N endpoint with percentiles (p50/p95/p99)
  for latency, throughput, and queue time per model/reason/agent
- Queue time now surfaced in routing metadata and routes:recent
- Streaming requests tracked with estimated prompt tokens

nginx/nginx.conf:
- Added /metrics/ proxy pass to router_api

Enables model performance comparison and routing tier validation.
2026-05-25 16:50:45 +00:00
Abiba b7882b2434 fix: reduce 27B Dense context to 192K to free VRAM
RTX 3090 was at 94.9% VRAM at 262K context. Reduced to 192K (196608),
freeing ~2.4GB. VRAM now at 85% with room for active inference.
2026-05-25 00:31:40 +00:00
Abiba ddde6646de fix: decouple VRAM usage from saturation status
VRAM percentage no longer marks GPU as saturated.
Saturation is about slot availability (handled by is_gpu_busy()),
not memory usage. Added vram_warning boolean flag (≥95% threshold)
for informational monitoring without affecting routing decisions.

27B Dense now correctly shows healthy at 91% VRAM.
2026-05-23 06:00:37 +00:00
Abiba 41939104c7 fix: non-blocking GPU health checks + 256K turboquant context upgrade
router/router.py:
- check_gpu_health() now accepts configurable timeouts (sidecar_timeout, gpu_timeout)
- /health and /v1/models endpoints use fast 1.5s/1s timeouts (non-blocking)
- /v1/models now calls check_gpu_health once per model instead of twice
- GPU_CONTEXT updated to 262144 across all models (turboquant upgrade)
- 27B max_concurrent reduced 2→1 (24GB VRAM saturated at 256K context)

docker-compose.yml:
- Router healthcheck timeout 5s→15s, interval 15s→30s
- Nginx healthcheck timeout 5s→15s, interval 15s→30s

Fixes dashboard hang when any GPU is unreachable.
2026-05-23 05:57:13 +00:00
Abiba 0983337fdb fix: heavy tier Dense→MoE→VLM 2026-05-19 21:24:36 +00:00
Abiba 28d62e27ba feat: context-aware routing + compaction signals 2026-05-19 21:13:57 +00:00
Abiba 714ebb003e fix: heavy threshold → 50000 tokens, 25 turns 2026-05-19 21:08:18 +00:00
Abiba e90bf0216d fix: raise heavy threshold — 4000→12000 tokens, 8→15 turns 2026-05-19 20:10:07 +00:00
Abiba 5971ceee4e security: reject requests without valid API key (401) 2026-05-19 19:15:13 +00:00
Abiba 5f05f46c7c fix: heavy tier — Dense first for reasoning, MoE workhorse, VLM overflow 2026-05-19 18:27:24 +00:00
Abiba 911fdc9f3f fix: routing priority — MoE first, VLM second, Dense last 2026-05-19 17:38:29 +00:00
Abiba d9d2c213f6 fix: routing — remove turn limit from default tier, no gaps 2026-05-19 17:24:41 +00:00
Abiba 6625892908 feat: redesigned routing tiers — VLM handles more traffic 2026-05-19 17:01:58 +00:00
Abiba fcb99a26c8 revert: remove Ollama endpoints 2026-05-19 16:57:05 +00:00
Abiba 2234d03079 fix: add /v1/props and /v1/models/<id> endpoints 2026-05-19 16:08:58 +00:00
Abiba 5b99b16712 feat: add request queuing to router (replaces hard 503) 2026-05-19 15:55:13 +00:00
Abiba 28fc57c5c7 May 19, 2026: Full harness update
- Model migration: gemma-4-E4B → qwen3.5-9b-vlm
- Dashboard reorder: Usage Over Time + GPU Metrics to top
- Router counter leak fix (gpu_decr in except handler)
- VLM slot upgrade 1→2
- Automated maintenance cron job
- LiteLLM config update
2026-05-19 15:03:47 +00:00