inference-harness

Author	SHA1	Message	Date
Abiba	93d0d3cc4b	revert: MoE concurrency back to 2 (Dense-first routing handles thermal)	2026-05-27 00:04:42 +00:00
Abiba	c4ea5e3a98	fix: flip Tier 4 (Heavy) to Dense-first for thermal safety Dense → MoE → VLM instead of MoE → Dense → VLM. Combined with MoE at 1 concurrent slot, Dense absorbs all primary traffic. MoE only activates when Dense saturated. Prevents Strix Halo from hitting 94C thermal limit.	2026-05-27 00:01:33 +00:00
Abiba	ebe8f9ced4	fix: reduce MoE concurrency 2→1 to prevent thermal timeout (94°C) Strix Halo running qwen3.6-35B-A3B was hitting 94°C with 2 concurrent slots, causing 300s request timeouts. Mumuni + Koby accumulated 15 timeouts in the last hour. Reduced to 1 slot for thermal headroom. Medium and Default tiers already route VLM before MoE as fallback, minimizing overflow traffic to the hot GPU.	2026-05-26 23:47:08 +00:00
Abiba	b3db0841ef	feat: redesigned routing tiers for even GPU distribution + speed priority OLD: Dense was last choice in every tier, got 4% of auto-routed traffic NEW: 5-tier routing with speed-first prioritization Tier 1 (Lightweight): VLM → Dense → MoE (≤500 tok, ≤100 words) Tier 2 (Simple): VLM → Dense → MoE (≤4000 tok, ≤6 turns) Tier 3 (Medium): DENSE → MoE → VLM (≤25000 tok, ≤15 turns) Tier 4 (Heavy): MoE → Dense → VLM (>25000 tok or >15 turns) Tier 5 (Default): DENSE → MoE → VLM (balanced fallback) Also: quality hint now routes to MoE (better reasoning) Bugfix: Tier 1 now checks token count to prevent giant single-word inputs from being routed as lightweight	2026-05-26 22:00:20 +00:00
Abiba	f47c3f3304	feat: latency vs prompt size scatter plot on dashboard Router: new /metrics/scatter endpoint returns individual data points (prompt_tokens, inference_ms, model, agent, reason, stream) for scatter visualization. Dashboard: new panel showing latency vs prompt size by model. - Log-scale X axis (prompt tokens) with model color coding - Dropdown to filter by individual model or view all - Hover tooltips with details per point - Auto-refresh every 30s Enables direct observation of context-length vs latency relationship — validates routing tier decisions.	2026-05-26 12:18:31 +00:00
Abiba	cfb05fa501	feat: capture streaming token counts from SSE final chunk Router now buffers streaming response chunks to extract timings (prompt_n, predicted_n, predicted_per_second) from the final SSE data frame before yielding to the client. Streaming requests get real throughput data instead of 0 tok/s. Uses llama.cpp timings field in the last content chunk: - completion_tokens = predicted_n - tokens_per_sec = predicted_per_second - inference_ms = predicted_ms (generation only) Client sees identical stream, no perceptible delay.	2026-05-25 19:58:51 +00:00
Abiba	8c5c922a4e	fix: handle single data point in performance percentiles	2026-05-25 17:00:40 +00:00
Abiba	b849cd3395	feat: per-request performance tracking + /metrics/performance endpoint router/router.py (+158 lines): - store_perf_record(): captures queue_ms, inference_ms, prompt_tokens, completion_tokens, tokens_per_sec per request in Redis - Per-model, per-reason, per-agent rolling windows (last 200-500) - /metrics/performance?window=N endpoint with percentiles (p50/p95/p99) for latency, throughput, and queue time per model/reason/agent - Queue time now surfaced in routing metadata and routes:recent - Streaming requests tracked with estimated prompt tokens nginx/nginx.conf: - Added /metrics/ proxy pass to router_api Enables model performance comparison and routing tier validation.	2026-05-25 16:50:45 +00:00
Abiba	b7882b2434	fix: reduce 27B Dense context to 192K to free VRAM RTX 3090 was at 94.9% VRAM at 262K context. Reduced to 192K (196608), freeing ~2.4GB. VRAM now at 85% with room for active inference.	2026-05-25 00:31:40 +00:00
Abiba	ddde6646de	fix: decouple VRAM usage from saturation status VRAM percentage no longer marks GPU as saturated. Saturation is about slot availability (handled by is_gpu_busy()), not memory usage. Added vram_warning boolean flag (≥95% threshold) for informational monitoring without affecting routing decisions. 27B Dense now correctly shows healthy at 91% VRAM.	2026-05-23 06:00:37 +00:00
Abiba	41939104c7	fix: non-blocking GPU health checks + 256K turboquant context upgrade router/router.py: - check_gpu_health() now accepts configurable timeouts (sidecar_timeout, gpu_timeout) - /health and /v1/models endpoints use fast 1.5s/1s timeouts (non-blocking) - /v1/models now calls check_gpu_health once per model instead of twice - GPU_CONTEXT updated to 262144 across all models (turboquant upgrade) - 27B max_concurrent reduced 2→1 (24GB VRAM saturated at 256K context) docker-compose.yml: - Router healthcheck timeout 5s→15s, interval 15s→30s - Nginx healthcheck timeout 5s→15s, interval 15s→30s Fixes dashboard hang when any GPU is unreachable.	2026-05-23 05:57:13 +00:00
Abiba	0983337fdb	fix: heavy tier Dense→MoE→VLM	2026-05-19 21:24:36 +00:00
Abiba	28d62e27ba	feat: context-aware routing + compaction signals	2026-05-19 21:13:57 +00:00
Abiba	714ebb003e	fix: heavy threshold → 50000 tokens, 25 turns	2026-05-19 21:08:18 +00:00
Abiba	e90bf0216d	fix: raise heavy threshold — 4000→12000 tokens, 8→15 turns	2026-05-19 20:10:07 +00:00
Abiba	5971ceee4e	security: reject requests without valid API key (401)	2026-05-19 19:15:13 +00:00
Abiba	5f05f46c7c	fix: heavy tier — Dense first for reasoning, MoE workhorse, VLM overflow	2026-05-19 18:27:24 +00:00
Abiba	911fdc9f3f	fix: routing priority — MoE first, VLM second, Dense last	2026-05-19 17:38:29 +00:00
Abiba	d9d2c213f6	fix: routing — remove turn limit from default tier, no gaps	2026-05-19 17:24:41 +00:00
Abiba	6625892908	feat: redesigned routing tiers — VLM handles more traffic	2026-05-19 17:01:58 +00:00
Abiba	fcb99a26c8	revert: remove Ollama endpoints	2026-05-19 16:57:05 +00:00
Abiba	2234d03079	fix: add /v1/props and /v1/models/<id> endpoints	2026-05-19 16:08:58 +00:00
Abiba	5b99b16712	feat: add request queuing to router (replaces hard 503)	2026-05-19 15:55:13 +00:00
Abiba	28fc57c5c7	May 19, 2026: Full harness update - Model migration: gemma-4-E4B → qwen3.5-9b-vlm - Dashboard reorder: Usage Over Time + GPU Metrics to top - Router counter leak fix (gpu_decr in except handler) - VLM slot upgrade 1→2 - Automated maintenance cron job - LiteLLM config update	2026-05-19 15:03:47 +00:00

24 Commits