inference-harness

Author	SHA1	Message	Date
Abiba	34fb7516e1	fix: cross-agent GPU spreading prevents hotspot hammering OLD: checked only if CURRENT agent was on a GPU Tanko→MoE, Mumuni also→MoE (didnt see Tanko) NEW: checks if ANY agent is on a GPU (cross-agent awareness) Pass 1: prefer GPUs with 0 agents Pass 2: prefer GPU this agent is not already on Pass 3: any non-busy GPU Prevents Tanko+Mumuni piling onto same GPU simultaneously even when both slots are free. Combined with MoE=1 slot, guarantees overflow goes to idle Dense.	2026-05-30 12:55:29 +00:00
Abiba	acbcb20837	fix: MoE concurrency 2→1 (95C thermal emergency) MoE at 95C with p50=13s latency — thermal throttling causing death spiral. Both slots stuck processing for 113s p95. Dense idle at 38C with 2 free slots. Reducing MoE to 1 slot forces heavy overflow to Dense, giving MoE thermal headroom. Heavy tier: MoE → Dense → VLM still valid — first heavy goes to MoE, second overflows to Dense.	2026-05-30 12:52:23 +00:00
Abiba	a3bca93d9b	fix: buffer SSE chunks for large streaming responses Mumuni 23K-token responses split the final SSE timings chunk across HTTP frames. The old per-chunk check missed timings when split. Now accumulates lines in a buffer before parsing. Also fixed: store_perf_record accidentally dropped in prior edit.	2026-05-29 09:45:41 +00:00
Abiba	d53685d874	feat: agent-aware GPU load balancing select_best_gpu() now spreads different agents across GPUs: - If agent already has a request on a GPU, prefer other GPUs first - Tracked via Redis agent_gpu:{agent}:{model} with 120s TTL - Same agent can still use multiple slots on same GPU if needed - Falls back to normal priority when only one option available Prevents Tanko+Mumuni from piling onto MoE simultaneously while Dense sits idle. Each agent naturally spreads across available GPUs.	2026-05-28 21:45:23 +00:00
Abiba	54a4f26db7	fix: Default tier back to Dense-first (MoE overheating at 91°C) Heavy tier keeps MoE primary (workhorse for >25K tok). Default tier routes Dense → VLM → MoE to prevent MoE overload. MoE had 5 timeouts in 15 min when Default pushed overflow to it.	2026-05-28 21:40:18 +00:00
Abiba	fb1d51b93b	restructure: routing prioritized by reasoning requirements Tier 1 (Lightweight): VLM → Dense → MoE ≤500 tok, 1 turn Tier 2 (Simple): VLM → Dense → MoE ≤15K tok, ≤12 turns (was 10K/10) Tier 3 (Medium): Dense → VLM → MoE ≤25K tok Tier 4 (Heavy): MoE → Dense → VLM >25K tok (MoE PRIMARY workhorse) Tier 5 (Default): MoE → Dense → VLM MoE primary fallback Target: MoE ~50% (heavy primary), VLM ~25% (raised simple + fallback), Dense ~25% (medium primary + heavy fallback) Removed turn limit from Medium tier — Simple tier handles conversational requests up to 12 turns now.	2026-05-27 07:22:30 +00:00
Abiba	9a0d69ce8d	feat: Dense 128K context + 2 slots, VLM second in Heavy tier - Dense GPU_CONTEXT: 192K→128K (131072) to free VRAM - Dense max_concurrent: 1→2 (VRAM now sufficient) - Heavy tier: Dense → VLM → MoE (VLM handles 262K context) - Total slots: 6 (2 Dense + 2 MoE + 2 VLM) Distribution target: Dense 50%, VLM 30%, MoE 20% NOTE: Requires llama.cpp restart on 192.168.68.8 with --ctx-size 131072	2026-05-27 07:15:58 +00:00
Abiba	621a897bec	tune: raise Tier 2 threshold 4K→10K tok, 6→10 turns for VLM More conversations now route to VLM as primary. 9B VLM has 262K context window and 88 tok/s average — well suited for moderate conversations. Dense absorbs overflow and heavy reasoning.	2026-05-27 00:29:25 +00:00
Abiba	93d0d3cc4b	revert: MoE concurrency back to 2 (Dense-first routing handles thermal)	2026-05-27 00:04:42 +00:00
Abiba	c4ea5e3a98	fix: flip Tier 4 (Heavy) to Dense-first for thermal safety Dense → MoE → VLM instead of MoE → Dense → VLM. Combined with MoE at 1 concurrent slot, Dense absorbs all primary traffic. MoE only activates when Dense saturated. Prevents Strix Halo from hitting 94C thermal limit.	2026-05-27 00:01:33 +00:00
Abiba	ebe8f9ced4	fix: reduce MoE concurrency 2→1 to prevent thermal timeout (94°C) Strix Halo running qwen3.6-35B-A3B was hitting 94°C with 2 concurrent slots, causing 300s request timeouts. Mumuni + Koby accumulated 15 timeouts in the last hour. Reduced to 1 slot for thermal headroom. Medium and Default tiers already route VLM before MoE as fallback, minimizing overflow traffic to the hot GPU.	2026-05-26 23:47:08 +00:00
Abiba	b3db0841ef	feat: redesigned routing tiers for even GPU distribution + speed priority OLD: Dense was last choice in every tier, got 4% of auto-routed traffic NEW: 5-tier routing with speed-first prioritization Tier 1 (Lightweight): VLM → Dense → MoE (≤500 tok, ≤100 words) Tier 2 (Simple): VLM → Dense → MoE (≤4000 tok, ≤6 turns) Tier 3 (Medium): DENSE → MoE → VLM (≤25000 tok, ≤15 turns) Tier 4 (Heavy): MoE → Dense → VLM (>25000 tok or >15 turns) Tier 5 (Default): DENSE → MoE → VLM (balanced fallback) Also: quality hint now routes to MoE (better reasoning) Bugfix: Tier 1 now checks token count to prevent giant single-word inputs from being routed as lightweight	2026-05-26 22:00:20 +00:00
Abiba	f47c3f3304	feat: latency vs prompt size scatter plot on dashboard Router: new /metrics/scatter endpoint returns individual data points (prompt_tokens, inference_ms, model, agent, reason, stream) for scatter visualization. Dashboard: new panel showing latency vs prompt size by model. - Log-scale X axis (prompt tokens) with model color coding - Dropdown to filter by individual model or view all - Hover tooltips with details per point - Auto-refresh every 30s Enables direct observation of context-length vs latency relationship — validates routing tier decisions.	2026-05-26 12:18:31 +00:00
Abiba	cfb05fa501	feat: capture streaming token counts from SSE final chunk Router now buffers streaming response chunks to extract timings (prompt_n, predicted_n, predicted_per_second) from the final SSE data frame before yielding to the client. Streaming requests get real throughput data instead of 0 tok/s. Uses llama.cpp timings field in the last content chunk: - completion_tokens = predicted_n - tokens_per_sec = predicted_per_second - inference_ms = predicted_ms (generation only) Client sees identical stream, no perceptible delay.	2026-05-25 19:58:51 +00:00
Abiba	8c5c922a4e	fix: handle single data point in performance percentiles	2026-05-25 17:00:40 +00:00
Abiba	b849cd3395	feat: per-request performance tracking + /metrics/performance endpoint router/router.py (+158 lines): - store_perf_record(): captures queue_ms, inference_ms, prompt_tokens, completion_tokens, tokens_per_sec per request in Redis - Per-model, per-reason, per-agent rolling windows (last 200-500) - /metrics/performance?window=N endpoint with percentiles (p50/p95/p99) for latency, throughput, and queue time per model/reason/agent - Queue time now surfaced in routing metadata and routes:recent - Streaming requests tracked with estimated prompt tokens nginx/nginx.conf: - Added /metrics/ proxy pass to router_api Enables model performance comparison and routing tier validation.	2026-05-25 16:50:45 +00:00
Abiba	b7882b2434	fix: reduce 27B Dense context to 192K to free VRAM RTX 3090 was at 94.9% VRAM at 262K context. Reduced to 192K (196608), freeing ~2.4GB. VRAM now at 85% with room for active inference.	2026-05-25 00:31:40 +00:00
Abiba	ddde6646de	fix: decouple VRAM usage from saturation status VRAM percentage no longer marks GPU as saturated. Saturation is about slot availability (handled by is_gpu_busy()), not memory usage. Added vram_warning boolean flag (≥95% threshold) for informational monitoring without affecting routing decisions. 27B Dense now correctly shows healthy at 91% VRAM.	2026-05-23 06:00:37 +00:00
Abiba	41939104c7	fix: non-blocking GPU health checks + 256K turboquant context upgrade router/router.py: - check_gpu_health() now accepts configurable timeouts (sidecar_timeout, gpu_timeout) - /health and /v1/models endpoints use fast 1.5s/1s timeouts (non-blocking) - /v1/models now calls check_gpu_health once per model instead of twice - GPU_CONTEXT updated to 262144 across all models (turboquant upgrade) - 27B max_concurrent reduced 2→1 (24GB VRAM saturated at 256K context) docker-compose.yml: - Router healthcheck timeout 5s→15s, interval 15s→30s - Nginx healthcheck timeout 5s→15s, interval 15s→30s Fixes dashboard hang when any GPU is unreachable.	2026-05-23 05:57:13 +00:00
Abiba	0983337fdb	fix: heavy tier Dense→MoE→VLM	2026-05-19 21:24:36 +00:00
Abiba	28d62e27ba	feat: context-aware routing + compaction signals	2026-05-19 21:13:57 +00:00
Abiba	714ebb003e	fix: heavy threshold → 50000 tokens, 25 turns	2026-05-19 21:08:18 +00:00
Abiba	e90bf0216d	fix: raise heavy threshold — 4000→12000 tokens, 8→15 turns	2026-05-19 20:10:07 +00:00
Abiba	5971ceee4e	security: reject requests without valid API key (401)	2026-05-19 19:15:13 +00:00
Abiba	5f05f46c7c	fix: heavy tier — Dense first for reasoning, MoE workhorse, VLM overflow	2026-05-19 18:27:24 +00:00
Abiba	911fdc9f3f	fix: routing priority — MoE first, VLM second, Dense last	2026-05-19 17:38:29 +00:00
Abiba	d9d2c213f6	fix: routing — remove turn limit from default tier, no gaps	2026-05-19 17:24:41 +00:00
Abiba	6625892908	feat: redesigned routing tiers — VLM handles more traffic	2026-05-19 17:01:58 +00:00
Abiba	fcb99a26c8	revert: remove Ollama endpoints	2026-05-19 16:57:05 +00:00
Abiba	2234d03079	fix: add /v1/props and /v1/models/<id> endpoints	2026-05-19 16:08:58 +00:00
Abiba	5b99b16712	feat: add request queuing to router (replaces hard 503)	2026-05-19 15:55:13 +00:00
Abiba	28fc57c5c7	May 19, 2026: Full harness update - Model migration: gemma-4-E4B → qwen3.5-9b-vlm - Dashboard reorder: Usage Over Time + GPU Metrics to top - Router counter leak fix (gpu_decr in except handler) - VLM slot upgrade 1→2 - Automated maintenance cron job - LiteLLM config update	2026-05-19 15:03:47 +00:00