Commit Graph

41 Commits

Author SHA1 Message Date
root 5116e4b1a7 router: heavy tier Dense→MoE→Light + X-Context-Warning headers (compact_soon/compact_recommended/compact_urgent) 2026-05-22 09:48:00 +00:00
Abiba e55bcef21a router: 4 optimizations — saturated flag fix, heavy tier MoE-first, better token est, session tracking
- Saturated flag now triggers on load saturation (was dead code)
- Heavy tier routes MoE(131K) first instead of Dense(98K)
- Token estimation uses JSON length/3.5 (was content/4)
- Cross-turn session tracking via X-Session-Id + Redis TTL 24h
2026-05-21 20:47:48 +00:00
Abiba 32bd817e97 fix: heavy tier back to Dense→MoE→VLM (Dense now 98K) 2026-05-19 21:24:36 +00:00
Abiba 79965450bb fix: Dense context 65K→98K, parallel restored to 2 2026-05-19 21:20:29 +00:00
Abiba 6c829abef5 fix: variable collision (r = Redis vs Response) in stream handler 2026-05-19 21:15:23 +00:00
Abiba 6efd5ff51c feat: context-aware routing + compaction signals
- Added GPU_CONTEXT map (MoE 131K, VLM 131K, Dense 65K)
- Heavy tier now prefers MoE/VLM (131K) over Dense (65K) for large requests
- Response headers: X-Context-Remaining, X-Context-Model
- Routing data includes context_remaining field
- Agents can use this to trigger compaction when nearing limits
2026-05-19 21:13:56 +00:00
Abiba 350a90b524 fix: sync tier 4 default threshold to 50000 tokens (was stale at 4000) 2026-05-19 21:11:34 +00:00
Abiba 3156c093d5 fix: heavy threshold → 50000 tokens, 25 turns (agent contexts are huge) 2026-05-19 21:08:18 +00:00
Abiba 3cbf38e3e2 fix: raise heavy threshold — 4000→12000 tokens, 8→15 turns
Agent conversations with system prompts easily exceed 4000 tokens,
forcing everything to Dense. Now only truly heavy work triggers Dense.
Most agent convos will route to MoE (default) instead.
2026-05-19 20:09:59 +00:00
Abiba b67021ac69 docs: complete design documentation — auth, routing tiers, queue, models, maintenance 2026-05-19 19:17:52 +00:00
Abiba 46dda918de security: reject requests without valid API key (401 instead of defaulting to starter) 2026-05-19 19:13:52 +00:00
Abiba 7a78c0f98d fix: heavy tier — Dense first (best for reasoning), then MoE, then VLM 2026-05-19 18:20:20 +00:00
Abiba 15c474aea0 fix: select_best_gpu respects candidate order — first non-busy wins
Previously it picked the least-loaded GPU globally, ignoring priority order.
Now it tries candidates in order: MoE → VLM → Dense. Only falls back to
least-loaded when ALL candidates are busy.
2026-05-19 18:18:00 +00:00
Abiba bfc38f5436 fix: routing priority — MoE first, VLM second, Dense last (slow)
All tiers now follow MoE → VLM → Dense priority order since
Dense (RTX 3090) can be slow. VLM acts as overflow absorber.
2026-05-19 17:38:21 +00:00
Abiba f519a3fa60 fix: routing — system prompts no longer force heavy tier
System messages are common in agent conversations but don't indicate
heavy workload. Now only token count (>4000) and turn count (>8) trigger
heavy routing. Simple conversations with system prompts can now route to VLM.
2026-05-19 17:19:29 +00:00
Abiba 941e8db65e feat: redesigned routing tiers — VLM handles more traffic
New 4-tier routing:
- TIER 1 (Lightweight): ≤100 words, single-turn → VLM first, fallback Dense
- TIER 2 (Simple Conv): ≤1000 tokens, ≤4 turns → VLM preferred, fallback Dense
- TIER 3 (Heavy): >4000 tokens, system prompts, >8 turns → Dense→MoE→VLM cascade
- TIER 4 (Default): Medium tasks → Dense preferred, MoE default, VLM overflow

VLM gets more utilization for simple conversations instead of defaulting
everything to MoE.
2026-05-19 17:01:55 +00:00
Abiba 241de4f38c revert: remove Ollama endpoints (llama.cpp uses OpenAI format, not Ollama) 2026-05-19 16:57:04 +00:00
Abiba beb2d1790a fix: add /v1/props and /v1/models/<id> Ollama-compatible endpoints
Mumuni's Ollama client probes /v1/props for model discovery and
/v1/models/<id> for per-model details. Previously both returned 404,
causing client retries. Now returns proper model properties and details.
2026-05-19 16:08:24 +00:00
Abiba f2f8e8c921 feat: add request queuing to router (replaces hard 503 on saturation)
When all GPUs are saturated, requests now enter a queue loop (poll every 500ms)
instead of immediately returning 503. Configurable via QUEUE_TIMEOUT env var
(default 30s) or X-Queue-Timeout header per-request.

This prevents agent failures from cluster saturation — agents wait for a slot
instead of crashing on fallback.
2026-05-19 15:55:05 +00:00
Abiba 76ade81fda docs: add Koonimo to agent API keys table 2026-05-19 15:48:39 +00:00
Abiba 9c31b5d622 May 19, 2026: Full harness update
- Model migration: gemma-4-E4B → qwen3.5-9b-vlm
- Dashboard reorder: Usage Over Time + GPU Metrics to top
- Router counter leak fix (gpu_decr in except handler)
- VLM slot upgrade 1→2
- Redis stale key cleanup
- Automated maintenance cron job
- LiteLLM config update
- GPU router config update
- README update
2026-05-19 15:03:34 +00:00
Abiba (pi) 4f032b035c Mumuni review action items: health checks for all containers, version pinning, 503+Retry-After on all-GPU saturation 2026-05-17 09:05:27 +00:00
Abiba (pi) 8f3b0c6647 Router: health check verifies actual llama.cpp endpoint, gpu_decr negative guard, AMD sidecar fixed (sysfs fallback) 2026-05-17 01:52:28 +00:00
Abiba (pi) 808c9d3d13 Router: 300s timeout, gpu_decr bugfix. Dashboard: Bootstrap 5 modern redesign with KPI stats, equal-height cards, queue ring. Nginx: 600s timeout. 2026-05-16 22:12:21 +00:00
Abiba (pi) 9817fe2ef2 Dashboard: clean rebuild with Queue Status ring chart, GPU slot indicators, organized layout (GPU/Queue+Model+Agent/Usage/Live) 2026-05-16 21:05:19 +00:00
Abiba (pi) 654cdff718 Dashboard: GPU slot indicators show active/max concurrent requests. Koonimo API key added. Real-time queuing visibility. 2026-05-16 20:43:22 +00:00
Abiba (pi) bf90e57c5f Load-aware routing: tracks active GPU requests in Redis, distributes overflow when MoE saturated. 6 concurrent requests now spread across all 3 GPUs instead of queuing on one. 2026-05-16 20:23:32 +00:00
Abiba (pi) 2db2796e53 Dashboard: rename to SyslogAI Harness, GPU bar now shows utilization instead of VRAM 2026-05-16 19:26:46 +00:00
Abiba (pi) ec0f9fac63 Fix: clean_unicode now uses chr()-based replacements + ASCII strip to prevent bash heredoc corruption. Emoji and all non-ASCII now fully stripped. 2026-05-16 19:12:58 +00:00
Abiba (pi) 3d42ea4767 Merge: add Abiba harness code — nginx, LiteLLM, router, dashboard, Redis 2026-05-16 18:53:31 +00:00
Abiba (pi) 7b6c6aabe1 Initial commit: CT 116 inference harness — nginx, LiteLLM, router, dashboard, Redis
- Complexity-based routing (MoE default, Dense heavy, Gemma light)
- Per-agent API keys with metrics tracking
- Time-series usage graphs (24h/7d/30d)
- Streaming support (SSE passthrough)
- Unicode cleanup (ASCII-only output)
- Vision support (gemma-4-E4B)
- Tier enforcement (starter/professional/enterprise)
- GPU health monitoring via sidecar polling
- Unified dashboard with line graph
2026-05-16 18:51:50 +00:00
mumuni-bot b65ea22765 Update Nginx Docker config 2026-05-15 21:35:13 +00:00
mumuni-bot cf7f61650f Add Dockerfile.dashboard 2026-05-15 21:34:52 +00:00
mumuni-bot 7d00bbec0e Add Dockerfile.queue 2026-05-15 21:34:49 +00:00
mumuni-bot 37f7c95b05 Add env example 2026-05-15 21:07:34 +00:00
mumuni-bot a28b3a557d Add Nginx router config 2026-05-15 21:07:33 +00:00
mumuni-bot c42f3a9979 Add migration plan 2026-05-15 21:07:32 +00:00
mumuni-bot e1f12c3462 Add dashboard 2026-05-15 21:07:07 +00:00
mumuni-bot b55b954967 Add queue service 2026-05-15 21:07:05 +00:00
mumuni-bot c85aaa570b Add docker-compose 2026-05-15 21:07:05 +00:00
mumuni-bot 43382dac5b Initial commit: README 2026-05-15 21:07:03 +00:00