Commit Graph

17 Commits

Author SHA1 Message Date
Abiba 46dda918de security: reject requests without valid API key (401 instead of defaulting to starter) 2026-05-19 19:13:52 +00:00
Abiba 7a78c0f98d fix: heavy tier — Dense first (best for reasoning), then MoE, then VLM 2026-05-19 18:20:20 +00:00
Abiba 15c474aea0 fix: select_best_gpu respects candidate order — first non-busy wins
Previously it picked the least-loaded GPU globally, ignoring priority order.
Now it tries candidates in order: MoE → VLM → Dense. Only falls back to
least-loaded when ALL candidates are busy.
2026-05-19 18:18:00 +00:00
Abiba bfc38f5436 fix: routing priority — MoE first, VLM second, Dense last (slow)
All tiers now follow MoE → VLM → Dense priority order since
Dense (RTX 3090) can be slow. VLM acts as overflow absorber.
2026-05-19 17:38:21 +00:00
Abiba f519a3fa60 fix: routing — system prompts no longer force heavy tier
System messages are common in agent conversations but don't indicate
heavy workload. Now only token count (>4000) and turn count (>8) trigger
heavy routing. Simple conversations with system prompts can now route to VLM.
2026-05-19 17:19:29 +00:00
Abiba 941e8db65e feat: redesigned routing tiers — VLM handles more traffic
New 4-tier routing:
- TIER 1 (Lightweight): ≤100 words, single-turn → VLM first, fallback Dense
- TIER 2 (Simple Conv): ≤1000 tokens, ≤4 turns → VLM preferred, fallback Dense
- TIER 3 (Heavy): >4000 tokens, system prompts, >8 turns → Dense→MoE→VLM cascade
- TIER 4 (Default): Medium tasks → Dense preferred, MoE default, VLM overflow

VLM gets more utilization for simple conversations instead of defaulting
everything to MoE.
2026-05-19 17:01:55 +00:00
Abiba 241de4f38c revert: remove Ollama endpoints (llama.cpp uses OpenAI format, not Ollama) 2026-05-19 16:57:04 +00:00
Abiba beb2d1790a fix: add /v1/props and /v1/models/<id> Ollama-compatible endpoints
Mumuni's Ollama client probes /v1/props for model discovery and
/v1/models/<id> for per-model details. Previously both returned 404,
causing client retries. Now returns proper model properties and details.
2026-05-19 16:08:24 +00:00
Abiba f2f8e8c921 feat: add request queuing to router (replaces hard 503 on saturation)
When all GPUs are saturated, requests now enter a queue loop (poll every 500ms)
instead of immediately returning 503. Configurable via QUEUE_TIMEOUT env var
(default 30s) or X-Queue-Timeout header per-request.

This prevents agent failures from cluster saturation — agents wait for a slot
instead of crashing on fallback.
2026-05-19 15:55:05 +00:00
Abiba 9c31b5d622 May 19, 2026: Full harness update
- Model migration: gemma-4-E4B → qwen3.5-9b-vlm
- Dashboard reorder: Usage Over Time + GPU Metrics to top
- Router counter leak fix (gpu_decr in except handler)
- VLM slot upgrade 1→2
- Redis stale key cleanup
- Automated maintenance cron job
- LiteLLM config update
- GPU router config update
- README update
2026-05-19 15:03:34 +00:00
Abiba (pi) 4f032b035c Mumuni review action items: health checks for all containers, version pinning, 503+Retry-After on all-GPU saturation 2026-05-17 09:05:27 +00:00
Abiba (pi) 8f3b0c6647 Router: health check verifies actual llama.cpp endpoint, gpu_decr negative guard, AMD sidecar fixed (sysfs fallback) 2026-05-17 01:52:28 +00:00
Abiba (pi) 808c9d3d13 Router: 300s timeout, gpu_decr bugfix. Dashboard: Bootstrap 5 modern redesign with KPI stats, equal-height cards, queue ring. Nginx: 600s timeout. 2026-05-16 22:12:21 +00:00
Abiba (pi) 654cdff718 Dashboard: GPU slot indicators show active/max concurrent requests. Koonimo API key added. Real-time queuing visibility. 2026-05-16 20:43:22 +00:00
Abiba (pi) bf90e57c5f Load-aware routing: tracks active GPU requests in Redis, distributes overflow when MoE saturated. 6 concurrent requests now spread across all 3 GPUs instead of queuing on one. 2026-05-16 20:23:32 +00:00
Abiba (pi) ec0f9fac63 Fix: clean_unicode now uses chr()-based replacements + ASCII strip to prevent bash heredoc corruption. Emoji and all non-ASCII now fully stripped. 2026-05-16 19:12:58 +00:00
Abiba (pi) 7b6c6aabe1 Initial commit: CT 116 inference harness — nginx, LiteLLM, router, dashboard, Redis
- Complexity-based routing (MoE default, Dense heavy, Gemma light)
- Per-agent API keys with metrics tracking
- Time-series usage graphs (24h/7d/30d)
- Streaming support (SSE passthrough)
- Unicode cleanup (ASCII-only output)
- Vision support (gemma-4-E4B)
- Tier enforcement (starter/professional/enterprise)
- GPU health monitoring via sidecar polling
- Unified dashboard with line graph
2026-05-16 18:51:50 +00:00