Abiba
32bd817e97
fix: heavy tier back to Dense→MoE→VLM (Dense now 98K)
2026-05-19 21:24:36 +00:00
Abiba
79965450bb
fix: Dense context 65K→98K, parallel restored to 2
2026-05-19 21:20:29 +00:00
Abiba
6c829abef5
fix: variable collision (r = Redis vs Response) in stream handler
2026-05-19 21:15:23 +00:00
Abiba
6efd5ff51c
feat: context-aware routing + compaction signals
...
- Added GPU_CONTEXT map (MoE 131K, VLM 131K, Dense 65K)
- Heavy tier now prefers MoE/VLM (131K) over Dense (65K) for large requests
- Response headers: X-Context-Remaining, X-Context-Model
- Routing data includes context_remaining field
- Agents can use this to trigger compaction when nearing limits
2026-05-19 21:13:56 +00:00
Abiba
350a90b524
fix: sync tier 4 default threshold to 50000 tokens (was stale at 4000)
2026-05-19 21:11:34 +00:00
Abiba
3156c093d5
fix: heavy threshold → 50000 tokens, 25 turns (agent contexts are huge)
2026-05-19 21:08:18 +00:00
Abiba
3cbf38e3e2
fix: raise heavy threshold — 4000→12000 tokens, 8→15 turns
...
Agent conversations with system prompts easily exceed 4000 tokens,
forcing everything to Dense. Now only truly heavy work triggers Dense.
Most agent convos will route to MoE (default) instead.
2026-05-19 20:09:59 +00:00
Abiba
46dda918de
security: reject requests without valid API key (401 instead of defaulting to starter)
2026-05-19 19:13:52 +00:00
Abiba
7a78c0f98d
fix: heavy tier — Dense first (best for reasoning), then MoE, then VLM
2026-05-19 18:20:20 +00:00
Abiba
15c474aea0
fix: select_best_gpu respects candidate order — first non-busy wins
...
Previously it picked the least-loaded GPU globally, ignoring priority order.
Now it tries candidates in order: MoE → VLM → Dense. Only falls back to
least-loaded when ALL candidates are busy.
2026-05-19 18:18:00 +00:00
Abiba
bfc38f5436
fix: routing priority — MoE first, VLM second, Dense last (slow)
...
All tiers now follow MoE → VLM → Dense priority order since
Dense (RTX 3090) can be slow. VLM acts as overflow absorber.
2026-05-19 17:38:21 +00:00
Abiba
f519a3fa60
fix: routing — system prompts no longer force heavy tier
...
System messages are common in agent conversations but don't indicate
heavy workload. Now only token count (>4000) and turn count (>8) trigger
heavy routing. Simple conversations with system prompts can now route to VLM.
2026-05-19 17:19:29 +00:00
Abiba
941e8db65e
feat: redesigned routing tiers — VLM handles more traffic
...
New 4-tier routing:
- TIER 1 (Lightweight): ≤100 words, single-turn → VLM first, fallback Dense
- TIER 2 (Simple Conv): ≤1000 tokens, ≤4 turns → VLM preferred, fallback Dense
- TIER 3 (Heavy): >4000 tokens, system prompts, >8 turns → Dense→MoE→VLM cascade
- TIER 4 (Default): Medium tasks → Dense preferred, MoE default, VLM overflow
VLM gets more utilization for simple conversations instead of defaulting
everything to MoE.
2026-05-19 17:01:55 +00:00
Abiba
241de4f38c
revert: remove Ollama endpoints (llama.cpp uses OpenAI format, not Ollama)
2026-05-19 16:57:04 +00:00
Abiba
beb2d1790a
fix: add /v1/props and /v1/models/<id> Ollama-compatible endpoints
...
Mumuni's Ollama client probes /v1/props for model discovery and
/v1/models/<id> for per-model details. Previously both returned 404,
causing client retries. Now returns proper model properties and details.
2026-05-19 16:08:24 +00:00
Abiba
f2f8e8c921
feat: add request queuing to router (replaces hard 503 on saturation)
...
When all GPUs are saturated, requests now enter a queue loop (poll every 500ms)
instead of immediately returning 503. Configurable via QUEUE_TIMEOUT env var
(default 30s) or X-Queue-Timeout header per-request.
This prevents agent failures from cluster saturation — agents wait for a slot
instead of crashing on fallback.
2026-05-19 15:55:05 +00:00
Abiba
9c31b5d622
May 19, 2026: Full harness update
...
- Model migration: gemma-4-E4B → qwen3.5-9b-vlm
- Dashboard reorder: Usage Over Time + GPU Metrics to top
- Router counter leak fix (gpu_decr in except handler)
- VLM slot upgrade 1→2
- Redis stale key cleanup
- Automated maintenance cron job
- LiteLLM config update
- GPU router config update
- README update
2026-05-19 15:03:34 +00:00
Abiba (pi)
4f032b035c
Mumuni review action items: health checks for all containers, version pinning, 503+Retry-After on all-GPU saturation
2026-05-17 09:05:27 +00:00
Abiba (pi)
8f3b0c6647
Router: health check verifies actual llama.cpp endpoint, gpu_decr negative guard, AMD sidecar fixed (sysfs fallback)
2026-05-17 01:52:28 +00:00
Abiba (pi)
808c9d3d13
Router: 300s timeout, gpu_decr bugfix. Dashboard: Bootstrap 5 modern redesign with KPI stats, equal-height cards, queue ring. Nginx: 600s timeout.
2026-05-16 22:12:21 +00:00
Abiba (pi)
654cdff718
Dashboard: GPU slot indicators show active/max concurrent requests. Koonimo API key added. Real-time queuing visibility.
2026-05-16 20:43:22 +00:00
Abiba (pi)
bf90e57c5f
Load-aware routing: tracks active GPU requests in Redis, distributes overflow when MoE saturated. 6 concurrent requests now spread across all 3 GPUs instead of queuing on one.
2026-05-16 20:23:32 +00:00
Abiba (pi)
ec0f9fac63
Fix: clean_unicode now uses chr()-based replacements + ASCII strip to prevent bash heredoc corruption. Emoji and all non-ASCII now fully stripped.
2026-05-16 19:12:58 +00:00
Abiba (pi)
7b6c6aabe1
Initial commit: CT 116 inference harness — nginx, LiteLLM, router, dashboard, Redis
...
- Complexity-based routing (MoE default, Dense heavy, Gemma light)
- Per-agent API keys with metrics tracking
- Time-series usage graphs (24h/7d/30d)
- Streaming support (SSE passthrough)
- Unicode cleanup (ASCII-only output)
- Vision support (gemma-4-E4B)
- Tier enforcement (starter/professional/enterprise)
- GPU health monitoring via sidecar polling
- Unified dashboard with line graph
2026-05-16 18:51:50 +00:00