syslog-harness

Author	SHA1	Message	Date
Abiba	32bd817e97	fix: heavy tier back to Dense→MoE→VLM (Dense now 98K)	2026-05-19 21:24:36 +00:00
Abiba	79965450bb	fix: Dense context 65K→98K, parallel restored to 2	2026-05-19 21:20:29 +00:00
Abiba	6c829abef5	fix: variable collision (r = Redis vs Response) in stream handler	2026-05-19 21:15:23 +00:00
Abiba	6efd5ff51c	feat: context-aware routing + compaction signals - Added GPU_CONTEXT map (MoE 131K, VLM 131K, Dense 65K) - Heavy tier now prefers MoE/VLM (131K) over Dense (65K) for large requests - Response headers: X-Context-Remaining, X-Context-Model - Routing data includes context_remaining field - Agents can use this to trigger compaction when nearing limits	2026-05-19 21:13:56 +00:00
Abiba	350a90b524	fix: sync tier 4 default threshold to 50000 tokens (was stale at 4000)	2026-05-19 21:11:34 +00:00
Abiba	3156c093d5	fix: heavy threshold → 50000 tokens, 25 turns (agent contexts are huge)	2026-05-19 21:08:18 +00:00
Abiba	3cbf38e3e2	fix: raise heavy threshold — 4000→12000 tokens, 8→15 turns Agent conversations with system prompts easily exceed 4000 tokens, forcing everything to Dense. Now only truly heavy work triggers Dense. Most agent convos will route to MoE (default) instead.	2026-05-19 20:09:59 +00:00
Abiba	46dda918de	security: reject requests without valid API key (401 instead of defaulting to starter)	2026-05-19 19:13:52 +00:00
Abiba	7a78c0f98d	fix: heavy tier — Dense first (best for reasoning), then MoE, then VLM	2026-05-19 18:20:20 +00:00
Abiba	15c474aea0	fix: select_best_gpu respects candidate order — first non-busy wins Previously it picked the least-loaded GPU globally, ignoring priority order. Now it tries candidates in order: MoE → VLM → Dense. Only falls back to least-loaded when ALL candidates are busy.	2026-05-19 18:18:00 +00:00
Abiba	bfc38f5436	fix: routing priority — MoE first, VLM second, Dense last (slow) All tiers now follow MoE → VLM → Dense priority order since Dense (RTX 3090) can be slow. VLM acts as overflow absorber.	2026-05-19 17:38:21 +00:00
Abiba	f519a3fa60	fix: routing — system prompts no longer force heavy tier System messages are common in agent conversations but don't indicate heavy workload. Now only token count (>4000) and turn count (>8) trigger heavy routing. Simple conversations with system prompts can now route to VLM.	2026-05-19 17:19:29 +00:00
Abiba	941e8db65e	feat: redesigned routing tiers — VLM handles more traffic New 4-tier routing: - TIER 1 (Lightweight): ≤100 words, single-turn → VLM first, fallback Dense - TIER 2 (Simple Conv): ≤1000 tokens, ≤4 turns → VLM preferred, fallback Dense - TIER 3 (Heavy): >4000 tokens, system prompts, >8 turns → Dense→MoE→VLM cascade - TIER 4 (Default): Medium tasks → Dense preferred, MoE default, VLM overflow VLM gets more utilization for simple conversations instead of defaulting everything to MoE.	2026-05-19 17:01:55 +00:00
Abiba	241de4f38c	revert: remove Ollama endpoints (llama.cpp uses OpenAI format, not Ollama)	2026-05-19 16:57:04 +00:00
Abiba	beb2d1790a	fix: add /v1/props and /v1/models/<id> Ollama-compatible endpoints Mumuni's Ollama client probes /v1/props for model discovery and /v1/models/<id> for per-model details. Previously both returned 404, causing client retries. Now returns proper model properties and details.	2026-05-19 16:08:24 +00:00
Abiba	f2f8e8c921	feat: add request queuing to router (replaces hard 503 on saturation) When all GPUs are saturated, requests now enter a queue loop (poll every 500ms) instead of immediately returning 503. Configurable via QUEUE_TIMEOUT env var (default 30s) or X-Queue-Timeout header per-request. This prevents agent failures from cluster saturation — agents wait for a slot instead of crashing on fallback.	2026-05-19 15:55:05 +00:00
Abiba	9c31b5d622	May 19, 2026: Full harness update - Model migration: gemma-4-E4B → qwen3.5-9b-vlm - Dashboard reorder: Usage Over Time + GPU Metrics to top - Router counter leak fix (gpu_decr in except handler) - VLM slot upgrade 1→2 - Redis stale key cleanup - Automated maintenance cron job - LiteLLM config update - GPU router config update - README update	2026-05-19 15:03:34 +00:00
Abiba (pi)	4f032b035c	Mumuni review action items: health checks for all containers, version pinning, 503+Retry-After on all-GPU saturation	2026-05-17 09:05:27 +00:00
Abiba (pi)	8f3b0c6647	Router: health check verifies actual llama.cpp endpoint, gpu_decr negative guard, AMD sidecar fixed (sysfs fallback)	2026-05-17 01:52:28 +00:00
Abiba (pi)	808c9d3d13	Router: 300s timeout, gpu_decr bugfix. Dashboard: Bootstrap 5 modern redesign with KPI stats, equal-height cards, queue ring. Nginx: 600s timeout.	2026-05-16 22:12:21 +00:00
Abiba (pi)	654cdff718	Dashboard: GPU slot indicators show active/max concurrent requests. Koonimo API key added. Real-time queuing visibility.	2026-05-16 20:43:22 +00:00
Abiba (pi)	bf90e57c5f	Load-aware routing: tracks active GPU requests in Redis, distributes overflow when MoE saturated. 6 concurrent requests now spread across all 3 GPUs instead of queuing on one.	2026-05-16 20:23:32 +00:00
Abiba (pi)	ec0f9fac63	Fix: clean_unicode now uses chr()-based replacements + ASCII strip to prevent bash heredoc corruption. Emoji and all non-ASCII now fully stripped.	2026-05-16 19:12:58 +00:00
Abiba (pi)	7b6c6aabe1	Initial commit: CT 116 inference harness — nginx, LiteLLM, router, dashboard, Redis - Complexity-based routing (MoE default, Dense heavy, Gemma light) - Per-agent API keys with metrics tracking - Time-series usage graphs (24h/7d/30d) - Streaming support (SSE passthrough) - Unicode cleanup (ASCII-only output) - Vision support (gemma-4-E4B) - Tier enforcement (starter/professional/enterprise) - GPU health monitoring via sidecar polling - Unified dashboard with line graph	2026-05-16 18:51:50 +00:00

24 Commits