Abiba e55bcef21a router: 4 optimizations — saturated flag fix, heavy tier MoE-first, better token est, session tracking
- Saturated flag now triggers on load saturation (was dead code)
- Heavy tier routes MoE(131K) first instead of Dense(98K)
- Token estimation uses JSON length/3.5 (was content/4)
- Cross-turn session tracking via X-Session-Id + Redis TTL 24h
2026-05-21 20:47:48 +00:00
2026-05-15 21:07:05 +00:00
2026-05-15 21:07:34 +00:00
2026-05-19 15:03:34 +00:00
2026-05-15 21:34:52 +00:00
2026-05-15 21:34:49 +00:00
2026-05-15 21:07:32 +00:00

syslog-harness — Inference API Harness

CT 116 Docker stack for routing local GPU models through a unified OpenAI-compatible API.

Architecture

nginx :80 → router :9000 → GPU backends
                ├─ qwen3.6-35B-A3B (MoE) @ 192.168.68.15:8080  [2 slots]
                ├─ qwen3.6-27B-code (Dense) @ 192.168.68.8:8080  [2 slots]
                └─ qwen3.5-9b-vlm (VLM) @ 192.168.68.110:8080    [2 slots]
                                     Total: 6 concurrent slots

LiteLLM :8081 (fallback) | Dashboard :3000 | Redis :6379 (local)

Deploy

cd /opt/inference-harness
docker compose up -d

Endpoints

URL Purpose
/v1/chat/completions Inference API (OpenAI-compatible) — API key required
/v1/models Available models
/ Dashboard (GPU health, routing, agents, timeseries)

Authentication

All /v1/chat/completions requests require a valid API key via Authorization: Bearer <key>. Missing or invalid keys return 401 Unauthorized.

Agent API Keys

Agent Key
Abiba sk-syslog-abiba
Mumuni sk-syslog-mumuni
Tanko sk-syslog-tanko
Koby sk-syslog-koby
Kagenz0 sk-syslog-kagenz0
Koonimo sk-syslog-koonimo

Routing Tiers

Tier Trigger Priority
Lightweight No system prompt, ≤1 turn, ≤100 words VLM → MoE → Dense
Simple Conv ≤1000 tokens, ≤4 turns VLM → MoE → Dense
Heavy >4000 tokens OR >8 turns Dense → MoE → VLM
Default Everything else MoE → VLM → Dense

Queue

When all GPUs are saturated, requests enter a polling queue (500ms intervals) instead of returning 503 immediately. Timeout: 30s (configurable via QUEUE_TIMEOUT env or X-Queue-Timeout header).

Models

GPU Model VRAM Slots
Strix Halo qwen3.6-35B-A3B (MoE) 65GB 2
RTX 3090 qwen3.6-27B-code (Dense) 24GB 2
RTX 5070 qwen3.5-9b-vlm (VLM) 12GB 2

Maintenance

Automated cron job runs daily at 3:00 AM UTC (/opt/inference-harness/maintenance.sh):

  • Cleans Redis timeseries keys >60 days
  • Prunes Docker build cache >7 days
  • Logs container health and Redis memory

Logs: /var/log/harness-maintenance.log

S
Description
Syslog Operational Agent Harness — Nginx routing, Redis queue, circuit breaker, monitoring, Docker migration
Readme 311 KiB
Languages
Python 99.2%
Dockerfile 0.8%