f519a3fa604cb0e085071311b5effa4377d6c5ba
System messages are common in agent conversations but don't indicate heavy workload. Now only token count (>4000) and turn count (>8) trigger heavy routing. Simple conversations with system prompts can now route to VLM.
syslog-harness — Inference API Harness
CT 116 Docker stack for routing local GPU models through a unified OpenAI-compatible API.
Architecture
nginx :80 → router :9000 → GPU backends
├─ qwen3.6-35B-A3B (MoE) @ 192.168.68.15:8080
├─ qwen3.6-27B-code (Dense) @ 192.168.68.8:8080
└─ qwen3.5-9b-vlm (VLM) @ 192.168.68.110:8080
LiteLLM :8081 (fallback) | Dashboard :3000 | Redis :6379 (local)
Deploy
cd /opt/inference-harness
docker compose up -d
Endpoints
| URL | Purpose |
|---|---|
/v1/chat/completions |
Inference API (OpenAI-compatible) |
/v1/models |
Available models |
/ |
Dashboard (GPU health, routing, agents, timeseries) |
Agent API Keys
| Agent | Key |
|---|---|
| Abiba | sk-syslog-abiba |
| Mumuni | sk-syslog-mumuni |
| Tanko | sk-syslog-tanko |
| Koby | sk-syslog-koby |
| Kagenz0 | sk-syslog-kagenz0 |
| Koonimo | sk-syslog-koonimo |
Description
Syslog Operational Agent Harness — Nginx routing, Redis queue, circuit breaker, monitoring, Docker migration
Languages
Python
99.2%
Dockerfile
0.8%