3cbf38e3e25793f674adc0500fcfd894bf5c7eb5
Agent conversations with system prompts easily exceed 4000 tokens, forcing everything to Dense. Now only truly heavy work triggers Dense. Most agent convos will route to MoE (default) instead.
syslog-harness — Inference API Harness
CT 116 Docker stack for routing local GPU models through a unified OpenAI-compatible API.
Architecture
nginx :80 → router :9000 → GPU backends
├─ qwen3.6-35B-A3B (MoE) @ 192.168.68.15:8080 [2 slots]
├─ qwen3.6-27B-code (Dense) @ 192.168.68.8:8080 [2 slots]
└─ qwen3.5-9b-vlm (VLM) @ 192.168.68.110:8080 [2 slots]
Total: 6 concurrent slots
LiteLLM :8081 (fallback) | Dashboard :3000 | Redis :6379 (local)
Deploy
cd /opt/inference-harness
docker compose up -d
Endpoints
| URL | Purpose |
|---|---|
/v1/chat/completions |
Inference API (OpenAI-compatible) — API key required |
/v1/models |
Available models |
/ |
Dashboard (GPU health, routing, agents, timeseries) |
Authentication
All /v1/chat/completions requests require a valid API key via Authorization: Bearer <key>. Missing or invalid keys return 401 Unauthorized.
Agent API Keys
| Agent | Key |
|---|---|
| Abiba | sk-syslog-abiba |
| Mumuni | sk-syslog-mumuni |
| Tanko | sk-syslog-tanko |
| Koby | sk-syslog-koby |
| Kagenz0 | sk-syslog-kagenz0 |
| Koonimo | sk-syslog-koonimo |
Routing Tiers
| Tier | Trigger | Priority |
|---|---|---|
| Lightweight | No system prompt, ≤1 turn, ≤100 words | VLM → MoE → Dense |
| Simple Conv | ≤1000 tokens, ≤4 turns | VLM → MoE → Dense |
| Heavy | >4000 tokens OR >8 turns | Dense → MoE → VLM |
| Default | Everything else | MoE → VLM → Dense |
Queue
When all GPUs are saturated, requests enter a polling queue (500ms intervals) instead of returning 503 immediately. Timeout: 30s (configurable via QUEUE_TIMEOUT env or X-Queue-Timeout header).
Models
| GPU | Model | VRAM | Slots |
|---|---|---|---|
| Strix Halo | qwen3.6-35B-A3B (MoE) | 65GB | 2 |
| RTX 3090 | qwen3.6-27B-code (Dense) | 24GB | 2 |
| RTX 5070 | qwen3.5-9b-vlm (VLM) | 12GB | 2 |
Maintenance
Automated cron job runs daily at 3:00 AM UTC (/opt/inference-harness/maintenance.sh):
- Cleans Redis timeseries keys >60 days
- Prunes Docker build cache >7 days
- Logs container health and Redis memory
Logs: /var/log/harness-maintenance.log
Description
Syslog Operational Agent Harness — Nginx routing, Redis queue, circuit breaker, monitoring, Docker migration
Languages
Python
99.2%
Dockerfile
0.8%