6efd5ff51ca39b0ad2b6b448bb46f2936cebed59
- Added GPU_CONTEXT map (MoE 131K, VLM 131K, Dense 65K) - Heavy tier now prefers MoE/VLM (131K) over Dense (65K) for large requests - Response headers: X-Context-Remaining, X-Context-Model - Routing data includes context_remaining field - Agents can use this to trigger compaction when nearing limits
syslog-harness — Inference API Harness
CT 116 Docker stack for routing local GPU models through a unified OpenAI-compatible API.
Architecture
nginx :80 → router :9000 → GPU backends
├─ qwen3.6-35B-A3B (MoE) @ 192.168.68.15:8080 [2 slots]
├─ qwen3.6-27B-code (Dense) @ 192.168.68.8:8080 [2 slots]
└─ qwen3.5-9b-vlm (VLM) @ 192.168.68.110:8080 [2 slots]
Total: 6 concurrent slots
LiteLLM :8081 (fallback) | Dashboard :3000 | Redis :6379 (local)
Deploy
cd /opt/inference-harness
docker compose up -d
Endpoints
| URL | Purpose |
|---|---|
/v1/chat/completions |
Inference API (OpenAI-compatible) — API key required |
/v1/models |
Available models |
/ |
Dashboard (GPU health, routing, agents, timeseries) |
Authentication
All /v1/chat/completions requests require a valid API key via Authorization: Bearer <key>. Missing or invalid keys return 401 Unauthorized.
Agent API Keys
| Agent | Key |
|---|---|
| Abiba | sk-syslog-abiba |
| Mumuni | sk-syslog-mumuni |
| Tanko | sk-syslog-tanko |
| Koby | sk-syslog-koby |
| Kagenz0 | sk-syslog-kagenz0 |
| Koonimo | sk-syslog-koonimo |
Routing Tiers
| Tier | Trigger | Priority |
|---|---|---|
| Lightweight | No system prompt, ≤1 turn, ≤100 words | VLM → MoE → Dense |
| Simple Conv | ≤1000 tokens, ≤4 turns | VLM → MoE → Dense |
| Heavy | >4000 tokens OR >8 turns | Dense → MoE → VLM |
| Default | Everything else | MoE → VLM → Dense |
Queue
When all GPUs are saturated, requests enter a polling queue (500ms intervals) instead of returning 503 immediately. Timeout: 30s (configurable via QUEUE_TIMEOUT env or X-Queue-Timeout header).
Models
| GPU | Model | VRAM | Slots |
|---|---|---|---|
| Strix Halo | qwen3.6-35B-A3B (MoE) | 65GB | 2 |
| RTX 3090 | qwen3.6-27B-code (Dense) | 24GB | 2 |
| RTX 5070 | qwen3.5-9b-vlm (VLM) | 12GB | 2 |
Maintenance
Automated cron job runs daily at 3:00 AM UTC (/opt/inference-harness/maintenance.sh):
- Cleans Redis timeseries keys >60 days
- Prunes Docker build cache >7 days
- Logs container health and Redis memory
Logs: /var/log/harness-maintenance.log
Description
Syslog Operational Agent Harness — Nginx routing, Redis queue, circuit breaker, monitoring, Docker migration
Languages
Python
99.2%
Dockerfile
0.8%