# syslog-harness — Inference API Harness CT 116 Docker stack for routing local GPU models through a unified OpenAI-compatible API. ## Architecture ``` nginx :80 → router :9000 → GPU backends ├─ qwen3.6-35B-A3B (MoE) @ 192.168.68.15:8080 [2 slots] ├─ qwen3.6-27B-code (Dense) @ 192.168.68.8:8080 [2 slots] └─ qwen3.5-9b-vlm (VLM) @ 192.168.68.110:8080 [2 slots] Total: 6 concurrent slots LiteLLM :8081 (fallback) | Dashboard :3000 | Redis :6379 (local) ``` ## Deploy ```bash cd /opt/inference-harness docker compose up -d ``` ## Endpoints | URL | Purpose | |-----|---------| | `/v1/chat/completions` | Inference API (OpenAI-compatible) — **API key required** | | `/v1/models` | Available models | | `/` | Dashboard (GPU health, routing, agents, timeseries) | ## Authentication **All `/v1/chat/completions` requests require a valid API key** via `Authorization: Bearer `. Missing or invalid keys return **401 Unauthorized**. ## Agent API Keys | Agent | Key | |-------|-----| | Abiba | `sk-syslog-abiba` | | Mumuni | `sk-syslog-mumuni` | | Tanko | `sk-syslog-tanko` | | Koby | `sk-syslog-koby` | | Kagenz0 | `sk-syslog-kagenz0` | | Koonimo | `sk-syslog-koonimo` | ## Routing Tiers | Tier | Trigger | Priority | |------|---------|----------| | Lightweight | No system prompt, ≤1 turn, ≤100 words | VLM → MoE → Dense | | Simple Conv | ≤1000 tokens, ≤4 turns | VLM → MoE → Dense | | Heavy | >4000 tokens OR >8 turns | Dense → MoE → VLM | | Default | Everything else | MoE → VLM → Dense | ## Queue When all GPUs are saturated, requests enter a polling queue (500ms intervals) instead of returning 503 immediately. Timeout: 30s (configurable via `QUEUE_TIMEOUT` env or `X-Queue-Timeout` header). ## Models | GPU | Model | VRAM | Slots | |-----|-------|------|-------| | Strix Halo | qwen3.6-35B-A3B (MoE) | 65GB | 2 | | RTX 3090 | qwen3.6-27B-code (Dense) | 24GB | 2 | | RTX 5070 | qwen3.5-9b-vlm (VLM) | 12GB | 2 | ## Maintenance Automated cron job runs daily at 3:00 AM UTC (`/opt/inference-harness/maintenance.sh`): - Cleans Redis timeseries keys >60 days - Prunes Docker build cache >7 days - Logs container health and Redis memory Logs: `/var/log/harness-maintenance.log`