Files

2.3 KiB

syslog-harness — Inference API Harness

CT 116 Docker stack for routing local GPU models through a unified OpenAI-compatible API.

Architecture

nginx :80 → router :9000 → GPU backends
                ├─ qwen3.6-35B-A3B (MoE) @ 192.168.68.15:8080  [2 slots]
                ├─ qwen3.6-27B-code (Dense) @ 192.168.68.8:8080  [2 slots]
                └─ qwen3.5-9b-vlm (VLM) @ 192.168.68.110:8080    [2 slots]
                                     Total: 6 concurrent slots

LiteLLM :8081 (fallback) | Dashboard :3000 | Redis :6379 (local)

Deploy

cd /opt/inference-harness
docker compose up -d

Endpoints

URL Purpose
/v1/chat/completions Inference API (OpenAI-compatible) — API key required
/v1/models Available models
/ Dashboard (GPU health, routing, agents, timeseries)

Authentication

All /v1/chat/completions requests require a valid API key via Authorization: Bearer <key>. Missing or invalid keys return 401 Unauthorized.

Agent API Keys

Agent Key
Abiba sk-syslog-abiba
Mumuni sk-syslog-mumuni
Tanko sk-syslog-tanko
Koby sk-syslog-koby
Kagenz0 sk-syslog-kagenz0
Koonimo sk-syslog-koonimo

Routing Tiers

Tier Trigger Priority
Lightweight No system prompt, ≤1 turn, ≤100 words VLM → MoE → Dense
Simple Conv ≤1000 tokens, ≤4 turns VLM → MoE → Dense
Heavy >4000 tokens OR >8 turns Dense → MoE → VLM
Default Everything else MoE → VLM → Dense

Queue

When all GPUs are saturated, requests enter a polling queue (500ms intervals) instead of returning 503 immediately. Timeout: 30s (configurable via QUEUE_TIMEOUT env or X-Queue-Timeout header).

Models

GPU Model VRAM Slots
Strix Halo qwen3.6-35B-A3B (MoE) 65GB 2
RTX 3090 qwen3.6-27B-code (Dense) 24GB 2
RTX 5070 qwen3.5-9b-vlm (VLM) 12GB 2

Maintenance

Automated cron job runs daily at 3:00 AM UTC (/opt/inference-harness/maintenance.sh):

  • Cleans Redis timeseries keys >60 days
  • Prunes Docker build cache >7 days
  • Logs container health and Redis memory

Logs: /var/log/harness-maintenance.log