syslog-harness — Inference API Harness

CT 116 Docker stack for routing local GPU models through a unified OpenAI-compatible API.

Architecture

nginx :80 → router :9000 → GPU backends
                ├─ qwen3.6-35B-A3B (MoE) @ 192.168.68.15:8080  [2 slots]
                ├─ qwen3.6-27B-code (Dense) @ 192.168.68.8:8080  [2 slots]
                └─ qwen3.5-9b-vlm (VLM) @ 192.168.68.110:8080    [2 slots]
                                     Total: 6 concurrent slots

LiteLLM :8081 (fallback) | Dashboard :3000 | Redis :6379 (local)

Deploy

cd /opt/inference-harness
docker compose up -d

Endpoints

URL	Purpose
`/v1/chat/completions`	Inference API (OpenAI-compatible) — API key required
`/v1/models`	Available models
`/`	Dashboard (GPU health, routing, agents, timeseries)

Authentication

All /v1/chat/completions requests require a valid API key via Authorization: Bearer <key>. Missing or invalid keys return 401 Unauthorized.

Agent API Keys

Agent	Key
Abiba	`sk-syslog-abiba`
Mumuni	`sk-syslog-mumuni`
Tanko	`sk-syslog-tanko`
Koby	`sk-syslog-koby`
Kagenz0	`sk-syslog-kagenz0`
Koonimo	`sk-syslog-koonimo`

Routing Tiers

Tier	Trigger	Priority
Lightweight	No system prompt, ≤1 turn, ≤100 words	VLM → MoE → Dense
Simple Conv	≤1000 tokens, ≤4 turns	VLM → MoE → Dense
Heavy	>4000 tokens OR >8 turns	Dense → MoE → VLM
Default	Everything else	MoE → VLM → Dense

Queue

When all GPUs are saturated, requests enter a polling queue (500ms intervals) instead of returning 503 immediately. Timeout: 30s (configurable via QUEUE_TIMEOUT env or X-Queue-Timeout header).

Models

GPU	Model	VRAM	Slots
Strix Halo	qwen3.6-35B-A3B (MoE)	65GB	2
RTX 3090	qwen3.6-27B-code (Dense)	24GB	2
RTX 5070	qwen3.5-9b-vlm (VLM)	12GB	2

Maintenance

Automated cron job runs daily at 3:00 AM UTC (/opt/inference-harness/maintenance.sh):

Cleans Redis timeseries keys >60 days
Prunes Docker build cache >7 days
Logs container health and Redis memory

Logs: /var/log/harness-maintenance.log

2.3 KiB Raw Permalink Blame History