Abiba 3cbf38e3e2 fix: raise heavy threshold — 4000→12000 tokens, 8→15 turns
Agent conversations with system prompts easily exceed 4000 tokens,
forcing everything to Dense. Now only truly heavy work triggers Dense.
Most agent convos will route to MoE (default) instead.
2026-05-19 20:09:59 +00:00
2026-05-15 21:07:05 +00:00
2026-05-15 21:07:34 +00:00
2026-05-19 15:03:34 +00:00
2026-05-15 21:34:52 +00:00
2026-05-15 21:34:49 +00:00
2026-05-15 21:07:32 +00:00

syslog-harness — Inference API Harness

CT 116 Docker stack for routing local GPU models through a unified OpenAI-compatible API.

Architecture

nginx :80 → router :9000 → GPU backends
                ├─ qwen3.6-35B-A3B (MoE) @ 192.168.68.15:8080  [2 slots]
                ├─ qwen3.6-27B-code (Dense) @ 192.168.68.8:8080  [2 slots]
                └─ qwen3.5-9b-vlm (VLM) @ 192.168.68.110:8080    [2 slots]
                                     Total: 6 concurrent slots

LiteLLM :8081 (fallback) | Dashboard :3000 | Redis :6379 (local)

Deploy

cd /opt/inference-harness
docker compose up -d

Endpoints

URL Purpose
/v1/chat/completions Inference API (OpenAI-compatible) — API key required
/v1/models Available models
/ Dashboard (GPU health, routing, agents, timeseries)

Authentication

All /v1/chat/completions requests require a valid API key via Authorization: Bearer <key>. Missing or invalid keys return 401 Unauthorized.

Agent API Keys

Agent Key
Abiba sk-syslog-abiba
Mumuni sk-syslog-mumuni
Tanko sk-syslog-tanko
Koby sk-syslog-koby
Kagenz0 sk-syslog-kagenz0
Koonimo sk-syslog-koonimo

Routing Tiers

Tier Trigger Priority
Lightweight No system prompt, ≤1 turn, ≤100 words VLM → MoE → Dense
Simple Conv ≤1000 tokens, ≤4 turns VLM → MoE → Dense
Heavy >4000 tokens OR >8 turns Dense → MoE → VLM
Default Everything else MoE → VLM → Dense

Queue

When all GPUs are saturated, requests enter a polling queue (500ms intervals) instead of returning 503 immediately. Timeout: 30s (configurable via QUEUE_TIMEOUT env or X-Queue-Timeout header).

Models

GPU Model VRAM Slots
Strix Halo qwen3.6-35B-A3B (MoE) 65GB 2
RTX 3090 qwen3.6-27B-code (Dense) 24GB 2
RTX 5070 qwen3.5-9b-vlm (VLM) 12GB 2

Maintenance

Automated cron job runs daily at 3:00 AM UTC (/opt/inference-harness/maintenance.sh):

  • Cleans Redis timeseries keys >60 days
  • Prunes Docker build cache >7 days
  • Logs container health and Redis memory

Logs: /var/log/harness-maintenance.log

S
Description
Syslog Operational Agent Harness — Nginx routing, Redis queue, circuit breaker, monitoring, Docker migration
Readme 311 KiB
Languages
Python 99.2%
Dockerfile 0.8%