T

Abiba 3cbf38e3e2 fix: raise heavy threshold — 4000→12000 tokens, 8→15 turns

Agent conversations with system prompts easily exceed 4000 tokens,
forcing everything to Dense. Now only truly heavy work triggers Dense.
Most agent convos will route to MoE (default) instead.

2026-05-19 20:09:59 +00:00

dashboard

May 19, 2026: Full harness update

2026-05-19 15:03:34 +00:00

nginx

Router: 300s timeout, gpu_decr bugfix. Dashboard: Bootstrap 5 modern redesign with KPI stats, equal-height cards, queue ring. Nginx: 600s timeout.

2026-05-16 22:12:21 +00:00

queue-service

Add queue service

2026-05-15 21:07:05 +00:00

router

fix: raise heavy threshold — 4000→12000 tokens, 8→15 turns

2026-05-19 20:09:59 +00:00

.env.example

Add env example

2026-05-15 21:07:34 +00:00

.gitignore

May 19, 2026: Full harness update

2026-05-19 15:03:34 +00:00

docker-compose.yml

Mumuni review action items: health checks for all containers, version pinning, 503+Retry-After on all-GPU saturation

2026-05-17 09:05:27 +00:00

Dockerfile.dashboard

Add Dockerfile.dashboard

2026-05-15 21:34:52 +00:00

Dockerfile.queue

Add Dockerfile.queue

2026-05-15 21:34:49 +00:00

gpu-router-docker.conf

May 19, 2026: Full harness update

2026-05-19 15:03:34 +00:00

gpu-router.conf

May 19, 2026: Full harness update

2026-05-19 15:03:34 +00:00

litellm_config.yaml

May 19, 2026: Full harness update

2026-05-19 15:03:34 +00:00

MIGRATION_PLAN.md

Add migration plan

2026-05-15 21:07:32 +00:00

README.md

docs: complete design documentation — auth, routing tiers, queue, models, maintenance

2026-05-19 19:17:52 +00:00

README.md

syslog-harness — Inference API Harness

CT 116 Docker stack for routing local GPU models through a unified OpenAI-compatible API.

Architecture

nginx :80 → router :9000 → GPU backends
                ├─ qwen3.6-35B-A3B (MoE) @ 192.168.68.15:8080  [2 slots]
                ├─ qwen3.6-27B-code (Dense) @ 192.168.68.8:8080  [2 slots]
                └─ qwen3.5-9b-vlm (VLM) @ 192.168.68.110:8080    [2 slots]
                                     Total: 6 concurrent slots

LiteLLM :8081 (fallback) | Dashboard :3000 | Redis :6379 (local)

Deploy

cd /opt/inference-harness
docker compose up -d

Endpoints

URL	Purpose
`/v1/chat/completions`	Inference API (OpenAI-compatible) — API key required
`/v1/models`	Available models
`/`	Dashboard (GPU health, routing, agents, timeseries)

Authentication

All /v1/chat/completions requests require a valid API key via Authorization: Bearer <key>. Missing or invalid keys return 401 Unauthorized.

Agent API Keys

Agent	Key
Abiba	`sk-syslog-abiba`
Mumuni	`sk-syslog-mumuni`
Tanko	`sk-syslog-tanko`
Koby	`sk-syslog-koby`
Kagenz0	`sk-syslog-kagenz0`
Koonimo	`sk-syslog-koonimo`

Routing Tiers

Tier	Trigger	Priority
Lightweight	No system prompt, ≤1 turn, ≤100 words	VLM → MoE → Dense
Simple Conv	≤1000 tokens, ≤4 turns	VLM → MoE → Dense
Heavy	>4000 tokens OR >8 turns	Dense → MoE → VLM
Default	Everything else	MoE → VLM → Dense

Queue

When all GPUs are saturated, requests enter a polling queue (500ms intervals) instead of returning 503 immediately. Timeout: 30s (configurable via QUEUE_TIMEOUT env or X-Queue-Timeout header).

Models

GPU	Model	VRAM	Slots
Strix Halo	qwen3.6-35B-A3B (MoE)	65GB	2
RTX 3090	qwen3.6-27B-code (Dense)	24GB	2
RTX 5070	qwen3.5-9b-vlm (VLM)	12GB	2

Maintenance

Automated cron job runs daily at 3:00 AM UTC (/opt/inference-harness/maintenance.sh):

Cleans Redis timeseries keys >60 days
Prunes Docker build cache >7 days
Logs container health and Redis memory

Logs: /var/log/harness-maintenance.log