T

Abiba 6efd5ff51c feat: context-aware routing + compaction signals

- Added GPU_CONTEXT map (MoE 131K, VLM 131K, Dense 65K)
- Heavy tier now prefers MoE/VLM (131K) over Dense (65K) for large requests
- Response headers: X-Context-Remaining, X-Context-Model
- Routing data includes context_remaining field
- Agents can use this to trigger compaction when nearing limits

2026-05-19 21:13:56 +00:00

dashboard

May 19, 2026: Full harness update

2026-05-19 15:03:34 +00:00

nginx

Router: 300s timeout, gpu_decr bugfix. Dashboard: Bootstrap 5 modern redesign with KPI stats, equal-height cards, queue ring. Nginx: 600s timeout.

2026-05-16 22:12:21 +00:00

queue-service

Add queue service

2026-05-15 21:07:05 +00:00

router

feat: context-aware routing + compaction signals

2026-05-19 21:13:56 +00:00

.env.example

Add env example

2026-05-15 21:07:34 +00:00

.gitignore

May 19, 2026: Full harness update

2026-05-19 15:03:34 +00:00

docker-compose.yml

Mumuni review action items: health checks for all containers, version pinning, 503+Retry-After on all-GPU saturation

2026-05-17 09:05:27 +00:00

Dockerfile.dashboard

Add Dockerfile.dashboard

2026-05-15 21:34:52 +00:00

Dockerfile.queue

Add Dockerfile.queue

2026-05-15 21:34:49 +00:00

gpu-router-docker.conf

May 19, 2026: Full harness update

2026-05-19 15:03:34 +00:00

gpu-router.conf

May 19, 2026: Full harness update

2026-05-19 15:03:34 +00:00

litellm_config.yaml

May 19, 2026: Full harness update

2026-05-19 15:03:34 +00:00

MIGRATION_PLAN.md

Add migration plan

2026-05-15 21:07:32 +00:00

README.md

docs: complete design documentation — auth, routing tiers, queue, models, maintenance

2026-05-19 19:17:52 +00:00

README.md

syslog-harness — Inference API Harness

CT 116 Docker stack for routing local GPU models through a unified OpenAI-compatible API.

Architecture

nginx :80 → router :9000 → GPU backends
                ├─ qwen3.6-35B-A3B (MoE) @ 192.168.68.15:8080  [2 slots]
                ├─ qwen3.6-27B-code (Dense) @ 192.168.68.8:8080  [2 slots]
                └─ qwen3.5-9b-vlm (VLM) @ 192.168.68.110:8080    [2 slots]
                                     Total: 6 concurrent slots

LiteLLM :8081 (fallback) | Dashboard :3000 | Redis :6379 (local)

Deploy

cd /opt/inference-harness
docker compose up -d

Endpoints

URL	Purpose
`/v1/chat/completions`	Inference API (OpenAI-compatible) — API key required
`/v1/models`	Available models
`/`	Dashboard (GPU health, routing, agents, timeseries)

Authentication

All /v1/chat/completions requests require a valid API key via Authorization: Bearer <key>. Missing or invalid keys return 401 Unauthorized.

Agent API Keys

Agent	Key
Abiba	`sk-syslog-abiba`
Mumuni	`sk-syslog-mumuni`
Tanko	`sk-syslog-tanko`
Koby	`sk-syslog-koby`
Kagenz0	`sk-syslog-kagenz0`
Koonimo	`sk-syslog-koonimo`

Routing Tiers

Tier	Trigger	Priority
Lightweight	No system prompt, ≤1 turn, ≤100 words	VLM → MoE → Dense
Simple Conv	≤1000 tokens, ≤4 turns	VLM → MoE → Dense
Heavy	>4000 tokens OR >8 turns	Dense → MoE → VLM
Default	Everything else	MoE → VLM → Dense

Queue

When all GPUs are saturated, requests enter a polling queue (500ms intervals) instead of returning 503 immediately. Timeout: 30s (configurable via QUEUE_TIMEOUT env or X-Queue-Timeout header).

Models

GPU	Model	VRAM	Slots
Strix Halo	qwen3.6-35B-A3B (MoE)	65GB	2
RTX 3090	qwen3.6-27B-code (Dense)	24GB	2
RTX 5070	qwen3.5-9b-vlm (VLM)	12GB	2

Maintenance

Automated cron job runs daily at 3:00 AM UTC (/opt/inference-harness/maintenance.sh):

Cleans Redis timeseries keys >60 days
Prunes Docker build cache >7 days
Logs container health and Redis memory

Logs: /var/log/harness-maintenance.log