T

Abiba f519a3fa60 fix: routing — system prompts no longer force heavy tier

System messages are common in agent conversations but don't indicate
heavy workload. Now only token count (>4000) and turn count (>8) trigger
heavy routing. Simple conversations with system prompts can now route to VLM.

2026-05-19 17:19:29 +00:00

dashboard

May 19, 2026: Full harness update

2026-05-19 15:03:34 +00:00

nginx

Router: 300s timeout, gpu_decr bugfix. Dashboard: Bootstrap 5 modern redesign with KPI stats, equal-height cards, queue ring. Nginx: 600s timeout.

2026-05-16 22:12:21 +00:00

queue-service

Add queue service

2026-05-15 21:07:05 +00:00

router

fix: routing — system prompts no longer force heavy tier

2026-05-19 17:19:29 +00:00

.env.example

Add env example

2026-05-15 21:07:34 +00:00

.gitignore

May 19, 2026: Full harness update

2026-05-19 15:03:34 +00:00

docker-compose.yml

Mumuni review action items: health checks for all containers, version pinning, 503+Retry-After on all-GPU saturation

2026-05-17 09:05:27 +00:00

Dockerfile.dashboard

Add Dockerfile.dashboard

2026-05-15 21:34:52 +00:00

Dockerfile.queue

Add Dockerfile.queue

2026-05-15 21:34:49 +00:00

gpu-router-docker.conf

May 19, 2026: Full harness update

2026-05-19 15:03:34 +00:00

gpu-router.conf

May 19, 2026: Full harness update

2026-05-19 15:03:34 +00:00

litellm_config.yaml

May 19, 2026: Full harness update

2026-05-19 15:03:34 +00:00

MIGRATION_PLAN.md

Add migration plan

2026-05-15 21:07:32 +00:00

README.md

docs: add Koonimo to agent API keys table

2026-05-19 15:48:39 +00:00

README.md

syslog-harness — Inference API Harness

CT 116 Docker stack for routing local GPU models through a unified OpenAI-compatible API.

Architecture

nginx :80 → router :9000 → GPU backends
                ├─ qwen3.6-35B-A3B (MoE) @ 192.168.68.15:8080
                ├─ qwen3.6-27B-code (Dense) @ 192.168.68.8:8080
                └─ qwen3.5-9b-vlm (VLM) @ 192.168.68.110:8080

LiteLLM :8081 (fallback) | Dashboard :3000 | Redis :6379 (local)

Deploy

cd /opt/inference-harness
docker compose up -d

Endpoints

URL	Purpose
`/v1/chat/completions`	Inference API (OpenAI-compatible)
`/v1/models`	Available models
`/`	Dashboard (GPU health, routing, agents, timeseries)

Agent API Keys

Agent	Key
Abiba	`sk-syslog-abiba`
Mumuni	`sk-syslog-mumuni`
Tanko	`sk-syslog-tanko`
Koby	`sk-syslog-koby`
Kagenz0	`sk-syslog-kagenz0`
Koonimo	`sk-syslog-koonimo`