T

Abiba 941e8db65e feat: redesigned routing tiers — VLM handles more traffic

New 4-tier routing:
- TIER 1 (Lightweight): ≤100 words, single-turn → VLM first, fallback Dense
- TIER 2 (Simple Conv): ≤1000 tokens, ≤4 turns → VLM preferred, fallback Dense
- TIER 3 (Heavy): >4000 tokens, system prompts, >8 turns → Dense→MoE→VLM cascade
- TIER 4 (Default): Medium tasks → Dense preferred, MoE default, VLM overflow

VLM gets more utilization for simple conversations instead of defaulting
everything to MoE.

2026-05-19 17:01:55 +00:00

dashboard

May 19, 2026: Full harness update

2026-05-19 15:03:34 +00:00

nginx

Router: 300s timeout, gpu_decr bugfix. Dashboard: Bootstrap 5 modern redesign with KPI stats, equal-height cards, queue ring. Nginx: 600s timeout.

2026-05-16 22:12:21 +00:00

queue-service

Add queue service

2026-05-15 21:07:05 +00:00

router

feat: redesigned routing tiers — VLM handles more traffic

2026-05-19 17:01:55 +00:00

.env.example

Add env example

2026-05-15 21:07:34 +00:00

.gitignore

May 19, 2026: Full harness update

2026-05-19 15:03:34 +00:00

docker-compose.yml

Mumuni review action items: health checks for all containers, version pinning, 503+Retry-After on all-GPU saturation

2026-05-17 09:05:27 +00:00

Dockerfile.dashboard

Add Dockerfile.dashboard

2026-05-15 21:34:52 +00:00

Dockerfile.queue

Add Dockerfile.queue

2026-05-15 21:34:49 +00:00

gpu-router-docker.conf

May 19, 2026: Full harness update

2026-05-19 15:03:34 +00:00

gpu-router.conf

May 19, 2026: Full harness update

2026-05-19 15:03:34 +00:00

litellm_config.yaml

May 19, 2026: Full harness update

2026-05-19 15:03:34 +00:00

MIGRATION_PLAN.md

Add migration plan

2026-05-15 21:07:32 +00:00

README.md

docs: add Koonimo to agent API keys table

2026-05-19 15:48:39 +00:00

README.md

syslog-harness — Inference API Harness

CT 116 Docker stack for routing local GPU models through a unified OpenAI-compatible API.

Architecture

nginx :80 → router :9000 → GPU backends
                ├─ qwen3.6-35B-A3B (MoE) @ 192.168.68.15:8080
                ├─ qwen3.6-27B-code (Dense) @ 192.168.68.8:8080
                └─ qwen3.5-9b-vlm (VLM) @ 192.168.68.110:8080

LiteLLM :8081 (fallback) | Dashboard :3000 | Redis :6379 (local)

Deploy

cd /opt/inference-harness
docker compose up -d

Endpoints

URL	Purpose
`/v1/chat/completions`	Inference API (OpenAI-compatible)
`/v1/models`	Available models
`/`	Dashboard (GPU health, routing, agents, timeseries)

Agent API Keys

Agent	Key
Abiba	`sk-syslog-abiba`
Mumuni	`sk-syslog-mumuni`
Tanko	`sk-syslog-tanko`
Koby	`sk-syslog-koby`
Kagenz0	`sk-syslog-kagenz0`
Koonimo	`sk-syslog-koonimo`