abiba-bot/inference-harness

T

Abiba ac13ecaaf7 auto-fix: harness-dashboard restarted — container was down, now healthy

2026-06-28 01:36:00 +00:00

backups/20260602_103344

security: move API keys to env var, strip from source code fallback

2026-06-03 12:40:04 +00:00

auto-fix: harness-dashboard restarted — container was down, now healthy

2026-06-28 01:36:00 +00:00

auto-fix: harness-dashboard restarted — container was down, now healthy

2026-06-28 01:36:00 +00:00

Add queue service

2026-05-15 21:07:05 +00:00

auto-fix: harness-dashboard restarted — container was down, now healthy

2026-06-28 01:36:00 +00:00

fix: Security hardening from CT116 deep-dive review

2026-06-02 10:37:10 +00:00

.env.example

Add env example

2026-05-15 21:07:34 +00:00

.gitignore

chore: add .gitignore for .bak and .backup files

2026-06-25 20:33:50 +00:00

docker-compose.yml

fix: LiteLLM OIDC + Admin UI fixes - Authentik integration restored

2026-06-25 20:33:22 +00:00

docker-compose.yml.bak

May 19, 2026: Full harness update

2026-05-19 15:03:47 +00:00

docker-compose.yml.bak.20260616-215242

auto-fix: harness-dashboard restarted — container was down, now healthy

2026-06-28 01:36:00 +00:00

docker-compose.yml.bak.20260617-104030

auto-fix: harness-dashboard restarted — container was down, now healthy

2026-06-28 01:36:00 +00:00

docker-compose.yml.bak.20260625-192313-fix

auto-fix: harness-dashboard restarted — container was down, now healthy

2026-06-28 01:36:00 +00:00

docker-compose.yml.bak.20260625-195716-remove-docs-url

auto-fix: harness-dashboard restarted — container was down, now healthy

2026-06-28 01:36:00 +00:00

docker-compose.yml.bak.20260625-195847-set-docs

auto-fix: harness-dashboard restarted — container was down, now healthy

2026-06-28 01:36:00 +00:00

docker-compose.yml.bak.20260625-202429-auth-proxy

auto-fix: harness-dashboard restarted — container was down, now healthy

2026-06-28 01:36:00 +00:00

Dockerfile.dashboard

Add Dockerfile.dashboard

2026-05-15 21:34:52 +00:00

Dockerfile.queue

Add Dockerfile.queue

2026-05-15 21:34:49 +00:00

gpu-router-docker.conf

VLM migration: qwen3.5-9b-vlm → gemma-4-12b across entire harness

2026-06-05 21:37:18 +00:00

gpu-router.conf

VLM migration: qwen3.5-9b-vlm → gemma-4-12b across entire harness

2026-06-05 21:37:18 +00:00

litellm_config.yaml

fix: LiteLLM OIDC + Admin UI fixes - Authentik integration restored

2026-06-25 20:33:22 +00:00

litellm_config.yaml.bak.20260617-110719

auto-fix: harness-dashboard restarted — container was down, now healthy

2026-06-28 01:36:00 +00:00

litellm_config.yaml.bak.poc

auto-fix: harness-dashboard restarted — container was down, now healthy

2026-06-28 01:36:00 +00:00

LITELLM-MIGRATION-PLAN.md

feat(plan): add fallback chains and resolve model identity gap

2026-06-14 22:48:53 +00:00

maintenance.sh

May 19, 2026: Full harness update

2026-05-19 15:03:47 +00:00

MIGRATION_PLAN.md

Add migration plan

2026-05-15 21:07:32 +00:00

phase0-dual-keys.json

Phase 0: Dual-key router — no hardcoded keys, gemma sync, 262K context fix

2026-06-06 00:51:29 +00:00

README.md

docs: update model table with context windows, capabilities, GPU labels

2026-06-05 23:49:03 +00:00

README.md

syslog-harness — Inference API Harness

CT 116 Docker stack for routing local GPU models through a unified OpenAI-compatible API.

Architecture

nginx :80 → router :9000 → GPU backends
                ├─ qwen3.6-35B-A3B (MoE) @ 192.168.68.15:8080  [2 slots, 262K ctx]
                ├─ qwen3.6-27B-code (Dense) @ 192.168.68.8:8080  [2 slots, 262K ctx]
                └─ gemma-4-12b (VLM) @ 192.168.68.110:8080    [2 slots, 262K ctx]
                                     Total: 6 concurrent slots

LiteLLM :8081 (fallback) | Dashboard :3000 | Redis :6379 (local)

Deploy

cd /opt/inference-harness
docker compose up -d

Endpoints

URL	Purpose
`/v1/chat/completions`	Inference API (OpenAI-compatible) — API key required
`/v1/models`	Available models
`/`	Dashboard (GPU health, routing, agents, timeseries)

Authentication

All /v1/chat/completions requests require a valid API key via Authorization: Bearer <key>. Missing or invalid keys return 401 Unauthorized.

Agent API Keys

Agent	Key
Abiba	`sk-syslog-abiba`
Mumuni	`sk-syslog-mumuni`
Tanko	`sk-syslog-tanko`
Koby	`sk-syslog-koby`
Kagenz0	`sk-syslog-kagenz0`
Koonimo	`sk-syslog-koonimo`

Routing Tiers

Tier	Trigger	Priority
Lightweight	No system prompt, ≤1 turn, ≤100 words	VLM → MoE → Dense
Simple Conv	≤1000 tokens, ≤4 turns	VLM → MoE → Dense
Heavy	>4000 tokens OR >8 turns	Dense → MoE → VLM
Default	Everything else	MoE → VLM → Dense

Queue

When all GPUs are saturated, requests enter a polling queue (500ms intervals) instead of returning 503 immediately. Timeout: 30s (configurable via QUEUE_TIMEOUT env or X-Queue-Timeout header).

Models

| GPU | Model | VRAM | Slots | Context | Best For | |-----|-------|------|-------| | Strix Halo | qwen3.6-35B-A3B (MoE) | 65GB | 2 | 262K | General quality | | RTX 3090 | qwen3.6-27B-code (Dense) | 24GB | 2 | 262K | Code, reasoning | | RTX 5070 | gemma-4-12b (VLM) | 12GB | 2 | 262K | Speed, vision |

Maintenance

Automated cron job runs daily at 3:00 AM UTC (/opt/inference-harness/maintenance.sh):

Cleans Redis timeseries keys >60 days
Prunes Docker build cache >7 days
Logs container health and Redis memory

Logs: /var/log/harness-maintenance.log