From b67021ac69be4f05e90de22d6fab7af70e73b3c8 Mon Sep 17 00:00:00 2001 From: Abiba Date: Tue, 19 May 2026 19:17:52 +0000 Subject: [PATCH] =?UTF-8?q?docs:=20complete=20design=20documentation=20?= =?UTF-8?q?=E2=80=94=20auth,=20routing=20tiers,=20queue,=20models,=20maint?= =?UTF-8?q?enance?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- README.md | 43 +++++++++++++++++++++++++++++++++++++++---- 1 file changed, 39 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 09fe187..0aea3cf 100644 --- a/README.md +++ b/README.md @@ -6,9 +6,10 @@ CT 116 Docker stack for routing local GPU models through a unified OpenAI-compat ``` nginx :80 → router :9000 → GPU backends - ├─ qwen3.6-35B-A3B (MoE) @ 192.168.68.15:8080 - ├─ qwen3.6-27B-code (Dense) @ 192.168.68.8:8080 - └─ qwen3.5-9b-vlm (VLM) @ 192.168.68.110:8080 + ├─ qwen3.6-35B-A3B (MoE) @ 192.168.68.15:8080 [2 slots] + ├─ qwen3.6-27B-code (Dense) @ 192.168.68.8:8080 [2 slots] + └─ qwen3.5-9b-vlm (VLM) @ 192.168.68.110:8080 [2 slots] + Total: 6 concurrent slots LiteLLM :8081 (fallback) | Dashboard :3000 | Redis :6379 (local) ``` @@ -24,10 +25,14 @@ docker compose up -d | URL | Purpose | |-----|---------| -| `/v1/chat/completions` | Inference API (OpenAI-compatible) | +| `/v1/chat/completions` | Inference API (OpenAI-compatible) — **API key required** | | `/v1/models` | Available models | | `/` | Dashboard (GPU health, routing, agents, timeseries) | +## Authentication + +**All `/v1/chat/completions` requests require a valid API key** via `Authorization: Bearer `. Missing or invalid keys return **401 Unauthorized**. + ## Agent API Keys | Agent | Key | @@ -38,3 +43,33 @@ docker compose up -d | Koby | `sk-syslog-koby` | | Kagenz0 | `sk-syslog-kagenz0` | | Koonimo | `sk-syslog-koonimo` | + +## Routing Tiers + +| Tier | Trigger | Priority | +|------|---------|----------| +| Lightweight | No system prompt, ≤1 turn, ≤100 words | VLM → MoE → Dense | +| Simple Conv | ≤1000 tokens, ≤4 turns | VLM → MoE → Dense | +| Heavy | >4000 tokens OR >8 turns | Dense → MoE → VLM | +| Default | Everything else | MoE → VLM → Dense | + +## Queue + +When all GPUs are saturated, requests enter a polling queue (500ms intervals) instead of returning 503 immediately. Timeout: 30s (configurable via `QUEUE_TIMEOUT` env or `X-Queue-Timeout` header). + +## Models + +| GPU | Model | VRAM | Slots | +|-----|-------|------|-------| +| Strix Halo | qwen3.6-35B-A3B (MoE) | 65GB | 2 | +| RTX 3090 | qwen3.6-27B-code (Dense) | 24GB | 2 | +| RTX 5070 | qwen3.5-9b-vlm (VLM) | 12GB | 2 | + +## Maintenance + +Automated cron job runs daily at 3:00 AM UTC (`/opt/inference-harness/maintenance.sh`): +- Cleans Redis timeseries keys >60 days +- Prunes Docker build cache >7 days +- Logs container health and Redis memory + +Logs: `/var/log/harness-maintenance.log`