From b67021ac69be4f05e90de22d6fab7af70e73b3c8 Mon Sep 17 00:00:00 2001
From: Abiba <abiba@sysloggh.com>
Date: Tue, 19 May 2026 19:17:52 +0000
Subject: [PATCH] =?UTF-8?q?docs:=20complete=20design=20documentation=20?=
 =?UTF-8?q?=E2=80=94=20auth,=20routing=20tiers,=20queue,=20models,=20maint?=
 =?UTF-8?q?enance?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 README.md | 43 +++++++++++++++++++++++++++++++++++++++----
 1 file changed, 39 insertions(+), 4 deletions(-)
diff --git a/README.md b/README.md
index 09fe187..0aea3cf 100644
--- a/README.md
+++ b/README.md
@@ -6,9 +6,10 @@ CT 116 Docker stack for routing local GPU models through a unified OpenAI-compat
 
 ```
 nginx :80 → router :9000 → GPU backends
-                ├─ qwen3.6-35B-A3B (MoE) @ 192.168.68.15:8080
-                ├─ qwen3.6-27B-code (Dense) @ 192.168.68.8:8080
-                └─ qwen3.5-9b-vlm (VLM) @ 192.168.68.110:8080
+                ├─ qwen3.6-35B-A3B (MoE) @ 192.168.68.15:8080  [2 slots]
+                ├─ qwen3.6-27B-code (Dense) @ 192.168.68.8:8080  [2 slots]
+                └─ qwen3.5-9b-vlm (VLM) @ 192.168.68.110:8080    [2 slots]
+                                     Total: 6 concurrent slots
 
 LiteLLM :8081 (fallback) | Dashboard :3000 | Redis :6379 (local)
 ```
@@ -24,10 +25,14 @@ docker compose up -d
 
 | URL | Purpose |
 |-----|---------|
-| `/v1/chat/completions` | Inference API (OpenAI-compatible) |
+| `/v1/chat/completions` | Inference API (OpenAI-compatible) — **API key required** |
 | `/v1/models` | Available models |
 | `/` | Dashboard (GPU health, routing, agents, timeseries) |
 
+## Authentication
+
+**All `/v1/chat/completions` requests require a valid API key** via `Authorization: Bearer <key>`. Missing or invalid keys return **401 Unauthorized**.
+
 ## Agent API Keys
 
 | Agent | Key |
@@ -38,3 +43,33 @@ docker compose up -d
 | Koby | `sk-syslog-koby` |
 | Kagenz0 | `sk-syslog-kagenz0` |
 | Koonimo | `sk-syslog-koonimo` |
+
+## Routing Tiers
+
+| Tier | Trigger | Priority |
+|------|---------|----------|
+| Lightweight | No system prompt, ≤1 turn, ≤100 words | VLM → MoE → Dense |
+| Simple Conv | ≤1000 tokens, ≤4 turns | VLM → MoE → Dense |
+| Heavy | >4000 tokens OR >8 turns | Dense → MoE → VLM |
+| Default | Everything else | MoE → VLM → Dense |
+
+## Queue
+
+When all GPUs are saturated, requests enter a polling queue (500ms intervals) instead of returning 503 immediately. Timeout: 30s (configurable via `QUEUE_TIMEOUT` env or `X-Queue-Timeout` header).
+
+## Models
+
+| GPU | Model | VRAM | Slots |
+|-----|-------|------|-------|
+| Strix Halo | qwen3.6-35B-A3B (MoE) | 65GB | 2 |
+| RTX 3090 | qwen3.6-27B-code (Dense) | 24GB | 2 |
+| RTX 5070 | qwen3.5-9b-vlm (VLM) | 12GB | 2 |
+
+## Maintenance
+
+Automated cron job runs daily at 3:00 AM UTC (`/opt/inference-harness/maintenance.sh`):
+- Cleans Redis timeseries keys >60 days
+- Prunes Docker build cache >7 days
+- Logs container health and Redis memory
+
+Logs: `/var/log/harness-maintenance.log`