docs: complete design documentation — auth, routing tiers, queue, models, maintenance

2026-05-19 19:17:52 +00:00
parent 46dda918de
commit b67021ac69
1 changed files with 39 additions and 4 deletions
@@ -6,9 +6,10 @@ CT 116 Docker stack for routing local GPU models through a unified OpenAI-compat
 ```
 nginx :80 → router :9000 → GPU backends
-                ├─ qwen3.6-35B-A3B (MoE) @ 192.168.68.15:8080
+                ├─ qwen3.6-35B-A3B (MoE) @ 192.168.68.15:8080  [2 slots]
-                ├─ qwen3.6-27B-code (Dense) @ 192.168.68.8:8080
+                ├─ qwen3.6-27B-code (Dense) @ 192.168.68.8:8080  [2 slots]
-                └─ qwen3.5-9b-vlm (VLM) @ 192.168.68.110:8080
+                └─ qwen3.5-9b-vlm (VLM) @ 192.168.68.110:8080    [2 slots]
                                     Total: 6 concurrent slots
 LiteLLM :8081 (fallback) | Dashboard :3000 | Redis :6379 (local)
 ```
@@ -24,10 +25,14 @@ docker compose up -d
 | URL | Purpose |
 |-----|---------|
-| `/v1/chat/completions` | Inference API (OpenAI-compatible) |
+| `/v1/chat/completions` | Inference API (OpenAI-compatible) — **API key required** |
 | `/v1/models` | Available models |
 | `/` | Dashboard (GPU health, routing, agents, timeseries) |
 ## Authentication
 **All `/v1/chat/completions` requests require a valid API key** via `Authorization: Bearer <key>`. Missing or invalid keys return **401 Unauthorized**.
 ## Agent API Keys
 | Agent | Key |
@@ -38,3 +43,33 @@ docker compose up -d
 | Koby | `sk-syslog-koby` |
 | Kagenz0 | `sk-syslog-kagenz0` |
 | Koonimo | `sk-syslog-koonimo` |
 ## Routing Tiers
 | Tier | Trigger | Priority |
 |------|---------|----------|
 | Lightweight | No system prompt, ≤1 turn, ≤100 words | VLM → MoE → Dense |
 | Simple Conv | ≤1000 tokens, ≤4 turns | VLM → MoE → Dense |
 | Heavy | >4000 tokens OR >8 turns | Dense → MoE → VLM |
 | Default | Everything else | MoE → VLM → Dense |
 ## Queue
 When all GPUs are saturated, requests enter a polling queue (500ms intervals) instead of returning 503 immediately. Timeout: 30s (configurable via `QUEUE_TIMEOUT` env or `X-Queue-Timeout` header).
 ## Models
 | GPU | Model | VRAM | Slots |
 |-----|-------|------|-------|
 | Strix Halo | qwen3.6-35B-A3B (MoE) | 65GB | 2 |
 | RTX 3090 | qwen3.6-27B-code (Dense) | 24GB | 2 |
 | RTX 5070 | qwen3.5-9b-vlm (VLM) | 12GB | 2 |
 ## Maintenance
 Automated cron job runs daily at 3:00 AM UTC (`/opt/inference-harness/maintenance.sh`):
 - Cleans Redis timeseries keys >60 days
 - Prunes Docker build cache >7 days
 - Logs container health and Redis memory
 Logs: `/var/log/harness-maintenance.log`