docs: complete design documentation — auth, routing tiers, queue, models, maintenance

This commit is contained in:
Abiba
2026-05-19 19:17:52 +00:00
parent 46dda918de
commit b67021ac69
+39 -4
View File
@@ -6,9 +6,10 @@ CT 116 Docker stack for routing local GPU models through a unified OpenAI-compat
``` ```
nginx :80 → router :9000 → GPU backends nginx :80 → router :9000 → GPU backends
├─ qwen3.6-35B-A3B (MoE) @ 192.168.68.15:8080 ├─ qwen3.6-35B-A3B (MoE) @ 192.168.68.15:8080 [2 slots]
├─ qwen3.6-27B-code (Dense) @ 192.168.68.8:8080 ├─ qwen3.6-27B-code (Dense) @ 192.168.68.8:8080 [2 slots]
└─ qwen3.5-9b-vlm (VLM) @ 192.168.68.110:8080 └─ qwen3.5-9b-vlm (VLM) @ 192.168.68.110:8080 [2 slots]
Total: 6 concurrent slots
LiteLLM :8081 (fallback) | Dashboard :3000 | Redis :6379 (local) LiteLLM :8081 (fallback) | Dashboard :3000 | Redis :6379 (local)
``` ```
@@ -24,10 +25,14 @@ docker compose up -d
| URL | Purpose | | URL | Purpose |
|-----|---------| |-----|---------|
| `/v1/chat/completions` | Inference API (OpenAI-compatible) | | `/v1/chat/completions` | Inference API (OpenAI-compatible)**API key required** |
| `/v1/models` | Available models | | `/v1/models` | Available models |
| `/` | Dashboard (GPU health, routing, agents, timeseries) | | `/` | Dashboard (GPU health, routing, agents, timeseries) |
## Authentication
**All `/v1/chat/completions` requests require a valid API key** via `Authorization: Bearer <key>`. Missing or invalid keys return **401 Unauthorized**.
## Agent API Keys ## Agent API Keys
| Agent | Key | | Agent | Key |
@@ -38,3 +43,33 @@ docker compose up -d
| Koby | `sk-syslog-koby` | | Koby | `sk-syslog-koby` |
| Kagenz0 | `sk-syslog-kagenz0` | | Kagenz0 | `sk-syslog-kagenz0` |
| Koonimo | `sk-syslog-koonimo` | | Koonimo | `sk-syslog-koonimo` |
## Routing Tiers
| Tier | Trigger | Priority |
|------|---------|----------|
| Lightweight | No system prompt, ≤1 turn, ≤100 words | VLM → MoE → Dense |
| Simple Conv | ≤1000 tokens, ≤4 turns | VLM → MoE → Dense |
| Heavy | >4000 tokens OR >8 turns | Dense → MoE → VLM |
| Default | Everything else | MoE → VLM → Dense |
## Queue
When all GPUs are saturated, requests enter a polling queue (500ms intervals) instead of returning 503 immediately. Timeout: 30s (configurable via `QUEUE_TIMEOUT` env or `X-Queue-Timeout` header).
## Models
| GPU | Model | VRAM | Slots |
|-----|-------|------|-------|
| Strix Halo | qwen3.6-35B-A3B (MoE) | 65GB | 2 |
| RTX 3090 | qwen3.6-27B-code (Dense) | 24GB | 2 |
| RTX 5070 | qwen3.5-9b-vlm (VLM) | 12GB | 2 |
## Maintenance
Automated cron job runs daily at 3:00 AM UTC (`/opt/inference-harness/maintenance.sh`):
- Cleans Redis timeseries keys >60 days
- Prunes Docker build cache >7 days
- Logs container health and Redis memory
Logs: `/var/log/harness-maintenance.log`