docs: complete design documentation — auth, routing tiers, queue, models, maintenance
This commit is contained in:
@@ -6,9 +6,10 @@ CT 116 Docker stack for routing local GPU models through a unified OpenAI-compat
|
|||||||
|
|
||||||
```
|
```
|
||||||
nginx :80 → router :9000 → GPU backends
|
nginx :80 → router :9000 → GPU backends
|
||||||
├─ qwen3.6-35B-A3B (MoE) @ 192.168.68.15:8080
|
├─ qwen3.6-35B-A3B (MoE) @ 192.168.68.15:8080 [2 slots]
|
||||||
├─ qwen3.6-27B-code (Dense) @ 192.168.68.8:8080
|
├─ qwen3.6-27B-code (Dense) @ 192.168.68.8:8080 [2 slots]
|
||||||
└─ qwen3.5-9b-vlm (VLM) @ 192.168.68.110:8080
|
└─ qwen3.5-9b-vlm (VLM) @ 192.168.68.110:8080 [2 slots]
|
||||||
|
Total: 6 concurrent slots
|
||||||
|
|
||||||
LiteLLM :8081 (fallback) | Dashboard :3000 | Redis :6379 (local)
|
LiteLLM :8081 (fallback) | Dashboard :3000 | Redis :6379 (local)
|
||||||
```
|
```
|
||||||
@@ -24,10 +25,14 @@ docker compose up -d
|
|||||||
|
|
||||||
| URL | Purpose |
|
| URL | Purpose |
|
||||||
|-----|---------|
|
|-----|---------|
|
||||||
| `/v1/chat/completions` | Inference API (OpenAI-compatible) |
|
| `/v1/chat/completions` | Inference API (OpenAI-compatible) — **API key required** |
|
||||||
| `/v1/models` | Available models |
|
| `/v1/models` | Available models |
|
||||||
| `/` | Dashboard (GPU health, routing, agents, timeseries) |
|
| `/` | Dashboard (GPU health, routing, agents, timeseries) |
|
||||||
|
|
||||||
|
## Authentication
|
||||||
|
|
||||||
|
**All `/v1/chat/completions` requests require a valid API key** via `Authorization: Bearer <key>`. Missing or invalid keys return **401 Unauthorized**.
|
||||||
|
|
||||||
## Agent API Keys
|
## Agent API Keys
|
||||||
|
|
||||||
| Agent | Key |
|
| Agent | Key |
|
||||||
@@ -38,3 +43,33 @@ docker compose up -d
|
|||||||
| Koby | `sk-syslog-koby` |
|
| Koby | `sk-syslog-koby` |
|
||||||
| Kagenz0 | `sk-syslog-kagenz0` |
|
| Kagenz0 | `sk-syslog-kagenz0` |
|
||||||
| Koonimo | `sk-syslog-koonimo` |
|
| Koonimo | `sk-syslog-koonimo` |
|
||||||
|
|
||||||
|
## Routing Tiers
|
||||||
|
|
||||||
|
| Tier | Trigger | Priority |
|
||||||
|
|------|---------|----------|
|
||||||
|
| Lightweight | No system prompt, ≤1 turn, ≤100 words | VLM → MoE → Dense |
|
||||||
|
| Simple Conv | ≤1000 tokens, ≤4 turns | VLM → MoE → Dense |
|
||||||
|
| Heavy | >4000 tokens OR >8 turns | Dense → MoE → VLM |
|
||||||
|
| Default | Everything else | MoE → VLM → Dense |
|
||||||
|
|
||||||
|
## Queue
|
||||||
|
|
||||||
|
When all GPUs are saturated, requests enter a polling queue (500ms intervals) instead of returning 503 immediately. Timeout: 30s (configurable via `QUEUE_TIMEOUT` env or `X-Queue-Timeout` header).
|
||||||
|
|
||||||
|
## Models
|
||||||
|
|
||||||
|
| GPU | Model | VRAM | Slots |
|
||||||
|
|-----|-------|------|-------|
|
||||||
|
| Strix Halo | qwen3.6-35B-A3B (MoE) | 65GB | 2 |
|
||||||
|
| RTX 3090 | qwen3.6-27B-code (Dense) | 24GB | 2 |
|
||||||
|
| RTX 5070 | qwen3.5-9b-vlm (VLM) | 12GB | 2 |
|
||||||
|
|
||||||
|
## Maintenance
|
||||||
|
|
||||||
|
Automated cron job runs daily at 3:00 AM UTC (`/opt/inference-harness/maintenance.sh`):
|
||||||
|
- Cleans Redis timeseries keys >60 days
|
||||||
|
- Prunes Docker build cache >7 days
|
||||||
|
- Logs container health and Redis memory
|
||||||
|
|
||||||
|
Logs: `/var/log/harness-maintenance.log`
|
||||||
|
|||||||
Reference in New Issue
Block a user