Files

76 lines
2.3 KiB
Markdown

# syslog-harness — Inference API Harness
CT 116 Docker stack for routing local GPU models through a unified OpenAI-compatible API.
## Architecture
```
nginx :80 → router :9000 → GPU backends
├─ qwen3.6-35B-A3B (MoE) @ 192.168.68.15:8080 [2 slots]
├─ qwen3.6-27B-code (Dense) @ 192.168.68.8:8080 [2 slots]
└─ qwen3.5-9b-vlm (VLM) @ 192.168.68.110:8080 [2 slots]
Total: 6 concurrent slots
LiteLLM :8081 (fallback) | Dashboard :3000 | Redis :6379 (local)
```
## Deploy
```bash
cd /opt/inference-harness
docker compose up -d
```
## Endpoints
| URL | Purpose |
|-----|---------|
| `/v1/chat/completions` | Inference API (OpenAI-compatible) — **API key required** |
| `/v1/models` | Available models |
| `/` | Dashboard (GPU health, routing, agents, timeseries) |
## Authentication
**All `/v1/chat/completions` requests require a valid API key** via `Authorization: Bearer <key>`. Missing or invalid keys return **401 Unauthorized**.
## Agent API Keys
| Agent | Key |
|-------|-----|
| Abiba | `sk-syslog-abiba` |
| Mumuni | `sk-syslog-mumuni` |
| Tanko | `sk-syslog-tanko` |
| Koby | `sk-syslog-koby` |
| Kagenz0 | `sk-syslog-kagenz0` |
| Koonimo | `sk-syslog-koonimo` |
## Routing Tiers
| Tier | Trigger | Priority |
|------|---------|----------|
| Lightweight | No system prompt, ≤1 turn, ≤100 words | VLM → MoE → Dense |
| Simple Conv | ≤1000 tokens, ≤4 turns | VLM → MoE → Dense |
| Heavy | >4000 tokens OR >8 turns | Dense → MoE → VLM |
| Default | Everything else | MoE → VLM → Dense |
## Queue
When all GPUs are saturated, requests enter a polling queue (500ms intervals) instead of returning 503 immediately. Timeout: 30s (configurable via `QUEUE_TIMEOUT` env or `X-Queue-Timeout` header).
## Models
| GPU | Model | VRAM | Slots |
|-----|-------|------|-------|
| Strix Halo | qwen3.6-35B-A3B (MoE) | 65GB | 2 |
| RTX 3090 | qwen3.6-27B-code (Dense) | 24GB | 2 |
| RTX 5070 | qwen3.5-9b-vlm (VLM) | 12GB | 2 |
## Maintenance
Automated cron job runs daily at 3:00 AM UTC (`/opt/inference-harness/maintenance.sh`):
- Cleans Redis timeseries keys >60 days
- Prunes Docker build cache >7 days
- Logs container health and Redis memory
Logs: `/var/log/harness-maintenance.log`