7b6c6aabe1
- Complexity-based routing (MoE default, Dense heavy, Gemma light) - Per-agent API keys with metrics tracking - Time-series usage graphs (24h/7d/30d) - Streaming support (SSE passthrough) - Unicode cleanup (ASCII-only output) - Vision support (gemma-4-E4B) - Tier enforcement (starter/professional/enterprise) - GPU health monitoring via sidecar polling - Unified dashboard with line graph
40 lines
958 B
Markdown
40 lines
958 B
Markdown
# syslog-harness — Inference API Harness
|
|
|
|
CT 116 Docker stack for routing local GPU models through a unified OpenAI-compatible API.
|
|
|
|
## Architecture
|
|
|
|
```
|
|
nginx :80 → router :9000 → GPU backends
|
|
├─ qwen3.6-35B-A3B (MoE) @ 192.168.68.15:8080
|
|
├─ qwen3.6-27B-code (Dense) @ 192.168.68.8:8080
|
|
└─ gemma-4-E4B (Light) @ 192.168.68.110:8080
|
|
|
|
LiteLLM :8081 (fallback) | Dashboard :3000 | Redis :6379 (local)
|
|
```
|
|
|
|
## Deploy
|
|
|
|
```bash
|
|
cd /opt/inference-harness
|
|
docker compose up -d
|
|
```
|
|
|
|
## Endpoints
|
|
|
|
| URL | Purpose |
|
|
|-----|---------|
|
|
| `/v1/chat/completions` | Inference API (OpenAI-compatible) |
|
|
| `/v1/models` | Available models |
|
|
| `/` | Dashboard (GPU health, routing, agents, timeseries) |
|
|
|
|
## Agent API Keys
|
|
|
|
| Agent | Key |
|
|
|-------|-----|
|
|
| Abiba | `sk-syslog-abiba` |
|
|
| Mumuni | `sk-syslog-mumuni` |
|
|
| Tanko | `sk-syslog-tanko` |
|
|
| Koby | `sk-syslog-koby` |
|
|
| Kagenz0 | `sk-syslog-kagenz0` |
|