Initial commit: CT 116 inference harness — nginx, LiteLLM, router, dashboard, Redis

- Complexity-based routing (MoE default, Dense heavy, Gemma light) - Per-agent API keys with metrics tracking - Time-series usage graphs (24h/7d/30d) - Streaming support (SSE passthrough) - Unicode cleanup (ASCII-only output) - Vision support (gemma-4-E4B) - Tier enforcement (starter/professional/enterprise) - GPU health monitoring via sidecar polling - Unified dashboard with line graph
2026-05-16 18:51:50 +00:00
commit 7b6c6aabe1
11 changed files with 749 additions and 0 deletions
@@ -0,0 +1,39 @@
+# syslog-harness — Inference API Harness
+
+CT 116 Docker stack for routing local GPU models through a unified OpenAI-compatible API.
+
+## Architecture
+
+```
+nginx :80 → router :9000 → GPU backends
+                ├─ qwen3.6-35B-A3B (MoE) @ 192.168.68.15:8080
+                ├─ qwen3.6-27B-code (Dense) @ 192.168.68.8:8080
+                └─ gemma-4-E4B (Light) @ 192.168.68.110:8080
+
+LiteLLM :8081 (fallback) | Dashboard :3000 | Redis :6379 (local)
+```
+
+## Deploy
+
+```bash
+cd /opt/inference-harness
+docker compose up -d
+```
+
+## Endpoints
+
+| URL | Purpose |
+|-----|---------|
+| `/v1/chat/completions` | Inference API (OpenAI-compatible) |
+| `/v1/models` | Available models |
+| `/` | Dashboard (GPU health, routing, agents, timeseries) |
+
+## Agent API Keys
+
+| Agent | Key |
+|-------|-----|
+| Abiba | `sk-syslog-abiba` |
+| Mumuni | `sk-syslog-mumuni` |
+| Tanko | `sk-syslog-tanko` |
+| Koby | `sk-syslog-koby` |
+| Kagenz0 | `sk-syslog-kagenz0` |