41 Commits

Author SHA1 Message Date
root 5116e4b1a7 router: heavy tier Dense→MoE→Light + X-Context-Warning headers (compact_soon/compact_recommended/compact_urgent) 2026-05-22 09:48:00 +00:00
Abiba e55bcef21a router: 4 optimizations — saturated flag fix, heavy tier MoE-first, better token est, session tracking
- Saturated flag now triggers on load saturation (was dead code)
- Heavy tier routes MoE(131K) first instead of Dense(98K)
- Token estimation uses JSON length/3.5 (was content/4)
- Cross-turn session tracking via X-Session-Id + Redis TTL 24h
2026-05-21 20:47:48 +00:00
Abiba 32bd817e97 fix: heavy tier back to Dense→MoE→VLM (Dense now 98K) 2026-05-19 21:24:36 +00:00
Abiba 79965450bb fix: Dense context 65K→98K, parallel restored to 2 2026-05-19 21:20:29 +00:00
Abiba 6c829abef5 fix: variable collision (r = Redis vs Response) in stream handler 2026-05-19 21:15:23 +00:00
Abiba 6efd5ff51c feat: context-aware routing + compaction signals
- Added GPU_CONTEXT map (MoE 131K, VLM 131K, Dense 65K)
- Heavy tier now prefers MoE/VLM (131K) over Dense (65K) for large requests
- Response headers: X-Context-Remaining, X-Context-Model
- Routing data includes context_remaining field
- Agents can use this to trigger compaction when nearing limits
2026-05-19 21:13:56 +00:00
Abiba 350a90b524 fix: sync tier 4 default threshold to 50000 tokens (was stale at 4000) 2026-05-19 21:11:34 +00:00
Abiba 3156c093d5 fix: heavy threshold → 50000 tokens, 25 turns (agent contexts are huge) 2026-05-19 21:08:18 +00:00
Abiba 3cbf38e3e2 fix: raise heavy threshold — 4000→12000 tokens, 8→15 turns
Agent conversations with system prompts easily exceed 4000 tokens,
forcing everything to Dense. Now only truly heavy work triggers Dense.
Most agent convos will route to MoE (default) instead.
2026-05-19 20:09:59 +00:00
Abiba b67021ac69 docs: complete design documentation — auth, routing tiers, queue, models, maintenance 2026-05-19 19:17:52 +00:00
Abiba 46dda918de security: reject requests without valid API key (401 instead of defaulting to starter) 2026-05-19 19:13:52 +00:00
Abiba 7a78c0f98d fix: heavy tier — Dense first (best for reasoning), then MoE, then VLM 2026-05-19 18:20:20 +00:00
Abiba 15c474aea0 fix: select_best_gpu respects candidate order — first non-busy wins
Previously it picked the least-loaded GPU globally, ignoring priority order.
Now it tries candidates in order: MoE → VLM → Dense. Only falls back to
least-loaded when ALL candidates are busy.
2026-05-19 18:18:00 +00:00
Abiba bfc38f5436 fix: routing priority — MoE first, VLM second, Dense last (slow)
All tiers now follow MoE → VLM → Dense priority order since
Dense (RTX 3090) can be slow. VLM acts as overflow absorber.
2026-05-19 17:38:21 +00:00
Abiba f519a3fa60 fix: routing — system prompts no longer force heavy tier
System messages are common in agent conversations but don't indicate
heavy workload. Now only token count (>4000) and turn count (>8) trigger
heavy routing. Simple conversations with system prompts can now route to VLM.
2026-05-19 17:19:29 +00:00
Abiba 941e8db65e feat: redesigned routing tiers — VLM handles more traffic
New 4-tier routing:
- TIER 1 (Lightweight): ≤100 words, single-turn → VLM first, fallback Dense
- TIER 2 (Simple Conv): ≤1000 tokens, ≤4 turns → VLM preferred, fallback Dense
- TIER 3 (Heavy): >4000 tokens, system prompts, >8 turns → Dense→MoE→VLM cascade
- TIER 4 (Default): Medium tasks → Dense preferred, MoE default, VLM overflow

VLM gets more utilization for simple conversations instead of defaulting
everything to MoE.
2026-05-19 17:01:55 +00:00
Abiba 241de4f38c revert: remove Ollama endpoints (llama.cpp uses OpenAI format, not Ollama) 2026-05-19 16:57:04 +00:00
Abiba beb2d1790a fix: add /v1/props and /v1/models/<id> Ollama-compatible endpoints
Mumuni's Ollama client probes /v1/props for model discovery and
/v1/models/<id> for per-model details. Previously both returned 404,
causing client retries. Now returns proper model properties and details.
2026-05-19 16:08:24 +00:00
Abiba f2f8e8c921 feat: add request queuing to router (replaces hard 503 on saturation)
When all GPUs are saturated, requests now enter a queue loop (poll every 500ms)
instead of immediately returning 503. Configurable via QUEUE_TIMEOUT env var
(default 30s) or X-Queue-Timeout header per-request.

This prevents agent failures from cluster saturation — agents wait for a slot
instead of crashing on fallback.
2026-05-19 15:55:05 +00:00
Abiba 76ade81fda docs: add Koonimo to agent API keys table 2026-05-19 15:48:39 +00:00
Abiba 9c31b5d622 May 19, 2026: Full harness update
- Model migration: gemma-4-E4B → qwen3.5-9b-vlm
- Dashboard reorder: Usage Over Time + GPU Metrics to top
- Router counter leak fix (gpu_decr in except handler)
- VLM slot upgrade 1→2
- Redis stale key cleanup
- Automated maintenance cron job
- LiteLLM config update
- GPU router config update
- README update
2026-05-19 15:03:34 +00:00
Abiba (pi) 4f032b035c Mumuni review action items: health checks for all containers, version pinning, 503+Retry-After on all-GPU saturation 2026-05-17 09:05:27 +00:00
Abiba (pi) 8f3b0c6647 Router: health check verifies actual llama.cpp endpoint, gpu_decr negative guard, AMD sidecar fixed (sysfs fallback) 2026-05-17 01:52:28 +00:00
Abiba (pi) 808c9d3d13 Router: 300s timeout, gpu_decr bugfix. Dashboard: Bootstrap 5 modern redesign with KPI stats, equal-height cards, queue ring. Nginx: 600s timeout. 2026-05-16 22:12:21 +00:00
Abiba (pi) 9817fe2ef2 Dashboard: clean rebuild with Queue Status ring chart, GPU slot indicators, organized layout (GPU/Queue+Model+Agent/Usage/Live) 2026-05-16 21:05:19 +00:00
Abiba (pi) 654cdff718 Dashboard: GPU slot indicators show active/max concurrent requests. Koonimo API key added. Real-time queuing visibility. 2026-05-16 20:43:22 +00:00
Abiba (pi) bf90e57c5f Load-aware routing: tracks active GPU requests in Redis, distributes overflow when MoE saturated. 6 concurrent requests now spread across all 3 GPUs instead of queuing on one. 2026-05-16 20:23:32 +00:00
Abiba (pi) 2db2796e53 Dashboard: rename to SyslogAI Harness, GPU bar now shows utilization instead of VRAM 2026-05-16 19:26:46 +00:00
Abiba (pi) ec0f9fac63 Fix: clean_unicode now uses chr()-based replacements + ASCII strip to prevent bash heredoc corruption. Emoji and all non-ASCII now fully stripped. 2026-05-16 19:12:58 +00:00
Abiba (pi) 3d42ea4767 Merge: add Abiba harness code — nginx, LiteLLM, router, dashboard, Redis 2026-05-16 18:53:31 +00:00
Abiba (pi) 7b6c6aabe1 Initial commit: CT 116 inference harness — nginx, LiteLLM, router, dashboard, Redis
- Complexity-based routing (MoE default, Dense heavy, Gemma light)
- Per-agent API keys with metrics tracking
- Time-series usage graphs (24h/7d/30d)
- Streaming support (SSE passthrough)
- Unicode cleanup (ASCII-only output)
- Vision support (gemma-4-E4B)
- Tier enforcement (starter/professional/enterprise)
- GPU health monitoring via sidecar polling
- Unified dashboard with line graph
2026-05-16 18:51:50 +00:00
mumuni-bot b65ea22765 Update Nginx Docker config 2026-05-15 21:35:13 +00:00
mumuni-bot cf7f61650f Add Dockerfile.dashboard 2026-05-15 21:34:52 +00:00
mumuni-bot 7d00bbec0e Add Dockerfile.queue 2026-05-15 21:34:49 +00:00
mumuni-bot 37f7c95b05 Add env example 2026-05-15 21:07:34 +00:00
mumuni-bot a28b3a557d Add Nginx router config 2026-05-15 21:07:33 +00:00
mumuni-bot c42f3a9979 Add migration plan 2026-05-15 21:07:32 +00:00
mumuni-bot e1f12c3462 Add dashboard 2026-05-15 21:07:07 +00:00
mumuni-bot b55b954967 Add queue service 2026-05-15 21:07:05 +00:00
mumuni-bot c85aaa570b Add docker-compose 2026-05-15 21:07:05 +00:00
mumuni-bot 43382dac5b Initial commit: README 2026-05-15 21:07:03 +00:00
29 changed files with 1031 additions and 4051 deletions
+8
View File
@@ -0,0 +1,8 @@
# Syslog Harness Environment
REDIS_HOST=192.168.68.8
REDIS_PORT=6379
AMDPVE_ENDPOINT=http://192.168.68.15:8080
LLMGPU_ENDPOINT=http://192.168.68.8:8080
OCU_LLM_ENDPOINT=http://192.168.68.110:8080
CIRCUIT_BREAKER_THRESHOLD=5
CIRCUIT_BREAKER_TIMEOUT=30
+3
View File
@@ -0,0 +1,3 @@
.git
__pycache__/
*.pyc
-390
View File
@@ -1,390 +0,0 @@
# Syslog Harness Architecture Review & Improvement Recommendations
**Date:** 2026-05-17
**Commit:** `e95475f` "Add GPU dashboard container + Nginx routing"
**Repo:** http://192.168.68.17:3000/SyslogSolution/syslog-harness.git
---
## 1. Current Architecture Overview
```
Host (192.168.68.123)
Agent :8080> Nginx Router > Queue Service > Dashboard
:8080 :8091 :3001
GPU Pool Redis > GPU Dashboard
:8080 :6379 :8092
amdpve llmgpu ocu_llm
.15:8080 .8:8080 .110:8080
MoE 35B Dense 27B Light 4B
```
### Services
| Service | Port | Container | Image | Purpose |
|---|---|---|---|---|
| **Nginx Router** | 8080 | Host-level | OS nginx | Routes by `X-Syslog-Model` header |
| **Queue Service** | 8091 | `syslog-queue` | `python:3.13-slim` | Request queue + circuit breaker |
| **Dashboard** | 3001 | `syslog-dashboard` | `python:3.11-slim` | Observability UI + GPU health |
| **GPU Dashboard** | 8092 | `syslog-gpu-dashboard` | `python:3.11-slim` | Hardware metrics (temp, VRAM, power) |
| **Redis** | 6379 | `syslog-redis` | `redis:7-alpine` | Queue storage |
### GPU Backends
| Host | GPU | Model | Capacity |
|---|---|---|---|
| 192.168.68.15 | AMD Strix Halo | qwen3.6-35B-A3B (MoE) | 65GB VRAM |
| 192.168.68.8 | RTX 3090 | qwen3.5-27B (Dense) | 24GB VRAM |
| 192.168.68.110 | RTX 5070 | gemma-4-E4B (Light) | 12GB VRAM |
### Data Flow
1. **Agent** sends request with `X-Syslog-Model` header Nginx :8080
2. **Nginx** routes to appropriate GPU based on header mapping
3. **GPU backend** (llama.cpp) processes request
4. **Fallback:** If GPU returns 502/503/timeout Nginx redirects to queue-service :8091
5. **Queue** stores request in Redis `inference:requests` LPUSH
6. **Dashboard** :3001 polls queue-service + GPU health for display
7. **GPU Dashboard** :8092 collects hardware metrics every 10s
---
## 2. File Inventory
```
docker-compose.yml # Main compose (Docker networking)
gpu-router-docker.conf # Nginx config for Docker deployment
Dockerfile.gpu # GPU dashboard container
Dockerfile.dashboard # Dashboard container (root-level)
queue-service/Dockerfile # Queue service container
queue-service/queue-service.py # Queue logic (121 lines)
dashboard/harness-dashboard.py # Dashboard app (133 lines)
dashboard/Dockerfile # Dashboard container (subdir)
dashboard/Dockerfile.dashboard # Dashboard container (duplicate)
gpu-dashboard/gpu_collector.py # GPU hardware collector (115 lines)
gpu-dashboard/gpu.html # GPU dashboard UI (183 lines)
gpu-dashboard/collector.py # Duplicate collector (hermes-workspace path)
gpu-dashboard/start.sh # Legacy startup script
MIGRATION_PLAN.md # Production migration plan
README.md # Documentation
syslog-harness-check/ # Checkpoint subdirectory (mirror)
```
---
## 3. Detailed Findings
### 3.1 Queue Service (`queue-service/queue-service.py`)
**Architecture:** Simple Flask app using Redis LPUSH/RPUSH for a FIFO queue. A basic circuit breaker prevents queue overflow at 50 messages.
**Issues Found:**
| # | Severity | Location | Issue |
|---|---|---|---|
| Q1 | **CRITICAL** | Lines 82-88 | **Queue is fire-and-forget with no consumer.** Requests are pushed to Redis but nothing dequeues or processes them. The queue is a dead storage pit. |
| Q2 | **CRITICAL** | Lines 28-32 | **Hardcoded GPU IPs** in the queue service duplicate the Nginx config. No configuration source of truth. |
| Q3 | **HIGH** | Lines 21-22 | **Redis host fallback to `192.168.68.7`** (line 21) conflicts with docker-compose which sets `REDIS_HOST=redis` (line 24). The default is unreachable inside Docker. |
| Q4 | **HIGH** | Lines 66-95 | **No job result retrieval mechanism.** Once enqueued, there's no API to poll for completion, get a job ID, or retrieve results. |
| Q5 | **HIGH** | Lines 73-79 | **Circuit breaker is a simple depth threshold.** No backoff, no recovery window, no sliding window. Once closed, it stays closed until manually drained. |
| Q6 | **MEDIUM** | Lines 50-57 | **GPU health check is synchronous and blocks** the `/status` endpoint. Checking 3 GPUs sequentially with 3s timeout means `/status` can take up to 9s. |
| Q7 | **MEDIUM** | Lines 35-40 | **`get_redis()` swallows all exceptions** and returns `None`. This makes Redis failures silent queue depth returns 0 on failure (line 47), potentially allowing overflow. |
| Q8 | **MEDIUM** | Lines 83-84 | **Headers filtered to only X-* prefixed** the `Content-Type` header is dropped entirely, meaning the receiver can't determine payload format. |
| Q9 | **LOW** | Line 121 | **No graceful shutdown.** Flask development server doesn't handle SIGTERM gracefully. |
### 3.2 Nginx Gateway (`gpu-router-docker.conf`)
**Architecture:** Nginx routes requests to GPU backends based on `X-Syslog-Model` header value. Has rate limiting, streaming support, and queue fallback.
**Issues Found:**
| # | Severity | Location | Issue |
|---|---|---|---|
| N1 | **HIGH** | Lines 79-80 | **`burst=20 nodelay`** means 20 requests are served immediately beyond the rate limit, then throttled. This defeats the purpose of rate limiting under burst traffic all 20 could still overwhelm a GPU. |
| N2 | **HIGH** | Lines 99-100 | **`proxy_next_upstream` with `tries 2`** means on error/timeout/502/503, Nginx retries once. But it retries against the *same GPU pool*, not a different one. The same GPU that failed gets hit again. |
| N3 | **HIGH** | Lines 106, 112-121 | **Queue fallback (`@queue_fallback`) is triggered for ANY 502/503/504**, including when a single GPU is overloaded. This means individual GPU slowness causes queue fallback instead of just queuing when ALL GPUs are down. |
| N4 | **MEDIUM** | Line 90 | **`proxy_pass_header X-Syslog-Model`** is non-standard. Nginx automatically passes request headers; this directive is for response headers. The model header is already passed implicitly via `proxy_set_header` inheritance. |
| N5 | **MEDIUM** | Lines 27, 32 | **Hardcoded container names** (`syslog-harness-dashboard-1`, `syslog-harness-gpu-dashboard-1`). These change based on docker-compose project prefix. Should use service names. |
| N6 | **LOW** | Lines 67-73 | **GPU dashboard at `/gpu` path** has `X-Forwarded-Proto` but the dashboard service (simple HTTP server) doesn't use it. Inconsistent header handling across locations. |
### 3.3 Dashboard (`dashboard/harness-dashboard.py`)
**Architecture:** Simple HTTP server using Python's `http.server`. Fetches queue status and GPU health, renders HTML.
**Issues Found:**
| # | Severity | Location | Issue |
|---|---|---|---|
| D1 | **HIGH** | Lines 34-40 | **`get_queue_status()` calls queue-service synchronously.** Combined with per-GPU health checks (lines 18-31), the `/api/status` endpoint makes 4 sequential HTTP calls. Worst case: 2 + 33s = 11s response time. |
| D2 | **MEDIUM** | Lines 101-127 | **Uses `SimpleHTTPRequestHandler`** which is single-threaded. Under concurrent dashboard access, requests queue up. Should use `ThreadingHTTPServer`. |
| D3 | **MEDIUM** | Lines 16-18 | **GPU endpoints hardcoded** in dashboard, separate from queue-service and Nginx. Three separate sources of truth for GPU addresses. |
| D4 | **LOW** | Line 127 | **Silent log suppression.** While intentional, this makes debugging impossible without modifying the source. |
### 3.4 GPU Dashboard (`gpu-dashboard/`)
**Architecture:** `gpu_collector.py` polls sidecar (port 8090) and llama.cpp (port 8080) endpoints every 10s, writes JSON to `gpu_metrics.json`. Static HTTP server serves the dashboard.
**Issues Found:**
| # | Severity | Location | Issue |
|---|---|---|---|
| G1 | **HIGH** | Lines 97-98 | **Sequential collection.** All 3 GPUs are polled sequentially (line 98: list comprehension). If one host is unreachable, it blocks collection for all three. |
| G2 | **HIGH** | Line 105-107 | **`/app/public/gpu_metrics.json` path is hardcoded** and differs from `collector.py` (line 11: `/root/hermes-workspace/public/gpu_metrics.json`). Inconsistent between the two collector files. |
| G3 | **MEDIUM** | Lines 19-25 | **`fetch_json` swallows all exceptions.** A timeout on one GPU's sidecar is silently ignored, making it impossible to distinguish "no data" from "collector error". |
| G4 | **MEDIUM** | Line 14 | **`DEAD_THRESHOLD = 60` seconds is aggressive.** A GPU that restarts takes 60s before reappearing as online, even if it's back in 5s. |
| G5 | **LOW** | Lines 10-14 | **`start.sh` references `/root/hermes-workspace/public`** but `Dockerfile.gpu` creates `/app/public`. Inconsistent between legacy and current deployment. |
### 3.5 Docker Compose (`docker-compose.yml`)
**Issues Found:**
| # | Severity | Location | Issue |
|---|---|---|---|
| C1 | **HIGH** | Lines 19-20 | **Queue service exposes port 8091 externally.** In a multi-tenant or public-facing deployment, the queue API should be internal-only. |
| C2 | **MEDIUM** | Lines 13-15 | **`Dockerfile.queue` referenced but doesn't exist at root level.** The file is at `queue-service/Dockerfile`. The compose build context is `.` (root) but the dockerfile path doesn't match. |
| C3 | **MEDIUM** | Lines 6, 16, 26, 31, 43 | **`restart: always`** instead of `restart: unless-stopped`. On crash, `always` restarts even after manual stop, making maintenance harder. |
| C4 | **LOW** | Lines 23-25 | **No health checks defined** for any service. Docker can't detect if a service is actually healthy, only if the container is running. |
| C5 | **LOW** | Line 10 | **Redis has no password.** Unauthenticated Redis exposed on the Docker network. |
| C6 | **LOW** | Lines 49-51 | **No network driver specified** for the bridge network (minor defaults to bridge). No IPAM configuration for large deployments. |
### 3.6 Container Images
**Issues Found:**
| # | Severity | Location | Issue |
|---|---|---|---|
| I1 | **HIGH** | All Dockerfiles | **No `requirements.txt` or dependency pinning.** All dependencies (`flask`, `redis`, `requests`) are installed without version pins. Builds are non-reproducible. |
| I2 | **MEDIUM** | `Dockerfile.gpu` line 3 | **`pip install requests`** unnecessary dependency for the GPU dashboard (only uses `urllib`). Adds ~300KB to the image. |
| I3 | **MEDIUM** | `Dockerfile.gpu` line 14 | **Multi-process CMD with `&`** no process supervisor. If the collector crashes, it won't restart. The `http.server` also won't receive SIGTERM properly. |
| I4 | **LOW** | All Dockerfiles | **No `.dockerignore` file.** The entire context is sent to the Docker daemon, including `.git` directories and any local artifacts. |
| I5 | **LOW** | `Dockerfile.dashboard` (root) vs `dashboard/Dockerfile.dashboard` | **Duplicate Dockerfiles** with slight differences (Python 3.11 vs 3.13, WORKDIR differences). |
---
## 4. Smart Queuing Analysis & Recommendations
### Current State: No Smart Queuing
The queue service is a **passive storage mechanism** it stores requests but has no intelligence:
- **No load balancing** no awareness of GPU load (slots_busy, VRAM usage, queue depth per GPU)
- **No job prioritization** FIFO only, no priority levels
- **No backpressure** simple threshold, no exponential backoff or adaptive limits
- **No retry logic** failed GPU requests go to queue but are never reprocessed
- **No dead letter handling** stuck or failed jobs have no lifecycle management
- **No consumer** nothing dequeues and forwards to GPUs
- **No job tracking** no job IDs, no status updates, no result retrieval
### Recommended Architecture: Smart Queue with Consumer
```
Agent > Nginx > Smart Queue API > Redis Streams (with consumers)
Consumer
Pool
GPU 1 (load) GPU 2 (load) GPU 3 (load)
Health Health Health
Update GPU scores
Priority Queue (sorted by urgency)
Dead Letter Queue (failed jobs)
Backpressure (adaptive rate limit)
```
### Specific Recommendations
#### R1: Implement Redis Streams as Queue Backend
- Replace `LPUSH/RPUSH` (FIFO list) with **Redis Streams** (`XADD/XREADGROUP`)
- Streams support consumer groups, message acknowledgment, and pending messages
- Enables proper dead letter queue handling and retry logic
- **File:** `queue-service/queue-service.py`
```python
# Before: Simple list
r.rpush(QUEUE_KEY, json.dumps(job))
# After: Redis Stream with consumer group
stream_key = "inference:stream"
consumer_group = "gpu-workers"
r.xadd(stream_key, {"job": json.dumps(job)}, maxlen=10000, approx=True)
```
#### R2: Build a Queue Consumer Pool
- Deploy 1+ consumer containers that poll the stream and forward to GPUs
- Consumer selects GPU based on: health status, current load (slots_busy), and VRAM availability
- **File:** New `queue-service/consumer.py`
```python
class LoadBalancedConsumer:
def select_gpu(self, job):
"""Select GPU based on load, health, and model compatibility."""
candidates = [g for g in self.gpus if g.health == "up" and not g.full]
if not candidates:
return None
# Sort by: slots_idle (descending), VRAM_available (descending)
candidates.sort(key=lambda g: (g.slots_idle, g.vram_free_mb), reverse=True)
return candidates[0]
```
#### R3: Implement Priority Queuing
- Add priority field to job payload: `high`, `normal`, `low`
- Use Redis Streams with multiple stream keys per priority level
- Consumer checks `high` `normal` `low` in order
- **File:** `queue-service/queue-service.py` enqueue endpoint
#### R4: Add Backpressure Mechanism
- Instead of hard threshold at 50, implement **adaptive backpressure**:
- Queue depth 0-30: normal operation
- Queue depth 30-40: return `retry-after` header with increasing delay
- Queue depth 40-50: return 503 with exponential retry-after
- Queue depth >50: circuit breaker open
- **File:** `queue-service/queue-service.py`
#### R5: Dead Letter Queue (DLQ)
- Move failed/unprocessable jobs to a `inference:dead-letter` stream
- Include failure reason, attempt count, and original payload
- Provide admin API to inspect, retry, or discard DLQ entries
- **File:** `queue-service/queue-service.py`
```python
# New endpoint
@app.route("/dlq", methods=["GET"])
def list_dlq():
return r.xrange("inference:dead-letter")
@app.route("/dlq/retry/<message_id>", methods=["POST"])
def retry_dlq(message_id):
job = r.xget("inference:dead-letter", message_id)
r.xadd("inference:stream", {"job": job})
```
#### R6: GPU-Aware Routing
- Queue consumer should check GPU `slots_busy` before routing
- If a GPU is busy, try the next available GPU
- Track per-GPU queue depth and avoid overloading a single GPU
- **File:** New consumer logic
#### R7: Job Status API
- Add job ID generation on enqueue
- Provide `/status/<job_id>` endpoint to check progress
- Store job state in Redis: `queued` `processing` `completed`/`failed`
- **File:** `queue-service/queue-service.py`
```python
@app.route("/enqueue", methods=["POST"])
def enqueue():
job_id = str(uuid.uuid4())
job = {"id": job_id, "payload": ..., "status": "queued", "created_at": time.time()}
r.xadd(stream_key, {"job": json.dumps(job)})
r.hset("job:status", job_id, json.dumps({"status": "queued"}))
return jsonify({"job_id": job_id, "status": "queued"}), 202
@app.route("/status/<job_id>")
def job_status(job_id):
status = r.hget("job:status", job_id)
return jsonify(json.loads(status)) if status else {"error": "not found"}, 404
```
#### R8: Health-Based Circuit Breaker
- Replace simple depth threshold with **per-GPU circuit breakers**
- Track consecutive failures per GPU
- Implement half-open state: after cooldown, probe one GPU to test recovery
- **File:** `queue-service/queue-service.py`
#### R9: Centralized Configuration
- Move GPU endpoints from 3 locations (queue-service, dashboard, Nginx) to:
- Redis config key: `config:gpus`
- Or environment file mounted to all containers
- Nginx can use Lua/variable from config instead of static upstreams
- **File:** New `config/` directory or Redis-based config
---
## 5. Priority Issue Summary
### Critical (Fix Immediately)
1. **Q1** Queue has no consumer; enqueued requests are never processed
2. **Q4** No job ID or result retrieval mechanism
3. **N3** Queue fallback triggers on individual GPU failure, not all-down
### High (Fix Before Production)
4. **Q5** Circuit breaker has no recovery mechanism
5. **Q6** `/status` endpoint blocks on GPU health checks
6. **D1** Dashboard `/api/status` makes 4 sequential calls, up to 11s
7. **C2** `Dockerfile.queue` path mismatch in docker-compose
8. **I1** No dependency pinning in any Dockerfile
9. **I3** Multi-process CMD without supervisor in GPU dashboard
### Medium (Improve in Next Iteration)
10. **Q3** Redis host default conflicts with Docker networking
11. **Q7** Silent exception swallowing in Redis access
12. **Q8** Content-Type header dropped in queue
13. **D2** Single-threaded dashboard server
14. **D3** Three separate sources of truth for GPU addresses
15. **G1** Sequential GPU collection blocks on single failure
16. **N1** Rate limit burst of 20 nodelay defeats protection
17. **N5** Hardcoded container names in Nginx
18. **C1** Queue API exposed externally
19. **C4** No Docker health checks
### Low (Nice to Have)
20. **Q9** No graceful shutdown
21. **C3** `restart: always` vs `unless-stopped`
22. **C5** No Redis authentication
23. **G4** 60s dead threshold is too aggressive
24. **I2** Unnecessary `requests` dependency
25. **I4** No `.dockerignore`
26. **I5** Duplicate Dockerfiles
---
## 6. Deployment Architecture Summary
### What Works Well
- Clean separation of concerns: routing (Nginx), queuing (Redis + queue-service), observability (two dashboards)
- Good GPU hardware monitoring with temperature, VRAM, power, fan metrics
- SSE streaming support in Nginx for LLM response streaming
- Rate limiting at the gateway layer
- Circuit breaker pattern implemented (even if basic)
### What Needs Work
- **Queue is incomplete** storage without processing is the most critical gap
- **No job lifecycle** requests go in and never come out
- **Duplicated configuration** GPU addresses in 3+ places
- **No monitoring/alerting** no Prometheus metrics, no alerting rules
- **Single point of failure** no Redis replication, no container redundancy
- **No logging** Flask dev server logs are minimal; no structured logging
### Recommended Next Steps
1. **Priority 1:** Implement queue consumer with GPU load-based routing
2. **Priority 2:** Add job status tracking and result retrieval
3. **Priority 3:** Fix Nginx fallback to only trigger when ALL GPUs are down
4. **Priority 4:** Add Docker health checks and proper dependency management
5. **Priority 5:** Centralize GPU configuration in Redis or environment
6. **Priority 6:** Add Prometheus metrics endpoint for observability
-5
View File
@@ -1,5 +0,0 @@
FROM python:3.11-slim
WORKDIR /app
COPY dashboard/harness-dashboard.py .
EXPOSE 3001
CMD ["python3", "harness-dashboard.py"]
-14
View File
@@ -1,14 +0,0 @@
FROM python:3.11-slim
RUN pip install requests
COPY gpu-dashboard/ /app/
WORKDIR /app
RUN mkdir -p /app/public && \
cp gpu.html /app/public/ && \
touch /app/public/gpu_metrics.json
EXPOSE 8092
CMD ["sh", "-c", "python3 gpu_collector.py & python3 -m http.server 8092 --directory /app/public & wait"]
View File
+57 -45
View File
@@ -1,63 +1,75 @@
# Syslog Harness
# syslog-harness — Inference API Harness
Operational orchestration layer for Syslog's internal AI agents.
CT 116 Docker stack for routing local GPU models through a unified OpenAI-compatible API.
## Architecture
```
┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ Agent │────>│ Nginx │────>│ GPU Pool │
│ (Hermes) │ │ Router │ │ (MoE/Dense)│
└─────────────┘ └──────────────┘ └─────────────┘
├──> :8091 Queue Service (Docker)
└──> :3001 Dashboard (Docker)
nginx :80 → router :9000 → GPU backends
├─ qwen3.6-35B-A3B (MoE) @ 192.168.68.15:8080 [2 slots]
├─ qwen3.6-27B-code (Dense) @ 192.168.68.8:8080 [2 slots]
└─ qwen3.5-9b-vlm (VLM) @ 192.168.68.110:8080 [2 slots]
Total: 6 concurrent slots
LiteLLM :8081 (fallback) | Dashboard :3000 | Redis :6379 (local)
```
## Components
| Service | Port | Container | Purpose |
|---|---|---|---|
| Nginx Router | 8080 | Host | Routes requests to GPU backends |
| Queue Service | 8091 | `syslog-queue` | Enqueues requests when GPUs are down |
| Dashboard | 3001 | `syslog-dashboard` | Observability UI + API |
## GPU Routing
| Header `X-Syslog-Model` | Backend | Model |
|---|---|---|
| (none) / `standard` | amdpve (.15) | qwen3.6-35B-A3B (MoE) |
| `heavy` / `qwen3.5-27B` | llmgpu (.8) | qwen3.5-27B (Dense) |
| `light` / `gemma-4` | ocu_llm (.110) | gemma-4-E4B (Light) |
## Quick Start
## Deploy
```bash
# Build & start
docker compose build
cd /opt/inference-harness
docker compose up -d
# Verify
curl http://localhost:8091/health
curl http://localhost:3001/api/status
```
## Dashboard
## Endpoints
- **UI:** `http://<host>:8080/dashboard/harness.html`
- **API:** `http://<host>:8080/dashboard/api/status`
| URL | Purpose |
|-----|---------|
| `/v1/chat/completions` | Inference API (OpenAI-compatible) — **API key required** |
| `/v1/models` | Available models |
| `/` | Dashboard (GPU health, routing, agents, timeseries) |
## Circuit Breaker
## Authentication
- Rate limit: 10 req/s per IP
- Burst: 20 requests
- Excess returns 503
- Queue fallback on GPU 502/503
**All `/v1/chat/completions` requests require a valid API key** via `Authorization: Bearer <key>`. Missing or invalid keys return **401 Unauthorized**.
## Production Migration
## Agent API Keys
See [MIGRATION_PLAN.md](./MIGRATION_PLAN.md)
| Agent | Key |
|-------|-----|
| Abiba | `sk-syslog-abiba` |
| Mumuni | `sk-syslog-mumuni` |
| Tanko | `sk-syslog-tanko` |
| Koby | `sk-syslog-koby` |
| Kagenz0 | `sk-syslog-kagenz0` |
| Koonimo | `sk-syslog-koonimo` |
---
*Built for Syslog Solution LLC — Quality over speed.*
## Routing Tiers
| Tier | Trigger | Priority |
|------|---------|----------|
| Lightweight | No system prompt, ≤1 turn, ≤100 words | VLM → MoE → Dense |
| Simple Conv | ≤1000 tokens, ≤4 turns | VLM → MoE → Dense |
| Heavy | >4000 tokens OR >8 turns | Dense → MoE → VLM |
| Default | Everything else | MoE → VLM → Dense |
## Queue
When all GPUs are saturated, requests enter a polling queue (500ms intervals) instead of returning 503 immediately. Timeout: 30s (configurable via `QUEUE_TIMEOUT` env or `X-Queue-Timeout` header).
## Models
| GPU | Model | VRAM | Slots |
|-----|-------|------|-------|
| Strix Halo | qwen3.6-35B-A3B (MoE) | 65GB | 2 |
| RTX 3090 | qwen3.6-27B-code (Dense) | 24GB | 2 |
| RTX 5070 | qwen3.5-9b-vlm (VLM) | 12GB | 2 |
## Maintenance
Automated cron job runs daily at 3:00 AM UTC (`/opt/inference-harness/maintenance.sh`):
- Cleans Redis timeseries keys >60 days
- Prunes Docker build cache >7 days
- Logs container health and Redis memory
Logs: `/var/log/harness-maintenance.log`
File diff suppressed because it is too large Load Diff
File diff suppressed because it is too large Load Diff
+6 -7
View File
@@ -1,8 +1,7 @@
FROM python:3.13-slim
COPY harness-dashboard.py /app/harness-dashboard.py
FROM python:3.12-slim
WORKDIR /app
EXPOSE 3001
CMD ["python3", "harness-dashboard.py"]
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY dashboard.py .
EXPOSE 3000
CMD ["python", "dashboard.py"]
-5
View File
@@ -1,5 +0,0 @@
FROM python:3.11-slim
WORKDIR /app
COPY harness-dashboard.py .
EXPOSE 3001
CMD ["python3", "harness-dashboard.py"]
+232
View File
@@ -0,0 +1,232 @@
"""SyslogAI Harness Dashboard — Modern Design."""
import os, json, time, queue, threading
import requests
from flask import Flask, request, render_template_string, Response, stream_with_context
ROUTER_METRICS = os.environ.get("ROUTER_METRICS_URL", "http://router:9000/metrics")
app = Flask(__name__)
sse_subscribers = []; sse_lock = threading.Lock()
def fetch_state():
try:
r = requests.get(ROUTER_METRICS, timeout=5)
if r.status_code == 200: return r.json()
except Exception: pass
return {"gpus":[],"route_counts":{},"agent_counts":{},"recent":[],"timestamp":time.time()}
def broadcast_loop():
while True:
time.sleep(3)
data = fetch_state(); payload = json.dumps(data)
with sse_lock:
dead = [q for q in sse_subscribers if not q.put(payload)]
for q in dead: sse_subscribers.remove(q)
threading.Thread(target=broadcast_loop, daemon=True).start()
DASHBOARD_HTML = r"""<!DOCTYPE html>
<html lang="en" data-bs-theme="dark">
<head>
<meta charset="UTF-8"><meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>SyslogAI Harness</title>
<link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.3/dist/css/bootstrap.min.css" rel="stylesheet">
<style>
body { background: #0b0f17; color: #bcc3cd; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', system-ui, sans-serif; padding: 20px 24px; }
.card { background: #111827; border: 1px solid #1e293b; border-radius: 10px; height: 100%; }
.stat-card { background: #111827; border: 1px solid #1e293b; border-radius: 10px; padding: 18px 20px; text-align: center; }
.stat-value { font-size: 28px; font-weight: 700; line-height: 1.1; }
.stat-label { font-size: 11px; text-transform: uppercase; letter-spacing: 0.6px; color: #64748b; margin-top: 4px; }
.gpu-card { background: #111827; border: 1px solid #1e293b; border-radius: 10px; padding: 16px 18px; height: 100%; }
.gpu-card .title { font-size: 13px; font-weight: 600; color: #e2e8f0; margin-bottom: 12px; display: flex; align-items: center; gap: 8px; }
.gpu-card .status-dot { width: 8px; height: 8px; border-radius: 50%; flex-shrink: 0; }
.gpu-card .row-metric { display: flex; justify-content: space-between; font-size: 12px; padding: 2px 0; }
.gpu-card .row-metric .lbl { color: #64748b; }
.gpu-card .row-metric .val { color: #e2e8f0; font-variant-numeric: tabular-nums; }
.gpu-card .slot-bar { display: flex; gap: 3px; margin-top: 8px; }
.gpu-card .slot-bar .s { flex: 1; height: 5px; border-radius: 2px; background: #1e293b; }
.gpu-card .slot-bar .s.active { background: #38bdf8; }
.chart-card { background: #111827; border: 1px solid #1e293b; border-radius: 10px; padding: 16px 18px; height: 100%; display: flex; flex-direction: column; }
.chart-card .title { font-size: 13px; font-weight: 600; color: #e2e8f0; margin-bottom: 12px; }
.bar-row { margin-bottom: 8px; }
.bar-label { display: flex; justify-content: space-between; font-size: 11px; margin-bottom: 3px; color: #64748b; }
.bar-label .name { color: #cbd5e1; }
.bar-track { height: 5px; background: #1e293b; border-radius: 3px; overflow: hidden; }
.bar-fill { height: 100%; border-radius: 3px; transition: width 0.6s ease; }
.table-custom { font-size: 11px; margin: 0; }
.table-custom th { color: #64748b; font-weight: 500; font-size: 10px; text-transform: uppercase; border-color: #1e293b; padding: 8px 10px; }
.table-custom td { color: #94a3b8; border-color: rgba(30,41,59,0.5); padding: 6px 10px; }
.agent-badge { font-size: 10px; padding: 2px 7px; border-radius: 8px; font-weight: 600; }
.btn-sm-period { font-size: 10px; padding: 3px 10px; border-radius: 6px; border: 1px solid #1e293b; color: #64748b; background: transparent; cursor: pointer; }
.btn-sm-period.active { background: #1d4ed8; color: #fff; border-color: #1d4ed8; }
.ring-label { font-size: 22px; font-weight: 700; }
.ring-sublabel { font-size: 10px; color: #64748b; }
</style>
</head>
<body>
<!-- HEADER -->
<div class="d-flex justify-content-between align-items-center mb-4">
<div>
<h5 class="mb-0 text-white fw-bold">&#x26A1; SyslogAI Harness</h5>
<div class="small text-secondary" id="live-indicator">
<span class="status-dot" id="live-dot" style="width:6px;height:6px;border-radius:50%;display:inline-block;background:#22c55e;animation:pulse 2s infinite"></span>
<span id="connection-status">live</span> &middot; <span id="update-time"></span>
</div>
</div>
<div class="d-flex gap-2">
<div class="stat-card" style="min-width:100px"><div class="stat-value text-info" id="kpi-total">0</div><div class="stat-label">Requests</div></div>
<div class="stat-card" style="min-width:100px"><div class="stat-value text-warning" id="kpi-active">0</div><div class="stat-label">Active</div></div>
<div class="stat-card" style="min-width:100px"><div class="stat-value" style="color:#a78bfa" id="kpi-agents">0</div><div class="stat-label">Agents</div></div>
</div>
</div>
<div class="row g-3 align-items-stretch">
<!-- ROW 1: Usage Chart (8) + GPU Metrics (4) -->
<div class="col-md-8"><div class="chart-card"><div class="title d-flex justify-content-between align-items-center">
<span>Usage Over Time</span>
<div class="d-flex gap-1">
<button class="btn-sm-period active" onclick="switchPeriod('day')">24h</button>
<button class="btn-sm-period" onclick="switchPeriod('week')">7d</button>
<button class="btn-sm-period" onclick="switchPeriod('month')">30d</button>
</div>
</div><div id="timeseries-chart" style="height:150px"></div><div id="timeseries-legend" class="d-flex justify-content-center gap-3 mt-2 flex-wrap small"></div></div></div>
<div class="col-md-4"><div class="chart-card"><div class="title">GPU Metrics</div><div id="gpu-metrics-card"></div></div></div>
<!-- ROW 2: 3 GPU Cards -->
<div class="col-md-4"><div class="gpu-card" id="gpu-moe"><div class="text-secondary small">Loading...</div></div></div>
<div class="col-md-4"><div class="gpu-card" id="gpu-dense"><div class="text-secondary small">Loading...</div></div></div>
<div class="col-md-4"><div class="gpu-card" id="gpu-light"><div class="text-secondary small">Loading...</div></div></div>
<!-- ROW 3: Queue + Model + Agent -->
<div class="col-md-4"><div class="chart-card"><div class="title">Queue Status</div><div class="text-center" id="queue-viz"></div></div></div>
<div class="col-md-4"><div class="chart-card"><div class="title">Model Distribution</div><div id="route-bars"></div></div></div>
<div class="col-md-4"><div class="chart-card"><div class="title">Agent Activity</div><div id="agent-bars"></div></div></div>
<!-- ROW 4: Live Stream -->
<div class="col-12"><div class="chart-card"><div class="title">Live Stream</div>
<div class="table-responsive"><table class="table table-custom mb-0">
<thead><tr><th>Time</th><th>Agent</th><th>Model</th><th>Reason</th><th>Tier</th></tr></thead>
<tbody id="route-tbody"></tbody>
</table></div>
</div></div>
</div>
<script>
var MC={'qwen3.5-9b-vlm':'#22c55e','qwen3.6-27B-code':'#f59e0b','qwen3.6-35B-A3B':'#a78bfa'};
var ML={'qwen3.5-9b-vlm':'Qwen3.5 9B VLM','qwen3.6-27B-code':'Qwen Code','qwen3.6-35B-A3B':'Qwen MoE'};
var GL={'qwen3.6-35B-A3B':'MoE - Strix Halo','qwen3.6-27B-code':'Dense - RTX 3090','qwen3.5-9b-vlm':'VLM - RTX 5070'};
function $(id){return document.getElementById(id);}
function render(data){
if(!data||!data.gpus)return;
var t=Object.values(data.route_counts||{}).reduce((a,b)=>a+b,0);
var ta=0,tm=0;data.gpus.forEach(function(g){ta+=(g.active_requests||0);tm+=(g.max_concurrent||1)});
$('kpi-total').textContent=t;$('kpi-active').textContent=ta+'/'+tm;$('kpi-agents').textContent=Object.keys(data.agent_counts||{}).length;
$('update-time').textContent=new Date().toLocaleTimeString();
var ids={'qwen3.6-35B-A3B':'gpu-moe','qwen3.6-27B-code':'gpu-dense','qwen3.5-9b-vlm':'gpu-light'};
data.gpus.forEach(function(g){
var el=$(ids[g.id]);if(!el)return;
var a=g.active_requests||0,mx=g.max_concurrent||1;
var sc=g.status==='healthy'?'#22c55e':g.status==='saturated'?'#f59e0b':'#ef4444';
var ss=g.status==='healthy'?'Online':g.status==='saturated'?'Busy':'Offline';
var slots='';for(var i=0;i<mx;i++)slots+='<span class=\"s'+(i<a?' active':'')+'\"></span>';
var h='<div class=\"title\"><span class=\"status-dot\" style=\"background:'+sc+'\"></span>'+GL[g.id]+'<span class=\"ms-auto small\" style=\"color:'+sc+'\">'+ss+'</span></div>';
h+='<div class=\"row-metric\"><span class=\"lbl\">VRAM</span><span class=\"val\">'+g.vram_used_mb+' / '+g.vram_total_mb+' MB</span></div>';
h+='<div class=\"row-metric\"><span class=\"lbl\">Utilization</span><span class=\"val\">'+g.gpu_util_pct+'%</span></div>';
h+='<div class=\"row-metric\"><span class=\"lbl\">Temperature</span><span class=\"val\" style=\"color:'+(g.temp_c>85?'#ef4444':g.temp_c>70?'#f59e0b':'#22c55e')+'\">'+g.temp_c+'C</span></div>';
if(g.power_w)h+='<div class=\"row-metric\"><span class=\"lbl\">Power</span><span class=\"val\">'+g.power_w+'W'+(g.power_limit_w?'/'+g.power_limit_w+'W':'')+'</span></div>';
h+='<div class=\"row-metric\"><span class=\"lbl\">Slots</span><span class=\"val\" style=\"color:'+(a>=mx?'#ef4444':'#e2e8f0')+'\">'+a+' / '+mx+'</span></div>';
h+='<div class=\"slot-bar\">'+slots+'</div>';el.innerHTML=h;
});
renderQueue(data);renderGPUMetrics(data);
var rc=data.route_counts||{},mr=Math.max(1,...Object.values(rc));
$('route-bars').innerHTML=Object.entries(rc).length?Object.entries(rc).sort((a,b)=>b[1]-a[1]).map(function(e){var m=e[0],c=e[1];return'<div class=\"bar-row\"><div class=\"bar-label\"><span class=\"name\">'+(ML[m]||m)+'</span><span>'+c+' ('+(t?Math.round(c/t*100):0)+'%)</span></div><div class=\"bar-track\"><div class=\"bar-fill\" style=\"width:'+(c/mr*100)+'%;background:'+(MC[m]||'#38bdf8')+'\"></div></div></div>';}).join(''):'<div class=\"text-secondary small\">-</div>';
var ac=data.agent_counts||{},ma=Math.max(1,...Object.values(ac));
$('agent-bars').innerHTML=Object.entries(ac).length?Object.entries(ac).sort((a,b)=>b[1]-a[1]).map(function(e){return'<div class=\"bar-row\"><div class=\"bar-label\"><span class=\"name\">'+e[0]+'</span><span>'+e[1]+'</span></div><div class=\"bar-track\"><div class=\"bar-fill\" style=\"width:'+(e[1]/ma*100)+'%;background:#38bdf8\"></div></div></div>';}).join(''):'<div class=\"text-secondary small\">-</div>';
var recent=data.recent||[];
$('route-tbody').innerHTML=recent.length?recent.slice(0,20).map(function(r){var d=new Date(r.ts*1000),ag=r.agent||'?';return'<tr><td class=\"text-secondary\">'+d.toLocaleTimeString()+'</td><td><span class=\"agent-badge\" style=\"background:rgba(56,189,248,0.12);color:#38bdf8\">'+ag+'</span></td><td>'+(ML[r.model]||r.model)+'</td><td class=\"text-secondary\">'+(r.reason||'')+'</td><td class=\"text-uppercase\" style=\"font-size:10px;color:'+(r.tier==='enterprise'?'#a78bfa':'#64748b')+'\">'+(r.tier||'')+'</td></tr>';}).join(''):'<tr><td colspan=\"5\" class=\"text-secondary\">Waiting...</td></tr>';
}
function renderQueue(data){
var el=$('queue-viz');if(!el)return;
var ta=0,tm=0;data.gpus.forEach(function(g){ta+=(g.active_requests||0);tm+=(g.max_concurrent||1)});
var pct=tm>0?Math.round(ta/tm*100):0,st=pct>=100?'SATURATED':pct>=50?'BUSY':'IDLE';
var sc=pct>=100?'#ef4444':pct>=50?'#f59e0b':'#22c55e';
var circ=188.5,dash=(pct/100)*circ;
var h='<div class=\"d-inline-block position-relative mb-2\"><svg width=\"72\" height=\"72\"><circle cx=\"36\" cy=\"36\" r=\"30\" fill=\"none\" stroke=\"#1e293b\" stroke-width=\"6\"/><circle cx=\"36\" cy=\"36\" r=\"30\" fill=\"none\" stroke=\"'+sc+'\" stroke-width=\"6\" stroke-dasharray=\"'+dash+' '+(circ-dash)+'\" stroke-linecap=\"round\" transform=\"rotate(-90 36 36)\"/></svg><div style=\"position:absolute;top:50%;left:50%;transform:translate(-50%,-50%);text-align:center\"><div class=\"ring-label\" style=\"color:'+sc+'\">'+ta+'</div><div class=\"ring-sublabel\">/ '+tm+' slots</div></div></div>';
h+='<div class=\"fw-bold mb-2 small\" style=\"color:'+sc+'\">'+st+'</div>';
var lb={'qwen3.6-35B-A3B':'MoE','qwen3.6-27B-code':'Dense','qwen3.5-9b-vlm':'VLM'};
data.gpus.forEach(function(g){var a=g.active_requests||0,mx=g.max_concurrent||1,gp=mx>0?Math.round(a/mx*100):0;h+='<div class=\"d-flex align-items-center gap-2 mb-1 justify-content-center\"><span class=\"small\" style=\"min-width:32px;text-align:right;font-size:10px\">'+(lb[g.id]||g.id)+'</span><div style=\"flex:1;max-width:70px;height:3px;background:#1e293b;border-radius:2px;overflow:hidden\"><div style=\"height:100%;width:'+gp+'%;background:'+sc+';border-radius:2px\"></div></div><span class=\"small\" style=\"min-width:22px;font-size:10px\">'+a+'/'+mx+'</span></div>'});
el.innerHTML=h;
}
function renderGPUMetrics(data){
var el=$('gpu-metrics-card');if(!el)return;
var lb={'qwen3.6-35B-A3B':'MoE','qwen3.6-27B-code':'Dense','qwen3.5-9b-vlm':'VLM'};
var h='';data.gpus.forEach(function(g){
var nm=lb[g.id]||g.id,tp=g.temp_c||0,ut=g.gpu_util_pct||0,pw=g.power_w||0,pl=g.power_limit_w||0;
var tc=tp>85?'#ef4444':tp>70?'#f59e0b':'#22c55e',uc=ut>90?'#ef4444':ut>70?'#f59e0b':'#22c55e';
h+='<div class=\"mb-3\"><div class=\"fw-bold small text-white-50 mb-1\">'+nm+'</div>';
h+='<div class=\"d-flex align-items-center gap-2 mb-1\"><span class=\"small text-secondary\" style=\"min-width:30px\">T</span><div class=\"flex-grow-1\" style=\"height:3px;background:#1e293b;border-radius:2px;overflow:hidden\"><div style=\"height:100%;width:'+Math.min(tp,100)+'%;background:'+tc+';border-radius:2px\"></div></div><span class=\"small\" style=\"color:'+tc+';min-width:30px;text-align:right\">'+tp+'C</span></div>';
h+='<div class=\"d-flex align-items-center gap-2 mb-1\"><span class=\"small text-secondary\" style=\"min-width:30px\">U</span><div class=\"flex-grow-1\" style=\"height:3px;background:#1e293b;border-radius:2px;overflow:hidden\"><div style=\"height:100%;width:'+ut+'%;background:'+uc+';border-radius:2px\"></div></div><span class=\"small\" style=\"color:'+uc+';min-width:30px;text-align:right\">'+ut+'%</span></div>';
if(pw>0){var pp=pl>0?Math.round(pw/pl*100):0,pc=pp>90?'#ef4444':pp>70?'#f59e0b':'#22c55e';h+='<div class=\"d-flex align-items-center gap-2\"><span class=\"small text-secondary\" style=\"min-width:30px\">P</span><div class=\"flex-grow-1\" style=\"height:3px;background:#1e293b;border-radius:2px;overflow:hidden\"><div style=\"height:100%;width:'+pp+'%;background:'+pc+';border-radius:2px\"></div></div><span class=\"small\" style=\"color:'+pc+';min-width:30px;text-align:right\">'+pw+'W</span></div>';}
h+='</div>';});
el.innerHTML=h;
}
var cp='day';
function switchPeriod(p){cp=p;document.querySelectorAll('.btn-sm-period').forEach(function(b){b.classList.remove('active')});event.target.classList.add('active');loadTS();}
function loadTS(){fetch('/api/timeseries?period='+cp).then(function(r){return r.json()}).then(renderTS).catch(function(){})}
function renderTS(d){
var models=d.models||{},labels=d.labels||[];
if(!labels.length)return;
var cn=$('timeseries-chart'),lg=$('timeseries-legend'),mn=Object.keys(models);
if(!mn.length){cn.innerHTML='<div class=\"text-secondary small text-center py-4\">-</div>';return;}
var mv=1;for(var m in models)for(var i=0;i<models[m].length;i++)if(models[m][i]>mv)mv=models[m][i];mv=Math.ceil(mv*1.15)||1;
var W=labels.length>1?100/(labels.length-1):100,H=130;
var paths='';for(var mi=0;mi<mn.length;mi++){var m=mn[mi],vals=models[m]||[],d='';for(var i=0;i<vals.length;i++){var x=i*W,y=H-(vals[i]/mv)*H;d+=(i===0?'M':'L')+x.toFixed(1)+','+y.toFixed(1)+' ';}paths+='<path d=\"'+d+'\" fill=\"none\" stroke=\"'+(MC[m]||'#38bdf8')+'\" stroke-width=\"2\" stroke-linecap=\"round\" opacity=\"0.8\"/>';}
var grid='';for(var g=0;g<=4;g++){var y=(g/4)*H;grid+='<line x1=\"0\" y1=\"'+y.toFixed(1)+'\" x2=\"100\" y2=\"'+y.toFixed(1)+'\" stroke=\"#1e293b\" stroke-width=\"1\"/>';}
cn.innerHTML='<svg viewBox=\"0 0 100 '+(H+16)+'\" style=\"width:100%;height:'+(H+20)+'px;display:block\" preserveAspectRatio=\"none\">'+grid+paths+'</svg>';
lg.innerHTML=mn.map(function(m){return'<span class=\"d-flex align-items-center gap-1\"><svg width=\"14\" height=\"8\"><line x1=\"0\" y1=\"4\" x2=\"14\" y2=\"4\" stroke=\"'+(MC[m]||'#38bdf8')+'\" stroke-width=\"2\"/></svg>'+(ML[m]||m)+'</span>';}).join('');
}
function poll(){fetch('/api/state').then(function(r){return r.json()}).then(function(data){render(data);$('connection-status').textContent='live';}).catch(function(){$('connection-status').textContent='reconnecting';});}
poll();setInterval(poll,3000);loadTS();
</script>
</body>
</html>"""
@app.route("/")
def dashboard(): return render_template_string(DASHBOARD_HTML)
@app.route("/api/state")
def api_state(): return fetch_state()
@app.route("/api/timeseries")
def api_timeseries():
period = request.args.get("period", "day")
try:
r = requests.get("http://router:9000/metrics/timeseries?period=" + period, timeout=5)
if r.status_code == 200: return r.json()
except Exception: pass
return {"models": {}, "labels": []}
@app.route("/api/stream")
def api_stream():
def ev():
q = queue.Queue()
with sse_lock: sse_subscribers.append(q)
try:
yield "data: "+json.dumps(fetch_state())+"\n\n"
while True:
try: msg = q.get(timeout=3); yield "data: "+msg+"\n\n"
except queue.Empty: yield "data: "+json.dumps(fetch_state())+"\n\n"
except GeneratorExit: pass
finally:
with sse_lock:
if q in sse_subscribers: sse_subscribers.remove(q)
return Response(stream_with_context(ev()), mimetype="text/event-stream", headers={"Cache-Control":"no-cache","X-Accel-Buffering":"no","Access-Control-Allow-Origin":"*"})
@app.route("/health")
def health(): return {"status":"healthy","service":"harness-dashboard"}
if __name__ == "__main__":
app.run(host="0.0.0.0", port=3000, debug=False)
-133
View File
@@ -1,133 +0,0 @@
#!/usr/bin/env python3
"""Syslog Harness Dashboard — Simple HTTP server exposing GPU health + metrics."""
import json
import os
import time
import urllib.request
from http.server import HTTPServer, SimpleHTTPRequestHandler
from datetime import datetime
GPUS = {
"amdpve": {"endpoint": os.getenv("AMDVE_EP", "192.168.68.15:8080"), "model": "qwen3.6-35B-A3B (MoE)", "vram": "65GB"},
"llmgpu": {"endpoint": os.getenv("LLMGPU_EP", "192.168.68.8:8080"), "model": "qwen3.5-27B (Dense)", "vram": "24GB"},
"ocu_llm": {"endpoint": os.getenv("OCU_LLM_EP", "192.168.68.110:8080"), "model": "gemma-4-E4B (Light)", "vram": "12GB"},
}
def check_gpu(name, info):
try:
start = time.time()
# Use simple HTTP GET to check if the GPU endpoint is alive
resp = urllib.request.urlopen(f"http://{info['endpoint']}/", timeout=3)
latency = (time.time() - start) * 1000
return {
"status": "up",
"latency_ms": round(latency, 1),
"model": info["model"],
"vram": info["vram"],
}
except Exception as e:
return {"status": "down", "error": str(e)[:50], "model": info["model"], "vram": info["vram"]}
def get_queue_status():
try:
req = urllib.request.Request("http://queue-service:8091/status")
resp = urllib.request.urlopen(req, timeout=2)
return json.loads(resp.read())
except Exception:
return {"queue_depth": -1, "circuit_breaker": "unknown", "gpu_health": {}}
DASHBOARD_HTML = """
<!DOCTYPE html>
<html><head><meta charset="utf-8"><title>🦅 Syslog Harness</title>
<style>
body { background: #1a1a2e; color: #e0e0e0; font-family: monospace; margin: 0; padding: 20px; }
.card { background: #16213e; border-radius: 8px; padding: 16px; margin: 10px 0; border-left: 4px solid #0f3460; }
.up { border-left-color: #00d26a; } .down { border-left-color: #ff4757; }
.warn { border-left-color: #ffa502; }
h1 { color: #00d26a; font-size: 24px; } h2 { color: #0f3460; font-size: 16px; }
.metric { display: inline-block; margin: 4px 12px; }
.value { font-weight: bold; color: #00d26a; }
#refresh { position: fixed; top: 10px; right: 10px; background: #0f3460; color: white;
border: none; padding: 8px 16px; border-radius: 4px; cursor: pointer; }
table { width: 100%; border-collapse: collapse; margin: 10px 0; }
th, td { text-align: left; padding: 8px; border-bottom: 1px solid #0f3460; }
th { color: #00d26a; }
</style></head><body>
<button id="refresh" onclick="location.reload()">↻ Refresh</button>
<h1>🦅 Syslog Harness Dashboard</h1>
<h2>Updated: <span id="ts"></span></h2>
<div class="card" id="queue-card">
<h2>Queue & Circuit Breaker</h2>
<div class="metric">Depth: <span class="value" id="depth">--</span></div>
<div class="metric">Circuit: <span class="value" id="circuit">--</span></div>
<div class="metric">Threshold: <span class="value" id="threshold">--</span></div>
</div>
<div class="card">
<h2>GPU Endpoints</h2>
<table><tr><th>GPU</th><th>Model</th><th>VRAM</th><th>Status</th><th>Latency</th></tr>
<tbody id="gpu-table"></tbody></table>
</div>
<script>
document.getElementById('ts').textContent = new Date().toISOString();
fetch('/api/status').then(r => r.json()).then(data => {
document.getElementById('depth').textContent = data.queue_depth;
document.getElementById('circuit').textContent = data.circuit_breaker;
document.getElementById('threshold').textContent = 'warn:' + data.thresholds.warn + ' / open:' + data.thresholds.open;
const card = document.getElementById('queue-card');
if (data.circuit_breaker === 'open') card.className = 'card warn';
else if (data.circuit_breaker === 'warn') card.className = 'card warn';
else card.className = 'card up';
let html = '';
for (const [name, gpu] of Object.entries(data.gpu_health)) {
const status = gpu.status === 'up' ? '' : '';
const latency = gpu.status === 'up' ? gpu.latency_ms + 'ms' : gpu.error;
const rowClass = gpu.status === 'up' ? '' : 'down';
html += `<tr class="${rowClass}"><td>${name}</td><td>${gpu.model}</td><td>${gpu.vram}</td><td>${status}</td><td>${latency}</td></tr>`;
}
document.getElementById('gpu-table').innerHTML = html;
});
setInterval(() => location.reload(), 10000);
</script></body></html>
"""
class Handler(SimpleHTTPRequestHandler):
def do_GET(self):
if self.path == "/" or self.path == "/harness.html":
self.send_response(200)
self.send_header("Content-Type", "text/html; charset=utf-8")
self.end_headers()
self.wfile.write(DASHBOARD_HTML.encode())
elif self.path == "/api/status":
status = get_queue_status()
enriched = {
"queue_depth": status.get("queue_depth", -1),
"circuit_breaker": status.get("circuit_breaker", "unknown"),
"thresholds": status.get("thresholds", {"warn": 30, "open": 50}),
"gpu_health": {},
}
for name, info in GPUS.items():
enriched["gpu_health"][name] = check_gpu(name, info)
self.send_response(200)
self.send_header("Content-Type", "application/json")
self.end_headers()
self.wfile.write(json.dumps(enriched).encode())
else:
self.send_response(404)
self.end_headers()
def log_message(self, format, *args):
pass # Suppress request logs
if __name__ == "__main__":
server = HTTPServer(("0.0.0.0", 3001), Handler)
print("Dashboard running on :3001/harness.html")
server.serve_forever()
View File
+2
View File
@@ -0,0 +1,2 @@
flask==3.1.*
requests==2.32.*
+80 -37
View File
@@ -1,54 +1,97 @@
version: "3.8"
version: '3.8'
services:
redis:
image: redis:7-alpine
restart: always
networks:
- gpu-router-net
container_name: harness-redis
restart: unless-stopped
ports:
- "127.0.0.1:6379:6379"
volumes:
- redis-data:/data
command: redis-server --appendonly yes --maxmemory 256mb --maxmemory-policy allkeys-lru
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 3s
retries: 5
queue-service:
build:
context: .
dockerfile: Dockerfile.queue
restart: always
networks:
- gpu-router-net
router:
build: ./router
container_name: harness-router
restart: unless-stopped
ports:
- "8091:8091"
depends_on:
- redis
- "9000:9000"
environment:
- REDIS_HOST=redis
- REDIS_PORT=6379
- REDIS_URL=redis://redis:6379
- GPU_MOE_URL=http://192.168.68.15:8080/v1
- GPU_DENSE_URL=http://192.168.68.8:8080/v1
- GPU_LIGHT_URL=http://192.168.68.110:8080/v1
healthcheck:
test: ["CMD", "python3", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:9000/health')"]
interval: 15s
timeout: 5s
retries: 3
depends_on:
redis:
condition: service_healthy
litellm:
image: ghcr.io/berriai/litellm:main-stable
command: ["--config", "/app/config.yaml", "--port", "4000"]
container_name: harness-litellm
restart: unless-stopped
ports:
- "8081:4000"
volumes:
- ./litellm_config.yaml:/app/config.yaml
environment:
- LITELLM_MASTER_KEY=sk-syslog-local-master-key
healthcheck:
test: ["CMD", "python3", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:9000/health')"]
interval: 15s
timeout: 5s
retries: 3
depends_on:
redis:
condition: service_healthy
nginx:
image: nginx:alpine
container_name: harness-nginx
restart: unless-stopped
ports:
- "80:80"
volumes:
- ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
healthcheck:
test: ["CMD", "curl", "-f", "http://127.0.0.1/health"]
interval: 15s
timeout: 5s
retries: 3
depends_on:
- litellm
- dashboard
dashboard:
build:
context: .
dockerfile: Dockerfile.dashboard
restart: always
networks:
- gpu-router-net
build: ./dashboard
container_name: harness-dashboard
restart: unless-stopped
ports:
- "3001:3001"
- "3000:3000"
environment:
- REDIS_URL=redis://redis:6379
- GPU_SIDECARS=192.168.68.15:8090,192.168.68.8:8090,192.168.68.110:8090
healthcheck:
test: ["CMD", "python3", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:3000/health')"]
interval: 15s
timeout: 5s
retries: 3
depends_on:
- redis
gpu-dashboard:
build:
context: .
dockerfile: Dockerfile.gpu
restart: always
networks:
- gpu-router-net
ports:
- "8092:8092"
networks:
gpu-router-net:
driver: bridge
volumes:
redis-data:
# LiteLLM command override to load config
# (appended to fix config loading issue)
-115
View File
@@ -1,115 +0,0 @@
#!/usr/bin/env python3
"""GPU metrics collector — polls sidecars + llama.cpp every 10s, writes to Workspace."""
import urllib.request, json, time, os
HOSTS = [
{"name": "amdpve", "host": "192.168.68.15", "gpu": "AMD Strix Halo", "llama_port": 8080},
{"name": "llmgpu", "host": "192.168.68.8", "gpu": "RTX 3090", "llama_port": 8080},
{"name": "ocu-llm", "host": "192.168.68.110", "gpu": "RTX 5070", "llama_port": 8080},
]
OUTPUT = "/root/hermes-workspace/public/gpu_metrics.json"
INTERVAL = 10
STALE_THRESHOLD = 30 # seconds before marking stale
DEAD_THRESHOLD = 60 # seconds before marking unreachable
last_seen = {}
def fetch_json(url, timeout=3):
try:
req = urllib.request.Request(url)
resp = urllib.request.urlopen(req, timeout=timeout)
return json.loads(resp.read().decode())
except Exception:
return None
def collect_one(h):
"""Collect GPU hardware + llama.cpp inference state for one host."""
name = h["name"]
host = h["host"]
now = time.time()
# GPU hardware from sidecar
gpu = fetch_json(f"http://{host}:8090/")
# llama.cpp inference state
llamacpp_health = fetch_json(f"http://{host}:{h['llama_port']}/health")
llamacpp_models = fetch_json(f"http://{host}:{h['llama_port']}/v1/models")
# Determine inference state
model_name = None
inference_state = "unknown"
if llamacpp_models:
models = llamacpp_models.get("data", [])
if models:
model_name = models[0].get("id")
if llamacpp_health:
status = llamacpp_health.get("status", "")
if status == "ok":
idle = llamacpp_health.get("slots_idle", 0)
processing = llamacpp_health.get("slots_processing", 0)
if idle and not processing:
inference_state = "idle"
elif processing:
inference_state = "busy"
else:
inference_state = "idle"
# Check for /slots endpoint for is_processing detail
slots = fetch_json(f"http://{host}:{h['llama_port']}/slots")
if slots and isinstance(slots, list) and len(slots) > 0:
if slots[0].get("is_processing"):
inference_state = "busy"
result = {
"host": name,
"gpu_name": h["gpu"],
"inference": {
"state": inference_state,
"model": model_name,
},
"hardware": gpu if gpu else None,
"online": gpu is not None,
"timestamp": now,
}
if gpu is not None:
last_seen[name] = now
if name in last_seen:
age = now - last_seen[name]
if age > DEAD_THRESHOLD:
result["online"] = False
elif age > STALE_THRESHOLD:
result["stale"] = True
return result
def main():
print(f"GPU collector starting, output={OUTPUT}, interval={INTERVAL}s")
os.makedirs(os.path.dirname(OUTPUT), exist_ok=True)
while True:
start = time.time()
results = [collect_one(h) for h in HOSTS]
payload = {
"updated": start,
"gpus": results,
}
with open(OUTPUT + ".tmp", "w") as f:
json.dump(payload, f)
os.rename(OUTPUT + ".tmp", OUTPUT)
elapsed = time.time() - start
sleep_for = max(0, INTERVAL - elapsed)
time.sleep(sleep_for)
if __name__ == "__main__":
main()
-183
View File
@@ -1,183 +0,0 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>GPU Monitor</title>
<style>
* { margin: 0; padding: 0; box-sizing: border-box; }
body { background: #0d1117; color: #c9d1d9; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif; padding: 20px; }
h1 { font-size: 1.3em; margin-bottom: 4px; }
.topbar { display: flex; justify-content: space-between; align-items: center; margin-bottom: 20px; padding-bottom: 12px; border-bottom: 1px solid #21262d; }
.topbar .status { font-size: 0.85em; color: #8b949e; }
.topbar .status .dot { display: inline-block; width: 8px; height: 8px; border-radius: 50%; margin-right: 6px; }
.dot.green { background: #3fb950; }
.dot.yellow { background: #d2991d; }
.dot.red { background: #f85149; }
.cards { display: grid; grid-template-columns: repeat(auto-fit, minmax(320px, 1fr)); gap: 16px; }
.card { background: #161b22; border: 1px solid #21262d; border-radius: 8px; padding: 16px; }
.card.stale { opacity: 0.5; }
.card.dead { opacity: 0.3; border-color: #f85149; }
.card-header { display: flex; justify-content: space-between; align-items: center; margin-bottom: 12px; }
.card-header .name { font-weight: 600; font-size: 1.05em; }
.card-header .host { font-size: 0.8em; color: #8b949e; }
.card-header .state { font-size: 0.75em; padding: 2px 8px; border-radius: 10px; font-weight: 600; }
.state.idle { background: #1b3826; color: #3fb950; }
.state.busy { background: #3d1f1a; color: #f85149; }
.state.unknown { background: #21262d; color: #8b949e; }
.metric { margin-bottom: 10px; }
.metric-label { display: flex; justify-content: space-between; font-size: 0.82em; color: #8b949e; margin-bottom: 2px; }
.metric-label .val { color: #c9d1d9; font-weight: 500; }
.bar { height: 6px; border-radius: 3px; background: #21262d; overflow: hidden; }
.bar-fill { height: 100%; border-radius: 3px; transition: width 0.5s ease; }
.bar-fill.temp-cool { background: #3fb950; }
.bar-fill.temp-warm { background: #d2991d; }
.bar-fill.temp-hot { background: #f85149; }
.bar-fill.util { background: #58a6ff; }
.bar-fill.vram { background: #bc8cff; }
.bar-fill.power { background: #f0883e; }
.model-line { font-size: 0.82em; color: #8b949e; margin-top: 8px; padding-top: 8px; border-top: 1px solid #21262d; }
.model-line span { color: #c9d1d9; }
.error { color: #f85149; font-size: 0.85em; }
</style>
</head>
<body>
<div class="topbar">
<div>
<h1><a href="/" style="color:#58a6ff;text-decoration:none;">← Workspace</a> · GPU Monitor</h1>
<span class="status"><span class="dot green" id="status-dot"></span><span id="status-text">Loading...</span></span>
</div>
<div class="status" id="age"></div>
</div>
<div class="cards" id="cards"></div>
<script>
const INTERVAL = 5000;
let lastFetchTime = null;
function updateClock() {
const el = document.getElementById('age');
if (!lastFetchTime) { el.textContent = '—'; return; }
const age = Math.round((Date.now() / 1000) - lastFetchTime);
el.textContent = age <= 60 ? `updated ${age}s ago` : `stale ${age}s ago`;
}
setInterval(updateClock, 1000);
const TEMP_WARN = 70, TEMP_HOT = 82;
const VRAM_WARN = 80, VRAM_HOT = 92;
function tempClass(c) { return c > TEMP_HOT ? 'temp-hot' : c > TEMP_WARN ? 'temp-warm' : 'temp-cool'; }
function vramClass(pct) { return pct > VRAM_HOT ? 'temp-hot' : pct > VRAM_WARN ? 'temp-warm' : 'temp-cool'; }
function pct(val, max) { return max ? Math.round(val / max * 100) : 0; }
function mbToGB(mb) { return mb ? (mb / 1024).toFixed(1) : '—'; }
function renderCard(g) {
const hw = g.hardware || {};
const inf = g.inference || {};
const online = g.online !== false;
const stale = g.stale === true;
let cardClass = '';
if (!online) cardClass = 'dead';
else if (stale) cardClass = 'stale';
let stateClass = inf.state || 'unknown';
let stateLabel = inf.state ? inf.state.toUpperCase() : 'UNKNOWN';
if (!online) { stateClass = 'unknown'; stateLabel = 'OFFLINE'; }
const temp = hw.temp_c;
const util = hw.gpu_util_pct;
const vramUsed = hw.vram_used_mb;
const vramTotal = hw.vram_total_mb;
const power = hw.power_w;
const powerLimit = hw.power_limit_w;
const fan = hw.fan_pct;
const vendor = hw.vendor;
let html = `<div class="card ${cardClass}">`;
html += `<div class="card-header">`;
html += `<div><div class="name">${g.gpu_name}</div><div class="host">${g.host}</div></div>`;
html += `<div class="state ${stateClass}">${stateLabel}</div>`;
html += `</div>`;
if (!online) {
html += `<div class="error">Unreachable</div>`;
} else if (hw.error) {
html += `<div class="error">${hw.error}</div>`;
} else {
// Temperature
if (temp != null) {
html += `<div class="metric"><div class="metric-label"><span>Temperature</span><span class="val">${temp}°C</span></div>`;
html += `<div class="bar"><div class="bar-fill ${tempClass(temp)}" style="width:${Math.min(temp,100)}%"></div></div></div>`;
}
// Utilization
if (util != null) {
html += `<div class="metric"><div class="metric-label"><span>GPU Utilization</span><span class="val">${util}%</span></div>`;
html += `<div class="bar"><div class="bar-fill util" style="width:${util}%"></div></div></div>`;
}
// VRAM
if (vramUsed != null && vramTotal != null) {
const vramPct = pct(vramUsed, vramTotal);
html += `<div class="metric"><div class="metric-label"><span>VRAM</span><span class="val">${mbToGB(vramUsed)} / ${mbToGB(vramTotal)} GB</span></div>`;
html += `<div class="bar"><div class="bar-fill ${vramClass(vramPct)}" style="width:${vramPct}%"></div></div></div>`;
}
// Power
if (power != null) {
const powerPct = powerLimit ? pct(power, powerLimit) : 0;
const powerText = powerLimit ? `${power}W / ${powerLimit}W` : `${power}W`;
html += `<div class="metric"><div class="metric-label"><span>Power</span><span class="val">${powerText}</span></div>`;
if (powerLimit) html += `<div class="bar"><div class="bar-fill power" style="width:${powerPct}%"></div></div>`;
html += `</div>`;
}
// Fan (NVIDIA only)
if (fan != null) {
html += `<div class="metric"><div class="metric-label"><span>Fan Speed</span><span class="val">${fan}%</span></div>`;
html += `<div class="bar"><div class="bar-fill util" style="width:${fan}%"></div></div></div>`;
}
}
// Model loaded
html += `<div class="model-line">Model: <span>${inf.model || '—'}</span></div>`;
html += `</div>`;
return html;
}
async function refresh() {
try {
const resp = await fetch('gpu_metrics.json?t=' + Date.now());
const data = await resp.json();
const gpus = data.gpus || [];
document.getElementById('cards').innerHTML = gpus.map(renderCard).join('');
// Top bar status
const online = gpus.filter(g => g.online !== false).length;
const total = gpus.length;
const dot = document.getElementById('status-dot');
const txt = document.getElementById('status-text');
if (online === total) { dot.className = 'dot green'; txt.textContent = `${online}/${total} online`; }
else if (online > 0) { dot.className = 'dot yellow'; txt.textContent = `${online}/${total} online`; }
else { dot.className = 'dot red'; txt.textContent = 'All offline'; }
// Capture fetch time for live clock
lastFetchTime = Date.now() / 1000;
} catch(e) {
document.getElementById('status-dot').className = 'dot red';
document.getElementById('status-text').textContent = 'Collector down';
}
}
// Render skeletons instantly
const SKELETONS = [
{host:'amdpve', gpu_name:'AMD Strix Halo', hardware:{}, inference:{}, online:true},
{host:'llmgpu', gpu_name:'RTX 3090', hardware:{}, inference:{}, online:true},
{host:'ocu-llm', gpu_name:'RTX 5070', hardware:{}, inference:{}, online:true},
];
document.getElementById('cards').innerHTML = SKELETONS.map(g =>
`<div class="card"><div class="card-header"><div><div class="name">${g.gpu_name}</div><div class="host">${g.host}</div></div><div class="state unknown">···</div></div><div class="model-line" style="color:#8b949e;">Loading metrics...</div></div>`
).join('');
refresh();
setInterval(refresh, INTERVAL);
</script>
</body>
</html>
-115
View File
@@ -1,115 +0,0 @@
#!/usr/bin/env python3
"""GPU metrics collector — polls sidecars + llama.cpp every 10s, writes to Workspace."""
import urllib.request, json, time, os
HOSTS = [
{"name": "amdpve", "host": "192.168.68.15", "gpu": "AMD Strix Halo", "llama_port": 8080},
{"name": "llmgpu", "host": "192.168.68.8", "gpu": "RTX 3090", "llama_port": 8080},
{"name": "ocu-llm", "host": "192.168.68.110", "gpu": "RTX 5070", "llama_port": 8080},
]
OUTPUT = "/app/public/gpu_metrics.json"
INTERVAL = 10
STALE_THRESHOLD = 30 # seconds before marking stale
DEAD_THRESHOLD = 60 # seconds before marking unreachable
last_seen = {}
def fetch_json(url, timeout=3):
try:
req = urllib.request.Request(url)
resp = urllib.request.urlopen(req, timeout=timeout)
return json.loads(resp.read().decode())
except Exception:
return None
def collect_one(h):
"""Collect GPU hardware + llama.cpp inference state for one host."""
name = h["name"]
host = h["host"]
now = time.time()
# GPU hardware from sidecar
gpu = fetch_json(f"http://{host}:8090/")
# llama.cpp inference state
llamacpp_health = fetch_json(f"http://{host}:{h['llama_port']}/health")
llamacpp_models = fetch_json(f"http://{host}:{h['llama_port']}/v1/models")
# Determine inference state
model_name = None
inference_state = "unknown"
if llamacpp_models:
models = llamacpp_models.get("data", [])
if models:
model_name = models[0].get("id")
if llamacpp_health:
status = llamacpp_health.get("status", "")
if status == "ok":
idle = llamacpp_health.get("slots_idle", 0)
processing = llamacpp_health.get("slots_processing", 0)
if idle and not processing:
inference_state = "idle"
elif processing:
inference_state = "busy"
else:
inference_state = "idle"
# Check for /slots endpoint for is_processing detail
slots = fetch_json(f"http://{host}:{h['llama_port']}/slots")
if slots and isinstance(slots, list) and len(slots) > 0:
if slots[0].get("is_processing"):
inference_state = "busy"
result = {
"host": name,
"gpu_name": h["gpu"],
"inference": {
"state": inference_state,
"model": model_name,
},
"hardware": gpu if gpu else None,
"online": gpu is not None,
"timestamp": now,
}
if gpu is not None:
last_seen[name] = now
if name in last_seen:
age = now - last_seen[name]
if age > DEAD_THRESHOLD:
result["online"] = False
elif age > STALE_THRESHOLD:
result["stale"] = True
return result
def main():
print(f"GPU collector starting, output={OUTPUT}, interval={INTERVAL}s")
os.makedirs(os.path.dirname(OUTPUT), exist_ok=True)
while True:
start = time.time()
results = [collect_one(h) for h in HOSTS]
payload = {
"updated": start,
"gpus": results,
}
with open(OUTPUT + ".tmp", "w") as f:
json.dump(payload, f)
os.rename(OUTPUT + ".tmp", OUTPUT)
elapsed = time.time() - start
sleep_for = max(0, INTERVAL - elapsed)
time.sleep(sleep_for)
if __name__ == "__main__":
main()
-14
View File
@@ -1,14 +0,0 @@
#!/bin/bash
set -e
# Start collector as background process
cd /root/hermes-workspace/public
python3 /app/collector.py &
COLLECTOR_PID=$!
echo "Collector started (PID $COLLECTOR_PID)"
echo "Serving dashboard on :8092"
# Serve the public directory (contains gpu.html + gpu_metrics.json)
cd /root/hermes-workspace/public
python3 -m http.server 8092
+3 -19
View File
@@ -13,7 +13,7 @@ upstream llmgpu_pool {
}
upstream ocu_llm_pool {
## RTX 5070 — gemma-4 (Dense 4B) — Ultra-light tasks
## RTX 5070 — qwen3.5-9b-vlm (VLM) — Vision + light tasks
server 192.168.68.110:8080;
}
@@ -24,12 +24,7 @@ upstream queue_service {
upstream dashboard_service {
## Harness dashboard (Docker container)
server syslog-harness-dashboard-1:3001;
}
upstream gpu_dashboard_pool {
## GPU dashboard (Docker container)
server syslog-harness-gpu-dashboard-1:8092;
server dashboard:3001;
}
## ------------------------------------------------------------------
@@ -41,7 +36,7 @@ map $http_x_syslog_model $gpu_upstream {
"heavy" llmgpu_pool;
"qwen3.5-27B" llmgpu_pool;
"light" ocu_llm_pool;
"gemma-4" ocu_llm_pool;
"qwen3.5-9b-vlm" ocu_llm_pool;
}
## Rate limit zone — 10 req/s per IP, burst of 20
@@ -61,17 +56,6 @@ server {
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
}
## ------------------------------------------------------------------
## GPU Dashboard — observability UI (MUST be before / catch-all)
## ------------------------------------------------------------------
location /gpu {
proxy_pass http://gpu_dashboard_pool/;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
## ------------------------------------------------------------------
## Main location — proxy to selected upstream
## ------------------------------------------------------------------
+106
View File
@@ -0,0 +1,106 @@
## Syslog GPU Router — Nginx Configuration
## Routes incoming agent requests to the appropriate GPU backend
## based on the X-Syslog-Model header.
upstream amdpve_pool {
## Strix Halo 395 — qwen3.6-35B-A3B (MoE) — Default workhorse
server 192.168.68.15:8080;
}
upstream llmgpu_pool {
## RTX 3090 — qwen3.5-27B (Dense) — Heavy reasoning
server 192.168.68.8:8080;
}
upstream ocu_llm_pool {
## RTX 5070 — qwen3.5-9b-vlm (VLM) — Vision + light tasks
server 192.168.68.110:8080;
}
upstream queue_service {
## Agent queue with circuit breaker (Docker container)
server 127.0.0.1:8091;
}
upstream dashboard_service {
## Harness dashboard (Docker container)
server 127.0.0.1:3001;
}
## ------------------------------------------------------------------
## Mapping: X-Syslog-Model header → upstream backend
## ------------------------------------------------------------------
map $http_x_syslog_model $gpu_upstream {
default amdpve_pool; # missing header → default workhorse
"standard" amdpve_pool;
"heavy" llmgpu_pool;
"qwen3.5-27B" llmgpu_pool;
"light" ocu_llm_pool;
"qwen3.5-9b-vlm" ocu_llm_pool;
}
server {
listen 8080;
server_name _;
# Rate limit zone — 10 req/s per IP, burst of 20
limit_req_zone $binary_remote_addr zone=perip:10m rate=10r/s;
## ------------------------------------------------------------------
## Dashboard — observability UI (MUST be before / catch-all)
## ------------------------------------------------------------------
location /dashboard {
proxy_pass http://dashboard_service/;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
}
## ------------------------------------------------------------------
## Main location — proxy to selected upstream
## ------------------------------------------------------------------
location / {
limit_req zone=perip burst=20 nodelay;
limit_req_status 503;
proxy_pass http://$gpu_upstream;
## Preserve original host and headers
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
## Pass through the model header so backends can log it
proxy_pass_header X-Syslog-Model;
## Streaming support (SSE for LLM responses)
proxy_buffering off;
proxy_cache off;
proxy_read_timeout 300s;
proxy_send_timeout 300s;
## Basic failover — retry on error or timeout
proxy_next_upstream error timeout http_502 http_503;
proxy_next_upstream_tries 2;
## Add a response header for observability
add_header X-Routed-To $gpu_upstream always;
## Fallback to queue when all GPU upstreams are down
error_page 502 503 504 = @queue_fallback;
}
## ------------------------------------------------------------------
## Queue fallback — enqueue when GPUs are unavailable
## ------------------------------------------------------------------
location @queue_fallback {
rewrite ^ /enqueue break;
proxy_pass http://queue_service;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_header Content-Type $content_type;
proxy_pass_request_body on;
}
}
+25
View File
@@ -0,0 +1,25 @@
model_list:
- model_name: qwen3.6-35B-A3B
litellm_params:
model: openai/qwen3.6-35B-A3B
api_base: http://192.168.68.15:8080/v1
api_key: "not-needed"
- model_name: qwen3.6-27B-code
litellm_params:
model: openai/qwen3.6-27B-code-text
api_base: http://192.168.68.8:8080/v1
api_key: "not-needed"
- model_name: qwen3.5-9b-vlm
litellm_params:
model: openai/qwen3.5-9b-vlm
api_base: http://192.168.68.110:8080/v1
api_key: "not-needed"
general_settings:
master_key: sk-syslog-local-master-key
litellm_settings:
drop_params: true
request_timeout: 120
+79
View File
@@ -0,0 +1,79 @@
worker_processes auto;
error_log /var/log/nginx/error.log warn;
pid /var/run/nginx.pid;
events { worker_connections 1024; }
http {
include /etc/nginx/mime.types;
default_type application/octet-stream;
log_format main launching rt=;
access_log /var/log/nginx/access.log main;
error_log /var/log/nginx/error.log;
sendfile on;
keepalive_timeout 65;
upstream router_api { server router:9000; }
upstream dashboard_ui { server dashboard:3000; }
upstream litellm_backend { server litellm:4000; }
server {
listen 80;
# Disable buffering for SSE streams
proxy_buffering off;
# API — through router
location /v1/ {
proxy_pass http://router_api;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header Authorization $http_authorization;
proxy_connect_timeout 10s;
proxy_read_timeout 600s;
proxy_buffering off;
}
# SSE streaming endpoint
location /stream {
proxy_pass http://router_api;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header Connection "";
proxy_buffering off;
chunked_transfer_encoding off;
}
# Dashboard API proxy for SSE
location /api/ {
proxy_pass http://dashboard_ui;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_buffering off;
}
# LiteLLM debug
location /litellm/ {
rewrite ^/litellm/(.*) /$1 break;
proxy_pass http://litellm_backend;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header Authorization $http_authorization;
}
# Dashboard
location / {
proxy_pass http://dashboard_ui;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_buffering off;
}
location /health {
return 200 "{\"status\":\"healthy\"}";
add_header Content-Type application/json;
}
}
}
-10
View File
@@ -1,10 +0,0 @@
FROM python:3.13-slim
RUN pip install --no-cache-dir flask redis
COPY queue-service.py /app/queue-service.py
WORKDIR /app
EXPOSE 8091
CMD ["python3", "queue-service.py"]
+9
View File
@@ -0,0 +1,9 @@
FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY router.py .
EXPOSE 9000
CMD ["python", "router.py"]
+3
View File
@@ -0,0 +1,3 @@
flask==3.1.*
redis==5.2.*
requests==2.32.*
+418
View File
@@ -0,0 +1,418 @@
import os, json, time, logging, traceback, threading, queue
import requests, redis
from flask import Flask, request, jsonify, Response, stream_with_context
REDIS_URL = os.environ.get("REDIS_URL", "redis://redis:6379")
GPU_MOE_URL = os.environ.get("GPU_MOE_URL", "http://192.168.68.15:8080/v1")
GPU_DENSE_URL = os.environ.get("GPU_DENSE_URL", "http://192.168.68.8:8080/v1")
GPU_LIGHT_URL = os.environ.get("GPU_LIGHT_URL", "http://192.168.68.110:8080/v1")
GPU_SIDECARS = {
"qwen3.6-35B-A3B": "http://192.168.68.15:8090",
"qwen3.6-27B-code": "http://192.168.68.8:8090",
"qwen3.5-9b-vlm": "http://192.168.68.110:8090",
}
GPU_URLS = {
"qwen3.6-35B-A3B": GPU_MOE_URL,
"qwen3.6-27B-code": GPU_DENSE_URL,
"qwen3.5-9b-vlm": GPU_LIGHT_URL,
}
# Max concurrent requests per GPU (based on llama.cpp --parallel)
GPU_MAX_CONCURRENT = {
"qwen3.6-35B-A3B": 2, # 2 slots
"qwen3.6-27B-code": 2, # 2 slots
"qwen3.5-9b-vlm": 2, # 2 slots (12GB VRAM, 4GB headroom)
}
# Context window sizes (tokens) — used for compaction signals
GPU_CONTEXT = {
"qwen3.6-35B-A3B": 131072,
"qwen3.6-27B-code": 98304,
"qwen3.5-9b-vlm": 131072,
}
TIER_MODELS = {
"starter": ["qwen3.5-9b-vlm"],
"professional": ["qwen3.6-35B-A3B", "qwen3.6-27B-code", "qwen3.5-9b-vlm"],
"enterprise": ["qwen3.6-35B-A3B", "qwen3.6-27B-code", "qwen3.5-9b-vlm"],
}
API_KEYS = {
"sk-syslog-local-master-key": {"tier": "enterprise", "agent": "admin"},
"sk-syslog-abiba": {"tier": "enterprise", "agent": "Abiba"},
"sk-syslog-mumuni": {"tier": "enterprise", "agent": "Mumuni"},
"sk-syslog-tanko": {"tier": "enterprise", "agent": "Tanko"},
"sk-syslog-koby": {"tier": "enterprise", "agent": "Koby"},
"sk-syslog-kagenz0": {"tier": "enterprise", "agent": "Kagenz0"},
"sk-syslog-koonimo": {"tier": "enterprise", "agent": "Koonimo"},
"sk-starter-abc123": {"tier": "starter", "agent": "test-starter"},
"sk-professional-xyz789": {"tier": "professional", "agent": "test-pro"},
}
logging.basicConfig(level=logging.INFO, format="%(asctime)s [ROUTER] %(levelname)s %(message)s")
log = logging.getLogger("router")
try: r = redis.from_url(REDIS_URL, decode_responses=True); r.ping()
except Exception: r = None
def counter_audit_loop():
"""Every 30s, check GPU slots and reset counters if all slots idle."""
while True:
time.sleep(30)
if not r: continue
for model, url in GPU_URLS.items():
try:
resp = requests.get(url.replace("/v1","") + "/slots",
headers={"Authorization": "Bearer not-needed"}, timeout=5)
if resp.status_code == 200:
slots = resp.json()
all_idle = all(not s.get("is_processing", False) for s in slots)
if all_idle:
current = int(r.get("active:" + model) or 0)
if current > 0:
r.set("active:" + model, 0)
log.info("AUDIT: Reset stuck counter for %s (was %d)", model, current)
except Exception:
pass
threading.Thread(target=counter_audit_loop, daemon=True).start()
app = Flask(__name__)
sse_subscribers = []; sse_lock = threading.Lock()
def gpu_active_count(model):
"""Get number of in-flight requests for a GPU."""
if r:
return int(r.get("active:" + model) or 0)
return 0
def gpu_incr(model):
if r: r.incr("active:" + model)
def gpu_decr(model):
if r:
v = r.decr("active:" + model)
if v and int(v) < 0:
r.set("active:" + model, 0) # never go negative
def check_gpu_health(model):
url = GPU_SIDECARS.get(model)
if not url: return {"status": "unknown"}
try:
resp = requests.get(url, timeout=5)
if resp.status_code == 200:
d = resp.json()
pct = (d.get("vram_used_mb",0) / max(d.get("vram_total_mb",1), 1)) * 100
status = "healthy" if pct < 90 else "saturated"
# Also check if llama.cpp endpoint is actually responding
gpu_url = GPU_URLS.get(model, "")
try:
hr = requests.get(gpu_url.replace("/v1","") + "/health", headers={"Authorization": "Bearer not-needed"}, timeout=3)
if hr.status_code != 200:
status = "down"
except Exception:
status = "down"
return {"status": status, "vram_used_mb": d.get("vram_used_mb"), "vram_total_mb": d.get("vram_total_mb"), "vram_pct": round(pct,1), "temp_c": d.get("temp_c"), "gpu_util_pct": d.get("gpu_util_pct"), "gpu_name": d.get("gpu_name"), "power_w": d.get("power_w"), "power_limit_w": d.get("power_limit_w")}
except Exception: pass
return {"status": "down"}
def available_models(): return [m for m in GPU_URLS if check_gpu_health(m)["status"] in ("healthy","saturated")]
def estimate_tokens(msgs):
"""Estimate token count from messages. Uses JSON length / 3.5 (closer to real tokenizer ratios for dense text)."""
return len(json.dumps(msgs, default=str)) // 3.5
def is_gpu_busy(model):
"""Check if GPU is at or near max concurrent capacity."""
active = gpu_active_count(model)
max_c = GPU_MAX_CONCURRENT.get(model, 1)
return active >= max_c
def select_best_gpu(candidates, reason):
"""Pick the best GPU from candidates IN ORDER — first non-busy one wins."""
for m in candidates:
if not is_gpu_busy(m):
return {"model": m, "reason": reason}
# All busy — pick least loaded
best = None
best_load = 999
for m in candidates:
load = gpu_active_count(m)
if load < best_load:
best_load = load
best = m
if best:
return {"model": best, "reason": "load_balanced_" + reason}
return None
def route(rd, tier):
msgs = rd.get("messages",[]); t = estimate_tokens(msgs)
sys = any(m.get("role")=="system" for m in msgs)
turns = len([m for m in msgs if m.get("role") in ("user","assistant")])
hints = rd.get("routing_hints",{})
allowed = TIER_MODELS.get(tier, ["qwen3.5-9b-vlm"])
avail = [m for m in available_models() if m in allowed]
if not avail: return {"model": allowed[0], "reason": "all_saturated", "saturated": True}
# Check if all available GPUs are at max capacity
if all(is_gpu_busy(m) for m in avail):
return {"model": avail[0], "reason": "all_saturated", "saturated": True}
req = rd.get("model","auto")
if req != "auto":
target = req if req in avail else avail[0]
# If explicit model is busy, check if another can take it
if is_gpu_busy(target) and req in allowed:
alts = [m for m in avail if m != target and m in allowed]
if alts:
alt = select_best_gpu(alts, "explicit")
if alt: return alt
return {"model": target, "reason": "explicit"}
if hints:
if hints.get("priority")=="speed" and "qwen3.5-9b-vlm" in avail:
return select_best_gpu(["qwen3.5-9b-vlm"], "hint_speed") or {"model":"qwen3.5-9b-vlm","reason":"hint_speed"}
if hints.get("priority")=="quality" and "qwen3.6-27B-code" in avail:
return select_best_gpu(["qwen3.6-27B-code"], "hint_quality") or {"model":"qwen3.6-27B-code","reason":"hint_quality"}
first_msg = msgs[0].get("content","") if msgs else ""
words = len(first_msg.split()) if isinstance(first_msg, str) else 99
# TIER 1: Lightweight — single-turn short queries → VLM first
if not sys and turns <= 1 and words <= 100 and "qwen3.5-9b-vlm" in avail:
if not is_gpu_busy("qwen3.5-9b-vlm"):
return {"model":"qwen3.5-9b-vlm","reason":"lightweight"}
# VLM busy — fall back to Dense, then MoE
fallback = [m for m in ["qwen3.6-35B-A3B","qwen3.6-27B-code"] if m in avail]
result = select_best_gpu(fallback, "lightweight_fallback")
if result: return result
# TIER 2: Simple conversations — short context, any prompt → VLM preferred
if t <= 1000 and turns <= 4 and "qwen3.5-9b-vlm" in avail:
if not is_gpu_busy("qwen3.5-9b-vlm"):
return {"model":"qwen3.5-9b-vlm","reason":"simple_conv"}
# VLM busy — try Dense
if "qwen3.6-27B-code" in avail and not is_gpu_busy("qwen3.6-27B-code"):
return {"model":"qwen3.6-27B-code","reason":"simple_conv_fallback"}
# TIER 3: Heavy reasoning — extremely large context or very long conversations
if t > 50000 or turns > 25:
# MoE first (131K context handles heavy sessions), then Dense (98K reasoning), then Light (131K fallback)
candidates = [m for m in ["qwen3.6-35B-A3B","qwen3.6-27B-code","qwen3.5-9b-vlm"] if m in avail]
result = select_best_gpu(candidates, "heavy_reasoning")
if result: return result
# TIER 4: Default — MoE first, VLM helps, Dense last (slow)
if t <= 50000:
candidates = [m for m in ["qwen3.6-35B-A3B","qwen3.5-9b-vlm","qwen3.6-27B-code"] if m in avail]
result = select_best_gpu(candidates, "default")
if result: return result
# Fallback — best available
if "qwen3.6-35B-A3B" in avail and not is_gpu_busy("qwen3.6-35B-A3B"):
return {"model":"qwen3.6-35B-A3B","reason":"default_moe"}
result = select_best_gpu([m for m in avail], "fallback")
if result: return result
return {"model":avail[0],"reason":"last_resort"}
def clean_unicode(text):
if not isinstance(text, str): return text
text = text.replace(chr(0x2014), "-"); text = text.replace(chr(0x2013), "-")
text = text.replace(chr(0x2018), "'"); text = text.replace(chr(0x2019), "'")
text = text.replace(chr(0x201C), '"'); text = text.replace(chr(0x201D), '"')
text = text.replace(chr(0x2026), "..."); text = text.replace(chr(0x00A0), " ")
return text.encode("ascii", "ignore").decode("ascii")
def clean_response(d):
if isinstance(d, dict): return {k: clean_response(v) for k,v in d.items()}
if isinstance(d, list): return [clean_response(v) for v in d]
if isinstance(d, str): return clean_unicode(d)
return d
def get_metrics():
d = {"gpus":[],"route_counts":{},"agent_counts":{},"tier_counts":{},"recent":[],"timestamp":time.time(),"active_requests":{}}
for m in GPU_URLS:
h = check_gpu_health(m)
d["gpus"].append({"id":m,"gpu_name":h.get("gpu_name",m),"status":h.get("status"),"vram_used_mb":h.get("vram_used_mb"),"vram_total_mb":h.get("vram_total_mb"),"vram_pct":h.get("vram_pct"),"temp_c":h.get("temp_c"),"gpu_util_pct":h.get("gpu_util_pct"),"power_w":h.get("power_w"),"power_limit_w":h.get("power_limit_w"),"active_requests":gpu_active_count(m), "max_concurrent": GPU_MAX_CONCURRENT.get(m, 1)})
d["active_requests"][m] = gpu_active_count(m)
if r:
try:
for m in GPU_URLS: d["route_counts"][m] = int(r.get("routes:"+m) or 0)
for k,v in API_KEYS.items():
c = int(r.get("routes:agent:"+v["agent"]) or 0)
if c>0: d["agent_counts"][v["agent"]] = c
for t in TIER_MODELS: d["tier_counts"][t] = int(r.get("routes:tier:"+t) or 0)
raw = r.lrange("routes:recent",0,49)
d["recent"] = [json.loads(x) for x in raw] if raw else []
except Exception: pass
return d
def bcast():
data = get_metrics(); payload = json.dumps(data)
with sse_lock:
dead = []
for q in sse_subscribers:
try: q.put(payload)
except Exception: dead.append(q)
for q in dead: sse_subscribers.remove(q)
QUEUE_TIMEOUT = int(os.environ.get("QUEUE_TIMEOUT", "30")) # max seconds to queue before 503
@app.route("/v1/chat/completions", methods=["POST"])
def chat():
try:
rd = request.get_json(force=True)
ak = request.headers.get("Authorization","").replace("Bearer ","")
if not ak or ak not in API_KEYS:
log.warning("AUTH_REJECTED: no/invalid API key from %s", request.remote_addr)
return jsonify({"error": "Unauthorized — valid API key required"}), 401
ki = API_KEYS[ak]
tier, agent = ki["tier"], ki["agent"]
# Allow agent to override queue timeout via header
q_timeout = int(request.headers.get("X-Queue-Timeout", str(QUEUE_TIMEOUT)))
# Cross-turn context tracking: accumulate tokens per session
session_id = request.headers.get("X-Session-Id", "")
session_tokens = 0
if session_id and r:
try:
prev = int(r.get("session:" + session_id) or 0)
current = estimate_tokens(rd.get("messages",[]))
session_tokens = max(prev, current) # context only grows
r.set("session:" + session_id, session_tokens, ex=86400) # TTL 24h
except Exception: pass
d = route(rd, tier)
queue_start = time.time()
# Queue loop: wait for a GPU slot instead of immediate 503
while d.get("saturated"):
elapsed = time.time() - queue_start
if elapsed > q_timeout:
resp = jsonify({"error": "All GPUs saturated", "queued_s": round(elapsed,1), "retry_after_s": 5})
resp.headers["Retry-After"] = "5"
log.warning("QUEUE_TIMEOUT: %s waited %.1fs, all GPUs saturated", agent, elapsed)
return resp, 503
time.sleep(0.5) # poll every 500ms
d = route(rd, tier)
waited = time.time() - queue_start
if waited > 0.5:
log.info("QUEUED: %s waited %.1fs before slot opened", agent, waited)
model, reason, url = d["model"], d["reason"], GPU_URLS[d["model"]]
is_stream = rd.get("stream", False)
gpu_incr(model)
log.info("ROUTE: %s -> %s (%s) stream=%s active=%d/%d", agent, model, reason, is_stream, gpu_active_count(model), GPU_MAX_CONCURRENT.get(model,1))
if r:
try:
r.incr("routes:"+model); r.incr("routes:tier:"+tier); r.incr("routes:agent:"+agent)
r.incr("ts:"+model+":"+time.strftime("%Y%m%d%H"))
r.lpush("routes:recent", json.dumps({"ts":time.time(),"model":model,"reason":reason,"tier":tier,"agent":agent}))
r.ltrim("routes:recent",0,999)
except Exception: pass
start = time.time()
resp = requests.post(url+"/chat/completions", json=rd,
headers={"Content-Type":"application/json","Authorization":"Bearer not-needed"}, timeout=300, stream=is_stream)
lat = int((time.time()-start)*1000)
gpu_decr(model)
if resp.status_code != 200: return jsonify({"error":"GPU error "+str(resp.status_code)}), 502
if is_stream:
def gen():
for raw in resp.iter_content(chunk_size=None, decode_unicode=True):
if raw: yield clean_unicode(raw)
bcast()
ctx_remaining = GPU_CONTEXT.get(model, 65536) - max(session_tokens, estimate_tokens(rd.get("messages",[])))
ctx_pct = ctx_remaining / GPU_CONTEXT.get(model, 65536) * 100
ctx_warning = "compact_urgent" if ctx_pct < 5 else ("compact_recommended" if ctx_pct < 15 else ("compact_soon" if ctx_pct < 30 else "ok"))
sse_resp = Response(stream_with_context(gen()), mimetype="text/event-stream")
sse_resp.headers["X-Context-Remaining"] = str(max(0, ctx_remaining))
sse_resp.headers["X-Context-Warning"] = ctx_warning
sse_resp.headers["X-Context-Model"] = model
return sse_resp
data = clean_response(resp.json())
for c in data.get("choices",[]):
msg = c.get("message",{})
if not msg.get("content") and msg.get("reasoning_content"):
msg["content"] = msg["reasoning_content"]
ctx_remaining = GPU_CONTEXT.get(model, 65536) - max(session_tokens, estimate_tokens(rd.get("messages",[])))
ctx_pct = ctx_remaining / GPU_CONTEXT.get(model, 65536) * 100
ctx_warning = "compact_urgent" if ctx_pct < 5 else ("compact_recommended" if ctx_pct < 15 else ("compact_soon" if ctx_pct < 30 else "ok"))
data["routing"] = {"model":model,"reason":reason,"gpu":url,"tier":tier,"agent":agent,"latency_ms":lat,"active_gpu":gpu_active_count(model),"context_remaining": max(0, ctx_remaining),"context_pct": round(ctx_pct,1),"context_warning": ctx_warning}
resp = jsonify(data)
resp.headers["X-Context-Remaining"] = str(max(0, ctx_remaining))
resp.headers["X-Context-Warning"] = ctx_warning
resp.headers["X-Context-Model"] = model
bcast()
return resp
except requests.Timeout:
gpu_decr(model)
log.error("TIMEOUT: %s -> %s", agent, model)
return jsonify({"error":"timeout"}), 504
except Exception as e:
gpu_decr(model)
log.error("Error: %s\n%s", e, traceback.format_exc())
return jsonify({"error":str(e)}), 500
@app.route("/v1/models")
def models(): return jsonify({"object":"list","data":[{"id":m,"object":"model","owned_by":"syslog","status":check_gpu_health(m).get("status"),"gpu":check_gpu_health(m).get("gpu_name")} for m in GPU_URLS]})
@app.route("/health")
def health():
gpus = {}
for m in GPU_URLS:
h = check_gpu_health(m)
h["active_requests"] = gpu_active_count(m)
h["max_concurrent"] = GPU_MAX_CONCURRENT.get(m, 1)
gpus[m] = h
return jsonify({"status":"healthy","redis":"connected" if r else "down","gpus":gpus,"available_models":available_models()})
@app.route("/metrics")
def metrics(): return jsonify(get_metrics())
@app.route("/metrics/timeseries")
def metrics_timeseries():
period = request.args.get("period", "day"); models_list = list(GPU_URLS.keys())
data = {"models": {}, "labels": []}
if period == "day":
buckets = [time.strftime("%Y%m%d%H", time.gmtime(time.time()-h*3600)) for h in range(23,-1,-1)]
data["labels"] = [time.strftime("%H:00", time.gmtime(time.time()-h*3600)) for h in range(23,-1,-1)]
elif period == "week":
buckets = [time.strftime("%Y%m%d", time.gmtime(time.time()-d*86400)) for d in range(6,-1,-1)]
data["labels"] = [time.strftime("%a", time.gmtime(time.time()-d*86400)) for d in range(6,-1,-1)]
else:
buckets = [time.strftime("%Y%m%d", time.gmtime(time.time()-d*86400)) for d in range(29,-1,-1)]
data["labels"] = [time.strftime("%m/%d", time.gmtime(time.time()-d*86400)) for d in range(29,-1,-1)]
if r:
for model in models_list:
counts = []
for bucket in buckets:
total = 0
if period in ("week","month"):
for hh in range(24): total += int(r.get("ts:"+model+":"+bucket+"{:02d}".format(hh)) or 0)
else: total = int(r.get("ts:"+model+":"+bucket) or 0)
counts.append(total)
data["models"][model] = counts
return jsonify(data)
@app.route("/stream")
def stream():
def ev():
q = queue.Queue()
with sse_lock: sse_subscribers.append(q)
try:
yield "data: "+json.dumps(get_metrics())+"\n\n"
while True:
try: yield "data: "+q.get(timeout=3)+"\n\n"
except queue.Empty: yield "data: "+json.dumps(get_metrics())+"\n\n"
except GeneratorExit: pass
finally:
with sse_lock:
if q in sse_subscribers: sse_subscribers.remove(q)
return Response(stream_with_context(ev()), mimetype="text/event-stream",
headers={"Cache-Control":"no-cache","X-Accel-Buffering":"no","Access-Control-Allow-Origin":"*"})
if __name__ == "__main__":
log.info("Router on :9000 (load-aware)")
app.run(host="0.0.0.0", port=9000, debug=False)
Submodule syslog-harness-check deleted from b65ea22765