Compare commits
41 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| 5116e4b1a7 | |||
| e55bcef21a | |||
| 32bd817e97 | |||
| 79965450bb | |||
| 6c829abef5 | |||
| 6efd5ff51c | |||
| 350a90b524 | |||
| 3156c093d5 | |||
| 3cbf38e3e2 | |||
| b67021ac69 | |||
| 46dda918de | |||
| 7a78c0f98d | |||
| 15c474aea0 | |||
| bfc38f5436 | |||
| f519a3fa60 | |||
| 941e8db65e | |||
| 241de4f38c | |||
| beb2d1790a | |||
| f2f8e8c921 | |||
| 76ade81fda | |||
| 9c31b5d622 | |||
| 4f032b035c | |||
| 8f3b0c6647 | |||
| 808c9d3d13 | |||
| 9817fe2ef2 | |||
| 654cdff718 | |||
| bf90e57c5f | |||
| 2db2796e53 | |||
| ec0f9fac63 | |||
| 3d42ea4767 | |||
| 7b6c6aabe1 | |||
| b65ea22765 | |||
| cf7f61650f | |||
| 7d00bbec0e | |||
| 37f7c95b05 | |||
| a28b3a557d | |||
| c42f3a9979 | |||
| e1f12c3462 | |||
| b55b954967 | |||
| c85aaa570b | |||
| 43382dac5b |
@@ -0,0 +1,8 @@
|
|||||||
|
# Syslog Harness Environment
|
||||||
|
REDIS_HOST=192.168.68.8
|
||||||
|
REDIS_PORT=6379
|
||||||
|
AMDPVE_ENDPOINT=http://192.168.68.15:8080
|
||||||
|
LLMGPU_ENDPOINT=http://192.168.68.8:8080
|
||||||
|
OCU_LLM_ENDPOINT=http://192.168.68.110:8080
|
||||||
|
CIRCUIT_BREAKER_THRESHOLD=5
|
||||||
|
CIRCUIT_BREAKER_TIMEOUT=30
|
||||||
@@ -0,0 +1,3 @@
|
|||||||
|
.git
|
||||||
|
__pycache__/
|
||||||
|
*.pyc
|
||||||
@@ -1,390 +0,0 @@
|
|||||||
# Syslog Harness Architecture Review & Improvement Recommendations
|
|
||||||
|
|
||||||
**Date:** 2026-05-17
|
|
||||||
**Commit:** `e95475f` "Add GPU dashboard container + Nginx routing"
|
|
||||||
**Repo:** http://192.168.68.17:3000/SyslogSolution/syslog-harness.git
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 1. Current Architecture Overview
|
|
||||||
|
|
||||||
```
|
|
||||||
|
|
||||||
Host (192.168.68.123)
|
|
||||||
|
|
||||||
|
|
||||||
Agent :8080> Nginx Router > Queue Service > Dashboard
|
|
||||||
:8080 :8091 :3001
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
GPU Pool Redis > GPU Dashboard
|
|
||||||
:8080 :6379 :8092
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
amdpve llmgpu ocu_llm
|
|
||||||
.15:8080 .8:8080 .110:8080
|
|
||||||
MoE 35B Dense 27B Light 4B
|
|
||||||
|
|
||||||
```
|
|
||||||
|
|
||||||
### Services
|
|
||||||
|
|
||||||
| Service | Port | Container | Image | Purpose |
|
|
||||||
|---|---|---|---|---|
|
|
||||||
| **Nginx Router** | 8080 | Host-level | OS nginx | Routes by `X-Syslog-Model` header |
|
|
||||||
| **Queue Service** | 8091 | `syslog-queue` | `python:3.13-slim` | Request queue + circuit breaker |
|
|
||||||
| **Dashboard** | 3001 | `syslog-dashboard` | `python:3.11-slim` | Observability UI + GPU health |
|
|
||||||
| **GPU Dashboard** | 8092 | `syslog-gpu-dashboard` | `python:3.11-slim` | Hardware metrics (temp, VRAM, power) |
|
|
||||||
| **Redis** | 6379 | `syslog-redis` | `redis:7-alpine` | Queue storage |
|
|
||||||
|
|
||||||
### GPU Backends
|
|
||||||
|
|
||||||
| Host | GPU | Model | Capacity |
|
|
||||||
|---|---|---|---|
|
|
||||||
| 192.168.68.15 | AMD Strix Halo | qwen3.6-35B-A3B (MoE) | 65GB VRAM |
|
|
||||||
| 192.168.68.8 | RTX 3090 | qwen3.5-27B (Dense) | 24GB VRAM |
|
|
||||||
| 192.168.68.110 | RTX 5070 | gemma-4-E4B (Light) | 12GB VRAM |
|
|
||||||
|
|
||||||
### Data Flow
|
|
||||||
|
|
||||||
1. **Agent** sends request with `X-Syslog-Model` header Nginx :8080
|
|
||||||
2. **Nginx** routes to appropriate GPU based on header mapping
|
|
||||||
3. **GPU backend** (llama.cpp) processes request
|
|
||||||
4. **Fallback:** If GPU returns 502/503/timeout Nginx redirects to queue-service :8091
|
|
||||||
5. **Queue** stores request in Redis `inference:requests` LPUSH
|
|
||||||
6. **Dashboard** :3001 polls queue-service + GPU health for display
|
|
||||||
7. **GPU Dashboard** :8092 collects hardware metrics every 10s
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 2. File Inventory
|
|
||||||
|
|
||||||
```
|
|
||||||
docker-compose.yml # Main compose (Docker networking)
|
|
||||||
gpu-router-docker.conf # Nginx config for Docker deployment
|
|
||||||
Dockerfile.gpu # GPU dashboard container
|
|
||||||
Dockerfile.dashboard # Dashboard container (root-level)
|
|
||||||
queue-service/Dockerfile # Queue service container
|
|
||||||
queue-service/queue-service.py # Queue logic (121 lines)
|
|
||||||
dashboard/harness-dashboard.py # Dashboard app (133 lines)
|
|
||||||
dashboard/Dockerfile # Dashboard container (subdir)
|
|
||||||
dashboard/Dockerfile.dashboard # Dashboard container (duplicate)
|
|
||||||
gpu-dashboard/gpu_collector.py # GPU hardware collector (115 lines)
|
|
||||||
gpu-dashboard/gpu.html # GPU dashboard UI (183 lines)
|
|
||||||
gpu-dashboard/collector.py # Duplicate collector (hermes-workspace path)
|
|
||||||
gpu-dashboard/start.sh # Legacy startup script
|
|
||||||
MIGRATION_PLAN.md # Production migration plan
|
|
||||||
README.md # Documentation
|
|
||||||
syslog-harness-check/ # Checkpoint subdirectory (mirror)
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 3. Detailed Findings
|
|
||||||
|
|
||||||
### 3.1 Queue Service (`queue-service/queue-service.py`)
|
|
||||||
|
|
||||||
**Architecture:** Simple Flask app using Redis LPUSH/RPUSH for a FIFO queue. A basic circuit breaker prevents queue overflow at 50 messages.
|
|
||||||
|
|
||||||
**Issues Found:**
|
|
||||||
|
|
||||||
| # | Severity | Location | Issue |
|
|
||||||
|---|---|---|---|
|
|
||||||
| Q1 | **CRITICAL** | Lines 82-88 | **Queue is fire-and-forget with no consumer.** Requests are pushed to Redis but nothing dequeues or processes them. The queue is a dead storage pit. |
|
|
||||||
| Q2 | **CRITICAL** | Lines 28-32 | **Hardcoded GPU IPs** in the queue service duplicate the Nginx config. No configuration source of truth. |
|
|
||||||
| Q3 | **HIGH** | Lines 21-22 | **Redis host fallback to `192.168.68.7`** (line 21) conflicts with docker-compose which sets `REDIS_HOST=redis` (line 24). The default is unreachable inside Docker. |
|
|
||||||
| Q4 | **HIGH** | Lines 66-95 | **No job result retrieval mechanism.** Once enqueued, there's no API to poll for completion, get a job ID, or retrieve results. |
|
|
||||||
| Q5 | **HIGH** | Lines 73-79 | **Circuit breaker is a simple depth threshold.** No backoff, no recovery window, no sliding window. Once closed, it stays closed until manually drained. |
|
|
||||||
| Q6 | **MEDIUM** | Lines 50-57 | **GPU health check is synchronous and blocks** the `/status` endpoint. Checking 3 GPUs sequentially with 3s timeout means `/status` can take up to 9s. |
|
|
||||||
| Q7 | **MEDIUM** | Lines 35-40 | **`get_redis()` swallows all exceptions** and returns `None`. This makes Redis failures silent queue depth returns 0 on failure (line 47), potentially allowing overflow. |
|
|
||||||
| Q8 | **MEDIUM** | Lines 83-84 | **Headers filtered to only X-* prefixed** the `Content-Type` header is dropped entirely, meaning the receiver can't determine payload format. |
|
|
||||||
| Q9 | **LOW** | Line 121 | **No graceful shutdown.** Flask development server doesn't handle SIGTERM gracefully. |
|
|
||||||
|
|
||||||
### 3.2 Nginx Gateway (`gpu-router-docker.conf`)
|
|
||||||
|
|
||||||
**Architecture:** Nginx routes requests to GPU backends based on `X-Syslog-Model` header value. Has rate limiting, streaming support, and queue fallback.
|
|
||||||
|
|
||||||
**Issues Found:**
|
|
||||||
|
|
||||||
| # | Severity | Location | Issue |
|
|
||||||
|---|---|---|---|
|
|
||||||
| N1 | **HIGH** | Lines 79-80 | **`burst=20 nodelay`** means 20 requests are served immediately beyond the rate limit, then throttled. This defeats the purpose of rate limiting under burst traffic all 20 could still overwhelm a GPU. |
|
|
||||||
| N2 | **HIGH** | Lines 99-100 | **`proxy_next_upstream` with `tries 2`** means on error/timeout/502/503, Nginx retries once. But it retries against the *same GPU pool*, not a different one. The same GPU that failed gets hit again. |
|
|
||||||
| N3 | **HIGH** | Lines 106, 112-121 | **Queue fallback (`@queue_fallback`) is triggered for ANY 502/503/504**, including when a single GPU is overloaded. This means individual GPU slowness causes queue fallback instead of just queuing when ALL GPUs are down. |
|
|
||||||
| N4 | **MEDIUM** | Line 90 | **`proxy_pass_header X-Syslog-Model`** is non-standard. Nginx automatically passes request headers; this directive is for response headers. The model header is already passed implicitly via `proxy_set_header` inheritance. |
|
|
||||||
| N5 | **MEDIUM** | Lines 27, 32 | **Hardcoded container names** (`syslog-harness-dashboard-1`, `syslog-harness-gpu-dashboard-1`). These change based on docker-compose project prefix. Should use service names. |
|
|
||||||
| N6 | **LOW** | Lines 67-73 | **GPU dashboard at `/gpu` path** has `X-Forwarded-Proto` but the dashboard service (simple HTTP server) doesn't use it. Inconsistent header handling across locations. |
|
|
||||||
|
|
||||||
### 3.3 Dashboard (`dashboard/harness-dashboard.py`)
|
|
||||||
|
|
||||||
**Architecture:** Simple HTTP server using Python's `http.server`. Fetches queue status and GPU health, renders HTML.
|
|
||||||
|
|
||||||
**Issues Found:**
|
|
||||||
|
|
||||||
| # | Severity | Location | Issue |
|
|
||||||
|---|---|---|---|
|
|
||||||
| D1 | **HIGH** | Lines 34-40 | **`get_queue_status()` calls queue-service synchronously.** Combined with per-GPU health checks (lines 18-31), the `/api/status` endpoint makes 4 sequential HTTP calls. Worst case: 2 + 33s = 11s response time. |
|
|
||||||
| D2 | **MEDIUM** | Lines 101-127 | **Uses `SimpleHTTPRequestHandler`** which is single-threaded. Under concurrent dashboard access, requests queue up. Should use `ThreadingHTTPServer`. |
|
|
||||||
| D3 | **MEDIUM** | Lines 16-18 | **GPU endpoints hardcoded** in dashboard, separate from queue-service and Nginx. Three separate sources of truth for GPU addresses. |
|
|
||||||
| D4 | **LOW** | Line 127 | **Silent log suppression.** While intentional, this makes debugging impossible without modifying the source. |
|
|
||||||
|
|
||||||
### 3.4 GPU Dashboard (`gpu-dashboard/`)
|
|
||||||
|
|
||||||
**Architecture:** `gpu_collector.py` polls sidecar (port 8090) and llama.cpp (port 8080) endpoints every 10s, writes JSON to `gpu_metrics.json`. Static HTTP server serves the dashboard.
|
|
||||||
|
|
||||||
**Issues Found:**
|
|
||||||
|
|
||||||
| # | Severity | Location | Issue |
|
|
||||||
|---|---|---|---|
|
|
||||||
| G1 | **HIGH** | Lines 97-98 | **Sequential collection.** All 3 GPUs are polled sequentially (line 98: list comprehension). If one host is unreachable, it blocks collection for all three. |
|
|
||||||
| G2 | **HIGH** | Line 105-107 | **`/app/public/gpu_metrics.json` path is hardcoded** and differs from `collector.py` (line 11: `/root/hermes-workspace/public/gpu_metrics.json`). Inconsistent between the two collector files. |
|
|
||||||
| G3 | **MEDIUM** | Lines 19-25 | **`fetch_json` swallows all exceptions.** A timeout on one GPU's sidecar is silently ignored, making it impossible to distinguish "no data" from "collector error". |
|
|
||||||
| G4 | **MEDIUM** | Line 14 | **`DEAD_THRESHOLD = 60` seconds is aggressive.** A GPU that restarts takes 60s before reappearing as online, even if it's back in 5s. |
|
|
||||||
| G5 | **LOW** | Lines 10-14 | **`start.sh` references `/root/hermes-workspace/public`** but `Dockerfile.gpu` creates `/app/public`. Inconsistent between legacy and current deployment. |
|
|
||||||
|
|
||||||
### 3.5 Docker Compose (`docker-compose.yml`)
|
|
||||||
|
|
||||||
**Issues Found:**
|
|
||||||
|
|
||||||
| # | Severity | Location | Issue |
|
|
||||||
|---|---|---|---|
|
|
||||||
| C1 | **HIGH** | Lines 19-20 | **Queue service exposes port 8091 externally.** In a multi-tenant or public-facing deployment, the queue API should be internal-only. |
|
|
||||||
| C2 | **MEDIUM** | Lines 13-15 | **`Dockerfile.queue` referenced but doesn't exist at root level.** The file is at `queue-service/Dockerfile`. The compose build context is `.` (root) but the dockerfile path doesn't match. |
|
|
||||||
| C3 | **MEDIUM** | Lines 6, 16, 26, 31, 43 | **`restart: always`** instead of `restart: unless-stopped`. On crash, `always` restarts even after manual stop, making maintenance harder. |
|
|
||||||
| C4 | **LOW** | Lines 23-25 | **No health checks defined** for any service. Docker can't detect if a service is actually healthy, only if the container is running. |
|
|
||||||
| C5 | **LOW** | Line 10 | **Redis has no password.** Unauthenticated Redis exposed on the Docker network. |
|
|
||||||
| C6 | **LOW** | Lines 49-51 | **No network driver specified** for the bridge network (minor defaults to bridge). No IPAM configuration for large deployments. |
|
|
||||||
|
|
||||||
### 3.6 Container Images
|
|
||||||
|
|
||||||
**Issues Found:**
|
|
||||||
|
|
||||||
| # | Severity | Location | Issue |
|
|
||||||
|---|---|---|---|
|
|
||||||
| I1 | **HIGH** | All Dockerfiles | **No `requirements.txt` or dependency pinning.** All dependencies (`flask`, `redis`, `requests`) are installed without version pins. Builds are non-reproducible. |
|
|
||||||
| I2 | **MEDIUM** | `Dockerfile.gpu` line 3 | **`pip install requests`** unnecessary dependency for the GPU dashboard (only uses `urllib`). Adds ~300KB to the image. |
|
|
||||||
| I3 | **MEDIUM** | `Dockerfile.gpu` line 14 | **Multi-process CMD with `&`** no process supervisor. If the collector crashes, it won't restart. The `http.server` also won't receive SIGTERM properly. |
|
|
||||||
| I4 | **LOW** | All Dockerfiles | **No `.dockerignore` file.** The entire context is sent to the Docker daemon, including `.git` directories and any local artifacts. |
|
|
||||||
| I5 | **LOW** | `Dockerfile.dashboard` (root) vs `dashboard/Dockerfile.dashboard` | **Duplicate Dockerfiles** with slight differences (Python 3.11 vs 3.13, WORKDIR differences). |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 4. Smart Queuing Analysis & Recommendations
|
|
||||||
|
|
||||||
### Current State: No Smart Queuing
|
|
||||||
|
|
||||||
The queue service is a **passive storage mechanism** it stores requests but has no intelligence:
|
|
||||||
|
|
||||||
- **No load balancing** no awareness of GPU load (slots_busy, VRAM usage, queue depth per GPU)
|
|
||||||
- **No job prioritization** FIFO only, no priority levels
|
|
||||||
- **No backpressure** simple threshold, no exponential backoff or adaptive limits
|
|
||||||
- **No retry logic** failed GPU requests go to queue but are never reprocessed
|
|
||||||
- **No dead letter handling** stuck or failed jobs have no lifecycle management
|
|
||||||
- **No consumer** nothing dequeues and forwards to GPUs
|
|
||||||
- **No job tracking** no job IDs, no status updates, no result retrieval
|
|
||||||
|
|
||||||
### Recommended Architecture: Smart Queue with Consumer
|
|
||||||
|
|
||||||
```
|
|
||||||
Agent > Nginx > Smart Queue API > Redis Streams (with consumers)
|
|
||||||
|
|
||||||
|
|
||||||
Consumer
|
|
||||||
Pool
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
GPU 1 (load) GPU 2 (load) GPU 3 (load)
|
|
||||||
|
|
||||||
|
|
||||||
Health Health Health
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
Update GPU scores
|
|
||||||
|
|
||||||
Priority Queue (sorted by urgency)
|
|
||||||
Dead Letter Queue (failed jobs)
|
|
||||||
Backpressure (adaptive rate limit)
|
|
||||||
```
|
|
||||||
|
|
||||||
### Specific Recommendations
|
|
||||||
|
|
||||||
#### R1: Implement Redis Streams as Queue Backend
|
|
||||||
- Replace `LPUSH/RPUSH` (FIFO list) with **Redis Streams** (`XADD/XREADGROUP`)
|
|
||||||
- Streams support consumer groups, message acknowledgment, and pending messages
|
|
||||||
- Enables proper dead letter queue handling and retry logic
|
|
||||||
- **File:** `queue-service/queue-service.py`
|
|
||||||
|
|
||||||
```python
|
|
||||||
# Before: Simple list
|
|
||||||
r.rpush(QUEUE_KEY, json.dumps(job))
|
|
||||||
|
|
||||||
# After: Redis Stream with consumer group
|
|
||||||
stream_key = "inference:stream"
|
|
||||||
consumer_group = "gpu-workers"
|
|
||||||
r.xadd(stream_key, {"job": json.dumps(job)}, maxlen=10000, approx=True)
|
|
||||||
```
|
|
||||||
|
|
||||||
#### R2: Build a Queue Consumer Pool
|
|
||||||
- Deploy 1+ consumer containers that poll the stream and forward to GPUs
|
|
||||||
- Consumer selects GPU based on: health status, current load (slots_busy), and VRAM availability
|
|
||||||
- **File:** New `queue-service/consumer.py`
|
|
||||||
|
|
||||||
```python
|
|
||||||
class LoadBalancedConsumer:
|
|
||||||
def select_gpu(self, job):
|
|
||||||
"""Select GPU based on load, health, and model compatibility."""
|
|
||||||
candidates = [g for g in self.gpus if g.health == "up" and not g.full]
|
|
||||||
if not candidates:
|
|
||||||
return None
|
|
||||||
# Sort by: slots_idle (descending), VRAM_available (descending)
|
|
||||||
candidates.sort(key=lambda g: (g.slots_idle, g.vram_free_mb), reverse=True)
|
|
||||||
return candidates[0]
|
|
||||||
```
|
|
||||||
|
|
||||||
#### R3: Implement Priority Queuing
|
|
||||||
- Add priority field to job payload: `high`, `normal`, `low`
|
|
||||||
- Use Redis Streams with multiple stream keys per priority level
|
|
||||||
- Consumer checks `high` `normal` `low` in order
|
|
||||||
- **File:** `queue-service/queue-service.py` enqueue endpoint
|
|
||||||
|
|
||||||
#### R4: Add Backpressure Mechanism
|
|
||||||
- Instead of hard threshold at 50, implement **adaptive backpressure**:
|
|
||||||
- Queue depth 0-30: normal operation
|
|
||||||
- Queue depth 30-40: return `retry-after` header with increasing delay
|
|
||||||
- Queue depth 40-50: return 503 with exponential retry-after
|
|
||||||
- Queue depth >50: circuit breaker open
|
|
||||||
- **File:** `queue-service/queue-service.py`
|
|
||||||
|
|
||||||
#### R5: Dead Letter Queue (DLQ)
|
|
||||||
- Move failed/unprocessable jobs to a `inference:dead-letter` stream
|
|
||||||
- Include failure reason, attempt count, and original payload
|
|
||||||
- Provide admin API to inspect, retry, or discard DLQ entries
|
|
||||||
- **File:** `queue-service/queue-service.py`
|
|
||||||
|
|
||||||
```python
|
|
||||||
# New endpoint
|
|
||||||
@app.route("/dlq", methods=["GET"])
|
|
||||||
def list_dlq():
|
|
||||||
return r.xrange("inference:dead-letter")
|
|
||||||
|
|
||||||
@app.route("/dlq/retry/<message_id>", methods=["POST"])
|
|
||||||
def retry_dlq(message_id):
|
|
||||||
job = r.xget("inference:dead-letter", message_id)
|
|
||||||
r.xadd("inference:stream", {"job": job})
|
|
||||||
```
|
|
||||||
|
|
||||||
#### R6: GPU-Aware Routing
|
|
||||||
- Queue consumer should check GPU `slots_busy` before routing
|
|
||||||
- If a GPU is busy, try the next available GPU
|
|
||||||
- Track per-GPU queue depth and avoid overloading a single GPU
|
|
||||||
- **File:** New consumer logic
|
|
||||||
|
|
||||||
#### R7: Job Status API
|
|
||||||
- Add job ID generation on enqueue
|
|
||||||
- Provide `/status/<job_id>` endpoint to check progress
|
|
||||||
- Store job state in Redis: `queued` `processing` `completed`/`failed`
|
|
||||||
- **File:** `queue-service/queue-service.py`
|
|
||||||
|
|
||||||
```python
|
|
||||||
@app.route("/enqueue", methods=["POST"])
|
|
||||||
def enqueue():
|
|
||||||
job_id = str(uuid.uuid4())
|
|
||||||
job = {"id": job_id, "payload": ..., "status": "queued", "created_at": time.time()}
|
|
||||||
r.xadd(stream_key, {"job": json.dumps(job)})
|
|
||||||
r.hset("job:status", job_id, json.dumps({"status": "queued"}))
|
|
||||||
return jsonify({"job_id": job_id, "status": "queued"}), 202
|
|
||||||
|
|
||||||
@app.route("/status/<job_id>")
|
|
||||||
def job_status(job_id):
|
|
||||||
status = r.hget("job:status", job_id)
|
|
||||||
return jsonify(json.loads(status)) if status else {"error": "not found"}, 404
|
|
||||||
```
|
|
||||||
|
|
||||||
#### R8: Health-Based Circuit Breaker
|
|
||||||
- Replace simple depth threshold with **per-GPU circuit breakers**
|
|
||||||
- Track consecutive failures per GPU
|
|
||||||
- Implement half-open state: after cooldown, probe one GPU to test recovery
|
|
||||||
- **File:** `queue-service/queue-service.py`
|
|
||||||
|
|
||||||
#### R9: Centralized Configuration
|
|
||||||
- Move GPU endpoints from 3 locations (queue-service, dashboard, Nginx) to:
|
|
||||||
- Redis config key: `config:gpus`
|
|
||||||
- Or environment file mounted to all containers
|
|
||||||
- Nginx can use Lua/variable from config instead of static upstreams
|
|
||||||
- **File:** New `config/` directory or Redis-based config
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 5. Priority Issue Summary
|
|
||||||
|
|
||||||
### Critical (Fix Immediately)
|
|
||||||
1. **Q1** Queue has no consumer; enqueued requests are never processed
|
|
||||||
2. **Q4** No job ID or result retrieval mechanism
|
|
||||||
3. **N3** Queue fallback triggers on individual GPU failure, not all-down
|
|
||||||
|
|
||||||
### High (Fix Before Production)
|
|
||||||
4. **Q5** Circuit breaker has no recovery mechanism
|
|
||||||
5. **Q6** `/status` endpoint blocks on GPU health checks
|
|
||||||
6. **D1** Dashboard `/api/status` makes 4 sequential calls, up to 11s
|
|
||||||
7. **C2** `Dockerfile.queue` path mismatch in docker-compose
|
|
||||||
8. **I1** No dependency pinning in any Dockerfile
|
|
||||||
9. **I3** Multi-process CMD without supervisor in GPU dashboard
|
|
||||||
|
|
||||||
### Medium (Improve in Next Iteration)
|
|
||||||
10. **Q3** Redis host default conflicts with Docker networking
|
|
||||||
11. **Q7** Silent exception swallowing in Redis access
|
|
||||||
12. **Q8** Content-Type header dropped in queue
|
|
||||||
13. **D2** Single-threaded dashboard server
|
|
||||||
14. **D3** Three separate sources of truth for GPU addresses
|
|
||||||
15. **G1** Sequential GPU collection blocks on single failure
|
|
||||||
16. **N1** Rate limit burst of 20 nodelay defeats protection
|
|
||||||
17. **N5** Hardcoded container names in Nginx
|
|
||||||
18. **C1** Queue API exposed externally
|
|
||||||
19. **C4** No Docker health checks
|
|
||||||
|
|
||||||
### Low (Nice to Have)
|
|
||||||
20. **Q9** No graceful shutdown
|
|
||||||
21. **C3** `restart: always` vs `unless-stopped`
|
|
||||||
22. **C5** No Redis authentication
|
|
||||||
23. **G4** 60s dead threshold is too aggressive
|
|
||||||
24. **I2** Unnecessary `requests` dependency
|
|
||||||
25. **I4** No `.dockerignore`
|
|
||||||
26. **I5** Duplicate Dockerfiles
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 6. Deployment Architecture Summary
|
|
||||||
|
|
||||||
### What Works Well
|
|
||||||
- Clean separation of concerns: routing (Nginx), queuing (Redis + queue-service), observability (two dashboards)
|
|
||||||
- Good GPU hardware monitoring with temperature, VRAM, power, fan metrics
|
|
||||||
- SSE streaming support in Nginx for LLM response streaming
|
|
||||||
- Rate limiting at the gateway layer
|
|
||||||
- Circuit breaker pattern implemented (even if basic)
|
|
||||||
|
|
||||||
### What Needs Work
|
|
||||||
- **Queue is incomplete** storage without processing is the most critical gap
|
|
||||||
- **No job lifecycle** requests go in and never come out
|
|
||||||
- **Duplicated configuration** GPU addresses in 3+ places
|
|
||||||
- **No monitoring/alerting** no Prometheus metrics, no alerting rules
|
|
||||||
- **Single point of failure** no Redis replication, no container redundancy
|
|
||||||
- **No logging** Flask dev server logs are minimal; no structured logging
|
|
||||||
|
|
||||||
### Recommended Next Steps
|
|
||||||
1. **Priority 1:** Implement queue consumer with GPU load-based routing
|
|
||||||
2. **Priority 2:** Add job status tracking and result retrieval
|
|
||||||
3. **Priority 3:** Fix Nginx fallback to only trigger when ALL GPUs are down
|
|
||||||
4. **Priority 4:** Add Docker health checks and proper dependency management
|
|
||||||
5. **Priority 5:** Centralize GPU configuration in Redis or environment
|
|
||||||
6. **Priority 6:** Add Prometheus metrics endpoint for observability
|
|
||||||
@@ -1,5 +0,0 @@
|
|||||||
FROM python:3.11-slim
|
|
||||||
WORKDIR /app
|
|
||||||
COPY dashboard/harness-dashboard.py .
|
|
||||||
EXPOSE 3001
|
|
||||||
CMD ["python3", "harness-dashboard.py"]
|
|
||||||
|
|||||||
@@ -1,14 +0,0 @@
|
|||||||
FROM python:3.11-slim
|
|
||||||
|
|
||||||
RUN pip install requests
|
|
||||||
|
|
||||||
COPY gpu-dashboard/ /app/
|
|
||||||
WORKDIR /app
|
|
||||||
|
|
||||||
RUN mkdir -p /app/public && \
|
|
||||||
cp gpu.html /app/public/ && \
|
|
||||||
touch /app/public/gpu_metrics.json
|
|
||||||
|
|
||||||
EXPOSE 8092
|
|
||||||
|
|
||||||
CMD ["sh", "-c", "python3 gpu_collector.py & python3 -m http.server 8092 --directory /app/public & wait"]
|
|
||||||
@@ -1,63 +1,75 @@
|
|||||||
# Syslog Harness
|
# syslog-harness — Inference API Harness
|
||||||
|
|
||||||
Operational orchestration layer for Syslog's internal AI agents.
|
CT 116 Docker stack for routing local GPU models through a unified OpenAI-compatible API.
|
||||||
|
|
||||||
## Architecture
|
## Architecture
|
||||||
|
|
||||||
```
|
```
|
||||||
┌─────────────┐ ┌──────────────┐ ┌─────────────┐
|
nginx :80 → router :9000 → GPU backends
|
||||||
│ Agent │────>│ Nginx │────>│ GPU Pool │
|
├─ qwen3.6-35B-A3B (MoE) @ 192.168.68.15:8080 [2 slots]
|
||||||
│ (Hermes) │ │ Router │ │ (MoE/Dense)│
|
├─ qwen3.6-27B-code (Dense) @ 192.168.68.8:8080 [2 slots]
|
||||||
└─────────────┘ └──────────────┘ └─────────────┘
|
└─ qwen3.5-9b-vlm (VLM) @ 192.168.68.110:8080 [2 slots]
|
||||||
│
|
Total: 6 concurrent slots
|
||||||
├──> :8091 Queue Service (Docker)
|
|
||||||
│
|
LiteLLM :8081 (fallback) | Dashboard :3000 | Redis :6379 (local)
|
||||||
└──> :3001 Dashboard (Docker)
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## Components
|
## Deploy
|
||||||
|
|
||||||
| Service | Port | Container | Purpose |
|
|
||||||
|---|---|---|---|
|
|
||||||
| Nginx Router | 8080 | Host | Routes requests to GPU backends |
|
|
||||||
| Queue Service | 8091 | `syslog-queue` | Enqueues requests when GPUs are down |
|
|
||||||
| Dashboard | 3001 | `syslog-dashboard` | Observability UI + API |
|
|
||||||
|
|
||||||
## GPU Routing
|
|
||||||
|
|
||||||
| Header `X-Syslog-Model` | Backend | Model |
|
|
||||||
|---|---|---|
|
|
||||||
| (none) / `standard` | amdpve (.15) | qwen3.6-35B-A3B (MoE) |
|
|
||||||
| `heavy` / `qwen3.5-27B` | llmgpu (.8) | qwen3.5-27B (Dense) |
|
|
||||||
| `light` / `gemma-4` | ocu_llm (.110) | gemma-4-E4B (Light) |
|
|
||||||
|
|
||||||
## Quick Start
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Build & start
|
cd /opt/inference-harness
|
||||||
docker compose build
|
|
||||||
docker compose up -d
|
docker compose up -d
|
||||||
|
|
||||||
# Verify
|
|
||||||
curl http://localhost:8091/health
|
|
||||||
curl http://localhost:3001/api/status
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## Dashboard
|
## Endpoints
|
||||||
|
|
||||||
- **UI:** `http://<host>:8080/dashboard/harness.html`
|
| URL | Purpose |
|
||||||
- **API:** `http://<host>:8080/dashboard/api/status`
|
|-----|---------|
|
||||||
|
| `/v1/chat/completions` | Inference API (OpenAI-compatible) — **API key required** |
|
||||||
|
| `/v1/models` | Available models |
|
||||||
|
| `/` | Dashboard (GPU health, routing, agents, timeseries) |
|
||||||
|
|
||||||
## Circuit Breaker
|
## Authentication
|
||||||
|
|
||||||
- Rate limit: 10 req/s per IP
|
**All `/v1/chat/completions` requests require a valid API key** via `Authorization: Bearer <key>`. Missing or invalid keys return **401 Unauthorized**.
|
||||||
- Burst: 20 requests
|
|
||||||
- Excess returns 503
|
|
||||||
- Queue fallback on GPU 502/503
|
|
||||||
|
|
||||||
## Production Migration
|
## Agent API Keys
|
||||||
|
|
||||||
See [MIGRATION_PLAN.md](./MIGRATION_PLAN.md)
|
| Agent | Key |
|
||||||
|
|-------|-----|
|
||||||
|
| Abiba | `sk-syslog-abiba` |
|
||||||
|
| Mumuni | `sk-syslog-mumuni` |
|
||||||
|
| Tanko | `sk-syslog-tanko` |
|
||||||
|
| Koby | `sk-syslog-koby` |
|
||||||
|
| Kagenz0 | `sk-syslog-kagenz0` |
|
||||||
|
| Koonimo | `sk-syslog-koonimo` |
|
||||||
|
|
||||||
---
|
## Routing Tiers
|
||||||
*Built for Syslog Solution LLC — Quality over speed.*
|
|
||||||
|
| Tier | Trigger | Priority |
|
||||||
|
|------|---------|----------|
|
||||||
|
| Lightweight | No system prompt, ≤1 turn, ≤100 words | VLM → MoE → Dense |
|
||||||
|
| Simple Conv | ≤1000 tokens, ≤4 turns | VLM → MoE → Dense |
|
||||||
|
| Heavy | >4000 tokens OR >8 turns | Dense → MoE → VLM |
|
||||||
|
| Default | Everything else | MoE → VLM → Dense |
|
||||||
|
|
||||||
|
## Queue
|
||||||
|
|
||||||
|
When all GPUs are saturated, requests enter a polling queue (500ms intervals) instead of returning 503 immediately. Timeout: 30s (configurable via `QUEUE_TIMEOUT` env or `X-Queue-Timeout` header).
|
||||||
|
|
||||||
|
## Models
|
||||||
|
|
||||||
|
| GPU | Model | VRAM | Slots |
|
||||||
|
|-----|-------|------|-------|
|
||||||
|
| Strix Halo | qwen3.6-35B-A3B (MoE) | 65GB | 2 |
|
||||||
|
| RTX 3090 | qwen3.6-27B-code (Dense) | 24GB | 2 |
|
||||||
|
| RTX 5070 | qwen3.5-9b-vlm (VLM) | 12GB | 2 |
|
||||||
|
|
||||||
|
## Maintenance
|
||||||
|
|
||||||
|
Automated cron job runs daily at 3:00 AM UTC (`/opt/inference-harness/maintenance.sh`):
|
||||||
|
- Cleans Redis timeseries keys >60 days
|
||||||
|
- Prunes Docker build cache >7 days
|
||||||
|
- Logs container health and Redis memory
|
||||||
|
|
||||||
|
Logs: `/var/log/harness-maintenance.log`
|
||||||
|
|||||||
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
@@ -1,8 +1,7 @@
|
|||||||
FROM python:3.13-slim
|
FROM python:3.12-slim
|
||||||
|
|
||||||
COPY harness-dashboard.py /app/harness-dashboard.py
|
|
||||||
WORKDIR /app
|
WORKDIR /app
|
||||||
|
COPY requirements.txt .
|
||||||
EXPOSE 3001
|
RUN pip install --no-cache-dir -r requirements.txt
|
||||||
|
COPY dashboard.py .
|
||||||
CMD ["python3", "harness-dashboard.py"]
|
EXPOSE 3000
|
||||||
|
CMD ["python", "dashboard.py"]
|
||||||
|
|||||||
@@ -1,5 +0,0 @@
|
|||||||
FROM python:3.11-slim
|
|
||||||
WORKDIR /app
|
|
||||||
COPY harness-dashboard.py .
|
|
||||||
EXPOSE 3001
|
|
||||||
CMD ["python3", "harness-dashboard.py"]
|
|
||||||
@@ -0,0 +1,232 @@
|
|||||||
|
"""SyslogAI Harness Dashboard — Modern Design."""
|
||||||
|
import os, json, time, queue, threading
|
||||||
|
import requests
|
||||||
|
from flask import Flask, request, render_template_string, Response, stream_with_context
|
||||||
|
|
||||||
|
ROUTER_METRICS = os.environ.get("ROUTER_METRICS_URL", "http://router:9000/metrics")
|
||||||
|
app = Flask(__name__)
|
||||||
|
sse_subscribers = []; sse_lock = threading.Lock()
|
||||||
|
|
||||||
|
def fetch_state():
|
||||||
|
try:
|
||||||
|
r = requests.get(ROUTER_METRICS, timeout=5)
|
||||||
|
if r.status_code == 200: return r.json()
|
||||||
|
except Exception: pass
|
||||||
|
return {"gpus":[],"route_counts":{},"agent_counts":{},"recent":[],"timestamp":time.time()}
|
||||||
|
|
||||||
|
def broadcast_loop():
|
||||||
|
while True:
|
||||||
|
time.sleep(3)
|
||||||
|
data = fetch_state(); payload = json.dumps(data)
|
||||||
|
with sse_lock:
|
||||||
|
dead = [q for q in sse_subscribers if not q.put(payload)]
|
||||||
|
for q in dead: sse_subscribers.remove(q)
|
||||||
|
threading.Thread(target=broadcast_loop, daemon=True).start()
|
||||||
|
|
||||||
|
DASHBOARD_HTML = r"""<!DOCTYPE html>
|
||||||
|
<html lang="en" data-bs-theme="dark">
|
||||||
|
<head>
|
||||||
|
<meta charset="UTF-8"><meta name="viewport" content="width=device-width, initial-scale=1.0">
|
||||||
|
<title>SyslogAI Harness</title>
|
||||||
|
<link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.3/dist/css/bootstrap.min.css" rel="stylesheet">
|
||||||
|
<style>
|
||||||
|
body { background: #0b0f17; color: #bcc3cd; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', system-ui, sans-serif; padding: 20px 24px; }
|
||||||
|
.card { background: #111827; border: 1px solid #1e293b; border-radius: 10px; height: 100%; }
|
||||||
|
.stat-card { background: #111827; border: 1px solid #1e293b; border-radius: 10px; padding: 18px 20px; text-align: center; }
|
||||||
|
.stat-value { font-size: 28px; font-weight: 700; line-height: 1.1; }
|
||||||
|
.stat-label { font-size: 11px; text-transform: uppercase; letter-spacing: 0.6px; color: #64748b; margin-top: 4px; }
|
||||||
|
.gpu-card { background: #111827; border: 1px solid #1e293b; border-radius: 10px; padding: 16px 18px; height: 100%; }
|
||||||
|
.gpu-card .title { font-size: 13px; font-weight: 600; color: #e2e8f0; margin-bottom: 12px; display: flex; align-items: center; gap: 8px; }
|
||||||
|
.gpu-card .status-dot { width: 8px; height: 8px; border-radius: 50%; flex-shrink: 0; }
|
||||||
|
.gpu-card .row-metric { display: flex; justify-content: space-between; font-size: 12px; padding: 2px 0; }
|
||||||
|
.gpu-card .row-metric .lbl { color: #64748b; }
|
||||||
|
.gpu-card .row-metric .val { color: #e2e8f0; font-variant-numeric: tabular-nums; }
|
||||||
|
.gpu-card .slot-bar { display: flex; gap: 3px; margin-top: 8px; }
|
||||||
|
.gpu-card .slot-bar .s { flex: 1; height: 5px; border-radius: 2px; background: #1e293b; }
|
||||||
|
.gpu-card .slot-bar .s.active { background: #38bdf8; }
|
||||||
|
.chart-card { background: #111827; border: 1px solid #1e293b; border-radius: 10px; padding: 16px 18px; height: 100%; display: flex; flex-direction: column; }
|
||||||
|
.chart-card .title { font-size: 13px; font-weight: 600; color: #e2e8f0; margin-bottom: 12px; }
|
||||||
|
.bar-row { margin-bottom: 8px; }
|
||||||
|
.bar-label { display: flex; justify-content: space-between; font-size: 11px; margin-bottom: 3px; color: #64748b; }
|
||||||
|
.bar-label .name { color: #cbd5e1; }
|
||||||
|
.bar-track { height: 5px; background: #1e293b; border-radius: 3px; overflow: hidden; }
|
||||||
|
.bar-fill { height: 100%; border-radius: 3px; transition: width 0.6s ease; }
|
||||||
|
.table-custom { font-size: 11px; margin: 0; }
|
||||||
|
.table-custom th { color: #64748b; font-weight: 500; font-size: 10px; text-transform: uppercase; border-color: #1e293b; padding: 8px 10px; }
|
||||||
|
.table-custom td { color: #94a3b8; border-color: rgba(30,41,59,0.5); padding: 6px 10px; }
|
||||||
|
.agent-badge { font-size: 10px; padding: 2px 7px; border-radius: 8px; font-weight: 600; }
|
||||||
|
.btn-sm-period { font-size: 10px; padding: 3px 10px; border-radius: 6px; border: 1px solid #1e293b; color: #64748b; background: transparent; cursor: pointer; }
|
||||||
|
.btn-sm-period.active { background: #1d4ed8; color: #fff; border-color: #1d4ed8; }
|
||||||
|
.ring-label { font-size: 22px; font-weight: 700; }
|
||||||
|
.ring-sublabel { font-size: 10px; color: #64748b; }
|
||||||
|
</style>
|
||||||
|
</head>
|
||||||
|
<body>
|
||||||
|
|
||||||
|
<!-- HEADER -->
|
||||||
|
<div class="d-flex justify-content-between align-items-center mb-4">
|
||||||
|
<div>
|
||||||
|
<h5 class="mb-0 text-white fw-bold">⚡ SyslogAI Harness</h5>
|
||||||
|
<div class="small text-secondary" id="live-indicator">
|
||||||
|
<span class="status-dot" id="live-dot" style="width:6px;height:6px;border-radius:50%;display:inline-block;background:#22c55e;animation:pulse 2s infinite"></span>
|
||||||
|
<span id="connection-status">live</span> · <span id="update-time"></span>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
<div class="d-flex gap-2">
|
||||||
|
<div class="stat-card" style="min-width:100px"><div class="stat-value text-info" id="kpi-total">0</div><div class="stat-label">Requests</div></div>
|
||||||
|
<div class="stat-card" style="min-width:100px"><div class="stat-value text-warning" id="kpi-active">0</div><div class="stat-label">Active</div></div>
|
||||||
|
<div class="stat-card" style="min-width:100px"><div class="stat-value" style="color:#a78bfa" id="kpi-agents">0</div><div class="stat-label">Agents</div></div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="row g-3 align-items-stretch">
|
||||||
|
<!-- ROW 1: Usage Chart (8) + GPU Metrics (4) -->
|
||||||
|
<div class="col-md-8"><div class="chart-card"><div class="title d-flex justify-content-between align-items-center">
|
||||||
|
<span>Usage Over Time</span>
|
||||||
|
<div class="d-flex gap-1">
|
||||||
|
<button class="btn-sm-period active" onclick="switchPeriod('day')">24h</button>
|
||||||
|
<button class="btn-sm-period" onclick="switchPeriod('week')">7d</button>
|
||||||
|
<button class="btn-sm-period" onclick="switchPeriod('month')">30d</button>
|
||||||
|
</div>
|
||||||
|
</div><div id="timeseries-chart" style="height:150px"></div><div id="timeseries-legend" class="d-flex justify-content-center gap-3 mt-2 flex-wrap small"></div></div></div>
|
||||||
|
<div class="col-md-4"><div class="chart-card"><div class="title">GPU Metrics</div><div id="gpu-metrics-card"></div></div></div>
|
||||||
|
|
||||||
|
<!-- ROW 2: 3 GPU Cards -->
|
||||||
|
<div class="col-md-4"><div class="gpu-card" id="gpu-moe"><div class="text-secondary small">Loading...</div></div></div>
|
||||||
|
<div class="col-md-4"><div class="gpu-card" id="gpu-dense"><div class="text-secondary small">Loading...</div></div></div>
|
||||||
|
<div class="col-md-4"><div class="gpu-card" id="gpu-light"><div class="text-secondary small">Loading...</div></div></div>
|
||||||
|
|
||||||
|
<!-- ROW 3: Queue + Model + Agent -->
|
||||||
|
<div class="col-md-4"><div class="chart-card"><div class="title">Queue Status</div><div class="text-center" id="queue-viz"></div></div></div>
|
||||||
|
<div class="col-md-4"><div class="chart-card"><div class="title">Model Distribution</div><div id="route-bars"></div></div></div>
|
||||||
|
<div class="col-md-4"><div class="chart-card"><div class="title">Agent Activity</div><div id="agent-bars"></div></div></div>
|
||||||
|
|
||||||
|
<!-- ROW 4: Live Stream -->
|
||||||
|
<div class="col-12"><div class="chart-card"><div class="title">Live Stream</div>
|
||||||
|
<div class="table-responsive"><table class="table table-custom mb-0">
|
||||||
|
<thead><tr><th>Time</th><th>Agent</th><th>Model</th><th>Reason</th><th>Tier</th></tr></thead>
|
||||||
|
<tbody id="route-tbody"></tbody>
|
||||||
|
</table></div>
|
||||||
|
</div></div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<script>
|
||||||
|
var MC={'qwen3.5-9b-vlm':'#22c55e','qwen3.6-27B-code':'#f59e0b','qwen3.6-35B-A3B':'#a78bfa'};
|
||||||
|
var ML={'qwen3.5-9b-vlm':'Qwen3.5 9B VLM','qwen3.6-27B-code':'Qwen Code','qwen3.6-35B-A3B':'Qwen MoE'};
|
||||||
|
var GL={'qwen3.6-35B-A3B':'MoE - Strix Halo','qwen3.6-27B-code':'Dense - RTX 3090','qwen3.5-9b-vlm':'VLM - RTX 5070'};
|
||||||
|
function $(id){return document.getElementById(id);}
|
||||||
|
|
||||||
|
function render(data){
|
||||||
|
if(!data||!data.gpus)return;
|
||||||
|
var t=Object.values(data.route_counts||{}).reduce((a,b)=>a+b,0);
|
||||||
|
var ta=0,tm=0;data.gpus.forEach(function(g){ta+=(g.active_requests||0);tm+=(g.max_concurrent||1)});
|
||||||
|
$('kpi-total').textContent=t;$('kpi-active').textContent=ta+'/'+tm;$('kpi-agents').textContent=Object.keys(data.agent_counts||{}).length;
|
||||||
|
$('update-time').textContent=new Date().toLocaleTimeString();
|
||||||
|
var ids={'qwen3.6-35B-A3B':'gpu-moe','qwen3.6-27B-code':'gpu-dense','qwen3.5-9b-vlm':'gpu-light'};
|
||||||
|
data.gpus.forEach(function(g){
|
||||||
|
var el=$(ids[g.id]);if(!el)return;
|
||||||
|
var a=g.active_requests||0,mx=g.max_concurrent||1;
|
||||||
|
var sc=g.status==='healthy'?'#22c55e':g.status==='saturated'?'#f59e0b':'#ef4444';
|
||||||
|
var ss=g.status==='healthy'?'Online':g.status==='saturated'?'Busy':'Offline';
|
||||||
|
var slots='';for(var i=0;i<mx;i++)slots+='<span class=\"s'+(i<a?' active':'')+'\"></span>';
|
||||||
|
var h='<div class=\"title\"><span class=\"status-dot\" style=\"background:'+sc+'\"></span>'+GL[g.id]+'<span class=\"ms-auto small\" style=\"color:'+sc+'\">'+ss+'</span></div>';
|
||||||
|
h+='<div class=\"row-metric\"><span class=\"lbl\">VRAM</span><span class=\"val\">'+g.vram_used_mb+' / '+g.vram_total_mb+' MB</span></div>';
|
||||||
|
h+='<div class=\"row-metric\"><span class=\"lbl\">Utilization</span><span class=\"val\">'+g.gpu_util_pct+'%</span></div>';
|
||||||
|
h+='<div class=\"row-metric\"><span class=\"lbl\">Temperature</span><span class=\"val\" style=\"color:'+(g.temp_c>85?'#ef4444':g.temp_c>70?'#f59e0b':'#22c55e')+'\">'+g.temp_c+'C</span></div>';
|
||||||
|
if(g.power_w)h+='<div class=\"row-metric\"><span class=\"lbl\">Power</span><span class=\"val\">'+g.power_w+'W'+(g.power_limit_w?'/'+g.power_limit_w+'W':'')+'</span></div>';
|
||||||
|
h+='<div class=\"row-metric\"><span class=\"lbl\">Slots</span><span class=\"val\" style=\"color:'+(a>=mx?'#ef4444':'#e2e8f0')+'\">'+a+' / '+mx+'</span></div>';
|
||||||
|
h+='<div class=\"slot-bar\">'+slots+'</div>';el.innerHTML=h;
|
||||||
|
});
|
||||||
|
renderQueue(data);renderGPUMetrics(data);
|
||||||
|
var rc=data.route_counts||{},mr=Math.max(1,...Object.values(rc));
|
||||||
|
$('route-bars').innerHTML=Object.entries(rc).length?Object.entries(rc).sort((a,b)=>b[1]-a[1]).map(function(e){var m=e[0],c=e[1];return'<div class=\"bar-row\"><div class=\"bar-label\"><span class=\"name\">'+(ML[m]||m)+'</span><span>'+c+' ('+(t?Math.round(c/t*100):0)+'%)</span></div><div class=\"bar-track\"><div class=\"bar-fill\" style=\"width:'+(c/mr*100)+'%;background:'+(MC[m]||'#38bdf8')+'\"></div></div></div>';}).join(''):'<div class=\"text-secondary small\">-</div>';
|
||||||
|
var ac=data.agent_counts||{},ma=Math.max(1,...Object.values(ac));
|
||||||
|
$('agent-bars').innerHTML=Object.entries(ac).length?Object.entries(ac).sort((a,b)=>b[1]-a[1]).map(function(e){return'<div class=\"bar-row\"><div class=\"bar-label\"><span class=\"name\">'+e[0]+'</span><span>'+e[1]+'</span></div><div class=\"bar-track\"><div class=\"bar-fill\" style=\"width:'+(e[1]/ma*100)+'%;background:#38bdf8\"></div></div></div>';}).join(''):'<div class=\"text-secondary small\">-</div>';
|
||||||
|
var recent=data.recent||[];
|
||||||
|
$('route-tbody').innerHTML=recent.length?recent.slice(0,20).map(function(r){var d=new Date(r.ts*1000),ag=r.agent||'?';return'<tr><td class=\"text-secondary\">'+d.toLocaleTimeString()+'</td><td><span class=\"agent-badge\" style=\"background:rgba(56,189,248,0.12);color:#38bdf8\">'+ag+'</span></td><td>'+(ML[r.model]||r.model)+'</td><td class=\"text-secondary\">'+(r.reason||'')+'</td><td class=\"text-uppercase\" style=\"font-size:10px;color:'+(r.tier==='enterprise'?'#a78bfa':'#64748b')+'\">'+(r.tier||'')+'</td></tr>';}).join(''):'<tr><td colspan=\"5\" class=\"text-secondary\">Waiting...</td></tr>';
|
||||||
|
}
|
||||||
|
|
||||||
|
function renderQueue(data){
|
||||||
|
var el=$('queue-viz');if(!el)return;
|
||||||
|
var ta=0,tm=0;data.gpus.forEach(function(g){ta+=(g.active_requests||0);tm+=(g.max_concurrent||1)});
|
||||||
|
var pct=tm>0?Math.round(ta/tm*100):0,st=pct>=100?'SATURATED':pct>=50?'BUSY':'IDLE';
|
||||||
|
var sc=pct>=100?'#ef4444':pct>=50?'#f59e0b':'#22c55e';
|
||||||
|
var circ=188.5,dash=(pct/100)*circ;
|
||||||
|
var h='<div class=\"d-inline-block position-relative mb-2\"><svg width=\"72\" height=\"72\"><circle cx=\"36\" cy=\"36\" r=\"30\" fill=\"none\" stroke=\"#1e293b\" stroke-width=\"6\"/><circle cx=\"36\" cy=\"36\" r=\"30\" fill=\"none\" stroke=\"'+sc+'\" stroke-width=\"6\" stroke-dasharray=\"'+dash+' '+(circ-dash)+'\" stroke-linecap=\"round\" transform=\"rotate(-90 36 36)\"/></svg><div style=\"position:absolute;top:50%;left:50%;transform:translate(-50%,-50%);text-align:center\"><div class=\"ring-label\" style=\"color:'+sc+'\">'+ta+'</div><div class=\"ring-sublabel\">/ '+tm+' slots</div></div></div>';
|
||||||
|
h+='<div class=\"fw-bold mb-2 small\" style=\"color:'+sc+'\">'+st+'</div>';
|
||||||
|
var lb={'qwen3.6-35B-A3B':'MoE','qwen3.6-27B-code':'Dense','qwen3.5-9b-vlm':'VLM'};
|
||||||
|
data.gpus.forEach(function(g){var a=g.active_requests||0,mx=g.max_concurrent||1,gp=mx>0?Math.round(a/mx*100):0;h+='<div class=\"d-flex align-items-center gap-2 mb-1 justify-content-center\"><span class=\"small\" style=\"min-width:32px;text-align:right;font-size:10px\">'+(lb[g.id]||g.id)+'</span><div style=\"flex:1;max-width:70px;height:3px;background:#1e293b;border-radius:2px;overflow:hidden\"><div style=\"height:100%;width:'+gp+'%;background:'+sc+';border-radius:2px\"></div></div><span class=\"small\" style=\"min-width:22px;font-size:10px\">'+a+'/'+mx+'</span></div>'});
|
||||||
|
el.innerHTML=h;
|
||||||
|
}
|
||||||
|
|
||||||
|
function renderGPUMetrics(data){
|
||||||
|
var el=$('gpu-metrics-card');if(!el)return;
|
||||||
|
var lb={'qwen3.6-35B-A3B':'MoE','qwen3.6-27B-code':'Dense','qwen3.5-9b-vlm':'VLM'};
|
||||||
|
var h='';data.gpus.forEach(function(g){
|
||||||
|
var nm=lb[g.id]||g.id,tp=g.temp_c||0,ut=g.gpu_util_pct||0,pw=g.power_w||0,pl=g.power_limit_w||0;
|
||||||
|
var tc=tp>85?'#ef4444':tp>70?'#f59e0b':'#22c55e',uc=ut>90?'#ef4444':ut>70?'#f59e0b':'#22c55e';
|
||||||
|
h+='<div class=\"mb-3\"><div class=\"fw-bold small text-white-50 mb-1\">'+nm+'</div>';
|
||||||
|
h+='<div class=\"d-flex align-items-center gap-2 mb-1\"><span class=\"small text-secondary\" style=\"min-width:30px\">T</span><div class=\"flex-grow-1\" style=\"height:3px;background:#1e293b;border-radius:2px;overflow:hidden\"><div style=\"height:100%;width:'+Math.min(tp,100)+'%;background:'+tc+';border-radius:2px\"></div></div><span class=\"small\" style=\"color:'+tc+';min-width:30px;text-align:right\">'+tp+'C</span></div>';
|
||||||
|
h+='<div class=\"d-flex align-items-center gap-2 mb-1\"><span class=\"small text-secondary\" style=\"min-width:30px\">U</span><div class=\"flex-grow-1\" style=\"height:3px;background:#1e293b;border-radius:2px;overflow:hidden\"><div style=\"height:100%;width:'+ut+'%;background:'+uc+';border-radius:2px\"></div></div><span class=\"small\" style=\"color:'+uc+';min-width:30px;text-align:right\">'+ut+'%</span></div>';
|
||||||
|
if(pw>0){var pp=pl>0?Math.round(pw/pl*100):0,pc=pp>90?'#ef4444':pp>70?'#f59e0b':'#22c55e';h+='<div class=\"d-flex align-items-center gap-2\"><span class=\"small text-secondary\" style=\"min-width:30px\">P</span><div class=\"flex-grow-1\" style=\"height:3px;background:#1e293b;border-radius:2px;overflow:hidden\"><div style=\"height:100%;width:'+pp+'%;background:'+pc+';border-radius:2px\"></div></div><span class=\"small\" style=\"color:'+pc+';min-width:30px;text-align:right\">'+pw+'W</span></div>';}
|
||||||
|
h+='</div>';});
|
||||||
|
el.innerHTML=h;
|
||||||
|
}
|
||||||
|
|
||||||
|
var cp='day';
|
||||||
|
function switchPeriod(p){cp=p;document.querySelectorAll('.btn-sm-period').forEach(function(b){b.classList.remove('active')});event.target.classList.add('active');loadTS();}
|
||||||
|
function loadTS(){fetch('/api/timeseries?period='+cp).then(function(r){return r.json()}).then(renderTS).catch(function(){})}
|
||||||
|
function renderTS(d){
|
||||||
|
var models=d.models||{},labels=d.labels||[];
|
||||||
|
if(!labels.length)return;
|
||||||
|
var cn=$('timeseries-chart'),lg=$('timeseries-legend'),mn=Object.keys(models);
|
||||||
|
if(!mn.length){cn.innerHTML='<div class=\"text-secondary small text-center py-4\">-</div>';return;}
|
||||||
|
var mv=1;for(var m in models)for(var i=0;i<models[m].length;i++)if(models[m][i]>mv)mv=models[m][i];mv=Math.ceil(mv*1.15)||1;
|
||||||
|
var W=labels.length>1?100/(labels.length-1):100,H=130;
|
||||||
|
var paths='';for(var mi=0;mi<mn.length;mi++){var m=mn[mi],vals=models[m]||[],d='';for(var i=0;i<vals.length;i++){var x=i*W,y=H-(vals[i]/mv)*H;d+=(i===0?'M':'L')+x.toFixed(1)+','+y.toFixed(1)+' ';}paths+='<path d=\"'+d+'\" fill=\"none\" stroke=\"'+(MC[m]||'#38bdf8')+'\" stroke-width=\"2\" stroke-linecap=\"round\" opacity=\"0.8\"/>';}
|
||||||
|
var grid='';for(var g=0;g<=4;g++){var y=(g/4)*H;grid+='<line x1=\"0\" y1=\"'+y.toFixed(1)+'\" x2=\"100\" y2=\"'+y.toFixed(1)+'\" stroke=\"#1e293b\" stroke-width=\"1\"/>';}
|
||||||
|
cn.innerHTML='<svg viewBox=\"0 0 100 '+(H+16)+'\" style=\"width:100%;height:'+(H+20)+'px;display:block\" preserveAspectRatio=\"none\">'+grid+paths+'</svg>';
|
||||||
|
lg.innerHTML=mn.map(function(m){return'<span class=\"d-flex align-items-center gap-1\"><svg width=\"14\" height=\"8\"><line x1=\"0\" y1=\"4\" x2=\"14\" y2=\"4\" stroke=\"'+(MC[m]||'#38bdf8')+'\" stroke-width=\"2\"/></svg>'+(ML[m]||m)+'</span>';}).join('');
|
||||||
|
}
|
||||||
|
function poll(){fetch('/api/state').then(function(r){return r.json()}).then(function(data){render(data);$('connection-status').textContent='live';}).catch(function(){$('connection-status').textContent='reconnecting';});}
|
||||||
|
poll();setInterval(poll,3000);loadTS();
|
||||||
|
</script>
|
||||||
|
</body>
|
||||||
|
</html>"""
|
||||||
|
|
||||||
|
@app.route("/")
|
||||||
|
def dashboard(): return render_template_string(DASHBOARD_HTML)
|
||||||
|
|
||||||
|
@app.route("/api/state")
|
||||||
|
def api_state(): return fetch_state()
|
||||||
|
|
||||||
|
@app.route("/api/timeseries")
|
||||||
|
def api_timeseries():
|
||||||
|
period = request.args.get("period", "day")
|
||||||
|
try:
|
||||||
|
r = requests.get("http://router:9000/metrics/timeseries?period=" + period, timeout=5)
|
||||||
|
if r.status_code == 200: return r.json()
|
||||||
|
except Exception: pass
|
||||||
|
return {"models": {}, "labels": []}
|
||||||
|
|
||||||
|
@app.route("/api/stream")
|
||||||
|
def api_stream():
|
||||||
|
def ev():
|
||||||
|
q = queue.Queue()
|
||||||
|
with sse_lock: sse_subscribers.append(q)
|
||||||
|
try:
|
||||||
|
yield "data: "+json.dumps(fetch_state())+"\n\n"
|
||||||
|
while True:
|
||||||
|
try: msg = q.get(timeout=3); yield "data: "+msg+"\n\n"
|
||||||
|
except queue.Empty: yield "data: "+json.dumps(fetch_state())+"\n\n"
|
||||||
|
except GeneratorExit: pass
|
||||||
|
finally:
|
||||||
|
with sse_lock:
|
||||||
|
if q in sse_subscribers: sse_subscribers.remove(q)
|
||||||
|
return Response(stream_with_context(ev()), mimetype="text/event-stream", headers={"Cache-Control":"no-cache","X-Accel-Buffering":"no","Access-Control-Allow-Origin":"*"})
|
||||||
|
|
||||||
|
@app.route("/health")
|
||||||
|
def health(): return {"status":"healthy","service":"harness-dashboard"}
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
app.run(host="0.0.0.0", port=3000, debug=False)
|
||||||
@@ -1,133 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""Syslog Harness Dashboard — Simple HTTP server exposing GPU health + metrics."""
|
|
||||||
|
|
||||||
import json
|
|
||||||
import os
|
|
||||||
import time
|
|
||||||
import urllib.request
|
|
||||||
from http.server import HTTPServer, SimpleHTTPRequestHandler
|
|
||||||
from datetime import datetime
|
|
||||||
|
|
||||||
GPUS = {
|
|
||||||
"amdpve": {"endpoint": os.getenv("AMDVE_EP", "192.168.68.15:8080"), "model": "qwen3.6-35B-A3B (MoE)", "vram": "65GB"},
|
|
||||||
"llmgpu": {"endpoint": os.getenv("LLMGPU_EP", "192.168.68.8:8080"), "model": "qwen3.5-27B (Dense)", "vram": "24GB"},
|
|
||||||
"ocu_llm": {"endpoint": os.getenv("OCU_LLM_EP", "192.168.68.110:8080"), "model": "gemma-4-E4B (Light)", "vram": "12GB"},
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
def check_gpu(name, info):
|
|
||||||
try:
|
|
||||||
start = time.time()
|
|
||||||
# Use simple HTTP GET to check if the GPU endpoint is alive
|
|
||||||
resp = urllib.request.urlopen(f"http://{info['endpoint']}/", timeout=3)
|
|
||||||
latency = (time.time() - start) * 1000
|
|
||||||
return {
|
|
||||||
"status": "up",
|
|
||||||
"latency_ms": round(latency, 1),
|
|
||||||
"model": info["model"],
|
|
||||||
"vram": info["vram"],
|
|
||||||
}
|
|
||||||
except Exception as e:
|
|
||||||
return {"status": "down", "error": str(e)[:50], "model": info["model"], "vram": info["vram"]}
|
|
||||||
|
|
||||||
|
|
||||||
def get_queue_status():
|
|
||||||
try:
|
|
||||||
req = urllib.request.Request("http://queue-service:8091/status")
|
|
||||||
resp = urllib.request.urlopen(req, timeout=2)
|
|
||||||
return json.loads(resp.read())
|
|
||||||
except Exception:
|
|
||||||
return {"queue_depth": -1, "circuit_breaker": "unknown", "gpu_health": {}}
|
|
||||||
|
|
||||||
|
|
||||||
DASHBOARD_HTML = """
|
|
||||||
<!DOCTYPE html>
|
|
||||||
<html><head><meta charset="utf-8"><title>🦅 Syslog Harness</title>
|
|
||||||
<style>
|
|
||||||
body { background: #1a1a2e; color: #e0e0e0; font-family: monospace; margin: 0; padding: 20px; }
|
|
||||||
.card { background: #16213e; border-radius: 8px; padding: 16px; margin: 10px 0; border-left: 4px solid #0f3460; }
|
|
||||||
.up { border-left-color: #00d26a; } .down { border-left-color: #ff4757; }
|
|
||||||
.warn { border-left-color: #ffa502; }
|
|
||||||
h1 { color: #00d26a; font-size: 24px; } h2 { color: #0f3460; font-size: 16px; }
|
|
||||||
.metric { display: inline-block; margin: 4px 12px; }
|
|
||||||
.value { font-weight: bold; color: #00d26a; }
|
|
||||||
#refresh { position: fixed; top: 10px; right: 10px; background: #0f3460; color: white;
|
|
||||||
border: none; padding: 8px 16px; border-radius: 4px; cursor: pointer; }
|
|
||||||
table { width: 100%; border-collapse: collapse; margin: 10px 0; }
|
|
||||||
th, td { text-align: left; padding: 8px; border-bottom: 1px solid #0f3460; }
|
|
||||||
th { color: #00d26a; }
|
|
||||||
</style></head><body>
|
|
||||||
<button id="refresh" onclick="location.reload()">↻ Refresh</button>
|
|
||||||
<h1>🦅 Syslog Harness Dashboard</h1>
|
|
||||||
<h2>Updated: <span id="ts"></span></h2>
|
|
||||||
|
|
||||||
<div class="card" id="queue-card">
|
|
||||||
<h2>Queue & Circuit Breaker</h2>
|
|
||||||
<div class="metric">Depth: <span class="value" id="depth">--</span></div>
|
|
||||||
<div class="metric">Circuit: <span class="value" id="circuit">--</span></div>
|
|
||||||
<div class="metric">Threshold: <span class="value" id="threshold">--</span></div>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<div class="card">
|
|
||||||
<h2>GPU Endpoints</h2>
|
|
||||||
<table><tr><th>GPU</th><th>Model</th><th>VRAM</th><th>Status</th><th>Latency</th></tr>
|
|
||||||
<tbody id="gpu-table"></tbody></table>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<script>
|
|
||||||
document.getElementById('ts').textContent = new Date().toISOString();
|
|
||||||
fetch('/api/status').then(r => r.json()).then(data => {
|
|
||||||
document.getElementById('depth').textContent = data.queue_depth;
|
|
||||||
document.getElementById('circuit').textContent = data.circuit_breaker;
|
|
||||||
document.getElementById('threshold').textContent = 'warn:' + data.thresholds.warn + ' / open:' + data.thresholds.open;
|
|
||||||
const card = document.getElementById('queue-card');
|
|
||||||
if (data.circuit_breaker === 'open') card.className = 'card warn';
|
|
||||||
else if (data.circuit_breaker === 'warn') card.className = 'card warn';
|
|
||||||
else card.className = 'card up';
|
|
||||||
let html = '';
|
|
||||||
for (const [name, gpu] of Object.entries(data.gpu_health)) {
|
|
||||||
const status = gpu.status === 'up' ? '✅' : '❌';
|
|
||||||
const latency = gpu.status === 'up' ? gpu.latency_ms + 'ms' : gpu.error;
|
|
||||||
const rowClass = gpu.status === 'up' ? '' : 'down';
|
|
||||||
html += `<tr class="${rowClass}"><td>${name}</td><td>${gpu.model}</td><td>${gpu.vram}</td><td>${status}</td><td>${latency}</td></tr>`;
|
|
||||||
}
|
|
||||||
document.getElementById('gpu-table').innerHTML = html;
|
|
||||||
});
|
|
||||||
setInterval(() => location.reload(), 10000);
|
|
||||||
</script></body></html>
|
|
||||||
"""
|
|
||||||
|
|
||||||
|
|
||||||
class Handler(SimpleHTTPRequestHandler):
|
|
||||||
def do_GET(self):
|
|
||||||
if self.path == "/" or self.path == "/harness.html":
|
|
||||||
self.send_response(200)
|
|
||||||
self.send_header("Content-Type", "text/html; charset=utf-8")
|
|
||||||
self.end_headers()
|
|
||||||
self.wfile.write(DASHBOARD_HTML.encode())
|
|
||||||
elif self.path == "/api/status":
|
|
||||||
status = get_queue_status()
|
|
||||||
enriched = {
|
|
||||||
"queue_depth": status.get("queue_depth", -1),
|
|
||||||
"circuit_breaker": status.get("circuit_breaker", "unknown"),
|
|
||||||
"thresholds": status.get("thresholds", {"warn": 30, "open": 50}),
|
|
||||||
"gpu_health": {},
|
|
||||||
}
|
|
||||||
for name, info in GPUS.items():
|
|
||||||
enriched["gpu_health"][name] = check_gpu(name, info)
|
|
||||||
self.send_response(200)
|
|
||||||
self.send_header("Content-Type", "application/json")
|
|
||||||
self.end_headers()
|
|
||||||
self.wfile.write(json.dumps(enriched).encode())
|
|
||||||
else:
|
|
||||||
self.send_response(404)
|
|
||||||
self.end_headers()
|
|
||||||
|
|
||||||
def log_message(self, format, *args):
|
|
||||||
pass # Suppress request logs
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
server = HTTPServer(("0.0.0.0", 3001), Handler)
|
|
||||||
print("Dashboard running on :3001/harness.html")
|
|
||||||
server.serve_forever()
|
|
||||||
@@ -0,0 +1,2 @@
|
|||||||
|
flask==3.1.*
|
||||||
|
requests==2.32.*
|
||||||
+80
-37
@@ -1,54 +1,97 @@
|
|||||||
version: "3.8"
|
version: '3.8'
|
||||||
|
|
||||||
services:
|
services:
|
||||||
redis:
|
redis:
|
||||||
image: redis:7-alpine
|
image: redis:7-alpine
|
||||||
restart: always
|
container_name: harness-redis
|
||||||
networks:
|
restart: unless-stopped
|
||||||
- gpu-router-net
|
ports:
|
||||||
|
- "127.0.0.1:6379:6379"
|
||||||
volumes:
|
volumes:
|
||||||
- redis-data:/data
|
- redis-data:/data
|
||||||
|
command: redis-server --appendonly yes --maxmemory 256mb --maxmemory-policy allkeys-lru
|
||||||
|
healthcheck:
|
||||||
|
test: ["CMD", "redis-cli", "ping"]
|
||||||
|
interval: 10s
|
||||||
|
timeout: 3s
|
||||||
|
retries: 5
|
||||||
|
|
||||||
queue-service:
|
router:
|
||||||
build:
|
build: ./router
|
||||||
context: .
|
container_name: harness-router
|
||||||
dockerfile: Dockerfile.queue
|
restart: unless-stopped
|
||||||
restart: always
|
|
||||||
networks:
|
|
||||||
- gpu-router-net
|
|
||||||
ports:
|
ports:
|
||||||
- "8091:8091"
|
- "9000:9000"
|
||||||
depends_on:
|
|
||||||
- redis
|
|
||||||
environment:
|
environment:
|
||||||
- REDIS_HOST=redis
|
- REDIS_URL=redis://redis:6379
|
||||||
- REDIS_PORT=6379
|
- GPU_MOE_URL=http://192.168.68.15:8080/v1
|
||||||
|
- GPU_DENSE_URL=http://192.168.68.8:8080/v1
|
||||||
|
- GPU_LIGHT_URL=http://192.168.68.110:8080/v1
|
||||||
|
healthcheck:
|
||||||
|
test: ["CMD", "python3", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:9000/health')"]
|
||||||
|
interval: 15s
|
||||||
|
timeout: 5s
|
||||||
|
retries: 3
|
||||||
|
depends_on:
|
||||||
|
redis:
|
||||||
|
condition: service_healthy
|
||||||
|
|
||||||
|
litellm:
|
||||||
|
image: ghcr.io/berriai/litellm:main-stable
|
||||||
|
command: ["--config", "/app/config.yaml", "--port", "4000"]
|
||||||
|
container_name: harness-litellm
|
||||||
|
restart: unless-stopped
|
||||||
|
ports:
|
||||||
|
- "8081:4000"
|
||||||
|
volumes:
|
||||||
|
- ./litellm_config.yaml:/app/config.yaml
|
||||||
|
environment:
|
||||||
|
- LITELLM_MASTER_KEY=sk-syslog-local-master-key
|
||||||
|
healthcheck:
|
||||||
|
test: ["CMD", "python3", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:9000/health')"]
|
||||||
|
interval: 15s
|
||||||
|
timeout: 5s
|
||||||
|
retries: 3
|
||||||
|
depends_on:
|
||||||
|
redis:
|
||||||
|
condition: service_healthy
|
||||||
|
|
||||||
|
nginx:
|
||||||
|
image: nginx:alpine
|
||||||
|
container_name: harness-nginx
|
||||||
|
restart: unless-stopped
|
||||||
|
ports:
|
||||||
|
- "80:80"
|
||||||
|
volumes:
|
||||||
|
- ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
|
||||||
|
healthcheck:
|
||||||
|
test: ["CMD", "curl", "-f", "http://127.0.0.1/health"]
|
||||||
|
interval: 15s
|
||||||
|
timeout: 5s
|
||||||
|
retries: 3
|
||||||
|
depends_on:
|
||||||
|
- litellm
|
||||||
|
- dashboard
|
||||||
|
|
||||||
dashboard:
|
dashboard:
|
||||||
build:
|
build: ./dashboard
|
||||||
context: .
|
container_name: harness-dashboard
|
||||||
dockerfile: Dockerfile.dashboard
|
restart: unless-stopped
|
||||||
restart: always
|
|
||||||
networks:
|
|
||||||
- gpu-router-net
|
|
||||||
ports:
|
ports:
|
||||||
- "3001:3001"
|
- "3000:3000"
|
||||||
|
environment:
|
||||||
|
- REDIS_URL=redis://redis:6379
|
||||||
|
- GPU_SIDECARS=192.168.68.15:8090,192.168.68.8:8090,192.168.68.110:8090
|
||||||
|
healthcheck:
|
||||||
|
test: ["CMD", "python3", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:3000/health')"]
|
||||||
|
interval: 15s
|
||||||
|
timeout: 5s
|
||||||
|
retries: 3
|
||||||
depends_on:
|
depends_on:
|
||||||
- redis
|
- redis
|
||||||
|
|
||||||
gpu-dashboard:
|
|
||||||
build:
|
|
||||||
context: .
|
|
||||||
dockerfile: Dockerfile.gpu
|
|
||||||
restart: always
|
|
||||||
networks:
|
|
||||||
- gpu-router-net
|
|
||||||
ports:
|
|
||||||
- "8092:8092"
|
|
||||||
|
|
||||||
networks:
|
|
||||||
gpu-router-net:
|
|
||||||
driver: bridge
|
|
||||||
|
|
||||||
volumes:
|
volumes:
|
||||||
redis-data:
|
redis-data:
|
||||||
|
|
||||||
|
# LiteLLM command override to load config
|
||||||
|
# (appended to fix config loading issue)
|
||||||
|
|||||||
@@ -1,115 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""GPU metrics collector — polls sidecars + llama.cpp every 10s, writes to Workspace."""
|
|
||||||
|
|
||||||
import urllib.request, json, time, os
|
|
||||||
|
|
||||||
HOSTS = [
|
|
||||||
{"name": "amdpve", "host": "192.168.68.15", "gpu": "AMD Strix Halo", "llama_port": 8080},
|
|
||||||
{"name": "llmgpu", "host": "192.168.68.8", "gpu": "RTX 3090", "llama_port": 8080},
|
|
||||||
{"name": "ocu-llm", "host": "192.168.68.110", "gpu": "RTX 5070", "llama_port": 8080},
|
|
||||||
]
|
|
||||||
OUTPUT = "/root/hermes-workspace/public/gpu_metrics.json"
|
|
||||||
INTERVAL = 10
|
|
||||||
STALE_THRESHOLD = 30 # seconds before marking stale
|
|
||||||
DEAD_THRESHOLD = 60 # seconds before marking unreachable
|
|
||||||
|
|
||||||
last_seen = {}
|
|
||||||
|
|
||||||
|
|
||||||
def fetch_json(url, timeout=3):
|
|
||||||
try:
|
|
||||||
req = urllib.request.Request(url)
|
|
||||||
resp = urllib.request.urlopen(req, timeout=timeout)
|
|
||||||
return json.loads(resp.read().decode())
|
|
||||||
except Exception:
|
|
||||||
return None
|
|
||||||
|
|
||||||
|
|
||||||
def collect_one(h):
|
|
||||||
"""Collect GPU hardware + llama.cpp inference state for one host."""
|
|
||||||
name = h["name"]
|
|
||||||
host = h["host"]
|
|
||||||
now = time.time()
|
|
||||||
|
|
||||||
# GPU hardware from sidecar
|
|
||||||
gpu = fetch_json(f"http://{host}:8090/")
|
|
||||||
|
|
||||||
# llama.cpp inference state
|
|
||||||
llamacpp_health = fetch_json(f"http://{host}:{h['llama_port']}/health")
|
|
||||||
llamacpp_models = fetch_json(f"http://{host}:{h['llama_port']}/v1/models")
|
|
||||||
|
|
||||||
# Determine inference state
|
|
||||||
model_name = None
|
|
||||||
inference_state = "unknown"
|
|
||||||
if llamacpp_models:
|
|
||||||
models = llamacpp_models.get("data", [])
|
|
||||||
if models:
|
|
||||||
model_name = models[0].get("id")
|
|
||||||
|
|
||||||
if llamacpp_health:
|
|
||||||
status = llamacpp_health.get("status", "")
|
|
||||||
if status == "ok":
|
|
||||||
idle = llamacpp_health.get("slots_idle", 0)
|
|
||||||
processing = llamacpp_health.get("slots_processing", 0)
|
|
||||||
if idle and not processing:
|
|
||||||
inference_state = "idle"
|
|
||||||
elif processing:
|
|
||||||
inference_state = "busy"
|
|
||||||
else:
|
|
||||||
inference_state = "idle"
|
|
||||||
|
|
||||||
# Check for /slots endpoint for is_processing detail
|
|
||||||
slots = fetch_json(f"http://{host}:{h['llama_port']}/slots")
|
|
||||||
if slots and isinstance(slots, list) and len(slots) > 0:
|
|
||||||
if slots[0].get("is_processing"):
|
|
||||||
inference_state = "busy"
|
|
||||||
|
|
||||||
result = {
|
|
||||||
"host": name,
|
|
||||||
"gpu_name": h["gpu"],
|
|
||||||
"inference": {
|
|
||||||
"state": inference_state,
|
|
||||||
"model": model_name,
|
|
||||||
},
|
|
||||||
"hardware": gpu if gpu else None,
|
|
||||||
"online": gpu is not None,
|
|
||||||
"timestamp": now,
|
|
||||||
}
|
|
||||||
|
|
||||||
if gpu is not None:
|
|
||||||
last_seen[name] = now
|
|
||||||
|
|
||||||
if name in last_seen:
|
|
||||||
age = now - last_seen[name]
|
|
||||||
if age > DEAD_THRESHOLD:
|
|
||||||
result["online"] = False
|
|
||||||
elif age > STALE_THRESHOLD:
|
|
||||||
result["stale"] = True
|
|
||||||
|
|
||||||
return result
|
|
||||||
|
|
||||||
|
|
||||||
def main():
|
|
||||||
print(f"GPU collector starting, output={OUTPUT}, interval={INTERVAL}s")
|
|
||||||
os.makedirs(os.path.dirname(OUTPUT), exist_ok=True)
|
|
||||||
|
|
||||||
while True:
|
|
||||||
start = time.time()
|
|
||||||
results = [collect_one(h) for h in HOSTS]
|
|
||||||
|
|
||||||
payload = {
|
|
||||||
"updated": start,
|
|
||||||
"gpus": results,
|
|
||||||
}
|
|
||||||
|
|
||||||
with open(OUTPUT + ".tmp", "w") as f:
|
|
||||||
json.dump(payload, f)
|
|
||||||
os.rename(OUTPUT + ".tmp", OUTPUT)
|
|
||||||
|
|
||||||
elapsed = time.time() - start
|
|
||||||
sleep_for = max(0, INTERVAL - elapsed)
|
|
||||||
time.sleep(sleep_for)
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
main()
|
|
||||||
@@ -1,183 +0,0 @@
|
|||||||
<!DOCTYPE html>
|
|
||||||
<html lang="en">
|
|
||||||
<head>
|
|
||||||
<meta charset="UTF-8">
|
|
||||||
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
|
||||||
<title>GPU Monitor</title>
|
|
||||||
<style>
|
|
||||||
* { margin: 0; padding: 0; box-sizing: border-box; }
|
|
||||||
body { background: #0d1117; color: #c9d1d9; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif; padding: 20px; }
|
|
||||||
h1 { font-size: 1.3em; margin-bottom: 4px; }
|
|
||||||
.topbar { display: flex; justify-content: space-between; align-items: center; margin-bottom: 20px; padding-bottom: 12px; border-bottom: 1px solid #21262d; }
|
|
||||||
.topbar .status { font-size: 0.85em; color: #8b949e; }
|
|
||||||
.topbar .status .dot { display: inline-block; width: 8px; height: 8px; border-radius: 50%; margin-right: 6px; }
|
|
||||||
.dot.green { background: #3fb950; }
|
|
||||||
.dot.yellow { background: #d2991d; }
|
|
||||||
.dot.red { background: #f85149; }
|
|
||||||
.cards { display: grid; grid-template-columns: repeat(auto-fit, minmax(320px, 1fr)); gap: 16px; }
|
|
||||||
.card { background: #161b22; border: 1px solid #21262d; border-radius: 8px; padding: 16px; }
|
|
||||||
.card.stale { opacity: 0.5; }
|
|
||||||
.card.dead { opacity: 0.3; border-color: #f85149; }
|
|
||||||
.card-header { display: flex; justify-content: space-between; align-items: center; margin-bottom: 12px; }
|
|
||||||
.card-header .name { font-weight: 600; font-size: 1.05em; }
|
|
||||||
.card-header .host { font-size: 0.8em; color: #8b949e; }
|
|
||||||
.card-header .state { font-size: 0.75em; padding: 2px 8px; border-radius: 10px; font-weight: 600; }
|
|
||||||
.state.idle { background: #1b3826; color: #3fb950; }
|
|
||||||
.state.busy { background: #3d1f1a; color: #f85149; }
|
|
||||||
.state.unknown { background: #21262d; color: #8b949e; }
|
|
||||||
.metric { margin-bottom: 10px; }
|
|
||||||
.metric-label { display: flex; justify-content: space-between; font-size: 0.82em; color: #8b949e; margin-bottom: 2px; }
|
|
||||||
.metric-label .val { color: #c9d1d9; font-weight: 500; }
|
|
||||||
.bar { height: 6px; border-radius: 3px; background: #21262d; overflow: hidden; }
|
|
||||||
.bar-fill { height: 100%; border-radius: 3px; transition: width 0.5s ease; }
|
|
||||||
.bar-fill.temp-cool { background: #3fb950; }
|
|
||||||
.bar-fill.temp-warm { background: #d2991d; }
|
|
||||||
.bar-fill.temp-hot { background: #f85149; }
|
|
||||||
.bar-fill.util { background: #58a6ff; }
|
|
||||||
.bar-fill.vram { background: #bc8cff; }
|
|
||||||
.bar-fill.power { background: #f0883e; }
|
|
||||||
.model-line { font-size: 0.82em; color: #8b949e; margin-top: 8px; padding-top: 8px; border-top: 1px solid #21262d; }
|
|
||||||
.model-line span { color: #c9d1d9; }
|
|
||||||
.error { color: #f85149; font-size: 0.85em; }
|
|
||||||
</style>
|
|
||||||
</head>
|
|
||||||
<body>
|
|
||||||
<div class="topbar">
|
|
||||||
<div>
|
|
||||||
<h1><a href="/" style="color:#58a6ff;text-decoration:none;">← Workspace</a> · GPU Monitor</h1>
|
|
||||||
<span class="status"><span class="dot green" id="status-dot"></span><span id="status-text">Loading...</span></span>
|
|
||||||
</div>
|
|
||||||
<div class="status" id="age">—</div>
|
|
||||||
</div>
|
|
||||||
<div class="cards" id="cards"></div>
|
|
||||||
|
|
||||||
<script>
|
|
||||||
const INTERVAL = 5000;
|
|
||||||
let lastFetchTime = null;
|
|
||||||
|
|
||||||
function updateClock() {
|
|
||||||
const el = document.getElementById('age');
|
|
||||||
if (!lastFetchTime) { el.textContent = '—'; return; }
|
|
||||||
const age = Math.round((Date.now() / 1000) - lastFetchTime);
|
|
||||||
el.textContent = age <= 60 ? `updated ${age}s ago` : `stale ${age}s ago`;
|
|
||||||
}
|
|
||||||
setInterval(updateClock, 1000);
|
|
||||||
|
|
||||||
const TEMP_WARN = 70, TEMP_HOT = 82;
|
|
||||||
const VRAM_WARN = 80, VRAM_HOT = 92;
|
|
||||||
|
|
||||||
function tempClass(c) { return c > TEMP_HOT ? 'temp-hot' : c > TEMP_WARN ? 'temp-warm' : 'temp-cool'; }
|
|
||||||
function vramClass(pct) { return pct > VRAM_HOT ? 'temp-hot' : pct > VRAM_WARN ? 'temp-warm' : 'temp-cool'; }
|
|
||||||
function pct(val, max) { return max ? Math.round(val / max * 100) : 0; }
|
|
||||||
function mbToGB(mb) { return mb ? (mb / 1024).toFixed(1) : '—'; }
|
|
||||||
|
|
||||||
function renderCard(g) {
|
|
||||||
const hw = g.hardware || {};
|
|
||||||
const inf = g.inference || {};
|
|
||||||
const online = g.online !== false;
|
|
||||||
const stale = g.stale === true;
|
|
||||||
let cardClass = '';
|
|
||||||
if (!online) cardClass = 'dead';
|
|
||||||
else if (stale) cardClass = 'stale';
|
|
||||||
|
|
||||||
let stateClass = inf.state || 'unknown';
|
|
||||||
let stateLabel = inf.state ? inf.state.toUpperCase() : 'UNKNOWN';
|
|
||||||
if (!online) { stateClass = 'unknown'; stateLabel = 'OFFLINE'; }
|
|
||||||
|
|
||||||
const temp = hw.temp_c;
|
|
||||||
const util = hw.gpu_util_pct;
|
|
||||||
const vramUsed = hw.vram_used_mb;
|
|
||||||
const vramTotal = hw.vram_total_mb;
|
|
||||||
const power = hw.power_w;
|
|
||||||
const powerLimit = hw.power_limit_w;
|
|
||||||
const fan = hw.fan_pct;
|
|
||||||
const vendor = hw.vendor;
|
|
||||||
|
|
||||||
let html = `<div class="card ${cardClass}">`;
|
|
||||||
html += `<div class="card-header">`;
|
|
||||||
html += `<div><div class="name">${g.gpu_name}</div><div class="host">${g.host}</div></div>`;
|
|
||||||
html += `<div class="state ${stateClass}">${stateLabel}</div>`;
|
|
||||||
html += `</div>`;
|
|
||||||
|
|
||||||
if (!online) {
|
|
||||||
html += `<div class="error">Unreachable</div>`;
|
|
||||||
} else if (hw.error) {
|
|
||||||
html += `<div class="error">${hw.error}</div>`;
|
|
||||||
} else {
|
|
||||||
// Temperature
|
|
||||||
if (temp != null) {
|
|
||||||
html += `<div class="metric"><div class="metric-label"><span>Temperature</span><span class="val">${temp}°C</span></div>`;
|
|
||||||
html += `<div class="bar"><div class="bar-fill ${tempClass(temp)}" style="width:${Math.min(temp,100)}%"></div></div></div>`;
|
|
||||||
}
|
|
||||||
// Utilization
|
|
||||||
if (util != null) {
|
|
||||||
html += `<div class="metric"><div class="metric-label"><span>GPU Utilization</span><span class="val">${util}%</span></div>`;
|
|
||||||
html += `<div class="bar"><div class="bar-fill util" style="width:${util}%"></div></div></div>`;
|
|
||||||
}
|
|
||||||
// VRAM
|
|
||||||
if (vramUsed != null && vramTotal != null) {
|
|
||||||
const vramPct = pct(vramUsed, vramTotal);
|
|
||||||
html += `<div class="metric"><div class="metric-label"><span>VRAM</span><span class="val">${mbToGB(vramUsed)} / ${mbToGB(vramTotal)} GB</span></div>`;
|
|
||||||
html += `<div class="bar"><div class="bar-fill ${vramClass(vramPct)}" style="width:${vramPct}%"></div></div></div>`;
|
|
||||||
}
|
|
||||||
// Power
|
|
||||||
if (power != null) {
|
|
||||||
const powerPct = powerLimit ? pct(power, powerLimit) : 0;
|
|
||||||
const powerText = powerLimit ? `${power}W / ${powerLimit}W` : `${power}W`;
|
|
||||||
html += `<div class="metric"><div class="metric-label"><span>Power</span><span class="val">${powerText}</span></div>`;
|
|
||||||
if (powerLimit) html += `<div class="bar"><div class="bar-fill power" style="width:${powerPct}%"></div></div>`;
|
|
||||||
html += `</div>`;
|
|
||||||
}
|
|
||||||
// Fan (NVIDIA only)
|
|
||||||
if (fan != null) {
|
|
||||||
html += `<div class="metric"><div class="metric-label"><span>Fan Speed</span><span class="val">${fan}%</span></div>`;
|
|
||||||
html += `<div class="bar"><div class="bar-fill util" style="width:${fan}%"></div></div></div>`;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
// Model loaded
|
|
||||||
html += `<div class="model-line">Model: <span>${inf.model || '—'}</span></div>`;
|
|
||||||
html += `</div>`;
|
|
||||||
return html;
|
|
||||||
}
|
|
||||||
|
|
||||||
async function refresh() {
|
|
||||||
try {
|
|
||||||
const resp = await fetch('gpu_metrics.json?t=' + Date.now());
|
|
||||||
const data = await resp.json();
|
|
||||||
const gpus = data.gpus || [];
|
|
||||||
|
|
||||||
document.getElementById('cards').innerHTML = gpus.map(renderCard).join('');
|
|
||||||
|
|
||||||
// Top bar status
|
|
||||||
const online = gpus.filter(g => g.online !== false).length;
|
|
||||||
const total = gpus.length;
|
|
||||||
const dot = document.getElementById('status-dot');
|
|
||||||
const txt = document.getElementById('status-text');
|
|
||||||
if (online === total) { dot.className = 'dot green'; txt.textContent = `${online}/${total} online`; }
|
|
||||||
else if (online > 0) { dot.className = 'dot yellow'; txt.textContent = `${online}/${total} online`; }
|
|
||||||
else { dot.className = 'dot red'; txt.textContent = 'All offline'; }
|
|
||||||
|
|
||||||
// Capture fetch time for live clock
|
|
||||||
lastFetchTime = Date.now() / 1000;
|
|
||||||
} catch(e) {
|
|
||||||
document.getElementById('status-dot').className = 'dot red';
|
|
||||||
document.getElementById('status-text').textContent = 'Collector down';
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
// Render skeletons instantly
|
|
||||||
const SKELETONS = [
|
|
||||||
{host:'amdpve', gpu_name:'AMD Strix Halo', hardware:{}, inference:{}, online:true},
|
|
||||||
{host:'llmgpu', gpu_name:'RTX 3090', hardware:{}, inference:{}, online:true},
|
|
||||||
{host:'ocu-llm', gpu_name:'RTX 5070', hardware:{}, inference:{}, online:true},
|
|
||||||
];
|
|
||||||
document.getElementById('cards').innerHTML = SKELETONS.map(g =>
|
|
||||||
`<div class="card"><div class="card-header"><div><div class="name">${g.gpu_name}</div><div class="host">${g.host}</div></div><div class="state unknown">···</div></div><div class="model-line" style="color:#8b949e;">Loading metrics...</div></div>`
|
|
||||||
).join('');
|
|
||||||
|
|
||||||
refresh();
|
|
||||||
setInterval(refresh, INTERVAL);
|
|
||||||
</script>
|
|
||||||
</body>
|
|
||||||
</html>
|
|
||||||
@@ -1,115 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""GPU metrics collector — polls sidecars + llama.cpp every 10s, writes to Workspace."""
|
|
||||||
|
|
||||||
import urllib.request, json, time, os
|
|
||||||
|
|
||||||
HOSTS = [
|
|
||||||
{"name": "amdpve", "host": "192.168.68.15", "gpu": "AMD Strix Halo", "llama_port": 8080},
|
|
||||||
{"name": "llmgpu", "host": "192.168.68.8", "gpu": "RTX 3090", "llama_port": 8080},
|
|
||||||
{"name": "ocu-llm", "host": "192.168.68.110", "gpu": "RTX 5070", "llama_port": 8080},
|
|
||||||
]
|
|
||||||
OUTPUT = "/app/public/gpu_metrics.json"
|
|
||||||
INTERVAL = 10
|
|
||||||
STALE_THRESHOLD = 30 # seconds before marking stale
|
|
||||||
DEAD_THRESHOLD = 60 # seconds before marking unreachable
|
|
||||||
|
|
||||||
last_seen = {}
|
|
||||||
|
|
||||||
|
|
||||||
def fetch_json(url, timeout=3):
|
|
||||||
try:
|
|
||||||
req = urllib.request.Request(url)
|
|
||||||
resp = urllib.request.urlopen(req, timeout=timeout)
|
|
||||||
return json.loads(resp.read().decode())
|
|
||||||
except Exception:
|
|
||||||
return None
|
|
||||||
|
|
||||||
|
|
||||||
def collect_one(h):
|
|
||||||
"""Collect GPU hardware + llama.cpp inference state for one host."""
|
|
||||||
name = h["name"]
|
|
||||||
host = h["host"]
|
|
||||||
now = time.time()
|
|
||||||
|
|
||||||
# GPU hardware from sidecar
|
|
||||||
gpu = fetch_json(f"http://{host}:8090/")
|
|
||||||
|
|
||||||
# llama.cpp inference state
|
|
||||||
llamacpp_health = fetch_json(f"http://{host}:{h['llama_port']}/health")
|
|
||||||
llamacpp_models = fetch_json(f"http://{host}:{h['llama_port']}/v1/models")
|
|
||||||
|
|
||||||
# Determine inference state
|
|
||||||
model_name = None
|
|
||||||
inference_state = "unknown"
|
|
||||||
if llamacpp_models:
|
|
||||||
models = llamacpp_models.get("data", [])
|
|
||||||
if models:
|
|
||||||
model_name = models[0].get("id")
|
|
||||||
|
|
||||||
if llamacpp_health:
|
|
||||||
status = llamacpp_health.get("status", "")
|
|
||||||
if status == "ok":
|
|
||||||
idle = llamacpp_health.get("slots_idle", 0)
|
|
||||||
processing = llamacpp_health.get("slots_processing", 0)
|
|
||||||
if idle and not processing:
|
|
||||||
inference_state = "idle"
|
|
||||||
elif processing:
|
|
||||||
inference_state = "busy"
|
|
||||||
else:
|
|
||||||
inference_state = "idle"
|
|
||||||
|
|
||||||
# Check for /slots endpoint for is_processing detail
|
|
||||||
slots = fetch_json(f"http://{host}:{h['llama_port']}/slots")
|
|
||||||
if slots and isinstance(slots, list) and len(slots) > 0:
|
|
||||||
if slots[0].get("is_processing"):
|
|
||||||
inference_state = "busy"
|
|
||||||
|
|
||||||
result = {
|
|
||||||
"host": name,
|
|
||||||
"gpu_name": h["gpu"],
|
|
||||||
"inference": {
|
|
||||||
"state": inference_state,
|
|
||||||
"model": model_name,
|
|
||||||
},
|
|
||||||
"hardware": gpu if gpu else None,
|
|
||||||
"online": gpu is not None,
|
|
||||||
"timestamp": now,
|
|
||||||
}
|
|
||||||
|
|
||||||
if gpu is not None:
|
|
||||||
last_seen[name] = now
|
|
||||||
|
|
||||||
if name in last_seen:
|
|
||||||
age = now - last_seen[name]
|
|
||||||
if age > DEAD_THRESHOLD:
|
|
||||||
result["online"] = False
|
|
||||||
elif age > STALE_THRESHOLD:
|
|
||||||
result["stale"] = True
|
|
||||||
|
|
||||||
return result
|
|
||||||
|
|
||||||
|
|
||||||
def main():
|
|
||||||
print(f"GPU collector starting, output={OUTPUT}, interval={INTERVAL}s")
|
|
||||||
os.makedirs(os.path.dirname(OUTPUT), exist_ok=True)
|
|
||||||
|
|
||||||
while True:
|
|
||||||
start = time.time()
|
|
||||||
results = [collect_one(h) for h in HOSTS]
|
|
||||||
|
|
||||||
payload = {
|
|
||||||
"updated": start,
|
|
||||||
"gpus": results,
|
|
||||||
}
|
|
||||||
|
|
||||||
with open(OUTPUT + ".tmp", "w") as f:
|
|
||||||
json.dump(payload, f)
|
|
||||||
os.rename(OUTPUT + ".tmp", OUTPUT)
|
|
||||||
|
|
||||||
elapsed = time.time() - start
|
|
||||||
sleep_for = max(0, INTERVAL - elapsed)
|
|
||||||
time.sleep(sleep_for)
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
main()
|
|
||||||
@@ -1,14 +0,0 @@
|
|||||||
#!/bin/bash
|
|
||||||
set -e
|
|
||||||
|
|
||||||
# Start collector as background process
|
|
||||||
cd /root/hermes-workspace/public
|
|
||||||
python3 /app/collector.py &
|
|
||||||
COLLECTOR_PID=$!
|
|
||||||
|
|
||||||
echo "Collector started (PID $COLLECTOR_PID)"
|
|
||||||
echo "Serving dashboard on :8092"
|
|
||||||
|
|
||||||
# Serve the public directory (contains gpu.html + gpu_metrics.json)
|
|
||||||
cd /root/hermes-workspace/public
|
|
||||||
python3 -m http.server 8092
|
|
||||||
+3
-19
@@ -13,7 +13,7 @@ upstream llmgpu_pool {
|
|||||||
}
|
}
|
||||||
|
|
||||||
upstream ocu_llm_pool {
|
upstream ocu_llm_pool {
|
||||||
## RTX 5070 — gemma-4 (Dense 4B) — Ultra-light tasks
|
## RTX 5070 — qwen3.5-9b-vlm (VLM) — Vision + light tasks
|
||||||
server 192.168.68.110:8080;
|
server 192.168.68.110:8080;
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -24,12 +24,7 @@ upstream queue_service {
|
|||||||
|
|
||||||
upstream dashboard_service {
|
upstream dashboard_service {
|
||||||
## Harness dashboard (Docker container)
|
## Harness dashboard (Docker container)
|
||||||
server syslog-harness-dashboard-1:3001;
|
server dashboard:3001;
|
||||||
}
|
|
||||||
|
|
||||||
upstream gpu_dashboard_pool {
|
|
||||||
## GPU dashboard (Docker container)
|
|
||||||
server syslog-harness-gpu-dashboard-1:8092;
|
|
||||||
}
|
}
|
||||||
|
|
||||||
## ------------------------------------------------------------------
|
## ------------------------------------------------------------------
|
||||||
@@ -41,7 +36,7 @@ map $http_x_syslog_model $gpu_upstream {
|
|||||||
"heavy" llmgpu_pool;
|
"heavy" llmgpu_pool;
|
||||||
"qwen3.5-27B" llmgpu_pool;
|
"qwen3.5-27B" llmgpu_pool;
|
||||||
"light" ocu_llm_pool;
|
"light" ocu_llm_pool;
|
||||||
"gemma-4" ocu_llm_pool;
|
"qwen3.5-9b-vlm" ocu_llm_pool;
|
||||||
}
|
}
|
||||||
|
|
||||||
## Rate limit zone — 10 req/s per IP, burst of 20
|
## Rate limit zone — 10 req/s per IP, burst of 20
|
||||||
@@ -61,17 +56,6 @@ server {
|
|||||||
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||||
}
|
}
|
||||||
|
|
||||||
## ------------------------------------------------------------------
|
|
||||||
## GPU Dashboard — observability UI (MUST be before / catch-all)
|
|
||||||
## ------------------------------------------------------------------
|
|
||||||
location /gpu {
|
|
||||||
proxy_pass http://gpu_dashboard_pool/;
|
|
||||||
proxy_set_header Host $host;
|
|
||||||
proxy_set_header X-Real-IP $remote_addr;
|
|
||||||
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
|
||||||
proxy_set_header X-Forwarded-Proto $scheme;
|
|
||||||
}
|
|
||||||
|
|
||||||
## ------------------------------------------------------------------
|
## ------------------------------------------------------------------
|
||||||
## Main location — proxy to selected upstream
|
## Main location — proxy to selected upstream
|
||||||
## ------------------------------------------------------------------
|
## ------------------------------------------------------------------
|
||||||
|
|||||||
+106
@@ -0,0 +1,106 @@
|
|||||||
|
## Syslog GPU Router — Nginx Configuration
|
||||||
|
## Routes incoming agent requests to the appropriate GPU backend
|
||||||
|
## based on the X-Syslog-Model header.
|
||||||
|
|
||||||
|
upstream amdpve_pool {
|
||||||
|
## Strix Halo 395 — qwen3.6-35B-A3B (MoE) — Default workhorse
|
||||||
|
server 192.168.68.15:8080;
|
||||||
|
}
|
||||||
|
|
||||||
|
upstream llmgpu_pool {
|
||||||
|
## RTX 3090 — qwen3.5-27B (Dense) — Heavy reasoning
|
||||||
|
server 192.168.68.8:8080;
|
||||||
|
}
|
||||||
|
|
||||||
|
upstream ocu_llm_pool {
|
||||||
|
## RTX 5070 — qwen3.5-9b-vlm (VLM) — Vision + light tasks
|
||||||
|
server 192.168.68.110:8080;
|
||||||
|
}
|
||||||
|
|
||||||
|
upstream queue_service {
|
||||||
|
## Agent queue with circuit breaker (Docker container)
|
||||||
|
server 127.0.0.1:8091;
|
||||||
|
}
|
||||||
|
|
||||||
|
upstream dashboard_service {
|
||||||
|
## Harness dashboard (Docker container)
|
||||||
|
server 127.0.0.1:3001;
|
||||||
|
}
|
||||||
|
|
||||||
|
## ------------------------------------------------------------------
|
||||||
|
## Mapping: X-Syslog-Model header → upstream backend
|
||||||
|
## ------------------------------------------------------------------
|
||||||
|
map $http_x_syslog_model $gpu_upstream {
|
||||||
|
default amdpve_pool; # missing header → default workhorse
|
||||||
|
"standard" amdpve_pool;
|
||||||
|
"heavy" llmgpu_pool;
|
||||||
|
"qwen3.5-27B" llmgpu_pool;
|
||||||
|
"light" ocu_llm_pool;
|
||||||
|
"qwen3.5-9b-vlm" ocu_llm_pool;
|
||||||
|
}
|
||||||
|
|
||||||
|
server {
|
||||||
|
listen 8080;
|
||||||
|
server_name _;
|
||||||
|
|
||||||
|
# Rate limit zone — 10 req/s per IP, burst of 20
|
||||||
|
limit_req_zone $binary_remote_addr zone=perip:10m rate=10r/s;
|
||||||
|
|
||||||
|
## ------------------------------------------------------------------
|
||||||
|
## Dashboard — observability UI (MUST be before / catch-all)
|
||||||
|
## ------------------------------------------------------------------
|
||||||
|
location /dashboard {
|
||||||
|
proxy_pass http://dashboard_service/;
|
||||||
|
proxy_set_header Host $host;
|
||||||
|
proxy_set_header X-Real-IP $remote_addr;
|
||||||
|
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||||
|
}
|
||||||
|
|
||||||
|
## ------------------------------------------------------------------
|
||||||
|
## Main location — proxy to selected upstream
|
||||||
|
## ------------------------------------------------------------------
|
||||||
|
location / {
|
||||||
|
limit_req zone=perip burst=20 nodelay;
|
||||||
|
limit_req_status 503;
|
||||||
|
proxy_pass http://$gpu_upstream;
|
||||||
|
|
||||||
|
## Preserve original host and headers
|
||||||
|
proxy_set_header Host $host;
|
||||||
|
proxy_set_header X-Real-IP $remote_addr;
|
||||||
|
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||||
|
proxy_set_header X-Forwarded-Proto $scheme;
|
||||||
|
|
||||||
|
## Pass through the model header so backends can log it
|
||||||
|
proxy_pass_header X-Syslog-Model;
|
||||||
|
|
||||||
|
## Streaming support (SSE for LLM responses)
|
||||||
|
proxy_buffering off;
|
||||||
|
proxy_cache off;
|
||||||
|
proxy_read_timeout 300s;
|
||||||
|
proxy_send_timeout 300s;
|
||||||
|
|
||||||
|
## Basic failover — retry on error or timeout
|
||||||
|
proxy_next_upstream error timeout http_502 http_503;
|
||||||
|
proxy_next_upstream_tries 2;
|
||||||
|
|
||||||
|
## Add a response header for observability
|
||||||
|
add_header X-Routed-To $gpu_upstream always;
|
||||||
|
|
||||||
|
## Fallback to queue when all GPU upstreams are down
|
||||||
|
error_page 502 503 504 = @queue_fallback;
|
||||||
|
}
|
||||||
|
|
||||||
|
## ------------------------------------------------------------------
|
||||||
|
## Queue fallback — enqueue when GPUs are unavailable
|
||||||
|
## ------------------------------------------------------------------
|
||||||
|
location @queue_fallback {
|
||||||
|
rewrite ^ /enqueue break;
|
||||||
|
proxy_pass http://queue_service;
|
||||||
|
proxy_set_header Host $host;
|
||||||
|
proxy_set_header X-Real-IP $remote_addr;
|
||||||
|
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||||
|
proxy_set_header X-Forwarded-Proto $scheme;
|
||||||
|
proxy_set_header Content-Type $content_type;
|
||||||
|
proxy_pass_request_body on;
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -0,0 +1,25 @@
|
|||||||
|
model_list:
|
||||||
|
- model_name: qwen3.6-35B-A3B
|
||||||
|
litellm_params:
|
||||||
|
model: openai/qwen3.6-35B-A3B
|
||||||
|
api_base: http://192.168.68.15:8080/v1
|
||||||
|
api_key: "not-needed"
|
||||||
|
|
||||||
|
- model_name: qwen3.6-27B-code
|
||||||
|
litellm_params:
|
||||||
|
model: openai/qwen3.6-27B-code-text
|
||||||
|
api_base: http://192.168.68.8:8080/v1
|
||||||
|
api_key: "not-needed"
|
||||||
|
|
||||||
|
- model_name: qwen3.5-9b-vlm
|
||||||
|
litellm_params:
|
||||||
|
model: openai/qwen3.5-9b-vlm
|
||||||
|
api_base: http://192.168.68.110:8080/v1
|
||||||
|
api_key: "not-needed"
|
||||||
|
|
||||||
|
general_settings:
|
||||||
|
master_key: sk-syslog-local-master-key
|
||||||
|
|
||||||
|
litellm_settings:
|
||||||
|
drop_params: true
|
||||||
|
request_timeout: 120
|
||||||
@@ -0,0 +1,79 @@
|
|||||||
|
worker_processes auto;
|
||||||
|
error_log /var/log/nginx/error.log warn;
|
||||||
|
pid /var/run/nginx.pid;
|
||||||
|
|
||||||
|
events { worker_connections 1024; }
|
||||||
|
|
||||||
|
http {
|
||||||
|
include /etc/nginx/mime.types;
|
||||||
|
default_type application/octet-stream;
|
||||||
|
|
||||||
|
log_format main launching rt=;
|
||||||
|
access_log /var/log/nginx/access.log main;
|
||||||
|
error_log /var/log/nginx/error.log;
|
||||||
|
sendfile on;
|
||||||
|
keepalive_timeout 65;
|
||||||
|
|
||||||
|
upstream router_api { server router:9000; }
|
||||||
|
upstream dashboard_ui { server dashboard:3000; }
|
||||||
|
upstream litellm_backend { server litellm:4000; }
|
||||||
|
|
||||||
|
server {
|
||||||
|
listen 80;
|
||||||
|
|
||||||
|
# Disable buffering for SSE streams
|
||||||
|
proxy_buffering off;
|
||||||
|
|
||||||
|
# API — through router
|
||||||
|
location /v1/ {
|
||||||
|
proxy_pass http://router_api;
|
||||||
|
proxy_http_version 1.1;
|
||||||
|
proxy_set_header Host $host;
|
||||||
|
proxy_set_header X-Real-IP $remote_addr;
|
||||||
|
proxy_set_header Authorization $http_authorization;
|
||||||
|
proxy_connect_timeout 10s;
|
||||||
|
proxy_read_timeout 600s;
|
||||||
|
proxy_buffering off;
|
||||||
|
}
|
||||||
|
|
||||||
|
# SSE streaming endpoint
|
||||||
|
location /stream {
|
||||||
|
proxy_pass http://router_api;
|
||||||
|
proxy_http_version 1.1;
|
||||||
|
proxy_set_header Host $host;
|
||||||
|
proxy_set_header Connection "";
|
||||||
|
proxy_buffering off;
|
||||||
|
chunked_transfer_encoding off;
|
||||||
|
}
|
||||||
|
|
||||||
|
# Dashboard API proxy for SSE
|
||||||
|
location /api/ {
|
||||||
|
proxy_pass http://dashboard_ui;
|
||||||
|
proxy_http_version 1.1;
|
||||||
|
proxy_set_header Host $host;
|
||||||
|
proxy_buffering off;
|
||||||
|
}
|
||||||
|
|
||||||
|
# LiteLLM debug
|
||||||
|
location /litellm/ {
|
||||||
|
rewrite ^/litellm/(.*) /$1 break;
|
||||||
|
proxy_pass http://litellm_backend;
|
||||||
|
proxy_http_version 1.1;
|
||||||
|
proxy_set_header Host $host;
|
||||||
|
proxy_set_header Authorization $http_authorization;
|
||||||
|
}
|
||||||
|
|
||||||
|
# Dashboard
|
||||||
|
location / {
|
||||||
|
proxy_pass http://dashboard_ui;
|
||||||
|
proxy_http_version 1.1;
|
||||||
|
proxy_set_header Host $host;
|
||||||
|
proxy_buffering off;
|
||||||
|
}
|
||||||
|
|
||||||
|
location /health {
|
||||||
|
return 200 "{\"status\":\"healthy\"}";
|
||||||
|
add_header Content-Type application/json;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -1,10 +0,0 @@
|
|||||||
FROM python:3.13-slim
|
|
||||||
|
|
||||||
RUN pip install --no-cache-dir flask redis
|
|
||||||
|
|
||||||
COPY queue-service.py /app/queue-service.py
|
|
||||||
WORKDIR /app
|
|
||||||
|
|
||||||
EXPOSE 8091
|
|
||||||
|
|
||||||
CMD ["python3", "queue-service.py"]
|
|
||||||
@@ -0,0 +1,9 @@
|
|||||||
|
FROM python:3.12-slim
|
||||||
|
|
||||||
|
WORKDIR /app
|
||||||
|
COPY requirements.txt .
|
||||||
|
RUN pip install --no-cache-dir -r requirements.txt
|
||||||
|
COPY router.py .
|
||||||
|
|
||||||
|
EXPOSE 9000
|
||||||
|
CMD ["python", "router.py"]
|
||||||
@@ -0,0 +1,3 @@
|
|||||||
|
flask==3.1.*
|
||||||
|
redis==5.2.*
|
||||||
|
requests==2.32.*
|
||||||
@@ -0,0 +1,418 @@
|
|||||||
|
import os, json, time, logging, traceback, threading, queue
|
||||||
|
import requests, redis
|
||||||
|
from flask import Flask, request, jsonify, Response, stream_with_context
|
||||||
|
|
||||||
|
REDIS_URL = os.environ.get("REDIS_URL", "redis://redis:6379")
|
||||||
|
GPU_MOE_URL = os.environ.get("GPU_MOE_URL", "http://192.168.68.15:8080/v1")
|
||||||
|
GPU_DENSE_URL = os.environ.get("GPU_DENSE_URL", "http://192.168.68.8:8080/v1")
|
||||||
|
GPU_LIGHT_URL = os.environ.get("GPU_LIGHT_URL", "http://192.168.68.110:8080/v1")
|
||||||
|
|
||||||
|
GPU_SIDECARS = {
|
||||||
|
"qwen3.6-35B-A3B": "http://192.168.68.15:8090",
|
||||||
|
"qwen3.6-27B-code": "http://192.168.68.8:8090",
|
||||||
|
"qwen3.5-9b-vlm": "http://192.168.68.110:8090",
|
||||||
|
}
|
||||||
|
GPU_URLS = {
|
||||||
|
"qwen3.6-35B-A3B": GPU_MOE_URL,
|
||||||
|
"qwen3.6-27B-code": GPU_DENSE_URL,
|
||||||
|
"qwen3.5-9b-vlm": GPU_LIGHT_URL,
|
||||||
|
}
|
||||||
|
# Max concurrent requests per GPU (based on llama.cpp --parallel)
|
||||||
|
GPU_MAX_CONCURRENT = {
|
||||||
|
"qwen3.6-35B-A3B": 2, # 2 slots
|
||||||
|
"qwen3.6-27B-code": 2, # 2 slots
|
||||||
|
"qwen3.5-9b-vlm": 2, # 2 slots (12GB VRAM, 4GB headroom)
|
||||||
|
}
|
||||||
|
|
||||||
|
# Context window sizes (tokens) — used for compaction signals
|
||||||
|
GPU_CONTEXT = {
|
||||||
|
"qwen3.6-35B-A3B": 131072,
|
||||||
|
"qwen3.6-27B-code": 98304,
|
||||||
|
"qwen3.5-9b-vlm": 131072,
|
||||||
|
}
|
||||||
|
|
||||||
|
TIER_MODELS = {
|
||||||
|
"starter": ["qwen3.5-9b-vlm"],
|
||||||
|
"professional": ["qwen3.6-35B-A3B", "qwen3.6-27B-code", "qwen3.5-9b-vlm"],
|
||||||
|
"enterprise": ["qwen3.6-35B-A3B", "qwen3.6-27B-code", "qwen3.5-9b-vlm"],
|
||||||
|
}
|
||||||
|
API_KEYS = {
|
||||||
|
"sk-syslog-local-master-key": {"tier": "enterprise", "agent": "admin"},
|
||||||
|
"sk-syslog-abiba": {"tier": "enterprise", "agent": "Abiba"},
|
||||||
|
"sk-syslog-mumuni": {"tier": "enterprise", "agent": "Mumuni"},
|
||||||
|
"sk-syslog-tanko": {"tier": "enterprise", "agent": "Tanko"},
|
||||||
|
"sk-syslog-koby": {"tier": "enterprise", "agent": "Koby"},
|
||||||
|
"sk-syslog-kagenz0": {"tier": "enterprise", "agent": "Kagenz0"},
|
||||||
|
"sk-syslog-koonimo": {"tier": "enterprise", "agent": "Koonimo"},
|
||||||
|
"sk-starter-abc123": {"tier": "starter", "agent": "test-starter"},
|
||||||
|
"sk-professional-xyz789": {"tier": "professional", "agent": "test-pro"},
|
||||||
|
}
|
||||||
|
|
||||||
|
logging.basicConfig(level=logging.INFO, format="%(asctime)s [ROUTER] %(levelname)s %(message)s")
|
||||||
|
log = logging.getLogger("router")
|
||||||
|
try: r = redis.from_url(REDIS_URL, decode_responses=True); r.ping()
|
||||||
|
except Exception: r = None
|
||||||
|
|
||||||
|
|
||||||
|
def counter_audit_loop():
|
||||||
|
"""Every 30s, check GPU slots and reset counters if all slots idle."""
|
||||||
|
while True:
|
||||||
|
time.sleep(30)
|
||||||
|
if not r: continue
|
||||||
|
for model, url in GPU_URLS.items():
|
||||||
|
try:
|
||||||
|
resp = requests.get(url.replace("/v1","") + "/slots",
|
||||||
|
headers={"Authorization": "Bearer not-needed"}, timeout=5)
|
||||||
|
if resp.status_code == 200:
|
||||||
|
slots = resp.json()
|
||||||
|
all_idle = all(not s.get("is_processing", False) for s in slots)
|
||||||
|
if all_idle:
|
||||||
|
current = int(r.get("active:" + model) or 0)
|
||||||
|
if current > 0:
|
||||||
|
r.set("active:" + model, 0)
|
||||||
|
log.info("AUDIT: Reset stuck counter for %s (was %d)", model, current)
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
|
||||||
|
threading.Thread(target=counter_audit_loop, daemon=True).start()
|
||||||
|
|
||||||
|
app = Flask(__name__)
|
||||||
|
sse_subscribers = []; sse_lock = threading.Lock()
|
||||||
|
|
||||||
|
def gpu_active_count(model):
|
||||||
|
"""Get number of in-flight requests for a GPU."""
|
||||||
|
if r:
|
||||||
|
return int(r.get("active:" + model) or 0)
|
||||||
|
return 0
|
||||||
|
|
||||||
|
def gpu_incr(model):
|
||||||
|
if r: r.incr("active:" + model)
|
||||||
|
|
||||||
|
def gpu_decr(model):
|
||||||
|
if r:
|
||||||
|
v = r.decr("active:" + model)
|
||||||
|
if v and int(v) < 0:
|
||||||
|
r.set("active:" + model, 0) # never go negative
|
||||||
|
|
||||||
|
def check_gpu_health(model):
|
||||||
|
url = GPU_SIDECARS.get(model)
|
||||||
|
if not url: return {"status": "unknown"}
|
||||||
|
try:
|
||||||
|
resp = requests.get(url, timeout=5)
|
||||||
|
if resp.status_code == 200:
|
||||||
|
d = resp.json()
|
||||||
|
pct = (d.get("vram_used_mb",0) / max(d.get("vram_total_mb",1), 1)) * 100
|
||||||
|
status = "healthy" if pct < 90 else "saturated"
|
||||||
|
# Also check if llama.cpp endpoint is actually responding
|
||||||
|
gpu_url = GPU_URLS.get(model, "")
|
||||||
|
try:
|
||||||
|
hr = requests.get(gpu_url.replace("/v1","") + "/health", headers={"Authorization": "Bearer not-needed"}, timeout=3)
|
||||||
|
if hr.status_code != 200:
|
||||||
|
status = "down"
|
||||||
|
except Exception:
|
||||||
|
status = "down"
|
||||||
|
return {"status": status, "vram_used_mb": d.get("vram_used_mb"), "vram_total_mb": d.get("vram_total_mb"), "vram_pct": round(pct,1), "temp_c": d.get("temp_c"), "gpu_util_pct": d.get("gpu_util_pct"), "gpu_name": d.get("gpu_name"), "power_w": d.get("power_w"), "power_limit_w": d.get("power_limit_w")}
|
||||||
|
except Exception: pass
|
||||||
|
return {"status": "down"}
|
||||||
|
|
||||||
|
def available_models(): return [m for m in GPU_URLS if check_gpu_health(m)["status"] in ("healthy","saturated")]
|
||||||
|
|
||||||
|
def estimate_tokens(msgs):
|
||||||
|
"""Estimate token count from messages. Uses JSON length / 3.5 (closer to real tokenizer ratios for dense text)."""
|
||||||
|
return len(json.dumps(msgs, default=str)) // 3.5
|
||||||
|
|
||||||
|
def is_gpu_busy(model):
|
||||||
|
"""Check if GPU is at or near max concurrent capacity."""
|
||||||
|
active = gpu_active_count(model)
|
||||||
|
max_c = GPU_MAX_CONCURRENT.get(model, 1)
|
||||||
|
return active >= max_c
|
||||||
|
|
||||||
|
def select_best_gpu(candidates, reason):
|
||||||
|
"""Pick the best GPU from candidates IN ORDER — first non-busy one wins."""
|
||||||
|
for m in candidates:
|
||||||
|
if not is_gpu_busy(m):
|
||||||
|
return {"model": m, "reason": reason}
|
||||||
|
# All busy — pick least loaded
|
||||||
|
best = None
|
||||||
|
best_load = 999
|
||||||
|
for m in candidates:
|
||||||
|
load = gpu_active_count(m)
|
||||||
|
if load < best_load:
|
||||||
|
best_load = load
|
||||||
|
best = m
|
||||||
|
if best:
|
||||||
|
return {"model": best, "reason": "load_balanced_" + reason}
|
||||||
|
return None
|
||||||
|
|
||||||
|
def route(rd, tier):
|
||||||
|
msgs = rd.get("messages",[]); t = estimate_tokens(msgs)
|
||||||
|
sys = any(m.get("role")=="system" for m in msgs)
|
||||||
|
turns = len([m for m in msgs if m.get("role") in ("user","assistant")])
|
||||||
|
hints = rd.get("routing_hints",{})
|
||||||
|
allowed = TIER_MODELS.get(tier, ["qwen3.5-9b-vlm"])
|
||||||
|
avail = [m for m in available_models() if m in allowed]
|
||||||
|
if not avail: return {"model": allowed[0], "reason": "all_saturated", "saturated": True}
|
||||||
|
# Check if all available GPUs are at max capacity
|
||||||
|
if all(is_gpu_busy(m) for m in avail):
|
||||||
|
return {"model": avail[0], "reason": "all_saturated", "saturated": True}
|
||||||
|
|
||||||
|
req = rd.get("model","auto")
|
||||||
|
if req != "auto":
|
||||||
|
target = req if req in avail else avail[0]
|
||||||
|
# If explicit model is busy, check if another can take it
|
||||||
|
if is_gpu_busy(target) and req in allowed:
|
||||||
|
alts = [m for m in avail if m != target and m in allowed]
|
||||||
|
if alts:
|
||||||
|
alt = select_best_gpu(alts, "explicit")
|
||||||
|
if alt: return alt
|
||||||
|
return {"model": target, "reason": "explicit"}
|
||||||
|
|
||||||
|
if hints:
|
||||||
|
if hints.get("priority")=="speed" and "qwen3.5-9b-vlm" in avail:
|
||||||
|
return select_best_gpu(["qwen3.5-9b-vlm"], "hint_speed") or {"model":"qwen3.5-9b-vlm","reason":"hint_speed"}
|
||||||
|
if hints.get("priority")=="quality" and "qwen3.6-27B-code" in avail:
|
||||||
|
return select_best_gpu(["qwen3.6-27B-code"], "hint_quality") or {"model":"qwen3.6-27B-code","reason":"hint_quality"}
|
||||||
|
|
||||||
|
first_msg = msgs[0].get("content","") if msgs else ""
|
||||||
|
words = len(first_msg.split()) if isinstance(first_msg, str) else 99
|
||||||
|
|
||||||
|
# TIER 1: Lightweight — single-turn short queries → VLM first
|
||||||
|
if not sys and turns <= 1 and words <= 100 and "qwen3.5-9b-vlm" in avail:
|
||||||
|
if not is_gpu_busy("qwen3.5-9b-vlm"):
|
||||||
|
return {"model":"qwen3.5-9b-vlm","reason":"lightweight"}
|
||||||
|
# VLM busy — fall back to Dense, then MoE
|
||||||
|
fallback = [m for m in ["qwen3.6-35B-A3B","qwen3.6-27B-code"] if m in avail]
|
||||||
|
result = select_best_gpu(fallback, "lightweight_fallback")
|
||||||
|
if result: return result
|
||||||
|
|
||||||
|
# TIER 2: Simple conversations — short context, any prompt → VLM preferred
|
||||||
|
if t <= 1000 and turns <= 4 and "qwen3.5-9b-vlm" in avail:
|
||||||
|
if not is_gpu_busy("qwen3.5-9b-vlm"):
|
||||||
|
return {"model":"qwen3.5-9b-vlm","reason":"simple_conv"}
|
||||||
|
# VLM busy — try Dense
|
||||||
|
if "qwen3.6-27B-code" in avail and not is_gpu_busy("qwen3.6-27B-code"):
|
||||||
|
return {"model":"qwen3.6-27B-code","reason":"simple_conv_fallback"}
|
||||||
|
|
||||||
|
# TIER 3: Heavy reasoning — extremely large context or very long conversations
|
||||||
|
if t > 50000 or turns > 25:
|
||||||
|
# MoE first (131K context handles heavy sessions), then Dense (98K reasoning), then Light (131K fallback)
|
||||||
|
candidates = [m for m in ["qwen3.6-35B-A3B","qwen3.6-27B-code","qwen3.5-9b-vlm"] if m in avail]
|
||||||
|
result = select_best_gpu(candidates, "heavy_reasoning")
|
||||||
|
if result: return result
|
||||||
|
|
||||||
|
# TIER 4: Default — MoE first, VLM helps, Dense last (slow)
|
||||||
|
if t <= 50000:
|
||||||
|
candidates = [m for m in ["qwen3.6-35B-A3B","qwen3.5-9b-vlm","qwen3.6-27B-code"] if m in avail]
|
||||||
|
result = select_best_gpu(candidates, "default")
|
||||||
|
if result: return result
|
||||||
|
|
||||||
|
# Fallback — best available
|
||||||
|
if "qwen3.6-35B-A3B" in avail and not is_gpu_busy("qwen3.6-35B-A3B"):
|
||||||
|
return {"model":"qwen3.6-35B-A3B","reason":"default_moe"}
|
||||||
|
result = select_best_gpu([m for m in avail], "fallback")
|
||||||
|
if result: return result
|
||||||
|
return {"model":avail[0],"reason":"last_resort"}
|
||||||
|
|
||||||
|
def clean_unicode(text):
|
||||||
|
if not isinstance(text, str): return text
|
||||||
|
text = text.replace(chr(0x2014), "-"); text = text.replace(chr(0x2013), "-")
|
||||||
|
text = text.replace(chr(0x2018), "'"); text = text.replace(chr(0x2019), "'")
|
||||||
|
text = text.replace(chr(0x201C), '"'); text = text.replace(chr(0x201D), '"')
|
||||||
|
text = text.replace(chr(0x2026), "..."); text = text.replace(chr(0x00A0), " ")
|
||||||
|
return text.encode("ascii", "ignore").decode("ascii")
|
||||||
|
|
||||||
|
def clean_response(d):
|
||||||
|
if isinstance(d, dict): return {k: clean_response(v) for k,v in d.items()}
|
||||||
|
if isinstance(d, list): return [clean_response(v) for v in d]
|
||||||
|
if isinstance(d, str): return clean_unicode(d)
|
||||||
|
return d
|
||||||
|
|
||||||
|
def get_metrics():
|
||||||
|
d = {"gpus":[],"route_counts":{},"agent_counts":{},"tier_counts":{},"recent":[],"timestamp":time.time(),"active_requests":{}}
|
||||||
|
for m in GPU_URLS:
|
||||||
|
h = check_gpu_health(m)
|
||||||
|
d["gpus"].append({"id":m,"gpu_name":h.get("gpu_name",m),"status":h.get("status"),"vram_used_mb":h.get("vram_used_mb"),"vram_total_mb":h.get("vram_total_mb"),"vram_pct":h.get("vram_pct"),"temp_c":h.get("temp_c"),"gpu_util_pct":h.get("gpu_util_pct"),"power_w":h.get("power_w"),"power_limit_w":h.get("power_limit_w"),"active_requests":gpu_active_count(m), "max_concurrent": GPU_MAX_CONCURRENT.get(m, 1)})
|
||||||
|
d["active_requests"][m] = gpu_active_count(m)
|
||||||
|
if r:
|
||||||
|
try:
|
||||||
|
for m in GPU_URLS: d["route_counts"][m] = int(r.get("routes:"+m) or 0)
|
||||||
|
for k,v in API_KEYS.items():
|
||||||
|
c = int(r.get("routes:agent:"+v["agent"]) or 0)
|
||||||
|
if c>0: d["agent_counts"][v["agent"]] = c
|
||||||
|
for t in TIER_MODELS: d["tier_counts"][t] = int(r.get("routes:tier:"+t) or 0)
|
||||||
|
raw = r.lrange("routes:recent",0,49)
|
||||||
|
d["recent"] = [json.loads(x) for x in raw] if raw else []
|
||||||
|
except Exception: pass
|
||||||
|
return d
|
||||||
|
|
||||||
|
def bcast():
|
||||||
|
data = get_metrics(); payload = json.dumps(data)
|
||||||
|
with sse_lock:
|
||||||
|
dead = []
|
||||||
|
for q in sse_subscribers:
|
||||||
|
try: q.put(payload)
|
||||||
|
except Exception: dead.append(q)
|
||||||
|
for q in dead: sse_subscribers.remove(q)
|
||||||
|
|
||||||
|
QUEUE_TIMEOUT = int(os.environ.get("QUEUE_TIMEOUT", "30")) # max seconds to queue before 503
|
||||||
|
|
||||||
|
@app.route("/v1/chat/completions", methods=["POST"])
|
||||||
|
def chat():
|
||||||
|
try:
|
||||||
|
rd = request.get_json(force=True)
|
||||||
|
ak = request.headers.get("Authorization","").replace("Bearer ","")
|
||||||
|
if not ak or ak not in API_KEYS:
|
||||||
|
log.warning("AUTH_REJECTED: no/invalid API key from %s", request.remote_addr)
|
||||||
|
return jsonify({"error": "Unauthorized — valid API key required"}), 401
|
||||||
|
ki = API_KEYS[ak]
|
||||||
|
tier, agent = ki["tier"], ki["agent"]
|
||||||
|
|
||||||
|
# Allow agent to override queue timeout via header
|
||||||
|
q_timeout = int(request.headers.get("X-Queue-Timeout", str(QUEUE_TIMEOUT)))
|
||||||
|
|
||||||
|
# Cross-turn context tracking: accumulate tokens per session
|
||||||
|
session_id = request.headers.get("X-Session-Id", "")
|
||||||
|
session_tokens = 0
|
||||||
|
if session_id and r:
|
||||||
|
try:
|
||||||
|
prev = int(r.get("session:" + session_id) or 0)
|
||||||
|
current = estimate_tokens(rd.get("messages",[]))
|
||||||
|
session_tokens = max(prev, current) # context only grows
|
||||||
|
r.set("session:" + session_id, session_tokens, ex=86400) # TTL 24h
|
||||||
|
except Exception: pass
|
||||||
|
|
||||||
|
d = route(rd, tier)
|
||||||
|
queue_start = time.time()
|
||||||
|
|
||||||
|
# Queue loop: wait for a GPU slot instead of immediate 503
|
||||||
|
while d.get("saturated"):
|
||||||
|
elapsed = time.time() - queue_start
|
||||||
|
if elapsed > q_timeout:
|
||||||
|
resp = jsonify({"error": "All GPUs saturated", "queued_s": round(elapsed,1), "retry_after_s": 5})
|
||||||
|
resp.headers["Retry-After"] = "5"
|
||||||
|
log.warning("QUEUE_TIMEOUT: %s waited %.1fs, all GPUs saturated", agent, elapsed)
|
||||||
|
return resp, 503
|
||||||
|
time.sleep(0.5) # poll every 500ms
|
||||||
|
d = route(rd, tier)
|
||||||
|
|
||||||
|
waited = time.time() - queue_start
|
||||||
|
if waited > 0.5:
|
||||||
|
log.info("QUEUED: %s waited %.1fs before slot opened", agent, waited)
|
||||||
|
model, reason, url = d["model"], d["reason"], GPU_URLS[d["model"]]
|
||||||
|
is_stream = rd.get("stream", False)
|
||||||
|
|
||||||
|
gpu_incr(model)
|
||||||
|
|
||||||
|
log.info("ROUTE: %s -> %s (%s) stream=%s active=%d/%d", agent, model, reason, is_stream, gpu_active_count(model), GPU_MAX_CONCURRENT.get(model,1))
|
||||||
|
if r:
|
||||||
|
try:
|
||||||
|
r.incr("routes:"+model); r.incr("routes:tier:"+tier); r.incr("routes:agent:"+agent)
|
||||||
|
r.incr("ts:"+model+":"+time.strftime("%Y%m%d%H"))
|
||||||
|
r.lpush("routes:recent", json.dumps({"ts":time.time(),"model":model,"reason":reason,"tier":tier,"agent":agent}))
|
||||||
|
r.ltrim("routes:recent",0,999)
|
||||||
|
except Exception: pass
|
||||||
|
start = time.time()
|
||||||
|
resp = requests.post(url+"/chat/completions", json=rd,
|
||||||
|
headers={"Content-Type":"application/json","Authorization":"Bearer not-needed"}, timeout=300, stream=is_stream)
|
||||||
|
lat = int((time.time()-start)*1000)
|
||||||
|
gpu_decr(model)
|
||||||
|
|
||||||
|
if resp.status_code != 200: return jsonify({"error":"GPU error "+str(resp.status_code)}), 502
|
||||||
|
if is_stream:
|
||||||
|
def gen():
|
||||||
|
for raw in resp.iter_content(chunk_size=None, decode_unicode=True):
|
||||||
|
if raw: yield clean_unicode(raw)
|
||||||
|
bcast()
|
||||||
|
ctx_remaining = GPU_CONTEXT.get(model, 65536) - max(session_tokens, estimate_tokens(rd.get("messages",[])))
|
||||||
|
ctx_pct = ctx_remaining / GPU_CONTEXT.get(model, 65536) * 100
|
||||||
|
ctx_warning = "compact_urgent" if ctx_pct < 5 else ("compact_recommended" if ctx_pct < 15 else ("compact_soon" if ctx_pct < 30 else "ok"))
|
||||||
|
sse_resp = Response(stream_with_context(gen()), mimetype="text/event-stream")
|
||||||
|
sse_resp.headers["X-Context-Remaining"] = str(max(0, ctx_remaining))
|
||||||
|
sse_resp.headers["X-Context-Warning"] = ctx_warning
|
||||||
|
sse_resp.headers["X-Context-Model"] = model
|
||||||
|
return sse_resp
|
||||||
|
data = clean_response(resp.json())
|
||||||
|
for c in data.get("choices",[]):
|
||||||
|
msg = c.get("message",{})
|
||||||
|
if not msg.get("content") and msg.get("reasoning_content"):
|
||||||
|
msg["content"] = msg["reasoning_content"]
|
||||||
|
ctx_remaining = GPU_CONTEXT.get(model, 65536) - max(session_tokens, estimate_tokens(rd.get("messages",[])))
|
||||||
|
ctx_pct = ctx_remaining / GPU_CONTEXT.get(model, 65536) * 100
|
||||||
|
ctx_warning = "compact_urgent" if ctx_pct < 5 else ("compact_recommended" if ctx_pct < 15 else ("compact_soon" if ctx_pct < 30 else "ok"))
|
||||||
|
data["routing"] = {"model":model,"reason":reason,"gpu":url,"tier":tier,"agent":agent,"latency_ms":lat,"active_gpu":gpu_active_count(model),"context_remaining": max(0, ctx_remaining),"context_pct": round(ctx_pct,1),"context_warning": ctx_warning}
|
||||||
|
resp = jsonify(data)
|
||||||
|
resp.headers["X-Context-Remaining"] = str(max(0, ctx_remaining))
|
||||||
|
resp.headers["X-Context-Warning"] = ctx_warning
|
||||||
|
resp.headers["X-Context-Model"] = model
|
||||||
|
bcast()
|
||||||
|
return resp
|
||||||
|
except requests.Timeout:
|
||||||
|
gpu_decr(model)
|
||||||
|
log.error("TIMEOUT: %s -> %s", agent, model)
|
||||||
|
return jsonify({"error":"timeout"}), 504
|
||||||
|
except Exception as e:
|
||||||
|
gpu_decr(model)
|
||||||
|
log.error("Error: %s\n%s", e, traceback.format_exc())
|
||||||
|
return jsonify({"error":str(e)}), 500
|
||||||
|
|
||||||
|
@app.route("/v1/models")
|
||||||
|
def models(): return jsonify({"object":"list","data":[{"id":m,"object":"model","owned_by":"syslog","status":check_gpu_health(m).get("status"),"gpu":check_gpu_health(m).get("gpu_name")} for m in GPU_URLS]})
|
||||||
|
|
||||||
|
@app.route("/health")
|
||||||
|
def health():
|
||||||
|
gpus = {}
|
||||||
|
for m in GPU_URLS:
|
||||||
|
h = check_gpu_health(m)
|
||||||
|
h["active_requests"] = gpu_active_count(m)
|
||||||
|
h["max_concurrent"] = GPU_MAX_CONCURRENT.get(m, 1)
|
||||||
|
gpus[m] = h
|
||||||
|
return jsonify({"status":"healthy","redis":"connected" if r else "down","gpus":gpus,"available_models":available_models()})
|
||||||
|
|
||||||
|
@app.route("/metrics")
|
||||||
|
def metrics(): return jsonify(get_metrics())
|
||||||
|
|
||||||
|
@app.route("/metrics/timeseries")
|
||||||
|
def metrics_timeseries():
|
||||||
|
period = request.args.get("period", "day"); models_list = list(GPU_URLS.keys())
|
||||||
|
data = {"models": {}, "labels": []}
|
||||||
|
if period == "day":
|
||||||
|
buckets = [time.strftime("%Y%m%d%H", time.gmtime(time.time()-h*3600)) for h in range(23,-1,-1)]
|
||||||
|
data["labels"] = [time.strftime("%H:00", time.gmtime(time.time()-h*3600)) for h in range(23,-1,-1)]
|
||||||
|
elif period == "week":
|
||||||
|
buckets = [time.strftime("%Y%m%d", time.gmtime(time.time()-d*86400)) for d in range(6,-1,-1)]
|
||||||
|
data["labels"] = [time.strftime("%a", time.gmtime(time.time()-d*86400)) for d in range(6,-1,-1)]
|
||||||
|
else:
|
||||||
|
buckets = [time.strftime("%Y%m%d", time.gmtime(time.time()-d*86400)) for d in range(29,-1,-1)]
|
||||||
|
data["labels"] = [time.strftime("%m/%d", time.gmtime(time.time()-d*86400)) for d in range(29,-1,-1)]
|
||||||
|
if r:
|
||||||
|
for model in models_list:
|
||||||
|
counts = []
|
||||||
|
for bucket in buckets:
|
||||||
|
total = 0
|
||||||
|
if period in ("week","month"):
|
||||||
|
for hh in range(24): total += int(r.get("ts:"+model+":"+bucket+"{:02d}".format(hh)) or 0)
|
||||||
|
else: total = int(r.get("ts:"+model+":"+bucket) or 0)
|
||||||
|
counts.append(total)
|
||||||
|
data["models"][model] = counts
|
||||||
|
return jsonify(data)
|
||||||
|
|
||||||
|
@app.route("/stream")
|
||||||
|
def stream():
|
||||||
|
def ev():
|
||||||
|
q = queue.Queue()
|
||||||
|
with sse_lock: sse_subscribers.append(q)
|
||||||
|
try:
|
||||||
|
yield "data: "+json.dumps(get_metrics())+"\n\n"
|
||||||
|
while True:
|
||||||
|
try: yield "data: "+q.get(timeout=3)+"\n\n"
|
||||||
|
except queue.Empty: yield "data: "+json.dumps(get_metrics())+"\n\n"
|
||||||
|
except GeneratorExit: pass
|
||||||
|
finally:
|
||||||
|
with sse_lock:
|
||||||
|
if q in sse_subscribers: sse_subscribers.remove(q)
|
||||||
|
return Response(stream_with_context(ev()), mimetype="text/event-stream",
|
||||||
|
headers={"Cache-Control":"no-cache","X-Accel-Buffering":"no","Access-Control-Allow-Origin":"*"})
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
log.info("Router on :9000 (load-aware)")
|
||||||
|
app.run(host="0.0.0.0", port=9000, debug=False)
|
||||||
Submodule syslog-harness-check deleted from b65ea22765
Reference in New Issue
Block a user