feat: Smart Queue Consumer implementation draft + architecture review
- SMART_QUEUE_IMPLEMENTATION.md: Complete implementation draft (1572 lines) with 10 quick-win fixes and full smart queue consumer rewrite - ARCHITECTURE_REVIEW.md: 26-issue audit with prioritized findings - Verified all 3 GPUs live: amdpve (73% util), llmgpu (idle), ocu_llm (idle) - Redis 7.4.9 confirmed streams support - GPU sidecar metrics verified on all hosts Key fixes: - QW-1: Dockerfile path mismatch (Dockerfile.queue -> queue-service/Dockerfile) - QW-2: Nginx fallback only on ALL-GPU failure (not single GPU) - QW-3: Container names fixed to Docker service names - QW-4: Redis host default fixed (192.168.68.7 -> redis) - QW-5: Dependency version pinning - QW-7-10: Health checks, restart policy, Gunicorn, single-process collector Smart queue features: - Redis Streams + consumer groups - GPU-aware load balancing via sidecar metrics - Per-GPU circuit breakers with half-open recovery - Adaptive backpressure (0-30 normal, 30-40 warn, 40-50 503, >50 open) - Dead letter queue with retry endpoint - Job ID tracking and /status/<job_id> API
This commit is contained in:
@@ -0,0 +1,390 @@
|
|||||||
|
# Syslog Harness Architecture Review & Improvement Recommendations
|
||||||
|
|
||||||
|
**Date:** 2026-05-17
|
||||||
|
**Commit:** `e95475f` "Add GPU dashboard container + Nginx routing"
|
||||||
|
**Repo:** http://192.168.68.17:3000/SyslogSolution/syslog-harness.git
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Current Architecture Overview
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
Host (192.168.68.123)
|
||||||
|
|
||||||
|
|
||||||
|
Agent :8080> Nginx Router > Queue Service > Dashboard
|
||||||
|
:8080 :8091 :3001
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
GPU Pool Redis > GPU Dashboard
|
||||||
|
:8080 :6379 :8092
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
amdpve llmgpu ocu_llm
|
||||||
|
.15:8080 .8:8080 .110:8080
|
||||||
|
MoE 35B Dense 27B Light 4B
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
### Services
|
||||||
|
|
||||||
|
| Service | Port | Container | Image | Purpose |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| **Nginx Router** | 8080 | Host-level | OS nginx | Routes by `X-Syslog-Model` header |
|
||||||
|
| **Queue Service** | 8091 | `syslog-queue` | `python:3.13-slim` | Request queue + circuit breaker |
|
||||||
|
| **Dashboard** | 3001 | `syslog-dashboard` | `python:3.11-slim` | Observability UI + GPU health |
|
||||||
|
| **GPU Dashboard** | 8092 | `syslog-gpu-dashboard` | `python:3.11-slim` | Hardware metrics (temp, VRAM, power) |
|
||||||
|
| **Redis** | 6379 | `syslog-redis` | `redis:7-alpine` | Queue storage |
|
||||||
|
|
||||||
|
### GPU Backends
|
||||||
|
|
||||||
|
| Host | GPU | Model | Capacity |
|
||||||
|
|---|---|---|---|
|
||||||
|
| 192.168.68.15 | AMD Strix Halo | qwen3.6-35B-A3B (MoE) | 65GB VRAM |
|
||||||
|
| 192.168.68.8 | RTX 3090 | qwen3.5-27B (Dense) | 24GB VRAM |
|
||||||
|
| 192.168.68.110 | RTX 5070 | gemma-4-E4B (Light) | 12GB VRAM |
|
||||||
|
|
||||||
|
### Data Flow
|
||||||
|
|
||||||
|
1. **Agent** sends request with `X-Syslog-Model` header Nginx :8080
|
||||||
|
2. **Nginx** routes to appropriate GPU based on header mapping
|
||||||
|
3. **GPU backend** (llama.cpp) processes request
|
||||||
|
4. **Fallback:** If GPU returns 502/503/timeout Nginx redirects to queue-service :8091
|
||||||
|
5. **Queue** stores request in Redis `inference:requests` LPUSH
|
||||||
|
6. **Dashboard** :3001 polls queue-service + GPU health for display
|
||||||
|
7. **GPU Dashboard** :8092 collects hardware metrics every 10s
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. File Inventory
|
||||||
|
|
||||||
|
```
|
||||||
|
docker-compose.yml # Main compose (Docker networking)
|
||||||
|
gpu-router-docker.conf # Nginx config for Docker deployment
|
||||||
|
Dockerfile.gpu # GPU dashboard container
|
||||||
|
Dockerfile.dashboard # Dashboard container (root-level)
|
||||||
|
queue-service/Dockerfile # Queue service container
|
||||||
|
queue-service/queue-service.py # Queue logic (121 lines)
|
||||||
|
dashboard/harness-dashboard.py # Dashboard app (133 lines)
|
||||||
|
dashboard/Dockerfile # Dashboard container (subdir)
|
||||||
|
dashboard/Dockerfile.dashboard # Dashboard container (duplicate)
|
||||||
|
gpu-dashboard/gpu_collector.py # GPU hardware collector (115 lines)
|
||||||
|
gpu-dashboard/gpu.html # GPU dashboard UI (183 lines)
|
||||||
|
gpu-dashboard/collector.py # Duplicate collector (hermes-workspace path)
|
||||||
|
gpu-dashboard/start.sh # Legacy startup script
|
||||||
|
MIGRATION_PLAN.md # Production migration plan
|
||||||
|
README.md # Documentation
|
||||||
|
syslog-harness-check/ # Checkpoint subdirectory (mirror)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Detailed Findings
|
||||||
|
|
||||||
|
### 3.1 Queue Service (`queue-service/queue-service.py`)
|
||||||
|
|
||||||
|
**Architecture:** Simple Flask app using Redis LPUSH/RPUSH for a FIFO queue. A basic circuit breaker prevents queue overflow at 50 messages.
|
||||||
|
|
||||||
|
**Issues Found:**
|
||||||
|
|
||||||
|
| # | Severity | Location | Issue |
|
||||||
|
|---|---|---|---|
|
||||||
|
| Q1 | **CRITICAL** | Lines 82-88 | **Queue is fire-and-forget with no consumer.** Requests are pushed to Redis but nothing dequeues or processes them. The queue is a dead storage pit. |
|
||||||
|
| Q2 | **CRITICAL** | Lines 28-32 | **Hardcoded GPU IPs** in the queue service duplicate the Nginx config. No configuration source of truth. |
|
||||||
|
| Q3 | **HIGH** | Lines 21-22 | **Redis host fallback to `192.168.68.7`** (line 21) conflicts with docker-compose which sets `REDIS_HOST=redis` (line 24). The default is unreachable inside Docker. |
|
||||||
|
| Q4 | **HIGH** | Lines 66-95 | **No job result retrieval mechanism.** Once enqueued, there's no API to poll for completion, get a job ID, or retrieve results. |
|
||||||
|
| Q5 | **HIGH** | Lines 73-79 | **Circuit breaker is a simple depth threshold.** No backoff, no recovery window, no sliding window. Once closed, it stays closed until manually drained. |
|
||||||
|
| Q6 | **MEDIUM** | Lines 50-57 | **GPU health check is synchronous and blocks** the `/status` endpoint. Checking 3 GPUs sequentially with 3s timeout means `/status` can take up to 9s. |
|
||||||
|
| Q7 | **MEDIUM** | Lines 35-40 | **`get_redis()` swallows all exceptions** and returns `None`. This makes Redis failures silent queue depth returns 0 on failure (line 47), potentially allowing overflow. |
|
||||||
|
| Q8 | **MEDIUM** | Lines 83-84 | **Headers filtered to only X-* prefixed** the `Content-Type` header is dropped entirely, meaning the receiver can't determine payload format. |
|
||||||
|
| Q9 | **LOW** | Line 121 | **No graceful shutdown.** Flask development server doesn't handle SIGTERM gracefully. |
|
||||||
|
|
||||||
|
### 3.2 Nginx Gateway (`gpu-router-docker.conf`)
|
||||||
|
|
||||||
|
**Architecture:** Nginx routes requests to GPU backends based on `X-Syslog-Model` header value. Has rate limiting, streaming support, and queue fallback.
|
||||||
|
|
||||||
|
**Issues Found:**
|
||||||
|
|
||||||
|
| # | Severity | Location | Issue |
|
||||||
|
|---|---|---|---|
|
||||||
|
| N1 | **HIGH** | Lines 79-80 | **`burst=20 nodelay`** means 20 requests are served immediately beyond the rate limit, then throttled. This defeats the purpose of rate limiting under burst traffic all 20 could still overwhelm a GPU. |
|
||||||
|
| N2 | **HIGH** | Lines 99-100 | **`proxy_next_upstream` with `tries 2`** means on error/timeout/502/503, Nginx retries once. But it retries against the *same GPU pool*, not a different one. The same GPU that failed gets hit again. |
|
||||||
|
| N3 | **HIGH** | Lines 106, 112-121 | **Queue fallback (`@queue_fallback`) is triggered for ANY 502/503/504**, including when a single GPU is overloaded. This means individual GPU slowness causes queue fallback instead of just queuing when ALL GPUs are down. |
|
||||||
|
| N4 | **MEDIUM** | Line 90 | **`proxy_pass_header X-Syslog-Model`** is non-standard. Nginx automatically passes request headers; this directive is for response headers. The model header is already passed implicitly via `proxy_set_header` inheritance. |
|
||||||
|
| N5 | **MEDIUM** | Lines 27, 32 | **Hardcoded container names** (`syslog-harness-dashboard-1`, `syslog-harness-gpu-dashboard-1`). These change based on docker-compose project prefix. Should use service names. |
|
||||||
|
| N6 | **LOW** | Lines 67-73 | **GPU dashboard at `/gpu` path** has `X-Forwarded-Proto` but the dashboard service (simple HTTP server) doesn't use it. Inconsistent header handling across locations. |
|
||||||
|
|
||||||
|
### 3.3 Dashboard (`dashboard/harness-dashboard.py`)
|
||||||
|
|
||||||
|
**Architecture:** Simple HTTP server using Python's `http.server`. Fetches queue status and GPU health, renders HTML.
|
||||||
|
|
||||||
|
**Issues Found:**
|
||||||
|
|
||||||
|
| # | Severity | Location | Issue |
|
||||||
|
|---|---|---|---|
|
||||||
|
| D1 | **HIGH** | Lines 34-40 | **`get_queue_status()` calls queue-service synchronously.** Combined with per-GPU health checks (lines 18-31), the `/api/status` endpoint makes 4 sequential HTTP calls. Worst case: 2 + 33s = 11s response time. |
|
||||||
|
| D2 | **MEDIUM** | Lines 101-127 | **Uses `SimpleHTTPRequestHandler`** which is single-threaded. Under concurrent dashboard access, requests queue up. Should use `ThreadingHTTPServer`. |
|
||||||
|
| D3 | **MEDIUM** | Lines 16-18 | **GPU endpoints hardcoded** in dashboard, separate from queue-service and Nginx. Three separate sources of truth for GPU addresses. |
|
||||||
|
| D4 | **LOW** | Line 127 | **Silent log suppression.** While intentional, this makes debugging impossible without modifying the source. |
|
||||||
|
|
||||||
|
### 3.4 GPU Dashboard (`gpu-dashboard/`)
|
||||||
|
|
||||||
|
**Architecture:** `gpu_collector.py` polls sidecar (port 8090) and llama.cpp (port 8080) endpoints every 10s, writes JSON to `gpu_metrics.json`. Static HTTP server serves the dashboard.
|
||||||
|
|
||||||
|
**Issues Found:**
|
||||||
|
|
||||||
|
| # | Severity | Location | Issue |
|
||||||
|
|---|---|---|---|
|
||||||
|
| G1 | **HIGH** | Lines 97-98 | **Sequential collection.** All 3 GPUs are polled sequentially (line 98: list comprehension). If one host is unreachable, it blocks collection for all three. |
|
||||||
|
| G2 | **HIGH** | Line 105-107 | **`/app/public/gpu_metrics.json` path is hardcoded** and differs from `collector.py` (line 11: `/root/hermes-workspace/public/gpu_metrics.json`). Inconsistent between the two collector files. |
|
||||||
|
| G3 | **MEDIUM** | Lines 19-25 | **`fetch_json` swallows all exceptions.** A timeout on one GPU's sidecar is silently ignored, making it impossible to distinguish "no data" from "collector error". |
|
||||||
|
| G4 | **MEDIUM** | Line 14 | **`DEAD_THRESHOLD = 60` seconds is aggressive.** A GPU that restarts takes 60s before reappearing as online, even if it's back in 5s. |
|
||||||
|
| G5 | **LOW** | Lines 10-14 | **`start.sh` references `/root/hermes-workspace/public`** but `Dockerfile.gpu` creates `/app/public`. Inconsistent between legacy and current deployment. |
|
||||||
|
|
||||||
|
### 3.5 Docker Compose (`docker-compose.yml`)
|
||||||
|
|
||||||
|
**Issues Found:**
|
||||||
|
|
||||||
|
| # | Severity | Location | Issue |
|
||||||
|
|---|---|---|---|
|
||||||
|
| C1 | **HIGH** | Lines 19-20 | **Queue service exposes port 8091 externally.** In a multi-tenant or public-facing deployment, the queue API should be internal-only. |
|
||||||
|
| C2 | **MEDIUM** | Lines 13-15 | **`Dockerfile.queue` referenced but doesn't exist at root level.** The file is at `queue-service/Dockerfile`. The compose build context is `.` (root) but the dockerfile path doesn't match. |
|
||||||
|
| C3 | **MEDIUM** | Lines 6, 16, 26, 31, 43 | **`restart: always`** instead of `restart: unless-stopped`. On crash, `always` restarts even after manual stop, making maintenance harder. |
|
||||||
|
| C4 | **LOW** | Lines 23-25 | **No health checks defined** for any service. Docker can't detect if a service is actually healthy, only if the container is running. |
|
||||||
|
| C5 | **LOW** | Line 10 | **Redis has no password.** Unauthenticated Redis exposed on the Docker network. |
|
||||||
|
| C6 | **LOW** | Lines 49-51 | **No network driver specified** for the bridge network (minor defaults to bridge). No IPAM configuration for large deployments. |
|
||||||
|
|
||||||
|
### 3.6 Container Images
|
||||||
|
|
||||||
|
**Issues Found:**
|
||||||
|
|
||||||
|
| # | Severity | Location | Issue |
|
||||||
|
|---|---|---|---|
|
||||||
|
| I1 | **HIGH** | All Dockerfiles | **No `requirements.txt` or dependency pinning.** All dependencies (`flask`, `redis`, `requests`) are installed without version pins. Builds are non-reproducible. |
|
||||||
|
| I2 | **MEDIUM** | `Dockerfile.gpu` line 3 | **`pip install requests`** unnecessary dependency for the GPU dashboard (only uses `urllib`). Adds ~300KB to the image. |
|
||||||
|
| I3 | **MEDIUM** | `Dockerfile.gpu` line 14 | **Multi-process CMD with `&`** no process supervisor. If the collector crashes, it won't restart. The `http.server` also won't receive SIGTERM properly. |
|
||||||
|
| I4 | **LOW** | All Dockerfiles | **No `.dockerignore` file.** The entire context is sent to the Docker daemon, including `.git` directories and any local artifacts. |
|
||||||
|
| I5 | **LOW** | `Dockerfile.dashboard` (root) vs `dashboard/Dockerfile.dashboard` | **Duplicate Dockerfiles** with slight differences (Python 3.11 vs 3.13, WORKDIR differences). |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Smart Queuing Analysis & Recommendations
|
||||||
|
|
||||||
|
### Current State: No Smart Queuing
|
||||||
|
|
||||||
|
The queue service is a **passive storage mechanism** it stores requests but has no intelligence:
|
||||||
|
|
||||||
|
- **No load balancing** no awareness of GPU load (slots_busy, VRAM usage, queue depth per GPU)
|
||||||
|
- **No job prioritization** FIFO only, no priority levels
|
||||||
|
- **No backpressure** simple threshold, no exponential backoff or adaptive limits
|
||||||
|
- **No retry logic** failed GPU requests go to queue but are never reprocessed
|
||||||
|
- **No dead letter handling** stuck or failed jobs have no lifecycle management
|
||||||
|
- **No consumer** nothing dequeues and forwards to GPUs
|
||||||
|
- **No job tracking** no job IDs, no status updates, no result retrieval
|
||||||
|
|
||||||
|
### Recommended Architecture: Smart Queue with Consumer
|
||||||
|
|
||||||
|
```
|
||||||
|
Agent > Nginx > Smart Queue API > Redis Streams (with consumers)
|
||||||
|
|
||||||
|
|
||||||
|
Consumer
|
||||||
|
Pool
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
GPU 1 (load) GPU 2 (load) GPU 3 (load)
|
||||||
|
|
||||||
|
|
||||||
|
Health Health Health
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
Update GPU scores
|
||||||
|
|
||||||
|
Priority Queue (sorted by urgency)
|
||||||
|
Dead Letter Queue (failed jobs)
|
||||||
|
Backpressure (adaptive rate limit)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Specific Recommendations
|
||||||
|
|
||||||
|
#### R1: Implement Redis Streams as Queue Backend
|
||||||
|
- Replace `LPUSH/RPUSH` (FIFO list) with **Redis Streams** (`XADD/XREADGROUP`)
|
||||||
|
- Streams support consumer groups, message acknowledgment, and pending messages
|
||||||
|
- Enables proper dead letter queue handling and retry logic
|
||||||
|
- **File:** `queue-service/queue-service.py`
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Before: Simple list
|
||||||
|
r.rpush(QUEUE_KEY, json.dumps(job))
|
||||||
|
|
||||||
|
# After: Redis Stream with consumer group
|
||||||
|
stream_key = "inference:stream"
|
||||||
|
consumer_group = "gpu-workers"
|
||||||
|
r.xadd(stream_key, {"job": json.dumps(job)}, maxlen=10000, approx=True)
|
||||||
|
```
|
||||||
|
|
||||||
|
#### R2: Build a Queue Consumer Pool
|
||||||
|
- Deploy 1+ consumer containers that poll the stream and forward to GPUs
|
||||||
|
- Consumer selects GPU based on: health status, current load (slots_busy), and VRAM availability
|
||||||
|
- **File:** New `queue-service/consumer.py`
|
||||||
|
|
||||||
|
```python
|
||||||
|
class LoadBalancedConsumer:
|
||||||
|
def select_gpu(self, job):
|
||||||
|
"""Select GPU based on load, health, and model compatibility."""
|
||||||
|
candidates = [g for g in self.gpus if g.health == "up" and not g.full]
|
||||||
|
if not candidates:
|
||||||
|
return None
|
||||||
|
# Sort by: slots_idle (descending), VRAM_available (descending)
|
||||||
|
candidates.sort(key=lambda g: (g.slots_idle, g.vram_free_mb), reverse=True)
|
||||||
|
return candidates[0]
|
||||||
|
```
|
||||||
|
|
||||||
|
#### R3: Implement Priority Queuing
|
||||||
|
- Add priority field to job payload: `high`, `normal`, `low`
|
||||||
|
- Use Redis Streams with multiple stream keys per priority level
|
||||||
|
- Consumer checks `high` `normal` `low` in order
|
||||||
|
- **File:** `queue-service/queue-service.py` enqueue endpoint
|
||||||
|
|
||||||
|
#### R4: Add Backpressure Mechanism
|
||||||
|
- Instead of hard threshold at 50, implement **adaptive backpressure**:
|
||||||
|
- Queue depth 0-30: normal operation
|
||||||
|
- Queue depth 30-40: return `retry-after` header with increasing delay
|
||||||
|
- Queue depth 40-50: return 503 with exponential retry-after
|
||||||
|
- Queue depth >50: circuit breaker open
|
||||||
|
- **File:** `queue-service/queue-service.py`
|
||||||
|
|
||||||
|
#### R5: Dead Letter Queue (DLQ)
|
||||||
|
- Move failed/unprocessable jobs to a `inference:dead-letter` stream
|
||||||
|
- Include failure reason, attempt count, and original payload
|
||||||
|
- Provide admin API to inspect, retry, or discard DLQ entries
|
||||||
|
- **File:** `queue-service/queue-service.py`
|
||||||
|
|
||||||
|
```python
|
||||||
|
# New endpoint
|
||||||
|
@app.route("/dlq", methods=["GET"])
|
||||||
|
def list_dlq():
|
||||||
|
return r.xrange("inference:dead-letter")
|
||||||
|
|
||||||
|
@app.route("/dlq/retry/<message_id>", methods=["POST"])
|
||||||
|
def retry_dlq(message_id):
|
||||||
|
job = r.xget("inference:dead-letter", message_id)
|
||||||
|
r.xadd("inference:stream", {"job": job})
|
||||||
|
```
|
||||||
|
|
||||||
|
#### R6: GPU-Aware Routing
|
||||||
|
- Queue consumer should check GPU `slots_busy` before routing
|
||||||
|
- If a GPU is busy, try the next available GPU
|
||||||
|
- Track per-GPU queue depth and avoid overloading a single GPU
|
||||||
|
- **File:** New consumer logic
|
||||||
|
|
||||||
|
#### R7: Job Status API
|
||||||
|
- Add job ID generation on enqueue
|
||||||
|
- Provide `/status/<job_id>` endpoint to check progress
|
||||||
|
- Store job state in Redis: `queued` `processing` `completed`/`failed`
|
||||||
|
- **File:** `queue-service/queue-service.py`
|
||||||
|
|
||||||
|
```python
|
||||||
|
@app.route("/enqueue", methods=["POST"])
|
||||||
|
def enqueue():
|
||||||
|
job_id = str(uuid.uuid4())
|
||||||
|
job = {"id": job_id, "payload": ..., "status": "queued", "created_at": time.time()}
|
||||||
|
r.xadd(stream_key, {"job": json.dumps(job)})
|
||||||
|
r.hset("job:status", job_id, json.dumps({"status": "queued"}))
|
||||||
|
return jsonify({"job_id": job_id, "status": "queued"}), 202
|
||||||
|
|
||||||
|
@app.route("/status/<job_id>")
|
||||||
|
def job_status(job_id):
|
||||||
|
status = r.hget("job:status", job_id)
|
||||||
|
return jsonify(json.loads(status)) if status else {"error": "not found"}, 404
|
||||||
|
```
|
||||||
|
|
||||||
|
#### R8: Health-Based Circuit Breaker
|
||||||
|
- Replace simple depth threshold with **per-GPU circuit breakers**
|
||||||
|
- Track consecutive failures per GPU
|
||||||
|
- Implement half-open state: after cooldown, probe one GPU to test recovery
|
||||||
|
- **File:** `queue-service/queue-service.py`
|
||||||
|
|
||||||
|
#### R9: Centralized Configuration
|
||||||
|
- Move GPU endpoints from 3 locations (queue-service, dashboard, Nginx) to:
|
||||||
|
- Redis config key: `config:gpus`
|
||||||
|
- Or environment file mounted to all containers
|
||||||
|
- Nginx can use Lua/variable from config instead of static upstreams
|
||||||
|
- **File:** New `config/` directory or Redis-based config
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Priority Issue Summary
|
||||||
|
|
||||||
|
### Critical (Fix Immediately)
|
||||||
|
1. **Q1** Queue has no consumer; enqueued requests are never processed
|
||||||
|
2. **Q4** No job ID or result retrieval mechanism
|
||||||
|
3. **N3** Queue fallback triggers on individual GPU failure, not all-down
|
||||||
|
|
||||||
|
### High (Fix Before Production)
|
||||||
|
4. **Q5** Circuit breaker has no recovery mechanism
|
||||||
|
5. **Q6** `/status` endpoint blocks on GPU health checks
|
||||||
|
6. **D1** Dashboard `/api/status` makes 4 sequential calls, up to 11s
|
||||||
|
7. **C2** `Dockerfile.queue` path mismatch in docker-compose
|
||||||
|
8. **I1** No dependency pinning in any Dockerfile
|
||||||
|
9. **I3** Multi-process CMD without supervisor in GPU dashboard
|
||||||
|
|
||||||
|
### Medium (Improve in Next Iteration)
|
||||||
|
10. **Q3** Redis host default conflicts with Docker networking
|
||||||
|
11. **Q7** Silent exception swallowing in Redis access
|
||||||
|
12. **Q8** Content-Type header dropped in queue
|
||||||
|
13. **D2** Single-threaded dashboard server
|
||||||
|
14. **D3** Three separate sources of truth for GPU addresses
|
||||||
|
15. **G1** Sequential GPU collection blocks on single failure
|
||||||
|
16. **N1** Rate limit burst of 20 nodelay defeats protection
|
||||||
|
17. **N5** Hardcoded container names in Nginx
|
||||||
|
18. **C1** Queue API exposed externally
|
||||||
|
19. **C4** No Docker health checks
|
||||||
|
|
||||||
|
### Low (Nice to Have)
|
||||||
|
20. **Q9** No graceful shutdown
|
||||||
|
21. **C3** `restart: always` vs `unless-stopped`
|
||||||
|
22. **C5** No Redis authentication
|
||||||
|
23. **G4** 60s dead threshold is too aggressive
|
||||||
|
24. **I2** Unnecessary `requests` dependency
|
||||||
|
25. **I4** No `.dockerignore`
|
||||||
|
26. **I5** Duplicate Dockerfiles
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Deployment Architecture Summary
|
||||||
|
|
||||||
|
### What Works Well
|
||||||
|
- Clean separation of concerns: routing (Nginx), queuing (Redis + queue-service), observability (two dashboards)
|
||||||
|
- Good GPU hardware monitoring with temperature, VRAM, power, fan metrics
|
||||||
|
- SSE streaming support in Nginx for LLM response streaming
|
||||||
|
- Rate limiting at the gateway layer
|
||||||
|
- Circuit breaker pattern implemented (even if basic)
|
||||||
|
|
||||||
|
### What Needs Work
|
||||||
|
- **Queue is incomplete** storage without processing is the most critical gap
|
||||||
|
- **No job lifecycle** requests go in and never come out
|
||||||
|
- **Duplicated configuration** GPU addresses in 3+ places
|
||||||
|
- **No monitoring/alerting** no Prometheus metrics, no alerting rules
|
||||||
|
- **Single point of failure** no Redis replication, no container redundancy
|
||||||
|
- **No logging** Flask dev server logs are minimal; no structured logging
|
||||||
|
|
||||||
|
### Recommended Next Steps
|
||||||
|
1. **Priority 1:** Implement queue consumer with GPU load-based routing
|
||||||
|
2. **Priority 2:** Add job status tracking and result retrieval
|
||||||
|
3. **Priority 3:** Fix Nginx fallback to only trigger when ALL GPUs are down
|
||||||
|
4. **Priority 4:** Add Docker health checks and proper dependency management
|
||||||
|
5. **Priority 5:** Centralize GPU configuration in Redis or environment
|
||||||
|
6. **Priority 6:** Add Prometheus metrics endpoint for observability
|
||||||
@@ -0,0 +1,5 @@
|
|||||||
|
FROM python:3.11-slim
|
||||||
|
WORKDIR /app
|
||||||
|
COPY dashboard/harness-dashboard.py .
|
||||||
|
EXPOSE 3001
|
||||||
|
CMD ["python3", "harness-dashboard.py"]
|
||||||
@@ -0,0 +1,71 @@
|
|||||||
|
# Syslog Harness — Production Migration Plan
|
||||||
|
|
||||||
|
## Current State (Development)
|
||||||
|
- **Host:** CT 114 (192.168.68.123)
|
||||||
|
- **Docker containers:** `syslog-queue` (:8091), `syslog-dashboard` (:3001)
|
||||||
|
- **Nginx:** Local on CT 114, routing to GPUs + Docker services
|
||||||
|
- **Status:** All components verified and operational
|
||||||
|
|
||||||
|
## Target State (Production)
|
||||||
|
- **Host:** New CT (e.g., `docker-vm` on 192.168.68.x)
|
||||||
|
- **Docker containers:** Same queue + dashboard services
|
||||||
|
- **Nginx:** Containerized on production CT
|
||||||
|
- **GPU backends:** Same (192.168.68.15, .8, .110)
|
||||||
|
|
||||||
|
## Migration Steps
|
||||||
|
|
||||||
|
### 1. Prepare Production CT
|
||||||
|
```bash
|
||||||
|
# Create new CT on Proxmox
|
||||||
|
# Install Docker
|
||||||
|
apt update && apt install -y docker.io docker-compose-plugin
|
||||||
|
|
||||||
|
# Pull/cloned harness repo
|
||||||
|
git clone <repo-url> /root/syslog-harness
|
||||||
|
cd /root/syslog-harness
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Update docker-compose.yml for Production
|
||||||
|
- Change `REDIS_HOST` to production Redis IP
|
||||||
|
- Update GPU endpoint env vars if IPs change
|
||||||
|
- Add volume mounts for persistence
|
||||||
|
|
||||||
|
### 3. Build & Deploy
|
||||||
|
```bash
|
||||||
|
# Build images
|
||||||
|
docker compose build
|
||||||
|
|
||||||
|
# Start services
|
||||||
|
docker compose up -d
|
||||||
|
|
||||||
|
# Verify health
|
||||||
|
curl http://localhost:8091/health
|
||||||
|
curl http://localhost:3001/api/status
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4. Configure Nginx
|
||||||
|
- Copy `/etc/nginx/conf.d/gpu-router.conf` to production CT
|
||||||
|
- Update upstream IPs if needed
|
||||||
|
- Test and reload
|
||||||
|
|
||||||
|
### 5. DNS / Routing Update
|
||||||
|
- Point agent traffic to new CT IP
|
||||||
|
- Update Hermes config `inference_api_url`
|
||||||
|
- Test agent routing
|
||||||
|
|
||||||
|
### 6. Verification Checklist
|
||||||
|
- [ ] Queue service health check passes
|
||||||
|
- [ ] Dashboard API returns GPU health
|
||||||
|
- [ ] Nginx routes to correct GPU based on header
|
||||||
|
- [ ] Circuit breaker triggers on excess load
|
||||||
|
- [ ] Queue fallback works when GPUs down
|
||||||
|
- [ ] Agent requests reach correct model
|
||||||
|
|
||||||
|
## Rollback Plan
|
||||||
|
- Keep CT 114 running as backup
|
||||||
|
- Revert DNS/routing to .123 if issues
|
||||||
|
- Docker containers can be stopped/started instantly
|
||||||
|
|
||||||
|
---
|
||||||
|
*Created: May 15, 2026*
|
||||||
|
*Status: Development verified, ready for production migration*
|
||||||
@@ -0,0 +1,63 @@
|
|||||||
|
# Syslog Harness
|
||||||
|
|
||||||
|
Operational orchestration layer for Syslog's internal AI agents.
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────┐ ┌──────────────┐ ┌─────────────┐
|
||||||
|
│ Agent │────>│ Nginx │────>│ GPU Pool │
|
||||||
|
│ (Hermes) │ │ Router │ │ (MoE/Dense)│
|
||||||
|
└─────────────┘ └──────────────┘ └─────────────┘
|
||||||
|
│
|
||||||
|
├──> :8091 Queue Service (Docker)
|
||||||
|
│
|
||||||
|
└──> :3001 Dashboard (Docker)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Components
|
||||||
|
|
||||||
|
| Service | Port | Container | Purpose |
|
||||||
|
|---|---|---|---|
|
||||||
|
| Nginx Router | 8080 | Host | Routes requests to GPU backends |
|
||||||
|
| Queue Service | 8091 | `syslog-queue` | Enqueues requests when GPUs are down |
|
||||||
|
| Dashboard | 3001 | `syslog-dashboard` | Observability UI + API |
|
||||||
|
|
||||||
|
## GPU Routing
|
||||||
|
|
||||||
|
| Header `X-Syslog-Model` | Backend | Model |
|
||||||
|
|---|---|---|
|
||||||
|
| (none) / `standard` | amdpve (.15) | qwen3.6-35B-A3B (MoE) |
|
||||||
|
| `heavy` / `qwen3.5-27B` | llmgpu (.8) | qwen3.5-27B (Dense) |
|
||||||
|
| `light` / `gemma-4` | ocu_llm (.110) | gemma-4-E4B (Light) |
|
||||||
|
|
||||||
|
## Quick Start
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Build & start
|
||||||
|
docker compose build
|
||||||
|
docker compose up -d
|
||||||
|
|
||||||
|
# Verify
|
||||||
|
curl http://localhost:8091/health
|
||||||
|
curl http://localhost:3001/api/status
|
||||||
|
```
|
||||||
|
|
||||||
|
## Dashboard
|
||||||
|
|
||||||
|
- **UI:** `http://<host>:8080/dashboard/harness.html`
|
||||||
|
- **API:** `http://<host>:8080/dashboard/api/status`
|
||||||
|
|
||||||
|
## Circuit Breaker
|
||||||
|
|
||||||
|
- Rate limit: 10 req/s per IP
|
||||||
|
- Burst: 20 requests
|
||||||
|
- Excess returns 503
|
||||||
|
- Queue fallback on GPU 502/503
|
||||||
|
|
||||||
|
## Production Migration
|
||||||
|
|
||||||
|
See [MIGRATION_PLAN.md](./MIGRATION_PLAN.md)
|
||||||
|
|
||||||
|
---
|
||||||
|
*Built for Syslog Solution LLC — Quality over speed.*
|
||||||
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,8 @@
|
|||||||
|
FROM python:3.13-slim
|
||||||
|
|
||||||
|
COPY harness-dashboard.py /app/harness-dashboard.py
|
||||||
|
WORKDIR /app
|
||||||
|
|
||||||
|
EXPOSE 3001
|
||||||
|
|
||||||
|
CMD ["python3", "harness-dashboard.py"]
|
||||||
@@ -0,0 +1,5 @@
|
|||||||
|
FROM python:3.11-slim
|
||||||
|
WORKDIR /app
|
||||||
|
COPY harness-dashboard.py .
|
||||||
|
EXPOSE 3001
|
||||||
|
CMD ["python3", "harness-dashboard.py"]
|
||||||
@@ -0,0 +1,133 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""Syslog Harness Dashboard — Simple HTTP server exposing GPU health + metrics."""
|
||||||
|
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import time
|
||||||
|
import urllib.request
|
||||||
|
from http.server import HTTPServer, SimpleHTTPRequestHandler
|
||||||
|
from datetime import datetime
|
||||||
|
|
||||||
|
GPUS = {
|
||||||
|
"amdpve": {"endpoint": os.getenv("AMDVE_EP", "192.168.68.15:8080"), "model": "qwen3.6-35B-A3B (MoE)", "vram": "65GB"},
|
||||||
|
"llmgpu": {"endpoint": os.getenv("LLMGPU_EP", "192.168.68.8:8080"), "model": "qwen3.5-27B (Dense)", "vram": "24GB"},
|
||||||
|
"ocu_llm": {"endpoint": os.getenv("OCU_LLM_EP", "192.168.68.110:8080"), "model": "gemma-4-E4B (Light)", "vram": "12GB"},
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def check_gpu(name, info):
|
||||||
|
try:
|
||||||
|
start = time.time()
|
||||||
|
# Use simple HTTP GET to check if the GPU endpoint is alive
|
||||||
|
resp = urllib.request.urlopen(f"http://{info['endpoint']}/", timeout=3)
|
||||||
|
latency = (time.time() - start) * 1000
|
||||||
|
return {
|
||||||
|
"status": "up",
|
||||||
|
"latency_ms": round(latency, 1),
|
||||||
|
"model": info["model"],
|
||||||
|
"vram": info["vram"],
|
||||||
|
}
|
||||||
|
except Exception as e:
|
||||||
|
return {"status": "down", "error": str(e)[:50], "model": info["model"], "vram": info["vram"]}
|
||||||
|
|
||||||
|
|
||||||
|
def get_queue_status():
|
||||||
|
try:
|
||||||
|
req = urllib.request.Request("http://queue-service:8091/status")
|
||||||
|
resp = urllib.request.urlopen(req, timeout=2)
|
||||||
|
return json.loads(resp.read())
|
||||||
|
except Exception:
|
||||||
|
return {"queue_depth": -1, "circuit_breaker": "unknown", "gpu_health": {}}
|
||||||
|
|
||||||
|
|
||||||
|
DASHBOARD_HTML = """
|
||||||
|
<!DOCTYPE html>
|
||||||
|
<html><head><meta charset="utf-8"><title>🦅 Syslog Harness</title>
|
||||||
|
<style>
|
||||||
|
body { background: #1a1a2e; color: #e0e0e0; font-family: monospace; margin: 0; padding: 20px; }
|
||||||
|
.card { background: #16213e; border-radius: 8px; padding: 16px; margin: 10px 0; border-left: 4px solid #0f3460; }
|
||||||
|
.up { border-left-color: #00d26a; } .down { border-left-color: #ff4757; }
|
||||||
|
.warn { border-left-color: #ffa502; }
|
||||||
|
h1 { color: #00d26a; font-size: 24px; } h2 { color: #0f3460; font-size: 16px; }
|
||||||
|
.metric { display: inline-block; margin: 4px 12px; }
|
||||||
|
.value { font-weight: bold; color: #00d26a; }
|
||||||
|
#refresh { position: fixed; top: 10px; right: 10px; background: #0f3460; color: white;
|
||||||
|
border: none; padding: 8px 16px; border-radius: 4px; cursor: pointer; }
|
||||||
|
table { width: 100%; border-collapse: collapse; margin: 10px 0; }
|
||||||
|
th, td { text-align: left; padding: 8px; border-bottom: 1px solid #0f3460; }
|
||||||
|
th { color: #00d26a; }
|
||||||
|
</style></head><body>
|
||||||
|
<button id="refresh" onclick="location.reload()">↻ Refresh</button>
|
||||||
|
<h1>🦅 Syslog Harness Dashboard</h1>
|
||||||
|
<h2>Updated: <span id="ts"></span></h2>
|
||||||
|
|
||||||
|
<div class="card" id="queue-card">
|
||||||
|
<h2>Queue & Circuit Breaker</h2>
|
||||||
|
<div class="metric">Depth: <span class="value" id="depth">--</span></div>
|
||||||
|
<div class="metric">Circuit: <span class="value" id="circuit">--</span></div>
|
||||||
|
<div class="metric">Threshold: <span class="value" id="threshold">--</span></div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="card">
|
||||||
|
<h2>GPU Endpoints</h2>
|
||||||
|
<table><tr><th>GPU</th><th>Model</th><th>VRAM</th><th>Status</th><th>Latency</th></tr>
|
||||||
|
<tbody id="gpu-table"></tbody></table>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<script>
|
||||||
|
document.getElementById('ts').textContent = new Date().toISOString();
|
||||||
|
fetch('/api/status').then(r => r.json()).then(data => {
|
||||||
|
document.getElementById('depth').textContent = data.queue_depth;
|
||||||
|
document.getElementById('circuit').textContent = data.circuit_breaker;
|
||||||
|
document.getElementById('threshold').textContent = 'warn:' + data.thresholds.warn + ' / open:' + data.thresholds.open;
|
||||||
|
const card = document.getElementById('queue-card');
|
||||||
|
if (data.circuit_breaker === 'open') card.className = 'card warn';
|
||||||
|
else if (data.circuit_breaker === 'warn') card.className = 'card warn';
|
||||||
|
else card.className = 'card up';
|
||||||
|
let html = '';
|
||||||
|
for (const [name, gpu] of Object.entries(data.gpu_health)) {
|
||||||
|
const status = gpu.status === 'up' ? '✅' : '❌';
|
||||||
|
const latency = gpu.status === 'up' ? gpu.latency_ms + 'ms' : gpu.error;
|
||||||
|
const rowClass = gpu.status === 'up' ? '' : 'down';
|
||||||
|
html += `<tr class="${rowClass}"><td>${name}</td><td>${gpu.model}</td><td>${gpu.vram}</td><td>${status}</td><td>${latency}</td></tr>`;
|
||||||
|
}
|
||||||
|
document.getElementById('gpu-table').innerHTML = html;
|
||||||
|
});
|
||||||
|
setInterval(() => location.reload(), 10000);
|
||||||
|
</script></body></html>
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
class Handler(SimpleHTTPRequestHandler):
|
||||||
|
def do_GET(self):
|
||||||
|
if self.path == "/" or self.path == "/harness.html":
|
||||||
|
self.send_response(200)
|
||||||
|
self.send_header("Content-Type", "text/html; charset=utf-8")
|
||||||
|
self.end_headers()
|
||||||
|
self.wfile.write(DASHBOARD_HTML.encode())
|
||||||
|
elif self.path == "/api/status":
|
||||||
|
status = get_queue_status()
|
||||||
|
enriched = {
|
||||||
|
"queue_depth": status.get("queue_depth", -1),
|
||||||
|
"circuit_breaker": status.get("circuit_breaker", "unknown"),
|
||||||
|
"thresholds": status.get("thresholds", {"warn": 30, "open": 50}),
|
||||||
|
"gpu_health": {},
|
||||||
|
}
|
||||||
|
for name, info in GPUS.items():
|
||||||
|
enriched["gpu_health"][name] = check_gpu(name, info)
|
||||||
|
self.send_response(200)
|
||||||
|
self.send_header("Content-Type", "application/json")
|
||||||
|
self.end_headers()
|
||||||
|
self.wfile.write(json.dumps(enriched).encode())
|
||||||
|
else:
|
||||||
|
self.send_response(404)
|
||||||
|
self.end_headers()
|
||||||
|
|
||||||
|
def log_message(self, format, *args):
|
||||||
|
pass # Suppress request logs
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
server = HTTPServer(("0.0.0.0", 3001), Handler)
|
||||||
|
print("Dashboard running on :3001/harness.html")
|
||||||
|
server.serve_forever()
|
||||||
@@ -0,0 +1,115 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""GPU metrics collector — polls sidecars + llama.cpp every 10s, writes to Workspace."""
|
||||||
|
|
||||||
|
import urllib.request, json, time, os
|
||||||
|
|
||||||
|
HOSTS = [
|
||||||
|
{"name": "amdpve", "host": "192.168.68.15", "gpu": "AMD Strix Halo", "llama_port": 8080},
|
||||||
|
{"name": "llmgpu", "host": "192.168.68.8", "gpu": "RTX 3090", "llama_port": 8080},
|
||||||
|
{"name": "ocu-llm", "host": "192.168.68.110", "gpu": "RTX 5070", "llama_port": 8080},
|
||||||
|
]
|
||||||
|
OUTPUT = "/root/hermes-workspace/public/gpu_metrics.json"
|
||||||
|
INTERVAL = 10
|
||||||
|
STALE_THRESHOLD = 30 # seconds before marking stale
|
||||||
|
DEAD_THRESHOLD = 60 # seconds before marking unreachable
|
||||||
|
|
||||||
|
last_seen = {}
|
||||||
|
|
||||||
|
|
||||||
|
def fetch_json(url, timeout=3):
|
||||||
|
try:
|
||||||
|
req = urllib.request.Request(url)
|
||||||
|
resp = urllib.request.urlopen(req, timeout=timeout)
|
||||||
|
return json.loads(resp.read().decode())
|
||||||
|
except Exception:
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def collect_one(h):
|
||||||
|
"""Collect GPU hardware + llama.cpp inference state for one host."""
|
||||||
|
name = h["name"]
|
||||||
|
host = h["host"]
|
||||||
|
now = time.time()
|
||||||
|
|
||||||
|
# GPU hardware from sidecar
|
||||||
|
gpu = fetch_json(f"http://{host}:8090/")
|
||||||
|
|
||||||
|
# llama.cpp inference state
|
||||||
|
llamacpp_health = fetch_json(f"http://{host}:{h['llama_port']}/health")
|
||||||
|
llamacpp_models = fetch_json(f"http://{host}:{h['llama_port']}/v1/models")
|
||||||
|
|
||||||
|
# Determine inference state
|
||||||
|
model_name = None
|
||||||
|
inference_state = "unknown"
|
||||||
|
if llamacpp_models:
|
||||||
|
models = llamacpp_models.get("data", [])
|
||||||
|
if models:
|
||||||
|
model_name = models[0].get("id")
|
||||||
|
|
||||||
|
if llamacpp_health:
|
||||||
|
status = llamacpp_health.get("status", "")
|
||||||
|
if status == "ok":
|
||||||
|
idle = llamacpp_health.get("slots_idle", 0)
|
||||||
|
processing = llamacpp_health.get("slots_processing", 0)
|
||||||
|
if idle and not processing:
|
||||||
|
inference_state = "idle"
|
||||||
|
elif processing:
|
||||||
|
inference_state = "busy"
|
||||||
|
else:
|
||||||
|
inference_state = "idle"
|
||||||
|
|
||||||
|
# Check for /slots endpoint for is_processing detail
|
||||||
|
slots = fetch_json(f"http://{host}:{h['llama_port']}/slots")
|
||||||
|
if slots and isinstance(slots, list) and len(slots) > 0:
|
||||||
|
if slots[0].get("is_processing"):
|
||||||
|
inference_state = "busy"
|
||||||
|
|
||||||
|
result = {
|
||||||
|
"host": name,
|
||||||
|
"gpu_name": h["gpu"],
|
||||||
|
"inference": {
|
||||||
|
"state": inference_state,
|
||||||
|
"model": model_name,
|
||||||
|
},
|
||||||
|
"hardware": gpu if gpu else None,
|
||||||
|
"online": gpu is not None,
|
||||||
|
"timestamp": now,
|
||||||
|
}
|
||||||
|
|
||||||
|
if gpu is not None:
|
||||||
|
last_seen[name] = now
|
||||||
|
|
||||||
|
if name in last_seen:
|
||||||
|
age = now - last_seen[name]
|
||||||
|
if age > DEAD_THRESHOLD:
|
||||||
|
result["online"] = False
|
||||||
|
elif age > STALE_THRESHOLD:
|
||||||
|
result["stale"] = True
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
print(f"GPU collector starting, output={OUTPUT}, interval={INTERVAL}s")
|
||||||
|
os.makedirs(os.path.dirname(OUTPUT), exist_ok=True)
|
||||||
|
|
||||||
|
while True:
|
||||||
|
start = time.time()
|
||||||
|
results = [collect_one(h) for h in HOSTS]
|
||||||
|
|
||||||
|
payload = {
|
||||||
|
"updated": start,
|
||||||
|
"gpus": results,
|
||||||
|
}
|
||||||
|
|
||||||
|
with open(OUTPUT + ".tmp", "w") as f:
|
||||||
|
json.dump(payload, f)
|
||||||
|
os.rename(OUTPUT + ".tmp", OUTPUT)
|
||||||
|
|
||||||
|
elapsed = time.time() - start
|
||||||
|
sleep_for = max(0, INTERVAL - elapsed)
|
||||||
|
time.sleep(sleep_for)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
@@ -0,0 +1,14 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
set -e
|
||||||
|
|
||||||
|
# Start collector as background process
|
||||||
|
cd /root/hermes-workspace/public
|
||||||
|
python3 /app/collector.py &
|
||||||
|
COLLECTOR_PID=$!
|
||||||
|
|
||||||
|
echo "Collector started (PID $COLLECTOR_PID)"
|
||||||
|
echo "Serving dashboard on :8092"
|
||||||
|
|
||||||
|
# Serve the public directory (contains gpu.html + gpu_metrics.json)
|
||||||
|
cd /root/hermes-workspace/public
|
||||||
|
python3 -m http.server 8092
|
||||||
@@ -24,7 +24,7 @@ upstream queue_service {
|
|||||||
|
|
||||||
upstream dashboard_service {
|
upstream dashboard_service {
|
||||||
## Harness dashboard (Docker container)
|
## Harness dashboard (Docker container)
|
||||||
server dashboard:3001;
|
server syslog-harness-dashboard-1:3001;
|
||||||
}
|
}
|
||||||
|
|
||||||
upstream gpu_dashboard_pool {
|
upstream gpu_dashboard_pool {
|
||||||
|
|||||||
@@ -0,0 +1,10 @@
|
|||||||
|
FROM python:3.13-slim
|
||||||
|
|
||||||
|
RUN pip install --no-cache-dir flask redis
|
||||||
|
|
||||||
|
COPY queue-service.py /app/queue-service.py
|
||||||
|
WORKDIR /app
|
||||||
|
|
||||||
|
EXPOSE 8091
|
||||||
|
|
||||||
|
CMD ["python3", "queue-service.py"]
|
||||||
@@ -0,0 +1,121 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""Syslog Inference Queue Service — Circuit breaker + request queuing.
|
||||||
|
|
||||||
|
Ports: 8091
|
||||||
|
Endpoints:
|
||||||
|
/health — liveness probe (Nginx upstream check)
|
||||||
|
/enqueue — POST inference request into queue (fallback from Nginx)
|
||||||
|
/status — GET queue depth + circuit breaker state
|
||||||
|
"""
|
||||||
|
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import sys
|
||||||
|
import time
|
||||||
|
import urllib.request
|
||||||
|
from flask import Flask, request, jsonify
|
||||||
|
|
||||||
|
app = Flask(__name__)
|
||||||
|
|
||||||
|
# Configuration
|
||||||
|
REDIS_HOST = os.getenv("REDIS_HOST", "192.168.68.7")
|
||||||
|
REDIS_PORT = int(os.getenv("REDIS_PORT", "6379"))
|
||||||
|
QUEUE_KEY = "inference:requests"
|
||||||
|
CIRCUIT_OPEN_THRESHOLD = 50
|
||||||
|
CIRCUIT_WARN_THRESHOLD = 30
|
||||||
|
|
||||||
|
# GPU endpoints for draining
|
||||||
|
GPUS = {
|
||||||
|
"amdpve": "192.168.68.15:8080",
|
||||||
|
"llmgpu": "192.168.68.8:8080",
|
||||||
|
"ocu_llm": "192.168.68.110:8080",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def get_redis():
|
||||||
|
try:
|
||||||
|
import redis
|
||||||
|
return redis.Redis(host=REDIS_HOST, port=REDIS_PORT, decode_responses=True)
|
||||||
|
except Exception:
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def get_queue_depth(r):
|
||||||
|
try:
|
||||||
|
return r.llen(QUEUE_KEY)
|
||||||
|
except Exception:
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
def check_gpu_health(endpoint):
|
||||||
|
try:
|
||||||
|
req = urllib.request.Request(f"http://{endpoint}/v1/models")
|
||||||
|
req.add_header("User-Agent", "queue-service/1.0")
|
||||||
|
resp = urllib.request.urlopen(req, timeout=3)
|
||||||
|
return resp.status == 200
|
||||||
|
except Exception:
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
@app.route("/health")
|
||||||
|
def health():
|
||||||
|
"""Nginx upstream health probe. Returns 200 if service is alive."""
|
||||||
|
return jsonify({"status": "ok", "service": "queue-service"}), 200
|
||||||
|
|
||||||
|
|
||||||
|
@app.route("/enqueue", methods=["POST"])
|
||||||
|
def enqueue():
|
||||||
|
"""Fallback endpoint — Nginx calls this when all GPU upstreams are down."""
|
||||||
|
r = get_redis()
|
||||||
|
if not r:
|
||||||
|
return jsonify({"error": "Redis unavailable"}), 503
|
||||||
|
|
||||||
|
depth = get_queue_depth(r)
|
||||||
|
if depth >= CIRCUIT_OPEN_THRESHOLD:
|
||||||
|
return jsonify({
|
||||||
|
"error": "Circuit breaker OPEN",
|
||||||
|
"queue_depth": depth,
|
||||||
|
"threshold": CIRCUIT_OPEN_THRESHOLD
|
||||||
|
}), 503
|
||||||
|
|
||||||
|
# Store the request in queue
|
||||||
|
payload = request.get_data(as_text=True)
|
||||||
|
headers = {k: v for k, v in request.headers if k.startswith("X-")}
|
||||||
|
r.rpush(QUEUE_KEY, json.dumps({
|
||||||
|
"payload": payload,
|
||||||
|
"headers": headers,
|
||||||
|
"queued_at": time.time()
|
||||||
|
}))
|
||||||
|
|
||||||
|
new_depth = get_queue_depth(r)
|
||||||
|
return jsonify({
|
||||||
|
"status": "queued",
|
||||||
|
"position": new_depth,
|
||||||
|
"circuit": "warn" if new_depth >= CIRCUIT_WARN_THRESHOLD else "closed"
|
||||||
|
}), 202
|
||||||
|
|
||||||
|
|
||||||
|
@app.route("/status")
|
||||||
|
def status():
|
||||||
|
"""GET queue depth + circuit breaker state + GPU health."""
|
||||||
|
r = get_redis()
|
||||||
|
depth = get_queue_depth(r) if r else -1
|
||||||
|
circuit = "open" if depth >= CIRCUIT_OPEN_THRESHOLD else ("warn" if depth >= CIRCUIT_WARN_THRESHOLD else "closed")
|
||||||
|
|
||||||
|
gpu_health = {}
|
||||||
|
for name, endpoint in GPUS.items():
|
||||||
|
gpu_health[name] = "up" if check_gpu_health(endpoint) else "down"
|
||||||
|
|
||||||
|
return jsonify({
|
||||||
|
"queue_depth": depth,
|
||||||
|
"circuit_breaker": circuit,
|
||||||
|
"gpu_health": gpu_health,
|
||||||
|
"thresholds": {
|
||||||
|
"warn": CIRCUIT_WARN_THRESHOLD,
|
||||||
|
"open": CIRCUIT_OPEN_THRESHOLD
|
||||||
|
}
|
||||||
|
})
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
app.run(host="0.0.0.0", port=8091)
|
||||||
Submodule
+1
Submodule syslog-harness-check added at b65ea22765
Reference in New Issue
Block a user