# Syslog Harness Architecture Review & Improvement Recommendations **Date:** 2026-05-17 **Commit:** `e95475f` "Add GPU dashboard container + Nginx routing" **Repo:** http://192.168.68.17:3000/SyslogSolution/syslog-harness.git --- ## 1. Current Architecture Overview ``` Host (192.168.68.123) Agent :8080> Nginx Router > Queue Service > Dashboard :8080 :8091 :3001 GPU Pool Redis > GPU Dashboard :8080 :6379 :8092 amdpve llmgpu ocu_llm .15:8080 .8:8080 .110:8080 MoE 35B Dense 27B Light 4B ``` ### Services | Service | Port | Container | Image | Purpose | |---|---|---|---|---| | **Nginx Router** | 8080 | Host-level | OS nginx | Routes by `X-Syslog-Model` header | | **Queue Service** | 8091 | `syslog-queue` | `python:3.13-slim` | Request queue + circuit breaker | | **Dashboard** | 3001 | `syslog-dashboard` | `python:3.11-slim` | Observability UI + GPU health | | **GPU Dashboard** | 8092 | `syslog-gpu-dashboard` | `python:3.11-slim` | Hardware metrics (temp, VRAM, power) | | **Redis** | 6379 | `syslog-redis` | `redis:7-alpine` | Queue storage | ### GPU Backends | Host | GPU | Model | Capacity | |---|---|---|---| | 192.168.68.15 | AMD Strix Halo | qwen3.6-35B-A3B (MoE) | 65GB VRAM | | 192.168.68.8 | RTX 3090 | qwen3.5-27B (Dense) | 24GB VRAM | | 192.168.68.110 | RTX 5070 | gemma-4-E4B (Light) | 12GB VRAM | ### Data Flow 1. **Agent** sends request with `X-Syslog-Model` header Nginx :8080 2. **Nginx** routes to appropriate GPU based on header mapping 3. **GPU backend** (llama.cpp) processes request 4. **Fallback:** If GPU returns 502/503/timeout Nginx redirects to queue-service :8091 5. **Queue** stores request in Redis `inference:requests` LPUSH 6. **Dashboard** :3001 polls queue-service + GPU health for display 7. **GPU Dashboard** :8092 collects hardware metrics every 10s --- ## 2. File Inventory ``` docker-compose.yml # Main compose (Docker networking) gpu-router-docker.conf # Nginx config for Docker deployment Dockerfile.gpu # GPU dashboard container Dockerfile.dashboard # Dashboard container (root-level) queue-service/Dockerfile # Queue service container queue-service/queue-service.py # Queue logic (121 lines) dashboard/harness-dashboard.py # Dashboard app (133 lines) dashboard/Dockerfile # Dashboard container (subdir) dashboard/Dockerfile.dashboard # Dashboard container (duplicate) gpu-dashboard/gpu_collector.py # GPU hardware collector (115 lines) gpu-dashboard/gpu.html # GPU dashboard UI (183 lines) gpu-dashboard/collector.py # Duplicate collector (hermes-workspace path) gpu-dashboard/start.sh # Legacy startup script MIGRATION_PLAN.md # Production migration plan README.md # Documentation syslog-harness-check/ # Checkpoint subdirectory (mirror) ``` --- ## 3. Detailed Findings ### 3.1 Queue Service (`queue-service/queue-service.py`) **Architecture:** Simple Flask app using Redis LPUSH/RPUSH for a FIFO queue. A basic circuit breaker prevents queue overflow at 50 messages. **Issues Found:** | # | Severity | Location | Issue | |---|---|---|---| | Q1 | **CRITICAL** | Lines 82-88 | **Queue is fire-and-forget with no consumer.** Requests are pushed to Redis but nothing dequeues or processes them. The queue is a dead storage pit. | | Q2 | **CRITICAL** | Lines 28-32 | **Hardcoded GPU IPs** in the queue service duplicate the Nginx config. No configuration source of truth. | | Q3 | **HIGH** | Lines 21-22 | **Redis host fallback to `192.168.68.7`** (line 21) conflicts with docker-compose which sets `REDIS_HOST=redis` (line 24). The default is unreachable inside Docker. | | Q4 | **HIGH** | Lines 66-95 | **No job result retrieval mechanism.** Once enqueued, there's no API to poll for completion, get a job ID, or retrieve results. | | Q5 | **HIGH** | Lines 73-79 | **Circuit breaker is a simple depth threshold.** No backoff, no recovery window, no sliding window. Once closed, it stays closed until manually drained. | | Q6 | **MEDIUM** | Lines 50-57 | **GPU health check is synchronous and blocks** the `/status` endpoint. Checking 3 GPUs sequentially with 3s timeout means `/status` can take up to 9s. | | Q7 | **MEDIUM** | Lines 35-40 | **`get_redis()` swallows all exceptions** and returns `None`. This makes Redis failures silent queue depth returns 0 on failure (line 47), potentially allowing overflow. | | Q8 | **MEDIUM** | Lines 83-84 | **Headers filtered to only X-* prefixed** the `Content-Type` header is dropped entirely, meaning the receiver can't determine payload format. | | Q9 | **LOW** | Line 121 | **No graceful shutdown.** Flask development server doesn't handle SIGTERM gracefully. | ### 3.2 Nginx Gateway (`gpu-router-docker.conf`) **Architecture:** Nginx routes requests to GPU backends based on `X-Syslog-Model` header value. Has rate limiting, streaming support, and queue fallback. **Issues Found:** | # | Severity | Location | Issue | |---|---|---|---| | N1 | **HIGH** | Lines 79-80 | **`burst=20 nodelay`** means 20 requests are served immediately beyond the rate limit, then throttled. This defeats the purpose of rate limiting under burst traffic all 20 could still overwhelm a GPU. | | N2 | **HIGH** | Lines 99-100 | **`proxy_next_upstream` with `tries 2`** means on error/timeout/502/503, Nginx retries once. But it retries against the *same GPU pool*, not a different one. The same GPU that failed gets hit again. | | N3 | **HIGH** | Lines 106, 112-121 | **Queue fallback (`@queue_fallback`) is triggered for ANY 502/503/504**, including when a single GPU is overloaded. This means individual GPU slowness causes queue fallback instead of just queuing when ALL GPUs are down. | | N4 | **MEDIUM** | Line 90 | **`proxy_pass_header X-Syslog-Model`** is non-standard. Nginx automatically passes request headers; this directive is for response headers. The model header is already passed implicitly via `proxy_set_header` inheritance. | | N5 | **MEDIUM** | Lines 27, 32 | **Hardcoded container names** (`syslog-harness-dashboard-1`, `syslog-harness-gpu-dashboard-1`). These change based on docker-compose project prefix. Should use service names. | | N6 | **LOW** | Lines 67-73 | **GPU dashboard at `/gpu` path** has `X-Forwarded-Proto` but the dashboard service (simple HTTP server) doesn't use it. Inconsistent header handling across locations. | ### 3.3 Dashboard (`dashboard/harness-dashboard.py`) **Architecture:** Simple HTTP server using Python's `http.server`. Fetches queue status and GPU health, renders HTML. **Issues Found:** | # | Severity | Location | Issue | |---|---|---|---| | D1 | **HIGH** | Lines 34-40 | **`get_queue_status()` calls queue-service synchronously.** Combined with per-GPU health checks (lines 18-31), the `/api/status` endpoint makes 4 sequential HTTP calls. Worst case: 2 + 33s = 11s response time. | | D2 | **MEDIUM** | Lines 101-127 | **Uses `SimpleHTTPRequestHandler`** which is single-threaded. Under concurrent dashboard access, requests queue up. Should use `ThreadingHTTPServer`. | | D3 | **MEDIUM** | Lines 16-18 | **GPU endpoints hardcoded** in dashboard, separate from queue-service and Nginx. Three separate sources of truth for GPU addresses. | | D4 | **LOW** | Line 127 | **Silent log suppression.** While intentional, this makes debugging impossible without modifying the source. | ### 3.4 GPU Dashboard (`gpu-dashboard/`) **Architecture:** `gpu_collector.py` polls sidecar (port 8090) and llama.cpp (port 8080) endpoints every 10s, writes JSON to `gpu_metrics.json`. Static HTTP server serves the dashboard. **Issues Found:** | # | Severity | Location | Issue | |---|---|---|---| | G1 | **HIGH** | Lines 97-98 | **Sequential collection.** All 3 GPUs are polled sequentially (line 98: list comprehension). If one host is unreachable, it blocks collection for all three. | | G2 | **HIGH** | Line 105-107 | **`/app/public/gpu_metrics.json` path is hardcoded** and differs from `collector.py` (line 11: `/root/hermes-workspace/public/gpu_metrics.json`). Inconsistent between the two collector files. | | G3 | **MEDIUM** | Lines 19-25 | **`fetch_json` swallows all exceptions.** A timeout on one GPU's sidecar is silently ignored, making it impossible to distinguish "no data" from "collector error". | | G4 | **MEDIUM** | Line 14 | **`DEAD_THRESHOLD = 60` seconds is aggressive.** A GPU that restarts takes 60s before reappearing as online, even if it's back in 5s. | | G5 | **LOW** | Lines 10-14 | **`start.sh` references `/root/hermes-workspace/public`** but `Dockerfile.gpu` creates `/app/public`. Inconsistent between legacy and current deployment. | ### 3.5 Docker Compose (`docker-compose.yml`) **Issues Found:** | # | Severity | Location | Issue | |---|---|---|---| | C1 | **HIGH** | Lines 19-20 | **Queue service exposes port 8091 externally.** In a multi-tenant or public-facing deployment, the queue API should be internal-only. | | C2 | **MEDIUM** | Lines 13-15 | **`Dockerfile.queue` referenced but doesn't exist at root level.** The file is at `queue-service/Dockerfile`. The compose build context is `.` (root) but the dockerfile path doesn't match. | | C3 | **MEDIUM** | Lines 6, 16, 26, 31, 43 | **`restart: always`** instead of `restart: unless-stopped`. On crash, `always` restarts even after manual stop, making maintenance harder. | | C4 | **LOW** | Lines 23-25 | **No health checks defined** for any service. Docker can't detect if a service is actually healthy, only if the container is running. | | C5 | **LOW** | Line 10 | **Redis has no password.** Unauthenticated Redis exposed on the Docker network. | | C6 | **LOW** | Lines 49-51 | **No network driver specified** for the bridge network (minor defaults to bridge). No IPAM configuration for large deployments. | ### 3.6 Container Images **Issues Found:** | # | Severity | Location | Issue | |---|---|---|---| | I1 | **HIGH** | All Dockerfiles | **No `requirements.txt` or dependency pinning.** All dependencies (`flask`, `redis`, `requests`) are installed without version pins. Builds are non-reproducible. | | I2 | **MEDIUM** | `Dockerfile.gpu` line 3 | **`pip install requests`** unnecessary dependency for the GPU dashboard (only uses `urllib`). Adds ~300KB to the image. | | I3 | **MEDIUM** | `Dockerfile.gpu` line 14 | **Multi-process CMD with `&`** no process supervisor. If the collector crashes, it won't restart. The `http.server` also won't receive SIGTERM properly. | | I4 | **LOW** | All Dockerfiles | **No `.dockerignore` file.** The entire context is sent to the Docker daemon, including `.git` directories and any local artifacts. | | I5 | **LOW** | `Dockerfile.dashboard` (root) vs `dashboard/Dockerfile.dashboard` | **Duplicate Dockerfiles** with slight differences (Python 3.11 vs 3.13, WORKDIR differences). | --- ## 4. Smart Queuing Analysis & Recommendations ### Current State: No Smart Queuing The queue service is a **passive storage mechanism** it stores requests but has no intelligence: - **No load balancing** no awareness of GPU load (slots_busy, VRAM usage, queue depth per GPU) - **No job prioritization** FIFO only, no priority levels - **No backpressure** simple threshold, no exponential backoff or adaptive limits - **No retry logic** failed GPU requests go to queue but are never reprocessed - **No dead letter handling** stuck or failed jobs have no lifecycle management - **No consumer** nothing dequeues and forwards to GPUs - **No job tracking** no job IDs, no status updates, no result retrieval ### Recommended Architecture: Smart Queue with Consumer ``` Agent > Nginx > Smart Queue API > Redis Streams (with consumers) Consumer Pool GPU 1 (load) GPU 2 (load) GPU 3 (load) Health Health Health Update GPU scores Priority Queue (sorted by urgency) Dead Letter Queue (failed jobs) Backpressure (adaptive rate limit) ``` ### Specific Recommendations #### R1: Implement Redis Streams as Queue Backend - Replace `LPUSH/RPUSH` (FIFO list) with **Redis Streams** (`XADD/XREADGROUP`) - Streams support consumer groups, message acknowledgment, and pending messages - Enables proper dead letter queue handling and retry logic - **File:** `queue-service/queue-service.py` ```python # Before: Simple list r.rpush(QUEUE_KEY, json.dumps(job)) # After: Redis Stream with consumer group stream_key = "inference:stream" consumer_group = "gpu-workers" r.xadd(stream_key, {"job": json.dumps(job)}, maxlen=10000, approx=True) ``` #### R2: Build a Queue Consumer Pool - Deploy 1+ consumer containers that poll the stream and forward to GPUs - Consumer selects GPU based on: health status, current load (slots_busy), and VRAM availability - **File:** New `queue-service/consumer.py` ```python class LoadBalancedConsumer: def select_gpu(self, job): """Select GPU based on load, health, and model compatibility.""" candidates = [g for g in self.gpus if g.health == "up" and not g.full] if not candidates: return None # Sort by: slots_idle (descending), VRAM_available (descending) candidates.sort(key=lambda g: (g.slots_idle, g.vram_free_mb), reverse=True) return candidates[0] ``` #### R3: Implement Priority Queuing - Add priority field to job payload: `high`, `normal`, `low` - Use Redis Streams with multiple stream keys per priority level - Consumer checks `high` `normal` `low` in order - **File:** `queue-service/queue-service.py` enqueue endpoint #### R4: Add Backpressure Mechanism - Instead of hard threshold at 50, implement **adaptive backpressure**: - Queue depth 0-30: normal operation - Queue depth 30-40: return `retry-after` header with increasing delay - Queue depth 40-50: return 503 with exponential retry-after - Queue depth >50: circuit breaker open - **File:** `queue-service/queue-service.py` #### R5: Dead Letter Queue (DLQ) - Move failed/unprocessable jobs to a `inference:dead-letter` stream - Include failure reason, attempt count, and original payload - Provide admin API to inspect, retry, or discard DLQ entries - **File:** `queue-service/queue-service.py` ```python # New endpoint @app.route("/dlq", methods=["GET"]) def list_dlq(): return r.xrange("inference:dead-letter") @app.route("/dlq/retry/", methods=["POST"]) def retry_dlq(message_id): job = r.xget("inference:dead-letter", message_id) r.xadd("inference:stream", {"job": job}) ``` #### R6: GPU-Aware Routing - Queue consumer should check GPU `slots_busy` before routing - If a GPU is busy, try the next available GPU - Track per-GPU queue depth and avoid overloading a single GPU - **File:** New consumer logic #### R7: Job Status API - Add job ID generation on enqueue - Provide `/status/` endpoint to check progress - Store job state in Redis: `queued` `processing` `completed`/`failed` - **File:** `queue-service/queue-service.py` ```python @app.route("/enqueue", methods=["POST"]) def enqueue(): job_id = str(uuid.uuid4()) job = {"id": job_id, "payload": ..., "status": "queued", "created_at": time.time()} r.xadd(stream_key, {"job": json.dumps(job)}) r.hset("job:status", job_id, json.dumps({"status": "queued"})) return jsonify({"job_id": job_id, "status": "queued"}), 202 @app.route("/status/") def job_status(job_id): status = r.hget("job:status", job_id) return jsonify(json.loads(status)) if status else {"error": "not found"}, 404 ``` #### R8: Health-Based Circuit Breaker - Replace simple depth threshold with **per-GPU circuit breakers** - Track consecutive failures per GPU - Implement half-open state: after cooldown, probe one GPU to test recovery - **File:** `queue-service/queue-service.py` #### R9: Centralized Configuration - Move GPU endpoints from 3 locations (queue-service, dashboard, Nginx) to: - Redis config key: `config:gpus` - Or environment file mounted to all containers - Nginx can use Lua/variable from config instead of static upstreams - **File:** New `config/` directory or Redis-based config --- ## 5. Priority Issue Summary ### Critical (Fix Immediately) 1. **Q1** Queue has no consumer; enqueued requests are never processed 2. **Q4** No job ID or result retrieval mechanism 3. **N3** Queue fallback triggers on individual GPU failure, not all-down ### High (Fix Before Production) 4. **Q5** Circuit breaker has no recovery mechanism 5. **Q6** `/status` endpoint blocks on GPU health checks 6. **D1** Dashboard `/api/status` makes 4 sequential calls, up to 11s 7. **C2** `Dockerfile.queue` path mismatch in docker-compose 8. **I1** No dependency pinning in any Dockerfile 9. **I3** Multi-process CMD without supervisor in GPU dashboard ### Medium (Improve in Next Iteration) 10. **Q3** Redis host default conflicts with Docker networking 11. **Q7** Silent exception swallowing in Redis access 12. **Q8** Content-Type header dropped in queue 13. **D2** Single-threaded dashboard server 14. **D3** Three separate sources of truth for GPU addresses 15. **G1** Sequential GPU collection blocks on single failure 16. **N1** Rate limit burst of 20 nodelay defeats protection 17. **N5** Hardcoded container names in Nginx 18. **C1** Queue API exposed externally 19. **C4** No Docker health checks ### Low (Nice to Have) 20. **Q9** No graceful shutdown 21. **C3** `restart: always` vs `unless-stopped` 22. **C5** No Redis authentication 23. **G4** 60s dead threshold is too aggressive 24. **I2** Unnecessary `requests` dependency 25. **I4** No `.dockerignore` 26. **I5** Duplicate Dockerfiles --- ## 6. Deployment Architecture Summary ### What Works Well - Clean separation of concerns: routing (Nginx), queuing (Redis + queue-service), observability (two dashboards) - Good GPU hardware monitoring with temperature, VRAM, power, fan metrics - SSE streaming support in Nginx for LLM response streaming - Rate limiting at the gateway layer - Circuit breaker pattern implemented (even if basic) ### What Needs Work - **Queue is incomplete** storage without processing is the most critical gap - **No job lifecycle** requests go in and never come out - **Duplicated configuration** GPU addresses in 3+ places - **No monitoring/alerting** no Prometheus metrics, no alerting rules - **Single point of failure** no Redis replication, no container redundancy - **No logging** Flask dev server logs are minimal; no structured logging ### Recommended Next Steps 1. **Priority 1:** Implement queue consumer with GPU load-based routing 2. **Priority 2:** Add job status tracking and result retrieval 3. **Priority 3:** Fix Nginx fallback to only trigger when ALL GPUs are down 4. **Priority 4:** Add Docker health checks and proper dependency management 5. **Priority 5:** Centralize GPU configuration in Redis or environment 6. **Priority 6:** Add Prometheus metrics endpoint for observability