- SMART_QUEUE_IMPLEMENTATION.md: Complete implementation draft (1572 lines) with 10 quick-win fixes and full smart queue consumer rewrite - ARCHITECTURE_REVIEW.md: 26-issue audit with prioritized findings - Verified all 3 GPUs live: amdpve (73% util), llmgpu (idle), ocu_llm (idle) - Redis 7.4.9 confirmed streams support - GPU sidecar metrics verified on all hosts Key fixes: - QW-1: Dockerfile path mismatch (Dockerfile.queue -> queue-service/Dockerfile) - QW-2: Nginx fallback only on ALL-GPU failure (not single GPU) - QW-3: Container names fixed to Docker service names - QW-4: Redis host default fixed (192.168.68.7 -> redis) - QW-5: Dependency version pinning - QW-7-10: Health checks, restart policy, Gunicorn, single-process collector Smart queue features: - Redis Streams + consumer groups - GPU-aware load balancing via sidecar metrics - Per-GPU circuit breakers with half-open recovery - Adaptive backpressure (0-30 normal, 30-40 warn, 40-50 503, >50 open) - Dead letter queue with retry endpoint - Job ID tracking and /status/<job_id> API
20 KiB
Syslog Harness Architecture Review & Improvement Recommendations
Date: 2026-05-17
Commit: e95475f "Add GPU dashboard container + Nginx routing"
Repo: http://192.168.68.17:3000/SyslogSolution/syslog-harness.git
1. Current Architecture Overview
Host (192.168.68.123)
Agent :8080> Nginx Router > Queue Service > Dashboard
:8080 :8091 :3001
GPU Pool Redis > GPU Dashboard
:8080 :6379 :8092
amdpve llmgpu ocu_llm
.15:8080 .8:8080 .110:8080
MoE 35B Dense 27B Light 4B
Services
| Service | Port | Container | Image | Purpose |
|---|---|---|---|---|
| Nginx Router | 8080 | Host-level | OS nginx | Routes by X-Syslog-Model header |
| Queue Service | 8091 | syslog-queue |
python:3.13-slim |
Request queue + circuit breaker |
| Dashboard | 3001 | syslog-dashboard |
python:3.11-slim |
Observability UI + GPU health |
| GPU Dashboard | 8092 | syslog-gpu-dashboard |
python:3.11-slim |
Hardware metrics (temp, VRAM, power) |
| Redis | 6379 | syslog-redis |
redis:7-alpine |
Queue storage |
GPU Backends
| Host | GPU | Model | Capacity |
|---|---|---|---|
| 192.168.68.15 | AMD Strix Halo | qwen3.6-35B-A3B (MoE) | 65GB VRAM |
| 192.168.68.8 | RTX 3090 | qwen3.5-27B (Dense) | 24GB VRAM |
| 192.168.68.110 | RTX 5070 | gemma-4-E4B (Light) | 12GB VRAM |
Data Flow
- Agent sends request with
X-Syslog-Modelheader Nginx :8080 - Nginx routes to appropriate GPU based on header mapping
- GPU backend (llama.cpp) processes request
- Fallback: If GPU returns 502/503/timeout Nginx redirects to queue-service :8091
- Queue stores request in Redis
inference:requestsLPUSH - Dashboard :3001 polls queue-service + GPU health for display
- GPU Dashboard :8092 collects hardware metrics every 10s
2. File Inventory
docker-compose.yml # Main compose (Docker networking)
gpu-router-docker.conf # Nginx config for Docker deployment
Dockerfile.gpu # GPU dashboard container
Dockerfile.dashboard # Dashboard container (root-level)
queue-service/Dockerfile # Queue service container
queue-service/queue-service.py # Queue logic (121 lines)
dashboard/harness-dashboard.py # Dashboard app (133 lines)
dashboard/Dockerfile # Dashboard container (subdir)
dashboard/Dockerfile.dashboard # Dashboard container (duplicate)
gpu-dashboard/gpu_collector.py # GPU hardware collector (115 lines)
gpu-dashboard/gpu.html # GPU dashboard UI (183 lines)
gpu-dashboard/collector.py # Duplicate collector (hermes-workspace path)
gpu-dashboard/start.sh # Legacy startup script
MIGRATION_PLAN.md # Production migration plan
README.md # Documentation
syslog-harness-check/ # Checkpoint subdirectory (mirror)
3. Detailed Findings
3.1 Queue Service (queue-service/queue-service.py)
Architecture: Simple Flask app using Redis LPUSH/RPUSH for a FIFO queue. A basic circuit breaker prevents queue overflow at 50 messages.
Issues Found:
| # | Severity | Location | Issue |
|---|---|---|---|
| Q1 | CRITICAL | Lines 82-88 | Queue is fire-and-forget with no consumer. Requests are pushed to Redis but nothing dequeues or processes them. The queue is a dead storage pit. |
| Q2 | CRITICAL | Lines 28-32 | Hardcoded GPU IPs in the queue service duplicate the Nginx config. No configuration source of truth. |
| Q3 | HIGH | Lines 21-22 | Redis host fallback to 192.168.68.7 (line 21) conflicts with docker-compose which sets REDIS_HOST=redis (line 24). The default is unreachable inside Docker. |
| Q4 | HIGH | Lines 66-95 | No job result retrieval mechanism. Once enqueued, there's no API to poll for completion, get a job ID, or retrieve results. |
| Q5 | HIGH | Lines 73-79 | Circuit breaker is a simple depth threshold. No backoff, no recovery window, no sliding window. Once closed, it stays closed until manually drained. |
| Q6 | MEDIUM | Lines 50-57 | GPU health check is synchronous and blocks the /status endpoint. Checking 3 GPUs sequentially with 3s timeout means /status can take up to 9s. |
| Q7 | MEDIUM | Lines 35-40 | get_redis() swallows all exceptions and returns None. This makes Redis failures silent queue depth returns 0 on failure (line 47), potentially allowing overflow. |
| Q8 | MEDIUM | Lines 83-84 | Headers filtered to only X- prefixed* the Content-Type header is dropped entirely, meaning the receiver can't determine payload format. |
| Q9 | LOW | Line 121 | No graceful shutdown. Flask development server doesn't handle SIGTERM gracefully. |
3.2 Nginx Gateway (gpu-router-docker.conf)
Architecture: Nginx routes requests to GPU backends based on X-Syslog-Model header value. Has rate limiting, streaming support, and queue fallback.
Issues Found:
| # | Severity | Location | Issue |
|---|---|---|---|
| N1 | HIGH | Lines 79-80 | burst=20 nodelay means 20 requests are served immediately beyond the rate limit, then throttled. This defeats the purpose of rate limiting under burst traffic all 20 could still overwhelm a GPU. |
| N2 | HIGH | Lines 99-100 | proxy_next_upstream with tries 2 means on error/timeout/502/503, Nginx retries once. But it retries against the same GPU pool, not a different one. The same GPU that failed gets hit again. |
| N3 | HIGH | Lines 106, 112-121 | Queue fallback (@queue_fallback) is triggered for ANY 502/503/504, including when a single GPU is overloaded. This means individual GPU slowness causes queue fallback instead of just queuing when ALL GPUs are down. |
| N4 | MEDIUM | Line 90 | proxy_pass_header X-Syslog-Model is non-standard. Nginx automatically passes request headers; this directive is for response headers. The model header is already passed implicitly via proxy_set_header inheritance. |
| N5 | MEDIUM | Lines 27, 32 | Hardcoded container names (syslog-harness-dashboard-1, syslog-harness-gpu-dashboard-1). These change based on docker-compose project prefix. Should use service names. |
| N6 | LOW | Lines 67-73 | GPU dashboard at /gpu path has X-Forwarded-Proto but the dashboard service (simple HTTP server) doesn't use it. Inconsistent header handling across locations. |
3.3 Dashboard (dashboard/harness-dashboard.py)
Architecture: Simple HTTP server using Python's http.server. Fetches queue status and GPU health, renders HTML.
Issues Found:
| # | Severity | Location | Issue |
|---|---|---|---|
| D1 | HIGH | Lines 34-40 | get_queue_status() calls queue-service synchronously. Combined with per-GPU health checks (lines 18-31), the /api/status endpoint makes 4 sequential HTTP calls. Worst case: 2 + 33s = 11s response time. |
| D2 | MEDIUM | Lines 101-127 | Uses SimpleHTTPRequestHandler which is single-threaded. Under concurrent dashboard access, requests queue up. Should use ThreadingHTTPServer. |
| D3 | MEDIUM | Lines 16-18 | GPU endpoints hardcoded in dashboard, separate from queue-service and Nginx. Three separate sources of truth for GPU addresses. |
| D4 | LOW | Line 127 | Silent log suppression. While intentional, this makes debugging impossible without modifying the source. |
3.4 GPU Dashboard (gpu-dashboard/)
Architecture: gpu_collector.py polls sidecar (port 8090) and llama.cpp (port 8080) endpoints every 10s, writes JSON to gpu_metrics.json. Static HTTP server serves the dashboard.
Issues Found:
| # | Severity | Location | Issue |
|---|---|---|---|
| G1 | HIGH | Lines 97-98 | Sequential collection. All 3 GPUs are polled sequentially (line 98: list comprehension). If one host is unreachable, it blocks collection for all three. |
| G2 | HIGH | Line 105-107 | /app/public/gpu_metrics.json path is hardcoded and differs from collector.py (line 11: /root/hermes-workspace/public/gpu_metrics.json). Inconsistent between the two collector files. |
| G3 | MEDIUM | Lines 19-25 | fetch_json swallows all exceptions. A timeout on one GPU's sidecar is silently ignored, making it impossible to distinguish "no data" from "collector error". |
| G4 | MEDIUM | Line 14 | DEAD_THRESHOLD = 60 seconds is aggressive. A GPU that restarts takes 60s before reappearing as online, even if it's back in 5s. |
| G5 | LOW | Lines 10-14 | start.sh references /root/hermes-workspace/public but Dockerfile.gpu creates /app/public. Inconsistent between legacy and current deployment. |
3.5 Docker Compose (docker-compose.yml)
Issues Found:
| # | Severity | Location | Issue |
|---|---|---|---|
| C1 | HIGH | Lines 19-20 | Queue service exposes port 8091 externally. In a multi-tenant or public-facing deployment, the queue API should be internal-only. |
| C2 | MEDIUM | Lines 13-15 | Dockerfile.queue referenced but doesn't exist at root level. The file is at queue-service/Dockerfile. The compose build context is . (root) but the dockerfile path doesn't match. |
| C3 | MEDIUM | Lines 6, 16, 26, 31, 43 | restart: always instead of restart: unless-stopped. On crash, always restarts even after manual stop, making maintenance harder. |
| C4 | LOW | Lines 23-25 | No health checks defined for any service. Docker can't detect if a service is actually healthy, only if the container is running. |
| C5 | LOW | Line 10 | Redis has no password. Unauthenticated Redis exposed on the Docker network. |
| C6 | LOW | Lines 49-51 | No network driver specified for the bridge network (minor defaults to bridge). No IPAM configuration for large deployments. |
3.6 Container Images
Issues Found:
| # | Severity | Location | Issue |
|---|---|---|---|
| I1 | HIGH | All Dockerfiles | No requirements.txt or dependency pinning. All dependencies (flask, redis, requests) are installed without version pins. Builds are non-reproducible. |
| I2 | MEDIUM | Dockerfile.gpu line 3 |
pip install requests unnecessary dependency for the GPU dashboard (only uses urllib). Adds ~300KB to the image. |
| I3 | MEDIUM | Dockerfile.gpu line 14 |
Multi-process CMD with & no process supervisor. If the collector crashes, it won't restart. The http.server also won't receive SIGTERM properly. |
| I4 | LOW | All Dockerfiles | No .dockerignore file. The entire context is sent to the Docker daemon, including .git directories and any local artifacts. |
| I5 | LOW | Dockerfile.dashboard (root) vs dashboard/Dockerfile.dashboard |
Duplicate Dockerfiles with slight differences (Python 3.11 vs 3.13, WORKDIR differences). |
4. Smart Queuing Analysis & Recommendations
Current State: No Smart Queuing
The queue service is a passive storage mechanism it stores requests but has no intelligence:
- No load balancing no awareness of GPU load (slots_busy, VRAM usage, queue depth per GPU)
- No job prioritization FIFO only, no priority levels
- No backpressure simple threshold, no exponential backoff or adaptive limits
- No retry logic failed GPU requests go to queue but are never reprocessed
- No dead letter handling stuck or failed jobs have no lifecycle management
- No consumer nothing dequeues and forwards to GPUs
- No job tracking no job IDs, no status updates, no result retrieval
Recommended Architecture: Smart Queue with Consumer
Agent > Nginx > Smart Queue API > Redis Streams (with consumers)
Consumer
Pool
GPU 1 (load) GPU 2 (load) GPU 3 (load)
Health Health Health
Update GPU scores
Priority Queue (sorted by urgency)
Dead Letter Queue (failed jobs)
Backpressure (adaptive rate limit)
Specific Recommendations
R1: Implement Redis Streams as Queue Backend
- Replace
LPUSH/RPUSH(FIFO list) with Redis Streams (XADD/XREADGROUP) - Streams support consumer groups, message acknowledgment, and pending messages
- Enables proper dead letter queue handling and retry logic
- File:
queue-service/queue-service.py
# Before: Simple list
r.rpush(QUEUE_KEY, json.dumps(job))
# After: Redis Stream with consumer group
stream_key = "inference:stream"
consumer_group = "gpu-workers"
r.xadd(stream_key, {"job": json.dumps(job)}, maxlen=10000, approx=True)
R2: Build a Queue Consumer Pool
- Deploy 1+ consumer containers that poll the stream and forward to GPUs
- Consumer selects GPU based on: health status, current load (slots_busy), and VRAM availability
- File: New
queue-service/consumer.py
class LoadBalancedConsumer:
def select_gpu(self, job):
"""Select GPU based on load, health, and model compatibility."""
candidates = [g for g in self.gpus if g.health == "up" and not g.full]
if not candidates:
return None
# Sort by: slots_idle (descending), VRAM_available (descending)
candidates.sort(key=lambda g: (g.slots_idle, g.vram_free_mb), reverse=True)
return candidates[0]
R3: Implement Priority Queuing
- Add priority field to job payload:
high,normal,low - Use Redis Streams with multiple stream keys per priority level
- Consumer checks
highnormallowin order - File:
queue-service/queue-service.pyenqueue endpoint
R4: Add Backpressure Mechanism
- Instead of hard threshold at 50, implement adaptive backpressure:
- Queue depth 0-30: normal operation
- Queue depth 30-40: return
retry-afterheader with increasing delay - Queue depth 40-50: return 503 with exponential retry-after
- Queue depth >50: circuit breaker open
- File:
queue-service/queue-service.py
R5: Dead Letter Queue (DLQ)
- Move failed/unprocessable jobs to a
inference:dead-letterstream - Include failure reason, attempt count, and original payload
- Provide admin API to inspect, retry, or discard DLQ entries
- File:
queue-service/queue-service.py
# New endpoint
@app.route("/dlq", methods=["GET"])
def list_dlq():
return r.xrange("inference:dead-letter")
@app.route("/dlq/retry/<message_id>", methods=["POST"])
def retry_dlq(message_id):
job = r.xget("inference:dead-letter", message_id)
r.xadd("inference:stream", {"job": job})
R6: GPU-Aware Routing
- Queue consumer should check GPU
slots_busybefore routing - If a GPU is busy, try the next available GPU
- Track per-GPU queue depth and avoid overloading a single GPU
- File: New consumer logic
R7: Job Status API
- Add job ID generation on enqueue
- Provide
/status/<job_id>endpoint to check progress - Store job state in Redis:
queuedprocessingcompleted/failed - File:
queue-service/queue-service.py
@app.route("/enqueue", methods=["POST"])
def enqueue():
job_id = str(uuid.uuid4())
job = {"id": job_id, "payload": ..., "status": "queued", "created_at": time.time()}
r.xadd(stream_key, {"job": json.dumps(job)})
r.hset("job:status", job_id, json.dumps({"status": "queued"}))
return jsonify({"job_id": job_id, "status": "queued"}), 202
@app.route("/status/<job_id>")
def job_status(job_id):
status = r.hget("job:status", job_id)
return jsonify(json.loads(status)) if status else {"error": "not found"}, 404
R8: Health-Based Circuit Breaker
- Replace simple depth threshold with per-GPU circuit breakers
- Track consecutive failures per GPU
- Implement half-open state: after cooldown, probe one GPU to test recovery
- File:
queue-service/queue-service.py
R9: Centralized Configuration
- Move GPU endpoints from 3 locations (queue-service, dashboard, Nginx) to:
- Redis config key:
config:gpus - Or environment file mounted to all containers
- Redis config key:
- Nginx can use Lua/variable from config instead of static upstreams
- File: New
config/directory or Redis-based config
5. Priority Issue Summary
Critical (Fix Immediately)
- Q1 Queue has no consumer; enqueued requests are never processed
- Q4 No job ID or result retrieval mechanism
- N3 Queue fallback triggers on individual GPU failure, not all-down
High (Fix Before Production)
- Q5 Circuit breaker has no recovery mechanism
- Q6
/statusendpoint blocks on GPU health checks - D1 Dashboard
/api/statusmakes 4 sequential calls, up to 11s - C2
Dockerfile.queuepath mismatch in docker-compose - I1 No dependency pinning in any Dockerfile
- I3 Multi-process CMD without supervisor in GPU dashboard
Medium (Improve in Next Iteration)
- Q3 Redis host default conflicts with Docker networking
- Q7 Silent exception swallowing in Redis access
- Q8 Content-Type header dropped in queue
- D2 Single-threaded dashboard server
- D3 Three separate sources of truth for GPU addresses
- G1 Sequential GPU collection blocks on single failure
- N1 Rate limit burst of 20 nodelay defeats protection
- N5 Hardcoded container names in Nginx
- C1 Queue API exposed externally
- C4 No Docker health checks
Low (Nice to Have)
- Q9 No graceful shutdown
- C3
restart: alwaysvsunless-stopped - C5 No Redis authentication
- G4 60s dead threshold is too aggressive
- I2 Unnecessary
requestsdependency - I4 No
.dockerignore - I5 Duplicate Dockerfiles
6. Deployment Architecture Summary
What Works Well
- Clean separation of concerns: routing (Nginx), queuing (Redis + queue-service), observability (two dashboards)
- Good GPU hardware monitoring with temperature, VRAM, power, fan metrics
- SSE streaming support in Nginx for LLM response streaming
- Rate limiting at the gateway layer
- Circuit breaker pattern implemented (even if basic)
What Needs Work
- Queue is incomplete storage without processing is the most critical gap
- No job lifecycle requests go in and never come out
- Duplicated configuration GPU addresses in 3+ places
- No monitoring/alerting no Prometheus metrics, no alerting rules
- Single point of failure no Redis replication, no container redundancy
- No logging Flask dev server logs are minimal; no structured logging
Recommended Next Steps
- Priority 1: Implement queue consumer with GPU load-based routing
- Priority 2: Add job status tracking and result retrieval
- Priority 3: Fix Nginx fallback to only trigger when ALL GPUs are down
- Priority 4: Add Docker health checks and proper dependency management
- Priority 5: Centralize GPU configuration in Redis or environment
- Priority 6: Add Prometheus metrics endpoint for observability