feat: Smart Queue Consumer implementation draft + architecture review

- SMART_QUEUE_IMPLEMENTATION.md: Complete implementation draft (1572 lines) with 10 quick-win fixes and full smart queue consumer rewrite - ARCHITECTURE_REVIEW.md: 26-issue audit with prioritized findings - Verified all 3 GPUs live: amdpve (73% util), llmgpu (idle), ocu_llm (idle) - Redis 7.4.9 confirmed streams support - GPU sidecar metrics verified on all hosts Key fixes: - QW-1: Dockerfile path mismatch (Dockerfile.queue -> queue-service/Dockerfile) - QW-2: Nginx fallback only on ALL-GPU failure (not single GPU) - QW-3: Container names fixed to Docker service names - QW-4: Redis host default fixed (192.168.68.7 -> redis) - QW-5: Dependency version pinning - QW-7-10: Health checks, restart policy, Gunicorn, single-process collector Smart queue features: - Redis Streams + consumer groups - GPU-aware load balancing via sidecar metrics - Per-GPU circuit breakers with half-open recovery - Adaptive backpressure (0-30 normal, 30-40 warn, 40-50 503, >50 open) - Dead letter queue with retry endpoint - Job ID tracking and /status/<job_id> API
2026-05-17 03:55:20 +00:00
parent e95475f431
commit b09a93f45c
15 changed files with 3895 additions and 1 deletions
@@ -0,0 +1,390 @@
+# Syslog Harness  Architecture Review & Improvement Recommendations
+
+**Date:** 2026-05-17  
+**Commit:** `e95475f`  "Add GPU dashboard container + Nginx routing"  
+**Repo:** http://192.168.68.17:3000/SyslogSolution/syslog-harness.git
+
+---
+
+## 1. Current Architecture Overview
+
+```
+                          
+                                              Host (192.168.68.123)                    
+                                                                                       
+                                        
+Agent :8080> Nginx Router >  Queue Service  >   Dashboard       
+                             :8080            :8091                  :3001         
+                                        
+                                                                                    
+                                                                                    
+                                                                                    
+                                        
+                             GPU Pool         Redis       >  GPU Dashboard  
+                             :8080            :6379               :8092         
+                                        
+                                                                                      
+                          
+                                  
+                    
+                                              
+                  
+               amdpve      llmgpu     ocu_llm    
+               .15:8080    .8:8080    .110:8080  
+               MoE 35B     Dense 27B   Light 4B  
+                  
+```
+
+### Services
+
+| Service | Port | Container | Image | Purpose |
+|---|---|---|---|---|
+| **Nginx Router** | 8080 | Host-level | OS nginx | Routes by `X-Syslog-Model` header |
+| **Queue Service** | 8091 | `syslog-queue` | `python:3.13-slim` | Request queue + circuit breaker |
+| **Dashboard** | 3001 | `syslog-dashboard` | `python:3.11-slim` | Observability UI + GPU health |
+| **GPU Dashboard** | 8092 | `syslog-gpu-dashboard` | `python:3.11-slim` | Hardware metrics (temp, VRAM, power) |
+| **Redis** | 6379 | `syslog-redis` | `redis:7-alpine` | Queue storage |
+
+### GPU Backends
+
+| Host | GPU | Model | Capacity |
+|---|---|---|---|
+| 192.168.68.15 | AMD Strix Halo | qwen3.6-35B-A3B (MoE) | 65GB VRAM |
+| 192.168.68.8 | RTX 3090 | qwen3.5-27B (Dense) | 24GB VRAM |
+| 192.168.68.110 | RTX 5070 | gemma-4-E4B (Light) | 12GB VRAM |
+
+### Data Flow
+
+1. **Agent** sends request with `X-Syslog-Model` header  Nginx :8080
+2. **Nginx** routes to appropriate GPU based on header mapping
+3. **GPU backend** (llama.cpp) processes request
+4. **Fallback:** If GPU returns 502/503/timeout  Nginx redirects to queue-service :8091
+5. **Queue** stores request in Redis `inference:requests` LPUSH
+6. **Dashboard** :3001 polls queue-service + GPU health for display
+7. **GPU Dashboard** :8092 collects hardware metrics every 10s
+
+---
+
+## 2. File Inventory
+
+```
+docker-compose.yml                          # Main compose (Docker networking)
+gpu-router-docker.conf                      # Nginx config for Docker deployment
+Dockerfile.gpu                              # GPU dashboard container
+Dockerfile.dashboard                        # Dashboard container (root-level)
+queue-service/Dockerfile                    # Queue service container
+queue-service/queue-service.py              # Queue logic (121 lines)
+dashboard/harness-dashboard.py              # Dashboard app (133 lines)
+dashboard/Dockerfile                        # Dashboard container (subdir)
+dashboard/Dockerfile.dashboard              # Dashboard container (duplicate)
+gpu-dashboard/gpu_collector.py              # GPU hardware collector (115 lines)
+gpu-dashboard/gpu.html                      # GPU dashboard UI (183 lines)
+gpu-dashboard/collector.py                  # Duplicate collector (hermes-workspace path)
+gpu-dashboard/start.sh                      # Legacy startup script
+MIGRATION_PLAN.md                           # Production migration plan
+README.md                                   # Documentation
+syslog-harness-check/                       # Checkpoint subdirectory (mirror)
+```
+
+---
+
+## 3. Detailed Findings
+
+### 3.1 Queue Service (`queue-service/queue-service.py`)
+
+**Architecture:** Simple Flask app using Redis LPUSH/RPUSH for a FIFO queue. A basic circuit breaker prevents queue overflow at 50 messages.
+
+**Issues Found:**
+
+| # | Severity | Location | Issue |
+|---|---|---|---|
+| Q1 | **CRITICAL** | Lines 82-88 | **Queue is fire-and-forget with no consumer.** Requests are pushed to Redis but nothing dequeues or processes them. The queue is a dead storage pit. |
+| Q2 | **CRITICAL** | Lines 28-32 | **Hardcoded GPU IPs** in the queue service duplicate the Nginx config. No configuration source of truth. |
+| Q3 | **HIGH** | Lines 21-22 | **Redis host fallback to `192.168.68.7`** (line 21) conflicts with docker-compose which sets `REDIS_HOST=redis` (line 24). The default is unreachable inside Docker. |
+| Q4 | **HIGH** | Lines 66-95 | **No job result retrieval mechanism.** Once enqueued, there's no API to poll for completion, get a job ID, or retrieve results. |
+| Q5 | **HIGH** | Lines 73-79 | **Circuit breaker is a simple depth threshold.** No backoff, no recovery window, no sliding window. Once closed, it stays closed until manually drained. |
+| Q6 | **MEDIUM** | Lines 50-57 | **GPU health check is synchronous and blocks** the `/status` endpoint. Checking 3 GPUs sequentially with 3s timeout means `/status` can take up to 9s. |
+| Q7 | **MEDIUM** | Lines 35-40 | **`get_redis()` swallows all exceptions** and returns `None`. This makes Redis failures silent  queue depth returns 0 on failure (line 47), potentially allowing overflow. |
+| Q8 | **MEDIUM** | Lines 83-84 | **Headers filtered to only X-* prefixed**  the `Content-Type` header is dropped entirely, meaning the receiver can't determine payload format. |
+| Q9 | **LOW** | Line 121 | **No graceful shutdown.** Flask development server doesn't handle SIGTERM gracefully. |
+
+### 3.2 Nginx Gateway (`gpu-router-docker.conf`)
+
+**Architecture:** Nginx routes requests to GPU backends based on `X-Syslog-Model` header value. Has rate limiting, streaming support, and queue fallback.
+
+**Issues Found:**
+
+| # | Severity | Location | Issue |
+|---|---|---|---|
+| N1 | **HIGH** | Lines 79-80 | **`burst=20 nodelay`** means 20 requests are served immediately beyond the rate limit, then throttled. This defeats the purpose of rate limiting under burst traffic  all 20 could still overwhelm a GPU. |
+| N2 | **HIGH** | Lines 99-100 | **`proxy_next_upstream` with `tries 2`** means on error/timeout/502/503, Nginx retries once. But it retries against the *same GPU pool*, not a different one. The same GPU that failed gets hit again. |
+| N3 | **HIGH** | Lines 106, 112-121 | **Queue fallback (`@queue_fallback`) is triggered for ANY 502/503/504**, including when a single GPU is overloaded. This means individual GPU slowness causes queue fallback instead of just queuing when ALL GPUs are down. |
+| N4 | **MEDIUM** | Line 90 | **`proxy_pass_header X-Syslog-Model`** is non-standard. Nginx automatically passes request headers; this directive is for response headers. The model header is already passed implicitly via `proxy_set_header` inheritance. |
+| N5 | **MEDIUM** | Lines 27, 32 | **Hardcoded container names** (`syslog-harness-dashboard-1`, `syslog-harness-gpu-dashboard-1`). These change based on docker-compose project prefix. Should use service names. |
+| N6 | **LOW** | Lines 67-73 | **GPU dashboard at `/gpu` path** has `X-Forwarded-Proto` but the dashboard service (simple HTTP server) doesn't use it. Inconsistent header handling across locations. |
+
+### 3.3 Dashboard (`dashboard/harness-dashboard.py`)
+
+**Architecture:** Simple HTTP server using Python's `http.server`. Fetches queue status and GPU health, renders HTML.
+
+**Issues Found:**
+
+| # | Severity | Location | Issue |
+|---|---|---|---|
+| D1 | **HIGH** | Lines 34-40 | **`get_queue_status()` calls queue-service synchronously.** Combined with per-GPU health checks (lines 18-31), the `/api/status` endpoint makes 4 sequential HTTP calls. Worst case: 2 + 33s = 11s response time. |
+| D2 | **MEDIUM** | Lines 101-127 | **Uses `SimpleHTTPRequestHandler`** which is single-threaded. Under concurrent dashboard access, requests queue up. Should use `ThreadingHTTPServer`. |
+| D3 | **MEDIUM** | Lines 16-18 | **GPU endpoints hardcoded** in dashboard, separate from queue-service and Nginx. Three separate sources of truth for GPU addresses. |
+| D4 | **LOW** | Line 127 | **Silent log suppression.** While intentional, this makes debugging impossible without modifying the source. |
+
+### 3.4 GPU Dashboard (`gpu-dashboard/`)
+
+**Architecture:** `gpu_collector.py` polls sidecar (port 8090) and llama.cpp (port 8080) endpoints every 10s, writes JSON to `gpu_metrics.json`. Static HTTP server serves the dashboard.
+
+**Issues Found:**
+
+| # | Severity | Location | Issue |
+|---|---|---|---|
+| G1 | **HIGH** | Lines 97-98 | **Sequential collection.** All 3 GPUs are polled sequentially (line 98: list comprehension). If one host is unreachable, it blocks collection for all three. |
+| G2 | **HIGH** | Line 105-107 | **`/app/public/gpu_metrics.json` path is hardcoded** and differs from `collector.py` (line 11: `/root/hermes-workspace/public/gpu_metrics.json`). Inconsistent between the two collector files. |
+| G3 | **MEDIUM** | Lines 19-25 | **`fetch_json` swallows all exceptions.** A timeout on one GPU's sidecar is silently ignored, making it impossible to distinguish "no data" from "collector error". |
+| G4 | **MEDIUM** | Line 14 | **`DEAD_THRESHOLD = 60` seconds is aggressive.** A GPU that restarts takes 60s before reappearing as online, even if it's back in 5s. |
+| G5 | **LOW** | Lines 10-14 | **`start.sh` references `/root/hermes-workspace/public`** but `Dockerfile.gpu` creates `/app/public`. Inconsistent between legacy and current deployment. |
+
+### 3.5 Docker Compose (`docker-compose.yml`)
+
+**Issues Found:**
+
+| # | Severity | Location | Issue |
+|---|---|---|---|
+| C1 | **HIGH** | Lines 19-20 | **Queue service exposes port 8091 externally.** In a multi-tenant or public-facing deployment, the queue API should be internal-only. |
+| C2 | **MEDIUM** | Lines 13-15 | **`Dockerfile.queue` referenced but doesn't exist at root level.** The file is at `queue-service/Dockerfile`. The compose build context is `.` (root) but the dockerfile path doesn't match. |
+| C3 | **MEDIUM** | Lines 6, 16, 26, 31, 43 | **`restart: always`** instead of `restart: unless-stopped`. On crash, `always` restarts even after manual stop, making maintenance harder. |
+| C4 | **LOW** | Lines 23-25 | **No health checks defined** for any service. Docker can't detect if a service is actually healthy, only if the container is running. |
+| C5 | **LOW** | Line 10 | **Redis has no password.** Unauthenticated Redis exposed on the Docker network. |
+| C6 | **LOW** | Lines 49-51 | **No network driver specified** for the bridge network (minor  defaults to bridge). No IPAM configuration for large deployments. |
+
+### 3.6 Container Images
+
+**Issues Found:**
+
+| # | Severity | Location | Issue |
+|---|---|---|---|
+| I1 | **HIGH** | All Dockerfiles | **No `requirements.txt` or dependency pinning.** All dependencies (`flask`, `redis`, `requests`) are installed without version pins. Builds are non-reproducible. |
+| I2 | **MEDIUM** | `Dockerfile.gpu` line 3 | **`pip install requests`**  unnecessary dependency for the GPU dashboard (only uses `urllib`). Adds ~300KB to the image. |
+| I3 | **MEDIUM** | `Dockerfile.gpu` line 14 | **Multi-process CMD with `&`**  no process supervisor. If the collector crashes, it won't restart. The `http.server` also won't receive SIGTERM properly. |
+| I4 | **LOW** | All Dockerfiles | **No `.dockerignore` file.** The entire context is sent to the Docker daemon, including `.git` directories and any local artifacts. |
+| I5 | **LOW** | `Dockerfile.dashboard` (root) vs `dashboard/Dockerfile.dashboard` | **Duplicate Dockerfiles** with slight differences (Python 3.11 vs 3.13, WORKDIR differences). |
+
+---
+
+## 4. Smart Queuing Analysis & Recommendations
+
+### Current State:  No Smart Queuing
+
+The queue service is a **passive storage mechanism**  it stores requests but has no intelligence:
+
+- **No load balancing**  no awareness of GPU load (slots_busy, VRAM usage, queue depth per GPU)
+- **No job prioritization**  FIFO only, no priority levels
+- **No backpressure**  simple threshold, no exponential backoff or adaptive limits
+- **No retry logic**  failed GPU requests go to queue but are never reprocessed
+- **No dead letter handling**  stuck or failed jobs have no lifecycle management
+- **No consumer**  nothing dequeues and forwards to GPUs
+- **No job tracking**  no job IDs, no status updates, no result retrieval
+
+### Recommended Architecture: Smart Queue with Consumer
+
+```
+Agent > Nginx > Smart Queue API > Redis Streams (with consumers)
+                                          
+                                   
+                                     Consumer   
+                                     Pool       
+                                   
+                                          
+                             
+                                                     
+                         GPU 1 (load)  GPU 2 (load)  GPU 3 (load)
+                                                     
+                                                     
+                         Health        Health        Health
+                                                   
+                           
+                                          
+                                  Update GPU scores
+                                          
+                             Priority Queue (sorted by urgency)
+                             Dead Letter Queue (failed jobs)
+                             Backpressure (adaptive rate limit)
+```
+
+### Specific Recommendations
+
+#### R1: Implement Redis Streams as Queue Backend
+- Replace `LPUSH/RPUSH` (FIFO list) with **Redis Streams** (`XADD/XREADGROUP`)
+- Streams support consumer groups, message acknowledgment, and pending messages
+- Enables proper dead letter queue handling and retry logic
+- **File:** `queue-service/queue-service.py`
+
+```python
+# Before: Simple list
+r.rpush(QUEUE_KEY, json.dumps(job))
+
+# After: Redis Stream with consumer group
+stream_key = "inference:stream"
+consumer_group = "gpu-workers"
+r.xadd(stream_key, {"job": json.dumps(job)}, maxlen=10000, approx=True)
+```
+
+#### R2: Build a Queue Consumer Pool
+- Deploy 1+ consumer containers that poll the stream and forward to GPUs
+- Consumer selects GPU based on: health status, current load (slots_busy), and VRAM availability
+- **File:** New `queue-service/consumer.py`
+
+```python
+class LoadBalancedConsumer:
+    def select_gpu(self, job):
+        """Select GPU based on load, health, and model compatibility."""
+        candidates = [g for g in self.gpus if g.health == "up" and not g.full]
+        if not candidates:
+            return None
+        # Sort by: slots_idle (descending), VRAM_available (descending)
+        candidates.sort(key=lambda g: (g.slots_idle, g.vram_free_mb), reverse=True)
+        return candidates[0]
+```
+
+#### R3: Implement Priority Queuing
+- Add priority field to job payload: `high`, `normal`, `low`
+- Use Redis Streams with multiple stream keys per priority level
+- Consumer checks `high`  `normal`  `low` in order
+- **File:** `queue-service/queue-service.py` enqueue endpoint
+
+#### R4: Add Backpressure Mechanism
+- Instead of hard threshold at 50, implement **adaptive backpressure**:
+  - Queue depth 0-30: normal operation
+  - Queue depth 30-40: return `retry-after` header with increasing delay
+  - Queue depth 40-50: return 503 with exponential retry-after
+  - Queue depth >50: circuit breaker open
+- **File:** `queue-service/queue-service.py`
+
+#### R5: Dead Letter Queue (DLQ)
+- Move failed/unprocessable jobs to a `inference:dead-letter` stream
+- Include failure reason, attempt count, and original payload
+- Provide admin API to inspect, retry, or discard DLQ entries
+- **File:** `queue-service/queue-service.py`
+
+```python
+# New endpoint
+@app.route("/dlq", methods=["GET"])
+def list_dlq():
+    return r.xrange("inference:dead-letter")
+
+@app.route("/dlq/retry/<message_id>", methods=["POST"])
+def retry_dlq(message_id):
+    job = r.xget("inference:dead-letter", message_id)
+    r.xadd("inference:stream", {"job": job})
+```
+
+#### R6: GPU-Aware Routing
+- Queue consumer should check GPU `slots_busy` before routing
+- If a GPU is busy, try the next available GPU
+- Track per-GPU queue depth and avoid overloading a single GPU
+- **File:** New consumer logic
+
+#### R7: Job Status API
+- Add job ID generation on enqueue
+- Provide `/status/<job_id>` endpoint to check progress
+- Store job state in Redis: `queued`  `processing`  `completed`/`failed`
+- **File:** `queue-service/queue-service.py`
+
+```python
+@app.route("/enqueue", methods=["POST"])
+def enqueue():
+    job_id = str(uuid.uuid4())
+    job = {"id": job_id, "payload": ..., "status": "queued", "created_at": time.time()}
+    r.xadd(stream_key, {"job": json.dumps(job)})
+    r.hset("job:status", job_id, json.dumps({"status": "queued"}))
+    return jsonify({"job_id": job_id, "status": "queued"}), 202
+
+@app.route("/status/<job_id>")
+def job_status(job_id):
+    status = r.hget("job:status", job_id)
+    return jsonify(json.loads(status)) if status else {"error": "not found"}, 404
+```
+
+#### R8: Health-Based Circuit Breaker
+- Replace simple depth threshold with **per-GPU circuit breakers**
+- Track consecutive failures per GPU
+- Implement half-open state: after cooldown, probe one GPU to test recovery
+- **File:** `queue-service/queue-service.py`
+
+#### R9: Centralized Configuration
+- Move GPU endpoints from 3 locations (queue-service, dashboard, Nginx) to:
+  - Redis config key: `config:gpus`
+  - Or environment file mounted to all containers
+- Nginx can use Lua/variable from config instead of static upstreams
+- **File:** New `config/` directory or Redis-based config
+
+---
+
+## 5. Priority Issue Summary
+
+### Critical (Fix Immediately)
+1. **Q1**  Queue has no consumer; enqueued requests are never processed
+2. **Q4**  No job ID or result retrieval mechanism
+3. **N3**  Queue fallback triggers on individual GPU failure, not all-down
+
+### High (Fix Before Production)
+4. **Q5**  Circuit breaker has no recovery mechanism
+5. **Q6**  `/status` endpoint blocks on GPU health checks
+6. **D1**  Dashboard `/api/status` makes 4 sequential calls, up to 11s
+7. **C2**  `Dockerfile.queue` path mismatch in docker-compose
+8. **I1**  No dependency pinning in any Dockerfile
+9. **I3**  Multi-process CMD without supervisor in GPU dashboard
+
+### Medium (Improve in Next Iteration)
+10. **Q3**  Redis host default conflicts with Docker networking
+11. **Q7**  Silent exception swallowing in Redis access
+12. **Q8**  Content-Type header dropped in queue
+13. **D2**  Single-threaded dashboard server
+14. **D3**  Three separate sources of truth for GPU addresses
+15. **G1**  Sequential GPU collection blocks on single failure
+16. **N1**  Rate limit burst of 20 nodelay defeats protection
+17. **N5**  Hardcoded container names in Nginx
+18. **C1**  Queue API exposed externally
+19. **C4**  No Docker health checks
+
+### Low (Nice to Have)
+20. **Q9**  No graceful shutdown
+21. **C3**  `restart: always` vs `unless-stopped`
+22. **C5**  No Redis authentication
+23. **G4**  60s dead threshold is too aggressive
+24. **I2**  Unnecessary `requests` dependency
+25. **I4**  No `.dockerignore`
+26. **I5**  Duplicate Dockerfiles
+
+---
+
+## 6. Deployment Architecture Summary
+
+### What Works Well
+- Clean separation of concerns: routing (Nginx), queuing (Redis + queue-service), observability (two dashboards)
+- Good GPU hardware monitoring with temperature, VRAM, power, fan metrics
+- SSE streaming support in Nginx for LLM response streaming
+- Rate limiting at the gateway layer
+- Circuit breaker pattern implemented (even if basic)
+
+### What Needs Work
+- **Queue is incomplete**  storage without processing is the most critical gap
+- **No job lifecycle**  requests go in and never come out
+- **Duplicated configuration**  GPU addresses in 3+ places
+- **No monitoring/alerting**  no Prometheus metrics, no alerting rules
+- **Single point of failure**  no Redis replication, no container redundancy
+- **No logging**  Flask dev server logs are minimal; no structured logging
+
+### Recommended Next Steps
+1. **Priority 1:** Implement queue consumer with GPU load-based routing
+2. **Priority 2:** Add job status tracking and result retrieval
+3. **Priority 3:** Fix Nginx fallback to only trigger when ALL GPUs are down
+4. **Priority 4:** Add Docker health checks and proper dependency management
+5. **Priority 5:** Centralize GPU configuration in Redis or environment
+6. **Priority 6:** Add Prometheus metrics endpoint for observability