Files
syslog-harness/ARCHITECTURE_REVIEW.md
SyslogBot b09a93f45c feat: Smart Queue Consumer implementation draft + architecture review
- SMART_QUEUE_IMPLEMENTATION.md: Complete implementation draft (1572 lines)
  with 10 quick-win fixes and full smart queue consumer rewrite
- ARCHITECTURE_REVIEW.md: 26-issue audit with prioritized findings
- Verified all 3 GPUs live: amdpve (73% util), llmgpu (idle), ocu_llm (idle)
- Redis 7.4.9 confirmed streams support
- GPU sidecar metrics verified on all hosts

Key fixes:
- QW-1: Dockerfile path mismatch (Dockerfile.queue -> queue-service/Dockerfile)
- QW-2: Nginx fallback only on ALL-GPU failure (not single GPU)
- QW-3: Container names fixed to Docker service names
- QW-4: Redis host default fixed (192.168.68.7 -> redis)
- QW-5: Dependency version pinning
- QW-7-10: Health checks, restart policy, Gunicorn, single-process collector

Smart queue features:
- Redis Streams + consumer groups
- GPU-aware load balancing via sidecar metrics
- Per-GPU circuit breakers with half-open recovery
- Adaptive backpressure (0-30 normal, 30-40 warn, 40-50 503, >50 open)
- Dead letter queue with retry endpoint
- Job ID tracking and /status/<job_id> API
2026-05-17 03:55:20 +00:00

20 KiB

Syslog Harness Architecture Review & Improvement Recommendations

Date: 2026-05-17
Commit: e95475f "Add GPU dashboard container + Nginx routing"
Repo: http://192.168.68.17:3000/SyslogSolution/syslog-harness.git


1. Current Architecture Overview

                          
                                              Host (192.168.68.123)                    
                                                                                       
                                        
Agent :8080> Nginx Router >  Queue Service  >   Dashboard       
                             :8080            :8091                  :3001         
                                        
                                                                                    
                                                                                    
                                                                                    
                                        
                             GPU Pool         Redis       >  GPU Dashboard  
                             :8080            :6379               :8092         
                                        
                                                                                      
                          
                                  
                    
                                              
                  
               amdpve      llmgpu     ocu_llm    
               .15:8080    .8:8080    .110:8080  
               MoE 35B     Dense 27B   Light 4B  
                  

Services

Service Port Container Image Purpose
Nginx Router 8080 Host-level OS nginx Routes by X-Syslog-Model header
Queue Service 8091 syslog-queue python:3.13-slim Request queue + circuit breaker
Dashboard 3001 syslog-dashboard python:3.11-slim Observability UI + GPU health
GPU Dashboard 8092 syslog-gpu-dashboard python:3.11-slim Hardware metrics (temp, VRAM, power)
Redis 6379 syslog-redis redis:7-alpine Queue storage

GPU Backends

Host GPU Model Capacity
192.168.68.15 AMD Strix Halo qwen3.6-35B-A3B (MoE) 65GB VRAM
192.168.68.8 RTX 3090 qwen3.5-27B (Dense) 24GB VRAM
192.168.68.110 RTX 5070 gemma-4-E4B (Light) 12GB VRAM

Data Flow

  1. Agent sends request with X-Syslog-Model header Nginx :8080
  2. Nginx routes to appropriate GPU based on header mapping
  3. GPU backend (llama.cpp) processes request
  4. Fallback: If GPU returns 502/503/timeout Nginx redirects to queue-service :8091
  5. Queue stores request in Redis inference:requests LPUSH
  6. Dashboard :3001 polls queue-service + GPU health for display
  7. GPU Dashboard :8092 collects hardware metrics every 10s

2. File Inventory

docker-compose.yml                          # Main compose (Docker networking)
gpu-router-docker.conf                      # Nginx config for Docker deployment
Dockerfile.gpu                              # GPU dashboard container
Dockerfile.dashboard                        # Dashboard container (root-level)
queue-service/Dockerfile                    # Queue service container
queue-service/queue-service.py              # Queue logic (121 lines)
dashboard/harness-dashboard.py              # Dashboard app (133 lines)
dashboard/Dockerfile                        # Dashboard container (subdir)
dashboard/Dockerfile.dashboard              # Dashboard container (duplicate)
gpu-dashboard/gpu_collector.py              # GPU hardware collector (115 lines)
gpu-dashboard/gpu.html                      # GPU dashboard UI (183 lines)
gpu-dashboard/collector.py                  # Duplicate collector (hermes-workspace path)
gpu-dashboard/start.sh                      # Legacy startup script
MIGRATION_PLAN.md                           # Production migration plan
README.md                                   # Documentation
syslog-harness-check/                       # Checkpoint subdirectory (mirror)

3. Detailed Findings

3.1 Queue Service (queue-service/queue-service.py)

Architecture: Simple Flask app using Redis LPUSH/RPUSH for a FIFO queue. A basic circuit breaker prevents queue overflow at 50 messages.

Issues Found:

# Severity Location Issue
Q1 CRITICAL Lines 82-88 Queue is fire-and-forget with no consumer. Requests are pushed to Redis but nothing dequeues or processes them. The queue is a dead storage pit.
Q2 CRITICAL Lines 28-32 Hardcoded GPU IPs in the queue service duplicate the Nginx config. No configuration source of truth.
Q3 HIGH Lines 21-22 Redis host fallback to 192.168.68.7 (line 21) conflicts with docker-compose which sets REDIS_HOST=redis (line 24). The default is unreachable inside Docker.
Q4 HIGH Lines 66-95 No job result retrieval mechanism. Once enqueued, there's no API to poll for completion, get a job ID, or retrieve results.
Q5 HIGH Lines 73-79 Circuit breaker is a simple depth threshold. No backoff, no recovery window, no sliding window. Once closed, it stays closed until manually drained.
Q6 MEDIUM Lines 50-57 GPU health check is synchronous and blocks the /status endpoint. Checking 3 GPUs sequentially with 3s timeout means /status can take up to 9s.
Q7 MEDIUM Lines 35-40 get_redis() swallows all exceptions and returns None. This makes Redis failures silent queue depth returns 0 on failure (line 47), potentially allowing overflow.
Q8 MEDIUM Lines 83-84 Headers filtered to only X- prefixed* the Content-Type header is dropped entirely, meaning the receiver can't determine payload format.
Q9 LOW Line 121 No graceful shutdown. Flask development server doesn't handle SIGTERM gracefully.

3.2 Nginx Gateway (gpu-router-docker.conf)

Architecture: Nginx routes requests to GPU backends based on X-Syslog-Model header value. Has rate limiting, streaming support, and queue fallback.

Issues Found:

# Severity Location Issue
N1 HIGH Lines 79-80 burst=20 nodelay means 20 requests are served immediately beyond the rate limit, then throttled. This defeats the purpose of rate limiting under burst traffic all 20 could still overwhelm a GPU.
N2 HIGH Lines 99-100 proxy_next_upstream with tries 2 means on error/timeout/502/503, Nginx retries once. But it retries against the same GPU pool, not a different one. The same GPU that failed gets hit again.
N3 HIGH Lines 106, 112-121 Queue fallback (@queue_fallback) is triggered for ANY 502/503/504, including when a single GPU is overloaded. This means individual GPU slowness causes queue fallback instead of just queuing when ALL GPUs are down.
N4 MEDIUM Line 90 proxy_pass_header X-Syslog-Model is non-standard. Nginx automatically passes request headers; this directive is for response headers. The model header is already passed implicitly via proxy_set_header inheritance.
N5 MEDIUM Lines 27, 32 Hardcoded container names (syslog-harness-dashboard-1, syslog-harness-gpu-dashboard-1). These change based on docker-compose project prefix. Should use service names.
N6 LOW Lines 67-73 GPU dashboard at /gpu path has X-Forwarded-Proto but the dashboard service (simple HTTP server) doesn't use it. Inconsistent header handling across locations.

3.3 Dashboard (dashboard/harness-dashboard.py)

Architecture: Simple HTTP server using Python's http.server. Fetches queue status and GPU health, renders HTML.

Issues Found:

# Severity Location Issue
D1 HIGH Lines 34-40 get_queue_status() calls queue-service synchronously. Combined with per-GPU health checks (lines 18-31), the /api/status endpoint makes 4 sequential HTTP calls. Worst case: 2 + 33s = 11s response time.
D2 MEDIUM Lines 101-127 Uses SimpleHTTPRequestHandler which is single-threaded. Under concurrent dashboard access, requests queue up. Should use ThreadingHTTPServer.
D3 MEDIUM Lines 16-18 GPU endpoints hardcoded in dashboard, separate from queue-service and Nginx. Three separate sources of truth for GPU addresses.
D4 LOW Line 127 Silent log suppression. While intentional, this makes debugging impossible without modifying the source.

3.4 GPU Dashboard (gpu-dashboard/)

Architecture: gpu_collector.py polls sidecar (port 8090) and llama.cpp (port 8080) endpoints every 10s, writes JSON to gpu_metrics.json. Static HTTP server serves the dashboard.

Issues Found:

# Severity Location Issue
G1 HIGH Lines 97-98 Sequential collection. All 3 GPUs are polled sequentially (line 98: list comprehension). If one host is unreachable, it blocks collection for all three.
G2 HIGH Line 105-107 /app/public/gpu_metrics.json path is hardcoded and differs from collector.py (line 11: /root/hermes-workspace/public/gpu_metrics.json). Inconsistent between the two collector files.
G3 MEDIUM Lines 19-25 fetch_json swallows all exceptions. A timeout on one GPU's sidecar is silently ignored, making it impossible to distinguish "no data" from "collector error".
G4 MEDIUM Line 14 DEAD_THRESHOLD = 60 seconds is aggressive. A GPU that restarts takes 60s before reappearing as online, even if it's back in 5s.
G5 LOW Lines 10-14 start.sh references /root/hermes-workspace/public but Dockerfile.gpu creates /app/public. Inconsistent between legacy and current deployment.

3.5 Docker Compose (docker-compose.yml)

Issues Found:

# Severity Location Issue
C1 HIGH Lines 19-20 Queue service exposes port 8091 externally. In a multi-tenant or public-facing deployment, the queue API should be internal-only.
C2 MEDIUM Lines 13-15 Dockerfile.queue referenced but doesn't exist at root level. The file is at queue-service/Dockerfile. The compose build context is . (root) but the dockerfile path doesn't match.
C3 MEDIUM Lines 6, 16, 26, 31, 43 restart: always instead of restart: unless-stopped. On crash, always restarts even after manual stop, making maintenance harder.
C4 LOW Lines 23-25 No health checks defined for any service. Docker can't detect if a service is actually healthy, only if the container is running.
C5 LOW Line 10 Redis has no password. Unauthenticated Redis exposed on the Docker network.
C6 LOW Lines 49-51 No network driver specified for the bridge network (minor defaults to bridge). No IPAM configuration for large deployments.

3.6 Container Images

Issues Found:

# Severity Location Issue
I1 HIGH All Dockerfiles No requirements.txt or dependency pinning. All dependencies (flask, redis, requests) are installed without version pins. Builds are non-reproducible.
I2 MEDIUM Dockerfile.gpu line 3 pip install requests unnecessary dependency for the GPU dashboard (only uses urllib). Adds ~300KB to the image.
I3 MEDIUM Dockerfile.gpu line 14 Multi-process CMD with & no process supervisor. If the collector crashes, it won't restart. The http.server also won't receive SIGTERM properly.
I4 LOW All Dockerfiles No .dockerignore file. The entire context is sent to the Docker daemon, including .git directories and any local artifacts.
I5 LOW Dockerfile.dashboard (root) vs dashboard/Dockerfile.dashboard Duplicate Dockerfiles with slight differences (Python 3.11 vs 3.13, WORKDIR differences).

4. Smart Queuing Analysis & Recommendations

Current State: No Smart Queuing

The queue service is a passive storage mechanism it stores requests but has no intelligence:

  • No load balancing no awareness of GPU load (slots_busy, VRAM usage, queue depth per GPU)
  • No job prioritization FIFO only, no priority levels
  • No backpressure simple threshold, no exponential backoff or adaptive limits
  • No retry logic failed GPU requests go to queue but are never reprocessed
  • No dead letter handling stuck or failed jobs have no lifecycle management
  • No consumer nothing dequeues and forwards to GPUs
  • No job tracking no job IDs, no status updates, no result retrieval
Agent > Nginx > Smart Queue API > Redis Streams (with consumers)
                                          
                                   
                                     Consumer   
                                     Pool       
                                   
                                          
                             
                                                     
                         GPU 1 (load)  GPU 2 (load)  GPU 3 (load)
                                                     
                                                     
                         Health        Health        Health
                                                   
                           
                                          
                                  Update GPU scores
                                          
                             Priority Queue (sorted by urgency)
                             Dead Letter Queue (failed jobs)
                             Backpressure (adaptive rate limit)

Specific Recommendations

R1: Implement Redis Streams as Queue Backend

  • Replace LPUSH/RPUSH (FIFO list) with Redis Streams (XADD/XREADGROUP)
  • Streams support consumer groups, message acknowledgment, and pending messages
  • Enables proper dead letter queue handling and retry logic
  • File: queue-service/queue-service.py
# Before: Simple list
r.rpush(QUEUE_KEY, json.dumps(job))

# After: Redis Stream with consumer group
stream_key = "inference:stream"
consumer_group = "gpu-workers"
r.xadd(stream_key, {"job": json.dumps(job)}, maxlen=10000, approx=True)

R2: Build a Queue Consumer Pool

  • Deploy 1+ consumer containers that poll the stream and forward to GPUs
  • Consumer selects GPU based on: health status, current load (slots_busy), and VRAM availability
  • File: New queue-service/consumer.py
class LoadBalancedConsumer:
    def select_gpu(self, job):
        """Select GPU based on load, health, and model compatibility."""
        candidates = [g for g in self.gpus if g.health == "up" and not g.full]
        if not candidates:
            return None
        # Sort by: slots_idle (descending), VRAM_available (descending)
        candidates.sort(key=lambda g: (g.slots_idle, g.vram_free_mb), reverse=True)
        return candidates[0]

R3: Implement Priority Queuing

  • Add priority field to job payload: high, normal, low
  • Use Redis Streams with multiple stream keys per priority level
  • Consumer checks high normal low in order
  • File: queue-service/queue-service.py enqueue endpoint

R4: Add Backpressure Mechanism

  • Instead of hard threshold at 50, implement adaptive backpressure:
    • Queue depth 0-30: normal operation
    • Queue depth 30-40: return retry-after header with increasing delay
    • Queue depth 40-50: return 503 with exponential retry-after
    • Queue depth >50: circuit breaker open
  • File: queue-service/queue-service.py

R5: Dead Letter Queue (DLQ)

  • Move failed/unprocessable jobs to a inference:dead-letter stream
  • Include failure reason, attempt count, and original payload
  • Provide admin API to inspect, retry, or discard DLQ entries
  • File: queue-service/queue-service.py
# New endpoint
@app.route("/dlq", methods=["GET"])
def list_dlq():
    return r.xrange("inference:dead-letter")

@app.route("/dlq/retry/<message_id>", methods=["POST"])
def retry_dlq(message_id):
    job = r.xget("inference:dead-letter", message_id)
    r.xadd("inference:stream", {"job": job})

R6: GPU-Aware Routing

  • Queue consumer should check GPU slots_busy before routing
  • If a GPU is busy, try the next available GPU
  • Track per-GPU queue depth and avoid overloading a single GPU
  • File: New consumer logic

R7: Job Status API

  • Add job ID generation on enqueue
  • Provide /status/<job_id> endpoint to check progress
  • Store job state in Redis: queued processing completed/failed
  • File: queue-service/queue-service.py
@app.route("/enqueue", methods=["POST"])
def enqueue():
    job_id = str(uuid.uuid4())
    job = {"id": job_id, "payload": ..., "status": "queued", "created_at": time.time()}
    r.xadd(stream_key, {"job": json.dumps(job)})
    r.hset("job:status", job_id, json.dumps({"status": "queued"}))
    return jsonify({"job_id": job_id, "status": "queued"}), 202

@app.route("/status/<job_id>")
def job_status(job_id):
    status = r.hget("job:status", job_id)
    return jsonify(json.loads(status)) if status else {"error": "not found"}, 404

R8: Health-Based Circuit Breaker

  • Replace simple depth threshold with per-GPU circuit breakers
  • Track consecutive failures per GPU
  • Implement half-open state: after cooldown, probe one GPU to test recovery
  • File: queue-service/queue-service.py

R9: Centralized Configuration

  • Move GPU endpoints from 3 locations (queue-service, dashboard, Nginx) to:
    • Redis config key: config:gpus
    • Or environment file mounted to all containers
  • Nginx can use Lua/variable from config instead of static upstreams
  • File: New config/ directory or Redis-based config

5. Priority Issue Summary

Critical (Fix Immediately)

  1. Q1 Queue has no consumer; enqueued requests are never processed
  2. Q4 No job ID or result retrieval mechanism
  3. N3 Queue fallback triggers on individual GPU failure, not all-down

High (Fix Before Production)

  1. Q5 Circuit breaker has no recovery mechanism
  2. Q6 /status endpoint blocks on GPU health checks
  3. D1 Dashboard /api/status makes 4 sequential calls, up to 11s
  4. C2 Dockerfile.queue path mismatch in docker-compose
  5. I1 No dependency pinning in any Dockerfile
  6. I3 Multi-process CMD without supervisor in GPU dashboard

Medium (Improve in Next Iteration)

  1. Q3 Redis host default conflicts with Docker networking
  2. Q7 Silent exception swallowing in Redis access
  3. Q8 Content-Type header dropped in queue
  4. D2 Single-threaded dashboard server
  5. D3 Three separate sources of truth for GPU addresses
  6. G1 Sequential GPU collection blocks on single failure
  7. N1 Rate limit burst of 20 nodelay defeats protection
  8. N5 Hardcoded container names in Nginx
  9. C1 Queue API exposed externally
  10. C4 No Docker health checks

Low (Nice to Have)

  1. Q9 No graceful shutdown
  2. C3 restart: always vs unless-stopped
  3. C5 No Redis authentication
  4. G4 60s dead threshold is too aggressive
  5. I2 Unnecessary requests dependency
  6. I4 No .dockerignore
  7. I5 Duplicate Dockerfiles

6. Deployment Architecture Summary

What Works Well

  • Clean separation of concerns: routing (Nginx), queuing (Redis + queue-service), observability (two dashboards)
  • Good GPU hardware monitoring with temperature, VRAM, power, fan metrics
  • SSE streaming support in Nginx for LLM response streaming
  • Rate limiting at the gateway layer
  • Circuit breaker pattern implemented (even if basic)

What Needs Work

  • Queue is incomplete storage without processing is the most critical gap
  • No job lifecycle requests go in and never come out
  • Duplicated configuration GPU addresses in 3+ places
  • No monitoring/alerting no Prometheus metrics, no alerting rules
  • Single point of failure no Redis replication, no container redundancy
  • No logging Flask dev server logs are minimal; no structured logging
  1. Priority 1: Implement queue consumer with GPU load-based routing
  2. Priority 2: Add job status tracking and result retrieval
  3. Priority 3: Fix Nginx fallback to only trigger when ALL GPUs are down
  4. Priority 4: Add Docker health checks and proper dependency management
  5. Priority 5: Centralize GPU configuration in Redis or environment
  6. Priority 6: Add Prometheus metrics endpoint for observability