Files

T

SyslogBot b09a93f45c feat: Smart Queue Consumer implementation draft + architecture review

- SMART_QUEUE_IMPLEMENTATION.md: Complete implementation draft (1572 lines)
  with 10 quick-win fixes and full smart queue consumer rewrite
- ARCHITECTURE_REVIEW.md: 26-issue audit with prioritized findings
- Verified all 3 GPUs live: amdpve (73% util), llmgpu (idle), ocu_llm (idle)
- Redis 7.4.9 confirmed streams support
- GPU sidecar metrics verified on all hosts

Key fixes:
- QW-1: Dockerfile path mismatch (Dockerfile.queue -> queue-service/Dockerfile)
- QW-2: Nginx fallback only on ALL-GPU failure (not single GPU)
- QW-3: Container names fixed to Docker service names
- QW-4: Redis host default fixed (192.168.68.7 -> redis)
- QW-5: Dependency version pinning
- QW-7-10: Health checks, restart policy, Gunicorn, single-process collector

Smart queue features:
- Redis Streams + consumer groups
- GPU-aware load balancing via sidecar metrics
- Per-GPU circuit breakers with half-open recovery
- Adaptive backpressure (0-30 normal, 30-40 warn, 40-50 503, >50 open)
- Dead letter queue with retry endpoint
- Job ID tracking and /status/<job_id> API

2026-05-17 03:55:20 +00:00

20 KiB

Raw Blame History

Syslog Harness Architecture Review & Improvement Recommendations

Date: 2026-05-17
Commit: e95475f "Add GPU dashboard container + Nginx routing"
Repo: http://192.168.68.17:3000/SyslogSolution/syslog-harness.git

1. Current Architecture Overview

                          
                                              Host (192.168.68.123)                    
                                                                                       
                                        
Agent :8080> Nginx Router >  Queue Service  >   Dashboard       
                             :8080            :8091                  :3001         
                                        
                                                                                    
                                                                                    
                                                                                    
                                        
                             GPU Pool         Redis       >  GPU Dashboard  
                             :8080            :6379               :8092         
                                        
                                                                                      
                          
                                  
                    
                                              
                  
               amdpve      llmgpu     ocu_llm    
               .15:8080    .8:8080    .110:8080  
               MoE 35B     Dense 27B   Light 4B

Services

Service	Port	Container	Image	Purpose
Nginx Router	8080	Host-level	OS nginx	Routes by `X-Syslog-Model` header
Queue Service	8091	`syslog-queue`	`python:3.13-slim`	Request queue + circuit breaker
Dashboard	3001	`syslog-dashboard`	`python:3.11-slim`	Observability UI + GPU health
GPU Dashboard	8092	`syslog-gpu-dashboard`	`python:3.11-slim`	Hardware metrics (temp, VRAM, power)
Redis	6379	`syslog-redis`	`redis:7-alpine`	Queue storage

GPU Backends

Host	GPU	Model	Capacity
192.168.68.15	AMD Strix Halo	qwen3.6-35B-A3B (MoE)	65GB VRAM
192.168.68.8	RTX 3090	qwen3.5-27B (Dense)	24GB VRAM
192.168.68.110	RTX 5070	gemma-4-E4B (Light)	12GB VRAM

Data Flow

Agent sends request with X-Syslog-Model header Nginx :8080
Nginx routes to appropriate GPU based on header mapping
GPU backend (llama.cpp) processes request
Fallback: If GPU returns 502/503/timeout Nginx redirects to queue-service :8091
Queue stores request in Redis inference:requests LPUSH
Dashboard :3001 polls queue-service + GPU health for display
GPU Dashboard :8092 collects hardware metrics every 10s

2. File Inventory

docker-compose.yml                          # Main compose (Docker networking)
gpu-router-docker.conf                      # Nginx config for Docker deployment
Dockerfile.gpu                              # GPU dashboard container
Dockerfile.dashboard                        # Dashboard container (root-level)
queue-service/Dockerfile                    # Queue service container
queue-service/queue-service.py              # Queue logic (121 lines)
dashboard/harness-dashboard.py              # Dashboard app (133 lines)
dashboard/Dockerfile                        # Dashboard container (subdir)
dashboard/Dockerfile.dashboard              # Dashboard container (duplicate)
gpu-dashboard/gpu_collector.py              # GPU hardware collector (115 lines)
gpu-dashboard/gpu.html                      # GPU dashboard UI (183 lines)
gpu-dashboard/collector.py                  # Duplicate collector (hermes-workspace path)
gpu-dashboard/start.sh                      # Legacy startup script
MIGRATION_PLAN.md                           # Production migration plan
README.md                                   # Documentation
syslog-harness-check/                       # Checkpoint subdirectory (mirror)

3. Detailed Findings

3.1 Queue Service (`queue-service/queue-service.py`)

Architecture: Simple Flask app using Redis LPUSH/RPUSH for a FIFO queue. A basic circuit breaker prevents queue overflow at 50 messages.

Issues Found:

#	Severity	Location	Issue
Q1	CRITICAL	Lines 82-88	Queue is fire-and-forget with no consumer. Requests are pushed to Redis but nothing dequeues or processes them. The queue is a dead storage pit.
Q2	CRITICAL	Lines 28-32	Hardcoded GPU IPs in the queue service duplicate the Nginx config. No configuration source of truth.
Q3	HIGH	Lines 21-22	Redis host fallback to `192.168.68.7` (line 21) conflicts with docker-compose which sets `REDIS_HOST=redis` (line 24). The default is unreachable inside Docker.
Q4	HIGH	Lines 66-95	No job result retrieval mechanism. Once enqueued, there's no API to poll for completion, get a job ID, or retrieve results.
Q5	HIGH	Lines 73-79	Circuit breaker is a simple depth threshold. No backoff, no recovery window, no sliding window. Once closed, it stays closed until manually drained.
Q6	MEDIUM	Lines 50-57	GPU health check is synchronous and blocks the `/status` endpoint. Checking 3 GPUs sequentially with 3s timeout means `/status` can take up to 9s.
Q7	MEDIUM	Lines 35-40	`get_redis()` swallows all exceptions and returns `None`. This makes Redis failures silent queue depth returns 0 on failure (line 47), potentially allowing overflow.
Q8	MEDIUM	Lines 83-84	Headers filtered to only X- prefixed* the `Content-Type` header is dropped entirely, meaning the receiver can't determine payload format.
Q9	LOW	Line 121	No graceful shutdown. Flask development server doesn't handle SIGTERM gracefully.

3.2 Nginx Gateway (`gpu-router-docker.conf`)

Architecture: Nginx routes requests to GPU backends based on X-Syslog-Model header value. Has rate limiting, streaming support, and queue fallback.

Issues Found:

#	Severity	Location	Issue
N1	HIGH	Lines 79-80	`burst=20 nodelay` means 20 requests are served immediately beyond the rate limit, then throttled. This defeats the purpose of rate limiting under burst traffic all 20 could still overwhelm a GPU.
N2	HIGH	Lines 99-100	`proxy_next_upstream` with `tries 2` means on error/timeout/502/503, Nginx retries once. But it retries against the same GPU pool, not a different one. The same GPU that failed gets hit again.
N3	HIGH	Lines 106, 112-121	Queue fallback (`@queue_fallback`) is triggered for ANY 502/503/504, including when a single GPU is overloaded. This means individual GPU slowness causes queue fallback instead of just queuing when ALL GPUs are down.
N4	MEDIUM	Line 90	`proxy_pass_header X-Syslog-Model` is non-standard. Nginx automatically passes request headers; this directive is for response headers. The model header is already passed implicitly via `proxy_set_header` inheritance.
N5	MEDIUM	Lines 27, 32	Hardcoded container names (`syslog-harness-dashboard-1`, `syslog-harness-gpu-dashboard-1`). These change based on docker-compose project prefix. Should use service names.
N6	LOW	Lines 67-73	GPU dashboard at `/gpu` path has `X-Forwarded-Proto` but the dashboard service (simple HTTP server) doesn't use it. Inconsistent header handling across locations.

3.3 Dashboard (`dashboard/harness-dashboard.py`)

Architecture: Simple HTTP server using Python's http.server. Fetches queue status and GPU health, renders HTML.

Issues Found:

#	Severity	Location	Issue
D1	HIGH	Lines 34-40	`get_queue_status()` calls queue-service synchronously. Combined with per-GPU health checks (lines 18-31), the `/api/status` endpoint makes 4 sequential HTTP calls. Worst case: 2 + 33s = 11s response time.
D2	MEDIUM	Lines 101-127	Uses `SimpleHTTPRequestHandler` which is single-threaded. Under concurrent dashboard access, requests queue up. Should use `ThreadingHTTPServer`.
D3	MEDIUM	Lines 16-18	GPU endpoints hardcoded in dashboard, separate from queue-service and Nginx. Three separate sources of truth for GPU addresses.
D4	LOW	Line 127	Silent log suppression. While intentional, this makes debugging impossible without modifying the source.

3.4 GPU Dashboard (`gpu-dashboard/`)

Architecture: gpu_collector.py polls sidecar (port 8090) and llama.cpp (port 8080) endpoints every 10s, writes JSON to gpu_metrics.json. Static HTTP server serves the dashboard.

Issues Found:

#	Severity	Location	Issue
G1	HIGH	Lines 97-98	Sequential collection. All 3 GPUs are polled sequentially (line 98: list comprehension). If one host is unreachable, it blocks collection for all three.
G2	HIGH	Line 105-107	`/app/public/gpu_metrics.json` path is hardcoded and differs from `collector.py` (line 11: `/root/hermes-workspace/public/gpu_metrics.json`). Inconsistent between the two collector files.
G3	MEDIUM	Lines 19-25	`fetch_json` swallows all exceptions. A timeout on one GPU's sidecar is silently ignored, making it impossible to distinguish "no data" from "collector error".
G4	MEDIUM	Line 14	`DEAD_THRESHOLD = 60` seconds is aggressive. A GPU that restarts takes 60s before reappearing as online, even if it's back in 5s.
G5	LOW	Lines 10-14	`start.sh` references `/root/hermes-workspace/public` but `Dockerfile.gpu` creates `/app/public`. Inconsistent between legacy and current deployment.

3.5 Docker Compose (`docker-compose.yml`)

Issues Found:

#	Severity	Location	Issue
C1	HIGH	Lines 19-20	Queue service exposes port 8091 externally. In a multi-tenant or public-facing deployment, the queue API should be internal-only.
C2	MEDIUM	Lines 13-15	`Dockerfile.queue` referenced but doesn't exist at root level. The file is at `queue-service/Dockerfile`. The compose build context is `.` (root) but the dockerfile path doesn't match.
C3	MEDIUM	Lines 6, 16, 26, 31, 43	`restart: always` instead of `restart: unless-stopped`. On crash, `always` restarts even after manual stop, making maintenance harder.
C4	LOW	Lines 23-25	No health checks defined for any service. Docker can't detect if a service is actually healthy, only if the container is running.
C5	LOW	Line 10	Redis has no password. Unauthenticated Redis exposed on the Docker network.
C6	LOW	Lines 49-51	No network driver specified for the bridge network (minor defaults to bridge). No IPAM configuration for large deployments.

3.6 Container Images

Issues Found:

#	Severity	Location	Issue
I1	HIGH	All Dockerfiles	No `requirements.txt` or dependency pinning. All dependencies (`flask`, `redis`, `requests`) are installed without version pins. Builds are non-reproducible.
I2	MEDIUM	`Dockerfile.gpu` line 3	`pip install requests` unnecessary dependency for the GPU dashboard (only uses `urllib`). Adds ~300KB to the image.
I3	MEDIUM	`Dockerfile.gpu` line 14	Multi-process CMD with `&` no process supervisor. If the collector crashes, it won't restart. The `http.server` also won't receive SIGTERM properly.
I4	LOW	All Dockerfiles	No `.dockerignore` file. The entire context is sent to the Docker daemon, including `.git` directories and any local artifacts.
I5	LOW	`Dockerfile.dashboard` (root) vs `dashboard/Dockerfile.dashboard`	Duplicate Dockerfiles with slight differences (Python 3.11 vs 3.13, WORKDIR differences).

4. Smart Queuing Analysis & Recommendations

Current State: No Smart Queuing

The queue service is a passive storage mechanism it stores requests but has no intelligence:

No load balancing no awareness of GPU load (slots_busy, VRAM usage, queue depth per GPU)
No job prioritization FIFO only, no priority levels
No backpressure simple threshold, no exponential backoff or adaptive limits
No retry logic failed GPU requests go to queue but are never reprocessed
No dead letter handling stuck or failed jobs have no lifecycle management
No consumer nothing dequeues and forwards to GPUs
No job tracking no job IDs, no status updates, no result retrieval

Recommended Architecture: Smart Queue with Consumer

Agent > Nginx > Smart Queue API > Redis Streams (with consumers)
                                          
                                   
                                     Consumer   
                                     Pool       
                                   
                                          
                             
                                                     
                         GPU 1 (load)  GPU 2 (load)  GPU 3 (load)
                                                     
                                                     
                         Health        Health        Health
                                                   
                           
                                          
                                  Update GPU scores
                                          
                             Priority Queue (sorted by urgency)
                             Dead Letter Queue (failed jobs)
                             Backpressure (adaptive rate limit)

Specific Recommendations

R1: Implement Redis Streams as Queue Backend

Replace LPUSH/RPUSH (FIFO list) with Redis Streams (XADD/XREADGROUP)
Streams support consumer groups, message acknowledgment, and pending messages
Enables proper dead letter queue handling and retry logic
File: queue-service/queue-service.py

# Before: Simple list
r.rpush(QUEUE_KEY, json.dumps(job))

# After: Redis Stream with consumer group
stream_key = "inference:stream"
consumer_group = "gpu-workers"
r.xadd(stream_key, {"job": json.dumps(job)}, maxlen=10000, approx=True)

R2: Build a Queue Consumer Pool

Deploy 1+ consumer containers that poll the stream and forward to GPUs
Consumer selects GPU based on: health status, current load (slots_busy), and VRAM availability
File: New queue-service/consumer.py

class LoadBalancedConsumer:
    def select_gpu(self, job):
        """Select GPU based on load, health, and model compatibility."""
        candidates = [g for g in self.gpus if g.health == "up" and not g.full]
        if not candidates:
            return None
        # Sort by: slots_idle (descending), VRAM_available (descending)
        candidates.sort(key=lambda g: (g.slots_idle, g.vram_free_mb), reverse=True)
        return candidates[0]

R3: Implement Priority Queuing

Add priority field to job payload: high, normal, low
Use Redis Streams with multiple stream keys per priority level
Consumer checks high normal low in order
File: queue-service/queue-service.py enqueue endpoint

R4: Add Backpressure Mechanism

Instead of hard threshold at 50, implement adaptive backpressure:
- Queue depth 0-30: normal operation
- Queue depth 30-40: return retry-after header with increasing delay
- Queue depth 40-50: return 503 with exponential retry-after
- Queue depth >50: circuit breaker open
File: queue-service/queue-service.py

R5: Dead Letter Queue (DLQ)

Move failed/unprocessable jobs to a inference:dead-letter stream
Include failure reason, attempt count, and original payload
Provide admin API to inspect, retry, or discard DLQ entries
File: queue-service/queue-service.py

# New endpoint
@app.route("/dlq", methods=["GET"])
def list_dlq():
    return r.xrange("inference:dead-letter")

@app.route("/dlq/retry/<message_id>", methods=["POST"])
def retry_dlq(message_id):
    job = r.xget("inference:dead-letter", message_id)
    r.xadd("inference:stream", {"job": job})

R6: GPU-Aware Routing

Queue consumer should check GPU slots_busy before routing
If a GPU is busy, try the next available GPU
Track per-GPU queue depth and avoid overloading a single GPU
File: New consumer logic

R7: Job Status API

Add job ID generation on enqueue
Provide /status/<job_id> endpoint to check progress
Store job state in Redis: queued processing completed/failed
File: queue-service/queue-service.py

@app.route("/enqueue", methods=["POST"])
def enqueue():
    job_id = str(uuid.uuid4())
    job = {"id": job_id, "payload": ..., "status": "queued", "created_at": time.time()}
    r.xadd(stream_key, {"job": json.dumps(job)})
    r.hset("job:status", job_id, json.dumps({"status": "queued"}))
    return jsonify({"job_id": job_id, "status": "queued"}), 202

@app.route("/status/<job_id>")
def job_status(job_id):
    status = r.hget("job:status", job_id)
    return jsonify(json.loads(status)) if status else {"error": "not found"}, 404

R8: Health-Based Circuit Breaker

Replace simple depth threshold with per-GPU circuit breakers
Track consecutive failures per GPU
Implement half-open state: after cooldown, probe one GPU to test recovery
File: queue-service/queue-service.py

R9: Centralized Configuration

Move GPU endpoints from 3 locations (queue-service, dashboard, Nginx) to:
- Redis config key: config:gpus
- Or environment file mounted to all containers
Nginx can use Lua/variable from config instead of static upstreams
File: New config/ directory or Redis-based config

5. Priority Issue Summary

Critical (Fix Immediately)

Q1 Queue has no consumer; enqueued requests are never processed
Q4 No job ID or result retrieval mechanism
N3 Queue fallback triggers on individual GPU failure, not all-down

High (Fix Before Production)

Q5 Circuit breaker has no recovery mechanism
Q6 /status endpoint blocks on GPU health checks
D1 Dashboard /api/status makes 4 sequential calls, up to 11s
C2 Dockerfile.queue path mismatch in docker-compose
I1 No dependency pinning in any Dockerfile
I3 Multi-process CMD without supervisor in GPU dashboard

Medium (Improve in Next Iteration)

Q3 Redis host default conflicts with Docker networking
Q7 Silent exception swallowing in Redis access
Q8 Content-Type header dropped in queue
D2 Single-threaded dashboard server
D3 Three separate sources of truth for GPU addresses
G1 Sequential GPU collection blocks on single failure
N1 Rate limit burst of 20 nodelay defeats protection
N5 Hardcoded container names in Nginx
C1 Queue API exposed externally
C4 No Docker health checks

Low (Nice to Have)

Q9 No graceful shutdown
C3 restart: always vs unless-stopped
C5 No Redis authentication
G4 60s dead threshold is too aggressive
I2 Unnecessary requests dependency
I4 No .dockerignore
I5 Duplicate Dockerfiles

6. Deployment Architecture Summary

What Works Well

Clean separation of concerns: routing (Nginx), queuing (Redis + queue-service), observability (two dashboards)
Good GPU hardware monitoring with temperature, VRAM, power, fan metrics
SSE streaming support in Nginx for LLM response streaming
Rate limiting at the gateway layer
Circuit breaker pattern implemented (even if basic)

What Needs Work

Queue is incomplete storage without processing is the most critical gap
No job lifecycle requests go in and never come out
Duplicated configuration GPU addresses in 3+ places
No monitoring/alerting no Prometheus metrics, no alerting rules
Single point of failure no Redis replication, no container redundancy
No logging Flask dev server logs are minimal; no structured logging

Recommended Next Steps

Priority 1: Implement queue consumer with GPU load-based routing
Priority 2: Add job status tracking and result retrieval
Priority 3: Fix Nginx fallback to only trigger when ALL GPUs are down
Priority 4: Add Docker health checks and proper dependency management
Priority 5: Centralize GPU configuration in Redis or environment
Priority 6: Add Prometheus metrics endpoint for observability

20 KiB Raw Blame History

Syslog Harness Architecture Review & Improvement Recommendations

1. Current Architecture Overview

Services

GPU Backends

Data Flow

2. File Inventory

3. Detailed Findings

3.1 Queue Service (queue-service/queue-service.py)

3.2 Nginx Gateway (gpu-router-docker.conf)

3.3 Dashboard (dashboard/harness-dashboard.py)

3.4 GPU Dashboard (gpu-dashboard/)

3.5 Docker Compose (docker-compose.yml)

3.6 Container Images

4. Smart Queuing Analysis & Recommendations

Current State: No Smart Queuing

Recommended Architecture: Smart Queue with Consumer

Specific Recommendations

R1: Implement Redis Streams as Queue Backend

R2: Build a Queue Consumer Pool

R3: Implement Priority Queuing

R4: Add Backpressure Mechanism

R5: Dead Letter Queue (DLQ)

R6: GPU-Aware Routing

R7: Job Status API

R8: Health-Based Circuit Breaker

R9: Centralized Configuration

5. Priority Issue Summary

Critical (Fix Immediately)

High (Fix Before Production)

Medium (Improve in Next Iteration)

Low (Nice to Have)

6. Deployment Architecture Summary

What Works Well

What Needs Work

Recommended Next Steps

20 KiB

Raw Blame History

3.1 Queue Service (`queue-service/queue-service.py`)

3.2 Nginx Gateway (`gpu-router-docker.conf`)

3.3 Dashboard (`dashboard/harness-dashboard.py`)

3.4 GPU Dashboard (`gpu-dashboard/`)

3.5 Docker Compose (`docker-compose.yml`)