feat: Smart Queue Consumer implementation draft + architecture review

- SMART_QUEUE_IMPLEMENTATION.md: Complete implementation draft (1572 lines) with 10 quick-win fixes and full smart queue consumer rewrite - ARCHITECTURE_REVIEW.md: 26-issue audit with prioritized findings - Verified all 3 GPUs live: amdpve (73% util), llmgpu (idle), ocu_llm (idle) - Redis 7.4.9 confirmed streams support - GPU sidecar metrics verified on all hosts Key fixes: - QW-1: Dockerfile path mismatch (Dockerfile.queue -> queue-service/Dockerfile) - QW-2: Nginx fallback only on ALL-GPU failure (not single GPU) - QW-3: Container names fixed to Docker service names - QW-4: Redis host default fixed (192.168.68.7 -> redis) - QW-5: Dependency version pinning - QW-7-10: Health checks, restart policy, Gunicorn, single-process collector Smart queue features: - Redis Streams + consumer groups - GPU-aware load balancing via sidecar metrics - Per-GPU circuit breakers with half-open recovery - Adaptive backpressure (0-30 normal, 30-40 warn, 40-50 503, >50 open) - Dead letter queue with retry endpoint - Job ID tracking and /status/<job_id> API
Add GPU dashboard container + Nginx routing
2026-05-17 03:55:20 +00:00 · 2026-05-15 22:25:56 +00:00
20 changed files with 4007 additions and 127 deletions
@@ -1,8 +0,0 @@
 # Syslog Harness Environment
 REDIS_HOST=192.168.68.8
 REDIS_PORT=6379
 AMDPVE_ENDPOINT=http://192.168.68.15:8080
 LLMGPU_ENDPOINT=http://192.168.68.8:8080
 OCU_LLM_ENDPOINT=http://192.168.68.110:8080
 CIRCUIT_BREAKER_THRESHOLD=5
 CIRCUIT_BREAKER_TIMEOUT=30
@@ -0,0 +1,390 @@
 # Syslog Harness  Architecture Review & Improvement Recommendations
 **Date:** 2026-05-17  
 **Commit:** `e95475f`  "Add GPU dashboard container + Nginx routing"  
 **Repo:** http://192.168.68.17:3000/SyslogSolution/syslog-harness.git
 ---
 ## 1. Current Architecture Overview
 ```
                                              Host (192.168.68.123)                    
 Agent :8080> Nginx Router >  Queue Service  >   Dashboard       
                             :8080            :8091                  :3001         
                             GPU Pool         Redis       >  GPU Dashboard  
                             :8080            :6379               :8092         
               amdpve      llmgpu     ocu_llm    
               .15:8080    .8:8080    .110:8080  
               MoE 35B     Dense 27B   Light 4B  
 ```
 ### Services
 | Service | Port | Container | Image | Purpose |
 |---|---|---|---|---|
 | **Nginx Router** | 8080 | Host-level | OS nginx | Routes by `X-Syslog-Model` header |
 | **Queue Service** | 8091 | `syslog-queue` | `python:3.13-slim` | Request queue + circuit breaker |
 | **Dashboard** | 3001 | `syslog-dashboard` | `python:3.11-slim` | Observability UI + GPU health |
 | **GPU Dashboard** | 8092 | `syslog-gpu-dashboard` | `python:3.11-slim` | Hardware metrics (temp, VRAM, power) |
 | **Redis** | 6379 | `syslog-redis` | `redis:7-alpine` | Queue storage |
 ### GPU Backends
 | Host | GPU | Model | Capacity |
 |---|---|---|---|
 | 192.168.68.15 | AMD Strix Halo | qwen3.6-35B-A3B (MoE) | 65GB VRAM |
 | 192.168.68.8 | RTX 3090 | qwen3.5-27B (Dense) | 24GB VRAM |
 | 192.168.68.110 | RTX 5070 | gemma-4-E4B (Light) | 12GB VRAM |
 ### Data Flow
 1. **Agent** sends request with `X-Syslog-Model` header  Nginx :8080
 2. **Nginx** routes to appropriate GPU based on header mapping
 3. **GPU backend** (llama.cpp) processes request
 4. **Fallback:** If GPU returns 502/503/timeout  Nginx redirects to queue-service :8091
 5. **Queue** stores request in Redis `inference:requests` LPUSH
 6. **Dashboard** :3001 polls queue-service + GPU health for display
 7. **GPU Dashboard** :8092 collects hardware metrics every 10s
 ---
 ## 2. File Inventory
 ```
 docker-compose.yml                          # Main compose (Docker networking)
 gpu-router-docker.conf                      # Nginx config for Docker deployment
 Dockerfile.gpu                              # GPU dashboard container
 Dockerfile.dashboard                        # Dashboard container (root-level)
 queue-service/Dockerfile                    # Queue service container
 queue-service/queue-service.py              # Queue logic (121 lines)
 dashboard/harness-dashboard.py              # Dashboard app (133 lines)
 dashboard/Dockerfile                        # Dashboard container (subdir)
 dashboard/Dockerfile.dashboard              # Dashboard container (duplicate)
 gpu-dashboard/gpu_collector.py              # GPU hardware collector (115 lines)
 gpu-dashboard/gpu.html                      # GPU dashboard UI (183 lines)
 gpu-dashboard/collector.py                  # Duplicate collector (hermes-workspace path)
 gpu-dashboard/start.sh                      # Legacy startup script
 MIGRATION_PLAN.md                           # Production migration plan
 README.md                                   # Documentation
 syslog-harness-check/                       # Checkpoint subdirectory (mirror)
 ```
 ---
 ## 3. Detailed Findings
 ### 3.1 Queue Service (`queue-service/queue-service.py`)
 **Architecture:** Simple Flask app using Redis LPUSH/RPUSH for a FIFO queue. A basic circuit breaker prevents queue overflow at 50 messages.
 **Issues Found:**
 | # | Severity | Location | Issue |
 |---|---|---|---|
 | Q1 | **CRITICAL** | Lines 82-88 | **Queue is fire-and-forget with no consumer.** Requests are pushed to Redis but nothing dequeues or processes them. The queue is a dead storage pit. |
 | Q2 | **CRITICAL** | Lines 28-32 | **Hardcoded GPU IPs** in the queue service duplicate the Nginx config. No configuration source of truth. |
 | Q3 | **HIGH** | Lines 21-22 | **Redis host fallback to `192.168.68.7`** (line 21) conflicts with docker-compose which sets `REDIS_HOST=redis` (line 24). The default is unreachable inside Docker. |
 | Q4 | **HIGH** | Lines 66-95 | **No job result retrieval mechanism.** Once enqueued, there's no API to poll for completion, get a job ID, or retrieve results. |
 | Q5 | **HIGH** | Lines 73-79 | **Circuit breaker is a simple depth threshold.** No backoff, no recovery window, no sliding window. Once closed, it stays closed until manually drained. |
 | Q6 | **MEDIUM** | Lines 50-57 | **GPU health check is synchronous and blocks** the `/status` endpoint. Checking 3 GPUs sequentially with 3s timeout means `/status` can take up to 9s. |
 | Q7 | **MEDIUM** | Lines 35-40 | **`get_redis()` swallows all exceptions** and returns `None`. This makes Redis failures silent  queue depth returns 0 on failure (line 47), potentially allowing overflow. |
 | Q8 | **MEDIUM** | Lines 83-84 | **Headers filtered to only X-* prefixed**  the `Content-Type` header is dropped entirely, meaning the receiver can't determine payload format. |
 | Q9 | **LOW** | Line 121 | **No graceful shutdown.** Flask development server doesn't handle SIGTERM gracefully. |
 ### 3.2 Nginx Gateway (`gpu-router-docker.conf`)
 **Architecture:** Nginx routes requests to GPU backends based on `X-Syslog-Model` header value. Has rate limiting, streaming support, and queue fallback.
 **Issues Found:**
 | # | Severity | Location | Issue |
 |---|---|---|---|
 | N1 | **HIGH** | Lines 79-80 | **`burst=20 nodelay`** means 20 requests are served immediately beyond the rate limit, then throttled. This defeats the purpose of rate limiting under burst traffic  all 20 could still overwhelm a GPU. |
 | N2 | **HIGH** | Lines 99-100 | **`proxy_next_upstream` with `tries 2`** means on error/timeout/502/503, Nginx retries once. But it retries against the *same GPU pool*, not a different one. The same GPU that failed gets hit again. |
 | N3 | **HIGH** | Lines 106, 112-121 | **Queue fallback (`@queue_fallback`) is triggered for ANY 502/503/504**, including when a single GPU is overloaded. This means individual GPU slowness causes queue fallback instead of just queuing when ALL GPUs are down. |
 | N4 | **MEDIUM** | Line 90 | **`proxy_pass_header X-Syslog-Model`** is non-standard. Nginx automatically passes request headers; this directive is for response headers. The model header is already passed implicitly via `proxy_set_header` inheritance. |
 | N5 | **MEDIUM** | Lines 27, 32 | **Hardcoded container names** (`syslog-harness-dashboard-1`, `syslog-harness-gpu-dashboard-1`). These change based on docker-compose project prefix. Should use service names. |
 | N6 | **LOW** | Lines 67-73 | **GPU dashboard at `/gpu` path** has `X-Forwarded-Proto` but the dashboard service (simple HTTP server) doesn't use it. Inconsistent header handling across locations. |
 ### 3.3 Dashboard (`dashboard/harness-dashboard.py`)
 **Architecture:** Simple HTTP server using Python's `http.server`. Fetches queue status and GPU health, renders HTML.
 **Issues Found:**
 | # | Severity | Location | Issue |
 |---|---|---|---|
 | D1 | **HIGH** | Lines 34-40 | **`get_queue_status()` calls queue-service synchronously.** Combined with per-GPU health checks (lines 18-31), the `/api/status` endpoint makes 4 sequential HTTP calls. Worst case: 2 + 33s = 11s response time. |
 | D2 | **MEDIUM** | Lines 101-127 | **Uses `SimpleHTTPRequestHandler`** which is single-threaded. Under concurrent dashboard access, requests queue up. Should use `ThreadingHTTPServer`. |
 | D3 | **MEDIUM** | Lines 16-18 | **GPU endpoints hardcoded** in dashboard, separate from queue-service and Nginx. Three separate sources of truth for GPU addresses. |
 | D4 | **LOW** | Line 127 | **Silent log suppression.** While intentional, this makes debugging impossible without modifying the source. |
 ### 3.4 GPU Dashboard (`gpu-dashboard/`)
 **Architecture:** `gpu_collector.py` polls sidecar (port 8090) and llama.cpp (port 8080) endpoints every 10s, writes JSON to `gpu_metrics.json`. Static HTTP server serves the dashboard.
 **Issues Found:**
 | # | Severity | Location | Issue |
 |---|---|---|---|
 | G1 | **HIGH** | Lines 97-98 | **Sequential collection.** All 3 GPUs are polled sequentially (line 98: list comprehension). If one host is unreachable, it blocks collection for all three. |
 | G2 | **HIGH** | Line 105-107 | **`/app/public/gpu_metrics.json` path is hardcoded** and differs from `collector.py` (line 11: `/root/hermes-workspace/public/gpu_metrics.json`). Inconsistent between the two collector files. |
 | G3 | **MEDIUM** | Lines 19-25 | **`fetch_json` swallows all exceptions.** A timeout on one GPU's sidecar is silently ignored, making it impossible to distinguish "no data" from "collector error". |
 | G4 | **MEDIUM** | Line 14 | **`DEAD_THRESHOLD = 60` seconds is aggressive.** A GPU that restarts takes 60s before reappearing as online, even if it's back in 5s. |
 | G5 | **LOW** | Lines 10-14 | **`start.sh` references `/root/hermes-workspace/public`** but `Dockerfile.gpu` creates `/app/public`. Inconsistent between legacy and current deployment. |
 ### 3.5 Docker Compose (`docker-compose.yml`)
 **Issues Found:**
 | # | Severity | Location | Issue |
 |---|---|---|---|
 | C1 | **HIGH** | Lines 19-20 | **Queue service exposes port 8091 externally.** In a multi-tenant or public-facing deployment, the queue API should be internal-only. |
 | C2 | **MEDIUM** | Lines 13-15 | **`Dockerfile.queue` referenced but doesn't exist at root level.** The file is at `queue-service/Dockerfile`. The compose build context is `.` (root) but the dockerfile path doesn't match. |
 | C3 | **MEDIUM** | Lines 6, 16, 26, 31, 43 | **`restart: always`** instead of `restart: unless-stopped`. On crash, `always` restarts even after manual stop, making maintenance harder. |
 | C4 | **LOW** | Lines 23-25 | **No health checks defined** for any service. Docker can't detect if a service is actually healthy, only if the container is running. |
 | C5 | **LOW** | Line 10 | **Redis has no password.** Unauthenticated Redis exposed on the Docker network. |
 | C6 | **LOW** | Lines 49-51 | **No network driver specified** for the bridge network (minor  defaults to bridge). No IPAM configuration for large deployments. |
 ### 3.6 Container Images
 **Issues Found:**
 | # | Severity | Location | Issue |
 |---|---|---|---|
 | I1 | **HIGH** | All Dockerfiles | **No `requirements.txt` or dependency pinning.** All dependencies (`flask`, `redis`, `requests`) are installed without version pins. Builds are non-reproducible. |
 | I2 | **MEDIUM** | `Dockerfile.gpu` line 3 | **`pip install requests`**  unnecessary dependency for the GPU dashboard (only uses `urllib`). Adds ~300KB to the image. |
 | I3 | **MEDIUM** | `Dockerfile.gpu` line 14 | **Multi-process CMD with `&`**  no process supervisor. If the collector crashes, it won't restart. The `http.server` also won't receive SIGTERM properly. |
 | I4 | **LOW** | All Dockerfiles | **No `.dockerignore` file.** The entire context is sent to the Docker daemon, including `.git` directories and any local artifacts. |
 | I5 | **LOW** | `Dockerfile.dashboard` (root) vs `dashboard/Dockerfile.dashboard` | **Duplicate Dockerfiles** with slight differences (Python 3.11 vs 3.13, WORKDIR differences). |
 ---
 ## 4. Smart Queuing Analysis & Recommendations
 ### Current State:  No Smart Queuing
 The queue service is a **passive storage mechanism**  it stores requests but has no intelligence:
 - **No load balancing**  no awareness of GPU load (slots_busy, VRAM usage, queue depth per GPU)
 - **No job prioritization**  FIFO only, no priority levels
 - **No backpressure**  simple threshold, no exponential backoff or adaptive limits
 - **No retry logic**  failed GPU requests go to queue but are never reprocessed
 - **No dead letter handling**  stuck or failed jobs have no lifecycle management
 - **No consumer**  nothing dequeues and forwards to GPUs
 - **No job tracking**  no job IDs, no status updates, no result retrieval
 ### Recommended Architecture: Smart Queue with Consumer
 ```
 Agent > Nginx > Smart Queue API > Redis Streams (with consumers)
                                     Consumer   
                                     Pool       
                         GPU 1 (load)  GPU 2 (load)  GPU 3 (load)
                         Health        Health        Health
                                  Update GPU scores
                             Priority Queue (sorted by urgency)
                             Dead Letter Queue (failed jobs)
                             Backpressure (adaptive rate limit)
 ```
 ### Specific Recommendations
 #### R1: Implement Redis Streams as Queue Backend
 - Replace `LPUSH/RPUSH` (FIFO list) with **Redis Streams** (`XADD/XREADGROUP`)
 - Streams support consumer groups, message acknowledgment, and pending messages
 - Enables proper dead letter queue handling and retry logic
 - **File:** `queue-service/queue-service.py`
 ```python
 # Before: Simple list
 r.rpush(QUEUE_KEY, json.dumps(job))
 # After: Redis Stream with consumer group
 stream_key = "inference:stream"
 consumer_group = "gpu-workers"
 r.xadd(stream_key, {"job": json.dumps(job)}, maxlen=10000, approx=True)
 ```
 #### R2: Build a Queue Consumer Pool
 - Deploy 1+ consumer containers that poll the stream and forward to GPUs
 - Consumer selects GPU based on: health status, current load (slots_busy), and VRAM availability
 - **File:** New `queue-service/consumer.py`
 ```python
 class LoadBalancedConsumer:
    def select_gpu(self, job):
        """Select GPU based on load, health, and model compatibility."""
        candidates = [g for g in self.gpus if g.health == "up" and not g.full]
        if not candidates:
            return None
        # Sort by: slots_idle (descending), VRAM_available (descending)
        candidates.sort(key=lambda g: (g.slots_idle, g.vram_free_mb), reverse=True)
        return candidates[0]
 ```
 #### R3: Implement Priority Queuing
 - Add priority field to job payload: `high`, `normal`, `low`
 - Use Redis Streams with multiple stream keys per priority level
 - Consumer checks `high`  `normal`  `low` in order
 - **File:** `queue-service/queue-service.py` enqueue endpoint
 #### R4: Add Backpressure Mechanism
 - Instead of hard threshold at 50, implement **adaptive backpressure**:
  - Queue depth 0-30: normal operation
  - Queue depth 30-40: return `retry-after` header with increasing delay
  - Queue depth 40-50: return 503 with exponential retry-after
  - Queue depth >50: circuit breaker open
 - **File:** `queue-service/queue-service.py`
 #### R5: Dead Letter Queue (DLQ)
 - Move failed/unprocessable jobs to a `inference:dead-letter` stream
 - Include failure reason, attempt count, and original payload
 - Provide admin API to inspect, retry, or discard DLQ entries
 - **File:** `queue-service/queue-service.py`
 ```python
 # New endpoint
@app.route("/dlq", methods=["GET"])
 def list_dlq():
    return r.xrange("inference:dead-letter")
@app.route("/dlq/retry/<message_id>", methods=["POST"])
 def retry_dlq(message_id):
    job = r.xget("inference:dead-letter", message_id)
    r.xadd("inference:stream", {"job": job})
 ```
 #### R6: GPU-Aware Routing
 - Queue consumer should check GPU `slots_busy` before routing
 - If a GPU is busy, try the next available GPU
 - Track per-GPU queue depth and avoid overloading a single GPU
 - **File:** New consumer logic
 #### R7: Job Status API
 - Add job ID generation on enqueue
 - Provide `/status/<job_id>` endpoint to check progress
 - Store job state in Redis: `queued`  `processing`  `completed`/`failed`
 - **File:** `queue-service/queue-service.py`
 ```python
@app.route("/enqueue", methods=["POST"])
 def enqueue():
    job_id = str(uuid.uuid4())
    job = {"id": job_id, "payload": ..., "status": "queued", "created_at": time.time()}
    r.xadd(stream_key, {"job": json.dumps(job)})
    r.hset("job:status", job_id, json.dumps({"status": "queued"}))
    return jsonify({"job_id": job_id, "status": "queued"}), 202
@app.route("/status/<job_id>")
 def job_status(job_id):
    status = r.hget("job:status", job_id)
    return jsonify(json.loads(status)) if status else {"error": "not found"}, 404
 ```
 #### R8: Health-Based Circuit Breaker
 - Replace simple depth threshold with **per-GPU circuit breakers**
 - Track consecutive failures per GPU
 - Implement half-open state: after cooldown, probe one GPU to test recovery
 - **File:** `queue-service/queue-service.py`
 #### R9: Centralized Configuration
 - Move GPU endpoints from 3 locations (queue-service, dashboard, Nginx) to:
  - Redis config key: `config:gpus`
  - Or environment file mounted to all containers
 - Nginx can use Lua/variable from config instead of static upstreams
 - **File:** New `config/` directory or Redis-based config
 ---
 ## 5. Priority Issue Summary
 ### Critical (Fix Immediately)
 1. **Q1**  Queue has no consumer; enqueued requests are never processed
 2. **Q4**  No job ID or result retrieval mechanism
 3. **N3**  Queue fallback triggers on individual GPU failure, not all-down
 ### High (Fix Before Production)
 4. **Q5**  Circuit breaker has no recovery mechanism
 5. **Q6**  `/status` endpoint blocks on GPU health checks
 6. **D1**  Dashboard `/api/status` makes 4 sequential calls, up to 11s
 7. **C2**  `Dockerfile.queue` path mismatch in docker-compose
 8. **I1**  No dependency pinning in any Dockerfile
 9. **I3**  Multi-process CMD without supervisor in GPU dashboard
 ### Medium (Improve in Next Iteration)
 10. **Q3**  Redis host default conflicts with Docker networking
 11. **Q7**  Silent exception swallowing in Redis access
 12. **Q8**  Content-Type header dropped in queue
 13. **D2**  Single-threaded dashboard server
 14. **D3**  Three separate sources of truth for GPU addresses
 15. **G1**  Sequential GPU collection blocks on single failure
 16. **N1**  Rate limit burst of 20 nodelay defeats protection
 17. **N5**  Hardcoded container names in Nginx
 18. **C1**  Queue API exposed externally
 19. **C4**  No Docker health checks
 ### Low (Nice to Have)
 20. **Q9**  No graceful shutdown
 21. **C3**  `restart: always` vs `unless-stopped`
 22. **C5**  No Redis authentication
 23. **G4**  60s dead threshold is too aggressive
 24. **I2**  Unnecessary `requests` dependency
 25. **I4**  No `.dockerignore`
 26. **I5**  Duplicate Dockerfiles
 ---
 ## 6. Deployment Architecture Summary
 ### What Works Well
 - Clean separation of concerns: routing (Nginx), queuing (Redis + queue-service), observability (two dashboards)
 - Good GPU hardware monitoring with temperature, VRAM, power, fan metrics
 - SSE streaming support in Nginx for LLM response streaming
 - Rate limiting at the gateway layer
 - Circuit breaker pattern implemented (even if basic)
 ### What Needs Work
 - **Queue is incomplete**  storage without processing is the most critical gap
 - **No job lifecycle**  requests go in and never come out
 - **Duplicated configuration**  GPU addresses in 3+ places
 - **No monitoring/alerting**  no Prometheus metrics, no alerting rules
 - **Single point of failure**  no Redis replication, no container redundancy
 - **No logging**  Flask dev server logs are minimal; no structured logging
 ### Recommended Next Steps
 1. **Priority 1:** Implement queue consumer with GPU load-based routing
 2. **Priority 2:** Add job status tracking and result retrieval
 3. **Priority 3:** Fix Nginx fallback to only trigger when ALL GPUs are down
 4. **Priority 4:** Add Docker health checks and proper dependency management
 5. **Priority 5:** Centralize GPU configuration in Redis or environment
 6. **Priority 6:** Add Prometheus metrics endpoint for observability
@@ -0,0 +1,5 @@
 FROM python:3.11-slim
 WORKDIR /app
 COPY dashboard/harness-dashboard.py .
 EXPOSE 3001
 CMD ["python3", "harness-dashboard.py"]
@@ -0,0 +1,14 @@
 FROM python:3.11-slim
 RUN pip install requests
 COPY gpu-dashboard/ /app/
 WORKDIR /app
 RUN mkdir -p /app/public && \
    cp gpu.html /app/public/ && \
    touch /app/public/gpu_metrics.json
 EXPOSE 8092
 CMD ["sh", "-c", "python3 gpu_collector.py & python3 -m http.server 8092 --directory /app/public & wait"]
@@ -0,0 +1,8 @@
 FROM python:3.13-slim
 COPY harness-dashboard.py /app/harness-dashboard.py
 WORKDIR /app
 EXPOSE 3001
 CMD ["python3", "harness-dashboard.py"]
@@ -0,0 +1,5 @@
 FROM python:3.11-slim
 WORKDIR /app
 COPY harness-dashboard.py .
 EXPOSE 3001
 CMD ["python3", "harness-dashboard.py"]
@@ -0,0 +1,133 @@
 #!/usr/bin/env python3
 """Syslog Harness Dashboard — Simple HTTP server exposing GPU health + metrics."""
 import json
 import os
 import time
 import urllib.request
 from http.server import HTTPServer, SimpleHTTPRequestHandler
 from datetime import datetime
 GPUS = {
    "amdpve": {"endpoint": os.getenv("AMDVE_EP", "192.168.68.15:8080"), "model": "qwen3.6-35B-A3B (MoE)", "vram": "65GB"},
    "llmgpu": {"endpoint": os.getenv("LLMGPU_EP", "192.168.68.8:8080"), "model": "qwen3.5-27B (Dense)", "vram": "24GB"},
    "ocu_llm": {"endpoint": os.getenv("OCU_LLM_EP", "192.168.68.110:8080"), "model": "gemma-4-E4B (Light)", "vram": "12GB"},
 }
 def check_gpu(name, info):
    try:
        start = time.time()
        # Use simple HTTP GET to check if the GPU endpoint is alive
        resp = urllib.request.urlopen(f"http://{info['endpoint']}/", timeout=3)
        latency = (time.time() - start) * 1000
        return {
            "status": "up",
            "latency_ms": round(latency, 1),
            "model": info["model"],
            "vram": info["vram"],
        }
    except Exception as e:
        return {"status": "down", "error": str(e)[:50], "model": info["model"], "vram": info["vram"]}
 def get_queue_status():
    try:
        req = urllib.request.Request("http://queue-service:8091/status")
        resp = urllib.request.urlopen(req, timeout=2)
        return json.loads(resp.read())
    except Exception:
        return {"queue_depth": -1, "circuit_breaker": "unknown", "gpu_health": {}}
 DASHBOARD_HTML = """
 <!DOCTYPE html>
 <html><head><meta charset="utf-8"><title>🦅 Syslog Harness</title>
 <style>
  body { background: #1a1a2e; color: #e0e0e0; font-family: monospace; margin: 0; padding: 20px; }
  .card { background: #16213e; border-radius: 8px; padding: 16px; margin: 10px 0; border-left: 4px solid #0f3460; }
  .up { border-left-color: #00d26a; } .down { border-left-color: #ff4757; }
  .warn { border-left-color: #ffa502; }
  h1 { color: #00d26a; font-size: 24px; } h2 { color: #0f3460; font-size: 16px; }
  .metric { display: inline-block; margin: 4px 12px; }
  .value { font-weight: bold; color: #00d26a; }
  #refresh { position: fixed; top: 10px; right: 10px; background: #0f3460; color: white;
             border: none; padding: 8px 16px; border-radius: 4px; cursor: pointer; }
  table { width: 100%; border-collapse: collapse; margin: 10px 0; }
  th, td { text-align: left; padding: 8px; border-bottom: 1px solid #0f3460; }
  th { color: #00d26a; }
 </style></head><body>
 <button id="refresh" onclick="location.reload()">↻ Refresh</button>
 <h1>🦅 Syslog Harness Dashboard</h1>
 <h2>Updated: <span id="ts"></span></h2>
 <div class="card" id="queue-card">
  <h2>Queue & Circuit Breaker</h2>
  <div class="metric">Depth: <span class="value" id="depth">--</span></div>
  <div class="metric">Circuit: <span class="value" id="circuit">--</span></div>
  <div class="metric">Threshold: <span class="value" id="threshold">--</span></div>
 </div>
 <div class="card">
  <h2>GPU Endpoints</h2>
  <table><tr><th>GPU</th><th>Model</th><th>VRAM</th><th>Status</th><th>Latency</th></tr>
  <tbody id="gpu-table"></tbody></table>
 </div>
 <script>
  document.getElementById('ts').textContent = new Date().toISOString();
  fetch('/api/status').then(r => r.json()).then(data => {
    document.getElementById('depth').textContent = data.queue_depth;
    document.getElementById('circuit').textContent = data.circuit_breaker;
    document.getElementById('threshold').textContent = 'warn:' + data.thresholds.warn + ' / open:' + data.thresholds.open;
    const card = document.getElementById('queue-card');
    if (data.circuit_breaker === 'open') card.className = 'card warn';
    else if (data.circuit_breaker === 'warn') card.className = 'card warn';
    else card.className = 'card up';
    let html = '';
    for (const [name, gpu] of Object.entries(data.gpu_health)) {
      const status = gpu.status === 'up' ? '✅' : '❌';
      const latency = gpu.status === 'up' ? gpu.latency_ms + 'ms' : gpu.error;
      const rowClass = gpu.status === 'up' ? '' : 'down';
      html += `<tr class="${rowClass}"><td>${name}</td><td>${gpu.model}</td><td>${gpu.vram}</td><td>${status}</td><td>${latency}</td></tr>`;
    }
    document.getElementById('gpu-table').innerHTML = html;
  });
  setInterval(() => location.reload(), 10000);
 </script></body></html>
 """
 class Handler(SimpleHTTPRequestHandler):
    def do_GET(self):
        if self.path == "/" or self.path == "/harness.html":
            self.send_response(200)
            self.send_header("Content-Type", "text/html; charset=utf-8")
            self.end_headers()
            self.wfile.write(DASHBOARD_HTML.encode())
        elif self.path == "/api/status":
            status = get_queue_status()
            enriched = {
                "queue_depth": status.get("queue_depth", -1),
                "circuit_breaker": status.get("circuit_breaker", "unknown"),
                "thresholds": status.get("thresholds", {"warn": 30, "open": 50}),
                "gpu_health": {},
            }
            for name, info in GPUS.items():
                enriched["gpu_health"][name] = check_gpu(name, info)
            self.send_response(200)
            self.send_header("Content-Type", "application/json")
            self.end_headers()
            self.wfile.write(json.dumps(enriched).encode())
        else:
            self.send_response(404)
            self.end_headers()
    def log_message(self, format, *args):
        pass  # Suppress request logs
 if __name__ == "__main__":
    server = HTTPServer(("0.0.0.0", 3001), Handler)
    print("Dashboard running on :3001/harness.html")
    server.serve_forever()
@@ -1,27 +1,54 @@
 version: "3.8"
 services:
  redis:
    image: redis:7-alpine
    restart: always
    networks:
      - gpu-router-net
    volumes:
      - redis-data:/data
  queue-service:
-    build: ./queue-service
+    build:
-    container_name: syslog-queue
+      context: .
-    restart: unless-stopped
+      dockerfile: Dockerfile.queue
    restart: always
    networks:
      - gpu-router-net
    ports:
      - "8091:8091"
    depends_on:
      - redis
    environment:
-      - REDIS_HOST=192.168.68.7
+      - REDIS_HOST=redis
      - REDIS_PORT=6379
    networks:
      - harness-net
  dashboard:
-    build: ./dashboard
+    build:
-    container_name: syslog-dashboard
+      context: .
-    restart: unless-stopped
+      dockerfile: Dockerfile.dashboard
    restart: always
    networks:
      - gpu-router-net
    ports:
      - "3001:3001"
    depends_on:
-      - queue-service
+      - redis
  gpu-dashboard:
    build:
      context: .
      dockerfile: Dockerfile.gpu
    restart: always
    networks:
-      - harness-net
+      - gpu-router-net
    ports:
      - "8092:8092"
 networks:
-  harness-net:
+  gpu-router-net:
    driver: bridge
 volumes:
  redis-data:
@@ -0,0 +1,115 @@
 #!/usr/bin/env python3
 """GPU metrics collector — polls sidecars + llama.cpp every 10s, writes to Workspace."""
 import urllib.request, json, time, os
 HOSTS = [
    {"name": "amdpve", "host": "192.168.68.15", "gpu": "AMD Strix Halo", "llama_port": 8080},
    {"name": "llmgpu", "host": "192.168.68.8", "gpu": "RTX 3090", "llama_port": 8080},
    {"name": "ocu-llm", "host": "192.168.68.110", "gpu": "RTX 5070", "llama_port": 8080},
 ]
 OUTPUT = "/root/hermes-workspace/public/gpu_metrics.json"
 INTERVAL = 10
 STALE_THRESHOLD = 30  # seconds before marking stale
 DEAD_THRESHOLD = 60   # seconds before marking unreachable
 last_seen = {}
 def fetch_json(url, timeout=3):
    try:
        req = urllib.request.Request(url)
        resp = urllib.request.urlopen(req, timeout=timeout)
        return json.loads(resp.read().decode())
    except Exception:
        return None
 def collect_one(h):
    """Collect GPU hardware + llama.cpp inference state for one host."""
    name = h["name"]
    host = h["host"]
    now = time.time()
    # GPU hardware from sidecar
    gpu = fetch_json(f"http://{host}:8090/")
    # llama.cpp inference state
    llamacpp_health = fetch_json(f"http://{host}:{h['llama_port']}/health")
    llamacpp_models = fetch_json(f"http://{host}:{h['llama_port']}/v1/models")
    # Determine inference state
    model_name = None
    inference_state = "unknown"
    if llamacpp_models:
        models = llamacpp_models.get("data", [])
        if models:
            model_name = models[0].get("id")
    if llamacpp_health:
        status = llamacpp_health.get("status", "")
        if status == "ok":
            idle = llamacpp_health.get("slots_idle", 0)
            processing = llamacpp_health.get("slots_processing", 0)
            if idle and not processing:
                inference_state = "idle"
            elif processing:
                inference_state = "busy"
            else:
                inference_state = "idle"
    # Check for /slots endpoint for is_processing detail
    slots = fetch_json(f"http://{host}:{h['llama_port']}/slots")
    if slots and isinstance(slots, list) and len(slots) > 0:
        if slots[0].get("is_processing"):
            inference_state = "busy"
    result = {
        "host": name,
        "gpu_name": h["gpu"],
        "inference": {
            "state": inference_state,
            "model": model_name,
        },
        "hardware": gpu if gpu else None,
        "online": gpu is not None,
        "timestamp": now,
    }
    if gpu is not None:
        last_seen[name] = now
    if name in last_seen:
        age = now - last_seen[name]
        if age > DEAD_THRESHOLD:
            result["online"] = False
        elif age > STALE_THRESHOLD:
            result["stale"] = True
    return result
 def main():
    print(f"GPU collector starting, output={OUTPUT}, interval={INTERVAL}s")
    os.makedirs(os.path.dirname(OUTPUT), exist_ok=True)
    while True:
        start = time.time()
        results = [collect_one(h) for h in HOSTS]
        payload = {
            "updated": start,
            "gpus": results,
        }
        with open(OUTPUT + ".tmp", "w") as f:
            json.dump(payload, f)
        os.rename(OUTPUT + ".tmp", OUTPUT)
        elapsed = time.time() - start
        sleep_for = max(0, INTERVAL - elapsed)
        time.sleep(sleep_for)
 if __name__ == "__main__":
    main()
@@ -0,0 +1,183 @@
 <!DOCTYPE html>
 <html lang="en">
 <head>
 <meta charset="UTF-8">
 <meta name="viewport" content="width=device-width, initial-scale=1.0">
 <title>GPU Monitor</title>
 <style>
 * { margin: 0; padding: 0; box-sizing: border-box; }
 body { background: #0d1117; color: #c9d1d9; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif; padding: 20px; }
 h1 { font-size: 1.3em; margin-bottom: 4px; }
 .topbar { display: flex; justify-content: space-between; align-items: center; margin-bottom: 20px; padding-bottom: 12px; border-bottom: 1px solid #21262d; }
 .topbar .status { font-size: 0.85em; color: #8b949e; }
 .topbar .status .dot { display: inline-block; width: 8px; height: 8px; border-radius: 50%; margin-right: 6px; }
 .dot.green { background: #3fb950; }
 .dot.yellow { background: #d2991d; }
 .dot.red { background: #f85149; }
 .cards { display: grid; grid-template-columns: repeat(auto-fit, minmax(320px, 1fr)); gap: 16px; }
 .card { background: #161b22; border: 1px solid #21262d; border-radius: 8px; padding: 16px; }
 .card.stale { opacity: 0.5; }
 .card.dead { opacity: 0.3; border-color: #f85149; }
 .card-header { display: flex; justify-content: space-between; align-items: center; margin-bottom: 12px; }
 .card-header .name { font-weight: 600; font-size: 1.05em; }
 .card-header .host { font-size: 0.8em; color: #8b949e; }
 .card-header .state { font-size: 0.75em; padding: 2px 8px; border-radius: 10px; font-weight: 600; }
 .state.idle { background: #1b3826; color: #3fb950; }
 .state.busy { background: #3d1f1a; color: #f85149; }
 .state.unknown { background: #21262d; color: #8b949e; }
 .metric { margin-bottom: 10px; }
 .metric-label { display: flex; justify-content: space-between; font-size: 0.82em; color: #8b949e; margin-bottom: 2px; }
 .metric-label .val { color: #c9d1d9; font-weight: 500; }
 .bar { height: 6px; border-radius: 3px; background: #21262d; overflow: hidden; }
 .bar-fill { height: 100%; border-radius: 3px; transition: width 0.5s ease; }
 .bar-fill.temp-cool { background: #3fb950; }
 .bar-fill.temp-warm { background: #d2991d; }
 .bar-fill.temp-hot { background: #f85149; }
 .bar-fill.util { background: #58a6ff; }
 .bar-fill.vram { background: #bc8cff; }
 .bar-fill.power { background: #f0883e; }
 .model-line { font-size: 0.82em; color: #8b949e; margin-top: 8px; padding-top: 8px; border-top: 1px solid #21262d; }
 .model-line span { color: #c9d1d9; }
 .error { color: #f85149; font-size: 0.85em; }
 </style>
 </head>
 <body>
 <div class="topbar">
  <div>
    <h1><a href="/" style="color:#58a6ff;text-decoration:none;">← Workspace</a> · GPU Monitor</h1>
    <span class="status"><span class="dot green" id="status-dot"></span><span id="status-text">Loading...</span></span>
  </div>
  <div class="status" id="age">—</div>
 </div>
 <div class="cards" id="cards"></div>
 <script>
 const INTERVAL = 5000;
 let lastFetchTime = null;
 function updateClock() {
  const el = document.getElementById('age');
  if (!lastFetchTime) { el.textContent = '—'; return; }
  const age = Math.round((Date.now() / 1000) - lastFetchTime);
  el.textContent = age <= 60 ? `updated ${age}s ago` : `stale ${age}s ago`;
 }
 setInterval(updateClock, 1000);
 const TEMP_WARN = 70, TEMP_HOT = 82;
 const VRAM_WARN = 80, VRAM_HOT = 92;
 function tempClass(c) { return c > TEMP_HOT ? 'temp-hot' : c > TEMP_WARN ? 'temp-warm' : 'temp-cool'; }
 function vramClass(pct) { return pct > VRAM_HOT ? 'temp-hot' : pct > VRAM_WARN ? 'temp-warm' : 'temp-cool'; }
 function pct(val, max) { return max ? Math.round(val / max * 100) : 0; }
 function mbToGB(mb) { return mb ? (mb / 1024).toFixed(1) : '—'; }
 function renderCard(g) {
  const hw = g.hardware || {};
  const inf = g.inference || {};
  const online = g.online !== false;
  const stale = g.stale === true;
  let cardClass = '';
  if (!online) cardClass = 'dead';
  else if (stale) cardClass = 'stale';
  let stateClass = inf.state || 'unknown';
  let stateLabel = inf.state ? inf.state.toUpperCase() : 'UNKNOWN';
  if (!online) { stateClass = 'unknown'; stateLabel = 'OFFLINE'; }
  const temp = hw.temp_c;
  const util = hw.gpu_util_pct;
  const vramUsed = hw.vram_used_mb;
  const vramTotal = hw.vram_total_mb;
  const power = hw.power_w;
  const powerLimit = hw.power_limit_w;
  const fan = hw.fan_pct;
  const vendor = hw.vendor;
  let html = `<div class="card ${cardClass}">`;
  html += `<div class="card-header">`;
  html += `<div><div class="name">${g.gpu_name}</div><div class="host">${g.host}</div></div>`;
  html += `<div class="state ${stateClass}">${stateLabel}</div>`;
  html += `</div>`;
  if (!online) {
    html += `<div class="error">Unreachable</div>`;
  } else if (hw.error) {
    html += `<div class="error">${hw.error}</div>`;
  } else {
    // Temperature
    if (temp != null) {
      html += `<div class="metric"><div class="metric-label"><span>Temperature</span><span class="val">${temp}°C</span></div>`;
      html += `<div class="bar"><div class="bar-fill ${tempClass(temp)}" style="width:${Math.min(temp,100)}%"></div></div></div>`;
    }
    // Utilization
    if (util != null) {
      html += `<div class="metric"><div class="metric-label"><span>GPU Utilization</span><span class="val">${util}%</span></div>`;
      html += `<div class="bar"><div class="bar-fill util" style="width:${util}%"></div></div></div>`;
    }
    // VRAM
    if (vramUsed != null && vramTotal != null) {
      const vramPct = pct(vramUsed, vramTotal);
      html += `<div class="metric"><div class="metric-label"><span>VRAM</span><span class="val">${mbToGB(vramUsed)} / ${mbToGB(vramTotal)} GB</span></div>`;
      html += `<div class="bar"><div class="bar-fill ${vramClass(vramPct)}" style="width:${vramPct}%"></div></div></div>`;
    }
    // Power
    if (power != null) {
      const powerPct = powerLimit ? pct(power, powerLimit) : 0;
      const powerText = powerLimit ? `${power}W / ${powerLimit}W` : `${power}W`;
      html += `<div class="metric"><div class="metric-label"><span>Power</span><span class="val">${powerText}</span></div>`;
      if (powerLimit) html += `<div class="bar"><div class="bar-fill power" style="width:${powerPct}%"></div></div>`;
      html += `</div>`;
    }
    // Fan (NVIDIA only)
    if (fan != null) {
      html += `<div class="metric"><div class="metric-label"><span>Fan Speed</span><span class="val">${fan}%</span></div>`;
      html += `<div class="bar"><div class="bar-fill util" style="width:${fan}%"></div></div></div>`;
    }
  }
  // Model loaded
  html += `<div class="model-line">Model: <span>${inf.model || '—'}</span></div>`;
  html += `</div>`;
  return html;
 }
 async function refresh() {
  try {
    const resp = await fetch('gpu_metrics.json?t=' + Date.now());
    const data = await resp.json();
    const gpus = data.gpus || [];
    document.getElementById('cards').innerHTML = gpus.map(renderCard).join('');
    // Top bar status
    const online = gpus.filter(g => g.online !== false).length;
    const total = gpus.length;
    const dot = document.getElementById('status-dot');
    const txt = document.getElementById('status-text');
    if (online === total) { dot.className = 'dot green'; txt.textContent = `${online}/${total} online`; }
    else if (online > 0) { dot.className = 'dot yellow'; txt.textContent = `${online}/${total} online`; }
    else { dot.className = 'dot red'; txt.textContent = 'All offline'; }
    // Capture fetch time for live clock
    lastFetchTime = Date.now() / 1000;
  } catch(e) {
    document.getElementById('status-dot').className = 'dot red';
    document.getElementById('status-text').textContent = 'Collector down';
  }
 }
 // Render skeletons instantly
 const SKELETONS = [
  {host:'amdpve', gpu_name:'AMD Strix Halo', hardware:{}, inference:{}, online:true},
  {host:'llmgpu', gpu_name:'RTX 3090', hardware:{}, inference:{}, online:true},
  {host:'ocu-llm', gpu_name:'RTX 5070', hardware:{}, inference:{}, online:true},
 ];
 document.getElementById('cards').innerHTML = SKELETONS.map(g =>
  `<div class="card"><div class="card-header"><div><div class="name">${g.gpu_name}</div><div class="host">${g.host}</div></div><div class="state unknown">···</div></div><div class="model-line" style="color:#8b949e;">Loading metrics...</div></div>`
 ).join('');
 refresh();
 setInterval(refresh, INTERVAL);
 </script>
 </body>
 </html>
@@ -0,0 +1,115 @@
 #!/usr/bin/env python3
 """GPU metrics collector — polls sidecars + llama.cpp every 10s, writes to Workspace."""
 import urllib.request, json, time, os
 HOSTS = [
    {"name": "amdpve", "host": "192.168.68.15", "gpu": "AMD Strix Halo", "llama_port": 8080},
    {"name": "llmgpu", "host": "192.168.68.8", "gpu": "RTX 3090", "llama_port": 8080},
    {"name": "ocu-llm", "host": "192.168.68.110", "gpu": "RTX 5070", "llama_port": 8080},
 ]
 OUTPUT = "/app/public/gpu_metrics.json"
 INTERVAL = 10
 STALE_THRESHOLD = 30  # seconds before marking stale
 DEAD_THRESHOLD = 60   # seconds before marking unreachable
 last_seen = {}
 def fetch_json(url, timeout=3):
    try:
        req = urllib.request.Request(url)
        resp = urllib.request.urlopen(req, timeout=timeout)
        return json.loads(resp.read().decode())
    except Exception:
        return None
 def collect_one(h):
    """Collect GPU hardware + llama.cpp inference state for one host."""
    name = h["name"]
    host = h["host"]
    now = time.time()
    # GPU hardware from sidecar
    gpu = fetch_json(f"http://{host}:8090/")
    # llama.cpp inference state
    llamacpp_health = fetch_json(f"http://{host}:{h['llama_port']}/health")
    llamacpp_models = fetch_json(f"http://{host}:{h['llama_port']}/v1/models")
    # Determine inference state
    model_name = None
    inference_state = "unknown"
    if llamacpp_models:
        models = llamacpp_models.get("data", [])
        if models:
            model_name = models[0].get("id")
    if llamacpp_health:
        status = llamacpp_health.get("status", "")
        if status == "ok":
            idle = llamacpp_health.get("slots_idle", 0)
            processing = llamacpp_health.get("slots_processing", 0)
            if idle and not processing:
                inference_state = "idle"
            elif processing:
                inference_state = "busy"
            else:
                inference_state = "idle"
    # Check for /slots endpoint for is_processing detail
    slots = fetch_json(f"http://{host}:{h['llama_port']}/slots")
    if slots and isinstance(slots, list) and len(slots) > 0:
        if slots[0].get("is_processing"):
            inference_state = "busy"
    result = {
        "host": name,
        "gpu_name": h["gpu"],
        "inference": {
            "state": inference_state,
            "model": model_name,
        },
        "hardware": gpu if gpu else None,
        "online": gpu is not None,
        "timestamp": now,
    }
    if gpu is not None:
        last_seen[name] = now
    if name in last_seen:
        age = now - last_seen[name]
        if age > DEAD_THRESHOLD:
            result["online"] = False
        elif age > STALE_THRESHOLD:
            result["stale"] = True
    return result
 def main():
    print(f"GPU collector starting, output={OUTPUT}, interval={INTERVAL}s")
    os.makedirs(os.path.dirname(OUTPUT), exist_ok=True)
    while True:
        start = time.time()
        results = [collect_one(h) for h in HOSTS]
        payload = {
            "updated": start,
            "gpus": results,
        }
        with open(OUTPUT + ".tmp", "w") as f:
            json.dump(payload, f)
        os.rename(OUTPUT + ".tmp", OUTPUT)
        elapsed = time.time() - start
        sleep_for = max(0, INTERVAL - elapsed)
        time.sleep(sleep_for)
 if __name__ == "__main__":
    main()
@@ -0,0 +1,14 @@
 #!/bin/bash
 set -e
 # Start collector as background process
 cd /root/hermes-workspace/public
 python3 /app/collector.py &
 COLLECTOR_PID=$!
 echo "Collector started (PID $COLLECTOR_PID)"
 echo "Serving dashboard on :8092"
 # Serve the public directory (contains gpu.html + gpu_metrics.json)
 cd /root/hermes-workspace/public
 python3 -m http.server 8092
@@ -24,7 +24,12 @@ upstream queue_service {
 upstream dashboard_service {
    ## Harness dashboard (Docker container)
-    server dashboard:3001;
+    server syslog-harness-dashboard-1:3001;
 }
 upstream gpu_dashboard_pool {
    ## GPU dashboard (Docker container)
    server syslog-harness-gpu-dashboard-1:8092;
 }
 ## ------------------------------------------------------------------
@@ -56,6 +61,17 @@ server {
        proxy_set_header X-Forwarded-For   $proxy_add_x_forwarded_for;
    }
    ## ------------------------------------------------------------------
    ## GPU Dashboard — observability UI (MUST be before / catch-all)
    ## ------------------------------------------------------------------
    location /gpu {
        proxy_pass http://gpu_dashboard_pool/;
        proxy_set_header Host              $host;
        proxy_set_header X-Real-IP         $remote_addr;
        proxy_set_header X-Forwarded-For   $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
    ## ------------------------------------------------------------------
    ## Main location — proxy to selected upstream
    ## ------------------------------------------------------------------
@@ -1,106 +0,0 @@
 ## Syslog GPU Router — Nginx Configuration
 ## Routes incoming agent requests to the appropriate GPU backend
 ## based on the X-Syslog-Model header.
 upstream amdpve_pool {
    ## Strix Halo 395 — qwen3.6-35B-A3B (MoE) — Default workhorse
    server 192.168.68.15:8080;
 }
 upstream llmgpu_pool {
    ## RTX 3090 — qwen3.5-27B (Dense) — Heavy reasoning
    server 192.168.68.8:8080;
 }
 upstream ocu_llm_pool {
    ## RTX 5070 — gemma-4 (Dense 4B) — Ultra-light tasks
    server 192.168.68.110:8080;
 }
 upstream queue_service {
    ## Agent queue with circuit breaker (Docker container)
    server 127.0.0.1:8091;
 }
 upstream dashboard_service {
    ## Harness dashboard (Docker container)
    server 127.0.0.1:3001;
 }
 ## ------------------------------------------------------------------
 ## Mapping: X-Syslog-Model header → upstream backend
 ## ------------------------------------------------------------------
 map $http_x_syslog_model $gpu_upstream {
    default          amdpve_pool;   # missing header → default workhorse
    "standard"       amdpve_pool;
    "heavy"          llmgpu_pool;
    "qwen3.5-27B"    llmgpu_pool;
    "light"          ocu_llm_pool;
    "gemma-4"        ocu_llm_pool;
 }
 server {
    listen 8080;
    server_name _;
    # Rate limit zone — 10 req/s per IP, burst of 20
    limit_req_zone $binary_remote_addr zone=perip:10m rate=10r/s;
    ## ------------------------------------------------------------------
    ## Dashboard — observability UI (MUST be before / catch-all)
    ## ------------------------------------------------------------------
    location /dashboard {
        proxy_pass http://dashboard_service/;
        proxy_set_header Host              $host;
        proxy_set_header X-Real-IP         $remote_addr;
        proxy_set_header X-Forwarded-For   $proxy_add_x_forwarded_for;
    }
    ## ------------------------------------------------------------------
    ## Main location — proxy to selected upstream
    ## ------------------------------------------------------------------
    location / {
        limit_req zone=perip burst=20 nodelay;
        limit_req_status 503;
        proxy_pass http://$gpu_upstream;
        ## Preserve original host and headers
        proxy_set_header Host              $host;
        proxy_set_header X-Real-IP         $remote_addr;
        proxy_set_header X-Forwarded-For   $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        ## Pass through the model header so backends can log it
        proxy_pass_header X-Syslog-Model;
        ## Streaming support (SSE for LLM responses)
        proxy_buffering off;
        proxy_cache     off;
        proxy_read_timeout  300s;
        proxy_send_timeout  300s;
        ## Basic failover — retry on error or timeout
        proxy_next_upstream error timeout http_502 http_503;
        proxy_next_upstream_tries 2;
        ## Add a response header for observability
        add_header X-Routed-To $gpu_upstream always;
        ## Fallback to queue when all GPU upstreams are down
        error_page 502 503 504 = @queue_fallback;
    }
    ## ------------------------------------------------------------------
    ## Queue fallback — enqueue when GPUs are unavailable
    ## ------------------------------------------------------------------
    location @queue_fallback {
        rewrite ^ /enqueue break;
        proxy_pass http://queue_service;
        proxy_set_header Host              $host;
        proxy_set_header X-Real-IP         $remote_addr;
        proxy_set_header X-Forwarded-For   $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_set_header Content-Type      $content_type;
        proxy_pass_request_body            on;
    }
 }
@@ -0,0 +1,10 @@
 FROM python:3.13-slim
 RUN pip install --no-cache-dir flask redis
 COPY queue-service.py /app/queue-service.py
 WORKDIR /app
 EXPOSE 8091
 CMD ["python3", "queue-service.py"]