feat: Smart Queue Consumer implementation draft + architecture review

- SMART_QUEUE_IMPLEMENTATION.md: Complete implementation draft (1572 lines) with 10 quick-win fixes and full smart queue consumer rewrite - ARCHITECTURE_REVIEW.md: 26-issue audit with prioritized findings - Verified all 3 GPUs live: amdpve (73% util), llmgpu (idle), ocu_llm (idle) - Redis 7.4.9 confirmed streams support - GPU sidecar metrics verified on all hosts Key fixes: - QW-1: Dockerfile path mismatch (Dockerfile.queue -> queue-service/Dockerfile) - QW-2: Nginx fallback only on ALL-GPU failure (not single GPU) - QW-3: Container names fixed to Docker service names - QW-4: Redis host default fixed (192.168.68.7 -> redis) - QW-5: Dependency version pinning - QW-7-10: Health checks, restart policy, Gunicorn, single-process collector Smart queue features: - Redis Streams + consumer groups - GPU-aware load balancing via sidecar metrics - Per-GPU circuit breakers with half-open recovery - Adaptive backpressure (0-30 normal, 30-40 warn, 40-50 503, >50 open) - Dead letter queue with retry endpoint - Job ID tracking and /status/<job_id> API
Add GPU dashboard container + Nginx routing
2026-05-17 03:55:20 +00:00 · 2026-05-15 22:25:56 +00:00
29 changed files with 4052 additions and 831 deletions
@@ -1,8 +0,0 @@
-# Syslog Harness Environment
-REDIS_HOST=192.168.68.8
-REDIS_PORT=6379
-AMDPVE_ENDPOINT=http://192.168.68.15:8080
-LLMGPU_ENDPOINT=http://192.168.68.8:8080
-OCU_LLM_ENDPOINT=http://192.168.68.110:8080
-CIRCUIT_BREAKER_THRESHOLD=5
-CIRCUIT_BREAKER_TIMEOUT=30
@@ -1,5 +0,0 @@
-__pycache__/
-*.pyc
-.env
-redis-data/
-ssl/
@@ -0,0 +1,390 @@
+# Syslog Harness  Architecture Review & Improvement Recommendations
+
+**Date:** 2026-05-17  
+**Commit:** `e95475f`  "Add GPU dashboard container + Nginx routing"  
+**Repo:** http://192.168.68.17:3000/SyslogSolution/syslog-harness.git
+
+---
+
+## 1. Current Architecture Overview
+
+```
+                          
+                                              Host (192.168.68.123)                    
+                                                                                       
+                                        
+Agent :8080> Nginx Router >  Queue Service  >   Dashboard       
+                             :8080            :8091                  :3001         
+                                        
+                                                                                    
+                                                                                    
+                                                                                    
+                                        
+                             GPU Pool         Redis       >  GPU Dashboard  
+                             :8080            :6379               :8092         
+                                        
+                                                                                      
+                          
+                                  
+                    
+                                              
+                  
+               amdpve      llmgpu     ocu_llm    
+               .15:8080    .8:8080    .110:8080  
+               MoE 35B     Dense 27B   Light 4B  
+                  
+```
+
+### Services
+
+| Service | Port | Container | Image | Purpose |
+|---|---|---|---|---|
+| **Nginx Router** | 8080 | Host-level | OS nginx | Routes by `X-Syslog-Model` header |
+| **Queue Service** | 8091 | `syslog-queue` | `python:3.13-slim` | Request queue + circuit breaker |
+| **Dashboard** | 3001 | `syslog-dashboard` | `python:3.11-slim` | Observability UI + GPU health |
+| **GPU Dashboard** | 8092 | `syslog-gpu-dashboard` | `python:3.11-slim` | Hardware metrics (temp, VRAM, power) |
+| **Redis** | 6379 | `syslog-redis` | `redis:7-alpine` | Queue storage |
+
+### GPU Backends
+
+| Host | GPU | Model | Capacity |
+|---|---|---|---|
+| 192.168.68.15 | AMD Strix Halo | qwen3.6-35B-A3B (MoE) | 65GB VRAM |
+| 192.168.68.8 | RTX 3090 | qwen3.5-27B (Dense) | 24GB VRAM |
+| 192.168.68.110 | RTX 5070 | gemma-4-E4B (Light) | 12GB VRAM |
+
+### Data Flow
+
+1. **Agent** sends request with `X-Syslog-Model` header  Nginx :8080
+2. **Nginx** routes to appropriate GPU based on header mapping
+3. **GPU backend** (llama.cpp) processes request
+4. **Fallback:** If GPU returns 502/503/timeout  Nginx redirects to queue-service :8091
+5. **Queue** stores request in Redis `inference:requests` LPUSH
+6. **Dashboard** :3001 polls queue-service + GPU health for display
+7. **GPU Dashboard** :8092 collects hardware metrics every 10s
+
+---
+
+## 2. File Inventory
+
+```
+docker-compose.yml                          # Main compose (Docker networking)
+gpu-router-docker.conf                      # Nginx config for Docker deployment
+Dockerfile.gpu                              # GPU dashboard container
+Dockerfile.dashboard                        # Dashboard container (root-level)
+queue-service/Dockerfile                    # Queue service container
+queue-service/queue-service.py              # Queue logic (121 lines)
+dashboard/harness-dashboard.py              # Dashboard app (133 lines)
+dashboard/Dockerfile                        # Dashboard container (subdir)
+dashboard/Dockerfile.dashboard              # Dashboard container (duplicate)
+gpu-dashboard/gpu_collector.py              # GPU hardware collector (115 lines)
+gpu-dashboard/gpu.html                      # GPU dashboard UI (183 lines)
+gpu-dashboard/collector.py                  # Duplicate collector (hermes-workspace path)
+gpu-dashboard/start.sh                      # Legacy startup script
+MIGRATION_PLAN.md                           # Production migration plan
+README.md                                   # Documentation
+syslog-harness-check/                       # Checkpoint subdirectory (mirror)
+```
+
+---
+
+## 3. Detailed Findings
+
+### 3.1 Queue Service (`queue-service/queue-service.py`)
+
+**Architecture:** Simple Flask app using Redis LPUSH/RPUSH for a FIFO queue. A basic circuit breaker prevents queue overflow at 50 messages.
+
+**Issues Found:**
+
+| # | Severity | Location | Issue |
+|---|---|---|---|
+| Q1 | **CRITICAL** | Lines 82-88 | **Queue is fire-and-forget with no consumer.** Requests are pushed to Redis but nothing dequeues or processes them. The queue is a dead storage pit. |
+| Q2 | **CRITICAL** | Lines 28-32 | **Hardcoded GPU IPs** in the queue service duplicate the Nginx config. No configuration source of truth. |
+| Q3 | **HIGH** | Lines 21-22 | **Redis host fallback to `192.168.68.7`** (line 21) conflicts with docker-compose which sets `REDIS_HOST=redis` (line 24). The default is unreachable inside Docker. |
+| Q4 | **HIGH** | Lines 66-95 | **No job result retrieval mechanism.** Once enqueued, there's no API to poll for completion, get a job ID, or retrieve results. |
+| Q5 | **HIGH** | Lines 73-79 | **Circuit breaker is a simple depth threshold.** No backoff, no recovery window, no sliding window. Once closed, it stays closed until manually drained. |
+| Q6 | **MEDIUM** | Lines 50-57 | **GPU health check is synchronous and blocks** the `/status` endpoint. Checking 3 GPUs sequentially with 3s timeout means `/status` can take up to 9s. |
+| Q7 | **MEDIUM** | Lines 35-40 | **`get_redis()` swallows all exceptions** and returns `None`. This makes Redis failures silent  queue depth returns 0 on failure (line 47), potentially allowing overflow. |
+| Q8 | **MEDIUM** | Lines 83-84 | **Headers filtered to only X-* prefixed**  the `Content-Type` header is dropped entirely, meaning the receiver can't determine payload format. |
+| Q9 | **LOW** | Line 121 | **No graceful shutdown.** Flask development server doesn't handle SIGTERM gracefully. |
+
+### 3.2 Nginx Gateway (`gpu-router-docker.conf`)
+
+**Architecture:** Nginx routes requests to GPU backends based on `X-Syslog-Model` header value. Has rate limiting, streaming support, and queue fallback.
+
+**Issues Found:**
+
+| # | Severity | Location | Issue |
+|---|---|---|---|
+| N1 | **HIGH** | Lines 79-80 | **`burst=20 nodelay`** means 20 requests are served immediately beyond the rate limit, then throttled. This defeats the purpose of rate limiting under burst traffic  all 20 could still overwhelm a GPU. |
+| N2 | **HIGH** | Lines 99-100 | **`proxy_next_upstream` with `tries 2`** means on error/timeout/502/503, Nginx retries once. But it retries against the *same GPU pool*, not a different one. The same GPU that failed gets hit again. |
+| N3 | **HIGH** | Lines 106, 112-121 | **Queue fallback (`@queue_fallback`) is triggered for ANY 502/503/504**, including when a single GPU is overloaded. This means individual GPU slowness causes queue fallback instead of just queuing when ALL GPUs are down. |
+| N4 | **MEDIUM** | Line 90 | **`proxy_pass_header X-Syslog-Model`** is non-standard. Nginx automatically passes request headers; this directive is for response headers. The model header is already passed implicitly via `proxy_set_header` inheritance. |
+| N5 | **MEDIUM** | Lines 27, 32 | **Hardcoded container names** (`syslog-harness-dashboard-1`, `syslog-harness-gpu-dashboard-1`). These change based on docker-compose project prefix. Should use service names. |
+| N6 | **LOW** | Lines 67-73 | **GPU dashboard at `/gpu` path** has `X-Forwarded-Proto` but the dashboard service (simple HTTP server) doesn't use it. Inconsistent header handling across locations. |
+
+### 3.3 Dashboard (`dashboard/harness-dashboard.py`)
+
+**Architecture:** Simple HTTP server using Python's `http.server`. Fetches queue status and GPU health, renders HTML.
+
+**Issues Found:**
+
+| # | Severity | Location | Issue |
+|---|---|---|---|
+| D1 | **HIGH** | Lines 34-40 | **`get_queue_status()` calls queue-service synchronously.** Combined with per-GPU health checks (lines 18-31), the `/api/status` endpoint makes 4 sequential HTTP calls. Worst case: 2 + 33s = 11s response time. |
+| D2 | **MEDIUM** | Lines 101-127 | **Uses `SimpleHTTPRequestHandler`** which is single-threaded. Under concurrent dashboard access, requests queue up. Should use `ThreadingHTTPServer`. |
+| D3 | **MEDIUM** | Lines 16-18 | **GPU endpoints hardcoded** in dashboard, separate from queue-service and Nginx. Three separate sources of truth for GPU addresses. |
+| D4 | **LOW** | Line 127 | **Silent log suppression.** While intentional, this makes debugging impossible without modifying the source. |
+
+### 3.4 GPU Dashboard (`gpu-dashboard/`)
+
+**Architecture:** `gpu_collector.py` polls sidecar (port 8090) and llama.cpp (port 8080) endpoints every 10s, writes JSON to `gpu_metrics.json`. Static HTTP server serves the dashboard.
+
+**Issues Found:**
+
+| # | Severity | Location | Issue |
+|---|---|---|---|
+| G1 | **HIGH** | Lines 97-98 | **Sequential collection.** All 3 GPUs are polled sequentially (line 98: list comprehension). If one host is unreachable, it blocks collection for all three. |
+| G2 | **HIGH** | Line 105-107 | **`/app/public/gpu_metrics.json` path is hardcoded** and differs from `collector.py` (line 11: `/root/hermes-workspace/public/gpu_metrics.json`). Inconsistent between the two collector files. |
+| G3 | **MEDIUM** | Lines 19-25 | **`fetch_json` swallows all exceptions.** A timeout on one GPU's sidecar is silently ignored, making it impossible to distinguish "no data" from "collector error". |
+| G4 | **MEDIUM** | Line 14 | **`DEAD_THRESHOLD = 60` seconds is aggressive.** A GPU that restarts takes 60s before reappearing as online, even if it's back in 5s. |
+| G5 | **LOW** | Lines 10-14 | **`start.sh` references `/root/hermes-workspace/public`** but `Dockerfile.gpu` creates `/app/public`. Inconsistent between legacy and current deployment. |
+
+### 3.5 Docker Compose (`docker-compose.yml`)
+
+**Issues Found:**
+
+| # | Severity | Location | Issue |
+|---|---|---|---|
+| C1 | **HIGH** | Lines 19-20 | **Queue service exposes port 8091 externally.** In a multi-tenant or public-facing deployment, the queue API should be internal-only. |
+| C2 | **MEDIUM** | Lines 13-15 | **`Dockerfile.queue` referenced but doesn't exist at root level.** The file is at `queue-service/Dockerfile`. The compose build context is `.` (root) but the dockerfile path doesn't match. |
+| C3 | **MEDIUM** | Lines 6, 16, 26, 31, 43 | **`restart: always`** instead of `restart: unless-stopped`. On crash, `always` restarts even after manual stop, making maintenance harder. |
+| C4 | **LOW** | Lines 23-25 | **No health checks defined** for any service. Docker can't detect if a service is actually healthy, only if the container is running. |
+| C5 | **LOW** | Line 10 | **Redis has no password.** Unauthenticated Redis exposed on the Docker network. |
+| C6 | **LOW** | Lines 49-51 | **No network driver specified** for the bridge network (minor  defaults to bridge). No IPAM configuration for large deployments. |
+
+### 3.6 Container Images
+
+**Issues Found:**
+
+| # | Severity | Location | Issue |
+|---|---|---|---|
+| I1 | **HIGH** | All Dockerfiles | **No `requirements.txt` or dependency pinning.** All dependencies (`flask`, `redis`, `requests`) are installed without version pins. Builds are non-reproducible. |
+| I2 | **MEDIUM** | `Dockerfile.gpu` line 3 | **`pip install requests`**  unnecessary dependency for the GPU dashboard (only uses `urllib`). Adds ~300KB to the image. |
+| I3 | **MEDIUM** | `Dockerfile.gpu` line 14 | **Multi-process CMD with `&`**  no process supervisor. If the collector crashes, it won't restart. The `http.server` also won't receive SIGTERM properly. |
+| I4 | **LOW** | All Dockerfiles | **No `.dockerignore` file.** The entire context is sent to the Docker daemon, including `.git` directories and any local artifacts. |
+| I5 | **LOW** | `Dockerfile.dashboard` (root) vs `dashboard/Dockerfile.dashboard` | **Duplicate Dockerfiles** with slight differences (Python 3.11 vs 3.13, WORKDIR differences). |
+
+---
+
+## 4. Smart Queuing Analysis & Recommendations
+
+### Current State:  No Smart Queuing
+
+The queue service is a **passive storage mechanism**  it stores requests but has no intelligence:
+
+- **No load balancing**  no awareness of GPU load (slots_busy, VRAM usage, queue depth per GPU)
+- **No job prioritization**  FIFO only, no priority levels
+- **No backpressure**  simple threshold, no exponential backoff or adaptive limits
+- **No retry logic**  failed GPU requests go to queue but are never reprocessed
+- **No dead letter handling**  stuck or failed jobs have no lifecycle management
+- **No consumer**  nothing dequeues and forwards to GPUs
+- **No job tracking**  no job IDs, no status updates, no result retrieval
+
+### Recommended Architecture: Smart Queue with Consumer
+
+```
+Agent > Nginx > Smart Queue API > Redis Streams (with consumers)
+                                          
+                                   
+                                     Consumer   
+                                     Pool       
+                                   
+                                          
+                             
+                                                     
+                         GPU 1 (load)  GPU 2 (load)  GPU 3 (load)
+                                                     
+                                                     
+                         Health        Health        Health
+                                                   
+                           
+                                          
+                                  Update GPU scores
+                                          
+                             Priority Queue (sorted by urgency)
+                             Dead Letter Queue (failed jobs)
+                             Backpressure (adaptive rate limit)
+```
+
+### Specific Recommendations
+
+#### R1: Implement Redis Streams as Queue Backend
+- Replace `LPUSH/RPUSH` (FIFO list) with **Redis Streams** (`XADD/XREADGROUP`)
+- Streams support consumer groups, message acknowledgment, and pending messages
+- Enables proper dead letter queue handling and retry logic
+- **File:** `queue-service/queue-service.py`
+
+```python
+# Before: Simple list
+r.rpush(QUEUE_KEY, json.dumps(job))
+
+# After: Redis Stream with consumer group
+stream_key = "inference:stream"
+consumer_group = "gpu-workers"
+r.xadd(stream_key, {"job": json.dumps(job)}, maxlen=10000, approx=True)
+```
+
+#### R2: Build a Queue Consumer Pool
+- Deploy 1+ consumer containers that poll the stream and forward to GPUs
+- Consumer selects GPU based on: health status, current load (slots_busy), and VRAM availability
+- **File:** New `queue-service/consumer.py`
+
+```python
+class LoadBalancedConsumer:
+    def select_gpu(self, job):
+        """Select GPU based on load, health, and model compatibility."""
+        candidates = [g for g in self.gpus if g.health == "up" and not g.full]
+        if not candidates:
+            return None
+        # Sort by: slots_idle (descending), VRAM_available (descending)
+        candidates.sort(key=lambda g: (g.slots_idle, g.vram_free_mb), reverse=True)
+        return candidates[0]
+```
+
+#### R3: Implement Priority Queuing
+- Add priority field to job payload: `high`, `normal`, `low`
+- Use Redis Streams with multiple stream keys per priority level
+- Consumer checks `high`  `normal`  `low` in order
+- **File:** `queue-service/queue-service.py` enqueue endpoint
+
+#### R4: Add Backpressure Mechanism
+- Instead of hard threshold at 50, implement **adaptive backpressure**:
+  - Queue depth 0-30: normal operation
+  - Queue depth 30-40: return `retry-after` header with increasing delay
+  - Queue depth 40-50: return 503 with exponential retry-after
+  - Queue depth >50: circuit breaker open
+- **File:** `queue-service/queue-service.py`
+
+#### R5: Dead Letter Queue (DLQ)
+- Move failed/unprocessable jobs to a `inference:dead-letter` stream
+- Include failure reason, attempt count, and original payload
+- Provide admin API to inspect, retry, or discard DLQ entries
+- **File:** `queue-service/queue-service.py`
+
+```python
+# New endpoint
+@app.route("/dlq", methods=["GET"])
+def list_dlq():
+    return r.xrange("inference:dead-letter")
+
+@app.route("/dlq/retry/<message_id>", methods=["POST"])
+def retry_dlq(message_id):
+    job = r.xget("inference:dead-letter", message_id)
+    r.xadd("inference:stream", {"job": job})
+```
+
+#### R6: GPU-Aware Routing
+- Queue consumer should check GPU `slots_busy` before routing
+- If a GPU is busy, try the next available GPU
+- Track per-GPU queue depth and avoid overloading a single GPU
+- **File:** New consumer logic
+
+#### R7: Job Status API
+- Add job ID generation on enqueue
+- Provide `/status/<job_id>` endpoint to check progress
+- Store job state in Redis: `queued`  `processing`  `completed`/`failed`
+- **File:** `queue-service/queue-service.py`
+
+```python
+@app.route("/enqueue", methods=["POST"])
+def enqueue():
+    job_id = str(uuid.uuid4())
+    job = {"id": job_id, "payload": ..., "status": "queued", "created_at": time.time()}
+    r.xadd(stream_key, {"job": json.dumps(job)})
+    r.hset("job:status", job_id, json.dumps({"status": "queued"}))
+    return jsonify({"job_id": job_id, "status": "queued"}), 202
+
+@app.route("/status/<job_id>")
+def job_status(job_id):
+    status = r.hget("job:status", job_id)
+    return jsonify(json.loads(status)) if status else {"error": "not found"}, 404
+```
+
+#### R8: Health-Based Circuit Breaker
+- Replace simple depth threshold with **per-GPU circuit breakers**
+- Track consecutive failures per GPU
+- Implement half-open state: after cooldown, probe one GPU to test recovery
+- **File:** `queue-service/queue-service.py`
+
+#### R9: Centralized Configuration
+- Move GPU endpoints from 3 locations (queue-service, dashboard, Nginx) to:
+  - Redis config key: `config:gpus`
+  - Or environment file mounted to all containers
+- Nginx can use Lua/variable from config instead of static upstreams
+- **File:** New `config/` directory or Redis-based config
+
+---
+
+## 5. Priority Issue Summary
+
+### Critical (Fix Immediately)
+1. **Q1**  Queue has no consumer; enqueued requests are never processed
+2. **Q4**  No job ID or result retrieval mechanism
+3. **N3**  Queue fallback triggers on individual GPU failure, not all-down
+
+### High (Fix Before Production)
+4. **Q5**  Circuit breaker has no recovery mechanism
+5. **Q6**  `/status` endpoint blocks on GPU health checks
+6. **D1**  Dashboard `/api/status` makes 4 sequential calls, up to 11s
+7. **C2**  `Dockerfile.queue` path mismatch in docker-compose
+8. **I1**  No dependency pinning in any Dockerfile
+9. **I3**  Multi-process CMD without supervisor in GPU dashboard
+
+### Medium (Improve in Next Iteration)
+10. **Q3**  Redis host default conflicts with Docker networking
+11. **Q7**  Silent exception swallowing in Redis access
+12. **Q8**  Content-Type header dropped in queue
+13. **D2**  Single-threaded dashboard server
+14. **D3**  Three separate sources of truth for GPU addresses
+15. **G1**  Sequential GPU collection blocks on single failure
+16. **N1**  Rate limit burst of 20 nodelay defeats protection
+17. **N5**  Hardcoded container names in Nginx
+18. **C1**  Queue API exposed externally
+19. **C4**  No Docker health checks
+
+### Low (Nice to Have)
+20. **Q9**  No graceful shutdown
+21. **C3**  `restart: always` vs `unless-stopped`
+22. **C5**  No Redis authentication
+23. **G4**  60s dead threshold is too aggressive
+24. **I2**  Unnecessary `requests` dependency
+25. **I4**  No `.dockerignore`
+26. **I5**  Duplicate Dockerfiles
+
+---
+
+## 6. Deployment Architecture Summary
+
+### What Works Well
+- Clean separation of concerns: routing (Nginx), queuing (Redis + queue-service), observability (two dashboards)
+- Good GPU hardware monitoring with temperature, VRAM, power, fan metrics
+- SSE streaming support in Nginx for LLM response streaming
+- Rate limiting at the gateway layer
+- Circuit breaker pattern implemented (even if basic)
+
+### What Needs Work
+- **Queue is incomplete**  storage without processing is the most critical gap
+- **No job lifecycle**  requests go in and never come out
+- **Duplicated configuration**  GPU addresses in 3+ places
+- **No monitoring/alerting**  no Prometheus metrics, no alerting rules
+- **Single point of failure**  no Redis replication, no container redundancy
+- **No logging**  Flask dev server logs are minimal; no structured logging
+
+### Recommended Next Steps
+1. **Priority 1:** Implement queue consumer with GPU load-based routing
+2. **Priority 2:** Add job status tracking and result retrieval
+3. **Priority 3:** Fix Nginx fallback to only trigger when ALL GPUs are down
+4. **Priority 4:** Add Docker health checks and proper dependency management
+5. **Priority 5:** Centralize GPU configuration in Redis or environment
+6. **Priority 6:** Add Prometheus metrics endpoint for observability
@@ -0,0 +1,5 @@
+FROM python:3.11-slim
+WORKDIR /app
+COPY dashboard/harness-dashboard.py .
+EXPOSE 3001
+CMD ["python3", "harness-dashboard.py"]
@@ -0,0 +1,14 @@
+FROM python:3.11-slim
+
+RUN pip install requests
+
+COPY gpu-dashboard/ /app/
+WORKDIR /app
+
+RUN mkdir -p /app/public && \
+    cp gpu.html /app/public/ && \
+    touch /app/public/gpu_metrics.json
+
+EXPOSE 8092
+
+CMD ["sh", "-c", "python3 gpu_collector.py & python3 -m http.server 8092 --directory /app/public & wait"]
@@ -1,39 +1,63 @@
-# syslog-harness — Inference API Harness
+# Syslog Harness

-CT 116 Docker stack for routing local GPU models through a unified OpenAI-compatible API.
+Operational orchestration layer for Syslog's internal AI agents.

 ## Architecture

 ```
-nginx :80 → router :9000 → GPU backends
-                ├─ qwen3.6-35B-A3B (MoE) @ 192.168.68.15:8080
-                ├─ qwen3.6-27B-code (Dense) @ 192.168.68.8:8080
-                └─ gemma-4-E4B (Light) @ 192.168.68.110:8080
-
-LiteLLM :8081 (fallback) | Dashboard :3000 | Redis :6379 (local)
+┌─────────────┐     ┌──────────────┐     ┌─────────────┐
+│  Agent      │────>│  Nginx       │────>│  GPU Pool   │
+│  (Hermes)   │     │  Router      │     │  (MoE/Dense)│
+└─────────────┘     └──────────────┘     └─────────────┘
+                         │
+                         ├──> :8091 Queue Service (Docker)
+                         │
+                         └──> :3001 Dashboard (Docker)
 ```

-## Deploy
+## Components
+
+| Service | Port | Container | Purpose |
+|---|---|---|---|
+| Nginx Router | 8080 | Host | Routes requests to GPU backends |
+| Queue Service | 8091 | `syslog-queue` | Enqueues requests when GPUs are down |
+| Dashboard | 3001 | `syslog-dashboard` | Observability UI + API |
+
+## GPU Routing
+
+| Header `X-Syslog-Model` | Backend | Model |
+|---|---|---|
+| (none) / `standard` | amdpve (.15) | qwen3.6-35B-A3B (MoE) |
+| `heavy` / `qwen3.5-27B` | llmgpu (.8) | qwen3.5-27B (Dense) |
+| `light` / `gemma-4` | ocu_llm (.110) | gemma-4-E4B (Light) |
+
+## Quick Start

 ```bash
-cd /opt/inference-harness
+# Build & start
+docker compose build
 docker compose up -d
+
+# Verify
+curl http://localhost:8091/health
+curl http://localhost:3001/api/status
 ```

-## Endpoints
+## Dashboard

-| URL | Purpose |
-|-----|---------|
-| `/v1/chat/completions` | Inference API (OpenAI-compatible) |
-| `/v1/models` | Available models |
-| `/` | Dashboard (GPU health, routing, agents, timeseries) |
+- **UI:** `http://<host>:8080/dashboard/harness.html`
+- **API:** `http://<host>:8080/dashboard/api/status`

-## Agent API Keys
+## Circuit Breaker

-| Agent | Key |
-|-------|-----|
-| Abiba | `sk-syslog-abiba` |
-| Mumuni | `sk-syslog-mumuni` |
-| Tanko | `sk-syslog-tanko` |
-| Koby | `sk-syslog-koby` |
-| Kagenz0 | `sk-syslog-kagenz0` |
+- Rate limit: 10 req/s per IP
+- Burst: 20 requests
+- Excess returns 503
+- Queue fallback on GPU 502/503
+
+## Production Migration
+
+See [MIGRATION_PLAN.md](./MIGRATION_PLAN.md)
+
+---
+*Built for Syslog Solution LLC — Quality over speed.*
@@ -1,7 +1,8 @@
-FROM python:3.12-slim
+FROM python:3.13-slim
+
+COPY harness-dashboard.py /app/harness-dashboard.py
 WORKDIR /app
-COPY requirements.txt .
-RUN pip install --no-cache-dir -r requirements.txt
-COPY dashboard.py .
-EXPOSE 3000
-CMD ["python", "dashboard.py"]
+
+EXPOSE 3001
+
+CMD ["python3", "harness-dashboard.py"]
@@ -0,0 +1,5 @@
+FROM python:3.11-slim
+WORKDIR /app
+COPY harness-dashboard.py .
+EXPOSE 3001
+CMD ["python3", "harness-dashboard.py"]
@@ -1,290 +0,0 @@
-"""Harness Dashboard."""
-import os, json, time, queue, threading
-import requests
-from flask import Flask, request, render_template_string, Response, stream_with_context
-
-ROUTER_METRICS = os.environ.get("ROUTER_METRICS_URL", "http://router:9000/metrics")
-
-app = Flask(__name__)
-sse_subscribers = []
-sse_lock = threading.Lock()
-
-def fetch_state():
-    try:
-        r = requests.get(ROUTER_METRICS, timeout=5)
-        if r.status_code == 200: return r.json()
-    except Exception: pass
-    return {"gpus":[],"route_counts":{},"agent_counts":{},"recent":[],"timestamp":time.time()}
-
-def broadcast_loop():
-    while True:
-        time.sleep(3)
-        data = fetch_state()
-        payload = json.dumps(data)
-        with sse_lock:
-            dead = []
-            for q in sse_subscribers:
-                try: q.put(payload)
-                except Exception: dead.append(q)
-            for q in dead: sse_subscribers.remove(q)
-
-threading.Thread(target=broadcast_loop, daemon=True).start()
-
-DASHBOARD_HTML = r"""<!DOCTYPE html>
-<html lang="en">
-<head>
-<meta charset="UTF-8">
-<meta name="viewport" content="width=device-width, initial-scale=1.0">
-<title>Inference Harness - Syslog Solution LLC</title>
-<style>
-:root {
-  --bg: #0a0e14; --card: #131820; --border: #1e2a3a; --text: #c9d1d9;
-  --dim: #5c6670; --accent: #39bae6; --green: #7fd962; --yellow: #ffb454;
-  --red: #f26d78; --blue: #59c2ff; --purple: #d2a6ff;
-}
-* { margin:0; padding:0; box-sizing:border-box; }
-body {
-  font-family: -apple-system, BlinkMacSystemFont, 'SF Pro Display', 'Segoe UI', system-ui, sans-serif;
-  background: var(--bg); color: var(--text); min-height: 100vh;
-  padding: clamp(12px, 3vw, 32px);
-}
-.header {
-  display: flex; align-items: center; justify-content: space-between;
-  flex-wrap: wrap; gap: 12px; margin-bottom: 24px;
-}
-.header h1 { font-size: clamp(18px, 4vw, 26px); font-weight: 700; color: #fff; }
-.header h1 span { color: var(--accent); }
-.status-bar { display: flex; gap: 16px; align-items: center; flex-wrap: wrap; font-size: 13px; color: var(--dim); }
-.status-dot { width: 8px; height: 8px; border-radius: 50%; display: inline-block; }
-.status-dot.live { background: var(--green); animation: pulse 2s infinite; }
-@keyframes pulse { 0%,100%{opacity:1} 50%{opacity:0.3} }
-.grid { display: grid; grid-template-columns: repeat(auto-fit, minmax(min(100%, 340px), 1fr)); gap: 16px; }
-.card {
-  background: var(--card); border: 1px solid var(--border);
-  border-radius: 12px; padding: clamp(12px, 3vw, 20px);
-}
-.card-title {
-  font-size: 13px; font-weight: 600; text-transform: uppercase;
-  letter-spacing: 0.5px; color: var(--dim); margin-bottom: 14px;
-}
-.gpu-row {
-  display: flex; align-items: center; gap: 14px; padding: 10px 0;
-  border-bottom: 1px solid rgba(255,255,255,0.04);
-}
-.gpu-row:last-child { border-bottom: none; }
-.gpu-icon {
-  width: 40px; height: 40px; border-radius: 10px; display: flex;
-  align-items: center; justify-content: center; font-size: 18px; flex-shrink: 0;
-}
-.gpu-icon.green { background: rgba(127,217,98,0.12); color: var(--green); }
-.gpu-icon.yellow { background: rgba(255,180,84,0.12); color: var(--yellow); }
-.gpu-icon.red { background: rgba(242,109,120,0.12); color: var(--red); }
-.gpu-info { flex:1; min-width: 0; }
-.gpu-name { font-size: 14px; font-weight: 600; color: #e6edf3; }
-.gpu-metrics { display: flex; gap: 20px; flex-wrap: wrap; margin-top: 6px; }
-.gpu-metric { font-size: 12px; }
-.gpu-metric .label { color: var(--dim); }
-.gpu-metric .value { color: #e6edf3; font-weight: 500; font-variant-numeric: tabular-nums; }
-.vram-bar { width: 100%; height: 4px; background: rgba(255,255,255,0.06); border-radius: 2px; margin-top: 6px; overflow: hidden; }
-.vram-fill { height: 100%; border-radius: 2px; transition: width 0.6s ease; }
-.vram-fill.green { background: var(--green); }
-.vram-fill.yellow { background: var(--yellow); }
-.vram-fill.red { background: var(--red); }
-.bar-row { margin-bottom: 10px; }
-.bar-label { display: flex; justify-content: space-between; font-size: 12px; margin-bottom: 4px; }
-.bar-label .name { color: #e6edf3; }
-.bar-label .count { color: var(--dim); font-variant-numeric: tabular-nums; }
-.bar-track { height: 6px; background: rgba(255,255,255,0.06); border-radius: 3px; overflow: hidden; }
-.bar-fill { height: 100%; border-radius: 3px; transition: width 0.6s ease; }
-.route-table { width: 100%; font-size: 12px; border-collapse: collapse; }
-.route-table th, .route-table td { text-align: left; padding: 6px 10px; }
-.route-table th { color: var(--dim); font-weight: 500; font-size: 11px; text-transform: uppercase; letter-spacing: 0.3px; border-bottom: 1px solid var(--border); }
-.route-table td { border-bottom: 1px solid rgba(255,255,255,0.03); color: #b0b8c4; }
-.agent-tag { display: inline-block; padding: 1px 7px; border-radius: 10px; font-size: 11px; font-weight: 600; }
-.agent-abiba { background: rgba(57,186,230,0.15); color: var(--accent); }
-.agent-mumuni { background: rgba(210,166,255,0.15); color: var(--purple); }
-.agent-tanko { background: rgba(255,180,84,0.15); color: var(--yellow); }
-.agent-koby { background: rgba(89,194,255,0.15); color: var(--blue); }
-.agent-kagenz0 { background: rgba(127,217,98,0.15); color: var(--green); }
-.agent-unknown { background: rgba(255,255,255,0.06); color: var(--dim); }
-.agent-admin { background: rgba(255,255,255,0.08); color: #e6edf3; }
-.full { grid-column: 1 / -1; }
-.period-btn {
-  background: var(--card); border: 1px solid var(--border); color: var(--dim);
-  padding: 4px 12px; border-radius: 6px; font-size: 12px; cursor: pointer;
-  font-family: inherit; transition: all 0.2s;
-}
-.period-btn.active { background: var(--accent); color: #000; border-color: var(--accent); }
-.period-btn:hover { border-color: var(--accent); color: #e6edf3; }
-@media (max-width: 600px) {
-  .gpu-metrics { gap: 10px; }
-  .route-table { font-size: 11px; }
-  .route-table th, .route-table td { padding: 4px 6px; }
-}
-</style>
-</head>
-<body>
-<div class="header">
-  <h1><span>&#x26A1;</span> Inference Harness</h1>
-  <div class="status-bar">
-    <span class="status-dot" id="live-dot"></span>
-    <span id="connection-status">connecting...</span>
-    <span id="update-time"></span>
-    <span id="total-requests">0 requests</span>
-  </div>
-</div>
-<div class="grid">
-  <div class="card full">
-    <div class="card-title">GPU Health</div>
-    <div id="gpu-container">Loading...</div>
-  </div>
-  <div class="card">
-    <div class="card-title">Model Distribution</div>
-    <div id="route-bars">-</div>
-  </div>
-  <div class="card" style="grid-column: span 2">
-    <div class="card-title" style="display:flex;justify-content:space-between;align-items:center;flex-wrap:wrap;gap:4px">
-      <span>Usage Over Time</span>
-      <div style="display:flex;gap:4px">
-        <button class="period-btn active" onclick="switchPeriod('day')">24h</button>
-        <button class="period-btn" onclick="switchPeriod('week')">7d</button>
-        <button class="period-btn" onclick="switchPeriod('month')">30d</button>
-      </div>
-    </div>
-    <div id="timeseries-chart" style="height:140px;position:relative;overflow:hidden">
-      <div style="color:var(--dim);font-size:13px;padding:50px 0;text-align:center">Loading...</div>
-    </div>
-    <div id="timeseries-legend" style="display:flex;gap:16px;justify-content:center;margin-top:8px;flex-wrap:wrap"></div>
-  </div>
-  <div class="card">
-    <div class="card-title">Agent Activity</div>
-    <div id="agent-bars">-</div>
-  </div>
-  <div class="card full">
-    <div class="card-title">Live Request Stream</div>
-    <div style="overflow-x:auto">
-      <table class="route-table">
-        <thead><tr><th>Time</th><th>Agent</th><th>Model</th><th>Reason</th><th>Tier</th></tr></thead>
-        <tbody id="route-tbody"><tr><td colspan="5">Waiting for data...</td></tr></tbody>
-      </table>
-    </div>
-  </div>
-</div>
-<script>
-const MODEL_COLORS = {'gemma-4-E4B':'#7fd962','qwen3.6-27B-code':'#ffb454','qwen3.6-35B-A3B':'#d2a6ff'};
-const MODEL_LABELS = {'gemma-4-E4B':'Gemma 4B','qwen3.6-27B-code':'Qwen Code 27B','qwen3.6-35B-A3B':'Qwen MoE 35B'};
-const GPU_LABELS = {'NVIDIA GeForce RTX 5070':'RTX 5070 - Gemma 4B','NVIDIA GeForce RTX 3090':'RTX 3090 - Qwen Code 27B','AMD Radeon (Strix Halo)':'Strix Halo - Qwen MoE 35B'};
-
-function statusIcon(status) {
-  if (status === 'healthy') return '<span class="gpu-icon green">&#x25CF;</span>';
-  if (status === 'saturated') return '<span class="gpu-icon yellow">&#x25C9;</span>';
-  return '<span class="gpu-icon red">&#x25CB;</span>';
-}
-function vramClass(pct) { if(pct>90)return'red';if(pct>75)return'yellow';return'green'; }
-
-function render(data) {
-  if(!data||!data.gpus)return;
-  const total = Object.values(data.route_counts||{}).reduce((a,b)=>a+b,0);
-  document.getElementById('total-requests').textContent = total + ' requests';
-  document.getElementById('update-time').textContent = new Date().toLocaleTimeString();
-
-  const gpus = data.gpus||[];
-  document.getElementById('gpu-container').innerHTML = gpus.map(g => '<div class="gpu-row">'+statusIcon(g.status)+'<div class="gpu-info"><div class="gpu-name">'+(GPU_LABELS[g.gpu_name]||g.gpu_name||g.id||'?')+'</div><div class="gpu-metrics"><div class="gpu-metric"><span class="label">VRAM</span> <span class="value">'+(g.vram_used_mb||'?')+'/'+(g.vram_total_mb||'?')+' MB</span></div><div class="gpu-metric"><span class="label">Temp</span> <span class="value">'+(g.temp_c||'?')+'C</span></div><div class="gpu-metric"><span class="label">Util</span> <span class="value">'+(g.gpu_util_pct||0)+'%</span></div>'+(g.power_w!=null?'<div class="gpu-metric"><span class="label">Power</span> <span class="value">'+g.power_w+'W</span></div>':'')+'</div><div class="vram-bar"><div class="vram-fill '+vramClass(g.vram_pct||0)+'" style="width:'+(g.vram_pct||0)+'%"></div></div></div><div style="font-size:24px;font-weight:700;color:'+(vramClass(g.vram_pct||0)==='red'?'var(--red)':vramClass(g.vram_pct||0)==='yellow'?'var(--yellow)':'var(--green)')+';min-width:50px;text-align:right">'+(g.vram_pct||0)+'%</div></div>').join('');
-
-  const rc = data.route_counts||{};
-  const maxR = Math.max(1,...Object.values(rc));
-  document.getElementById('route-bars').innerHTML = Object.entries(rc).length ? Object.entries(rc).sort((a,b)=>b[1]-a[1]).map(([m,c])=>'<div class="bar-row"><div class="bar-label"><span class="name">'+(MODEL_LABELS[m]||m)+'</span><span class="count">'+c+' ('+(total?Math.round(c/total*100):0)+'%)</span></div><div class="bar-track"><div class="bar-fill" style="width:'+(c/maxR*100)+'%;background:'+(MODEL_COLORS[m]||'#39bae6')+'"></div></div></div>').join('') : '<div style="color:var(--dim);font-size:13px">No data yet</div>';
-
-  const ac = data.agent_counts||{};
-  const maxA = Math.max(1,...Object.values(ac));
-  document.getElementById('agent-bars').innerHTML = Object.entries(ac).length ? Object.entries(ac).sort((a,b)=>b[1]-a[1]).map(([a,c])=>'<div class="bar-row"><div class="bar-label"><span class="name agent-'+a.toLowerCase().replace(/[^a-z]/g,'')+'">'+a+'</span><span class="count">'+c+' reqs</span></div><div class="bar-track"><div class="bar-fill" style="width:'+(c/maxA*100)+'%;background:var(--accent)"></div></div></div>').join('') : '<div style="color:var(--dim);font-size:13px">No agent activity yet</div>';
-
-  const recent = data.recent||[];
-  document.getElementById('route-tbody').innerHTML = recent.length ? recent.slice(0,25).map(r=>{const d=new Date(r.ts*1000);const a=r.agent||'?';const cl='agent-'+a.toLowerCase().replace(/[^a-z0-9]/g,'');return'<tr><td style="color:var(--dim);font-size:11px">'+d.toLocaleTimeString()+'</td><td><span class="agent-tag '+cl+'">'+a+'</span></td><td>'+(MODEL_LABELS[r.model]||r.model)+'</td><td style="color:var(--dim);font-size:11px">'+(r.reason||'')+'</td><td style="font-size:11px;text-transform:uppercase;color:'+(r.tier==='enterprise'?'var(--purple)':r.tier==='professional'?'var(--blue)':'var(--dim)')+'">'+(r.tier||'')+'</td></tr>';}).join('') : '<tr><td colspan="5" style="color:var(--dim)">Waiting for requests...</td></tr>';
-}
-
-let currentPeriod = 'day';
-async function switchPeriod(p) {
-  currentPeriod = p;
-  document.querySelectorAll('.period-btn').forEach(b => b.classList.remove('active'));
-  document.querySelectorAll('.period-btn').forEach(b => { if(b.textContent.trim().startsWith(p==='day'?'24h':p==='week'?'7d':'30d')) b.classList.add('active'); });
-  await loadTimeseries();
-}
-async function loadTimeseries() {
-  try { const r = await fetch('/api/timeseries?period='+currentPeriod); renderTimeseries(await r.json()); } catch(e) {}
-}
-function renderTimeseries(d) {
-  const models = d.models||{}, labels = d.labels||[];
-  if(!labels.length)return;
-  const container = document.getElementById('timeseries-chart');
-  const legend = document.getElementById('timeseries-legend');
-  const modelNames = Object.keys(models);
-  if(!modelNames.length){container.innerHTML='<div style="color:var(--dim);font-size:13px;padding:50px 0;text-align:center">No data yet</div>';return;}
-  const colors = {'gemma-4-E4B':'#7fd962','qwen3.6-27B-code':'#ffb454','qwen3.6-35B-A3B':'#d2a6ff'};
-  const shortNames = {'gemma-4-E4B':'Gemma','qwen3.6-27B-code':'Qwen Code','qwen3.6-35B-A3B':'Qwen MoE'};
-  let maxVal = 1;
-  for(const m in models) for(const v of models[m]) if(v>maxVal) maxVal=v;
-  maxVal = Math.ceil(maxVal*1.15)||1;
-  const W = labels.length>1?100/(labels.length-1):100, H=130;
-  let paths='';
-  for(const m of modelNames){const vals=models[m]||[];let d='';for(let i=0;i<vals.length;i++){const x=i*W,y=H-(vals[i]/maxVal)*H;d+=(i===0?'M':'L')+x.toFixed(1)+','+y.toFixed(1)+' ';}paths+='<path d="'+d+'" fill="none" stroke="'+(colors[m]||'#39bae6')+'" stroke-width="2.5" stroke-linecap="round" stroke-linejoin="round" opacity="0.85"/>';}
-  let grid='';
-  for(let g=0;g<=4;g++){const y=(g/4)*H;grid+='<line x1="0" y1="'+y.toFixed(1)+'" x2="100" y2="'+y.toFixed(1)+'" stroke="rgba(255,255,255,0.05)" stroke-width="1"/>';}
-  const svg='<svg viewBox="0 0 100 '+(H+16)+'" style="width:100%;height:'+(H+20)+'px;display:block" preserveAspectRatio="none">'+grid+paths+'</svg>';
-  const step=Math.max(1,Math.floor(labels.length/8));
-  let lh='<div style="display:flex;margin-top:2px;font-size:10px;color:var(--dim);overflow:hidden">';
-  for(let i=0;i<labels.length;i+=step) lh+='<div style="flex:1;text-align:center">'+labels[i]+'</div>';
-  lh+='</div>';
-  container.innerHTML=svg+lh;
-  legend.innerHTML=modelNames.map(m=>'<span style="display:flex;align-items:center;gap:6px;font-size:11px;color:var(--dim)"><svg width="18" height="10"><line x1="0" y1="5" x2="18" y2="5" stroke="'+(colors[m]||'#39bae6')+'" stroke-width="2.5"/></svg>'+shortNames[m]+'</span>').join('');
-}
-
-function poll(){fetch('/api/state').then(r=>r.json()).then(data=>{render(data);document.getElementById('connection-status').textContent='live';document.getElementById('live-dot').className='status-dot live';}).catch(()=>{document.getElementById('connection-status').textContent='reconnecting...';document.getElementById('live-dot').className='status-dot';});}
-poll();setInterval(poll,3000);loadTimeseries();
-</script>
-</body>
-</html>"""
-
-@app.route("/")
-def dashboard():
-    return render_template_string(DASHBOARD_HTML)
-
-@app.route("/api/state")
-def api_state():
-    return fetch_state()
-
-@app.route("/api/timeseries")
-def api_timeseries():
-    period = request.args.get("period", "day")
-    try:
-        r = requests.get("http://router:9000/metrics/timeseries?period=" + period, timeout=5)
-        if r.status_code == 200: return r.json()
-    except Exception: pass
-    return {"models": {}, "labels": []}
-
-@app.route("/api/stream")
-def api_stream():
-    def event_stream():
-        q = queue.Queue()
-        with sse_lock: sse_subscribers.append(q)
-        try:
-            data = fetch_state()
-            yield "data: " + json.dumps(data) + "\n\n"
-            while True:
-                try: msg = q.get(timeout=3); yield "data: " + msg + "\n\n"
-                except queue.Empty:
-                    data = fetch_state()
-                    yield "data: " + json.dumps(data) + "\n\n"
-        except GeneratorExit: pass
-        finally:
-            with sse_lock:
-                if q in sse_subscribers: sse_subscribers.remove(q)
-    return Response(stream_with_context(event_stream()), mimetype="text/event-stream",
-                    headers={"Cache-Control":"no-cache","X-Accel-Buffering":"no","Access-Control-Allow-Origin":"*"})
-
-@app.route("/health")
-def health():
-    return {"status":"healthy","service":"harness-dashboard"}
-
-if __name__ == "__main__":
-    app.run(host="0.0.0.0", port=3000, debug=False)
@@ -0,0 +1,133 @@
+#!/usr/bin/env python3
+"""Syslog Harness Dashboard — Simple HTTP server exposing GPU health + metrics."""
+
+import json
+import os
+import time
+import urllib.request
+from http.server import HTTPServer, SimpleHTTPRequestHandler
+from datetime import datetime
+
+GPUS = {
+    "amdpve": {"endpoint": os.getenv("AMDVE_EP", "192.168.68.15:8080"), "model": "qwen3.6-35B-A3B (MoE)", "vram": "65GB"},
+    "llmgpu": {"endpoint": os.getenv("LLMGPU_EP", "192.168.68.8:8080"), "model": "qwen3.5-27B (Dense)", "vram": "24GB"},
+    "ocu_llm": {"endpoint": os.getenv("OCU_LLM_EP", "192.168.68.110:8080"), "model": "gemma-4-E4B (Light)", "vram": "12GB"},
+}
+
+
+def check_gpu(name, info):
+    try:
+        start = time.time()
+        # Use simple HTTP GET to check if the GPU endpoint is alive
+        resp = urllib.request.urlopen(f"http://{info['endpoint']}/", timeout=3)
+        latency = (time.time() - start) * 1000
+        return {
+            "status": "up",
+            "latency_ms": round(latency, 1),
+            "model": info["model"],
+            "vram": info["vram"],
+        }
+    except Exception as e:
+        return {"status": "down", "error": str(e)[:50], "model": info["model"], "vram": info["vram"]}
+
+
+def get_queue_status():
+    try:
+        req = urllib.request.Request("http://queue-service:8091/status")
+        resp = urllib.request.urlopen(req, timeout=2)
+        return json.loads(resp.read())
+    except Exception:
+        return {"queue_depth": -1, "circuit_breaker": "unknown", "gpu_health": {}}
+
+
+DASHBOARD_HTML = """
+<!DOCTYPE html>
+<html><head><meta charset="utf-8"><title>🦅 Syslog Harness</title>
+<style>
+  body { background: #1a1a2e; color: #e0e0e0; font-family: monospace; margin: 0; padding: 20px; }
+  .card { background: #16213e; border-radius: 8px; padding: 16px; margin: 10px 0; border-left: 4px solid #0f3460; }
+  .up { border-left-color: #00d26a; } .down { border-left-color: #ff4757; }
+  .warn { border-left-color: #ffa502; }
+  h1 { color: #00d26a; font-size: 24px; } h2 { color: #0f3460; font-size: 16px; }
+  .metric { display: inline-block; margin: 4px 12px; }
+  .value { font-weight: bold; color: #00d26a; }
+  #refresh { position: fixed; top: 10px; right: 10px; background: #0f3460; color: white;
+             border: none; padding: 8px 16px; border-radius: 4px; cursor: pointer; }
+  table { width: 100%; border-collapse: collapse; margin: 10px 0; }
+  th, td { text-align: left; padding: 8px; border-bottom: 1px solid #0f3460; }
+  th { color: #00d26a; }
+</style></head><body>
+<button id="refresh" onclick="location.reload()">↻ Refresh</button>
+<h1>🦅 Syslog Harness Dashboard</h1>
+<h2>Updated: <span id="ts"></span></h2>
+
+<div class="card" id="queue-card">
+  <h2>Queue & Circuit Breaker</h2>
+  <div class="metric">Depth: <span class="value" id="depth">--</span></div>
+  <div class="metric">Circuit: <span class="value" id="circuit">--</span></div>
+  <div class="metric">Threshold: <span class="value" id="threshold">--</span></div>
+</div>
+
+<div class="card">
+  <h2>GPU Endpoints</h2>
+  <table><tr><th>GPU</th><th>Model</th><th>VRAM</th><th>Status</th><th>Latency</th></tr>
+  <tbody id="gpu-table"></tbody></table>
+</div>
+
+<script>
+  document.getElementById('ts').textContent = new Date().toISOString();
+  fetch('/api/status').then(r => r.json()).then(data => {
+    document.getElementById('depth').textContent = data.queue_depth;
+    document.getElementById('circuit').textContent = data.circuit_breaker;
+    document.getElementById('threshold').textContent = 'warn:' + data.thresholds.warn + ' / open:' + data.thresholds.open;
+    const card = document.getElementById('queue-card');
+    if (data.circuit_breaker === 'open') card.className = 'card warn';
+    else if (data.circuit_breaker === 'warn') card.className = 'card warn';
+    else card.className = 'card up';
+    let html = '';
+    for (const [name, gpu] of Object.entries(data.gpu_health)) {
+      const status = gpu.status === 'up' ? '✅' : '❌';
+      const latency = gpu.status === 'up' ? gpu.latency_ms + 'ms' : gpu.error;
+      const rowClass = gpu.status === 'up' ? '' : 'down';
+      html += `<tr class="${rowClass}"><td>${name}</td><td>${gpu.model}</td><td>${gpu.vram}</td><td>${status}</td><td>${latency}</td></tr>`;
+    }
+    document.getElementById('gpu-table').innerHTML = html;
+  });
+  setInterval(() => location.reload(), 10000);
+</script></body></html>
+"""
+
+
+class Handler(SimpleHTTPRequestHandler):
+    def do_GET(self):
+        if self.path == "/" or self.path == "/harness.html":
+            self.send_response(200)
+            self.send_header("Content-Type", "text/html; charset=utf-8")
+            self.end_headers()
+            self.wfile.write(DASHBOARD_HTML.encode())
+        elif self.path == "/api/status":
+            status = get_queue_status()
+            enriched = {
+                "queue_depth": status.get("queue_depth", -1),
+                "circuit_breaker": status.get("circuit_breaker", "unknown"),
+                "thresholds": status.get("thresholds", {"warn": 30, "open": 50}),
+                "gpu_health": {},
+            }
+            for name, info in GPUS.items():
+                enriched["gpu_health"][name] = check_gpu(name, info)
+            self.send_response(200)
+            self.send_header("Content-Type", "application/json")
+            self.end_headers()
+            self.wfile.write(json.dumps(enriched).encode())
+        else:
+            self.send_response(404)
+            self.end_headers()
+
+    def log_message(self, format, *args):
+        pass  # Suppress request logs
+
+
+if __name__ == "__main__":
+    server = HTTPServer(("0.0.0.0", 3001), Handler)
+    print("Dashboard running on :3001/harness.html")
+    server.serve_forever()
@@ -1,2 +0,0 @@
-flask==3.1.*
-requests==2.32.*
@@ -1,77 +1,54 @@
-version: '3.8'
+version: "3.8"

 services:
  redis:
    image: redis:7-alpine
-    container_name: harness-redis
-    restart: unless-stopped
-    ports:
-      - "127.0.0.1:6379:6379"
+    restart: always
+    networks:
+      - gpu-router-net
    volumes:
      - redis-data:/data
-    command: redis-server --appendonly yes --maxmemory 256mb --maxmemory-policy allkeys-lru
-    healthcheck:
-      test: ["CMD", "redis-cli", "ping"]
-      interval: 10s
-      timeout: 3s
-      retries: 5

-  router:
-    build: ./router
-    container_name: harness-router
-    restart: unless-stopped
+  queue-service:
+    build:
+      context: .
+      dockerfile: Dockerfile.queue
+    restart: always
+    networks:
+      - gpu-router-net
    ports:
-      - "9000:9000"
+      - "8091:8091"
+    depends_on:
+      - redis
    environment:
-      - REDIS_URL=redis://redis:6379
-      - GPU_MOE_URL=http://192.168.68.15:8080/v1
-      - GPU_DENSE_URL=http://192.168.68.8:8080/v1
-      - GPU_LIGHT_URL=http://192.168.68.110:8080/v1
-    depends_on:
-      redis:
-        condition: service_healthy
-
-  litellm:
-    image: ghcr.io/berriai/litellm:main-stable
-    command: ["--config", "/app/config.yaml", "--port", "4000"]
-    container_name: harness-litellm
-    restart: unless-stopped
-    ports:
-      - "8081:4000"
-    volumes:
-      - ./litellm_config.yaml:/app/config.yaml
-    environment:
-      - LITELLM_MASTER_KEY=sk-syslog-local-master-key
-    depends_on:
-      redis:
-        condition: service_healthy
-
-  nginx:
-    image: nginx:alpine
-    container_name: harness-nginx
-    restart: unless-stopped
-    ports:
-      - "80:80"
-    volumes:
-      - ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
-    depends_on:
-      - litellm
-      - dashboard
+      - REDIS_HOST=redis
+      - REDIS_PORT=6379

  dashboard:
-    build: ./dashboard
-    container_name: harness-dashboard
-    restart: unless-stopped
+    build:
+      context: .
+      dockerfile: Dockerfile.dashboard
+    restart: always
+    networks:
+      - gpu-router-net
    ports:
-      - "3000:3000"
-    environment:
-      - REDIS_URL=redis://redis:6379
-      - GPU_SIDECARS=192.168.68.15:8090,192.168.68.8:8090,192.168.68.110:8090
+      - "3001:3001"
    depends_on:
      - redis

+  gpu-dashboard:
+    build:
+      context: .
+      dockerfile: Dockerfile.gpu
+    restart: always
+    networks:
+      - gpu-router-net
+    ports:
+      - "8092:8092"
+
+networks:
+  gpu-router-net:
+    driver: bridge
+
 volumes:
  redis-data:
-
-# LiteLLM command override to load config
-# (appended to fix config loading issue)
@@ -0,0 +1,115 @@
+#!/usr/bin/env python3
+"""GPU metrics collector — polls sidecars + llama.cpp every 10s, writes to Workspace."""
+
+import urllib.request, json, time, os
+
+HOSTS = [
+    {"name": "amdpve", "host": "192.168.68.15", "gpu": "AMD Strix Halo", "llama_port": 8080},
+    {"name": "llmgpu", "host": "192.168.68.8", "gpu": "RTX 3090", "llama_port": 8080},
+    {"name": "ocu-llm", "host": "192.168.68.110", "gpu": "RTX 5070", "llama_port": 8080},
+]
+OUTPUT = "/root/hermes-workspace/public/gpu_metrics.json"
+INTERVAL = 10
+STALE_THRESHOLD = 30  # seconds before marking stale
+DEAD_THRESHOLD = 60   # seconds before marking unreachable
+
+last_seen = {}
+
+
+def fetch_json(url, timeout=3):
+    try:
+        req = urllib.request.Request(url)
+        resp = urllib.request.urlopen(req, timeout=timeout)
+        return json.loads(resp.read().decode())
+    except Exception:
+        return None
+
+
+def collect_one(h):
+    """Collect GPU hardware + llama.cpp inference state for one host."""
+    name = h["name"]
+    host = h["host"]
+    now = time.time()
+
+    # GPU hardware from sidecar
+    gpu = fetch_json(f"http://{host}:8090/")
+
+    # llama.cpp inference state
+    llamacpp_health = fetch_json(f"http://{host}:{h['llama_port']}/health")
+    llamacpp_models = fetch_json(f"http://{host}:{h['llama_port']}/v1/models")
+
+    # Determine inference state
+    model_name = None
+    inference_state = "unknown"
+    if llamacpp_models:
+        models = llamacpp_models.get("data", [])
+        if models:
+            model_name = models[0].get("id")
+
+    if llamacpp_health:
+        status = llamacpp_health.get("status", "")
+        if status == "ok":
+            idle = llamacpp_health.get("slots_idle", 0)
+            processing = llamacpp_health.get("slots_processing", 0)
+            if idle and not processing:
+                inference_state = "idle"
+            elif processing:
+                inference_state = "busy"
+            else:
+                inference_state = "idle"
+
+    # Check for /slots endpoint for is_processing detail
+    slots = fetch_json(f"http://{host}:{h['llama_port']}/slots")
+    if slots and isinstance(slots, list) and len(slots) > 0:
+        if slots[0].get("is_processing"):
+            inference_state = "busy"
+
+    result = {
+        "host": name,
+        "gpu_name": h["gpu"],
+        "inference": {
+            "state": inference_state,
+            "model": model_name,
+        },
+        "hardware": gpu if gpu else None,
+        "online": gpu is not None,
+        "timestamp": now,
+    }
+
+    if gpu is not None:
+        last_seen[name] = now
+
+    if name in last_seen:
+        age = now - last_seen[name]
+        if age > DEAD_THRESHOLD:
+            result["online"] = False
+        elif age > STALE_THRESHOLD:
+            result["stale"] = True
+
+    return result
+
+
+def main():
+    print(f"GPU collector starting, output={OUTPUT}, interval={INTERVAL}s")
+    os.makedirs(os.path.dirname(OUTPUT), exist_ok=True)
+
+    while True:
+        start = time.time()
+        results = [collect_one(h) for h in HOSTS]
+
+        payload = {
+            "updated": start,
+            "gpus": results,
+        }
+
+        with open(OUTPUT + ".tmp", "w") as f:
+            json.dump(payload, f)
+        os.rename(OUTPUT + ".tmp", OUTPUT)
+
+        elapsed = time.time() - start
+        sleep_for = max(0, INTERVAL - elapsed)
+        time.sleep(sleep_for)
+
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,183 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+<meta charset="UTF-8">
+<meta name="viewport" content="width=device-width, initial-scale=1.0">
+<title>GPU Monitor</title>
+<style>
+* { margin: 0; padding: 0; box-sizing: border-box; }
+body { background: #0d1117; color: #c9d1d9; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif; padding: 20px; }
+h1 { font-size: 1.3em; margin-bottom: 4px; }
+.topbar { display: flex; justify-content: space-between; align-items: center; margin-bottom: 20px; padding-bottom: 12px; border-bottom: 1px solid #21262d; }
+.topbar .status { font-size: 0.85em; color: #8b949e; }
+.topbar .status .dot { display: inline-block; width: 8px; height: 8px; border-radius: 50%; margin-right: 6px; }
+.dot.green { background: #3fb950; }
+.dot.yellow { background: #d2991d; }
+.dot.red { background: #f85149; }
+.cards { display: grid; grid-template-columns: repeat(auto-fit, minmax(320px, 1fr)); gap: 16px; }
+.card { background: #161b22; border: 1px solid #21262d; border-radius: 8px; padding: 16px; }
+.card.stale { opacity: 0.5; }
+.card.dead { opacity: 0.3; border-color: #f85149; }
+.card-header { display: flex; justify-content: space-between; align-items: center; margin-bottom: 12px; }
+.card-header .name { font-weight: 600; font-size: 1.05em; }
+.card-header .host { font-size: 0.8em; color: #8b949e; }
+.card-header .state { font-size: 0.75em; padding: 2px 8px; border-radius: 10px; font-weight: 600; }
+.state.idle { background: #1b3826; color: #3fb950; }
+.state.busy { background: #3d1f1a; color: #f85149; }
+.state.unknown { background: #21262d; color: #8b949e; }
+.metric { margin-bottom: 10px; }
+.metric-label { display: flex; justify-content: space-between; font-size: 0.82em; color: #8b949e; margin-bottom: 2px; }
+.metric-label .val { color: #c9d1d9; font-weight: 500; }
+.bar { height: 6px; border-radius: 3px; background: #21262d; overflow: hidden; }
+.bar-fill { height: 100%; border-radius: 3px; transition: width 0.5s ease; }
+.bar-fill.temp-cool { background: #3fb950; }
+.bar-fill.temp-warm { background: #d2991d; }
+.bar-fill.temp-hot { background: #f85149; }
+.bar-fill.util { background: #58a6ff; }
+.bar-fill.vram { background: #bc8cff; }
+.bar-fill.power { background: #f0883e; }
+.model-line { font-size: 0.82em; color: #8b949e; margin-top: 8px; padding-top: 8px; border-top: 1px solid #21262d; }
+.model-line span { color: #c9d1d9; }
+.error { color: #f85149; font-size: 0.85em; }
+</style>
+</head>
+<body>
+<div class="topbar">
+  <div>
+    <h1><a href="/" style="color:#58a6ff;text-decoration:none;">← Workspace</a> · GPU Monitor</h1>
+    <span class="status"><span class="dot green" id="status-dot"></span><span id="status-text">Loading...</span></span>
+  </div>
+  <div class="status" id="age">—</div>
+</div>
+<div class="cards" id="cards"></div>
+
+<script>
+const INTERVAL = 5000;
+let lastFetchTime = null;
+
+function updateClock() {
+  const el = document.getElementById('age');
+  if (!lastFetchTime) { el.textContent = '—'; return; }
+  const age = Math.round((Date.now() / 1000) - lastFetchTime);
+  el.textContent = age <= 60 ? `updated ${age}s ago` : `stale ${age}s ago`;
+}
+setInterval(updateClock, 1000);
+
+const TEMP_WARN = 70, TEMP_HOT = 82;
+const VRAM_WARN = 80, VRAM_HOT = 92;
+
+function tempClass(c) { return c > TEMP_HOT ? 'temp-hot' : c > TEMP_WARN ? 'temp-warm' : 'temp-cool'; }
+function vramClass(pct) { return pct > VRAM_HOT ? 'temp-hot' : pct > VRAM_WARN ? 'temp-warm' : 'temp-cool'; }
+function pct(val, max) { return max ? Math.round(val / max * 100) : 0; }
+function mbToGB(mb) { return mb ? (mb / 1024).toFixed(1) : '—'; }
+
+function renderCard(g) {
+  const hw = g.hardware || {};
+  const inf = g.inference || {};
+  const online = g.online !== false;
+  const stale = g.stale === true;
+  let cardClass = '';
+  if (!online) cardClass = 'dead';
+  else if (stale) cardClass = 'stale';
+
+  let stateClass = inf.state || 'unknown';
+  let stateLabel = inf.state ? inf.state.toUpperCase() : 'UNKNOWN';
+  if (!online) { stateClass = 'unknown'; stateLabel = 'OFFLINE'; }
+
+  const temp = hw.temp_c;
+  const util = hw.gpu_util_pct;
+  const vramUsed = hw.vram_used_mb;
+  const vramTotal = hw.vram_total_mb;
+  const power = hw.power_w;
+  const powerLimit = hw.power_limit_w;
+  const fan = hw.fan_pct;
+  const vendor = hw.vendor;
+
+  let html = `<div class="card ${cardClass}">`;
+  html += `<div class="card-header">`;
+  html += `<div><div class="name">${g.gpu_name}</div><div class="host">${g.host}</div></div>`;
+  html += `<div class="state ${stateClass}">${stateLabel}</div>`;
+  html += `</div>`;
+
+  if (!online) {
+    html += `<div class="error">Unreachable</div>`;
+  } else if (hw.error) {
+    html += `<div class="error">${hw.error}</div>`;
+  } else {
+    // Temperature
+    if (temp != null) {
+      html += `<div class="metric"><div class="metric-label"><span>Temperature</span><span class="val">${temp}°C</span></div>`;
+      html += `<div class="bar"><div class="bar-fill ${tempClass(temp)}" style="width:${Math.min(temp,100)}%"></div></div></div>`;
+    }
+    // Utilization
+    if (util != null) {
+      html += `<div class="metric"><div class="metric-label"><span>GPU Utilization</span><span class="val">${util}%</span></div>`;
+      html += `<div class="bar"><div class="bar-fill util" style="width:${util}%"></div></div></div>`;
+    }
+    // VRAM
+    if (vramUsed != null && vramTotal != null) {
+      const vramPct = pct(vramUsed, vramTotal);
+      html += `<div class="metric"><div class="metric-label"><span>VRAM</span><span class="val">${mbToGB(vramUsed)} / ${mbToGB(vramTotal)} GB</span></div>`;
+      html += `<div class="bar"><div class="bar-fill ${vramClass(vramPct)}" style="width:${vramPct}%"></div></div></div>`;
+    }
+    // Power
+    if (power != null) {
+      const powerPct = powerLimit ? pct(power, powerLimit) : 0;
+      const powerText = powerLimit ? `${power}W / ${powerLimit}W` : `${power}W`;
+      html += `<div class="metric"><div class="metric-label"><span>Power</span><span class="val">${powerText}</span></div>`;
+      if (powerLimit) html += `<div class="bar"><div class="bar-fill power" style="width:${powerPct}%"></div></div>`;
+      html += `</div>`;
+    }
+    // Fan (NVIDIA only)
+    if (fan != null) {
+      html += `<div class="metric"><div class="metric-label"><span>Fan Speed</span><span class="val">${fan}%</span></div>`;
+      html += `<div class="bar"><div class="bar-fill util" style="width:${fan}%"></div></div></div>`;
+    }
+  }
+
+  // Model loaded
+  html += `<div class="model-line">Model: <span>${inf.model || '—'}</span></div>`;
+  html += `</div>`;
+  return html;
+}
+
+async function refresh() {
+  try {
+    const resp = await fetch('gpu_metrics.json?t=' + Date.now());
+    const data = await resp.json();
+    const gpus = data.gpus || [];
+
+    document.getElementById('cards').innerHTML = gpus.map(renderCard).join('');
+
+    // Top bar status
+    const online = gpus.filter(g => g.online !== false).length;
+    const total = gpus.length;
+    const dot = document.getElementById('status-dot');
+    const txt = document.getElementById('status-text');
+    if (online === total) { dot.className = 'dot green'; txt.textContent = `${online}/${total} online`; }
+    else if (online > 0) { dot.className = 'dot yellow'; txt.textContent = `${online}/${total} online`; }
+    else { dot.className = 'dot red'; txt.textContent = 'All offline'; }
+
+    // Capture fetch time for live clock
+    lastFetchTime = Date.now() / 1000;
+  } catch(e) {
+    document.getElementById('status-dot').className = 'dot red';
+    document.getElementById('status-text').textContent = 'Collector down';
+  }
+}
+
+// Render skeletons instantly
+const SKELETONS = [
+  {host:'amdpve', gpu_name:'AMD Strix Halo', hardware:{}, inference:{}, online:true},
+  {host:'llmgpu', gpu_name:'RTX 3090', hardware:{}, inference:{}, online:true},
+  {host:'ocu-llm', gpu_name:'RTX 5070', hardware:{}, inference:{}, online:true},
+];
+document.getElementById('cards').innerHTML = SKELETONS.map(g =>
+  `<div class="card"><div class="card-header"><div><div class="name">${g.gpu_name}</div><div class="host">${g.host}</div></div><div class="state unknown">···</div></div><div class="model-line" style="color:#8b949e;">Loading metrics...</div></div>`
+).join('');
+
+refresh();
+setInterval(refresh, INTERVAL);
+</script>
+</body>
+</html>
@@ -0,0 +1,115 @@
+#!/usr/bin/env python3
+"""GPU metrics collector — polls sidecars + llama.cpp every 10s, writes to Workspace."""
+
+import urllib.request, json, time, os
+
+HOSTS = [
+    {"name": "amdpve", "host": "192.168.68.15", "gpu": "AMD Strix Halo", "llama_port": 8080},
+    {"name": "llmgpu", "host": "192.168.68.8", "gpu": "RTX 3090", "llama_port": 8080},
+    {"name": "ocu-llm", "host": "192.168.68.110", "gpu": "RTX 5070", "llama_port": 8080},
+]
+OUTPUT = "/app/public/gpu_metrics.json"
+INTERVAL = 10
+STALE_THRESHOLD = 30  # seconds before marking stale
+DEAD_THRESHOLD = 60   # seconds before marking unreachable
+
+last_seen = {}
+
+
+def fetch_json(url, timeout=3):
+    try:
+        req = urllib.request.Request(url)
+        resp = urllib.request.urlopen(req, timeout=timeout)
+        return json.loads(resp.read().decode())
+    except Exception:
+        return None
+
+
+def collect_one(h):
+    """Collect GPU hardware + llama.cpp inference state for one host."""
+    name = h["name"]
+    host = h["host"]
+    now = time.time()
+
+    # GPU hardware from sidecar
+    gpu = fetch_json(f"http://{host}:8090/")
+
+    # llama.cpp inference state
+    llamacpp_health = fetch_json(f"http://{host}:{h['llama_port']}/health")
+    llamacpp_models = fetch_json(f"http://{host}:{h['llama_port']}/v1/models")
+
+    # Determine inference state
+    model_name = None
+    inference_state = "unknown"
+    if llamacpp_models:
+        models = llamacpp_models.get("data", [])
+        if models:
+            model_name = models[0].get("id")
+
+    if llamacpp_health:
+        status = llamacpp_health.get("status", "")
+        if status == "ok":
+            idle = llamacpp_health.get("slots_idle", 0)
+            processing = llamacpp_health.get("slots_processing", 0)
+            if idle and not processing:
+                inference_state = "idle"
+            elif processing:
+                inference_state = "busy"
+            else:
+                inference_state = "idle"
+
+    # Check for /slots endpoint for is_processing detail
+    slots = fetch_json(f"http://{host}:{h['llama_port']}/slots")
+    if slots and isinstance(slots, list) and len(slots) > 0:
+        if slots[0].get("is_processing"):
+            inference_state = "busy"
+
+    result = {
+        "host": name,
+        "gpu_name": h["gpu"],
+        "inference": {
+            "state": inference_state,
+            "model": model_name,
+        },
+        "hardware": gpu if gpu else None,
+        "online": gpu is not None,
+        "timestamp": now,
+    }
+
+    if gpu is not None:
+        last_seen[name] = now
+
+    if name in last_seen:
+        age = now - last_seen[name]
+        if age > DEAD_THRESHOLD:
+            result["online"] = False
+        elif age > STALE_THRESHOLD:
+            result["stale"] = True
+
+    return result
+
+
+def main():
+    print(f"GPU collector starting, output={OUTPUT}, interval={INTERVAL}s")
+    os.makedirs(os.path.dirname(OUTPUT), exist_ok=True)
+
+    while True:
+        start = time.time()
+        results = [collect_one(h) for h in HOSTS]
+
+        payload = {
+            "updated": start,
+            "gpus": results,
+        }
+
+        with open(OUTPUT + ".tmp", "w") as f:
+            json.dump(payload, f)
+        os.rename(OUTPUT + ".tmp", OUTPUT)
+
+        elapsed = time.time() - start
+        sleep_for = max(0, INTERVAL - elapsed)
+        time.sleep(sleep_for)
+
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,14 @@
+#!/bin/bash
+set -e
+
+# Start collector as background process
+cd /root/hermes-workspace/public
+python3 /app/collector.py &
+COLLECTOR_PID=$!
+
+echo "Collector started (PID $COLLECTOR_PID)"
+echo "Serving dashboard on :8092"
+
+# Serve the public directory (contains gpu.html + gpu_metrics.json)
+cd /root/hermes-workspace/public
+python3 -m http.server 8092
@@ -24,7 +24,12 @@ upstream queue_service {

 upstream dashboard_service {
    ## Harness dashboard (Docker container)
-    server dashboard:3001;
+    server syslog-harness-dashboard-1:3001;
+}
+
+upstream gpu_dashboard_pool {
+    ## GPU dashboard (Docker container)
+    server syslog-harness-gpu-dashboard-1:8092;
 }

 ## ------------------------------------------------------------------
@@ -56,6 +61,17 @@ server {
        proxy_set_header X-Forwarded-For   $proxy_add_x_forwarded_for;
    }

+    ## ------------------------------------------------------------------
+    ## GPU Dashboard — observability UI (MUST be before / catch-all)
+    ## ------------------------------------------------------------------
+    location /gpu {
+        proxy_pass http://gpu_dashboard_pool/;
+        proxy_set_header Host              $host;
+        proxy_set_header X-Real-IP         $remote_addr;
+        proxy_set_header X-Forwarded-For   $proxy_add_x_forwarded_for;
+        proxy_set_header X-Forwarded-Proto $scheme;
+    }
+
    ## ------------------------------------------------------------------
    ## Main location — proxy to selected upstream
    ## ------------------------------------------------------------------
@@ -1,106 +0,0 @@
-## Syslog GPU Router — Nginx Configuration
-## Routes incoming agent requests to the appropriate GPU backend
-## based on the X-Syslog-Model header.
-
-upstream amdpve_pool {
-    ## Strix Halo 395 — qwen3.6-35B-A3B (MoE) — Default workhorse
-    server 192.168.68.15:8080;
-}
-
-upstream llmgpu_pool {
-    ## RTX 3090 — qwen3.5-27B (Dense) — Heavy reasoning
-    server 192.168.68.8:8080;
-}
-
-upstream ocu_llm_pool {
-    ## RTX 5070 — gemma-4 (Dense 4B) — Ultra-light tasks
-    server 192.168.68.110:8080;
-}
-
-upstream queue_service {
-    ## Agent queue with circuit breaker (Docker container)
-    server 127.0.0.1:8091;
-}
-
-upstream dashboard_service {
-    ## Harness dashboard (Docker container)
-    server 127.0.0.1:3001;
-}
-
-## ------------------------------------------------------------------
-## Mapping: X-Syslog-Model header → upstream backend
-## ------------------------------------------------------------------
-map $http_x_syslog_model $gpu_upstream {
-    default          amdpve_pool;   # missing header → default workhorse
-    "standard"       amdpve_pool;
-    "heavy"          llmgpu_pool;
-    "qwen3.5-27B"    llmgpu_pool;
-    "light"          ocu_llm_pool;
-    "gemma-4"        ocu_llm_pool;
-}
-
-server {
-    listen 8080;
-    server_name _;
-
-    # Rate limit zone — 10 req/s per IP, burst of 20
-    limit_req_zone $binary_remote_addr zone=perip:10m rate=10r/s;
-
-    ## ------------------------------------------------------------------
-    ## Dashboard — observability UI (MUST be before / catch-all)
-    ## ------------------------------------------------------------------
-    location /dashboard {
-        proxy_pass http://dashboard_service/;
-        proxy_set_header Host              $host;
-        proxy_set_header X-Real-IP         $remote_addr;
-        proxy_set_header X-Forwarded-For   $proxy_add_x_forwarded_for;
-    }
-
-    ## ------------------------------------------------------------------
-    ## Main location — proxy to selected upstream
-    ## ------------------------------------------------------------------
-    location / {
-        limit_req zone=perip burst=20 nodelay;
-        limit_req_status 503;
-        proxy_pass http://$gpu_upstream;
-
-        ## Preserve original host and headers
-        proxy_set_header Host              $host;
-        proxy_set_header X-Real-IP         $remote_addr;
-        proxy_set_header X-Forwarded-For   $proxy_add_x_forwarded_for;
-        proxy_set_header X-Forwarded-Proto $scheme;
-
-        ## Pass through the model header so backends can log it
-        proxy_pass_header X-Syslog-Model;
-
-        ## Streaming support (SSE for LLM responses)
-        proxy_buffering off;
-        proxy_cache     off;
-        proxy_read_timeout  300s;
-        proxy_send_timeout  300s;
-
-        ## Basic failover — retry on error or timeout
-        proxy_next_upstream error timeout http_502 http_503;
-        proxy_next_upstream_tries 2;
-
-        ## Add a response header for observability
-        add_header X-Routed-To $gpu_upstream always;
-
-        ## Fallback to queue when all GPU upstreams are down
-        error_page 502 503 504 = @queue_fallback;
-    }
-
-    ## ------------------------------------------------------------------
-    ## Queue fallback — enqueue when GPUs are unavailable
-    ## ------------------------------------------------------------------
-    location @queue_fallback {
-        rewrite ^ /enqueue break;
-        proxy_pass http://queue_service;
-        proxy_set_header Host              $host;
-        proxy_set_header X-Real-IP         $remote_addr;
-        proxy_set_header X-Forwarded-For   $proxy_add_x_forwarded_for;
-        proxy_set_header X-Forwarded-Proto $scheme;
-        proxy_set_header Content-Type      $content_type;
-        proxy_pass_request_body            on;
-    }
-}
@@ -1,25 +0,0 @@
-model_list:
-  - model_name: qwen3.6-35B-A3B
-    litellm_params:
-      model: openai/qwen3.6-35B-A3B
-      api_base: http://192.168.68.15:8080/v1
-      api_key: "not-needed"
-
-  - model_name: qwen3.6-27B-code
-    litellm_params:
-      model: openai/qwen3.6-27B-code-text
-      api_base: http://192.168.68.8:8080/v1
-      api_key: "not-needed"
-
-  - model_name: gemma-4-E4B
-    litellm_params:
-      model: openai/gemma-4-E4B
-      api_base: http://192.168.68.110:8080/v1
-      api_key: "not-needed"
-
-general_settings:
-  master_key: sk-syslog-local-master-key
-
-litellm_settings:
-  drop_params: true
-  request_timeout: 120
@@ -1,79 +0,0 @@
-worker_processes auto;
-error_log /var/log/nginx/error.log warn;
-pid /var/run/nginx.pid;
-
-events { worker_connections 1024; }
-
-http {
-    include /etc/nginx/mime.types;
-    default_type application/octet-stream;
-
-    log_format main  launching rt=;
-    access_log /var/log/nginx/access.log main;
-    error_log /var/log/nginx/error.log;
-    sendfile on;
-    keepalive_timeout 65;
-
-    upstream router_api { server router:9000; }
-    upstream dashboard_ui { server dashboard:3000; }
-    upstream litellm_backend { server litellm:4000; }
-
-    server {
-        listen 80;
-
-        # Disable buffering for SSE streams
-        proxy_buffering off;
-
-        # API — through router
-        location /v1/ {
-            proxy_pass http://router_api;
-            proxy_http_version 1.1;
-            proxy_set_header Host $host;
-            proxy_set_header X-Real-IP $remote_addr;
-            proxy_set_header Authorization $http_authorization;
-            proxy_connect_timeout 10s;
-            proxy_read_timeout 300s;
-            proxy_buffering off;
-        }
-
-        # SSE streaming endpoint
-        location /stream {
-            proxy_pass http://router_api;
-            proxy_http_version 1.1;
-            proxy_set_header Host $host;
-            proxy_set_header Connection "";
-            proxy_buffering off;
-            chunked_transfer_encoding off;
-        }
-
-        # Dashboard API proxy for SSE
-        location /api/ {
-            proxy_pass http://dashboard_ui;
-            proxy_http_version 1.1;
-            proxy_set_header Host $host;
-            proxy_buffering off;
-        }
-
-        # LiteLLM debug
-        location /litellm/ {
-            rewrite ^/litellm/(.*) /$1 break;
-            proxy_pass http://litellm_backend;
-            proxy_http_version 1.1;
-            proxy_set_header Host $host;
-            proxy_set_header Authorization $http_authorization;
-        }
-
-        # Dashboard
-        location / {
-            proxy_pass http://dashboard_ui;
-            proxy_http_version 1.1;
-            proxy_set_header Host $host;
-            proxy_buffering off;
-        }
-
-        location /health {
-            return 200 "{\"status\":\"healthy\"}";
-            add_header Content-Type application/json;
-        }
-    }
-}
@@ -0,0 +1,10 @@
+FROM python:3.13-slim
+
+RUN pip install --no-cache-dir flask redis
+
+COPY queue-service.py /app/queue-service.py
+WORKDIR /app
+
+EXPOSE 8091
+
+CMD ["python3", "queue-service.py"]
@@ -1,9 +0,0 @@
-FROM python:3.12-slim
-
-WORKDIR /app
-COPY requirements.txt .
-RUN pip install --no-cache-dir -r requirements.txt
-COPY router.py .
-
-EXPOSE 9000
-CMD ["python", "router.py"]
@@ -1,3 +0,0 @@
-flask==3.1.*
-redis==5.2.*
-requests==2.32.*
@@ -1,213 +0,0 @@
-import os, json, time, logging, traceback, threading, queue
-import requests, redis
-from flask import Flask, request, jsonify, Response, stream_with_context
-
-REDIS_URL = os.environ.get("REDIS_URL", "redis://redis:6379")
-GPU_MOE_URL = os.environ.get("GPU_MOE_URL", "http://192.168.68.15:8080/v1")
-GPU_DENSE_URL = os.environ.get("GPU_DENSE_URL", "http://192.168.68.8:8080/v1")
-GPU_LIGHT_URL = os.environ.get("GPU_LIGHT_URL", "http://192.168.68.110:8080/v1")
-
-GPU_SIDECARS = {
-    "qwen3.6-35B-A3B": "http://192.168.68.15:8090",
-    "qwen3.6-27B-code": "http://192.168.68.8:8090",
-    "gemma-4-E4B": "http://192.168.68.110:8090",
-}
-GPU_URLS = {
-    "qwen3.6-35B-A3B": GPU_MOE_URL,
-    "qwen3.6-27B-code": GPU_DENSE_URL,
-    "gemma-4-E4B": GPU_LIGHT_URL,
-}
-TIER_MODELS = {
-    "starter": ["gemma-4-E4B"],
-    "professional": ["qwen3.6-35B-A3B", "qwen3.6-27B-code", "gemma-4-E4B"],
-    "enterprise": ["qwen3.6-35B-A3B", "qwen3.6-27B-code", "gemma-4-E4B"],
-}
-API_KEYS = {
-    "sk-syslog-local-master-key": {"tier": "enterprise", "agent": "admin"},
-    "sk-syslog-abiba": {"tier": "enterprise", "agent": "Abiba"},
-    "sk-syslog-mumuni": {"tier": "enterprise", "agent": "Mumuni"},
-    "sk-syslog-tanko": {"tier": "enterprise", "agent": "Tanko"},
-    "sk-syslog-koby": {"tier": "enterprise", "agent": "Koby"},
-    "sk-syslog-kagenz0": {"tier": "enterprise", "agent": "Kagenz0"},
-    "sk-starter-abc123": {"tier": "starter", "agent": "test-starter"},
-    "sk-professional-xyz789": {"tier": "professional", "agent": "test-pro"},
-}
-
-logging.basicConfig(level=logging.INFO, format="%(asctime)s [ROUTER] %(levelname)s %(message)s")
-log = logging.getLogger("router")
-try: r = redis.from_url(REDIS_URL, decode_responses=True); r.ping()
-except Exception: r = None
-
-app = Flask(__name__)
-sse_subscribers = []; sse_lock = threading.Lock()
-
-def check_gpu_health(model):
-    url = GPU_SIDECARS.get(model)
-    if not url: return {"status": "unknown"}
-    try:
-        resp = requests.get(url, timeout=5)
-        if resp.status_code == 200:
-            d = resp.json()
-            pct = (d.get("vram_used_mb",0) / max(d.get("vram_total_mb",1), 1)) * 100
-            return {"status": "healthy" if pct < 90 else "saturated", "vram_used_mb": d.get("vram_used_mb"), "vram_total_mb": d.get("vram_total_mb"), "vram_pct": round(pct,1), "temp_c": d.get("temp_c"), "gpu_util_pct": d.get("gpu_util_pct"), "gpu_name": d.get("gpu_name"), "power_w": d.get("power_w"), "power_limit_w": d.get("power_limit_w")}
-    except Exception: pass
-    return {"status": "down"}
-
-def available_models(): return [m for m in GPU_URLS if check_gpu_health(m)["status"] in ("healthy","saturated")]
-
-def estimate_tokens(msgs): return sum(len(str(m.get("content",""))) for m in msgs) // 4
-
-def route(rd, tier):
-    msgs = rd.get("messages",[]); t = estimate_tokens(msgs)
-    sys = any(m.get("role")=="system" for m in msgs)
-    turns = len([m for m in msgs if m.get("role") in ("user","assistant")])
-    hints = rd.get("routing_hints",{})
-    allowed = TIER_MODELS.get(tier, ["gemma-4-E4B"])
-    avail = [m for m in available_models() if m in allowed]
-    if not avail: return {"model": allowed[0], "reason": "all_saturated"}
-    req = rd.get("model","auto")
-    if req != "auto": return {"model": req if req in avail else avail[0], "reason": "explicit"}
-    if hints:
-        if hints.get("priority")=="speed" and "gemma-4-E4B" in avail: return {"model":"gemma-4-E4B","reason":"hint_speed"}
-        if hints.get("priority")=="quality" and "qwen3.6-27B-code" in avail: return {"model":"qwen3.6-27B-code","reason":"hint_quality"}
-    if t > 4000 or sys or turns > 6:
-        for m in ["qwen3.6-27B-code","qwen3.6-35B-A3B","gemma-4-E4B"]:
-            if m in avail: return {"model":m,"reason":"heavy_reasoning"}
-    first_msg = msgs[0].get("content","") if msgs else ""
-    words = len(first_msg.split()) if isinstance(first_msg, str) else 99
-    if words <= 3 and turns <= 1 and not sys and "gemma-4-E4B" in avail:
-        return {"model":"gemma-4-E4B","reason":"ultra_light"}
-    if "qwen3.6-35B-A3B" in avail: return {"model":"qwen3.6-35B-A3B","reason":"default_moe"}
-    return {"model":avail[0],"reason":"fallback"}
-
-def clean_unicode(text):
-    if not isinstance(text, str): return text
-    return text.replace("\u2014","-").replace("\u2013","-").replace("\u2018",").replace(u2019,").replace("\u201c",').replace(u201d,').replace("\u2026","...").replace("\u00a0"," ")
-
-def clean_response(d):
-    if isinstance(d, dict): return {k: clean_response(v) for k,v in d.items()}
-    if isinstance(d, list): return [clean_response(v) for v in d]
-    if isinstance(d, str): return clean_unicode(d)
-    return d
-
-def get_metrics():
-    d = {"gpus":[],"route_counts":{},"agent_counts":{},"tier_counts":{},"recent":[],"timestamp":time.time()}
-    for m in GPU_URLS:
-        h = check_gpu_health(m)
-        d["gpus"].append({"id":m,"gpu_name":h.get("gpu_name",m),"status":h.get("status"),"vram_used_mb":h.get("vram_used_mb"),"vram_total_mb":h.get("vram_total_mb"),"vram_pct":h.get("vram_pct"),"temp_c":h.get("temp_c"),"gpu_util_pct":h.get("gpu_util_pct"),"power_w":h.get("power_w"),"power_limit_w":h.get("power_limit_w")})
-    if r:
-        try:
-            for m in GPU_URLS: d["route_counts"][m] = int(r.get("routes:"+m) or 0)
-            for k,v in API_KEYS.items():
-                c = int(r.get("routes:agent:"+v["agent"]) or 0)
-                if c>0: d["agent_counts"][v["agent"]] = c
-            for t in TIER_MODELS: d["tier_counts"][t] = int(r.get("routes:tier:"+t) or 0)
-            raw = r.lrange("routes:recent",0,49)
-            d["recent"] = [json.loads(x) for x in raw] if raw else []
-        except Exception: pass
-    return d
-
-def bcast():
-    data = get_metrics(); payload = json.dumps(data)
-    with sse_lock:
-        dead = []
-        for q in sse_subscribers:
-            try: q.put(payload)
-            except Exception: dead.append(q)
-        for q in dead: sse_subscribers.remove(q)
-
-@app.route("/v1/chat/completions", methods=["POST"])
-def chat():
-    try:
-        rd = request.get_json(force=True)
-        ak = request.headers.get("Authorization","").replace("Bearer ","")
-        ki = API_KEYS.get(ak, {"tier":"starter","agent":"unknown"})
-        tier, agent = ki["tier"], ki["agent"]
-        d = route(rd, tier); model, reason, url = d["model"], d["reason"], GPU_URLS[d["model"]]
-        is_stream = rd.get("stream", False)
-        log.info("ROUTE: %s -> %s (%s) stream=%s", agent, model, reason, is_stream)
-        if r:
-            try:
-                r.incr("routes:"+model); r.incr("routes:tier:"+tier); r.incr("routes:agent:"+agent)
-                r.incr("ts:"+model+":"+time.strftime("%Y%m%d%H"))
-                r.lpush("routes:recent", json.dumps({"ts":time.time(),"model":model,"reason":reason,"tier":tier,"agent":agent}))
-                r.ltrim("routes:recent",0,999)
-            except Exception: pass
-        start = time.time()
-        resp = requests.post(url+"/chat/completions", json=rd,
-            headers={"Content-Type":"application/json","Authorization":"Bearer not-needed"}, timeout=120, stream=is_stream)
-        lat = int((time.time()-start)*1000)
-        if resp.status_code != 200: return jsonify({"error":"GPU error "+str(resp.status_code)}), 502
-        if is_stream:
-            def gen():
-                for raw in resp.iter_content(chunk_size=None, decode_unicode=True):
-                    if raw: yield clean_unicode(raw)
-            bcast()
-            return Response(stream_with_context(gen()), mimetype="text/event-stream")
-        data = clean_response(resp.json())
-        for c in data.get("choices",[]):
-            msg = c.get("message",{})
-            if not msg.get("content") and msg.get("reasoning_content"):
-                msg["content"] = msg["reasoning_content"]
-        data["routing"] = {"model":model,"reason":reason,"gpu":url,"tier":tier,"agent":agent,"latency_ms":lat}
-        bcast()
-        return jsonify(data)
-    except requests.Timeout: return jsonify({"error":"timeout"}), 504
-    except Exception as e:
-        log.error("Error: %s\n%s", e, traceback.format_exc())
-        return jsonify({"error":str(e)}), 500
-
-@app.route("/v1/models")
-def models(): return jsonify({"object":"list","data":[{"id":m,"object":"model","owned_by":"syslog","status":check_gpu_health(m).get("status"),"gpu":check_gpu_health(m).get("gpu_name")} for m in GPU_URLS]})
-
-@app.route("/health")
-def health(): return jsonify({"status":"healthy","redis":"connected" if r else "down","gpus":{m:check_gpu_health(m) for m in GPU_URLS},"available_models":available_models()})
-
-@app.route("/metrics")
-def metrics(): return jsonify(get_metrics())
-
-@app.route("/metrics/timeseries")
-def metrics_timeseries():
-    period = request.args.get("period", "day"); models_list = list(GPU_URLS.keys())
-    data = {"models": {}, "labels": []}
-    if period == "day":
-        buckets = [time.strftime("%Y%m%d%H", time.gmtime(time.time()-h*3600)) for h in range(23,-1,-1)]
-        data["labels"] = [time.strftime("%H:00", time.gmtime(time.time()-h*3600)) for h in range(23,-1,-1)]
-    elif period == "week":
-        buckets = [time.strftime("%Y%m%d", time.gmtime(time.time()-d*86400)) for d in range(6,-1,-1)]
-        data["labels"] = [time.strftime("%a", time.gmtime(time.time()-d*86400)) for d in range(6,-1,-1)]
-    else:
-        buckets = [time.strftime("%Y%m%d", time.gmtime(time.time()-d*86400)) for d in range(29,-1,-1)]
-        data["labels"] = [time.strftime("%m/%d", time.gmtime(time.time()-d*86400)) for d in range(29,-1,-1)]
-    if r:
-        for model in models_list:
-            counts = []
-            for bucket in buckets:
-                total = 0
-                if period in ("week","month"):
-                    for hh in range(24): total += int(r.get("ts:"+model+":"+bucket+"{:02d}".format(hh)) or 0)
-                else: total = int(r.get("ts:"+model+":"+bucket) or 0)
-                counts.append(total)
-            data["models"][model] = counts
-    return jsonify(data)
-
-@app.route("/stream")
-def stream():
-    def ev():
-        q = queue.Queue()
-        with sse_lock: sse_subscribers.append(q)
-        try:
-            yield "data: "+json.dumps(get_metrics())+"\n\n"
-            while True:
-                try: yield "data: "+q.get(timeout=3)+"\n\n"
-                except queue.Empty: yield "data: "+json.dumps(get_metrics())+"\n\n"
-        except GeneratorExit: pass
-        finally:
-            with sse_lock:
-                if q in sse_subscribers: sse_subscribers.remove(q)
-    return Response(stream_with_context(ev()), mimetype="text/event-stream",
-                    headers={"Cache-Control":"no-cache","X-Accel-Buffering":"no","Access-Control-Allow-Origin":"*"})
-
-if __name__ == "__main__":
-    log.info("Router on :9000")
-    app.run(host="0.0.0.0", port=9000, debug=False)