router: heavy tier Dense→MoE→Light + X-Context-Warning headers (compact_soon/compact_recommended/compact_urgent)

router: 4 optimizations — saturated flag fix, heavy tier MoE-first, better token est, session tracking
- Saturated flag now triggers on load saturation (was dead code) - Heavy tier routes MoE(131K) first instead of Dense(98K) - Token estimation uses JSON length/3.5 (was content/4) - Cross-turn session tracking via X-Session-Id + Redis TTL 24h
2026-05-22 09:48:00 +00:00 · 2026-05-21 20:47:48 +00:00 · 2026-05-19 21:24:36 +00:00 · 2026-05-19 21:20:29 +00:00 · 2026-05-19 21:15:23 +00:00 · 2026-05-19 21:13:56 +00:00
29 changed files with 1031 additions and 4051 deletions
@@ -0,0 +1,8 @@
+# Syslog Harness Environment
+REDIS_HOST=192.168.68.8
+REDIS_PORT=6379
+AMDPVE_ENDPOINT=http://192.168.68.15:8080
+LLMGPU_ENDPOINT=http://192.168.68.8:8080
+OCU_LLM_ENDPOINT=http://192.168.68.110:8080
+CIRCUIT_BREAKER_THRESHOLD=5
+CIRCUIT_BREAKER_TIMEOUT=30
@@ -0,0 +1,3 @@
+.git
+__pycache__/
+*.pyc
@@ -1,390 +0,0 @@
-# Syslog Harness  Architecture Review & Improvement Recommendations
-
-**Date:** 2026-05-17  
-**Commit:** `e95475f`  "Add GPU dashboard container + Nginx routing"  
-**Repo:** http://192.168.68.17:3000/SyslogSolution/syslog-harness.git
-
---
-
-## 1. Current Architecture Overview
-
-```
-                          
-                                              Host (192.168.68.123)                    
-                                                                                       
-                                        
-Agent :8080> Nginx Router >  Queue Service  >   Dashboard       
-                             :8080            :8091                  :3001         
-                                        
-                                                                                    
-                                                                                    
-                                                                                    
-                                        
-                             GPU Pool         Redis       >  GPU Dashboard  
-                             :8080            :6379               :8092         
-                                        
-                                                                                      
-                          
-                                  
-                    
-                                              
-                  
-               amdpve      llmgpu     ocu_llm    
-               .15:8080    .8:8080    .110:8080  
-               MoE 35B     Dense 27B   Light 4B  
-                  
-```
-
-### Services
-
-| Service | Port | Container | Image | Purpose |
-|---|---|---|---|---|
-| **Nginx Router** | 8080 | Host-level | OS nginx | Routes by `X-Syslog-Model` header |
-| **Queue Service** | 8091 | `syslog-queue` | `python:3.13-slim` | Request queue + circuit breaker |
-| **Dashboard** | 3001 | `syslog-dashboard` | `python:3.11-slim` | Observability UI + GPU health |
-| **GPU Dashboard** | 8092 | `syslog-gpu-dashboard` | `python:3.11-slim` | Hardware metrics (temp, VRAM, power) |
-| **Redis** | 6379 | `syslog-redis` | `redis:7-alpine` | Queue storage |
-
-### GPU Backends
-
-| Host | GPU | Model | Capacity |
-|---|---|---|---|
-| 192.168.68.15 | AMD Strix Halo | qwen3.6-35B-A3B (MoE) | 65GB VRAM |
-| 192.168.68.8 | RTX 3090 | qwen3.5-27B (Dense) | 24GB VRAM |
-| 192.168.68.110 | RTX 5070 | gemma-4-E4B (Light) | 12GB VRAM |
-
-### Data Flow
-
-1. **Agent** sends request with `X-Syslog-Model` header  Nginx :8080
-2. **Nginx** routes to appropriate GPU based on header mapping
-3. **GPU backend** (llama.cpp) processes request
-4. **Fallback:** If GPU returns 502/503/timeout  Nginx redirects to queue-service :8091
-5. **Queue** stores request in Redis `inference:requests` LPUSH
-6. **Dashboard** :3001 polls queue-service + GPU health for display
-7. **GPU Dashboard** :8092 collects hardware metrics every 10s
-
---
-
-## 2. File Inventory
-
-```
-docker-compose.yml                          # Main compose (Docker networking)
-gpu-router-docker.conf                      # Nginx config for Docker deployment
-Dockerfile.gpu                              # GPU dashboard container
-Dockerfile.dashboard                        # Dashboard container (root-level)
-queue-service/Dockerfile                    # Queue service container
-queue-service/queue-service.py              # Queue logic (121 lines)
-dashboard/harness-dashboard.py              # Dashboard app (133 lines)
-dashboard/Dockerfile                        # Dashboard container (subdir)
-dashboard/Dockerfile.dashboard              # Dashboard container (duplicate)
-gpu-dashboard/gpu_collector.py              # GPU hardware collector (115 lines)
-gpu-dashboard/gpu.html                      # GPU dashboard UI (183 lines)
-gpu-dashboard/collector.py                  # Duplicate collector (hermes-workspace path)
-gpu-dashboard/start.sh                      # Legacy startup script
-MIGRATION_PLAN.md                           # Production migration plan
-README.md                                   # Documentation
-syslog-harness-check/                       # Checkpoint subdirectory (mirror)
-```
-
---
-
-## 3. Detailed Findings
-
-### 3.1 Queue Service (`queue-service/queue-service.py`)
-
-**Architecture:** Simple Flask app using Redis LPUSH/RPUSH for a FIFO queue. A basic circuit breaker prevents queue overflow at 50 messages.
-
-**Issues Found:**
-
-| # | Severity | Location | Issue |
-|---|---|---|---|
-| Q1 | **CRITICAL** | Lines 82-88 | **Queue is fire-and-forget with no consumer.** Requests are pushed to Redis but nothing dequeues or processes them. The queue is a dead storage pit. |
-| Q2 | **CRITICAL** | Lines 28-32 | **Hardcoded GPU IPs** in the queue service duplicate the Nginx config. No configuration source of truth. |
-| Q3 | **HIGH** | Lines 21-22 | **Redis host fallback to `192.168.68.7`** (line 21) conflicts with docker-compose which sets `REDIS_HOST=redis` (line 24). The default is unreachable inside Docker. |
-| Q4 | **HIGH** | Lines 66-95 | **No job result retrieval mechanism.** Once enqueued, there's no API to poll for completion, get a job ID, or retrieve results. |
-| Q5 | **HIGH** | Lines 73-79 | **Circuit breaker is a simple depth threshold.** No backoff, no recovery window, no sliding window. Once closed, it stays closed until manually drained. |
-| Q6 | **MEDIUM** | Lines 50-57 | **GPU health check is synchronous and blocks** the `/status` endpoint. Checking 3 GPUs sequentially with 3s timeout means `/status` can take up to 9s. |
-| Q7 | **MEDIUM** | Lines 35-40 | **`get_redis()` swallows all exceptions** and returns `None`. This makes Redis failures silent  queue depth returns 0 on failure (line 47), potentially allowing overflow. |
-| Q8 | **MEDIUM** | Lines 83-84 | **Headers filtered to only X-* prefixed**  the `Content-Type` header is dropped entirely, meaning the receiver can't determine payload format. |
-| Q9 | **LOW** | Line 121 | **No graceful shutdown.** Flask development server doesn't handle SIGTERM gracefully. |
-
-### 3.2 Nginx Gateway (`gpu-router-docker.conf`)
-
-**Architecture:** Nginx routes requests to GPU backends based on `X-Syslog-Model` header value. Has rate limiting, streaming support, and queue fallback.
-
-**Issues Found:**
-
-| # | Severity | Location | Issue |
-|---|---|---|---|
-| N1 | **HIGH** | Lines 79-80 | **`burst=20 nodelay`** means 20 requests are served immediately beyond the rate limit, then throttled. This defeats the purpose of rate limiting under burst traffic  all 20 could still overwhelm a GPU. |
-| N2 | **HIGH** | Lines 99-100 | **`proxy_next_upstream` with `tries 2`** means on error/timeout/502/503, Nginx retries once. But it retries against the *same GPU pool*, not a different one. The same GPU that failed gets hit again. |
-| N3 | **HIGH** | Lines 106, 112-121 | **Queue fallback (`@queue_fallback`) is triggered for ANY 502/503/504**, including when a single GPU is overloaded. This means individual GPU slowness causes queue fallback instead of just queuing when ALL GPUs are down. |
-| N4 | **MEDIUM** | Line 90 | **`proxy_pass_header X-Syslog-Model`** is non-standard. Nginx automatically passes request headers; this directive is for response headers. The model header is already passed implicitly via `proxy_set_header` inheritance. |
-| N5 | **MEDIUM** | Lines 27, 32 | **Hardcoded container names** (`syslog-harness-dashboard-1`, `syslog-harness-gpu-dashboard-1`). These change based on docker-compose project prefix. Should use service names. |
-| N6 | **LOW** | Lines 67-73 | **GPU dashboard at `/gpu` path** has `X-Forwarded-Proto` but the dashboard service (simple HTTP server) doesn't use it. Inconsistent header handling across locations. |
-
-### 3.3 Dashboard (`dashboard/harness-dashboard.py`)
-
-**Architecture:** Simple HTTP server using Python's `http.server`. Fetches queue status and GPU health, renders HTML.
-
-**Issues Found:**
-
-| # | Severity | Location | Issue |
-|---|---|---|---|
-| D1 | **HIGH** | Lines 34-40 | **`get_queue_status()` calls queue-service synchronously.** Combined with per-GPU health checks (lines 18-31), the `/api/status` endpoint makes 4 sequential HTTP calls. Worst case: 2 + 33s = 11s response time. |
-| D2 | **MEDIUM** | Lines 101-127 | **Uses `SimpleHTTPRequestHandler`** which is single-threaded. Under concurrent dashboard access, requests queue up. Should use `ThreadingHTTPServer`. |
-| D3 | **MEDIUM** | Lines 16-18 | **GPU endpoints hardcoded** in dashboard, separate from queue-service and Nginx. Three separate sources of truth for GPU addresses. |
-| D4 | **LOW** | Line 127 | **Silent log suppression.** While intentional, this makes debugging impossible without modifying the source. |
-
-### 3.4 GPU Dashboard (`gpu-dashboard/`)
-
-**Architecture:** `gpu_collector.py` polls sidecar (port 8090) and llama.cpp (port 8080) endpoints every 10s, writes JSON to `gpu_metrics.json`. Static HTTP server serves the dashboard.
-
-**Issues Found:**
-
-| # | Severity | Location | Issue |
-|---|---|---|---|
-| G1 | **HIGH** | Lines 97-98 | **Sequential collection.** All 3 GPUs are polled sequentially (line 98: list comprehension). If one host is unreachable, it blocks collection for all three. |
-| G2 | **HIGH** | Line 105-107 | **`/app/public/gpu_metrics.json` path is hardcoded** and differs from `collector.py` (line 11: `/root/hermes-workspace/public/gpu_metrics.json`). Inconsistent between the two collector files. |
-| G3 | **MEDIUM** | Lines 19-25 | **`fetch_json` swallows all exceptions.** A timeout on one GPU's sidecar is silently ignored, making it impossible to distinguish "no data" from "collector error". |
-| G4 | **MEDIUM** | Line 14 | **`DEAD_THRESHOLD = 60` seconds is aggressive.** A GPU that restarts takes 60s before reappearing as online, even if it's back in 5s. |
-| G5 | **LOW** | Lines 10-14 | **`start.sh` references `/root/hermes-workspace/public`** but `Dockerfile.gpu` creates `/app/public`. Inconsistent between legacy and current deployment. |
-
-### 3.5 Docker Compose (`docker-compose.yml`)
-
-**Issues Found:**
-
-| # | Severity | Location | Issue |
-|---|---|---|---|
-| C1 | **HIGH** | Lines 19-20 | **Queue service exposes port 8091 externally.** In a multi-tenant or public-facing deployment, the queue API should be internal-only. |
-| C2 | **MEDIUM** | Lines 13-15 | **`Dockerfile.queue` referenced but doesn't exist at root level.** The file is at `queue-service/Dockerfile`. The compose build context is `.` (root) but the dockerfile path doesn't match. |
-| C3 | **MEDIUM** | Lines 6, 16, 26, 31, 43 | **`restart: always`** instead of `restart: unless-stopped`. On crash, `always` restarts even after manual stop, making maintenance harder. |
-| C4 | **LOW** | Lines 23-25 | **No health checks defined** for any service. Docker can't detect if a service is actually healthy, only if the container is running. |
-| C5 | **LOW** | Line 10 | **Redis has no password.** Unauthenticated Redis exposed on the Docker network. |
-| C6 | **LOW** | Lines 49-51 | **No network driver specified** for the bridge network (minor  defaults to bridge). No IPAM configuration for large deployments. |
-
-### 3.6 Container Images
-
-**Issues Found:**
-
-| # | Severity | Location | Issue |
-|---|---|---|---|
-| I1 | **HIGH** | All Dockerfiles | **No `requirements.txt` or dependency pinning.** All dependencies (`flask`, `redis`, `requests`) are installed without version pins. Builds are non-reproducible. |
-| I2 | **MEDIUM** | `Dockerfile.gpu` line 3 | **`pip install requests`**  unnecessary dependency for the GPU dashboard (only uses `urllib`). Adds ~300KB to the image. |
-| I3 | **MEDIUM** | `Dockerfile.gpu` line 14 | **Multi-process CMD with `&`**  no process supervisor. If the collector crashes, it won't restart. The `http.server` also won't receive SIGTERM properly. |
-| I4 | **LOW** | All Dockerfiles | **No `.dockerignore` file.** The entire context is sent to the Docker daemon, including `.git` directories and any local artifacts. |
-| I5 | **LOW** | `Dockerfile.dashboard` (root) vs `dashboard/Dockerfile.dashboard` | **Duplicate Dockerfiles** with slight differences (Python 3.11 vs 3.13, WORKDIR differences). |
-
---
-
-## 4. Smart Queuing Analysis & Recommendations
-
-### Current State:  No Smart Queuing
-
-The queue service is a **passive storage mechanism**  it stores requests but has no intelligence:
-
- **No load balancing**  no awareness of GPU load (slots_busy, VRAM usage, queue depth per GPU)
- **No job prioritization**  FIFO only, no priority levels
- **No backpressure**  simple threshold, no exponential backoff or adaptive limits
- **No retry logic**  failed GPU requests go to queue but are never reprocessed
- **No dead letter handling**  stuck or failed jobs have no lifecycle management
- **No consumer**  nothing dequeues and forwards to GPUs
- **No job tracking**  no job IDs, no status updates, no result retrieval
-
-### Recommended Architecture: Smart Queue with Consumer
-
-```
-Agent > Nginx > Smart Queue API > Redis Streams (with consumers)
-                                          
-                                   
-                                     Consumer   
-                                     Pool       
-                                   
-                                          
-                             
-                                                     
-                         GPU 1 (load)  GPU 2 (load)  GPU 3 (load)
-                                                     
-                                                     
-                         Health        Health        Health
-                                                   
-                           
-                                          
-                                  Update GPU scores
-                                          
-                             Priority Queue (sorted by urgency)
-                             Dead Letter Queue (failed jobs)
-                             Backpressure (adaptive rate limit)
-```
-
-### Specific Recommendations
-
-#### R1: Implement Redis Streams as Queue Backend
- Replace `LPUSH/RPUSH` (FIFO list) with **Redis Streams** (`XADD/XREADGROUP`)
- Streams support consumer groups, message acknowledgment, and pending messages
- Enables proper dead letter queue handling and retry logic
- **File:** `queue-service/queue-service.py`
-
-```python
-# Before: Simple list
-r.rpush(QUEUE_KEY, json.dumps(job))
-
-# After: Redis Stream with consumer group
-stream_key = "inference:stream"
-consumer_group = "gpu-workers"
-r.xadd(stream_key, {"job": json.dumps(job)}, maxlen=10000, approx=True)
-```
-
-#### R2: Build a Queue Consumer Pool
- Deploy 1+ consumer containers that poll the stream and forward to GPUs
- Consumer selects GPU based on: health status, current load (slots_busy), and VRAM availability
- **File:** New `queue-service/consumer.py`
-
-```python
-class LoadBalancedConsumer:
-    def select_gpu(self, job):
-        """Select GPU based on load, health, and model compatibility."""
-        candidates = [g for g in self.gpus if g.health == "up" and not g.full]
-        if not candidates:
-            return None
-        # Sort by: slots_idle (descending), VRAM_available (descending)
-        candidates.sort(key=lambda g: (g.slots_idle, g.vram_free_mb), reverse=True)
-        return candidates[0]
-```
-
-#### R3: Implement Priority Queuing
- Add priority field to job payload: `high`, `normal`, `low`
- Use Redis Streams with multiple stream keys per priority level
- Consumer checks `high`  `normal`  `low` in order
- **File:** `queue-service/queue-service.py` enqueue endpoint
-
-#### R4: Add Backpressure Mechanism
- Instead of hard threshold at 50, implement **adaptive backpressure**:
-  - Queue depth 0-30: normal operation
-  - Queue depth 30-40: return `retry-after` header with increasing delay
-  - Queue depth 40-50: return 503 with exponential retry-after
-  - Queue depth >50: circuit breaker open
- **File:** `queue-service/queue-service.py`
-
-#### R5: Dead Letter Queue (DLQ)
- Move failed/unprocessable jobs to a `inference:dead-letter` stream
- Include failure reason, attempt count, and original payload
- Provide admin API to inspect, retry, or discard DLQ entries
- **File:** `queue-service/queue-service.py`
-
-```python
-# New endpoint
-@app.route("/dlq", methods=["GET"])
-def list_dlq():
-    return r.xrange("inference:dead-letter")
-
-@app.route("/dlq/retry/<message_id>", methods=["POST"])
-def retry_dlq(message_id):
-    job = r.xget("inference:dead-letter", message_id)
-    r.xadd("inference:stream", {"job": job})
-```
-
-#### R6: GPU-Aware Routing
- Queue consumer should check GPU `slots_busy` before routing
- If a GPU is busy, try the next available GPU
- Track per-GPU queue depth and avoid overloading a single GPU
- **File:** New consumer logic
-
-#### R7: Job Status API
- Add job ID generation on enqueue
- Provide `/status/<job_id>` endpoint to check progress
- Store job state in Redis: `queued`  `processing`  `completed`/`failed`
- **File:** `queue-service/queue-service.py`
-
-```python
-@app.route("/enqueue", methods=["POST"])
-def enqueue():
-    job_id = str(uuid.uuid4())
-    job = {"id": job_id, "payload": ..., "status": "queued", "created_at": time.time()}
-    r.xadd(stream_key, {"job": json.dumps(job)})
-    r.hset("job:status", job_id, json.dumps({"status": "queued"}))
-    return jsonify({"job_id": job_id, "status": "queued"}), 202
-
-@app.route("/status/<job_id>")
-def job_status(job_id):
-    status = r.hget("job:status", job_id)
-    return jsonify(json.loads(status)) if status else {"error": "not found"}, 404
-```
-
-#### R8: Health-Based Circuit Breaker
- Replace simple depth threshold with **per-GPU circuit breakers**
- Track consecutive failures per GPU
- Implement half-open state: after cooldown, probe one GPU to test recovery
- **File:** `queue-service/queue-service.py`
-
-#### R9: Centralized Configuration
- Move GPU endpoints from 3 locations (queue-service, dashboard, Nginx) to:
-  - Redis config key: `config:gpus`
-  - Or environment file mounted to all containers
- Nginx can use Lua/variable from config instead of static upstreams
- **File:** New `config/` directory or Redis-based config
-
---
-
-## 5. Priority Issue Summary
-
-### Critical (Fix Immediately)
-1. **Q1**  Queue has no consumer; enqueued requests are never processed
-2. **Q4**  No job ID or result retrieval mechanism
-3. **N3**  Queue fallback triggers on individual GPU failure, not all-down
-
-### High (Fix Before Production)
-4. **Q5**  Circuit breaker has no recovery mechanism
-5. **Q6**  `/status` endpoint blocks on GPU health checks
-6. **D1**  Dashboard `/api/status` makes 4 sequential calls, up to 11s
-7. **C2**  `Dockerfile.queue` path mismatch in docker-compose
-8. **I1**  No dependency pinning in any Dockerfile
-9. **I3**  Multi-process CMD without supervisor in GPU dashboard
-
-### Medium (Improve in Next Iteration)
-10. **Q3**  Redis host default conflicts with Docker networking
-11. **Q7**  Silent exception swallowing in Redis access
-12. **Q8**  Content-Type header dropped in queue
-13. **D2**  Single-threaded dashboard server
-14. **D3**  Three separate sources of truth for GPU addresses
-15. **G1**  Sequential GPU collection blocks on single failure
-16. **N1**  Rate limit burst of 20 nodelay defeats protection
-17. **N5**  Hardcoded container names in Nginx
-18. **C1**  Queue API exposed externally
-19. **C4**  No Docker health checks
-
-### Low (Nice to Have)
-20. **Q9**  No graceful shutdown
-21. **C3**  `restart: always` vs `unless-stopped`
-22. **C5**  No Redis authentication
-23. **G4**  60s dead threshold is too aggressive
-24. **I2**  Unnecessary `requests` dependency
-25. **I4**  No `.dockerignore`
-26. **I5**  Duplicate Dockerfiles
-
---
-
-## 6. Deployment Architecture Summary
-
-### What Works Well
- Clean separation of concerns: routing (Nginx), queuing (Redis + queue-service), observability (two dashboards)
- Good GPU hardware monitoring with temperature, VRAM, power, fan metrics
- SSE streaming support in Nginx for LLM response streaming
- Rate limiting at the gateway layer
- Circuit breaker pattern implemented (even if basic)
-
-### What Needs Work
- **Queue is incomplete**  storage without processing is the most critical gap
- **No job lifecycle**  requests go in and never come out
- **Duplicated configuration**  GPU addresses in 3+ places
- **No monitoring/alerting**  no Prometheus metrics, no alerting rules
- **Single point of failure**  no Redis replication, no container redundancy
- **No logging**  Flask dev server logs are minimal; no structured logging
-
-### Recommended Next Steps
-1. **Priority 1:** Implement queue consumer with GPU load-based routing
-2. **Priority 2:** Add job status tracking and result retrieval
-3. **Priority 3:** Fix Nginx fallback to only trigger when ALL GPUs are down
-4. **Priority 4:** Add Docker health checks and proper dependency management
-5. **Priority 5:** Centralize GPU configuration in Redis or environment
-6. **Priority 6:** Add Prometheus metrics endpoint for observability
@@ -1,5 +0,0 @@
-FROM python:3.11-slim
-WORKDIR /app
-COPY dashboard/harness-dashboard.py .
-EXPOSE 3001
-CMD ["python3", "harness-dashboard.py"]
@@ -1,14 +0,0 @@
-FROM python:3.11-slim
-
-RUN pip install requests
-
-COPY gpu-dashboard/ /app/
-WORKDIR /app
-
-RUN mkdir -p /app/public && \
-    cp gpu.html /app/public/ && \
-    touch /app/public/gpu_metrics.json
-
-EXPOSE 8092
-
-CMD ["sh", "-c", "python3 gpu_collector.py & python3 -m http.server 8092 --directory /app/public & wait"]
@@ -1,63 +1,75 @@
-# Syslog Harness
+# syslog-harness — Inference API Harness

-Operational orchestration layer for Syslog's internal AI agents.
+CT 116 Docker stack for routing local GPU models through a unified OpenAI-compatible API.

 ## Architecture

 ```
-┌─────────────┐     ┌──────────────┐     ┌─────────────┐
-│  Agent      │────>│  Nginx       │────>│  GPU Pool   │
-│  (Hermes)   │     │  Router      │     │  (MoE/Dense)│
-└─────────────┘     └──────────────┘     └─────────────┘
-                         │
-                         ├──> :8091 Queue Service (Docker)
-                         │
-                         └──> :3001 Dashboard (Docker)
+nginx :80 → router :9000 → GPU backends
+                ├─ qwen3.6-35B-A3B (MoE) @ 192.168.68.15:8080  [2 slots]
+                ├─ qwen3.6-27B-code (Dense) @ 192.168.68.8:8080  [2 slots]
+                └─ qwen3.5-9b-vlm (VLM) @ 192.168.68.110:8080    [2 slots]
+                                     Total: 6 concurrent slots
+
+LiteLLM :8081 (fallback) | Dashboard :3000 | Redis :6379 (local)
 ```

-## Components
-
-| Service | Port | Container | Purpose |
-|---|---|---|---|
-| Nginx Router | 8080 | Host | Routes requests to GPU backends |
-| Queue Service | 8091 | `syslog-queue` | Enqueues requests when GPUs are down |
-| Dashboard | 3001 | `syslog-dashboard` | Observability UI + API |
-
-## GPU Routing
-
-| Header `X-Syslog-Model` | Backend | Model |
-|---|---|---|
-| (none) / `standard` | amdpve (.15) | qwen3.6-35B-A3B (MoE) |
-| `heavy` / `qwen3.5-27B` | llmgpu (.8) | qwen3.5-27B (Dense) |
-| `light` / `gemma-4` | ocu_llm (.110) | gemma-4-E4B (Light) |
-
-## Quick Start
+## Deploy

 ```bash
-# Build & start
-docker compose build
+cd /opt/inference-harness
 docker compose up -d
-
-# Verify
-curl http://localhost:8091/health
-curl http://localhost:3001/api/status
 ```

-## Dashboard
+## Endpoints

- **UI:** `http://<host>:8080/dashboard/harness.html`
- **API:** `http://<host>:8080/dashboard/api/status`
+| URL | Purpose |
+|-----|---------|
+| `/v1/chat/completions` | Inference API (OpenAI-compatible) — **API key required** |
+| `/v1/models` | Available models |
+| `/` | Dashboard (GPU health, routing, agents, timeseries) |

-## Circuit Breaker
+## Authentication

- Rate limit: 10 req/s per IP
- Burst: 20 requests
- Excess returns 503
- Queue fallback on GPU 502/503
+**All `/v1/chat/completions` requests require a valid API key** via `Authorization: Bearer <key>`. Missing or invalid keys return **401 Unauthorized**.

-## Production Migration
+## Agent API Keys

-See [MIGRATION_PLAN.md](./MIGRATION_PLAN.md)
+| Agent | Key |
+|-------|-----|
+| Abiba | `sk-syslog-abiba` |
+| Mumuni | `sk-syslog-mumuni` |
+| Tanko | `sk-syslog-tanko` |
+| Koby | `sk-syslog-koby` |
+| Kagenz0 | `sk-syslog-kagenz0` |
+| Koonimo | `sk-syslog-koonimo` |

---
-*Built for Syslog Solution LLC — Quality over speed.*
+## Routing Tiers
+
+| Tier | Trigger | Priority |
+|------|---------|----------|
+| Lightweight | No system prompt, ≤1 turn, ≤100 words | VLM → MoE → Dense |
+| Simple Conv | ≤1000 tokens, ≤4 turns | VLM → MoE → Dense |
+| Heavy | >4000 tokens OR >8 turns | Dense → MoE → VLM |
+| Default | Everything else | MoE → VLM → Dense |
+
+## Queue
+
+When all GPUs are saturated, requests enter a polling queue (500ms intervals) instead of returning 503 immediately. Timeout: 30s (configurable via `QUEUE_TIMEOUT` env or `X-Queue-Timeout` header).
+
+## Models
+
+| GPU | Model | VRAM | Slots |
+|-----|-------|------|-------|
+| Strix Halo | qwen3.6-35B-A3B (MoE) | 65GB | 2 |
+| RTX 3090 | qwen3.6-27B-code (Dense) | 24GB | 2 |
+| RTX 5070 | qwen3.5-9b-vlm (VLM) | 12GB | 2 |
+
+## Maintenance
+
+Automated cron job runs daily at 3:00 AM UTC (`/opt/inference-harness/maintenance.sh`):
+- Cleans Redis timeseries keys >60 days
+- Prunes Docker build cache >7 days
+- Logs container health and Redis memory
+
+Logs: `/var/log/harness-maintenance.log`
@@ -1,8 +1,7 @@
-FROM python:3.13-slim
-
-COPY harness-dashboard.py /app/harness-dashboard.py
+FROM python:3.12-slim
 WORKDIR /app
-
-EXPOSE 3001
-
-CMD ["python3", "harness-dashboard.py"]
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+COPY dashboard.py .
+EXPOSE 3000
+CMD ["python", "dashboard.py"]
@@ -1,5 +0,0 @@
-FROM python:3.11-slim
-WORKDIR /app
-COPY harness-dashboard.py .
-EXPOSE 3001
-CMD ["python3", "harness-dashboard.py"]
@@ -0,0 +1,232 @@
+"""SyslogAI Harness Dashboard — Modern Design."""
+import os, json, time, queue, threading
+import requests
+from flask import Flask, request, render_template_string, Response, stream_with_context
+
+ROUTER_METRICS = os.environ.get("ROUTER_METRICS_URL", "http://router:9000/metrics")
+app = Flask(__name__)
+sse_subscribers = []; sse_lock = threading.Lock()
+
+def fetch_state():
+    try:
+        r = requests.get(ROUTER_METRICS, timeout=5)
+        if r.status_code == 200: return r.json()
+    except Exception: pass
+    return {"gpus":[],"route_counts":{},"agent_counts":{},"recent":[],"timestamp":time.time()}
+
+def broadcast_loop():
+    while True:
+        time.sleep(3)
+        data = fetch_state(); payload = json.dumps(data)
+        with sse_lock:
+            dead = [q for q in sse_subscribers if not q.put(payload)]
+            for q in dead: sse_subscribers.remove(q)
+threading.Thread(target=broadcast_loop, daemon=True).start()
+
+DASHBOARD_HTML = r"""<!DOCTYPE html>
+<html lang="en" data-bs-theme="dark">
+<head>
+<meta charset="UTF-8"><meta name="viewport" content="width=device-width, initial-scale=1.0">
+<title>SyslogAI Harness</title>
+<link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.3/dist/css/bootstrap.min.css" rel="stylesheet">
+<style>
+body { background: #0b0f17; color: #bcc3cd; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', system-ui, sans-serif; padding: 20px 24px; }
+.card { background: #111827; border: 1px solid #1e293b; border-radius: 10px; height: 100%; }
+.stat-card { background: #111827; border: 1px solid #1e293b; border-radius: 10px; padding: 18px 20px; text-align: center; }
+.stat-value { font-size: 28px; font-weight: 700; line-height: 1.1; }
+.stat-label { font-size: 11px; text-transform: uppercase; letter-spacing: 0.6px; color: #64748b; margin-top: 4px; }
+.gpu-card { background: #111827; border: 1px solid #1e293b; border-radius: 10px; padding: 16px 18px; height: 100%; }
+.gpu-card .title { font-size: 13px; font-weight: 600; color: #e2e8f0; margin-bottom: 12px; display: flex; align-items: center; gap: 8px; }
+.gpu-card .status-dot { width: 8px; height: 8px; border-radius: 50%; flex-shrink: 0; }
+.gpu-card .row-metric { display: flex; justify-content: space-between; font-size: 12px; padding: 2px 0; }
+.gpu-card .row-metric .lbl { color: #64748b; }
+.gpu-card .row-metric .val { color: #e2e8f0; font-variant-numeric: tabular-nums; }
+.gpu-card .slot-bar { display: flex; gap: 3px; margin-top: 8px; }
+.gpu-card .slot-bar .s { flex: 1; height: 5px; border-radius: 2px; background: #1e293b; }
+.gpu-card .slot-bar .s.active { background: #38bdf8; }
+.chart-card { background: #111827; border: 1px solid #1e293b; border-radius: 10px; padding: 16px 18px; height: 100%; display: flex; flex-direction: column; }
+.chart-card .title { font-size: 13px; font-weight: 600; color: #e2e8f0; margin-bottom: 12px; }
+.bar-row { margin-bottom: 8px; }
+.bar-label { display: flex; justify-content: space-between; font-size: 11px; margin-bottom: 3px; color: #64748b; }
+.bar-label .name { color: #cbd5e1; }
+.bar-track { height: 5px; background: #1e293b; border-radius: 3px; overflow: hidden; }
+.bar-fill { height: 100%; border-radius: 3px; transition: width 0.6s ease; }
+.table-custom { font-size: 11px; margin: 0; }
+.table-custom th { color: #64748b; font-weight: 500; font-size: 10px; text-transform: uppercase; border-color: #1e293b; padding: 8px 10px; }
+.table-custom td { color: #94a3b8; border-color: rgba(30,41,59,0.5); padding: 6px 10px; }
+.agent-badge { font-size: 10px; padding: 2px 7px; border-radius: 8px; font-weight: 600; }
+.btn-sm-period { font-size: 10px; padding: 3px 10px; border-radius: 6px; border: 1px solid #1e293b; color: #64748b; background: transparent; cursor: pointer; }
+.btn-sm-period.active { background: #1d4ed8; color: #fff; border-color: #1d4ed8; }
+.ring-label { font-size: 22px; font-weight: 700; }
+.ring-sublabel { font-size: 10px; color: #64748b; }
+</style>
+</head>
+<body>
+
+<!-- HEADER -->
+<div class="d-flex justify-content-between align-items-center mb-4">
+  <div>
+    <h5 class="mb-0 text-white fw-bold">&#x26A1; SyslogAI Harness</h5>
+    <div class="small text-secondary" id="live-indicator">
+      <span class="status-dot" id="live-dot" style="width:6px;height:6px;border-radius:50%;display:inline-block;background:#22c55e;animation:pulse 2s infinite"></span>
+      <span id="connection-status">live</span> &middot; <span id="update-time"></span>
+    </div>
+  </div>
+  <div class="d-flex gap-2">
+    <div class="stat-card" style="min-width:100px"><div class="stat-value text-info" id="kpi-total">0</div><div class="stat-label">Requests</div></div>
+    <div class="stat-card" style="min-width:100px"><div class="stat-value text-warning" id="kpi-active">0</div><div class="stat-label">Active</div></div>
+    <div class="stat-card" style="min-width:100px"><div class="stat-value" style="color:#a78bfa" id="kpi-agents">0</div><div class="stat-label">Agents</div></div>
+  </div>
+</div>
+
+<div class="row g-3 align-items-stretch">
+  <!-- ROW 1: Usage Chart (8) + GPU Metrics (4) -->
+  <div class="col-md-8"><div class="chart-card"><div class="title d-flex justify-content-between align-items-center">
+    <span>Usage Over Time</span>
+    <div class="d-flex gap-1">
+      <button class="btn-sm-period active" onclick="switchPeriod('day')">24h</button>
+      <button class="btn-sm-period" onclick="switchPeriod('week')">7d</button>
+      <button class="btn-sm-period" onclick="switchPeriod('month')">30d</button>
+    </div>
+  </div><div id="timeseries-chart" style="height:150px"></div><div id="timeseries-legend" class="d-flex justify-content-center gap-3 mt-2 flex-wrap small"></div></div></div>
+  <div class="col-md-4"><div class="chart-card"><div class="title">GPU Metrics</div><div id="gpu-metrics-card"></div></div></div>
+
+  <!-- ROW 2: 3 GPU Cards -->
+  <div class="col-md-4"><div class="gpu-card" id="gpu-moe"><div class="text-secondary small">Loading...</div></div></div>
+  <div class="col-md-4"><div class="gpu-card" id="gpu-dense"><div class="text-secondary small">Loading...</div></div></div>
+  <div class="col-md-4"><div class="gpu-card" id="gpu-light"><div class="text-secondary small">Loading...</div></div></div>
+
+  <!-- ROW 3: Queue + Model + Agent -->
+  <div class="col-md-4"><div class="chart-card"><div class="title">Queue Status</div><div class="text-center" id="queue-viz"></div></div></div>
+  <div class="col-md-4"><div class="chart-card"><div class="title">Model Distribution</div><div id="route-bars"></div></div></div>
+  <div class="col-md-4"><div class="chart-card"><div class="title">Agent Activity</div><div id="agent-bars"></div></div></div>
+
+  <!-- ROW 4: Live Stream -->
+  <div class="col-12"><div class="chart-card"><div class="title">Live Stream</div>
+    <div class="table-responsive"><table class="table table-custom mb-0">
+      <thead><tr><th>Time</th><th>Agent</th><th>Model</th><th>Reason</th><th>Tier</th></tr></thead>
+      <tbody id="route-tbody"></tbody>
+    </table></div>
+  </div></div>
+</div>
+
+<script>
+var MC={'qwen3.5-9b-vlm':'#22c55e','qwen3.6-27B-code':'#f59e0b','qwen3.6-35B-A3B':'#a78bfa'};
+var ML={'qwen3.5-9b-vlm':'Qwen3.5 9B VLM','qwen3.6-27B-code':'Qwen Code','qwen3.6-35B-A3B':'Qwen MoE'};
+var GL={'qwen3.6-35B-A3B':'MoE - Strix Halo','qwen3.6-27B-code':'Dense - RTX 3090','qwen3.5-9b-vlm':'VLM - RTX 5070'};
+function $(id){return document.getElementById(id);}
+
+function render(data){
+if(!data||!data.gpus)return;
+var t=Object.values(data.route_counts||{}).reduce((a,b)=>a+b,0);
+var ta=0,tm=0;data.gpus.forEach(function(g){ta+=(g.active_requests||0);tm+=(g.max_concurrent||1)});
+$('kpi-total').textContent=t;$('kpi-active').textContent=ta+'/'+tm;$('kpi-agents').textContent=Object.keys(data.agent_counts||{}).length;
+$('update-time').textContent=new Date().toLocaleTimeString();
+var ids={'qwen3.6-35B-A3B':'gpu-moe','qwen3.6-27B-code':'gpu-dense','qwen3.5-9b-vlm':'gpu-light'};
+data.gpus.forEach(function(g){
+var el=$(ids[g.id]);if(!el)return;
+var a=g.active_requests||0,mx=g.max_concurrent||1;
+var sc=g.status==='healthy'?'#22c55e':g.status==='saturated'?'#f59e0b':'#ef4444';
+var ss=g.status==='healthy'?'Online':g.status==='saturated'?'Busy':'Offline';
+var slots='';for(var i=0;i<mx;i++)slots+='<span class=\"s'+(i<a?' active':'')+'\"></span>';
+var h='<div class=\"title\"><span class=\"status-dot\" style=\"background:'+sc+'\"></span>'+GL[g.id]+'<span class=\"ms-auto small\" style=\"color:'+sc+'\">'+ss+'</span></div>';
+h+='<div class=\"row-metric\"><span class=\"lbl\">VRAM</span><span class=\"val\">'+g.vram_used_mb+' / '+g.vram_total_mb+' MB</span></div>';
+h+='<div class=\"row-metric\"><span class=\"lbl\">Utilization</span><span class=\"val\">'+g.gpu_util_pct+'%</span></div>';
+h+='<div class=\"row-metric\"><span class=\"lbl\">Temperature</span><span class=\"val\" style=\"color:'+(g.temp_c>85?'#ef4444':g.temp_c>70?'#f59e0b':'#22c55e')+'\">'+g.temp_c+'C</span></div>';
+if(g.power_w)h+='<div class=\"row-metric\"><span class=\"lbl\">Power</span><span class=\"val\">'+g.power_w+'W'+(g.power_limit_w?'/'+g.power_limit_w+'W':'')+'</span></div>';
+h+='<div class=\"row-metric\"><span class=\"lbl\">Slots</span><span class=\"val\" style=\"color:'+(a>=mx?'#ef4444':'#e2e8f0')+'\">'+a+' / '+mx+'</span></div>';
+h+='<div class=\"slot-bar\">'+slots+'</div>';el.innerHTML=h;
+});
+renderQueue(data);renderGPUMetrics(data);
+var rc=data.route_counts||{},mr=Math.max(1,...Object.values(rc));
+$('route-bars').innerHTML=Object.entries(rc).length?Object.entries(rc).sort((a,b)=>b[1]-a[1]).map(function(e){var m=e[0],c=e[1];return'<div class=\"bar-row\"><div class=\"bar-label\"><span class=\"name\">'+(ML[m]||m)+'</span><span>'+c+' ('+(t?Math.round(c/t*100):0)+'%)</span></div><div class=\"bar-track\"><div class=\"bar-fill\" style=\"width:'+(c/mr*100)+'%;background:'+(MC[m]||'#38bdf8')+'\"></div></div></div>';}).join(''):'<div class=\"text-secondary small\">-</div>';
+var ac=data.agent_counts||{},ma=Math.max(1,...Object.values(ac));
+$('agent-bars').innerHTML=Object.entries(ac).length?Object.entries(ac).sort((a,b)=>b[1]-a[1]).map(function(e){return'<div class=\"bar-row\"><div class=\"bar-label\"><span class=\"name\">'+e[0]+'</span><span>'+e[1]+'</span></div><div class=\"bar-track\"><div class=\"bar-fill\" style=\"width:'+(e[1]/ma*100)+'%;background:#38bdf8\"></div></div></div>';}).join(''):'<div class=\"text-secondary small\">-</div>';
+var recent=data.recent||[];
+$('route-tbody').innerHTML=recent.length?recent.slice(0,20).map(function(r){var d=new Date(r.ts*1000),ag=r.agent||'?';return'<tr><td class=\"text-secondary\">'+d.toLocaleTimeString()+'</td><td><span class=\"agent-badge\" style=\"background:rgba(56,189,248,0.12);color:#38bdf8\">'+ag+'</span></td><td>'+(ML[r.model]||r.model)+'</td><td class=\"text-secondary\">'+(r.reason||'')+'</td><td class=\"text-uppercase\" style=\"font-size:10px;color:'+(r.tier==='enterprise'?'#a78bfa':'#64748b')+'\">'+(r.tier||'')+'</td></tr>';}).join(''):'<tr><td colspan=\"5\" class=\"text-secondary\">Waiting...</td></tr>';
+}
+
+function renderQueue(data){
+var el=$('queue-viz');if(!el)return;
+var ta=0,tm=0;data.gpus.forEach(function(g){ta+=(g.active_requests||0);tm+=(g.max_concurrent||1)});
+var pct=tm>0?Math.round(ta/tm*100):0,st=pct>=100?'SATURATED':pct>=50?'BUSY':'IDLE';
+var sc=pct>=100?'#ef4444':pct>=50?'#f59e0b':'#22c55e';
+var circ=188.5,dash=(pct/100)*circ;
+var h='<div class=\"d-inline-block position-relative mb-2\"><svg width=\"72\" height=\"72\"><circle cx=\"36\" cy=\"36\" r=\"30\" fill=\"none\" stroke=\"#1e293b\" stroke-width=\"6\"/><circle cx=\"36\" cy=\"36\" r=\"30\" fill=\"none\" stroke=\"'+sc+'\" stroke-width=\"6\" stroke-dasharray=\"'+dash+' '+(circ-dash)+'\" stroke-linecap=\"round\" transform=\"rotate(-90 36 36)\"/></svg><div style=\"position:absolute;top:50%;left:50%;transform:translate(-50%,-50%);text-align:center\"><div class=\"ring-label\" style=\"color:'+sc+'\">'+ta+'</div><div class=\"ring-sublabel\">/ '+tm+' slots</div></div></div>';
+h+='<div class=\"fw-bold mb-2 small\" style=\"color:'+sc+'\">'+st+'</div>';
+var lb={'qwen3.6-35B-A3B':'MoE','qwen3.6-27B-code':'Dense','qwen3.5-9b-vlm':'VLM'};
+data.gpus.forEach(function(g){var a=g.active_requests||0,mx=g.max_concurrent||1,gp=mx>0?Math.round(a/mx*100):0;h+='<div class=\"d-flex align-items-center gap-2 mb-1 justify-content-center\"><span class=\"small\" style=\"min-width:32px;text-align:right;font-size:10px\">'+(lb[g.id]||g.id)+'</span><div style=\"flex:1;max-width:70px;height:3px;background:#1e293b;border-radius:2px;overflow:hidden\"><div style=\"height:100%;width:'+gp+'%;background:'+sc+';border-radius:2px\"></div></div><span class=\"small\" style=\"min-width:22px;font-size:10px\">'+a+'/'+mx+'</span></div>'});
+el.innerHTML=h;
+}
+
+function renderGPUMetrics(data){
+var el=$('gpu-metrics-card');if(!el)return;
+var lb={'qwen3.6-35B-A3B':'MoE','qwen3.6-27B-code':'Dense','qwen3.5-9b-vlm':'VLM'};
+var h='';data.gpus.forEach(function(g){
+var nm=lb[g.id]||g.id,tp=g.temp_c||0,ut=g.gpu_util_pct||0,pw=g.power_w||0,pl=g.power_limit_w||0;
+var tc=tp>85?'#ef4444':tp>70?'#f59e0b':'#22c55e',uc=ut>90?'#ef4444':ut>70?'#f59e0b':'#22c55e';
+h+='<div class=\"mb-3\"><div class=\"fw-bold small text-white-50 mb-1\">'+nm+'</div>';
+h+='<div class=\"d-flex align-items-center gap-2 mb-1\"><span class=\"small text-secondary\" style=\"min-width:30px\">T</span><div class=\"flex-grow-1\" style=\"height:3px;background:#1e293b;border-radius:2px;overflow:hidden\"><div style=\"height:100%;width:'+Math.min(tp,100)+'%;background:'+tc+';border-radius:2px\"></div></div><span class=\"small\" style=\"color:'+tc+';min-width:30px;text-align:right\">'+tp+'C</span></div>';
+h+='<div class=\"d-flex align-items-center gap-2 mb-1\"><span class=\"small text-secondary\" style=\"min-width:30px\">U</span><div class=\"flex-grow-1\" style=\"height:3px;background:#1e293b;border-radius:2px;overflow:hidden\"><div style=\"height:100%;width:'+ut+'%;background:'+uc+';border-radius:2px\"></div></div><span class=\"small\" style=\"color:'+uc+';min-width:30px;text-align:right\">'+ut+'%</span></div>';
+if(pw>0){var pp=pl>0?Math.round(pw/pl*100):0,pc=pp>90?'#ef4444':pp>70?'#f59e0b':'#22c55e';h+='<div class=\"d-flex align-items-center gap-2\"><span class=\"small text-secondary\" style=\"min-width:30px\">P</span><div class=\"flex-grow-1\" style=\"height:3px;background:#1e293b;border-radius:2px;overflow:hidden\"><div style=\"height:100%;width:'+pp+'%;background:'+pc+';border-radius:2px\"></div></div><span class=\"small\" style=\"color:'+pc+';min-width:30px;text-align:right\">'+pw+'W</span></div>';}
+h+='</div>';});
+el.innerHTML=h;
+}
+
+var cp='day';
+function switchPeriod(p){cp=p;document.querySelectorAll('.btn-sm-period').forEach(function(b){b.classList.remove('active')});event.target.classList.add('active');loadTS();}
+function loadTS(){fetch('/api/timeseries?period='+cp).then(function(r){return r.json()}).then(renderTS).catch(function(){})}
+function renderTS(d){
+var models=d.models||{},labels=d.labels||[];
+if(!labels.length)return;
+var cn=$('timeseries-chart'),lg=$('timeseries-legend'),mn=Object.keys(models);
+if(!mn.length){cn.innerHTML='<div class=\"text-secondary small text-center py-4\">-</div>';return;}
+var mv=1;for(var m in models)for(var i=0;i<models[m].length;i++)if(models[m][i]>mv)mv=models[m][i];mv=Math.ceil(mv*1.15)||1;
+var W=labels.length>1?100/(labels.length-1):100,H=130;
+var paths='';for(var mi=0;mi<mn.length;mi++){var m=mn[mi],vals=models[m]||[],d='';for(var i=0;i<vals.length;i++){var x=i*W,y=H-(vals[i]/mv)*H;d+=(i===0?'M':'L')+x.toFixed(1)+','+y.toFixed(1)+' ';}paths+='<path d=\"'+d+'\" fill=\"none\" stroke=\"'+(MC[m]||'#38bdf8')+'\" stroke-width=\"2\" stroke-linecap=\"round\" opacity=\"0.8\"/>';}
+var grid='';for(var g=0;g<=4;g++){var y=(g/4)*H;grid+='<line x1=\"0\" y1=\"'+y.toFixed(1)+'\" x2=\"100\" y2=\"'+y.toFixed(1)+'\" stroke=\"#1e293b\" stroke-width=\"1\"/>';}
+cn.innerHTML='<svg viewBox=\"0 0 100 '+(H+16)+'\" style=\"width:100%;height:'+(H+20)+'px;display:block\" preserveAspectRatio=\"none\">'+grid+paths+'</svg>';
+lg.innerHTML=mn.map(function(m){return'<span class=\"d-flex align-items-center gap-1\"><svg width=\"14\" height=\"8\"><line x1=\"0\" y1=\"4\" x2=\"14\" y2=\"4\" stroke=\"'+(MC[m]||'#38bdf8')+'\" stroke-width=\"2\"/></svg>'+(ML[m]||m)+'</span>';}).join('');
+}
+function poll(){fetch('/api/state').then(function(r){return r.json()}).then(function(data){render(data);$('connection-status').textContent='live';}).catch(function(){$('connection-status').textContent='reconnecting';});}
+poll();setInterval(poll,3000);loadTS();
+</script>
+</body>
+</html>"""
+
+@app.route("/") 
+def dashboard(): return render_template_string(DASHBOARD_HTML)
+
+@app.route("/api/state")
+def api_state(): return fetch_state()
+
+@app.route("/api/timeseries")
+def api_timeseries():
+    period = request.args.get("period", "day")
+    try:
+        r = requests.get("http://router:9000/metrics/timeseries?period=" + period, timeout=5)
+        if r.status_code == 200: return r.json()
+    except Exception: pass
+    return {"models": {}, "labels": []}
+
+@app.route("/api/stream")
+def api_stream():
+    def ev():
+        q = queue.Queue()
+        with sse_lock: sse_subscribers.append(q)
+        try:
+            yield "data: "+json.dumps(fetch_state())+"\n\n"
+            while True:
+                try: msg = q.get(timeout=3); yield "data: "+msg+"\n\n"
+                except queue.Empty: yield "data: "+json.dumps(fetch_state())+"\n\n"
+        except GeneratorExit: pass
+        finally:
+            with sse_lock:
+                if q in sse_subscribers: sse_subscribers.remove(q)
+    return Response(stream_with_context(ev()), mimetype="text/event-stream", headers={"Cache-Control":"no-cache","X-Accel-Buffering":"no","Access-Control-Allow-Origin":"*"})
+
+@app.route("/health")
+def health(): return {"status":"healthy","service":"harness-dashboard"}
+
+if __name__ == "__main__":
+    app.run(host="0.0.0.0", port=3000, debug=False)
@@ -1,133 +0,0 @@
-#!/usr/bin/env python3
-"""Syslog Harness Dashboard — Simple HTTP server exposing GPU health + metrics."""
-
-import json
-import os
-import time
-import urllib.request
-from http.server import HTTPServer, SimpleHTTPRequestHandler
-from datetime import datetime
-
-GPUS = {
-    "amdpve": {"endpoint": os.getenv("AMDVE_EP", "192.168.68.15:8080"), "model": "qwen3.6-35B-A3B (MoE)", "vram": "65GB"},
-    "llmgpu": {"endpoint": os.getenv("LLMGPU_EP", "192.168.68.8:8080"), "model": "qwen3.5-27B (Dense)", "vram": "24GB"},
-    "ocu_llm": {"endpoint": os.getenv("OCU_LLM_EP", "192.168.68.110:8080"), "model": "gemma-4-E4B (Light)", "vram": "12GB"},
-}
-
-
-def check_gpu(name, info):
-    try:
-        start = time.time()
-        # Use simple HTTP GET to check if the GPU endpoint is alive
-        resp = urllib.request.urlopen(f"http://{info['endpoint']}/", timeout=3)
-        latency = (time.time() - start) * 1000
-        return {
-            "status": "up",
-            "latency_ms": round(latency, 1),
-            "model": info["model"],
-            "vram": info["vram"],
-        }
-    except Exception as e:
-        return {"status": "down", "error": str(e)[:50], "model": info["model"], "vram": info["vram"]}
-
-
-def get_queue_status():
-    try:
-        req = urllib.request.Request("http://queue-service:8091/status")
-        resp = urllib.request.urlopen(req, timeout=2)
-        return json.loads(resp.read())
-    except Exception:
-        return {"queue_depth": -1, "circuit_breaker": "unknown", "gpu_health": {}}
-
-
-DASHBOARD_HTML = """
-<!DOCTYPE html>
-<html><head><meta charset="utf-8"><title>🦅 Syslog Harness</title>
-<style>
-  body { background: #1a1a2e; color: #e0e0e0; font-family: monospace; margin: 0; padding: 20px; }
-  .card { background: #16213e; border-radius: 8px; padding: 16px; margin: 10px 0; border-left: 4px solid #0f3460; }
-  .up { border-left-color: #00d26a; } .down { border-left-color: #ff4757; }
-  .warn { border-left-color: #ffa502; }
-  h1 { color: #00d26a; font-size: 24px; } h2 { color: #0f3460; font-size: 16px; }
-  .metric { display: inline-block; margin: 4px 12px; }
-  .value { font-weight: bold; color: #00d26a; }
-  #refresh { position: fixed; top: 10px; right: 10px; background: #0f3460; color: white;
-             border: none; padding: 8px 16px; border-radius: 4px; cursor: pointer; }
-  table { width: 100%; border-collapse: collapse; margin: 10px 0; }
-  th, td { text-align: left; padding: 8px; border-bottom: 1px solid #0f3460; }
-  th { color: #00d26a; }
-</style></head><body>
-<button id="refresh" onclick="location.reload()">↻ Refresh</button>
-<h1>🦅 Syslog Harness Dashboard</h1>
-<h2>Updated: <span id="ts"></span></h2>
-
-<div class="card" id="queue-card">
-  <h2>Queue & Circuit Breaker</h2>
-  <div class="metric">Depth: <span class="value" id="depth">--</span></div>
-  <div class="metric">Circuit: <span class="value" id="circuit">--</span></div>
-  <div class="metric">Threshold: <span class="value" id="threshold">--</span></div>
-</div>
-
-<div class="card">
-  <h2>GPU Endpoints</h2>
-  <table><tr><th>GPU</th><th>Model</th><th>VRAM</th><th>Status</th><th>Latency</th></tr>
-  <tbody id="gpu-table"></tbody></table>
-</div>
-
-<script>
-  document.getElementById('ts').textContent = new Date().toISOString();
-  fetch('/api/status').then(r => r.json()).then(data => {
-    document.getElementById('depth').textContent = data.queue_depth;
-    document.getElementById('circuit').textContent = data.circuit_breaker;
-    document.getElementById('threshold').textContent = 'warn:' + data.thresholds.warn + ' / open:' + data.thresholds.open;
-    const card = document.getElementById('queue-card');
-    if (data.circuit_breaker === 'open') card.className = 'card warn';
-    else if (data.circuit_breaker === 'warn') card.className = 'card warn';
-    else card.className = 'card up';
-    let html = '';
-    for (const [name, gpu] of Object.entries(data.gpu_health)) {
-      const status = gpu.status === 'up' ? '✅' : '❌';
-      const latency = gpu.status === 'up' ? gpu.latency_ms + 'ms' : gpu.error;
-      const rowClass = gpu.status === 'up' ? '' : 'down';
-      html += `<tr class="${rowClass}"><td>${name}</td><td>${gpu.model}</td><td>${gpu.vram}</td><td>${status}</td><td>${latency}</td></tr>`;
-    }
-    document.getElementById('gpu-table').innerHTML = html;
-  });
-  setInterval(() => location.reload(), 10000);
-</script></body></html>
-"""
-
-
-class Handler(SimpleHTTPRequestHandler):
-    def do_GET(self):
-        if self.path == "/" or self.path == "/harness.html":
-            self.send_response(200)
-            self.send_header("Content-Type", "text/html; charset=utf-8")
-            self.end_headers()
-            self.wfile.write(DASHBOARD_HTML.encode())
-        elif self.path == "/api/status":
-            status = get_queue_status()
-            enriched = {
-                "queue_depth": status.get("queue_depth", -1),
-                "circuit_breaker": status.get("circuit_breaker", "unknown"),
-                "thresholds": status.get("thresholds", {"warn": 30, "open": 50}),
-                "gpu_health": {},
-            }
-            for name, info in GPUS.items():
-                enriched["gpu_health"][name] = check_gpu(name, info)
-            self.send_response(200)
-            self.send_header("Content-Type", "application/json")
-            self.end_headers()
-            self.wfile.write(json.dumps(enriched).encode())
-        else:
-            self.send_response(404)
-            self.end_headers()
-
-    def log_message(self, format, *args):
-        pass  # Suppress request logs
-
-
-if __name__ == "__main__":
-    server = HTTPServer(("0.0.0.0", 3001), Handler)
-    print("Dashboard running on :3001/harness.html")
-    server.serve_forever()
@@ -0,0 +1,2 @@
+flask==3.1.*
+requests==2.32.*
@@ -1,54 +1,97 @@
-version: "3.8"
+version: '3.8'

 services:
  redis:
    image: redis:7-alpine
-    restart: always
-    networks:
-      - gpu-router-net
+    container_name: harness-redis
+    restart: unless-stopped
+    ports:
+      - "127.0.0.1:6379:6379"
    volumes:
      - redis-data:/data
+    command: redis-server --appendonly yes --maxmemory 256mb --maxmemory-policy allkeys-lru
+    healthcheck:
+      test: ["CMD", "redis-cli", "ping"]
+      interval: 10s
+      timeout: 3s
+      retries: 5

-  queue-service:
-    build:
-      context: .
-      dockerfile: Dockerfile.queue
-    restart: always
-    networks:
-      - gpu-router-net
+  router:
+    build: ./router
+    container_name: harness-router
+    restart: unless-stopped
    ports:
-      - "8091:8091"
-    depends_on:
-      - redis
+      - "9000:9000"
    environment:
-      - REDIS_HOST=redis
-      - REDIS_PORT=6379
+      - REDIS_URL=redis://redis:6379
+      - GPU_MOE_URL=http://192.168.68.15:8080/v1
+      - GPU_DENSE_URL=http://192.168.68.8:8080/v1
+      - GPU_LIGHT_URL=http://192.168.68.110:8080/v1
+    healthcheck:
+      test: ["CMD", "python3", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:9000/health')"]
+      interval: 15s
+      timeout: 5s
+      retries: 3
+    depends_on:
+      redis:
+        condition: service_healthy
+
+  litellm:
+    image: ghcr.io/berriai/litellm:main-stable
+    command: ["--config", "/app/config.yaml", "--port", "4000"]
+    container_name: harness-litellm
+    restart: unless-stopped
+    ports:
+      - "8081:4000"
+    volumes:
+      - ./litellm_config.yaml:/app/config.yaml
+    environment:
+      - LITELLM_MASTER_KEY=sk-syslog-local-master-key
+    healthcheck:
+      test: ["CMD", "python3", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:9000/health')"]
+      interval: 15s
+      timeout: 5s
+      retries: 3
+    depends_on:
+      redis:
+        condition: service_healthy
+
+  nginx:
+    image: nginx:alpine
+    container_name: harness-nginx
+    restart: unless-stopped
+    ports:
+      - "80:80"
+    volumes:
+      - ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
+    healthcheck:
+      test: ["CMD", "curl", "-f", "http://127.0.0.1/health"]
+      interval: 15s
+      timeout: 5s
+      retries: 3
+    depends_on:
+      - litellm
+      - dashboard

  dashboard:
-    build:
-      context: .
-      dockerfile: Dockerfile.dashboard
-    restart: always
-    networks:
-      - gpu-router-net
+    build: ./dashboard
+    container_name: harness-dashboard
+    restart: unless-stopped
    ports:
-      - "3001:3001"
+      - "3000:3000"
+    environment:
+      - REDIS_URL=redis://redis:6379
+      - GPU_SIDECARS=192.168.68.15:8090,192.168.68.8:8090,192.168.68.110:8090
+    healthcheck:
+      test: ["CMD", "python3", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:3000/health')"]
+      interval: 15s
+      timeout: 5s
+      retries: 3
    depends_on:
      - redis

-  gpu-dashboard:
-    build:
-      context: .
-      dockerfile: Dockerfile.gpu
-    restart: always
-    networks:
-      - gpu-router-net
-    ports:
-      - "8092:8092"
-
-networks:
-  gpu-router-net:
-    driver: bridge
-
 volumes:
  redis-data:
+
+# LiteLLM command override to load config
+# (appended to fix config loading issue)
@@ -1,115 +0,0 @@
-#!/usr/bin/env python3
-"""GPU metrics collector — polls sidecars + llama.cpp every 10s, writes to Workspace."""
-
-import urllib.request, json, time, os
-
-HOSTS = [
-    {"name": "amdpve", "host": "192.168.68.15", "gpu": "AMD Strix Halo", "llama_port": 8080},
-    {"name": "llmgpu", "host": "192.168.68.8", "gpu": "RTX 3090", "llama_port": 8080},
-    {"name": "ocu-llm", "host": "192.168.68.110", "gpu": "RTX 5070", "llama_port": 8080},
-]
-OUTPUT = "/root/hermes-workspace/public/gpu_metrics.json"
-INTERVAL = 10
-STALE_THRESHOLD = 30  # seconds before marking stale
-DEAD_THRESHOLD = 60   # seconds before marking unreachable
-
-last_seen = {}
-
-
-def fetch_json(url, timeout=3):
-    try:
-        req = urllib.request.Request(url)
-        resp = urllib.request.urlopen(req, timeout=timeout)
-        return json.loads(resp.read().decode())
-    except Exception:
-        return None
-
-
-def collect_one(h):
-    """Collect GPU hardware + llama.cpp inference state for one host."""
-    name = h["name"]
-    host = h["host"]
-    now = time.time()
-
-    # GPU hardware from sidecar
-    gpu = fetch_json(f"http://{host}:8090/")
-
-    # llama.cpp inference state
-    llamacpp_health = fetch_json(f"http://{host}:{h['llama_port']}/health")
-    llamacpp_models = fetch_json(f"http://{host}:{h['llama_port']}/v1/models")
-
-    # Determine inference state
-    model_name = None
-    inference_state = "unknown"
-    if llamacpp_models:
-        models = llamacpp_models.get("data", [])
-        if models:
-            model_name = models[0].get("id")
-
-    if llamacpp_health:
-        status = llamacpp_health.get("status", "")
-        if status == "ok":
-            idle = llamacpp_health.get("slots_idle", 0)
-            processing = llamacpp_health.get("slots_processing", 0)
-            if idle and not processing:
-                inference_state = "idle"
-            elif processing:
-                inference_state = "busy"
-            else:
-                inference_state = "idle"
-
-    # Check for /slots endpoint for is_processing detail
-    slots = fetch_json(f"http://{host}:{h['llama_port']}/slots")
-    if slots and isinstance(slots, list) and len(slots) > 0:
-        if slots[0].get("is_processing"):
-            inference_state = "busy"
-
-    result = {
-        "host": name,
-        "gpu_name": h["gpu"],
-        "inference": {
-            "state": inference_state,
-            "model": model_name,
-        },
-        "hardware": gpu if gpu else None,
-        "online": gpu is not None,
-        "timestamp": now,
-    }
-
-    if gpu is not None:
-        last_seen[name] = now
-
-    if name in last_seen:
-        age = now - last_seen[name]
-        if age > DEAD_THRESHOLD:
-            result["online"] = False
-        elif age > STALE_THRESHOLD:
-            result["stale"] = True
-
-    return result
-
-
-def main():
-    print(f"GPU collector starting, output={OUTPUT}, interval={INTERVAL}s")
-    os.makedirs(os.path.dirname(OUTPUT), exist_ok=True)
-
-    while True:
-        start = time.time()
-        results = [collect_one(h) for h in HOSTS]
-
-        payload = {
-            "updated": start,
-            "gpus": results,
-        }
-
-        with open(OUTPUT + ".tmp", "w") as f:
-            json.dump(payload, f)
-        os.rename(OUTPUT + ".tmp", OUTPUT)
-
-        elapsed = time.time() - start
-        sleep_for = max(0, INTERVAL - elapsed)
-        time.sleep(sleep_for)
-
-
-if __name__ == "__main__":
-    main()
@@ -1,183 +0,0 @@
-<!DOCTYPE html>
-<html lang="en">
-<head>
-<meta charset="UTF-8">
-<meta name="viewport" content="width=device-width, initial-scale=1.0">
-<title>GPU Monitor</title>
-<style>
-* { margin: 0; padding: 0; box-sizing: border-box; }
-body { background: #0d1117; color: #c9d1d9; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif; padding: 20px; }
-h1 { font-size: 1.3em; margin-bottom: 4px; }
-.topbar { display: flex; justify-content: space-between; align-items: center; margin-bottom: 20px; padding-bottom: 12px; border-bottom: 1px solid #21262d; }
-.topbar .status { font-size: 0.85em; color: #8b949e; }
-.topbar .status .dot { display: inline-block; width: 8px; height: 8px; border-radius: 50%; margin-right: 6px; }
-.dot.green { background: #3fb950; }
-.dot.yellow { background: #d2991d; }
-.dot.red { background: #f85149; }
-.cards { display: grid; grid-template-columns: repeat(auto-fit, minmax(320px, 1fr)); gap: 16px; }
-.card { background: #161b22; border: 1px solid #21262d; border-radius: 8px; padding: 16px; }
-.card.stale { opacity: 0.5; }
-.card.dead { opacity: 0.3; border-color: #f85149; }
-.card-header { display: flex; justify-content: space-between; align-items: center; margin-bottom: 12px; }
-.card-header .name { font-weight: 600; font-size: 1.05em; }
-.card-header .host { font-size: 0.8em; color: #8b949e; }
-.card-header .state { font-size: 0.75em; padding: 2px 8px; border-radius: 10px; font-weight: 600; }
-.state.idle { background: #1b3826; color: #3fb950; }
-.state.busy { background: #3d1f1a; color: #f85149; }
-.state.unknown { background: #21262d; color: #8b949e; }
-.metric { margin-bottom: 10px; }
-.metric-label { display: flex; justify-content: space-between; font-size: 0.82em; color: #8b949e; margin-bottom: 2px; }
-.metric-label .val { color: #c9d1d9; font-weight: 500; }
-.bar { height: 6px; border-radius: 3px; background: #21262d; overflow: hidden; }
-.bar-fill { height: 100%; border-radius: 3px; transition: width 0.5s ease; }
-.bar-fill.temp-cool { background: #3fb950; }
-.bar-fill.temp-warm { background: #d2991d; }
-.bar-fill.temp-hot { background: #f85149; }
-.bar-fill.util { background: #58a6ff; }
-.bar-fill.vram { background: #bc8cff; }
-.bar-fill.power { background: #f0883e; }
-.model-line { font-size: 0.82em; color: #8b949e; margin-top: 8px; padding-top: 8px; border-top: 1px solid #21262d; }
-.model-line span { color: #c9d1d9; }
-.error { color: #f85149; font-size: 0.85em; }
-</style>
-</head>
-<body>
-<div class="topbar">
-  <div>
-    <h1><a href="/" style="color:#58a6ff;text-decoration:none;">← Workspace</a> · GPU Monitor</h1>
-    <span class="status"><span class="dot green" id="status-dot"></span><span id="status-text">Loading...</span></span>
-  </div>
-  <div class="status" id="age">—</div>
-</div>
-<div class="cards" id="cards"></div>
-
-<script>
-const INTERVAL = 5000;
-let lastFetchTime = null;
-
-function updateClock() {
-  const el = document.getElementById('age');
-  if (!lastFetchTime) { el.textContent = '—'; return; }
-  const age = Math.round((Date.now() / 1000) - lastFetchTime);
-  el.textContent = age <= 60 ? `updated ${age}s ago` : `stale ${age}s ago`;
-}
-setInterval(updateClock, 1000);
-
-const TEMP_WARN = 70, TEMP_HOT = 82;
-const VRAM_WARN = 80, VRAM_HOT = 92;
-
-function tempClass(c) { return c > TEMP_HOT ? 'temp-hot' : c > TEMP_WARN ? 'temp-warm' : 'temp-cool'; }
-function vramClass(pct) { return pct > VRAM_HOT ? 'temp-hot' : pct > VRAM_WARN ? 'temp-warm' : 'temp-cool'; }
-function pct(val, max) { return max ? Math.round(val / max * 100) : 0; }
-function mbToGB(mb) { return mb ? (mb / 1024).toFixed(1) : '—'; }
-
-function renderCard(g) {
-  const hw = g.hardware || {};
-  const inf = g.inference || {};
-  const online = g.online !== false;
-  const stale = g.stale === true;
-  let cardClass = '';
-  if (!online) cardClass = 'dead';
-  else if (stale) cardClass = 'stale';
-
-  let stateClass = inf.state || 'unknown';
-  let stateLabel = inf.state ? inf.state.toUpperCase() : 'UNKNOWN';
-  if (!online) { stateClass = 'unknown'; stateLabel = 'OFFLINE'; }
-
-  const temp = hw.temp_c;
-  const util = hw.gpu_util_pct;
-  const vramUsed = hw.vram_used_mb;
-  const vramTotal = hw.vram_total_mb;
-  const power = hw.power_w;
-  const powerLimit = hw.power_limit_w;
-  const fan = hw.fan_pct;
-  const vendor = hw.vendor;
-
-  let html = `<div class="card ${cardClass}">`;
-  html += `<div class="card-header">`;
-  html += `<div><div class="name">${g.gpu_name}</div><div class="host">${g.host}</div></div>`;
-  html += `<div class="state ${stateClass}">${stateLabel}</div>`;
-  html += `</div>`;
-
-  if (!online) {
-    html += `<div class="error">Unreachable</div>`;
-  } else if (hw.error) {
-    html += `<div class="error">${hw.error}</div>`;
-  } else {
-    // Temperature
-    if (temp != null) {
-      html += `<div class="metric"><div class="metric-label"><span>Temperature</span><span class="val">${temp}°C</span></div>`;
-      html += `<div class="bar"><div class="bar-fill ${tempClass(temp)}" style="width:${Math.min(temp,100)}%"></div></div></div>`;
-    }
-    // Utilization
-    if (util != null) {
-      html += `<div class="metric"><div class="metric-label"><span>GPU Utilization</span><span class="val">${util}%</span></div>`;
-      html += `<div class="bar"><div class="bar-fill util" style="width:${util}%"></div></div></div>`;
-    }
-    // VRAM
-    if (vramUsed != null && vramTotal != null) {
-      const vramPct = pct(vramUsed, vramTotal);
-      html += `<div class="metric"><div class="metric-label"><span>VRAM</span><span class="val">${mbToGB(vramUsed)} / ${mbToGB(vramTotal)} GB</span></div>`;
-      html += `<div class="bar"><div class="bar-fill ${vramClass(vramPct)}" style="width:${vramPct}%"></div></div></div>`;
-    }
-    // Power
-    if (power != null) {
-      const powerPct = powerLimit ? pct(power, powerLimit) : 0;
-      const powerText = powerLimit ? `${power}W / ${powerLimit}W` : `${power}W`;
-      html += `<div class="metric"><div class="metric-label"><span>Power</span><span class="val">${powerText}</span></div>`;
-      if (powerLimit) html += `<div class="bar"><div class="bar-fill power" style="width:${powerPct}%"></div></div>`;
-      html += `</div>`;
-    }
-    // Fan (NVIDIA only)
-    if (fan != null) {
-      html += `<div class="metric"><div class="metric-label"><span>Fan Speed</span><span class="val">${fan}%</span></div>`;
-      html += `<div class="bar"><div class="bar-fill util" style="width:${fan}%"></div></div></div>`;
-    }
-  }
-
-  // Model loaded
-  html += `<div class="model-line">Model: <span>${inf.model || '—'}</span></div>`;
-  html += `</div>`;
-  return html;
-}
-
-async function refresh() {
-  try {
-    const resp = await fetch('gpu_metrics.json?t=' + Date.now());
-    const data = await resp.json();
-    const gpus = data.gpus || [];
-
-    document.getElementById('cards').innerHTML = gpus.map(renderCard).join('');
-
-    // Top bar status
-    const online = gpus.filter(g => g.online !== false).length;
-    const total = gpus.length;
-    const dot = document.getElementById('status-dot');
-    const txt = document.getElementById('status-text');
-    if (online === total) { dot.className = 'dot green'; txt.textContent = `${online}/${total} online`; }
-    else if (online > 0) { dot.className = 'dot yellow'; txt.textContent = `${online}/${total} online`; }
-    else { dot.className = 'dot red'; txt.textContent = 'All offline'; }
-
-    // Capture fetch time for live clock
-    lastFetchTime = Date.now() / 1000;
-  } catch(e) {
-    document.getElementById('status-dot').className = 'dot red';
-    document.getElementById('status-text').textContent = 'Collector down';
-  }
-}
-
-// Render skeletons instantly
-const SKELETONS = [
-  {host:'amdpve', gpu_name:'AMD Strix Halo', hardware:{}, inference:{}, online:true},
-  {host:'llmgpu', gpu_name:'RTX 3090', hardware:{}, inference:{}, online:true},
-  {host:'ocu-llm', gpu_name:'RTX 5070', hardware:{}, inference:{}, online:true},
-];
-document.getElementById('cards').innerHTML = SKELETONS.map(g =>
-  `<div class="card"><div class="card-header"><div><div class="name">${g.gpu_name}</div><div class="host">${g.host}</div></div><div class="state unknown">···</div></div><div class="model-line" style="color:#8b949e;">Loading metrics...</div></div>`
-).join('');
-
-refresh();
-setInterval(refresh, INTERVAL);
-</script>
-</body>
-</html>
@@ -1,115 +0,0 @@
-#!/usr/bin/env python3
-"""GPU metrics collector — polls sidecars + llama.cpp every 10s, writes to Workspace."""
-
-import urllib.request, json, time, os
-
-HOSTS = [
-    {"name": "amdpve", "host": "192.168.68.15", "gpu": "AMD Strix Halo", "llama_port": 8080},
-    {"name": "llmgpu", "host": "192.168.68.8", "gpu": "RTX 3090", "llama_port": 8080},
-    {"name": "ocu-llm", "host": "192.168.68.110", "gpu": "RTX 5070", "llama_port": 8080},
-]
-OUTPUT = "/app/public/gpu_metrics.json"
-INTERVAL = 10
-STALE_THRESHOLD = 30  # seconds before marking stale
-DEAD_THRESHOLD = 60   # seconds before marking unreachable
-
-last_seen = {}
-
-
-def fetch_json(url, timeout=3):
-    try:
-        req = urllib.request.Request(url)
-        resp = urllib.request.urlopen(req, timeout=timeout)
-        return json.loads(resp.read().decode())
-    except Exception:
-        return None
-
-
-def collect_one(h):
-    """Collect GPU hardware + llama.cpp inference state for one host."""
-    name = h["name"]
-    host = h["host"]
-    now = time.time()
-
-    # GPU hardware from sidecar
-    gpu = fetch_json(f"http://{host}:8090/")
-
-    # llama.cpp inference state
-    llamacpp_health = fetch_json(f"http://{host}:{h['llama_port']}/health")
-    llamacpp_models = fetch_json(f"http://{host}:{h['llama_port']}/v1/models")
-
-    # Determine inference state
-    model_name = None
-    inference_state = "unknown"
-    if llamacpp_models:
-        models = llamacpp_models.get("data", [])
-        if models:
-            model_name = models[0].get("id")
-
-    if llamacpp_health:
-        status = llamacpp_health.get("status", "")
-        if status == "ok":
-            idle = llamacpp_health.get("slots_idle", 0)
-            processing = llamacpp_health.get("slots_processing", 0)
-            if idle and not processing:
-                inference_state = "idle"
-            elif processing:
-                inference_state = "busy"
-            else:
-                inference_state = "idle"
-
-    # Check for /slots endpoint for is_processing detail
-    slots = fetch_json(f"http://{host}:{h['llama_port']}/slots")
-    if slots and isinstance(slots, list) and len(slots) > 0:
-        if slots[0].get("is_processing"):
-            inference_state = "busy"
-
-    result = {
-        "host": name,
-        "gpu_name": h["gpu"],
-        "inference": {
-            "state": inference_state,
-            "model": model_name,
-        },
-        "hardware": gpu if gpu else None,
-        "online": gpu is not None,
-        "timestamp": now,
-    }
-
-    if gpu is not None:
-        last_seen[name] = now
-
-    if name in last_seen:
-        age = now - last_seen[name]
-        if age > DEAD_THRESHOLD:
-            result["online"] = False
-        elif age > STALE_THRESHOLD:
-            result["stale"] = True
-
-    return result
-
-
-def main():
-    print(f"GPU collector starting, output={OUTPUT}, interval={INTERVAL}s")
-    os.makedirs(os.path.dirname(OUTPUT), exist_ok=True)
-
-    while True:
-        start = time.time()
-        results = [collect_one(h) for h in HOSTS]
-
-        payload = {
-            "updated": start,
-            "gpus": results,
-        }
-
-        with open(OUTPUT + ".tmp", "w") as f:
-            json.dump(payload, f)
-        os.rename(OUTPUT + ".tmp", OUTPUT)
-
-        elapsed = time.time() - start
-        sleep_for = max(0, INTERVAL - elapsed)
-        time.sleep(sleep_for)
-
-
-if __name__ == "__main__":
-    main()
@@ -1,14 +0,0 @@
-#!/bin/bash
-set -e
-
-# Start collector as background process
-cd /root/hermes-workspace/public
-python3 /app/collector.py &
-COLLECTOR_PID=$!
-
-echo "Collector started (PID $COLLECTOR_PID)"
-echo "Serving dashboard on :8092"
-
-# Serve the public directory (contains gpu.html + gpu_metrics.json)
-cd /root/hermes-workspace/public
-python3 -m http.server 8092
@@ -13,7 +13,7 @@ upstream llmgpu_pool {
 }

 upstream ocu_llm_pool {
-    ## RTX 5070 — gemma-4 (Dense 4B) — Ultra-light tasks
+    ## RTX 5070 — qwen3.5-9b-vlm (VLM) — Vision + light tasks
    server 192.168.68.110:8080;
 }

@@ -24,12 +24,7 @@ upstream queue_service {

 upstream dashboard_service {
    ## Harness dashboard (Docker container)
-    server syslog-harness-dashboard-1:3001;
-}
-
-upstream gpu_dashboard_pool {
-    ## GPU dashboard (Docker container)
-    server syslog-harness-gpu-dashboard-1:8092;
+    server dashboard:3001;
 }

 ## ------------------------------------------------------------------
@@ -41,7 +36,7 @@ map $http_x_syslog_model $gpu_upstream {
    "heavy"          llmgpu_pool;
    "qwen3.5-27B"    llmgpu_pool;
    "light"          ocu_llm_pool;
-    "gemma-4"        ocu_llm_pool;
+    "qwen3.5-9b-vlm"        ocu_llm_pool;
 }

 ## Rate limit zone — 10 req/s per IP, burst of 20
@@ -61,17 +56,6 @@ server {
        proxy_set_header X-Forwarded-For   $proxy_add_x_forwarded_for;
    }

-    ## ------------------------------------------------------------------
-    ## GPU Dashboard — observability UI (MUST be before / catch-all)
-    ## ------------------------------------------------------------------
-    location /gpu {
-        proxy_pass http://gpu_dashboard_pool/;
-        proxy_set_header Host              $host;
-        proxy_set_header X-Real-IP         $remote_addr;
-        proxy_set_header X-Forwarded-For   $proxy_add_x_forwarded_for;
-        proxy_set_header X-Forwarded-Proto $scheme;
-    }
-
    ## ------------------------------------------------------------------
    ## Main location — proxy to selected upstream
    ## ------------------------------------------------------------------
@@ -0,0 +1,106 @@
+## Syslog GPU Router — Nginx Configuration
+## Routes incoming agent requests to the appropriate GPU backend
+## based on the X-Syslog-Model header.
+
+upstream amdpve_pool {
+    ## Strix Halo 395 — qwen3.6-35B-A3B (MoE) — Default workhorse
+    server 192.168.68.15:8080;
+}
+
+upstream llmgpu_pool {
+    ## RTX 3090 — qwen3.5-27B (Dense) — Heavy reasoning
+    server 192.168.68.8:8080;
+}
+
+upstream ocu_llm_pool {
+    ## RTX 5070 — qwen3.5-9b-vlm (VLM) — Vision + light tasks
+    server 192.168.68.110:8080;
+}
+
+upstream queue_service {
+    ## Agent queue with circuit breaker (Docker container)
+    server 127.0.0.1:8091;
+}
+
+upstream dashboard_service {
+    ## Harness dashboard (Docker container)
+    server 127.0.0.1:3001;
+}
+
+## ------------------------------------------------------------------
+## Mapping: X-Syslog-Model header → upstream backend
+## ------------------------------------------------------------------
+map $http_x_syslog_model $gpu_upstream {
+    default          amdpve_pool;   # missing header → default workhorse
+    "standard"       amdpve_pool;
+    "heavy"          llmgpu_pool;
+    "qwen3.5-27B"    llmgpu_pool;
+    "light"          ocu_llm_pool;
+    "qwen3.5-9b-vlm"        ocu_llm_pool;
+}
+
+server {
+    listen 8080;
+    server_name _;
+
+    # Rate limit zone — 10 req/s per IP, burst of 20
+    limit_req_zone $binary_remote_addr zone=perip:10m rate=10r/s;
+
+    ## ------------------------------------------------------------------
+    ## Dashboard — observability UI (MUST be before / catch-all)
+    ## ------------------------------------------------------------------
+    location /dashboard {
+        proxy_pass http://dashboard_service/;
+        proxy_set_header Host              $host;
+        proxy_set_header X-Real-IP         $remote_addr;
+        proxy_set_header X-Forwarded-For   $proxy_add_x_forwarded_for;
+    }
+
+    ## ------------------------------------------------------------------
+    ## Main location — proxy to selected upstream
+    ## ------------------------------------------------------------------
+    location / {
+        limit_req zone=perip burst=20 nodelay;
+        limit_req_status 503;
+        proxy_pass http://$gpu_upstream;
+
+        ## Preserve original host and headers
+        proxy_set_header Host              $host;
+        proxy_set_header X-Real-IP         $remote_addr;
+        proxy_set_header X-Forwarded-For   $proxy_add_x_forwarded_for;
+        proxy_set_header X-Forwarded-Proto $scheme;
+
+        ## Pass through the model header so backends can log it
+        proxy_pass_header X-Syslog-Model;
+
+        ## Streaming support (SSE for LLM responses)
+        proxy_buffering off;
+        proxy_cache     off;
+        proxy_read_timeout  300s;
+        proxy_send_timeout  300s;
+
+        ## Basic failover — retry on error or timeout
+        proxy_next_upstream error timeout http_502 http_503;
+        proxy_next_upstream_tries 2;
+
+        ## Add a response header for observability
+        add_header X-Routed-To $gpu_upstream always;
+
+        ## Fallback to queue when all GPU upstreams are down
+        error_page 502 503 504 = @queue_fallback;
+    }
+
+    ## ------------------------------------------------------------------
+    ## Queue fallback — enqueue when GPUs are unavailable
+    ## ------------------------------------------------------------------
+    location @queue_fallback {
+        rewrite ^ /enqueue break;
+        proxy_pass http://queue_service;
+        proxy_set_header Host              $host;
+        proxy_set_header X-Real-IP         $remote_addr;
+        proxy_set_header X-Forwarded-For   $proxy_add_x_forwarded_for;
+        proxy_set_header X-Forwarded-Proto $scheme;
+        proxy_set_header Content-Type      $content_type;
+        proxy_pass_request_body            on;
+    }
+}
@@ -0,0 +1,25 @@
+model_list:
+  - model_name: qwen3.6-35B-A3B
+    litellm_params:
+      model: openai/qwen3.6-35B-A3B
+      api_base: http://192.168.68.15:8080/v1
+      api_key: "not-needed"
+
+  - model_name: qwen3.6-27B-code
+    litellm_params:
+      model: openai/qwen3.6-27B-code-text
+      api_base: http://192.168.68.8:8080/v1
+      api_key: "not-needed"
+
+  - model_name: qwen3.5-9b-vlm
+    litellm_params:
+      model: openai/qwen3.5-9b-vlm
+      api_base: http://192.168.68.110:8080/v1
+      api_key: "not-needed"
+
+general_settings:
+  master_key: sk-syslog-local-master-key
+
+litellm_settings:
+  drop_params: true
+  request_timeout: 120
@@ -0,0 +1,79 @@
+worker_processes auto;
+error_log /var/log/nginx/error.log warn;
+pid /var/run/nginx.pid;
+
+events { worker_connections 1024; }
+
+http {
+    include /etc/nginx/mime.types;
+    default_type application/octet-stream;
+
+    log_format main  launching rt=;
+    access_log /var/log/nginx/access.log main;
+    error_log /var/log/nginx/error.log;
+    sendfile on;
+    keepalive_timeout 65;
+
+    upstream router_api { server router:9000; }
+    upstream dashboard_ui { server dashboard:3000; }
+    upstream litellm_backend { server litellm:4000; }
+
+    server {
+        listen 80;
+
+        # Disable buffering for SSE streams
+        proxy_buffering off;
+
+        # API — through router
+        location /v1/ {
+            proxy_pass http://router_api;
+            proxy_http_version 1.1;
+            proxy_set_header Host $host;
+            proxy_set_header X-Real-IP $remote_addr;
+            proxy_set_header Authorization $http_authorization;
+            proxy_connect_timeout 10s;
+            proxy_read_timeout 600s;
+            proxy_buffering off;
+        }
+
+        # SSE streaming endpoint
+        location /stream {
+            proxy_pass http://router_api;
+            proxy_http_version 1.1;
+            proxy_set_header Host $host;
+            proxy_set_header Connection "";
+            proxy_buffering off;
+            chunked_transfer_encoding off;
+        }
+
+        # Dashboard API proxy for SSE
+        location /api/ {
+            proxy_pass http://dashboard_ui;
+            proxy_http_version 1.1;
+            proxy_set_header Host $host;
+            proxy_buffering off;
+        }
+
+        # LiteLLM debug
+        location /litellm/ {
+            rewrite ^/litellm/(.*) /$1 break;
+            proxy_pass http://litellm_backend;
+            proxy_http_version 1.1;
+            proxy_set_header Host $host;
+            proxy_set_header Authorization $http_authorization;
+        }
+
+        # Dashboard
+        location / {
+            proxy_pass http://dashboard_ui;
+            proxy_http_version 1.1;
+            proxy_set_header Host $host;
+            proxy_buffering off;
+        }
+
+        location /health {
+            return 200 "{\"status\":\"healthy\"}";
+            add_header Content-Type application/json;
+        }
+    }
+}
@@ -1,10 +0,0 @@
-FROM python:3.13-slim
-
-RUN pip install --no-cache-dir flask redis
-
-COPY queue-service.py /app/queue-service.py
-WORKDIR /app
-
-EXPOSE 8091
-
-CMD ["python3", "queue-service.py"]
@@ -0,0 +1,9 @@
+FROM python:3.12-slim
+
+WORKDIR /app
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+COPY router.py .
+
+EXPOSE 9000
+CMD ["python", "router.py"]
@@ -0,0 +1,3 @@
+flask==3.1.*
+redis==5.2.*
+requests==2.32.*
@@ -0,0 +1,418 @@
+import os, json, time, logging, traceback, threading, queue
+import requests, redis
+from flask import Flask, request, jsonify, Response, stream_with_context
+
+REDIS_URL = os.environ.get("REDIS_URL", "redis://redis:6379")
+GPU_MOE_URL = os.environ.get("GPU_MOE_URL", "http://192.168.68.15:8080/v1")
+GPU_DENSE_URL = os.environ.get("GPU_DENSE_URL", "http://192.168.68.8:8080/v1")
+GPU_LIGHT_URL = os.environ.get("GPU_LIGHT_URL", "http://192.168.68.110:8080/v1")
+
+GPU_SIDECARS = {
+    "qwen3.6-35B-A3B": "http://192.168.68.15:8090",
+    "qwen3.6-27B-code": "http://192.168.68.8:8090",
+    "qwen3.5-9b-vlm": "http://192.168.68.110:8090",
+}
+GPU_URLS = {
+    "qwen3.6-35B-A3B": GPU_MOE_URL,
+    "qwen3.6-27B-code": GPU_DENSE_URL,
+    "qwen3.5-9b-vlm": GPU_LIGHT_URL,
+}
+# Max concurrent requests per GPU (based on llama.cpp --parallel)
+GPU_MAX_CONCURRENT = {
+    "qwen3.6-35B-A3B": 2,   # 2 slots
+    "qwen3.6-27B-code": 2,  # 2 slots
+    "qwen3.5-9b-vlm": 2,       # 2 slots (12GB VRAM, 4GB headroom)
+}
+
+# Context window sizes (tokens) — used for compaction signals
+GPU_CONTEXT = {
+    "qwen3.6-35B-A3B": 131072,
+    "qwen3.6-27B-code": 98304,
+    "qwen3.5-9b-vlm": 131072,
+}
+
+TIER_MODELS = {
+    "starter": ["qwen3.5-9b-vlm"],
+    "professional": ["qwen3.6-35B-A3B", "qwen3.6-27B-code", "qwen3.5-9b-vlm"],
+    "enterprise": ["qwen3.6-35B-A3B", "qwen3.6-27B-code", "qwen3.5-9b-vlm"],
+}
+API_KEYS = {
+    "sk-syslog-local-master-key": {"tier": "enterprise", "agent": "admin"},
+    "sk-syslog-abiba": {"tier": "enterprise", "agent": "Abiba"},
+    "sk-syslog-mumuni": {"tier": "enterprise", "agent": "Mumuni"},
+    "sk-syslog-tanko": {"tier": "enterprise", "agent": "Tanko"},
+    "sk-syslog-koby": {"tier": "enterprise", "agent": "Koby"},
+    "sk-syslog-kagenz0": {"tier": "enterprise", "agent": "Kagenz0"},
+    "sk-syslog-koonimo": {"tier": "enterprise", "agent": "Koonimo"},
+    "sk-starter-abc123": {"tier": "starter", "agent": "test-starter"},
+    "sk-professional-xyz789": {"tier": "professional", "agent": "test-pro"},
+}
+
+logging.basicConfig(level=logging.INFO, format="%(asctime)s [ROUTER] %(levelname)s %(message)s")
+log = logging.getLogger("router")
+try: r = redis.from_url(REDIS_URL, decode_responses=True); r.ping()
+except Exception: r = None
+
+
+def counter_audit_loop():
+    """Every 30s, check GPU slots and reset counters if all slots idle."""
+    while True:
+        time.sleep(30)
+        if not r: continue
+        for model, url in GPU_URLS.items():
+            try:
+                resp = requests.get(url.replace("/v1","") + "/slots",
+                    headers={"Authorization": "Bearer not-needed"}, timeout=5)
+                if resp.status_code == 200:
+                    slots = resp.json()
+                    all_idle = all(not s.get("is_processing", False) for s in slots)
+                    if all_idle:
+                        current = int(r.get("active:" + model) or 0)
+                        if current > 0:
+                            r.set("active:" + model, 0)
+                            log.info("AUDIT: Reset stuck counter for %s (was %d)", model, current)
+            except Exception:
+                pass
+
+threading.Thread(target=counter_audit_loop, daemon=True).start()
+
+app = Flask(__name__)
+sse_subscribers = []; sse_lock = threading.Lock()
+
+def gpu_active_count(model):
+    """Get number of in-flight requests for a GPU."""
+    if r:
+        return int(r.get("active:" + model) or 0)
+    return 0
+
+def gpu_incr(model):
+    if r: r.incr("active:" + model)
+
+def gpu_decr(model):
+    if r:
+        v = r.decr("active:" + model)
+        if v and int(v) < 0:
+            r.set("active:" + model, 0)  # never go negative
+
+def check_gpu_health(model):
+    url = GPU_SIDECARS.get(model)
+    if not url: return {"status": "unknown"}
+    try:
+        resp = requests.get(url, timeout=5)
+        if resp.status_code == 200:
+            d = resp.json()
+            pct = (d.get("vram_used_mb",0) / max(d.get("vram_total_mb",1), 1)) * 100
+            status = "healthy" if pct < 90 else "saturated"
+            # Also check if llama.cpp endpoint is actually responding
+            gpu_url = GPU_URLS.get(model, "")
+            try:
+                hr = requests.get(gpu_url.replace("/v1","") + "/health", headers={"Authorization": "Bearer not-needed"}, timeout=3)
+                if hr.status_code != 200:
+                    status = "down"
+            except Exception:
+                status = "down"
+            return {"status": status, "vram_used_mb": d.get("vram_used_mb"), "vram_total_mb": d.get("vram_total_mb"), "vram_pct": round(pct,1), "temp_c": d.get("temp_c"), "gpu_util_pct": d.get("gpu_util_pct"), "gpu_name": d.get("gpu_name"), "power_w": d.get("power_w"), "power_limit_w": d.get("power_limit_w")}
+    except Exception: pass
+    return {"status": "down"}
+
+def available_models(): return [m for m in GPU_URLS if check_gpu_health(m)["status"] in ("healthy","saturated")]
+
+def estimate_tokens(msgs):
+    """Estimate token count from messages. Uses JSON length / 3.5 (closer to real tokenizer ratios for dense text)."""
+    return len(json.dumps(msgs, default=str)) // 3.5
+
+def is_gpu_busy(model):
+    """Check if GPU is at or near max concurrent capacity."""
+    active = gpu_active_count(model)
+    max_c = GPU_MAX_CONCURRENT.get(model, 1)
+    return active >= max_c
+
+def select_best_gpu(candidates, reason):
+    """Pick the best GPU from candidates IN ORDER — first non-busy one wins."""
+    for m in candidates:
+        if not is_gpu_busy(m):
+            return {"model": m, "reason": reason}
+    # All busy — pick least loaded
+    best = None
+    best_load = 999
+    for m in candidates:
+        load = gpu_active_count(m)
+        if load < best_load:
+            best_load = load
+            best = m
+    if best:
+        return {"model": best, "reason": "load_balanced_" + reason}
+    return None
+
+def route(rd, tier):
+    msgs = rd.get("messages",[]); t = estimate_tokens(msgs)
+    sys = any(m.get("role")=="system" for m in msgs)
+    turns = len([m for m in msgs if m.get("role") in ("user","assistant")])
+    hints = rd.get("routing_hints",{})
+    allowed = TIER_MODELS.get(tier, ["qwen3.5-9b-vlm"])
+    avail = [m for m in available_models() if m in allowed]
+    if not avail: return {"model": allowed[0], "reason": "all_saturated", "saturated": True}
+    # Check if all available GPUs are at max capacity
+    if all(is_gpu_busy(m) for m in avail):
+        return {"model": avail[0], "reason": "all_saturated", "saturated": True}
+    
+    req = rd.get("model","auto")
+    if req != "auto":
+        target = req if req in avail else avail[0]
+        # If explicit model is busy, check if another can take it
+        if is_gpu_busy(target) and req in allowed:
+            alts = [m for m in avail if m != target and m in allowed]
+            if alts:
+                alt = select_best_gpu(alts, "explicit")
+                if alt: return alt
+        return {"model": target, "reason": "explicit"}
+    
+    if hints:
+        if hints.get("priority")=="speed" and "qwen3.5-9b-vlm" in avail:
+            return select_best_gpu(["qwen3.5-9b-vlm"], "hint_speed") or {"model":"qwen3.5-9b-vlm","reason":"hint_speed"}
+        if hints.get("priority")=="quality" and "qwen3.6-27B-code" in avail:
+            return select_best_gpu(["qwen3.6-27B-code"], "hint_quality") or {"model":"qwen3.6-27B-code","reason":"hint_quality"}
+    
+    first_msg = msgs[0].get("content","") if msgs else ""
+    words = len(first_msg.split()) if isinstance(first_msg, str) else 99
+    
+    # TIER 1: Lightweight — single-turn short queries → VLM first
+    if not sys and turns <= 1 and words <= 100 and "qwen3.5-9b-vlm" in avail:
+        if not is_gpu_busy("qwen3.5-9b-vlm"):
+            return {"model":"qwen3.5-9b-vlm","reason":"lightweight"}
+        # VLM busy — fall back to Dense, then MoE
+        fallback = [m for m in ["qwen3.6-35B-A3B","qwen3.6-27B-code"] if m in avail]
+        result = select_best_gpu(fallback, "lightweight_fallback")
+        if result: return result
+    
+    # TIER 2: Simple conversations — short context, any prompt → VLM preferred
+    if t <= 1000 and turns <= 4 and "qwen3.5-9b-vlm" in avail:
+        if not is_gpu_busy("qwen3.5-9b-vlm"):
+            return {"model":"qwen3.5-9b-vlm","reason":"simple_conv"}
+        # VLM busy — try Dense
+        if "qwen3.6-27B-code" in avail and not is_gpu_busy("qwen3.6-27B-code"):
+            return {"model":"qwen3.6-27B-code","reason":"simple_conv_fallback"}
+    
+    # TIER 3: Heavy reasoning — extremely large context or very long conversations
+    if t > 50000 or turns > 25:
+        # MoE first (131K context handles heavy sessions), then Dense (98K reasoning), then Light (131K fallback)
+        candidates = [m for m in ["qwen3.6-35B-A3B","qwen3.6-27B-code","qwen3.5-9b-vlm"] if m in avail]
+        result = select_best_gpu(candidates, "heavy_reasoning")
+        if result: return result
+    
+    # TIER 4: Default — MoE first, VLM helps, Dense last (slow)
+    if t <= 50000:
+        candidates = [m for m in ["qwen3.6-35B-A3B","qwen3.5-9b-vlm","qwen3.6-27B-code"] if m in avail]
+        result = select_best_gpu(candidates, "default")
+        if result: return result
+    
+    # Fallback — best available
+    if "qwen3.6-35B-A3B" in avail and not is_gpu_busy("qwen3.6-35B-A3B"):
+        return {"model":"qwen3.6-35B-A3B","reason":"default_moe"}
+    result = select_best_gpu([m for m in avail], "fallback")
+    if result: return result
+    return {"model":avail[0],"reason":"last_resort"}
+
+def clean_unicode(text):
+    if not isinstance(text, str): return text
+    text = text.replace(chr(0x2014), "-"); text = text.replace(chr(0x2013), "-")
+    text = text.replace(chr(0x2018), "'"); text = text.replace(chr(0x2019), "'")
+    text = text.replace(chr(0x201C), '"'); text = text.replace(chr(0x201D), '"')
+    text = text.replace(chr(0x2026), "..."); text = text.replace(chr(0x00A0), " ")
+    return text.encode("ascii", "ignore").decode("ascii")
+
+def clean_response(d):
+    if isinstance(d, dict): return {k: clean_response(v) for k,v in d.items()}
+    if isinstance(d, list): return [clean_response(v) for v in d]
+    if isinstance(d, str): return clean_unicode(d)
+    return d
+
+def get_metrics():
+    d = {"gpus":[],"route_counts":{},"agent_counts":{},"tier_counts":{},"recent":[],"timestamp":time.time(),"active_requests":{}}
+    for m in GPU_URLS:
+        h = check_gpu_health(m)
+        d["gpus"].append({"id":m,"gpu_name":h.get("gpu_name",m),"status":h.get("status"),"vram_used_mb":h.get("vram_used_mb"),"vram_total_mb":h.get("vram_total_mb"),"vram_pct":h.get("vram_pct"),"temp_c":h.get("temp_c"),"gpu_util_pct":h.get("gpu_util_pct"),"power_w":h.get("power_w"),"power_limit_w":h.get("power_limit_w"),"active_requests":gpu_active_count(m), "max_concurrent": GPU_MAX_CONCURRENT.get(m, 1)})
+        d["active_requests"][m] = gpu_active_count(m)
+    if r:
+        try:
+            for m in GPU_URLS: d["route_counts"][m] = int(r.get("routes:"+m) or 0)
+            for k,v in API_KEYS.items():
+                c = int(r.get("routes:agent:"+v["agent"]) or 0)
+                if c>0: d["agent_counts"][v["agent"]] = c
+            for t in TIER_MODELS: d["tier_counts"][t] = int(r.get("routes:tier:"+t) or 0)
+            raw = r.lrange("routes:recent",0,49)
+            d["recent"] = [json.loads(x) for x in raw] if raw else []
+        except Exception: pass
+    return d
+
+def bcast():
+    data = get_metrics(); payload = json.dumps(data)
+    with sse_lock:
+        dead = []
+        for q in sse_subscribers:
+            try: q.put(payload)
+            except Exception: dead.append(q)
+        for q in dead: sse_subscribers.remove(q)
+
+QUEUE_TIMEOUT = int(os.environ.get("QUEUE_TIMEOUT", "30"))  # max seconds to queue before 503
+
+@app.route("/v1/chat/completions", methods=["POST"])
+def chat():
+    try:
+        rd = request.get_json(force=True)
+        ak = request.headers.get("Authorization","").replace("Bearer ","")
+        if not ak or ak not in API_KEYS:
+            log.warning("AUTH_REJECTED: no/invalid API key from %s", request.remote_addr)
+            return jsonify({"error": "Unauthorized — valid API key required"}), 401
+        ki = API_KEYS[ak]
+        tier, agent = ki["tier"], ki["agent"]
+        
+        # Allow agent to override queue timeout via header
+        q_timeout = int(request.headers.get("X-Queue-Timeout", str(QUEUE_TIMEOUT)))
+        
+        # Cross-turn context tracking: accumulate tokens per session
+        session_id = request.headers.get("X-Session-Id", "")
+        session_tokens = 0
+        if session_id and r:
+            try:
+                prev = int(r.get("session:" + session_id) or 0)
+                current = estimate_tokens(rd.get("messages",[]))
+                session_tokens = max(prev, current)  # context only grows
+                r.set("session:" + session_id, session_tokens, ex=86400)  # TTL 24h
+            except Exception: pass
+        
+        d = route(rd, tier)
+        queue_start = time.time()
+        
+        # Queue loop: wait for a GPU slot instead of immediate 503
+        while d.get("saturated"):
+            elapsed = time.time() - queue_start
+            if elapsed > q_timeout:
+                resp = jsonify({"error": "All GPUs saturated", "queued_s": round(elapsed,1), "retry_after_s": 5})
+                resp.headers["Retry-After"] = "5"
+                log.warning("QUEUE_TIMEOUT: %s waited %.1fs, all GPUs saturated", agent, elapsed)
+                return resp, 503
+            time.sleep(0.5)  # poll every 500ms
+            d = route(rd, tier)
+        
+        waited = time.time() - queue_start
+        if waited > 0.5:
+            log.info("QUEUED: %s waited %.1fs before slot opened", agent, waited)
+        model, reason, url = d["model"], d["reason"], GPU_URLS[d["model"]]
+        is_stream = rd.get("stream", False)
+        
+        gpu_incr(model)
+        
+        log.info("ROUTE: %s -> %s (%s) stream=%s active=%d/%d", agent, model, reason, is_stream, gpu_active_count(model), GPU_MAX_CONCURRENT.get(model,1))
+        if r:
+            try:
+                r.incr("routes:"+model); r.incr("routes:tier:"+tier); r.incr("routes:agent:"+agent)
+                r.incr("ts:"+model+":"+time.strftime("%Y%m%d%H"))
+                r.lpush("routes:recent", json.dumps({"ts":time.time(),"model":model,"reason":reason,"tier":tier,"agent":agent}))
+                r.ltrim("routes:recent",0,999)
+            except Exception: pass
+        start = time.time()
+        resp = requests.post(url+"/chat/completions", json=rd,
+            headers={"Content-Type":"application/json","Authorization":"Bearer not-needed"}, timeout=300, stream=is_stream)
+        lat = int((time.time()-start)*1000)
+        gpu_decr(model)
+        
+        if resp.status_code != 200: return jsonify({"error":"GPU error "+str(resp.status_code)}), 502
+        if is_stream:
+            def gen():
+                for raw in resp.iter_content(chunk_size=None, decode_unicode=True):
+                    if raw: yield clean_unicode(raw)
+            bcast()
+            ctx_remaining = GPU_CONTEXT.get(model, 65536) - max(session_tokens, estimate_tokens(rd.get("messages",[])))
+            ctx_pct = ctx_remaining / GPU_CONTEXT.get(model, 65536) * 100
+            ctx_warning = "compact_urgent" if ctx_pct < 5 else ("compact_recommended" if ctx_pct < 15 else ("compact_soon" if ctx_pct < 30 else "ok"))
+            sse_resp = Response(stream_with_context(gen()), mimetype="text/event-stream")
+            sse_resp.headers["X-Context-Remaining"] = str(max(0, ctx_remaining))
+            sse_resp.headers["X-Context-Warning"] = ctx_warning
+            sse_resp.headers["X-Context-Model"] = model
+            return sse_resp
+        data = clean_response(resp.json())
+        for c in data.get("choices",[]):
+            msg = c.get("message",{})
+            if not msg.get("content") and msg.get("reasoning_content"):
+                msg["content"] = msg["reasoning_content"]
+        ctx_remaining = GPU_CONTEXT.get(model, 65536) - max(session_tokens, estimate_tokens(rd.get("messages",[])))
+        ctx_pct = ctx_remaining / GPU_CONTEXT.get(model, 65536) * 100
+        ctx_warning = "compact_urgent" if ctx_pct < 5 else ("compact_recommended" if ctx_pct < 15 else ("compact_soon" if ctx_pct < 30 else "ok"))
+        data["routing"] = {"model":model,"reason":reason,"gpu":url,"tier":tier,"agent":agent,"latency_ms":lat,"active_gpu":gpu_active_count(model),"context_remaining": max(0, ctx_remaining),"context_pct": round(ctx_pct,1),"context_warning": ctx_warning}
+        resp = jsonify(data)
+        resp.headers["X-Context-Remaining"] = str(max(0, ctx_remaining))
+        resp.headers["X-Context-Warning"] = ctx_warning
+        resp.headers["X-Context-Model"] = model
+        bcast()
+        return resp
+    except requests.Timeout:
+        gpu_decr(model)
+        log.error("TIMEOUT: %s -> %s", agent, model)
+        return jsonify({"error":"timeout"}), 504
+    except Exception as e:
+        gpu_decr(model)
+        log.error("Error: %s\n%s", e, traceback.format_exc())
+        return jsonify({"error":str(e)}), 500
+
+@app.route("/v1/models")
+def models(): return jsonify({"object":"list","data":[{"id":m,"object":"model","owned_by":"syslog","status":check_gpu_health(m).get("status"),"gpu":check_gpu_health(m).get("gpu_name")} for m in GPU_URLS]})
+
+@app.route("/health")
+def health():
+    gpus = {}
+    for m in GPU_URLS:
+        h = check_gpu_health(m)
+        h["active_requests"] = gpu_active_count(m)
+        h["max_concurrent"] = GPU_MAX_CONCURRENT.get(m, 1)
+        gpus[m] = h
+    return jsonify({"status":"healthy","redis":"connected" if r else "down","gpus":gpus,"available_models":available_models()})
+
+@app.route("/metrics")
+def metrics(): return jsonify(get_metrics())
+
+@app.route("/metrics/timeseries")
+def metrics_timeseries():
+    period = request.args.get("period", "day"); models_list = list(GPU_URLS.keys())
+    data = {"models": {}, "labels": []}
+    if period == "day":
+        buckets = [time.strftime("%Y%m%d%H", time.gmtime(time.time()-h*3600)) for h in range(23,-1,-1)]
+        data["labels"] = [time.strftime("%H:00", time.gmtime(time.time()-h*3600)) for h in range(23,-1,-1)]
+    elif period == "week":
+        buckets = [time.strftime("%Y%m%d", time.gmtime(time.time()-d*86400)) for d in range(6,-1,-1)]
+        data["labels"] = [time.strftime("%a", time.gmtime(time.time()-d*86400)) for d in range(6,-1,-1)]
+    else:
+        buckets = [time.strftime("%Y%m%d", time.gmtime(time.time()-d*86400)) for d in range(29,-1,-1)]
+        data["labels"] = [time.strftime("%m/%d", time.gmtime(time.time()-d*86400)) for d in range(29,-1,-1)]
+    if r:
+        for model in models_list:
+            counts = []
+            for bucket in buckets:
+                total = 0
+                if period in ("week","month"):
+                    for hh in range(24): total += int(r.get("ts:"+model+":"+bucket+"{:02d}".format(hh)) or 0)
+                else: total = int(r.get("ts:"+model+":"+bucket) or 0)
+                counts.append(total)
+            data["models"][model] = counts
+    return jsonify(data)
+
+@app.route("/stream")
+def stream():
+    def ev():
+        q = queue.Queue()
+        with sse_lock: sse_subscribers.append(q)
+        try:
+            yield "data: "+json.dumps(get_metrics())+"\n\n"
+            while True:
+                try: yield "data: "+q.get(timeout=3)+"\n\n"
+                except queue.Empty: yield "data: "+json.dumps(get_metrics())+"\n\n"
+        except GeneratorExit: pass
+        finally:
+            with sse_lock:
+                if q in sse_subscribers: sse_subscribers.remove(q)
+    return Response(stream_with_context(ev()), mimetype="text/event-stream",
+                    headers={"Cache-Control":"no-cache","X-Accel-Buffering":"no","Access-Control-Allow-Origin":"*"})
+
+if __name__ == "__main__":
+    log.info("Router on :9000 (load-aware)")
+    app.run(host="0.0.0.0", port=9000, debug=False)
Author	SHA1	Message	Date
root	5116e4b1a7	router: heavy tier Dense→MoE→Light + X-Context-Warning headers (compact_soon/compact_recommended/compact_urgent)	2026-05-22 09:48:00 +00:00
Abiba	e55bcef21a	router: 4 optimizations — saturated flag fix, heavy tier MoE-first, better token est, session tracking - Saturated flag now triggers on load saturation (was dead code) - Heavy tier routes MoE(131K) first instead of Dense(98K) - Token estimation uses JSON length/3.5 (was content/4) - Cross-turn session tracking via X-Session-Id + Redis TTL 24h	2026-05-21 20:47:48 +00:00
Abiba	32bd817e97	fix: heavy tier back to Dense→MoE→VLM (Dense now 98K)	2026-05-19 21:24:36 +00:00
Abiba	79965450bb	fix: Dense context 65K→98K, parallel restored to 2	2026-05-19 21:20:29 +00:00
Abiba	6c829abef5	fix: variable collision (r = Redis vs Response) in stream handler	2026-05-19 21:15:23 +00:00
Abiba	6efd5ff51c	feat: context-aware routing + compaction signals - Added GPU_CONTEXT map (MoE 131K, VLM 131K, Dense 65K) - Heavy tier now prefers MoE/VLM (131K) over Dense (65K) for large requests - Response headers: X-Context-Remaining, X-Context-Model - Routing data includes context_remaining field - Agents can use this to trigger compaction when nearing limits	2026-05-19 21:13:56 +00:00
Abiba	350a90b524	fix: sync tier 4 default threshold to 50000 tokens (was stale at 4000)	2026-05-19 21:11:34 +00:00
Abiba	3156c093d5	fix: heavy threshold → 50000 tokens, 25 turns (agent contexts are huge)	2026-05-19 21:08:18 +00:00
Abiba	3cbf38e3e2	fix: raise heavy threshold — 4000→12000 tokens, 8→15 turns Agent conversations with system prompts easily exceed 4000 tokens, forcing everything to Dense. Now only truly heavy work triggers Dense. Most agent convos will route to MoE (default) instead.	2026-05-19 20:09:59 +00:00
Abiba	b67021ac69	docs: complete design documentation — auth, routing tiers, queue, models, maintenance	2026-05-19 19:17:52 +00:00
Abiba	46dda918de	security: reject requests without valid API key (401 instead of defaulting to starter)	2026-05-19 19:13:52 +00:00
Abiba	7a78c0f98d	fix: heavy tier — Dense first (best for reasoning), then MoE, then VLM	2026-05-19 18:20:20 +00:00
Abiba	15c474aea0	fix: select_best_gpu respects candidate order — first non-busy wins Previously it picked the least-loaded GPU globally, ignoring priority order. Now it tries candidates in order: MoE → VLM → Dense. Only falls back to least-loaded when ALL candidates are busy.	2026-05-19 18:18:00 +00:00
Abiba	bfc38f5436	fix: routing priority — MoE first, VLM second, Dense last (slow) All tiers now follow MoE → VLM → Dense priority order since Dense (RTX 3090) can be slow. VLM acts as overflow absorber.	2026-05-19 17:38:21 +00:00
Abiba	f519a3fa60	fix: routing — system prompts no longer force heavy tier System messages are common in agent conversations but don't indicate heavy workload. Now only token count (>4000) and turn count (>8) trigger heavy routing. Simple conversations with system prompts can now route to VLM.	2026-05-19 17:19:29 +00:00
Abiba	941e8db65e	feat: redesigned routing tiers — VLM handles more traffic New 4-tier routing: - TIER 1 (Lightweight): ≤100 words, single-turn → VLM first, fallback Dense - TIER 2 (Simple Conv): ≤1000 tokens, ≤4 turns → VLM preferred, fallback Dense - TIER 3 (Heavy): >4000 tokens, system prompts, >8 turns → Dense→MoE→VLM cascade - TIER 4 (Default): Medium tasks → Dense preferred, MoE default, VLM overflow VLM gets more utilization for simple conversations instead of defaulting everything to MoE.	2026-05-19 17:01:55 +00:00
Abiba	241de4f38c	revert: remove Ollama endpoints (llama.cpp uses OpenAI format, not Ollama)	2026-05-19 16:57:04 +00:00
Abiba	beb2d1790a	fix: add /v1/props and /v1/models/<id> Ollama-compatible endpoints Mumuni's Ollama client probes /v1/props for model discovery and /v1/models/<id> for per-model details. Previously both returned 404, causing client retries. Now returns proper model properties and details.	2026-05-19 16:08:24 +00:00
Abiba	f2f8e8c921	feat: add request queuing to router (replaces hard 503 on saturation) When all GPUs are saturated, requests now enter a queue loop (poll every 500ms) instead of immediately returning 503. Configurable via QUEUE_TIMEOUT env var (default 30s) or X-Queue-Timeout header per-request. This prevents agent failures from cluster saturation — agents wait for a slot instead of crashing on fallback.	2026-05-19 15:55:05 +00:00
Abiba	76ade81fda	docs: add Koonimo to agent API keys table	2026-05-19 15:48:39 +00:00
Abiba	9c31b5d622	May 19, 2026: Full harness update - Model migration: gemma-4-E4B → qwen3.5-9b-vlm - Dashboard reorder: Usage Over Time + GPU Metrics to top - Router counter leak fix (gpu_decr in except handler) - VLM slot upgrade 1→2 - Redis stale key cleanup - Automated maintenance cron job - LiteLLM config update - GPU router config update - README update	2026-05-19 15:03:34 +00:00
Abiba (pi)	4f032b035c	Mumuni review action items: health checks for all containers, version pinning, 503+Retry-After on all-GPU saturation	2026-05-17 09:05:27 +00:00
Abiba (pi)	8f3b0c6647	Router: health check verifies actual llama.cpp endpoint, gpu_decr negative guard, AMD sidecar fixed (sysfs fallback)	2026-05-17 01:52:28 +00:00
Abiba (pi)	808c9d3d13	Router: 300s timeout, gpu_decr bugfix. Dashboard: Bootstrap 5 modern redesign with KPI stats, equal-height cards, queue ring. Nginx: 600s timeout.	2026-05-16 22:12:21 +00:00
Abiba (pi)	9817fe2ef2	Dashboard: clean rebuild with Queue Status ring chart, GPU slot indicators, organized layout (GPU/Queue+Model+Agent/Usage/Live)	2026-05-16 21:05:19 +00:00
Abiba (pi)	654cdff718	Dashboard: GPU slot indicators show active/max concurrent requests. Koonimo API key added. Real-time queuing visibility.	2026-05-16 20:43:22 +00:00
Abiba (pi)	bf90e57c5f	Load-aware routing: tracks active GPU requests in Redis, distributes overflow when MoE saturated. 6 concurrent requests now spread across all 3 GPUs instead of queuing on one.	2026-05-16 20:23:32 +00:00
Abiba (pi)	2db2796e53	Dashboard: rename to SyslogAI Harness, GPU bar now shows utilization instead of VRAM	2026-05-16 19:26:46 +00:00
Abiba (pi)	ec0f9fac63	Fix: clean_unicode now uses chr()-based replacements + ASCII strip to prevent bash heredoc corruption. Emoji and all non-ASCII now fully stripped.	2026-05-16 19:12:58 +00:00
Abiba (pi)	3d42ea4767	Merge: add Abiba harness code — nginx, LiteLLM, router, dashboard, Redis	2026-05-16 18:53:31 +00:00
Abiba (pi)	7b6c6aabe1	Initial commit: CT 116 inference harness — nginx, LiteLLM, router, dashboard, Redis - Complexity-based routing (MoE default, Dense heavy, Gemma light) - Per-agent API keys with metrics tracking - Time-series usage graphs (24h/7d/30d) - Streaming support (SSE passthrough) - Unicode cleanup (ASCII-only output) - Vision support (gemma-4-E4B) - Tier enforcement (starter/professional/enterprise) - GPU health monitoring via sidecar polling - Unified dashboard with line graph	2026-05-16 18:51:50 +00:00
mumuni-bot	b65ea22765	Update Nginx Docker config	2026-05-15 21:35:13 +00:00
mumuni-bot	cf7f61650f	Add Dockerfile.dashboard	2026-05-15 21:34:52 +00:00
mumuni-bot	7d00bbec0e	Add Dockerfile.queue	2026-05-15 21:34:49 +00:00
mumuni-bot	37f7c95b05	Add env example	2026-05-15 21:07:34 +00:00
mumuni-bot	a28b3a557d	Add Nginx router config	2026-05-15 21:07:33 +00:00
mumuni-bot	c42f3a9979	Add migration plan	2026-05-15 21:07:32 +00:00
mumuni-bot	e1f12c3462	Add dashboard	2026-05-15 21:07:07 +00:00
mumuni-bot	b55b954967	Add queue service	2026-05-15 21:07:05 +00:00
mumuni-bot	c85aaa570b	Add docker-compose	2026-05-15 21:07:05 +00:00
mumuni-bot	43382dac5b	Initial commit: README	2026-05-15 21:07:03 +00:00