feat: Smart Queue Consumer implementation draft + architecture review
- SMART_QUEUE_IMPLEMENTATION.md: Complete implementation draft (1572 lines) with 10 quick-win fixes and full smart queue consumer rewrite - ARCHITECTURE_REVIEW.md: 26-issue audit with prioritized findings - Verified all 3 GPUs live: amdpve (73% util), llmgpu (idle), ocu_llm (idle) - Redis 7.4.9 confirmed streams support - GPU sidecar metrics verified on all hosts Key fixes: - QW-1: Dockerfile path mismatch (Dockerfile.queue -> queue-service/Dockerfile) - QW-2: Nginx fallback only on ALL-GPU failure (not single GPU) - QW-3: Container names fixed to Docker service names - QW-4: Redis host default fixed (192.168.68.7 -> redis) - QW-5: Dependency version pinning - QW-7-10: Health checks, restart policy, Gunicorn, single-process collector Smart queue features: - Redis Streams + consumer groups - GPU-aware load balancing via sidecar metrics - Per-GPU circuit breakers with half-open recovery - Adaptive backpressure (0-30 normal, 30-40 warn, 40-50 503, >50 open) - Dead letter queue with retry endpoint - Job ID tracking and /status/<job_id> API
This commit is contained in:
@@ -0,0 +1,71 @@
|
||||
# Syslog Harness — Production Migration Plan
|
||||
|
||||
## Current State (Development)
|
||||
- **Host:** CT 114 (192.168.68.123)
|
||||
- **Docker containers:** `syslog-queue` (:8091), `syslog-dashboard` (:3001)
|
||||
- **Nginx:** Local on CT 114, routing to GPUs + Docker services
|
||||
- **Status:** All components verified and operational
|
||||
|
||||
## Target State (Production)
|
||||
- **Host:** New CT (e.g., `docker-vm` on 192.168.68.x)
|
||||
- **Docker containers:** Same queue + dashboard services
|
||||
- **Nginx:** Containerized on production CT
|
||||
- **GPU backends:** Same (192.168.68.15, .8, .110)
|
||||
|
||||
## Migration Steps
|
||||
|
||||
### 1. Prepare Production CT
|
||||
```bash
|
||||
# Create new CT on Proxmox
|
||||
# Install Docker
|
||||
apt update && apt install -y docker.io docker-compose-plugin
|
||||
|
||||
# Pull/cloned harness repo
|
||||
git clone <repo-url> /root/syslog-harness
|
||||
cd /root/syslog-harness
|
||||
```
|
||||
|
||||
### 2. Update docker-compose.yml for Production
|
||||
- Change `REDIS_HOST` to production Redis IP
|
||||
- Update GPU endpoint env vars if IPs change
|
||||
- Add volume mounts for persistence
|
||||
|
||||
### 3. Build & Deploy
|
||||
```bash
|
||||
# Build images
|
||||
docker compose build
|
||||
|
||||
# Start services
|
||||
docker compose up -d
|
||||
|
||||
# Verify health
|
||||
curl http://localhost:8091/health
|
||||
curl http://localhost:3001/api/status
|
||||
```
|
||||
|
||||
### 4. Configure Nginx
|
||||
- Copy `/etc/nginx/conf.d/gpu-router.conf` to production CT
|
||||
- Update upstream IPs if needed
|
||||
- Test and reload
|
||||
|
||||
### 5. DNS / Routing Update
|
||||
- Point agent traffic to new CT IP
|
||||
- Update Hermes config `inference_api_url`
|
||||
- Test agent routing
|
||||
|
||||
### 6. Verification Checklist
|
||||
- [ ] Queue service health check passes
|
||||
- [ ] Dashboard API returns GPU health
|
||||
- [ ] Nginx routes to correct GPU based on header
|
||||
- [ ] Circuit breaker triggers on excess load
|
||||
- [ ] Queue fallback works when GPUs down
|
||||
- [ ] Agent requests reach correct model
|
||||
|
||||
## Rollback Plan
|
||||
- Keep CT 114 running as backup
|
||||
- Revert DNS/routing to .123 if issues
|
||||
- Docker containers can be stopped/started instantly
|
||||
|
||||
---
|
||||
*Created: May 15, 2026*
|
||||
*Status: Development verified, ready for production migration*
|
||||
Reference in New Issue
Block a user