b09a93f45c
- SMART_QUEUE_IMPLEMENTATION.md: Complete implementation draft (1572 lines) with 10 quick-win fixes and full smart queue consumer rewrite - ARCHITECTURE_REVIEW.md: 26-issue audit with prioritized findings - Verified all 3 GPUs live: amdpve (73% util), llmgpu (idle), ocu_llm (idle) - Redis 7.4.9 confirmed streams support - GPU sidecar metrics verified on all hosts Key fixes: - QW-1: Dockerfile path mismatch (Dockerfile.queue -> queue-service/Dockerfile) - QW-2: Nginx fallback only on ALL-GPU failure (not single GPU) - QW-3: Container names fixed to Docker service names - QW-4: Redis host default fixed (192.168.68.7 -> redis) - QW-5: Dependency version pinning - QW-7-10: Health checks, restart policy, Gunicorn, single-process collector Smart queue features: - Redis Streams + consumer groups - GPU-aware load balancing via sidecar metrics - Per-GPU circuit breakers with half-open recovery - Adaptive backpressure (0-30 normal, 30-40 warn, 40-50 503, >50 open) - Dead letter queue with retry endpoint - Job ID tracking and /status/<job_id> API
1.9 KiB
1.9 KiB
Syslog Harness — Production Migration Plan
Current State (Development)
- Host: CT 114 (192.168.68.123)
- Docker containers:
syslog-queue(:8091),syslog-dashboard(:3001) - Nginx: Local on CT 114, routing to GPUs + Docker services
- Status: All components verified and operational
Target State (Production)
- Host: New CT (e.g.,
docker-vmon 192.168.68.x) - Docker containers: Same queue + dashboard services
- Nginx: Containerized on production CT
- GPU backends: Same (192.168.68.15, .8, .110)
Migration Steps
1. Prepare Production CT
# Create new CT on Proxmox
# Install Docker
apt update && apt install -y docker.io docker-compose-plugin
# Pull/cloned harness repo
git clone <repo-url> /root/syslog-harness
cd /root/syslog-harness
2. Update docker-compose.yml for Production
- Change
REDIS_HOSTto production Redis IP - Update GPU endpoint env vars if IPs change
- Add volume mounts for persistence
3. Build & Deploy
# Build images
docker compose build
# Start services
docker compose up -d
# Verify health
curl http://localhost:8091/health
curl http://localhost:3001/api/status
4. Configure Nginx
- Copy
/etc/nginx/conf.d/gpu-router.confto production CT - Update upstream IPs if needed
- Test and reload
5. DNS / Routing Update
- Point agent traffic to new CT IP
- Update Hermes config
inference_api_url - Test agent routing
6. Verification Checklist
- Queue service health check passes
- Dashboard API returns GPU health
- Nginx routes to correct GPU based on header
- Circuit breaker triggers on excess load
- Queue fallback works when GPUs down
- Agent requests reach correct model
Rollback Plan
- Keep CT 114 running as backup
- Revert DNS/routing to .123 if issues
- Docker containers can be stopped/started instantly
Created: May 15, 2026 Status: Development verified, ready for production migration