Files
syslog-harness/MIGRATION_PLAN.md
T
SyslogBot b09a93f45c feat: Smart Queue Consumer implementation draft + architecture review
- SMART_QUEUE_IMPLEMENTATION.md: Complete implementation draft (1572 lines)
  with 10 quick-win fixes and full smart queue consumer rewrite
- ARCHITECTURE_REVIEW.md: 26-issue audit with prioritized findings
- Verified all 3 GPUs live: amdpve (73% util), llmgpu (idle), ocu_llm (idle)
- Redis 7.4.9 confirmed streams support
- GPU sidecar metrics verified on all hosts

Key fixes:
- QW-1: Dockerfile path mismatch (Dockerfile.queue -> queue-service/Dockerfile)
- QW-2: Nginx fallback only on ALL-GPU failure (not single GPU)
- QW-3: Container names fixed to Docker service names
- QW-4: Redis host default fixed (192.168.68.7 -> redis)
- QW-5: Dependency version pinning
- QW-7-10: Health checks, restart policy, Gunicorn, single-process collector

Smart queue features:
- Redis Streams + consumer groups
- GPU-aware load balancing via sidecar metrics
- Per-GPU circuit breakers with half-open recovery
- Adaptive backpressure (0-30 normal, 30-40 warn, 40-50 503, >50 open)
- Dead letter queue with retry endpoint
- Job ID tracking and /status/<job_id> API
2026-05-17 03:55:20 +00:00

1.9 KiB

Syslog Harness — Production Migration Plan

Current State (Development)

  • Host: CT 114 (192.168.68.123)
  • Docker containers: syslog-queue (:8091), syslog-dashboard (:3001)
  • Nginx: Local on CT 114, routing to GPUs + Docker services
  • Status: All components verified and operational

Target State (Production)

  • Host: New CT (e.g., docker-vm on 192.168.68.x)
  • Docker containers: Same queue + dashboard services
  • Nginx: Containerized on production CT
  • GPU backends: Same (192.168.68.15, .8, .110)

Migration Steps

1. Prepare Production CT

# Create new CT on Proxmox
# Install Docker
apt update && apt install -y docker.io docker-compose-plugin

# Pull/cloned harness repo
git clone <repo-url> /root/syslog-harness
cd /root/syslog-harness

2. Update docker-compose.yml for Production

  • Change REDIS_HOST to production Redis IP
  • Update GPU endpoint env vars if IPs change
  • Add volume mounts for persistence

3. Build & Deploy

# Build images
docker compose build

# Start services
docker compose up -d

# Verify health
curl http://localhost:8091/health
curl http://localhost:3001/api/status

4. Configure Nginx

  • Copy /etc/nginx/conf.d/gpu-router.conf to production CT
  • Update upstream IPs if needed
  • Test and reload

5. DNS / Routing Update

  • Point agent traffic to new CT IP
  • Update Hermes config inference_api_url
  • Test agent routing

6. Verification Checklist

  • Queue service health check passes
  • Dashboard API returns GPU health
  • Nginx routes to correct GPU based on header
  • Circuit breaker triggers on excess load
  • Queue fallback works when GPUs down
  • Agent requests reach correct model

Rollback Plan

  • Keep CT 114 running as backup
  • Revert DNS/routing to .123 if issues
  • Docker containers can be stopped/started instantly

Created: May 15, 2026 Status: Development verified, ready for production migration