fix: non-blocking GPU health checks + 256K turboquant context upgrade

router/router.py: - check_gpu_health() now accepts configurable timeouts (sidecar_timeout, gpu_timeout) - /health and /v1/models endpoints use fast 1.5s/1s timeouts (non-blocking) - /v1/models now calls check_gpu_health once per model instead of twice - GPU_CONTEXT updated to 262144 across all models (turboquant upgrade) - 27B max_concurrent reduced 2→1 (24GB VRAM saturated at 256K context) docker-compose.yml: - Router healthcheck timeout 5s→15s, interval 15s→30s - Nginx healthcheck timeout 5s→15s, interval 15s→30s Fixes dashboard hang when any GPU is unreachable.
2026-05-23 05:57:13 +00:00
parent 0983337fdb
commit 41939104c7
2 changed files with 43 additions and 19 deletions
@@ -29,8 +29,8 @@ services:
      - GPU_LIGHT_URL=http://192.168.68.110:8080/v1
    healthcheck:
      test: ["CMD", "python3", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:9000/health')"]
-      interval: 15s
-      timeout: 5s
+      interval: 30s
+      timeout: 15s
      retries: 3
    depends_on:
      redis:
@@ -68,8 +68,8 @@ services:
      - ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
    healthcheck:
      test: ["CMD", "curl", "-f", "http://127.0.0.1/health"]
-      interval: 15s
-      timeout: 5s
+      interval: 30s
+      timeout: 15s
      retries: 3
    depends_on:
      - litellm