fix: non-blocking GPU health checks + 256K turboquant context upgrade

router/router.py:
- check_gpu_health() now accepts configurable timeouts (sidecar_timeout, gpu_timeout)
- /health and /v1/models endpoints use fast 1.5s/1s timeouts (non-blocking)
- /v1/models now calls check_gpu_health once per model instead of twice
- GPU_CONTEXT updated to 262144 across all models (turboquant upgrade)
- 27B max_concurrent reduced 2→1 (24GB VRAM saturated at 256K context)

docker-compose.yml:
- Router healthcheck timeout 5s→15s, interval 15s→30s
- Nginx healthcheck timeout 5s→15s, interval 15s→30s

Fixes dashboard hang when any GPU is unreachable.
This commit is contained in:
Abiba
2026-05-23 05:57:13 +00:00
parent 0983337fdb
commit 41939104c7
2 changed files with 43 additions and 19 deletions
+4 -4
View File
@@ -29,8 +29,8 @@ services:
- GPU_LIGHT_URL=http://192.168.68.110:8080/v1
healthcheck:
test: ["CMD", "python3", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:9000/health')"]
interval: 15s
timeout: 5s
interval: 30s
timeout: 15s
retries: 3
depends_on:
redis:
@@ -68,8 +68,8 @@ services:
- ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
healthcheck:
test: ["CMD", "curl", "-f", "http://127.0.0.1/health"]
interval: 15s
timeout: 5s
interval: 30s
timeout: 15s
retries: 3
depends_on:
- litellm