Abiba 41939104c7 fix: non-blocking GPU health checks + 256K turboquant context upgrade
router/router.py:
- check_gpu_health() now accepts configurable timeouts (sidecar_timeout, gpu_timeout)
- /health and /v1/models endpoints use fast 1.5s/1s timeouts (non-blocking)
- /v1/models now calls check_gpu_health once per model instead of twice
- GPU_CONTEXT updated to 262144 across all models (turboquant upgrade)
- 27B max_concurrent reduced 2→1 (24GB VRAM saturated at 256K context)

docker-compose.yml:
- Router healthcheck timeout 5s→15s, interval 15s→30s
- Nginx healthcheck timeout 5s→15s, interval 15s→30s

Fixes dashboard hang when any GPU is unreachable.
2026-05-23 05:57:13 +00:00
2026-05-19 15:03:47 +00:00
2026-05-19 15:03:47 +00:00
S
Description
SyslogAI Inference Harness — 3-GPU router, dashboard, LiteLLM proxy
371 KiB
Languages
Python 97.6%
Shell 1.9%
Dockerfile 0.5%