inference-harness

Files

T

Abiba 41939104c7 fix: non-blocking GPU health checks + 256K turboquant context upgrade

router/router.py:
- check_gpu_health() now accepts configurable timeouts (sidecar_timeout, gpu_timeout)
- /health and /v1/models endpoints use fast 1.5s/1s timeouts (non-blocking)
- /v1/models now calls check_gpu_health once per model instead of twice
- GPU_CONTEXT updated to 262144 across all models (turboquant upgrade)
- 27B max_concurrent reduced 2→1 (24GB VRAM saturated at 256K context)

docker-compose.yml:
- Router healthcheck timeout 5s→15s, interval 15s→30s
- Nginx healthcheck timeout 5s→15s, interval 15s→30s

Fixes dashboard hang when any GPU is unreachable.

2026-05-23 05:57:13 +00:00

Dockerfile

May 19, 2026: Full harness update

2026-05-19 15:03:47 +00:00

http_patch.py

May 19, 2026: Full harness update

2026-05-19 15:03:47 +00:00

requirements.txt

May 19, 2026: Full harness update

2026-05-19 15:03:47 +00:00

router.py

fix: non-blocking GPU health checks + 256K turboquant context upgrade

2026-05-23 05:57:13 +00:00

router.py.bak.20260518074236

May 19, 2026: Full harness update

2026-05-19 15:03:47 +00:00

ts_patch.py

May 19, 2026: Full harness update

2026-05-19 15:03:47 +00:00