Commit Graph

15 Commits

Author SHA1 Message Date
Abiba ddde6646de fix: decouple VRAM usage from saturation status
VRAM percentage no longer marks GPU as saturated.
Saturation is about slot availability (handled by is_gpu_busy()),
not memory usage. Added vram_warning boolean flag (≥95% threshold)
for informational monitoring without affecting routing decisions.

27B Dense now correctly shows healthy at 91% VRAM.
2026-05-23 06:00:37 +00:00
Abiba 41939104c7 fix: non-blocking GPU health checks + 256K turboquant context upgrade
router/router.py:
- check_gpu_health() now accepts configurable timeouts (sidecar_timeout, gpu_timeout)
- /health and /v1/models endpoints use fast 1.5s/1s timeouts (non-blocking)
- /v1/models now calls check_gpu_health once per model instead of twice
- GPU_CONTEXT updated to 262144 across all models (turboquant upgrade)
- 27B max_concurrent reduced 2→1 (24GB VRAM saturated at 256K context)

docker-compose.yml:
- Router healthcheck timeout 5s→15s, interval 15s→30s
- Nginx healthcheck timeout 5s→15s, interval 15s→30s

Fixes dashboard hang when any GPU is unreachable.
2026-05-23 05:57:13 +00:00
Abiba 0983337fdb fix: heavy tier Dense→MoE→VLM 2026-05-19 21:24:36 +00:00
Abiba 28d62e27ba feat: context-aware routing + compaction signals 2026-05-19 21:13:57 +00:00
Abiba 714ebb003e fix: heavy threshold → 50000 tokens, 25 turns 2026-05-19 21:08:18 +00:00
Abiba e90bf0216d fix: raise heavy threshold — 4000→12000 tokens, 8→15 turns 2026-05-19 20:10:07 +00:00
Abiba 5971ceee4e security: reject requests without valid API key (401) 2026-05-19 19:15:13 +00:00
Abiba 5f05f46c7c fix: heavy tier — Dense first for reasoning, MoE workhorse, VLM overflow 2026-05-19 18:27:24 +00:00
Abiba 911fdc9f3f fix: routing priority — MoE first, VLM second, Dense last 2026-05-19 17:38:29 +00:00
Abiba d9d2c213f6 fix: routing — remove turn limit from default tier, no gaps 2026-05-19 17:24:41 +00:00
Abiba 6625892908 feat: redesigned routing tiers — VLM handles more traffic 2026-05-19 17:01:58 +00:00
Abiba fcb99a26c8 revert: remove Ollama endpoints 2026-05-19 16:57:05 +00:00
Abiba 2234d03079 fix: add /v1/props and /v1/models/<id> endpoints 2026-05-19 16:08:58 +00:00
Abiba 5b99b16712 feat: add request queuing to router (replaces hard 503) 2026-05-19 15:55:13 +00:00
Abiba 28fc57c5c7 May 19, 2026: Full harness update
- Model migration: gemma-4-E4B → qwen3.5-9b-vlm
- Dashboard reorder: Usage Over Time + GPU Metrics to top
- Router counter leak fix (gpu_decr in except handler)
- VLM slot upgrade 1→2
- Automated maintenance cron job
- LiteLLM config update
2026-05-19 15:03:47 +00:00