inference-harness

15 Commits

Author	SHA1	Message	Date
Abiba	ddde6646de	fix: decouple VRAM usage from saturation status VRAM percentage no longer marks GPU as saturated. Saturation is about slot availability (handled by is_gpu_busy()), not memory usage. Added vram_warning boolean flag (≥95% threshold) for informational monitoring without affecting routing decisions. 27B Dense now correctly shows healthy at 91% VRAM.	2026-05-23 06:00:37 +00:00
Abiba	41939104c7	fix: non-blocking GPU health checks + 256K turboquant context upgrade router/router.py: - check_gpu_health() now accepts configurable timeouts (sidecar_timeout, gpu_timeout) - /health and /v1/models endpoints use fast 1.5s/1s timeouts (non-blocking) - /v1/models now calls check_gpu_health once per model instead of twice - GPU_CONTEXT updated to 262144 across all models (turboquant upgrade) - 27B max_concurrent reduced 2→1 (24GB VRAM saturated at 256K context) docker-compose.yml: - Router healthcheck timeout 5s→15s, interval 15s→30s - Nginx healthcheck timeout 5s→15s, interval 15s→30s Fixes dashboard hang when any GPU is unreachable.	2026-05-23 05:57:13 +00:00
Abiba	0983337fdb	fix: heavy tier Dense→MoE→VLM	2026-05-19 21:24:36 +00:00
Abiba	28d62e27ba	feat: context-aware routing + compaction signals	2026-05-19 21:13:57 +00:00
Abiba	714ebb003e	fix: heavy threshold → 50000 tokens, 25 turns	2026-05-19 21:08:18 +00:00
Abiba	e90bf0216d	fix: raise heavy threshold — 4000→12000 tokens, 8→15 turns	2026-05-19 20:10:07 +00:00
Abiba	5971ceee4e	security: reject requests without valid API key (401)	2026-05-19 19:15:13 +00:00
Abiba	5f05f46c7c	fix: heavy tier — Dense first for reasoning, MoE workhorse, VLM overflow	2026-05-19 18:27:24 +00:00
Abiba	911fdc9f3f	fix: routing priority — MoE first, VLM second, Dense last	2026-05-19 17:38:29 +00:00
Abiba	d9d2c213f6	fix: routing — remove turn limit from default tier, no gaps	2026-05-19 17:24:41 +00:00
Abiba	6625892908	feat: redesigned routing tiers — VLM handles more traffic	2026-05-19 17:01:58 +00:00
Abiba	fcb99a26c8	revert: remove Ollama endpoints	2026-05-19 16:57:05 +00:00
Abiba	2234d03079	fix: add /v1/props and /v1/models/<id> endpoints	2026-05-19 16:08:58 +00:00
Abiba	5b99b16712	feat: add request queuing to router (replaces hard 503)	2026-05-19 15:55:13 +00:00
Abiba	28fc57c5c7	May 19, 2026: Full harness update - Model migration: gemma-4-E4B → qwen3.5-9b-vlm - Dashboard reorder: Usage Over Time + GPU Metrics to top - Router counter leak fix (gpu_decr in except handler) - VLM slot upgrade 1→2 - Automated maintenance cron job - LiteLLM config update	2026-05-19 15:03:47 +00:00