inference-harness

T

Abiba 9a0d69ce8d feat: Dense 128K context + 2 slots, VLM second in Heavy tier

- Dense GPU_CONTEXT: 192K→128K (131072) to free VRAM
- Dense max_concurrent: 1→2 (VRAM now sufficient)
- Heavy tier: Dense → VLM → MoE (VLM handles 262K context)
- Total slots: 6 (2 Dense + 2 MoE + 2 VLM)

Distribution target: Dense 50%, VLM 30%, MoE 20%

NOTE: Requires llama.cpp restart on 192.168.68.8 with --ctx-size 131072

2026-05-27 07:15:58 +00:00

dashboard

fix: default performance window to 24h so all models appear immediately

2026-05-26 12:37:52 +00:00

nginx

feat: per-request performance tracking + /metrics/performance endpoint

2026-05-25 16:50:45 +00:00

router

feat: Dense 128K context + 2 slots, VLM second in Heavy tier

2026-05-27 07:15:58 +00:00

.gitignore

May 19, 2026: Full harness update