inference-harness

Files

T

Abiba 9a0d69ce8d feat: Dense 128K context + 2 slots, VLM second in Heavy tier

- Dense GPU_CONTEXT: 192K→128K (131072) to free VRAM
- Dense max_concurrent: 1→2 (VRAM now sufficient)
- Heavy tier: Dense → VLM → MoE (VLM handles 262K context)
- Total slots: 6 (2 Dense + 2 MoE + 2 VLM)

Distribution target: Dense 50%, VLM 30%, MoE 20%

NOTE: Requires llama.cpp restart on 192.168.68.8 with --ctx-size 131072

2026-05-27 07:15:58 +00:00

Dockerfile

May 19, 2026: Full harness update

2026-05-19 15:03:47 +00:00

http_patch.py

May 19, 2026: Full harness update

2026-05-19 15:03:47 +00:00

requirements.txt

May 19, 2026: Full harness update

2026-05-19 15:03:47 +00:00

router.py

feat: Dense 128K context + 2 slots, VLM second in Heavy tier

2026-05-27 07:15:58 +00:00

router.py.bak.20260518074236

May 19, 2026: Full harness update

2026-05-19 15:03:47 +00:00

ts_patch.py

May 19, 2026: Full harness update

2026-05-19 15:03:47 +00:00