inference-harness

T

Abiba d53685d874 feat: agent-aware GPU load balancing

select_best_gpu() now spreads different agents across GPUs:
- If agent already has a request on a GPU, prefer other GPUs first
- Tracked via Redis agent_gpu:{agent}:{model} with 120s TTL
- Same agent can still use multiple slots on same GPU if needed
- Falls back to normal priority when only one option available

Prevents Tanko+Mumuni from piling onto MoE simultaneously
while Dense sits idle. Each agent naturally spreads across
available GPUs.

2026-05-28 21:45:23 +00:00

dashboard

fix: default performance window to 24h so all models appear immediately

2026-05-26 12:37:52 +00:00

nginx

feat: per-request performance tracking + /metrics/performance endpoint

2026-05-25 16:50:45 +00:00

router

feat: agent-aware GPU load balancing

2026-05-28 21:45:23 +00:00

.gitignore

May 19, 2026: Full harness update