inference-harness

abiba-bot/inference-harness

Fork 0

Commit Graph

Author	SHA1	Message	Date
Abiba	b849cd3395	feat: per-request performance tracking + /metrics/performance endpoint router/router.py (+158 lines): - store_perf_record(): captures queue_ms, inference_ms, prompt_tokens, completion_tokens, tokens_per_sec per request in Redis - Per-model, per-reason, per-agent rolling windows (last 200-500) - /metrics/performance?window=N endpoint with percentiles (p50/p95/p99) for latency, throughput, and queue time per model/reason/agent - Queue time now surfaced in routing metadata and routes:recent - Streaming requests tracked with estimated prompt tokens nginx/nginx.conf: - Added /metrics/ proxy pass to router_api Enables model performance comparison and routing tier validation.	2026-05-25 16:50:45 +00:00
Abiba	28fc57c5c7	May 19, 2026: Full harness update - Model migration: gemma-4-E4B → qwen3.5-9b-vlm - Dashboard reorder: Usage Over Time + GPU Metrics to top - Router counter leak fix (gpu_decr in except handler) - VLM slot upgrade 1→2 - Automated maintenance cron job - LiteLLM config update	2026-05-19 15:03:47 +00:00

Author

SHA1

Message

Date

Abiba

b849cd3395

feat: per-request performance tracking + /metrics/performance endpoint

router/router.py (+158 lines):
- store_perf_record(): captures queue_ms, inference_ms, prompt_tokens,
  completion_tokens, tokens_per_sec per request in Redis
- Per-model, per-reason, per-agent rolling windows (last 200-500)
- /metrics/performance?window=N endpoint with percentiles (p50/p95/p99)
  for latency, throughput, and queue time per model/reason/agent
- Queue time now surfaced in routing metadata and routes:recent
- Streaming requests tracked with estimated prompt tokens

nginx/nginx.conf:
- Added /metrics/ proxy pass to router_api

Enables model performance comparison and routing tier validation.

2026-05-25 16:50:45 +00:00

Abiba

28fc57c5c7

May 19, 2026: Full harness update

- Model migration: gemma-4-E4B → qwen3.5-9b-vlm
- Dashboard reorder: Usage Over Time + GPU Metrics to top
- Router counter leak fix (gpu_decr in except handler)
- VLM slot upgrade 1→2
- Automated maintenance cron job
- LiteLLM config update

2026-05-19 15:03:47 +00:00

2 Commits