inference-harness

T

Abiba cfb05fa501 feat: capture streaming token counts from SSE final chunk

Router now buffers streaming response chunks to extract timings
(prompt_n, predicted_n, predicted_per_second) from the final
SSE data frame before yielding to the client. Streaming requests
get real throughput data instead of 0 tok/s.

Uses llama.cpp timings field in the last content chunk:
- completion_tokens = predicted_n
- tokens_per_sec = predicted_per_second
- inference_ms = predicted_ms (generation only)

Client sees identical stream, no perceptible delay.

2026-05-25 19:58:51 +00:00

dashboard

fix: throughput panel handles streaming-only models gracefully

2026-05-25 19:45:21 +00:00

nginx

feat: per-request performance tracking + /metrics/performance endpoint

2026-05-25 16:50:45 +00:00

router

feat: capture streaming token counts from SSE final chunk

2026-05-25 19:58:51 +00:00

.gitignore

May 19, 2026: Full harness update