Abiba cfb05fa501 feat: capture streaming token counts from SSE final chunk
Router now buffers streaming response chunks to extract timings
(prompt_n, predicted_n, predicted_per_second) from the final
SSE data frame before yielding to the client. Streaming requests
get real throughput data instead of 0 tok/s.

Uses llama.cpp timings field in the last content chunk:
- completion_tokens = predicted_n
- tokens_per_sec = predicted_per_second
- inference_ms = predicted_ms (generation only)

Client sees identical stream, no perceptible delay.
2026-05-25 19:58:51 +00:00
2026-05-19 15:03:47 +00:00
S
Description
SyslogAI Inference Harness — 3-GPU router, dashboard, LiteLLM proxy
371 KiB
Languages
Python 97.6%
Shell 1.9%
Dockerfile 0.5%