inference-harness

Files

T

Abiba cfb05fa501 feat: capture streaming token counts from SSE final chunk

Router now buffers streaming response chunks to extract timings
(prompt_n, predicted_n, predicted_per_second) from the final
SSE data frame before yielding to the client. Streaming requests
get real throughput data instead of 0 tok/s.

Uses llama.cpp timings field in the last content chunk:
- completion_tokens = predicted_n
- tokens_per_sec = predicted_per_second
- inference_ms = predicted_ms (generation only)

Client sees identical stream, no perceptible delay.

2026-05-25 19:58:51 +00:00

Dockerfile

May 19, 2026: Full harness update

2026-05-19 15:03:47 +00:00

http_patch.py

May 19, 2026: Full harness update

2026-05-19 15:03:47 +00:00

requirements.txt

May 19, 2026: Full harness update

2026-05-19 15:03:47 +00:00

router.py

feat: capture streaming token counts from SSE final chunk

2026-05-25 19:58:51 +00:00

router.py.bak.20260518074236

May 19, 2026: Full harness update

2026-05-19 15:03:47 +00:00

ts_patch.py

May 19, 2026: Full harness update

2026-05-19 15:03:47 +00:00