fix: MoE concurrency 2→1 (95C thermal emergency)

MoE at 95C with p50=13s latency — thermal throttling causing
death spiral. Both slots stuck processing for 113s p95.
Dense idle at 38C with 2 free slots. Reducing MoE to 1 slot
forces heavy overflow to Dense, giving MoE thermal headroom.

Heavy tier: MoE → Dense → VLM still valid — first heavy goes
to MoE, second overflows to Dense.
This commit is contained in:
Abiba
2026-05-30 12:52:23 +00:00
parent a3bca93d9b
commit acbcb20837
+1 -1
View File
@@ -19,7 +19,7 @@ GPU_URLS = {
}
# Max concurrent requests per GPU (based on llama.cpp --parallel)
GPU_MAX_CONCURRENT = {
"qwen3.6-35B-A3B": 2, # 2 slots (Dense-first routing reduces thermal load)
"qwen3.6-35B-A3B": 1, # 1 slot (95C thermal emergency)
"qwen3.6-27B-code": 2, # 2 slots (128K context frees VRAM)
"qwen3.5-9b-vlm": 2, # 2 slots (12GB VRAM, 4GB headroom)
}