fix: reduce MoE concurrency 2→1 to prevent thermal timeout (94°C)
Strix Halo running qwen3.6-35B-A3B was hitting 94°C with 2 concurrent slots, causing 300s request timeouts. Mumuni + Koby accumulated 15 timeouts in the last hour. Reduced to 1 slot for thermal headroom. Medium and Default tiers already route VLM before MoE as fallback, minimizing overflow traffic to the hot GPU.
This commit is contained in:
+3
-3
@@ -19,7 +19,7 @@ GPU_URLS = {
|
||||
}
|
||||
# Max concurrent requests per GPU (based on llama.cpp --parallel)
|
||||
GPU_MAX_CONCURRENT = {
|
||||
"qwen3.6-35B-A3B": 2, # 2 slots
|
||||
"qwen3.6-35B-A3B": 1, # 1 slot (thermal management: 94C at 2 concurrent)
|
||||
"qwen3.6-27B-code": 1, # 1 slot (24GB VRAM saturated at 256K ctx)
|
||||
"qwen3.5-9b-vlm": 2, # 2 slots (12GB VRAM, 4GB headroom)
|
||||
}
|
||||
@@ -229,7 +229,7 @@ def route(rd, tier):
|
||||
|
||||
# TIER 3: Medium complexity — Dense primary (speed), MoE fallback
|
||||
if t <= 25000 and turns <= 15:
|
||||
candidates = [m for m in ["qwen3.6-27B-code","qwen3.6-35B-A3B","qwen3.5-9b-vlm"] if m in avail]
|
||||
candidates = [m for m in ["qwen3.6-27B-code","qwen3.5-9b-vlm","qwen3.6-35B-A3B"] if m in avail]
|
||||
result = select_best_gpu(candidates, "medium")
|
||||
if result: return result
|
||||
|
||||
@@ -240,7 +240,7 @@ def route(rd, tier):
|
||||
if result: return result
|
||||
|
||||
# TIER 5: Default — balanced distribution: Dense first (speed), MoE second (capacity)
|
||||
candidates = [m for m in ["qwen3.6-27B-code","qwen3.6-35B-A3B","qwen3.5-9b-vlm"] if m in avail]
|
||||
candidates = [m for m in ["qwen3.6-27B-code","qwen3.5-9b-vlm","qwen3.6-35B-A3B"] if m in avail]
|
||||
result = select_best_gpu(candidates, "default")
|
||||
if result: return result
|
||||
return {"model":avail[0],"reason":"last_resort"}
|
||||
|
||||
Reference in New Issue
Block a user