Priority lanes serialize requests.
USER_CHAT > REMINDER > AGENT_COMMS > BACKGROUND. User-active work jumps the queue. Background heartbeats wait their turn.
Every local-model call gets a priority based on its source (chat / reminder / inter-agent / background). It joins the right lane.
If the active chat is mid-stream on model X and a new request needs model Y, the new request waits. Active chat doesn't get its model evicted.
Highest-priority waiting request takes the slot. Streaming holds the slot until done; non-streaming releases on completion.
If the model server returns load failed or crashes, the request retries with exponential backoff. Three failures inside the window trip the health-monitor cooldown.
After the 60-second cooldown, a probe request tests the provider. Pass → traffic resumes. Fail → another cooldown cycle.