We support local LLMs because some teams need them. Air-gapped environments, regulated data, "we paid for the Mac Studio, please let us use it." When teams choose local, we want it to be a first-class experience, not a degraded one.
Here's what we found out the hard way. Local inference servers (Ollama, LM Studio, llama.cpp variants) almost universally handle one request at a time per loaded model. Concurrent requests cause crashes, model swap thrash, GPU memory fragmentation, or all three. The hardware can theoretically run more. The runtime layer sequentializes anyway.
That collides with how PACKWOLF wants to operate. A user is in chat. A heartbeat fires. A reminder triggers. Three contexts, three requests, one local server. Without coordination you get a fight for the GPU, and the user-facing chat loses because everything is first-come, first-served.
The constraint
We thought about a few options:
Throw money at it. Run two model servers on different ports, route concurrent traffic. Doesn't work, most teams running local don't have two GPUs to throw at this. The whole point is the one machine they have.
Use Ollama's built-in queueing. It exists, it works for simple cases, but it's FIFO. Your background heartbeat gets the same priority as your active chat. If a background job is ahead of you in line, you wait.
Build a coordination layer in PACKWOLF. What we ended up doing.
The queue
Every local-model call passes through lib/local-model-queue.ts. Each request enters with a priority based on its source:
// lib/local-model-queue.ts
export const RequestPriority = {
USER_CHAT: 1, // operator-to-agent web chat, Telegram
REMINDER: 2, // time-sensitive system follow-ups
AGENT_COMMS: 3, // agent-to-agent messages
BACKGROUND: 4, // heartbeats, consolidation, warmup
} as const
queue.enqueue({ model, priority, execute })Lower number wins. A USER_CHAT call preempts a BACKGROUND heartbeat that hasn't started yet. There are only four levels on purpose. We tried six early on. The extra granularity meant nothing in practice and made it harder to reason about which class of work was about to lose.
Once a request is in flight, it completes. Cancelling a model call mid-stream corrupts state on most local servers, so the slot is held atomically for the whole call, streaming included. Sounds obvious; getting it right took two iterations because the first version had a race where a streaming call could be marked complete by the wrong handler.
Model affinity guards
Here's a failure mode we didn't see coming. The user is mid-chat on model X. A background heartbeat fires that needs model Y. The priority queue lets the heartbeat run because the user's call is already in flight (atomic), but the heartbeat needs a different model, so the local server unloads X to load Y.
Now the user's next turn has to reload model X. Reloading a 70B-parameter model takes 30+ seconds. The user sees their chat stall.
Model affinity guards. If the active chat is using model X and a background request needs model Y, the background request waits. Active chat keeps its model. Two windows protect it:
MODEL_WARM_WINDOW_MS(2 min). If the warm model was last touched by a user-chat call within the last 2 minutes, background work cannot evict it.USER_ACTIVITY_WINDOW_MS(5 min). A wider guard: if there's been any user-side activity at all in the last 5 minutes, background model switches are deferred regardless of which model is currently warm.
Deferred background requests stay in the queue. A retry timer fires when the longer of the two windows expires.
Exponential backoff for transient failures
Local servers fail in ways cloud APIs mostly don't: "model failed to load," "the model has crashed," "connection refused" because the server is restarting. Retrying 2 seconds later usually works. We do exponential backoff (2s base, 15s ceiling). Three failures in a row trips a 60-second health-monitor cooldown; after that a probe request decides whether traffic resumes or the cooldown cycles again.
HMR-resilient state
Last detail. In dev, Next.js's hot module reload resets module state on every save. The first version of the queue lived in a module-scoped variable. Save a file, queue resets, in-flight streaming requests lose their slot tracking and start the model swap thrash all over again. Fun to watch on a 70B model.
Fix: queue state lives on globalThis. HMR doesn't clear globals. The queue survives every dev save. Sounds janky; works perfectly.
What this gives you
A team can put their pack on a single shared local server and the user-facing experience stays sharp. Background agents wait. Cross-agent comms wait. The active chat doesn't. Model affinity keeps the loaded model loaded. Transient failures retry. Every queue wait shows up in the activity trace, so you can see exactly where time went and which class of work was holding the line.
Most local inference runtimes don't do any of this themselves. That's why we wrote the layer that does.