Capability · Execution layer

Your GPU stays sane. User chat jumps the queue.

When a team chooses local LLMs (Ollama, LM Studio), PACKWOLF's priority queue prevents the GPU thrashing that kills shared inference. Request priority lanes, model affinity guards, exponential-backoff retries, and HMR-resilient state, so background work never evicts your active chat.

4-tier
Priority lanes
5s
Switching delay
Exp. backoff
Retry policy
HMR-resilient
Queue state
packwolf.app · Local models
Live screenshot
Local models screenshot
AI Models settings. Provider health, queue depth, model affinity, all visible. Test connections without leaving the page.
What it actually does

The parts that make this work.

Priority lanes serialize requests.

USER_CHAT > REMINDER > AGENT_COMMS > BACKGROUND. User-active work jumps the queue. Background heartbeats wait their turn.

Model affinity guards block eviction.

If your active chat is using model X, background work that needs model Y waits, instead of forcing a switch that kicks your model out of GPU memory.

Exponential backoff handles transient failures.

Base 2s, max 15s. Catches "model failed to load" and "server crashed" errors automatically. Retries don't pile up; they back off.

Streaming spans the slot.

Streaming requests hold the queue slot until the stream completes. No mid-stream evictions, no half-finished responses.

HMR-resilient state.

Queue state lives on globalThis so dev hot-reloads don't reset it mid-stream. Saving a file doesn't kill your model session.

Health monitor cuts losses.

Three failures within window → 60-second cooldown. The provider stops getting traffic until it's healthy again.

How it works

The path through local models.

  1. 01

    Request enters the queue.

    Every local-model call gets a priority based on its source (chat / reminder / inter-agent / background). It joins the right lane.

  2. 02

    Affinity check.

    If the active chat is mid-stream on model X and a new request needs model Y, the new request waits. Active chat doesn't get its model evicted.

  3. 03

    Slot available.

    Highest-priority waiting request takes the slot. Streaming holds the slot until done; non-streaming releases on completion.

  4. 04

    Failure → backoff.

    If the model server returns load failed or crashes, the request retries with exponential backoff. Three failures inside the window trip the health-monitor cooldown.

  5. 05

    Cooldown ends, traffic resumes.

    After the 60-second cooldown, a probe request tests the provider. Pass → traffic resumes. Fail → another cooldown cycle.

Common questions

Things engineers actually ask.

No. Local LLMs are an option, not a requirement. PACKWOLF works on Claude, OpenAI, our in-house model, or local, your call per agent.

Source: docs/local_model_opt.md

See it in your workspace.

Closed-beta cohorts are small. Tell us what you'd want this capability to handle for your team.

Request beta access