Question 1

Do I have to use local models?

Accepted Answer

No. Local LLMs are an option, not a requirement. PACKWOLF works on Claude, OpenAI, our in-house model, or local, your call per agent.

Question 2

Why not run requests in parallel against a local model?

Accepted Answer

Most local inference servers (Ollama, LM Studio) handle one request at a time per model. Concurrent requests cause crashes, timeouts, or model swap thrash. The priority queue makes the underlying constraint cooperate with your work.

Question 3

What does model affinity actually prevent?

Accepted Answer

Suppose you're streaming a Claude-equivalent local model and a background heartbeat decides to use a different local model. Without affinity guards, the local server unloads your model to load the other one, your stream dies. With guards, the background work waits.

Question 4

Can I run on shared remote infrastructure (e.g., a single Mac Studio for the team)?

Accepted Answer

Yes. Point all team members' clients at the same local-model endpoint. The priority queue serializes requests across the team. Useful for small teams sharing one beefy machine.

Question 5

Does this affect Claude or OpenAI requests?

Accepted Answer

No. Cloud providers handle their own concurrency. The priority queue only applies to local LLMs because that's where the constraint is real.

Question 6

What happens if my local model server is down?

Accepted Answer

Health monitor catches it within three failures. The provider drops to cooldown. PACKWOLF's provider failover routes the request to a backup if you've configured one (e.g., Claude as a fallback for an unavailable local model).

Your GPU stays sane. User chat jumps the queue.

The parts that make this work.

Priority lanes serialize requests.

Model affinity guards block eviction.

Exponential backoff handles transient failures.

Streaming spans the slot.

HMR-resilient state.

Health monitor cuts losses.

The path through local models.

Request enters the queue.

Affinity check.

Slot available.

Failure → backoff.

Cooldown ends, traffic resumes.

Things engineers actually ask.

See it in your workspace.