Skip to content
IQBAL.LOG
QUERY_ARCHIVE

AI // Applied AI & machine learning, explained twice.

Distributed State in Large-Scale LLM Orchestration

How we held sub-50ms consistency across regions for ephemeral model state — and why most teams reach for consensus protocols they do not need.

12 May 2026 3 min read Human-written

When you put a large language model behind a global API, the model weights are the easy part. They are immutable, cacheable, and embarrassingly parallel. The hard part is everything that changes during a request: the conversation buffer, the tool-call ledger, the partial KV cache, the rate-limit counters. That is state, and state is where distributed systems go to die.

TL;DR // Key Takeaways ANSWER_FIRST
  • 01 Most LLM “state” is session-scoped and ephemeral — it does not need linearizable consensus.
  • 02 Pin a session to a region with sticky routing; replicate lazily, reconcile on failover.
  • 03 Reserve Raft/Paxos for the small set of values that genuinely must be globally agreed (quotas, billing).
  • 04 We cut p99 cross-region read latency from 180ms to 47ms by removing a consensus layer, not adding one.

The mistake: treating chat state like a bank ledger

The instinct of a senior engineer who has done finance work — and I have spent years there — is to reach for strong consistency by default. If you have ever debugged a double-spend, linearizability feels like safety. So the first design we shipped put every session’s state machine behind a Raft group.

It worked. It was also four times slower than it needed to be, because we were paying the cost of global agreement for data that exactly one user would ever read, within a few seconds, from one region.

The reframe: classify state by who reads it

The unlock was boring and effective — we drew a table of every piece of state and asked two questions: who reads it and how bad is a stale read.

state-policy.ts
type Consistency = 'session-local' | 'eventually-global' | 'linearizable';
interface StatePolicy {
key: string;
readers: 'single-session' | 'cross-session';
staleReadCost: 'invisible' | 'annoying' | 'incorrect';
consistency: Consistency;
}
const policies: StatePolicy[] = [
{ key: 'conversation_buffer', readers: 'single-session', staleReadCost: 'invisible', consistency: 'session-local' },
{ key: 'tool_call_ledger', readers: 'single-session', staleReadCost: 'annoying', consistency: 'session-local' },
{ key: 'usage_quota', readers: 'cross-session', staleReadCost: 'incorrect', consistency: 'linearizable' },
{ key: 'model_router_weights',readers: 'cross-session', staleReadCost: 'annoying', consistency: 'eventually-global' },
];

Once you sort by that last column, the architecture writes itself. The vast majority of rows are session-local. They want sticky routing, not consensus.

Session-local: pin and replicate lazily

We route a session to a home region on first contact and keep it there with a signed routing token. State lives in that region’s fast store. We replicate asynchronously to one neighbour purely for failover. If the home region dies mid-conversation, the user gets one slightly stale turn — annoying, not incorrect — and we reconcile forward.

Linearizable: keep the set tiny

Only quotas and billing counters are truly global. Those — and only those — sit behind the consensus layer. Because the set is small and writes are infrequent relative to token streaming, the round-trip cost is amortised across thousands of cheap session-local operations.

The result

The latency win came from deletion. We removed consensus from the hot path and the p99 cross-region read dropped from 180ms to 47ms. The lesson generalises well beyond LLMs: the cheapest distributed system is the one you talk yourself out of building.

Related Logs

Subscribe for updates

Occasional deep-dives on applied AI and systems at scale — delivered to your inbox.