When you put a large language model behind a global API, the model weights are the easy part. They are immutable, cacheable, and embarrassingly parallel. The hard part is everything that changes during a request: the conversation buffer, the tool-call ledger, the partial KV cache, the rate-limit counters. That is state, and state is where distributed systems go to die.
- 01 Most LLM “state” is session-scoped and ephemeral — it does not need linearizable consensus.
- 02 Pin a session to a region with sticky routing; replicate lazily, reconcile on failover.
- 03 Reserve Raft/Paxos for the small set of values that genuinely must be globally agreed (quotas, billing).
- 04 We cut p99 cross-region read latency from 180ms to 47ms by removing a consensus layer, not adding one.
The mistake: treating chat state like a bank ledger
The instinct of a senior engineer who has done finance work — and I have spent years there — is to reach for strong consistency by default. If you have ever debugged a double-spend, linearizability feels like safety. So the first design we shipped put every session’s state machine behind a Raft group.
It worked. It was also four times slower than it needed to be, because we were paying the cost of global agreement for data that exactly one user would ever read, within a few seconds, from one region.
The reframe: classify state by who reads it
The unlock was boring and effective — we drew a table of every piece of state and asked two questions: who reads it and how bad is a stale read.
type Consistency = 'session-local' | 'eventually-global' | 'linearizable';
interface StatePolicy { key: string; readers: 'single-session' | 'cross-session'; staleReadCost: 'invisible' | 'annoying' | 'incorrect'; consistency: Consistency;}
const policies: StatePolicy[] = [ { key: 'conversation_buffer', readers: 'single-session', staleReadCost: 'invisible', consistency: 'session-local' }, { key: 'tool_call_ledger', readers: 'single-session', staleReadCost: 'annoying', consistency: 'session-local' }, { key: 'usage_quota', readers: 'cross-session', staleReadCost: 'incorrect', consistency: 'linearizable' }, { key: 'model_router_weights',readers: 'cross-session', staleReadCost: 'annoying', consistency: 'eventually-global' },];Once you sort by that last column, the architecture writes itself. The vast
majority of rows are session-local. They want sticky routing, not
consensus.
Session-local: pin and replicate lazily
We route a session to a home region on first contact and keep it there with a
signed routing token. State lives in that region’s fast store. We replicate
asynchronously to one neighbour purely for failover. If the home region dies
mid-conversation, the user gets one slightly stale turn — annoying, not
incorrect — and we reconcile forward.
Linearizable: keep the set tiny
Only quotas and billing counters are truly global. Those — and only those — sit behind the consensus layer. Because the set is small and writes are infrequent relative to token streaming, the round-trip cost is amortised across thousands of cheap session-local operations.
The result
The latency win came from deletion. We removed consensus from the hot path and the p99 cross-region read dropped from 180ms to 47ms. The lesson generalises well beyond LLMs: the cheapest distributed system is the one you talk yourself out of building.