Why Your LLM Eval Harness Is Quietly Lying to You
Offline eval scores that climb while production quality flatlines are the default failure mode of applied AI. Here is how the gap opens, and how to close it.
I take startups and scale-ups from zero to scale, pairing AI-first delivery with classical systems-design foundations — and I write about the engineering and the human systems around it, including raising an autistic child and the discipline of training. A technical archive for engineers and the AI-curious alike — written so both walk away with the mechanics.
Offline eval scores that climb while production quality flatlines are the default failure mode of applied AI. Here is how the gap opens, and how to close it.
Microfrontends promise team autonomy. In a regulated finance product they quietly traded one shared codebase for a distributed governance problem nobody owned.
Occasional deep-dives on applied AI and systems at scale, delivered to your inbox.