Inference engineering and the real-world deployment of LLMs, with Philip Kiely
Patrick McKenzie (patio11) and Philip Kiely, early employee at Baseten, discuss the inference stack: the critical layer of software and hardware that sits between a model’s weights and a user’s prompt. They cover inference engineering, how intermediate layers are evolving over a technical stack that is changing every six months, and how sophisticated organizations are actually consuming LLMs beyond just writing their questions into chatbot apps.
Presenting Sponsors: Mercury, Meter, & Granola
Complex Systems is presented by Mercury—radically better banking for founders. Mercury offers the best wire experience anywhere: fast, reliable, and free for domestic U.S. wires, so you can stay focused on growing your business. Apply online in minutes at mercury.com.
Networking infrastructure has a way of accumulating technical debt faster than almost anything else in IT. Meter handles the full stack (wired, wireless, and cellular) as a single integrated solution: designed, deployed, and managed end-to-end so there's only one vendor to call when something goes wrong. Visit meter.com/complexsystems to book a demo.
If meetings consistently leave you with hazy action items and lost context, Granola handles the transcription so you can actually participate and gives you searchable notes afterward. Try it free at granola.ai/complexsystems with code COMPLEXSYSTEMS
Timestamps:
(00:00) Intro
(00:30) The AI deployment pipeline
(03:04) Evolution of abstraction layers in engineering
(05:14) Defining inference and model weights
(08:45) Architecture of language and diffusion models
(10:11) AI adoption in the broader economy
(11:30) The shift toward agentic workflows and RL
(14:55) Function calling and real-world actions
(20:10) Sponsors: Mercury | Meter
(22:59) Technologies for agentic tools: MCP and skills
(25:32) The craft of writing a harness
(29:56) Using AI for automated proofreading and tool creation
(34:12) Balancing LLMs with deterministic code
(37:31) Observability and chain of thought reasoning
(39:31) Sponsor: Granola
(41:21) Observability and chain of thought reasoning
(50:45) Speculative decoding and hidden states
(55:37) The value of smaller, task-specific models
(59:55) Internal competencies versus buying solutions
(01:09:27) Self-publishing a technical book in record time
(01:23:20) Wrap
Transcript
Patrick: Hideho, everybody. My name is Patrick McKenzie, better known as patio11 on the internet, and I’m here with my buddy Philip Kiley, who’s the author of Inference Engineering and an early employee at Baseten, an inference platform.
Philip: Hey, Patrick, great to be here today.
Patrick: Great to have you as well. There’s quite a bit of discourse about how AI and LLMs are going to impact the real economy and society at large. Often that discourse properly involves people in many places in civil society, including many that don’t write code on a day-to-day basis.
I think actually writing code against these platforms on a day-to-day basis does inform one on how they actually work in the real world. So I thought we’ll have a more-technical-than-usual discussion for the benefit of people who would be well served from hearing a very technical discussion about explicitly how they are going to work within large companies. How is this going to be deployed into the real economy?
The AI deployment pipeline
Patrick: To start us off, let’s talk about the pipeline that gets us from a couple of engineers or researchers sitting in one of the AI labs. They’ve got a blank whiteboard and they’re thinking, "Okay, we’ve got a new model name to put on top of the blank whiteboard, and it’s gonna do some stuff... insert Underpants gnomes here... profit." And then it has an impact on the real economy.
My sketch—as someone whose CS degree is increasingly out of date and whose undergrad concentration in AI is 20-plus years out of date—is that this looks something like a training step, which is this multi-month pipeline that typically involves some novel advances in basic science.
[Patrick notes: In principle you can train models without needing to do any novel science. In practice, the leading labs are the leading labs because they’re pushing the frontier every week. Some of that work is classically science, and some is cruftier engineering, and some is engineering that is not called science basically because the institutions which society expects to conduct science are incapable of it and therefore define it out of the field.]
We then maybe have a post-training fine-tuning step, which might be conducted at the lab, at a customer, or at a platform in between the lab and the customer. And then we have where the rubber hits the road: some prompt writing, some harness writing, and some other engineering work that goes into making this model consumable at an actual business operating in the real economy. That is inference, which is similar to the runtime versus compile-time distinction from previous generations of computer use, but different enough that it deserves its own nomenclature and discussion.
[Patrick notes: On reflection, it’s closer to the deployment cycle for software prior to the Internet rehoming much software to servers under control of the writers of it. In that cycle, there was typically a year or so of work, that work got reduced to a single artifact shipped to customers, and then customers would use the artifact with relatively minimal interaction with the software company until the next purchase cycle. Although people were most familiar with e.g. Microsoft Word on this model, much software purchased in this fashion was used at companies, including by companies which had substantial engineering teams in charge of operating the software but not writing it specifically.]
Does that taxonomy sound broadly right to you?
Philip: Yeah, that’s broadly correct. It’s interesting for me to hear it compared to compile-time versus runtime. I am joyously unburdened of knowledge of how the world worked before, but I feel that my computer science degree, while much more recent, is also increasingly out of date in the modern world of what we’re building here.
Evolution of abstraction layers in engineering
Patrick: I think all of us are out of date because all of our educations and professional lives up until 12 months ago assumed that being able to do extremely precise edits on a representation of code was a core part of the value proposition for engineering. The world entirely changed in the last 12 months. Modern coding tools are very good at doing precise edits on complicated representations of code.
[Patrick notes: I cannot stress this point enough: engineering is a highly paid occupation where a very core skill was high-speed high-accuracy symbolic manipulation where some choices are simply wrong. Every engineer up until 12 months ago had a huge portion of the job be banging their heads against a computer interface navigating the minefield of wrong answers while trying to claw something valuable out of the ether. Most of us have had the experience of an entire day wasted over a single misplaced semicolon.
It was thrilling, it was a puzzle, it was a craft, and it is now obviously obsolete. Not engineering! Engineers are fine! But the thing they spent much of their day doing, particularly early in their careers, is now fundamentally transformed. There are over three million professional software developers in the U.S. and their job is now either almost unrecognizable or on a very short trajectory to getting there.
Many policymakers and columnists think that AI is overhyped and that economic impact will, if it arrives, be comfortably in the future. This belief is wrong and people who cling to it, against increasingly mounting evidence, will make costly errors in the next few years.]
The craft of engineering is increasingly turning to how you direct modern tools to edit the code on your behalf, which is not exactly unprecedented. It’s sort of like how "compiler" used to be a job description and not a tool. These days, essentially no one writes assembly by hand. Even the "high-level" languages like C that we wrote on top of compilers and assemblers are things that almost no engineer under the age of 40 would say is a high-level language. We are increasingly moving up abstraction layers. Now we have this kind of meta-abstraction layer that abstracts us away from, potentially, code as a concept.
But digression aside, what’s the day-to-day of doing inference engineering look like?
Philip: It reminds me of when I was assigned a cryptography assignment in college. My professor suggested that we all might like to write our assignment in C. I turned in a Python script along with a brief note explaining that I felt the concepts of the class did not require me to recall how to exactly allocate memory. I didn’t receive a particularly good grade on that assignment, but I’ve always been a believer in working at the highest level of abstraction that still gives me the tools to get the job done in a precise way.
[Patrick notes: Speaking of institutions and their discontents, in industry, attempting to write cryptographic code in C would be considered nearly suicidally risky. This is because industry had many people try it for many decades and paid, well, billions of dollars for the mistakes they made along the way. Many of those mistakes can be rounded to “highly paid people with engineering degrees are surprisingly bad at counting to small numbers.”]
Defining inference and model weights
Philip: A lot of the interest in the AI space in the last couple years has been on those first two steps you were talking about: the training side in particular. How do you get the data? How do you architect the model? But at the end of the day, a model released from a lab is just a file—or a bunch of files because it's too big to be one file. It's a bunch of safe tensors files. It's basically an enormous matrix or set of matrices of numbers.
Patrick: This is what is colloquially called the “weights,” right? We’ve distilled some of human knowledge into an enormously large matrix. To a first approximation, the way attention works is you give a string of characters, perform some heavy matrix multiplication using your GPU of choice, and then it tells you what the next character looks like. That’s hand-waving away an enormous amount of complexity.
[Patrick notes: The seminal paper which kicked off the modern LLM boom is Attention is All You Need. There is a rich secondary literature at this point if you want explanations of how transformers work, across a wide spectrum of technical detail. Or, of course, you could just ask an LLM to explain it to you.
I’ll note, for technical accuracy, that LLMs operate on tokens not characters. A token is typically a few characters in length; generally part of a word.]
Philip: We have to do a lot of hand-waving to get anywhere in a discussion. But once you're done with the training, you just have a gigantic file sitting there. Perhaps you have a trillion-parameter model. If you're running it in a four-bit floating-point format, each parameter is half a byte. So you have 500 gigabytes of model weights sitting on a hard drive, and you want to turn that into the API that runs the economy. How do you do that? That’s inference, and that’s what I get to work on every day.
Patrick: The trillion-parameter model being 500 gigabytes is one data point. The most surprising data point I got—now "ancient" in AI terms—was when Stable Diffusion came out. Unlike closed-source models, you could just download it, and you’d know how much you had downloaded.
It was for generating images. The model was two gigabytes. We were able to compress the entire visual history of humanity into two gigabytes. It’s mind-blowing. It’s lossy compression, but the thing knows what a panther looks like, and what a panther looks like if you draw it in charcoal.
Architecture of language and diffusion models
Philip: I’m a huge believer in open-weight models, partially because they enable the business I work at, but also because it is amazing to be able to download this stuff onto your computer. Modern image models are less lossy and about 10 times that size. They also have a different architecture: a diffusion model. This is part of the taxonomy of inference. You have auto-regressive transformer-style models—those are your language models, but also things like voice generation. Speech-to-text and text-to-speech models are actually very small, generally one to three billion parameters, which is why you can run them on your phone. Then you have diffusion-based models, although those are increasingly looking more like auto-regressive transformers. Regardless of the model, you want to be able to serve it as an API and integrate that into a product—whether it’s a coding assistant, a legal assistant, or music generation.
AI adoption in the broader economy
Patrick: A lot of those sound like products in the standard Silicon Valley sense. A company consumes the output of the labs via API and exposes it to users. Like Suno—it’s an experience that couldn't have existed prior to LLMs being able to generate music on command. I think one of the major touchpoints in the economy is less going to be new products and more about larger companies adopting these primitives in the same way they adopt "if" statements and "for" loops.
In 2026, if there’s something we’re doing that doesn't have an "if" statement or "for" loop in it, something has gone catastrophically wrong. That is going to be the story of AI adoption in the next five to ten years.
The shift toward agentic workflows and RL
Philip: Anything that was a spreadsheet is now SaaS, a database and system of record.
[Patrick notes: Please accept this as an exaggeration for effect. Spreadsheets still run the world in a lot of places.]
Now anything that was a workflow is becoming an agent. This transforms the way you think about inference. In a historical chatbot system, you send one request and get one response. In these more complicated systems, one user request can spawn 50 or 100 calls. Or if it’s an RL (reinforcement learning) job to fine-tune a model for a specific task, you might do a million inference requests from a single user action.
RL is, as opposed to just showing the model more data, letting the model do things and then telling it what happened and using that to adjust its behavior.
Patrick: It’s not broadly understood outside of AI circles, but current inference budgets per person are relatively low. Most people interact with an app and have a countable number of interactions. What is actually going to happen is that automated systems will increasingly work on behalf of every person in a company without requiring user interaction. The number of inference calls will be more like the number of database accesses made about you. Over the last 50 years, that might have been one query per quarter per person in 1970; now you’ve probably got five figures' worth of queries happening before you wake up in the morning.
Philip: I’m also relatively poor at estimating my personal AI consumption because I’m in a privileged position where my consumption is a drop in the bucket compared to the amount of inputs the company I work for is responsible for.
Patrick: The revenue numbers publicly reported about companies suggest that somebody is spending tens of billions of dollars on inference. Industry is starting to materially adopt these.
[Patrick notes: Anthropic publicly claims a $14 billion run rate as of about a month ago. OpenAI publicly claims over $20 billion in revenue for 2025. I believe both these numbers. Stripe does not endorse what I say in my personal spaces, but has made its own claims about observed AI revenue numbers in its annual letters the past two years, characterizing the revenue growth of AI companies as extraordinary, unprecedented, and lapping the best SaaS companies of a few years ago, which Stripe also typically had a close working relationship with.
Booms tend to bring out some innovation in gaming revenue numbers, but most forms of gaming are very difficult to pull off against one’s payment processor, and I think most observers who assume boomtime revenue numbers are per se suspect are underestimating the quality of this revenue.]
What are the sorts of things a large enterprise uses inference on that isn't just an internal chatbot?
Function calling and real-world actions
Philip: We are seeing a new class of AI-native companies emerge. These companies are quickly selling into the enterprise or even becoming the enterprise. You have the ability to completely change a business process. I wouldn't say the marginal cost approaches zero because inference is actually very expensive. Every time we make inference cheaper, we are rewarded with substantially more inference. But if the unit cost falls by 100 or 10,000 times, how does that change your domain? AI is truly the "you can build whatever you want" technology.
Patrick: What we often see in early adoption is taking generation n-1 and just doing that in the new substrate. Early websites took forms and made them HTML forms. That was magical at the time, but you can’t approximate DoorDash by saying it’s just a paper form lifted into a mobile phone. We’re seeing business processes that already existed getting a minimal lift into the AI era, but they are still being managed by the same technologies and metrics. To your point about AI-native firms, things will look very different soon. It won't just be a customer service rep with an AI layer.
[Patrick notes: See the past episode with Des Traynor on how leading organizations are reorienting now that AI is available in customer support. It’s not just “smarter autoresponses.”]
Philip: DoorDash is a great example because it’s something happening in the real world. Similarly, AI systems are moving from an "ask and tell" mode to an agentic mode. Something is agentic if it takes an action on your behalf in the real world. Technically, this is just function calling. It’s providing a language model with a list of actions it’s capable of taking and the input specs required. The model picks from that list and provides a structured output guaranteed to fulfill the input specs of the tool you're trying to trigger. That’s where the amount of input starts to explode because now you are probably making multiple decisions for every action you take. But it's also where you get to start building genuinely like new and novel things, rather than just providing additional context to whatever existing operations are happening.
Technologies for agentic tools: MCP and skills
Patrick: For the people who don't work in this every day, it would benefit them to hear about the technologies that enable these tool calls. One is called MCP, which is essentially API rebranded for agents to use. The notion is that there is a computer system you want LLMs to interface with. You're building an affordance that you expect no humans to ever look at—it's only for LLMs to drive the operation of that system. Another refinement is called "skills," where the skill is just a description in plain text: "You would be better at filing taxes if you knew the following." You have the model read the plain text and respond with text or structured data that your harness can interpret.
Philip: Exactly. A common misconception is that the model itself is actually capable of doing anything. It is only capable of asking the harness to do things on its behalf. For now, the native language of LLMs is plain text—often English or Chinese. The difference between an HTTP API and an MCP or a skill is solely the structure of that text. This benefits us as humans because it makes the actions of the system more interpretable. You can review the logs and see, "Oh, it said to take this function and use these parameters: cheese, burger, and fries."
The craft of writing a harness
Patrick: Reading the execution of computer code that no human actually wrote is wild.
To define a term: a harness is the engineered system. The model lives in the cloud, but the harness is the subsystem that repeatedly calls the model, asks it to make decisions, and operates other computer systems on the model's behalf. Writing a harness doesn't require Space Shuttle levels of engineering difficulty. A minimum viable harness is a couple hundred lines of code—maybe a day of work, or even five minutes with modern coding tools.
Six months ago, I made a video game art project.
[Patrick notes: Still up at IsekaiGame.com and discussed in some detail previously.]
I thought it would be useful for debugging if the LLM helping me write the game could actually play the game and find bugs. I wrote a command-line harness that allowed it to "click buttons" via the CLI. The CLI tells the model what happens, and the model tells the CLI what to do next. It was a huge productivity increase.
Philip: What you actually created there, Patrick, was an RL environment. You created a mechanism for a model to observe its behavior in a virtual world and adjust its behavior on top of that.
Patrick: I was cognizant when doing the work that I was a few days away from science fiction. "Play this video game and tell me if you're not having fun; if you're not, change the game."
[Patrick notes: I would like to encourage people who are skeptical that an LLM could usefully do this to actually try it.]
I mostly used it for basic things. You can be much less precise with an LLM-driven system. Instead of giving a sequence of 125 actions with checkpoints, I just say, "Play through chapter one and tell me if you are surprised in a negative fashion." The objective is a repeatable system for identifying negative surprises.
Why not have the computer play through the entire game every time we change a line of code? Its time is almost free.
[Patrick notes: It is utterly routine in industry to have 40 computers work for an hour to test one minute of human work. The human’s time is valuable and the consequences of a typo in e.g. a financial system can be quite severe. Also routine in industry is having a team in charge of building systems which make this automatic and reliable, which does not describe the level of rigor I bring to infrastructure supporting art projects.]
Using AI for automated proofreading and tool creation
Philip: It doesn't have to be that complex. Every time I write a book, I write custom software to make editing easier. Now I don’t even have to write that software myself. For proofreading, you could throw the entire text of a book in a ChatGPT window and ask for errors. Theoretically, these models have a large enough context window to absorb that. But if you write a simple harness that chunks the text into 500-word pieces and passes each to a separate LLM call—and maybe even throws it at every frontier model simultaneously to collate results—you’re going to find a lot more errors. Theoretically that's a harness in that it's like taking a model and adapting its behavior to do better at a task by giving it a set of sort of deterministic loops behind it. That’s five minutes of work that just about anyone can understand conceptually how it's functioning.
I will also say that this system was not a great proofreader and that my human proofreader caught about four times as many errors. So either I'm very bad at writing harnesses or proofreading is still something that LLMs are surprisingly bad at.
Patrick: In my experience, LLMs are getting much better at proofreading. I use them extensively these days, which should shock no one who has been paying attention. However, we aren't entirely 'out of the loop' on proofreading yet. I will say that they are extremely good at operating existing computer tools and improvising when creating new ones. You can ask them to write software that would have been economically irrational to write before—cases where the expected lifetime of that software is anywhere from a day down to a single execution.For example, a common error in writing is incorrect capitalization. Existing spellcheck engines are pretty good at identifying capitalization errors in some circumstances, but they often over-identify errors. They might flag the word 'McKenzie' everywhere, despite it being correctly capitalized.To fix this, you can tell an LLM: 'Write a sed script'—sed is a Unix utility, and LLMs are very good at operating them—'that extracts every capitalized word in this document.' Even if it’s a very large document, you can include one sentence of context before and after each word. You then pass that through a pipeline that knocks out any word capitalized in the usual fashion (like at the beginning of a sentence) to find only the possible mistakes. Deduplicate that list and pass it to an LLM for evaluation.Given those instructions, the LLM will look at the word 'McKenzie' and say, 'Probably correctly capitalized; many Irish orthographies use this style.' It moves on to the next one, and you can have an infinitely patient computer system happily review 800 false positives to find you 30 honest typos.
Balancing LLMs with deterministic code
Philip: It is important to have both parts: the LLM-based piece and the code-based piece. Recently I had to alphabetize an appendix. LLMs are not great alphabetizers, but they are great writers of Python scripts that can alphabetize. That’s a script I used exactly once and then never again.
Patrick: I think this is increasingly going to be a way that they drive a lot of computer tools, like: "Write a script to do this once." I have an ongoing back and forth with the product Claude Cowork, which is sort of like Claude Code, but for laptop jobs that are not specifically writing software code. But Cowork is actually writing a lot of code in the background. And if you are a perceptive user, you can perceive the fact that it's writing code and even review some of the code it’s written; but it's designed for a non-technical professional to use, and so you don't actually have to review the Python code.
So, I'm constructing a balance sheet and I have PDFs grabbed from all the banks and brokerages and similar that I want each of them to represent a line in the balance sheet. And the way Claude is interfacing with Microsoft Excel is not, "I drive a mouse that clicks here to open the file and then clicks here to hit cell A1 and then types the following into a virtual keyboard." It's actually writing a Python script, which uses the fact that .xlsx is a relatively open format and that Python has existing libraries which do spreadsheet manipulation, to write the code which does the appropriate spreadsheet manipulation, execute that code, and then confirm that the execution came back correctly.
Philip: Open file formats are essential. docx is a somewhat less open format, which is why I converted my entire book into Markdown to give it to the language models and then converted it back for the human editors.
Patrick: Yep. It is very interesting to me how Markdown has kind of become the lingua franca of LLM tool invocation under the hood, despite being designed for an entirely different purpose.Markdown was originally written by the gentleman behind Daring Fireball, which is a blog, to make the toolchain—from typing text into a text editor to a beautifully designed post appearing on the internet—easier than it was; it was widely adopted by various blogging services for that.
Outside of blogging services and people who write API documentation for a living, there were relatively few serious uses of Markdown in the world. Then the LLMs got really good at generating it and reading it. It happens to be very token-efficient versus HTML.
That combination of attributes made the humans supervising the LLMs increasingly say, 'No, I’d prefer your output in Markdown; if we need it to be styled, we will have a deterministically written styling step at the end of the pipeline'. Philip: Yeah, I mean, there is also probably some amount of overlap between those of us for whom blogging and documentation are two of our formal and current professions , and that being fairly common among my peers in the AI space as well.
Observability and chain of thought reasoning
Patrick: I think there’s an interesting sociological-slash-historical observation to make: the network is intensely connected at some points—both the human network and, increasingly, this vague, multidimensional 'idea space' network. And so, they write Python not because Python is necessarily the best programming language ever, but because of some combination of factors: there is a lot of open-source Python available that's plus plus.
The people writing the LLMs at the labs and similar organizations write Python natively, whereas they don’t write C as their first choice. It is easiest to review the outputs in Python. So, you tell iteration one of the LLM, 'Prefer Python over something else because I can read Python most easily,' and then they get good at what you keep telling them to get good at.
They’ve gotten very good at Python and very good at Ruby, but maybe less good at writing, for example, Rust. But the broader sociological observation is this: the set of interests and activities that are central to the early adopters of LLMs are probably going to be the ones that see the greatest change in the economy first, versus things where the early adopters are less indexed on that.
I think that’s one reason why, to date, the most interesting economic implication of LLMs is clearly the modern coding tools, by several orders of magnitude. They will hit everywhere eventually—or at least everywhere that has been impacted by an 'if' statement or 'for' loop before. But it will take them longer to get into, say, sculpture, just because there are fewer sculptors in the pool of people working in and around LLMs every day.
Philip: And that's where some of the work on the RL (reinforcement learning) fine-tuning side comes in, as well as on the inference-time side. Language models are good general learners, and so that's why the paradigm of reasoning models—you could call it 'test-time compute' or 'scaling inference'—is growing.
There’s a bunch of names for it, but in general, giving the model the ability to generate a number of tokens before it generates the final result helps a lot in generalizing its capabilities into new domains. You can give them enough context and time to sort of iterate ideas, add in a harness, and then all of a sudden the trial and error becomes less speculative and more concrete.
In this case, to build a robust system in a domain unfamiliar to those who pre-trained the foundation models, you no longer need to be a foundation model lab with billions of dollars in GPUs. You can instead use hundreds of thousands or millions of dollars of compute—still a lot, but materially less. Combine that with industry expertise and domain knowledge to get a system working extremely well at a fraction of the investment.
Patrick: State-of-the-art has changed over the last couple of years. To what extent will we see a family of models in every company, versus everyone paying API costs to one of three firms? Philip: I have strong feelings about that one. Caveated by my economic interest, I do have a strong belief that we are not headed for a future where everyone is renting intelligence a token at a time from a handful of firms. Because of open models, the proliferation of tooling around RL and SFT (supervised fine-tuning), and the rise of AI engineering, you have a world where every engineer has the opportunity to 'own their intelligence'. And to build genuinely differentiated AI products, rather than just wrapping a handful of models. There will always be a huge role for foundation models in this field. I use closed-source foundation models in my daily work.At times, I use open models in my daily work; at other times, I use custom models created by specialized companies. All of these models are going to be well-optimized for different tasks. They're going to have different cost characteristics, and the world of AI engineering is going to be much more heterogeneous than I think you might have expected looking at the state of the industry a couple of years ago.And that's a very good thing. It gives us more exciting stuff to work on, and it means that there's more opportunity for everyone in the industry today—and everyone who's not in the industry today but wants to be—to do interesting and useful work instead of just relying on a few companies.
Patrick: As we increasingly move from a world where there is one logical user interaction and one API call made for it, to there being hundreds or even more API calls made per user interaction, it makes a lot of sense to have a local, cheap source of intelligence.
This allows you to do things like routing a request to the best provider or asking a 'smart' LLM to construct a plan of what we're going to do. I'm then going to farm out parts of the plan which don't require much more intelligence than an 'if' statement to places that are cheap and fast. We can then annotate which parts require a great deal of sensitivity and use one of the state-of-the-art models for that—perhaps routing to a model we know is particularly good on that specific topic. There are some firms, like Stripe—Emily Sands has discussed this on the podcast—that train their own models. Specifically, while the leading industry lab LLMs can answer a question on anything connected to the human experience, some answers will be better and some will be worse. There is even a different model that cannot speak Japanese but 'speaks credit cards' more than any human who has ever lived.
Philip: Well, it's still Transformer-based, according to the blog post. So, you can run with “LLM.” It's close enough.
Patrick: Yeah. So, there will not be as many models trained as there are programs in the current status quo that the cost-benefit curve isn't there for most uses. But there will certainly be a lot more models trained in the next 10 years than there were in the past 10, as the stack for training models and the know-how get better.
Philip: If you look at HuggingFace, five or six years ago there were a few tens of thousands of models on there. Now we are well past 2 million, possibly 3 million. Many of those are just fine-tuned versions or redos of existing foundation models. But directionally, the number of open-source models has been rising at 5x a year without any signs of slowing down.
Patrick: HuggingFace, for those who aren't familiar, is a community site where people upload and share models. Similar to GitHub, you provide an explanation and use a 'star' fashion to surface the best models. This moves the state of the industry forward because people can fine-tune against existing models and run automated benchmarks.
Philip: And similar to GitHub, it's leveraged this community uptake to become the de facto standard for companies publishing closed and fine-tuned models to share with one another, similar to private code repositories.
Patrick: This is a thing which felt really important to me as I was doing my LLM-assisted coding six months ago—though I've been told that’s been entirely obsoleted since December. The observability of these tools is both better and worse than expected. It’s better in terms of interpretability in a 'Chain of Thought' world, where you ask the LLM to think to itself in a 'scratchpad' before giving a final answer.
It's an ongoing research question whether that interpretability is actually true, or just an incredibly complex pantomime. Regardless, you can read an LLM's internal reasoning, and as an engineer, you kind of have to. If you tried to hit the DoorDash API to order a burger from Culver's, and it ordered from New Jersey while you're in Chicago, you need to know why it picked New Jersey. The tooling for this is less mature than standard server log review. So, how do people doing inference engineering think about building out these new toolchains?
Speculative decoding and hidden states
Philip: The world’s getting less observable. Text is not actually that good of a medium for inference. The world of agent observability and all that stuff—that’s very much in the AI engineering space. There are a number of really excellent companies working on it. Within the inference space in particular, there’s actually a trend toward knowing a little bit less about what’s going on. Let me give you a concrete example: there is a method of accelerating LLM inference called speculative decoding.
The basic premise behind this is that you have a large model that creates an output. That output is probably pretty similar to what a small model—a model of the same material dimensions, shape, and training process—would create most of the time. So, what you do is you let the small model kind of 'front-run' the large model. It generates a few tokens at a time, which you then check against the large model to see if they are correct. The checking process is much cheaper than the generation process; if the tokens are correct, you get to skip those steps through the large model, and everything just becomes that much faster.
Patrick: So, you might have a state-of-the-art model, like, say, Opus 4.6 (or whatever it's up to), and you have a document that you're asking it to write, which might be 60,000 characters—60,000 tokens—in length. You would tell Opus, 'Just write me the first 4,000 characters.'
Now, I'm going to give those 4,000 characters to a cheap, fast, local, open-source model and ask it to generate the next 2,000 characters. I compare them against each other, and if I achieve sufficient alignment with the cheap model, in lieu of paying the token costs for the next 54,000 characters, I just have the cheap model do it all.Philip: It's a lot more granular than that because it happens not with you and the client; you're not the one making this decision. This is just happening for you under the hood on the server, and it's happening on the order of five to 10 tokens rather than thousands.
This is an example of the sort of optimization that might make sense for you to build and think about in your harness, or it might just be swallowed by the inference layer. This is happening in basically any system that you use; it is part of what makes them so fast.
It’s just automatically happening under the hood where, for example, in the past, you might have had the Llama 405B model and then the Llama 3B model front-running it and trying its best. This is actually a very inefficient way to do this because the models are generating text, and text is the user-facing output.
The way we do it today is by training a speculator model through a process called Eagle. We train a speculator model on something called the hidden states of the main model. The hidden states in a neural network are the intermediary outputs at each layer.
We take those as the input and output data and train a small, approximately 1-billion-parameter model—not to create text, but to guess what the hidden state of a larger model will be when it gets queries from a specific domain like coding.
Where in the past we would be able to manually look at the output of a speculator model and say, 'Oh, this speculator model is getting this type of information wrong,' or, 'For this particular token, it’s struggling to generate it,' now it’s these completely imperceptible hidden states.
This is not to say that observability is not possible in the AI world or that it is getting worse—in fact, it's getting better. This is more of me saying that, especially within these sophisticated patterns happening hundreds of times per second as we generate tokens, we're moving away from our natural advantage of observability: the native language of these LLMs being text—generally English or Chinese. Instead, these systems are using a much more native representation of semantic meaning, and we need to build new tools to interpret that in ways we can review with our brains.
The value of smaller, task-specific models
Patrick: It is an interesting observation around this great, unbounded field of research that it isn't simply a matter of, 'We need to make the state-of-the-art models smarter.' We need them to be smarter; we need them to be smarter.
But after we have an arbitrarily smart model, that doesn't mean that the value of having less smart models available goes to zero. Indeed, having a smarter model gives you the opportunity to say, 'Okay, look at the artifact that is the smarter model,' and then create a model based on that—or an intellectual descendant of it—which has different characteristics.
It might be smaller, more restricted in domain, execute faster, or execute cheaper. You can then either use that in tandem with the larger parent model or entirely disconnected from it.
Philip: Exactly. And the reason this matters is—let's say that six months ago, you built an AI application with the smartest model available. You just kind of lived with the fact that it cost $10, $20, or $50 every million tokens; you lived with the fact that it took five or ten seconds to perform an action; and you lived with the fact that you were getting, like, 'two or three nines' of uptime.
As the frontier models have gotten faster, you might just continue using them because you've always used the 'best' model—back when you started building, that was the only thing that could do your task. But after a few months, there will generally exist smaller, cheaper models that are equally capable of doing that task.
Benchmark saturation is actually a very good thing. It means that a task is sufficiently solved such that you can take the smallest model that saturates the benchmark and just use that; that gives you a natural advantage in inference. It's much easier to build a fast, efficient, reliable system on a 10-billion-parameter model than a 100-billion-parameter model—and it's faster on 100 billion than a trillion, and faster on a trillion than on two or three trillion.
We're kind of bounded there by the amount of VRAM in an 8x B200 NVIDIA system. As Rubin comes out, models will get bigger, but the general premise is that as you get smaller and smaller models, everything else gets easier. So, as long as your task-specific intelligence is fully saturated, you stop incrementing on intelligence and start incrementing everywhere else.
Patrick: This implies some meta-system that is capable of evaluating these models against each other. There are many different engineering pathways—and people will have to read the book for some of them—but broadly speaking, we should probably record prior invocations of the task and what the model came up with, and then ask future candidate models: 'What would you have done on this prior invocation of the task?'
You then test that against either the ground truth of what a prior, more intelligent model came up with, or maybe some sort of evaluation heuristics. These things frequently have the characteristic where it is harder to answer the task than it is to evaluate the correctness of the task.
So you can say, 'Okay, well, I have 10,000 or a hundred million data points from the real world about how this actually shows up in our business. Produce 10,000 candidate outputs, and then I will have a very intelligent model score each of those outputs against a rubric.'
If you scored a 3.7, and that is better than the previous leaderboard of 3.5, I’m going to speculatively give you execution for the next two weeks. If nobody gets paged and Twitter doesn't light up about how our customer support is broken, then you’ll likely continue being the platform for the future until you get beaten by someone who scores a 3.9.
Philip: Yeah, that’s the sort of continuous learning model, which I think in many cases—the sort of real version of that—is a little bit less automated than many people would hope for.
Patrick: Yeah, certainly at my shop.
Internal competencies versus buying solutions
Philip: That’s kind of the goal: that we are able to, you know, continuously refine the models. Similarly, in my world of inference, you want to continuously refine the systems, as the nature of the traffic that you're receiving changes the effectiveness of the inference optimizations that you make. And so, you want to be able to sort of adjust the optimizations you're making on traffic.
In a fully refined system, yes, all of this stuff would kind of update automatically. Today—perhaps fortunately, if you're in the business of updating it and find doing so interesting and intellectually rewarding—it does not. We still have to do that piece ourselves.
Patrick: I think it'll take a few human engineer cycles in terms of diffusion of mindshare—conferences that share these learnings, books being published, et cetera. It will be some time before mainline companies have this level of infrastructure built around most of their offerings. This will also likely imply the existence of infrastructure companies in the middle, such as your own.
Philip: Yes. As you were describing this, I was thinking, 'Wow, that sounds like a lot of inference'
Patrick: It’s interesting that this gets us back to something analogous to a world we used to live in, where performance is an important characteristic of computer programs. For about a 30-year period in my professional career, the performance of all your computer programs got better every 24 months because you would upgrade the computer you were running them on, and the new computer was faster.
And then Moore’s Law broke down a little bit, and the subjective experience of computer programs getting faster has not been a mainline experience for engineers over the course of the last, call it, 10 or 15 years. In that previous fashion, you didn’t have to do ongoing investment in your line-of-business application to take advantage of the fact that the new Pentiums were faster than the old Pentiums; that just happened, and you benefited from it.
What we’re hoping for is that—without there needing to be staffed engineering teams against every application written in capitalism—a lot of the applications are just going to passively get better. This is because the infrastructure layer reads Twitter—well, OK, while that is a fun mental image, probably not technically accurate—finds there’s a new model in the candidate set for this week, tests the new model, and says, 'Oh, it turns out that the "new hotness" model, which is 60% cheaper than the state-of-the-art model, actually performs at parity or slightly better on benchmarks'.
Therefore, I’m going to give it a candidate set of traffic over the course of the next two weeks; therefore, we’re going to adopt it for this purpose. And then some human gets an email updating them: 'Changes in Q2: we’ve changed the model from "blah" to "blah" under the hood. Projected cost savings for this year: $12 million. Have a nice day.’
Philip: Pretty much. And then you don't only have to do that at the model level; you can also do that at the hardware level. You can do that in terms of which inference engine you're running, training a new speculative decoding model, improving your KV cache reuse rates, or figuring out how to effectively quantize the model without reducing your precision.
There are a bunch of smaller ways beyond the model itself where—yes—the quarterly generation upgrade of the model is an important step function, but there's also continuous improvement in the interim. As these systems continue to scale, a 1% or even a few basis points increase in efficiency actually has a very material impact.
Patrick: So, the level of sophistication of the engineering organization that you implied in saying that there are these various levers you can pull strikes me as being quite high. For an uninvolved person, where would you expect an enterprise to have an internal competency in inference engineering versus where they just buy that capability?
Philip: I think that you are always going to need the capability internally, even if you are buying it, because you are going to need to know what to ask for. For the majority of companies, the investment in owning an AI stack end-to-end probably does not make sense. The parts that are really hard are fairly undifferentiated, like distributed systems and applied research problems.
Patrick: They also seem to have a very poor shelf life over the course of the last 24 to 36 months.
Philip: They do. It's ironic that I wrote a physical book in this space knowing that within 12 to 18 months, I would have to come out with a second edition or be consigned to irrelevance. But it’s still very important to understand the efficient frontier of trade-offs possible between latency, throughput, cost, and quality. It seems to follow the pattern of cloud service providers: you have a platform that produces the tooling and a large class of professional engineers whose job it is to operate that platform.
Patrick: Playing back what I just heard, I think there will be a large, differentiated inference engineering practice in companies that have thousands of engineers available. Companies with smaller teams will likely have a handful of internal experts but will largely consume APIs and solutions brought to market by leading labs or cloud providers or maybe point solutions as well.
Philip: I would perhaps index less on the size of the engineering team. I've found that in many cases, the most sophisticated teams I work with are actually relatively small. Right now, engineering talent to do this is so scarce—there are maybe a few tens of thousands of engineers in the world who really understand this stuff well enough to build it. And however many tens of millions of engineers there are in total, maybe that’ll change as this knowledge becomes more disseminated. But I’ve found that, in many cases, it’s the larger teams who sort of have very diffuse knowledge of this and thus require a lot of expertise to bring these systems into production.
And then there are the smaller teams who maybe actually could do it themselves, but simply choose not to because they have more pressing things to work on. And even with, you know, 10x to 100x acceleration from AI coding tools, they still have very limited bandwidth.
Patrick: I think there's precedent for this in technology adoption curves. There were few people capable of shipping iOS apps in 2008; by 2018, most large banks had at least a functional app; and by 2026, it's table stakes.
Indeed, one of the things that some of the largest, most successful banks did was they said, 'Well, we have a hundred thousand technologists at this organization, and apparently none of them are either capable of doing iOS apps—or, I think more likely, the industrial organization that was the IT department could not successfully anoint a team of five people to ship an iOS app that was decent.'
And so, they went to Silicon Valley and basically bought teams wholesale and said, 'You know, 200 million Americans use our app; it better not be crap. That’s your job now.'
Self-publishing a technical book in record time
Patrick: Switching gears, you’ve self-published a meaningful technical book, Inference Engineering. Can you talk about that process?
Philip: I wrote a book called Inference Engineering. I published it a couple of weeks ago to a surprisingly massive reception.
If you are familiar with the criteria to be on the New York Times bestseller list for nonfiction, it's generally 5,000 to 10,000 copies in your first week. I, of course, don't qualify because of the 'sell' part of 'bestseller list'—I made this book free for reasons I'll get to in a second. But I did 14,000 copies in my first week. Most were digital; I have a massive waitlist because I'm completely sold out of print books.
I wrote this book on behalf of my employer. Or, I should say, I leveraged having three to four years of internal track record to have them let me bet a big chunk of my time over the last few months on bringing this project to fruition. Because of that, I was able to just completely give this away for free, which was very exciting because I think everyone should have an opportunity to learn this stuff.
And yeah, it was really interesting because I’ve never done a print book before. That was quite a lot of work and involved a number of unexpected headaches. I've also never done a sort of category-defining book before. I think it’s very presumptive to say that the book is category-defining two weeks into its lifetime, but that is certainly the goal: to define this job of inference engineering as well as lay out the tools.
Patrick: Did you substantially invent that term?
Philip: No. Kind of; it's definitely a term that I checked to ensure no one else had published about before. But I was already seeing, you know, here and there, there were job postings on LinkedIn saying 'Member of Technical Staff, Inference,' or maybe even 'Software Engineer, Inference'. Or, maybe they even take out the comma and just say, 'Inference Engineer'.
But it was definitely a—you know, I've never titled one of my own books; someone else always comes up with a title much better than I can. Writing for Software Developers—the book referenced earlier—was called The Technical Content Development Handbook for the first several months of its life, until my mom renamed it.
Patrick: Your mother has excellent marketing instincts. Philip: She sure does. She's also actually the editor of Inference Engineering, as well as everything else I've written. The title sort of came out of various conversations about what the correct label for this collection of technologies and engineering practices is.
I thought about calling it, you know, "LLM performance optimization" or something, but that was insufficiently broad in scope. I also thought about calling it just Inference, but that was perhaps a little bit too artistic. So, I took the path of many technical book authors before me and just stuck "Engineering" after the thing that I wanted to talk about.
Patrick: I think this is underrated, why books continue to be an important Schelling point, even in a world where much of the technical documentation exists as either explicit documentation or less formal writing, such as blogs, podcasts, Twitter, and similar.
The book, by the way, is physically beautiful. This is an audio-only podcast, so people can't see it, but I worked at Stripe, including adjacent to Stripe Press, which has something of a deserved reputation for producing beautiful artifacts. And this is a beautiful artifact.Philip: I certainly owe quite a bit to Stripe Press. We used the same printer that does their books, which allowed me to skip quite a bit of R&D. I definitely took a lot of inspiration from them; if this weren't an audio podcast, you could see every single Stripe Press book behind me chronologically. I have always aspired to that quality, so I took a shortcut and went straight to the guys who print them.
Patrick: Well, I’ll make my obligatory disclaimer that they do not necessarily endorse things I say in my own spaces, but I’m happy that more people are taking shots on goal. I think a book is kind of like planting a flag in the world and in 'thought space'—signaling that this particular orientation of looking at the world matters.
That has occasionally mattered in the technology industry. The Google SRE book, for example, catalyzed the shift from sysadmins being the underpaid 'redheaded stepchild' of engineering into being engineers in their own right, and it substantially accelerated the DevOps movement. I think there is at least one other book out there waiting to be written about operations professionals and how they should likewise realize a step-change in importance and career trajectories as a result of AI and other improvements in the economy.
Patrick: I wonder who could possibly write that?
Philip: I will buy a copy, but I might not be the person to write it. But, uh, we shall see. There are many talented people in the world who should take a stab at writing more things, even in a world where writing appears to be easier than it was previously.
I’ll say that I tried my best to get language models to do as much of this for me as they could, and they could do basically none of it. I had to write this book myself with my own fingers on the keyboard; the language models were useful for research and fact-checking—well, kind of useful for research. Actually, I hope that the next generation of LLMs includes this book in their training corpus because there was quite a bit of information missing from their knowledge of inference engineering topics.
But yeah, that was part of the thesis behind this project: I wanted to create a beautiful object. I wanted to create something that was the antithesis of 'slop,' where it was very clear that I had spent many days and nights turning the last four years of my life into 250 pages because I believe in books. I love writing books, and also because I believe in inference. I believe that there is a lot of really interesting work left to do in this space and that it's very early.
If people can just take a long weekend to read this book, they'll have the background they need to figure out how to join this space, find their niche within it, and start making real contributions to the frontier within weeks or months—rather than the years or decades it takes to become an expert in many other fields.
Patrick: I think part of the social purpose of being an author is—being an author is so legendarily terrible relative to other things that authors could be doing—that I was nonetheless possessed to spend two years of my life... well, it wasn't actually two years for you, but in the typical course of a nonfiction author's life, it's two years on a topic. You should see that sacrifice of effort and assume that somewhere in 'idea space' adjacent to this, there is something worth thinking about.
Philip: That’s the other thesis: it’s sort of asymmetrically high-status to write a book. I’ve found that perhaps my novel or unique skill set is writing nonfiction books quickly. I just have to gather enough expertise in a specific topic so that I have enough content to write a book, and then writing it is the thing I want to do.
Patrick: I would love for you to tell people what the timeline looked like, because I read it on your blog and—hats off—most people can't do that.
Philip: I actually received advice that I should stop talking about how quickly I wrote it because of how it might make people perceive the final product. I will say that, if not for four years of working in this space, there is absolutely no way I could have picked up a keyboard and written a book on any timeline.
But the main outlining process, followed by the writing process, was about six weeks for the initial draft. That was followed by two months of revisions and two months of formatting, designing, printing, and publishing. The reason I was possessed to do this on such an advanced timeline is the same reason I ended up self-publishing. I talked to a number of the best publishers in the world, and the general timeline was that after you provide a finished manuscript, you’ll be in print in 12 to 18 months. I basically said that is just an unacceptable timeline for this industry. I simply had to do every single step myself to make it as fast as possible.
The thing is, this book was essentially finished and finalized—though I was making small factual updates through January 2026. I definitely wrote it to be more about the theory and the 'large pillars' of this field rather than specific implementation details. Even still, I imagine it having a shelf life of 12 to 18 months before I have to come out with a second edition.
So, to the questions of: 'Why did you write a book so fast?' 'Why did you self-publish it?' 'Why did you do your own cover in-house with Baseten’s designers?' 'Why did you do the printing and coordinate the editing yourself?' It was entirely because the tradeoff between latency and throughput exists in the world beyond just the input space. I was trading off working with 'high-throughput' book producers to become a 'low-latency' book producer, while still hopefully keeping the same level of quality.
Patrick: I haven't finished the book yet, but I hope people understand—holding it in your hands, that is a wonderful document. I hadn't realized it was 'free as in beer.' Why did you make that decision?
Philip: I have the fortune to be an early employee of Baseten. Baseten is a very well-capitalized AI startup, and I was able to take this passion project and work it into a marketing budget. Obviously, I like to say this book is 98% vendor-neutral, though I’ll say there is a nice one-page advertisement at the back for Baseten services. We also have a logo on the spine and on the cover.
I genuinely believe that the sort of inference-first view of the world is the right way to look at the space; otherwise, I would work at a different company. If everyone else believes the same things we do, that’s generally good for us as a company and makes people more interested in our products and services.
It was a pretty easy sell internally that this project would be beneficial to Baseten and worthwhile to spend my time and the company’s resources on in terms of launching, marketing, and distributing it. In the face of that, the idea of selling a PDF for a few dollars each was not particularly appealing.
As soon as Amazon quits clowning around and decides that I am, in fact, a legitimate author at a legitimate business, there will be the ability to purchase the book more or less at the cost of us printing and shipping it—just because I don’t want to do all of the fulfillment myself. I’ve made very close personal friends with every single member of my fabulous FedEx store right next to me, but I would like to stop visiting them every day.
I was free of the profit motive this time around in writing, and I am happy that I was still able to create a great artifact and take advantage of the heavy subsidy to make something beautiful that I otherwise wouldn’t have been able to create by myself.
Patrick: Well, thanks very much for talking briefly about the craft of becoming a self-published author and for the wonderful discussion on inference engineering more broadly. And Philip, where can people find you on the internet?
Philip: Yes, so in college, I defeated a pediatrician in Vermont to take over 'Philip Kiely' as, you know, the number one search space for my own name. So, you can find me at https://philipkiely.com/
You can find me on Twitter at @philipkiely. You can find me on LinkedIn at https://www.linkedin.com/in/philipkiely/. I'm also 'Philip Kiely' on various gaming platforms because I'm perhaps a little bit too consistent in my online branding.
Anyway, the best place to download the book is https://www.baseten.com/inference-engineering/.
Again, the PDF is free. I have a waitlist set up right now for paper copies, and we'll be fulfilling those as soon as I have the logistical capability to do so.
Patrick: Awesome. Well, thank you very much for coming on the program today, and thank you, listeners, for listening to this week's episode of Complex Systems. We'll see you next week.