The AI infrastructure stack with Jennifer Li, a16z

This week, I'm joined by Jennifer Li, a general partner at a16z investing in enterprise, infrastructure and AI. We discuss how AI is reshaping every layer of the software stack, creating demand for new types of middleware. Jennifer talks about emerging infrastructure categories and why the next wave of valuable companies might be the unsexy infrastructure providers powering tomorrow's intelligent applications.
Sponsor: Vanta
Vanta automates security compliance and builds trust, helping companies streamline ISO, SOC 2, and AI framework certifications. Learn more at https://vanta.com/complex
Timestamps
(00:00) Intro
(00:55) The AI shift and infrastructure
(02:24) Diving into middleware and AI models
(04:23) Challenges in AI infrastructure
(07:07) Real-world applications and optimizations
(19:05) Reinforcement learning and synthetic environments
(23:05) The future of SaaS and AI integration
(26:02) Observability and self-healing systems
(32:49) Web scraping and automation
(37:29) API economy and agent interactions
(44:47) Wrap
Transcript
Patrick McKenzie: Hi, everybody. My name is Patrick McKenzie, better known as patio11 on the Internet. And I'm here with Jennifer Li, who is a general partner at a16z.
Patrick disclaims: When in the course of human events it becomes necessary for one man to read a disclaimer, a decent respect to the opinions of mankind compels him to read exactly the text written by Compliance:
Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. For more details please see a16z.com/disclosures.
Jennifer Li: Hi, Patrick, it's great to be here. Thanks for having me.
Patrick: Thanks very much for coming on.
The AI shift and infrastructure
One of the things that I think is underappreciated when there's a major shift in how people consume software is that all software is built on many layers of infrastructure, both software infrastructure and hardware infrastructure. We seem to be in the opening innings of what I think will ultimately be a transformation of society due to the AI shift, but which is certainly impacting the software and hardware stack we work on top of.
That seems to be your primary investing focus at a16z. And so I'd love to just chat with you about what are the things that people might not see behind the chat windows when they're using these products.
Jennifer: Amazing. This is my favorite topic. My everyday life is living and breathing through infrastructure changes. And I feel like my whole career has been waiting for this moment where every single piece of the stack is shifting. So it's definitely a very exciting time for me. As you said, people probably see a lot of the products, whether it's consumer or B2B, just around the capability of the models. But I think if you realize how much infrastructure is being built or needs to be advanced to support this new type of workload, it's fascinating. And I'm very excited to dive into this.
Patrick: Sure. So I guess we can start close to the metal or close to the user, whichever you prefer.
Jennifer: Let's start with the user. I came from a product background and I always like to think about problems when they're close to the user. So let's start from there.
Diving into middleware and AI models
Patrick: Sounds great. So presumably there's the application layer these days and then a model layer beneath it, sometimes the same people. And what we've seen in previous iterations of this game in SaaS and similar is there ends up being a middleware layer between those two providers as well. Is there a developing middleware in AI yet?
Jennifer: So the answer is yes. And this middleware, whether we start from the frameworks themselves or connectivity tissues, protocols, pipelines, there are quite a few moving pieces. And I spend a lot of time thinking about what the new application stack looks like.
To be honest, largely, it's not that dissimilar to the current application stack, where you still have databases, you still have CDNs, the content delivery systems, still need the front end client and servers. A lot of these things are not changing. However, now we have this new modality or new capability called AI and AI agents, which put a lot of pressure in thinking about what real time and low latency workloads look like when we're, let's say, delivering a large amount of image and video inference or audio inferences, and where does that capacity and also capability come from? So that's an area where I have been spending quite a bit of time.
Maybe to dive into one area specifically, I think a lot of the applications we're seeing today are mostly centered around large language models and whether they're delivering really amazing answers to your questions around a product, parsing PDF documentation, retrieving information and knowledge. But I think a very underestimated and under-appreciated area is this diffusion model world, where lots and lots of creative tooling are being revamped and reimagined.
[Patrick notes: Diffusion models are a technique which slightly predates LLMs, and are more used in multimedia generation. They’re frequently quite remarkable; Stable Diffusion in 2022 (!) felt like compressing the visual history of humanity into about 2.X GB of downloadable weights.]
While we know large language models are really powerful, these diffusion models are incredibly creative and they're incredibly diverse too, given the different genres and capabilities that produce either great graphics or really imaginative images. And it's a huge infrastructure challenge to deliver low latency and high quality outputs.
Challenges in AI infrastructure
This is where I'm seeing on the infra layer, the capabilities are bifurcated, or at least how the infra stack is being built is bifurcated based on modality, where language / transformer models are more expensive to train, but they are a lot more economical to inference versus diffusion models which need several steps to inference. Also we know these multimodality assets or multimedia assets are much more expensive to deliver to end users. So there needs to be quite a bit of optimization done at the infrastructure layer to make sure the steps are well optimized and also delivered through very low latency, very fast throughput way to not have the user, let's say, wait for 30 seconds for a video to be generated. And even just compared to the last two years, optimization has really created new experiences for users. So we're definitely happy to dive into that.
Edge Computing and Model Distillation
Patrick: Yeah, I think it's an open question on how much inference is going to be done on the user's device. How much is going to be done relatively local to the user in some sort of forward deployed edge layer and how much inference is going to be done quote unquote at the data center.
One of the most wild things I've seen, which I'm afraid I don't have enough background to either remember the acronym or remember the exact technology stack, but NVIDIA is training models on the outputs of video games that have been fully rendered on NVIDIA GPUs with the notion that for, say, less powerful GPUs, rather than simulating the entire fictional world the game takes place in and doing ray casting and all the fun magic that we usually do to do rendering of real-time scenes, we just give it the input that we would give to that sort of engine, say, just hallucinate the details, because it probably looks like most screenshots from this video game anyhow. And that is not something that will happen in five years, that's something like you get on consumer grade chips today.
[Patrick notes: You can read more about DLSS (Dynamic Learning Super Sampling) from Nvidia directly. And indeed if you’ve played Baldur’s Gate 3 on a recent PC and not inspected your settings carefully then congratulations, you’re watching a machine hallucinating results from a not-actually-instantiated predicted universe. Tadpoles, all the way down.]
And at least as far as I can tell, and you can see I have glasses on, I can't tell the difference between the fully artisanal massive calculations done on the GPU and the one where the GPU is just winging it on the fly.
Jennifer: Yeah, and that's a great point. And it's one area where there's lots of research happening of like, how do you separate the compute from the local and cloud to really deliver the best user experiences? Because we still probably want a little bit more central control to orchestrate, you know, different—these are not like one model output. These are probably quite a few models orchestrated to deliver the end experience. And we need the central planning compute layer living still on the cloud, but a lot of the near user experiences, whether it's audio, whether it's last mile delivery that can really be living on the device and how to distill these models also is another very active research area.
Real-world applications and optimizations
Patrick: So distillation is the process, I think, of taking a large model that has many parameters and squeezing it down a little bit to try not to lose fidelity with the model, or at least lose as little fidelity as possible, but make it possible to execute faster on cheaper hardware with lower power constraints and similar, right?
Jennifer: Correct, yes. And I'd say that these distillation methods have largely produced really amazing output, especially for transformer models. But I think it's still, again, to produce high output video games, you don't want your video games to look like just pixelated characters. So how to continuously deliver really fast but high quality experiences for these diffusion models is still, I think, unknown territory.
Patrick: Yep. I was blown away maybe three years ago, which is of course ancient history in AI years, when Stable Diffusion came out and it was two gigabytes. And like, how can you compress so much of the visual arts history of humanity into two gigabytes and also photographs in the natural world and similar?
It turns out maybe the world is, I won't say much less interesting, but it's much more compressible than we expected.
[Patrick notes: I would not particularly recommend Stable Diffusion these days; MidJourney and OpenAI’s Image Generation get much better results out of the box. However, you can run it locally on your machine for free, which is something of a feature, and indeed much of the community excitement around it is in pushing generation of images that no cloud provider wants to be responsible for.]
Jennifer: No joke.
Model Orchestration and Layering
Patrick: Yeah. So let's see. You mentioned that we have multiple models working in concert with an orchestration layer to deliver some of these experiences for users. I assume that's even more important when many of our users are accessing on relatively underpowered devices, like say the phone in their pocket versus a gaming PC that has the newest consumer grade thing from NVIDIA in it.
Have you seen much iteration in using cheap models to do some amount of compute before you send a prompt to a more expensive model or similar?
Jennifer: Actually, this is another very interesting phenomenon that I think only really started happening in the last 12 months.
The production stack for especially consumer grade applications has really gotten very sophisticated in how the models are layered before going into, let's say, a GPT-4 or Claude level large models. And that's not just happening on consumer products or consumer hardware. It's happening across the board for even cloud SaaS applications. One is because I think we understand the model quality and the model characteristics much better of where we should apply smaller models to do some more deterministic tasks and where we should really leverage the reasoning capabilities for these large models.
I'll give you an example. Document processing is actually a very interesting area for me as well. It really is what I call the ETL for the AI agent era. A lot of the unstructured data or a lot of really underutilized business information is still living in unstructured data. But do you apply a Claude Sonnet model to all the PDFs you're getting to really extract information and process that with these largest, most capable reasoning models? You can, but it just will take you a lot of time and a lot of tokens and a lot of money, knowing that these omni models and vision models are much more expensive. And you probably can also...
Patrick: I sometimes literally apologize to the smarter models. “I realize this is a waste of your time, but it would be even more of a waste of my time: What's the sum of column H?”
[Patrick notes: Every time I mention this sort of thing on Twitter someone says that hallucinations make this far less useful than it actually is, and what can I say, the Department of the Treasury is welcome to object to my Report of Foreign Bank and Financial Accounts on accuracy grounds any time it wants to, but it appears that the answers that LLMs got (and I spot checked) from dozens of underlying documents were Good Enough For Government Work.]
But yeah.
Jennifer: Exactly. Yeah.
The problem is not just, you know, it's overuse of the capability, but also you probably won't get the best and the most optimal results from them as well. Because some of these documents are really hard to understand. They have very nested tables. It's sort of like a long tail domain specific problem that still is best addressed by traditional machine learning and OCR or vision models that really fine tune towards those capabilities and tasks. This is what I love about infrastructure, you know, these are like corners of the world where really good teams or people will go dive in and just gather all the most complex documents and most nested and cumbersome tables and invoices to really understand what the distribution looks like in the long tail and build very powerful, smaller models to address that problem.
And once you have the parsed data, feed that into the largest and most capable models to perform the next round of tasks and that kind of layering and pipeline is happening across the board where we see smaller and more deterministic, less reasonably intelligent capable models are happening either on device or as a pre-processing step before we really leverage larger models for their either creativity or intelligence.
The Evolution of B2B SaaS
Patrick: I like how this sort of reduces the buy-in to doing many of these classic tasks for B2B SaaS tools like invoice processing and similar. Back in the ancient history of five years ago, if you were a seed stage startup and didn't know what the bank statement formats looked like at 4,500 different banks in the United States that are running on approximately 1,000 different tech stacks, what you would probably do is source on your team like okay, you've got Bank of America, I've got Chase, someone get Amex, go, go, for the top 10.
And then for the rest of them, we're gonna try to do some handwavy heuristics. And often we're going to end up piping it to Mechanical Turk and telling our users and our investors “Yep, we’ve got AI to do that.” Because, you know, we do have some amount of code written for the 10 that we have easy access to.
But these days, I think it might be like, okay, the default option is we pipe it to ChatGPT or Claude or similar. And then we do some post-processing and, which of these has the worst errors? And for the ones with the worst errors and the largest representation in our user base, we get the in-house machine learning team to do more classical machine learning models to crunch down the error rate and eventually crunch down the cost.
And then potentially you don't ever need to go out to like the First National Bank of Kansas and have a hand-trained model for them because like for the five users per week that is relevant to, GPT will do a good enough job most of the time.
[Patrick notes: dang it, my mental Markov chain for stereotypical bank names returned an actual bank. Apologies to the actual First National Bank of Kansas; I didn’t mean you all specifically.]
Jennifer: It will be a very disappointing future if we let humans do the most soul-crushing work to read very fine printed text and models do the most fun part of the work, which is planning, reasoning, debating interesting concepts. So we're definitely going to solve this problem so humans don't have to go through, let's say, call transcripts to correct the jargons or go through the documents to figure out what this handwriting really talks about.
Patrick: As someone who got his start in the industry doing OCR via the Mk 1 eyeball on faxes and then has done any number of transcript corrections myself over the years, AI please take my job. No one should have to be doing that now that we have machines that can do it.
Jennifer: Definitely. And I know this is maybe going on a little bit tangent. I know Patrick, you lived in Japan and the characters of Japanese or Chinese Mandarin, those are another game of like, you know, very similar looking, but semantically meaning very different things. And that's again, where large language model capabilities can really solve a lot of human unsolvable problems. I'm very excited for that.
Language Models and Translation
Patrick: Yeah, it was almost wild to me. I haven't had the experience with Chinese, obviously not a language I can read well. But the quality of machine translation to and from Japanese got much better as soon as large language models came out, which is odd because I don't think any of the labs were specifically targeting that as a result. In fact, I've heard tell that many of them did not officially have Japanese in the dataset.
But there was enough Japanese in what was believed to be an English language curated dataset that the models were able to infer many of the rules of the Japanese language. TThe first conversation I had with a particular model (I’ll elide the name) was:
Here's an article from this morning on CNN. Translate into Japanese, like you were a professional translator for a high status media publication.
And as someone who was a professional translator for a while, yeah, it did really, really well at that.
[Patrick notes: Many professional translators, particularly in this language pair, would prefer if I were grading a translation going J->E rather than E->J, but since you’re probably not a professional translator, please just accept that I’m literate and capable of telling whether a paragraph is reasonably decent or very obviously not.]
Whereas, as you're probably aware, if you ask (pre-LLMs) Google Translate for the same task, you'd be able to infer the topic of the article, but on the sentence level, it would be gibberish.
Jennifer: Exactly, exactly. It's a surprising side benefit of the Internet having lots of variety of languages.
Infrastructure Evolution
Patrick: Yep. So we mentioned the infrastructure layers with respect to where inference is conducted and this notion of assembling and orchestrating low cost, high cost models and near and far to the user.
I also get the impression that infrastructure companies themselves are probably going to change a little bit. And this is similar to B2B SaaS. There will be some existing players which just take the existing solution, bolt AI onto it, now you're done. And then there's probably going to be other companies which say, no, if we had had large language models when we built this product, the product would look entirely different.
[Patrick notes: We’re seeing fascinating examples of this everywhere. Take Cursor or Codex, for example. What would a programming environment look like if it were about a dialogue with a codebase rather than an editor-with-sprinkles on top of a codebase? We’re going to find out, and I doubt that is the final form of programming, either.
Extend that intuition into e.g. invoice reconciliation. Keshikomi Simulator, a game some of us at Stripe Japan made to dramatize invoice reconciliation, should be 100% solvable by modern LLMs. Should that impact invoice reconciliation UIs in the future, as much as the desk/inbox/etc skeumorphism impacts the UI of Keshikomi Simulator? Very probably yes!]
Do you see examples of that in the AI infrastructure as well, where some of it is doing the old stuff in a slightly new way and some of it is just doing fundamentally new things?
Jennifer: 100% and it's such an interesting time and also such an interesting question of like, what does infrastructure even mean? Because if you look across all the model companies, they're largely, you know, or let's put it this way, a lot of the engineering resources are already put on, you know, the research side is of course advancing the model capabilities, but the platform side is really building really solid foundational infrastructure.
That includes training data pipeline, includes fine tuning post-training pipeline, includes large inference scale and supporting also the training scale given these models are becoming more and more multimodal and also more capable. These are huge infrastructure problems and they all live within like one organization where in the past you probably would outsource, let's say, ETL or your data collection, or let's say, the compute to let's say a cloud provider where of course the cloud provider still plays a big role in this world where they support both the GPU and CPU workloads, but a lot of the specialty and scaling capabilities lives within organizations.
Patrick: It will be interesting to see how much of that sort of spins out of those organizations. I think one of the classic stories in SaaS, both as a user and presumably from the investing side, was there's a capability you would have if you were a systems engineer at Google. You are no longer a systems engineer at Google, but you tried building something and you really wanted this capability, and so you made a company to externalize that.
And so you end up seeing things like orchestration layers or alerting or logging and things that the large companies have an entire team of people working on would become companies like PagerDuty or Datadog or Splunk, for all the places that can’t justify 200 people working on those tools internally.
Reinforcement learning and synthetic environments
Jennifer: Yeah, and we're already seeing that happening. For example, there's a new category of reinforcement learning environments companies popping up. And it's really interesting because we know to get the next mileage or next level capability out of these models is to really apply reinforcement learning, which means you're situating the model in a very specific scenario. Let's say understanding checkouts and navigating all the e-commerce websites and figuring out how to go through a whole transaction and also figuring out comparing products and so on. To enable an agent to be really good at these types of tasks, you need to build environments that are putting agents in this kind of gym and practicing them again and again to have them, and set up the end goal, which is evals, to have them complete a task at a higher and higher success rate. And this capability largely lived within labs before, but there have been companies coming out to build this as a standalone business because for SaaS companies or new agent companies to enable their agent to the next level, they need this type of environments to improve the agent workflow and the agent capabilities. So that's already something we're seeing of like these capabilities that live in labs and people that have the domain knowledge of how to post-train, how to RL and how to evaluate these models and agents coming out to build new companies. And to your earlier question…
Patrick: And so for—can I just give a little bit of background for people who might not understand the problem this is solving. So the large language models were originally trained on, it's a little bit unclear, but let's say a large crawl of the Internet and other open text sources. And there may be closed text sources as well.
But we don't have infinite amounts of data to train them with, and we've tapped out much of the low-hanging fruit, and relatively few people have taken the time online to describe step-by-step their experience in, say, using your example, buying things through a particular e-commerce shop that does $10,000 a month of revenue. And so what we're doing is creating synthetic environments where we can simulate the things that we want them to do at a very high cycle rate, far higher than human activity could generate, and just let them sort of self-play this game a million times, 10 million times, et cetera, to get really, really good at the game that is winning one particular checkout flow, or to be able to generalize from, I've played 10 million games against 200 checkout flows. I don't think the 201st checkout flow is going to surprise me so much now.
And this is just fundamentally something that you need to go and build both to figure out, like, short of doing 10 million transactions every day, how do I actually get a computer to be able to look over the shoulder of that interaction? And then there's some craftiness, there's some presumably people spinning up VMs that are each doing an instance of Magento [Patrick notes: I blanked on the name of this while recording], some sort of emulation of the Amazon flow, et cetera, et cetera, and even maybe some sort of like multistage thing where, OK, like coding bot of choice, please create something which functions a lot like this flow. And then I'm going to have the agent bot train itself against your thing and then compare in a more costly environment whether the training environment matched the real life experience.
Jennifer: That's exactly right. And a lot of people, which I think very accurately compare the agent improvement steps to autonomous vehicle or autonomous driving, where you kind of need to simulate the roads, the highways, the corners, the pedestrians on the road, different types and sides of vehicles where you run the simulation often and hard enough for the AI models to understand these corner cases and also know how to navigate through them, where eventually the agents will be powerful enough to either complete a task simple as check out online shopping, but even more involved to replace sort of a cumbersome or long running knowledge worker human workflow.
The future of SaaS and AI integration
Patrick: So you mentioned that you had another comment or we can go to another topic, whichever you would like.
Jennifer: Yeah, I want to comment on earlier when you were mentioning like what are the things that are not changing for infra companies. So the good news for infrastructure is, you know, it's very horizontal capabilities that support all kinds of workloads and AI is just another level or another form of workloads. So the good news is a lot of the core infrastructure, whether it's, let's say API gateway, where companies need to manage communication to third party traffic or like databases, transactional databases, analytics, observability. These are still cornerstones to set up production systems and need to be robust and reliable where AI is just really driving more volume and more demand to these types of systems.
Even though all the players, including Datadog, Databricks are all really adapting to the AI world, I think the net benefit is there. It's just, you know, there's more demand for the digital formats of data and throughput that it's going to drive sheer computing workloads.
The Future of Observability
Patrick: I imagine we're going to see some totally new observability companies, certainly at the margin, because historically, the observability business has been like, we're selling storage and compute at a markup.
Your application is generating billions of lines of logging per day. Each of those lines was handcrafted by an engineer at your company. And then we're going to put them all into some storage solution. And then occasionally, we will run a heuristic that was also handcrafted by engineers at your company to raise an alert to the appropriate team to say something might be going wrong or might not. And you as the on-call engineer at 3 a.m. get to decide, do I wake other people up or do I just say, this is the third false positive today? This is a model to direct use of scarce cognition in the world.
We only have so many engineers that can only write so many log lines. They can only write so many heuristics. And then there's only so many people we can bother so many times at 3 a.m. in the morning before we need new staff engineers to replace the ones that quit.
Jennifer: I'm sure every on-call engineer cannot wait to have self-healing and agents to complete that on-call duty for them.
Patrick: Amen to that.
And so you might imagine the next wave of these things is, OK, we're probably still going to need a silly amount of storage. We're probably going to need perhaps even a larger amount of compute. But we will be just passively monitoring things and suggesting rules, or maybe self-implementing rules with, again, that like stair-step approach of we're going to have relatively dumb things looking through 10 million lines of log to flag to a relatively smarter thing. Hey, this looks like an anomaly. Can you dig into it? And maybe the relatively smarter thing gives them a new prompt to run, et cetera, et cetera. And at some point, this system will say “Uh oh, we’ve got to call the bosses”, and then finally interrupt the engineer at 3 AM, but hopefully less frequently than in the status quo.
Observability and self-healing systems
Jennifer: I started my career at a company called AppDynamics, which is an APM or application performance monitoring solution that really serves DevOps engineers for these types of incidents and understanding the healthiness and performance of large scale systems. We supported the really big banks and also the largest travel agencies. But when I dig into the product, I was a product manager on the web monitoring product. When I was digging into product, I'm like, this is a lot of alerts, a lot of charts and graphs to understand what's really going on.
And if you are just managing one part of the stack or one part of the system, to connect the dots of where exactly this error is happening, you kind of have to, like you said, ping quite a few teams to do the root cause analysis, put together a war room. That's another feature I worked on. Today, I think that's still largely the case of how teams are troubleshooting and resolving incidents and they need to huddle on Slack. They need to figure out where exactly in the stack in the system, the error is happening and go like, you know, three, four steps deep into the database and figuring out if one database just cracked out.
I have been seeing a lot of innovations and also advancements in like using agentic technology to at least solve this issue identification and sort of diagnosis problem. I think self healing is always the Holy Grail and always, you know, the North Star. I don’t know when or if we get there. But I think even just like understanding the error messages, putting sort of the whole incident alerts and issues together and giving a human SRE, a summary of what's going on, what are the potential causes to, like you said, saving a couple of pings to other teams and solving issues around just understanding the scope and the impact of a problem itself is hugely beneficial to teams that are supporting large production workloads.
One day, I believe we're potentially going to get to some amount of self-healing, self-autoscaling but I think especially given all the coding agents that are much more powerful today and can start to cross system boundaries, it's still a very hard problem, but I'm very hopeful for that to start showcasing the capability and powers.
Rubber Ducking 2.0
Patrick: One of the early examples that I heard of this was engineers were sort of on a bottom-up level. Again, the person who gets the call at 3 a.m. in the morning might be a 25-year-old engineer in their first job. That's why they are the one getting woken up at 3 a.m. in the morning: to protect the sleep of the people who have gone through this very necessary hazing ritual which is still, in part, a hazing ritual.
And so we have this word in software engineering called rubber ducking, where sometimes when you have a question that you really want to ask your team lead or a senior engineer around the company, you should ask a rubber duck first. And just the act of verbalizing the question will cause you to realize the answer. And so Rubber Ducking 2.0 was, before I wake anybody up, ask the LLM: what's the worst thing that can happen if the caching layer is down and I restart it?
And a model could tell you before a senior team lead needed to be woken up to tell you, well, that could cause a thundering herd problem.
And then you might say, look, I've only been here for a year and a half. I don't know what thundering herd means. It would say, OK, well, that means that when you reset all the caches up at once, all the data is not going to be in the cache. That's going to cause increased load at the database layer. And this is a prior cause of outages at this firm and others.
And then you might realize, good thing I asked. The thing I wanted to do will instead blow me up. I should call in someone more senior to help me come up with a better plan, or I should ask the model for a better plan, or I should read my run book again, which probably said on page three, by the way, don't restart all the caches at once, thundering herd. And once these get more integrated into the natural affordances for doing these things, once they're available on Slack, once they're sort of pre-digesting the error messages and the telemetry data for the engineers to give them personalized recommendations...
Jennifer: Now we have the most relentless rubber duck.
Patrick: And we'll start on that, you can take this or leave it, but, and then after people get comfort with using them and after they realize that the teams that opt into this just have many less escalations, many less high severity incidents than the teams that don't, it's going to transform for the better a lot of people's lives, I think.
Jennifer: Yeah, this is part of the reason why I really love infrastructure and love working on these, let's say, potentially boring and unsexy problems because they first really impact the lives of all the people that are on call and being woken up at 3 a.m. and their kids and wives or husbands as well. It also impacts all the customer experiences, whoever are trying to on the other side, you know, buying something for their family or booking a flight for a business trip, like it impacts both sides where if we really improve the experience and the performance of the software and the systems, it really changes lives.
Patrick: Yep. For some of the largest firms in capitalism, there's a number of transactions that happen in every minute of the day. And the difference between 17-minute time to recovery and eight-minute time to recovery means potentially hundreds of thousands of people don't get negatively inconvenienced by something. And that might sound sort of frivolous, like maybe if it's an e-commerce shop, they just come back later and do it, fine. [Patrick notes: It was, in fact, not fine.]
But if they're sitting at an ATM and the ATM won't give them their own money because of transient problems in the ATM network. That can be an emergency for someone who needs that money to get on the bus to go to work or needs the money because they want to feed their family that day, et cetera, et cetera, come back in 45 minutes doesn't necessarily solve the problem. And so decreasing the time to detect errors and the time to recover from errors is compounding good news across society for both people who are sort of in the room resolving them and then everybody else who is just downstream of the widely useful applications that we all use in our lives.
I think we probably have enough time for one more deep dive into a fun facet of infrastructure. What's one that you wouldn't expect people to think of, but it's really been seizing your attention recently?
Web scraping and automation
Jennifer: I think we talked a little bit about this earlier of like, what is the infrastructure that supports or inputs information into agents? Because for agents to perform their tasks, it's not just the models being really smart, really capable. They really need to be able to both use tools, have context, have data to act on. And I think a lot of the attention has been spent on how autonomous and how smart the agents can navigate the web and understand documentation. But I think what's really under-appreciated is these really small intuitive tasks such as scraping web pages, clicking the button that's off on the side and maybe hidden and not so obvious. That's the checkout button or no thank you, I want to return to the homepage. That's really around web scraping capabilities and again, like understanding unstructured data capabilities.
And I think a lot of people probably imagine at this point, scraping and web automation is a solved problem.
Patrick: <joking but serious>Said no one who has ever done scraping professionally.</joking but serious>
Jennifer: For anyone who has done scripting, like you said, in the past, we're writing these very long and definitive scripts to understand if this is a header, if this is the division that has the content, if this is the text I really want to grab.
Jennifer: I think even in the last generation, I'd say before large language model capabilities came out, there's already really interesting innovation to use a combination of vision models plus these JavaScript automation solutions to build really robust scripting technology. I think there's a company called Diffbot, I've been really obsessing with them in the past few years. I think now, once we have more capable vision models and the combination of scripting languages plus vision models that's driving these navigation type of agents, we can do a lot more in understanding really complex websites that have both horizontal and vertical scrolls and have lots of nested components and maybe even more so that's like a canvas that has multiple cursors and multiple people that's interacting with the application together.
Like we have much better capabilities to understand complex DOMs and web structure to grab the right information to feed into agents and perform the next task. And that's another area I've been just really excited about and seeing lots of companies being built around how to either advance agents navigating complex applications and also feeding the information as sort of a unit of value through like API to downstream agent workflows. So I think this is another area where lots of infrastructure companies will be built upon.
Patrick: Yep. I think it's sort of underrated in importance because these things that sound like one task to a human, like just book my tickets to Hawaii. That's one thing I want you to do today are actually, you know, 40 things in a row where it's like, okay, navigate to Delta.com, click this button, click this button, click this button. And if you have a 99% success rate on each of these component tasks and they're relatively independent of each other, like 50-50 on whether you get tickets to Hawaii or not.
[Patrick notes: log 0.5 / log 0.99 = approximately 69, so you need approximately 69 independent 99% accurate tasks before you have only a 50% chance of overall success. I did not successfully mentally compute that in real time.]
Patrick:And users won't bite, businesses won't use it, if it's 50-50 on tasks that anybody at the company can successfully accomplish. And so I think one of the things is to just make the infrastructure around agents more robust on things like parsing the DOM models, like you said, giving perceptual feedback to them and decreasing the number of decisions they need to make.
I think another thing that's increasingly going to happen is we create sort of side doors for the AI where, for example, the MCP servers that are becoming somewhat popular these days. For people who haven't played with them, MCP is essentially an API++ that you expose to agents directly. And you explain to the agent, like, the typical way people get things done in the API is to hit these methods in this fashion. Here's the documentation. And it's far more complicated than you could give a human to do because you can't expect the typical person interacting with the Delta website to intuit how to use a JSON-based API quickly. But since agents are scarily good at that, you can just say, OK, here's the typical way to book a ticket. And then they don't have to reconstruct the exact sequence of database queries and API calls that your application does.
Jennifer: [Laughs]
Patrick: They could just invent a series of API calls. And if it successfully gets the user's task accomplished, you're happy about that. And it's an open question for me, how much of the MCP development is going to be done by agent companies? And we're going to use a published API or build a scraping engine, et cetera, et cetera, just to make the inference time compute costs lower. Or how much of it is going to be exposed directly by the companies that have some good or service in the economy to sell and realize like, okay, some portion of the future goods and services purchased through our company will not be sold through the website. They'll not be sold through a phone order or a fax. They're going to be sold through this new channel with a human ultimately driving the purchasing decision in some fashion, but with the actual mechanics of it getting done by an agent that is driving this MCP script. And so I can put an appropriate amount of engineering effort into making my MCP server one that will successfully sell people airline tickets.
API economy and agent interactions
Jennifer: Yeah, that's such a great point going back to sort of the middleware layer. I am very certain the whole API economy and API tooling is going to have another boom just because, you know, I like to say, you know, your API spec, which is where a lot of these MCP servers are built upon or are wrapping. It's more important than your homepage these days because your homepage is being visited by human beings, but your API, your MCP servers, your documentation are being read by agents that have much higher volume and much more demand for the quality and also reliability that they are. This is also a reason why I invested in a couple of companies in this domain. One is called Stainless API, where they help companies like OpenAI and Anthropic to generate the SDKs, building MCP servers as well for a lot of SaaS companies that really provide this interface for AI agents to interact with your product and your API endpoints, your capabilities and data. That's really the next surface of, I would call product surface of a lot of application layer players as well as infrastructure players. And I invested in this other company called Mintlify. That's like AI first documentation tool where again, we'll probably think documentation is a solved problem. It's good and clean formatting that serves developers what they need to understand to adopt a technology. But today, as we know, coding agents probably are writing more lines of code than human developers. So the documentation tool, which is again, the interface between coding agents to corresponding capability API endpoints or infrastructure tooling software is the surface area for agents to adopt and understand tooling. And how to make both the content and also the format easy for agents to understand and leverage is another problem that the Mintlify team is solving.
Patrick: Yep. And I love how this demonstrates how infrastructure is fractal because, you know, the readme.io and other software documentation as a service providers, you start by, okay, the design needs to look nice. Engineer comes in, gives the platform a markdown text or similar. It gets put on a subdomain. Everybody's happy.
But then like the next layer of that is, well, everybody's not actually happy. Like some parts of the documentation are more effective than others. There needs to be a commenting interface through here and maybe some way to tie directly into the user's workflows or the authoring engineer's workflows or bug reports to the company or similar. In an MCP world, you can't simply ask Google Analytics, what are the top pages in our documentation and which do people bounce off of quickly? Because that metaphor no longer works for agents. They're accessing hundreds of pages of documentation, and who knows if they understand them or not.
And so you need to come up with a new metaphor and a new feedback loop between the agents that are attempting to consume an MCP and the MCP server provider and then the company that is ultimately providing the code or service. And so it might be as simple as, hello, Mr. Agent, would you like to take a quick five-second survey to tell us what instruction you were most confused by? But it's probably going to be some combination of that and passive analytics tooling that probably feeds lines into an API and says, for this agent that, again, we do not control directly, we can only give it input and see what it does with it. Which inputs do we give them that didn't seem to work well? And what could we do about those inputs to make that agent or other similarly situated agents more effective in the future with the ultimate goal of increasing task success and therefore increasing the amounts of goods or services that we sell? And so...
Jennifer: Totally.
Patrick: There are going to be companies on top of companies on top of companies and layers on top of layers on top of layers here.
Jennifer: Yeah, I just remember a data point that I think maybe Vercel put out that 30 or 40% of traffic today on at least the sites that they're hosting are being accessed by agents or bots. And a lot of the effort is being spent on crawling 404 pages. That's another efficiency issue we need to solve. 404 pages means those pages don't exist.
Patrick: Oh boy.
The Future of SaaS
Patrick: A thing I hear a lot from people recently:
They think the age of SaaS, the age of being able to build meaningful software businesses, might be coming to a close because all software in the future is going to be generated by business style users who are going to create it at need versus paying monthly for services. I am extremely bearish on this point of view. But I'd love to hear, do you think there are still going to be investable companies in two years?
Jennifer: I've been hearing this argument for a good part of the last two years. There have been two sides of, you know, SaaS is dead, you know, we don't need software anymore. Everyone can build their own, or you only need a chat interface that can go execute and perform tasks to respective software. And you don't need these large teams building all this UI that needs human beings to go navigate and do data inputs and looking at dashboards and so on. I'm very bearish on that view as well. I do believe a lot of the interaction with SaaS software will change. We probably will have much cleaner and simpler or more intuitive interfaces that are much more adaptable to users' needs, which is good news for everybody. But I feel like largely what SaaS companies are solving is deeply understanding knowledge workers, professionals' domain and workflow and encompassing that into a schema and also a software interface that addresses that in a very efficient way. And a lot of that also serves the needs of not just individuals, but teams to collaborate on one task. And once you, this is where the system problems come through, right? When you have large software systems, it's about communication, the boundaries, the networking, and API endpoints to talk to each other. I think when software comes through to serve teams and human beings, it's a lot of going through that kind of human team boundaries and consolidating sort of information and truth to one repository and I don't think that is going to go away as much as I believe AI agents and especially coding agents will help us build better and better.
Patrick: Yeah, I think it would be a tremendous disservice to the world if we went away from a model where DocuSign, for example, has to understand the ins and outs of contracts in the American insurance or real estate industry to know we're going to push that onto the desktops of 2 million people every day, where that might be 5% of their job. And you need to be an expert on all things.
The fundamental trade that SaaS does for the economy is not simply hiring the coders that you can't hire and then we bill them to you at $100 per month. We employ the experts, we do the ethnography work, we think of the edge cases so you don't have to, and we hire this team of sales and marketing professionals to tell you, yes, you could do this with $500 a month of paraprofessional time, but you probably shouldn't because we can sell you a better thing for $100.
I think it is extremely unlikely, contingent on humans doing valuable work in the economy, that humans stop successfully selling software to other humans. We will still need to understand the world around us, and we will still need to build models of it, and still need to convince other people to use those models versus attempting to do their own voyage of discovery with either a pen and paper or a chat app, as the case may be.
So I know we're coming to the end of time. Jennifer, where can people follow you on the Internet?
Jennifer: I'm on Twitter. My handle is @jennifer.hli. Please follow me. I post quite a bit about infrastructure and infrastructure go-to-market challenges. So I'd love to hear your point of view and get feedback.
Patrick: Okay, thanks very much for coming on the program today and for the rest of you, thanks very much and we'll see you on Complex Systems next week.
Jennifer: Thank you so much. This has been a lot of fun. It's my pleasure.
Patrick: It's been a lot of fun. Thank you.