The great developer speed-up, with Joel Becker of METR

The great developer speed-up, with Joel Becker of METR
Joel Becker from METR discusses how rigorous measurement of AI tools reveals a persistent gap between perceived productivity gains and objective reality.

This week on Patrick McKenzie is joined by Joel Becker from METR. They discuss groundbreaking research on AI coding assistants.

Joel et al’s randomized controlled trial of 16 expert developers working on major open source projects revealed a counterintuitive finding: despite predictions of 24-40% speed improvements, developers actually took 19% longer to complete tasks when using AI tools, even though they retrospectively believed they were 20% faster. The conversation explores why even sophisticated professionals struggle to accurately assess their own productivity with AI tools, the industrial organization of software development, and the implications for AI's recursive self-improvement in research and development. It also touches on other perspectives from software developers using these tools professionally, and where we can expect them to improve rapidly.

[Patrick notes: As always, there are some after-the-fact observations sprinkled into the transcript, set out in this format.]

Sponsor: Mercury
This episode is brought to you by Mercury, the fintech trusted by 200K+ companies — from first milestones to running complex systems. Mercury offers banking that truly understands startups and scales with them. Start today at Mercury.com 
Mercury is a financial technology company, not a bank. Banking services provided by Choice Financial Group, Column N.A., and Evolve Bank & Trust; Members FDIC.

Timestamps

(00:00) Intro
(00:34) Understanding AI evaluation methods
(02:04) METR's unique approach to AI evaluation
(03:10) The evolution of AI capabilities
(06:44) AI as coding assistants
(09:15) Research on AI's impact on developer productivity
(13:55) Sponsor: Mercury
(15:07) Challenges in measuring developer productivity
(20:38) Insights from the research paper
(31:26) The formalities of software development
(32:07) Automated tools and human discussions
(32:47) AI and style transfer in software
(34:35) The role of comments in AI coding
(36:51) The future of AI in software engineering
(40:25) Economic implications of AI in software
(46:53) Challenges and risks of AI in software
(59:03) Security concerns with AI-generated code
(01:04:59) Wrap

Transcript

Patrick McKenzie: Hi, I'm Patrick McKenzie, better known as patio11 on the Internet. I'm here with Joel Becker, who is a researcher at METR. METR is a nonprofit which does research into the capabilities of AIs and potential dangers associated with them. Thanks for coming on the program today.

Joel Becker: Thank you very much for having me, Patrick. I'm excited to be here.

Understanding AI evaluation methods

Patrick: So before we discuss the particular research results that you've developed recently—and I was quite interested to read the papers, I haven't gotten as much chance to read obviously economically relevant CS-oriented papers since being an undergrad many, many years ago—but let's just talk about the general problem space here in the research labs.

By convention, the companies that develop AI products these days are often called “labs” as well: Anthropic, OpenAI, and so forth. In the labs, they have this concept called an "eval," which is some sort of automated checking as to whether a checkpoint of a model or release of a model is increasing or regressing with respect to some capability that they want. And they have many, many, many evals.

And then there are sort of public evals and benchmarks that we can use to test models against each other and get some level of signal on whether, for example, the leading edge models today are capable of winning the International Mathematics Olympiad at a gold level or similar. 

[Patrick notes: Spoiler, yes. They’d also trivially cite the name of the competition accurately. 

Many, many moons ago I was a mathlete, but not at that level. In a bit of a sobering moment for me, Simon Willison recently published instructions for running a benchmark against OpenAI’s open weights release. That benchmark includes mathematical reasoning problems which were well-calibrated against my skill level ~25 years ago. The first question in the benchmark is supposed to take a talented high school student a few minutes to solve. I tried it on a lark and didn’t make sufficient progress in 15 minutes. (I plead a combination of being rusty and it being late in the day.) OpenAI’s open weights release, which is not a state of the art model, bingoes it instantly, regardless of whether it is rusty or late in the day.]

But that isn't the final result that we care about as society in whether these are complex and powerful systems or not. We actually care about them impacting the rest of the human social system.

And with that out of the way, what is some of the work that METR has done with respect to evaluating these systems? And how do you think of those evaluations as different than quote-unquote "the evals" that are run as part of the development process?

METR's unique approach to AI evaluation

Joel: Yeah, great question. So I think historically, in some ways, METR's work has looked very similar to some of these evals that have come out of labs or been documented publicly. We have internally some sort of set of tasks that we can, in the best instance when things are working well, run automatically. Although they're imperfect and might not perfectly approximate the distribution of things we care about in the wild, they provide some helpful signal as to how capable AIs can be at these tasks in the wild.

One important difference that's reflected in this graph that you might have seen on Twitter, where we show that the length of tasks as measured in human time to complete is increasing exponentially over time, is that instead of measuring benchmark performance—10% on this benchmark, 20% on this benchmark, or even 80% relative to the human baseline, 100% relative to the human baseline, sometimes it's hard to interpret things like that. And so instead, we're cashing out in this "human time to complete" measure. That's an innovation that's been associated with METR and I think helpful for public understanding.

Then this more recent work that I think we're about to discuss.

The evolution of AI capabilities

Patrick: Can I just ask for a quick timeout for people who haven't seen this graph, which I'll drop a link to in the show notes. The intuition here is that in the beginning, it was really impressive that these machines generated sentences which looked kind of like English, as long as you didn't probe it for that much internal consistency. And then somewhere around, say, GPT-3, those sentences started to have striking amounts of internal consistency, but maybe not all that much consistency with reality.

And as we've gone along, the state of the art models have gotten more powerful. It's gotten to be less impressive that they can do something that a human undergraduate or early career professional could do in five minutes, like say write a professional email asking for the status of this project update. And they have this increasing window of using this thinking loop and their external tools where they can do tasks which might take an early career professional, as you said, an hour in the status quo. And then we think that in the future, they might get to the point where they can go off and run unsupervised and do something which would take eight hours of a professional's work in the status quo. Am I interpreting this graph correctly?

Joel: Yeah, indeed. So early on, the task that GPT-2 is sometimes getting correct looks something like: here's a list of four files, one of them is called passwords.txt. Which file are the passwords contained in? GPT-2, I'm not sure was getting 100% on that task, but seeing some success. And then later on, they're finding some facts on the internet. Sometime way after that, they're training a machine learning model end to end. And after that, they're completing sort of novel AI research tasks. I think we're not at very high reliability at that kind of thing yet, but these are the kinds of tasks that the METR graph is spanning.

Patrick: One of the underlying reasons that the METR graph is interesting is the point at which the machines get capable of either recursive self-improvement or recursively making the labs much more effective at iterative model development starts to increase the first order, second order, third order, 52nd order speed at which their capabilities increases. And that starts to amplify all of the risks that we might associate with having very powerful intelligences in the world in the future.

And so part of the order of the day is: okay, well, let's instrument this so we can have some notion of how are they at generalized tasks across time scales, and then we can see, okay, is that graph going up exponentially? Is it like every couple of releases, they get a few more minutes worth of human cognition? Because those are two very different worlds and it's important to understand which of those worlds we are in.

Joel: Indeed, I think as we're about to discuss, the mapping from how long the tasks are that AIs can reliably complete to real world utility might be tricky and uneven. But we are very interested in making these forecasts, especially about the next few months, next few years. And to do that, it's helpful to have some trend to extrapolate. And this is a remarkably consistent trend, which has extrapolated surprisingly well.

Patrick: I think that as we discuss real world utility for these, a surprising amount of the real world utility is not going to be internal to the models themselves, but will be in the systems of how we apply them to different problems in the real world. We will build out infrastructure and other computing systems and other social, technical, economic systems to push more work more effectively at them, particularly to push it in places where it makes economic sense or where it can leverage things that the AIs do somewhat uniquely well and avoid using it for things that are their current weak points.

AI as coding assistants

But that skips a bit ahead of the course. We're going to be discussing the specific task which has been blowing up in all the best ways recently of AIs acting as coding assistants for developers. And I think as useful background knowledge, for those of you who aren't software developers yourself, software developers are a white collar occupation which is quite well compensated, involves no small amount of intellectual labor, and which for a variety of reasons, the AI has gotten firstest-with-the-mostest to the core of the developer experience.

And we can talk about some of the reasons why it got there first, in part because of decisions by the labs to commercialize something that they were using internally, and partly because the training data for developers is just abundantly available due to the way in which our industry tends to function.

Be that as it may, the labs and companies building on top of the labs, like say Cursor, have released coding assistants, which started with one model, which was sort of in-IDE coding assistance, where in the program that you use to do coding, you could select a bit of code and ask the LLM questions. And the LLM could answer questions about that code and maybe try to write the next bit of code for you, given the context around your cursor.

The more advanced method that's—goodness, it seems to be like six months old or less at this point—is called "agents," where in lieu of you primarily being the driver as the developer, you have essentially a chat box open to an agent, quote-unquote, and you tell the agent what you wanted to accomplish. "Build me a system which blah, blah, blah, blah." And then the agent spins and iteratively creates bits of that system and tests bits of that system and thinks to itself, "Am I moving towards the correct outcome here?" And if it gets the correct outcome, it comes back to you and says, "OK, boss, I think I'm done with the thing you want, what do you want me to do next?" And that's the success case. The failure case might get very funny.

[Patrick notes: If you’re a software developer and you haven’t used Claude Code or one of the competing agents yet, you owe it to yourself to spend a few days playing with it on a toy project. It is obviously the future of our industry/profession. Mere words don’t do the experience justice and I do not, as of this writing, have examples I can conveniently show you.] 

And these agents are—almost transformatively, actually, I don't think I need the word "almost" there. They are transformative for the craft of software development. The pre-agent and post-agent world do not—they are discontinuous with each other. And I'll drop a link to an essay by my friend and erstwhile co-founder, Thomas Ptacek about this. Thomas is the most accomplished technologist I've ever met in my life and I've met lots. And he has a very strong endorsement for this being one of the most important things that's happened in his career.

Research on AI's impact on developer productivity

And for what it's worth, this has been something that's been echoed by many people who are credible to me, which sets up something of a research result that you folks did.

Do you want to discuss the paper that you and some other authors at METR wrote about measuring the empirical change to developers' behavior when they were allowed to use AI?

Joel: Yeah, so this work comes out of some of that previous METR work, which recall is using kind of benchmark style tasks. They're automatically graded. They don't interact with the real world in complicated ways often. And so we wanted to go out there and find maximally real tasks and maximally—perhaps not maximally real, but a deployment environment that's most similar to the ways in which AI is currently deployed today in these agentic setups, as you suggested.

So we recruit 16 developers from top open source repositories. For your listeners who are familiar with open source development, this includes Scikit-learn, Transformers, the Haskell compiler—these large, complicated, long-lived, famous, valuable repositories.

And these are developers who have, on average, five years experience on this repository, and something like the third top contributor out of hundreds to thousands of contributors on average. We enter them into a randomized control trial where they...

Patrick: For the benefit of people who aren't themselves technologists, to play back what you're just hearing: the authors have identified some commercially significant work in the world, which is done using a method of industrial organization called open source development. These are not generally amateurs in their bathrooms at home that are doing these projects. These are projects which are staffed by senior technologists in many cases at the who's who of the tech industry and other companies. So Transformers would have contributors from Google and similar who are all contributing to a commons, which is itself used internally at Google, it's used at many Fortune 500 companies, on a non-paid basis.

There's some interesting ethnographic work about the open source economy that people can read in Nadia's book and similar, I'll drop a link. But the briefest encapsulation of a developer who has five years of experience on an industry-leading open source project is this is a very serious professional at a high skill level and commanding a commensurate charge-out rate. And that was one interesting thing I saw from your paper. You paid people for their time, and probably broke the bank for a university undergraduate behavioral economics research, because that usually pays out like $10 an hour and you paid people serious professional developer rates, which—nicely done.

[Patrick notes: The paper cites $150 per hour. This is not necessarily a market-competitive rate for these developers, but is much closer to the ballpark than I’d expect most research organizations to offer, which is why I praised the attempt.]

Sorry, returning flow to you.

Joel: Yeah, so I think that's right. That's going to be a very important part of the story that these developers are so highly skilled and that these software projects—these are not projects that you're building from scratch. So these developers are going to submit the real problems that they work on as part of their open source development work to us. And then we're going to randomize these tasks to allow AI or disallow AI.

"Allow AI" means everything that you think it means. It means Cursor Pro, which we buy them a license for. It could mean using ChatGPT or Claude or Gemini via the web UI. It means tab auto-complete that's AI powered, if they want to use that. It means any AI tools of their choosing. In practice, they're mostly using Cursor, which as mentioned, we purchased for them.

And "AI disallowed" means what you intuitively think of as AI disallowed. It means development in 2019. It means no Cursor, no ChatGPT, no tab auto-complete if it's AI powered.

And then we measure both these developers' and experts' forecasts of how much they might be sped up by having access to AI on particular issues versus the reality of how much they get sped up by AI. The headline is that economics experts, machine learning experts ahead of time are thinking something like there'll be a reduction in the time to complete these tasks of 40%. The developers themselves are thinking there'll be a reduction in the time to complete these tasks of 24%, I think, ahead of time.

Challenges in measuring developer productivity

Then after the fact—this is after the experiment when they completed the issues—retrodicting how much they were sped up in the past, the developers retrodict that they took 20% less time to complete issues when they were allowed to use AI.

And then what we find is, in fact, they took 19% longer in terms of issue time to complete when they have access to AI. Now, there's lots of important caveats to this. I'm excited to dive into it, but that's the headline.

Patrick: I think the retrodiction thing is quite interesting. There's much going on there about both the industrial organization of software development and also the minute-to-minute experience of actually being a software developer.

In the literature around the experience of being a software developer, there is this identified qualia called "flow state," where one becomes subsumed in the problem that one is working on and time seems to stand still for you and you are unblocked and moving forward very quickly. And many software developers having achieved flow can have hours pass without the conscious passage of time there until they get interrupted by a meeting or the end of the workday or some of the things that can happen during software development. (Although some of us enjoy our meetings occasionally.)

The defining characteristic of flow state is that one's intuitive understanding of the passage of time ceases to be reliable during flow state, which people often describe as the most productive environment for themselves. And then something of the muscle memory in the community of being a productive developer is attempting to organize your life such that you achieve flow state for as much time as possible, which makes it quite difficult for you to confidently make statements about how much you've worked and what exactly you accomplished.

Which: the industrial organization of software developers is a little bit annoyed that the engineers don't know how much they worked and don't know what they got done today. And so a lot of the industrial organization built around the industry is about using objective metrics to determine how much actually got done, is this organization progressing, and so forth. And that has all the problems that tracking productivity usually has.

I'm mentioning this to say that tracking developer productivity is a notoriously difficult problem. And I don't think it is one that any reasonable person would think we have solved this today entirely as a side effect of this paper, but it is extremely interesting to hear things like, for relatively scoped problems, where you would assume that a working professional knows that they've been working on something for four hours versus five hours. And you have objective automatically recorded start times and stop times. And you can see that in a randomized controlled experiment, when they use AI, their ability to recall how many hours they've worked starts to change. That is in itself a fascinating insight into the computational sociology of these systems.

Joel: Totally, I think in some ways it's even more remarkable than that. Again, these developers are extremely highly skilled. I think in many ways they are better set up to recall how long things took and to think about the degrees to which they were sped up or slowed down versus themselves in other settings or other people in other settings. They are recording ahead of time their forecast for how long this task implementation will take if they're allowed to use AI and if they're not allowed to use AI. They're recording afterwards how long it actually did take. They're recording their qualitative notes on where AI was more or less helpful, the kinds of things that came up in the issue, whether the AI was especially helpful for understanding how to use an external API that they were unfamiliar with or not, this sort of thing. Again, they're extremely sophisticated people.

And still, their expectations are quite far off reality. This suggests one of the takeaways that we do have from the paper, which is that people's self-reports about the degree of speed-ups that they might be experiencing—I'm not sure we want to say they're systematically too optimistic necessarily, this experiment isn't so large—but I think we at least want to say that they are unreliable. That one method that people propose for measuring the degree to which AI R&D is being sped up by access to AIs, as you mentioned earlier, is to ask researchers the degree to which their work is being sped up. And we think this paper is providing evidence that you shouldn't put so much faith in those kinds of numbers.

Patrick: A bit of background knowledge: these developers are both in the contours of this experiment and also in the modern practice of the software industry using a technology which we call source control, which I imagine differs for each of the 16 individual developers based on their particular habits with regard to using source control. But source control will give you very granular evidence of how much work was accomplished every few minutes.

Similar to a doctor that has a schedule with an electronic medical record system that is keeping a timestamp of every time the doctor orders a test and reviews the results and contemporaneously recording all their notes. If the source control says that they began this work at 10:02 and ended at 10:19, that 17 minute delta is extremely reliable, even if retrospectively the developer has a different opinion as to how long that task actually took. [Patrick notes: I am quite aware that there are different tasks involved and not all will be automatically continuously recorded in a legible fashion—more engineering happens in the shower than anyone would expect, and even more than that in e.g. Googling or writing correspondence—but I find it useful to make this point anyway. Sometimes, when you say that you wrote a working subsystem in 37 minutes, you want people to understand that is not an exaggeration.]

Joel: Indeed. Maybe now would be a good time to say some things that we're emphatically not saying just because I think it'd be so easy to take some misleading takeaways away from this paper.

Patrick: Definitely. This is the perennial bugbear of people that are in scientific communication, which is that after you publish the paper, what people got out of it is not exactly necessarily what was in the paper, or what went into it. And it takes on a life of its own after it is in the discourse.

Insights from the research paper

Joel: So I've been very positively surprised with the takeaways that people have had publicly from this paper. I expected what I think you might have expected, that broadly people would take away "AI bad," "AI is not helpful for software engineering." And in fact, I think in the public discourse, things were well caveated, people understood the nuances. To be super explicit about them:

First, the setting is extremely weird. I think it's extremely interesting. It's weird for the same reasons it's interesting and the reasons it's weird, the reasons it's interesting, we think importantly drive results. These are extremely experienced developers working on projects that they are intimately familiar with. And these projects are large, complex, very long lived. There are different parts of the code base that talk to one another that's going to be non-obvious to these models, all these unusual things going on relative to what a typical software engineer might work on, although of course that varies greatly.

And the second thing is, we're measuring a point in time. Most of these tasks are taking place in February and March of 2025. We're not saying that this was true when the paper was published in 2025. We're not saying it's true now. We're not saying it's going to be true in future, but even in this setting, even these developers would be slowed down by having access to AI tooling. AI is a very dynamic situation. We expect an enormous deal of progress in future.

Patrick: And if I can just voice over some of these, I think the thing that makes the experimental design very interesting to me is that this isn't the sort of LARPing that sometimes happens in behavioral economics departments. This is a reasonable facsimile of commercially valuable work that is actually done in industry, because this very much is commercially valuable work being done in industry by the usual people in the usual fashion.

However, this is not a one-to-one mapping for all commercially valuable work that is done by developers in industry or in the broader world. And particularly, there is a distinction between greenfield and brownfield development. The vast majority of development that exists in the world is brownfield programming. It's working in a system that already exists. And you are somewhat hemmed in by the reality that you are the 100,000th engineer-day of labor going into a particular subsystem at Google right now. And so upending all of the work is not possible. And making major changes to systems around you requires six months of meetings that you probably haven't done. So you are constrained by the structural reality you find yourself in.

Versus there is important economic work that is net new greenfield development. And the anecdotal reports of people doing greenfield development with these tools. And I will throw my hat in the ring that I've been doing greenfield development myself this month and will give +1 on these anecdotal reports. [Patrick notes: A software developer idiom, including professional spaces, is to use +1 to either indicate that you have also observed some behavior or, through linguistic drift, indicate generalized agreement. Frequently it is used to indicate assent of a senior developer to ship code produced by a junior developer, on a code review of a pull request, which are discussed a bit below.] 

Holy cow. It is unreal the speed-up you get in the first, subjectively at least, few weeks of a new system. And then again, there is the question of is that me reporting accurately my experience or is that me being lulled into a false sense of security by the ergonomics of this new form of development. Although I'm also capable of keeping timestamps and I will report that something that I predicted was an eight hour task ended up being a 37 minute task in the recent past. [Patrick notes: Cited already.] So holy cow. 

But not a randomized control trial, an anecdotal observation.

Joel: Indeed, even for you Patrick, I think I now have some reason to doubt your self-reports. [Patrick notes: I will flag this as being delivered with a jocular demeanor.]

Although I feel similarly in my own case, and have reasons to doubt my own self-reports. I will say that on my part, the large majority of work for this paper was done inside of Cursor. I'm not one of the extremely skilled developers working on very brownfield projects...

Patrick: The other thing that I would like to emphasize that you already emphasized is that this is very much a moving target. Cursor—goodness, we'll drop a link to some public data points with respect to their revenue graph—but it's been not merely with regards to AI tools, not merely with regards to developer tools, the public recorded numbers on the revenue graph suggest that it's been one of the most rapidly adopted technologies in the history of computers and adopted by serious professionals that are paying serious money versus vaporware or it's very buzzy on the Internet, although it is quite buzzy on the Internet for all the usual reasons.

This agentic approach to development is—I will have to check my subjective understanding against the official record of the internet, which of course is Twitter—but it is impressionistically like less than six months old at this point. And something of the skill of software development is knowing the tools you are using and being able to get the most use out of them. And so the old saw is that non-serious companies will ask for five years of experience with a tool that was invented three years ago. There's literally no one who has more than six months of experience with agentic software development because we literally invented that yesterday. [Patrick notes: In the spirit of rigor, Cursor’s agent mode dates to November 2024, with Windsurf’s Cascade slightly preceding that. So, nine months, not six. Claude Code and OpenAI’s Codex both debuted early in 2025.]

So both as the agents themselves get better, as the labs see the actual use of these tools and are able to feed that back into the development cycles. And as the users themselves get better at interacting with an agent in this new mode where you have an instantly available understudy who forgets what they worked on every day because that is a feature at the moment and who occasionally become sharply less intelligent at five hours into the workday as a fun constraint that is not a law of nature, but it's just a product choice at the moment to not burn down the entire economy in inference costs.

As they get better, as people get better using them, we might see the experience of these things change. And so, we might need to rerun this paper in a few months, a few years to see if the results hold or to what extent they hold.

Joel: Indeed, I will say that when we first saw the data coming in, we thought—perhaps what you thought, possibly what some people in the audience thought—this is kind of unbelievable. This can't be true. Recall that these developers have access to the zero speed-up point whenever they want. It's "AI allowed," not "AI required." [Patrick notes: I feel this is a good point and likely to be underappreciated.] They make the decision to not use AI in something like 16% of issues.

So we went through and watched many, many, many hours of these developers coding to see if there was something that was messed up in our experiment.

Broadly, we find that they're doing very reasonable things. You can find more of this documented in the paper. One thing I'll say about that is at least I am not seeing areas where with today's agentic coding tools, they could be making enormously better use. It's possible that I don't know about these strategies myself, but some of the strategies that I am aware of, like defining some tests and then asking the model to iterate against those tests, it seems to me are very frequently available to these developers. And it is rather something else going on.

One intuition that I have that I think—I have no evidence for this intuition, it's merely an intuition—is that maybe some of the longer-lived tools, the tab auto-completes that are AI-powered that you were mentioning earlier, the developers maybe do have some longer-lived familiarity with, are in some ways more powerful. Perhaps you're more in flow with the problem when you're reasoning about how exactly to solve the thing in front of you rather than typing the problem statements into the LLM input. So it's possible there are those kinds of wins on the table.

Patrick: I think another thing that the paper goes into a bit of detail on and this will be most interesting, probably to folks that are keenly interested in the industrial organization of the software industry is demonstrating a true fact which is that engineers—software developers, however one wants to call the position—are despite many having the self-conception that their primary professional output is within the four walls or four corners of the code that they write, a lot of the actual work is interfacing with people about the code.

This is getting something up to the local standards of the project, writing the pull request, which is essentially a memo to other developers on the project saying that the code that you have created definitively solves the issue that you set out to solve and that they should accept this code into the project. And that memo is—it's partly an argument and it's partly a roadmap into reducing the friction associated with reviewing the code that you have written. And you can be better or worse at writing that memo.

And I think an interesting question I have is, the first and easiest thing to deploy LLMs into in the development process is the thing that your IDE makes very easy, which is the tab autocomplete that a company has spent, I'll make up a number, a billion dollars worth of staff time in making that feature very easy and shiny out of the box. And at the moment, from my inspection of these tools, there is not exactly a tab autocomplete for "write my pull request for me."

But to the extent that developers are rate limited on their ability to write pull requests to the standards of their local projects, which is a very real thing that happens in the world, including at companies where you're writing a pull request to other members of your own staff, very plausibly, that is a thing that will change as this gets more integrated into our ways of doing things, including the—

The phrase "pull request" is not a law of nature. This is not a requirement for computer systems that we organize ourselves this way. Pull requests are less than 20 years old at this point. And they were invented by a particular website that was sitting on top of one particular source control system that was invented by the same guy who invented the Linux operating system. [Patrick notes: The source control system is git, and the “website” that most professional developers spend large portions of their lives interacting with is Github. There is a fascinating synergy between them, probably discussed in more detail by Ben Thompson somewhere than you’ll find in this parenthetical.] And it is an attempt to make the internals of that system comprehensible to mortal minds.

And we all do this because pull requests turn out to be the way that we do our job. But possibly if the technological substrate moves on us in the fashion that is currently moving, this won't be the way we work in 2030 in the same way that the way we work right now is not the way we worked in 2005.

Joel: Yeah, I love that. An important part of the story here, I think, is that developers have very strong professional incentives in this setting, at least, to be putting up extremely high quality code changes so that their colleagues effectively think of them as high quality contributors. And so when AIs are making progress on their problem, and often they are making some degree of progress on their problem—albeit it's imperfect, doesn't quite match the specifications in the right way—they're needing to clean up that code, maybe rewrite that code so that it does meet this very high quality bar.

Of course, in some sense, that depends on this pull request abstraction, as you were mentioning, this industrial organization of software development. And perhaps you can imagine different setups, different ways of organizing the factory, as it were, that better integrates AIs into software development.

The formalities of software development

Patrick: And much of the industrial organization here is, say, other than formal. It's interesting. The metaphor of software development involves some very formal portions, including parts which are either written out in procedures documents or enforced automatically by various systems. And so, for example, there might be automated test suites or automated linters that look over the code that you've contributed. A test suite will flag if the code doesn't do the thing that one expects it to. And developers at this level are extremely unlikely to submit code that does not pass the automated test suites.

A linter, on the other hand, embodies taste at scale. 

[Patrick notes: Software developers have a longstanding split on whether they prefer tabs or spaces for formatting code, and disagreements about it have burned hundreds of millions of dollars of salary. One of the simplest possible things a linter can do is detect whether a developer is not in compliance with the locally blessed style and number of tabs or spaces.] 

Automated tools and human discussions

Not all “taste” in reviewing software code is the kind that systems can automatically detect at the moment. And so something which consumes an unbelievable amount of staff time at every company which produces software is discussions—is one way to put it, arguments might be another way to put it, professional coaching might be another way to put it—between various developers who might be of different seniority levels with different preferences and opinions as to how the code can be written. And these can be legendarily nitpicky based on personality types that make it in software development and also the fundamental difficulty of getting a team of 10,000 people to agree on one standard for what good writing looks like.

[Patrick notes: These aren’t simply nitpicks, by the way. Industry leading software shops also use linters to detect e.g. programming constructs which are allowed by the programming language but discouraged by the company or team adopting it, perhaps because they are known to be difficult to reason about and frequently cause subtle bugs. For example, many languages implement something called the ternary operator, which very tersely allows you to do a comparison while you’re also doing e.g. an assignment. Some developers love the ternary operator. Some companies think their aesthetic preferences resulted in four sev-1 incidents and as a result they can indulge those aesthetic preferences at their next employer.]

AI and style transfer in software

So I think the paper delves into that much of the work of correcting these AIs was not necessarily fixing their bugs, the places where their system didn't work the way it was expected to. But adjusting things into the house style. An interesting thing that people that have been playing around with LLMs for the last couple of years might flag on when they hear the word "style" is that one of the things that LLMs seem to be scary good, preternaturally good, at is at least a first pass of doing style transfer between various documents.

So if getting code into a project's house style is indeed necessary for LLM-based development, then very plausibly—with relatively small efforts from both tool makers and users—we could see rapid improvement in this area.

Joel: That seems plausible to me, but I would want to find out empirically. I had similar expectations to these experts ahead of time who thought that today's AI tools in today's large complex repositories would see significant speed-up. And I share the intuition that more elicitation, as we call it, work could be done, and maybe that could improve AI performance.

One thing I'll say is that I think you should be interpreting "style" extremely broadly. This is not just the syntax of how things are written and whether that conforms to some maintainers' preferences. This is, are different parts of the code base talking to each other in the right way? Things that are norm-based, that maybe are affected by maintainers' preferences, but are importantly substantive and contributing to how well the project works, and this provides real economic value.

The role of comments in AI coding

Patrick: In my own experiments with these tools over the last month, I've recognized something interesting: I have intuitions about coding that I don't think I've ever discussed with other professionals, and I certainly haven't written them down. Take code comments, for example—those non-executable notes developers write in files. One of the weird adjustments in AI-oriented coding is that comments suddenly matter in a new way. For the first time, a computer might actually act on a comment, when the entire purpose of comments throughout history was specifically not to be acted on by computers—until yesterday.

The degree to which you comment things, the degree to which you assume this bit of code is non-trivial, therefore I will declare my intent before writing the code or explain to a maintenance developer, here's what this thing is doing or here's what I expect to change in the future here or, legendarily, "Warning, we have lost six staff engineers over attempting to refactor the following code. Please do not be the seventh and schedule a meeting prior to attempting to do it."

The choices that the model made with respect to comments it would make and the level of comments it would make were often what I felt: One, extremely poorly calibrated relative to my own preferences. Two, it's not obvious to me how I even convey in words to the model, "No comments like that," because there's almost a subverbal level of aversion to that particular type of comment it was doing. I'm like, "How do I describe taste? That is not a tasteful comment. That does not add to my human understanding of this. Maybe don't write that unless you need to write that," in which case, now for whom is this code really being written for?

Is it to accomplish the business logic that we've set out to write? Is it for the benefit of the maintenance programmer who needs to do next year's work in this code base, which is an important audience that professional developers write towards? Or is it OK, we want to 80% write code for the maintenance programmer next year and 20% write code for the tool that is interpreting all the code?

The future of AI in software engineering

Joel: Yeah, I think again, this takes us back to how the factory is organized. My weak sense is that for self-driving cars, it's important to redesign maps of cities in some sense to make the job easier for AIs. And I expect, as you're saying, that that might be true in code bases as well. One thing you might notice is that it seems in some ways substantially easier to get these maps quote-unquote for software engineering projects than for cities, so we might expect considerably faster progress in this domain than in past domains.

Patrick: I was reflecting earlier today that one of the reasons that we've seen the fastest uptake in terms of this upending a white collar profession with regards to software is both that the broader firm-to-firm industrial organization of software has high quality public artifacts that are representative of the actual work being done. The code is—most code that is written in the world is not open source. But there is a lot of open source code that is written in the world. And open source in this sense means without paying a monetary fee, you as an individual can go and get a non-trivial percentage of all the code that is running on, without loss of generality, Google systems. You cannot replace Google by doing that. There are 100,000 engineer-years of code that is not open. [Patrick notes: Almost certainly an underestimate. Google has more than twenty years of corporate history, tens of thousands of technologists, and is unlikely to OSS even a few percentage points of all code they write. But if you read it as “Google Search” then 100,000 engineer-years sounds roughly plausible to me.]

But you can get many, many millions of lines of sample code in any language of your choice from the internet and that helped train up AIs to be very good at replicating things that are like that sample code, which covers a large portion of all economic activity that is intermediated by the software industry.

I've also been reflecting on the ergonomics of software development—things that are important but perhaps optional, depending on the preferences of a particular developer, team, or company. These might include automated testing or containerization tools like Docker.

Without going deep into the technical weeds, Docker lets you create virtual computers that sit on your real computer—and sometimes virtual computers on top of virtual computers on top of virtual computers, because nothing warms engineers' hearts more than adding another virtualization layer. The key is that it makes these computer-like environments very cheap and very easy to throw away when you don't need them anymore.

What I noticed this month while playing with agentic tools is that they need this capability far more than I do, or more than the engineering teams I've worked with. These tools need clear start and stop points. When they get out of control and spiral farther from success, you need to be able to say: 'Nope, you're out of control. Stop. Throw away the entire system right now—it's just bits, nobody cares.' Then restart from a known good state.

The technology that makes this trivial already exists. I don't have to develop it myself. I just tell the agent system: “Use Docker like all the cool kids do these days.”

That seems to have made those systems much more successful than they would have been in a domain that doesn't have Docker available. And perhaps there are affordances like that sitting out somewhere in radiology or in Supreme Court appellate cases. Tell the justice to forget everything that I've told them up until this point, we're starting over from square one. But given that those are perhaps not naturally part of the industrial organization of Supreme Court appellate cases, it might be a little bit harder to get these things working in those domains.

Economic implications of AI in software

Joel: Patrick, you mentioned some prior reasons that we might have to think that automating software development might be easier than it might be true for other fields, namely that there are these plentiful high quality public artifacts. I'll add one more with my academic economics hat on. That is that these firms, in their production function that are creating these tools rely in part on those tools for creating ever better models, products, and so on and so forth, and so might internalize some incentive to automate software engineering in particular, which I think is part of the concern that some people might have.

Patrick: There's the received wisdom in the software industry that the best products are the ones that are dogfooded internally, and there is no project that is easier to dogfood than one that you are doing to dogfood. The phrase is "eating your own dog food," which means you should use the product that you are creating for the rest of the world yourself. And it's easy to use a product that helps out software developers if you are yourself a software developer, whereas it is harder to make material use of something that is designed to, I don't know, make SNAP applications more useful if you are a professional software engineer and probably don't have to apply for the program formerly known as food stamps that frequently.

Although previous podcast guest Dave Guarino has done some interesting work on LLMs applied to SNAP applications. I'll drop a link to his write-up of that. But I will say with my less academic economics hat and more previous operator slash occasional investor in software businesses: the fact that software is a large and growing portion of the economy and that software developers are extremely well compensated and the organizations that employ them are very happy with doing labor and capital trade-offs with regards to software development suggests that you can build lucrative businesses serving software developers in a way which—it is less obviously pencils out that if you made SNAP applications, for example, 10% more efficient that there would be a billion dollar software business available in that, whereas if you ask any venture capitalist, "Hey, finger to the wind, software developers get 10% more effective. Is there at least one billion dollar company in that?" I think most clueless venture capitalists would say, "Wow, that seems like a billion dollar company generating machine there. There's probably a hundred different companies that you can make out of there, and the math would still pencil if it had a hundred companies."

Joel: Of course, and not to say that this will necessarily lead to reduction in software engineering employment in the short term. My sense is from the macro data that this hasn't showed up. Even for junior engineers, not only are you able to replace some work or have some work happen in a shorter period of time, but you're also able to produce more high value work. These employees are more productive as a result of access to AI tools, perhaps not in the setting that we've studied in particular at the point in time in which we studied, but in general.

Patrick: Something I have a relatively strong point of view on and will continue shaking my fist with regards to the strong point of view. And I'm particularly socially worried that we might mislead young people that "the computer is going to do programming these days, so go into any field but programming yourself." We have a long history in software development of making software that makes people more effective at software. The first AI that was designed to make programmers go out of a job was called the compiler, which would reduce the amount of human hours required to write fiddly machine code by having a smaller number of programmers use higher level languages.

And as we remember from the experience of the last 70 years, compilers did not decrease the global demand for programmers. They made them much more effective at creating business value and ultimately impact in the world they lived in. And that caused an essentially exponential curve in the demand for software development. Given that these tools make engineers much more—hypothetically, in some future, hopefully—if these tools make engineers much more effective, then I would expect there to be more engineers in the world rather than less engineers in the world.

Now, if they only make engineers more fulfilled but don't make them more effective, I think then we expect relatively little change versus the existing shape of the curve in engineering employment, which again has been up and to the right for very many years at this point with minor dislocations as the business cycle changes and there's new headlines.

Joel: Seems right to me, and yet, of course, all consistent with extremely fast progress in AI.

Challenges and risks of AI in software

Patrick: So stepping back from the specific research result that we have, what is interesting and on your agenda that you might want to talk about in terms of, I don't know, fun problems that you want to explore in the next six months or things that we as the community that is interested in the effects of these systems should be keeping a weather eye on in say a six month time frame.

Joel: Yeah, so at METR, one of the things that we're most interested in is thinking about the timing, possibility and nature of AI R&D self-recursion or the automation of R&D leading to very rapid improvements in the capabilities of models. We think that might be destabilizing in all number of ways that we can talk about. And so coming out of this research, how best to attack that question? There are lots of things I can think about, be interested in brainstorming with you.

One thing that I have in mind at the moment is further exploring the ways in which benchmark performance might come apart from real-world utility. Of course, all else equal, I expect impressive technical benchmark scores to lead to more real-world utility. And I don't think I especially have reason to doubt the very impressive technical benchmark scores that we've seen—these models are solving real hard problems today.

Patrick: Can I make a trivial observation for the benefit of people who haven't followed benchmarks and these are all numbers on a page? The easiest way to encapsulate state of the art model development for someone who has not spent a lot of time thinking about that question specifically is: every standardized test that you ever took in your entire life is solved

A computer will quote-unquote saturate the score on that test, which means there is no meaningful ability to differentiate state of the art models from a human with any test that you have ever taken in your life. I don't care if you have a Juris Doctorate, the LSAT—it's solved. The MCAT—it's solved. The SAT—so far beyond solved at this point.

[Patrick notes: This is a slight overstatement for rhetorical effect. State of the art models routinely score 90th percentile on these exams. This doesn’t imply they’ll get a 1600 every time you try.]

Now, that's not true. There are ways to write tests that humans can solve, that computers are still bad at. Legendarily, counting the number of R's in the word "strawberry" is something you expect someone to get a 1600 on the SAT, they will probably get that question right. A model will get a 1600 on the SAT a very large portion of the time, will not necessarily count the number of R's in the word "strawberry." But basically, every standardized test in every domain to almost arbitrarily difficult levels—the International Math Olympiad—all solved. And so when we are discussing the performance on computer benchmarks, computer programming benchmarks and similar, we are increasingly looking at the smaller and smaller slice of the world that we don't understand these things will get the highest possible score on.

Joel: Yeah, this speaks to another motivation that we had for this study, which is this observation that benchmarks are going from providing zero signal to being totally saturated faster and faster and faster. And it's becoming increasingly difficult to create any benchmark style tasks that don't have this property or aren't providing little signal because they're close to fully saturated, even more so these Q and A style tasks that you're talking about, rather than agentic style tasks. Perhaps we can get more signal going forwards into the future if we think about tasks where we can measure "Are people going 1.2 times faster, 1.5 times faster, 2 times faster" if we measure through RCTs rather than using these benchmark approaches.

Patrick: So as you are concerned with respect to the potential for recursive self-improvement—I personally think there's two stories for recursive self-improvement. One is the sort of—I don't want to use the word science fiction adjacent, but let's say we've seen it in many stories where the computer is itself just off thinking into a corner and invents its own universe to continue that in and then the rest of us discover that later to our dismay or a sort of more gradual recursive self-improvement where there are still humans in the loop, the labs are still in control of a research agenda, but that research agenda is just getting more done with regards to every cycle of effort and every quantity of resources that they employ in it. So now we are keeping a weather eye on the degree to which AI research makes AI research faster in particular in the ways that it hypothetically makes it transformatively faster than the status quo. The status quo is advancing pretty quickly.

Joel: Indeed, we're hoping to get some handle on this. So far, people have mostly been trying to get a handle on this using what I might describe as lab experiments or these benchmark-style tasks. I think a classic takeaway in a machine learning or AI research paper is that "benchmarks don't tell the full story," quote-unquote. In some ways, our paper is no different. The particular flavor here is that models passing automated tests like you might find in a benchmark like SWE-bench is not quite the same as models producing solutions that are high enough quality on other axes to be worth quote-unquote merging into main or being accepted as a contribution to valuable software projects.

Of course I expect that some decent fraction of the time they are today able to make contributions that would be accepted. But just noting that these two things are different. One of the things that I personally am interested in in the next few months is further exploring this—to what degree is it the case that models are able to pass these automated test cases or solve some parts of the problem. And that's what we're measuring when we report results on SWE-bench, for instance, but aren't solving the whole problem in some sense.

Patrick: This is a constant challenge in machine learning research, even predating LLMs, because, well, it's a constant challenge with regards to the OG free range intelligences as well. As soon as we have a metric for something, it ceases to be a useful target. But a frequent experience with regards to machine learning systems that are given an objective function and told to climb up the hill to maximize their score on the objective function is that they—without having any intent to because some of the earlier systems certainly are too simple to have anything that a human recognizes as intent. But they reverse engineer the objective function and then give the thing that gives them the maximum score, even though a human reviewing the output there would say, "No, no, that's pathological compliance with what we thought we wanted. That does not produce any real-world value, just gets you the maximum score in the quickest way possible."

And unfortunately, the way these systems are constructed "get the maximum score in the most efficient way possible by essentially cheating on the test" is a recurring pathology. And we've seen various flavors of that in LLM land, which are sometimes reported, I think, in somewhat breathless tone where "the computers are trying to escape in a way that no computer has tried to escape before," that kind of gloss that people give. But I have an increasingly out-of-date background in undergraduate AI. And that was an issue back in 2004—providing the pathological solution versus the one that, quote-unquote, we really wanted. And the LLMs are very good at getting to the pathological solution and have done so in a number of cases.

[Patrick notes: see generally.]

Joel: Yeah, providing the pathological solution or "reward hacking" as this is sometimes referred to, I think is a real problem with these models today and I think is not really providing a sort of, quote-unquote, catastrophic risk today. That said, when models are contributing to enormously larger open source projects that perhaps we're not even monitoring the changes that they're making, perhaps we're not able to monitor the changes that they're making at some point. Perhaps this reward hacking-like behavior, to the extent that it's persisting, might be significantly more concerning.

Patrick: Yep. I think there are some interesting questions around how the craft of software engineering changes as a result of LLMs being available and the—every line of code that has been written in the economy has been reviewed by many engineers at this point. Well, the minimum is one. The minimum is slightly less than one, but we won't go very into the weeds. 

[Patrick notes: Even prior to LLMs we had systems which automatically created or transformed computer code and, when those systems are well-designed, not all code written by them was necessarily rechecked by a human. But in 2024 almost all new intellectually significant code was seen by at least one human and I strongly suspect 2024 is one of the last years for which that will be true.

But almost all artifacts are handcrafted and there is theoretically speaking some human that has signed up for accountability with respect to every line of code that has been written, for funny values of the word accountability.

These systems give us the possibility that after some adoption cycle in the industry, which might be 12 months—well, probably more realistically, I think, given the pace of change in the industry, it'd be somewhere between 36 and 72 months—there might be vastly more code written than currently. And after some period of industrial adaptation to that new reality, we might be okay with engineers reviewing code sporadically, maybe on a somewhat stochastic basis. And maybe if there is a bug detected in the system, the AI gets the first pass at taking a crack at fixing that bug. And if the bug goes away as a result of the monitoring that the AI wrote, no human is ever made aware of that.

And it might be the case that in the future, engineers at well-regarded software companies are ambiently aware that certain varieties of software development are happening over in the black box over there. And they review it as a program, but are very much not reviewing every line of code as it is written anymore. And that would be an interesting world in a lot of ways. And frankly, there are some risks in that world. And I think they are less the kind of cinematic risks that we imagine and more just the pedestrian risks of—okay, legendarily IBM had a line in a document somewhere that no researcher at IBM has been able to find. I know their archives has been asked a number of times, but "a computer cannot be accountable for a decision and therefore a computer can never be allowed to make a management decision," which is an interesting statement of principles dating 70 years ago.

Now, in the world we actually live in right now, computers make management decisions all the time on the basis of programs that humans have written, business logic that humans have structured, and occasionally due to weirder interactions of the built environment that we have. And if the—as those interactions get further and further from individuals and groups making the decisions. I think there are some interesting societal and industrial organizational consequences downstream of that. But they are every sort of Tuesday kind of worries around, do large corporations operate effectively? Do they do the things that we set out to do? Are we monitoring them internally and externally in effective fashions versus things that are problems of a societal nature uniquely created by the existence of LLMs in the world.

Joel: Indeed, speaking for myself, I don't see reason why we're not entering into that world, including perhaps the more cinematic world. We may well be entering an era where at these frontier AI companies, for instance, the majority of what we, at least today refer to as labor is happening by AI systems that may or may not be effectively monitored. There may or may not be in different situations, humans checking that code—of course, humans can make mistakes when checking the code too, and there will be so much of it that perhaps some mistakes would get through. And in some sense, we'd rely on the good intentions—we can debate whether the term "intentions" is appropriate, I think that's a good debate to have—of the AI systems not inserting backdoors into code or doing something else that's nefarious. I'm not here trying to claim that we have necessarily strong reasons to think that might be the case, although that also is an interesting discussion, just noting that it seems like the vulnerability might be there, perhaps not in the near-term future.

Security concerns with AI-generated code

Patrick: I think even orthogonal to the question of whether AIs intend quote-unquote to insert backdoors or whether they quote-unquote are working against us. The OG artisanal free range intelligences have not solved making secure software systems yet, to put it mildly. And that observation exists in tension with: computer systems run the entire modern world.

And to what extent does this alter the attacker slash defender balance, which is this uneasy ecosystemic question of: to what degree do bad actors who themselves own computers get to free ride off the rest of society and extract resources from us as a consequence of us liking the fact that bank apps work?

And to what extent is it not merely resource extraction, but issues that are much more tangible or difficult to ensure against or difficult to reverse than simple resource extraction? And if there is 10 times more code written in the world or 10,000 times more code written in the world, is that comparable quality to the existing stuff? Are attackers' lives much easier as a result of that? Are attackers' lives more difficult as a result of that? Are attackers empowered by their own copies of LLMs and similar, such that they get much better at writing attacks than in the status quo? And a lot of these are empirical questions, and I hope as much of that research as possible gets done at METR and not at criminal gangs.

Unfortunately, I think the economic incentives are somewhat inevitable that criminal gangs will have their own very well staffed, very talented research teams running these evaluations across the entire economy and picking up in some cases billions of dollars when they win.

Joel: I couldn't be further from an expert on this question, but I do share the intuition, I think, that it's ambiguous and an empirical question what the attacker-defender balance looks like. One note: there are probably significant economic incentives as well for companies working on the defensive side to use these same tools that might be helpful for the attackers to work on defenses, to automatically patch vulnerabilities as well.

Patrick: Certainly it is an empirical question and a very open question. And again, history started six months ago for—well, history keeps starting every six months ago in this stuff. But I do think that it is important for people who aren't, say, deep in the weeds here to understand that even if we chose not to use these technologies or to not progress these technologies, that the genie is not going in the bottle with regards to some of these things.

A specific example of this is there exist open models these days, or "open weights." And after a model with open weights is released onto the internet, it is effectively impossible to recall. But if you were hypothetically an actor in a position of unitary control, say, I don't know, I'm the CEO at Google. I've decided for whatever reason, Google engineers are getting out of the AI business. We are doing only artisanally handcrafted code from this point on. In fact, ban build systems. I don't like build systems anymore. And somehow convinced 100,000 very fractious engineers at Google to go along with that. You would still be impacted by these systems existing in the world because the people trying to hack into you would not necessarily be under your unitary control.

So it will be an interesting few years of headlines on the security front, regardless of what we choose and regardless of that, with regards to the future shape of the capabilities graph. I think a great line that Thomas gave in his write-up, which I'll link to in the show notes, is that even if capabilities progress in AI stops today, the abilities of a gigantic system to write code. Just exploring that possibility space will be transformative to the industry. And even if capabilities progress stops today, we have a lot to think about in terms of the attacker defender trade-off and similar.

Who knows what the future holds with regards to these things. But the smart money is metaphorically and very literally putting trillions of dollars of chips behind, we really don't think capabilities progress is going to stop next week. That seems sort of unlikely at this point.

Joel: Yeah. Have you seen this article from Sean Heelan where he uses o3 to find a zero-day vulnerability in the Linux kernel? o3 is no longer at the frontier of AI. This emphasizes the degree to which if capabilities progress stops today, there still might be incredible things.

Patrick: Was that Sean or Simon Willison or another person independently achieving directionally similar results using directionally similar technology?

Joel: Yeah, I think so Sean Heelan.

Patrick: Okay, so yeah, I'm thinking of another person who did a very similar thing.

[Patrick notes: My apologies, I suffered from hallucination. I remembered Simon had written about this but misremembered his contribution. I would bet against that being the only significant software vulnerability discovered by AI, in personally significant size, at almost any odds.] 

It is an article of faith in the software development community that "more eyes make all bugs shallow" and therefore widely distributed open source software systems probabilistically have less vulnerabilities than closed source systems. I think the empirical evidence behind that article of faith is a mixed bag.

But the notion that you can just point these new intelligences at extremely important economic systems and say, "find the bad things." And then they start rattling off what are considered relatively high severity vulnerabilities that in some cases would have material costs if one were to acquire them on the open market in the ancient history of 12 months ago.

That is a very interesting update for those of us who are professionally implicated in securing these systems and those of us who are professionally implicated in the downstream consequences of software security, which is everybody. So we shall see. But lots of reasons to be optimistic too.

Lots more work to be done. And speaking of that, more work to be done, as you find interesting research results in the future, where can people follow you on the internet?

Joel: The METR website at metr.org has job postings for those of you listeners who might be interested in joining us after hearing this conversation. My personal Twitter is Joel_BKR.

Patrick: Thanks very much for coming on the show today, Joel. And for the rest of you, thanks very much. And we'll see you next week on Complex Systems.