LLM-enabled Developer Experience (as of April 2024)
Let's set aside all the prognostication about where LLMs might be going, and focus a little more narrowly on what they are today. There are plenty of breathless demos, but how do they perform in day-to-day workloads? What are their strengths and weaknesses?
I've been getting back to writing code for the first time in about a decade, and I've been making consistent use of LLM-enabled IDEs for about a month. I'm perhaps the ideal test subject: I have a good computer science educational foundation, some very stale experience writing research-grade software, and a dream of making a niche game. If LLMs-for-writing-software are going to be useful to anyone, they should be useful for me. I know concepts but not syntax.
My screenshots all come from Cursor, but I used VSCode and Copilot for a while too. They're roughly similar.
Let's dive in and see what it looks like!
Autocomplete
As you're typing, it's actively trying to be useful in the background and suggest code completion. This is a natural UI extension of auto-complete patterns that IDEs have had for ages.
Here's a relatively complex autocompletion it suggested. The LLM-driven suggestion is in grey on the right, while the "basic" IDE auto-complete is the dropdown below the line. In this case, the suggestion is great.
This suggestion does not come from nowhere, though. I had just implemented this new method anyEntityHasComponent and was tracking down all the locations that should be using this new method. In each case I was doing the same operation; swapping out calls to a previous variant of the method to the new version. So there's plenty of code around this line, a bunch of variables named in a way that looks similar to the names of the arguments for the method. And the model successfully puts all these pieces together to produce code that fits the pattern of my activity.
Sometimes, the model will make surprisingly accurate leaps. In this case I typed if and it generated a totally plausible bounds-checker. Now, this was not what I was actually here to do. But it was useful code. I kept it!
Just as often, though, it generates garbage. Here you can see the code in "grey" is the LLM suggested approach. But the IDE's type-aware autocomplete knows better; on() is not a method available on this type.
This seems like an important area of development to me. The LLM suggestions need to tap into the structural knowledge offered by types and method signatures, not just the vibes of the code.
Implement
Most LLM-enabled IDEs I've used offer some interface for pointing the model at a part of the codebase and following instructions to update it. Here's a simple example of that.
Nothing complex about this, but it saved me a bunch of keystrokes to write this boilerplate code. Note also it's choosing to edit spaces into the if statements along the way. This pattern of touching code it doesn't need to is annoying and happens often.
The model really shines at doing bulk text manipulation like this. But you do have to baby it a fair amount; in this case it took a second pass to change the variable names to normal CONSTANT_CASE.
Notably, there is no promise the generated code actually works. As before, if you are interacting with a library that the model doesn't have a lot of example code for, it will happily generate garbage that the IDE immediately identifies issues with. So in these successful examples, I'm keeping it scoped quite narrowly and setting it up for success proactively. Simply writing a method signature and then saying "implement" to the model basically never works outside of toy situations.
This is especially aggravating when you ask it to fix something. My IDE surfaces this capability in a high profile way, and its success rate is terrible. It frequently produces code that just fails in a novel way. Or it just removes the offending code entirely. It will also sometimes gaslight me and tell me what I'm doing does work while the IDE is telling me separately in the "problems" tab that it does not.
This capability is also noticeably slow. Like, open-Mastadon-on-your-phone-while-you-wait slow. That might be okay if it was usually a good result without further prompting. But I'd say at least half the time I need to do 2-3 attempts to get what I want, and each of those attempts might take 30s or more to complete. Which in turn means you spend more time writing more elaborate prompts to avoid having to do three rounds to get what you want, which just makes the whole experience feel inefficient. I use this capability infrequently.
领英推荐
Aside: Names
I remember in high school I was doing a programming competition and one of my teammates just zoned out and materialized this crazy solution to a problem that he could not explain, with variables named foo, foofoofoo and foofoofoofoo. This sort of thing would send the LLM into fits because it is not actually reasoning about code execution at all, it is simply looking at the patterns of characters. The LLM will do things like infer that a Rect class will have a width and height property. All well and good, but if I have a class named EntityMap it will assume it has all the methods of the Javascript Map object, too.
You know the joke about the two hard problems in computer science? Well, LLMs are pretty dependent on you solving one of them. Not just in a way that makes sense to you, but in a way that is similar to other codebases. As with most of these capabilities, it's a double-edged sword. I'm a rusty and incompetent engineer so getting reminders about the sorts of method names other people would use on a class named EntityMap is overall useful. But if you have a stronger point of view on what you're trying to do, I bet this is a hassle. In a general sense, it will probably contribute to homogenization of naming over time because being different makes the tool less useful.
Codebase-aware Interaction
The last major capability is to "chat" with the LLM, while bringing elements of your codebase into the context of the model's attention. I've found two useful modes here. First, you can express a somewhat complex question about how to do something you don't know how to do. The UI here is great -- it draws on your actual code, explains what's going on, and offers an "apply" button to copy that cleanly into your codebase where it would belong.
It's also great as someone learning new concepts for the first time. I hadn't written any TypeScript before this project, and being able to ask questions in English about why something is happening and get answers that are specific to my codebase is handy. This works well for a context like TypeScript which is very well documented and very broadly adopted. I will definitely turn to the model for conceptual questions about programming patterns that aren't familiar to me. This is better than Google and Stackoverflow in most cases because the answers are often based in my own codebase, and I can build on the dialog to smoothly turn it into actual code. However, when it comes to more detail-oriented issues -- patterns for using a specific library, fixing issues in the build or deploy systems, solving issues -- it's notably worse than Stackoverflow. In other words, when getting it exactly right is important, the model is bad. When you want general context on a common topic, it's useful.
This dynamic plays out on questions of software architecture. It tends to equivocate and restate the tradeoffs on any qualitative question that requires judgement. If you are not familiar with the tradeoffs, this may be helpful. But for the most part I found getting "advice" from the model, even with all my codebase in scope, to be basically useless. This makes it hard for me to believe it can complete any non-trivial software design on its own without a lot of scaffolding and oversight.
Silver Bullets
In the mid 1980s, Fred Brooks argued:
There is no single development [...] in technology [...] which by itself promises even one order-of-magnitude improvement within a decade in productivity, in reliability, in simplicity.
None of the breakthroughs he had experienced at that point (high level languages, time-sharing operating systems, modular libraries), nor the ones he identified coming soon (object oriented programming, graphical programming, AI, formal verification, code generation, etc.) yielded changes of that magnitude.
I think there's broad agreement with this conclusion historically. None of those movements yielded order-of-magnitude changes in software development efficiency. But the degree of hype associated with LLM-enabled IDEs might make you think otherwise. Are we at an inflection point? Do we have a "silver bullet" in hand finally, or is this another incremental improvement in productivity?
Lets discuss some key insights and try to answer that question about inflection points.
The IDE is always correct, but limited. The LLM is powerful, but error-prone. Finding ways to bridge these kinds of knowledge seems important to me. LLM-based auto-complete should simply not be returning non-functional code as often as it does.
Quality of LLM output depends highly on your relevant corpus size. Are you building a basic TODO app in React? Autocomplete will be uncannily good. Are you doing anything novel with niche tools? Autocomplete will be very limited.
LLM-enabled IDEs are a significant time saver, driven primarily by saving repetitive mechanical work. I can complete repetitive tasks much faster with an LLM. When the API is down or the plugin is disabled because it needs to be updated, I feel it and fix the issue quickly. I already don't want to write code without this capability.
Can a non-engineer produce non-trivial software with an LLM-enabled IDE? Absolutely not. There may be alternate types of tools that are not so code-forward and bring some constraints that make it more feasible. But a fully-featured IDE -- even with an LLM bolted to the side -- is too elaborate and capable a tool to operate without a lot of programming foundational knowledge.
Could a novice learn faster with the support of an LLM-enabled IDE? Yes. When I think back to my early days learning to write code, this is miles ahead. No trawling through Majordomo listserv records or IRC chat logs to find examples of someone doing what I'm trying to do. The LLM will never tell me to "RTFM" or admonish me for not finding the answer to my question in some barely-searchable archive.
The copyright vibes are still weird. There are some major legal battles yet to be fought and I have no insight into how they will land. But I can say that in my own experience, I do get the the spooky feeling sometime that I'm just getting someone else's code auto-completed into mine. This is the most acute if I'm working through a tutorial and auto-complete clearly has the tutorial in its corpus and spits it back out at me, character for character. But it also happens when I prompt it to implement any common algorithm like "write a depth first search of this tree." What stands out to me is that the code styles are not consistent with either my personal style or between prompts. Which creates this feeling to me that it's drawing from something very specific to another person. I have no idea if this is true or not, but if this is something that bothers you ... I get it. If I were writing corporate code, I would understand if my Legal partners wanted to put tight limits on what kind of code generation was acceptable.
There is still no silver bullet. This feels like a ~30% overall speed improvement to me. Higher when I'm doing menial refactoring, lower when I'm working on a structural issue or hashing out a complex behavior.
My instinct is we're not on the path towards an order-of-magnitude change here on average. But the effects will be extremely uneven. I can imagine a significant (> 2x) speed up on extremely routine tasks with common libraries and design patterns. For engineers working on something materially novel or niche, my instinct is there will be limited value. I doubt anyone submitting Linux kernel patches, for example, would get anything out of this toolset today.
Software Engineer / Executive
11 个月I've found LLM's to be really good at doing the shit I don't want to learn about. I'm often prompting CoPilot with things like "write me a pulumi file for creating a dynamodb instance which is good for testing a quick idea on"... obviously I could go and learn that same IaaS or whatever, but I don't want to do that because while it lays on my critical path it's not where my focus needs to be.
Sr. Director of Legal Operations & Strategy
11 个月I appreciate the focus on where things are today. The marketing hype makes LLMs out to be a panacea or a "silver bullet" as you refer to. My own experience has been exactly the same as yours. I've been doing some programming for personal projects and out of curiosity. I think I would sum up the tools today as providing a shortcut. Your "~30% overall speed improvement" feels about right anecdotally. There is still a requirement for active human in the loop participation.