With, By, or For - Three ways I use LLMs as a software engineer

anelson May 18, 2025 #chatgpt #llm #ai #tech

Like most everyone in my company (and, if you’re reading this, probably yours as well), my colleagues and I have been enthusiastically adopting AI tech, particularly LLMs, since ChatGPT 3.5 first showed us what was possible with next-token prediction and insane amounts of compute and training. In the 2+ years since then, I’ve used quite a few tools and numerous models, and tried to find how to get the most productive value out of LLMs as a software engineer and engineering leader. In that time I’ve learned quite a bit about how to get value out of LLMs without succumbing to either of the most common failure modes: outright dismissal of modern LLMs as “stochastic parrots”, and breathless idiotic “AI cloned AirBnb in one shot devs are NGMI!!!” clickbaiting. This article is a distillation of my thinking on using LLMs as a software engineer, as of May 2025.

Disclaimer: AI technology is changing very rapidly. I expect I’ll read this article five years from now and scoff at how primitive our tools were and how clueless I was. Consider this article a snapshot of the state of my use of LLMs as of this moment in time. I’m neither an AI doomer nor an AI hypebeast, I’m a working SWE and engineering leader who has to deal with these technologies whether I like it or not and is trying to find ways to get the most value from LLMs and avoid drowning in AI slop.

tl;dr - Three ways I use LLMs as an SWE

I break down the ways I use LLMs as an SWE into three categories, in the order in which I discovered them:

Code is written with LLMs - By this I mean I’m doing the thinking and writing the code, but consulting with an LLM on specific problems or to explore an idea that I’m not yet ready to commit to coding. What ten years ago would have been done with a mix of Google and Stack Overflow, I now do with LLMs. I use Claude Desktop, ChatGPT Desktop, and Perplexity in this role.
Code is written by LLMs - This refers to modern agentic coding tools, including Cursor’s Agent mode, Claude Code, and the like. In this use case, the LLM is writing the code, I am driving it with prompts. I may jump in and modify code myself here and there, but the majority of the code is authored by the LLM.
Code is written for LLMs - This refers to choices of tooling, language, infrastructure, and documentation with the express purpose of improving the productivity and accuracy of agentic coding tools. The agentic tools don’t know enough to do this themselves, so you as the developer need to go to the effort to give the agents what they need but don’t yet know they require in order to be productive.

None of these is better or worse than the other; you can do stupid things and hurt yourself with all three of these techniques, and you can get value out of LLMs without using any of them. This is the combination that works for me in May 2025. The rest of this article is an elaboration on the three techniques.

With

This was the original use case for LLMs in the ChatGPT 3.5 era, when all we had was the chat interface. You’d prompt it to do something, it would spit out some code or a command, which you’d copy-paste or modify or just run, if you got an error you’d copy-paste that back into the session, ad infinitum. Or you’d use it as a stochastic Google, asking it questions that it hopefully will produce the right answers to. As far as I can tell, for the vast majority of people who use AI, this continues to be the primary workflow.

This can go badly wrong when you don’t understand the limitations of the LLM, but it can also be immensely helpful and save a ton of time if you know what you’re doing.

For my day-to-day coding with LLMs, I use Claude Desktop and the latest thinking model, which as of now is Sonnet 3.7 (update: now Sonnet 4 and Opus 4 are out). If I don’t like what I’m getting, or if Claude is down or I’m throttled, I will use the ChatGPT Desktop or Mobile app. Both of these are $20/mo and are well worth the price for the value I get.

If the task involves up-to-date information or searching the Internet is for some reason key to success, I also have a Perplexity Pro subscription which I use constantly.

Here are a few real-world examples pulled from my recent history in the aforementioned applications:

(Claude) - Getting recommendations on how to set up a new Typescript project in 2025 using the latest tooling and tsconfig.json values
(Claude) - Feed in an OpenAPI specification for an API that is very poorly designed, pointing out some of the things I don’t like about it, and getting a more comprehensive list of problems in Markdown format with specific endpoints called out
(Claude) - What are the options for automatically generating OpenAPI specifications for REST APIs built on Node with Fastify (as part of my effort to unfuck the aforementioned poorly designed OpenAPI spec)
(Claude) - Explain how FF3-1 format-preserving encryption and Feistel networks work, after I ran across the concept somewhere and wanted to understand how it actually worked. This ended up being a long conversation with many follow-up questions after which I felt I understood the concept well enough to evaluate its use for my particular case
(Claude) - How to make Jest tests fail with a meaningful error message in a particular failure case that I wasn’t sure how best to represent (I have almost no experience with Typescript or Node tooling so this was a noob question)
(Claude) - Write a shell script to list processes by the tmux session and window that they are running in. It took a lot of back-and-forth to get something I could live with, but I’m reasonably happy with the result. I don’t often need this but when I do it’s great having it available. Sadly this needs to work on Mac as well so it’s limited to some pretty old Bash syntax and thus performs very poorly.
(OpenAI) - Refresh my memory on random forest networks, and how they are implemented using modern Python ML frameworks
(OpenAI) - Paste in an error from the Rust compiler and help me understand what the problem actually is (interestingly OpenAI o4-mini-high utterly failed to analyze this correctly and came up with a completely ridiculous but plausible-sounding explanation for the cause; thankfully I’m a very experienced Rust programmer so I was able to recognize this and went about solving the problem myself)
(Perplexity) - Search for help solving a problem with Claude Code caused by a new release that didn’t work with Google Vertex AI yet
(Perplexity) - Figure out how I can install a Typescript “binary” program into the PATH on Windows using pnpm
(Perplexity) - Find sources for a large Ukrainian language corpus as part of a research project related to detecting tampering with data

By

Starting in Q4 last year, “agentic AI” became all the rage, and agentic features started to appear in Cursor and similar tools. AI influencers fell over each other to be the first to state the obvious, that 2025 was to be the “year of agentic AI”. In February 2025, Andrej Karpathy christened the name vibe coding to refer to the low-effort generation of AI slop code that was already by then rampant. AI grifters on YouTube and X performatively gaped at the ease with which primitive agentic coding tools turned a screenshot of some web app into a React application, heralding the end of software engineering and the urgency with which one must join their Patreon or perish.

If you’ve experimented with these tools, you’ve likely noticed how quickly and confidently they produce garbage code that you wouldn’t accept from the greenest junior developer. You can be forgiven for dismissing agentic coding tools as gimmicks hyped by charlatans who need to somehow justify their absurd VC investments, for indeed they are in many cases exactly that. However, I have been able to get some valuable output from them, and whether you like it or not your laziest and least-capable colleagues are churning out code written by AI anyway so you may as well come to terms with it now.

As of right now my go-to agentic coding tool is Claude Code, although I still pay $20/mo for Cursor and occassionally use it still.

Claude Code uses the Anthropic APIs, which are billed per token, so comparing it to the $20/mo Cursor subscription isn’t really fair, but I’m doing it anyway. My company has some generous Google GCP credits, and Claude Sonnet 3.7 is available via Vertex AI, so for us in particular Claude Code is “free”, in the sense that it doesn’t use up any of our runway. Paying Cursor per token plus their 10% markup would happen with actual dollars, and Claude Code works very well for me so I haven’t put any effort into exploring other options yet.

Cursor’s agent mode has come a long way since I first mentioned it in my year-end GenAI tooling review. There’s even (finally) a background mode so you can have multiple agents churning on tasks. However, I have grown very tired of the throttling on Cursor when I use up all of my “fast” credits, which happened usually within a few days of the start of the billing cycle. I also get the sense that Cursor is motivated to minimize the amount of tokens that they pay for on the $20/mo plans which may explain the poor performance I experienced. But Cursor is a VS Code fork, and sometimes I prefer those ergonomics to that of the terminal, which is when I still find myself reaching for Cursor. Also, since I do most of my day-to-day work with Claude Code, Cursor throttling is less of a problem for me.

As for how to get decent code written by LLMs, the best practices in the Anthropic Claude Code best practices doc are what helped me finally get some decent results. That doc is very specific to Claude Code, but the section “3. Try common workflows” seems like broadly applicable guidance that will improve results with other agentic coding tools that work similarly to Claude Code.

Thanks to that Anthropic guidance, I’ve been able to get several useful results out of LLMs that were done faster than I could have done on my own, even taking into account rework and time spent reviewing and correcting the code. Here are a few examples:

Write some complex Python scripts implementing an Apache Beam job on Google Dataflow. I didn’t know anything about Beam or Dataflow before I started, therefore the bar for “what I could have done on my own” was so low that even with several stumbles the LLM was able to get me something good enough much faster than I could myself. Also these are internal tools for testing models, they’re not customer facing nor are they mission-critical, so I felt more at ease mostly vibe-coding the scripts.
Read a complex and very poorly constructed Typescript codebase to figure out how some APIs work. I am only vaguely familiar with Typescript and my web development days are 15 years behind me already, so having the agent explore the code, summarize what it found, and point me to actual code lines where I could see things for myself was a huge time savings.
Plan, execute, and test a substantial refactoring of aforementioned Typescript codebase to correct a pretty serious structural deficiency in how it was constructed. I can’t emphasize enough the importance of “plan” here. In fact this was done over multiple agent sessions; the first one was entirely dedicated to authoring and refinding a Markdown plan document. By the time the document was done, the context window was already full. Each subsequent step was implemented with another session and a fresh context window, but that was fine because I just instructed the agent to read the plan doc, and pick up where we left off before. Here again, if I were a master Typescript REST API developer, I’m not entirely sure that the agent would have been faster than me doing it myself. But as it was, the agent gave me the ability to quickly and thoroughly refactor a foreign codebase in a language that I don’t know well, in just a day.
Write a Rust program to do a complex analysis on a memory dump. I am a very strong Rust programmer, I could easily do this one myself, but in this case it was a weekend, my mental energy level was low, and I wanted to try an experiment using an agent on a language and stack that I knew very well. If I had been fresh and highly motivated, I definitely could have finished this task faster than the agent did, but under the circumstances having the agent was the difference between doing it and not getting around to it. It didn’t do a very good job initially, but thankfully the Rust compiler is so fastidious that I only rarely had to intervene to nudge it in the right direction. This program was also a research project and doesn’t operate on customer data, so I had no reservations at all about vibe-coding it.

If you do nothing more than follow the Anthropic best practices with Claude Code and make a good-faith effort learn the nuances of how the various coding models work, I think you’ll get good results at least some of the time. This is especially true if you use agents to do tasks that otherwise would not be done at all for reasons of mental energy or familiarity with a codebase or tech stack.

However, I would also urge you augment the Anthropic best practices by investing heavily in the technique described in the next section. Your codebase needs to be written for AI, since it’s probably inevitable at this point that at least parts of it are going to be written by AI.

For

Already in last December’s year-end GenAI tooling review, the kernels of what became “coding for AI” were present:

The corollary of the previous bullet is that generating docs optimized for LLM consumption will be much more important, particularly for new tools and languages. I think it’s inevitable that software development agents will need to get much better at looking up documentation, and when they do the extent to which that documentation is easily consumed by whatever mechanism they use will be important. Right now it seems like dumping all documentation content into a big Markdown file is a pretty good approach, but I bet this will be refined over time. This applies not just to developer docs but also end-user docs as well. On the plus side, perhaps this will finally be the death of product docs locked away behind a login?

In the intervening 5 months working with agentic systems, it’s become abundantly clear that that prediction was right but also insufficient: in fact, not just docs help LLMs, but anything that can be invoked as a tool that will provide actionable feedback on their output.

In fact, it turns out that the things I’ve been doing throughout my career to harden a code base against the predations of eager juniors and incompetent offshore “seniors” brought in for the latest “this-time-it’s-different” management cost-cutting scheme, also go a long way to making agents more useful. Every programmer I know who was done anything with LLMs in the last two years has inevitably characterized the experience as that of working with an eager junior, tireless and overconfident. If you, like me, enjoy the experience of mentoring a promising and eager junior as he or she grows into a more capable programmer, then you will probably protest that a stochastic parrot wrapped in an agentic framework is something entirely different and qualitatively inferior. I won’t argue that point, but just like an eager junior (or the latest outsourcing scammer), LLMs have no actual understanding of anything they write, and they lack any judgement by which to evaluate what they have built. If you force them to get their code to pass a type check or compilation step, a linter, a beautifier, unit tests, integration tests, maybe some dynamic analysis, you automate much of the tediuous and error-prone verification work, so that by the time it gets to you for review you at least know you won’t see any mistakes any previous steps can catch.

In the case of coding agents, this isn’t just a way to spare yourself the brunt of their vibe-coded stupidity. In many cases it seems that this feedback cycle somehow guides the agent along a random walk to more likely arrive at an acceptable answer. I suppose if you think of the underlying LLM as a stochastic parrot, then it makes sense that the more guardrails you put in place the more values the stochastic parrot can sample, thus increasing the odds that it eventually produces something that’s at least acceptable.

Here are some of the actual things I’ve put in place in codebases where I want to enable (or in some cases, lack the power to prohibit) productive use of coding agents:

Maintain a CLAUDE.md (for Claude Code) or Cursor Rules (for Cursor) or whatever your agent uses.

You can make agents read a README or some other document, but they will always look at the agent-specific documentation and load some or all of it into the context window. This is where you should put exactly the kinds of instructions that you would write for a new junior. Explain what the project is, what it does, how to compile it, what the coding standards are, how to run the tests, etc.

As you see the agent doing something stupid, update the rules doc accordingly. For example, I had a Python project where Cursor kept trying to add dependencies by just doing a pip install, but this was a uv project with a pyproject.toml. I wrote a Cursor rule that explicitly forbade ever running pip install or creating a requirements.txt and stating that all dependencies must go into pyproject.toml. On a Rust project we used workspace dependencies but agents always try to add dependencies to the Cargo.toml for the crate that will use the dependency, so I wrote a script that worked like cargo add but worked how I wanted, and put some text in the rules file forbidding direct edits to Cargo.toml when to add dependencies and requiring the use of script instead.

As with bargain-basement offshore scammers, you will likely never stop discovering new agent misbehavior that needs to be explicitly prohibited. But if you keep up this discipline you will be rewarded with better agent performance. And unlike those garbage-tier outsourced idiots, I find that coding agents actually try to follow the rules at least 75% of the time.
If something can be done deterministically, do that - By this I meant that it’s better to have an entry in the agent’s rule file that says “always run frobnulator.sh to check your work for errors before considering a task complete”, than to write “always run foo and also bar and then baz to check your work for errors before considering a task complete”. It takes fewer tokens, and gives the agent less opportunity to fuck up. Maybe it ran foo, that passed, then bar failed, then it makes a fix for that failure which happens to break foo but it doesn’t re-run foo and declares victory when bar passes.

In a Rust codebase recently I used just to make a command vibecheck that ran a multitude of checks, cargo check --tests and clippy and a few quick unit tests and some more expensive unit tests and some integration tests. In the agent’s rules file I just had “Run just vibecheck to compile and run all tests after every task”.

This is also helpful in cases where there are multiple ways to run things, like in a Python project where you want to use uv run. Agents can try to get clever and run a tool the wrong way or with the wrong version of Python, which will cause them to get wrapped around that axle, so it’s better to constrain them as much as you can.
Do as much static type checking as possible - Let the compiler or type checker do as much of the verification work as possible to catch LLM fuckups.

If it’s a Python project, use mypy and require types everywhere.

If it’s Typescript, configure the compiler as strictly as possible and prohibit any.

If you have the pleasure of working on a Rust codebase, make sure you run not only the compiler but also clippy with at least the default lints, and consider enabling some additional ones on a case-by-case basis.

In all cases, make sure whatever compiler or linter you use is configured to fail the build on any warnings as well as all errors, and obviously make sure that running this is a stated requirement in the agent’s rules file.
Automate all the tests - You really should be doing this anyway, but coding agents make it even more valuable to have reliable, quick, and automated tests. If you have flaky tests, they will absolutely drive a coding agent mad. If you don’t have any automated tests, then the coding agent will very happily “fix” things by disabling authentication or turning off a feature or silently eating an exception.

The good news is that one of the most compelling use cases for coding agents is the generation of unit tests. The Anthropic best practices guide that I linked above goes into their recommendations on this in more detail, but suffice it to say that in 2025 there is no excuse for a project not having at least some basic unit tests covering the majority of the application’s functionality. If engineering leadership or the Product org push back and insist that your JIRA tickets are higher priority, show them this post and explain how they are failing to capitalize on AI-powered efficiency gains by not having reliable tests, and ask them how they will explain to the CEO why they are not utilizing GenAI to drive engineering efficiencies. If that doesn’t result in an immediate test automation mandate then you and I clearly have very different professional milieus.

The more bugs you can protect against with automated tests, the better. I think of it in adversarial terms: there’s an AI run amok intent on ruining your project with subtly-wrong vibecode, and your best defense is your own army of (deterministic) machines to catch stochastic fuckups before they can land in master. As an added bonus, these tests help you and any other humans on your team as well, none of whom are infallible.
Don’t assume the agent has run any of your checks - Most of the time the agents will follow instructions about mandatory checks, but in my experience even with Claude Code it’s not 100% of the time. So you still need to have CI and that CI still needs to run at least all of the same checks that the agent is required to run, and if those checks fail it still needs to block the pull request.
CI should be able to deploy the whole stack and test it - Arguably this is just part of the previous point, but I want to call it out specifically.

In many projects, for various reasons, it’s not practical for developers to run the entire product or solution on their local systems. If they can, that’s ideal, because then you can do that as part of the “automate all the tests” point and you’re done. But even if not, there’s still immense value in having a CI step that deploys the whole solution and runs it, as realistically as possible, with some automated tests that can verify the actual behavior of the whole system.

I know, if you don’t have this yet it’s quite a bit of effort to set up. But coding agents now are starting to move out of the developer environment and onto separate dedicated compute, where they grind asynchronously on a Github issue and prepare a pull request for you to review at your leisure. That means that if you have CI checks that can catch problems, even if they take 30 minutes to run and aren’t practical to make the agent do itself, as long as they block the PR from landing and provide some meaningful failure message when they break, this can still save some of your time by catching stupid things that the other layers of testing somehow missed. And here again, this is a good thing to have even if GenAI turns out to have been an opioid fever dream and you go back to writing your own code.
Accept that vibe-coded garbage is going to get into the codebase and know what to do when that happens - If you are one of those teams that is always shipping, and do your debugging in prod, you hopefully have already ensured that you can quickly recover from a bad push, whether it’s a 100% organic human-originated fuckup or the cheap AI-powered kind.

What changes when using coding agents is the volume of (often dubious) code that can be produced. This phenomenon is addressed in a great article titled The Coming Knowledge-Work Supply-Chain Crisis which I urge you to read. In the past, without any GenAI tools, teams could do everything right, and shit still broke in production due to human error. Coding agents, if they live up to even a fraction of their creators’ hype, will dramatically increase the amount of code being produced, and I see no reason to believe that code will be any less fallible than the code humans write; I’m confident that it will in fact be worse.

So you need to be able to recover from bad commits quickly.

In my case with Elastio most of our product is installed in customer accounts, so we don’t have the luxury of pushing that code to prod dozens of times a day. That fact, combined with the nature of Elastio and its role in our customers’ security postures, means there is a high bar for quality in much of the code base. For this kind of product, you should consider the provenance of all code very carefully and make it clear which named developer is personally responsible for which pull request, regardless of where it came from. I would suggest that low-effort vibe-coded PRs that waste reviewers’ time and risk the integrity of the product should be prohibited, and perpetrators punished when caught, although I’m also aware in most orgs that’s just a fantasy.
Use a beautifier but only at the end of tasks - LLMs produce some really ugly code, particularly if you are sensitive to things like trailing whitespace. In the LLMs’ defense, humans are also pretty terrible at this. Fortunately we have beautifiers now. Definitely use one to check at the CI stage that new code has been formatted according to the standards. But there’s a nuance that I ran into when using this with Claude Code specifically, that probably applies to other agents as well.

When the agent is working on a file, it loads part of the file into its context window, which lets it rewrite that part of the file with some other content relatively efficiently. There’s a safety feature in the tool that Claude Code uses to write to a file, that detects when what the LLM thinks is currently there doesn’t match. There are very good reasons for that check, which I’m sure is there to avoid an otherwise common LLM failure mode. However if the beautifier has run since the LLM wrote the code, this means it will fail making subsequent writes and thus has to load the file again. That not only slows down the agent but also uses up more context window tokens.

So resist the urge to make the beautifier part of the standard tools that you make the agent run as part of its work. They don’t provide any feedback that would help the agent anyway, so it’s better to do that in a commit hook, or as some explicit final step separate from the code checks.

I’m sure over time we’ll discover more techniques for building guardrails to keep the agents on the road. I personally would love to find a way to detect the zero-value comments that LLMs are so fond of injecting into the code, as well as the deletion of valuable comments that for whatever reason they seem to deem unnecessary.

Conclusion

The tone of this text may suggest that I’m an AI skeptic and possibly even a curmudgeonly gray beard gatekeeper nostalgic for the days of punch cards and walking to school uphill both ways in the snow. That is very much not the case. I am bullish on AI tools in general, I already get a lot of value in the tools as they exist today, and it seems certain that they will continue to increase in capability.

But I am also a jaded and cynical SWE who can clearly see the lazy and careless use of AI by people whose unaugmented abilities are low enough that they are not capable of evaluating the slop that their GenAI tooling is producing in their name. This is already wasting my time by making me read AI generated slop docs and looking at vibe-coded PRs from idiots who don’t know or possibly don’t care how obvious it is that their incompetence in their actual job is matched only by their inability to prompt LLMs to do their job for them.

I’ve written this article for others like me who feel the same way. You cannot stop this AI revolution, you cannot hide from slop, but I urge you to keep an open mind and try to regard this new technology with the wonderment that I remember in my youth as I first discovered programming and then the Internet. There is real value to be had here, and it’s worth your time to figure out how to take advantage of it.

Statement of Provenance

I wrote the text of this article entirely myself, by thinking thoughts and translating those thoughts into words which I typed with my hands on a keyboard. Any emdashes, proper grammar and spelling, or use of the words “underscore” and “delve” is entirely coincidental.

OpenAI’s GPT 4.5 model was used to copyedit a draft of this text for typos and sloppy or lazy writing. Its feedback was read by me with my eyeballs, the proposed changes were considered by me with my own brain, and changes that I agreed with were again made with my own hands.

All thoughts and clever turns of phrase are my own, and do not necessarily reflect the opinions of Elastio, nor did they emerge from a high-dimensional latent space on NVIDIA silicon.