2024 Year-end GenAI Tooling Review

anelson December 31, 2024 #genai

I don’t normally spend any time on navel-gazing posts about tools that I use, to say nothing of year-end predictions. However I want to record the GenAI tooling I’m using at the end of 2024 for my day to day software engineering work, mainly because it’s such a fast-moving field that I suspect it will be amusing to revisit this in a year or five and marvel at how primitive our lives were. At the same time, I’m also curious to see how some of my predictions about the future of the field hold up over time.

With that preamble out of the way, on to the listicle:

GenAI Tooling I’m Using Daily

Cursor

I’m not exactly on the cutting edge with this one. I don’t recall when I discovered Cursor; sometime earlier in 2024 on a Hackernews thread, most likely. I did an eval of an early version and found it to be an incremental improvement on the prevailing GenAI dev workflow at the time, which was copy-pasting code from Claude and then copy-pasting compiler errors and program output back into Claude when something went wrong.

However, what has changed recently, and what leads me to gladly pay the $20/mo for Cursor, is the new “agent” feature in Composer. “Agentic” is the hotness right now, to the point that if your AI ~~grift~~ startup doesn’t have an agentic story then you’re not going to be taken seriously. In Cursor’s case the agentic features are pretty simple, but they also unlock a more productive way of working with LLMs. Cursor can now not only make changes to code files as part of conversations in Composer, but it can run commands in the shell (subject to user approval) and see the results automatically. It sounds like a small thing, something you can already do by copy-pasting between Claude or ChatGPT and existing tools, but for me it’s removed annoying friction and lets me use existing tooling to help the LLM correct its inevitable hallucinations and screw-ups.

Now that this feature is available, it’s more important than ever to have tooling in place like static analysis tools and automated test suites. I can tell Cursor to write some bit of code, and then tell it to run the tests or run a linter, knowing that this will surface a lot of the typical LLM fuckups that I’ve come to expect when programming with GenAI. It’s not perfect of course, but when it works I can almost mindlessly Command-Enter repeatedly to approve the model’s various flights of fancy and let it figure out the details as it runs afoul of clippy or a test or the compiler itself.

It’s become a cliche at this point to characterize current SOTA LLMs as eager, tireless, but often incompetent junior developers. That goes double now, when using the agent feature. Just like you would harden a repo against the well-intended predations of junior devs with branch protection and a bunch of checks in the CI system, that same effort pays dividends when using LLMs. Today, the agent workflow in Cursor is interactive, but companies are already starting to sell junior dev AI services that operate entirely in the background on a Github issue, potentially coming back hours later with a PR. The more you can automate checking the work of the LLM with existing tooling, the more likely these tools (well, future iterations of them anyway; right now they’re still pretty raw) will be able to provide some value.

Speaking of cliches, another one is that the use of Cursor to help take on unfamiliar tasks, languages or frameworks empowers one to take on work that otherwise would be too time-consuming or intimidating. I can confirm this as well. My preferred language today is Rust, which I know very well (and which LLMs generally don’t know very well, presumably due to lack of training data). However, sometimes as punishment for sins I must have committed in a past life, I need to work with languages that aren’t Rust. Most recently Python. Being able to have the LLM guide me through the subtleties of the language and which packages are available for what tasks is a huge unlock. The resulting code is not great in many cases, and I’m sure professional Python devs would cringe at my output, but this isn’t a shipping product it’s internal tooling and glue and such where getting something done matters a lot more than stylistic purity, and the LLM makes the difference in many cases between a quick-and-dirty script existing and helping get things done, and not having anything at all.

Claude

I have access to the Claude Sonnet 3.5 model as part of my Cursor subscription, but I also pay Anthropic $20/mo for access to the Claude model on my own. I need this for a few reasons:

First, I use LLMs for non-programming tasks as well. To give but one example, I currently live in Budapest but speak no Hungarian. I can take a photo of some document or sign posted by the entrance of my building, and Claude will not only translate it but also explain it and answer questions. My wife uses my account as well, interacting in Russian and Ukrainian, to similar good effect (don’t tell Anthropic please!).

Second, in Cursor’s presentation of Claude one doesn’t get the raw model; Cursor has extensive prompting in place to guide the model to the task at hand. I find that this often works great, but sometimes confuses the model and results in it doing dumb things that obviously won’t work. When I see this happening, I’ll sometimes pop over to Claude directly and tell it what I’m trying to do and have it generate a prompt and code example for use with an LLM, then paste that back into the Composer conversation and get things back on track.

Finally, Anthropic frequently release cool new functionality that is only in the app, or at least starts there, and I like to be able to pick up and play with new stuff as it lands.

ChatGPT

I recently re-activated my $20/mo ChatGPT subscription, which I had canceled once Claude 3.5 Sonnet took over as the SOTA model for dev tasks. Part of this was to play with the stuff they announced rapid-fire at the end of the year, and part of it was to be able to play with o1 (having done so I don’t think it’s superior to Sonnet for my work). I probably ought to cancel this again, but I’m keeping it around for the rare cases where I want to play Sonnet against another model to sanity-check its analyses.

Perplexity

I have found myself using Perplexity search by default now, unless I’m doing a very mechanical lookup that I know Kagi will find quicker (I also pay for Kagi). The Pro search is clearly better, and does a pretty good job of sifting through SEO crap and blogspam to get to meaningful content. I find this is particularly true if I want to shop for something and I want to find the best option. Long gone are the days where one could search “best gonkolator” and actually get actionable and unbiased results back that will help you figure out who sells the best gonkolator. Instead one must take an adversarial approach, interrogating each result on the assumption that it’s a bad actor trying to trick you into clicking an affiliate link. For the most part, Perplexity Pro does the first pass on its own, making it much easier to sift through what’s left.

GenAI Predictions for 2025

In no particular order:

2025 is the year of agentic systems. This isn’t so much a prediction as a parroting of the Zeitgeist on X at the moment. However one defines “agentic”, Cursor’s primitive agent feature has shown that the way to at least ameliorate the limitations of current SOTA LLMs is to plug them into deterministic systems that can automatically call out their bullshit hallucinations and keep them on a somewhat straight and narrow path. To the point that I doubt this distinction will exist for much longer; systems that use GenAI and don’t suck completely will just obviously be built as agents incorporating tool use.
AI junior devs will be widely adopted. Not because it’s a good idea, or because it will be a net positive for engineering productivity, but because the pressure on management to replace expensive and annoying developers with cheap and compliant AI will be too powerful to resist. To be clear, I’m bullish on GenAI tooling for devs in general, and use multiple tools every day to increase my productivity. But nothing that I have experienced in my long career gives me any reason to believe that adoption of these new technologies will be principled, measured, and rational. I fully expect that part of the job of senior engineers will be wrangling the army of AI agents submitting PRs based on issues and requirements documents that themselves were generated by AI tools, and I fully expect that to suck. It won’t be politically acceptable to turn this off until enough negative experience permeates the MBA Zeitgeist and managers can be confident that they’re not missing out on a hot new trend. That is multiple years away unfortunately.
Software will get a lot worse. Since I use LLMs to help me write code every day, in multiple languages, I have a pretty good sense of what they’re capable of. Not a day goes by that I don’t get code from a SOTA model that is very obviously wrong but compiles. In all cases “obviously wrong” includes stylistic, structural, and nominative defects, but in many cases also subtle functional bugs as well. I’m quite certain that many developers will just Command-Enter until the code seems to work, and then ship it. At scale this will show up as flaky, confusing, slow, and in some cases utterly broken software. Some of that software might even be my own, if I am not constantly vigilant and never complacent. It’s tempting to rail against the injustice but I don’t think that changes the outcome.
New tooling and languages will be much harder to get adopted. Already I think a compelling steelman argument against adopting Rust in projects that otherwise are a good fit for it is that LLMs don’t do a very good job of writing Rust since there’s comparatively less Rust code in public for LLMs to train on. As more and more of the code that gets shipped is built with LLMs by developers who don’t understand the code that’s being written, the ability to outsource thinking to the GenAI tools will be an important consideration for teams that don’t have the luxury of employing motivated humans to write and review every line of code for correctness. This will impact languages but also tools. In the old days, part of the calculus when choosing a tool or a language was the ability to hire developers with those skills; now it will be the extent to which the foundational models that power modern AI agents will be able to effectively analyze and generate code with those tools or languages. I suppose if you hate having to learn a new Javascript bundler every year, this is good news, but I think it’s a net negative.
The corollary of the previous bullet is that generating docs optimized for LLM consumption will be much more important, particularly for new tools and languages. I think it’s inevitable that software development agents will need to get much better at looking up documentation, and when they do the extent to which that documentation is easily consumed by whatever mechanism they use will be important. Right now it seems like dumping all documentation content into a big Markdown file is a pretty good approach, but I bet this will be refined over time. This applies not just to developer docs but also end-user docs as well. On the plus side, perhaps this will finally be the death of product docs locked away behind a login?
It will turn out that letting LLMs build entire codebases with little human understanding of what is being built will result in a mess that the LLM itself can’t work with. This year there was a lot of breathless “holy shit!” posting on X in which an AI tool took a spec like “here’s a screenshot of Facebook now write the code for it” and produced the right HTML, CSS, and React. I’ve also seen is claimed that no one will pay for SaaS products anymore, since they can just tell ChatGPT to write whatever they need. I am quite confident that this is nonsense, and that the people making these claims are either grifters, ignorant of how complex software is built and operated, or in many cases likely both. However those of us naysayers will be ignored in favor of the sugar rush of exciting AI future vibes, and funded companies will make a big deal out of using AI to build the whole product. At some point, I think likely later in 2025, the failure of this approach will become impossible to deny, but in the meantime expect a lot of breathless “the end is nigh for software engineers” takes.
Developers without AI augmentation will be at a huge disadvantage. There is still a large population of developers posting on HN and elsewhere who insist that they see no value in GenAI tooling for the software engineering profession, and are much more productive without it. Without passing judgement on the merits of those statements one way or the other, I predict that this will be a position one will need to keep to oneself when looking for software dev jobs in 2025. As for myself, I can honestly say that I get a lot of value out of Cursor and SOTA models and I look forward to synergistically working tirelessly to increase shareholder value by leveraging disruptive AI technologies to push the boundary of what is possible whilst constantly shipping.