2024 Year-end GenAI Tooling Review
anelson December 31, 2024 #genaiI donāt normally spend any time on navel-gazing posts about tools that I use, to say nothing of year-end predictions. However I want to record the GenAI tooling Iām using at the end of 2024 for my day to day software engineering work, mainly because itās such a fast-moving field that I suspect it will be amusing to revisit this in a year or five and marvel at how primitive our lives were. At the same time, Iām also curious to see how some of my predictions about the future of the field hold up over time.
With that preamble out of the way, on to the listicle:
GenAI Tooling Iām Using Daily
Cursor
Iām not exactly on the cutting edge with this one. I donāt recall when I discovered Cursor; sometime earlier in 2024 on a Hackernews thread, most likely. I did an eval of an early version and found it to be an incremental improvement on the prevailing GenAI dev workflow at the time, which was copy-pasting code from Claude and then copy-pasting compiler errors and program output back into Claude when something went wrong.
However, what has changed recently, and what leads me to gladly pay the $20/mo for Cursor, is the new āagentā feature in Composer. āAgenticā is the hotness right now, to the point that if your AI grift startup doesnāt have an agentic story then youāre not going to be taken seriously. In Cursorās case the agentic features are pretty simple, but they also unlock a more productive way of working with LLMs. Cursor can now not only make changes to code files as part of conversations in Composer, but it can run commands in the shell (subject to user approval) and see the results automatically. It sounds like a small thing, something you can already do by copy-pasting between Claude or ChatGPT and existing tools, but for me itās removed annoying friction and lets me use existing tooling to help the LLM correct its inevitable hallucinations and screw-ups.
Now that this feature is available, itās more important than ever to have tooling in place like static analysis tools and automated test suites. I can tell Cursor to write some bit of code, and then tell it to run the tests or run a linter, knowing that this will surface a lot of the typical LLM fuckups that Iāve come to expect when programming with GenAI. Itās not perfect of course, but when it works I can almost mindlessly Command-Enter repeatedly to approve the modelās various flights of fancy and let it figure out the details as it runs afoul of clippy or a test or the compiler itself.
Itās become a cliche at this point to characterize current SOTA LLMs as eager, tireless, but often incompetent junior developers. That goes double now, when using the agent feature. Just like you would harden a repo against the well-intended predations of junior devs with branch protection and a bunch of checks in the CI system, that same effort pays dividends when using LLMs. Today, the agent workflow in Cursor is interactive, but companies are already starting to sell junior dev AI services that operate entirely in the background on a Github issue, potentially coming back hours later with a PR. The more you can automate checking the work of the LLM with existing tooling, the more likely these tools (well, future iterations of them anyway; right now theyāre still pretty raw) will be able to provide some value.
Speaking of cliches, another one is that the use of Cursor to help take on unfamiliar tasks, languages or frameworks empowers one to take on work that otherwise would be too time-consuming or intimidating. I can confirm this as well. My preferred language today is Rust, which I know very well (and which LLMs generally donāt know very well, presumably due to lack of training data). However, sometimes as punishment for sins I must have committed in a past life, I need to work with languages that arenāt Rust. Most recently Python. Being able to have the LLM guide me through the subtleties of the language and which packages are available for what tasks is a huge unlock. The resulting code is not great in many cases, and Iām sure professional Python devs would cringe at my output, but this isnāt a shipping product itās internal tooling and glue and such where getting something done matters a lot more than stylistic purity, and the LLM makes the difference in many cases between a quick-and-dirty script existing and helping get things done, and not having anything at all.
Claude
I have access to the Claude Sonnet 3.5 model as part of my Cursor subscription, but I also pay Anthropic $20/mo for access to the Claude model on my own. I need this for a few reasons:
First, I use LLMs for non-programming tasks as well. To give but one example, I currently live in Budapest but speak no Hungarian. I can take a photo of some document or sign posted by the entrance of my building, and Claude will not only translate it but also explain it and answer questions. My wife uses my account as well, interacting in Russian and Ukrainian, to similar good effect (donāt tell Anthropic please!).
Second, in Cursorās presentation of Claude one doesnāt get the raw model; Cursor has extensive prompting in place to guide the model to the task at hand. I find that this often works great, but sometimes confuses the model and results in it doing dumb things that obviously wonāt work. When I see this happening, Iāll sometimes pop over to Claude directly and tell it what Iām trying to do and have it generate a prompt and code example for use with an LLM, then paste that back into the Composer conversation and get things back on track.
Finally, Anthropic frequently release cool new functionality that is only in the app, or at least starts there, and I like to be able to pick up and play with new stuff as it lands.
ChatGPT
I recently re-activated my $20/mo ChatGPT subscription, which I had canceled once Claude 3.5 Sonnet took over as the SOTA model for dev tasks. Part of this was to play with the stuff they announced rapid-fire at the end of the year, and part of it was to be able to play with o1 (having done so I donāt think itās superior to Sonnet for my work). I probably ought to cancel this again, but Iām keeping it around for the rare cases where I want to play Sonnet against another model to sanity-check its analyses.
Perplexity
I have found myself using Perplexity search by default now, unless Iām doing a very mechanical lookup that I know Kagi will find quicker (I also pay for Kagi). The Pro search is clearly better, and does a pretty good job of sifting through SEO crap and blogspam to get to meaningful content. I find this is particularly true if I want to shop for something and I want to find the best option. Long gone are the days where one could search ābest gonkolatorā and actually get actionable and unbiased results back that will help you figure out who sells the best gonkolator. Instead one must take an adversarial approach, interrogating each result on the assumption that itās a bad actor trying to trick you into clicking an affiliate link. For the most part, Perplexity Pro does the first pass on its own, making it much easier to sift through whatās left.
GenAI Predictions for 2025
In no particular order:
- 2025 is the year of agentic systems. This isnāt so much a prediction as a parroting of the Zeitgeist on X at the moment. However one defines āagenticā, Cursorās primitive agent feature has shown that the way to at least ameliorate the limitations of current SOTA LLMs is to plug them into deterministic systems that can automatically call out their bullshit hallucinations and keep them on a somewhat straight and narrow path. To the point that I doubt this distinction will exist for much longer; systems that use GenAI and donāt suck completely will just obviously be built as agents incorporating tool use.
- AI junior devs will be widely adopted. Not because itās a good idea, or because it will be a net positive for engineering productivity, but because the pressure on management to replace expensive and annoying developers with cheap and compliant AI will be too powerful to resist. To be clear, Iām bullish on GenAI tooling for devs in general, and use multiple tools every day to increase my productivity. But nothing that I have experienced in my long career gives me any reason to believe that adoption of these new technologies will be principled, measured, and rational. I fully expect that part of the job of senior engineers will be wrangling the army of AI agents submitting PRs based on issues and requirements documents that themselves were generated by AI tools, and I fully expect that to suck. It wonāt be politically acceptable to turn this off until enough negative experience permeates the MBA Zeitgeist and managers can be confident that theyāre not missing out on a hot new trend. That is multiple years away unfortunately.
- Software will get a lot worse. Since I use LLMs to help me write code every day, in multiple languages, I have a pretty good sense of what theyāre capable of. Not a day goes by that I donāt get code from a SOTA model that is very obviously wrong but compiles. In all cases āobviously wrongā includes stylistic, structural, and nominative defects, but in many cases also subtle functional bugs as well. Iām quite certain that many developers will just Command-Enter until the code seems to work, and then ship it. At scale this will show up as flaky, confusing, slow, and in some cases utterly broken software. Some of that software might even be my own, if I am not constantly vigilant and never complacent. Itās tempting to rail against the injustice but I donāt think that changes the outcome.
- New tooling and languages will be much harder to get adopted. Already I think a compelling steelman argument against adopting Rust in projects that otherwise are a good fit for it is that LLMs donāt do a very good job of writing Rust since thereās comparatively less Rust code in public for LLMs to train on. As more and more of the code that gets shipped is built with LLMs by developers who donāt understand the code thatās being written, the ability to outsource thinking to the GenAI tools will be an important consideration for teams that donāt have the luxury of employing motivated humans to write and review every line of code for correctness. This will impact languages but also tools. In the old days, part of the calculus when choosing a tool or a language was the ability to hire developers with those skills; now it will be the extent to which the foundational models that power modern AI agents will be able to effectively analyze and generate code with those tools or languages. I suppose if you hate having to learn a new Javascript bundler every year, this is good news, but I think itās a net negative.
- The corollary of the previous bullet is that generating docs optimized for LLM consumption will be much more important, particularly for new tools and languages. I think itās inevitable that software development agents will need to get much better at looking up documentation, and when they do the extent to which that documentation is easily consumed by whatever mechanism they use will be important. Right now it seems like dumping all documentation content into a big Markdown file is a pretty good approach, but I bet this will be refined over time. This applies not just to developer docs but also end-user docs as well. On the plus side, perhaps this will finally be the death of product docs locked away behind a login?
- It will turn out that letting LLMs build entire codebases with little human understanding of what is being built will result in a mess that the LLM itself canāt work with. This year there was a lot of breathless āholy shit!ā posting on X in which an AI tool took a spec like āhereās a screenshot of Facebook now write the code for itā and produced the right HTML, CSS, and React. Iāve also seen is claimed that no one will pay for SaaS products anymore, since they can just tell ChatGPT to write whatever they need. I am quite confident that this is nonsense, and that the people making these claims are either grifters, ignorant of how complex software is built and operated, or in many cases likely both. However those of us naysayers will be ignored in favor of the sugar rush of exciting AI future vibes, and funded companies will make a big deal out of using AI to build the whole product. At some point, I think likely later in 2025, the failure of this approach will become impossible to deny, but in the meantime expect a lot of breathless āthe end is nigh for software engineersā takes.
- Developers without AI augmentation will be at a huge disadvantage. There is still a large population of developers posting on HN and elsewhere who insist that they see no value in GenAI tooling for the software engineering profession, and are much more productive without it. Without passing judgement on the merits of those statements one way or the other, I predict that this will be a position one will need to keep to oneself when looking for software dev jobs in 2025. As for myself, I can honestly say that I get a lot of value out of Cursor and SOTA models and I look forward to synergistically working tirelessly to increase shareholder value by leveraging disruptive AI technologies to push the boundary of what is possible whilst constantly shipping.