Musings and misadventures of an expat enterpreneur

2024 Year-end GenAI Tooling Review

anelson December 31, 2024 #genai

I donā€™t normally spend any time on navel-gazing posts about tools that I use, to say nothing of year-end predictions. However I want to record the GenAI tooling Iā€™m using at the end of 2024 for my day to day software engineering work, mainly because itā€™s such a fast-moving field that I suspect it will be amusing to revisit this in a year or five and marvel at how primitive our lives were. At the same time, Iā€™m also curious to see how some of my predictions about the future of the field hold up over time.

With that preamble out of the way, on to the listicle:

GenAI Tooling Iā€™m Using Daily

Cursor

Iā€™m not exactly on the cutting edge with this one. I donā€™t recall when I discovered Cursor; sometime earlier in 2024 on a Hackernews thread, most likely. I did an eval of an early version and found it to be an incremental improvement on the prevailing GenAI dev workflow at the time, which was copy-pasting code from Claude and then copy-pasting compiler errors and program output back into Claude when something went wrong.

However, what has changed recently, and what leads me to gladly pay the $20/mo for Cursor, is the new ā€œagentā€ feature in Composer. ā€œAgenticā€ is the hotness right now, to the point that if your AI grift startup doesnā€™t have an agentic story then youā€™re not going to be taken seriously. In Cursorā€™s case the agentic features are pretty simple, but they also unlock a more productive way of working with LLMs. Cursor can now not only make changes to code files as part of conversations in Composer, but it can run commands in the shell (subject to user approval) and see the results automatically. It sounds like a small thing, something you can already do by copy-pasting between Claude or ChatGPT and existing tools, but for me itā€™s removed annoying friction and lets me use existing tooling to help the LLM correct its inevitable hallucinations and screw-ups.

Now that this feature is available, itā€™s more important than ever to have tooling in place like static analysis tools and automated test suites. I can tell Cursor to write some bit of code, and then tell it to run the tests or run a linter, knowing that this will surface a lot of the typical LLM fuckups that Iā€™ve come to expect when programming with GenAI. Itā€™s not perfect of course, but when it works I can almost mindlessly Command-Enter repeatedly to approve the modelā€™s various flights of fancy and let it figure out the details as it runs afoul of clippy or a test or the compiler itself.

Itā€™s become a cliche at this point to characterize current SOTA LLMs as eager, tireless, but often incompetent junior developers. That goes double now, when using the agent feature. Just like you would harden a repo against the well-intended predations of junior devs with branch protection and a bunch of checks in the CI system, that same effort pays dividends when using LLMs. Today, the agent workflow in Cursor is interactive, but companies are already starting to sell junior dev AI services that operate entirely in the background on a Github issue, potentially coming back hours later with a PR. The more you can automate checking the work of the LLM with existing tooling, the more likely these tools (well, future iterations of them anyway; right now theyā€™re still pretty raw) will be able to provide some value.

Speaking of cliches, another one is that the use of Cursor to help take on unfamiliar tasks, languages or frameworks empowers one to take on work that otherwise would be too time-consuming or intimidating. I can confirm this as well. My preferred language today is Rust, which I know very well (and which LLMs generally donā€™t know very well, presumably due to lack of training data). However, sometimes as punishment for sins I must have committed in a past life, I need to work with languages that arenā€™t Rust. Most recently Python. Being able to have the LLM guide me through the subtleties of the language and which packages are available for what tasks is a huge unlock. The resulting code is not great in many cases, and Iā€™m sure professional Python devs would cringe at my output, but this isnā€™t a shipping product itā€™s internal tooling and glue and such where getting something done matters a lot more than stylistic purity, and the LLM makes the difference in many cases between a quick-and-dirty script existing and helping get things done, and not having anything at all.

Claude

I have access to the Claude Sonnet 3.5 model as part of my Cursor subscription, but I also pay Anthropic $20/mo for access to the Claude model on my own. I need this for a few reasons:

First, I use LLMs for non-programming tasks as well. To give but one example, I currently live in Budapest but speak no Hungarian. I can take a photo of some document or sign posted by the entrance of my building, and Claude will not only translate it but also explain it and answer questions. My wife uses my account as well, interacting in Russian and Ukrainian, to similar good effect (donā€™t tell Anthropic please!).

Second, in Cursorā€™s presentation of Claude one doesnā€™t get the raw model; Cursor has extensive prompting in place to guide the model to the task at hand. I find that this often works great, but sometimes confuses the model and results in it doing dumb things that obviously wonā€™t work. When I see this happening, Iā€™ll sometimes pop over to Claude directly and tell it what Iā€™m trying to do and have it generate a prompt and code example for use with an LLM, then paste that back into the Composer conversation and get things back on track.

Finally, Anthropic frequently release cool new functionality that is only in the app, or at least starts there, and I like to be able to pick up and play with new stuff as it lands.

ChatGPT

I recently re-activated my $20/mo ChatGPT subscription, which I had canceled once Claude 3.5 Sonnet took over as the SOTA model for dev tasks. Part of this was to play with the stuff they announced rapid-fire at the end of the year, and part of it was to be able to play with o1 (having done so I donā€™t think itā€™s superior to Sonnet for my work). I probably ought to cancel this again, but Iā€™m keeping it around for the rare cases where I want to play Sonnet against another model to sanity-check its analyses.

Perplexity

I have found myself using Perplexity search by default now, unless Iā€™m doing a very mechanical lookup that I know Kagi will find quicker (I also pay for Kagi). The Pro search is clearly better, and does a pretty good job of sifting through SEO crap and blogspam to get to meaningful content. I find this is particularly true if I want to shop for something and I want to find the best option. Long gone are the days where one could search ā€œbest gonkolatorā€ and actually get actionable and unbiased results back that will help you figure out who sells the best gonkolator. Instead one must take an adversarial approach, interrogating each result on the assumption that itā€™s a bad actor trying to trick you into clicking an affiliate link. For the most part, Perplexity Pro does the first pass on its own, making it much easier to sift through whatā€™s left.

GenAI Predictions for 2025

In no particular order: