Musings and misadventures of an expat enterpreneur

Late Summer 2025 GenAI Tooling Review

anelson August 12, 2025 #genai

It’s been just eight months since my 2024 year-end GenAI tool review, and the landscape already feels unrecognizable. I am posting this update to record how much has changed, mainly so that I can amuse myself by revisiting this in the future and marveling at how primitive things were back in the antediluvian GenAI epoch of 2025.

Tools I’ve Added

Claude Code

Around May of 2025 I made mention of using Claude Code in my With, By, or For post; I had already by that point found myself preferring Claude Code for much of my LLM-augmented programming work, but I still had a few Cursor windows open on projects that I just started with Cursor and kept in Cursor due to inertia. But now I’m using Claude Code exclusively for all of my LLM programming assistant and vibe-coding needs.

Anecdotally, I’m not alone. By now Claude Code has exploded in popularity, with many users who would actually prefer a VS Code fork like Cursor putting up with the terminal aesthetic simply because Claude Code is a better tool (having said that I think there are now VS Code extensions that embed the Claude Code engine so you don’t have to touch that icky terminal if you don’t want to). It’s even penetrated the lazy vibe-coder zeitgeist, such that I’ve noticed the developers in my company who used to make slop PRs with Cursor and Copilot now use Claude Code instead.

Anthropic didn’t even invent terminal-based LLM coding assistants. I recall playing with the open-source aider shortly after OpenAI released the GPT-3.5 API. I found it clunky and limiting and not smart enough to do anything meaningful, and went back to what was then the state of the art, copy-pasting text between a terminal window and the ChatGPT web interface. There were several terminal-based LLM clients of various kinds, most of which I never tried since Aider was such a bust. When I learned about Claude Code on Hacker News, I fully expected it to suck. Then I was blown away by its capabilities almost immediately.

I can’t say exactly how Anthropic was able to make such a great coding assistant that runs in the terminal. I have to assume it was built by a few independent developers without the supervision of a PM, as it’s hard to imagine a Product organization ever allowing a team to build a TUI as their first foray into LLM coding assistants. But kudos to those developers, they really nailed the text-mode UX, and it keeps getting subtly better over time without making jarring changes that completely break existing workflows and muscle memory (Cursor, I’m glaring at you!). Most importantly, though, somehow the agentic plumbing and specialized tools and prompts in Claude Code really make the Anthropic models sing. It’s still an overconfident and eager junior developer who will happily lie to try to please you, but it’s as if this junior developer has better instincts and is generally smarter than the ones that inhabit the other LLM coding assistant tools that I’ve used. That’s particularly wild because in fact it’s not true: Cursor uses the same Anthropic models as Claude Code, but sucks so much more!

I don’t have rigorous evals for performance on my coding tasks; it’s purely subjective vibe checks. If I were to invent a metric, it would be to analyze the chat transcripts with Cursor vs Claude Code and count how often I use expletives. I haven’t done that analysis but I would be shocked if the ratio isn’t at least 5:1 in favor of Claude Code. It’s not that the underlying model doesn’t make stupid mistakes all the time; it absolutely does. It fails to apply changes, fucks up tool calls, and is constantly forgetting what the current directory is. But if you let it run with auto-accept enabled (or for those of us who like to live --dangerously-skip-permissions, with training wheels completely removed (IYKYK)), it for the most part figures it out.

A month or two ago, I switched from metered use of Claude Code to their $200/mo Max subscription. The month before I switched, I spent almost $500 in Anthropic API usage, entirely due to Claude Code. The Max subscription is a great deal right now, to the point that I’m afraid Anthropic might come to their senses and jack up the price.

I may write a separate post to capture my current techniques for getting the most out of Claude Code, but suffice it to say that if you have not tried this yet, you need to stop reading this article and set up Claude Code. In the meantime, Anthropic has published some great practical guidance on how to get the most out of Claude Code based on their own internal dogfooding:

Tools I’ve Discarded

Cursor

I was a pretty enthusiastic early adopter of Cursor last year, and encouraged several of my team members in Elastio to try it. I got a lot of utility out of it, which I recounted at length in my previous tool review post. But it has long-since been discarded. I already explained what motivated me to move to Claude Code in the Claude Code section above; refer back to that section for those details.

Part of the problem with Cursor is that, once Anthropic showed me what it could be like to access SOTA LLM programming assistants inside a terminal integrated with the rest of my terminal tools like tmux, nvim, and zsh, having to touch the mouse and click around in Cursor felt like puttering around on a little scooter after a brisk ride on a Ducati. I didn’t know how much I wanted a terminal-based LLM assistant until I used one, and then I couldn’t go back.

But the bigger driver of the change from Cursor to Claude Code wasn’t the improved ergonomics of the terminal, it was the fact that Claude Code just seemed to perform so much better. This was kind of surprising, since when I used Cursor it was almost always with Anthropic models, the same ones that power Claude Code. Why does Cursor suck and Claude Code is so good? I think the issue is the economics of the two products. When Claude Code first came out, you gave it an Anthropic API key and paid for every token it consumed. That got expensive very quickly, but it also meant that there were no tricks being played on Anthropic’s to economize on tokens to save money; quite the contrary the more tokens the tool uses the more they earn. Whereas Cursor got a flat $20/mo from me, but had to pay Anthropic for every token. They did have some rate limiting but clearly it wasn’t enough, and they had to do what they could to minimize the amount of context sent to Anthropic, resulting in much shittier performance in the tool. I believe they now have a different pricing model whereby you pay for usage, and for all I know maybe that has improved the quality of the LLM’s assistance, but I don’t care. Claude Code just feels more solid, and like it was built by people who used it for serious software engineering, while Cursor felt to me like the kind of tool you use to vibe-code some slop so you can performatively ship fast and call yourself “cracked” on social media while script kiddies download the API tokens your slop leaked all over the Internet.

Tools That I Tried and Hated

Gemini CLI

Once Anthropic’s Claude Code had shown Product teams that there was an enthusiastic market for terminal-based coding assistants, the other big AI players rushed to ship their own me-too offerings. The one I most eagerly anticipated was Google’s CLI, which they ended up calling “Gemini CLI”, after their series of frontier models of the same name.

The Gemini Pro models, most recently 2.5 Pro, are widely hyped on Twitter and YouTube. They have something like a 1M token context window, and have some impressive multimodal capabilities. LLM grifters influencers are constantly gushing about the galaxy-brain capabilities of this model. It’s hard sometimes to sift through the slop and figure out who is actually smart enough that a model being smarter than they are on some task that they gave it is actually a compelling endorsement. But the Gemini models are not a scam; they really do seem to perform well on programming-related tasks, at least in my limited testing. Back when I was still using Cursor, sometimes if it shat the bed running Claude Sonnet I’d switch over to Gemini Pro 2.5 and get a much smarter result, so I took it for granted that the Gemini CLI tool was going to be similarly competitive.

As an added bonus, my company has some generous GCP credits, so Gemini CLI is effectively free for me, unlike Claude Code which I have to pay for. So all Gemini CLI needed to do was be about as good as Claude Code, and I’d switch to it.

My God in heaven, it was not even close. Gemini CLI sucks. The UX sucks; it lacks the elegant but text-native polish of Claude Code. But worse than that, the performance on agentic coding tasks is…I think in this case it’s appropriate to use the word “retarded”. Just like I am amazed that Claude Code gets better performance from the same models that Cursor is using under the covers, I’m astonished that Gemini CLI is able to make Gemini 2.5 Pro suck as much as it does.

I immediately configured it to use gemini-2.5-pro, and gave it some simple Python and Rust tasks. Not even ones that Claude Code struggled with, just whatever I was working on at the time. It made stupid decisions, failed to make use of available information, simply forgot or ignored guidance in the prompt, and went in circles. It was utterly useless.

Sometimes if the other models I have access to get stuck on something, I’ll try giving it to Gemini 2.5 Pro, but using the Vertex AI web interface. It typically gives me some useful output, even if it’s not able to solve the problem itself. So something about the way the CLI wrapper prompts or invokes Gemini seems to lobotomize it.

Just an epic fail on Google’s part. My friends who are also eager early adopters of LLM assistants, some professional SWEs and some closer to a PM level of sophistication, have all reported the same results. It’s actually impressive how a seemingly-capable model can be made to be so stupid given the right scaffolding.

Jules

Jules is another Google AI project, this one is an AI coding agent that runs on its own given a GitHub repo and a task description. Behind the scenes Google spins up a VM, checks out the repo, and sets loose an LLM agent in that sandbox environment where it can work as long as it needs to in order to accomplish the task.

This is in the same product category as OpenAI Codex (no, not that Codex), the widely hyped Devon, one of the dozens of different products sold under “Copilot”, and approximately millions of other frantic me-too AI startup cash grabs.

It’s clear to me that, as a category, this is going to be a big part of the future of LLM coding assistants. There are a lot of advantages to this approach. If the tooling is good enough, agents could take a crack at dozens or hundreds of issues from the backlog, particularly chores like updating dependencies and making minor text changes, then produce PRs for human engineers to review. I’ve heard of teams inside OpenAI (meaning, teams with effectively infinite budget for AI spend) spawning 10 or more instances of these agents for a single task, then reviewing all of the solutions and picking the best one. Even if a problem is too hard for an agent to one-shot, you could leave PR comments just like you would with a human colleague, and maybe it’ll get it right after a few tries.

That’s how I envision these tools working someday. Perhaps there are tools that work like that today. However Jules is not such a tool.

Jules feels to me as if it were built by people who had no prior experience building software for a living. Or at least, no experience with the typical software lifecycles around backlog items, branches, pull requests, and eventual merges. If I’m being more generous, it feels like it was built by people whose bosses were anxious about getting their promotions in an era at Google where “AI” is the answer and no one gives a fuck what the question was, and who noticed that no exec had yet laid claim to building a background autonomous LLM programming agent, so they decided that this crude demo whipped up over a weekend to show the basic idea should be shipped to production without any regard to whether or not it fucking worked. It is an embarrassment, one which makes me feel actual pity for whoever works in the utterly dysfunctional org that allowed this to ship.

I gave it very simple tasks, like to update a Node.js dependency and use a new field added in the updated version of the dependency. It made the code change easily enough, so I assume it’s using Gemini 2.5 Pro without whatever makes Gemini CLI suck so much. But everything about the interaction sucked:

The “agent runs a branch in a sandbox and ships a PR” pattern is the future. Jules is not. Even if they somehow fix the leadership failure that led to this shipping, knowing Google, by the time it starts to get good they’ll kill it off.

Tools I’m Still Using Daily

These were all in the year-end tool review, although my use of them has evolved somewhat since then, they’re still daily drivers.

Claude Desktop (Max plan)

If I need an LLM to do something for me other than writing or debugging code, I first reach for Claude Sonnet 4 or Opus 4.1 in Claude Desktop.

Here is a selection of prompts from my actual Claude Desktop history:

My preference for the Anthropic models for these use cases stems from the fact that I use them for my LLM coding needs as part of Claude Code, and I have a Max subscription so I also have very generous limits in the Desktop app. I’ve never hit a usage limit, however I do hit outages much more often than I’d like.

ChatGPT Desktop (Plus plan)

Though my default go-to models when I want an LLM to do something for me or answer a question for me are still the Anthropic models, I keep a paid version of ChatGPT around for a few reasons:

Here are some examples of prompts I’ve used with the various OpenAI models recently:

Perplexity (Pro plan)

I would estimate that about 80% of the searches that I would once have done with Kagi (for which I also maintain a paid subscription), I now do with Perplexity, either Pro or Research. As the Internet descends further and further into the abyss of AI slop and weaponized SEO hacks, a tool that sorts through search results and gets to the information I actually want is well worth the $20/mo. Unfortunately, all of the LLM caveats apply. Not only does it outright hallucinate from time to time, but it’s not the best judge of character, and will regurgitate claims from pages matching search terms uncritically. It’s not unusual to have to fall back to Kagi for searches where Perplexity gets…well, perplexed.

The Research feature is all of that, but more so. It’s incredibly valuable for finding sources of information that I wouldn’t have found myself without exhaustive searching, but it’s not at all good at separating out the bullshit from what’s real, and it often gets confused by competing claims in search results. Often, the only valuable output from a Research run is just the list of links it’s drawing on, which I can then read myself to get the information I needed.

I would share a few examples from recent Perplexity activity, but the Perplexity Electron app is user-hostile and prevents me from selecting and copying text from a Perplexity session. All I can do is use the Share feature to generate a link to it. The cynic in me suspects that some sociopath in Product had some incentive to drive engagement on shared Perplexity sessions and realized making those links the only way to share information would tweak the stats. Damn you!

As much value as I get from Perplexity the service, I would welcome a complete rewrite of the Perplexity Mac app. I hate it. The inability to copy-paste text is unforgivable, but it seems that whenever I Command-Tab over to it after it’s been out of focus for a few hours, I get the spinning volleyball of death and force quit rather than wait for it to right itself. This is on a 2024 M4 MBP with 48GB of RAM where absolutely nothing else, including DaVinci Resolve Studio, lags at all, so it takes a special kind of idiot to suck wind on this system.

That said, if there is a credible competitor to Perplexity (other than the web search and research features in the other LLM apps that I use), I’m not aware of it. I hope to see more competition in this space, as I don’t think Perplexity is particularly great at what it does, the mere fact that it works at all has value but I’m sure it can be done better.

Predictions for H2 2025