Late Summer 2025 GenAI Tooling Review

anelson August 12, 2025 #genai

It’s been just eight months since my 2024 year-end GenAI tool review, and the landscape already feels unrecognizable. I am posting this update to record how much has changed, mainly so that I can amuse myself by revisiting this in the future and marveling at how primitive things were back in the antediluvian GenAI epoch of 2025.

Tools I’ve Added

Claude Code

Around May of 2025 I made mention of using Claude Code in my With, By, or For post; I had already by that point found myself preferring Claude Code for much of my LLM-augmented programming work, but I still had a few Cursor windows open on projects that I just started with Cursor and kept in Cursor due to inertia. But now I’m using Claude Code exclusively for all of my LLM programming assistant and vibe-coding needs.

Anecdotally, I’m not alone. By now Claude Code has exploded in popularity, with many users who would actually prefer a VS Code fork like Cursor putting up with the terminal aesthetic simply because Claude Code is a better tool (having said that I think there are now VS Code extensions that embed the Claude Code engine so you don’t have to touch that icky terminal if you don’t want to). It’s even penetrated the lazy vibe-coder zeitgeist, such that I’ve noticed the developers in my company who used to make slop PRs with Cursor and Copilot now use Claude Code instead.

Anthropic didn’t even invent terminal-based LLM coding assistants. I recall playing with the open-source aider shortly after OpenAI released the GPT-3.5 API. I found it clunky and limiting and not smart enough to do anything meaningful, and went back to what was then the state of the art, copy-pasting text between a terminal window and the ChatGPT web interface. There were several terminal-based LLM clients of various kinds, most of which I never tried since Aider was such a bust. When I learned about Claude Code on Hacker News, I fully expected it to suck. Then I was blown away by its capabilities almost immediately.

I can’t say exactly how Anthropic was able to make such a great coding assistant that runs in the terminal. I have to assume it was built by a few independent developers without the supervision of a PM, as it’s hard to imagine a Product organization ever allowing a team to build a TUI as their first foray into LLM coding assistants. But kudos to those developers, they really nailed the text-mode UX, and it keeps getting subtly better over time without making jarring changes that completely break existing workflows and muscle memory (Cursor, I’m glaring at you!). Most importantly, though, somehow the agentic plumbing and specialized tools and prompts in Claude Code really make the Anthropic models sing. It’s still an overconfident and eager junior developer who will happily lie to try to please you, but it’s as if this junior developer has better instincts and is generally smarter than the ones that inhabit the other LLM coding assistant tools that I’ve used. That’s particularly wild because in fact it’s not true: Cursor uses the same Anthropic models as Claude Code, but sucks so much more!

I don’t have rigorous evals for performance on my coding tasks; it’s purely subjective vibe checks. If I were to invent a metric, it would be to analyze the chat transcripts with Cursor vs Claude Code and count how often I use expletives. I haven’t done that analysis but I would be shocked if the ratio isn’t at least 5:1 in favor of Claude Code. It’s not that the underlying model doesn’t make stupid mistakes all the time; it absolutely does. It fails to apply changes, fucks up tool calls, and is constantly forgetting what the current directory is. But if you let it run with auto-accept enabled (or for those of us who like to live --dangerously-skip-permissions, with training wheels completely removed (IYKYK)), it for the most part figures it out.

A month or two ago, I switched from metered use of Claude Code to their $200/mo Max subscription. The month before I switched, I spent almost $500 in Anthropic API usage, entirely due to Claude Code. The Max subscription is a great deal right now, to the point that I’m afraid Anthropic might come to their senses and jack up the price.

I may write a separate post to capture my current techniques for getting the most out of Claude Code, but suffice it to say that if you have not tried this yet, you need to stop reading this article and set up Claude Code. In the meantime, Anthropic has published some great practical guidance on how to get the most out of Claude Code based on their own internal dogfooding:

Tools I’ve Discarded

Cursor

I was a pretty enthusiastic early adopter of Cursor last year, and encouraged several of my team members in Elastio to try it. I got a lot of utility out of it, which I recounted at length in my previous tool review post. But it has long-since been discarded. I already explained what motivated me to move to Claude Code in the Claude Code section above; refer back to that section for those details.

Part of the problem with Cursor is that, once Anthropic showed me what it could be like to access SOTA LLM programming assistants inside a terminal integrated with the rest of my terminal tools like tmux, nvim, and zsh, having to touch the mouse and click around in Cursor felt like puttering around on a little scooter after a brisk ride on a Ducati. I didn’t know how much I wanted a terminal-based LLM assistant until I used one, and then I couldn’t go back.

But the bigger driver of the change from Cursor to Claude Code wasn’t the improved ergonomics of the terminal, it was the fact that Claude Code just seemed to perform so much better. This was kind of surprising, since when I used Cursor it was almost always with Anthropic models, the same ones that power Claude Code. Why does Cursor suck and Claude Code is so good? I think the issue is the economics of the two products. When Claude Code first came out, you gave it an Anthropic API key and paid for every token it consumed. That got expensive very quickly, but it also meant that there were no tricks being played on Anthropic’s to economize on tokens to save money; quite the contrary the more tokens the tool uses the more they earn. Whereas Cursor got a flat $20/mo from me, but had to pay Anthropic for every token. They did have some rate limiting but clearly it wasn’t enough, and they had to do what they could to minimize the amount of context sent to Anthropic, resulting in much shittier performance in the tool. I believe they now have a different pricing model whereby you pay for usage, and for all I know maybe that has improved the quality of the LLM’s assistance, but I don’t care. Claude Code just feels more solid, and like it was built by people who used it for serious software engineering, while Cursor felt to me like the kind of tool you use to vibe-code some slop so you can performatively ship fast and call yourself “cracked” on social media while script kiddies download the API tokens your slop leaked all over the Internet.

Tools That I Tried and Hated

Gemini CLI

Once Anthropic’s Claude Code had shown Product teams that there was an enthusiastic market for terminal-based coding assistants, the other big AI players rushed to ship their own me-too offerings. The one I most eagerly anticipated was Google’s CLI, which they ended up calling “Gemini CLI”, after their series of frontier models of the same name.

The Gemini Pro models, most recently 2.5 Pro, are widely hyped on Twitter and YouTube. They have something like a 1M token context window, and have some impressive multimodal capabilities. LLM ~~grifters~~ influencers are constantly gushing about the galaxy-brain capabilities of this model. It’s hard sometimes to sift through the slop and figure out who is actually smart enough that a model being smarter than they are on some task that they gave it is actually a compelling endorsement. But the Gemini models are not a scam; they really do seem to perform well on programming-related tasks, at least in my limited testing. Back when I was still using Cursor, sometimes if it shat the bed running Claude Sonnet I’d switch over to Gemini Pro 2.5 and get a much smarter result, so I took it for granted that the Gemini CLI tool was going to be similarly competitive.

As an added bonus, my company has some generous GCP credits, so Gemini CLI is effectively free for me, unlike Claude Code which I have to pay for. So all Gemini CLI needed to do was be about as good as Claude Code, and I’d switch to it.

My God in heaven, it was not even close. Gemini CLI sucks. The UX sucks; it lacks the elegant but text-native polish of Claude Code. But worse than that, the performance on agentic coding tasks is…I think in this case it’s appropriate to use the word “retarded”. Just like I am amazed that Claude Code gets better performance from the same models that Cursor is using under the covers, I’m astonished that Gemini CLI is able to make Gemini 2.5 Pro suck as much as it does.

I immediately configured it to use gemini-2.5-pro, and gave it some simple Python and Rust tasks. Not even ones that Claude Code struggled with, just whatever I was working on at the time. It made stupid decisions, failed to make use of available information, simply forgot or ignored guidance in the prompt, and went in circles. It was utterly useless.

Sometimes if the other models I have access to get stuck on something, I’ll try giving it to Gemini 2.5 Pro, but using the Vertex AI web interface. It typically gives me some useful output, even if it’s not able to solve the problem itself. So something about the way the CLI wrapper prompts or invokes Gemini seems to lobotomize it.

Just an epic fail on Google’s part. My friends who are also eager early adopters of LLM assistants, some professional SWEs and some closer to a PM level of sophistication, have all reported the same results. It’s actually impressive how a seemingly-capable model can be made to be so stupid given the right scaffolding.

Jules

Jules is another Google AI project, this one is an AI coding agent that runs on its own given a GitHub repo and a task description. Behind the scenes Google spins up a VM, checks out the repo, and sets loose an LLM agent in that sandbox environment where it can work as long as it needs to in order to accomplish the task.

This is in the same product category as OpenAI Codex (no, not that Codex), the widely hyped Devon, one of the dozens of different products sold under “Copilot”, and approximately millions of other frantic me-too AI startup cash grabs.

It’s clear to me that, as a category, this is going to be a big part of the future of LLM coding assistants. There are a lot of advantages to this approach. If the tooling is good enough, agents could take a crack at dozens or hundreds of issues from the backlog, particularly chores like updating dependencies and making minor text changes, then produce PRs for human engineers to review. I’ve heard of teams inside OpenAI (meaning, teams with effectively infinite budget for AI spend) spawning 10 or more instances of these agents for a single task, then reviewing all of the solutions and picking the best one. Even if a problem is too hard for an agent to one-shot, you could leave PR comments just like you would with a human colleague, and maybe it’ll get it right after a few tries.

That’s how I envision these tools working someday. Perhaps there are tools that work like that today. However Jules is not such a tool.

Jules feels to me as if it were built by people who had no prior experience building software for a living. Or at least, no experience with the typical software lifecycles around backlog items, branches, pull requests, and eventual merges. If I’m being more generous, it feels like it was built by people whose bosses were anxious about getting their promotions in an era at Google where “AI” is the answer and no one gives a fuck what the question was, and who noticed that no exec had yet laid claim to building a background autonomous LLM programming agent, so they decided that this crude demo whipped up over a weekend to show the basic idea should be shipped to production without any regard to whether or not it fucking worked. It is an embarrassment, one which makes me feel actual pity for whoever works in the utterly dysfunctional org that allowed this to ship.

I gave it very simple tasks, like to update a Node.js dependency and use a new field added in the updated version of the dependency. It made the code change easily enough, so I assume it’s using Gemini 2.5 Pro without whatever makes Gemini CLI suck so much. But everything about the interaction sucked:

It was a pain in the ass to configure Jules to build my project; it wasn’t smart enough to use GitHub Actions, I had to use a very rough UI to give it build commands to run inside of its VM to set up the environment
The process of testing those build commands to see if they worked was clumsy and confusing, and they go in a tiny text area with minimal context clues helping you figure out what to type there.
When it thinks it has finished the task, you have the option to make a PR, view the branch, and that’s it. Did it completely fuck this up and you want to give it more guidance? Too bad. Do you want to make a PR, but then your GHA CI checks are all red and you want Jules to look at those failures and fix them? Too bad.
From time to time my auth token would expire and I would be unable to access Jules because I’m in Europe now and it’s US-only, even though my company is American and once I am authenticated it will work, I’d still have to use a VPN to get past the stupid geo-blocking to complete the auth challenge. Google of course has the technical ability to be smarter about this, but evidently no one cares enough to bother.

The “agent runs a branch in a sandbox and ships a PR” pattern is the future. Jules is not. Even if they somehow fix the leadership failure that led to this shipping, knowing Google, by the time it starts to get good they’ll kill it off.

Tools I’m Still Using Daily

These were all in the year-end tool review, although my use of them has evolved somewhat since then, they’re still daily drivers.

Claude Desktop (Max plan)

If I need an LLM to do something for me other than writing or debugging code, I first reach for Claude Sonnet 4 or Opus 4.1 in Claude Desktop.

Here is a selection of prompts from my actual Claude Desktop history:

Which breaker on this Hungarian panel is the stove?
Read the read me at https://github.com/etw11/DunedinPACNI and summarize
Give me a command to grow / to use the extra space on this device: (followed by lsblk output from the system)
Who is playing at the Danube arena in Budapest tonight?
In my vmagent log output I see a lot of this [a bunch of log output snipped] What’s the problem? Suggest fixes
Help me troubleshoot an Ethernet issue.

I have cat5e installed in my apartment by the builders. In each room are two rj45 jacks. All lines terminate in a wiring closet.

I’ve crimped rj45 connectors onto the lines. Line labeled 3 goes to 3 jack, and 4 goes to the 4 jack.

3 and 4 are connected to an Eero pro 6e to its 2.5 gbit and 1 gbit ports, respectively. 3 (on the 2.5gbit port) is the wan and connects to the internet provider. 4 on the 1gbit port connects to a gige switch in the wiring closet to provide wired internet to some devices.

The wan port works and has a 2.5gbit speed. The lan port detects no connection at all.

I used a cable tester from the 4 cable in the wiring closet to the 4 jack, and to the other end of the patch cable connecting the 4 jack to the eero. All eight lines are good.

However whether I connect to the 1gbit port on the eero or a desktop gigabit switch for testing purposes, I can’t get any link light. Not even a degraded 10 or 100mbit link.

I’ve tested each cable separately with a cable tester. They all pass. I know the eero 1gbit port works as it was working in my old apartment two days ago.

Help me theorize as to the problem and diagnostic next steps
Consider this tree:

[output of tree command in a directory with nested directories and files that have _ appended to their extensions]

What’s a shell one-liner to rename each file in place, to strip the _ off of the end of the filename

My preference for the Anthropic models for these use cases stems from the fact that I use them for my LLM coding needs as part of Claude Code, and I have a Max subscription so I also have very generous limits in the Desktop app. I’ve never hit a usage limit, however I do hit outages much more often than I’d like.

ChatGPT Desktop (Plus plan)

Though my default go-to models when I want an LLM to do something for me or answer a question for me are still the Anthropic models, I keep a paid version of ChatGPT around for a few reasons:

Anthropic is not very reliable; I often get network or server errors trying to submit prompts using Claude Desktop, and I’m not going to just wait around until Anthropic gets their shit together, so I fall back to an OpenAI model
My wife has been habituated to use ChatGPT so if I turn this off I’ll have to set her up with her own account or let her use my Claude account. Keeping this is the path of least resistance.
Sometimes the Claude models fail to perform well on a task and I’ll give the task to OpenAI. I especially liked the o3 model for advanced reasoning; it was very slow but seemed to be more resourceful and had more powerful reasoning capabilities. In the last week or so OpenAI removed all of those and now I only have GPT 5, which is making me strongly consider cancelling the subscription and giving Gemini a try.
If I have some random idea or complex question about how things work, I like to feed it to o3 and sometimes have a back-and-forth session as I explore it. In most cases I’m sure Claude Opus could perform well in this role, but I just fell into the habit of using o3 as the slow but wise sage that I can consult when I want to deepen my understanding of the world. With the removal of o3 and the forced migration to GPT 5, I may take that as an occasion to switch this use case to Anthropic.
It sometimes happens that Claude Desktop is busy on a research task and I have another thing I want to research at the same time. It’s easier to just Command-Tab over to ChatGPT and initiate the research there.

Here are some examples of prompts I’ve used with the various OpenAI models recently:

On on the latest macOs. my macbook pro is configured to connect to my home wifi. when I’m docked to my thunderbolt dock, I also have a gigabit ethernet connection.

How do I make sure that the ethernet connection is used when available, and wifi used otherwise?
[With some photo attachments] The first photo is a weird plastic thing that installs in the corner of the Schrack distribution panel cover in the second photo. When the circular component is turned it extends a rod vertically which operates as a hinge. This one has been broken off. Find me a replacement online from some place that ships to Budapest.
- Note that it failed this task; it claimed to find a parts kit but the kit did not contain the hinge that I needed.
Consider the latest USB PD standard, which IIRC can deliver up to 240W of power.

Would it be physically possible to make a mock cordless tool battery for major 18v and 12v platforms like Makita and DeWalt, where in place of an actual chemical battery is a USB-C trigger board connected to a USB PD power supply. Is there enough current and voltage capacity in the latest spec to provide the same amount of energy as the chemical batteries?
I have a DD1802H smart home RF remote control that is programmed correctly to control five motorized shutters in my apartment. It works well.

I also have a DD1805H RF remote that I want to set up to exactly clone the DD1802H. That is, I don’t want to have to pair the 1805H with each shutter’s controller again; I want to make the DD1805H use the exact same signals that the DD1802H already sends, so that it will also work with each of the five motorized shutters.

Figure out how I do that.
Consider this idea I’m writing up: [multiple paragraphs of text from a journal entry in which I describe an idea I had] I don’t want to build that if it already exists. Find what open source tools are available that do some or all of this, and review their features, strengths, and weaknesses.

Perplexity (Pro plan)

I would estimate that about 80% of the searches that I would once have done with Kagi (for which I also maintain a paid subscription), I now do with Perplexity, either Pro or Research. As the Internet descends further and further into the abyss of AI slop and weaponized SEO hacks, a tool that sorts through search results and gets to the information I actually want is well worth the $20/mo. Unfortunately, all of the LLM caveats apply. Not only does it outright hallucinate from time to time, but it’s not the best judge of character, and will regurgitate claims from pages matching search terms uncritically. It’s not unusual to have to fall back to Kagi for searches where Perplexity gets…well, perplexed.

The Research feature is all of that, but more so. It’s incredibly valuable for finding sources of information that I wouldn’t have found myself without exhaustive searching, but it’s not at all good at separating out the bullshit from what’s real, and it often gets confused by competing claims in search results. Often, the only valuable output from a Research run is just the list of links it’s drawing on, which I can then read myself to get the information I needed.

I would share a few examples from recent Perplexity activity, but the Perplexity Electron app is user-hostile and prevents me from selecting and copying text from a Perplexity session. All I can do is use the Share feature to generate a link to it. The cynic in me suspects that some sociopath in Product had some incentive to drive engagement on shared Perplexity sessions and realized making those links the only way to share information would tweak the stats. Damn you!

As much value as I get from Perplexity the service, I would welcome a complete rewrite of the Perplexity Mac app. I hate it. The inability to copy-paste text is unforgivable, but it seems that whenever I Command-Tab over to it after it’s been out of focus for a few hours, I get the spinning volleyball of death and force quit rather than wait for it to right itself. This is on a 2024 M4 MBP with 48GB of RAM where absolutely nothing else, including DaVinci Resolve Studio, lags at all, so it takes a special kind of idiot to suck wind on this system.

That said, if there is a credible competitor to Perplexity (other than the web search and research features in the other LLM apps that I use), I’m not aware of it. I hope to see more competition in this space, as I don’t think Perplexity is particularly great at what it does, the mere fact that it works at all has value but I’m sure it can be done better.

Predictions for H2 2025

Background coding agents will hit mainstream workflows. One vendor will ship a credible “branch-per-issue” loop that knows about existing CI/CD infrastructure; the rest will copy it. Jules will continue to suck until Google kills it.
The Claude Max plan will be enshittified somehow. Anthropic already announced more restrictive usage limits for Pro, adding not only limits within five-hour windows but also weekly usage limits. They claim that this won’t impact most users, and for now this is specific to Pro and not Max, but given the economics of hosted LLMs I think further arbitrary usage restrictions are inevitable.
Vibe-coded slop will get worse. The economic and psychological incentives for lazily turning off your brain and vibe-coding slop that you then ship are simply too strong, and the consequences of doing so still too diffuse and remote. Even if there were zero hype around LLMs and no frenzy of AI FOMO intoxicating every exec and investor on the planet, I think you’d still see widespread vibe-coding simply because it’s easier than having to think for yourself, and you can potentially work multiple jobs at once shipping slop at each one. Add in the AI mandates from execs and investors and I see no hope that software doesn’t get much worse very quickly. I don’t think we will ever recover, but I do hope at some point in the coming years the damage will be sufficiently undeniable and MBA programs will teach enough cautionary case studies that most execs and engineering managers will have learned not to tolerate vibe-coding slop-merchants (or, at the very least, the vibe-coding slop-merchants will be forced to make some minimal effort to hide the telltale signs of slop which might incidentally make them less sloppy).
LLMs will continue to not be capable of replacing software engineers, and will continue to get more useful in the hands of competent professionals. More leverage for pros; more mess from everyone else. This is a double-edged sword. I’m glad that there will still be professional opportunities for people like me in the future, but I also am pretty sure I will not enjoy those opportunities due to all of the aforementioned slop.
LLM performance will gradually increase, which will continue to give me the ability to knock out quick tasks and explore random ideas that I would never have been able to justify spending my own time on before LLM coding agents were developed. This is the positive side to the negative take in the previous bullet. I’m very excited about what I can do when augmented by a SOTA LLM coding agent. My lament is entirely concerned with what others will do with the same tools.
Valuations on AI plays will continue to be insane, and I will continue to seethe with jealousy and resentment as I compare my equity stake and comp at Elastio with packages at even the dumbest me-too LLM wrapper startups.