Musings and misadventures of an expat enterpreneur

Mid Summer 2026 GenAI Tooling Review

anelson June 25, 2026 #genai

It’s hard to believe that it’s been more than ten months since I posted my 2025 summer GenAI tool review. I guess I’ve been too busy shipping loudly and performatively to take a breath and write down what’s going on. I fear if I wait any longer I won’t even remember how primitive life was way back in…late summer 2025.

The headline update is that I still lean heavily on Claude Code and Anthropic’s models, mostly Opus 4.8 these days. But that makes it sound like not much has changed; Claude Code has completely changed in the last ten months, and the new models are also quite a bit more capable.

Changes in how I use tools

Reasoning about reasoning effort

I don’t recall what reasoning effort tunables were in Claude Code when I wrote up my last review, but the latest Opus 4.8 model definitely has different reasoning effort options than I was using last year.

I often set the model to xhigh reasoning effort and then leave it there. I have the Claude Max $200/mo subscription and have literally never exhausted my quota, and that’s running multiple agents in parallel all day every day. xhigh is much slower, so I really ought to make more of an effort to adjust the effort depending on the task. As it happens, I crank it up to xhigh for some challenging planning activity, and inevitably forget to drop it back down. On the other hand, it’s not like I’m sitting at the terminal waiting patiently for the next turn; I kick off some work and switch over to another terminal where some other work is ongoing.

/goal-setting

I’ve also started to look for ways to use the /goal skill built into Claude Code. One has to be careful with this as it’s prone to the model reward-hacking to achieve the goal; it’s important to set unambiguous goals that the model can’t hack around. But with this in mind, I’ve been able to get some pretty cool results. Such as:

Skill issues

I’ve known about the concept of skills (not to be confused with plugins or agents!) in Claude Code since that was first released, but I didn’t feel a need to use them before. It was always enough to update CLAUDE.md every time the agent screwed something up. But as the models get more capable and I want to wire up the harness to more systems, I found making custom skills to be a great help.

I have probably made a dozen skills at this point, but I’ll point out a couple that have added a ton of value this year:

Since skills are basically just prompts but with some structured front-matter and a principled way to embed deterministic scripts, they should in theory work with any harness/model combination that has tool calling. But, annoyingly, Claude Code looks for skills in a Claude-specific place, which makes using my skills from other agents a small hassle. I could use symlinks but these are under source control and symlinks aren’t a great solution there. Maybe by now Claude Code has a config param that I can use to make it look in the more standard places for skill files, but when I last looked into it that wasn’t the case. This is a user-hostile decision on Anthropic’s part that I must admit has the effect of making me reach for Claude Code whenever I need to use one of my skills; you devious bastards!

YOLO

By default now when launching any agentic coding harness, I disable all of the safety checks and sandboxing. Clutch your pearls if you must, but if you’re really reviewing each operation your agent wants to do, in mid-2026, you are leaving a huge amount of productivity on the table. And for what? The safety crowd is, IMHO, way over-indexed on the elimination of risk in AI tooling, when the focus should actually be on managing risk just like we do with infosec today. After all, it’s incredibly insecure to let users use their computers, especially if those computers are connected to each other, and yet we do that because THAT IS THE WHOLE FUCKING POINT OF THEM! Likewise, agents need agency.

I’m not saying you should run OpenClaw with all of your private creds (I wouldn’t run it at all in fact, but that’s another post), but if the only thing protecting you from being p0wned or having your prod DB deleted is the trust and safety mommies, I’m afraid you’re already doing it wrong. You can segregate sensitive creds, require human confirmation of sensitive operations like using your SSH key to auth somewhere, and run the agent as an unprivileged user without passwordless sudo access and get a lot of protection from actual threats without giving up all of the agency that makes the agent so wonderfully agentic.

In my case, I try to think rationally about my threat model and act accordingly. I won’t go into detail here about my setup in this regard; I just want to capture the fact that --dangerously-skip-permissions (Claude Code) and --dangerously-bypass-approvals-and-sandbox (Codex) are enabled in all of my agentic coding sessions, and I’m still here to talk about it.

Worktrees and tmux

I first learned about the worktree feature in git whilst reading some Anthropic docs in the early days of Claude Code. I played with it but was immediately turned off by the fact that a given branch can only be checked out in a single worktree. I didn’t get the value of doing this over a separate checkout. But I have seen the light.

Claude Code has two command line options that together are a real boon for spastic agentic multi-taskers such as myself: --tmux and --worktree. If I’m about to work on an issue that I know is going to be self-contained and probably not take that long, I’ll go to my main tmux AI shell window for that project, and run something like:

claude --dangerously-skip-permissions --worktree 269-tokio-runtime-monitoring --tmux &

That will do two things:

Being a separate worktree and branch, this is isolated from the main checkout of the project (which I typically keep on master). Each tmux session/worktree is dedicated to some specific task. When the work is done I just tmux kill-session the session, and periodically purge the old worktrees (each one has a separate cargo target/ dir so they do need to be pruned from time to time).

This makes it trivial to multi-task to what is probably a pretty unhealthy extent.

As of this writing, codex doesn’t have this feature and I have not found sufficient motivation to script it. That doesn’t mean I don’t ever use Codex in these worktree sessions though. Since they’re just tmux sessions, I can easily open another pane and launch codex, nvim, and a shell.

Tools I’ve added

Codex

Sometime around January of this year I started to play with OpenAI’s Claude Code competitor, Codex. This was motivated by a recommendation from a colleague who is the exact opposite of a breathless AI influencer on Twitter, but also by frustration with Claude Code quality. Claude Code is famously, unapologetically vibe-coded (using, naturally, Claude Code itself). This has allowed it to be built and extended very quickly, and given the revenue growth I very much doubt Anthropic investors would say that this was a mistake. However, it means that the quality is highly variable from one vibe-shipped release to another, with some persistent problems with rendering the TUI that continue to drive me mad to this day. I’ll probably write up my many thoughts on the inevitable failure mode of vibe-coding shipping products, but for now just suffice it to say that I was longing for a tool that could do what Claude Code does but also work properly.

Codex is also famously and performatively vibe-coded, but there are a couple of differences I noticed right away. First, being written in Rust instead of TypeScript, it immediately benefited from my goodwill towards Rust (to say nothing of the very robust compiler tooling that Rust offers). Second, whatever OpenAI’s vibe-coders prompted their models to do regarding terminal output was much smarter than what Anthropic did, which was apparently to make a terminal renderer for React (LOLWUT!?). There are still glitches, I still hesitate a bit when I pull down an update, but overall Codex is more stable (and also less ravenously hungry for memory).

There are online religious wars over which model has better vibes, and they are as shallow and pointless and engagement-farmy as you probably imagine. I try to ignore AI influencer Twitter (harder than it sounds), but I do my own vibe-checks with the models and harnesses. I’ve had cases where one did well and another did poorly, but I consider them roughly equivalent in terms of capability.

Pretty often I’ll run both Claude Code and Codex on the same task, having them both make a plan, and then I pick the plan I like better and feed that plan into the other agent to critique the plan, feed the output of that back into the agent that made the plan, and repeat until the feedback becomes trivial or useless. Only then do I engage with the plan meaningfully myself, by which time most of the stupidity has been filtered out.

I should note that my company has some startup credits with OpenAI and I’m on the $200/mo plan, so for the moment I burn tokens with reckless abandon. I expect future me will read this and weep, but for now there is no task too trivial to merit throwing an agentic loop at it at least once.

Tools I’m Still Using Daily

Claude Desktop and Mobile (Max plan)

This is largely unchanged from my last review, although the go-to model is now Opus 4.8.

The new feature ā€œCoworkā€ in Claude Desktop has come in handy a few times. I installed the Claude extension in Brave, which lets Cowork actually drive the browser instead of just issuing HTTP requests itself (which often get blocked or the content being requested requires JavaScript). I don’t use Cowork daily (except for the scheduled tasks) but when I need it, it’s a useful feature. I think this kind of agentic capability is the future of these kinds of tools, and I expect that they will be rolled out more widely this year.

Here is a sample of stuff I’ve used Cowork for:

I actually do not use the various connectors to things like Office that Anthropic are pushing, because I don’t want that level of integration yet. If I’m authoring a doc it’s almost always Markdown anyway. I seem to be the only one left in my company who isn’t lazily prompting Claude to crank out AI slop that is then sent unreviewed to coworkers, customers, and partners. I will die on this hill. I hate low-effort AI slop documents, and I call them out whenever I see them. Our customers and partners at the very least deserve documents that reflect reality and have been reviewed by a competent human.

ChatGPT Desktop and Mobile (Pro plan)

I think in the last review I was on the Plus plan ($20/mo) but now I’ve switched to my company’s account and I’m on the $200/mo Pro plan. I do that mainly for Codex, but the desktop and mobile apps are included so I may as well use them.

I find that the ChatGPT app and OpenAI models are more to my liking when it comes to researching something or going back and forth on an idea. If I’m actually implementing something in code or even just investigating something that benefits from shell access, I’ll use Claude Code or Codex, but for higher level stuff I will use the Desktop and Mobile apps. A few things I’ve done with ChatGPT specifically recently:

Tools I’ve discarded

Perplexity

It’s hard to believe that just ten months ago, I listed Perplexity under tools that I’m still using daily. It must have been just shortly after that when I dropped it. By now, both Claude and ChatGPT mobile and desktop applications competently search the Internet to provide grounded answers to prompts. Plus, I’m a paid Kagi subscriber so I can use their AI quick answer feature which doesn’t suck either. Perplexity now feels to me like MSN or Excite; it’s like a thing that my grandparents used in the dawn of the consumer Internet and still use because they don’t know that there are much better options.

Apparently Perplexity is still a going concern, much to my surprise. I see no value in it at this point.

Predictions for H2 2026

Re-reading my predictions from the last review, I actually don’t disagree with many of them now. So first I’ll repeat them here before I get to new predictions.

To my surprise, this is only now becoming a thing. Claude Tag was announced a few days ago, and for a while there’s been Claude Code in the desktop app that kind of does this. I know that Big Tech have their own bespoke AI slop production tooling to make it much easier and faster to run a coding agent on an issue, produce a PR, review the PR, and ship it to prod. But I’m not aware of anything that is generally available and reliably used in anger for this purpose. Perhaps I’m just not aware of it, but I’m pretty plugged into the AI software engineering zeitgeist.

This has started to happen. You already cannot use third-party agents like Pi with your Max subscription, you have to use credits to pay per-token for extra usage. So far the ChatGPT subscription does still allow this, although for how long I don’t know.

There’s also the kerfuffle around Fable, which was abruptly yanked after the US Commerce Department got bribed/tricked/prompted to take Anthropic’s claims of the danger of their frontier models at face value; I won’t count that as enshittification since it wasn’t Anthropic’s decision to pull that model.

I think there’s much more to come here, especially as Anthropic and OpenAI are both planning IPOs. They’re going to need to start revenuemaxxing sooner than later. We’ll see more wailing and gnashing of teeth about how absurdly expensive AI tooling is out of reach of the common man, mostly from preening retards who wouldn’t know a common man if he fixed their toilet.

This was hardly a bold prediction at the time, and it has already begun to come true, although we’re still far away from the zeitgeist acknowledging any negative consequences of all of this slop. Meta have already vibe-shipped an AI agent that let attackers recover the password on target accounts just by bullshitting the LLM, which I predict is not going to be the most retarded AI slop fail this year. In my own company, clumsily prompting the LLM is just about all anyone can be bothered to do now, and the proliferation of slop emails, slop documents, slop issues, and slop messages is accelerating.

Nothing to add on this one. One didn’t have to be Nostradamus to see this coming.

Another big-brain bet that, to no one’s surprise, is coming true. Opus 4.8 and GPT 5.5 are both really good. Fable seemed a bit better for the brief time I had access to it. I don’t think AGI is around the corner, but I do expect incremental improvements in models (and most especially the deterministic tooling that we wrap the models in) to continue. Even if progress on LLMs stopped today, I think we have years of improvements we can discover just around how and when to use what agentic tooling mechanisms to get the most out of a given model.

Given the valuations being floated for the Anthropic and OpenAI IPOs, this prediction was, if anything, insufficiently ambitious.

Now my new predictions for the second half of this year: