Mid Summer 2026 GenAI Tooling Review

anelson June 25, 2026 #genai

It’s hard to believe that it’s been more than ten months since I posted my 2025 summer GenAI tool review. I guess I’ve been too busy shipping loudly and performatively to take a breath and write down what’s going on. I fear if I wait any longer I won’t even remember how primitive life was way back in…late summer 2025.

The headline update is that I still lean heavily on Claude Code and Anthropic’s models, mostly Opus 4.8 these days. But that makes it sound like not much has changed; Claude Code has completely changed in the last ten months, and the new models are also quite a bit more capable.

Changes in how I use tools

Reasoning about reasoning effort

I don’t recall what reasoning effort tunables were in Claude Code when I wrote up my last review, but the latest Opus 4.8 model definitely has different reasoning effort options than I was using last year.

I often set the model to xhigh reasoning effort and then leave it there. I have the Claude Max $200/mo subscription and have literally never exhausted my quota, and that’s running multiple agents in parallel all day every day. xhigh is much slower, so I really ought to make more of an effort to adjust the effort depending on the task. As it happens, I crank it up to xhigh for some challenging planning activity, and inevitably forget to drop it back down. On the other hand, it’s not like I’m sitting at the terminal waiting patiently for the next turn; I kick off some work and switch over to another terminal where some other work is ongoing.

`/goal`-setting

I’ve also started to look for ways to use the /goal skill built into Claude Code. One has to be careful with this as it’s prone to the model reward-hacking to achieve the goal; it’s important to set unambiguous goals that the model can’t hack around. But with this in mind, I’ve been able to get some pretty cool results. Such as:

Write a Python script to extract the password hash from a particular notebook vendor’s BIOS, and reverse-engineer the BIOS machine code to figure out the hashing algorithm, and invoke hashcat to crack the password.

I had a laptop that I bought cheap because it was BIOS locked, but still able to SecureBoot a Linux LiveCD. I booted Fedora 44, enabled SSH, pointed Claude Code at the system over SSH, and had it figure out how to dump the BIOS memory. This I had to do interactively because it required rebooting with some different kernel command line options, but once I had the dump and BIOS code extracted to my local workstation I just let it go to town with /goal to crack the password. It did so, and I was able to unlock the BIOS, and now I have a fully working laptop.
Figure out the right qemu command line incantation to get a particular set of Windows QCOW2 images to boot in QEMU.

For reasons that aren’t important, I had dumped snapshots of some Windows systems on a cloud hypervisor and converted them to QCOW2, and then needed to boot them using QEMU on my local KVM hypervisor, far enough to get the QEMU Guest Agent installed and then drive some automated testing on those VMs. None of them wanted to boot, owing to the differences in hypervisor and crash-consistent nature of the snapshots. Each one had some different nuance that prevented it from working. I left a /goal running that took almost the limit of 6 hours, at the end of which I had some Python code that would reliably go from an unmodified original QCOW2 to a working Windows VM with QGA enabled.
Migrate a complex data collection pipeline from an older RHEL to the latest Ubuntu 26.04 image.

This doesn’t sound that impressive, but it’s a good example of the kind of quality-of-life improvement that I couldn’t justify spending a few hours or a day doing myself, but is well worth lazily firing off a /goal for the clanker to chew on. None of the migration tasks were hard, but they were tedious and error prone, involving many repeated runs of the pipeline until all of the different packages and config options and kernel nuances were worked out. At the end of this I had a big diff of changes sensitive enough that it would be irresponsible to vibe-ship; however, the effort of doing that and correcting a few slop-isms was much less than doing it myself.

Skill issues

I’ve known about the concept of skills (not to be confused with plugins or agents!) in Claude Code since that was first released, but I didn’t feel a need to use them before. It was always enough to update CLAUDE.md every time the agent screwed something up. But as the models get more capable and I want to wire up the harness to more systems, I found making custom skills to be a great help.

I have probably made a dozen skills at this point, but I’ll point out a couple that have added a ton of value this year:

/pipeline-testing is a skill that I wrote on a specific repo in Elastio for a product that I’m developing there. This is a complex appliance that integrates with complex storage systems, which is a bit of a pain to build and deploy into staging. So I started a new Claude Code session, and talked the clanker through what it needed to do, as if I was explaining it to a junior. Once it had successfully done the deployment and run a few of my usual semi-manual tests, I prompted it to capture that all in a skill. It produced the usual clanker vomit initially, which I edited to remove the verbose crap and leave just the relevant details, and explicitly state some invariants and constraints that it missed.

Now, when I’ve made a change and tested it myself to my satisfaction, I lazily kick off a /pipeline-testing and go switch to another session. When I come back in an hour or so, I have the results of some additional thorough testing that I otherwise would have had to do myself. It is surprisingly fastidious about user experience as well as the more obvious correctness, and gives me a lot more confidence in the code I then push to a PR.

Obviously, it would be irresponsible to just vibe-test shit and then ship it. But it’s quite responsible to test it yourself AND THEN vibe-test it for additional coverage, and then ship it to a QA team that also tests it, and then ship it. Anything that the clanker finds is stuff that otherwise would have slipped into a release for QA, where it would be more expensive to find and fix. This is a no-brainer.
/grafana-telem was created under similar conditions as /pipeline-testing. It instructs the clanker how to pull logs and metrics from our telemetry systems for a given prod deployment of our product, and also lays out some common failure modes. When I get a problem report or when I notice something wrong myself, I fire off /grafana-telem to start with, before I do any analysis of my own (and, as always, switch over to some other session to keep the dopamine flowing while the clanker cogitates). I won’t say that the model always one-shots the issue (although it does often enough), but it always grinds through a bunch of telemetry that I didn’t have to squint at myself, leaving me with only the anomalous bits to look into. Even better, once this is in the context, I can immediately have the clanker work with me on a plan to fix the issue, add more telemetry, or whatever it is that needs to be done to resolve the problem.

Since skills are basically just prompts but with some structured front-matter and a principled way to embed deterministic scripts, they should in theory work with any harness/model combination that has tool calling. But, annoyingly, Claude Code looks for skills in a Claude-specific place, which makes using my skills from other agents a small hassle. I could use symlinks but these are under source control and symlinks aren’t a great solution there. Maybe by now Claude Code has a config param that I can use to make it look in the more standard places for skill files, but when I last looked into it that wasn’t the case. This is a user-hostile decision on Anthropic’s part that I must admit has the effect of making me reach for Claude Code whenever I need to use one of my skills; you devious bastards!

YOLO

By default now when launching any agentic coding harness, I disable all of the safety checks and sandboxing. Clutch your pearls if you must, but if you’re really reviewing each operation your agent wants to do, in mid-2026, you are leaving a huge amount of productivity on the table. And for what? The safety crowd is, IMHO, way over-indexed on the elimination of risk in AI tooling, when the focus should actually be on managing risk just like we do with infosec today. After all, it’s incredibly insecure to let users use their computers, especially if those computers are connected to each other, and yet we do that because THAT IS THE WHOLE FUCKING POINT OF THEM! Likewise, agents need agency.

I’m not saying you should run OpenClaw with all of your private creds (I wouldn’t run it at all in fact, but that’s another post), but if the only thing protecting you from being p0wned or having your prod DB deleted is the trust and safety mommies, I’m afraid you’re already doing it wrong. You can segregate sensitive creds, require human confirmation of sensitive operations like using your SSH key to auth somewhere, and run the agent as an unprivileged user without passwordless sudo access and get a lot of protection from actual threats without giving up all of the agency that makes the agent so wonderfully agentic.

In my case, I try to think rationally about my threat model and act accordingly. I won’t go into detail here about my setup in this regard; I just want to capture the fact that --dangerously-skip-permissions (Claude Code) and --dangerously-bypass-approvals-and-sandbox (Codex) are enabled in all of my agentic coding sessions, and I’m still here to talk about it.

Worktrees and tmux

I first learned about the worktree feature in git whilst reading some Anthropic docs in the early days of Claude Code. I played with it but was immediately turned off by the fact that a given branch can only be checked out in a single worktree. I didn’t get the value of doing this over a separate checkout. But I have seen the light.

Claude Code has two command line options that together are a real boon for spastic agentic multi-taskers such as myself: --tmux and --worktree. If I’m about to work on an issue that I know is going to be self-contained and probably not take that long, I’ll go to my main tmux AI shell window for that project, and run something like:

claude --dangerously-skip-permissions --worktree 269-tokio-runtime-monitoring --tmux &

That will do two things:

Creates a git branch worktree-269-tokio-runtime-monitoring and checks out that branch in a new git worktree in .claude/worktrees/
Creates a new tmux session called something like $repo_worktree-269-tokio-runtime-monitoring with a new claude instance running from the new worktree’s directory.

Being a separate worktree and branch, this is isolated from the main checkout of the project (which I typically keep on master). Each tmux session/worktree is dedicated to some specific task. When the work is done I just tmux kill-session the session, and periodically purge the old worktrees (each one has a separate cargo target/ dir so they do need to be pruned from time to time).

This makes it trivial to multi-task to what is probably a pretty unhealthy extent.

As of this writing, codex doesn’t have this feature and I have not found sufficient motivation to script it. That doesn’t mean I don’t ever use Codex in these worktree sessions though. Since they’re just tmux sessions, I can easily open another pane and launch codex, nvim, and a shell.

Tools I’ve added

Codex

Sometime around January of this year I started to play with OpenAI’s Claude Code competitor, Codex. This was motivated by a recommendation from a colleague who is the exact opposite of a breathless AI influencer on Twitter, but also by frustration with Claude Code quality. Claude Code is famously, unapologetically vibe-coded (using, naturally, Claude Code itself). This has allowed it to be built and extended very quickly, and given the revenue growth I very much doubt Anthropic investors would say that this was a mistake. However, it means that the quality is highly variable from one vibe-shipped release to another, with some persistent problems with rendering the TUI that continue to drive me mad to this day. I’ll probably write up my many thoughts on the inevitable failure mode of vibe-coding shipping products, but for now just suffice it to say that I was longing for a tool that could do what Claude Code does but also work properly.

Codex is also famously and performatively vibe-coded, but there are a couple of differences I noticed right away. First, being written in Rust instead of TypeScript, it immediately benefited from my goodwill towards Rust (to say nothing of the very robust compiler tooling that Rust offers). Second, whatever OpenAI’s vibe-coders prompted their models to do regarding terminal output was much smarter than what Anthropic did, which was apparently to make a terminal renderer for React (LOLWUT!?). There are still glitches, I still hesitate a bit when I pull down an update, but overall Codex is more stable (and also less ravenously hungry for memory).

There are online religious wars over which model has better vibes, and they are as shallow and pointless and engagement-farmy as you probably imagine. I try to ignore AI influencer Twitter (harder than it sounds), but I do my own vibe-checks with the models and harnesses. I’ve had cases where one did well and another did poorly, but I consider them roughly equivalent in terms of capability.

Pretty often I’ll run both Claude Code and Codex on the same task, having them both make a plan, and then I pick the plan I like better and feed that plan into the other agent to critique the plan, feed the output of that back into the agent that made the plan, and repeat until the feedback becomes trivial or useless. Only then do I engage with the plan meaningfully myself, by which time most of the stupidity has been filtered out.

I should note that my company has some startup credits with OpenAI and I’m on the $200/mo plan, so for the moment I burn tokens with reckless abandon. I expect future me will read this and weep, but for now there is no task too trivial to merit throwing an agentic loop at it at least once.

Tools I’m Still Using Daily

Claude Desktop and Mobile (Max plan)

This is largely unchanged from my last review, although the go-to model is now Opus 4.8.

The new feature “Cowork” in Claude Desktop has come in handy a few times. I installed the Claude extension in Brave, which lets Cowork actually drive the browser instead of just issuing HTTP requests itself (which often get blocked or the content being requested requires JavaScript). I don’t use Cowork daily (except for the scheduled tasks) but when I need it, it’s a useful feature. I think this kind of agentic capability is the future of these kinds of tools, and I expect that they will be rolled out more widely this year.

Here is a sample of stuff I’ve used Cowork for:

Using Cowork’s scheduled tasks feature, I have set up an automatic daily task that opens a particular eBay saved search and goes through the results putting them in an Excel spreadsheet for me to review at my leisure. This saved search is part of my ongoing search for some particular surplus computing devices that require some tedious manual filtering to separate from the abundant chaff that mostly obscures the models that I actually want.
Using the Slack connector, I was able to use this prompt to extract a ton of useful context from a Slack conversation that I wanted to store in Markdown:

I want this entire conversation in slack: https://elastio.slack.com/archives/C092WLZKCAW/p1781206874798799 downloaded into a single markdown file showing all of the messages in the thread, including who sent each one.
Using Chrome to navigate the clumsy eBay UI to make a spreadsheet of all of my eBay payouts, and breaking down the payout amount per item.

Using Chrome, go to https://www.ebay.com/mes/transactionlist?filter=transactionType:{PAYOUT}&pillFilterId=PAYOUT_FILTER and prepare an Excel spreadsheet with a list of all of my payouts. The challenge is that I want to calculate which payouts include revenue from which listings that I sold, so I want the analysis broken down such that it lists every item that I sold, how much I told the item for, and how much I was paid out net of commission, shipping, and whatever else reduces the payouts. If you can get it, I’d prefer both the title of the listing, and the link to view the sale, so help me understand which item is which.

I would never have taken the time to do this myself, but tasking a clanker to do it made it much easier to account for the revenue for some sold items relative to their input costs.
Examine a bunch of photos downloaded from my phone of several different laptops and tablets that I bought used, grouping multiple photos into a single item and reading out the serial number from either screenshots of the BIOS or photos of the serial number sticker on each unit.

This was a challenging task. I won’t share the prompt here: first, because it’s very long and task-specific and second, because it was not a one-shot prompt but rather required quite a bit of back and forth and “constructive” criticism on my part to get it to work. But the scale of the toil to do this task myself was so massive that I was willing to put this comparatively minimal effort in, with the result being many hours of my own time saved on a boring and low-value slog.

I actually do not use the various connectors to things like Office that Anthropic are pushing, because I don’t want that level of integration yet. If I’m authoring a doc it’s almost always Markdown anyway. I seem to be the only one left in my company who isn’t lazily prompting Claude to crank out AI slop that is then sent unreviewed to coworkers, customers, and partners. I will die on this hill. I hate low-effort AI slop documents, and I call them out whenever I see them. Our customers and partners at the very least deserve documents that reflect reality and have been reviewed by a competent human.

ChatGPT Desktop and Mobile (Pro plan)

I think in the last review I was on the Plus plan ($20/mo) but now I’ve switched to my company’s account and I’m on the $200/mo Pro plan. I do that mainly for Codex, but the desktop and mobile apps are included so I may as well use them.

I find that the ChatGPT app and OpenAI models are more to my liking when it comes to researching something or going back and forth on an idea. If I’m actually implementing something in code or even just investigating something that benefits from shell access, I’ll use Claude Code or Codex, but for higher level stuff I will use the Desktop and Mobile apps. A few things I’ve done with ChatGPT specifically recently:

Go through the research and influencer slop regarding how to tune the “voice” of SOTA models, by which I mean how to prompt away the maddening tics that SOTA models for some reason seem biased towards. I’m talking about emojis, bulleted lists with bold prefixes, “delve”, “seam”, “load-bearing”, “X not Y”, etc. The prompt itself was very long and the conversation involved a lot of back and forth, but suffice it to say that it was illuminating, and inspired the VOICE.md file that I use now.
What is the maximum for a gift between immediate family members under federal tax code before the gift is taxable?
I got this screenshot from the console on a Linux VM that was hung. at some point while this VM was running, before it hung, I updated my software including a new version of the referenced bdamserver, that is supposed to have a fix for the hung task. i am trying to tell if that hung task message is from before or after I applied that fix. how do I translate the timestamps on these lines into an actual date/time?

(With a screenshot attached)

I could have easily done this in Codex, except that I run all of my agentic harnesses on a remote headless Linux server in a data center, and I got this screenshot from a colleague over Slack. The path of least resistance was to fire up ChatGPT and paste the screenshot there, so that’s what I did. It read the contents no problem and helped me reason about the contents there.
Dictating text. I hate HATE HATE the voice chat mode in ChatGPT, because all of the voices are so maddeningly patronizing and chirpy, but I really like the speech-to-text (STT) feature. If I need to capture a bunch of text, usually as input to an LLM, it’s much faster for me to dictate it than to type it. And I find if I prompt ChatGPT first with some context, and call out product names and which words should be fixed width, before I dump the text on it, it can clean it up and put it into a nice Markdown format without altering the semantic content in any way.

One big win I got doing this is in one complex project with a lot of external context, I have a docs/braindumps directory. Whenever something comes up that the agent stumbles on for lack of some context, I quickly drop a braindump file there which I author using this STT workflow. I put that file under source control so I can point agents at it when they need that context (and of course they can also find it themselves if they grep for key concepts). Crucially, this isn’t AI slop text, it’s literally my own words, dictated and transcribed, so it has a high signal-to-noise ratio.

There are dedicated STT tools (I have used and paid for MacWhisper in the past), but now that the STT capability in the SOTA models is good enough, I much prefer doing this within ChatGPT. That way I’m also leveraging the power of the SOTA model to transform the raw text of my transcript. Sure, tools like MacWhisper can do that too if you give them an OpenAI API key and a prompt, but why would I bother with that when it’s built in and part of my paid ChatGPT plan already?

Tools I’ve discarded

Perplexity

It’s hard to believe that just ten months ago, I listed Perplexity under tools that I’m still using daily. It must have been just shortly after that when I dropped it. By now, both Claude and ChatGPT mobile and desktop applications competently search the Internet to provide grounded answers to prompts. Plus, I’m a paid Kagi subscriber so I can use their AI quick answer feature which doesn’t suck either. Perplexity now feels to me like MSN or Excite; it’s like a thing that my grandparents used in the dawn of the consumer Internet and still use because they don’t know that there are much better options.

Apparently Perplexity is still a going concern, much to my surprise. I see no value in it at this point.

Predictions for H2 2026

Re-reading my predictions from the last review, I actually don’t disagree with many of them now. So first I’ll repeat them here before I get to new predictions.

Background coding agents will hit mainstream workflows. One vendor will ship a credible “branch-per-issue” loop that knows about existing CI/CD infrastructure; the rest will copy it. Jules will continue to suck until Google kills it.

To my surprise, this is only now becoming a thing. Claude Tag was announced a few days ago, and for a while there’s been Claude Code in the desktop app that kind of does this. I know that Big Tech have their own bespoke AI slop production tooling to make it much easier and faster to run a coding agent on an issue, produce a PR, review the PR, and ship it to prod. But I’m not aware of anything that is generally available and reliably used in anger for this purpose. Perhaps I’m just not aware of it, but I’m pretty plugged into the AI software engineering zeitgeist.

The Claude Max plan will be enshittified somehow. Anthropic already announced more restrictive usage limits for Pro, adding not only limits within five-hour windows but also weekly usage limits. They claim that this won’t impact most users, and for now this is specific to Pro and not Max, but given the economics of hosted LLMs I think further arbitrary usage restrictions are inevitable.

This has started to happen. You already cannot use third-party agents like Pi with your Max subscription, you have to use credits to pay per-token for extra usage. So far the ChatGPT subscription does still allow this, although for how long I don’t know.

There’s also the kerfuffle around Fable, which was abruptly yanked after the US Commerce Department got bribed/tricked/prompted to take Anthropic’s claims of the danger of their frontier models at face value; I won’t count that as enshittification since it wasn’t Anthropic’s decision to pull that model.

I think there’s much more to come here, especially as Anthropic and OpenAI are both planning IPOs. They’re going to need to start revenuemaxxing sooner than later. We’ll see more wailing and gnashing of teeth about how absurdly expensive AI tooling is out of reach of the common man, mostly from preening retards who wouldn’t know a common man if he fixed their toilet.

Vibe-coded slop will get worse. The economic and psychological incentives for lazily turning off your brain and vibe-coding slop that you then ship are simply too strong, and the consequences of doing so still too diffuse and remote. Even if there were zero hype around LLMs and no frenzy of AI FOMO intoxicating every exec and investor on the planet, I think you’d still see widespread vibe-coding simply because it’s easier than having to think for yourself, and you can potentially work multiple jobs at once shipping slop at each one. Add in the AI mandates from execs and investors and I see no hope that software doesn’t get much worse very quickly. I don’t think we will ever recover, but I do hope at some point in the coming years the damage will be sufficiently undeniable and MBA programs will teach enough cautionary case studies that most execs and engineering managers will have learned not to tolerate vibe-coding slop-merchants (or, at the very least, the vibe-coding slop-merchants will be forced to make some minimal effort to hide the telltale signs of slop which might incidentally make them less sloppy).

This was hardly a bold prediction at the time, and it has already begun to come true, although we’re still far away from the zeitgeist acknowledging any negative consequences of all of this slop. Meta have already vibe-shipped an AI agent that let attackers recover the password on target accounts just by bullshitting the LLM, which I predict is not going to be the most retarded AI slop fail this year. In my own company, clumsily prompting the LLM is just about all anyone can be bothered to do now, and the proliferation of slop emails, slop documents, slop issues, and slop messages is accelerating.

LLMs will continue to not be capable of replacing software engineers, and will continue to get more useful in the hands of competent professionals. More leverage for pros; more mess from everyone else. This is a double-edged sword. I’m glad that there will still be professional opportunities for people like me in the future, but I also am pretty sure I will not enjoy those opportunities due to all of the aforementioned slop.

Nothing to add on this one. One didn’t have to be Nostradamus to see this coming.

LLM performance will gradually increase, which will continue to give me the ability to knock out quick tasks and explore random ideas that I would never have been able to justify spending my own time on before LLM coding agents were developed. This is the positive side to the negative take in the previous bullet. I’m very excited about what I can do when augmented by a SOTA LLM coding agent. My lament is entirely concerned with what others will do with the same tools.

Another big-brain bet that, to no one’s surprise, is coming true. Opus 4.8 and GPT 5.5 are both really good. Fable seemed a bit better for the brief time I had access to it. I don’t think AGI is around the corner, but I do expect incremental improvements in models (and most especially the deterministic tooling that we wrap the models in) to continue. Even if progress on LLMs stopped today, I think we have years of improvements we can discover just around how and when to use what agentic tooling mechanisms to get the most out of a given model.

Valuations on AI plays will continue to be insane, and I will continue to seethe with jealousy and resentment as I compare my equity stake and comp at Elastio with packages at even the dumbest me-too LLM wrapper startups.

Given the valuations being floated for the Anthropic and OpenAI IPOs, this prediction was, if anything, insufficiently ambitious.

Now my new predictions for the second half of this year:

The prevalence of AI psychosis will increase but will not peak this year. Execs, PMs, engineers, shameless shit-tier influencers, and of course investors will continue to uncritically adopt and promote AI for all the things, and when problems inevitably arise the solution will always be more AI for all the things. Staggering amounts of financial and human capital will be expended in the AI gold rush, out of proportion to the real, measurable, lasting value produced (that’s not counting insane .ai valuations, which for at least the rest of this year will continue to rise). Fear not though, the worst is still yet to come.
The new model will come out, and it will be ZOMG WOW THIS IS AGI WE’RE COOKED!!!!, and at the same time also dogshit and worse than GPT-4o and Y U SO EXPENSIVE, grifters will keep grifting, and the minority of engineers getting real value out of thoughtful use of GenAI tooling will be vastly outnumbered by the shambling slop horde, some of which will even be humans.
The US government will continue to behave as if it’s actually controlled by a CCP committee tasked with neutralizing American AI advantages via bone-headed policy moves. The EU will continue to set taxpayer ~~dollars~~ euros on fire in the delusional belief that sovereign SOTA AI is just a matter of a few more billion euros “invested” into the pockets of the European primes. Chinese models will mostly lag US SOTA model performance but cost vastly less, blocked from corporate US adoption by compliance and security concerns.
Some bright spark will realize that there’s money to be made selling a special US-only model that is explicitly subject to ITAR controls (all the best stuff is ITAR!). It will suck more than the SOTA models, but its customers won’t (be legally permitted to) care.
The importance of independence from any one lab, model, or provider will become increasingly obvious and drive more interest in agentic coding harnesses that aren’t bound to a specific model or vendor. Self-hosted open weight models will continue to be almost but not quite viable as an alternative to paying for inference.