Mid Summer 2026 GenAI Tooling Review
anelson June 25, 2026 #genaiItās hard to believe that itās been more than ten months since I posted my 2025 summer GenAI tool review. I guess Iāve been too busy shipping loudly and performatively to take a breath and write down whatās going on. I fear if I wait any longer I wonāt even remember how primitive life was way back inā¦late summer 2025.
The headline update is that I still lean heavily on Claude Code and Anthropicās models, mostly Opus 4.8 these days. But that makes it sound like not much has changed; Claude Code has completely changed in the last ten months, and the new models are also quite a bit more capable.
Changes in how I use tools
Reasoning about reasoning effort
I donāt recall what reasoning effort tunables were in Claude Code when I wrote up my last review, but the latest Opus 4.8 model definitely has different reasoning effort options than I was using last year.
I often set the model to xhigh reasoning effort and then leave it there. I have the Claude Max $200/mo subscription and have literally never exhausted my quota, and thatās running multiple agents in parallel all day every day. xhigh is much slower, so I really ought to make more of an effort to adjust the effort depending on the task. As it happens, I crank it up to xhigh for some challenging planning activity, and inevitably forget to drop it back down. On the other hand, itās not like Iām sitting at the terminal waiting patiently for the next turn; I kick off some work and switch over to another terminal where some other work is ongoing.
/goal-setting
Iāve also started to look for ways to use the /goal skill built into Claude Code. One has to be careful with this as itās prone to the model reward-hacking to achieve the goal; itās important to set unambiguous goals that the model canāt hack around. But with this in mind, Iāve been able to get some pretty cool results. Such as:
-
Write a Python script to extract the password hash from a particular notebook vendorās BIOS, and reverse-engineer the BIOS machine code to figure out the hashing algorithm, and invoke
hashcatto crack the password.I had a laptop that I bought cheap because it was BIOS locked, but still able to SecureBoot a Linux LiveCD. I booted Fedora 44, enabled SSH, pointed Claude Code at the system over SSH, and had it figure out how to dump the BIOS memory. This I had to do interactively because it required rebooting with some different kernel command line options, but once I had the dump and BIOS code extracted to my local workstation I just let it go to town with
/goalto crack the password. It did so, and I was able to unlock the BIOS, and now I have a fully working laptop. -
Figure out the right
qemucommand line incantation to get a particular set of Windows QCOW2 images to boot in QEMU.For reasons that arenāt important, I had dumped snapshots of some Windows systems on a cloud hypervisor and converted them to QCOW2, and then needed to boot them using QEMU on my local KVM hypervisor, far enough to get the QEMU Guest Agent installed and then drive some automated testing on those VMs. None of them wanted to boot, owing to the differences in hypervisor and crash-consistent nature of the snapshots. Each one had some different nuance that prevented it from working. I left a
/goalrunning that took almost the limit of 6 hours, at the end of which I had some Python code that would reliably go from an unmodified original QCOW2 to a working Windows VM with QGA enabled. -
Migrate a complex data collection pipeline from an older RHEL to the latest Ubuntu 26.04 image.
This doesnāt sound that impressive, but itās a good example of the kind of quality-of-life improvement that I couldnāt justify spending a few hours or a day doing myself, but is well worth lazily firing off a
/goalfor the clanker to chew on. None of the migration tasks were hard, but they were tedious and error prone, involving many repeated runs of the pipeline until all of the different packages and config options and kernel nuances were worked out. At the end of this I had a big diff of changes sensitive enough that it would be irresponsible to vibe-ship; however, the effort of doing that and correcting a few slop-isms was much less than doing it myself.
Skill issues
Iāve known about the concept of skills (not to be confused with plugins or agents!) in Claude Code since that was first released, but I didnāt feel a need to use them before. It was always enough to update CLAUDE.md every time the agent screwed something up. But as the models get more capable and I want to wire up the harness to more systems, I found making custom skills to be a great help.
I have probably made a dozen skills at this point, but Iāll point out a couple that have added a ton of value this year:
-
/pipeline-testingis a skill that I wrote on a specific repo in Elastio for a product that Iām developing there. This is a complex appliance that integrates with complex storage systems, which is a bit of a pain to build and deploy into staging. So I started a new Claude Code session, and talked the clanker through what it needed to do, as if I was explaining it to a junior. Once it had successfully done the deployment and run a few of my usual semi-manual tests, I prompted it to capture that all in a skill. It produced the usual clanker vomit initially, which I edited to remove the verbose crap and leave just the relevant details, and explicitly state some invariants and constraints that it missed.Now, when Iāve made a change and tested it myself to my satisfaction, I lazily kick off a
/pipeline-testingand go switch to another session. When I come back in an hour or so, I have the results of some additional thorough testing that I otherwise would have had to do myself. It is surprisingly fastidious about user experience as well as the more obvious correctness, and gives me a lot more confidence in the code I then push to a PR.Obviously, it would be irresponsible to just vibe-test shit and then ship it. But itās quite responsible to test it yourself AND THEN vibe-test it for additional coverage, and then ship it to a QA team that also tests it, and then ship it. Anything that the clanker finds is stuff that otherwise would have slipped into a release for QA, where it would be more expensive to find and fix. This is a no-brainer.
-
/grafana-telemwas created under similar conditions as/pipeline-testing. It instructs the clanker how to pull logs and metrics from our telemetry systems for a given prod deployment of our product, and also lays out some common failure modes. When I get a problem report or when I notice something wrong myself, I fire off/grafana-telemto start with, before I do any analysis of my own (and, as always, switch over to some other session to keep the dopamine flowing while the clanker cogitates). I wonāt say that the model always one-shots the issue (although it does often enough), but it always grinds through a bunch of telemetry that I didnāt have to squint at myself, leaving me with only the anomalous bits to look into. Even better, once this is in the context, I can immediately have the clanker work with me on a plan to fix the issue, add more telemetry, or whatever it is that needs to be done to resolve the problem.
Since skills are basically just prompts but with some structured front-matter and a principled way to embed deterministic scripts, they should in theory work with any harness/model combination that has tool calling. But, annoyingly, Claude Code looks for skills in a Claude-specific place, which makes using my skills from other agents a small hassle. I could use symlinks but these are under source control and symlinks arenāt a great solution there. Maybe by now Claude Code has a config param that I can use to make it look in the more standard places for skill files, but when I last looked into it that wasnāt the case. This is a user-hostile decision on Anthropicās part that I must admit has the effect of making me reach for Claude Code whenever I need to use one of my skills; you devious bastards!
YOLO
By default now when launching any agentic coding harness, I disable all of the safety checks and sandboxing. Clutch your pearls if you must, but if youāre really reviewing each operation your agent wants to do, in mid-2026, you are leaving a huge amount of productivity on the table. And for what? The safety crowd is, IMHO, way over-indexed on the elimination of risk in AI tooling, when the focus should actually be on managing risk just like we do with infosec today. After all, itās incredibly insecure to let users use their computers, especially if those computers are connected to each other, and yet we do that because THAT IS THE WHOLE FUCKING POINT OF THEM! Likewise, agents need agency.
Iām not saying you should run OpenClaw with all of your private creds (I wouldnāt run it at all in fact, but thatās another post), but if the only thing protecting you from being p0wned or having your prod DB deleted is the trust and safety mommies, Iām afraid youāre already doing it wrong. You can segregate sensitive creds, require human confirmation of sensitive operations like using your SSH key to auth somewhere, and run the agent as an unprivileged user without passwordless sudo access and get a lot of protection from actual threats without giving up all of the agency that makes the agent so wonderfully agentic.
In my case, I try to think rationally about my threat model and act accordingly. I wonāt go into detail here about my setup in this regard; I just want to capture the fact that --dangerously-skip-permissions (Claude Code) and --dangerously-bypass-approvals-and-sandbox (Codex) are enabled in all of my agentic coding sessions, and Iām still here to talk about it.
Worktrees and tmux
I first learned about the worktree feature in git whilst reading some Anthropic docs in the early days of Claude Code. I played with it but was immediately turned off by the fact that a given branch can only be checked out in a single worktree. I didnāt get the value of doing this over a separate checkout. But I have seen the light.
Claude Code has two command line options that together are a real boon for spastic agentic multi-taskers such as myself: --tmux and --worktree. If Iām about to work on an issue that I know is going to be self-contained and probably not take that long, Iāll go to my main tmux AI shell window for that project, and run something like:
claude --dangerously-skip-permissions --worktree 269-tokio-runtime-monitoring --tmux &
That will do two things:
- Creates a git branch
worktree-269-tokio-runtime-monitoringand checks out that branch in a new git worktree in.claude/worktrees/ - Creates a new tmux session called something like
$repo_worktree-269-tokio-runtime-monitoringwith a newclaudeinstance running from the new worktreeās directory.
Being a separate worktree and branch, this is isolated from the main checkout of the project (which I typically keep on master). Each tmux session/worktree is dedicated to some specific task. When the work is done I just tmux kill-session the session, and periodically purge the old worktrees (each one has a separate cargo target/ dir so they do need to be pruned from time to time).
This makes it trivial to multi-task to what is probably a pretty unhealthy extent.
As of this writing, codex doesnāt have this feature and I have not found sufficient motivation to script it. That doesnāt mean I donāt ever use Codex in these worktree sessions though. Since theyāre just tmux sessions, I can easily open another pane and launch codex, nvim, and a shell.
Tools Iāve added
Codex
Sometime around January of this year I started to play with OpenAIās Claude Code competitor, Codex. This was motivated by a recommendation from a colleague who is the exact opposite of a breathless AI influencer on Twitter, but also by frustration with Claude Code quality. Claude Code is famously, unapologetically vibe-coded (using, naturally, Claude Code itself). This has allowed it to be built and extended very quickly, and given the revenue growth I very much doubt Anthropic investors would say that this was a mistake. However, it means that the quality is highly variable from one vibe-shipped release to another, with some persistent problems with rendering the TUI that continue to drive me mad to this day. Iāll probably write up my many thoughts on the inevitable failure mode of vibe-coding shipping products, but for now just suffice it to say that I was longing for a tool that could do what Claude Code does but also work properly.
Codex is also famously and performatively vibe-coded, but there are a couple of differences I noticed right away. First, being written in Rust instead of TypeScript, it immediately benefited from my goodwill towards Rust (to say nothing of the very robust compiler tooling that Rust offers). Second, whatever OpenAIās vibe-coders prompted their models to do regarding terminal output was much smarter than what Anthropic did, which was apparently to make a terminal renderer for React (LOLWUT!?). There are still glitches, I still hesitate a bit when I pull down an update, but overall Codex is more stable (and also less ravenously hungry for memory).
There are online religious wars over which model has better vibes, and they are as shallow and pointless and engagement-farmy as you probably imagine. I try to ignore AI influencer Twitter (harder than it sounds), but I do my own vibe-checks with the models and harnesses. Iāve had cases where one did well and another did poorly, but I consider them roughly equivalent in terms of capability.
Pretty often Iāll run both Claude Code and Codex on the same task, having them both make a plan, and then I pick the plan I like better and feed that plan into the other agent to critique the plan, feed the output of that back into the agent that made the plan, and repeat until the feedback becomes trivial or useless. Only then do I engage with the plan meaningfully myself, by which time most of the stupidity has been filtered out.
I should note that my company has some startup credits with OpenAI and Iām on the $200/mo plan, so for the moment I burn tokens with reckless abandon. I expect future me will read this and weep, but for now there is no task too trivial to merit throwing an agentic loop at it at least once.
Tools Iām Still Using Daily
Claude Desktop and Mobile (Max plan)
This is largely unchanged from my last review, although the go-to model is now Opus 4.8.
The new feature āCoworkā in Claude Desktop has come in handy a few times. I installed the Claude extension in Brave, which lets Cowork actually drive the browser instead of just issuing HTTP requests itself (which often get blocked or the content being requested requires JavaScript). I donāt use Cowork daily (except for the scheduled tasks) but when I need it, itās a useful feature. I think this kind of agentic capability is the future of these kinds of tools, and I expect that they will be rolled out more widely this year.
Here is a sample of stuff Iāve used Cowork for:
-
Using Coworkās scheduled tasks feature, I have set up an automatic daily task that opens a particular eBay saved search and goes through the results putting them in an Excel spreadsheet for me to review at my leisure. This saved search is part of my ongoing search for some particular surplus computing devices that require some tedious manual filtering to separate from the abundant chaff that mostly obscures the models that I actually want.
-
Using the Slack connector, I was able to use this prompt to extract a ton of useful context from a Slack conversation that I wanted to store in Markdown:
I want this entire conversation in slack: https://elastio.slack.com/archives/C092WLZKCAW/p1781206874798799 downloaded into a single markdown file showing all of the messages in the thread, including who sent each one.
-
Using Chrome to navigate the clumsy eBay UI to make a spreadsheet of all of my eBay payouts, and breaking down the payout amount per item.
Using Chrome, go to https://www.ebay.com/mes/transactionlist?filter=transactionType:{PAYOUT}&pillFilterId=PAYOUT_FILTER and prepare an Excel spreadsheet with a list of all of my payouts. The challenge is that I want to calculate which payouts include revenue from which listings that I sold, so I want the analysis broken down such that it lists every item that I sold, how much I told the item for, and how much I was paid out net of commission, shipping, and whatever else reduces the payouts. If you can get it, Iād prefer both the title of the listing, and the link to view the sale, so help me understand which item is which.
I would never have taken the time to do this myself, but tasking a clanker to do it made it much easier to account for the revenue for some sold items relative to their input costs.
-
Examine a bunch of photos downloaded from my phone of several different laptops and tablets that I bought used, grouping multiple photos into a single item and reading out the serial number from either screenshots of the BIOS or photos of the serial number sticker on each unit.
This was a challenging task. I wonāt share the prompt here: first, because itās very long and task-specific and second, because it was not a one-shot prompt but rather required quite a bit of back and forth and āconstructiveā criticism on my part to get it to work. But the scale of the toil to do this task myself was so massive that I was willing to put this comparatively minimal effort in, with the result being many hours of my own time saved on a boring and low-value slog.
I actually do not use the various connectors to things like Office that Anthropic are pushing, because I donāt want that level of integration yet. If Iām authoring a doc itās almost always Markdown anyway. I seem to be the only one left in my company who isnāt lazily prompting Claude to crank out AI slop that is then sent unreviewed to coworkers, customers, and partners. I will die on this hill. I hate low-effort AI slop documents, and I call them out whenever I see them. Our customers and partners at the very least deserve documents that reflect reality and have been reviewed by a competent human.
ChatGPT Desktop and Mobile (Pro plan)
I think in the last review I was on the Plus plan ($20/mo) but now Iāve switched to my companyās account and Iām on the $200/mo Pro plan. I do that mainly for Codex, but the desktop and mobile apps are included so I may as well use them.
I find that the ChatGPT app and OpenAI models are more to my liking when it comes to researching something or going back and forth on an idea. If Iām actually implementing something in code or even just investigating something that benefits from shell access, Iāll use Claude Code or Codex, but for higher level stuff I will use the Desktop and Mobile apps. A few things Iāve done with ChatGPT specifically recently:
-
Go through the research and influencer slop regarding how to tune the āvoiceā of SOTA models, by which I mean how to prompt away the maddening tics that SOTA models for some reason seem biased towards. Iām talking about emojis, bulleted lists with bold prefixes, ādelveā, āseamā, āload-bearingā, āX not Yā, etc. The prompt itself was very long and the conversation involved a lot of back and forth, but suffice it to say that it was illuminating, and inspired the
VOICE.mdfile that I use now. -
What is the maximum for a gift between immediate family members under federal tax code before the gift is taxable?
-
I got this screenshot from the console on a Linux VM that was hung. at some point while this VM was running, before it hung, I updated my software including a new version of the referenced
bdamserver, that is supposed to have a fix for the hung task. i am trying to tell if that hung task message is from before or after I applied that fix. how do I translate the timestamps on these lines into an actual date/time?(With a screenshot attached)
I could have easily done this in Codex, except that I run all of my agentic harnesses on a remote headless Linux server in a data center, and I got this screenshot from a colleague over Slack. The path of least resistance was to fire up ChatGPT and paste the screenshot there, so thatās what I did. It read the contents no problem and helped me reason about the contents there.
-
Dictating text. I hate HATE HATE the voice chat mode in ChatGPT, because all of the voices are so maddeningly patronizing and chirpy, but I really like the speech-to-text (STT) feature. If I need to capture a bunch of text, usually as input to an LLM, itās much faster for me to dictate it than to type it. And I find if I prompt ChatGPT first with some context, and call out product names and which words should be
fixed width, before I dump the text on it, it can clean it up and put it into a nice Markdown format without altering the semantic content in any way.One big win I got doing this is in one complex project with a lot of external context, I have a
docs/braindumpsdirectory. Whenever something comes up that the agent stumbles on for lack of some context, I quickly drop a braindump file there which I author using this STT workflow. I put that file under source control so I can point agents at it when they need that context (and of course they can also find it themselves if they grep for key concepts). Crucially, this isnāt AI slop text, itās literally my own words, dictated and transcribed, so it has a high signal-to-noise ratio.There are dedicated STT tools (I have used and paid for MacWhisper in the past), but now that the STT capability in the SOTA models is good enough, I much prefer doing this within ChatGPT. That way Iām also leveraging the power of the SOTA model to transform the raw text of my transcript. Sure, tools like MacWhisper can do that too if you give them an OpenAI API key and a prompt, but why would I bother with that when itās built in and part of my paid ChatGPT plan already?
Tools Iāve discarded
Perplexity
Itās hard to believe that just ten months ago, I listed Perplexity under tools that Iām still using daily. It must have been just shortly after that when I dropped it. By now, both Claude and ChatGPT mobile and desktop applications competently search the Internet to provide grounded answers to prompts. Plus, Iām a paid Kagi subscriber so I can use their AI quick answer feature which doesnāt suck either. Perplexity now feels to me like MSN or Excite; itās like a thing that my grandparents used in the dawn of the consumer Internet and still use because they donāt know that there are much better options.
Apparently Perplexity is still a going concern, much to my surprise. I see no value in it at this point.
Predictions for H2 2026
Re-reading my predictions from the last review, I actually donāt disagree with many of them now. So first Iāll repeat them here before I get to new predictions.
-
Background coding agents will hit mainstream workflows. One vendor will ship a credible ābranch-per-issueā loop that knows about existing CI/CD infrastructure; the rest will copy it. Jules will continue to suck until Google kills it.
To my surprise, this is only now becoming a thing. Claude Tag was announced a few days ago, and for a while thereās been Claude Code in the desktop app that kind of does this. I know that Big Tech have their own bespoke AI slop production tooling to make it much easier and faster to run a coding agent on an issue, produce a PR, review the PR, and ship it to prod. But Iām not aware of anything that is generally available and reliably used in anger for this purpose. Perhaps Iām just not aware of it, but Iām pretty plugged into the AI software engineering zeitgeist.
-
The Claude Max plan will be enshittified somehow. Anthropic already announced more restrictive usage limits for Pro, adding not only limits within five-hour windows but also weekly usage limits. They claim that this wonāt impact most users, and for now this is specific to Pro and not Max, but given the economics of hosted LLMs I think further arbitrary usage restrictions are inevitable.
This has started to happen. You already cannot use third-party agents like Pi with your Max subscription, you have to use credits to pay per-token for extra usage. So far the ChatGPT subscription does still allow this, although for how long I donāt know.
Thereās also the kerfuffle around Fable, which was abruptly yanked after the US Commerce Department got bribed/tricked/prompted to take Anthropicās claims of the danger of their frontier models at face value; I wonāt count that as enshittification since it wasnāt Anthropicās decision to pull that model.
I think thereās much more to come here, especially as Anthropic and OpenAI are both planning IPOs. Theyāre going to need to start revenuemaxxing sooner than later. Weāll see more wailing and gnashing of teeth about how absurdly expensive AI tooling is out of reach of the common man, mostly from preening retards who wouldnāt know a common man if he fixed their toilet.
-
Vibe-coded slop will get worse. The economic and psychological incentives for lazily turning off your brain and vibe-coding slop that you then ship are simply too strong, and the consequences of doing so still too diffuse and remote. Even if there were zero hype around LLMs and no frenzy of AI FOMO intoxicating every exec and investor on the planet, I think youād still see widespread vibe-coding simply because itās easier than having to think for yourself, and you can potentially work multiple jobs at once shipping slop at each one. Add in the AI mandates from execs and investors and I see no hope that software doesnāt get much worse very quickly. I donāt think we will ever recover, but I do hope at some point in the coming years the damage will be sufficiently undeniable and MBA programs will teach enough cautionary case studies that most execs and engineering managers will have learned not to tolerate vibe-coding slop-merchants (or, at the very least, the vibe-coding slop-merchants will be forced to make some minimal effort to hide the telltale signs of slop which might incidentally make them less sloppy).
This was hardly a bold prediction at the time, and it has already begun to come true, although weāre still far away from the zeitgeist acknowledging any negative consequences of all of this slop. Meta have already vibe-shipped an AI agent that let attackers recover the password on target accounts just by bullshitting the LLM, which I predict is not going to be the most retarded AI slop fail this year. In my own company, clumsily prompting the LLM is just about all anyone can be bothered to do now, and the proliferation of slop emails, slop documents, slop issues, and slop messages is accelerating.
-
LLMs will continue to not be capable of replacing software engineers, and will continue to get more useful in the hands of competent professionals. More leverage for pros; more mess from everyone else. This is a double-edged sword. Iām glad that there will still be professional opportunities for people like me in the future, but I also am pretty sure I will not enjoy those opportunities due to all of the aforementioned slop.
Nothing to add on this one. One didnāt have to be Nostradamus to see this coming.
-
LLM performance will gradually increase, which will continue to give me the ability to knock out quick tasks and explore random ideas that I would never have been able to justify spending my own time on before LLM coding agents were developed. This is the positive side to the negative take in the previous bullet. Iām very excited about what I can do when augmented by a SOTA LLM coding agent. My lament is entirely concerned with what others will do with the same tools.
Another big-brain bet that, to no oneās surprise, is coming true. Opus 4.8 and GPT 5.5 are both really good. Fable seemed a bit better for the brief time I had access to it. I donāt think AGI is around the corner, but I do expect incremental improvements in models (and most especially the deterministic tooling that we wrap the models in) to continue. Even if progress on LLMs stopped today, I think we have years of improvements we can discover just around how and when to use what agentic tooling mechanisms to get the most out of a given model.
-
Valuations on AI plays will continue to be insane, and I will continue to seethe with jealousy and resentment as I compare my equity stake and comp at Elastio with packages at even the dumbest me-too LLM wrapper startups.
Given the valuations being floated for the Anthropic and OpenAI IPOs, this prediction was, if anything, insufficiently ambitious.
Now my new predictions for the second half of this year:
-
The prevalence of AI psychosis will increase but will not peak this year. Execs, PMs, engineers, shameless shit-tier influencers, and of course investors will continue to uncritically adopt and promote AI for all the things, and when problems inevitably arise the solution will always be more AI for all the things. Staggering amounts of financial and human capital will be expended in the AI gold rush, out of proportion to the real, measurable, lasting value produced (thatās not counting insane
.aivaluations, which for at least the rest of this year will continue to rise). Fear not though, the worst is still yet to come. -
The new model will come out, and it will be ZOMG WOW THIS IS AGI WEāRE COOKED!!!!, and at the same time also dogshit and worse than GPT-4o and Y U SO EXPENSIVE, grifters will keep grifting, and the minority of engineers getting real value out of thoughtful use of GenAI tooling will be vastly outnumbered by the shambling slop horde, some of which will even be humans.
-
The US government will continue to behave as if itās actually controlled by a CCP committee tasked with neutralizing American AI advantages via bone-headed policy moves. The EU will continue to set taxpayer
dollarseuros on fire in the delusional belief that sovereign SOTA AI is just a matter of a few more billion euros āinvestedā into the pockets of the European primes. Chinese models will mostly lag US SOTA model performance but cost vastly less, blocked from corporate US adoption by compliance and security concerns. -
Some bright spark will realize that thereās money to be made selling a special US-only model that is explicitly subject to ITAR controls (all the best stuff is ITAR!). It will suck more than the SOTA models, but its customers wonāt (be legally permitted to) care.
-
The importance of independence from any one lab, model, or provider will become increasingly obvious and drive more interest in agentic coding harnesses that arenāt bound to a specific model or vendor. Self-hosted open weight models will continue to be almost but not quite viable as an alternative to paying for inference.