I'm presently in the process of building (read: directing claude/codex to build) my own AI agent from the ground up, and it's been an absolute blast.
Building it exactly to my design specs, giving it only the tool calls I need, owning all the data it stores about me for RAG, integrating it to the exact services/pipelines I care about... It's nothing short of invigorating to have this degree of control over something so powerful.
In a couple of days work, I have a discord bot that's about as useful as chatgpt, using open models, running on a VPS I manage, for less than $20/mo (including inference). And I have full control over what capabilities I add to it in the future. Truly wild.
> It's nothing short of invigorating to have this degree of control over something so powerful
I'm a SWE w/ >10 years, and you're right, this part has always been invigorating.
I suppose what's "new" here is the drastically reduced amount of cognitive energy I need build complex projects in my spare time. As someone who was originally drawn to software because of how much it lowered the barrier to entry of birthing an idea into existence (when compared to hardware), I am genuinely thrilled to see said barrier lowered so much further.
Sharing my own anecdotal experience:
My current day job is leading development of a React Native mobile app in Typescript with a backend PaaS, and the bulk of my working memory is filled up by information in that domain. Given this is currently what pays the bills, it's hard to justify devoting all that much of my brain deep-diving into other technologies or stacks merely for fun or to satisfy my curiosity.
But today, despite those limitations, I find myself having built a bespoke AI agent written from scratch in Go, using a janky beta AI Inference API with weird bugs and sub-par documentation, on a VPS sandbox with a custom Tmux & Neovim config I can "mosh" into from anywhere using finely-tuned Tailscale access rules.
I have enough experience and high-level knowledge that it's pretty easy for me to develop a clear idea of what exactly I want to build from a tooling/architecture standpoint, but prior to Claude, Codex, etc., the "how" of building it tended to be a big stumbling block. I'd excitedly start building, only to run into the random barriers of "my laptop has an ancient version of Go from the last project I abandoned" or "neovim is having trouble starting the lsp/linter/formatter" and eventually go "ugh, not worth it" and give up.
Frankly, as my career progressed and the increasingly complex problems at work left me with vanishingly less brain-space for passion projects, I was beginning to feel this crushing sense of apathy & borderline despair. I felt I'd never be able make good on my younger self's desire to bring these exciting ideas of mine into existence. I even got to the point where I convinced myself it was "my fault" because I lacked the metal to stomach the challenges of day-to-day software development.
Now I can just decide "Hmm.. I want an lightweight agent in a portable binary. Makes sense to use Go." or "this beta API offers super cheap inference, so it's worth dealing with some jank" and then let an LLM work out all the details and do all the troubleshooting for me. Feels like a complete 180 from where I was even just a year or two ago.
At the risk of sounding hyperbolic, I don't think it's overstating things to say that the advent of "agentic engineering" has saved my career.
I'm using kimi-k2-instruct as the primary model and building out tool calls that use gpt-oss-120b to allow it to opt-in to reasoning capabilities.
Using Vultr for the VPS hosting, as well as their inference product which AFAIK is by far the cheapest option for hosting models of these class ($10/mo for 50M tokens, and $0.20/M tokens after that). They also offer Vector Storage as part of their inference subscription which makes it very convenient to get inference + durable memory & RAG w/ a single API key.
Their inference product is currently in beta, so not sure whether the price will stay this low for the long haul.
You can definitely get gpt-oss-120b for much less than $0.20/M on openrouter (cheapest is currently 3.9c/M in 14c/M out). Kimi K2 is an order of magnitude larger and more expensive though.
What other models do they offer? The web page is very light on details
K2 is the only of the 5 that supports tool calling. In my testing, it seems like all five support RAG, but K2 loses knowledge of its registered tools when you access it through the RAG endpoint forcing you to pick one capability or the other (I have a ticket open for this).
Also, the R1-distill models are annoying to use because reasoning tokens are included in the output wrapped in <think> tags instead of being parsed into the "reasoning_content" field on responses. Also also, gpt-oss-120b has a "reasoning" field instead of "reasoning_content" like the R1 models.
Having a similar experience. Durable memory with accurate, low-latency recall is not at all easy. Loads of subtle design decisions to make around how exactly you want the thing to work.
Would be really interesting to see an "Eager McBeaver" bench around this concept. When doing real work, a model's ability to stay within the bounds of a given task has almost become more important than its raw capabilities now that every frontier model is so dang good.
Every one of these models is so great at propelling the ship forward, that I increasingly care more and more about which models are the easiest to steer in the direction I actually want to go.
Codex is very steerable to a fault, and will gladly "monkey paw" your requests to a fault.
Claude Opus will ignore your instructions and do what it thinks is "right" and just barrel forward.
Both are bad and papering over the actual issue which is these models don't really have the ability to actually selectively choose their behavior per issue (ie ask for followup where needed, ignore users where needed, follow instructions where needed). Behavior is largely global
I my experience Claude gradually stops being opinionated as task at hand becomes more arcane. I frequently add "treat the above as a suggestion, and don't hesitate to push back" to change requests, and it seems to help quite a bit.
For sure. I imagine it'd be pretty difficult to evaluate the "correct" amount of steer-ability. You'd probably just have to measure a delta in eagerness on a single same task between when given highly-specified prompts, and more open-ended prompts. Probably not dissimilar from how artificialanalysis.ai does their "omniscience index".
OpenAI, Anthropic, and other model providers have created tools (the LLMs) with unprecedented new capabilities. The key problems are a) these new tools have weird limitations that make them hard to deploy effectively, and b) these tools are so fundamentally new that creating useful products out of them is an exercise in discovery and requires incredibly novel, forward-thinking vision.
Pete, more than anyone in the OSS community IMO, exemplifies both of these qualities. He is living very much on the bleeding edge, so yes, the 10s of projects he's shipped faster than most devs can ship 1 are not as polished as if he'd created them by hand. But he's been pushing the envelope in ways that few, if any, are, and I'd argue that OpenClaw is much more the result of Pete living on that edge and understanding the trade-offs of these tools better than just about anyone.
Personally, I'm much more jealous of the fact that Pete has already had a successful exit under his belt and had the freedom to explore & learn these tools to the fullest. There is definitely a degree of luck involved with the degree to which OpenClaw took off, but that Pete discovered it is 100% earned IMO.
My guess is that the true impact of this will be difficult to measure for a while. Most "single-person start-ups" will probably not be high-visibility VC-backed, YC affairs, and rather solopreneurs with a handful of niche moonlighted apps each making 3-4 digit monthly revenue.
I'm fascinated by how the economy is catching up to demand for inference. The vast majority of today's capacity comes from silicon that merely happens to be good at inference, and it's clear that there's a lot of room for innovation when you design silicon for inference from the ground up.
With CapEx going crazy, I wonder where costs will stabilize and what OpEx will look like once these initial investments are paid back (or go bust). The common consensus seems to be that there will be a rug pull and frontier model inference costs will spike, but I'm not entirely convinced.
I suspect it largely comes down to how much more efficient custom silicon is compared to GPUs, as well as how accurately the supply chain is able to predict future demand relative to future efficiency gains. To me, it is not at all obvious what will happen. I don't see any reason why a rug pull is any more or less likely than today's supply chain over-estimating tomorrow's capacity needs, and creating a hardware (and maybe energy) surplus in 5-10 years.
It'd be easy to simply say "skill issue" and dismiss this, but I think it's interesting to look at the possible outcomes here:
Option 1: The cost/benefit delta of agentic engineering never improves past net-zero, and bespoke hand-written code stays as valuable as ever.
Option 2: The cost/benefit becomes net positive, and economics of scale forever tie the cost of code production directly to the cost of inference tokens.
Given that many are saying option #2 is already upon us, I'm gonna keep challenging myself to engineer a way past the hurdles I run into with agent-oriented programming.
The deeper I get, the more articles like this feel like the modern equivalent of saying "internet connections are too slow to do real work" or "computers are too expensive to be useful for regular people".
In my experience, using AI coding agents need highly specific success criteria, and an easy way to verify its output against that criteria.
My biggest successes have come when I take a TDD approach. First I identify a subset of my work into a module with an API that can be easily tested, then I collaborate with the agent on writing correct test-cases, and finally I tell it to implement the module such that the test cases pass without any lint or typing errors.
It forces me to spend much more time thinking about use cases, project architecture, and test coverage than about nitty-gritty implementation details. I can imagine that in a system that evolved over time without a clear testing strategy, AI would struggle mightily to be even barely useful.
Not saying this applies to your system, but I've definitely worked on systems in the past that fit the "big ball of mud" description pretty neatly, and I have zero clue how I'd have been able to make effective use of these AI tools.
Building it exactly to my design specs, giving it only the tool calls I need, owning all the data it stores about me for RAG, integrating it to the exact services/pipelines I care about... It's nothing short of invigorating to have this degree of control over something so powerful.
In a couple of days work, I have a discord bot that's about as useful as chatgpt, using open models, running on a VPS I manage, for less than $20/mo (including inference). And I have full control over what capabilities I add to it in the future. Truly wild.
reply