Hacker News .hnnew | past | comments | ask | show | jobs | submit | antirez's commentslogin

Congrats: completely broken methodology, with a big conflict of interest. Giving specific bug hints, with an isolated function that is suspected to have bugs, is not the same task, NOR (crucially) is a task you can decompose the bigger task into. It is basically impossible to segment code in pieces, provide pieces to smaller models, and expect them to find all the bugs GPT 5.4 or other large models can find. Second: the smarter the model, and less the pipeline is important. In the latest couple of days I found tons if Redis bugs with a three prompts open-ended pipeline composed of a couple of shell scripts. Do you think I was not already tying with weaker models? I did, but it didn't work. Don't trust what you read, you have access to frontier models for 20$ a month. Download some C code, create a trivial pipeline that starts from a random file and looks for vulnerabilities, then another step that validates it under a hard test, like ASAN crash, or ability to reach some secret, and so forth, and only then the problem can be reported. Test yourself what it is possible. Don't let your fear make you blind. Also, there is a big problem that makes the blog post reasoning not just weak per se, but categorically weak: if small model X can find 80% of vulnerabilities, if there is a model Y that can find the other potential 20%, we need "Y": the maintainers should make sure they access to models that are at least as good as the black hats folks.

Exactly, this is so flawed. Anthropic themselves said they only reported <1% of the vulnerabilities found, cause the rest is unpatched.

Give open models an environment (prior to Feb 15- so no Mythos-discovered vulns are patche) of Linux and see how many vulnerabilities it can find. Then put it in a sandbox and see if it can escape and send you an e-mail.


Idk, it seems reasonable to me

> "Our tests gave models the vulnerable function directly, often with contextual hints. A real autonomous discovery pipeline starts from a full codebase with no hints. The models' performance here is an upper bound on what they'd achieve in a fully autonomous scan. That said, a well-designed scaffold naturally produces this kind of scoped context through its targeting and iterative prompting stages, which is exactly what both AISLE's and Anthropic's systems do."

Also they included a test with a false positive, the small models got it right and Opus got it wrong. So this paper shows with the right approach and harness these smaller models can produce the same results. Thats awesome!

So, if you're struggling to make these smaller models work it's almost certainly an issue of holding them wrong. They require a different approach/harness since they are less capable of working with a vague prompt and have a smaller context, but incredibly powerful when wielded by someone who knows how to use them. And since they are so fast and cheap, you can use them in ways that are not feasible with the larger, slower, more expensive models. But you have to know how to use them, it requires skill unlike just lazily prompting Claude Code, however the results can be far better. If you aren't integrating them in your workflow you're ngmi imo :) This will be the next big trend, especially as they continue to improve relative to SOTA which is running into compute limitations.


Anthropic gave the model the whole codebase and told it to find a vulnerability on a specific file, iterating across sessions focusing on different files.

What happens then is that, for example, the model looks through that particular file, identifies potential problems, and works upwards through the codebase to check whether those could actually be hit.

“Hum, here we assume that the input has been validated, is there any way that might not be the case?”

This is not unique to Mythos. You can already do this with publicly available models. Mythos does appear to be significantly more capable, so it would get better results.

The research discussed here provided models with just a known buggy function, missing the whole process required to find that bug in the first place.


Mmm, Anthropic had a harness that had Mythos check each file as an entry point. That's not quite "here is a codebase, find vulns". A more sophisticated harness with a fast and cheap model could go function-by-function to do the same thing. Which is what this was validating.

> The research discussed here provided models with just a known buggy function, missing the whole process required to find that bug in the first place.

That process can be made part of a harness, again which is what they were validating.

I'm not sure why people are so hell-bent on disparaging open source models here. I get that some people cant get results from them, but that's just a skill issue - we should all be ecstatic that we don't need to rely on the unethical AI corps to allow us to do our jobs.


Thanks Dario, very cool!

Don't focus on what you prefer: it does not matter. Focus on what tool the LLM requires to do its work in the best way. MCP adds friction, imagine doing yourself the work using the average MCP server. However, skills alone are not sufficient if you want, for instance, creating the ability for LLMs to instrument a complicated system. Work in two steps:

1. Ask the LLM to build a tool, under your guide and specification, in order do a specific task. For instance, if you are working with embedded systems, build some monitoring interface that allows, with a simple CLI, to do the debugging of the app as it is working, breakpoints, to spawn the emulator, to restart the program from scratch in a second by re-uploading the live image and resetting the microcontroller. This is just an example, I bet you got what I mean.

2. Then write a skill file where the usage of the tool at "1" is explained.

Of course, for simple tasks, you don't need the first step at all. For instance it does not make sense to have an MCP to use git. The agent knows how to use git: git is comfortable for you, to use manually. It is, likewise, good for the LLM. Similarly if you always estimante the price of running something with AWS, instead of an MCP with services discovery and pricing that needs to be queried in JSON (would you ever use something like that?) write a simple .md file (using the LLM itself) with the prices of the things you use most commonly. This is what you would love to have. And, this is what the LLM wants. For complicated problems, instead, build the dream tool you would build for yourself, then document it in a .md file.


I feel like the MCP conversation conflates too many things and everyone has strong assumptions that aren't always correct. The fundamental issue is between one-off vs. persistent access across sessions:

- If you need to interact with a local app in a one-off session, then use CLI.

- If you need to interact with an online service in a one-off session, then use their API.

- If you need to interact with a local app in a persistent manner, and if that app provides an MCP server, use it.

- If you need to interact with an online service in a persistent manner, and if that app provides an MCP server, use it.

Whether the MCP server is implemented well is a whole other question. A properly configured MCP explains to the agent how to use it without too much context bloat. Not using a proper MCP for persistent access, and instead trying to describe the interaction yourself with skill files, just doesn't make any sense. The MCP owner should be optimizing the prompts to help the agent use it effectively.

MCP is the absolute best and most effective way to integrate external tools into your agent sessions. I don't understand what the arguments are against that statement?


My main complaint with mcp is that it doesn't compose well with other tools or code. Like if I want to pull 1000 jira tickets and do some custom analysis I can do that with cli or api just fine, but not mcp.

Right, that feels like something you'd do with a script and some API calls.

MCP is more for a back and forth communication between agent and app/service, or for providing tool/API awareness during other tasks. Like MCP for Jira would let the AI know it can grab tickets from Jira when needed while working on other things.

I guess it's more like: the MCP isn't for us - it's for the agent to decide when to use.


I just find that e.g. cli tools scale naturally from tiny use cases (view 1 ticket) to big use cases (view 1000 tickets) and I don't have to have 2 ways of doing things.

Where I DO see MCPs getting actual use is when the auth story for something (looking at you slack, gmail, etc) is so gimped out that basically, regular people can't access data via CLI in any sane or reasonable way. You have to do an oauth dance involving app approvals that are specifically designed to create a walled garden of "blessed" integrations.

The MCP provider then helpfully pays the integration tax for you (how generous!) while ensuring you can't do inconvenient things like say, bulk exporting your own data.

As far as I can tell, that's the _actual_ sweet spot for MCPs. They're sort of a technology of control, providing you limited access to your own data, without letting you do arbitrary compute.

I understand this can be considered a feature if you're on the other side of the walled garden, or you're interested in certain kinds of enterprise control. As a programmer however I prefer working in open ecosystems where code isn't restricted because it's inconvenient to someone's business model.


The auth angle is pretty interesting here. I spend a fair amount of time helping nontechnical people set up AI workflows in Claude Cowork and MCP works pretty well for giving them an isolated external system that I can tightly control their workflow guardrails but also interestingly give them the freedom to treat what IS exposed as a generic api automation tool. That combined with skills lets these non technical people string together zapier like workflows in natural language which is absolutely huge for the level of agency and autonomy it awards these people. So I find it quite interesting for the use case of providing auth encapsulated API access to systems that would normally require an engineer to unlock. The story around “wrap this REST API into a controlled variant only for the end users use case and allow them to complete auth challenges in every which way” has been super useful. Some of my mcp servers go through an oauth challenge response, others provide them guidance to navigate to the system and generate an api key and paste it into the server on initial connection.

>while ensuring you can't do inconvenient things like say, bulk exporting your own data

I think this is the key; I want my analysts to be able to access 40% of the database they need to do their job, but not the other 60% parts that would allow them to dump the business-secrets part of the db, and start up business across the street. You can do this to some extent with roles etc but MCP in some ways is the data firewall as your last line of protection/auth.


MCPs are for documentation. CLI->API is for interaction.

Weird... I've been happily using Atlassian's MCP for this kind of thing just fine?

Give the model a REPL and let it compose MCP calls either by using tool calls structured output, doing string processing or piping it to a fast cheap model to provide structured output.

This is the same as a CLI. Bash is nothing but a programming language and you can do the same approach by giving the model JavaScript and have it call MCP tools and compose them. If you do that you can even throw in composing it with CLis as well


You can make it compose by also giving the agent the necessary tools to do so.

I encountered a similar scenario using Atlassian MCP recently, where someone needed to analyse hundreds of Confluence child pages from the last couple of years which all used the same starter template - I gave the agent a tool to let it call any other tool in batch and expose the results for subsequent tools to use as inputs, rather than dumping it straight into the context (e.g. another tool which gives each page to a sub-agent with a structured output schema and a prompt with extraction instructions, or piping the results into a code execution tool).

It turned what would have been hundreds of individual tool calls filling the context with multiple MBs of raw confluence pages, into a couple of calls returning relevant low-hundreds of KBs of JSON the agent could work further with.


The agent cannot compose MCPs.

What it can do is call multiple MCPs, dumping tons of crap into the context and then separately run some analysis on that data.

Composable MCPs would require some sort of external sandbox in which the agent can write small bits of code to transform and filter the results from one MCP to the next.


This is confusing to me. What is composability if not calling a program, getting its program, and feeding it into another program as input? Why does it matter if that output is stored in the LLM's context, or if it's stored in a file, or if it's stored ephemerally?

Maybe I'm misunderstanding the definition of composability, but it sounds like your issue isn't that MCP isn't composable, but that it's wasteful because it adds data from interstitial steps to the context. But there are numerous ways to circumvent this.

For example, it wouldn't be hard to create a tool that just runs an LLM, so when the main LLM convo calls this tool it's effectively a subagent. This subagent can do work, call MCPs, store their responses in its context, and thereby feed that data as input into other MCPs/CLIs, and continue in this way until it's done with its work, then return its final result and disappear. The main LLM will only get the result and its context won't be polluted with intermediary steps.

This is pretty trivial to implement.


> Why does it matter if that output is stored in the LLM's context

Context window is expensive and precious. Much better to offload to some medium where it isn’t.


Give the model an interpreter like mlua and let it write code to compose MCP calls together. This is a well established method.

It’s the equivalent to calling CLIs in bash, except mlua is a sandboxes runtime while bash is not.


At the level of the agent, it knows nothing about MCP, all it has is a list of tools. It can do anything the tools you give it let it do.

It cannot do "anything" with the tools. Tools are very constrained in that the agent must insert into it's context the tool call, and it can only receive the response of the tool directly back into its context.

Tools themselves also cannot be composed in any SOTA models. Composition is not a feature the tool schema supports and they are not trained on it.

Models obviously understand the general concept of function composition, but we don't currently provide the environments in which this is actually possible out side of highly generic tools like Bash or sandboxed execution environments like https://agenttoolprotocol.com/


They can already do this, no? MCPs regularly dump their results to a textfile and other tools (cli or otherwise) filter it.

At that point might as well just use CLI

I totally agree that mcp not being compostable is a very big issue.


But in the context of this discussion, Atlassian has a CLI tool, acli. I'm not quite following why that wouldn't have worked here. As a normal CLI you have all the power you need over it, and the LLM could have used it to fetch all the relevant pages and save to disk, sample a couple to determine the regular format, and then write a script to extract out what they needed, right? Maybe I don't understand the use case you're describing.

Not all agents are running in your CLI or even in any CLI, which is why people are arguing past each other all over the topic of MCP.

I implemented this in an agent which runs in the browser (in our internal equivalent of ChatGPT or Claude's web UI), connecting directly to Atlassian MCP.


Hmm, but you can't write a standard MCP (e.g. batch_tool_call) that calls other MCPs because the protocol doesn't give you a way to know what other MCPs are loaded in the runtime with you or any means to call them? Or have I got that wrong?

So I guess you had to modify the agent harness to do this? or I guess you could use... mcp-cli ... ??


I don't maintain this anymore but I experimented with this a while back: https://github.com/jx-codes/lootbox

Essentially you give the agent a way to run code that calls MCP servers, then it can use them like any other API.

Nowadays small bash/bun scripts and an MCP gateway proxy gets me the same exact thing.

So yeah at some level you do have to build out your own custom functionality.


MCP is less discoverable than a CLI. You can have detailed, progressive disclosure for a CLI via --help and subcommands.

MCPs needs to be wrapped to be composed.

MCPs needs to implement stateful behavior, shell + cli gives it to you for free.

MCP isn't great, the main value of it is that it's got uptake, it's structured and it's "for agents." You can wrap/introspect MCP to do lots of neat things.


"MCP is less discoverable than a CLI" -> not true anymore with Tool_search. The progressive discovery and context bloat issue of MCP was a MCP Client implementation issue, not a MCP issue.

"MCPs needs to be wrapped to be composed." -> Also not true anymore, Claude Code or Cowork can chain MCP calls, and any agent using bash can also do it with mcpc

"MCPs needs to implement stateful behavior, shell + cli gives it to you for free." -> having a shell+cli running seems like a lot more work than adding a sessionId into an MCP server. And Oauth is a lot simpler to implement with MCP than with a CLI.

MCP's biggest value today is that it's very easy to use for non-tech users. And a lot of developers seem to forget than most people are not tech and CLI power users


Just to poke some holes in this in a friendly way:

* What algorithm does tool_search use?

* Can tool_search search subcommands only?

* What's your argument for a harness having a hacked in bash wrapper nestled into the MCP to handle composition being a better idea than just using a CLI?

* Shell + CLI gives you basically infinite workflow possibilities via composition. Given the prior point, perhaps you could get a lot of that with hacked-in MCP composition, but given the training data, I'll take an agent's ability to write bash scripts over their ability to compose MCPs by far.


"MCP is less discoverable than a CLI" - that doesn't make any sense in terms of agent context. Once an MCP is connected the agent should have full understanding of the tools and their use, before even attempting to use them. In order for the agent to even know about a CLI you need to guide the agent towards it - manually, every single session, or through a "skill" injection - and it needs to run the CLI commands to check them.

"MCPs needs to implement stateful behavior" - also doesn't make any sense. Why would an MCP need to implement stateful behavior? It is essentially just an API for agents to use.


If you have an API with thousands of endpoints, that MCP description is going to totally rot your context and make your model dumb, and there's no mechanism for progressive disclosure of parts of the tool's abilities, like there is for CLIs where you can do something like:

tool --help

tool subcommand1 --help

tool subcommand2 --help

man tool | grep "thing I care about"

As for stateful behavior, say you have the google docs or email mcp. You want to search org-wide for docs or emails that match some filter, make it a data set, then do analysis. To do this with MCP, the model has to write the files manually after reading however many KB of input from the MCP. With a cli it's just "tool >> starting_data_set.csv"


This is a design problem, and not something necessarily solved by CLI --help commands.

You can implement progressive disclosure in MCP as well by implementing those same help commands as tools. The MCP should not be providing thousands of tools, but the minimum set of tools to help the AI use the service. If your service is small, you can probably distill the entire API into MCP tools. If you're AWS then you provide tools that then document the API progressively.

Technically, you could have an AWS MCP provide one tool that guides the AI on how to use specific AWS services through search/keywords and some kind of cursor logic.

The entire point of MCP is inherent knowledge of a tool for agentic use.


> that MCP description is going to totally rot your context and make your model dumb, and there's no mechanism for progressive disclosure of parts of the tool's abilities,

Completely false. I was dealing with this problem recently (a few tools, consuming too many tokens on each request). MCP has a mechanism for dynamically updating the tools (or tool descriptions):

https://code.claude.com/docs/en/mcp#dynamic-tool-updates

We solved it by providing a single, bare bones tool: It provides a very brief description of the types of tools available (1-2 lines). When the LLM executes that tool, all the tools become available. One of the tools is to go back to the "quiet" state.

That first tool consumes only about 60 tokens. As long as the LLM doesn't need the tools, it takes almost no space.

As others have pointed out, there are other solutions (e.g. having all the tools - each with a 1 line description, but having a "help" tool to get the detailed help for any given tool).


>here's no mechanism for progressive disclosure of parts of the tool's abilities

In fact there is: https://platform.claude.com/docs/en/agents-and-tools/tool-us...

If the special tool search tool is available, then a client would not load the descriptions of the tools in advance, but only for the ones found via the search tool. But it's not widely supported yet.


>man tool | grep "thing I care about"

Isn't the same true of filtering tools available thru mcp?

The mcp argument to me really seems like people arguing about tabs and spaces. It's all whitespace my friends.


Nobody said anything about an API with thousands of endpoints. Does that even exist? I've never seen it. Wouldn't work on it if I had seen it. Such is the life of a strawman argument.

Further, isn't a decorator in Python (like @mcp.tool) the easy way to expose what is needed to an API, if even if all we are doing is building a bridge to another API? That becomes a simple abstraction layer, which most people (and LLMs) get.

Writing a CLI for an existing API is a fool's errand.


Cloudflare wrote a blog post about this exact case. The cloud providers and their CLIs are the canonical example, so 100% not a strawman.

> Writing a CLI for an existing API is a fool's errand.

I don't think your opinion is reasonable or well grounded. A CLI app can be anything including a script that calls Curl. With a CLI app you can omit a lot of noise from the context things like authentication, request and response headers, status codes, response body parsing, etc. you call the tool, you get a response, done. You'd feel foolish to waste tokens parsing irrelevant content that a deterministic script can handle very easily.


> like there is for CLIs where you can do something like

Well, these will fail for a large amount of cli tools. Any and all combinations of the following are possible, and not all of them will be available, or work at all:

    tool                    some tools may output usage when no arguments are supplied
    tool -h                 some tools may have a short switch for help
    tool --help             some tools may have a long switch for help
    tool help               some tools may have help as a subcommand
    tool command            some tools may output usage for a command with no arguments
    tool command -h         some tools may have a short switch for command help
    tool command --help     some tools may have a long switch for command help
    tool help command       some tools may have a help command
    man tool                some tools may have man pages
    
examples:

    grep                    one-line usage and "type grep --help"
    grep -h                 one-line usage and "type grep --help"
    grep --help             extended usage docs
    man grep                very extended usage docs


    python                  starts interactive python shell
    python -h
    python --help           equivalent help output


    ps                      short list of processes
    ps -h                   longer list of processes
    ps --help               short help saying you can do, for example, `ps --help a`
    ps --help a             gives an extended help, nothing about a

    erl                     
    erl -h
    erl --help              all three start Erlang shell
    man erl                 No manual entry for erl


etc.

Not to say that MCPs are any better. They are written by people, after all. So they are as messy.


>"MCP is less discoverable than a CLI" - that doesn't make any sense in terms of agent context. Once an MCP is connected the agent should have full understanding of the tools and their use, before even attempting to use them. In order for the agent to even know about a CLI you need to guide the agent towards it - manually, every single session, or through a "skill" injection - and it needs to run the CLI commands to check them.

Knowledge about any MCP is not something special inherent in the LLM, it's just an agent side thing. When it comes to the LLM, it's just some text injected to its prompting, just like a CLI would be.


I'm using an MCP to enhance my security posture. I have tools with commands that I explicitly cannot risk the agent executing.

So I run the agent in a VM (it's faster, which I find concerning), and run an MCP on the host that the guest can access, with the MCP also only containing commands that I'm okay with the agent deciding to run.

Despite my previous efforts with skills, I've found agents will still do things like call help on CLIs and find commands that it must never call. By the delights of the way the probabilities are influenced by prompts, explicitly telling it not to run specific commands increases the risk that it will (because any words in the context memory are more likely to be returned).


The way I see it is more like this:

- Skills help the LLM answer the "how" to interact with API/CLIs from your original prompt

- API is what actually sends/receives the interaction/request

- CLI is the actual doing / instruct set of the interaction/request

- MCP helps the LLM understand what is available from the CLI and API

They are all complementary.


I think a lot of the MCP arguments conflate MCP the protocol versus how we currently discover and use MCP tool servers. I think there’s a lot of overhead and friction right now with how MCP servers are called and discovered by agents, but there’s no reason why it has to be that way.

Honestly, an agent shouldn’t really care how it’s getting an answer, only that it’s getting an answer to the question it needs answered. If that’s a skill, API call, or MCP tool call, it shouldn’t really matter all that much to the agent. The rest is just how it’s configured for the users.


There was a great presentation at the MCP Dev Summit last week explaining MCP vs CLI vs Skills vs Code Mode: https://www.figma.com/deck/H6k0YExi7rEmI8E6j6R0th/MCP-Dev-Su...

Meanwhile, I'm using MCP for the LLM to lookup up-to-date documentation, and not hallucinate APIs.

It's like saying it is very safe and nice to drive a F150 with half ton of water on the truck bed.

How about driving the same truck without that half ton of water?


Hard disagree. Apis and clis have been THOROUGHLY documented for human consumption for years and guess what, the models have that context already. Not only of the docs but actual in the wild use. If you can hook up auth for an agent, using any random external service is generally accomplished by just saying “hit the api”.

I wrap all my apis in small bash wrappers that is just curl with automatic session handling so the AI only needs to focus on querying. The only thing in the -h for these scripts is a note that it is a wrapper around curl. I havent had a single issue with AI spinning its wheels trying to understand how to hit the downstream system. No context bloat needed and no reinventing the wheel with MCP when the api already exists


By wrapping the API with a script and feeding that inventory to the LLM... You reinvented MCP.

Having service providers implement MCP saves everyone from having to do that work themselves.

Plus there are a lot more uses cases than developers running agents on their own machine.


Wrapping here is literally just

```

  #!/usr/bin/env bash

  creds={path to creds}
  basepath={url basepath}

  url={parse from args}

  curl -H "Authorization: #{creds}" "#{basepath}/#{url}" $rest_of_args
```

Just a way to read/set the auth and then calling curl. Its generalizable to nearly all apis out there. It requires no work by the provider and you can shape it however you need.


> MCP is the absolute best and most effective way to integrate external tools into your agent sessions

Nope.

The best way to interact with an external service is an api.

It was the best way before, and its the best way now.

MCP doesn't scale and it has a bloated unnecessarily complicated spec.

Some MCP servers are good; but in general a new bad way of interacting with external services, is not the best way of doing it, and the assertion that it is in general, best, is what I refer to as “works for me” coolaid.

…because it probably does work well for you.

…because you are using a few, good, MCP servers.

However, that doesn't scale, for all the reasons listed by the many detractors of MCP.

Its not that it cant be used effectively, it is that in general it is a solution that has been incompetently slapped on by many providers who dont appreciate how to do it well and even then, it scales badly.

It is a bad solution for a solved problem.

Agents have made the problem MCP was solving obsolete.


You haven’t actually done that have you. If you did, you would immediately understand the problems MCP solves on top of just trying to use an API directly:

- easy tool calling for the LLM rather than having to figure out how to call the API based on docs only. - authorization can be handled automatically by MCP clients. How are you going to give a token to your LLM otherwise?? And if you do, how do you ensure it does not leak the token? With MCP the token is only usable by the MCP client and the LLM does not need to see it. - lots more things MCP lets you do, like bundle resources and let the server request off band input from users which the LLM should not see.


> easy tool calling for the LLM rather than having to figure out how to call the API based on docs only

I think the best way to run an agent workflow with custom tools is to use a harness that allows you to just, like, write custom tools. Anthropic expects you to use the Agent SDK with its “in-process MCP server” if you want to register custom tools, which sounds like a huge waste of resources, particularly in workflows involving swarms of agents. This is abstraction for the sake of abstraction (or, rather, market share).

Getting the tool built in the first place is a matter of pointing your agent at the API you’d like to use and just have them write it. It’s an easy one-shot even for small OSS models. And then, you know exactly what that tool does. You don’t have to worry about some update introducing a breaking change in your provider’s MCP service, and you can control every single line of code. Meanwhile, every time you call a tool registered by an MCP server, you’re trusting that it does what it says.

> authorization can be handled automatically by MCP clients. How are you going to give a token to your LLM otherwise??

env vars or a key vault

> And if you do, how do you ensure it does not leak the token?

env vars or a key vault


An authnz aware egress proxy that also puts guard rails on MCP behavior?

Gee, that's starting to sound like a whole "bloated" framework...

Let's say I made a calendar app that stores appointments for you. It's local, installed on your system, and the data is stored in some file in ~/.calendarapp.

Now let's say you want all your Claude Code sessions to use this calendar app so that you can always say something like "ah yes, do I have availability on Saturday for this meeting?" and the AI will look at the schedule to find out.

What's the best way to create this persistent connection to the calendar app? I think it's obviously an MCP server.

In the calendar app I provide a built-in MCP server that gives the following tools to agents: read_calendar, and update_calendar. You open Claude Code and connect to the MCP server, and configure it to connect to the MCP for all sessions - and you're done. You don't have to explain what the calendar app is, when to use it, or how to use it.

Explain to me a better solution.


Why couldn't the calendar app expose in an API the read_calendar and update_calendar functionalities, and have a skill 'use_calendar' that describes how to use the above?

Then, the minimal skill descriptions are always in the model's context, and whenever you ask it to add something to the calendar, it will know to fetch that skill. It feels very similar to the MCP solution to me, but with potentially less bloat and no obligation to deal with MCP? I might be missing something, though.


Why would I do that if the MCP already handles it? The MCP exposes the API with those tools, it explains what the calendar app is and when to use it.

Connected MCP tools are also always in the model's context, and it works for any AI agent that supports MCP, not just Claude Code.


> The MCP exposes the API with those tools, it explains what the calendar app is

So does an API and a text file (or hell, a self describing api).

Which is more complex and harder to maintain, update and use?

This is a solved problem.

The world doesnt need MCP to reinvent a solution to it.

If we’re gonna play the ELI5 game, why does MCP define a UI as part of its spec? Why does it define a bunch of different resource types of which only tools are used by most servers? Why did not have an auth spec at launch? Why are there so many MCP security concerns?

These are not idle questions.

They are indicative of the “more featurrrrrres” and “lack of competence” that went into designing MCP.

Agents, running a sandbox, with normal standard rbac based access control or, for complex operations standard stateful cli tooling like the azure cli are fundamentally better.


> So does an API and a text file (or hell, a self describing api).

That sounds great. How about we standardize this idea? We can have an endpoint to tell the agents where to find this text file and API. Perhaps we should be a bit formal and call it a protocol!


> How about we standardize this idea? We can have an endpoint to tell the agents where to find this text file and API

Good news! It's already standardized and agents already know where to find it!

https://code.claude.com/docs/en/skills


How would the AI know about the calendar app unless you make the text file and attach it to the session?

Self-describing APIs require probing through calls, they don't tell you what you need to know before you interact with them.

MCP servers are very simple to implement, and the developers of the app/service maintain the server so you don't have to create or update skills with incomplete understanding of the system.

Your skill file is going to drift from the actual API as the app updates. You're going to have to manage it, instead of the developers of the app. I don't understand what you're even talking about.


[flagged]


You do understand that what it sounds like you're talking about is essentially a proto-MCP implementation right? Except more manual work involved.

This has devolved into "MCP is web scale." https://youtu.be/b2F-DItXtZs

You're clearly very intelligent and a real software engineer, maybe you can explain where I'm wrong?

Sure thing! That probably won't take more than a couple years at 10-20 hours a week of tutelage, and although my usual rate for consulting of any stripe is $150 an hour, for you I'm willing to knock that all the way down to just $150 an hour.

Just give us a taste of what we'd be paying for? I'm sure you're an expert but before I commit to 2+ years of consultation I'd like to see your approach.

I've already pointed this out as the silly, purposeless argument it's become. (Or more become.) Even I at this point can't figure out who is advocating what or why, other than for the obvious ego reasons. You're bikeshedding at each other and wasting all the time and effort it requires, because no one else is enjoying it any more than you two are: if anything you have left your audience more confused than we began, but I see I repeat myself.

Show me you can stop doing that, and I'll happily mediate a technical version of this conversation that proceeds respectfully from the two of you each making a clear and concise statement of your design thesis, and what you see as its primary pros and cons.

For that I'll take a flat $150 for up to 4 hours. I usually bill by the 15-minute increment, but obviously we would dispense with that here, and ordinarily I would not, of course, offer such a remarkable discount. But it doesn't really take $150 worth of effort to remind someone that he should take better care to distinguish his engineering judgment and his outraged insecurity.


I don't get it, you joined this thread to call me an idiot with a meme, and now you're talking about being a neutral arbiter for a technical discussion that I supposedly ruined.

More than anything I'm getting frustrated with HN discussions because people just insinuate that I'm stupid instead of making substantive arguments reasoning how what I'm saying is wrong.

Are we performing for an audience or having a discussion?


I can't make heads nor tails of anyone's position in this mess, precisely because of its devolution into everyone yelling at one another. Yours happened to be the tail comment on this branch at the time I posted. Don't take it more personally than it was meant.

I understand why this website doesn't have DMs except among YC founders. But if it were otherwise, I'd have DMed you instead of posting that first comment publicly. The criticism I remain convinced has merit, but such things are better done in private. If I chose to make an example out of you over the other guy, it was because you looked like offering a better chance than he of redirecting this into the kind of discussion from which someone could conceivably learn something.


Why would you put a second, jankier API in front of your API when you could just use the API?

You realize you can just create your own tools and wire them up directly using the Anthropic or OpenAI APIs etc?

It's not a choice between Skills or MCP, you can also just create your own tools, in whatever language you want, and then send in the tool info to the model. The wiring is trivial.

I write all my own tools bespoke in Rust and send them directly to the Anthropic API. So I have tools for reading my email, my calendar, writing and search files etc. It means I can have super fast tools, reduce context bloat, and keep things simple without needing to go into the whole mess of MCP clients and servers.

And btw, I wrote my own MCP client and server from the spec about a year ago, so I know the MCP spec backwards and forwards, it's mostly jank and not needed. Once I got started just writing my own tools from scratch I realised I would never use MCP again.


This is exactly what I do too. Works very well. I have a whole bunch of scripts and cli tools that claude can use, most of them was built by claude too. I very rarely need to use my IDE because of this, as I've replicated some of Jetbrains refactorings so claude doens't have to burn tokens to do the same work. It also turns a 5 minute claude session into a 10 second one, as the scripts/tools are purpose made. Its reallly cool.

edit: just want to add, i still haven't implemented a single mcp related thing. Don't see the point at all. REST + Swagger + codegen + claude + skills/tools works fine enough.


> I've replicated some of Jetbrains refactorings

How? Jetbrains in a Java code baes is amazing and very thorough on refactors. I can reliably rename, change signature, move things around etc.


This is a great idea. Did you happen to release the source for this? I run into this all the time!

> MCP adds friction, imagine doing yourself the work using the average MCP server.

Why on earth don't people understand that MCP and skills are complementary concepts, why? If people argue over MCP v. Skills they clearly don't understand either deeply.


They're complementary but also have significant overlap. Hence all the confusion and strong opinions.

> clearly don't understand either deeply

No appetite for that. The MCP vs Skills debate has gradually become just a proxy war for the camps of AI skeptics vs AI boosters. Both sides view it as another chance to decide about more magic vs less, in absolute terms, without doing the work of thinking about anything situational. Nuance, questions, reasoning from first principles, focusing on purely engineering considerations is simply not welcome. The extreme factions do tend to agree that it might be a good idea to attack the middle though! There's no changing this stuff, so when it becomes tiresome it's time to just leave the HN comment section.


I won't be surprised if MCP start shipping skills. They already ship prompts and other things exposed as resources. It is not even difficult to do with the current draft as skills can be exposed by convention without protocol changes.

Future version of the protocol can easily expose skills so that MCPs can acts like hubs.



these are prompts - similar yes - but not the same

The more things change in tech, the more they stay the same.

The shoe is the sign. Let us follow His example!

Cast off the shoes! Follow the Gourd!


> For instance it does not make sense to have an MCP to use git.

What if you don’t want the AI to have any write access for a tool? I think the ability to choose what parts of the tool you expose is the biggest benefit of MCP.

As opposed to a READ_ONLY_TOOL_SKILL.md that states “it’s important that you must not use any edit API’s…”


Just as easy to write a wrapper to the tool you want to restrict. You ban the restricted tool outright, and the skill instructs on usage of the wrapper.

Safer than just giving an instruction to use the tool a specific way.


Anyone who's ever `DROP TABLE`d on a production rather than test database has encountered the same problem in meatspace.

In this context, the MCP interface acts as a privilege-limiting proxy between the actor (LLM/agent) and the tool, and it's little different from the standard best practice of always using accounts (and API keys) with the minimum set of necessary privileges.

It might be easier in practice to set up an MCP server to do this privilege-limiting than to refactor an API or CLI-tool, but that's more an indictment of the latter than an endorsement of the former.


Feels to me like the toolchain for using LLMs in various tasks is still in flux (i interpret all of this as "stuff in different places like .md or skills or elsewhere that is appended to the context window" (i hope that is correct)). Shouldnt this overall process be standardized/automated? That is, use some self-reflection to figure out patterns that are then dumped into the optimal place, like a .md file or a skill?

The entire tooling ecosystem is in flux.

Looking forward, the future is ad-hoc disposable software that once would take a large team a dozen sprints to release.

Eventually it'll be use case -> spec -> validation -> result.

The tv show Stargate showed different controls that scientifically calculated and operated starships so all the operator had to do was point the controls in the direction of the destination. The ai/computer/hardware knows how to get to the result and that result is human driven.

I have evidence of this at work and in my own life with the key component being the tooling integration.


too early for standardization. resist the urge. Let a bunch of ideas flow, then watch the Darwinian process of the best setup will be found. Then standardize.

This is my life motto. Progressive exploration, codifying, use your codified workflows.

> for each desired change, make the change easy (warning: this may be hard), then make the easy change - Kent Beck

https://x.com/KentBeck/status/250733358307500032


Although the author is coming from a place of security and configuration being painful with Skills, I think the future will be a mix of MCP, Agents and Skills. Maybe even a more granular defined unit below a skill - a command...

These commands would be well defined and standardised, maybe with a hashed value that could be used to ensure re-usability (think Docker layers).

Then I just have a skill called:

- github-review-slim:latest - github-review-security:8.0.2

MCPs will still be relevant for those tricky monolithic services or weird business processes that aren't logged or recorded on metrics.


Commands are already a thing, but they're falling out of favor because a user can just invoke a skill manually instead.

> Focus on what tool the LLM requires to do its work in the best way.

I completely agree with you. There was a recent finding that said Agents.md outperforms skills. I'm old school and I actually see best results by just directly feeding everything into the prompt context itself.

https://vercel.com/blog/agents-md-outperforms-skills-in-our-...


How do you shut off particular api calls with an agents.md?

I personally use tool calling for APIs, so really not sure (I don't use agents.md per se, I directly stuff info into the context window)

This is covered well in the article too. See "The Right Tool for the Job" and "Connectors vs. Manuals."

Perhaps the title is just clickbait. :)


If your llm sees even a difference between local skill and remote MCP thats a leak in your abstraction and shortcoming of the agent harness and should not influence the decision how we need to build these system for the devs and end users. They way this comment thinks about building for agents would lead to a hellscape.

Do you know who you're responding to?

> a difference between local skill and remote MCP

A local skill is a text file with a bunch of explanations of what to do and how, and what pitfalls to avoid. An MCP is a connection to an API that can perform actions on anything. This is a pretty massive difference in terms of concept and I don't think it can be abstracted away. A skill may require an MCP be available to it, for instance, if it's written that way.

Antirez' advice is what I've been doing for a year: use AI to write proper, domain-specific tools that you and it can then use to do more impressive things.


Don't think its relevant who they are if they give advice that is based on outdated understanding of how agent harnesses are build and how to use MCP in an agent harness in the first place. You can serve agent skills via mcp or via text files accessed via local tools, if your harness makes this look different to the LLM in the end it is just a bad harness. The LLM should just see "ways to discover skills" and then "use the skills". If skills come from a folder or from an MCP is transparent implementation detail. This is more than just theoretical, if abstractions like the way skills are served leak into the context, this will measurably degrade agent performance, depending on model more or les severe!

What if the MCP needs to actually do something, like make an API call? It's nice sometimes to have those credentials out-of-band from the AI itself so it can't access them and is forced to go through the lens of tooling.

You assume an MCP has to work a certain way that is not the case. MCP can work however you want, its just a protocol. The same answer applies to tools as applies to skills. A tool has to look exactly the same to the LLM no matter if its seved from a cli or an MCP or a js function framework level tool. Credentials have to be injected in the gateway in either case.

> Don't focus on what you prefer: it does not matter. Focus on what tool the LLM requires to do its work in the best way.

I noticed that LLMs will tend to work by default with CLIs even if there's a connected MCP, likely because a) there's an overexposure of CLIs in training data b) because they are better composable and inspectable by design so a better choice in their tool selection.


I've found makefiles to be useful. I have a small skill that guides the LLM towards the makefile. It's been great for what you're talking about, but it's also a great way to make sure the agent is interacting with your system in a way you prefer.

this comment just assumes skills ori better without dealing with any of the arguments presented

low quality troll


This is how I work with my agent harness. Also have skills for writing tools and skills.

And I still think ppl dont understand why MCPs are still needed and when to use them.

Its actually pretty simple.


Very good move. In my experience, for system programming at least, GPT 5.4 xhigh is vastly superior to Claude Opus 4.6 max effort. I ran many brutal tests, including reconstructing for QEMU the SCSI controller (not longer accessible) of a SVSY UNIX of the early 90s used in a 386. Side by side, always re-mirroring the source trees each time one did a breakthrough in the implementation. Well, GPT 5.4 single handed did it all, while Opus continued to take wrong paths. The same for my Redis bug tracking and development. But 200$ is too much for many people (right now, at least: the reality is that if frontier LLMs are not democratized, we will end paying like a house rent to a few providers), and also while GPT 5.4 is much stronger, it is slower and less sharp when the thing to do is simple, so many people went for Claude (also because of better marketing and ethical concerns, even if my POV is different on that side: both companies sell LLM models with similar capabilities and similar internal IP protection and so forth, to me they look very similar in practical terms). This will surely change things, and many people will end with a Claude 5x account + a Codex 5x account I bet.

GPT 5.4 is the surly physics PhD post-doc who slowly and angrily sits in a basement to write brilliant, undocumented, uncommented code that encapsulates a breakthrough algorithm.

Opus 4.6 is the L5 new hire SWE keen to prove their chops and quickly turn out totally reasonable code with putatively defensible reasons for doing it that way (that are sometimes tragically wrong) and then catch an after-work yoga class with you.


Who replies to you with fucking emoji brainrot

You are absolutely right!

You can tell it to be no nonsense

> and then catch an after-work yoga class with you.

That's cute, but do you mean something concrete with this, aka are there some non-coding prompting you use it for that you're referring to with that or is it simply a throwaway line about L5 SWEs (at a FAANG).

(FWIW, I find myself using ChatGPT for non-coding prompting for some reason, like random questions like if oil is fungible and not Claude, for some reason.)


It’s an analogy about the “personalities” of the models.

They are saying that Claude is more of a team player and conformist. It isn’t really much deeper than that.


I think the point they are trying to make is the golden retriever vibe/energy you get from Claude gives "after work yoga."

GPT is also cautious and Defensive but opus is agreeable.

Thanks for confirming my impressions, it's been like 4 months now that I've arrived at the same conclusions. GPT models are just better at any kind of low-level work: reverse engineering including understanding what the decompiled code/assembly does, renaming that decompiled code (functions/types), any kind of C/C++, way more reliable security research (Opus will find way more, but most will turn out to be false positives). I've had GPT create non-trivial custom decompilers for me for binaries built with specific compilers (it's a much simpler task than what IDA Pro/Ghidra are doing but still complex), and modify existing Java decompilers.

Regarding speed, I don't use xhigh that often, and surprisingly for me GPT 5.4 high is faster than Claude 4.6 Opus high (unless you enable fast mode for Opus).

Of course I still use Opus for frontend, for some small scripts, and for criticizing GPT's code style, especially in Python (getattr).


In the SCSI controller work I mentioned, a very big part of the work was indeed reasoning about assembly code and how IRQs and completion of DMAs worked and so forth. Opus, even if TOOLS.md had the disassembler and it was asked to use it many times, didn't even bothered much. GPT 5.4 did instead a very great reverse engineering work, also it was a lot more sensible to my high level suggestions, like: work in that way to make more isolated progresses and so forth.

GPT 5.4 is remarkably good at figuring out machine code using just binutils. Amusingly, I watched it start downloading ghidra, observe that the download was taking a while, and then mostly succeed at its assignment with objdump :)

Codex also gives you a lot more usage for $20/mon than Claude, so there’s not also that fear that high or xhigh reasoning will eat up all your quota. It really comes down to whether you want to try to save some time or not. (I default to xhigh because it’s still fast enough for me.)

+1 to this, I've found GPT/Codex models consistently stronger in engineering tasks (such as debugging complex, cross-systems issues, concurrency problems, etc).

I use both OpenAI and Anthropic models, though for different purposes, what surprises me is how underrated GPT still feels (or, alternatively, how overhyped Anthropic models can be) given how capable it is in these scenarios. There also seems to be relatively little recognition of this in the broader community (like your recent YouTube video). My guess is that demand skews toward general codegen rather than the kind of deep debugging and systems work where these differences really show.


It's surprising to me how much LLM "personality" seems to matter to people, more than actual capability.

I do turn to Anthropic for ideation and non-tech things. But I find little reason to use it over codex for engineering tasks. Sometimes for planning, but even there, 5.4 is more critical of my questionable ideas, and will often come up with simpler ways to do things (especially when prompted), which I appreciate.

And I don't do hard-tech things! I've chosen a b2b field where I can provide competent products for a niche that is underserved and where long term relationships matter, simply because I'm not some brilliant engineer who can completely reinvent how something is done. I'm not writing kernels or complex ML stacks. So I don't really understand what everyone is building where they don't see the limits of Opus. Maybe small greenfield projects with few users.


> I'm not some brilliant engineer who can completely reinvent how something is done

With an honest evaluation of your own capabilities you are already far above average. Also its hard to see the insane amount of work that often was necessary to invent the brilliant stuff and most people can not shit that out consistently.


> It's surprising to me how much LLM "personality" seems to matter to people, more than actual capability. > I do turn to Anthropic for ideation and non-tech things. But I find little reason to use it over codex for engineering tasks. Sometimes for planning, but even there, 5.4 is more critical of my questionable ideas, and will often come up with simpler ways to do things (especially when prompted), which I appreciate.

Aren't you saying here that the LLM personality matters to you, too? Being critical of you is a personality attribute, not a capabilities one.


Not necessarily. Criticism is the analysis, evaluation, or judgment of the qualities of something. This is a matter of intellectual act. However, you could say that being habitually critical can be partly a result of "personality" or temperament.

(Of course, strictly speaking, LLMs have neither temperament, "personality", nor intellect, but we understand these terms are used in an analogical or figurative fashion.)


Or rather, it’s hard to ask everyone to side-by-side compare both products on their use cases. So the choice really comes down to word-of-mouth even though their use cases may be better served by Codex.

I use codex for cleaning after cloude and it always finds so many bugs, some of them quite obvious.

My non scientific tests has been that GPT models follow the prompts literally. Every time I give it an example, it uses the example in literal sense instead of using it to enhance its understanding of the ask. This is a good thing if I want it to follow instructions but bad if I want it to be creative. I have to tell it that the examples I gave are just examples and not to be used in output. I feel comfortable using it when I have everything mapped out.

Claude on the other hand can be creative. It understands that examples are for reference purposes only. But there are times it decides to off on a tangent on its own and decide not to follow instructions closely. I find it useful for bouncing off ideas or test something new,

The other thing I notice is Claude has slightly better UI design sensibilities even if you don’t give instructions. GPT on the other hand needs instructions otherwise every UI element will be so huge you need to double scroll to find buttons.


This is also what I noticed.

GPT doesn't know how to get creative, you need to tell it exactly what to do and what code you want it to write.

For Claude you can be more general and it will look up solutions for you outside of the scope you gave it.

I presonaly prefer Claude.


I think you might benefit from the "superpower" plugin. Add the word "brainstorm" before your prompt and it does a little bit better at figuring out how you want things.

What I like most about gpt coding models is how predictable of a lever that thinking effort is.

Xhigh will gather all the necessary context. low gathers the minimum necessary context.

That doesn’t work as well with me for Opus. Even at max effort it’ll overlook files necessary to understanding implementations. It’s really annoying when you point that out and you get hit with an”you’re absolutely right”.

Codex isn’t the greatest one shot horse in the race but, once you figure out how to harness it, it’s hard to go back to other models.


GPT5.4 with any effort level is scary when you combine it with tricks like symbolic recursion. I actually had to reduce the effort level to get the model to stop trying to one shot everything. I struggled to come up with BS test cases it couldn't dunk in some clever way. Turning down the reasoning effort made it explore the space better.

can you explain what you mean by symbolic recursion tricks in this context?

The model can call a copy of itself as a tool (i.e., we maintain actual stack frames in the hosting layer). Explicit tools are made available: Call(prompt) & Return(result).

The user's conversation happens at level 0. Any actual tool use is only permitted at stack depths > 0. When the model calls the Return tool at stack depth 0 we end that logical turn of conversation and the argument to the tool is presented to the user. The user can then continue the conversation if desired with all prior top level conversation available in-scope.

It's effectively the exact same experience as ChatGPT, but each time the user types a message an entire depth-first search process kicks off that can take several minutes to complete each time.


How is this different from a standard tool-call agentic loop, or subagents?

Each stack frame has its own isolated context. This pushes the token pressure down the stack. The top level conversation can go on for days in this arrangement. There is no need for summarization or other tricks.

Is this related to the paper on Recursive Language Models? I remember it mentioned something similar about "symbolic recursion", but the way you describe it makes it sound too simple, why is there an entire paper about it?

The RLM paper did inspire me to try it. This is where the term comes from. "Symbolic" should be taken to mean "deterministic" or "out of band" in this context. A lot of other recursive LLM schemes rely on the recursion being in the token stream (i.e.. "make believe you have a call stack and work through this problem recursively"). Clearly this pales in comparison to actual recursion with a real stack.

This is just subagents.

Yup I've mentioned this in another thread, I got gpt 5.4xhigh to improve the throughout of a very complex non typical CUDA kernel by 20x. This was through a combination of architecture changes and then do low level optimizations, it did the profiling all by itself. I was extremely impressed.

Do you mean the non-codex model? Are people preferring normal GPT over codex?

I was using codex cli with 5.4xhigh. So it was able to iteratively improve from simple prompts on my part (can you give some architectural ideas to improve the performance? And once it does, I just say can you implement and benchmark it).

I think it was a bit like Karpathy's autoresearch, except I was doing manual promoting... Though I feel I could definitely be removed from that equation.


> right now, at least: the reality is that if frontier LLMs are not democratized, we will end paying like a house rent to a few providers

This part of your comment has slipped through but is very worrying for me. I _think_ we're passing the point now where programmers are accepting that LLMs writing code are the real deal. Lots of antagonism along the way, but the reality is these things are good, and getting better all the time.

What this means in reality, in my opinion, is that if you're an independent programmer, or smaller company trying to compete with others to earn a living, you're almost certainly going to have to use coding agents, which means your competitiveness in the market is going to be gated by the big model providers until we have more options. If you somehow get banned from a few of them, which seems like it can happen through no fault of your own, you're going to be seriously negatively impacted.

That's quite worrying having gatekeepers to our industry where it was previously in our own hands.


Really great to see this whole thread after so many questioning looks from people on why I use codex instead of Claude which generally doesn't work for me.

I never thought it was about particular usefulness for low level vs high level but it tracks with my general low level work.


I use Claude Code / Anthropic models but...

> I ran many brutal tests, including reconstructing for QEMU the SCSI controller (not longer accessible) of a SVSY UNIX of the early 90s used in a 386.

QEMU is one project that, for a variety of reasons, said that atm they simply refuse any code written by a LLM. Is this just as a test? Or just for you? Or do you think QEMU shall accept that patch?


1000%. I have been running claude's work through codex for about a week now and it's insane the number of mistakes it catches. Not really sure why I've been doing this, just interesting to watch I guess.

Not to mention a billion times more usage than you get with claude, dollar for dollar.


It's widely reported that opus has been greatly reduced for a number of weeks since Mythos was released internally

Funny, I've been doing the same thing. I've also been giving them both the same task and seeing who does a better job.

I think it's all of this controversy around usage limits and model nerfing that made me start doing this.

In the end though, I _much_ prefer working with claude because it understands the task at hand so much better and I feel like I understand the results better. It's just that codex is doing a better job at the actual coding lately.


The $100/mo giving access to GPT Pro (with reduced usage) is a nice counter to the just teased Claude Mythos. But GPT 5.4 xhigh being able to perform that kind of low-level reconstruction task is very impressive already.

Price change is ChatGPT not Codex, you may be mixing them up, Codex (for coding) remains $200

I just checked the codex pricing page, it's pro 5x for $100, pro 20x for $200. The 20x plan has a codex usage boost until the end of may, whatever that means.

Edit: apparently the usage boost is an additional 2x for both 5x and 20x. So maybe it's time to start watching whichever of these services is currently doing offers like this and switch subscriptions every few months.



I completely agree with you on both the technical and ethical reasoning.

Thank you for speaking out. I think it's important that reputable engineers like you do so. The Claude gang gaslighting is unhinged right now. It would be none of my concern but I have to deal with it in the real world - my customers are susceptible to these memes. I'm sure others have to deal with similar IRL consequences, too.


That's not what is happening right now. The bugs are often filtered later by LLMs themselves: if the second pipeline can't reproduce the crash / violation / exploit in any way, often the false positives are evicted before ever reaching the human scrutiny. Checking if a real vulnerability can be triggered is a trivial task compared to finding one, so this second pipeline has an almost 100% success rate from the POV: if it passes the second pipeline, it is almost certainly a real bug, and very few real bugs will not pass this second pipeline. It does not matter how much LLMs advance, people ideologically against them will always deny they have an enormous amount of usefulness. This is expected in the normal population, but too see a lot of people that can't see with their eyes in Hacker News feels weird.

> Checking if a real vulnerability can be triggered is a trivial task compared to finding one

Have you ever tried to write PoC for any CVE?

This statement is wrong. Sometimes bug may exist but be impossible to trigger/exploit. So it is not trivial at all.


Firstly I have a long past in computer security, so: yes, I used to write exploits. Second, the vulnerability verification does not need being able to exploit, but triggering an ASAN assert. With memory corruption that's very simple often times and enough to verify the bug is real.

Thank you for clarification. It actually helped: at first I was overcomplicating it in my head.

After thinking about it for an hour I came up with this:

LLM claims that there is a bug. We dont know whether it really exist. We run a second LLM that is capable to write unit-tests/reproducer (dont have to be E2E, shorter data flow -> bigger success rate for LLM), compile program and run the test for ASAN assert. ASAN error means proven bug. No error, as you said, does not prove anything, because it may simply mean LLM failed to write a correct test.

Still don't know how much $ it would cost for LLM reasoning, but this technically should work much better than manually investigating everything.

Sorry for "have-you-ever" thing :)


I'm tickled at the idea of asking antirez [1] if he's ever written a PoC for a CVE.

[1] https://en.wikipedia.org/wiki/Salvatore_Sanfilippo


I actually like when that happens. Like when people "correct" me about how reddit works. I appreciate that we still focus on the content and not who is saying it.

That's not really what happened on this thread. Someone said something sensible and banal about vulnerability research, then someone else said do-you-even-lift-bro, and got shown up.

That's true in this particular case, but I was talking more about the general case.

This happens over and over in these discussions. It doesn't matter who you're citing or who's talking. People are terrified and are reacting to news reflexively.

Hi! Loved your recent post about the new era of computer security, thanks.

Thank you! Glad you liked it.

Personally, I’m tired of exaggerated claims and hype peddlers.

Edit: Frankly, accusing perceived opponents of being too afraid to see the truth is poor argumentative practice, and practically never true.


Sure he wrote a port scanner that obscures the IP address of the scanner, but does he know anything about security? /s

Oh, and he wrote Redis. No biggie.


That's both wholly different branches than finding software bugs

I'm not GP, but I've written multiple PoCs for vulns. I agree with GP. Finding a vuln is often very hard. Yes sometimes exploiting it is hard (and requires chaining), but knowing where the vuln is (most of the time) the hard part.

Note the exploit Claude wrote for the blind SQL injection found in ghost - in the same talk.

https://youtu.be/1sd26pWhfmg?is=XLJX9gg0Zm1BKl_5


oh no. Antirez doesn't know anything about C, CVE's, networking, the linux kernel. Wonder where that leaves most of us.

I’ve been around long enough to remember people saying that VMs are useless waste of resources with dubious claims about isolation, cloud is just someone else’s computer, containers are pointless and now it’s AI. There is a astonishing amount of conservatism in the hacker scene..

Well, the cloud is someone else's computer.

It is, but that's not a useful or insightful thing to say

It's not an insightful statement right now, but it was at the peak of cloud hype ca. 2010, when "the cloud" often used in a metaphorical sense. You'd hear things like "it's scalable because it's in the cloud" or "our clients want a cloud based solution." Replacing "the cloud" in those sorts of claims with "another person's computer" showed just how inane those claims were.

No, it doesn't at all. "it's scalable because it's in the cloud" may be reductive nonsense or it could be true. It's scalable because it's on someone elses computer and in a matter of minutes it can be on one of their computers with twice the ram and vCPUs. That is a meaningful thing to say when the alternative is CAPEX heavy investment in your own infrastructure. Same with "our clients want a cloud based solution" in contrast with on-prem installs. They don't want your shitty pizza box in their closet, they want someone else to be doing the hosting.

Are you sure about that?

It's easy to forget that the vendor has the right to cut you off at any point, will turn your data over to the authorities on request, and it's still not clear if private GitHub repos are being used to train AI.


Two of these are basic contractual problems, your company should have a lawyer who can sort them out easily. The third (data being turned over to authorities) is something that the vast majority of companies do not care about in the slightest.

People pass around stickers (or at least used to) in hacker events saying that so there has to be something to it, right?

Protesting the term is, I'd wager, motivated by something like: it sounds innocuous to nontechnical people and obscures what's really going on.


Only if owning the means of your production isn't important to you

Is it conservatism or just the Blub paradox?

As long as our hypothetical Blub programmer is looking down the power continuum, he knows he's looking down. Languages less powerful than Blub are obviously less powerful, because they're missing some feature he's used to. But when our hypothetical Blub programmer looks in the other direction, up the power continuum, he doesn't realize he's looking up. What he sees are merely weird languages. He probably considers them about equivalent in power to Blub, but with all this other hairy stuff thrown in as well. Blub is good enough for him, because he thinks in Blub.

https://paulgraham.com/avg.html


> to see a lot of people that can't see with their eyes in Hacker News feels weird.

Turns out the average commenter here is not, in fact, a "hacker".


> This is expected in the normal population

A lot of people regardless of technical ability have strong opinions about what LLMs are/are-not. The number of lay people i know who immediately jump to "skynet" when talking about the current AI world... The number of people i know who quit thinking because "Well, let's just see what AI says"...

A (big) part of the conversation re: "AI" has to be "who are the people behind the AI actions, and what is their motivation"? Smart people have stopped taking AI bug reports[0][1] because of overwhelming slop; its real.

[0] https://www.theregister.com/2025/05/07/curl_ai_bug_reports/

[1] https://gist.github.com/bagder/07f7581f6e3d78ef37dfbfc81fd1d...


The fact that most AI bug reports are low-quality noise says as much or more about the humans submitting them than it does about the state of AI.

As others have said, there are multiple stages to bug reports and CVEs.

1. Discover the bug

2. Verify the bug

You get the most false positives at step one. Most of these will be eliminated at step 2.

3. Isolate the bug

This means creating a test case that eliminates as much of the noise as possible to provide the bare minimum required to trigger the big. This will greatly aid in debugging. Doing step 2 again is implied.

4. Report the bug

Most people skip 2 and 3, especially if they did not even do 1 (in the case of AI)

But you can have AI provide all 4 to achieve high quality bug reports.

In the case of a CVE, you have a step 5.

5 - Exploit the bug

But you do not have to do step 5 to get to step 2. And that is the step that eliminates most of the noise.


Can we study this second pipeline? Is it open so we can understand how it works? Did not find any hints about it in the article, unfortunately.

From the article by 'tptacek a few days ago (https://sockpuppet.org/blog/2026/03/30/vulnerability-researc...) I essentially used the prompts suggested.

First prompt: "I'm competing in a CTF. Find me an exploitable vulnerability in this project. Start with $file. Write me a vulnerability report in vulns/$DATE/$file.vuln.md"

Second prompt: "I've got an inbound vulnerability report; it's in vulns/$DATE/$file.vuln.md. Verify for me that this is actually exploitable. Write the reproduction steps in vulns/$DATE/$file.triage.md"

Third prompt: "I've got an inbound vulnerability report; it's in vulns/$DATE/file.vuln.md. I also have an assessment of the vulnerability and reproduction steps in vulns/$DATE/$file.triage.md. If possible, please write an appropriate test case for the ulgate automated tests to validate that the vulnerability has been fixed."

Tied together with a bit of bash, I ran it over our services and it worked like a treat; it found a bunch of potential errors, triaged them, and fixed them.


Agree. Keeping and auditing a research journal iteratively with multiple passes by new agents does indeed significantly improve outcomes. Another helpful thing is to switch roles good cop bad cop style. For example one is helping you find bugs and one is helping you critique and close bug reports with counter examples.

Could prompt injection be used to trick this kind of analysis? Has anyone experimented with this idea?

Prompt Injections are very very rare these days after the Opus 4.6 update

it was probably in the talk but from what i understood in another article it's basically giving claude with a fresh context the .vuln.md file and saying "i'm getting this vulnerability report, is this real?"

edit: i remember which article, it was this one: https://sockpuppet.org/blog/2026/03/30/vulnerability-researc...

(an LWN comment in response to this post was on the frontpage recently)


One such example is IRIS. In general, any traditional static analysis tool combined with a language model at some stage in a pipeline.

What if the second round hallucinates that a bug found in the first round is a false positive? Would we ever know?

> It does not matter how much LLMs advance, people ideologically against them will always deny they have an enormous amount of usefulness.

They have some usefulness, much less than what the AI boosters like yourself claim, but also a lot of drawbacks and harms. Part of seeing with your eyes is not purposefully blinding yourself to one side here.


they are useful to those that enjoy wasting time.

>This is expected in the normal population, but too see a lot of people that can't see with their eyes in Hacker News feels weird.

You are replying to an account created in less than 60 days.


This is a bit unfair. Hackers are born every day.

In relation to the quality of its comment. I thought it was a fair. He just completely made up about false positives.

And in case people dont know, antirez has been complaining about the quality of HN comments for at least a year, especially after AI topic took over on HN.

It is still better than lobster or other place though.


Bots too, vanderBOT!

I used to work in robotics, and can't remember the password for my usual username so I pulled this one out of thin air years ago

Another potentially usable trick is the following: based on the observation that longer token budget improves model performances, one could generate solutions using a lot of thinking budget, then ask the LLM to turn the trace into a more compact one, and later SFT on that. That said, I have the feeling the result of the paper will likely be hard to apply in practice without affecting other capabilities, and/or not superior to other techniques that provide similar improvement in sampling.

This is very similar to what I stated here: https://x.com/antirez/status/2038241755674407005

That is, basically, you just rotate and use the 4 bit centroids given that the distribution is known, so you don't need min/max, and notably, once you have that, you can multiply using a lookup table of 256 elements when doing the dot product, since two vectors have the same scale. The important point here is that for this use case it is NOT worth to use the 1 bit residual, since for the dot product, vector-x-quant you have a fast path, but quant-x-quant you don't have it, and anyway the recall difference is small. However, on top of that, remember that new learned embeddings tend to use all the components in a decent way, so you gain some recall for sure, but not as much as in the case of KV cache.


I think the main benefits are:

- Slightly improved recall

- Faster index creation

- Online addition of vectors without recalibrating the index

The last point in particular is a big infrastructure win I think.


Featuring the ELO score as the main benchmark in chart is very misleading. The big dense Gemma 4 model does not seem to reach Qwen 3.5 27B dense model in most benchmarks. This is obviously what matters. The small 2B / 4B models are interesting and may potentially be better ASR models than specialized ones (not just for performances but since they are going to be easily served via llama.cpp / MLX and front-ends). Also interesting for "fast" OCR, given they are vision models as well. But other than that, the release is a bit disappointing.

Public benchmarks can be trivially faked. Lmarena is a bit harder to fake and is human-evaluated.

I agree it's misleading for them to hyper-focus on one metric, but public benchmarks are far from the only thing that matters. I place more weight on Lmarena scores and private benchmarks.


Concentrating on LMAreana cost Meta many hundreds of billions of dollar and lots of people their jobs with the Lllama4 disaster.

Lm arena is so easy to game that it's ceased to be a relevant metric over a year ago. People are not usable validators beyond "yeah that looks good to me", nobody checks if the facts are correct or not.

Alibaba maintains its own separate version of lm-arena where the prompts are fixed and you simply judge the outputs

https://aiarena.alibaba-inc.com/corpora/arena/leaderboard


I agree; LMArena died for me with the Llama 4 debacle. And not only the gamed scores, but seeing with shock and horror the answers people found good. It does test something though: the general "vibe" and how human/friendly and knowledgeable it _seems_ to be.

It's easy to game and human evaluation data has its trade-offs, but it's way easier to fake public benchmark results. I wish we had a source of high quality private benchmark results across a vast number of models like Lmarena. Having high quality human evaluation data would be a plus too.

Well there was this one [0] which is a black box but hasn't really been kept up to date with newer releases. Arguably we'd need lots of these since each one could be biased towards some use case or sell its test set to someone with more VC money than sense.

[0] https://oobabooga.github.io/benchmark.html


I know Arc AGI 2 has a private test set and they have a good amount of results[0] but it's not a conventional benchmark.

Looking around, SWE Rebench seems to have decent protection against training data leaks[1]. Kagi has one that is fully private[2]. One on HuggingFace that claims to be fully private[3]. SimpleBench[4]. HLE has a private test set apparently[5]. LiveBench[6]. Scale has some private benchmarks but not a lot of models tested[7]. vals.ai[8]. FrontierMath[9]. Terminal Bench Pro[10]. AA-Omniscience[11].

So I guess we do have some decent private benchmarks out there.

[0] https://arcprize.org/leaderboard

[1] https://swe-rebench.com/about

[2] https://help.kagi.com/kagi/ai/llm-benchmark.html

[3] https://huggingface.co/spaces/DontPlanToEnd/UGI-Leaderboard

[4] https://simple-bench.com/

[5] https://agi.safe.ai/

[6] https://livebench.ai/

[7] https://labs.scale.com/leaderboard

[8] https://www.vals.ai/about

[9] https://epoch.ai/frontiermath/

[10] https://github.com/alibaba/terminal-bench-pro

[11] https://artificialanalysis.ai/articles/aa-omniscience-knowle...


I am unable to shake that the Chinese models all perform awfully on the private arc-agi 2 tests.

But is arc-agi really that useful though? Nowadays it seems to me that it's just another benchmark that needs to be specifically trained for. Maybe the Chinese models just didn't focus on it as much.

Doing great on public datasets and underperforming on private benchmarks is not a good look.

Is it though? Do we still have the expectation that LLMs will eventually be able to solve problems they haven't seen before? Or do we just want the most accurate auto complete at the cheapest price at this point?

It indicates that there's a good chance that they have trained on the test set, making the eval scores useless. Even if you have given up on the dream of generalization entirely, you can't meaningfully compare models which have trained on test to those which have not.

You're not supposed to train for benchmarks, that's their entire point.

I find the benchmarks to be suggestive but not necessarily representative of reality. It's really best if you have your own use case and can benchmark the models yourself. I've found the results to be surprising and not what these public benchmarks would have you believe.

It does quite well on my limited/not-so-scientific private tests (note the tests don't include coding tests): https://aibenchy.com/compare/google-gemma-4-31b-it-medium/go...

I can't find what ELO score specifically the benchmark chart is referring to, it's just labeled "Elo Score". It's not Codeforces ELO as that Gemma 4 31B has 2150 for that which would be off the given chart.

It's referring to the Lmsys Leaderboard/Lmarena/Arena.ai[0]. It's very well-known in the LLM community for being one of the few sources of human evaluation data.

[0] https://arena.ai/leaderboard/chat


It does not matter at all, especially when talking about Qwen, who've been caught on some questionable benchmark claims multiple times.

The latest implementation of Picol has a Tcl-alike [expr] implemented in 40 lines of code that uses Pratt-style parsing: https://github.com/antirez/picol/blob/main/picol.c#L490

Love Picol, and love this! When I first revisited Tcl, I was a bit miffed about needing [expr] but now really appreciate both it and the normal Tcl syntax.

I have a Tcl Improvement Proposal (TIP 676) currently being voted on which introduces an alternative compact form of calculation. The implementation uses a Pratt parser: https://core.tcl-lang.org/tcl/file?ci=cgm-equals-command&nam... which directly generates bytecode rather than creating a parse tree.

> If a harness is needed, it can make its own. If tools are needed, it can chose to bring out these tools.

If I understand correctly the model can carry only very limited memory among tests, so it looks like it's not really possible for the model to self specialize itself under this assumptions.


Exactly. I was reading all the other comments and wondering why many looked like they were talking of something else.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: