Hacker News .hnnew | past | comments | ask | show | jobs | submit | sjmaplesec's commentslogin

Pass in a skill and it'll roast the contents:

"This is not just useless - it is an insult to the very concept of functionality."


I ran some evals to see which Anthropic models use skills the best, between Opus, Sonnet and Haiku.

I was pretty impressed how good Haiku was with skills at completing various tasks


Link to all the review scans is here - mostly in the 50-70% range https://tessl.io/registry/skills/github/googleworkspace/cli


There's so much more we can do around activation and skills creation. Looking at the eval results, there are even cases where the context makes the agent worse.

Scenario 5, test 1 72% -> 22%

https://tessl.io/eval-runs/019cc02f-bb26-76e0-a7c9-598a7337e...


The review eval tests language, activation etc of skills. I guess you could move it all to a skill quick and then run an eval on that if using Tessl. This checks if the way you write the instructions etc are being well understood by the agent


An eval is to an LLM as a test is to code.


Tessl can generate the evals, both to test anthropic best practices as well as running scenarios with and without the skill to check if it's helping


Can add this as a skill or as part of a skill, and so you don't need to keep prompting the same things.


No, the context can be human created as much as it could be llm generated. The suggestions are based on Anthropic best practices and allow the agents to activate, and use the skills better, make the text clearer for the agent etc.


This resonates with my experience: we have dozens of internal “playbooks” and prompt snippets floating around, and nobody knows which ones still work after model changes. If you can make “skill quality” visible over time (regressions, drift), that’s valuable. Do you have a CI integration where you can pin a skill version and fail builds if eval scores drop?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: