More

sjmaplesec · 2026-03-20T16:08:43 1774022923

Pass in a skill and it'll roast the contents:

"This is not just useless - it is an insult to the very concept of functionality."

sjmaplesec · 2026-03-11T20:11:41 1773259901

I ran some evals to see which Anthropic models use skills the best, between Opus, Sonnet and Haiku.

I was pretty impressed how good Haiku was with skills at completing various tasks

sjmaplesec · 2026-03-05T23:37:19 1772753839

Link to all the review scans is here - mostly in the 50-70% range https://tessl.io/registry/skills/github/googleworkspace/cli

sjmaplesec · 2026-03-05T23:34:24 1772753664

There's so much more we can do around activation and skills creation. Looking at the eval results, there are even cases where the context makes the agent worse.

Scenario 5, test 1 72% -> 22%

https://tessl.io/eval-runs/019cc02f-bb26-76e0-a7c9-598a7337e...

sjmaplesec · 2026-02-26T14:09:37 1772114977

The review eval tests language, activation etc of skills. I guess you could move it all to a skill quick and then run an eval on that if using Tessl. This checks if the way you write the instructions etc are being well understood by the agent

sjmaplesec · 2026-02-26T14:07:59 1772114879

An eval is to an LLM as a test is to code.

sjmaplesec · 2026-02-26T14:07:22 1772114842

Tessl can generate the evals, both to test anthropic best practices as well as running scenarios with and without the skill to check if it's helping

sjmaplesec · 2026-02-26T14:05:45 1772114745

Can add this as a skill or as part of a skill, and so you don't need to keep prompting the same things.

sjmaplesec · 2026-02-26T14:05:12 1772114712

No, the context can be human created as much as it could be llm generated. The suggestions are based on Anthropic best practices and allow the agents to activate, and use the skills better, make the text clearer for the agent etc.

sjmaplesec · 2026-02-05T16:04:05 1770307445

This resonates with my experience: we have dozens of internal “playbooks” and prompt snippets floating around, and nobody knows which ones still work after model changes. If you can make “skill quality” visible over time (regressions, drift), that’s valuable. Do you have a CI integration where you can pin a skill version and fail builds if eval scores drop?