There's so much more we can do around activation and skills creation. Looking at the eval results, there are even cases where the context makes the agent worse.
The review eval tests language, activation etc of skills. I guess you could move it all to a skill quick and then run an eval on that if using Tessl. This checks if the way you write the instructions etc are being well understood by the agent
No, the context can be human created as much as it could be llm generated. The suggestions are based on Anthropic best practices and allow the agents to activate, and use the skills better, make the text clearer for the agent etc.
This resonates with my experience: we have dozens of internal “playbooks” and prompt snippets floating around, and nobody knows which ones still work after model changes. If you can make “skill quality” visible over time (regressions, drift), that’s valuable. Do you have a CI integration where you can pin a skill version and fail builds if eval scores drop?
"This is not just useless - it is an insult to the very concept of functionality."
reply