And we don't think the judge can/will be gamed? Also... It's an LLM, it's going ...

lukewarm707 · 2026-04-22T10:41:36 1776854496

you can use a safety model trained on prompt injections with developer message priority.

user message becomes close to untrusted compared to dev prompt.

also post train it only outputs things like safe/unsafe so you are relatively deterministic on injection or no injection.

ie llama prompt guard, oss 120 safeguard.

windexh8er · 2026-04-22T15:39:32 1776872372

Unfortunately it's not that simple. Self-policing AI systems will always be gamed. Just one [0] example of this among many.