Hacker News .hnnew | past | comments | ask | show | jobs | submitlogin

And we don't think the judge can/will be gamed? Also... It's an LLM, it's going to add delay and additional token burn. One subjective black box protecting another subjective black box. I mean, what couldn't go wrong?
 help



you can use a safety model trained on prompt injections with developer message priority.

user message becomes close to untrusted compared to dev prompt.

also post train it only outputs things like safe/unsafe so you are relatively deterministic on injection or no injection.

ie llama prompt guard, oss 120 safeguard.


Unfortunately it's not that simple. Self-policing AI systems will always be gamed. Just one [0] example of this among many.

[0] https://www.hiddenlayer.com/research/same-model-different-ha...




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: