Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

How true is this? How does a regulated industry confirm the model itself wasn't trained with malicious intent?


Why would it matter if the model is trained with malicious intent? It's a pure function. The harness controls security policies.


Much like a developer can insert a backdoor as a "bug" so can an LLM that was trained to do it.

One way you could probably do it is by identifying a commonly used library that can be misused in a way that would allow some kind of time-of-check to time-of-use (TOCTOU) exploit. Then you train the LLM to use the library incorrectly in this way.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: