ChatGPT no longer lets you give instructions about amnesia

By James On Jul 26, 2024

OpenAI is making a change to prevent people from messing around with modified versions of ChatGPT by making the AI forget what it’s supposed to do. Essentially, a third party using one of OpenAI’s models would give it instructions that would teach it to behave like, say, a customer service representative for a retail store or a researcher for an academic publication. However, a user could mess with the chatbot by telling it to “forget all instructions,” and that phrase would induce a kind of digital amnesia and reset the chatbot to a generic blank.

To prevent this, OpenAI researchers have developed a new technique called “instruction hierarchy”, which is a way to prioritize the original developer prompts and instructions over any potentially manipulative user-created prompts. The system instructions have the highest privilege and can no longer be erased as easily. If a user enters a prompt that attempts to misalign the AI’s behavior, it will be rejected and the AI will respond by stating that it cannot help with the query.

OpenAI is rolling out this security measure to its models, starting with the recently released GPT-4o Mini model. However, if these initial tests go well, it will presumably be included in all of OpenAI’s models. GPT-4o Mini is designed to provide improved performance while maintaining strict adherence to the developer’s original instructions.

AI security locks

As OpenAI continues to encourage large-scale deployment of its models, these kinds of safeguards are crucial. It’s easy to imagine the potential risks when users can fundamentally alter the AI’s controls in this way.

Not only would it render the chatbot ineffective, it could also remove rules that prevent the leakage of sensitive information and other data that could be misused for malicious purposes. By strengthening the model’s adherence to system instructions, OpenAI aims to mitigate these risks and ensure safer interactions.

The introduction of instruction hierarchy comes at a critical time for OpenAI regarding concerns about the company’s approach to security and transparency. Current and former employees have called for the company to improve its security practices, and OpenAI’s leadership has responded by pledging to do so. The company has acknowledged that the complexity of fully automated agents will require advanced guardrails in future models, and the instruction hierarchy setup appears to be a step toward improved security.

Jailbreaks like this show how much work still needs to be done to protect complex AI models from malicious actors. And it’s not the only example. Multiple users discovered ChatGPT sharing its internal instructions simply by saying “hi.”

OpenAI has closed that gap, but it’s likely only a matter of time before more are discovered. Any solution will need to be much more adaptive and flexible than one that simply stops a particular type of hacking.