Normally, AI chatbots have security measures in place to prevent them from being used maliciously. This may include banning certain words or phrases or limiting answers to certain questions.
However, researchers now claim that they have been able to train AI chatbots to 'jailbreak' each other, allowing them to bypass protections and answer malicious questions.
Researchers from Singapore's Nanyang Technological University (NTU) investigating the ethics of large language models (LLM) say they have developed a method to train AI chatbots to bypass each other's defense mechanisms.
AI attack methods
The method involves first identifying one of the chatbots' security mechanisms so you know how to undermine it. The second phase involves training another chatbot to bypass protections and generate malicious content.
Professor Liu Yang, together with PhD students Deng Gelei and Liu Yi, co-authored a paper that labeled their method as 'Masterkey', with an effectiveness three times higher than that of standard LLM prompt methods.
One of the key features of LLMs when used as chatbots is their ability to learn and adapt, and Masterkey is no different in this regard. Even if an LLM is patched to rule out a bypass method, Masterkey can adjust the patch and overcome it.
The intuitive methods used include adding extra spaces between words to bypass the list of banned words, or telling the chatbot to respond as if it had a persona without moral restrictions.
Through Tom's Hardware