Using chatbots against themselves to ‘jailbreak’ each other

Computer scientists from NTU have found a way to compromise artificial intelligence (AI) chatbots – by training and using an AI chatbot to produce prompts that can ‘jailbreak’ other chatbots.

‘Jailbreaking’ is a term in computer security where computer hackers find and exploit flaws in a system’s software to make it do something its developers deliberately restricted from doing.

The researchers’ used a twofold method for ‘jailbreaking’ LLMs, which they named “Masterkey”. First, they reverse-engineered how large language models (LLMs) detect and defend themselves from malicious queries. With that information, they taught an LLM to automatically learn and produce prompts that bypass the defences of other LLMs. This process can be automated, creating a jailbreaking LLM that can adapt to and create new jailbreak prompts even after developers patch their LLMs.

Their findings may be critical in helping companies and businesses to be aware of the weaknesses and limitations of their LLM chatbots so that they can take steps to strengthen them against hackers. 

After running a series of proof-of-concept tests on LLMs to prove that their technique indeed presents a clear and present threat to them, the researchers immediately reported the issues to the relevant service providers upon initiating successful jailbreak attacks.