Introduction
This article will help to use jailbreak detection to identify potential jailbreak attack from the user in the Generative AI scenarios. The scenarios can be running a simple test or running a bulk test using jailbreak detection.
Prerequisites
- Azure Subscription
- Azure AI Content Safety resource with keys and endpoint
Jailbreak Risk Detection
Generative AI models demonstrate advanced general capabilities, but they also pose potential risks of misuse by malicious actors. To mitigate these risks, developers implement safety mechanisms to keep the large language model (LLM) behavior within a secure range of capabilities. Furthermore, safety can be enhanced by setting specific rules through the system message.
Despite these precautions, models remain vulnerable to adversarial inputs that can cause the LLM to bypass or ignore built-in safety instructions and the system message. Most generative AI models are prompt-based, where the user interacts with the model by entering a text prompt, and the model responds with a completion.
Jailbreak attacks involve user prompts crafted to coax a Generative AI model into displaying behaviors it was programmed to avoid or to circumvent the guidelines established in the system message. These attacks can range from complex role-playing scenarios to subtle attempts to undermine safety protocols.
Language Support
Currently, the Jailbreak risk detection API supports the English language. Although our API allows the submission of non-English content, we cannot guarantee the same level of quality and accuracy in its analysis. We recommend users primarily submit content in English to ensure the most reliable and accurate results from the API.
Region Support
To use this API, you need to create your Azure AI Content Safety resource in one of the supported regions. Currently, it is available in the following Azure regions.
Analyze text content for Jailbreak risk detection
Step 1. Click Jailbreak risk detection from the content safety studio window.
Step 2. Click Run a simple test from the window.
Step 3. Choose the Safe content from the window and then click the Run test button.
Step 4. In the View results page, for the input text there is no jailbreak attack which means that the content was safe.
Lets us consider the second scenario where the content is not safe and hence there is a jailbreak risk detection.
Step 1. Click Jailbreak attempt content from the menu which was listed below in the Run a simple test screen.
Step 2. Click the Run Test button from the page.
Step 3. In the View results page, for the input text there is a jailbreak attack detected which means that the content was not safe.
Let us consider the last scenario where the content is not a safe one, and hence there is a jailbreak risk detection using the Run a bulk test option.
Step 1. Click Run a Bulk test from the window.
Step 2. Click Sample Content which was uploaded already from the menu listed below.
Step 3. In this sample dataset, which contains nearly 6 records, a label of 1 indicates unsafe content, while a label of 0 indicates safe content.
Step 4. Click the Run Test button
Step 5. In the view results section, 83.3% of text content where there is no Jailbreak risk detection, but in 16.7% of text content, there is a Jailbreak risk detected.
Summary
In this article, we successfully learned and analyzed text content for Jailbreak risk detection. These scenarios were assessed using both simple and bulk testing methods.
I hope you enjoyed reading this article!
Happy Learning and see you soon in another interesting article!