Safeguard Your LLM Against Jailbreak Detection

Introduction

This article will help to use jailbreak detection to identify potential jailbreak attack from the user in the Generative AI scenarios. The scenarios can be running a simple test or running a bulk test using jailbreak detection.

Prerequisites

  1. Azure Subscription
  2. Azure AI Content Safety resource with keys and endpoint

Jailbreak Risk Detection

Generative AI models demonstrate advanced general capabilities, but they also pose potential risks of misuse by malicious actors. To mitigate these risks, developers implement safety mechanisms to keep the large language model (LLM) behavior within a secure range of capabilities. Furthermore, safety can be enhanced by setting specific rules through the system message.

Despite these precautions, models remain vulnerable to adversarial inputs that can cause the LLM to bypass or ignore built-in safety instructions and the system message. Most generative AI models are prompt-based, where the user interacts with the model by entering a text prompt, and the model responds with a completion.

Jailbreak attacks involve user prompts crafted to coax a Generative AI model into displaying behaviors it was programmed to avoid or to circumvent the guidelines established in the system message. These attacks can range from complex role-playing scenarios to subtle attempts to undermine safety protocols.

Language Support

Currently, the Jailbreak risk detection API supports the English language. Although our API allows the submission of non-English content, we cannot guarantee the same level of quality and accuracy in its analysis. We recommend users primarily submit content in English to ensure the most reliable and accurate results from the API.

Region Support

To use this API, you need to create your Azure AI Content Safety resource in one of the supported regions. Currently, it is available in the following Azure regions.

  • East US
  • West Europe

Analyze text content for Jailbreak risk detection

Step 1. Click Jailbreak risk detection from the content safety studio window.

Studio window

Step 2. Click Run a simple test from the window.

Simple test

Step 3. Choose the Safe content from the window and then click the Run test button.

Run test button

Step 4. In the View results page, for the input text there is no jailbreak attack which means that the content was safe.

View result

Lets us consider the second scenario where the content is not safe and hence there is a jailbreak risk detection.

 Risk detection

Step 1. Click Jailbreak attempt content from the menu which was listed below in the Run a simple test screen.

Content

Step 2. Click the Run Test button from the page.

Run Test button

Step 3. In the View results page, for the input text there is a jailbreak attack detected which means that the content was not safe.

 Jailbreak attack

Let us consider the last scenario where the content is not a safe one, and hence there is a jailbreak risk detection using the Run a bulk test option.

Step 1. Click Run a Bulk test from the window.

 Bulk test

Step 2. Click Sample Content which was uploaded already from the menu listed below.

 Sample Content

Step 3. In this sample dataset, which contains nearly 6 records, a label of 1 indicates unsafe content, while a label of 0 indicates safe content.

Dataset

Step 4. Click the Run Test button

Step 5. In the view results section, 83.3% of text content where there is no Jailbreak risk detection, but in 16.7% of text content, there is a Jailbreak risk detected.

 Text content

Summary

In this article, we successfully learned and analyzed text content for Jailbreak risk detection. These scenarios were assessed using both simple and bulk testing methods.

I hope you enjoyed reading this article!

Happy Learning and see you soon in another interesting article!


Similar Articles