Using Prompt Shield to Prevent Prompt Injection Attacks

Introduction

This article explores Prompt Shield, an advanced security solution created to protect AI systems from Direct and Indirect Prompt Injection Attacks. Utilizing cutting-edge detection and prevention mechanisms, Prompt Shield maintains the integrity and reliability of large language models (LLMs) by identifying and neutralizing potential threats.

Generative AI models can be exploited by malicious actors, posing significant risks. To mitigate these threats, we incorporate safety mechanisms that confine the behavior of large language models (LLMs) within a secure operational range. Despite these precautions, LLMs remain susceptible to adversarial inputs that can circumvent the built-in safety measures.

Prompt shields

Prompt Shields is a unified API designed to analyze inputs to large language models (LLMs) and detect two common types of adversarial inputs: User Prompt attacks and Document attacks.

User prompt attacks

Formerly known as Jailbreak risk detection, this shield targets User Prompt injection attacks, where users intentionally exploit system vulnerabilities to provoke unauthorized behavior from the LLM. Such exploits could result in the generation of inappropriate content or breaches of system-imposed restrictions.

In a Jailbreak Attack, also known as a Direct Prompt Attack, the user acts as the attacker, introducing the attack through the user prompt. This method tricks the LLM into ignoring its System Prompt and/or RLHF training, causing the LLM to behave in ways that deviate from its intended design.

Document attacks

This shield is designed to protect against attacks that leverage information not directly provided by the user or developer, such as external documents. Attackers might embed hidden instructions within these materials to gain unauthorized control over the LLM session.

Conversely, in an Indirect Prompt Attack, a third-party adversary is the attacker, and the attack infiltrates the system through untrusted content embedded in the prompt, such as a third-party document, plugin result, web page, or email. Indirect Prompt Attacks work by deceiving the LLM into interpreting this content as a valid command from the user, thereby gaining control over user credentials and LLM/Copilot capabilities.

Subcategories of user prompt attacks

  • Attempt to change system rules: This category includes requests to use a new unrestricted system/AI assistant without any rules, principles, or limitations, or requests that instruct the AI to ignore, forget, or disregard its rules, instructions, and previous interactions.
  • Embedding a conversation mockup to confuse the model: This attack involves embedding user-crafted conversational turns within a single query to instruct the system/AI assistant to ignore its rules and limitations.
  • Role-play: This attack directs the system/AI assistant to adopt another "system persona" that lacks the existing system limitations or assigns human-like qualities such as emotions, thoughts, and opinions to the system.
  • Encoding attacks: This attack employs encoding methods, such as character transformations, generation styles, ciphers, or other natural language variations, to bypass the system rules.

Subcategories of document attacks

  • Manipulated content: Commands aimed at falsifying, hiding, manipulating, or promoting specific information.
  • Intrusion: Commands related to creating backdoors, escalating privileges without authorization, and gaining unauthorized access to LLMs and systems.
  • Information gathering: Commands involving the deletion, modification, access, or theft of data.
  • Availability: Commands that render the model unusable to the user, block certain capabilities, or force the model to generate incorrect information.
  • Fraud: Commands designed to defraud the user of money, passwords, or information or to act on behalf of the user without authorization.
  • Malware: Commands related to spreading malware via malicious links, emails, etc.
  • Attempt to change system rules: This category includes requests to use a new unrestricted system/AI assistant without any rules, principles, or limitations, or requests instructing the AI to ignore, forget, or disregard its rules, instructions, and previous interactions.
  • Embedding a conversation mockup to confuse the model: This attack uses user-crafted conversational turns embedded in a single query to instruct the system/AI assistant to disregard its rules and limitations.
  • Role-play: This attack instructs the system/AI assistant to assume another "system persona" that lacks existing system limitations or to assign human-like qualities such as emotions, thoughts, and opinions to the system.
  • Encoding attacks: This attack employs encoding methods, such as character transformations, generation styles, ciphers, or other natural language variations, to bypass the system's rules.

Language support

At present, the Prompt Shields API is designed for the English language. While our API doesn't forbid the submission of content in languages other than English, we can't assure the same degree of quality and precision in analyzing such content. We advise users to primarily input content in English to ensure the most dependable and precise results from the API.

Summary

Prompt Shield acts as a crucial defense tool against both Direct and Indirect Prompt Attacks. It provides robust detection mechanisms within the LLM environment, ensuring the security and integrity of AI applications by identifying and neutralizing potential threats, thus preventing malicious manipulation or exploitation.


Similar Articles