Strengthening LLMs Against Prompt Injection

The Dual LLM Model to protect against prompt injection

The Dual LLM Model Concept was introduced by Simon Willison in April 2023. The concept of prompt injection—where a malicious command is embedded within legitimate input to influence a Large Language Model (LLM) into unintended actions—poses a significant challenge in developing secure AI assistants. As businesses increasingly use LLM-powered AI assistants that interact with private data and trusted tools, mitigating prompt injection risks is critical.

One compelling approach to minimizing these vulnerabilities is through a Dual LLM pattern, where two separate LLM instances, labeled as "Privileged" and "Quarantined," work together to securely process different types of data. The Privileged LLM handles trusted user inputs and has access to perform actions (such as sending emails or searching databases). In contrast, the Quarantined LLM manages untrusted data, such as inputs from public websites or user-generated content, without tool access. This pattern ensures that sensitive commands and untrusted inputs are processed in isolated spaces, reducing potential exploitations.

To ensure sensitive commands remain secure, a Controller layer mediates between the two models. For instance, if a user requests an email summary, the Privileged LLM instructs the Controller to retrieve emails and pass them to the Quarantined LLM, which processes the content. The Controller then handles the Quarantined LLM’s output as a protected variable, which prevents the Privileged LLM from being exposed to any potentially malicious content. This setup effectively keeps maliciously manipulated outputs from entering the action-capable Privileged LLM’s processing stream.

However, this layered setup is not entirely foolproof. Social engineering remains a risk, where attackers might trick users into copying and pasting secure information out of the AI environment. While adding extra prompts or warnings for potential copy-paste threats might help, vigilance is essential.

Sazakan is a dedicated tool that aids developers in identifying and mitigating prompt injection risks within AI models. By simulating prompt injection and testing the robustness of dual-model setups, Sazakan enables the safe and effective development of AI systems that safeguard user data. It provides a crucial layer of security by allowing companies to proactively address risks before they compromise sensitive information.

Ultimately, building secure AI systems requires balancing functionality with a complex array of safeguards, and tools like Sazakan make this endeavor far more achievable. By applying rigorous security frameworks and continuously testing for vulnerabilities, developers can create AI solutions that responsibly manage prompt injection risks, providing safer AI assistants for users everywhere.