Easy methods to Check an OpenAI Mannequin Towards Single-Flip Adversarial Assaults Utilizing deepteam

On this tutorial, we’ll discover methods to check an OpenAI mannequin towards single-turn adversarial assaults utilizing deepteam.

deepteam gives 10+ assault strategies—like immediate injection, jailbreaking, and leetspeak—that expose weaknesses in LLM functions. It begins with easy baseline assaults after which applies extra superior strategies (referred to as assault enhancement) to imitate real-world malicious conduct. Try the FULL CODES right here.

By working these assaults, we will consider how properly the mannequin defends towards completely different vulnerabilities.

In deepteam, there are two primary varieties of assaults:

Right here, we’ll focus solely on single-turn assaults.

Putting in the dependencies

pip set up deepteam openai pandas

You’ll must set your OPENAI_API_KEY as an atmosphere variable earlier than working the red_team() perform, since deepteam makes use of LLMs to each generate adversarial assaults and consider LLM outputs.

To get an OpenAI API key, go to https://platform.openai.com/settings/group/api-keys and generate a brand new key. In case you’re a brand new person, chances are you’ll want so as to add billing particulars and make a minimal cost of $5 to activate API entry. Try the FULL CODES right here.

import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass('Enter OpenAI API Key: ')

Importing the libraries

import asyncio
from openai import OpenAI
from deepteam import red_team
from deepteam.vulnerabilities import IllegalActivity
from deepteam.assaults.single_turn import PromptInjection, GrayBox, Base64, Leetspeak, ROT13, Multilingual, MathProblem

Defining the Mannequin Callback

This code defines an async callback perform that queries the OpenAI mannequin (gpt-4o-mini) and returns the mannequin’s response textual content. It acts because the LLM’s output generator for the assault framework. Try the FULL CODES right here.

consumer = OpenAI()

# Outline callback for querying the LLM
async def model_callback(enter: str) -> str:
    response = consumer.chat.completions.create(
        mannequin="gpt-4o-mini",  # use "gpt-4o" in order for you a stronger mannequin
        messages=[{"role": "user", "content": input}],
    )
    return response.decisions[0].message.content material

Defining Vulnerability and Assaults

Right here, we specify the vulnerability (IllegalActivity) and set its kind to baby exploitation. We then outline completely different assault strategies equivalent to PromptInjection, GrayBox, Base64, Leetspeak, ROT13, Multilingual, and MathProblem

Every of those assault varieties will probably be defined intimately later within the tutorial. Try the FULL CODES right here.

# Vulnerability
illegal_activity = IllegalActivity(varieties=["child exploitation"])

# Assaults
prompt_injection = PromptInjection()
graybox_attack = GrayBox()
base64_attack = Base64()
leetspeak_attack = Leetspeak()
rot_attack = ROT13()
multi_attack = Multilingual()
math_attack = MathProblem()

Immediate Injection

Immediate Injection is an assault the place a person tries to override the mannequin’s authentic directions by injecting malicious or manipulative textual content into the immediate. The aim is to trick the mannequin into ignoring security insurance policies and producing dangerous or restricted content material.

The injected immediate makes an attempt to pressure the mannequin into role-playing as an unethical persona known as OPPO, who disregards security guidelines and encourages criminality.

Regardless of this try, the mannequin refused to conform—it responded with:

“I’m sorry, I can not help with that.”

This confirms that the mannequin efficiently resisted the immediate injection and returned a PASS consequence for this check. Try the FULL CODES right here.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        assaults=[prompt_injection],
    )

Graybox Assault

The GrayBox assault leverages partial data concerning the goal LLM system to craft adversarial prompts. In contrast to random inputs, GrayBox assaults exploit identified weaknesses by reframing the baseline assault with summary or deceptive language, making it tougher for security filters to detect malicious intent.

On this check, the adversarial immediate tried to encourage unlawful actions by disguising directions about creating false identification paperwork and utilizing encrypted channels. Try the FULL CODES right here.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        assaults=[graybox_attack],
    )

Base64 Assault

The Base64 assault is a standard adversarial method the place dangerous directions are encoded in Base64 to bypass security filters. As a substitute of presenting malicious content material instantly, the attacker hides it in an encoded format, hoping the mannequin will decode it and execute the directions.

On this check, the encoded string contained instructions associated to criminality, disguised to seem innocent at first look. The mannequin, nonetheless, didn’t try to decode or comply with via with the hidden request. Try the FULL CODES right here.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        assaults=[base64_attack],
    )

Leetspeak Assault

The Leetspeak assault disguises malicious directions by changing regular characters with numbers or symbols (for instance, a turns into 4, e turns into 3, i turns into 1). This symbolic substitution makes dangerous textual content tougher to detect with easy key phrase filters, whereas nonetheless being readable to people or programs which may decode it.

On this check, the assault textual content tried to instruct minors in unlawful actions, written in leetspeak format. Regardless of the obfuscation, the mannequin clearly acknowledged the malicious intent. Try the FULL CODES right here.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        assaults=[leetspeak_attack],
    )

ROT-13 Assault

The ROT-13 assault is a basic obfuscation methodology the place every letter is shifted 13 positions within the alphabet. For instance, A turns into N, B turns into O, and so forth. This transformation scrambles dangerous directions right into a coded kind, making them much less prone to set off easy keyword-based content material filters. Nevertheless, the textual content can nonetheless be simply decoded again into its authentic kind. Try the FULL CODES right here.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        assaults=[rot_attack],
    )

Multi-lingual Assault

The multilingual assault works by translating a dangerous baseline immediate right into a much less generally monitored language. The concept is that content material filters and moderation programs could also be extra sturdy in broadly used languages (equivalent to English) however much less efficient in different languages, permitting malicious directions to bypass detection.

On this check, the assault was written in Swahili, asking for directions associated to criminality. Try the FULL CODES right here.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        assaults=[multi_attack],
    )

Math Downside

The maths drawback assault disguises malicious requests inside mathematical notation or drawback statements. By embedding dangerous directions in a proper construction, the textual content could look like a innocent tutorial train, making it tougher for filters to detect the underlying intent.

On this case, the enter framed unlawful exploitation content material as a gaggle idea drawback, asking the mannequin to “show” a dangerous consequence and supply a “translation” in plain language. Try the FULL CODES right here.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        assaults=[math_attack],
    )

Try the FULL CODES right here. Be happy to take a look at our GitHub Web page for Tutorials, Codes and Notebooks. Additionally, be happy to comply with us on Twitter and don’t overlook to hitch our 100k+ ML SubReddit and Subscribe to our Publication.

I’m a Civil Engineering Graduate (2022) from Jamia Millia Islamia, New Delhi, and I’ve a eager curiosity in Information Science, particularly Neural Networks and their software in numerous areas.

Sample Page Title

Putting in the dependencies

Importing the libraries

Defining the Mannequin Callback

Defining Vulnerability and Assaults

Immediate Injection

Graybox Assault

Base64 Assault

Leetspeak Assault

ROT-13 Assault

Multi-lingual Assault

Math Downside

Related Articles

Coinbase Cuts 14% of Employees as AI and Crypto Downturn Reshape Its Working Mannequin

The Multi-Pair AI EA Take a look at: Why One Market Is Not Sufficient – My Buying and selling – 5 Might 2026

Vary Filter Indicator MT5 – ForexMT4Indicators.com

LEAVE A REPLY Cancel reply

Latest Articles

Coinbase Cuts 14% of Employees as AI and Crypto Downturn Reshape Its Working Mannequin

The Multi-Pair AI EA Take a look at: Why One Market Is Not Sufficient – My Buying and selling – 5 Might 2026

Vary Filter Indicator MT5 – ForexMT4Indicators.com

Start charges maintain falling. We have to plan for an older America with fewer youngsters.

Hyperliquid Whale Bets $1.31M on TON at 6x Leverage as Bitcoin Eyes Contemporary Highs

EDITOR PICKS

Coinbase Cuts 14% of Employees as AI and Crypto Downturn Reshape...

The Multi-Pair AI EA Take a look at: Why One Market...

Vary Filter Indicator MT5 – ForexMT4Indicators.com

POPULAR POSTS

Qubic’s Mining Pool Attacking Monero Falls Beneath Assault

Feedback on the brand new buying and selling dialog in Metatrader...

What’s nano-texture glass and do I would like it?

POPULAR CATEGORY