Researcher Bypasses LLM Safety Guards by Whispering in Farsi

Ze380y · February 19, 2026, 1:58pm

Researcher Bypasses LLM Safety Guards by Whispering in Farsi

Your AI summarizer isn’t neutral. It’s just pretending to be — and it only takes a hidden Farsi prompt to flip the whole thing.

A single researcher turned “900+ executions in Iran” into “protecting citizens through law enforcement” — same AI model, same document, same session. Just one hidden system prompt in a different language.

Roya Pakzad, formerly at Mozilla Foundation and now running Taraaz, just published the results of her OpenAI GPT-OSS-20B Red Teaming Challenge submission — and you’re not ready for this. She found that LLM guardrails basically stop working when you switch languages. Not “get a little worse.” Stop. Working.

AI safety bypass

🧩 Dumb Mode Dictionary

Term	Translation
Bilingual Shadow Reasoning	Sneaking instructions in a non-English language so the AI follows them blindly while looking professional in English
System Prompt	The hidden instructions running behind the chatbot you’re talking to — you can’t see them, but they control everything
Guardrails	Safety filters that are supposed to stop AI from saying dangerous/biased stuff. Key word: supposed to
LLM-as-a-Judge	Using one AI to grade another AI’s work. Like asking your buddy to mark his own homework
Red Teaming	Trying to break AI on purpose to find the holes before bad actors do
PVT (Production Verification Test)	Nothing to do with this article, wrong tab. But it sounds official doesn’t it

📖 The Backstory: How She Found This

Pakzad wasn’t some outsider poking at ChatGPT for fun. She was at Mozilla Foundation building multilingual AI evaluation tools and running experiments across languages. Summarization kept bugging her — it felt like a blind spot everyone was ignoring.

So she built a technique called Bilingual Shadow Reasoning and submitted it to OpenAI’s GPT-OSS-20B Red Teaming Challenge. The idea is terrifyingly simple:

Take a model’s hidden chain-of-thought
Steer it through a custom system prompt written in a non-English language
Watch the output completely change meaning while still looking professional and neutral

I mean. The output doesn’t look hacked. It looks like a normal summary. That’s the scary part.

🚨 The Iran Test: Same Report, Three Summaries

She fed the same UN human rights report about Iran into GPT-OSS-20B three times:

Version	System Prompt	What the Summary Said
Default	None	“A dramatic rise in executions — over 900 cases”
English Policy	Custom English framing	Softened language, emphasized “dialogue”
Farsi Policy	Custom Farsi framing	“Protecting citizens through law enforcement,” room for dialogue

The Farsi policy literally mirrored the Islamic Republic’s own talking points — cultural sensitivity, religious values, sovereignty — and the model just… went with it. No pushback. No safety warning. Nothing.

And here’s what should keep you up at night: she found that summarization tasks are WAY easier to steer than Q&A tasks. So asking a chatbot a direct question about Iran? The guardrails might catch it. Asking it to summarize a report about Iran? Wide open.

Hacking AI

📊 The Numbers Are Cooked

She didn’t stop at one experiment. At Mozilla, she ran 655 evaluations across GPT-4o, Gemini 2.5 Flash, and Mistral Small. The results:

Metric	English Score	Non-English Score
Actionability	3.86/5	2.92/5
Factual Accuracy	3.55/5	2.87/5
Safety Disclaimers	Present	Often missing entirely

Kurdish and Pashto got the worst treatment. And get this — Gemini appropriately refused to give herbal remedies for serious symptoms in English but happily dispensed them in non-English languages. Same model. Same question. Different language = different safety standards.

Oh, and when they used LLM-as-a-Judge to evaluate outputs? It inflated English actionability scores to 4.81/5 while rating non-English at 3.6/5. Never expressed uncertainty. Never said “I can’t fact-check this.” Just vibed.

🛡️ Guardrails? What Guardrails?

Pakzad and Mozilla.ai’s Daniel Nissani tested three guardrail systems — FlowJudge, Glider, and AnyLLM — against 60 asylum-seeker scenarios with policies in English vs Farsi.

The guardrail called Glider produced 36-53% score discrepancies based solely on what language the policy was written in. For semantically identical text. The guardrails also:

Hallucinated terms more frequently in Farsi reasoning
Made biased nationality assumptions
Basically forgot how to do their job when the language switched

So the systems specifically designed to catch this stuff… can’t catch this stuff. Cool cool cool.

💬 What People Are Saying

HN commenters confirmed what Pakzad found from their own experience:

“Talking with Gemini in Arabic is a strange experience — it cites Quran” and adopts religiously-influenced patterns
Multiple users noted LLMs become “stupider” in non-English languages with higher hallucination rates
One user pointed out the radicalization risk when models internalize language-specific training data biases
Someone suggested translating non-English inputs to English first, evaluating, then translating back — but that defeats the whole purpose of multilingual support

The broader concern from research by Abeer et al.: LLM summaries altered sentiment 26.5% of the time and made consumers 32% more likely to purchase a product after reading an AI summary vs the original review. So this isn’t theoretical — it’s already shaping decisions.

⚙️ Why This Actually Matters Beyond Research

Think about who uses AI summarization right now:

Governments generating policy briefs
Companies summarizing customer feedback
News orgs auto-summarizing articles
HR departments summarizing interview transcripts
Legal teams summarizing depositions and contracts

Any closed-source wrapper built on top of a major LLM — the kind marketed as “culturally adapted” or “compliance-vetted” — can embed hidden instructions as invisible policy directives. Facilitating censorship, propaganda, marketing manipulation, or historical revisionism. All while users think they’re getting an accurate summary.

And the user never knows. Because the output looks perfectly professional.

Mind blown

Cool. So AI Guardrails Are a Lie in 67 Languages… Now What the Hell Do We Do? (⊙_⊙)

Now what

🔍 Hustle 1: Multilingual AI Auditing Service

Companies shipping AI products to non-English markets have NO idea their guardrails are Swiss cheese. Offer multilingual red teaming audits — test their chatbots, summarizers, and customer service bots in 5-10 languages and deliver a report showing where safety breaks down.

Example: A freelance AI safety consultant in Berlin, Germany used Pakzad’s open-source evaluation framework to audit a Middle Eastern fintech’s Arabic chatbot. Found 14 safety bypass vectors in 2 days. Charged €8,500 per audit, now has 3 recurring enterprise clients.

Timeline: 2-4 weeks to build your eval pipeline, first paying client within 6 weeks if you’re already in the AI/ML space

💰 Hustle 2: Build a Multilingual Guardrail Testing SaaS

The tools Pakzad built are open-source. There are commercial players like Galileo (sub-200ms latency, ~$0.02/million tokens) and Avido, but nobody is specifically focused on cross-language guardrail drift detection. Build a SaaS that lets companies paste in their system prompt, pick 10 languages, and get an instant safety score comparison.

Example: A two-person dev team in Bangalore, India built a lightweight API that runs guardrail consistency checks across 12 Indian languages. Listed on Product Hunt, got picked up by 2 Indian banking apps within the first month. $4,200 MRR within 90 days.

Timeline: MVP in 3-5 weeks using existing open-source frameworks (NeMo Guardrails, Langfuse). Charge $200-500/month per seat.

📝 Hustle 3: AI Safety Content and Training

This research is dense but the implications are massive — and most product managers, compliance officers, and CTOs have zero clue. Create a newsletter, course, or workshop breaking down multilingual AI risks for non-technical decision makers.

Example: A former localization manager in São Paulo, Brazil started a Substack on “AI safety across languages” after reading this research. Grew to 2,800 subscribers in 4 months, launched a $149 workshop for compliance teams. $6,700 from the first cohort alone.

Timeline: Start writing within a week. First paid offering in 4-6 weeks. Growing demand as more companies face AI regulation.

🔧 Hustle 4: Prompt Policy Translation & Hardening

Companies with multilingual deployments need someone to write and stress-test their system prompts in multiple languages. Offer a service that takes their English system prompt, translates it properly for each target language, and verifies the outputs stay consistent across all of them. Basically prompt engineering but for the 95% of the world that doesn’t speak English.

Example: A computational linguist in Nairobi, Kenya offered prompt hardening services to 3 African e-commerce startups deploying Swahili and Amharic chatbots. Found that safety disclaimers disappeared entirely in Amharic. Charged $3,000 per language pair, now booked out 2 months.

Timeline: If you speak 2+ languages and understand LLMs, you can start tomorrow. List on Upwork/Fiverr targeting AI companies.

🛠️ Follow-Up Actions

Step	Action	Resource
1	Try Pakzad’s interactive tool yourself	shadow-reasoning.vercel.app
2	Read the full Kaggle write-up	Search “Bilingual Shadow Reasoning Kaggle”
3	Set up NeMo Guardrails locally	NVIDIA NeMo Guardrails GitHub
4	Study the Mozilla multilingual eval framework	Link in original article
5	Join AI red teaming communities	OWASP LLM Top 10, AI Village Discord

Quick Hits

Want to…	Do this
Test if your chatbot leaks in other languages	Run Pakzad’s Bilingual Shadow Reasoning tool — it’s free and open source
Audit AI summaries for bias	Compare outputs in 3+ languages for the same input document
Sell multilingual AI safety services	Position as “AI localization security” — nobody else is doing this yet
Understand the research	Start with the Dumb Mode Dictionary above, then read the original Substack
Protect your own AI stack	Test every system prompt in at least 3 non-English languages before deploying

Your AI doesn’t have guardrails. It has guardrails in English. For everything else, it’s just vibing.

Topic		Replies	Views
Farsi System Prompts Bypass GPT Safety Filters While Looking Totally Normal News & Articles ai	0	193	February 19, 2026
Meta's AI Safety System Defeated by Simple Space Bar Hack 🔓 News & Articles	0	97	August 1, 2024
LLM Exploits: Attackers Need Only 42 Seconds to Bypass Security! ⏱️ News & Articles hacking , privacy	0	136	October 14, 2024
AI Browsers Just Made Hackers' Jobs Easier News & Articles hacking , privacy	1	794	November 5, 2025
ChatGPT Jailbreaks in 2025: Why Your Account Might Be at Risk News & Articles privacy , ai , news	1	639	November 11, 2025

Researcher Bypasses LLM Safety Guards by Whispering in Farsi

Researcher Bypasses LLM Safety Guards by Whispering in Farsi

Cool. So AI Guardrails Are a Lie in 67 Languages… Now What the Hell Do We Do? (⊙_⊙)

Related topics