Researcher Bypasses LLM Safety Guards by Whispering in Farsi
Your AI summarizer isn’t neutral. It’s just pretending to be — and it only takes a hidden Farsi prompt to flip the whole thing.
A single researcher turned “900+ executions in Iran” into “protecting citizens through law enforcement” — same AI model, same document, same session. Just one hidden system prompt in a different language.
Roya Pakzad, formerly at Mozilla Foundation and now running Taraaz, just published the results of her OpenAI GPT-OSS-20B Red Teaming Challenge submission — and you’re not ready for this. She found that LLM guardrails basically stop working when you switch languages. Not “get a little worse.” Stop. Working.

🧩 Dumb Mode Dictionary
| Term | Translation |
|---|---|
| Bilingual Shadow Reasoning | Sneaking instructions in a non-English language so the AI follows them blindly while looking professional in English |
| System Prompt | The hidden instructions running behind the chatbot you’re talking to — you can’t see them, but they control everything |
| Guardrails | Safety filters that are supposed to stop AI from saying dangerous/biased stuff. Key word: supposed to |
| LLM-as-a-Judge | Using one AI to grade another AI’s work. Like asking your buddy to mark his own homework |
| Red Teaming | Trying to break AI on purpose to find the holes before bad actors do |
| PVT (Production Verification Test) | Nothing to do with this article, wrong tab. But it sounds official doesn’t it |
📖 The Backstory: How She Found This
Pakzad wasn’t some outsider poking at ChatGPT for fun. She was at Mozilla Foundation building multilingual AI evaluation tools and running experiments across languages. Summarization kept bugging her — it felt like a blind spot everyone was ignoring.
So she built a technique called Bilingual Shadow Reasoning and submitted it to OpenAI’s GPT-OSS-20B Red Teaming Challenge. The idea is terrifyingly simple:
- Take a model’s hidden chain-of-thought
- Steer it through a custom system prompt written in a non-English language
- Watch the output completely change meaning while still looking professional and neutral
I mean. The output doesn’t look hacked. It looks like a normal summary. That’s the scary part.
🚨 The Iran Test: Same Report, Three Summaries
She fed the same UN human rights report about Iran into GPT-OSS-20B three times:
| Version | System Prompt | What the Summary Said |
|---|---|---|
| Default | None | “A dramatic rise in executions — over 900 cases” |
| English Policy | Custom English framing | Softened language, emphasized “dialogue” |
| Farsi Policy | Custom Farsi framing | “Protecting citizens through law enforcement,” room for dialogue |
The Farsi policy literally mirrored the Islamic Republic’s own talking points — cultural sensitivity, religious values, sovereignty — and the model just… went with it. No pushback. No safety warning. Nothing.
And here’s what should keep you up at night: she found that summarization tasks are WAY easier to steer than Q&A tasks. So asking a chatbot a direct question about Iran? The guardrails might catch it. Asking it to summarize a report about Iran? Wide open.

📊 The Numbers Are Cooked
She didn’t stop at one experiment. At Mozilla, she ran 655 evaluations across GPT-4o, Gemini 2.5 Flash, and Mistral Small. The results:
| Metric | English Score | Non-English Score |
|---|---|---|
| Actionability | 3.86/5 | 2.92/5 |
| Factual Accuracy | 3.55/5 | 2.87/5 |
| Safety Disclaimers | Present | Often missing entirely |
Kurdish and Pashto got the worst treatment. And get this — Gemini appropriately refused to give herbal remedies for serious symptoms in English but happily dispensed them in non-English languages. Same model. Same question. Different language = different safety standards.
Oh, and when they used LLM-as-a-Judge to evaluate outputs? It inflated English actionability scores to 4.81/5 while rating non-English at 3.6/5. Never expressed uncertainty. Never said “I can’t fact-check this.” Just vibed.
🛡️ Guardrails? What Guardrails?
Pakzad and Mozilla.ai’s Daniel Nissani tested three guardrail systems — FlowJudge, Glider, and AnyLLM — against 60 asylum-seeker scenarios with policies in English vs Farsi.
The guardrail called Glider produced 36-53% score discrepancies based solely on what language the policy was written in. For semantically identical text. The guardrails also:
- Hallucinated terms more frequently in Farsi reasoning
- Made biased nationality assumptions
- Basically forgot how to do their job when the language switched
So the systems specifically designed to catch this stuff… can’t catch this stuff. Cool cool cool.
💬 What People Are Saying
HN commenters confirmed what Pakzad found from their own experience:
- “Talking with Gemini in Arabic is a strange experience — it cites Quran” and adopts religiously-influenced patterns
- Multiple users noted LLMs become “stupider” in non-English languages with higher hallucination rates
- One user pointed out the radicalization risk when models internalize language-specific training data biases
- Someone suggested translating non-English inputs to English first, evaluating, then translating back — but that defeats the whole purpose of multilingual support
The broader concern from research by Abeer et al.: LLM summaries altered sentiment 26.5% of the time and made consumers 32% more likely to purchase a product after reading an AI summary vs the original review. So this isn’t theoretical — it’s already shaping decisions.
⚙️ Why This Actually Matters Beyond Research
Think about who uses AI summarization right now:
- Governments generating policy briefs
- Companies summarizing customer feedback
- News orgs auto-summarizing articles
- HR departments summarizing interview transcripts
- Legal teams summarizing depositions and contracts
Any closed-source wrapper built on top of a major LLM — the kind marketed as “culturally adapted” or “compliance-vetted” — can embed hidden instructions as invisible policy directives. Facilitating censorship, propaganda, marketing manipulation, or historical revisionism. All while users think they’re getting an accurate summary.
And the user never knows. Because the output looks perfectly professional.

Cool. So AI Guardrails Are a Lie in 67 Languages… Now What the Hell Do We Do? (⊙_⊙)

🔍 Hustle 1: Multilingual AI Auditing Service
Companies shipping AI products to non-English markets have NO idea their guardrails are Swiss cheese. Offer multilingual red teaming audits — test their chatbots, summarizers, and customer service bots in 5-10 languages and deliver a report showing where safety breaks down.
Example: A freelance AI safety consultant in Berlin, Germany used Pakzad’s open-source evaluation framework to audit a Middle Eastern fintech’s Arabic chatbot. Found 14 safety bypass vectors in 2 days. Charged €8,500 per audit, now has 3 recurring enterprise clients.
Timeline: 2-4 weeks to build your eval pipeline, first paying client within 6 weeks if you’re already in the AI/ML space
💰 Hustle 2: Build a Multilingual Guardrail Testing SaaS
The tools Pakzad built are open-source. There are commercial players like Galileo (sub-200ms latency, ~$0.02/million tokens) and Avido, but nobody is specifically focused on cross-language guardrail drift detection. Build a SaaS that lets companies paste in their system prompt, pick 10 languages, and get an instant safety score comparison.
Example: A two-person dev team in Bangalore, India built a lightweight API that runs guardrail consistency checks across 12 Indian languages. Listed on Product Hunt, got picked up by 2 Indian banking apps within the first month. $4,200 MRR within 90 days.
Timeline: MVP in 3-5 weeks using existing open-source frameworks (NeMo Guardrails, Langfuse). Charge $200-500/month per seat.
📝 Hustle 3: AI Safety Content and Training
This research is dense but the implications are massive — and most product managers, compliance officers, and CTOs have zero clue. Create a newsletter, course, or workshop breaking down multilingual AI risks for non-technical decision makers.
Example: A former localization manager in São Paulo, Brazil started a Substack on “AI safety across languages” after reading this research. Grew to 2,800 subscribers in 4 months, launched a $149 workshop for compliance teams. $6,700 from the first cohort alone.
Timeline: Start writing within a week. First paid offering in 4-6 weeks. Growing demand as more companies face AI regulation.
🔧 Hustle 4: Prompt Policy Translation & Hardening
Companies with multilingual deployments need someone to write and stress-test their system prompts in multiple languages. Offer a service that takes their English system prompt, translates it properly for each target language, and verifies the outputs stay consistent across all of them. Basically prompt engineering but for the 95% of the world that doesn’t speak English.
Example: A computational linguist in Nairobi, Kenya offered prompt hardening services to 3 African e-commerce startups deploying Swahili and Amharic chatbots. Found that safety disclaimers disappeared entirely in Amharic. Charged $3,000 per language pair, now booked out 2 months.
Timeline: If you speak 2+ languages and understand LLMs, you can start tomorrow. List on Upwork/Fiverr targeting AI companies.
🛠️ Follow-Up Actions
| Step | Action | Resource |
|---|---|---|
| 1 | Try Pakzad’s interactive tool yourself | shadow-reasoning.vercel.app |
| 2 | Read the full Kaggle write-up | Search “Bilingual Shadow Reasoning Kaggle” |
| 3 | Set up NeMo Guardrails locally | NVIDIA NeMo Guardrails GitHub |
| 4 | Study the Mozilla multilingual eval framework | Link in original article |
| 5 | Join AI red teaming communities | OWASP LLM Top 10, AI Village Discord |
Quick Hits
| Want to… | Do this |
|---|---|
| Run Pakzad’s Bilingual Shadow Reasoning tool — it’s free and open source | |
| Compare outputs in 3+ languages for the same input document | |
| Position as “AI localization security” — nobody else is doing this yet | |
| Start with the Dumb Mode Dictionary above, then read the original Substack | |
| Test every system prompt in at least 3 non-English languages before deploying |
Your AI doesn’t have guardrails. It has guardrails in English. For everything else, it’s just vibing.
!