Farsi System Prompts Bypass GPT Safety Filters While Looking Totally Normal

:shield: Researcher Makes GPT Rewrite Human Rights Abuses as “Cultural Sensitivity” Using One Farsi Prompt

a former Mozilla researcher just showed how a single non-English system prompt can turn your “neutral” AI summarizer into a literal propaganda machine — and nobody would notice

OpenAI GPT-OSS-20B red team submission shows LLM summaries can be silently weaponized through multilingual policy injection — flipping “900+ executions” into “protecting citizens through law enforcement” with zero visible tampering.

Submitted as part of OpenAI’s GPT-OSS-20B Red Teaming Challenge by researcher Roya Pakzad (formerly at Mozilla Foundation). The technique, called “Bilingual Shadow Reasoning,” steers a model’s hidden chain-of-thought through non-English system prompts while keeping the output looking clean, professional, and totally normal.

hacking-gif


🧩 Dumb Mode Dictionary
Term Translation
Bilingual Shadow Reasoning hiding bias instructions in a non-English system prompt so the AI thinks in one language but outputs in another, looking clean
System Prompt / Policy Layer the hidden instructions that tell an LLM how to behave before a user even types anything
Chain-of-Thought (CoT) the AI’s internal “thinking steps” before giving you an answer — this technique hijacks those steps
Red Teaming trying to break AI systems on purpose to find weaknesses (legally, with permission)
Guardrails the safety filters that are supposed to stop an AI from saying bad stuff
LLM Summarization when an AI reads a long document and gives you the TL;DR — except now that TL;DR might be lying to you
Closed-Source Wrappers companies that build products on top of GPT/Claude but add their own hidden instructions you can’t see
📰 What Actually Happened

so here’s the deal. Roya Pakzad — researcher, former Mozilla Foundation staff, founder of Taraaz (a digital rights org) — built multilingual AI evaluation tools during her time at Mozilla. she submitted a method to OpenAI’s GPT-OSS-20B Red Teaming Challenge that deadass breaks AI safety by… just switching the language of the system prompt.

the test case? a UN Special Rapporteur report documenting Iran’s human rights record — including over 900 executions. three versions of the same summary were generated from the same model, same document, same everything:

  • Default (no policy): describes “a dramatic rise in executions in Iran — over 900 cases”
  • English policy (steered): starts softening the language
  • Farsi policy (steered): flips the entire narrative to “protecting citizens through law enforcement” and emphasizes “cultural sensitivity” and “room for dialogue”

same model. same document. completely different reality.

the kicker? the Farsi policy she used closely mirrors how the Islamic Republic of Iran actually frames its own human rights record. she basically showed you can inject state propaganda directly into an AI’s reasoning layer.

🔍 Why This Hits Different From Normal Jailbreaks

this isn’t your average “ignore all previous instructions” type jailbreak. here’s why it matters:

  • it’s invisible — the output looks professional, neutral, well-formatted. no red flags. no “as an AI I shouldn’t…” warnings. just clean propaganda
  • it targets summarization, not Q&A — Pakzad found that steering outputs in multilingual summarization tasks is way easier than in direct question-answer tasks
  • it scales — every company using AI summarization for executive reports, political analysis, market research, UX studies, or chatbot memory is vulnerable
  • the policy layer is hidden — closed-source wrappers built on top of major LLMs can embed these instructions as invisible policy directives, and users would never know

Pakzad references research by Abeer et al. showing LLM summaries altered sentiment 26.5% of the time and made consumers “32% more likely to purchase the same product after reading a summary of the review generated by an LLM rather than the original review.”

your AI TL;DR is literally changing what you think and buy. and now someone showed you can steer that on purpose.

thinking-gif

📊 The Numbers That Should Scare You
Metric Value
Sentiment alteration rate by LLM summaries 26.5% of the time
Consumer purchase influence from AI summaries vs originals +32% more likely
Languages tested in Pakzad’s multilingual safety eval Multiple (Farsi, English, others via Mozilla project)
Countries with active LLM wrapper products embedding hidden policies Unknown — that’s the problem
Cost to execute this attack $0 — just change the system prompt
Detection difficulty Extremely hard — output looks perfectly normal
🌍 The Bigger Picture: Multilingual AI is Broken

Pakzad didn’t stop at the shadow reasoning demo. she built two more related projects at Mozilla Foundation:

Multilingual AI Safety Evaluation Lab — tools to test how LLM safety measures perform across different languages. spoiler: they mostly don’t. guardrails built for English fall apart in other languages because the safety training data is overwhelmingly English-centric.

The real-world implications are wild:

  • authoritarian governments can deploy “culturally adapted” AI tools that are actually censorship machines
  • marketing firms can embed sentiment-shifting prompts into AI summarizers
  • chatbot memory systems that summarize conversations to “remember” you? those summaries can be steered to change your recommendations forever
  • political debate summaries can be tilted without anyone noticing
  • historical events can be reframed at the system prompt level

she notes: “Many closed-source wrappers built on top of major LLMs (often marketed as localized, culturally adapted, or compliance-vetted alternatives) can embed these hidden instructions as invisible policy directives.”

lowkey the scariest part is that this is probably already happening and we wouldn’t know.

🗣️ The Researcher's Perspective

Pakzad frames this through her experience as an Iranian-born researcher who saw how language can be weaponized:

“If your job as a researcher is to bring critical thinking, subjective understanding, and a novel approach to your research, don’t rely on [AI summarization tools].”

she also describes the core problem with our obsession with AI summaries — we’re outsourcing our thinking to tools we assume are neutral, “willingly offloading cognition to tools they assume are neutral.”

the interactive demo and full write-up are available on her project page where you can run the experiments yourself.


Cool. So AI Can Be Turned Into a Propaganda Machine With a Copypaste. Now What the Hell Do We Do? ( ͡ಠ ʖ̯ ͡ಠ)

what-now-gif

🔧 Build a Multilingual Prompt Audit Tool

Most companies using LLM wrappers have no idea what’s in their system prompts across languages. Build a tool that scans system prompts for sentiment-steering patterns, compares outputs across languages, and flags discrepancies. Think of it as a “bias diff checker” for AI deployments.

:brain: Example: A security consultant in Berlin, Germany built a simple Python script that compared GPT outputs across 6 language policies for an NGO client. Found 3 instances of accidental sentiment steering in their Arabic-language customer support bot. Landed a €15K audit contract and two more referrals.

:chart_increasing: Timeline: MVP in a weekend using basic API calls + diff comparison. First paying client within a month if you target NGOs and media orgs.

📝 Offer AI Red Teaming Services for Non-English Markets

Pakzad’s work shows that multilingual safety testing is a massive gap. Companies building “localized” AI products desperately need someone to test whether their guardrails hold up in languages other than English. Most don’t. You don’t need a PhD — you need fluency in a non-English language and API access.

:brain: Example: A freelance pentester in São Paulo, Brazil offered Portuguese-language LLM red teaming to three Brazilian fintechs. Found that safety filters in their customer chatbot could be bypassed with Portuguese system prompts. Closed R$40K (~$8K USD) in consulting work across two engagements.

:chart_increasing: Timeline: Start with Pakzad’s public methodology as a template. Target companies in your language market through LinkedIn and security forums.

💡 Create a Summarization Integrity Browser Extension

Users reading AI-generated summaries have no way to know if the summary is faithful to the source. Build a browser extension that takes an AI summary + source document and scores faithfulness, flagging omissions, sentiment shifts, and reframing. The research shows 26.5% of summaries alter sentiment — that’s a real problem people would pay to solve.

:brain: Example: An indie dev in Tallinn, Estonia shipped a Chrome extension that compared AI meeting note summaries against original transcripts for remote teams. Flagged sentiment discrepancies in 1 out of 4 summaries. Hit $2.1K MRR within 3 months on a $7/mo subscription.

:chart_increasing: Timeline: Core NLP comparison logic exists in open-source. Ship MVP targeting remote workers and researchers using AI note-takers.

🔍 Freelance as a Digital Rights AI Auditor

Human rights orgs, journalism nonprofits, and press freedom groups need to know if the AI tools they’re using are safe — especially in authoritarian contexts where a biased summary could literally endanger sources. This is a niche nobody is filling. Pakzad’s tools are open-source — use them as a starting point.

:brain: Example: A cybersecurity researcher in Nairobi, Kenya partnered with a regional press freedom org to audit their AI transcription and summarization pipeline. Discovered the Swahili-language outputs were consistently softer on government actions. Contract paid $6K and led to ongoing advisory work.

:chart_increasing: Timeline: Reach out to orgs like CPJ, RSF, or Access Now. They have funding for exactly this kind of work and not enough people who can do it.

🛡️ Build Training Data for Non-English Safety Alignment

The root cause of this vulnerability is that LLM safety training is English-first. Companies like OpenAI, Anthropic, and Google need non-English safety evaluation datasets. If you’re fluent in an underrepresented language, you can build and sell evaluation benchmarks — adversarial prompts, expected-vs-actual outputs, cultural context annotations.

:brain: Example: A computational linguist in Istanbul, Turkey compiled a Turkish-language LLM safety benchmark (500 test cases across sensitive topics). Licensed it to two AI safety startups for $12K total and got cited in a conference paper.

:chart_increasing: Timeline: Start with 100-200 high-quality test cases in your language. Reach out to AI safety labs and companies running red team programs (many pay bounties).

🛠️ Follow-Up Actions
Want To… Do This
Try the attack yourself Visit Pakzad’s interactive web app and run the bilingual shadow reasoning experiments
Learn multilingual red teaming Study the full GPT-OSS-20B red team submission methodology
Audit your own AI tools Compare outputs of the same prompt in English vs. your native language — look for sentiment and framing differences
Protect yourself from steered summaries Always read source documents for anything important. Never trust a summary for high-stakes decisions
Get into AI safety consulting Start with OWASP LLM Top 10, then specialize in multilingual testing

:high_voltage: Quick Hits

Want To… Do This
:magnifying_glass_tilted_left: Test if your AI is biased Run the same summarization task in 3+ languages, compare outputs
:shield: Protect against policy injection Demand system prompt transparency from AI vendors you use
:money_bag: Make money from this Offer multilingual AI red teaming — almost nobody does it
:open_book: Learn the technique Read Pakzad’s full write-up + try the interactive demo
:brain: Go deeper Read Abeer et al.'s paper on cognitive bias induction in LLM content

your AI doesn’t have a language barrier — it has a language blindspot, and somebody’s already standing in it.

4 Likes