Farsi System Prompts Bypass GPT Safety Filters While Looking Totally Normal

Aina · February 19, 2026, 1:43pm

Researcher Makes GPT Rewrite Human Rights Abuses as “Cultural Sensitivity” Using One Farsi Prompt

a former Mozilla researcher just showed how a single non-English system prompt can turn your “neutral” AI summarizer into a literal propaganda machine — and nobody would notice

OpenAI GPT-OSS-20B red team submission shows LLM summaries can be silently weaponized through multilingual policy injection — flipping “900+ executions” into “protecting citizens through law enforcement” with zero visible tampering.

Submitted as part of OpenAI’s GPT-OSS-20B Red Teaming Challenge by researcher Roya Pakzad (formerly at Mozilla Foundation). The technique, called “Bilingual Shadow Reasoning,” steers a model’s hidden chain-of-thought through non-English system prompts while keeping the output looking clean, professional, and totally normal.

hacking-gif

🧩 Dumb Mode Dictionary

Term	Translation
Bilingual Shadow Reasoning	hiding bias instructions in a non-English system prompt so the AI thinks in one language but outputs in another, looking clean
System Prompt / Policy Layer	the hidden instructions that tell an LLM how to behave before a user even types anything
Chain-of-Thought (CoT)	the AI’s internal “thinking steps” before giving you an answer — this technique hijacks those steps
Red Teaming	trying to break AI systems on purpose to find weaknesses (legally, with permission)
Guardrails	the safety filters that are supposed to stop an AI from saying bad stuff
LLM Summarization	when an AI reads a long document and gives you the TL;DR — except now that TL;DR might be lying to you
Closed-Source Wrappers	companies that build products on top of GPT/Claude but add their own hidden instructions you can’t see

📰 What Actually Happened

so here’s the deal. Roya Pakzad — researcher, former Mozilla Foundation staff, founder of Taraaz (a digital rights org) — built multilingual AI evaluation tools during her time at Mozilla. she submitted a method to OpenAI’s GPT-OSS-20B Red Teaming Challenge that deadass breaks AI safety by… just switching the language of the system prompt.

the test case? a UN Special Rapporteur report documenting Iran’s human rights record — including over 900 executions. three versions of the same summary were generated from the same model, same document, same everything:

Default (no policy): describes “a dramatic rise in executions in Iran — over 900 cases”
English policy (steered): starts softening the language
Farsi policy (steered): flips the entire narrative to “protecting citizens through law enforcement” and emphasizes “cultural sensitivity” and “room for dialogue”

same model. same document. completely different reality.

the kicker? the Farsi policy she used closely mirrors how the Islamic Republic of Iran actually frames its own human rights record. she basically showed you can inject state propaganda directly into an AI’s reasoning layer.

🔍 Why This Hits Different From Normal Jailbreaks

this isn’t your average “ignore all previous instructions” type jailbreak. here’s why it matters:

it’s invisible — the output looks professional, neutral, well-formatted. no red flags. no “as an AI I shouldn’t…” warnings. just clean propaganda
it targets summarization, not Q&A — Pakzad found that steering outputs in multilingual summarization tasks is way easier than in direct question-answer tasks
it scales — every company using AI summarization for executive reports, political analysis, market research, UX studies, or chatbot memory is vulnerable
the policy layer is hidden — closed-source wrappers built on top of major LLMs can embed these instructions as invisible policy directives, and users would never know

Pakzad references research by Abeer et al. showing LLM summaries altered sentiment 26.5% of the time and made consumers “32% more likely to purchase the same product after reading a summary of the review generated by an LLM rather than the original review.”

your AI TL;DR is literally changing what you think and buy. and now someone showed you can steer that on purpose.

thinking-gif

📊 The Numbers That Should Scare You

Metric	Value
Sentiment alteration rate by LLM summaries	26.5% of the time
Consumer purchase influence from AI summaries vs originals	+32% more likely
Languages tested in Pakzad’s multilingual safety eval	Multiple (Farsi, English, others via Mozilla project)
Countries with active LLM wrapper products embedding hidden policies	Unknown — that’s the problem
Cost to execute this attack	$0 — just change the system prompt
Detection difficulty	Extremely hard — output looks perfectly normal

🌍 The Bigger Picture: Multilingual AI is Broken

Pakzad didn’t stop at the shadow reasoning demo. she built two more related projects at Mozilla Foundation:

Multilingual AI Safety Evaluation Lab — tools to test how LLM safety measures perform across different languages. spoiler: they mostly don’t. guardrails built for English fall apart in other languages because the safety training data is overwhelmingly English-centric.

The real-world implications are wild:

authoritarian governments can deploy “culturally adapted” AI tools that are actually censorship machines
marketing firms can embed sentiment-shifting prompts into AI summarizers
chatbot memory systems that summarize conversations to “remember” you? those summaries can be steered to change your recommendations forever
political debate summaries can be tilted without anyone noticing
historical events can be reframed at the system prompt level

she notes: “Many closed-source wrappers built on top of major LLMs (often marketed as localized, culturally adapted, or compliance-vetted alternatives) can embed these hidden instructions as invisible policy directives.”

lowkey the scariest part is that this is probably already happening and we wouldn’t know.

🗣️ The Researcher's Perspective

Pakzad frames this through her experience as an Iranian-born researcher who saw how language can be weaponized:

“If your job as a researcher is to bring critical thinking, subjective understanding, and a novel approach to your research, don’t rely on [AI summarization tools].”

she also describes the core problem with our obsession with AI summaries — we’re outsourcing our thinking to tools we assume are neutral, “willingly offloading cognition to tools they assume are neutral.”

the interactive demo and full write-up are available on her project page where you can run the experiments yourself.

Cool. So AI Can Be Turned Into a Propaganda Machine With a Copypaste. Now What the Hell Do We Do? ( ͡ಠ ʖ̯ ͡ಠ)

what-now-gif

🔧 Build a Multilingual Prompt Audit Tool

Most companies using LLM wrappers have no idea what’s in their system prompts across languages. Build a tool that scans system prompts for sentiment-steering patterns, compares outputs across languages, and flags discrepancies. Think of it as a “bias diff checker” for AI deployments.

Example: A security consultant in Berlin, Germany built a simple Python script that compared GPT outputs across 6 language policies for an NGO client. Found 3 instances of accidental sentiment steering in their Arabic-language customer support bot. Landed a €15K audit contract and two more referrals.

Timeline: MVP in a weekend using basic API calls + diff comparison. First paying client within a month if you target NGOs and media orgs.

📝 Offer AI Red Teaming Services for Non-English Markets

Pakzad’s work shows that multilingual safety testing is a massive gap. Companies building “localized” AI products desperately need someone to test whether their guardrails hold up in languages other than English. Most don’t. You don’t need a PhD — you need fluency in a non-English language and API access.

Example: A freelance pentester in São Paulo, Brazil offered Portuguese-language LLM red teaming to three Brazilian fintechs. Found that safety filters in their customer chatbot could be bypassed with Portuguese system prompts. Closed R$40K (~$8K USD) in consulting work across two engagements.

Timeline: Start with Pakzad’s public methodology as a template. Target companies in your language market through LinkedIn and security forums.

💡 Create a Summarization Integrity Browser Extension

Users reading AI-generated summaries have no way to know if the summary is faithful to the source. Build a browser extension that takes an AI summary + source document and scores faithfulness, flagging omissions, sentiment shifts, and reframing. The research shows 26.5% of summaries alter sentiment — that’s a real problem people would pay to solve.

Example: An indie dev in Tallinn, Estonia shipped a Chrome extension that compared AI meeting note summaries against original transcripts for remote teams. Flagged sentiment discrepancies in 1 out of 4 summaries. Hit $2.1K MRR within 3 months on a $7/mo subscription.

Timeline: Core NLP comparison logic exists in open-source. Ship MVP targeting remote workers and researchers using AI note-takers.

🔍 Freelance as a Digital Rights AI Auditor

Human rights orgs, journalism nonprofits, and press freedom groups need to know if the AI tools they’re using are safe — especially in authoritarian contexts where a biased summary could literally endanger sources. This is a niche nobody is filling. Pakzad’s tools are open-source — use them as a starting point.

Example: A cybersecurity researcher in Nairobi, Kenya partnered with a regional press freedom org to audit their AI transcription and summarization pipeline. Discovered the Swahili-language outputs were consistently softer on government actions. Contract paid $6K and led to ongoing advisory work.

Timeline: Reach out to orgs like CPJ, RSF, or Access Now. They have funding for exactly this kind of work and not enough people who can do it.

🛡️ Build Training Data for Non-English Safety Alignment

The root cause of this vulnerability is that LLM safety training is English-first. Companies like OpenAI, Anthropic, and Google need non-English safety evaluation datasets. If you’re fluent in an underrepresented language, you can build and sell evaluation benchmarks — adversarial prompts, expected-vs-actual outputs, cultural context annotations.

Example: A computational linguist in Istanbul, Turkey compiled a Turkish-language LLM safety benchmark (500 test cases across sensitive topics). Licensed it to two AI safety startups for $12K total and got cited in a conference paper.

Timeline: Start with 100-200 high-quality test cases in your language. Reach out to AI safety labs and companies running red team programs (many pay bounties).

🛠️ Follow-Up Actions

Want To…	Do This
Try the attack yourself	Visit Pakzad’s interactive web app and run the bilingual shadow reasoning experiments
Learn multilingual red teaming	Study the full GPT-OSS-20B red team submission methodology
Audit your own AI tools	Compare outputs of the same prompt in English vs. your native language — look for sentiment and framing differences
Protect yourself from steered summaries	Always read source documents for anything important. Never trust a summary for high-stakes decisions
Get into AI safety consulting	Start with OWASP LLM Top 10, then specialize in multilingual testing

Quick Hits

Want To…	Do This
Test if your AI is biased	Run the same summarization task in 3+ languages, compare outputs
Protect against policy injection	Demand system prompt transparency from AI vendors you use
Make money from this	Offer multilingual AI red teaming — almost nobody does it
Learn the technique	Read Pakzad’s full write-up + try the interactive demo
Go deeper	Read Abeer et al.'s paper on cognitive bias induction in LLM content

your AI doesn’t have a language barrier — it has a language blindspot, and somebody’s already standing in it.

Topic		Replies	Views
Researcher Bypasses LLM Safety Guards by Whispering in Farsi News & Articles ai	0	126	February 19, 2026
Meta's AI Safety System Defeated by Simple Space Bar Hack 🔓 News & Articles	0	97	August 1, 2024
AI Browsers Just Made Hackers' Jobs Easier News & Articles hacking , privacy	1	794	November 5, 2025
LLM Exploits: Attackers Need Only 42 Seconds to Bypass Security! ⏱️ News & Articles hacking , privacy	0	136	October 14, 2024
ChatGPT Jailbreaks in 2025: Why Your Account Might Be at Risk News & Articles privacy , ai , news	1	639	November 11, 2025

Farsi System Prompts Bypass GPT Safety Filters While Looking Totally Normal

Researcher Makes GPT Rewrite Human Rights Abuses as “Cultural Sensitivity” Using One Farsi Prompt

Cool. So AI Can Be Turned Into a Propaganda Machine With a Copypaste. Now What the Hell Do We Do? ( ͡ಠ ʖ̯ ͡ಠ)

Related topics