Six weeks into a brokerage agent build, a native-speaker tester sent us one line of feedback that reshaped the project: 'It answers like a news anchor.' Grammatically perfect, dialectally alien. Buyers were writing Khaleeji and Egyptian; our test set was Modern Standard Arabic.
The common belief
'The model supports Arabic' — true, and dangerously incomplete. Benchmarks skew MSA; real customers text in dialect, switch to English mid-sentence, transliterate into Arabizi (3 for ع, 7 for ح), and expect the register of a helpful local, not a broadcast.
If your Arabic test set is MSA, you've tested a language your customers don't text in.
What actually bites
Dialect register: Khaleeji, Egyptian, and Levantine differ enough that one 'Arabic prompt' can read formal in one and odd in another. We maintain dialect-specific few-shot examples and test each with native speakers before launch.
Code-switching and Arabizi: real threads mix scripts and languages mid-sentence. The intake layer must detect and follow the customer's mix rather than forcing a lane.
RTL is product work, not CSS work: bidirectional text with embedded English brand names, numerals, and URLs breaks naive rendering. Budget design time, not a stylesheet pass.
Evaluation needs humans: automated metrics miss register entirely. Our launch gate includes native-speaker review across the dialects the client's traffic actually shows — we re-tested two prompts post-launch on one project because they read 'too formal' to Khaleeji speakers. Small thing; mattered.
When MSA-first is fine
Government, legal, and formal-document contexts expect MSA — there, dialect tuning is wasted effort. Match register to channel: WhatsApp is dialect country; an official portal isn't.
- Build dialect test sets from real traffic, not benchmarks.
- Handle Arabizi and code-switching in the intake layer.
- Treat RTL as design scope.
- Gate launch on native-speaker review.


