You are sitting in the masjid on a Friday. The imam is six rows in front of you, speaking fluent Arabic for twenty minutes, and you follow roughly one word in five. You know the khutbah is about something. You know there was an ayah in there, and probably a hadith, and a story. You walk out having prayed but not having understood. This is the problem live khutbah translation is built to solve: put the phone on your lap, watch the Arabic transcribe in real time, and read the translation in your language as the khatib speaks.

This post is the technical explainer. What is actually happening between the imam's voice and the English (or Urdu, French, Turkish, Indonesian) sentences scrolling on your screen. Why real-time Arabic translation is harder than real-time English. What the system gets right, what it gets wrong, and when to trust it.

Who actually needs live khutbah translation

Two groups of people, mostly. The first is reverts and born Muslims who never learned Arabic beyond the prayer. In the United States, the United Kingdom, Canada, Australia, most of West Africa, a lot of South Asia, and increasingly western Europe, this is a majority of the congregation. The khutbah happens in Arabic either fully or for the opening and the ayah recitation, and the translation afterwards (if there is one) is a summary, not the khutbah itself.

The second is travellers. You are in Istanbul, Jakarta, Casablanca, or Dubai for the week, you walk into whichever masjid is closest at 1:15pm on Friday, and the khutbah is in a language you do not share with the local congregation. Arabic is the common denominator. Real-time Arabic translation closes the gap.

A third, smaller group: students of Arabic who can follow most of the khutbah but miss a few terms per minute and want a running safety net. Live translation is useful here as a dictionary that keeps pace with the speaker.

Why real-time Arabic translation is harder than English

Most real-time translation demos you have seen are English to something, or something to English. English ASR has been under relentless attention for fifteen years, trained on tens of thousands of hours of clean speech. Arabic does not have that luxury, and khutbah audio is worse than the Arabic the research datasets are built on.

Three layers of difficulty stack up inside a single sermon.

Three kinds of Arabic in one khutbah

The khatib is not speaking one register. A typical twenty-minute khutbah contains:

Classical Arabic. Quranic verses quoted verbatim, plus hadith in their original wording. Vowels fully marked, archaic vocabulary, syntactic patterns an MSA speaker does not use in conversation.
Modern Standard Arabic. The body of the khutbah, the actual sermon content, delivered in the formal register used on Arabic television and in news broadcasts.
Dialectal asides. When the khatib tells a story or makes a joke, he drops into his own dialect: Egyptian, Levantine, Gulf, Moroccan Darija, Sudanese. Sometimes he stays in dialect for a minute before climbing back into MSA.

ASR models trained on MSA handle the middle layer well. On clean MSA speech, a fine-tuned Whisper model reaches around 12 percent word error rate, which is usable. On the Classical Arabic of Quranic quotation, the same model struggles because the phonetic patterns and vocabulary are older than the training distribution. On Moroccan Darija, the best models sit around 18 percent WER, and that is on clean isolated speech, not a speaker swapping in and out of dialect mid-sentence.

Code-switching

Many khatibs, especially in diaspora masjids, switch languages. A khutbah in the United States might open in Arabic, translate the ayah into English, give a story in English, then close in Arabic. A UK khutbah can splice Urdu. A Paris khutbah can splice French. The ASR has to detect the language change within a second and switch models, or it will transliterate English words as broken Arabic.

Masjid acoustics

Khutbahs are delivered in large rooms with hard surfaces and carpeted floors. Reverb smears consonants. HVAC hum sits at 60 to 120Hz and masks bass frequencies. Children cry, phones buzz, latecomers whisper. The microphone on your phone is a few metres from a speaker that is itself amplified through a PA, so you are recording a reverberant re-radiation of the original voice, not the voice itself. Every one of these factors raises the WER by a measurable amount.

The two-stage pipeline

Live khutbah translation, like most production real-time speech translation, runs as a cascade: audio goes through ASR first, and the ASR output goes through machine translation. Both stages run streaming, meaning they produce output while the imam is still speaking rather than waiting for a full sentence.

Stage 1: streaming ASR

The microphone feeds audio into a voice activity detector, a small neural net that decides which segments contain speech and which are silence or background noise. The VAD runs continuously at roughly 30 times real time on a phone CPU. When it marks a segment as speech, the segment goes to the ASR model in chunks, typically 200 to 500 milliseconds each.

The ASR produces partial hypotheses: low-confidence transcriptions that appear within a second of the words being spoken, then get revised as more audio arrives. You see this on screen as words appearing, sometimes changing, then stabilising. Each chunk is decoded with beam search, keeping the top several candidate transcriptions rather than committing to one. A language model rescores the beams to prefer output that looks like real Arabic.

When the VAD detects a pause longer than about 800 milliseconds, or when the beam converges, the current hypothesis is marked final. Final output does not get revised.

Stage 2: streaming machine translation

Finalised Arabic chunks go to the translation model. Partial chunks can be translated too, but the results are less stable, so most systems wait for finalisation to produce text you actually see in the translated language. The translation model is a separate neural net trained on parallel Arabic-to-target-language pairs, running with its own beam search.

The target-language text streams to your screen below the Arabic. Latency from mouth to screen is typically two to four seconds for the translation, one to two seconds for the Arabic. The reason the translation lags behind the Arabic is that MT needs slightly more context to avoid producing sentences that get rewritten mid-display.

Why not end-to-end speech translation

An alternative architecture skips the intermediate Arabic text and maps Arabic audio directly to English (or your target language) text. This is called end-to-end speech translation. It has a latency advantage (one model instead of two) and can sometimes preserve information that ASR discards, like emphasis and emotion.

The tradeoff is that end-to-end systems are harder to train, harder to debug, and less accurate on code-switched input. For khutbah translation specifically, the cascaded approach has one big win: because the Arabic text is produced explicitly, we can intercept it between stages and do things to it. The most important of those things is the next section.

Quran-quotation handling

The khatib quotes an ayah. The ASR transcribes it as Arabic text. The MT model translates that Arabic text. The output is garbage, because MT models are trained on news, Wikipedia, and web text, and the Classical Arabic of the Quran does not translate well through an MSA-tuned model. You get something that resembles the ayah but is neither accurate nor dignified.

The fix is closed-corpus matching, the same technique RecitID uses for reciter identification and Quran detection. After the ASR produces Arabic text, the pipeline fuzzy-matches every phrase against the full text of the Quran. If a match clears a confidence threshold (accounting for ASR error), the matched ayah is treated as a quotation: instead of routing the words through MT, the system inserts the canonical, authoritative translation from the user's preferred translation source (Saheeh International, Pickthall, Hamidullah for French, whichever they set).

The same treatment applies to common hadith with known canonical wording, though the hadith corpus is larger and messier than the Quran so matches are less reliable. For the Quran itself, where the text is fixed and finite, quotation detection is close to solved if the ASR gets enough of the Arabic right. A three-word match is usually enough to identify the ayah with high confidence.

The result: when the khatib quotes inna ma'a al-usri yusra, you see the real translation ("indeed, with hardship comes ease") rather than an MT approximation. When he quotes a hadith, the system tries canonical first and falls back to MT if no match is found.

How accurate is it, really

Accuracy varies by three factors that are independent of the app: the acoustics of the room, the dialect and diction of the khatib, and the ambient noise. Here is an honest breakdown of what to expect.

Clean MSA, well-amplified, quiet masjid. Transcription is mostly correct. Translation reads well. Quranic quotations are caught and rendered with the canonical text. This is the case the system is built for.
Heavy reverb. Large masjid with a high dome or marble interior. Transcription degrades, translation degrades proportionally. Sitting closer to the imam or to a PA speaker helps more than anything else.
Thick dialect. Moroccan, Algerian, deep Egyptian. Dialectal segments will be transcribed with higher error rate. The MSA portions of the same khutbah remain fine. Quran quotations still get caught because the ayah text is the same regardless of dialect.
Thick non-Arab accent. A khatib whose first language is Urdu, Turkish, or Malay and who pronounces Arabic with a strong substrate accent. Usually works better than dialect because the underlying phonology is closer to MSA, just coloured.
Noisy masjid. Crying children, HVAC, latecomers shuffling. Partial hypotheses flicker more. Finalised output is usually recoverable.

Admitting the failure modes openly: if you are in the back row of a 2,000-person masjid with a thick reverb and the khatib is speaking colloquial Moroccan, the translation will be noticeably worse than if you are three rows back in a 200-person masjid with a clear PA and a Cairo-trained khatib. The feature is a tool, not a miracle.

Supported languages

Live Khutbah covers 53 translation languages: English, French, Spanish, German, Turkish, Urdu, Indonesian, Malay, Bengali, Russian, Chinese, Japanese, Korean, Portuguese, Italian, Dutch, Swahili, Hausa, Somali, Hindi, Persian, Punjabi, Tamil, Thai, Pashto, Kurdish, Amharic, Albanian, Azerbaijani, Bosnian, Filipino, Gujarati, Kazakh, Malayalam, Tajik, Uzbek, Yoruba, Vietnamese, Polish, Romanian, Greek, Czech, Swedish, Norwegian, Danish, Finnish, Hungarian, Bulgarian, Croatian, Slovak, Lithuanian, Slovenian, and more.

Worth flagging a difference that trips people up. Live Khutbah supports 53 languages. The reading translations that appear next to detected ayat are a different list, currently 40 languages with vetted canonical translations (Saheeh International for English, Hamidullah for French, Muhammad Asad as an option, and so on). The 53 languages for Live Khutbah use streaming machine translation; the 40 reading translations use human-produced authoritative texts. Two different features, two different numbers.

Privacy: what happens to the audio

The short version: audio streams to our servers for processing during the session, and we do not keep a permanent copy of the raw audio. The transcript and translation stay on your device unless you choose to save the session. If you save, the text stays in your account; the original audio is still discarded after processing.

Real-time ASR at this quality currently requires cloud inference. The models are too large to run on phone hardware without melting the battery in ten minutes. We watch the on-device model space closely, and if a model ships that can do streaming Arabic ASR on an iPhone or Pixel at acceptable quality, we will move there. Until then, cloud plus aggressive audio discard is the honest answer. Full retention windows are in the privacy policy.

How to use it in the app

Three steps.

Open RecitID and switch to Live Khutbah mode. Select your translation language from the list.
Tap record when the khatib starts speaking. Place the phone face-up on your lap or on the carpet in front of you. The built-in phone mic is fine.
Watch the Arabic and the translation appear in parallel. Save the session at the end if you want to re-read the khutbah later.

Works for any Arabic speech, not only Friday khutbahs. Islamic lectures, halaqas, taraweeh du'a, weekend seminars, online classes. Anywhere the speaker is speaking Arabic and you want the meaning in your language.

What pairs well with Live Khutbah

Two other features cover cases Live Khutbah does not.

AI Explain takes a specific ayah and gives you context, tafsir summary, and asbab al-nuzul. Useful after the khutbah when you want to go deeper on a verse the khatib quoted. Pair it with Asbab al-Nuzul if you want to read about the classical circumstances-of-revelation literature.

AI Chat answers free-form questions about what you just heard. "The khatib mentioned three conditions for a valid repentance, can you list them with sources?" That sort of thing.

For the separate problem of identifying the Quranic recitation itself, RecitID Detect matches recited audio against the Quran text and returns the surah, verse, and reciter. Detect and Live Khutbah are different tools: Detect is for recitation (closed text, fixed corpus), Live Khutbah is for sermons (open speech, MT). The Shazam-for-Quran explainer goes into the difference in more detail.

Pricing and limits

Live Khutbah is a paid feature because the cloud ASR and MT cost real money per minute of audio. The pricing page has the current numbers; at time of writing, Monthly Pro includes 3 hours of Live Translation per month, and Annual Pro+ includes 60 hours per year (boosted to 70 during Ramadan to cover the extra taraweeh and lecture sessions). Extra hour packs are available if you burn through the quota.

Sixty hours is enough for every Friday khutbah for a year (roughly 20 hours at 25 minutes each), plus a weekly lecture, plus travel. The Ramadan bump exists because heavy users, reverts especially, tend to attend a lot more talks in that month.

Frequently asked

Does live khutbah translation work offline?

No. Real-time Arabic ASR at the quality needed for a sermon requires cloud inference today. A connection over the masjid WiFi or 4G is enough; the uplink bandwidth is modest because audio is compressed before upload.

How accurate is the translation?

On clean MSA with a clear PA and a modest reverb, accurate enough to follow the argument and catch most phrasing. On heavy dialect, thick reverb, or loud masjids, accuracy drops. Quranic quotations are handled separately using canonical translations, so the ayat are rendered with authoritative text rather than MT output.

What languages does it support?

53 target languages, including English, French, Spanish, German, Turkish, Urdu, Indonesian, Malay, Bengali, Chinese, Russian, Swahili, Hausa, Somali, Bosnian, Albanian, Vietnamese, Yoruba, and more. Full list on the Live Khutbah page.

Does it handle non-MSA dialects like Moroccan Darija or Egyptian?

Partially. MSA portions transcribe well across dialects. Purely dialectal segments (deep Darija, colloquial Egyptian asides) transcribe with higher error rates. Most khatibs code-switch between dialect and MSA, so the MSA segments give the translation enough backbone to still make sense. A khutbah delivered entirely in colloquial Moroccan will be noticeably less accurate than one in MSA.

Is my audio stored?

No. Raw audio is discarded after the ASR finishes processing it. The transcript and translation stay on your device, and you can save or delete sessions at will. For exact retention windows and the third-party providers we route through, see the privacy policy.

Does it work for English khutbahs?

Yes. Set the source language to English (or let auto-detect handle it) and the target language to whichever you want. The quality on English source is very high because English ASR is the most mature speech recognition problem in the field. Code-switched khutbahs (half Arabic, half English) work too, with auto-detect switching on language change.

Try it this Friday

Install RecitID from the App Store or Google Play, open Live Khutbah on Friday, pick your language, and put the phone on your lap. The free plan gives you a 1-minute preview per session, enough to see the feature in action. For regular Friday use, you will want Monthly Pro or Annual Pro+.

Related reading: Shazam for the Quran, murattal vs mujawwad, and the ten qiraat.

Live Khutbah Translation: How It Works in Real Time