The Church of Jesus Christ of Latter-day Saints is a global faith, with over 30,000 congregations around the world, and translating materials into over 188 languages according to church statistics. For church-wide conferences, hundreds of interpreters work to provide translation of talks into various languages. However, many regional meetings and local congregations face translation needs and may not have adequate resources to enable translation.
In recent years, advances in deep learning have enabled high performance in translating languages and transcription of spoken language. If these advances are suffient to accurately and reliably translate talks in real time during church services, language barriers that limit unity in local units could be overcome.
Effective evaluation of speech-to-text models is crucial to building systems to provide translation during church services. Although some languages (such as English) can be transcribed with high accuracy with speech-to-text models, lower-resource languages have substantially lower accuracy rates. In order to enable effective real-time translation, an accurate real-time transcription model must be identified and utilized before translating to other languages.
In this post, I share the performance of four different speech-to-text models for transcribing general conference talks. I use the most popular metric for transcription evaluation (WER; word error rate) as well as LLM-as-a-judge to evaluate model performance.
To provide a robust evaluation of speech-to-text transcription ability, I used conference talks from five different conference sessions (April 2025, October 2025, October 2024, April 2024, and April 2023). I selected a variety of speakers to capture different accents in English, and I selected different sessions in hopes that the person reading translations would differ between sessions.
Data were downloaded using an adaptation of the general-conference-extractor python package.
To evaluate speech-to-text models, I used Word Error Rate (WER).
I was concerned that word error rate may struggle in this context because:
To help ameliorate these concerns, I also used LLM-as-a-judge to evaluate audio transcriptions against ground truth text.
I used the following prompt for my LLM-as-a-judge procedure:
You are evaluating the quality of a speech transcription. Note that the ground truth text may contain additional footnotes or reference not spoken in the transcription. You may ignore those while making your evaluation.
Rate the transcription quality on a scale from 1 to 8:
8 = Perfect transcription. There are no errors and the transcription matches the ground truth perfectly besides any footnotes existing in the ground truth.
7 = Nearly Perfect transcription. There may be minor inconsistencies in spelling or formatting but a human would be able to understand the full meaning of every part of the transcription perfectly.
6 = Very Good transcription. Some words in the transcription may differ, but a human would be able to understand the general content of each part of the transcription.
5 = Good transcription. Many words in the transcription may differ, but a human would be able to understand the general content of most parts of the transcription.
4 = Fair. A human may be able to understand the content of most sections, but many sections are garbled and difficult to understand.
3 = Poor. A human may be able to guess at the general subject of the transcription based on some of the words, and they may be able to understand the content of some sections, but most sections are garbled and impossible to understand.
2 = Very Poor. A human may be able to guess at the general subject of the transcription based on some of the words, but would not be able to understand any of the points the speaker is trying to convey.
1 = Completely unusable. No or very few words match between the ground truth and transcription. Words do not follow grammatical structure.
Ground truth:
{reference}Transcription:
{hypothesis}Respond with only a single integer (1-8). Do not include anything else in your response. Correct example responses are:
513.
I evaluated models from providers that listed Haitian Creole as a supported language in documentation somewhere on the internet. Whisper, Gladia, and Assembly AI all met this criteria.
Code for generating benchmarks is available in my repository
Whisper transcription was the best-performing transcription model for English, Norwegian, Portuguese, and Mandarin. Note that this is not exactly a 1:1 comparison; OpenAI does not support streaming for Whisper. In the future, performance is expected to decline for Whisper as it is used for streaming.
LLM-as-a-judge scores had a nearly perfect negative correlation(-0.91) with WER, demonstrating the validity of the scoring strategy.
| lang | model | wer | llm_score |
|---|---|---|---|
| eng | assemblyai_multilingual_lang_all_results | 0.12389945486314122 | 4.5 |
| eng | assemblyai_u3_pro_lang_all_results | 0.09872030212527325 | 5.0 |
| eng | gladia_lang_all_results | 0.12609268449335131 | 4.0 |
| eng | whisper_api_all_results | 0.0935035911501859 | 5.75 |
| hat | assemblyai_multilingual_lang_all_results | - | - |
| hat | assemblyai_u3_pro_lang_all_results | 0.9883436324455568 | 1.8 |
| hat | gladia_lang_all_results | 0.7184969761041254 | 2.8 |
| hat | whisper_api_all_results | 1.0192933824809913 | 1.2 |
| nor | assemblyai_multilingual_lang_all_results | - | - |
| nor | assemblyai_u3_pro_lang_all_results | - | - |
| nor | gladia_lang_all_results | 0.22120283603744278 | 3.5 |
| nor | whisper_api_all_results | 0.17264503491009425 | 4.5 |
| por | assemblyai_multilingual_lang_all_results | 0.155253448010818 | 4.8 |
| por | assemblyai_u3_pro_lang_all_results | 0.15156519899116436 | 5.0 |
| por | gladia_lang_all_results | 0.1671072893970662 | 4.6 |
| por | whisper_api_all_results | 0.13283992869827094 | 5.8 |
| spa | assemblyai_multilingual_lang_all_results | 0.1619719882774932 | 5.5 |
| spa | assemblyai_u3_pro_lang_all_results | 0.16497267455714246 | 4.5 |
| spa | gladia_lang_all_results | 0.1753674897914526 | 4.75 |
| spa | whisper_api_all_results | 0.19148008792750137 | 4.75 |
| zho | assemblyai_multilingual_lang_all_results | - | - |
| zho | assemblyai_u3_pro_lang_all_results | - | - |
| zho | gladia_lang_all_results | 0.41858619445420614 | 3.8 |
| zho | whisper_api_all_results | 0.19683140512716585 | 4.8 |
Note: If a model did not support a given language via their API, we did not run benchmarks for that language/model.
With these benchmarks in place, I plan to explore different options to provide high-quality transcription for languages such as Haitian Creole. I hope to evaluate: