SOTA for Machine Translation System¶
1. Introduction¶
Develop a modular, scalable, and open-source machine translation system for the Moore language, one of the major languages spoken in Burkina Faso.
A system with three primary components:
-
Text-to-Text Translation (MT): Translation between French and Moore.
-
Speech-to-Text (STT): Converting spoken French and Moore into text.
-
Text-to-Speech (TTS): Converting translated text into natural-sounding speech in Moore.
Given that Moore is a low-resource language with limited labeled data --> state-of-the-art (SOTA) deep learning models while leveraging fine-tuning, transfer learning, and domain adaptation techniques.
Specificity of Moore language: A lot of monolingual data
This document attempts to provide a technical overview for the selection, adaptation, and deployment of models and discusses key challenges, solutions, and future directions.
2. Text-to-Text Translation (MT)¶
2.1. Model Selection¶
SOTA models nowadays are focused on Transformer-based architectures. They have become the standard in machine translation (MT).
Interesting options to explore:
- mBART-50 (Facebook AI): A multilingual denoising autoencoder trained on 50 languages, capable of unsupervised translation and adaptation.
- M2M-100 (Facebook AI): A fully multilingual model supporting direct translation between 100 languages without relying on English as an intermediary.
- MarianMT / OPUS-MT: Open-source Transformer models trained on OPUS parallel corpora, well-suited for low-resource languages.
-
NLLB 200 At scaling machine translation across thousands of language
-
Other interesting option is LLM Lightweight Fine-Tuning – Mistral (https://mistral.ai/) we can use other models also that are light phi, llama, ....
2.2. Benchmarking NLLB Against Previous Models¶
Evaluation Criteria:¶
- Model Size and Inference Speed:
- NLLB: Although larger than some older models, it has been optimized for scalability and often leverages quantization techniques for faster inference on edge devices.
-
MarianMT/OPUS-MT: Generally lighter in terms of model parameters, which might be beneficial in resource-constrained environments, albeit sometimes at the cost of translation accuracy.
-
Adaptability to Low-Resource Languages:
NLLB is designed explicitly with low-resource languages in mind, offering tailored fine-tuning and domain adaptation strategies that give it an edge over more general models like mBART-50 and M2M-100 when applied to Moore. -
Ease of Integration and Fine-Tuning:
All models support transfer learning and can be fine-tuned on domain-specific data. However, NLLB's architecture incorporates recent advances in multilingual training, potentially reducing the amount of fine-tuning required to achieve high accuracy.
Benchmarking Summary:¶
Model | BLEU (Low-Resource) | Inference Speed | Model Size | Adaptability to Moore |
---|---|---|---|---|
NLLB | High | Moderate | Large (~billions parameters) | Excellent |
mBART-50 | Moderate to High | Moderate | Medium | Not tested |
M2M-100 | Moderate | Moderate | Medium to Large | Not tested |
MarianMT/OPUS-MT | Moderate | Fast | Small | Not tested |
2.3. Techniques for Improving Translation Performance¶
- Fine-tuning on domain-specific data: Training models with a curated French-Moore parallel corpus.
- Back-translation (Reference): Generating synthetic Moore-to-French translations to increase training data.
- Data Augmentation & Denoising: Introducing noise, paraphrasing, and synthetic data generation to improve robustness.
- Adapter Layers (Reference): Training lightweight adapter modules to specialize in Moore translation without modifying the entire model.
- Self Labelling: Use STT models to transcribe audios
2.4. Evaluation Metrics¶
- BLEU Score (Reference): Measures translation accuracy by comparing model outputs with human translations.
- CHRF++ (Reference): A character-based metric useful for morphologically rich languages like Moore.
- Human Evaluation: Moore speakers assess fluency and adequacy.
3. STT – Automatic Speech Recognition for Moore¶
3.1. Model Selection¶
3.1.1. Whisper (OpenAI) – Pretrained Model Approach¶
- Whisper: A multilingual ASR model trained on a large and diverse dataset, supporting Moore transcription and direct speech translation.
- Fine-tuning: We can enhance Whisper’s accuracy on Moore speech by training on additional labeled Moore audio data.
3.1.2. Wave2Vec 2.0 (Facebook AI)¶
- Wav2Vec 2.0: A powerful self-supervised learning framework for speech representations. It has shown excellent results in various ASR tasks and can be fine-tuned for low-resource languages like Moore.
3.1.2. From-Scratch STT Model¶
For developing a custom Moore ASR model, we consider:
- Acoustic Modeling:
- Wav2Vec 2.0: Self-supervised speech representation learning.
-
Conformer: A hybrid convolutional and Transformer-based ASR model.
-
Language Modeling:
- Train a Moore-specific language model using Transformer-based architectures.
-
Lexicon-based Decoding: Improves rare word recognition.
-
End-to-End Architectures: Unified models that combine acoustic and language modeling.
3.2. Data Collection & Augmentation¶
- Crowdsourced Moore audio datasets.
- Synthetic speech data augmentation.
- Phonetic-based augmentation for pronunciation variation coverage.
3.3. Evaluation Metrics¶
- Word Error Rate (WER) – Measures transcription accuracy.
- Phoneme Error Rate (PER) – Useful for phonetic consistency.
- Real-time Factor (RTF) – Measures inference speed.
4. TTS – Speech Synthesis for Moore¶
4.1. Model Selection¶
We consider neural TTS models optimized for low-resource languages:
- Tacotron 2 (Google) – Sequence-to-sequence model for natural speech synthesis.
- FastSpeech 2 (Microsoft) – Non-autoregressive model for fast inference.
- VITS – End-to-end model with prosody control.
4.2. Techniques for Improvement¶
- Speaker Adaptation: Fine-tune on Moore voice datasets.
- Prosody & Expressiveness Modeling: Enhancing pitch and tone variation.
- Multilingual Pretraining: Using models trained on African languages.
- Data Augmentation:
- Speech perturbation (speed, pitch, noise)
- Phoneme-based synthesis
4.3. Evaluation Metrics¶
- MOS (Mean Opinion Score) – Human evaluation of naturalness.
- Mel Cepstral Distortion (MCD) – Measures synthesized speech quality.
- CER (Character Error Rate) – Measures intelligibility.
5. Pipeline Integration & Deployment¶
TTT¶
graph LR;
A[Input Source Text ] --> B[Text Preprocessing & Normalization]
B --> C[MT Model Selection]
C --> D[Translation Model Options:
NLLB / mBART-50 / M2M-100 / MarianMT / Fine-tuned Mistral]
D --> E[Post-Translation Processing ]
E --> F[Output: Translated Text]
TTS¶
graph LR;
A[Audio Input] --> B[Audio Preprocessing -Noise Reduction, Normalization];
B --> C[ASR Model -Whisper / Wav2Vec2 / Conformer-];
C --> D[Language Model Integration];
D --> E[Post-processing];
E --> F[Output: Transcribed Text];
6. Challenges¶
- Lack of Moore training data → Data collection & augmentation.
- Dialectal Variations → Phonetic modeling techniques.
- Efficient deployment → Lightweight models.
- Multimodal learning (Text, Audio, Visual cues).