SOTA for Machine Translation System¶

1. Introduction¶

Develop a modular, scalable, and open-source machine translation system for the Moore language, one of the major languages spoken in Burkina Faso.

A system with three primary components:

Text-to-Text Translation (MT): Translation between French and Moore.
Speech-to-Text (STT): Converting spoken French and Moore into text.
Text-to-Speech (TTS): Converting translated text into natural-sounding speech in Moore.

Given that Moore is a low-resource language with limited labeled data --> state-of-the-art (SOTA) deep learning models while leveraging fine-tuning, transfer learning, and domain adaptation techniques.

Specificity of Moore language: A lot of monolingual data

This document attempts to provide a technical overview for the selection, adaptation, and deployment of models and discusses key challenges, solutions, and future directions.

2. Text-to-Text Translation (MT)¶

2.1. Model Selection¶

SOTA models nowadays are focused on Transformer-based architectures. They have become the standard in machine translation (MT).

Interesting options to explore:

mBART-50 (Facebook AI): A multilingual denoising autoencoder trained on 50 languages, capable of unsupervised translation and adaptation.
M2M-100 (Facebook AI): A fully multilingual model supporting direct translation between 100 languages without relying on English as an intermediary.
MarianMT / OPUS-MT: Open-source Transformer models trained on OPUS parallel corpora, well-suited for low-resource languages.
NLLB 200 At scaling machine translation across thousands of language
Other interesting option is LLM Lightweight Fine-Tuning – Mistral (https://mistral.ai/) we can use other models also that are light phi, llama, ....

2.2. Benchmarking NLLB Against Previous Models¶

Evaluation Criteria:¶

Model Size and Inference Speed:
NLLB: Although larger than some older models, it has been optimized for scalability and often leverages quantization techniques for faster inference on edge devices.
MarianMT/OPUS-MT: Generally lighter in terms of model parameters, which might be beneficial in resource-constrained environments, albeit sometimes at the cost of translation accuracy.
Adaptability to Low-Resource Languages:
NLLB is designed explicitly with low-resource languages in mind, offering tailored fine-tuning and domain adaptation strategies that give it an edge over more general models like mBART-50 and M2M-100 when applied to Moore.
Ease of Integration and Fine-Tuning:
All models support transfer learning and can be fine-tuned on domain-specific data. However, NLLB's architecture incorporates recent advances in multilingual training, potentially reducing the amount of fine-tuning required to achieve high accuracy.

Benchmarking Summary:¶

Model	BLEU (Low-Resource)	Inference Speed	Model Size	Adaptability to Moore
NLLB	High	Moderate	Large (~billions parameters)	Excellent
mBART-50	Moderate to High	Moderate	Medium	Not tested
M2M-100	Moderate	Moderate	Medium to Large	Not tested
MarianMT/OPUS-MT	Moderate	Fast	Small	Not tested

2.3. Techniques for Improving Translation Performance¶

Fine-tuning on domain-specific data: Training models with a curated French-Moore parallel corpus.
Back-translation (Reference): Generating synthetic Moore-to-French translations to increase training data.
Data Augmentation & Denoising: Introducing noise, paraphrasing, and synthetic data generation to improve robustness.
Adapter Layers (Reference): Training lightweight adapter modules to specialize in Moore translation without modifying the entire model.
Self Labelling: Use STT models to transcribe audios

2.4. Evaluation Metrics¶

BLEU Score (Reference): Measures translation accuracy by comparing model outputs with human translations.
CHRF++ (Reference): A character-based metric useful for morphologically rich languages like Moore.
Human Evaluation: Moore speakers assess fluency and adequacy.

3. STT – Automatic Speech Recognition for Moore¶

3.1. Model Selection¶

3.1.1. Whisper (OpenAI) – Pretrained Model Approach¶

Whisper: A multilingual ASR model trained on a large and diverse dataset, supporting Moore transcription and direct speech translation.
Fine-tuning: We can enhance Whisper’s accuracy on Moore speech by training on additional labeled Moore audio data.

3.1.2. Wave2Vec 2.0 (Facebook AI)¶

Wav2Vec 2.0: A powerful self-supervised learning framework for speech representations. It has shown excellent results in various ASR tasks and can be fine-tuned for low-resource languages like Moore.

3.1.2. From-Scratch STT Model¶

For developing a custom Moore ASR model, we consider:

Acoustic Modeling:
Wav2Vec 2.0: Self-supervised speech representation learning.
Conformer: A hybrid convolutional and Transformer-based ASR model.
Language Modeling:
Train a Moore-specific language model using Transformer-based architectures.
Lexicon-based Decoding: Improves rare word recognition.
End-to-End Architectures: Unified models that combine acoustic and language modeling.

3.2. Data Collection & Augmentation¶

Crowdsourced Moore audio datasets.
Synthetic speech data augmentation.
Phonetic-based augmentation for pronunciation variation coverage.

3.3. Evaluation Metrics¶

Word Error Rate (WER) – Measures transcription accuracy.
Phoneme Error Rate (PER) – Useful for phonetic consistency.
Real-time Factor (RTF) – Measures inference speed.

4. TTS – Speech Synthesis for Moore¶

4.1. Model Selection¶

We consider neural TTS models optimized for low-resource languages:

Tacotron 2 (Google) – Sequence-to-sequence model for natural speech synthesis.
FastSpeech 2 (Microsoft) – Non-autoregressive model for fast inference.
VITS – End-to-end model with prosody control.

4.2. Techniques for Improvement¶

Speaker Adaptation: Fine-tune on Moore voice datasets.
Prosody & Expressiveness Modeling: Enhancing pitch and tone variation.
Multilingual Pretraining: Using models trained on African languages.
Data Augmentation:
Speech perturbation (speed, pitch, noise)
Phoneme-based synthesis

4.3. Evaluation Metrics¶

MOS (Mean Opinion Score) – Human evaluation of naturalness.
Mel Cepstral Distortion (MCD) – Measures synthesized speech quality.
CER (Character Error Rate) – Measures intelligibility.

5. Pipeline Integration & Deployment¶

TTT¶

graph LR;
    A[Input Source Text ] --> B[Text Preprocessing & Normalization]
    B --> C[MT Model Selection]
    C --> D[Translation Model Options: 
        NLLB / mBART-50 / M2M-100 / MarianMT / Fine-tuned Mistral]
    D --> E[Post-Translation Processing ]
    E --> F[Output: Translated Text]

TTS¶

graph LR;
    A[Audio Input] --> B[Audio Preprocessing -Noise Reduction, Normalization];
    B --> C[ASR Model -Whisper / Wav2Vec2 / Conformer-];
    C --> D[Language Model Integration];
    D --> E[Post-processing];
    E --> F[Output: Transcribed Text];

6. Challenges¶

Lack of Moore training data → Data collection & augmentation.
Dialectal Variations → Phonetic modeling techniques.
Efficient deployment → Lightweight models.
Multimodal learning (Text, Audio, Visual cues).