Downloads
Abstract
The Bahnar people, an ethnic minority in Vietnam with a rich ancestral heritage, possess a language of immense cultural and historical significance. The government places a strong emphasis on preserving and promoting the Bahnaric language by making it accessible online and encouraging communication across generations. Recent advancements in artificial intelligence, such as Neural Machine Translation (NMT), have brought about a transformation in translation by improving accuracy and fluency. This, in turn, contributes to the revival of the language through educational efforts, communication, and documentation. Specifically, NMT is pivotal in enhancing accessibility for Bahnaric speakers, making information and content more readily available. Nevertheless, the translation of Vietnamese into Bahnaric faces practical challenges due to resource constraints, especially given the limited resources available for the Bahnaric language. To address this, we employ state-of-the-art techniques in NMT along with two augmentation strategies for domain-specific Vietnamese-Bahnaric translation task. In a multi-task data augmentation approach, new sentence pairs are generated through transformations. These augmented sentences are employed as auxiliary tasks within a multi-task framework during training. The objective is to introduce fresh contexts where the target prefix alone does not provide sufficient information for predicting the next word accurately. His approach enhances the encoder’s capabilities and compels the decoder to focus more on the source representations from the encoder. On the other hand, the sentence boundary augmentation method extends the application of the noising-based approach beyond the word level to include sentence-level augmentation. In neural machine translation, handling errors related to grammatical structure and sentence boundaries poses significant challenges to ensure robustness. Through thoroughly examining errors, it becomes evident that sentence boundary segmentation has the most substantial impact on translation quality. To enhance segmentation robustness, a straightforward data augmentation strategy is devised. Importantly, both approaches are flexible and can be used with various neural machine translation models. Additionally, they do not require complex data preprocessing steps, the training of additional systems, or the acquisition of extra data beyond the existing training parallel corpora.
Issue: Vol 6 No 4 (2023)
Page No.: 2099-2117
Published: May 13, 2024
Section: Research article
DOI: https://doi.org/10.32508/stdjet.v6i4.1284
PDF = 232 times
Total = 232 times