Open Access

Downloads

Download data is not yet available.

Abstract

The Bahnar people, an ethnic minority in Vietnam with a rich ancestral heritage, possess a language of immense cultural and historical significance. The government places a strong emphasis on preserving and promoting the Bahnaric language by making it accessible online and encouraging communication across generations. Recent advancements in artificial intelligence, such as Neural Machine Translation (NMT), have brought about a transformation in translation by improving accuracy and fluency. This, in turn, contributes to the revival of the language through educational efforts, communication, and documentation. Specifically, NMT is pivotal in enhancing accessibility for Bahnaric speakers, making information and content more readily available. Nevertheless, the translation of Vietnamese into Bahnaric faces practical challenges due to resource constraints, especially given the limited resources available for the Bahnaric language. To address this, we employ state-of-the-art techniques in NMT along with two augmentation strategies for domain-specific Vietnamese-Bahnaric translation task. In a multi-task data augmentation approach, new sentence pairs are generated through transformations. These augmented sentences are employed as auxiliary tasks within a multi-task framework during training. The objective is to introduce fresh contexts where the target prefix alone does not provide sufficient information for predicting the next word accurately. His approach enhances the encoder’s capabilities and compels the decoder to focus more on the source representations from the encoder. On the other hand, the sentence boundary augmentation method extends the application of the noising-based approach beyond the word level to include sentence-level augmentation. In neural machine translation, handling errors related to grammatical structure and sentence boundaries poses significant challenges to ensure robustness. Through thoroughly examining errors, it becomes evident that sentence boundary segmentation has the most substantial impact on translation quality. To enhance segmentation robustness, a straightforward data augmentation strategy is devised. Importantly, both approaches are flexible and can be used with various neural machine translation models. Additionally, they do not require complex data preprocessing steps, the training of additional systems, or the acquisition of extra data beyond the existing training parallel corpora.



Author's Affiliation
  • Sang Tan Nguyen

    Google Scholar Pubmed https://orcid.org/0009-0004-9209-9155

  • Nguyen Quoc Pham

    Email I'd for correspondance: quocnguyenkh@gmail.com
    Google Scholar Pubmed

  • Tho Thanh Quan

    Google Scholar Pubmed https://orcid.org/0000-0003-0467-6254

Article Details

Issue: Vol 6 No 4 (2023)
Page No.: 2099-2117
Published: May 13, 2024
Section: Research article
DOI: https://doi.org/10.32508/stdjet.v6i4.1284

 Copyright Info

Creative Commons License

Copyright: The Authors. This is an open access article distributed under the terms of the Creative Commons Attribution License CC-BY 4.0., which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

 How to Cite
Nguyen, S., Pham, N., & Quan, T. (2024). Leveraging Sentence-oriented Augmentation And Transformer-Based Architecture For Vietnamese-Bahnaric Translation. VNUHCM Journal of Engineering and Technology, 6(4), 2099-2117. https://doi.org/https://doi.org/10.32508/stdjet.v6i4.1284

 Cited by



Article level Metrics by Paperbuzz/Impactstory
Article level Metrics by Altmetrics

 Article Statistics
HTML = 680 times
PDF   = 232 times
Total   = 232 times