Leveraging Sentence-oriented Augmentation And Transformer-Based Architecture For Vietnamese-Bahnaric Translation

Sang Tan Nguyen; Nguyen Quoc Pham; Tho Thanh Quan

doi:10.32508/stdjet.v6i4.1284

Article
Details
Citation
Metrics

Open Access

Downloads

Download data is not yet available.

Abstract

The Bahnar people, an ethnic minority in Vietnam with a rich ancestral heritage, possess a language of immense cultural and historical significance. The government places a strong emphasis on preserving and promoting the Bahnaric language by making it accessible online and encouraging communication across generations. Recent advancements in artificial intelligence, such as Neural Machine Translation (NMT), have brought about a transformation in translation by improving accuracy and fluency. This, in turn, contributes to the revival of the language through educational efforts, communication, and documentation. Specifically, NMT is pivotal in enhancing accessibility for Bahnaric speakers, making information and content more readily available. Nevertheless, the translation of Vietnamese into Bahnaric faces practical challenges due to resource constraints, especially given the limited resources available for the Bahnaric language. To address this, we employ state-of-the-art techniques in NMT along with two augmentation strategies for domain-specific Vietnamese-Bahnaric translation task. In a multi-task data augmentation approach, new sentence pairs are generated through transformations. These augmented sentences are employed as auxiliary tasks within a multi-task framework during training. The objective is to introduce fresh contexts where the target prefix alone does not provide sufficient information for predicting the next word accurately. His approach enhances the encoder’s capabilities and compels the decoder to focus more on the source representations from the encoder. On the other hand, the sentence boundary augmentation method extends the application of the noising-based approach beyond the word level to include sentence-level augmentation. In neural machine translation, handling errors related to grammatical structure and sentence boundaries poses significant challenges to ensure robustness. Through thoroughly examining errors, it becomes evident that sentence boundary segmentation has the most substantial impact on translation quality. To enhance segmentation robustness, a straightforward data augmentation strategy is devised. Importantly, both approaches are flexible and can be used with various neural machine translation models. Additionally, they do not require complex data preprocessing steps, the training of additional systems, or the acquisition of extra data beyond the existing training parallel corpora.

Comments

Author's Affiliation

Sang Tan Nguyen

Falcuty of Computer Science and Engineering, Ho Chi Minh City University of Technology (HCMUT), 268 Ly Thuong Kiet, District 10, Ho Chi Minh City, Vietnam; Vietnam National University Ho Chi Minh City (VNU-HCM), Linh Trung Ward, Thu Duc City, Ho Chi Minh City, Vietnam
Google Scholar Pubmed https://orcid.org/0009-0004-9209-9155

Nguyen Quoc Pham

Falcuty of Computer Science and Engineering, Ho Chi Minh City University of Technology (HCMUT), 268 Ly Thuong Kiet, District 10, Ho Chi Minh City, Vietnam; Vietnam National University Ho Chi Minh City (VNU-HCM), Linh Trung Ward, Thu Duc City, Ho Chi Minh City, Vietnam

Email I'd for correspondance: quocnguyenkh@gmail.com
Google Scholar Pubmed

Tho Thanh Quan

Falcuty of Computer Science and Engineering, Ho Chi Minh City University of Technology (HCMUT), 268 Ly Thuong Kiet, District 10, Ho Chi Minh City, Vietnam; Vietnam National University Ho Chi Minh City (VNU-HCM), Linh Trung Ward, Thu Duc City, Ho Chi Minh City, Vietnam
Google Scholar Pubmed https://orcid.org/0000-0003-0467-6254

Article Details

Issue: Vol 6 No 4 (2023)

Page No.: 2099-2117

Published: May 13, 2024

Section: Research article

DOI: https://doi.org/10.32508/stdjet.v6i4.1284

Copyright: The Authors. This is an open access article distributed under the terms of the Creative Commons Attribution License CC-BY 4.0., which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

How to Cite

Nguyen, S., Pham, N., & Quan, T. (2024). Leveraging Sentence-oriented Augmentation And Transformer-Based Architecture For Vietnamese-Bahnaric Translation. VNUHCM Journal of Engineering and Technology, 6(4), 2099-2117. https://doi.org/https://doi.org/10.32508/stdjet.v6i4.1284

Download Citation

Cited by

Article level Metrics by Paperbuzz/Impactstory

Article level Metrics by Altmetrics

Article Statistics

HTML = 805 times
PDF = 281 times
Total = 281 times

VNUHCM Journal of

Engineering and Technology

An official journal of Viet Nam National University Ho Chi Minh City, Viet Nam since 2018

ISSN 2615-9872

HTML

805

Total

281

Citations

Share

Leveraging Sentence-oriented Augmentation And Transformer-Based Architecture For Vietnamese-Bahnaric Translation

Sang Tan Nguyen

Nguyen Quoc Pham

Tho Thanh Quan

Downloads

Abstract

Sang Tan Nguyen

Nguyen Quoc Pham

Tho Thanh Quan

INFORMATION

FOR AUTHORS

CONTACT US

VNUHCM Journal of

Engineering and Technology

An official journal of Viet Nam National University Ho Chi Minh City, Viet Nam since 2018

ISSN 2615-9872

HTML805 Total 281 Citations Share Leveraging Sentence-oriented Augmentation And Transformer-Based Architecture For Vietnamese-Bahnaric Translation

Sang Tan Nguyen Nguyen Quoc Pham Tho Thanh Quan

Downloads

Abstract

Sang Tan Nguyen

Nguyen Quoc Pham

Tho Thanh Quan

INFORMATION

FOR AUTHORS

CONTACT US

HTML

805

Total

281

Citations

Share

Leveraging Sentence-oriented Augmentation And Transformer-Based Architecture For Vietnamese-Bahnaric Translation

Sang Tan Nguyen

Nguyen Quoc Pham

Tho Thanh Quan