NLP-Method for Identifying and Extracting Vietnamese Education Legal Documents Based on OCR and XML Techniques

Nguyen Van Sinh; Kiet Anh Phan; Tuan Thanh Nguyen; Sang Thi Thanh Nguyen; Son Thanh Le

doi:10.32508/stdjet.v9i1.1536

Article
Details
Citation
Metrics

Open Access

Downloads

Download data is not yet available.

Abstract

Digital transformation in education requires intelligent systems for extracting, standardizing, and analyzing large volumes of legal documents in various formats. The structure of Vietnamese legal documents in education is complex, with different components, including national emblems, issuing agencies, symbols, dates of issuance, clauses, appendices, and sometimes signatures or handwritten notes. Besides, the difference has also come from their styles, such as constitutions, laws, decrees, circulars, decisions, etc. In addition, the decree also specifically regulates the document format (font, font size, margin, line spacing), arrangement of components, and rules for presenting appendices or attached documents. Therefore, processing these documents poses many challenges in information retrieval and legal data management. Natural language processing (NLP) plays a crucial role in such text and document processing. In this paper, we propose a novel approach integrating NLP, Optical Character Recognition (OCR), and image processing to process Vietnamese legal documents in the education section. This foundation serves a future optimal information retrieval system using Large Language Models (LLMs) and Generative Artificial Intelligence (GenAI). Our method includes the following steps: (1) collecting the legal document database in PDF format from the website of legal documents of the Ministry of Education and Training (MOET); (2) removing noise, segmenting different components, and converting into plain text via OCR and image processing techniques. (3) structuring the extracted text into XML format for our future system. Compared to the existing application, our system achieves an expected accuracy of over 99% with printed documents, including the ability to recognize handwritten text. This study is a step toward developing a technical solution for a data platform that enables the intelligent application of LLM and GenAI in an optimized database search engine for informed decision-making based on legal documents.

Comments

Author's Affiliation

Nguyen Van Sinh

International University - Vietnam National University of Ho Chi Minh City, Vietnam

Email I'd for correspondance:
nvsinh@hcmiu.edu.vn

Google Scholar Pubmed

Kiet Anh Phan

International University - Vietnam National University of Ho Chi Minh City, Vietnam
Google Scholar Pubmed

Tuan Thanh Nguyen

International University - Vietnam National University of Ho Chi Minh City, Vietnam
Google Scholar Pubmed

Sang Thi Thanh Nguyen

International University - Vietnam National University of Ho Chi Minh City, Vietnam
Google Scholar Pubmed

Son Thanh Le

International University - Vietnam National University of Ho Chi Minh City, Vietnam
Google Scholar Pubmed

Article Details

Issue: Vol 9 No 1 (2026)

Page No.: 2702-2714

Published: Dec 31, 2025

Section: Research article

DOI: https://doi.org/10.32508/stdjet.v9i1.1536

Copyright: The Authors. This is an open access article distributed under the terms of the Creative Commons Attribution License CC-BY 4.0., which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

How to Cite

Sinh, N., Phan, K., Nguyen, T., Nguyen, S., & Le, S. (2025). NLP-Method for Identifying and Extracting Vietnamese Education Legal Documents Based on OCR and XML Techniques. VNUHCM Journal of Engineering and Technology, 9(1), 2702-2714. https://doi.org/https://doi.org/10.32508/stdjet.v9i1.1536

Download Citation

Cited by

Article level Metrics by Paperbuzz/Impactstory

Article level Metrics by Altmetrics

Article Statistics

HTML = 0 times
PDF = 0 times
Total = 0 times

VNUHCM Journal of

Engineering and Technology

An official journal of Viet Nam National University Ho Chi Minh City, Viet Nam since 2018

ISSN 2615-9872

HTML

0

Total

0

Citations

Share

NLP-Method for Identifying and Extracting Vietnamese Education Legal Documents Based on OCR and XML Techniques

Nguyen Van Sinh

Kiet Anh Phan

Tuan Thanh Nguyen

Sang Thi Thanh Nguyen

Son Thanh Le

Downloads

Abstract

Nguyen Van Sinh

Kiet Anh Phan

Tuan Thanh Nguyen

Sang Thi Thanh Nguyen

Son Thanh Le

INFORMATION

FOR AUTHORS

CONTACT US

VNUHCM Journal of

Engineering and Technology

An official journal of Viet Nam National University Ho Chi Minh City, Viet Nam since 2018

ISSN 2615-9872

HTML0 Total 0 Citations Share NLP-Method for Identifying and Extracting Vietnamese Education Legal Documents Based on OCR and XML Techniques

Nguyen Van Sinh Kiet Anh Phan Tuan Thanh Nguyen Sang Thi Thanh Nguyen Son Thanh Le

Downloads

Abstract

Nguyen Van Sinh

Kiet Anh Phan

Tuan Thanh Nguyen

Sang Thi Thanh Nguyen

Son Thanh Le

INFORMATION

FOR AUTHORS

CONTACT US

HTML

0

Total

0

Citations

Share

NLP-Method for Identifying and Extracting Vietnamese Education Legal Documents Based on OCR and XML Techniques

Nguyen Van Sinh

Kiet Anh Phan

Tuan Thanh Nguyen

Sang Thi Thanh Nguyen

Son Thanh Le