Open Access

Downloads

Download data is not yet available.

Abstract

Digital transformation in education requires intelligent systems for extracting, standardizing, and analyzing large volumes of legal documents in various formats. Natural language processing (NLP) plays a crucial role in such text and document processing. In this paper, we propose a novel approach integrating NLP, Optical Character Recognition (OCR), and image processing to process Vietnamese legal documents in education section. This foundation serves a future optimal information retrieval system using Large Language Models (LLMs) and Generative Artificial Intelligence (GenAI). Our method includes the following steps: (1) collecting the legal document database in PDF format, from the website of legal documents of the Ministry of Education and Training (MOET); (2) removing noise, segmenting different components, and converting into plain text  via OCR and image processing techniques. (3) structuring the extracted text into XML format  for our future system . Compared to the existing application, our system achieves an expected accuracy of over 99% with printed documents, including the ability to recognize handwriting text. This study is a step forward from a technical solution to a data platform, which leads to the intelligent application of LLM and GenAI in an optimized database search engine for decision-making based on legal documents.



Article Details

Issue: Vol 9 No 1 (2026)
Page No.: 2702-2714
Published: Dec 31, 2025
Section: Research article
DOI: https://doi.org/10.32508/stdjet.v9i1.1536

 Copyright Info

Creative Commons License

Copyright: The Authors. This is an open access article distributed under the terms of the Creative Commons Attribution License CC-BY 4.0., which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

 How to Cite
Sinh, N., Phan, K., Nguyen, T., Nguyen, S., & Le, S. (2025). NLP-Method for Identifying and Extracting Vietnamese Education Legal Documents Based on OCR and XML Techniques. VNUHCM Journal of Engineering and Technology, 9(1), 2702-2714. https://doi.org/https://doi.org/10.32508/stdjet.v9i1.1536

 Cited by



Article level Metrics by Paperbuzz/Impactstory
Article level Metrics by Altmetrics

 Article Statistics
HTML = 0 times
PDF   = 0 times
Total   = 0 times