Downloads
Abstract
Digital transformation in education requires intelligent systems for extracting, standardizing, and analyzing large volumes of legal documents in various formats. Natural language processing (NLP) plays a crucial role in such text and document processing. In this paper, we propose a novel approach integrating NLP, Optical Character Recognition (OCR), and image processing to process Vietnamese legal documents in education section. This foundation serves a future optimal information retrieval system using Large Language Models (LLMs) and Generative Artificial Intelligence (GenAI). Our method includes the following steps: (1) collecting the legal document database in PDF format, from the website of legal documents of the Ministry of Education and Training (MOET); (2) removing noise, segmenting different components, and converting into plain text  via OCR and image processing techniques. (3) structuring the extracted text into XML format for our future system . Compared to the existing application, our system achieves an expected accuracy of over 99% with printed documents, including the ability to recognize handwriting text. This study is a step forward from a technical solution to a data platform, which leads to the intelligent application of LLM and GenAI in an optimized database search engine for decision-making based on legal documents.
Issue: Vol 9 No 1 (2026)
Page No.: 2702-2714
Published: Dec 31, 2025
Section: Research article
DOI: https://doi.org/10.32508/stdjet.v9i1.1536
PDF = 0 times
Total = 0 times
Open Access 





