http://stdjet.scienceandtechnology.com.vn/index.php/stdjet/issue/feedVNUHCM Journal of Engineering and Technology2025-12-31T13:47:29+07:00Pham Tan Thiptthi@hcmut.edu.vnOpen Journal Systemshttp://stdjet.scienceandtechnology.com.vn/index.php/stdjet/article/view/1536NLP-Method for Identifying and Extracting Vietnamese Education Legal Documents Based on OCR and XML Techniques2025-12-31T13:47:29+07:00Nguyen Van Sinhnvsinh@hcmiu.edu.vnKiet Anh Phankietpa27@mp.hcmiu.edu.vnTuan Thanh Nguyennttuan@hcmiu.edu.vnSang Thi Thanh Nguyennttsang@hcmiu.edu.vnSon Thanh Leltson@hcmiu.edu.vn<p>Digital transformation in education requires intelligent systems for extracting, standardizing, and analyzing large volumes of legal documents in various formats. The structure of Vietnamese legal documents in education is complex, with different components, including national emblems, issuing agencies, symbols, dates of issuance, clauses, appendices, and sometimes signatures or handwritten notes. Besides, the difference has also come from their styles, such as constitutions, laws, decrees, circulars, decisions, etc. In addition, the decree also specifically regulates the document format (font, font size, margin, line spacing), arrangement of components, and rules for presenting appendices or attached documents. Therefore, processing these documents poses many challenges in information retrieval and legal data management. Natural language processing (NLP) plays a crucial role in such text and document processing. In this paper, we propose a novel approach integrating NLP, Optical Character Recognition (OCR), and image processing to process Vietnamese legal documents in the education section. This foundation serves a future optimal information retrieval system using Large Language Models (LLMs) and Generative Artificial Intelligence (GenAI). Our method includes the following steps: (1) collecting the legal document database in PDF format from the website of legal documents of the Ministry of Education and Training (MOET); (2) removing noise, segmenting different components, and converting into plain text via OCR and image processing techniques. (3) structuring the extracted text into XML format for our future system. Compared to the existing application, our system achieves an expected accuracy of over 99% with printed documents, including the ability to recognize handwritten text. This study is a step toward developing a technical solution for a data platform that enables the intelligent application of LLM and GenAI in an optimized database search engine for informed decision-making based on legal documents.</p>2025-12-31T00:00:00+07:00##submission.copyrightStatement##