logo SBA

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-09262024-153342


Tipo di tesi
Tesi di laurea magistrale
Autore
AYUSHI, AYUSHI
Indirizzo email
a.vijaykumar@studenti.unipi.it, ayushisrisri07@gmail.com
URN
etd-09262024-153342
Titolo
Design and Implementation of a Cloud-Based AI System for Automated Document Processing: Integrating Google Cloud Storage, Document AI, and Vertex AI
Dipartimento
INFORMATICA
Corso di studi
DATA SCIENCE AND BUSINESS INFORMATICS
Relatori
relatore Prof. Pollacci, Laura
Parole chiave
  • Data analysis
  • Document AI
  • Google Cloud
  • Type script code
  • Vertext AI
Data inizio appello
11/10/2024
Consultabilità
Non consultabile
Data di rilascio
11/10/2094
Riassunto
The growing complexity and volume of unstructured data, particularly in sectors like finance, legal, and healthcare, have created a significant demand for more efficient and accurate document management solutions. Traditional methods of document processing—often involving manual review and data extraction—are inefficient, time-consuming, and prone to error, leading to the need for scalable, automated systems that can handle large volumes of data swiftly and accurately. This thesis presents a comprehensive framework for automated document processing using a cloud-based approach integrated with artificial intelligence (AI) technologies.
Problem Context and Motivation
The research addresses the challenges posed by handling unstructured financial documents, such as PDFs, which are commonly used in sectors dealing with financial instruments, contracts, and other legal documentation. These documents contain critical data points—such as ISIN, coupon rates, issue dates, and maturity dates—that are essential for financial analysis and reporting but are often buried within dense, unstructured content.
Manual extraction of this data is labor-intensive and error-prone, particularly in industries where large volumes of documents must be processed regularly. This thesis, therefore, focuses on building an automated system that not only reduces the time and labor associated with document processing but also enhances accuracy and scalability. By leveraging cloud infrastructure and AI-driven tools, the proposed system seeks to improve the extraction, classification, and storage of critical data from financial documents.
Proposed Framework
The proposed system integrates several cloud-based and AI technologies, including Google Cloud Storage, MongoDB, Document AI, and Vertex AI. Each of these components plays a crucial role in the pipeline, which is designed to optimize the entire document processing workflow, from ingestion to data extraction and storage.
Google Cloud Storage is used for securely storing the raw unstructured documents, such as PDFs and images, while ensuring scalability.
MongoDB serves as the NoSQL database system, where structured data extracted from the documents is stored. This enables efficient querying and retrieval of extracted information for downstream applications.
Document AI and Vertex AI power the data extraction process by applying machine learning models trained to recognize and extract relevant financial fields, such as ISIN, issue price, maturity date, and other key financial metrics.
The design of the system is modular, allowing for adaptability and future scalability. Each component can be replaced or upgraded without disrupting the overall workflow, ensuring that the system remains flexible to accommodate future advancements in AI technology or changes in document formats and content.
Evaluation of the AI-Based Document Processor
A key contribution of this research is the evaluation of the AI document parser, particularly in its ability to process financial documents. The system was tested using financial data such as ISIN codes, coupon rates, and maturity dates, and the results were analyzed using key evaluation metrics, including F1 score, precision, and recall.
The evaluation metrics offer an insight into how effectively the system balances precision (the proportion of correctly identified positive cases among all identified cases) and recall (the proportion of actual positives that were correctly identified). For instance, the overall F1 score across all labels was recorded at 0.869, with precision and recall scores of 90.6% and 83.6%, respectively. This demonstrates a strong ability to capture relevant data with minimal false positives or negatives. Additionally, the F1 score for individual fields such as ISIN (0.974), Issue Date (0.974), and Maturity (0.947) indicates a high degree of accuracy in processing key financial terms.
The system also incorporates a "fuzzy matching" feature, allowing some degree of flexibility when matching patterns within documents, which is particularly useful when dealing with minor variations in the way information is presented across different documents.
Challenges and Limitations
While the system shows promising results, there are still areas for improvement. For example, fields such as CURRENCY (F1 score of 0.800) and ISSUER (F1 score of 0.718) were identified as potential areas where the model struggled. This suggests that more targeted training data may be needed to improve the accuracy of extraction for these fields.
Additionally, the current system is largely limited to handling PDF files and financial documents. Although these are widely used formats in industries like finance, expanding the system to support other formats such as spreadsheets or images would increase its applicability across a broader range of industries and document types. Moreover, incorporating more training data from diverse sources would further refine the model's ability to generalize across different document styles and structures.
Future Work
The modular nature of the proposed system allows for several potential future enhancements. One area for future development includes expanding the range of supported document formats, such as integrating optical character recognition (OCR) technology for processing scanned images of documents. This would be particularly useful for industries that handle paper-based documents in addition to digital PDFs.
Another avenue for improvement lies in the AI models themselves. By incorporating more extensive and diverse training data, the system’s ability to accurately extract key information from financial documents can be significantly enhanced. This is especially important for fields where the current F1 scores indicate room for improvement, such as the currency and issuer fields.
The user experience is another important area for future development. Currently, the system focuses primarily on back-end processing and data extraction. However, developing a more intuitive and user-friendly dashboard would enhance the overall usability of the system. Such a dashboard could provide real-time status updates, performance analytics, and an easier interface for monitoring document processing and extracted data.
Conclusion
This thesis has successfully demonstrated the feasibility of using AI and cloud technologies for automating document processing, specifically in the context of financial documents. The system presents a significant improvement over traditional manual methods, offering higher accuracy, scalability, and speed. Through the integration of Google Cloud, MongoDB, Document AI, and Vertex AI, the system provides a flexible and robust solution that can be adapted for future advancements in AI and cloud technology.
In conclusion, the research presented in this thesis addresses a critical need in industries dealing with large volumes of unstructured data, offering a scalable, accurate, and efficient solution for automating the extraction of key financial information. As technology continues to advance, systems like the one proposed in this thesis will be essential in meeting the growing demands for real-time, accurate document processing in data-intensive industries.
File