Harnessing LLMs for complex document processing

Jesper Edström

August 26, 2024

1. Introduction

2. The Data Extraction Tool: Complexity and Inconsistency in Data Extraction

3. Developing the Tool

- Challenges

- Future Potential

4. Text-to-illustration of architecture

5. Conclusion

Introduction

Jesper Edström , an engineering physics student from Uppsala University, completed his Master's thesis with us this fall. Under the guidance of Rasmus Lindqvist, Jesper designed, researched, and built a document reader powered by large language models (LLMs).

With LLMs getting more and more powerful, new areas are being discovered where they can be implemented as the powerful tools that they are. Retrieval augmented generation (RAG) is an architecture discovered by Facebook in 2020 that utilizes external data as a source for an LLM to answer a question instead of solely basing its answer on the pre-trained parameters. At the end of this piece, there is a link to an article about RAG for those interested in learning more.

The Data Extraction Tool

Complexity and Inconsistency in Data Extraction

In the battery industry, technical documents such as Material Safety Data Sheets (MSDS), Technical Specification Sheets, and Test Reports are crucial. However, these documents often present challenges due to their length, inconsistent structure, and complex data, making accurate extraction difficult.

Jesper explains, "The real challenge lies in the inconsistency across these documents. Each one is structured differently, with varying levels of detail and complexity, which makes manual extraction prone to errors."

Misinterpreting these documents can lead to significant inefficiencies in data analysis and decision-making, highlighting the need for a reliable tool that can accurately extract data from such diverse sources.

Developing the Tool

To address these challenges, a pipeline was developed that takes these documents—MSDS, Technical Specification Sheets, or Test Reports—along with a set of predefined questions, and returns the extracted data in a structured, easily digestible format. The goal was to create a system that could classify and parse any of these document types correctly, extracting relevant, generic questions for each. By limiting the user to providing only these types of documents and using predefined questions and prompts tailored to each type, the tool achieves more consistent results in data retrieval.

An illustration of the RAG architecture where text files (Orange) are parsed and divided into chunks before being embedded into a vector store. The user prompt (Blue) is then embedded to search for relevant data within the vector store that the LLM (Green) can use to answer the question based on relevant data.

From this pipeline, two specialized tools were developed:

Technical Specification Tool: This tool accurately parses Technical Specification Sheets, extracting pertinent data relevant to the user's predefined questions.
Test Reports Tool: This tool focuses on Test Reports, extracting key data and even offering graphical representations of how packs/modules performed.

"By focusing on specific document types and predefined questions, we streamlined the extraction process and reduced the likelihood of errors," Jesper shares. "The structured format we produce is much more manageable and actionable for further analysis."

‍

Challenges

Developing these tools involved overcoming several hurdles. The diverse formats of technical documents—including text blocks, tables, and graphs—complicate the extraction process. Not every document is straightforward. Some have complex layouts with graphs, tables, and text all interwoven, making it tough for any tool to extract information correctly.

The tools must effectively break down documents and determine the best order to read and extract information, especially for handling different lengths and structures of technical data on batteries.

Large Language Models for Complex Document Processing.

Future Potential

The tool's potential for future improvement is significant. As newer and more powerful LLMs emerge, they will enhance these tools performance, enabling it to handle increasingly complex documents with greater accuracy. These tools marks just the beginning of how LLMs can improve workflows and accessibility, transforming large, complex documents into structured, actionable data.

Screenshot of the tools in editing stage.

Text-to-Illustration of Architecture

To ensure consistent and accurate data extraction, the tools employs an adapted version of the Retrieval-Augmented Generation (RAG) architecture. In this tailored approach, user input is limited to document parsing, with prompts and questions pre-structured to maintain uniformity across various document types. This modification streamlines the extraction process, reducing potential errors caused by unstructured inputs.

The process begins by converting the document into manageable chunks, which are then embedded and stored in vector storage. The system later matches predefined questions against these embedded chunks to retrieve the most relevant data. This retrieved context, along with the structured prompts, is then passed to the large language model (LLM), which processes and returns the information in a standardized format. To further enhance reliability, a safety net using Pydantic is implemented, ensuring that any formatting inconsistencies are automatically corrected before the final data output.

Conclusion

Jesper Edström's Master's thesis demonstrates the significant impact of large language models (LLMs) on solving real-world data extraction challenges. By leveraging the Retrieval-Augmented Generation (RAG) architecture, he developed a document reader that effectively manages complex technical documents, addressing issues like inconsistency and format diversity.

More information

‍

Jesper Edström

August 26, 2024

•

5 min read