1. Introduction
2. The Data Extraction Tool: Complexity and Inconsistency in Data Extraction
3. Developing the Tool
- Challenges
- Future Potential
4. Text-to-illustration of architecture
5. Conclusion
Introduction
Jesper Edström , an engineering physics student from Uppsala University, completed his Master's thesis with us this fall. Under the guidance of Rasmus Lindqvist, Jesper designed, researched, and built a document reader powered by large language models (LLMs).
With LLMs getting more and more powerful, new areas are being discovered where they can be implemented as the powerful tools that they are. Retrieval augmented generation (RAG) is an architecture discovered by Facebook in 2020 that utilizes external data as a source for an LLM to answer a question instead of solely basing its answer on the pre-trained parameters. At the end of this piece, there is a link to an article about RAG for those interested in learning more.
The Data Extraction Tool
Complexity and Inconsistency in Data Extraction
In the battery industry, technical documents such as Material Safety Data Sheets (MSDS), Technical Specification Sheets, and Test Reports are crucial. However, these documents often present challenges due to their length, inconsistent structure, and complex data, making accurate extraction difficult.
Jesper explains, "The real challenge lies in the inconsistency across these documents. Each one is structured differently, with varying levels of detail and complexity, which makes manual extraction prone to errors."
Misinterpreting these documents can lead to significant inefficiencies in data analysis and decision-making, highlighting the need for a reliable tool that can accurately extract data from such diverse sources.
Developing the Tool
To address these challenges, a pipeline was developed that takes these documents—MSDS, Technical Specification Sheets, or Test Reports—along with a set of predefined questions, and returns the extracted data in a structured, easily digestible format. The goal was to create a system that could classify and parse any of these document types correctly, extracting relevant, generic questions for each. By limiting the user to providing only these types of documents and using predefined questions and prompts tailored to each type, the tool achieves more consistent results in data retrieval.
From this pipeline, two specialized tools were developed:
- Technical Specification Tool: This tool accurately parses Technical Specification Sheets, extracting pertinent data relevant to the user's predefined questions.
- Test Reports Tool: This tool focuses on Test Reports, extracting key data and even offering graphical representations of how packs/modules performed.
"By focusing on specific document types and predefined questions, we streamlined the extraction process and reduced the likelihood of errors," Jesper shares. "The structured format we produce is much more manageable and actionable for further analysis."
Challenges
Developing these tools involved overcoming several hurdles. The diverse formats of technical documents—including text blocks, tables, and graphs—complicate the extraction process. Not every document is straightforward. Some have complex layouts with graphs, tables, and text all interwoven, making it tough for any tool to extract information correctly.
The tools must effectively break down documents and determine the best order to read and extract information, especially for handling different lengths and structures of technical data on batteries.
Future Potential
The tool's potential for future improvement is significant. As newer and more powerful LLMs emerge, they will enhance these tools performance, enabling it to handle increasingly complex documents with greater accuracy. These tools marks just the beginning of how LLMs can improve workflows and accessibility, transforming large, complex documents into structured, actionable data.
Text-to-Illustration of Architecture
To ensure consistent and accurate data extraction, the tools employs an adapted version of the Retrieval-Augmented Generation (RAG) architecture. In this tailored approach, user input is limited to document parsing, with prompts and questions pre-structured to maintain uniformity across various document types. This modification streamlines the extraction process, reducing potential errors caused by unstructured inputs.
The process begins by converting the document into manageable chunks, which are then embedded and stored in vector storage. The system later matches predefined questions against these embedded chunks to retrieve the most relevant data. This retrieved context, along with the structured prompts, is then passed to the large language model (LLM), which processes and returns the information in a standardized format. To further enhance reliability, a safety net using Pydantic is implemented, ensuring that any formatting inconsistencies are automatically corrected before the final data output.
Conclusion
Jesper Edström's Master's thesis demonstrates the significant impact of large language models (LLMs) on solving real-world data extraction challenges. By leveraging the Retrieval-Augmented Generation (RAG) architecture, he developed a document reader that effectively manages complex technical documents, addressing issues like inconsistency and format diversity.
More information
- Link to Jesper’s thesis: https://uu.diva-portal.org/smash/record.jsf?dswid=2329&pid=diva2%3A1877456&c=1&searchType=SIMPLE&language=en&query=jesper+Edstr%C3%B6m&af=%5B%5D&aq=%5B%5B%5D%5D&aq2=%5B%5B%5D%5D&aqe=%5B%5D&noOfRows=50&sortOrder=author_sort_asc&sortOrder2=title_sort_asc&onlyFullText=false&sf=all
- Link to RAG article link: https://research.facebook.com/file/4283170945104179/Retrieval-Augmented-Generation-for-Knowledge-Intensive-NLP-Tasks.pdf
- Interested in learning more about these projects or Cling's services? Get in touch here: https://www.clingsystems.com/battery-expert?utm_source=linkedin&utm_medium=article&utm_campaign=aiarticle