I read your PDF

I read your PDF

In 1944, inspired by Rommel 's book, George Patton embarked on his own French adventure . Hollywood responsibly supported the national war spirit in various ways. The WW2 defense effort was bolstered by high-quality paper drawings , statistical process control and strict enforcement of industry standards.

In 1984, Ronald Reagan won in a landslide. Tom Clancy published "The Hunt for Red October ," and Orwell's nightmare was brought to the screen, each demonstrating the value of information in their own fashion. Adobe released PostScript , making printing both author-controlled and device-agnostic.

In 1994, the world was supposedly getting "flat ," while Hollywood was definitely shifting left. Adobe released PDF , a descendant of PostScript. PDF guaranteed consistent, author-controlled, and device-agnostic client-side viewing and printing, with the added capability of secure signing. Around the same time, S1000D emerged to enforce rules for technical documentation layout.

In 2004, the USA was deeply engaged in the Global War on Terror . Hollywood's role in generating partisan consent began to overshadow its original entertainment purpose. The concepts of the digital thread started to proliferate through the engineering and manufacturing domains, encompassing CAD , simulations , and requirements . Meanwhile, PDF continued to conquer the world.

  • PDF became ISO 32000 , and Adobe opened its API to external developers. Subsequently, PDF evolved into the default method for information presentation and distribution, especially for legally binding documents.
  • PDF/A was introduced to address PDF dependencies on potentially unstable external features. Combined with the PDF Document Catalog 's hierarchical embedding of a wide variety of documents (such as STEP /JT , MS Office , etc.), PDF/A is arguably the uncontested tool for long-term data archiving. As of today, the veraPDF project validates any PDF/A file for compliance at any level.
  • 3D PDF was born to facilitate the easy read-only sharing of 3D MBD data. With native support in Adobe Acrobat and enthusiastic adoption by the DoD , its success seemed assured. Unfortunately, for reasons too lengthy to cover here, it did not live up to its promise and is now largely abandoned by all but its most devoted proponents.

It is November 2024, and I suddenly feel quite optimistic about many topics. Hollywood's very existence is under threat. In the ongoing quest to connect the engineering and manufacturing puzzle to the digital thread, the PDF consumption process might be up for disruption. In one such case, recipients either print PDFs authored in various PLM ecosystems onto paper (sic!) or manually retype and copy-paste their content into MES . They do this because, due to the 40-year-old system architecture, PDF data cannot be readily extracted into JSON or XML . This issue is relevant for PDFs created from authoring tools like MS Word using API, as well as those created from scans.

Patton's exploits in France in 1944 originated from British conceptual musings circa 1924, German and Russian experiments around 1934, and the subsequent German blitzkrieg successes. He hardly invented anything; instead, he was able to orchestrate the already matured stack of technologies, battlefield techniques, and the overwhelming American industrial and logistical advantages in the most creative and consistent manner. We can learn from Patton a lot as we think about the next phase of the industrial revolution.

We still expect to see PDF being used on a grand scale in the MES/MRO domain in 2034, as it will remain extremely cost-efficient, especially in the context of AI. The Senticore team has experimented extensively with several LLMs and a number of public GitHub projects to extract data from PDFs, and we would like to share our conclusions.

  • So far, we haven't seen a fully comprehensive and reliable solution for PDF consumption in the MES/MRO domain. The latest research at Google, Microsoft, and several prominent startups seems to pragmatically concentrate on relatively structured data mixes, such as invoices.
  • Unless there is a qualitative leap with AI, our own concept is to keep humans in the loop as a standard feature, steadily moving generative AI-led automation from 20:80 to 80:20. This approach allows us to teach the system to process text, tables, diagrams, and drawings. In a sense, a fusion of generative AI, neural networks, and the right IDE functions like Patton's combined arms warfare against the German lines, breaking through PDF constraints, identifying data types, and allowing other algorithmic tools to come into play and extract them correctly into JSON or XML.

Feeling exhausted from plowing through the avalanche of inbound PDF files? Would you like to integrate the engineering and manufacturing data trapped inside these files into your digital thread ecosystem reliably and at a reasonable cost? Talk to us; like General Patton, we may have a solution for you.

Originally published at Senticore blog .

Steef Klein

Digital Business/IT Strategist | IT Director | Program Management | Enterprise Architectures | CRM-ERP-PLM-SCM Consultancy |

2 天前
Alex Bruskin

Bespoke Generative AI for Engineering & Manufacturing (PLM, MES, ERP) | Cloud Native | Air Gapped | System Integration | Concepts, Technologies, Execution

4 天前
回复
Patrick Hillberg Ph.D.

Adjunct Professor @ Oakland University | Product Lifecycle Management (PLM) | Speaker, Consultant, Expert Witness | Advocate for Workforce Development | Ex-Siemens PLM

5 天前

I gave a lecture last week where I showed the following slide, hypothesizing the nature of a Digital Thread which might have avoided the Boeing 737 Max crashes. There are 4 different organizations in this thread, making decisions across 7 years, and the impact of the failure was 346 lives and $20B lost. (At least. Boeing issued $20B in bonds for the Max crashes, but recently issues another $19B for the variety of problems which have come to light since these crashes.) So, there are lives and dollars which might be saved by a comprehensive digital thread, which truly spans the lifecycle, but as I point out in my lecture, I don’t see either a business model or an ethical model which can address this. 10 & 15 years ago I was doing defense work, and was told: “we use the 3D to create prints, and then we pitch the 3D”. In the past couple years I’ve heard: “if it currently floats, flys, or drives the only info we have on it is 2D PDF.” We are where we are due to decades-long business and cultural models. I wouldn’t bother with the tech until we address incentives and culture.

  • 该图片无替代文字
Alex Bruskin

Bespoke Generative AI for Engineering & Manufacturing (PLM, MES, ERP) | Cloud Native | Air Gapped | System Integration | Concepts, Technologies, Execution

5 天前

James Allen Regenor, PhD, Col USAF(ret) is Veritex somehow a fit here?

回复

Nice read. I think PDF is there to stay, however some trends will be visible gradually. - Knowledge intensive PDFs like Standards, and Specs will be converted semi-automatically to Knowledge Graphs to serve in parallel to hybrid RAGs. These PDFs are maybe 2% of the unstructured data in companies. - Less critical PDFs or content like instructions, tutorials, minutes, will be used mainly to build different RAG systems. These PDFs are probably 10% of unstructured data in companies. - The remaining content and the above ones (i.e. 100%) will be used to fine tune LLMs. And PDFs will stay in use as they are not only relevant for legacy systems and approaches, but also legally binding almost everywhere in the world. And you know how quickly legal terms change ;).

要查看或添加评论,请登录

Alex Bruskin的更多文章

  • The AI jungle book for engineering

    The AI jungle book for engineering

    It's a jungle out there, and it's a jungle in here, too. Those who ignore it receive the Darwin Award one way or…

    16 条评论
  • For whom the chaos calls

    For whom the chaos calls

    One common theme from both history and technology is a perpetual struggle between simplicity and complexity. Simplicity…

    26 条评论
  • The Thing with Transformation

    The Thing with Transformation

    There is a time for everything, and a season for every activity under the heavens. A time to develop, a time to deploy,…

    18 条评论
  • MiB: Engineering

    MiB: Engineering

    There is only one thing in the world worse than being talked about, and that is not being talked about. When people…

    32 条评论
  • Twinkle-Twinkle

    Twinkle-Twinkle

    When the first versions of CATIA and Star Wars appeared at the end of 1970s, they dramatically reinvented their…

    31 条评论
  • Elephant in the context window

    Elephant in the context window

    Ask not what your elephant can do for you, ask what you can do for your elephant. There is a famous Indian tale about a…

    31 条评论
  • Dragons, drones, digital thread

    Dragons, drones, digital thread

    As strange as it sounds, there is a link between James Cameron's 2009 "Avatar" and the ongoing conflict in Ukraine…

    17 条评论
  • God, Groundhog, LLM, PLM

    God, Groundhog, LLM, PLM

    Back in the 1500s, Europe was a peculiar place to inhabit. Loyalty was still less to the nation-states, and more…

    21 条评论
  • PLM vs. Zombies

    PLM vs. Zombies

    When life imitates art, and life alternates between tragedy and farce, be careful with your fantasies. That was more or…

    18 条评论
  • Home alone with data longevity

    Home alone with data longevity

    For many years "Home Alone" has been known as a beloved Hollywood comedy, featuring a schlimazel kid stumbling towards…

    26 条评论