PDF Generation Service - Efficient PDF Generation from DOCX Templates using LibreOffice - A Fast Alternative to Puppeteer with Few Limitations.
We are super lucky to work in an environment that allows for first-time thinking and idea generation and this week our very own Scottish-based Engineering Lead Francesco Belvedere will share the following solution that was developed to address a real-life challenge.
The topic in question:?
PDF Generation Service - Efficient PDF Generation from DOCX Templates using LibreOffice - A Fast Alternative to Puppeteer with Few Limitations...?
As this topic has not been covered properly anywhere on the internet or other sources, Engineering then had to produce a unique solution. We are all about empowering the collective knowledge base within Fintech and think this will be of great interest to Devs and Solution Architects who follow the Nomo Fintech page feel free to share and tag us in this distribution of this article.?
This article gives a real insight into how challenging the build is and the opportunity our Engineering teams have in doing the ability to do cutting-edge solution generation. Have a detailed read below:
Generating PDF documents flexibly is a hard problem to “get right”. A popular approach is to use Puppeteer, a Node.js library that allows controlling a browser instance from the command line, however, this comes with some major drawbacks. A less common approach is to use LibreOffice, an open-source technology that can quickly generate PDFs with some key advantages. In this article, we explore the solution some teams at Nomo are using to generate PDFs and we discuss the pros and cons.?
Goals
When thinking about our next document generation service, these were the goals we wanted to accomplish:
The Problem with Puppeteer and HTML/XML-based PDF Generation
Puppeteer's PDF generation functionality relies on loading a webpage in a headless browser and “printing” it as a PDF. This approach can be slow and resource-intensive since a full web browser must be started (although the browser instance can be long-lived) and a DOM loaded before being able to “print” what we see in a PDF format. This gets particularly intensive for complex documents with many elements and images.
Solutions to convert from HTML or XML to PDF without starting a browser like wkhtmltopdf, pdfkit, dom2pdf and more, also come with their own challenges since dealing with pages, margins and overflow can become a nightmare to work with and maintain.
The Solution: LibreOffice to the Rescue!
An alternative approach to PDF generation is to use LibreOffice (LO), an open-source technology that offers a word processor, spreadsheet, and presentation software, as well as tooling and a command-line interface to convert documents to different formats, including DOCX to PDF.
DOCX is a file format used to store documents. It is an XML-based file format and can contain text, images, tables, and other types of content.
Using DOCX docs allows:
Following, let’s explore how LibreOffice fits in the bigger picture of a document generation service.
Document Generation Service
Step 1: Create a DOCX Template and an Optional Layout
You can use any word processor to create DOCX layouts and templates, such as Microsoft Word or LibreOffice Writer. The template should include placeholders for the data that will be replaced during the PDF generation process (this is also known as string interpolation). For example, you could include placeholders for the name, address, and date in the format {{property_name}}.
Step 2: Install LibreOffice
Install LO locally using your package manager or download it from the official website. For your live system, you can spin up a LO service or use a ready-made solution like https://github.com/unoconv/unoserver or https://github.com/gotenberg/gotenberg, which provides a REST API on top of LO.?
At Nomo, we run the document generation in AWS Lambda and we interface with LibreOffice (LO) binaries that are hosted in a Lambda Layer via the built-in LO CLI. To keep the generation fast and prevent AWS from wiping the temporary storage where LibreOffice is unpacked, we use provisioned concurrency and ping our lambda every 5 minutes. However, going forward we plan to separate the LibreOffice install and try out one of the wrapper solutions mentioned in the paragraph above.
领英推荐
Step 3: Validate the passed parameters using JSON Schema?
Alongside our template and layout files, we host a JSON Schema file that describes which properties are required to build the document and what their type or format is. This schema is used at runtime to validate the data passed to the generation service.
Step 4: Combine and Generate the Complete DOCX Document
A DOCX document is essentially a zip or collection of XML files that are bundled together. Each part of the document is usually represented as a separate .xml file and has the following structure:
.
├── _rels
│ ? ├── document.xml.rels
│ ? ├── footer2.xml.rels
│ ? ├── header1.xml.rels
│ ? └── header2.xml.rels
├── …
├── document.xml
├── footer1.xml
├── footer2.xml
├── header1.xml
├── header2.xml
├── media
│ ? └── image1.png
├── styles.xml
├── theme
?????└── theme1.xml
For maximum flexibility and reusability, we use layouts and templates which we can dynamically combine such that the resulting file replaces:
To obtain the final combined document, we follow these steps:
At this point, we have a fully usable and formatted DOCX document. You could stop here if all you needed is a DOCX document with a custom header and footer, and placeholders replaced by values.
Step 5: Convert the DOCX Document to PDF
Finally, since we have a working DOCX document, we can now use LO to convert the file into a PDF. If you are using LO locally, you can do this using the LO command-line interface. For example, this command will convert the template to a PDF and save it as document.pdf:
soffice --headless --convert-to pdf document.docx
Results
We created a flow that addresses some of the original goals we had in mind. Work is still needed to achieve some of the other goals, and this approach comes with a few caveats too:
In conclusion, we feel this solution offers good performance and is a good compromise to achieve some of the goals that we set ourselves, although like with every other approach to PDF generation, this comes with its own set of challenges.