The PDF/A Myths
Frederik Rosseel
Digitizing and Preserving the Regulated Customer Interaction since 2006
In recent years, as the need for Long Term Archiving and Long Term Preservation of "Digital Born" information has increased, so has the call for a universal archiving format.
The idea behind this is obviously to have a file format that allows you to permanently store and read information without any loss of the latter. And if possible, even the more “graphical” information, such as text color, image color space or the fonts used should be kept if it conveys any relevant information.
Enter PDF/A, the solution to it all. Or not? The two most important words in the first paragraph are: "Digital Born". The idea of PDF/A was to create a digital version of what your document should look like “when printed”. (To avoid lengthy discussions, I’ll consider scanned documents as giving birth to a “Digital Document” once it is transferred to a digital carrier, and yes, to suppress any urge to print them again, you’d like them properly archived).
But, what do you mean? Is this not the solution to all our archiving needs? Well…. let me explain.
To begin with, there are a lot of misconceptions or myths with regard to the format. Such as:
- We can convert everything to PDF/A
- We should convert every file that we want or need to archive to PDF/A
- PDF/A will help us to prove that the file has not been modified during the retention period
Nowhere have I found any information or article to contradict these misconceptions. But let’s look at these one by one.
Should you convert every file to PDF/A?
The question is not whether you should convert the original file to a PDF/A, but whether you should convert it to an archiving format. If the necessary retention period that applies to a particular file does not create a risk with regard to legibility of the information , then this is not a reason to convert every file to PDF/A.
This doesn't mean in any way that you might have different reasons to migrate your file to a PDF/A format. The most important reasons why you should want to do so are elaborated here.
PDF/A will help us to prove that the file has not been modified or tampered with during the retention period.
Unfortunately, the answer is no. It is a general misconception that PDF/A files create a non-repudiable, unalterable, immutable version of the document. As with any other PDF, the file can be edited. As an example: we once even received a request from a customer to create a custom module that would add a specific header to the PDF/A, with information that wasn’t included in the original files. Original?
For one reason or another, a PDF/A file is often mistaken with a PDF(/A) that has been digitally sealed (sealed, not signed).
But if you want to prove that your file has not been tampered with during the retention period, you will need one of the following:
- Digital Sealing (can be achieved via a solution such as this.)
- Via Audit Trailing on your ECM system
- Via auditing on your Trustworthy Repository
However, if you wish to achieve this without the use of an ECM-solution, or other audited repository, migrating your file to PDF(/A) is a step in the right direction, as this format can be more easily sealed than other file formats. This is especially useful for documents that might need to be distributed or e.g. when you need to produce the document as evidence in a court case.
Can we convert everything to PDF/A?
No, you can’t. There are simply file formats that can not be converted to PDF/A. And most surprisingly to many, a lot of PDF files can not be properly converted to PDF/A. And this poses a major issue for a lot of organizations, as they have very often stored thousands of files as PDF (and not as PDF/A).
How come?
Well, as many people know or have already experienced, a file might look different depending on the workstation or viewing application used. This can cause the document to display different fonts and colors or even cause the document to have a different lay-out.
A few examples:
- Opening a MS Word document, created with MS Word, with OpenOffice.
- Editing an MSWord document, created on Windows, using Word on Mac (Oooh, the fun for the original author when he gets his document back).
- Displaying a document that contains True Type Fonts on a Unix machine without those fonts.
To tackle these issues, PDF/A was created as an archiving file format, and it quickly became an ISO standard. The file itself contains the information necessary to create a correct visual presentation of the document, while only using techniques and information that are the same across systems and applications. This means that, independent of the viewing application in use or the client environment, the file will look identical now and in, let’s say, 50 years.
This was mainly aimed at digital born files and, even more specifically: digital born text documents.
Converting a file to PDF/A-1a or PDF/A-1b (which are the most common and used versions of the PDF/A format) imposes some requirements and limitations on the source format.
In short, these are:
Requirements:
- The file should contain all the embedded fonts. (The fonts used are stored in the file).
- Device-independent color schemes should be available.
- Extensible Metadata Platform (XMP) metadata
Limitations are (mainly related to PDF/A-1):
The files should not contain any:
- Encryption
- LZW Compression (replaced by ZIP)
- Embedded files (is allowed in v2 and v3)
- External content references
- PDF Transparency
- Multi-media
- JavaScript
Creating a correct PDF/A file is more than just adding a “flag”. It should be checked whether the file complies to the above requirements and limitations. One has to be aware of the fact that, when creating a PDF/A file, the original (digital) file is often necessary in order to be able to create a target file that responds to all requirements. The file should also be created using a tool (or machine) that contains all the necessary information (such as the fonts). This page contains a very good description how a correct PDF/A can be created in several ways.
The most commonly used version of PDF/A is PDF/A-1, which is based on PDF 1.4. If you want to convert any PDF file created after PDF 1.4 to PDF/A-1a or PDF/A-1b, you will have to remove all features that are not already part of the PDF 1.4 format. As PDF/A-2 is based on PDF 1.7, that information loss is reduced, but it is still not a good archiving practice to store your documents as a PDF/A-2 and not keep the original format.
Should we convert all files to PDF/A?
No, you shouldn’t. But, if PDF/A offers so much advantages, why shouldn’t we convert all files to PDF/A?
There are several reasons why not to do so. From our point of view this might sound a bit bizarre, as we offer our own solution to convert files to PDF/A, correct them and validate them. But the DocShifter solution also allows for your files to be converted to other formats. You should only convert those files to PDF/A for which it makes sense and for which it works fine without too much of an additional cost.
Let’s have a look at different file types that you might want to convert to PDF/A:
- Image files:
There is often not much added value in converting image files (e.g. Tiff files) to PDF/A for the sake of visual presentation, as these files are generally already have a correct graphical and content representation of themselves. In some cases, if other (non-image) file types cannot correctly be converted to PDF/A they are often, as a last resort, converted to an image and stored in a PDF/A container.
Converting your image files to PDF(/A) offers advantages that might be interesting to many users:- Compression techniques can be applied
- The PDF can be searchable
- The file is easier transferrable and viewable on different devices
- Scanned Documents:
Scanned Documents are in in essence image files (unless they are converted to a document via OCR-techniques). So, for images the previous bullet point applies. - Non-text documents:
You should always consider what information is pertinent to archive, before deciding what format you want to use. Let me use the example of an Excel Worksheet. If the calculations and the logic used to attain a result or report are essential to the information, then you should not consider only preserving a PDF(/A) version of the document, as that information is then lost. - Converting PDF files to PDF/A:
Well, unfortunately, this is a question that we have already received several times. And this is an issue that is not always easily solved. For starters, there is a difference between the type of PDF file that you’re starting with (I won’t cover them all here):- Applicable to all PDF versions:
All features that are not supported by your target PDF/A version will be removed from the PDF file. - PDF Image:
This can easily be converted to PDF/A, as there is no need to embed the fonts. It might require a different compression used or some other modifications applied. - PDF Text:
This file type is normally created using a text document as a source file. As opposed to PDF/A, it does not have the fonts embedded, nor does it appeal to the other requirements and limitations. When trying to convert such a file to PDF/A, it means that, if possible, the files need to be embedded and a set of other actions need to be applied. This is a process of verifying, applying modifications and validating. It might be necessary to reiterate this process several times and in the end it might still fail in some cases. If all else fails, the original file might be transformed to an image contained in a PDF/A.
- Applicable to all PDF versions:
One has to keep in mind that a PDF/A file can be 100% correctly validated, but still might contain visual inconsistencies. If you wish to eliminate these a visual validation, and if necessary, a manual correction is required of the PDF/A.
“Functional” or Compliance Limitations:
As said earlier in this post, people consider PDF/A as a best practice format for archiving purposes which might serve one of the following goals, but you won’t achieve these goals with only PDF/A:
- Demonstrating the file has not been tampered with.
This is not something that can be achieved via PDF/A, but via solutions that either create an audit trail for the file or add a digital seal to the document.
- Audit trailing: This can be done via an ECM solution armed with the right functionalities or via a solution such as Kazeon.
- Digital Sealing: A qualified digital seal can prove, within the right conditions, that the file has not been tampered with.
- Ensuring legibility of the digital preserved file:
The international standard in this field is the OAIS standard. OAIS proposes a set of best practices that should be implemented, but does not impose any specific format. The most important thing to remember is that you always, ALWAYS, need to preserve the original format. The migrated (or converted format) is there to make sure you are still able to read and access the information. Over time the “secondary” format might change. Also, at any time, you need to be able to verify or validate the secondary format against the primary format.
Imagine if anything goes wrong with the conversion, what will you do then?
Always keep the original file format!
Some more PDF(/A) challenges in a future post.
Traveling & Writing
5 年You say, "the files need to be embedded and a set of other actions need to be applied." I think you mean "fonts" in place of "files."
Enabling Cybersecurity & Data Protection @ Scale
9 年Nice article debunking the myth of the role of PDF/A in archiving!
ICT Manager bij Van Benthem & Keulen | Vrijwilliger AHN en HiP | Organisator Paasroute | Creatief bezig
9 年Thanks for this article.Very usefull.
VP Marketing/Product Strategist turned Industry Analyst & Board Director. Excited about Category Creation | Martech/Adtech Trends | Go-to-market strategy | Market Analysis | Innovation
9 年Great post. You’re right – the myths and complexities of PDF/A abound (even in this post I think some of the details need updating based on the newer versions - 1,2,3,a,b,u of the standard). I should also add, our (@adlibsoftware) customers are finding that PDF/A is only part of the story.... intelligent compression, de-duplication ROT analysis, etc. are also critical to truly effective long-term digital preservation.
Experienced and committed in corporate finance advisory. CFO, CDO, Expertise in ERP, financial modelling, M&A projects, information management and digitalization. Eager to analyze, clarify, optimize, complex biz-issues.
9 年Great post to be read by all actors in the digital transition of the financial supply chain. PDF is not dead in e-invoicing especially not in the lower mid and long tail and for compliance and archiving.