Using print image files (part of a continuing series)
Last week I talked about PDF as a data transfer choice. The logical extension is to discuss print image files, as they serve a similar purpose. They are both a representation of a printed document, but print image files are pure text, and are therefore a much more reliable source to transfer data.
As with any data transfer protocol, there are a number of characteristics that you should be aware of. Luckily, since this is a well established technique, most of the tools you might use handle the bulk of the issues consistently, and well. Virtually every tool reads the source report and extracts the data you select into a flat file format of some type. Because of the similarities, I will focus only on those characteristics that may differentiate your choices, or are likely to be problematic.
Re-use of existing templates can be a big time saver with print image files. Being a report, it is quite common to find that you repeatedly encounter the same format, but perhaps for other locations, or time periods. There tends to be a lot of fussy definition steps when working with these files, regardless of your tool, since you need to identify which information is relevant, and how to pick it out of the report. Assuming this isn't the first time you've encountered this report format, an ideal tool will let you choose an existing template to save you as much of this work as possible. Not only is it extra work re-defining the data, but it increases the chances for errors. When redefining, you may miss some subtle characteristic needed to identify appropriate records, resulting in an incorrect analysis. Of course, Arbutus offers let's you re-use existing templates with new reports, as do some other tools, eliminating this risk and tedium.
Changes to an existing template are also very common. A typical scenario is to define a report template to capture just the information you know you need right now. The reason you don't capture everything is a combination of the tedious nature of defining every single report item and the size of the resulting file. As such, you may only define the key items on each line, omitting those that you think won't be relevant. Even if you define all the items on the detail line, who would capture the page number, or the time/date of creation from the report heading? In either case, you sometimes subsequently realize that some of this un-captured information would be useful after all. A poor product will force you to repeat all your steps again from scratch (not only tedious, but risky, as explained above). An ideal product (Arbutus included) will let you retrieve the existing template and add/modify as required. The very existence of this capability allows you to be frugal at the outset, knowing that you can always add more data capture later.
Character sets are also an ongoing issue. Virtually every tool will allow you to define ASCII data, but if the data is coming from the mainframe world you may encounter EBCDIC. That said, this is typically not an issue, as most transfer programs automatically provide EBCDIC-to-ASCII conversion. What can be a problem is Unicode. There are two main variants: UTF-8 and UTF-16. You may well encounter data in either of these formats, depending on the circumstances, and supporting both is a definite asset. Since most text editors support both, you might think that you could read it in one format (say UTF-8) and have the editor convert it to the other (UTF-16). The problem with this is that UTF-8 is a multi-byte character set (MCBS), and in typical use various single characters may take up 1, 2 or 3 bytes. After conversion, you will often find that the data has become misaligned, since most other character sets have the same width for every character.
As mentioned at the start, print image files are a much better data transfer choice than PDF, but you want to make sure your tool choice doesn't force you to repeatedly re-invent the wheel in normal usage.
Check out some of my other weekly posts at https://www.dhirubhai.net/today/author/0_1c4mnoBSwKJ9wfyxYP_FLh?trk=prof-sm