Using print image files (part of a continuing series)

Using print image files (part of a continuing series)

Last week I talked about PDF as a data transfer choice.  The logical extension is to discuss print image files, as they serve a similar purpose.  They are both a representation of a printed document, but print image files are pure text, and are therefore a much more reliable source to transfer data.

As with any data transfer protocol, there are a number of characteristics that you should be aware of.  Luckily, since this is a well established technique, most of the tools you might use handle the bulk of the issues consistently, and well.  Virtually every tool reads the source report and extracts the data you select into a flat file format of some type.  Because of the similarities, I will focus only on those characteristics that may differentiate your choices, or are likely to be problematic.

Re-use of existing templates can be a big time saver with print image files.  Being a report, it is quite common to find that you repeatedly encounter the same format, but perhaps for other locations, or time periods.  There tends to be a lot of fussy definition steps when working with these files, regardless of your tool, since you need to identify which information is relevant, and how to pick it out of the report.  Assuming this isn't the first time you've encountered this report format, an ideal tool will let you choose an existing template to save you as much of this work as possible.  Not only is it extra work re-defining the data, but it increases the chances for errors.  When redefining, you may miss some subtle characteristic needed to identify appropriate records, resulting in an incorrect analysis.  Of course, Arbutus offers let's you re-use existing templates with new reports, as do some other tools, eliminating this risk and tedium.

Changes to an existing template are also very common.  A typical scenario is to define a report template to capture just the information you know you need right now.  The reason you don't capture everything is a combination of the tedious nature of defining every single report item and the size of the resulting file.  As such, you may only define the key items on each line, omitting those that you think won't be relevant.  Even if you define all the items on the detail line, who would capture the page number, or the time/date of creation from the report heading?  In either case, you sometimes subsequently realize that some of this un-captured information would be useful after all.  A poor product will force you to repeat all your steps again from scratch (not only tedious, but risky, as explained above).  An ideal product (Arbutus included) will let you retrieve the existing template and add/modify as required.  The very existence of this capability allows you to be frugal at the outset, knowing that you can always add more data capture later.

Character sets are also an ongoing issue.  Virtually every tool will allow you to define ASCII data, but if the data is coming from the mainframe world you may encounter EBCDIC.  That said, this is typically not an issue, as most transfer programs automatically provide EBCDIC-to-ASCII conversion.  What can be a problem is Unicode.  There are two main variants: UTF-8 and UTF-16.  You may well encounter data in either of these formats, depending on the circumstances, and supporting both is a definite asset.  Since most text editors support both, you might think that you could read it in one format (say UTF-8) and have the editor convert it to the other (UTF-16).  The problem with this is that UTF-8 is a multi-byte character set (MCBS), and in typical use various single characters may take up 1, 2 or 3 bytes.  After conversion, you will often find that the data has become misaligned, since most other character sets have the same width for every character.

As mentioned at the start, print image files are a much better data transfer choice than PDF, but you want to make sure your tool choice doesn't force you to repeatedly re-invent the wheel in normal usage.

Check out some of my other weekly posts at https://www.dhirubhai.net/today/author/0_1c4mnoBSwKJ9wfyxYP_FLh?trk=prof-sm

要查看或添加评论,请登录

Grant Brodie的更多文章

  • Self-serve analytics

    Self-serve analytics

    Self-serve analytics are an ideal we should all strive for. They minimize the “time to answer”, by letting the consumer…

  • Using Delimited Data (last of a series)

    Using Delimited Data (last of a series)

    The delimited data format is the workhorse of data transfer standards, and for good reason. It is designed specifically…

  • Using XML Data (part of a continuing series)

    Using XML Data (part of a continuing series)

    An increasingly popular format in the Internet age is XML. The main reason for its ubiquity is that unlike most other…

  • Success with analytics is everyone's job

    Success with analytics is everyone's job

    PwC recently published their annual State of the Internal Audit Profession. ACL published a response titled "Leadership…

    1 条评论
  • Data Transfer Formats

    Data Transfer Formats

    Most data isn't transferred in its native format. The reason for this is that internal formats are usually designed for…

  • Take control of your data, maintain your audit independence

    Take control of your data, maintain your audit independence

    Data is seldom cooperative, it comes in innumerable formats, and in many/most cases isn't conveniently located in a…

  • Data Quality testing

    Data Quality testing

    We are all familiar with the phrase “garbage in, garbage out”. Once data quality gets “off the rails” it can be…

  • No Apologies

    No Apologies

    Just this morning I read a post by an individual who wrote a utility to overcome a shortcoming in our major competitor.…

    2 条评论
  • Big Data Analytics

    Big Data Analytics

    Like so many buzzwords, it's impossible to avoid the term “Big Data” these days. Today I thought I’d explain the…

    1 条评论
  • Servers: Simpler is better

    Servers: Simpler is better

    Few IT words generate more anxiety in the non-IT crowd than servers. Most people have a sense of what “server” means…

社区洞察

其他会员也浏览了