Using Delimited Data (last of a series)

Using Delimited Data (last of a series)

The delimited data format is the workhorse of data transfer standards, and for good reason.  It is designed specifically for data transfer, typically produces a very compact file, while still maintaining a quite flexible ability to handle typical data.  Even so, there are still some factors you should consider when addressing this type of data.

A very important factor with delimited data is the speed of conversion.  Not all applications require an import (Arbutus doesn't, for example), but for those that do this is often a very time consuming process.  It's seldom an issue with small files of a few megabytes, but for larger files you might want to benchmark this step, to at least get a sense of what you're in for.

Almost as important as speed is the size of the data after import.  Simplistically you might assume that this doubles the size, but this is often quite a low estimate.  There are two reasons for this.  The first reason is that the import process typically wants to line up the data in each column.  If any of the columns have variable length data (like names, addresses or comments) then the converted file tends to pad the shorter fields with blanks, so that the fields all line up in the output file.  If you have lots of variable data (or one field, like a comment, that might be unusually variable) then the extra blank padding can easily double your initial expectation.

The second size issue is character sets.  If your tool converts the delimited data to Unicode UTF-16, then the size may well double again.  Combined with the previous example you may well find that the converted data is 4 times larger than the source!

Of course, neither of these factors is an issue for Arbutus, as we read delimited data directly, without conversion.  No time is spent "flattening" a delimited file, as all you need to do is point at it like any other file.  Similarly, you don't need to consider the extra space required for the flattened file, as there is no flattened file!  Of course, all character set options are supported on input.

Following up on my XML discussion last week, you should be aware that delimited files inherently represent a relational model, which works well with most analytics tools.  This will only be a factor if your source data is hierarchical.  As example of this might be a master file that supports multiple addresses, for when a customer moves.  Some people move frequently, so you could never have enough entries in a relational table for every address.  In this case, the source system may well use a hierarchical model, that doesn't work as well with delimited files.

Delimiters tend to be an issue you should at least be aware of.  The reason delimiters are important is because your source data may well contain these very delimiters.  The easy example is to use double quotes to delimit strings; but what happens when your data contains double quotes?  There are two approaches to this problem: one is to use unusual delimiters (like a pipe character) that are less likely to appear in your data.  The second is the use some special convention to address this case.  For example, one convention is to use two consecutive quotes to represent a single quote.  Another is to precede any "special" character with a backslash (\).  Either way, you should be aware of what is being used in the file, and how your software handles this.

Headings are a fairly minor issue.  I only mention it because some delimited files include them and other don't.  make sure you don't inadvertently treat heading like data.

Like almost any data format these days, UTF-8 is an important factor.  It has become the standard for most data transfers due to its support for international characters, and your analytics tool should support it if you are reading international data.

Subsequent refreshes are something to remember.  If you intend to repeatedly read data in the same format then you need to refresh your data, unless you want to repeat you import definition each and every time.  The problem here is that next month's file may include new data that is longer than expected.  A tool that flattens the source (like our competitor) will likely truncate this extra data, essentially corrupting the source.  In this case they do it without even advising you.  A direct read tool, like Arbutus, doesn't suffer this shortcoming, as the full data is read directly, so nothing is missing.

You will most likely encounter delimited data if you perform any but the simplest analytics.  This is generally an excellent data format to work with, as long as you are aware of the issues noted above.

Check out some of my other weekly posts at https://www.dhirubhai.net/today/author/0_1c4mnoBSwKJ9wfyxYP_FLh?trk=prof-sm

要查看或添加评论,请登录

Grant Brodie的更多文章

  • Self-serve analytics

    Self-serve analytics

    Self-serve analytics are an ideal we should all strive for. They minimize the “time to answer”, by letting the consumer…

  • Using XML Data (part of a continuing series)

    Using XML Data (part of a continuing series)

    An increasingly popular format in the Internet age is XML. The main reason for its ubiquity is that unlike most other…

  • Using print image files (part of a continuing series)

    Using print image files (part of a continuing series)

    Last week I talked about PDF as a data transfer choice. The logical extension is to discuss print image files, as they…

  • Success with analytics is everyone's job

    Success with analytics is everyone's job

    PwC recently published their annual State of the Internal Audit Profession. ACL published a response titled "Leadership…

    1 条评论
  • Data Transfer Formats

    Data Transfer Formats

    Most data isn't transferred in its native format. The reason for this is that internal formats are usually designed for…

  • Take control of your data, maintain your audit independence

    Take control of your data, maintain your audit independence

    Data is seldom cooperative, it comes in innumerable formats, and in many/most cases isn't conveniently located in a…

  • Data Quality testing

    Data Quality testing

    We are all familiar with the phrase “garbage in, garbage out”. Once data quality gets “off the rails” it can be…

  • No Apologies

    No Apologies

    Just this morning I read a post by an individual who wrote a utility to overcome a shortcoming in our major competitor.…

    2 条评论
  • Big Data Analytics

    Big Data Analytics

    Like so many buzzwords, it's impossible to avoid the term “Big Data” these days. Today I thought I’d explain the…

    1 条评论
  • Servers: Simpler is better

    Servers: Simpler is better

    Few IT words generate more anxiety in the non-IT crowd than servers. Most people have a sense of what “server” means…

社区洞察

其他会员也浏览了