登录查看更多内容

Using Delimited Data (last of a series)

Grant Brodie

President, Arbutus Software

发布日期: 2016年7月6日

The delimited data format is the workhorse of data transfer standards, and for good reason. It is designed specifically for data transfer, typically produces a very compact file, while still maintaining a quite flexible ability to handle typical data. Even so, there are still some factors you should consider when addressing this type of data.

A very important factor with delimited data is the speed of conversion. Not all applications require an import (Arbutus doesn't, for example), but for those that do this is often a very time consuming process. It's seldom an issue with small files of a few megabytes, but for larger files you might want to benchmark this step, to at least get a sense of what you're in for.

Almost as important as speed is the size of the data after import. Simplistically you might assume that this doubles the size, but this is often quite a low estimate. There are two reasons for this. The first reason is that the import process typically wants to line up the data in each column. If any of the columns have variable length data (like names, addresses or comments) then the converted file tends to pad the shorter fields with blanks, so that the fields all line up in the output file. If you have lots of variable data (or one field, like a comment, that might be unusually variable) then the extra blank padding can easily double your initial expectation.

The second size issue is character sets. If your tool converts the delimited data to Unicode UTF-16, then the size may well double again. Combined with the previous example you may well find that the converted data is 4 times larger than the source!

Of course, neither of these factors is an issue for Arbutus, as we read delimited data directly, without conversion. No time is spent "flattening" a delimited file, as all you need to do is point at it like any other file. Similarly, you don't need to consider the extra space required for the flattened file, as there is no flattened file! Of course, all character set options are supported on input.

Following up on my XML discussion last week, you should be aware that delimited files inherently represent a relational model, which works well with most analytics tools. This will only be a factor if your source data is hierarchical. As example of this might be a master file that supports multiple addresses, for when a customer moves. Some people move frequently, so you could never have enough entries in a relational table for every address. In this case, the source system may well use a hierarchical model, that doesn't work as well with delimited files.

Delimiters tend to be an issue you should at least be aware of. The reason delimiters are important is because your source data may well contain these very delimiters. The easy example is to use double quotes to delimit strings; but what happens when your data contains double quotes? There are two approaches to this problem: one is to use unusual delimiters (like a pipe character) that are less likely to appear in your data. The second is the use some special convention to address this case. For example, one convention is to use two consecutive quotes to represent a single quote. Another is to precede any "special" character with a backslash (\). Either way, you should be aware of what is being used in the file, and how your software handles this.

Headings are a fairly minor issue. I only mention it because some delimited files include them and other don't. make sure you don't inadvertently treat heading like data.

Like almost any data format these days, UTF-8 is an important factor. It has become the standard for most data transfers due to its support for international characters, and your analytics tool should support it if you are reading international data.

Subsequent refreshes are something to remember. If you intend to repeatedly read data in the same format then you need to refresh your data, unless you want to repeat you import definition each and every time. The problem here is that next month's file may include new data that is longer than expected. A tool that flattens the source (like our competitor) will likely truncate this extra data, essentially corrupting the source. In this case they do it without even advising you. A direct read tool, like Arbutus, doesn't suffer this shortcoming, as the full data is read directly, so nothing is missing.

You will most likely encounter delimited data if you perform any but the simplest analytics. This is generally an excellent data format to work with, as long as you are aware of the issues noted above.

Check out some of my other weekly posts at https://www.dhirubhai.net/today/author/0_1c4mnoBSwKJ9wfyxYP_FLh?trk=prof-sm

要查看或添加评论，请登录

Grant Brodie的更多文章

Self-serve analytics

2016年7月14日

Self-serve analytics

Self-serve analytics are an ideal we should all strive for. They minimize the “time to answer”, by letting the consumer…
Using XML Data (part of a continuing series)

2016年6月29日

Using XML Data (part of a continuing series)

An increasingly popular format in the Internet age is XML. The main reason for its ubiquity is that unlike most other…
Using print image files (part of a continuing series)

2016年6月16日

Using print image files (part of a continuing series)

Last week I talked about PDF as a data transfer choice. The logical extension is to discuss print image files, as they…
Success with analytics is everyone's job

2016年6月9日

Success with analytics is everyone's job

PwC recently published their annual State of the Internal Audit Profession. ACL published a response titled "Leadership…

1 条评论
Data Transfer Formats

2016年6月8日

Data Transfer Formats

Most data isn't transferred in its native format. The reason for this is that internal formats are usually designed for…
Take control of your data, maintain your audit independence

2016年6月1日

Take control of your data, maintain your audit independence

Data is seldom cooperative, it comes in innumerable formats, and in many/most cases isn't conveniently located in a…
Data Quality testing

2016年5月25日

Data Quality testing

We are all familiar with the phrase “garbage in, garbage out”. Once data quality gets “off the rails” it can be…
No Apologies

2016年5月18日

No Apologies

Just this morning I read a post by an individual who wrote a utility to overcome a shortcoming in our major competitor.…

2 条评论
Big Data Analytics

2016年5月12日

Big Data Analytics

Like so many buzzwords, it's impossible to avoid the term “Big Data” these days. Today I thought I’d explain the…

1 条评论
Servers: Simpler is better

2016年5月5日

Servers: Simpler is better

Few IT words generate more anxiety in the non-IT crowd than servers. Most people have a sense of what “server” means…

See all articles

Using Delimited Data (last of a series)

Grant Brodie

President, Arbutus Software

Grant Brodie的更多文章

社区洞察

其他会员也浏览了

Disable Foreign Key Checks in ClickHouse for Faster Data Loading

Structured data?—?What is it? What is that used for?

DATA CARDINALITY

Elevating Data Quality: Advanced Checks To Find Issues Before Your Customers

Ch 5: The Eight Basic Types of Data Quality Problems

Data Types--why the hype?

Power Query Refresh Sequence: Understanding the Order of Dimensions and Facts

HOOK vs Data Vault: Willibald Part 4

HOOK vs Data Vault: Willibald Part 3

The 7 Steps of a Data Project

Grant Brodie的更多文章

Self-serve analytics

Using XML Data (part of a continuing series)

Using print image files (part of a continuing series)

Success with analytics is everyone's job

Data Transfer Formats

Take control of your data, maintain your audit independence

Data Quality testing

No Apologies

Big Data Analytics

Servers: Simpler is better

社区洞察

其他会员也浏览了

Disable Foreign Key Checks in ClickHouse for Faster Data Loading

Structured data?—?What is it? What is that used for?

DATA CARDINALITY

Elevating Data Quality: Advanced Checks To Find Issues Before Your Customers

Ch 5: The Eight Basic Types of Data Quality Problems

Data Types--why the hype?

Power Query Refresh Sequence: Understanding the Order of Dimensions and Facts

HOOK vs Data Vault: Willibald Part 4

HOOK vs Data Vault: Willibald Part 3

The 7 Steps of a Data Project