In netCDF-4/HDF5, Only Use Shuffle Filter on Integer Data!

NetCDF-4 allows users to turn on the shuffle filter for a variable.

This shuffles the byte order of adjacent integers. For example, if you are storing four integers, 10, 11, 12, and 13, they will look like this in hex:

0000 000A 0000 000B 0000 000C 0000 000D

If the shuffle filter is used, they will be stored like this instead:

0000 000A 000B 0000 0000 000C 000D 0000

The bytes of every other integer value have been shuffled from 1234 5678 to 5678 1234.

The point is that this produces long strings of zeros in many cases. In the unshuffled storage, the longest run of zeros is 7 (or 28 bits), the in the shuffled data, it is 11 (or 44 bit). This makes deflation (compression based on the zlib library) more effective.

But the shuffle filter should not be used for floating point data. In IEEE floating point, 10.1, 11.1, 12.1, and 13.1 look like this:

4121999a 4131999a 4141999a 4151999a

So in floating point data, there are no long runs of zeros. Shuffling the data has no affect on compression, except to add an extra step to the work that the underlying HDF5 library has to do in order to read and write the data.

Only use the shuffle filter for integer data that will be deflated. Otherwise, it is not worth using.

要查看或添加评论,请登录

Edward Hartnett的更多文章

社区洞察

其他会员也浏览了