On the Use of Nested Data in Hive

On the Use of Nested Data in Hive

We recently wrapped up some efficiency improvements to Voldemort's Build and Push process. As we started rolling it out in production, I wanted to crunch some operational data to measure the resource utilization gains on the Hadoop cluster.

I asked around to determine what would be the best way to get visibility into our Build and Push jobs' metrics. Various methods were suggested, but I was eventually pointed to the most promising one of the bunch: an experimental* Hive table containing metadata about all of our Hadoop jobs.

*I was warned in advance that the table was not ready for general consumption because its schema was not yet finalized.

Hive, We Meet Again

It has been almost two years since I last used Hive, so I was enthusiastic to go at it again.

I'm glad to report that Hive works a lot more smoothly nowadays. I'm not sure how much of that improvement is because of Hive itself getting better and how much is thanks to the Hadoop team at LinkedIn running a pretty tight operation.

In any case, I clearly remember that when I was playing around with it 2-3 years ago, there were still plenty of rough edges: queries that would run forever unless they were written just right, manual steps to optimize query plans, etc. Now, everything Just Works?.

Everything? No, of course not. Nothing is ever perfect (:

Nested Data

The table I needed to deal with contained a fair amount of nested data. The structure of the table is one row per job execution, and then any metadata for which there is more than one item per job is stored in nested arrays. For example, the Map-Reduce counters are stored in an array of counter groups, each of which contains an array of counters. The counter is a struct containing name, display name and value. You can see a glimpse of one of the field's nested structure in this post's banner picture, above.

I appreciate the possibilities that nested data brings to the table (pun not intended), but unfortunately, SQL is not always well suited to deal with it. In my case, I needed to find a few specific elements of the array, extract their values, and compute some aggregations on them. Let's see how this can be achieved.

Querying Nested Arrays

At first, I found the explode(array) table-generating function, but that seemed quite limited. I needed to explode two levels of nested arrays, and that didn't seem possible (or at least not without creating intermediate tables). I also needed to get a hold of the job ID associated with my counters, in order to do some further queries, and that didn't seem possible either with the explode function.

Then I tried lateral views, and that did the trick. The end result was a bit convoluted, with my main select query joined against several lateral views of the same table. But at least it worked.

Can We Do Better?

I am not sure whether there could be a better syntax to support nested array exploration in SQL. I imagine there could be, but that is a conversation for another day. In any case, I tend to agree with Postgres' documentation:

Tip: Arrays are not sets; searching for specific array elements can be a sign of database misdesign.

In the case of my table containing metadata on job executions, I believe nested data is perfectly fine, but it would have been more appropriate to use nested maps rather than nested arrays. In this case, one of the nested fields (the counter's name) would have been a perfectly appropriate map key, since the array of counter elements would be considered bogus if it contained two counters with the same name.

Using nested maps would have allowed me to access exactly the counters I wanted with a very simple syntax, without any joining against lateral views. Indeed, the syntax for accessing specific map elements is even simpler than performing a join against a normalized set of tables!

Closing Remarks

When designing Hive table schemas, always carefully consider whether the use of a nested array is appropriate. If there is any field inside the array's elements which you expect to be unique across all elements of the array, then consider the possibility of using a map keyed on that field, instead of an array.

More generally, whenever designing any data schema, try to anticipate which query patterns you expect to see. If getting at some part of your data seems convoluted of otherwise costly, take a moment to see if you potentially have a schema design issue on your hands.

Finally, if working on large scale challenges with a data-driven approach appeals to you, feel free to reach out to me. There might be a place for you within LinkedIn's Data org!

In a future installment, I'd like to share some of the specific efficiency improvement numbers that came out of the analysis mentioned in this post. Stay tuned!

EDIT [2016-02-24] : The analysis that came out of this Hive work is now published here: Slashing Voldemort's Hadoop Resource Usage

Félix GV

Planet-Scale Systems

9 年

If anyone's interested, the analysis that came out of this work is now published here: https://www.dhirubhai.net/pulse/slashing-voldemorts-hadoop-resource-usage-felix-gv

回复

Felix GV, Pig's Flatten takes a collection and removes one level of nesting. It would take a little bit of playing around, but when the SQL get's weird the verbose nature of pig latin can make it easier to think about the problem. A nested foreach is an operation across elements in a collection. One or the other or a combination would be a starting point. Pig had a JobHistoryLoader class, but that might be from before the JobHistory was written in JSON. -- Tom

Nice write up... Did you think about using pig? Whenever the SQL gets messy I end up wondering if Pig would be a better choice, what are your thoughts?

回复
孙力

数据架构师

9 年

Nice write up, Felix GV A lot of the deeply/heavily nested structures are pretty easy to generate or to manipulate when we have the application code which deals with one record at a time. But we have to pay extra price with not-so-easy-to-read code at analytical time when a big set of such data need to be processed. Anticipate the API query pattern is very different from anticipate batch query pattern. We have to keep some pragmatic balance here. Using map can be problematic if the keys are in free-form.

Drew Moxon

Leading Data Teams in Tech and Gaming

9 年

Great article Felix. I'll be writing a lot of logging specs soon where the data are accessed in Hive and it's always good to have a reminder to consider the proper nesting structures for querying efficiency.

要查看或添加评论,请登录

Félix GV的更多文章

  • Scalabe is Italian for Scalable

    Scalabe is Italian for Scalable

    Just came back from Strange Loop where we open sourced Venice. In a way, it feels like the culmination of 8 years of…

    2 条评论
  • UX Musings: Why Do We Still Have Modals?

    UX Musings: Why Do We Still Have Modals?

    In User Experience, "modals" are interruptions which must be dealt with before proceeding to use the program further…

    6 条评论
  • Effective Resolutions

    Effective Resolutions

    At this time of the year, many people take resolutions. Unfortunately, new year's resolutions are often synonymous with…

    2 条评论
  • Writing Maintainable Integration Tests

    Writing Maintainable Integration Tests

    [Re-posted from LinkedIn's Engineering Blog] In software development, writing integration tests is sometimes an…

  • Solar Impulse: A Sun Powered Plane

    Solar Impulse: A Sun Powered Plane

    Last night, a little before midnight, I got invited out of the blue to attend the take off of Solar Impulse, an…

    3 条评论
  • Slashing Voldemort's Hadoop Resource Usage

    Slashing Voldemort's Hadoop Resource Usage

    Towards the end of last year, we developed some improvements to Voldemort's Read-Only Build and Push process. After…

    10 条评论
  • On Building Data Centers, InDays and Acting Like an Owner

    On Building Data Centers, InDays and Acting Like an Owner

    At LinkedIn, we have monthly InDays where employees can do whatever they want. Engineers can hack on whatever they'd…

    18 条评论

社区洞察

其他会员也浏览了