登录查看更多内容

Slashing Voldemort's Hadoop Resource Usage

Félix GV

Planet-Scale Systems

发布日期: 2016年2月24日

Towards the end of last year, we developed some improvements to Voldemort's Read-Only Build and Push process. After some stabilization efforts, I'm happy to report that the new code has now been running in production without issues for more than a month.

Around the time we rolled it out, I used Hive to analyze the impact on Hadoop's resource usage. Now that some more time has passed, I re-ran the analysis, and this post is an overview of the results.

What Changed?

We made three changes which impact the Voldemort Build and Push (BnP) job's usage of Hadoop resources:

Reducer per Bucket: An existing feature is now enabled by default.
Build Primary Replicas Only: An entirely new feature.
Number of Jobs We Run: A change in how we configured our jobs.

In this section, we'll do an overview of these two features and the job configuration change.

Reducer per Bucket

This is a very simple change. The "reducer.per.bucket" configuration already existed previously, and we merely changed its default value from false to true. Let's see what this means for BnP jobs.

This configuration controls parallelism when the size of the job's input data is large. For larger data sets, the BnP job automatically attempts to divide each data partition into buckets (of up to 1 GB each, by default).

In the map tasks of the BnP job, we determine which partition/bucket each key belongs to in order to shuffle it to the appropriate reduce tasks. Then, it is the reduce tasks which write the final bucketed files. The "reducer.per.bucket" setting controls whether all buckets of a given partition should be sent to the same reduce task, or to independent reduce tasks.

The trade-off here is that with "reducer.per.bucket=false" (the previous default), there are more reduce tasks spawned in the Hadoop cluster, each of which have a smaller amount of work to do. This is sometimes desirable for use cases with a very aggressive SLA. However, this setting may be costly if the Hadoop cluster is already heavily loaded, since the start up cost of each task is paid more often, and tasks may be queued while waiting for an available slot.

With the change to the configuration's default value, an interesting metric to look at would be the wall-clock time the jobs took to complete (i.e.: from the user's perspective, how long does the Map Reduce part of the BnP job take to run). Because of the reduced parallelism, the hypothesis is that there would be a slight degradation to jobs' wall-clock time.

Build Primary Replicas Only

This one is a more significant change. Remember those reduce tasks we talked about in the Reducer per Bucket section? Another factor which can influence the number of reduce tasks used by the BnP job is the Voldemort store's replication factor.

In the legacy implementation (before this change), the BnP job would create one reducer per replica, and each reducer would be independently writing the exact same data! The likely rationale for this architecture was that it simplified some other parts of the system, but since we needed to change that code for other reasons anyway, we got a chance to revisit (and fix!) this inefficiency.

The BnP job can now operate in a new mode where each replica is built only once. The new mode is automatically enabled if both the servers and the BnP job are running on a version which supports it (and the process gracefully degrades to the old mode if not).

With this change, a couple of interesting metrics to look at would be the number of reduce tasks spawned per BnP job, and the sum of time spent in Hadoop tasks per BnP job. Since most of our Read-Only stores have a replication factor of 2, and a few have a replication factor of 3, the hypothesis is that both of those metrics would be between one third and one half of their original values.

Number of Jobs We Run

There is one more thing which affects the amount of Hadoop resources used by Voldemort's Build and Push job: how many such jobs run each day?

One would think that this number of jobs would remain roughly the same before and after the roll out of the new code, but that was not the case. The reason for this is that we had many Hadoop flows which were configured in a wasteful way, with an independent BnP job for each destination data center.

Historically, the main reason why our partner teams (the Voldemort users) configured separate BnP jobs for each data center was in order to push concurrently. The BnP job did support pushing to a list of many destination clusters, and it would build the data set only once (thus not wasting Hadoop resources), but it would then push to each destination sequentially, thus increasing the total BnP run time.

This shortcoming was fixed a while ago and BnP now supports pushing to multiple clusters concurrently. Thus, we used our latest data center build out as an opportunity to do the right thing and get everyone to clean up the duplicate BnP jobs they owned.

It's hard to hypothesize what the precise reduction in the number of jobs will be, but since we were pushing data to three data centers before this re-configuration initiative, the best case scenario would be to reduce the number of jobs by 2/3. Conversely, the worst case would be that all jobs are already properly configured, thus yielding no reduction at all.

Looking at the Numbers

Now comes the crunchy part! The numbers presented here are a comparison of the 30 days prior to the roll out of this change, alongside the 30 days following the roll out of the change (except when noted otherwise).

The roll out itself was done gradually over the course of a few days, one cluster at a time. The shaded blue area in the graphs represents the progress of the roll out, and the days where the change was only partially rolled out are excluded from the before and after 30 days averages.

Wall-Clock Time per Job

This metric is the average time it took for the Map Reduce part of the BnP job to complete end-to-end.

Average of the 30 days before: 5m 26s / job
Average of the 30 days after: 5m 2s / job
Delta: 7.5% reduction

Although the wall-clock time after the roll out appears to be slightly better on average, it is probably not a big enough difference to be significant, given the variability of the metric. That being said, it is good that the wall-clock time is in the same ball park as before, given that each task needs to do more work on average (because of the change in the Reducer per Bucket default configuration).

Reduce Tasks per Job

This metric is the average number of reduce tasks per job.

Average of the 30 days before: 1143 tasks / job
Average of the 30 days after: 540 tasks / job
Delta: 52.8% reduction

As we can see, the number of reduce tasks is now not only significantly smaller, but also much more predictable. Since we run our Voldemort Read-Only clusters with 540 partitions each, the BnP jobs now always generate 540 reduce tasks per job, no matter the size of the input data or the replication factor of the store.

In the future, we may want to selectively increase parallelism on some of the biggest BnP jobs by having them disable the Reducer per Bucket feature, which would slightly affect this metric, but so far this has not been necessary.

Total Hadoop Task Time per Job

This metric is the sum of the run time of all map and reduce tasks, averaged per job. The number may seem high but that's because BnP jobs typically spawn hundreds or even thousands of parallel tasks. As you can see, there's a lot going on during the 5 minutes of wall-clock time each job takes on average!

Average of the 30 days before: 103 hours / job
Average of the 30 days after: 38 hours / job
Delta: 63.2% reduction

This is a very significant saving, demonstrating that BnP now requires a much leaner amount of resources!

Keep in mind that this number is based on the LinkedIn workload, where most stores are configured with a replication factor of 2, and few stores (which happen to be among our biggest stores, however) with a replication factor of 3. Another company running Voldemort Read-Only with a different break down of store sizes and replication factors would likely see a different amount of savings by upgrading to the latest version. Please do share your numbers if you try it out!

Number of Jobs per Day

This metric is the number of production BnP jobs run each day. Since the main driver of this change is the re-configuration of our partner teams' jobs, and not the roll out of the new code, the average "after" number presented below starts counting after the end of the re-configuration effort, rather than after the roll out.

Average of the 30 days before: 381 jobs / day
Average of the 15 days after re-configuration: 261 jobs / day
Delta: 31.4% reduction

We can see a few notable events on this graph:

The dips during the beginning of the blue section are caused by the roll out process itself, because Voldemort servers currently do not support incoming Build and Push jobs while doing a rolling restart.
We can see that the number of jobs right after the end of the roll out process is similar to what it was just before.
Then, roughly two weeks later, our partner teams start modifying their jobs (which causes some disruption for several days) in order to remove duplicate jobs and support the new data center.
Finally, roughly another week later, once the partner teams are done re-configuring, we can see that the number of jobs per day stabilizes at a lower number.

Since all of the graphs presented here have their time axes aligned, you can compare this graph with the next one to see the direct impact that the number of daily jobs has on total Hadoop resources used per day.

Final Numbers

With the changes described above, which greatly improved the efficiency of individual BnP jobs, as well as the reduction in the number of BnP jobs run per day, the final tally for the amount of Hadoop task time used by Voldemort at LinkedIn each day is:

Average of the 30 days before: 4.48 years / day
Average of the 15 days after re-configuration: 1.15 years / day
Delta: 74.4% reduction

Yes, that's not a mistake. The number is in years, because that's what tens of thousands of hours of compute time come up to!

So there you have it: a near 75% reduction in Hadoop task slot time usage!

Closing Remarks

I'd like to thank Mark Wagner from the Hadoop team for helping me run the analysis that crunched the numbers presented here.

If I could go back and do something differently, it would be to separately measure the impact of the Reducer per Bucket default configuration change and that of the new Build Primary Replicas Only feature. Clubbing both changes together makes it impossible to evaluate the merits of each.

That being said, I'm really glad that amidst all of the work required to build out a new data center, we were capable of clearly seeing what's right for the company and to shed technical debt rather than piling on even more of it.

At LinkedIn's scale, optimizations can yield massive impacts. If that kind of work seems interesting to you, feel free to reach out to me! There might be an opening for you somewhere at LinkedIn!

Finally, if you are already running Voldemort Read-Only, we greatly encourage you to upgrade to the latest and greatest! And if you have any questions, feel free to ask them on the mailing list (:

Arunachalam Thirupathi

Software Engineer at Google

9 年

Great work and nice post with beautiful graphs! Love it!, we should probably post this to our open source forum as well :)

1 次回应

Richard S.

Talent @ Apple

9 年

Great stuff Felix! Thanks for taking the time out to write these Blogs and write them in such a way that many different audiences can understand

2 次回应

Anthony Hsu

Software engineer

9 年

This is really really awesome! I wonder if Dr. Elephant is now marking Voldemort BnP jobs as green after these changes?

1 次回应

Vinoth Chandar

9 年

Bravo!

3 次回应

查看更多评论

要查看或添加评论，请登录

Félix GV的更多文章

Scalabe is Italian for Scalable

2022年9月27日

Scalabe is Italian for Scalable

Just came back from Strange Loop where we open sourced Venice. In a way, it feels like the culmination of 8 years of…

2 条评论
UX Musings: Why Do We Still Have Modals?

2017年9月2日

UX Musings: Why Do We Still Have Modals?

In User Experience, "modals" are interruptions which must be dealt with before proceeding to use the program further…

6 条评论
Effective Resolutions

2017年1月1日

Effective Resolutions

At this time of the year, many people take resolutions. Unfortunately, new year's resolutions are often synonymous with…

2 条评论
Writing Maintainable Integration Tests

2016年8月29日

Writing Maintainable Integration Tests

[Re-posted from LinkedIn's Engineering Blog] In software development, writing integration tests is sometimes an…
Solar Impulse: A Sun Powered Plane

2016年5月2日

Solar Impulse: A Sun Powered Plane

Last night, a little before midnight, I got invited out of the blue to attend the take off of Solar Impulse, an…

3 条评论
On the Use of Nested Data in Hive

2016年1月22日

On the Use of Nested Data in Hive

We recently wrapped up some efficiency improvements to Voldemort's Build and Push process. As we started rolling it out…

7 条评论
On Building Data Centers, InDays and Acting Like an Owner

2015年12月18日

On Building Data Centers, InDays and Acting Like an Owner

At LinkedIn, we have monthly InDays where employees can do whatever they want. Engineers can hack on whatever they'd…

18 条评论

See all articles

Slashing Voldemort's Hadoop Resource Usage

Félix GV

Planet-Scale Systems

What Changed?

Reducer per Bucket

Build Primary Replicas Only

Number of Jobs We Run

Looking at the Numbers

Wall-Clock Time per Job

Reduce Tasks per Job

Total Hadoop Task Time per Job

Number of Jobs per Day

Final Numbers

Closing Remarks

Félix GV的更多文章

社区洞察

其他会员也浏览了

HADOOP: "How to share Limited Storage of Datanode to the Namenode in Hadoop Distributed Storage Cluster?"

Is Hadoop on a Downtrend?

Navigating the Hadoop Ecosystem: A Hands-On Guide

Setting Up Hadoop Cluster on Top of AWS & Checking the Existence of Replica by Crashing the data node

Hadoop File Formats, when and what to use?

Hadoop: Pioneering the Era of Big Data Storage Technologies

The 9 main applications of the Hadoop Ecosystem

Unleashing the Power of Big Data with Hadoop

Hadoop: A Powerful Tool for Big Data Management

What Changed?

Reducer per Bucket

Build Primary Replicas Only

Number of Jobs We Run

Looking at the Numbers

Wall-Clock Time per Job

Reduce Tasks per Job

Total Hadoop Task Time per Job

Number of Jobs per Day

Final Numbers

Closing Remarks

Félix GV的更多文章

Scalabe is Italian for Scalable

UX Musings: Why Do We Still Have Modals?

Effective Resolutions

Writing Maintainable Integration Tests

Solar Impulse: A Sun Powered Plane

On the Use of Nested Data in Hive

On Building Data Centers, InDays and Acting Like an Owner

社区洞察

其他会员也浏览了

HADOOP: "How to share Limited Storage of Datanode to the Namenode in Hadoop Distributed Storage Cluster?"

Is Hadoop on a Downtrend?

Navigating the Hadoop Ecosystem: A Hands-On Guide

Setting Up Hadoop Cluster on Top of AWS & Checking the Existence of Replica by Crashing the data node

Hadoop File Formats, when and what to use?

Hadoop: Pioneering the Era of Big Data Storage Technologies

The 9 main applications of the Hadoop Ecosystem

Unleashing the Power of Big Data with Hadoop

Hadoop: A Powerful Tool for Big Data Management