Slashing Voldemort's Hadoop Resource Usage

Slashing Voldemort's Hadoop Resource Usage

Towards the end of last year, we developed some improvements to Voldemort's Read-Only Build and Push process. After some stabilization efforts, I'm happy to report that the new code has now been running in production without issues for more than a month.

Around the time we rolled it out, I used Hive to analyze the impact on Hadoop's resource usage. Now that some more time has passed, I re-ran the analysis, and this post is an overview of the results.

What Changed?

We made three changes which impact the Voldemort Build and Push (BnP) job's usage of Hadoop resources:

  1. Reducer per Bucket: An existing feature is now enabled by default.
  2. Build Primary Replicas Only: An entirely new feature.
  3. Number of Jobs We Run: A change in how we configured our jobs. 

In this section, we'll do an overview of these two features and the job configuration change.

Reducer per Bucket

This is a very simple change. The "reducer.per.bucket" configuration already existed previously, and we merely changed its default value from false to true. Let's see what this means for BnP jobs.

This configuration controls parallelism when the size of the job's input data is large. For larger data sets, the BnP job automatically attempts to divide each data partition into buckets (of up to 1 GB each, by default).

In the map tasks of the BnP job, we determine which partition/bucket each key belongs to in order to shuffle it to the appropriate reduce tasks. Then, it is the reduce tasks which write the final bucketed files. The "reducer.per.bucket" setting controls whether all buckets of a given partition should be sent to the same reduce task, or to independent reduce tasks.

The trade-off here is that with "reducer.per.bucket=false" (the previous default), there are more reduce tasks spawned in the Hadoop cluster, each of which have a smaller amount of work to do. This is sometimes desirable for use cases with a very aggressive SLA. However, this setting may be costly if the Hadoop cluster is already heavily loaded, since the start up cost of each task is paid more often, and tasks may be queued while waiting for an available slot.

With the change to the configuration's default value, an interesting metric to look at would be the wall-clock time the jobs took to complete (i.e.: from the user's perspective, how long does the Map Reduce part of the BnP job take to run). Because of the reduced parallelism, the hypothesis is that there would be a slight degradation to jobs' wall-clock time.

Build Primary Replicas Only

This one is a more significant change. Remember those reduce tasks we talked about in the Reducer per Bucket section? Another factor which can influence the number of reduce tasks used by the BnP job is the Voldemort store's replication factor.

In the legacy implementation (before this change), the BnP job would create one reducer per replica, and each reducer would be independently writing the exact same data! The likely rationale for this architecture was that it simplified some other parts of the system, but since we needed to change that code for other reasons anyway, we got a chance to revisit (and fix!) this inefficiency.

The BnP job can now operate in a new mode where each replica is built only once. The new mode is automatically enabled if both the servers and the BnP job are running on a version which supports it (and the process gracefully degrades to the old mode if not).

With this change, a couple of interesting metrics to look at would be the number of reduce tasks spawned per BnP job, and the sum of time spent in Hadoop tasks per BnP job. Since most of our Read-Only stores have a replication factor of 2, and a few have a replication factor of 3, the hypothesis is that both of those metrics would be between one third and one half of their original values.

Number of Jobs We Run

There is one more thing which affects the amount of Hadoop resources used by Voldemort's Build and Push job: how many such jobs run each day?

One would think that this number of jobs would remain roughly the same before and after the roll out of the new code, but that was not the case. The reason for this is that we had many Hadoop flows which were configured in a wasteful way, with an independent BnP job for each destination data center.

Historically, the main reason why our partner teams (the Voldemort users) configured separate BnP jobs for each data center was in order to push concurrently. The BnP job did support pushing to a list of many destination clusters, and it would build the data set only once (thus not wasting Hadoop resources), but it would then push to each destination sequentially, thus increasing the total BnP run time.

This shortcoming was fixed a while ago and BnP now supports pushing to multiple clusters concurrently. Thus, we used our latest data center build out as an opportunity to do the right thing and get everyone to clean up the duplicate BnP jobs they owned.

It's hard to hypothesize what the precise reduction in the number of jobs will be, but since we were pushing data to three data centers before this re-configuration initiative, the best case scenario would be to reduce the number of jobs by 2/3. Conversely, the worst case would be that all jobs are already properly configured, thus yielding no reduction at all.

Looking at the Numbers

Now comes the crunchy part! The numbers presented here are a comparison of the 30 days prior to the roll out of this change, alongside the 30 days following the roll out of the change (except when noted otherwise).

The roll out itself was done gradually over the course of a few days, one cluster at a time. The shaded blue area in the graphs represents the progress of the roll out, and the days where the change was only partially rolled out are excluded from the before and after 30 days averages.

Wall-Clock Time per Job

This metric is the average time it took for the Map Reduce part of the BnP job to complete end-to-end.

  • Average of the 30 days before: 5m 26s / job
  • Average of the 30 days after: 5m 2s / job
  • Delta: 7.5% reduction

Although the wall-clock time after the roll out appears to be slightly better on average, it is probably not a big enough difference to be significant, given the variability of the metric. That being said, it is good that the wall-clock time is in the same ball park as before, given that each task needs to do more work on average (because of the change in the Reducer per Bucket default configuration).

Reduce Tasks per Job

This metric is the average number of reduce tasks per job.

  • Average of the 30 days before: 1143 tasks / job
  • Average of the 30 days after: 540 tasks / job
  • Delta: 52.8% reduction

As we can see, the number of reduce tasks is now not only significantly smaller, but also much more predictable. Since we run our Voldemort Read-Only clusters with 540 partitions each, the BnP jobs now always generate 540 reduce tasks per job, no matter the size of the input data or the replication factor of the store.

In the future, we may want to selectively increase parallelism on some of the biggest BnP jobs by having them disable the Reducer per Bucket feature, which would slightly affect this metric, but so far this has not been necessary.

Total Hadoop Task Time per Job

This metric is the sum of the run time of all map and reduce tasks, averaged per job. The number may seem high but that's because BnP jobs typically spawn hundreds or even thousands of parallel tasks. As you can see, there's a lot going on during the 5 minutes of wall-clock time each job takes on average! 

  • Average of the 30 days before: 103 hours / job
  • Average of the 30 days after: 38 hours / job
  • Delta: 63.2% reduction

This is a very significant saving, demonstrating that BnP now requires a much leaner amount of resources!

Keep in mind that this number is based on the LinkedIn workload, where most stores are configured with a replication factor of 2, and few stores (which happen to be among our biggest stores, however) with a replication factor of 3. Another company running Voldemort Read-Only with a different break down of store sizes and replication factors would likely see a different amount of savings by upgrading to the latest version. Please do share your numbers if you try it out!

Number of Jobs per Day

This metric is the number of production BnP jobs run each day. Since the main driver of this change is the re-configuration of our partner teams' jobs, and not the roll out of the new code, the average "after" number presented below starts counting after the end of the re-configuration effort, rather than after the roll out.

  • Average of the 30 days before: 381 jobs / day
  • Average of the 15 days after re-configuration: 261 jobs / day
  • Delta: 31.4% reduction

We can see a few notable events on this graph:

  1. The dips during the beginning of the blue section are caused by the roll out process itself, because Voldemort servers currently do not support incoming Build and Push jobs while doing a rolling restart.
  2. We can see that the number of jobs right after the end of the roll out process is similar to what it was just before.
  3. Then, roughly two weeks later, our partner teams start modifying their jobs (which causes some disruption for several days) in order to remove duplicate jobs and support the new data center.
  4. Finally, roughly another week later, once the partner teams are done re-configuring, we can see that the number of jobs per day stabilizes at a lower number.

Since all of the graphs presented here have their time axes aligned, you can compare this graph with the next one to see the direct impact that the number of daily jobs has on total Hadoop resources used per day.

Final Numbers

With the changes described above, which greatly improved the efficiency of individual BnP jobs, as well as the reduction in the number of BnP jobs run per day, the final tally for the amount of Hadoop task time used by Voldemort at LinkedIn each day is:

  • Average of the 30 days before: 4.48 years / day
  • Average of the 15 days after re-configuration: 1.15 years / day
  • Delta: 74.4% reduction

Yes, that's not a mistake. The number is in years, because that's what tens of thousands of hours of compute time come up to!

So there you have it: a near 75% reduction in Hadoop task slot time usage!

Closing Remarks

I'd like to thank Mark Wagner from the Hadoop team for helping me run the analysis that crunched the numbers presented here.

If I could go back and do something differently, it would be to separately measure the impact of the Reducer per Bucket default configuration change and that of the new Build Primary Replicas Only feature. Clubbing both changes together makes it impossible to evaluate the merits of each.

That being said, I'm really glad that amidst all of the work required to build out a new data center, we were capable of clearly seeing what's right for the company and to shed technical debt rather than piling on even more of it.

At LinkedIn's scale, optimizations can yield massive impacts. If that kind of work seems interesting to you, feel free to reach out to me! There might be an opening for you somewhere at LinkedIn!

Finally, if you are already running Voldemort Read-Only, we greatly encourage you to upgrade to the latest and greatest! And if you have any questions, feel free to ask them on the mailing list (:

Arunachalam Thirupathi

Software Engineer at Google

9 年

Great work and nice post with beautiful graphs! Love it!, we should probably post this to our open source forum as well :)

Richard S.

Talent @ Apple

9 年

Great stuff Felix! Thanks for taking the time out to write these Blogs and write them in such a way that many different audiences can understand

Anthony Hsu

Software engineer

9 年

This is really really awesome! I wonder if Dr. Elephant is now marking Voldemort BnP jobs as green after these changes?

要查看或添加评论,请登录

Félix GV的更多文章

  • Scalabe is Italian for Scalable

    Scalabe is Italian for Scalable

    Just came back from Strange Loop where we open sourced Venice. In a way, it feels like the culmination of 8 years of…

    2 条评论
  • UX Musings: Why Do We Still Have Modals?

    UX Musings: Why Do We Still Have Modals?

    In User Experience, "modals" are interruptions which must be dealt with before proceeding to use the program further…

    6 条评论
  • Effective Resolutions

    Effective Resolutions

    At this time of the year, many people take resolutions. Unfortunately, new year's resolutions are often synonymous with…

    2 条评论
  • Writing Maintainable Integration Tests

    Writing Maintainable Integration Tests

    [Re-posted from LinkedIn's Engineering Blog] In software development, writing integration tests is sometimes an…

  • Solar Impulse: A Sun Powered Plane

    Solar Impulse: A Sun Powered Plane

    Last night, a little before midnight, I got invited out of the blue to attend the take off of Solar Impulse, an…

    3 条评论
  • On the Use of Nested Data in Hive

    On the Use of Nested Data in Hive

    We recently wrapped up some efficiency improvements to Voldemort's Build and Push process. As we started rolling it out…

    7 条评论
  • On Building Data Centers, InDays and Acting Like an Owner

    On Building Data Centers, InDays and Acting Like an Owner

    At LinkedIn, we have monthly InDays where employees can do whatever they want. Engineers can hack on whatever they'd…

    18 条评论

社区洞察

其他会员也浏览了