Horizontal Scaling for Media Encoding

Horizontal Scaling for Media Encoding

When building out cloud workloads, it is very important that you think about your #CloudArchitecture. How you scale, how you install software, and how you will control costs are very important to think through.  

I am going to share the story of moving part of our #media encoding pipeline from on-premise to a cloud implementation. I will talk about #LegacySoftware and how we made it run on cloud infrastructure. A potentially scary lesson in cloud economics and finally how we brought our costs under control. 

Background

In #MediaAndEntertainment Large scale video encoding is a challenging problem to solve. #Video supply chain systems do not have the problem of millions of users directly hitting our systems. Our challenge is encoding tens of thousands of video files from source content that can reach half a terabyte. The problem is not millisecond UI response times but handling petabytes of storage and moving those petabytes around a network.

During a period of rapid expansion, we had reached the limits of our on-premise infrastructure. We simply could not continue to encode the volume of videos in a timely fashion. We were able to rack more servers but our storage system and network infrastructure would not handle the load.  

We started on a path of moving the encoders to #AWS. It seemed like a logical solution and at the outset seemed like it should be an easy transition. Simply install the encoder software on an EC2 and magic would happen. Sadly this was not the case.

Encoder Woes

Our primary highest quality encoder still uses dongles for licensing. For those of you that are younger, this technology dates back to the anti-piracy efforts of the early ’90s. You buy the software and the vendor gives you a piece of hardware that you plug into the machine to authorize the hardware. The software looks for this dongle and will not operate without it. Yes, in the 2010’s we were dealing with hardware dongles, but at least they did not require an LPT port.

We analyzed the available cloud encoders. We tested quite a few different iterations from cloud vendor supplied encoders to third-party cloud vendors. None of them met our video quality standards. 

We found a vendor that sold encoders for on-premise use but they used IP addresses for licensing. Their quality is outstanding and they were open to using the encoders in a cloud solution. We started our experiment with these encoders at AWS. As an aside, these vendors must have a very high rate of theft as they are some of the most licensing intensive pieces of software we use.  

EC2 Transition

We installed the controller software and encoders on appropriate sized EC2 instances. The software ran fine and we were able to perform some simple transcodes. Then we started solving the S3 problem. 

Object storage presented us with two problems. First, our encoders were only designed to read from block storage. The larger problem was that the encoder system ran under windows. So NFS mounts would not work. We decided the lowest impact solution was to use EFS for our servers. All our encoders would use a common file share to get to the video files.

That was the first big hurdle down. Making legacy windows software run at AWS. The solution required one controller (think license master) and a few encoders. The encoders would then read and write through EFS to create our needed video files. 

Horizontal Scaling

The next problem to solve was horizontal scaling. Remember the core problem we were trying to solve was running more encoders at the same time. The challenge here was the legacy software, it required IP addresses to license each encoder. That meant manually spinning up the EC2 installing the software, getting an IP address, and finally registering it with the controller.

Fortunately, my team has some brilliant engineers. We were able to determine the process that the controller used to register and deregister a worker and automate that. I don’t want to trivialize this step. The documentation was not helpful and it took getting down to watching network traffic to emulate the process.

We hooked up the EC2 lifecycle events to register a server when it was scaled out. Conversely, we were able to deregister an instance when it was scaled in. From an engineering standpoint, this was pretty exciting as it was really thinking outside the box to solve a problem.

That was the second hurdle down. Our encoders would now be scaled automatically based on the amount of work in the queue. As jobs increased the system would automatically scale up and as jobs decreased the system would scale down. This is elastic horizontal scaling, you are only paying for what you are using. You do not buy bigger hardware you buy more low-cost hardware.

Cloud Economics

When you want to try something new on a cloud platform the cost and risk is very minimal. I spend a lot of weekend time spinning something up at AWS just to play with and see how it works. I don’t have to buy any hardware and I don’t have to wait. When I am done experimenting I simply spin it down and stop paying for it.

The ability to rapidly prototype is one of the best features of cloud solutions. Spin it up, experiment, spin it down. However, this can also lull you into a false sense of cost comfort. That is exactly what happened to us at this point. Our cloud solution was costing very little to operate and when doing some quick math it seemed that our solution was going to be very cost-effective. 

What you can miss when doing this type of experimentation are all the little costs. At the prototype phase, they amount to rounding errors. When you do them on scale those rounding errors can be material. In our case, it was going from a couple of encoders and 100 Gb of video to petabytes of video and hundreds of encoders.

We were at a stable point in the project and it was time to do some real scale testing. We sent 1000 jobs through the encoder. There were a few shakeout issues to deal with but the system scaled to hundreds of encoders and started producing videos at an amazing rate. Our prototype was capable of easily 10 times our on-premise encode farm.

After some refinement and testing, we pushed the solution in a limited capacity to our production environment. We effectively had a canary process running. Lower risk jobs would go to the new encoders and all others would go to the existing farm. Everything was looking great. 

The Bill Arrives

When the bill arrived I had a bit of sticker shock. While it did not break the budget it was well over what I estimated. My windows servers at about 3.50 per hour were really adding up. The storage system was the real surprise, which was costing way more than I anticipated. At our current operational scale, we would break our annual budget in months. 

In order to solve the problem, we had to really break our problem down. It required diving deep into our workload and understanding what was required. We needed to radically lower cost without sacrificing throughput.  

I was speaking to one encode vendor and their cloud strategy was to wait for companies to come back from the cloud. The vendor said companies were breaking the bank trying to encode video in the cloud. These words were haunting me at this point. Fortunately, our smaller test prevented that from happening to us.  

Our Solution

We took a multi-step approach. We moved workloads that could be done by cheaper processes off these instances. We moved to Spot instances for our encoders. Finally, we changed our file system. This can be boiled down to process optimization. 

Process optimization can be tricky, what is optimal in one situation may not be optimal in another. When you are dealing with sunk cost capital, forking to multiple servers is not always the most cost-efficient way to solve a problem. However, when you move your costs to op-ex and pay by utilization it is more cost-effective. In our case, moving some of the operations like muxing or text transformations to cheap low power compute instances took a pretty big workload off the system

Spot Instances were the real money saver. Spot instances are basically extra compute instances that are sitting around unutilized. At the time we built this, you would bid on an instance size and if there were any available you could have them for very cheap. Most of the time we saved 50% by using Spot.  

The downside of Spot is that they can be taken away at any time. If you bid on an instance and get half-way through an encode, a higher bid can take your instance away. Your job will be killed. You have to make sure your application is designed to handle this. It needs to be able to resume a canceled job without intervention. **Amazon has done some great work here, Spot Blocks and Spot Fleets can help mitigate this problem

Lastly, we had to change our file system. EFS is great for transitioning workloads and making block storage available over object storage. If you start moving petabytes of video content in and out of EFS it gets expensive very fast. We switched over to FSx which is a much better fit for what I would classify as transient workloads. This saved us quite a bit of money as well. 

Production

Having completed our phase 1 cost controls we moved to production. We had an elastic, horizontally scaling, cost-effective solution in place. Our cost per video encode-hour (my main cost metric) had dropped to a price lower than on-premise solutions.  

We were able to encode substantially more in parallel. I don’t want to give away specifics but our cloud solution was easily capable of 50 times our on-premise solution. Without this transition, our corporate growth strategy would have been substantially more challenging. 

Where are we now

We have made more strides to reduce costs. Since going into production we have reduced our costs another 30%. This was done by optimizing our testing labs and further refining our EC2 utilization.

This year our goal is to reduce our costs by another 28% while increasing scale and decreasing delivery time.

Mike Benson is Vice President, Software Technology Solutions for Starz Entertainment, A leading global entertainment brand. Starz / Lionsgate is rapidly expanding to markets around the world to bring the highest value content to new audiences. Mike drives the delivery of all assets to Starz/Lionsgate customers in a low-cost high-volume supply chain. 

Hor

Steve DeRidder

Cloud Architect at Western Area Power Administration

4 年

Good read.

回复

要查看或添加评论,请登录

Mike Benson的更多文章

  • How to start planning a successful cloud transition project

    How to start planning a successful cloud transition project

    Being thoughtful when you transition to the #cloud is of paramount importance. It does not matter if you are going to…

    7 条评论
  • How to make your standup effective

    How to make your standup effective

    If you look around your agile standup and find that people are waiting for it to end, you might need to change it up a…

    1 条评论
  • Leading from the Weeds: Increasing team engagement

    Leading from the Weeds: Increasing team engagement

    Occasionally I find that I feel out of touch with what my team is actively working, what their challenges are, and more…

    4 条评论
  • Leading by Intent

    Leading by Intent

    Having a team with a known workload that is under-performing is frustrating. I normally see this when changing managers…

  • Personal Continuous Improvement

    Personal Continuous Improvement

    In the world of #agile or #lean we often hear the phrase continuous improvement. Typically this is applied to our…

社区洞察

其他会员也浏览了