Scaling the Hotstar Platform for 50M
Akash Saxena
CTO Jiocinema | CTO Excellence Award 2024 | ex-CTO Hotstar[Asia|MENA|SEA] | ex-OpenTable
TL;DR; Hotstar is the home of Indian cricket and scale. However, it’s not rocket science, we did use some rocket metaphors! While we covered first principles earlier, this time around we talk about rockets and going even beyond. What does it take to be ready for 50M?
Ridiculous Speed
What’s better than 10M cricket fans? 25M ofcourse! Handling this level of scale was not a given and we had to ideate on newer techniques to get the platform to deliver even more concurrency.
The platform was able to comfortably hum at 15M and change, however, we knew that we’d need to go back to the drawing board if we wanted to go beyond the 20's.
One of our key insights from 2018 was that auto-scaling would not work, which meant that we had static “ladders” that we stepped up/down to, basis the amount of “headroom” left. While this was effective, and safe, it required human oversight and was costly.
While 10x engineers might be all the rage, our engineers built 10x infrastructure!
10x Infrastructure
We had a belief that using containers would allow us to fine tune our compute needs. We were not aware of anyone that had orchestrated containers at Hotstar scale and our team took an audacious bet to run 2019 on Kubernetes (K8s). This meant abandoning the tried and tested scale infrastructure of 2018. Here be dragons!
Tuning K8s at scale and having our key services ported over was a massive task and was a really high peak to climb! Through several test runs, and gradually gaining confidence in the run up to the big events, failing, learning and stabilising, this was a a big win for us in 2019. This is a whole blog in itself.
Once we were able to demonstrate in the lab, and in production, that we could run our scale loads using K8s, it became possible to think of building our own auto-scaling engine that took into account multiple variables that mattered to our system. We ran auto-scaling in shadow till we finally had enough data to let it fly on it’s own, and so far it’s done a stellar job. This also helped to cut down human oversight over events significantly, we just let the automation take over.
We supported 2x more concurrency in 2019 with 10x less compute overall.
The move to K8s helped to really land the goal of doing more, with less, and also contribute to simplifying infrastructure operations during the big event.
Cautionary note → this was a 6–8 month journey that had it’s roots in 2–3 months of ideation before we undertook it. This section might make it sound easy, it isn’t.
Jettison
The story of Apollo 13 is an incredible tale.
It’s a textbook on leadership in the face of adversity and a lesson in what’s holding on to what’s important. As we ramped up towards the end of the ICC World Cup 2019, the possibility of a very high concurrency game was looming.Breaching the 30M barrier was not un-thinkable.
This time we threw an Apollo 13 sized problem at the team, we asked, what would it take to go to 50M? It was a forcing function for the team to focus on only what was needed. We reviewed all our service interactions, what their rated SLA’s were, what their rated peak limits were, and finally, what would be the tsunami, that took the service down.
We converged on a sequencing of which service to jettison (turn off), as the concurrency marched upwards, so that customers could keep coming in and the video would keep playing.
This exercise was akin to the sequence in Apollo 13 where ground control worked within very hard constraints (50M), and ensure that it brought the astronauts home. While our concurrency journey can never match up to this amazing study in crisis management and leadership, we can all learn a lesson about focusing on what’s important.
Nobody cares for your surround features if they cannot consume your primary feature. The video must play on.
It’s important to know what to turn off, and when to it turn off, without impacting your customer and also satisfying your goal of allowing cricket fans to watch the event un-interrupted.
Performance Is Bespoke
A compliment to the previous principle about jettisoning stuff, is the fact that your system is unique, and it will require a unique solution.
Knowing your data patterns and tweaking the system to match behaviours becomes important. We spent a lot of time reviewing our end to end request chain and optimised each segment of delivery. At each step we found that there were processing layers that were built to tackle generic traffic, and therefore performed completely un-required processing, thus causing many second order effects in the system.
The most notable of these was when we found a system taking up more than 70% compute for a feature that we weren’t even using. While adding more capacity would be obvious, if we had not pursued it, we were able to free up massive headroom by tuning it to match what we knew of our traffic and data patterns. This little nugget was buried till a particularly harsh traffic simulation brought down the test system and exposed how a hot-spot triggered a domino effect cascading failure.
It’s Not The CDN (alone)
Every now and then we hear comments about how Hotstar’s scaling success is due to the CDN. The simple fact is that there is no silver bullet, nothing absolves you from deeply understanding your system and then leveraging every bullet at your disposal, a CDN is one of many such bullets.
A CDN is an essential element in your scaling toolbox. However, you have to architect your CDN to sit in front of your origin and build an amazing shock absorber. This will take serious tuning of your infrastructure and for you to understand deeply, your traffic patterns, cache-a-bility and ultimately, finding the limits of your CDN.
Much like your services, there are breaking points and it’s key to understand this, if you want to scale beyond what you’re able to do today.
TTL Cloudbursts
While we’re talking about caching and CDN’s, the topic of understanding TTL’s (Time To Live) cannot be far behind. The biggest snafu’s can occur if you don’t have a clear line of sight of every cache TTL through the chain.
Especially if you have multi-level caches, you want to ensure that there is clarity on what each object TTL is, through the request chain. Apart from very hard to find data inconsistency issues, you will also open the door to making your origin vulnerable.
Imagine a request cloudburst at your origin triggered by large number of objects expiring around the same time. Think of customer facing timers, authentication tokens and so on. These seemingly harmless decisions can have serious consequences.
While we’ve talked about this in Part 1, it’s important enough to bring up again.
Preparing To Fail
As we continue our scale journey and prepare to take on the concurrency demands of the future, our core principle remains about preparing to fail. Failure engineering is at the core of everything that we set out to do at Hotstar. Everything has a redundancy and then that redundancy has a redundancy. Even then, there are failures.
We don’t believe in surprises, and prepare for the worst case scenarios, run drills around them so that we are not caught off guard, if failure comes to pass.
Learning We Are
We’re not the Yoda of scaling.We don’t have all the answers, nobody does. Fail, Learn, Adapt, Rinse, Repeat. We take inspiration from solution patterns around us, from Formula 1, from space missions, and this is what makes engineering so beautifully unpredictable and creative.
If you’re an Indian cricket fan, Hotstar is your home. If 25M people can watch a game, when our women and men don blue, and we fans want to follow along in the glory, the platform will be ready to make it possible for millions more to watch it glitch free.
That, remains our mission.
We’re constantly looking for the brightest, bravest and most creative engineers to join our amazing team. Check out job postings and more at tech.hotstar.com.
Senior Software Engineer with Extensive Java and Cloud Expertise: Driving Innovation and Efficiency in Complex Systems
3 年I was looking more towards some technical insights. Can you Point me to any such article or if you have an engineering blog.
Telco NFV Cloud | Business & Product Manager | Services Leader
5 年I like the 'Rinse' part -
Technology strategy for startups
5 年Agreed with the mantra?Fail, Learn, Adapt, Rinse, Repeat!? Hotstar performance is better than several well established players in the streaming industry. Kudos to your team! You all have figured it out!?
Operations Lead at Publitas.com
5 年Could not agree more, my favorite is the Failure Engineering part