Scaling Is Not An Accident
Akash Saxena
CTO Jiocinema | CTO Excellence Award 2024 | ex-CTO Hotstar[Asia|MENA|SEA] | ex-OpenTable
TL;DR; The entire Hotstar team has spent the last 6+ months getting ready for our marquee event, the IPL on Hotstar. While there are many scaling tales from every team, we are kicking off the series with a round up of the things that we did to tweak our world record setting platform.
A few days before the IPL was to begin, as we were taking stock of our readiness, the question before us was if were ready for the big day. Had we done everything that we could do, to be ready for the big event on our new platform. After months of late nights, strategising and incredible team-work, for us, we knew that we’d thrown everything and then some at the problem statement. There was a brief pause to reflect, and then realisation that while we felt we were ready, real life traffic patterns on a new platform can be very unpredictable. Cautiously optimistic was the mood of the hour.
The Tsunami Problem
Hotstar is the home of live cricket, having done scale cricketing events in the past, we had the fortune of having real game time traffic models to fall back on. Repeatedly, our older platform would come under duress as notifications were sent when something interesting happened, the ensuing spike in traffic would overwhelm our back-end systems and occasionally disrupt us from rendering video. The legacy back-end was unable to scale beyond a point given it’s architecture, which further limited us from accepting newer customers onto the platform.
We’d learnt to tackle Tsunami’s the hard way. The lessons were these:
There is no auto-scaling. Give yourself headroom.
Given the time it takes to auto-scale, it’s a useless strategy when you need deal with spikes. Our strategy is to estimate the peak concurrency, then scale up ahead of the event for the big event. Do you end up ordering food as guests start arriving for your party. I hope you don’t. We don’t either. There is no auto-scaling. We use auto-scaling to ensure the right amount of servers always exist in a pool, but we’re always scaled up for the peak. When we breach our headroom thresholds, that we define post scale testing, we scale up in anticipation.
Improve single server utilisation
Your cloud provider also has physical limits of how much you can auto-scale. Work with them closely and ensure you make the right projections ahead of time. Even then, nothing can make it better for you if you are inefficient per server. This calls for rabid tuning of all your system parameters. Moving from development to production environments requires knowledge of what hardware your code will run on and then tweaking it to suit that system. Be lean on your single server and yield results with more room to scale horizontally. Review all your configurations with a fine tooth, it’ll save you the blushes in production. Each system must be tuned specifically to the traffic pattern and hardware you choose to run it on.
No dumb clients
At Indian cricket scale, we cannot afford to have clients that rely completely on the server systems to make decisions. Tsunami’s can overwhelm the back-end. Retries will make the problems worse. Clients must be smart about inferring when things don’t look right, and add “jitter” to the requests they make to the servers. Caching, exponential back-offs and panic protocols all come together to ensure a seamless customer experience.
The Social Problem
For all the wealth of information we had in our traffic models, the big monkey wrench this year was the addition of our social interaction to the platform. The “Watch And Play” game is a prediction cum trivia game. Given our concurrency, and the nature of requesting answers within a time band, the concurrency on our end-point that accepts answers is pretty crazy. Here, we used the golden rule of performance tuning - Don’t Do Anything In Real Time, That You Don’t Need To Do In Real Time.
Also, given the game, our concurrency patterns show a marked difference in terms of stickiness, so as opposed to historical traffic models, where we saw traffic peaks and falls, now we see our customers spend more time on our platform once they are on. This directly impacts our infrastructure to keep simultaneous streams running.
The Three Pillars
Our platform has three core pillars, the subscription engine, meta-data engine and our streaming infrastructure. Each of these have unique scale needs and were tweaked separately. We built pessimistic traffic models for each of these basis which we came up with ladders that controlled server farms depending on the estimated concurrency. Knowing what your key pillars are and what kind of patterns they are going to experience is pivotal when it comes to tuning. One size does not fit all.
Once Only
Scaling effectively at such a scale means that you drive away as much traffic as possible from the origin servers. Depending on your business patterns, using caching strategies on the serving layer as well as smart TTL controls on the client end, one can give breathing room to their server systems.
Reject Early
Security is a key tenet, and we leverage this layer to also drive away traffic that doesn’t need to come to us at the top of the funnel. Using a combination of white listing and industry best practices, we drive away a lot of spurious traffic up front itself.
The Telescope
Like any other subscription platform, we’re ultimately beholden to the processing rates that our payment partners provide us. Sometimes during a peak, this might mean adding a jitter to our funnel to allow customers to follow through at an acceptable rate to enable a higher success rate overall. Again, these funnels / telescopes are designed keeping in mind the traffic patterns that your platform will experience. Often these decisions will need to involve customer experience changes to account for being gentler on the infrastructure.
The Latency Problem
As the leading OTT player in India, we’ve been steadily making improvements to our streaming infrastructure. It remains a simple motto of leaner on the wire, faster than broadcast. As simple as this sounds, its one of the most complex things to get right. Through the year we have brought down our latency numbers from being roughly 55s behind broadcast, to approximately 15–20s behind broadcast and only a couple of seconds behind on our re-done web platform.
This was a result of highly meticulous measurement of how much time each segment of our encoding workflow took and then tweaking operations and encoder settings to do better. We did this by applying profiling of the workflow to instrument each segment. This is another classical tenet, tuning cannot happen without instrumentation.
We continue to tweak bit-rate settings to provide a un-compromised experience to our customers while at the same time be efficient in bandwidth consumption for Indian conditions.
Lower latencies and smarter use of player controls to provide a smooth viewing experience to customers also helps with smoother traffic patterns as fewer customers are repeating the funnel, which can cause a lot of ripple through the whole system with it’s retries and consequent additional events that pass through the system.
Server Morghulis (aka Client Resiliency)
The Hotstar client applications have been downloaded multiple hundred million times so far. Suffice to say that when game time comes, millions of folks are using Hotstar as their primary screen. Dealing with such high concurrency means that we cannot think of a classical coupling of client with the back-end infrastructure.
We build our client applications to be resilient and gracefully degrade. While we maintain a very high degree of availability, we also prepare for the worst by reviewing all the client : server interactions and indicating either gracefully that the servers were experiencing high load or by a variety of panic switches in the infrastructure. These switches indicate to our client applications that they should ease off momentarily, either exponential back-off or sometimes a custom back-off depending on the interaction so as to build a jitter into the system that provides the back-end infrastructure time to heal.
While the application has many capabilities, our primary function is that to render video to our customers reliably. If things don’t look completely in control, specific functionality can degrade gracefully and keep the primary function un-affected.
Ensure that the primary function always works and ensure resiliency around server touch-points. Not every failure is fatal, and using intelligent caching with the right TTL’s can buy a lot of precious headroom. This is an important tenet.
Planning to Fail
Years ago, I had the opportunity to visit the Boeing factory in Phoenix,AZ. This particular facility was manufacturing Apache helicopters, it was quite a treat. As the lead engineer described to us how these fearsome war machines were put together, he kept emphasising testing. Every other line was about how they tested these machines relentlessly before they took flight.
No matter how good your design, no matter how good your architects and team are, the proof in is flight. The team spent close to a good two months arriving at first, traffic models, then figuring out which product could help replicate the massive scale traffic that the product would inevitably see. We had to see this baby fly, before it even took on passengers. No souls lost!
Once we had figured out a way to scale test our systems (we used Flood.io on top of Gatling), we held a series of “game-days”, which were simulations of a real IPL event, before it actually happened. This helped our teams get used to what the event would play like, and the simulated tsunamis and ensuing mayhem helped us to tune our configurations as well as grow our operational muscle, so that teams knew what they had to do in a crisis, ahead of time, rather than figure it out when a true crisis arose. We discovered some interesting behaviours at scale around SSL handshake errors, keep-alives and general cases of impedance mis-matches across AZ’s or regions that led to pressures in the system.
In addition to simulated load conditions, we also spent some time on chaostesting, which simulated a variety of downtime scenarios to see how the system would react and heal.
Telemetry
Staying with the airplane metaphor, a key part of aviation is having the right telemetry so you can make decisions to ensure a safe flight. We take the same approach when it comes to operations on our platform. While we had excellent hygiene around having telemetry around each service, it took some doing to get a unified view of what metrics were important from a decisioning perspective and to get them all layered into a single view that could quickly show you how the system was behaving. This meant not only working with our service owners, but also key partners to ensure that we had visibility into their systems as well.
Scale Journey
The scale journey is never over, there are always newer mountains to climb as the platform scales and hums at the last peak that it scaled. As always, doing the simple things right makes a huge difference. Simplicity in design, helps. Sweating the small configurations, helps. Relentless testing, helps. Focusing on operational excellence, helps.
If you’re an Indian cricket fan, Hotstar is your home, If 6.1M people can watch an IPL game, when our women and men don blue and we fans want to follow along in the glory, the platform will be ready to make it possible for millions more to watch it glitch free. That is our mission.
Global, Corporate Group Head of AI at L&T Group |CTO, Sr.VP| IITB | Keynote AI Speaker | $ 27 billion, 3 startups, Entrepreneur | 26 yrs Member of Group Tech Council !| 17 yrs in AI | Gen AI Mob: 9689899815
5 年amazing !
CIO & Co-Founder at ImageKit.io | Hiring across sales, marketing and tech.
6 年Nice story. But the fact is most of heavy lifting is done by Akamai CDN, even for live traffic. Since most of the traffic is from one region India, it is easier to maintain good hit ratio. Still it is impressive to maintain 15-20s behind broadcast. However, The Moto - "Fast than broadcast" seems to be difficult to achieve, considering you are re-encoding broadcast stream into live web stream.
Driving Digital Transformation | Live Streaming, Creator Economy, Gaming & Generative AI | Innovation through Strategic Leadership
6 年Good read, sounds like another day at the office. Agree on planning and scale but not so much on applying automation and flexible infrastructure to handle peaks, spikes and other unknown factors like flash crowds. Keep evaluating your workflows, internal assets (machine & human), operational and utility capacity (cloud), for efficiency and you will find those scale points that can be machine driven for these large events.
Co-Founder & COO, Showt | Inventor | Interactive Platform Creator | Social Entrepreneur
6 年Very interesting and informative. Your's and your team's hard work has paid off beautifully!