A Look at the Netflix Live Issues from the Love is Blind Reunion
“What is wrong with TV?”
Normally when I get this question from my wife, my stomach goes into knots.?However taking a second, looking at the TV, I realized there are other people having a way worse night than me. Netflix’s second live event, The Love Is Blind – Live Reunion, failed to be presented live.
At least 2.6 million people logged into Twitter to see how they were to be disappointed within a half an hour.?My wife sat on the couch, upset at first, then started riffing through memes coping with the loss of #loveisblind reunion.?Was this a problem with Netflix’s capacity, a Content Delivery Network or #cdn issue, or an issue broadcasting a live event from #AWS? ?
Let’s look at it.
Was it an issue broadcasting a live event from AWS?
No, AWS is the infrastructure behind a lot of live events broadcasting today.?One “prime” (don’t think that can even count as a pun) example is the NFL on Amazon Prime.?More live TV is broadcasted on AWS but looking at this example shows us a clear architecture for this from Amazon themselves.
AWS has a tool called Elemental based off software company they bought for live broadcast over the internet.?You can host other solutions on AWS however Elemental has built in features for AWS accounts.?While Elemental encodes the video, the video is distributed by Amazon CloudFront, AWS’ Content Delivery Service or CDN. Cloudfront creates localized edge points for users to send requests to whether they are watching from Mobile, TV, or a PC Device. These are then directed locally to a datastore where the video is being stored in this case an S3 bucket. Simple put the users connect to Amazon #cloudfront which leads the session to the closest datastore where the end product of the Elemental broadcast is placed as it is being processed live.?An average of 11.3 million people remember that there is an NFL game on Thursday and tune in.?AWS, much like the other major Cloud Solution Providers, has more than enough scale to handle events like this.
领英推荐
So if it wasn’t AWS, could it have been the CDN.?
Netflix while being one of AWS’ biggest users decided in 2012 to build its own CDN called Open Connect, six years before #cloudfront debuted.?In 2014, Netflix started paying Comcast to stop throttling its usage not only leading to the future Net Neutrality debates but locking in Netflix’s investment.?That said Netflix has built one of the most impressive in-house CDNs in the world.?Netflix accounts for a little under 10% of the world’s global app traffic. Along with the five other largest brands (#microsoft, #google, #amazon, #facebook, and #apple ), each have extensive CDNs that easily could host an event like Love is Blind.?Also a session on #netflix was established by my TV app, instead it looked like a caching issue.
So if it wasn’t AWS or the CDN was it something with Netflix itself??
Netflix handles over 40 million concurrent users at any given moment, why would a live event cause them any problems??The issue might be in what makes Netflix work so well at scale, its micro segmentation.?Netflix uses its own api gateway built on AWS tools called #Zuul.?It allows Netflix to deal with millions of sessions on multiple types of devices like TV, Mobile Phones, and Web Sessions.?It is reliable open-source gateway build whose main weakness is in a scaling event with overloaded instances. When presented with a a scaling event, Zuul could start throttling and refusing connections with preference to towards older sessions.?Normally this makes sense, why ruin a session where someone is in the middle of movie or show in favor of new requests? A new session could pointing to a bug or invalid configuration property for example loading a show which isn’t configure correctly causing multiple bad sessions all at once (foreshadowing).?
Of course, this is an extreme simplification however it starts painting a picture of what could have happened if say you tried to use this system in front changing an extremely large number of sessions from one datastore (a holding screen) to another (a live broadcast) at the same time.
If you had an issue like that, the new sessions would throttled as the internal service tried to recover, in fact as this is going on until the service recovers, internal retries would be disabled.?This would lead to people either being stuck in a loading screen or bounced back to the holding screen if Zuul was trying to move the datastore in a canary release.?People would get frustrated, start a new session, sending even more requests to be throttled or denied.?Because while Zuul is built to handle millions of sessions for thousands of different assets, it is not built to deal with millions of requests to a single asset at the same time if there is an error with the asset.?The issue could cause a cascading event. This is just an educated guess at what could have happened however these are the kind of issues you can see when deploying a new application like #netflixlive in the wild.
But Tom, how do you test this??Well Netflix did with a smaller event, however they may not have had a simplified alternative if something went wrong.?Or if they did, however, they were always just a little bit away from solving the problem as it just grew slightly out of reach of an extremely talented team.?Everyone in IT has had this happen.?The best designs fail at the worst moment, and major outages affect the customer experience.?This is why it is always helpful to have an outside experience, whether that be trusted colleagues [or a good message board] you can reach to while planning.?Or you can work with solution architects from AWS or look to leverage outside consultants.?It is important to build a team of resources that you trust to execute your mission. If you are looking for please feel to reach out to us at Oxford Global Resources