The Cost of Doing Business in the Cloud Age; Lessons from the Super Bowl
NFL Game Pass mentions on Twitter

The Cost of Doing Business in the Cloud Age; Lessons from the Super Bowl

Keep a business running. 

In today's digital age this often means being able to keep services running and available for customers 24/7/365, and accessible from a variety of different devices. No longer are users accessing your platform from a single type of device, but rather from laptops, smartphones, tablets, Chromecast, smart TV's, Apple TV, and probably even more devices. Users are accessing the NFL Game Pass streaming service at all hours of the night, whether to catch highlights in Australia, or streaming a game replay in Germany.

All of which makes today's fiasco even more stunning.

No alt text provided for this image

Global users of the paid service NFL Game Pass, experienced not one but TWO services outages in last nights Super Bowl. The first service outage came in the first few minutes of the game, just as the San Francisco 49'ers were on their way to scoring the first points of the game. The second outage occurred at an even more crucial point in the game; with the game on the line, and the Kansas City Chiefs driving towards the game winning drive, Game Pass crashed again, not to return.

Now re-read that last sentence. This is not just any random Sunday game in November. This is the Super Bowl, the most watched event in the US for at least the last 35 years. This is not any breaking event that comes along or becomes viral, it is a known "peak" in terms of viewership and exposure. The NFL and its partners are supposed to know the amount of users that already subscribe to the service, how many users buy single game Super Bowl passes, and how many concurrent users they have had in the previous years. Any forward looking company that offers a cloud based service, has to know how to plan for an influx of users, and what specific actions (which should be automated BTW; this is where a cloud orchestration platform can help) to take.

The failure of the providers of the technology backbone which powers NFL Game Pass, to accommodate for this was a key ingredient in this fiasco. From some of my research I saw that a company called Deltatre, is the technology provider which operates the NFL's OTT streaming service. Taking a look at some of the skills that Deltatre looks for in their job posts for SRE's (Site Reliability Engineers) it looks like they use a pretty modern tech stack, including AWS, Kubernetes, Infrastructure as Code platforms, Istio service mesh and more:

No alt text provided for this image

If Deltatre is indeed using this as the base of their tech stack for the Game Pass streaming service, it's hard to imagine that they don't have high availability or Geo-redundancy available for this core service, especially during their annual peak, with an added spike in casual fans also purchasing the streaming service. In fact the lack of such fallback options is likely to end up in some sort of user compensation as people have been livid about this double crash:

Today's users expect always on services; this expectation is heightened exponentially when said services are live event streams.

Quite frankly it doesn't take a technical genius to figure this out, or how to prepare adequately for a big event. I don't have the numbers of how many users NFL Game Pass gets for the Super Bowl, or how they compare to another streaming event like the Game of Thrones premiere (which mentions many of the same tooling during the below talk). Without this and other data it will be difficult to ascertain what exactly went wrong here. I did see though that many users on Twitter have been continually experiencing service outages with NFL Game Pass for at least a couple of years now.

Without Deltatre explaining what caused the service outage, or how they dealt with it, we cannot really know if they could have prevented it or not. My personal guess is that both the NFL and Deltatre will try to sweep this under the rug, and maybe give partial credit to a few disgruntled users, who persist in their complaints. Not only will this be continued disregard for paying customers, but it will also mark a failure to own up to their mistakes and try to plan their systems to run properly in next years Super Bowl. One can only hope that this is the straw that causes Deltatre to plan and make the necessary changes.

In closing and to bring it back to the cloud age, and always on services that directly impact your line of business. Your cloud strategy needs to make proper use of automation and policy driven automation, that takes into account different events that may impact your service level. Whether its a downed data center or a mis-configured cluster, you need to have a plan and a platform to execute the relevant actions in case they happen. If you're reaching peak service capacity, have the orchestrator scale up, and then scale it down when the event ends. With streaming becoming more and more the way that people consume live events, a move towards a fully automated system to handle the many "moving parts" is crucial.

要查看或添加评论,请登录

Ilan Adler的更多文章

  • What's the business value of Environment as a Service?

    What's the business value of Environment as a Service?

    A lot of people have been asking me lately, what exactly does Environment as a Service do? How does it actually help my…

  • Orchestration at the Edge

    Orchestration at the Edge

    Note: This is based on a talk and blog post written by Cloudify's Senior Architect Shay Naeh A bold claim by many…

  • Cloudify vs. Ansible - The Best of Both Worlds

    Cloudify vs. Ansible - The Best of Both Worlds

    In October 2018, we posted an article outlining a comparison between Cloudify and Terraform, HashiCorp’s infrastructure…

    2 条评论
  • Automation Evolution: The Path to Intelligent Orchestration

    Automation Evolution: The Path to Intelligent Orchestration

    Note - This article was written by DeWayne Filppi, a Director of Solution Architecture from the Cloudify CTO team…

  • Go Beyond Automation with Orchestration

    Go Beyond Automation with Orchestration

    I often get asked what exactly Orchestration is? What is the difference between using Orchestration and other…

    1 条评论

社区洞察

其他会员也浏览了