Why mobile networks should be decentralized
Telecom outages make headlines
When reading the press coverage of major mobile service outages, you may wonder what is wrong. In 2024 February AT&T, in July KDDI, in October Verizon. In Australia the public shake up after the Optus problems that directly affected 10 million people in November 2023 continues.
Yes, there have been service outages in the past. Here and there a switch went down, or a fiber was cut. Restoring the service could take a long time – especially if proprietary hardware had to be exchanged. Now all the essential software is supposed to run in the cloud with redundancy schemes allowing load sharing and sophisticated resilience schemes. It all works fine until ooops – a single configuration or update causes a wildfire of problems that bring the whole system out of control. The theory of wildfires is quite well understood and models considering fuel, weather and topography quite accurately describe the behavior of the fires and help keeping them at bay.
For telecom services system reliability used to be an important topic – but that was before pooling, cloud and microservices. Now telecom looks and smells like IT. Those who detest the traditional telco special obscurities – like SCTP multihoming – may celebrate. Those more conservative may point out that telcos are now in one boat with government agencies and airlines what comes to incidents like the SolarWind hack or the more recent CrowdStrike Falcon disaster.
You do not need a crystal ball to predict that we will see more headlines about large telco outages. To return to the wildfire analogy, most telcos look like hilly dense old forests in the summer heat just waiting for the spark.?
The mess?
One reason for telcos being so vulnerable to wildfire like problems can be found in the long life cycles of the technologies powering their services. Generations of outdated software needs to be kept alive and integrated. It is one of the wonders of this world that you can get your DSL and 5G subscriptions charged on the same electronically delivered invoice. [Editor’s note: Sometimes I experienced those invoices to be correct. Often, they were not.] ?
Most of the legacy software (except when running on a still operational proprietary hardware) is now running in the cloud. This, however, does not mean that the software would have been rewritten. If it used RADIUS then it uses RADIUS now, or SS7.
With mobile networks there is also the topic of backwards compatibility. 2G was designed in the 1980s and 3G in the 1990s. For smooth introduction of the next mobile generation support and interworking with the previous generations has been very high on the agenda. We have seen CS fallback for voice calls – essentially 4G networks pushing phones back to 3G for establishing a voice call. Or 5G NSA, the currently most widely used 5G service where the core network is 4G.
For end-users the mobile generations are mostly visible in marketing. 5G is faster, better and more secure than anything before it, but the apps on the phone are the same. A text message remains a text message even if it is delivered using RCS and a voice call is the same as before – except that you now get odd error messages about a server not being available when you call is cut.
For service availability the parallel availability of many generations of mobile services may be a big plus. The catch is that a mobile operator uses the same subscriber database for all of these. If that database becomes unavailable, you might be able to hold your breath long enough to experience the impact on 5G and 4G connections. With the older generations you need to wait a bit longer, but eventually all will go down. These massive subscriber databases are the Achilles heel of the mobile network. They are not just a single point of failure – but also a very large attack surface for anyone targeting individual users or the whole network.
While in many markets 2G and 3G have been sunset in recent years, the legacy is still haunting as our 5G phones are happy to connect to 2G and 3G networks whenever we travel to countries where 4G and 5G are not yet the norm. In much of Africa 3G is still the best service you can get. For roaming the legacy technologies with all their known and unknown issues are still with us.
Roaming as such is a nice trust-based concept based on agreements between mobile operators. It worked fine when the number of mobile operators and users was limited. Now there are 9 billion mobile subscriptions and 1200+ licensed operators of which quite many are ailing or failing. There is a long list of known GTP, SS7 and Diameter vulnerabilities that can and are being exploited using roaming partners.
Many of the inbuild security challenges in mobile networks have been around for years. GTP firewalls were introduced already in the 3G area, but most of such solutions were practically swept away from the mainstream with the rapid growth of mobile data traffic. The focus was in keeping pace with the traffic growth while trying to survive with very little revenue growth.
If you would operate a candy shop like telcos ran their mobile business in the last 10 years you would sell candies in bags. Every ear you double the size of the bags and do some branding work while keeping the old prices because that is what the other candy stores across the street do. You can be sure you are busy designing bigger bags, adding new flavorings and colorings to the bulk of sugar, emulsifiers and stabilizers that your products consist of. There is neither money nor energy left for experimenting on different business models. Besides you are increasingly harassed by activists and authorities claiming too many candies cause health problems…
Now the mobile broadband market is saturated, traffic growth is slowing, and the telco industry is in a state of hangover after the sugar high – like your candy store customers after a mega pack of gummy bears. Reducing screen time and social media exposure is as hot a topic as excess sugar consumption. While waiting for new use cases (XR, AI, IoT) to start fueling new growth in any significant way, we may see more stagnation and disappointments. Just shouting out loud AI does not make more people buy expensive fancy gadgets – like Apple learned with the iPhone 16 launch.
All the above is not just bad for business for the telcos. It also kills innovation.
The difficulty of untangling the mess?
No single player in the telecom ecosystem can change the course of the industry. While mobile operators in given markets seem to be able to more or less coordinate the sunsets of old generations, this does not apply on a wider scale. If say, Mediatek decides to drop legacy support from its chips, phone manufacturers will pick Qualcomm. If Ericsson tries the same with the infrastructure, Nokia will jump in.
An unpleasant side effect of the sluggish telecom market is, that money is tight. Massive layoffs and restructurings are the norm. In such an environment all players focus on their core business – which means they produce more of the same, while loonshots and other cool stuff is axed.
Authorities to the rescue
After network outages the first reaction is to require tougher controls and penalties for network operators. These can be justified by pointing out that the telcos are breaking the service level requirements embedded in their licenses – or by referring to some other regulation, like NIS2 in the EU. ?
The challenge with more controls and fines is that it directs scarce resources to what looks like risk minimization, but more logging, added cyber security and bigger legal teams do nothing to reduce the immense complexity that makes the networks vulnerable – just like adding fences and no smoking signs has little impact on the occurrence of massive wildfires.
Enforcing architectural changes
While the availability of cat videos, WhatsApp and the occasional phone call to consumers is not very high on the agenda of authorities, the possibility to make emergency calls is.
Another topic is that increasingly mobile broadband is also used for mission critical services. These services may either run fully on commercial network or just use RAN sharing for piggybacking the base stations and transport infra of a mobile provider for connecting to an own mobile core with own SIM cards. The latter approach limits the faith sharing between the big commercial network (with roaming and legacy support) and the small dedicated 4G or 5G network of the authorities.
The above RAN sharing model is good for the fire brigade. They can connect as long as the central core of the mission critical network is available. This approach could easily be complemented with a solution allowing people to make emergency calls and the fire brigade to be reachable even in case the base stations loose connection to all core networks. Just add small local core networks here and there.
From cellular to modular
The trick of turning cellular networks into self-sufficient modules is outlined in the figure below. It is a matter of taste if you want to have the many small island networks operational all the time or only in case of emergency. Just provision one extra network code and make the base stations advertise it among the once they regularly do. If no other networks are available the small bubble will accept emergency calls – which need to be routed to nearby authorities instead of the large national or regional emergency centers as these tend to be unreachable when the large core networks are unavailable. ?
Now you may ask why it is not enough with relying on that at least one of the three or four commercial mobile networks is operational. First, not all networks have coverage everywhere. Especially in more remote areas you are lucky if you have reasonable coverage from any of the providers. Second, the likelihood of all mobile networks going down as a result of targeted attacks might be much higher than we generally perceive. Recently US press reported that both AT&T and Verizon had been hacked by foreign allegedly state sponsored actors. In this case the target was identified to be lawful interception – a target better protected than any other. If the mission had been to wipe out the networks, you might wonder if T-Mobile would still be standing or if all the big carriers would have been gone.
6G to the rescue
Modular in addition to cellular could be implemented as a part of 6G. The most elegant way to do this is to make SIM authentication optional. There is no technical reason why an end user device could not connect to the base station and negotiate a service contract locally. Asymmetric authentication is used in most other networks – only 3GPP continues with the symmetric SIM authentication – a heritage from GSM times.
If users could connect to mobile networks using public key cryptography the whole concept of roaming would change. Asking for the public key of a previously unknown user is much easier and requires much less trust between the different actors when compared to relaying SIM authentication requests and tunneled traffic between the networks.
Back to preventing wildfires
Mobile networks have experienced two growth bursts. 2G was Connecting People with mobile telephony and SMS. All the later generations have been about adding data capacity as fast as possible. During this time most of the underlaying hardware has morphed from highly proprietary telephone exchanges to regular IT servers and cloud platforms, but protocols have remained telco specific and as backwards compatibility has been of highest priority it has been easy for the whole industry to lift their legacy code base into the cloud and to keep adding new functionality around it to the extent that no single expert can know it all. The result is the equivalent of an overgrown dense forest, which is just waiting for a spark. ?
The most recent savior promoted by technology enthusiasts is of cause AI, which can help optimizing and securing the complex environment. The flip side of the coin is that any bad actors can feed their AI models with specs and manuals and look for weaknesses. AI is a zero-sum game at best.
What remains is what rangers do when guarding dense old forests. Thinning. Remove legacy services. Creating fire breaks: Breaking networks into smaller self-contained entities is the optimal solution – but cutting access from all but really trusted roaming partners is a good start.
Rangers also do controlled burnings. This medicine is hard to recommend – but then again Chaos Monkey and other chaos engineering tools that randomly kill processes and test resilience schemes are in regular use in other industries. Telcos I guess, do not even dare to try.
cc Ismaila "izzo" Wane - you might have some views on this
Entrepreneurial Telco Industry Professional - Biz Dev, Products and Technology
5 个月While the HLR (and its various variants over the generations) is a single point of failure (ignoring redundancy architectures and the like) it also is a "single" point for external commercial entities to communicate with and "authenticate" a device/user. Isn't that fundamental to an capability like roaming?
Decentralized transactional ecosystem enabler
5 个月For "Achilles heel /subscriber databases" -- Have you considered IPFS? ??IPFS: globally addressable decentralized hash table. No cloud required. ??Immutable: content-based addressing & data duplication at the data layer for hi-availability. ??Ownership: deploy dedicated nodes or use an IPFS pinning service" - 100M++ pub nodes exist. ??Simplicity: No API complexity required. OPen-source gateways exist for all platforms, unstructured data is supported i.e. P2P For identity: IPFS works as a registry for W3C DID-documents (user contact data) no blockchain is required. IPFS CID is effectively the URI and can be used in place of a mobile number. Additionally there are many deployments where IPFS is used as the connectivity for streaming or messaging. Would IPFS be useful for decentralized mobile networks?
Professor for Communication Networks and Cybersecurity
5 个月I firmly believe that you can deploy resilient network functions on cloud infrastructure. However, as you point out, the software needs to be designed for the cloud. As we saw with NFV and subsequent efforts, we often only see the lift and shift approach without really thinking too much about what this move really means in terms of performance and resilience. Particularly in Mobile Networks, we have the additional problem that the standards mandate a relatively narrow set of possible approaches to implementation that are geared towards the classical box world and cannot fully take advantage of cloud services.
Member of Technology Leadership team within MN Global Business Development / now MN Marketing at Nokia
5 个月Authentication is a "hot topic" - and for sure webscalers have a keen interest ...