Understanding 5G, A Practical Guide to Deploying and Operating 5G Networks, Automation and Optimization (Part 8)
Houman S. Kaji
Founder, Board Member & Executive Vice President, Chief Innovation Architect Strategy and Ecosystem
We are asking a lot about our 5G systems. 5G must deliver a wide range of challenging use cases with extraordinarily diverse characteristics in terms of data rates, latency, reliability, and the number of connected devices. Some of these requirements contradict each other, yet the system must be able to deliver these use cases simultaneously with the same network.
Earlier parts showed the innovations of radio technology for spectral efficiency, antenna arrays for massive MIMO and beamforming, and new architectural flexibility and distributed computing environments that will play a major role in meeting these requirements. In those parts, we saw how 5G is architecturally designed and implemented so that it can deliver these use cases.
However, technology development is only one part of the story. The way it is operated, managed, and optimized has a critical role in delivering the ITU 5G service enablers. In this part, we explore the areas of automation and optimization, including what can be optimized along with the beneficial outcomes of optimization. The part will also cover artificial intelligence and machine learning and how these will play a major role in delivering on the promise of 5G.
?
1.?????Business Drivers for Automation and Optimization
Optimization has been a cornerstone of wireless networks and has facilitated increased performance since the earliest technology generations. Even from the early analog days of wireless communication, the network could be tuned to offer a capacity and coverage that was better able to meet the needs of the users. This was achieved through the choice of assignment of specific carrier frequencies to cells and adjustment of powers, for example.
As the digital era dawned with 2G narrowband time division multiple access systems such as GSM, the challenge of frequency planning continued, although it was somewhat mitigated by frequency hopping and the introduction of flexible voice coder-decoders (CODECs).
The introduction of handover between cells introduced the need to plan the lists of candidate neighbors that mobiles could seek as potential handover targets. With spread-spectrum systems such as 3G UMTS, the planning of frequencies became less important and was eclipsed by the need for management of interference to maximize capacity while maintaining the desired coverage and achievable data rates.
With 4G LTE came the concept of self-optimizing and self-organizing networks (SON), and various capabilities were built into the standards to facilitate the automation o what had previously been manual. This included various procedures for the initial configuration of parameters, from the choice of physical identities to prevent close-proximity clashes,?or the choice of what cells a mobile would be directed to search for, as candidates for handover. Other aspects of SON were concerned with the parameters that controlled when a handover would take place between cells so that the risky transition process is more robust. SON also included the facilitation of the balancing of the often-conflicting objectives of coverage, capacity, and quality.
Many of these features have yet to make their way into the 3GPP 5G standards. If and when they do, there are still many aspects of operating a network that requires optimization. We could think of the concept of optimization as the configuration and
operation of the network in a way that best delivers the services required, at the capacities required, in the locations where they are required, while meeting certain criteria such as characteristics in the level of service, reliability, or energy consumption.
Manual optimization today performed by engineers includes the analysis to identify problems, troubleshoot the root cause, and perform manual interventions to change various aspects of network configuration. This manual process is slow and requires laborious workflows to conclude. To scale this up naively requires more linear scaling in time, engineering resources, and ultimately cost.
Automation takes the principles of manual optimization and encodes the process in software that can be performed automatically with minimal or even no human interventions. This vastly reduces the cycle time for interventions and frees the engineers to manage other aspects of network operations. The workflows of manual optimization can be automated but as we shall see, automation opens the door to entirely new approaches that are not even possible with the current manual approaches.
There are other various motivating drivers for optimization. One such driver is the fact that the network is being designed to satisfy the connectivity needs of multiple vertical use cases with sometimes strict requirements in terms of data rates, latency, reliability, and device density. These requirements in many cases are inversely correlated to one another;?to satisfy the one you need to relax another. Ultimately, optimization will be expected to find a state for the various aspects of a network configuration that delivers a service that is the best compromise between the various conflicting goals.
An additional driver for optimization and automation is the new dynamic nature of the networks. The ways that networks are configured change constantly. While some Internet of Things (IoT) devices are expected to be stationary in street appliances or smart meters,?and fixed wireless access (FWA) access points stay attached to the side of buildings, many?UEs will be moving. People with smartphones traveling around as part of their daily routine and connected cars on highways driving at higher speeds place varying demands on the network.
This dynamic property goes well beyond spatial distribution as the use of applications varies. Critical communications for connected cars or smart manufacturing will vary by time of day such that the usage by application and network slice will be much more dynamic. This presents a challenge for the operator. At one end of the spectrum, a network can be optimized statically for average demand. At the other end of the spectrum, it can be constantly reparametrized so that it is always perfectly tuned for the needs of the users where they happen to be located, using the applications they want and need. While this latter vision may never be achieved, it has seen progress in recent years. It envisions a network that has a greater capacity to serve subscribers since it can tune its parameters to adapt to what connectivity is required, what quality of service mix is required, where it is required, and when it is required. This results in a network that delivers more return for a?given investment.
There is also a trend toward a greater mix of network infrastructure and virtual function vendors along with more disaggregation into logical components with well-defined interfaces in a network. This means that there is less scope for optimization to be performed within the ecosystem of a single vendor. Or rather, if nothing else is done, the push for openness and commoditization will be in vain unless the ability to automatically manage and optimize a system of heterogeneous components is also delivered. The alternative is a strong incentive to select a single vendor who has a proprietary intelligence layer for controlling the performance by optimizing at the system layer, with all the cost and vendor lock-in that this entails. An automation layer is needed to deal with the complexity and abstraction of network system behavior irrespective of the vendors.
Within the operator, keeping the operational cost of running the network under control is an important factor in optimization. This may be achieved by choosing an optimization paradigm that allows cases where performance issues are arising from poor optimization, and then root-causing these and resolving them with network reconfiguration.
? Figure 1. 5G technologies life cycle
?
1.1. Stakeholders in Automation and Optimization
The drivers for optimization have stakeholders who care about the outcome, and there are various outcomes of optimization and automation that they care about. Here we examine the various stakeholders in network behavior and the nature of their interest.
The users of a service are stakeholders in its performance. Their satisfaction with the service will depend on how well the network is optimized for delivering the applications they want to use in the locations and at the times that they want to use them, along with sufficient quality of experience.
The operator of the network has a stake in optimization and automation. As well as ensuring that their customer base is happy with the service, the CFO will want to ensure that the capital and operational costs are balanced with the revenue a network generates.?These are all outcomes on which optimization and automation will have a bearing.
Connectivity services for specific industrial and commercial verticals may be provided by third-party specialists who procure network connectivity from the operator, typically in the form of a network slice, and market the service to their customers. As well as having an interest in the outcomes of optimization and automation in terms of network performance,?these entities need ease of service creation along with assurance and confirmation that service level agreements (SLAs) are met.
The regulator also has a stake in the outcomes of optimization and automation. They will typically attach conditions to the right to use spectrum. These obligations can be in terms of the performance, where the network can be used, and what applications can be used. Whether these criteria are met will often depend on optimization and automation.
?2.??????Benefits of Optimization and Automation
Optimization and automation are performed for a reason: to satisfy the needs of the stakeholders as outlined in the previous section. Let us explore what motivates these outcomes in more detail. Naively, these activities intend to enhance the performance of the network. But what exactly is the performance that we seek to improve? This has many dimensions and layers, including what services are available in what locations and with what capacities. Additionally, other objectives are beyond what can be characterized as performance.
2.1. Delivery of the 5G Service Enablers
The success of 5G will be measured on its ability to deliver on the promised 5G service enablers, namely, enhanced Mobile Broadband, Ultra-Reliable Low Latency Communication,?and massive Machine Type Communications, and optimization has a major role to play in making these a reality. The services built on these slices will succeed or fail depending on the ability of the network to adapt to the delivery of the vertical use case slices in concert.
The success with which a service is delivered will often be measured in terms of SLAs. These are typically enshrined in the contract between the network operator and another party such as the provider of the vertical use case or the end-user. These SLAs can be expressed in a variety of ways but will typically include measures of latency, capacity, and reliability,?and may vary by geographical area. If these performance measures are not met, then there will typically be penalty clauses in which the network operator will receive reduced fees for access to their network or, in extreme cases, must pay compensation to the provider of the vertical service. Optimization and automation should enhance the ability to meet SLAs. This can be achieved most directly in terms of the measures of application performance. But optimization can also deliver improved resilience in being able to recover quickly from an impairment, outage, or other interruption to the service.
?2.2. Enabling Coexistence of Network Slices for Rich Service Ecosystems
5G networks will be expected to deliver slices of connectivity for multiple vertical industries simultaneously. Each of these will have its applications with a unique set of characteristics or network performance that constitute what is a good, acceptable, or poor user experience. Each may have its own SLAs associated with it that will typically differ from slice to slice. Where these characteristics of performance and SLAs vary between slices,?optimization and automation will play a major role in balancing these requirements together.
There may be cases where there is a conflict between the requirements of the various slices that cannot be satisfied simultaneously. In this case, the optimization process will have to balance the requirements between the slices while some lower priority slices are denied a full service so that the higher priority slices can meet their service requirements fully. This becomes particularly important in the presence of impairments. Malfunctioning network functions or degraded connectivity between them will temporarily restrict the ability to deliver the service for all users on all slices.
2.3. A Platform for Frictionless Service Creation and a Market in Connectivity
Network slices with specific expectations of performance and associated SLAs may often be operated by third parties within an industry such as automotive, utilities, transportation,?media, or entertainment. To foster a rich ecosystem of services that are engaging to the consumer along with being profitable to both the operator of the network and the slice, this partitioning of a network into slices will require some degree of automation. Any manual steps will add friction and present some barriers to bringing innovative services to the market. Some manual aspects of the lifecycle of service delivery, such as commercial negotiations, may be hard to remove entirely. But operators can aspire to remove even these steps.
By whatever means a new service vertical is accepted onto a network, that network will generally have to respond to ensure that it is optimally configured for the new service along with the existing network customers and slices. The reduction in friction to new services will depend on the ability to make the right decisions on how to reconfigure a?network and automate the deployment of the reconfigurations.
But this reduction in friction goes beyond just the step at which the new service is integrated into a network. That optimization step will generally depend on the models of a network. These models are discussed later in the part, but in summary, they will model various aspects of a network including how a network responds to stimuli and change. As well as underpinning the ability to change the configuration of a network for more optimal performance, these models can also be used earlier in the lifecycle of a service. At the stage of negotiation between the network operator and the operator of the vertical service, the network operator can assess the impact of the proposed new service on the network,?whether the service performance targets can be achieved, whether the SLAs can be met,?and at what cost, if any, to the other users of a network.
The answers to these questions can help with the decision of whether the service can be supported, and how much should be charged for the connectivity. If these questions can be answered by an automation and optimization system using the underlying models of a network, then the decisions themselves are more amenable to automation. This reduces further the friction of the entire service creation lifecycle.
At its limit, it is plausible that a market in connectivity may emerge which will see a rich and dynamic ecosystem of services with the provider of the connectivity rewarded for?their investment in the next generation of high performance and flexible communication?infrastructure, and the spectral resources on which these depend.
Together, these capabilities will result in a more responsive network able to deliver new services and retire legacy services with very low friction. The network of the future, when supported with the right automation and optimization, will be a platform for innovation?with low cost of entry for new services, fostering a thriving industry of innovation.
2.4. Achieving Optimized OPEX to Manage Operational Complexity
As we have seen, delivering a flexible ecosystem in which innovative and valuable services can be created easily to enable new applications is a key benefit of automation and?optimization. But there are other outcomes to which we can aspire that go beyond the ability of a network to deliver service to subscribers and meet SLAs. Some of these can be addressed in part by optimization and automation. We explore these here.
As we saw in part 6, we risk driving up the operational costs substantially unless there is a transition away from the traditional approach to managing a network. This change of mindset must take account of the complexity of a 5G network and delivering what is expected of it. A shift in approach is not just about delivering a set of network slices that cost-effectively coexist in harmony. Rather, it is an imperative born of the fact that no group of human network engineers will be able to achieve what is required. The entity or entities managing a network must understand the operation and dynamics of?the network in sufficient detail to identify suboptimal performance, impairments, and?interruptions to service, and cases of SLA infringement.
The managing entity must be able to troubleshoot the identified issues with enough resolution to discover the root cause quickly and accurately. It must be able to identify the changes that might resolve the issues and select from these the change or changes that?most effectively achieve the resolution. And it must do all this across the entire end-to-end network from the edge through the transport network to the centralized functions, from?the application layer to the physical layer, and in sufficiently short timescales to minimize?the impact of suboptimal performance and impairment.
This vision of operating a 5G network must be achieved to deliver on the full value of 5G. But this vision cannot be delivered as it has in the past, with teams of engineers working in manual ways. They will have to depend increasingly on more sophisticated solutions to be able to rapidly assimilate large quantities of heterogeneous information from across a network and make decisions that are substantially more autonomous than?before. This vision is hugely ambitious, but as we saw in part 6 it is also achievable and has precedent in other industries. The keen CFO may hope that the need for expensive operational staff will instantly evaporate, leading to the dream of a more performance-based network combined with dramatic reductions in staffing costs. Unfortunately, the latter of these is unlikely to happen immediately. Although the staff may be making fewer decisions about when and how to reconfigure the network parameterization, there will be?plenty of other concerns that will require their attention.
This is because the systems that optimize and control a network will themselves require management and maintenance, along with expansion and upgrading. The DevOps role will become critically important initially, much as it has in web-scale companies. The need to incorporate new automated ways of creating and operating network slices will keep these?roles busy as the transition to automation progresses. As new classes of service using a network come along, new ways to automate their happy coexistence alongside the other?ecosystem of slices will be paramount. Unless and until robots will be expected to replace faulty hardware and upgrade obsolete parts, staff will still play a key role in ensuring the?network operates efficiently, especially keeping track of the physical location of additional?physical resources in more distributed locations with all the many issues that can befall a?physical plant.
?2.5. Reduced or Deferred CAPEX
The early days of a commercial 5G network are characterized by the rapid buildout of new sites in the race to cover as many potential subscribers as possible. The attention will shift to keeping abreast of the capacity demand once the initial coverage targets are achieved.?This increased capacity generally cannot be achieved through a single mechanism but typically will be delivered through a mix of methods. These include network densification, addition of carriers, re-farming of spectrum from older technologies, or upgrading network?infrastructure to support more advanced versions of the standards. In general, these will incur spending on network infrastructure and the connectivity to deliver the transport.
Also needing to be factored in are the associated costs of installation, commissioning, integration, spectrum licensing costs, and other activities associated with network growth.
Any mechanism to avoid these expenses, or at least to defer them, can result in substantial savings for an operator. This is where the optimization process can have a role to play. For example, optimization of the capacity can squeeze more performance out of a network?with the same infrastructure by delivering a system that has better signal-to-noise ratios?and thus can utilize higher-order modulation schemes and channel coders. In some cases, the cleaner RF and lower interference arising from optimization will mean that coverage is extended and could avoid some of the coverage expansion work.
A significant component of the capital expenditure will be for physical computing platforms for logical functions and the fiber links to provide the transport that will connect them.
This will include the costs of laying fiber and acquiring sites to house hardware along with the costs of fiber and computer-associated hardware. The choice of where to physically place logical functions and how to route user plane and control plane traffic will have?impacts on how much of each logical resource is required and how much physical transport?is required. The choice of where to break out the user plane will also be a factor in these costs. An optimization suitably guided by these considerations can consider the capital expenditure so that physical compute or transport is avoided or deferred.
2.6. Reduced Energy Consumption
Reducing the energy consumed by a network leads to a reduction in operational costs.
A major operator will incur annual electricity bills totaling many hundreds of millions of dollars, so a reduction in energy demand will produce significant savings. But such a saving also has peripheral benefits. More and more, companies can brandish their credentials as environmentally friendly enterprises. Having lower carbon footprints is seen as a competitive advantage as consumers migrate to reduce the impact of their consumption.?One of the design considerations of 5G has been to keep power demands under control even in the face of delivering vastly more data to greater multitudes of devices. For example, the design of the 5G NR frame structure means that synchronization signals are?transmitted only intermittently so that when demand is low the radios can save power by?only transmitting for small portions of the radio frame and shutting down for the remainder.
Optimization can augment these advances in the standards in a variety of ways. Accurate prediction of demand for services at different locations around a network facilitates the ability to take offline entire beams, cells, or entire base stations, or even logical?functions allowing the underlying physical compute resources to be powered down. When objectives for power consumption are introduced into the control functions directing?activity, the power required can be balanced against the coverage and capacity along with?likely demand, such that a resilient service can match the demand and meet SLAs while?minimizing energy requirements.
2.7. Cyber Security and Network Security
The smooth operation of a 5G network will be about far more than identifying and dealing with impaired services arising from the demand on a network and its ability to satisfy?that demand. There will also be threats to a network, from subscribers seeking to achieve a service without being a subscriber to unauthorized agencies attempting to eavesdrop on communications. It will include those determined to take computing cycles without authorization or disrupt service for malicious reasons, be it through jamming the physical?spectral resources with interference or causing viruses to propagate a network. All of these can affect the integrity of a network, not just to the extent that the service is impaired but also that the users are threatened by loss of data or personal information.
There is an intersection between cyber security and network security interests and the optimization and automation of the network. Many of the same approaches used for network optimization also have applications in the detection of security issues along with?establishing the root cause. Models developed for the demands placed on a network and the way it responds to satisfy those demands will be able to detect deviations from?normal operation. The anomalies in network behavior arising from the nefarious activity will in many instances be discernible by the same models. In other cases, similar models using the same approach and infrastructure can be deployed at a relatively low cost to further the?detectability of this type of activity and attenuate its ability to impact network performance.
?2.8. Meeting of Regulatory Requirements
Some regulators around the world are attaching stricter requirements for mobile operators to deliver high-grade services. This typically includes the requirement to deliver coverage to ever-increasing proportions of the population. In some cases, there are requirements for certain data rates or latency at specific places, such as on roads. Meeting these regulatory requirements is yet another constraint to be considered in managing a network, alongside?the delivery of capacity and coverage, meeting SLAs for the various network slices, and?keeping capital and operational costs under control. Again, a mature optimization and automation system plays a significant role in helping the operator meet the constraints?imposed by regulators.
?2.9. Intent-Driven Optimization
The theme of these drivers and benefits for optimization is that control of a network will experience a shift. There will be an inexorable move away from simply changing configuration settings to achieve lower-level network goals such?as coverage, capacity, and quality. Instead, the focus will shift to specifying increasingly abstract policy objectives for the network in the form of intent, which can then be translated into optimization decisions that must be made. This was discussed in detail in part 6 in the context of intent-driven orchestration. Orchestration of virtual network functions configured on physical infrastructure is a major component of this. It also encompasses the configuration of the functions themselves—i.e., the physical aspects,?including radio functions—and of the transport network, for example.
Ultimately the operator will be setting policies about the services and user experience, and how a system will offer a variety of rich services on a set of network slices operating together. These intents and policies will translate into what sorts of services will be allowed on a network, in which locations, and how close to capacity to allow these services to?operate, for example. This will mark the transformation towards the intent-driven operation of a network that tells the optimization what objectives it should be achieved, not how they?should be achieved.
? ?
3.??????Technology Enablers for Automation and Optimization
5G is anticipated to have several characteristics that will benefit from automation and optimization and create an atmosphere that allows mobile operators to thrive.
?3.1. Disaggregation
The 5G network has more logical entities than previous generations. The gNB, which is the base station in 5G systems, is now defined as three logical elements. The central unit (CU) manages the processing of the higher network layers above the RLC layer. This is connected to the distributed unit (DU), which manages the lower network layers. The CU and DU are defined by 3GPP and connected by the F1 interface. The remote radio unit (RRU) is additionally defined and is connected via the lower-layer split (LLS) as defined by the?Open-Radio Access Network (O-RAN) industry consortium.
One of the motivations for disaggregating the gNB into these logical elements was to address the challenge of network densification. As more cells are added to a network it will become increasingly hard to find locations where base stations can be placed. In many of these new locations—e.g., streetlights, sides of buildings—there will be insufficient power?and space for a full base station.
To address this problem, disaggregation allows the minimum functionality to be placed by the radio in cases where power and space are limited. The rest of the functionality can then be placed in a more central location such as a local aggregation center. Disaggregation can be different for different entities such as carriers and other parts of the spectrum, and?for the user plane and control plane. Although centralization has distinct advantages in facilitating network densification, it comes at a cost. The transport network for the lower-layer split has a high data rate and very strict timing alignment requirements for the?radio interface to function correctly. Thus, high-grade transport connectivity, typically using costly fiber optic links, is required. Another disadvantage of centralization is that some very low-latency applications will require the user plane data to be broken out as close to the edge as possible. However, the user plane data is not available at network elements closer to the edge than the CU. Disaggregation and the resulting centralization have benefits but also have the disadvantages identified above. Finding the right balance of cost and performance is an essential objective for optimization.
?3.2. Network Flexibility and Complexity
The way that 5G has been defined permits vastly more flexibility in how it is constructed and configured. But with this flexibility comes potential complexity. As long as the complexity can be tamed, the flexibility can be exploited and used as an enabler for?innovative ways of constructing the resulting system.
The disaggregation described in the previous section is one example of the way that 5G systems are more flexible. Beyond this, there are choices of which carrier bands to use within the mmWave bands and the lower FR1 band. There is also the choice about how to break carrier bands into carriers and bandwidth parts, and which numerologies to use?for these. How different UEs are scheduled onto these physical resources and how the resources are distributed between network slices are examples of the increased flexibility?and complexity of the 5G system. When systems are flexible and complex, they can be constructed and reconfigured in a multitude of ways. As a result, ever-larger sets of objectives can be achieved through careful choice of the configuration. Thus, the flexibility that we see in 5G systems is an enabler for a potent optimization layer that can drive?performance to new limits, maximizing the return on investment for an operator.
?3.3. Network Programmability
Flexibility and complexity are one side of the equation for automation and optimization.?But unless the system is also programmable, the power of flexibility is locked away from the system operator. Parameters must be exposed and configurable to influence how the network operates. Traditionally, making changes has involved navigating graphical user interfaces in network element managers or operation support systems to change?parameters manually. In extreme cases, changing a parameter or characteristic has involved visiting a site to manually change the direction in which an antenna is pointing, for?example. While optimization is possible with these constraints, automation is not. There is a trend to make it possible to change the network via well-defined programmable interfaces?such as APIs. It is this openness of exposed interfaces that bring the power of optimization to life with automation. It means not only that changes can be performed autonomously but can be made faster and more dynamically.
There will always be some level at which vendors of network functions make decisions and implement some aspects of a capability with no possibility to influence the operation?externally. But these vendor-specific decisions will become less common as more disaggregation of functionality takes place in initiatives such as the O-RAN consortium,?for example. Vendors can draw comfort from the fact that even if the opportunities for differentiated intelligence in the components become more limited, this will be counterbalanced by increasing opportunities for intelligent control systems to automate?and optimize the 5G systems.
?3.4. Virtualization and Service-Based Architecture
As we saw in part 6, virtualization of network functions brings decisions about which functions to instantiate on what physical resource, in what physical locations, and?with what connectivity. This allows more control over how the objectives of a network are achieved, with what performance characteristics, and with what resilience. Thus, virtualization is an enabler for optimization.
With the evolution to 5G, the 3GPP has decided to define many of the functions of the network?as services. Not all functions are part of the service-based architecture, however. In particular, the RAN functions generally are not defined in this way. But many of the core functions are defined in terms of services. These have advantages such as well-defined interfaces and discoverability. With this transition, these functions start to resemble the services and microservices that are used to construct the typical web service. This architecture facilitates a network that is flexible where the logical entities used to construct?a service end-to-end are chosen based on the service, the network slice, the service KPIs,?and SLAs. This also means that the service functions can be placed in the physical locations, toward the edge or more centralized, where they are best able to deliver the latency requirements given the availability of computing and other physical resources.
?3.5. Artificial Intelligence and Machine Learning
Artificial intelligence (AI) and machine learning (ML) are other cornerstone enabler for optimization and automation. Modern wireless communication systems that convey vast quantities of data for the users are also awash in data generated by test and assurance systems, state information from elements, and other telemetry supporting network?operations. Many machine learning algorithms and models thrive on data in large quantities, and these algorithms and models in turn underpin the artificial intelligence required to automate the optimization processes. This area will be covered in more depth later in this part.
4.??????A Closer Look at Automation and Optimization
We have seen that there are many drivers for automation and optimization, as well as multiple stakeholders in the outcomes. Moreover, we have seen that there are many technological enablers for these. Here we explore the different resolutions over which optimization decisions are made in 5G systems. We consider the temporal and spatial scales of decisions along with how and where the data are collected, and computation performed.
?4.1. Optimization Timescales
Different types of optimizations are performed over a vast range of timescales and cadence.?Some aspects of the operation of a network are essentially optimized only once or are costly to change once deployed. The choice of where cells are placed is an example of this.?Procuring sites along with power and transport connectivity is a costly business; there must be a compelling reason to change this once the site is deployed.
While there is significant friction to changing some characteristics of a network in cases where decisions must be made at the time of network design, other factors are much less of a barrier to modification. These can have diminishing or zero marginal costs in making a change. Some decisions can be made with great effect even over non-real-time timescales.
Examples of aspects of the network that can change more easily include the choice of the parameters that control the interactions in a network. These parameters control the transmission powers and beam directions that affect coverage and interference. Other examples include the parameters that control mobility between beams, cells, and radio technologies. Also, in this category are the configurations that control how traffic is routed through the transport network and where virtual functions are instantiated. Decisions about these factors can incorporate data and knowledge from multiple parts of a network.?Decisions can be made considering the emergent system behavior and how this can best be aligned to the objectives of optimization or intent of the network.
Where static decisions made at the time of network design are at one end of the spectrum of timescales for optimization, the opposite end of the spectrum includes the near-real-time decisions that must be made almost instantaneously in response to the changing dynamics of the network, the state of the radio, transport and network functions, along with the demand from the subscriber devices. Examples include which data to schedule in what order and on which parts of the OFDMA resource grid, or which beams to utilize for communication with specific UEs. These decisions must be made in a matter of a few milliseconds or less and must necessarily be made at or near the edge of a network where the resource mapping and other functionality is performed. This means that the real-time data that the algorithms have access to is limited; generally, only fresh data collected or generated at or near the network element making the decision can be used. However, the algorithms can rely on models and other aggregated information that have been gathered over longer timescales based on historic data. In this sense, the algorithms can learn what strategies are successful and which are not. We shall cover this in more detail later in the section on machine learning.
?4.2. Spatial Extent of Optimization
In addition to different timescales, optimization can be performed on different spatial scales. Decisions of where a 5G NR beam should be directed and with what power are important because they will affect which UEs can access the network from that cell, what degree of modulation they can achieve, and thus what data rate is possible. For a network containing a multitude of UEs and beams, the problem is to find the direction and power of each static beam such that as many users as possible achieve coverage while minimizing the interference that non-serving beams cause to those UEs that they do not serve. This is generally a problem best solved at the system level over larger clusters of beams, cells, and gNBs.
As we change the power or directivity of a single beam, other system parameters mean that UEs will now elect to get access from different beams; beams that were once beneficial by offering service will become detrimental as an interferer, and vice-versa.
The beams can be thought of as interdependent. They must help each other by providing enough coverage but not to the extent that they have excessive overlap; that would result in excessive interference and loss of capacity. Thus, optimization becomes a problem of how to balance the beam powers and directivity of many beams over an area or cluster of cells. A piecemeal approach of localized decisions will be unable to find the unique mix of changes across a cluster of cells. But considering the cells in a cluster together facilitates the optimal trade-off between coverage and interference, allowing delivery of the capacity profile in the places across the geographical area that best satisfies the demand from users of the applications offered by a network.
The fact that some optimization decisions must be made within the context of a group of entities such as gNBs in a network raises the question of where and how these decisions should be made. Making such decisions requires data from across multiple entities over a finite geographical area of the network. Typically, the data must be used to create models that reduce the volume of the data into much smaller aggregates that capture the essence of the network, its performance, subscribers, or state. These models must be used to support the optimization decisions and must have a scope that includes an area of the network that is sometimes significant. This impedes the ability to make the decisions at points in a network that implement the decision, such as the gNB. Thus, some degree of centralization is required for this type of decision.
However, any drive towards centralization should be balanced against the need for data to use as inputs to the algorithms. More centralization can mean that data is transferred in large volumes from the network edge to the more centralized data center, placing extra strain on the transport network. This needs to move data around can also impact the latency of decisions and lead to slower optimization dynamics. A hybrid approach that overcomes these challenges is to perform processing as close to the edge as possible.?For example, models that underpin the decisions can be constructed at or near the edge, where the data required to make the models are available. Once these distributed models exist, they can be used in a variety of ways. One alternative is to use them in situ, being called upon to produce outputs which are then transferred to where the decisions are made. Alternatively, the models that capture the essence of the network can themselves be transferred to the more central locations where optimization decisions are made. The optimal choice for this distributed modeling and decision-making architecture will depend on many factors, such as the volumes of data required to create the models, the availability of distributed computing, and the latency required for decisions to be made.
?5.??????Network Tuning for Optimization
In this section, we build on the principles of the previous sections and describe the details of what can be configured, tuned, or otherwise changed to deliver optimization in the next generation 5G network. We review some of the specific ways that a 5G network can be adjusted or tuned to achieve a change in various aspects of performance. A thorough analysis of this would require much longer treatment, so this section contains only a selection of the many ways to effect change, giving a flavor of the breadth of changes that are possible.
?5.1. Beam Configuration
The power with which each beam is transmitted will naturally be a significant factor in determining how far from the transmitter it can be received by different UEs. The transmit power will also contribute to what level of service can be received at which locations. Increasing the power will tend to increase the range and the coverage, but there are limits to this. If the beam is directed toward a structure that is opaque to the electromagnetic spectrum at the frequency of the carrier, then increased power will not achieve further coverage. In contrast, if the power is increased excessively then it will create interference with other beams using the same spectral resources. This will reduce the ability to support UEs in the area of interference. In the case where the capacity will be reduced, the types of service that can be supported may be restricted or the service may become entirely unavailable. There will also be constraints on the power at which the beams can be transmitted. These include the capacity of the power amplifier that prepares the signal for transmission. There are also sometimes regulatory constraints that must be respected.
In addition to the power with which the beams are transmitted, the direction in which they are transmitted is also important. Together with the transmitted power, the beam direction can be used to carefully deliver coverage where it is needed. The objective is a system that does not have undesirable holes in coverage yet manages interference and thus achieves a high level of capacity. In some cases, particularly for time division duplex (TDD) carriers, dynamic beamforming may be used to direct the beams in response to where the users are located, or even track them as they move. In this case, coverage beams facilitate the initial system access and handover between cells. These coverage beams must also be configured to achieve the required coverage while maximizing capacity.
Some systems will deliver retail or commercial broadband connectivity service only to static customer premises equipment (CPE). This is a simpler scenario in general, as the areas where coverage is required are dynamic only to the extent that new subscribers are added to the service. To have a more resilient service, some operators may impose targets for beam redundancy so that if one beam is occluded by a transient body such as a truck, the service is maintained through the other beams. But it is the systems that deliver full mobility for the services where coverage becomes so important. In this case, for those beams that are not dynamically formed, the task of optimization is to maintain enough overlap between static beams such that the users can move between the coverage areas of beams on the cell (intra-cell beam mobility) and between beams on different cells (inter-cell beam mobility). This must be done in a way that keeps the beams spatially separated enough to avoid excessive interference and the associated loss of capacity.
When coordinated scheduling across multiple beams serving the same UE is used, data can be received on the same spectral resources from multiple beams for the same UE. This coordination can also be exploited for optimization.
The way is flexible that feedback is configured for beam management and mobility. The optimal configuration will depend on whether the carrier uses frequency division duplex, time division duplex, or supplemental uplink or downlink. But there will be flexibility in the feedback configuration that will impact the system’s ability to respond to dynamics promptly.
?5.2. Neighbor Lists
One of the functions of the UE is to measure beams that are not currently serving it, but which are candidates for providing service. When certain conditions are met, the UE will send a measurement report to the network including the quality of various candidate beams. The network may then elect to direct the UE to receive service from a new beam.?To help the UE perform this, it is sent a neighbor list by the network, telling it what beams to attempt to measure. The network must decide how to populate the neighbor list and how many candidate handover target beams to include. Excessive numbers of neighbors can cause the UE to perform unnecessary measurements and also delay the process of transition to service from new cells. If good neighbor candidates are missing from the neighbor list, the risk of the connection being interrupted increases.
?5.3. Physical Cell Identity
The beam mobility and handover process described above is facilitated by cell and beam identities. Each cell in a network is given an identifier that is unique in the network, and each beam on a cell is assigned a unique identifier within that cell. As the cell identifiers are unique within the network, they have to be represented by many bits of data to avoid collisions between identifiers. The process of decoding the full cell identity is a relatively costly operation for the UE. To speed up the process of measuring and reporting the strength with which beams from different cells are received, cells are also identified by a more concise physical cell identity (PCI). This PCI for a cell is determined by the combination of the identities conveyed by both the Primary Synchronization Signal (PSS), which can take one of three values, and the Secondary Synchronization Signal (SSS), which can take one of 336 values. Together there are 1,008 unique PCIs.
These 1,008 PCIs cannot be assigned uniquely across a network once the number of cells exceeds 1,008. To overcome this lack of uniqueness, there must be sufficient spatial separation between cells with coincident PCIs. It is important to maintain this PCI integrity such that when a UE reports a PCI associated with measurement, the target cell can be unambiguously identified. This may require ongoing management of the PCIs, such as in the case where network densification requires the addition of new cells to the network. This can lead to the distribution of PCIs becoming suboptimal over time.
?5.4. Handover Parameters
As well as deciding which neighbor cells and beams to consider for handover, the system must be able to decide when to hand over service to new cells. Handing over too early can result in ping-ponging between cells, which can interrupt service and make connections less stable, increasing the chance of a service drop. Handing over too late can also cause instability and potential loss of coverage. This process is typically controlled using a variety of parameters that determine what quantities to compare between the serving beams and candidate target beams along with timers and hysteresis parameters.
?5.5. Physical Layer Parameters
The physical layer is characterized by flexibility in the numerology; namely the choice of what subcarrier spacing to use. To some degree, this choice is limited by the frequency carrier and the bandwidth, but often some flexibility is possible. The objective here is to balance the risk of inter-symbol interference with the risk of phase noise-causing inter-subcarrier interference.?In general, the risk of inter-symbol interference diminishes as the carrier frequency increases.?In contrast, phase noise is less of a problem at lower carrier frequencies.
?Where numerology 2 with a 60KHz subcarrier spacing is used, there is the option of an extended cyclic prefix for cases where there is very complex propagation with excessive non-line-of-sight propagation. The degree to which mini-slots and slots for pre-emptive uplink transmission (such as non-orthogonal multiple access) can be configured balances the low-latency requirements with the capacity of the air interface.
In the case of the time division duplex, the limits on the ratio between downlink and uplink transmissions will be controlled by the choice of slot format. The configuration of bandwidth parts (BWPs) will balance what parts of the spectrum can be accessed by which UEs and consequently how the capacity is shared between the mix of device capabilities.
??5.6. Layer Management
5G is capable of aggregating multiple carriers from 5G low band, mid-band, and high band (mmWave) along with LTE carriers together in different combinations, as well as splitting 5G carriers into BWPs with their numerologies. Managing which carriers and which BWPs are used in what circumstances will determine what performance characteristics such as data rates and latencies are achieved, in addition to the overall system capacity.
Managing load between lower frequency coverage layers and higher frequency bandwidth layers will be important.
?5.7. Scheduler Configuration
Transmissions from or to different UEs must be multiplexed onto the spectral resources by the scheduler. How this is done and how the streams from different users, bearers, classes of device, and network slices are prioritized will determine the quality of service experienced by the various applications, network slices, and subscriber classes, as well as the capacity of the overall system.
?5.8. Idle and Inactive State Operation and Access Parameters
How the UEs operate when not in an active state can be configured. Parameters determine which cell is monitored by a UE in case there is data for it to receive. Other parameters determine when this cell should change. The configuration also determines how often the UE should communicate with the network and in what circumstances it should indicate that it is monitoring a new cell. This will affect many outcomes such as how rapidly the UE can be reached when there is data to send, and how much system capacity is consumed for these communications. This configuration will also determine how much UE battery is consumed. The optimal configuration for these will depend on the type of device. For example, a battery life spanning many years is a paramount requirement for a battery-powered IoT device such as a smart utility meter, while being able to establish connectivity very rapidly is unimportant. This is very different from the requirements for a commercial smartphone, the users of which will generally value instantaneous connectivity and will be happy to charge their devices regularly.
?5.9. Placement and Configuration of Network Functions
As discussed above, the 5G RAN is already disaggregated into the CU, DU, and RU logical functions, with further disaggregation planned by initiatives such as O-RAN. These bring choices of where these new elements are hosted, apart from the RU, which must be placed at the point of transmission. There is additional functionality concerning where mobile core functions are placed, along with where the user plane is broken out.
In situations where interference is excessive or extra capacity is required, coordination of beams and carriers between sites will encourage more centralization. The degree to which transmissions can be coordinated will depend on the nature of the transport for the fronthaul links. In cases of high-performance ideal transport, coordination can be performed on a phase-coherence basis to overcome interference. Less ideal transport will facilitate transmit and receive diversity giving an engineering benefit with a single transmission and reception chain consuming fewer resources. The placement of functions may vary by radio carrier or even by BWP offering yet more flexibility.
Mobile edge computing and distributed compute allow application functions to be deployed flexibly: either centrally or towards the edge. These decisions can vary depending on various factors. The network slices with low latency needs can have their functions pushed as close to the edge as the available compute will allow. Different connections can be assigned to different logical functions for application or core functions, and the degree of redundancy can be balanced against the amount of compute resource consumed depending on the resilience required.
?5.10. Transport Configuration
The transport network has its part to play in delivering the performance of the overall system. The quality of services with poor tolerance to latency will depend on how effectively the traffic is prioritized by application, network slice, and device. The route taken through the transport network will also play a part. The resilience of the overall system will depend on whether redundancy is built-in and how the routing is configured to take advantage of this.
?6.??????The Role of Artificial Intelligence and Machine Learning
Many industries have seen substantial benefits arising from the appropriate application of artificial intelligence (AI) and machine learning (ML). Applications are numerous but include speech recognition, language translation, image classification, and automated description, detection, and prediction of disease, drug discovery, energy and carbon emission saving, automated driving, and many more. It makes sense to ask if mobile telecommunication can similarly benefit from this technology area. To deliver the promise of 5G, AI and ML will play a critical central role.
AI has a variety of definitions. In the context of automation, we regard artificial intelligence as using an automated system to solve a problem that humans intuitively can solve but those computers typically find hard. Classifying a picture as containing an image of a cat is easy for a three-year-old child. In contrast, a programmer requires advanced skills and diverse tooling to reproduce the same capability on a computer. This is what characterizes AI. AI often involves ML, but this is not always necessary.
In contrast to AI, ML is the creation of models that describe relationships between inputs, typically called features, and outputs typically called targets. A well-constructed model will be able to make predictions of the targets given an example of the features. The model may make a variety of different types of predictions. For example, targets may include a number in a continuous range, also known as a regression. A concrete example of this is a model that predicts how many people will be using a particular mobile service in one hour. ?Another model may make categorical predictions, such as whether certain characteristics are represented in the features. An example of this is whether users of a particular 5G MIMO beam are experiencing congestion.
Generally, ML models have parameters or states, and an optimal configuration for these must be found for the model to be able to predict the targets accurately given the features. A set of parameters that achieves this accurate predictive power is found using a process known as training. Training includes the presentation of real examples of the features to the model along with corresponding targets. The model is then allowed to find a configuration to better describe the relationship such that it becomes better at classification or regression.
Models can suffer from limitations. Typically, a model will not perfectly predict the targets based on the features. Systematic differences between model predictions and expected targets are known as bias. Bias can be reduced by training but carries a risk: If the training is allowed to progress too far, or there is too much flexibility in the model state, then overfitting can occur. Overfitting means that while the model can predict the targets based on the features in the data used to train the model, the predictions based on unseen data not in the training set will be limited and will generally have a larger bias. This is known as variance in the model. It can arise because the training data do not cover the whole space of features or the targets used for training contain errors. The best models will combine low bias with low variance for the most flexible and accurate predictions.
6.1. Various ML Algorithms
There is much literature available on the various types of ML algorithms and we do not propose to discuss these in detail here. However, we will briefly introduce a selection of approaches that are used to deliver results, and which can play a part in the optimization and automation of wireless communication networks.
Parametric models are used when there is some knowledge about the form of the function underlying the relationship between features and targets. For example, some targets are related to features with linear relationships. Others have higher-order polynomials or cyclical relationships such as some trigonometric functions. In this case, the model can be chosen, and training the model involves finding the values for the parameters that best describe the relationship between the features and the target.
In contrast to parametric models, non-parametric models do not depend on detailed knowledge about the form of the function underlying the relationship between features and targets. Rather the model is allowed, through training, to discover this form itself, along with the associated parameters and other components of the state. The terminology can be confusing; non-parametric models do typically have parameters. It is just that non-parametric models allow more freedom in the relationship. A parametric model in the form of a fourth-order polynomial will model a quartic relationship between features and targets very well. It will do this with only five parameters. However, that model will generally be poor at modeling a cyclical relationship, or a logarithmic relationship, or a relationship that is a hybrid of these. A suitable non-parametric model could learn the relationship between features and targets for a wide range of function classes. Neural networks are an example of non-parametric models. They comprise discrete simple units of computation, linking their inputs to their outputs by simple algebraic manipulation and simple functions. These units are connected, sometimes in multitudes. Although each discrete computation unit has very simple predictive power, together the system of units has emergent power to model arbitrarily complex systems. In this sense, it mimics the construction and operation of the human brain, although to date no neural network has come close to the scale or complexity of the organ by which it was inspired. The neural network generally has many parameters, typically comprising weights and biases, associated with it. Weights control how important each input is in each unit of the network. Biases offset the output of a computation unit to boost or suppress that part. These parameters must be optimized as part of the training process using a mechanism called back-propagation.
We turn our attention from neural networks to decision trees. These are ML models that repeatedly partition the vector space of features into finer and finer portions. Each portion is then assigned to portions of the target space. Decision trees can be used for classification or regression. The partitions are typically made by logical comparison involving components of the feature space. The number of decisions that are made gives the model increasingly discriminative power, although allowing too many decisions, or excessive depth to the model, risks overfitting the model and increasing variance.
Reinforcement learning is a way to allow a computer agent to interact with a system and learn how to achieve a goal. It has been used as the basis for Google’s AlphaGo system for playing the ancient Chinese game Go. By making exploratory moves in a game or changes in a system where the objectives are known, the reinforcement learner can develop a strategy for what is more successful at achieving an objective and what is less successful.
Reinforcement learning could be considered to deviate from pure ML in the sense that in practice it comprises several discrete models internally for delivering the capability.
However, externally it makes predictions about what is the best change to make given the system state to maximize the chance of achieving the goal. In that sense, it is a predictive machine learning model.
In systems that have many variables or features, building models able to accept all these features as inputs presents challenges. Typically features have varying degrees of importance and the importance of some may be very small or even vanish. This can cause the model to be overly complex and expensive to train or evaluate. One solution to this is dimensionality reduction. This can include sensitivity analysis to determine which are the least important features that give negligible predictive power so they can be eliminated from the model. Even when features that do not contribute to the predictive power of the model are eliminated, the remaining features could be found to have more dimensions than necessary. In this case transformation to a lower-dimensioned vector space, typically using linear translations, can be performed. The outputs of the transformation can be used as features instead of untranslated features without impacting the predictive power of the resulting models. Principal component analysis is an example of dimensionality reduction that finds an optimal set of orthogonal vectors for mapping the data, which maximizes the variance between the dimensions. Auto-encoding is another method for dimensionality reduction that uses neural networks to map the full-dimensional feature data to a smaller dimensional layer in the neural network and then map it back again to the full dimension. It is trained to approximate the identity function and so to predict its inputs. If it can do this sufficiently accurately while squeezing the features through a lower dimensionality layer, then the network layers beyond the compression layer can be discarded and the output of the lower dimensionality layer can be used as a dimensionality reduction layer.
For a non-parametric model such as a neural network to perform well and have good predictive power, it must be sufficiently complex and flexible to represent the underlying relationship. However, excessive complexity will not increase the predictive power but will add to training and prediction cost and is therefore undesirable. While a neural network has weights and biases that are optimized as part of the training process, it also has other characteristics controlling its architecture. These are not trained by back-propagation but must be configured when the network is constructed, including the number of layers, how many components comprise each layer, whether they include convolutional layers and other aspects. These characteristics are typically referred to as hyperparameters. The optimal values of these can be found in a variety of ways. Intuition and experimentation by the network designer are one such method, but a systematic approach to discovering the best ways to construct the network can also be employed. This is referred to as hyperparameter optimization. It is also possible to build neural networks that can change their architecture as part of training and operation to adapt to the specific challenges of the features being modeled. Inspired in part by the brain, these are known as self-organizing neural networks.
?6.2. Models for Wireless Communications
Having introduced a selection of ML algorithms, here we explore how to bring the theory of ML modeling into the domain of wireless communications. ML gives us the power to build predictive models, but what models can we aspire to build to bring value? We can think in terms of the wireless communication network being a system with stimulus and response. In this case, the stimulus is the demand placed on the network by the variety of users. These users are from a variety of subscriber classes or network slices. They are attempting to access a variety of services in various locations at various times. The wireless network itself will have a state. This includes the physical resources available, along with any configuration state and any impairments in terms of malfunctioning infrastructure, software, or transport links.
The network experiences the demand for services placed on it by the subscribers.?Moreover, the network has a state. Together the demand and the state will result in a response by the network. At a fundamental level, the response will be the signal strength and ratio of signal-to-interference experienced by each subscriber device at any instant along with whether they can access the network at all (and if so, for what services and whether they experience a service disruption). The response will also include which performance metrics they experience such as data rate, packet loss rate, delay, and jitter.?The response will include more abstract aspects such as how satisfied the subscribers are with the service they receive. These are harder to measure as well as to the model. The response will also include more indirect characteristics such as the revenue generated from the subscribers, how likely the subscribers are to cancel their subscription and the energy consumed in operating the system.
We can aspire to model these different aspects of the wireless network: the demand stimulus, the state, and the response. We can seek to model them in isolation. We can also seek to build models that capture the essence of the relationships between these aspects.?For example, a model of the geographical distribution of demand in isolation can be used to understand whether the demand is anomalous at any instance. We can also model the relationship between the geographical distribution of demand and the resulting response of the network. A more sophisticated model would also include the current state of the network when predicting the response given the stimulus.
As we have seen, we can model the various characteristics of the network and their interactions. But how do we use these to perform functions that deliver business value??Models alone cannot do this; we need the branches of analytics that they enable to deliver value. We explore these next.
?6.3. Analytics to Deliver Value from ML Models
We have seen how we can build models of the characteristics of a wireless network and the relationships among different aspects. To understand how ML models deliver value, it is useful to think in terms of the various branches of analytics. The simplest level is descriptive analytics. Often implemented with an ML model, descriptive analytics is typically used to recognize a situation or phenomenon. For example, descriptive analytics can be employed to recognize that congestion is occurring or that UEs are experiencing poor signal strength, or that demand for the service is unusually high. The problem of determining how the subscribers are geographically distributed is also an example of descriptive analytics. Descriptive analytics will recognize various situations, phenomena, or other characteristics, but they will not reveal why these situations are arising. For this, we need diagnostic analytics. Diagnostic models will assign a root cause to a phenomenon.
For example, descriptive analytics may tell us that congestion is occurring, but diagnostic analytics will help us to define if this is because there are more subscribers than usual or because a transport link is impaired and carrying less data than normal, or because a gNB has failed for example.
Another stage of analytics is predictive analytics, which helps us to anticipate what will happen in the future. If we develop a model that can predict that the physical fiber that conveys a transport link will fail within the next month, or that demand will be fifty percent higher than normal in a geographical area in an hour, then these are examples of predictive analytics.
Another facet of predictive analytics is the ability to estimate what will occur as the result of a change before that change being made. One example of this is predicting if there will be sufficient coverage to support certain applications in a specific area as a result of changing some parameters in the network.
Bringing these together, we can identify problems we have now and problems we will have in the future with descriptive analytics and predictive analytics respectively.
Furthermore, we can determine the root cause of these problems with diagnostic analytics, and then anticipate what might happen as a result of changing the network, again with predictive analytics. This is a lot of information and power afforded us by our models, but we still need to know what to do. This is where prescriptive analytics comes in. Prescriptive analytics takes information about the system along with knowledge of the policies and objectives of what is desired to be achieved and yields a course of action. With the constraints of the model accuracies and completeness of the information available, the course of action recommended is what should be done to maximize the chance of the objectives being achieved most completely. It is prescriptive analytics that underpins the intent-based optimization mentioned earlier and intent-based orchestration introduced in part 6.
?6.4. Applications of Machine Learning in Telecommunications
Above we introduced some examples of ML models and associated algorithms. We showed how models of wireless communication and associated infrastructure can be envisaged along with the demand placed on them by the users and how they respond in terms of services and other outcomes. We have also seen how models can be employed to deliver different types of analytics. Now we introduce and discuss classes of use cases for wireless communications that are enabled by machine learning.
Detection of anomalies is a key class of use cases for machine learning. Anomalies include changing behavior or characteristics. Anomalies can be indicative of an underlying problem or impairment with the network, or they may indicate a change in the way the network
is being used by the subscribers or how it is responding to the demand. Detecting an anomaly is not the same as a diagnosis of why the anomaly has occurred, which we cover later, but it can be a powerful vehicle to understanding network dynamics. Underlying issues will manifest as anomalies and alert the operator to the existence of problems. Here the operator could be an engineering team or a control system responsible for the automated operational management of the network. Once the anomaly is detected it will open the door to understanding the root cause, triaging, and ultimately resolving any problems thus uncovered.
Potentially, many anomalies can be discovered with machine learning anomaly detection algorithms. When the normal patterns are understood, or modeled, then anomalies can be found in terms of deviations from that modeled behavior. The characteristics that can be modeled include what services and applications are being used on the network, where they are being used, and by what network slices. Anomalies can be found in the state of the radio: Where coverage is good or poor will have a natural cycle from which it can deviate. Interference can similarly be nominal or anomalous. The times and locations when and where different subscribers can connect to the network will have their characteristics. Blocking subscribers is a strategy to avoid overloading the network; this will also have its characteristic patterns which may be normal or anomalous.
The behavior and state of the transport network can be measured, and anomalies detected.?For example, measures such as data volume carried, packet loss rate, retransmissions, jitter, and delay will have characteristics of behavior from which they can deviate. Core network functions and applications also exhibit anomalies. This can be in terms of performance measures such as response times, resource utilization, or physical locations of functions. The behavior and characteristics of the physical infrastructure on which virtual functions are hosted can also exhibit anomalies.
Anomalies can be detected on different timescales. Some need to be detected and reacted to in near-real-time to avoid serious adverse impacts. An example is the congestion of a transport link. If the link is congested, subscribers will be experiencing poor performance.?If these are important subscribers or they are using a critical communication network slice, the operator will want to diagnose the problem and resolve it as quickly as possible.
Detecting anomalies in near-real-time is more demanding for a machine learning anomaly detection algorithm. It must be able to characterize the behavior as anomalous with very few samples spanning a very short time, sometimes much less than a second. A retrospective analysis is more discerning and will be able to detect more subtle anomalies. While this type of anomaly detection cannot allow instantaneous anomalies to be reacted to, this non-real-time analysis can also have value. For example, a failing piece of network hardware, or a network function nearing capacity, may exhibit transient performance changes as a leading indicator of a future failure. Non-real-time anomaly detection can be a useful tool for detecting this.
While the detection of anomalies is important, it is generally followed by the diagnosis as this will allow the underlying cause to be understood and potentially resolved. Here machine learning can assist us. Some anomalies are their diagnosis. For example, patterns of what applications are being used by which subscribers in what locations can have patterns that are subject to change and anomaly. But other characteristics are indicative of an underlying issue or problem that can be diagnosed. An increased packet loss rate on a transport link can have various causes. It may be indicative of more demand on the network by subscribers, or it could be due to the degrading of the fiber that conveys the transport. This in turn could be caused by a fiber connector performing poorly.
Alternatively, a network switch could be degrading. The best action to take to resolve the packet loss issue will depend on which of these potential triggers is causing it.
Another example of an effect with multiple potential causes is the radio interface performance. To achieve a high data rate depends on the signal-to-noise ratio (SNR) being high enough to support the more spectrally efficient modulation and coding schemes along with the higher-order MIMO that will allow more data to be multiplexed over the same channel. SNR falling at particular locations or times has various causes. It can happen when the ability of the signal to penetrate to the subscribers is impaired, e.g., when an object occludes the signal. It could also be due to a transient phenomenon such as a vehicle parking in the line of sight between the transmitter and the receiver. It could be caused by a seasonal cycle such as foliage on trees. It may experience a more permanent degradation resulting from new construction that impedes signal transmission. Another cause of lowering SNR is an increase in interference. This can happen when signals on nearby beams intended for other subscribers are attenuated less than they were previously or are transmitted with more power, or utilize more of the spectral resources than they had previously. How this loss of SNR is addressed will of course depend on the underlying cause. Machine learning models can be trained to detect these underlying causes. This can be done directly so that the characteristics of each cause can be recognized directly.?Alternatively, some root causes can be diagnosed by creating and training models for the normal operation of each aspect of the network. Then those anomalies that are correlated together in time can be found and organized into causal relationships.
Subscribers must be able to access the network initially when the device is powered on if they are to be able to achieve service. While anomalies in terms of what proportion of subscribers can access the network at different locations and times can be found, there are many and varied potential root causes for this. These range from poor SNR discussed above, with all the potential root causes that this can entail, to constraints of admission control. Problems with the transport network can also be the culprit as can the core network functions such as the access and mobility management function (AMF).
The virtualized function may themselves be the problem or the physical infrastructure on which they reside. Again, machine learning models can support troubleshooting by finding correlated anomalies or directly recognizing the patterns of behavior that relate to specific diagnoses.
Anomaly detection will find situations potentially indicative of poor performance or other adverse issues either during the issue or after it has taken place. There is also value in predicting when adverse situations will happen in the future. For example, the prediction of future demand on the network will allow the network capacity to be planned accordingly.?New demand can lead to increased congestion or interference and reduced coverage. New sites, cells, or carriers can be added in anticipation of the growing demand for data for new services and increases in subscriptions. The adverse impact on services can thus be avoided. Modeling the cyclical aspects of the demand on daily, weekly, and seasonal cycles along with the secular trend can be coupled with models of the relationship between new services and the corresponding change in demand for data.
The transport network is critical to the integrity of the network. Service in parts of the network can be lost if a break occurs in a physical fiber. If a fiber passes through the ground subject to subsidence, this can cause strain and result in partial failure or a complete break.?Predicting these ahead of time using ML models allows the problem to be pre-empted and the fiber replaced. Models optimize this activity by allowing the replacement to be performed in time to avoid disruption but not so early that there is an unnecessary increase in operating expenditure.
Just as the physical fiber can degrade and fail, so too can the computer platforms and other physical assets in a virtualized network. Some parts will generally not be virtualized, such as the radios and high-performance signal processing hardware for the generation of the modulated radio. Prediction of when these will fail through hardware faults in the future can allow them to be replaced or repaired before an interruption in service. Again, ML models can learn the leading indicators that will predict what will fail and when it will fail.
As operators become more reliant on revenue from the delivery of services to industry verticals with strict SLAs attached, so there is a need to predict when these SLAs will be infringed because they come with financial penalties and reputational costs. Similarly, prediction of when and how requirements imposed by the regulator will be infringed is critical to minimizing the damage and costs associated with infringement.
If a UE has a GPS or global navigation satellite system (GNSS) module, it can measure its location. There are mechanisms in the mobile standards to support the UE to make location estimates using satellite location systems or other techniques that do not rely on satellites. There are also mechanisms in the mobile standards to allow location fixes to be reported to the network. These location fixes can be used for various applications such as responding to emergency calls or limiting service to certain geofenced areas. But these estimates of the location, coupled with measurements of the characteristics of the radio network at that location, can be valuable for managing and optimizing the performance of the radio network. Some locations, such as within buildings or in very dense urban areas with high-rise buildings, can have poor visibility of the GNSS satellites. Some types of UE lack the GNSS subsystems and cannot make their location fixes. Additionally, the number of mobile connections when the mobile reports its location routinely can be a tiny proportion of all the mobile connections. Increasing the number of devices and connections where the UE is requested to report its location is possible but has limitations. Running the GNSS subsystem will increase the drain on the UE battery. Reporting the location to the network will use some of the network capacity, leaving a little less for user data. These constraints limit the visibility of the geographical distribution of network performance and can bias that distribution to certain locations or device types. A solution to this is to estimate the location of devices without relying on GNSS fixes. This can be achieved using measurements that the UE routinely reports to the network as part of the normal management of radio resources. These measurements include signal strength and timing information. However, these measurements are not designed for location estimation. But they can be correlated with the location of the UE. This geolocation approach is heavily dependent on machine learning because measurements for radio resource management and external data must be combined and correlated with geographical location. As such, UE location can be estimated even when there are very low levels or no GNSS locations reported.
Once locations of the UEs can be calculated, the geographical distributions in the radio conditions can be measured and correlated with other factors such as terrain, building density, etc. ML models can be created using this information to answer questions about how to expand the network, for example. The capacity that can be achieved with cells using MIMO and beamforming will depend in part on the environment in which they are installed and what locations are chosen. Thus, ML models can support the decision of where to place MIMO cells to balance the system performance with return on investment.
We have introduced various concepts where descriptive analytics is used for detecting anomalies, diagnostic analytics can rapidly identify the root cause of an issue, and predictive analytics can tell us about an adverse event before it becomes an issue.
Now we turn our attention to prescriptive analytics: what to do to deliver the best performance that satisfies the requirements placed on the network. Various aspects must be optimized to achieve this. Coverage optimization encompasses optimization of the coverage beams and static user beams to ensure that the network can be accessed in the required locations while also meeting any targets for beam redundancy, for example. ML can help us here. Models for what performance would be achieved as a result of changing various parameters can be built. These can model the impact of parameters such as transmit powers and beam parameters, such as direction. The models can predict coverage along with other targets of interest such as interference, capacity, beam redundancy, spectral efficiency, and data rates, along with how these vary by location. Such models can then be used to underpin a system to find combinations of parameter configurations that best achieve what the network is being asked to do. Thus, the effectiveness of parameterizations can be evaluated before being deployed to the network. Thus, the parameterization predicted to best meet the needs can be chosen.
?6.5. Network Slicing and Intent-Based Optimization
As models become more sophisticated, and as network slicing becomes more of the normal choice to provide competing requirements within the same infrastructure, more aspects of the network parameterization choices and the corresponding performance will be required as inputs and outputs for these models. Some of these will include choices such as the numerology, choice of cyclic prefix, RACH channel configuration, and configuration of BWPs. These will consider the variations in UE capabilities of the device population including which numerologies are supported and what channel bandwidth can be accessed on which carriers, so that the parts of the spectral resources that can be accessed by which devices are modeled. Thus, the overall composite capacity achievable for the mix of devices and subscribers over multiple network slices will be resolved by the model. The most sophisticated models will not only model the dynamics of UEs as they move around a network on a specific spectral resource; they will complement this with model components that include the interactions between network layers. Network layer modeling will include which carriers and BWPs are used to deliver service to each class of subscribers in the various locations. This will also include the interworking between LTE and 5GNR carriers in dual connectivity mode, respecting the system parameters that control the management of spectral resource layers.
Such powerful models will capture more faithfully the complex system of interactions that is the NR and LTE radio interface and will underpin optimization of the many parameters and choices that can be tuned in the next-generation networks. The resulting networks will utilize the physical resources more efficiently and deliver the best balance of coverage and capacity for each class of subscriber on each network slice.
These AI and ML and models that possess ever more powerful predictive capability will be in the ascendancy, as initiatives such as disaggregation in the RAN, service-based architecture in the core, and the O-RAN Alliance open up the network into more discrete components. As these components become more programmable, so will the data that they expose to fuel the next generation of advanced AI models. These open programmable networks and the transport networks that fuel them, when combined with sophisticated models, optimization, and prescriptive analytics, will be a potent combination. Drawing on many areas of advanced technology, with autonomic monitoring, self-regulation, and intelligent adaptability, we will have the most sophisticated hybrid digital and physical systems ever created by humans. From these new artifacts will emerge a communication system that delivers a richness of experience with breadth and depth well beyond what we dare to imagine today.