Five Impactful Considerations for Your Successful Digital Transformation - Hybrid Cloud Architecture for Telecom, Finance & Health Verticals
Omar F. Mendoza
CTO | CISO | Architecture & Planning | Mission Critical Infrastructures | Security | 5G/LTE | Technology Migrations | Private Cloud | Technical Product/Program Management
The Public cloud is only 98.5% reliable, sold as?"reliable enough". I evangelize for private-cloud only or hybrid cloud adoption to support the digital transformation for regulated verticals.?By doing so, regulated companies can build a private carrier-grade cloud, and take advantage of some opportunities and the fast development of AI(Artificial Intelligence) in the public cloud.
Why private only or hybrid, why not public??
Note?in extreme circumstances, the?private cloud?can be one or many dedicated physical POD (Point of Delivery) with restricted physical access, collocated in the same data center as is the public cloud.
What is this article about?
This article?is intended to be a guide?on the most crucial factors that senior leadership, finance, HR, architects, and program managers should consider when they are objectively aligning the needs of the company's digital transformation objectives. The company must align those needs with the EA (Enterprise Architecture) and the technical architecture of a mission-critical infrastructure to support the highest available global business operations for regulated companies like Telcos and financial companies. Although it applies to any company, there are several questions to answer before the company moves to the public cloud infrastructure or any cloud-based infrastructure model.
In this article, when I mention site reliability, I am talking specifically about the infrastructure as a whole, not about a website?or web application reliability. Be careful, as the?applications?- for availability-calculating purposes - are connected in a series to the infrastructure, never in parallel. Also, let's assume the applications are carrier-grade with a minimum requirement of five-nines of availability and reliability (The five-nines refers to 99.999% uptime).
Basic concepts
The organization needs to understand, know, and be conscientious that:
You cannot wait until you have everything 100% perfectly set, or you will never have a product ready to deploy. Your company must understand the value proposition of the technology transformation and have the minimum viable product covering at least 99.999% of the identified critical variables and features from the?"what-if" business plan, for the company, and develop a business case based on that strategy, with the highest levels of security.
A beautifully written business case without a strategic "what-if" business plan may become a eulogy for your company.
Many companies skip the?what-if business plan and present beautifully written business cases.?Then two years later the company discover the costly mistakes and try to find financial equilibrium by impacting the human resources and morale of the company.
Speaking of security, notice how I don't talk about security policy standards or best practices. By the time you finish reading this article, the recommendations, policies, standards, and best practices for security may have already changed or evolved. Anything you are learning today is already obsolete or already improved upon. You can never be security-paranoiac enough. What I can tell for sure is;?that security is expensive, it will transcend technology, and security is and will always be dynamic.
Security is expensive, it will transcend technology, and security is and will always be dynamic.
The cloud was born and evolved to support massive unreliable Web services, using regular COTS (Commercial off the shelf), and over time was optimized to do that, support web services and web applications over TCP. Cloud was never intended for UDP traffic.
What are Reliability and Availability?
Understand that the?cloud infrastructure?will provide?98.5% availability and reliability. There is a tendency to use these terms frequently as if they mean the same thing, but reliability and availability are two different factors. Based on?ITU-T recommendation 800:
TIP:?You can think of reliability as how frequently a system fails, measured in MTBF (Mean Time Between Failures). It is critical to measure the MTTR (Mean Time To Repair) as a KPI (Key Performance Indicator) because the MTTR will influence the availability.
For example: let us assume a?tier-4?data center with an availability of 99.995%, and inside the data center, is deployed a cloud infrastructure with a single physical POD (Point of Delivery) - not the Kubernetes POD. This POD will lower the data center reliability and availability to 98.49%. If you have multiple physical cloud infrastructure PODs in parallel using type one virtualization systems or containers; then, the availability and reliability will increase. Very few private cloud infrastructures reach or go over the 99.995% availability per data center like the ones I have the opportunity to work on.
Let's talk about the Cloud
The?NIST definition of cloud:?cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.
My definition of cloud is similar:
Cloud:?"The cloud", or cloud computing, is a set of orchestrated infrastructure that provides a service of shared, interconnected, and distributed resources - resources are composed of the orchestrator, computing power, cooling, network, and storage - which are metered and scalable on-demand.
This cloud infrastructure will eventually support the PaaS (Platform as a Service) like Openshift, or SaaS (Software as a Service) like Gmail or Office-365. That's why I see the PaaS and SaaS as clients running on top of the cloud infrastructure and should follow the same recommendations that I am sharing in this article.
I see the PaaS and SaaS as clients running on top of "The Cloud" Infrastructure and should follow the same recommendations I am sharing in this post.
On the public cloud, the customers are responsible for the reliability and availability they want to reach. After I have built several?"what-if" business plans and developed the complex respective business cases, I don't see any strong advantages for a well-established?regulated corporation?to move to the public cloud, but in some cases, it opens an opportunity by using a hybrid architecture solution - meaning to have a private cloud plus some opportunities on the public cloud, such as AI analytics and/or the Web Front-end scaling on demand.
In a private cloud, the company should have control of all the?business and operations variables. And be able to deliver the SDI (Software Defined Infrastructure). Which should be a highly secure infrastructure architecture with automation, management, capacity, and scalability; not necessarily using COTS servers - the?private cloud?tends to use?highly customized hardware?for their server build and even using?Mainframes. In either case, you need to identify all the variables and features to make the right decision on selecting what can be moved to the public cloud, versus what needs to stay in the private cloud, and if the hybrid solution is the right one for your company. I found architecture designs for private clouds where even when they had built a fully redundant infrastructure, they were not taking advantage of it.
The Five Considerations To Move To The Cloud
First things first, ignore the cloud sales speech:?“Cloud will abstract the Infrastructure, and you will move from a CAPEX model to an OPEX model, with fewer resources (Human resources), and with complete automation capacity on-demand”
The following steps will give the stakeholders, program managers, network, security, storage, and infrastructure architects, the opportunity to gather information on the most important variables that will help them estimate the actual business and finance impact and make the right decision. The PaaS and SaaS will require similar considerations.
Ask and understand how the cloud operations work and if the company can easily integrate the existing SDI. At the minimum, your leadership, architecture, and program managers need to understand the business impact of the answers to the following questions
1. Understand your current Infrastructure - Start with a self-discovery
Does your company have a?solid business case?with all the variables in place? And exercise the "what-if" for each application, like?what if?we move this app to the cloud??The?what-if business plan?must include quantifiable answers to the upcoming questions and the?voice of the customer. Consider?people, process, structure, culture, and strategy shifts,?as professors Charles A. O’Reilly III and Michael Tushman show in the book?Lead and Disrupt: How to Solve the Innovator's Dilemma.
Does your company clearly understand what is ushering your company to move to the cloud? Is it?platformization?or?platformification?with AI? If your?digital transformation?has a basis in taking an exhaustive advantage of?data analytics with AI, the legal variable for any?regulated vertical?may - and in my experience, will -?not make it financially nor strategically advantageous?to move to the?public cloud. The?good news?is; that you still can have your?private cloud?where anyone inside your organization can take advantage of the privately shared infrastructure. Your leadership needs to be cautious of all the variables. I will recommend the book written by professors @Marco Iansiti and Karim R. Lakhani:?Competing in the Age of AI.
Does your company have a customized infrastructure and or applications that give an advantage over existing market competitors? Many companies in different verticals have customized OS(Operative System) running on the servers, on their switches and routers, RAN (Radio Access Networks), the packet core, etc.
Does your company have?SLI,?SLO, and?SLA?defined? The Telcos, Banks, and the Stock Exchange (Market data networks) have the SLA defined as the "five nines minimum" - An SLA of 99.999% available and reliable - and you have to start from there to design your infrastructure and define your SLIs and SLOs. I know this because public cloud providers and many?unregulated?communication companies begin with the SLI. Government regulations have different implications for each vertical.
Will you have a life-awareness supporting service running on the public cloud that offers you 98.5% availability on a single physical POD? that will?touch on ethical variables that are very difficult to quantify. For example, when your direct or indirect customer is unable to use the phone service to make a call to 911 or the police department. High availability and reliability are costly and can be more expensive than you expect if you use the public cloud.
Is your company unable to forecast the complex capacity needs? That should find a solution by taking advantage of the analytics with ML(Machine Learning) and AI, which may need human resources adjustment and require new skills. But this is not a fundamental reason to move to the public cloud.
Is your company unable to scale dynamically to adjust to the demand needs? For elasticity or scaling dynamically on demand, let's focus on the?private cloud. The scaling capability it's linked to the infrastructure forecast. Because your company needs high availability (99,999%), this implies that you have at least?two?tier-4?data centers with multiple PODs deployed in each data center.?Each one of the data centers alone should support the whole data traffic?if one of the data centers goes down. Then the solution resides in evaluating and adjusting how your company is operating to handle the business demands. Moving to the public cloud is not the right way to solve the elastic scaling requirements.
– Does your company have Mainframes and is planning to move from them? If yes, what is the reasoning behind having to move from a highly reliable system with a minimum of five-nines availability cloud infrastructure system (The Mainframe) to a cloud with a single nine? - 99.999% Mainframe uptime vs 98.5% cloud Uptime for a single POD system? What is triggering the company's transformation to the cloud? Is it just COBOL? Think twice before you move. In the financial industry, Banks process large amounts of transactions with large amounts of data. I understand that Fintechs were doing just fine using the public cloud, but they are moving to a hybrid model already. Please read this article?"Will the cloud take down the Mainframe".
Check - before you start spending millions of dollars moving out from your mainframes take a look at the comparison I made - Cloud vs Mainframe & COBOL
Does your company have an updated?CMDB?(Configuration Management Database with inventory)? And is it mapping the?physical inventory,?including the exact physical and virtual location,?applications, applications interdependence, application standby location, data center capacity, storage capacity, bandwidth capacity, IOPS for storage, IP addresses,?and updated IPAM (IP address Management) integration? Ideally, if you already have all of this, you should add?application instances, source IPs, source ports, destination ports, destination IP addresses, and affinity?and?anti-affinity?for each application.?A skills map for the existing and future infrastructure associated with internal and external products. These additional variables can have a considerable economic impact on your?"what-if" business plan.?Ensure those impacts are remarked in the?business case,?which is your tactical approach.
TIP:?The most challenging is the mapping and dependencies of the existing and future requirements.
The following diagram is a high-level process to evaluate the technology transformation.
Is the existing network infrastructure flat at your company or properly designed and segmented physically and logically, including your WI-FI, private LTE, or private-5G? If it is flat - which surprisingly I found even some fortune 1000 have flat network infrastructures - a human error can put the whole infrastructure down, or worst may open a door for a?hacker?to steal data.
领英推荐
Does the existing network infrastructure at your company separate the user traffic from the control or signaling traffic, security, IoT, and?OAM, logically or physically, or both? For the Telcos, they have at the minimum the user, control traffic, and the OAM separated physically and logically, not sharing the same physical/virtual network device. Multiple separated network infrastructure gives you an extra level of security and is easier to secure and monitor the activities of those accessing the VNFs (Virtual Network Functions). Your company may have the data gathering pipelines associated with the OAM or on a separate network infrastructure that will align with your MEC (Multi-access Edge Computing) and SD-WAN deployments.
Does your company update regularly the network infrastructure diagrams? Diagrams that show ports, IP addresses, and virtual IP addresses for the IGP and BGP with BGP peers -?should never include passwords or secret words?- with a life of a packet for each application, aligned with the CMDB dependencies? Believe it or not, most companies do not have it. If the network and security diagrams show you a group of boxes with lines representing interconnection and IP addresses, it may be better to build those diagrams from scratch.
Does your company have data pipelines for collecting, cleaning, and processing the data, in addition to access logs? Do you have or plan to have a data lake, elaborated BI(business intelligence) with analytics, and take advantage of ML/AI? If you responded yes, then a private cloud or a hybrid cloud may be the best solution. One important factor is storage, it can be costly in the public cloud, as you cannot share storage space, and public cloud providers charge for data moved-in and moved-out from storage resources.
Does your company store and handle data related to PII (Personal Identifiable Information), PCI (Payment Card Industry), trade secrets, etc, then it's better to start with a private cloud. Using the hybrid cloud, the company must eliminate the risks of sharing confidential information by masking the data before it's sent to the public cloud for processing. And only those results are sent back to the private cloud.
Is your company moving to the cloud to be able to scale your WAN capacity on demand? If yes, have you checked the SD-WAN option or dynamic burst bandwidth allocation with your existing WAN provider?
Are your company applications UDP or TCP-based??This is a critical question to answer. If it's UDP, I will suggest building your own dedicated private cloud that will cover all your needs, features, and infrastructure variables. For example, with VMware, Mirantis(Openstack), or Nutanix - no marketing intended. If it's TCP, it needs an evaluation on a case by case. In either situation, I will reemphasize you need a "what-if" business plan analysis with every variable and feature for your physical and virtual infrastructure, segments, slices, security, application, availability, reliability, automation, and your Infrastructure Orchestration and?MANO?(Management and Network Orchestration).
What is the current Organization structure? Will the existing structure need to be changed? If needs to be changed, how is the Enterprise architecture being impacted?
The following diagram is a high-level proposed organization structure with IaaS and AI departments.
For the finance vertical and Telcos, I will suggest having your third-party VNFs (Virtual Network Functions). To start, the company should be using the existing vendors, which will simplify re-training and give your company an architectural and operational advantage. That will increase your availability and reduce your MTTR, which in turn will increase the reliability of your operations. For example: if there is a need to deploy on any other public cloud, you can keep standardized infrastructure design and architecture, uniform operations, and the uniform training of the personnel.?It is imperative to have a sandbox mimicking your private cloud and validate all your VNFs. Do not leave anything to a risky and costly fate.
If your company plans to use?tightly coupled?VNFs to the public cloud and use a multi-cloud architecture, your company will lose the infrastructure hegemony and ubiquitous operation advantage. In this case, independently of what the cloud vendor tells, your virtual infrastructure will be tied to their particular public cloud infrastructure.
TIP:?Think of the Cloud (Private or Public) as your virtual Layer-0, where on top you are going to build your virtual infrastructure that is going to support your applications and services.
NOTE:?Do not use the sandbox provided by the public cloud provider.
2. Security
Are the security and access controls to the NFVs and applications, separated or one giant IAM (Identity and Access Management)? In the public cloud is one giant IAM. My suggestion is to create a separate IAM group for the OAM, another for the network engineering, and a separate one for the security engineering groups to deploy the VNFs. Build your authentication service running on the security or OAM NFVi to operate and access the VNFs with a virtual jump box on the OAM - should not use the VDI (Virtual Desktop Infrastructure) as a jump-box.
It is?critical?to consider the "coupledness" of the security virtual functions applications. These should follow the same considerations I suggest in point four for?any application?when we talk about?tightly coupled vs loosely coupled. In summary,?avoid getting tightly coupled?to the cloud infrastructure.
A good, secure hybrid design should consider having the private cloud connected through a private?CIPX?(Cloud Internetwork Packet eXchange). You can deploy this CIPX in your data centers or collocate in a carrier hotel of your preference, which may simplify future connectivity for a multi-cloud service. And by using encryption on end-to-end transport securely exchange data between your public cloud infrastructure and your private cloud infrastructure.
What is a CIPX (Cloud Internetwork Packet Exchange)?
A?CIPX(Cloud Internetwork Packet Exchange)?Is not a Cloud Exchange or some providers called Cloud Exchange Fabric. The CIPX is a Complete?Highly-Available Infrastructure?whose main function is to be the Security infrastructure in the middle, between the Corporate Data Center(s) and Cloud Exchange Providers or Internet Service Providers. The CIPX can be collocated in a?Carrier-Hotel?with access to multiple public cloud providers.?
A CIPX must have a secure extension of the?OAM?Infrastructure with OAM routers/Switches/Firewalls, and the?Core?- where?User-traffic?flows - must have?Firewalls, IPS, IDS, and Routers.?I recommend having network?Taps, a?packet broker?with a local temporary?SIEM(Security Information and Event Management), and optionally a DMZ. Remember: security is paramount.
3. What are your Infrastructure dependability needs?
What is going to be the impact on the existing customers? If you have Mainframes, are there other Mainframes connected to yours? In finance, your company will have the ATM's mainframe network and other financial services connected to your mainframe, like credit card financial services. Have your?"what-if" business plan?supporting the business case, where your company accounts for all the variables, your company will find the high costs required to support the unparalleled number of transactions per second at the highest security, availability, and reliability. Please read the?Cloud vs Mainframe & COBOL?comparison I made and the executive summary of?How to understand the design of Mainframes?by Shaun Snapp.
Is the Choreographer or the Orchestrators moving your application instances between servers and between the physical PODs (east-west movement)? In this case, the company needs to understand the level of visibility of these operations with automatic CMDB updates. The most important consideration is the versioning of the infrastructure where the instances are moving to and if the NFVi automation aligns with it. If the movement of the applications, instances, and NFVi are not carefully planned, it will reduce the availability of your business. Also, yes, a carefully planned movement and dynamic movement or scale can't go hand in hand. When moving to a different version of the existing infrastructure, the backward, or forward compatibility has to be validated in a sandbox.?Why??Because this can have a negative impact on the MTTR and hence will affect the availability of the business operations.
TIP:?It's critical to ensure the automation of any dynamic movement of an application instance is going to an identical infrastructure, including the version of the virtual machine or container software. And the NFVi - the virtualized infrastructure - has to be moved first or built on the cloud first.
Is your company considering any other specialized VNFs for your NFVi (Network Function Virtualization Infrastructure)? I would suggest using your preferred vendors vIPS, vIPD, vProbes, vTaps, or vPacketBrokers to build the data pipelines with sophisticated analytics for security and marketing. Today many public cloud providers offer a marketplace with third-party VNFs - check the version of the VNFs.
Are any of your OSI layers tied to any specific library on the containers or virtualization platform? The answer to this question is most of the time yes, independently of if you are using a VNF provided by the cloud provider (tightly coupled) or VNFs acquired from a third party (loosely coupled?and highly customizable). If you are deploying or moving any of the previous items, and it happens without the validation of every variable, your availability may and will certainly go down and may reach 0% sooner or later. Keep in mind "slow is the new down".
TIP:?Use your preferred VNFs vendor for the vRouters, vFirewalls, vIPS, vIPD, vProbes, vTaps, or vPacketBrokers to build the data pipelines with sophisticated AI data analytics for security, operations, and marketing.
Are you going to use the cloud provider’s OAM? They will not, and should not allow your company to use it. You have to build your own OAM-NFVi.
Storage flow dependability. As I commented before, the data volume, the associated costs with the volume and moving data-in and data-out, and also the data traffic between data centers.
In summary, you will use the provided IAM by the cloud provider to deploy the NFVs and build the NFVi to?deploy and interconnect?your applications on top of the NFVi. I will never recommend using the same cloud IAM to access and operate the NFVi. For a private cloud, I think you get the idea. Looking to the future, if your company follows this structure, it will be easier to build a Hybrid Cloud as you will have a ubiquitous infrastructure.
Security is expensive, and the most secure infrastructures follow CMM(Capability Maturity Model) linked to a waterfall methodology or its successor, the CMMI (Capability Maturity Model Integration). For the agilists, they can use the?Agile-ish version of CMM?it's a mapping of the disciplined and well-educated Agile methodologies.
4. Now about your applications, let's talk about the minimum considerations to make
- Just a quick refresh, the idea is to serve as a guide for regulated, well-established companies.
Is your company planning a lift and shift? Just move the apps!?
TIP:?Don’t do it without building your own NFVi (Network Function Virtualized infrastructure) on top of the Cloud IaaS.
Is your company planning to refactor the applications? Imply recode and use the same architecture, this is not for beginners, and your applications will end?tightly coupled?to the public cloud you choose - which in my experience is not ideal in the long term as the reliability of your application will be tied to the public infrastructure of your choice.
Are you Rearchitecting (Replatform)? Check the "what-if" business plan if it's strategically worth it. This is going to be very costly and?tightly coupled?to the public cloud you choose. As mentioned in the previous point, it is not ideal.
Are your applications mission-critical? Safety (911 for example)? My recommendation is to build your private cloud. And with the approval of your legal department, move those applications that do not require high availability to the public cloud. Maybe take advantage of some of the Public cloud ML/AI (visual recognition, data analytics). Remember to mask all your data.
Is your company using or moving into microservices? This can make the life of a developer easy. But miserable for the infrastructure planning due to the number of variables to consider and side-to-side traffic, this is mainly due to the retransmissions by the microservices applications running over TCP. I can tell you the data center traffic may be overwhelmed with TCP retransmissions. This can be alleviated using deep buffers on private cloud switches if the applications are not?latency-sensitive, but,?Microservices architecture is latency-sensitive. You will notice also, that these retransmissions are not related to the congestion on the internetworking infrastructure, but?are ServerHardware+Virtualization+OS latency-related. Although it will take the appearance of being a network infrastructure congestion-related.
Is your company dependent on every cloud platform web console? In that case, you are locked into the specific public and/or private clouds provider. And in a multi-cloud environment, the company will require more people with skills for each cloud-platform web interface. The?best approach?is to take advantage of the APIs that the public and private clouds have available. Build your company's own web console with a friendly UI (User Interface), or evolve the existing console. This approach should unify the web UI across multiple public cloud providers, making the private console UI uniform for the company operations teams. The development of the integrated web UI should run under R&D (Research and development) and will require less headcount with specific vendor cloud skills.
For regulated verticals, check the applications' affinity and anti-affinity. They pretty much may have countless restrictions, even physical restrictions.
Telco, LTE, 5G, and Market data services(Investments infrastructure) run over UDP, for which the cloud was not designed. That is why many carriers have to design their customized and optimized private cloud.
For example, a tightly coupled application happens when the application uses cloud messaging libraries or cloud accelerator libraries.
To eliminate the?tightly coupled?multiple associated risks,?refactor the applications as loosely coupled.?To do it, your company should provide the basis to allow such a paradigm. Build a highly available private cloud infrastructure with an NFVi. A well-architected NFVi will provide flexibility and security, and work as an abstraction layer to avoid getting tightly coupled to the cloud infrastructure. Your company can start with Openstack and move to VMware or vice-versa without having to be bonded onto those multiple cloud libraries or extensions. As cloud technology tends to evolve quickly and the libraries with it, sooner or later, the application may lose backward compatibility.
5. Finance Operations Impact
Finance is critical as Security. Moving to the public cloud may have another impact - Finance. Check with the legal and finance team. Your CAPEX is becoming an expense (it may be non-tax-deductible), and some of the CAPEX will become OPEX. When your company moves to the public cloud most of your CAPEX and OPEX will become?expenses. The high initial capital expense will become a recurrent cost. These are monthly payments to sustain operations. And these sustaining costs will need to be segregated by engineering and finance teams and have to go into the granularity per release, newly added functions per VNF, application, and the immediate applicability to your infrastructure. On the private cloud, you still have the CAPEX.