登录查看更多内容

Orchestrating 'Your' Big Compute and Cloud Grid

Faisal J.

Head of Group's IT Infrastructure and Operations

发布日期: 2023年12月23日

The more you peel the onion, eye watering details emerge that can make or break your compute environment. Assuming break is not a choice :) and If you are running 'Big Compute' housing workloads in '000s then you must take time out to see this and also contribute your experiences to this please.

Core fabric

Big Compute is back. And it's enabler, true hybrid grid (Infrastructure) is now real and must-have. but the multi dimension of your distributed compute AKA Hybrid, are often over simplified. Operations, reliability, sustainability, uptime, dynamic updates and business functions.

A parallel transition; Classical computing is challenged by computational problems due to complexity - even on the scale of our most powerful supercomputers. Quantum compute expands our computing capabilities by properties of quantum physics to perform calculations. They offer exponential speed improvements.

Why should you spend next 10 minutes of your life reading this ?

This writeup is designed to give you a starting point. Brain dump of components that will help you avoid becoming the most affected affected party in the Echo system. End consumer, provider or system integrators. No matter how simplified the pitch may be, managing them is a very different and not a simple story. How you can avoid a direct business impact. Must do to mitigate blind spots and ensure a long-term "sustainable" run.

Your feedback on this or my other drill downs subjects is Gold-Dust, it’s an evolving landscape.

Having gone through a number of these deployments, let me recap; Compute/Cloud/XaaS, are easy to understand and it dumps, all the complex-y & no-one-is-interested stuff, in a box and let someone-else manage . While business can use it to focus on what matters to business. Even if costs more and has known drawbacks (technical and non-technical). I also believe that to keep pace with innovation & maintain future readiness of your business, it’s a must-do. Even if it comes at a long term and risky dependency on a third party (Risk of being held to ransom to run your own business ??).

Important to keep in mind, the above does not mean managed by experts on their premise or your premise.

By the way, as of today, 'cloud' does not necessarily mean remote infrastructure, infact true cloud compute must be independent of the premise (yours or suppliers or third party). It's your choice.

Depends completely on the longer term strategy and the technology stack in question. Fast rapidly updating systems make sense as a SaaS but enterprise grade core business systems is a very different story.

However, for us technical folk. Managing hybrid technologies, housings can pose some challenges, such as ensuring security, compatibility, performance, and cost-efficiency.

In this post, let bring the best practices for managing hybrid technologies and recommend some key systems that can help you achieve your goals.

Try to triangulate technology and resources. not locations.

You can always play (triangulate) with Storage, Compute and Network and monitor but try to use a third party as compared over priced/rated native tools. Maintaining sanity of the system is your role.

The Foundation - your big compute design

Strong foundation is mandatory, don't rush it. Do not get confused with the vanilla that you get shown by different providers. And almost insisting on the use of their design at network, compute, middleware, application layers. Infact this is one thing you must keep proprietary to yourself/your organization.

Intellectual property of your organization defined, maintained by you. If it helps below are mandatory elements of "your" design that you may arrive based on different technology and function components :

User Interface: The interface(s) through which users interact with the grid.
Security: Ensures the integrity and confidentiality of data and resources.
Scheduler: Determines the order in which tasks are processed.
Data Management: Handles the storage and retrieval of data.
Workload Management: Manages the distribution of tasks across the grid.
Resource Management: Allocates resources based on demand and availability.
Decentralized, local, and global coordination of resources: Such as computer clusters, data analytics, mass storage, and databases.
Standardized, open interfaces (nodes) and middleware (protocols or protocol bundles): Connects computing units to the main grid and distributes tasks.
Artificial Intelligence (AI) and Service-Management Systems: Supports advanced cognitive and analytic mission requirements.
Underlying Networking Infrastructure: Connects all components of the hybrid cloud.
Site Reliability Engineering (SRE): Concerned about the platform’s resiliency, robustness, security, stability, and capacity.
Edge Computing: Processes data closer to the source to reduce latency and bandwidth use.

Remember, the design and management of these infrastructures require careful consideration of the organization’s specific needs. Vanilla is good but if there are blind spots they will be expensive (technically, impact wise etc). Minimum Landing Zone Considerations - Microsoft Azure

Showstoppers, Must-Haves and Mitigations

Core challenge of hybrid technologies is ensuring cost-efficiency across different platforms and services. To address this, you need to monitor and analyze your cloud usage and spending, optimize your resource utilization and allocation, and implement cost-saving measures such as scaling up or down according to demand. There is no right answer but a combination of variables thats best for you. Some of the key systems that can help you with cost-efficiency are:

Cloud Cost Management (CCM): Tip of the ice berg but important to start from here. While built-in tools by each provider are brilliant but you really need nOps, Morpheus, Aria etc or other independent 3P tools that not just monitor but AI enabled optimizations on Realtime basis. look for not just a Software tool that tracks and reports your cloud usage and spending. but something that helps you identify trends, patterns, anomalies, and opportunities for cost optimization.

领英推荐

Pure Storage Takes On Dell In Top Tier Storage Market

Sramana Mitra 2 年前

Pure Storage Partners With Microsoft Azure To Address…

Sramana Mitra 3 年前

Network configurations to make the most of AIStor

MinIO 1 个月前

Cloud Resource Optimization (CRO): A CRO is a system that analyzes your resource utilization and allocation across different cloud resources. It can help you optimize performance, availability, security, compliance, and cost of your hybrid cloud infrastructure.

Cloud Autoscaling (CAS): Autoscaling is a technique that automatically adjusts the amount of computing resources allocated to your applications based on the demand and performance metrics. This means that you can optimize your costs and performance without having to manually intervene.

With autoscaling, you can specify the minimum and maximum number of instances, pods, or nodes that you want to run for your applications. You can also define the scaling policies that determine when and how to scale up or down. For example, you can scale up when the CPU utilization exceeds 80%, and scale down when it falls below 20%. Autoscaling will monitor the metrics and adjust the resources accordingly.

You can also use our API or CLI tools to configure autoscaling programmatically. For more information, please refer to our documentation. CAS is usually a dedicated system within you Grid.

Cloud Management Platform (CMP): A CMP is a software tool that provides a single interface for managing multiple cloud resources. It can automate provisioning, configuration, monitoring, optimization, and governance of your hybrid cloud infrastructure.

ensuring performance and reliability across different platforms and networks. To address this, you need to optimize your network connectivity and bandwidth, balance your workload distribution and resource allocation, and implement backup and disaster recovery strategies. Some of the key systems that can help you with performance are:

Software-Defined Everything (but start with Networking - SDN): I've written a full article on this subject. Software defined everything is a must have foundation now. Compute, Storage etc SDN is a network architecture that decouples the control plane from the data plane. It can enable dynamic routing, load balancing, quality of service, security, and policy enforcement across your network.

Cloud Load Balancer (CLB): A software-based service that distributes traffic between multiple cloud servers. It works by assigning requests from client devices to available servers using an algorithm that considers factors such as server load and geographical distance. A cloud load balancer helps to reduce latency, improve availability and reliability, and prevent server failure. Some cloud load balancers can also balance traffic across servers in different regions or cloud providers, which is known as global server load balancing (GSLB).It is usually a network device that distributes incoming traffic across multiple servers or instances. It can improve availability, scalability, and performance of your applications by reducing latency, congestion, and downtime. Azure traffic manager, AWS Elastic load balancing, Web Works and NGINX are some really strong and benchmarking brands for this.

Cloud Backup and Recovery (CBR): Absolute must have for us. keeps copies your data from one location to another for backup or recovery purposes. It can protect your data from loss or corruption due to hardware failure, human error, natural disaster, or cyberattack. I recently deployed and am impressed with Cohesity but there are many tool brands for you to choose.

Ensuring security across different platforms and networks. To address this, you need to implement a unified security policy that covers both your public and private cloud resources. This means applying consistent encryption, authentication, authorization, and auditing standards across all your data and applications. You also need to monitor and protect your network traffic and endpoints from potential threats and vulnerabilities. Some of the key systems that can help you with security are:

Cloud Access Security Broker (CASB): A CASB is a software tool that acts as a gatekeeper between your cloud services and your users. It can enforce your security policy, detect and prevent data breaches, and provide visibility and control over your cloud usage. Identity and Access Management (IAM): IAM is a system that manages the identities and access rights of your users and devices. It can verify the identity of your users, grant or deny access to your resources, and manage roles and permissions. Firewall: A firewall is a network device that filters and blocks unwanted or malicious traffic from entering or leaving your network. It can protect your network perimeter, segment your network zones, and prevent unauthorized access to your data and applications.

Compatibility and interoperability between different platforms and services. To address this, you need to adopt common standards and protocols that can facilitate communication and integration across your hybrid environment. You also need to use tools that can automate and orchestrate your workflows and processes across different cloud resources. Some of the key systems that can help you with compatibility are:

Application Programming Interface (API): An API is a set of rules and specifications that define how different software components can interact with each other. It can enable data exchange, functionality sharing, and service integration across different platforms and services.

Container: A container is a software package that bundles an application and its dependencies into a portable and isolated unit. It can run on any platform that supports the container technology, such as Docker or Kubernetes. It can simplify deployment, scaling, and management of your applications across different cloud environments.

Let me leave you with a good singular IT Management module that you must keep as base components to your overall operation design. This must be available at multiple layers of the design. Ultimately your operating model will have each of the below components covered.

application of these layers for administrators as below

Just like the above operational components, Below is an example of the base stack consideration if you are to creating a landing zone on Microsoft. Having multiple of these on different platforms will replicate the same stack (x times). Each box and it's design is a drill down within.

Closing off, managing hybrid technologies requires a holistic approach that covers all aspects while ensuring you and your organisational still remains in control of the new environment for it's operation, technology, functions etc. application of the some these elements will atleast reduce the possibility of full exposure now or in future state.

I have drilled down on some of the subjects in my other write-ups. please feel free to contribute, add , amend as appropriate !

Thank you for your read !

[HEINZ] Enrique Carrillo

Driving Artificial Intelligence Automation & Operational Excellence for Global Multinationals | Partnering with NASDAQ & Global Industry Leaders | Tech. Author | Advisory Board x3 |

1 个月

thanks for sharing!

Manohar Chirakkal

Managing Director - ICL Computer Trading

5 个月

Hi Faisal, hope you are doing good, this is Manohar ex-Ensure Services

1 次回应

Aakash Sapra

Regional Head - MENA (DevOps and GenAI) Partner #AI #GenAI

1 年

Nice one. Faisal Javed

1 次回应

查看更多评论

要查看或添加评论，请登录

Faisal J.的更多文章

Your own: Deep or Artificial neural network

2023年6月30日

Your own: Deep or Artificial neural network

Deep Neural Network is so key to understand, if you want to do anything with any of 'Artificial Intelligence' models…

1 条评论
Transition from legacy to Modern I&O

2023年5月28日

Transition from legacy to Modern I&O

There 'was' Data center, then there was remote hosted data centers, then there was service oriented architecture, grid…

3 条评论
Transforming Service Levels with Tiered, Hybrid, Multi-Cloud Technology stack

2023年4月28日

Transforming Service Levels with Tiered, Hybrid, Multi-Cloud Technology stack

Despite the importance and awareness of Service Levels throughout the industry. It still remains a complicated and a…
SDI (Software Defined Infrastructure) - components, design considerations and Methodology

2023年4月26日

SDI (Software Defined Infrastructure) - components, design considerations and Methodology

Over the last decade, We all have learned and seen in production, so many great technologies around virtualization of…
How will AI (artificial intelligence) impact IT Infrastructure

2023年4月23日

How will AI (artificial intelligence) impact IT Infrastructure

Technology disruption is not new to us, but the pace at which it is changing the world now..

9 条评论

See all articles

Orchestrating 'Your' Big Compute and Cloud Grid

Faisal J.

Head of Group's IT Infrastructure and Operations

领英推荐

Faisal J.的更多文章

社区洞察

其他会员也浏览了

From Mainframes to DePIN: The Evolution of Compute

Microsoft Ignite 2024: New Azure Data Center Chips Unveiled.

Oxide: Bidding a Fond Farewell to On-Prem Computing As We Know It

Get Started with Google Compute

Enterprise HPC: Why HPE is buying Cray

11 Reasons why Nutanix is the Best All-Flash Platform

Weekly News

High-performance Computing Market Will Reach USD 51.21 Bn By 2028 Credence Research

???The Return of the Datacenter: A New Hope for AI Native Workloads in 2025

Designing High-Performance Computing (HPC) Solutions on Azure

领英推荐

Faisal J.的更多文章

Your own: Deep or Artificial neural network

Transition from legacy to Modern I&O

Transforming Service Levels with Tiered, Hybrid, Multi-Cloud Technology stack

SDI (Software Defined Infrastructure) - components, design considerations and Methodology

How will AI (artificial intelligence) impact IT Infrastructure

社区洞察

其他会员也浏览了

From Mainframes to DePIN: The Evolution of Compute

Microsoft Ignite 2024: New Azure Data Center Chips Unveiled.

Oxide: Bidding a Fond Farewell to On-Prem Computing As We Know It

Get Started with Google Compute

Enterprise HPC: Why HPE is buying Cray

11 Reasons why Nutanix is the Best All-Flash Platform

Weekly News

High-performance Computing Market Will Reach USD 51.21 Bn By 2028 Credence Research

???The Return of the Datacenter: A New Hope for AI Native Workloads in 2025

Designing High-Performance Computing (HPC) Solutions on Azure