登录查看更多内容

Enforcing isolation with your Ray clusters

Raghavendra Prakash

Sr Solutions Architect at AWS India

发布日期: 2024年11月9日

Enforcing isolation relates to separation of concerns, deals with different deployment environments. Let's break this down with an example:

An ML Pipeline

Say, you're implementing an ML pipeline with the following requirements

1. Data ingestion and preprocessing

2. Model training

3. Model evaluation

4. Model serving

As an Architect, you need to determine which of these stages should be separated based on factors like resource requirements, scaling needs, and isolation for security or performance reasons.

领英推荐

k0smotron 1.0 is here!

Mirantis 8 个月前

NuNet Technical Update Q2 2024

NuNet 7 个月前

2024-2025 Tech Trends: What’s Changing and Why It…

OX Company 3 个月前

You might decide Data ingestion and preprocessing can be one application component, Model training should be a separate application component due to its intensive compute requirements, Model evaluation can be included in the training application, and Model serving should be a separate application component for scalability and isolation from the training process.

# For illustration purposes only
# Data ingestion and preprocessing
@ray.remote
def my_preprocess_data():
    # Preprocess data
    return my_preprocessed_data

# Model training and evaluation
@ray.remote(num_gpus=1)
def my_train_and_evaluate():
    # Train and evaluate model
    return my_trained_model

# Model serving
@ray.remote
class MyModelServer:
    def __init__(self, model):
        self.model = model
    
    def predict(self, input_data):
        return self.model.predict(input_data)

# Main pipeline
preprocessed_data = ray.get(my_preprocess_data.remote())
trained_model = ray.get(my_train_and_evaluate.remote())
model_server = MyModelServer.remote(trained_model)

Architect has separated the concerns into distinct Ray tasks and actors, effectively determining pre processing, training and evaluation, and model serving have to be different application components of a pipeline.

What is Operations team's (Simply, Ops team is the one who provisions and manages Ray clusters) responsibility? Ops team is responsible for ensuring that these application components run in isolation. This effectively involves 1\ Providing separate compute resources (e.g., different EC2 instances on AWS) 2\ Ensuring network isolation between components 3\ Implementing security controls to prevent unauthorized access between components 4\ Offering relevant optimized instance types for different workloads (e.g., GPU instances for training, high-memory instances for preprocessing) https://aws.amazon.com/ec2/instance-types/

Ops team might set up the infrastructure like this 1\ Run on general-purpose CPU instances for data ingestion and pre-processing 2\ Run on GPU enabled instances with high compute capacity for model training 3\ Run on instances optimized for low latency serving.

It is the responsibility of Ops team to ensure that these applications deployed on different isolated Ray clusters communicate securely.

In summary, the Architect decides on the logical separation of the application components, while the Ops team ensures that this logical separation is reflected in the physical infrastructure and runtime environment, providing the required isolation and resources for each component.

要查看或添加评论，请登录

Raghavendra Prakash的更多文章

Why Open source AI? How OS AI contributes to transparency?

2024年11月27日

Why Open source AI? How OS AI contributes to transparency?

The research revealed that the algorithm had been systematically disadvantageous to black patients in need of complex…

1 条评论
Diving deep with the problem statement

2024年11月15日

Diving deep with the problem statement

Came across a situation where customer experiences increased API rate due to the race condition in custom code…
LLM and Knowledge Graph, A Winning combination.

2024年10月29日

LLM and Knowledge Graph, A Winning combination.

Generative AI and Knowledge Graph can complement in a more effective way when integrated well. Let us decode the…

2 条评论
Distributed Customer Segmentation with Ray Framework

2024年10月22日

Distributed Customer Segmentation with Ray Framework

In today's data-driven world, businesses often deal with massive amounts of customer data. Effective customer…

1 条评论
Well Architected Framework Review (WAFR/WAR) Is it really necessary?

2024年1月28日

Well Architected Framework Review (WAFR/WAR) Is it really necessary?

“If you don’t know where you are, a map won’t help, If you don’t know where to go, any road will do”. Well, is a Well…

1 条评论
Do You need AI Governance?

2024年1月21日

Do You need AI Governance?

As Gen AI is taking the world by storm, privacy and compliance teams are at frontline to ensure right safeguards are…
Why do you need an OSS Strategy?

2024年1月6日

Why do you need an OSS Strategy?

Amazon, Facebook, Google, IBM, Intel, SAP, and Microsoft are among the biggest contributors on GitHub. In 2020, GitHub…

1 条评论
Trio: Reliability-Resiliency-Availability

2021年6月17日

Trio: Reliability-Resiliency-Availability

Imagine you just landed in your home town after a busy day in the late night, say 2 AM and you are hungry. You look…

1 条评论
Power of detachment

2021年6月9日

Power of detachment

June - It's the season of monsoon in India. Many birds chirp around and that elevates the energy levels around.

4 条评论
Handling Failovers in Kubernetes Cluster

2021年5月30日

Handling Failovers in Kubernetes Cluster

We make use of EKS(K8S) to run the containers. High availability, auto scaling are the quality attributes that the team…

3 条评论

See all articles

Enforcing isolation with your Ray clusters

Raghavendra Prakash

Sr Solutions Architect at AWS India

领英推荐

Raghavendra Prakash的更多文章

社区洞察

其他会员也浏览了

Kubernetes-a necessity in the Cloud-Native world

The cost of reliablity

Celebrating 113 Years of IBM: A Legacy of Innovation

Scaling Up vs Scaling Out Architectures for Future-Ready Semiconductor Fabs

Scaling of Read Operations Using Elasticsearch

AIOps Solutions for Network Operations Center

Demystifying the CAP Theorem in Distributed Systems ????

Reinventing IBM – A True Story

Scalable Service-Oriented Middleware over IP(SOME/IP)

Are you breaking up monoliths, or creating rubble piles?

领英推荐

Raghavendra Prakash的更多文章

Why Open source AI? How OS AI contributes to transparency?

Diving deep with the problem statement

LLM and Knowledge Graph, A Winning combination.

Distributed Customer Segmentation with Ray Framework

Well Architected Framework Review (WAFR/WAR) Is it really necessary?

Do You need AI Governance?

Why do you need an OSS Strategy?

Trio: Reliability-Resiliency-Availability

Power of detachment

Handling Failovers in Kubernetes Cluster

社区洞察

其他会员也浏览了

Kubernetes-a necessity in the Cloud-Native world

The cost of reliablity

Celebrating 113 Years of IBM: A Legacy of Innovation

Scaling Up vs Scaling Out Architectures for Future-Ready Semiconductor Fabs

Scaling of Read Operations Using Elasticsearch

AIOps Solutions for Network Operations Center

Demystifying the CAP Theorem in Distributed Systems ????

Reinventing IBM – A True Story

Scalable Service-Oriented Middleware over IP(SOME/IP)

Are you breaking up monoliths, or creating rubble piles?