Enforcing isolation with your Ray clusters

Enforcing isolation with your Ray clusters

Enforcing isolation relates to separation of concerns, deals with different deployment environments. Let's break this down with an example:

An ML Pipeline

Say, you're implementing an ML pipeline with the following requirements

1. Data ingestion and preprocessing

2. Model training

3. Model evaluation

4. Model serving

As an Architect, you need to determine which of these stages should be separated based on factors like resource requirements, scaling needs, and isolation for security or performance reasons.

You might decide Data ingestion and preprocessing can be one application component, Model training should be a separate application component due to its intensive compute requirements, Model evaluation can be included in the training application, and Model serving should be a separate application component for scalability and isolation from the training process.

# For illustration purposes only
# Data ingestion and preprocessing
@ray.remote
def my_preprocess_data():
    # Preprocess data
    return my_preprocessed_data

# Model training and evaluation
@ray.remote(num_gpus=1)
def my_train_and_evaluate():
    # Train and evaluate model
    return my_trained_model

# Model serving
@ray.remote
class MyModelServer:
    def __init__(self, model):
        self.model = model
    
    def predict(self, input_data):
        return self.model.predict(input_data)

# Main pipeline
preprocessed_data = ray.get(my_preprocess_data.remote())
trained_model = ray.get(my_train_and_evaluate.remote())
model_server = MyModelServer.remote(trained_model)        

Architect has separated the concerns into distinct Ray tasks and actors, effectively determining pre processing, training and evaluation, and model serving have to be different application components of a pipeline.

What is Operations team's (Simply, Ops team is the one who provisions and manages Ray clusters) responsibility? Ops team is responsible for ensuring that these application components run in isolation. This effectively involves 1\ Providing separate compute resources (e.g., different EC2 instances on AWS) 2\ Ensuring network isolation between components 3\ Implementing security controls to prevent unauthorized access between components 4\ Offering relevant optimized instance types for different workloads (e.g., GPU instances for training, high-memory instances for preprocessing) https://aws.amazon.com/ec2/instance-types/

Ops team might set up the infrastructure like this 1\ Run on general-purpose CPU instances for data ingestion and pre-processing 2\ Run on GPU enabled instances with high compute capacity for model training 3\ Run on instances optimized for low latency serving.

It is the responsibility of Ops team to ensure that these applications deployed on different isolated Ray clusters communicate securely.

In summary, the Architect decides on the logical separation of the application components, while the Ops team ensures that this logical separation is reflected in the physical infrastructure and runtime environment, providing the required isolation and resources for each component.

要查看或添加评论,请登录

Raghavendra Prakash的更多文章

  • Why Open source AI? How OS AI contributes to transparency?

    Why Open source AI? How OS AI contributes to transparency?

    The research revealed that the algorithm had been systematically disadvantageous to black patients in need of complex…

    1 条评论
  • Diving deep with the problem statement

    Diving deep with the problem statement

    Came across a situation where customer experiences increased API rate due to the race condition in custom code…

  • LLM and Knowledge Graph, A Winning combination.

    LLM and Knowledge Graph, A Winning combination.

    Generative AI and Knowledge Graph can complement in a more effective way when integrated well. Let us decode the…

    2 条评论
  • Distributed Customer Segmentation with Ray Framework

    Distributed Customer Segmentation with Ray Framework

    In today's data-driven world, businesses often deal with massive amounts of customer data. Effective customer…

    1 条评论
  • Well Architected Framework Review (WAFR/WAR) Is it really necessary?

    Well Architected Framework Review (WAFR/WAR) Is it really necessary?

    “If you don’t know where you are, a map won’t help, If you don’t know where to go, any road will do”. Well, is a Well…

    1 条评论
  • Do You need AI Governance?

    Do You need AI Governance?

    As Gen AI is taking the world by storm, privacy and compliance teams are at frontline to ensure right safeguards are…

  • Why do you need an OSS Strategy?

    Why do you need an OSS Strategy?

    Amazon, Facebook, Google, IBM, Intel, SAP, and Microsoft are among the biggest contributors on GitHub. In 2020, GitHub…

    1 条评论
  • Trio: Reliability-Resiliency-Availability

    Trio: Reliability-Resiliency-Availability

    Imagine you just landed in your home town after a busy day in the late night, say 2 AM and you are hungry. You look…

    1 条评论
  • Power of detachment

    Power of detachment

    June - It's the season of monsoon in India. Many birds chirp around and that elevates the energy levels around.

    4 条评论
  • Handling Failovers in Kubernetes Cluster

    Handling Failovers in Kubernetes Cluster

    We make use of EKS(K8S) to run the containers. High availability, auto scaling are the quality attributes that the team…

    3 条评论

社区洞察

其他会员也浏览了