Enforcing isolation with your Ray clusters
Enforcing isolation relates to separation of concerns, deals with different deployment environments. Let's break this down with an example:
An ML Pipeline
Say, you're implementing an ML pipeline with the following requirements
1. Data ingestion and preprocessing
2. Model training
3. Model evaluation
4. Model serving
As an Architect, you need to determine which of these stages should be separated based on factors like resource requirements, scaling needs, and isolation for security or performance reasons.
领英推荐
You might decide Data ingestion and preprocessing can be one application component, Model training should be a separate application component due to its intensive compute requirements, Model evaluation can be included in the training application, and Model serving should be a separate application component for scalability and isolation from the training process.
# For illustration purposes only
# Data ingestion and preprocessing
@ray.remote
def my_preprocess_data():
# Preprocess data
return my_preprocessed_data
# Model training and evaluation
@ray.remote(num_gpus=1)
def my_train_and_evaluate():
# Train and evaluate model
return my_trained_model
# Model serving
@ray.remote
class MyModelServer:
def __init__(self, model):
self.model = model
def predict(self, input_data):
return self.model.predict(input_data)
# Main pipeline
preprocessed_data = ray.get(my_preprocess_data.remote())
trained_model = ray.get(my_train_and_evaluate.remote())
model_server = MyModelServer.remote(trained_model)
Architect has separated the concerns into distinct Ray tasks and actors, effectively determining pre processing, training and evaluation, and model serving have to be different application components of a pipeline.
What is Operations team's (Simply, Ops team is the one who provisions and manages Ray clusters) responsibility? Ops team is responsible for ensuring that these application components run in isolation. This effectively involves 1\ Providing separate compute resources (e.g., different EC2 instances on AWS) 2\ Ensuring network isolation between components 3\ Implementing security controls to prevent unauthorized access between components 4\ Offering relevant optimized instance types for different workloads (e.g., GPU instances for training, high-memory instances for preprocessing) https://aws.amazon.com/ec2/instance-types/
Ops team might set up the infrastructure like this 1\ Run on general-purpose CPU instances for data ingestion and pre-processing 2\ Run on GPU enabled instances with high compute capacity for model training 3\ Run on instances optimized for low latency serving.
It is the responsibility of Ops team to ensure that these applications deployed on different isolated Ray clusters communicate securely.
In summary, the Architect decides on the logical separation of the application components, while the Ops team ensures that this logical separation is reflected in the physical infrastructure and runtime environment, providing the required isolation and resources for each component.