Marvelous MLOps #44: Getting started with Databricks Feature Serving
Databricks has recently announced the general availability of the Feature Serving. This functionality is useful when you need to expose the data available on Databricks to other applications.
To make it a bit more clear when you need Feature Serving vs Model Serving, see a diagram below.
In the case of machine learning, Feature Serving can be used to expose your pre-computed predictions. For example, user recommendations for an online store that get generated in a batch process daily/weekly depending on your business needs. This is the use case that we are going to explore.
Remember, even though Feature Serving is in GA, it uses Online Tables, and that feature is still in Public Preview. Also, make sure you use databricks-sdk version 0.28.0 or higher to run the code. It is currently in Beta.
High-level overview
Let's start with the high-level overview, and then go through the process step-by-step.
Recommenations dataframe
Let's mimic the recommendations dataframe using the online retail dataset available on your Databricks workspace.
import pyspark.sql.functions as F
from pyspark.sql import Window
df = spark.read.load("/databricks-datasets/online_retail/data-001/data.csv",format="csv",sep=",",inferSchema="true",header="true" )
df = df[['CustomerID', 'StockCode']].filter(df['CustomerId'].isNotNull()).drop_duplicates()
w = Window.partitionBy('CustomerId').orderBy('StockCode')
df = df.withColumn('rank', F.rank().over(w))
df = df.filter(df.rank <=5).withColumnRenamed('StockCode', 'ProductId')
df = df.groupby("CustomerID").agg(F.collect_list("ProductId")).withColumnRenamed("collect_list(ProductId)", "Recommendations")
For each CustomerID, we have a list of product ids available as recommendations. This is what the dataframe looks like:
Create a feature table and an online table
Both feature and online table are stored in Unity Catalog. To run the commands, catalog "mlops_test" and schema "feature_serving" must exist and you must have USE privileges on both. Users with CREATE privileges on both can create schema and catalog.
from databricks import feature_engineering
catalog_name = "mlops_test"
schema_name = "feature_serving"
fe = feature_engineering.FeatureEngineeringClient()
feature_table_name = f"{catalog_name}.{schema_name}.user_recommendations"
online_table_name = f"{catalog_name}.{schema_name}.user_recommendations_online"
fe.create_table(
name = feature_table_name,
primary_keys=["CustomerID"],
df = df1,
description = "User recommendations"
)
With "create_table" method, we create a new feature table. To update the existing feature table, use "write_table" method instead.
After the feature table is created, we need to generate an online table. The commands below will only run with databricks-sdk version 0.28.0 or higher.
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.catalog import OnlineTableSpec, OnlineTableSpecTriggeredSchedulingPolicy
import mlflow
workspace = WorkspaceClient()
# Create an online table
spec = OnlineTableSpec(
primary_key_columns = ["CustomerID"],
source_table_full_name = feature_table_name,
run_triggered=OnlineTableSpecTriggeredSchedulingPolicy.from_dict({'triggered': 'true'}),
perform_full_copy=True)
online_table_pipeline = workspace.online_tables.create(name=online_table_name, spec=spec)
In the example above, the online table can be updated using a trigger. online_table_pipeline object contains pipeline id (in our case, 'f6bn131e-da4e-4431-abc7–4ae189c28001'), which can be used to trigger an update later:
w.pipelines.start_update(pipeline_id='f6bn131e-da4e-4431-abc7–4ae189c28001', full_refresh=True)
If the online table is not updated, it will not reflect changes in the feature table.
领英推荐
An online table can also be updated continuously if it is created with different specifications (run_continuously=OnlineTableSpecContinuousSchedulingPolicy.from_dict({'continuous': 'true'}). Also, change data feed must be enabled for the feature table you can do it by running the following SQL commands on your feature table:
ALTER TABLE mlops_test.feature_serving.user_recommendations SET TBLPROPERTIES (delta.enableChangeDataFeed = true)
This is unclear what consequences it will have for the costs, but I would assume a triggered update is cheaper.
Create FeatureSpec
FeatureSpec is a user-defined set of features and functions that is served behind the Feature serving endpoint.
In our scenario, we just need to use feature lookup.
from databricks.feature_engineering import FeatureLookup
features=[
FeatureLookup(
table_name=feature_table_name,
lookup_key="CustomerID",
)
]
feature_spec_name = f"{catalog_name}.{schema_name}.recommendation"
fe.create_feature_spec(name=feature_spec_name, features=features, exclude_columns=None)
FeatureSpec is also stored in Unity Catalog and can be viewed via UI under "Functions". Note: you can see both the feature table and the online table under "Tables".
Deploy a Feature Serving endpoint and send a request
To create the Feature Serving endpoint, you need to specify a served entity, which is FeatureSpec defined above.
from databricks.sdk.service.serving import EndpointCoreConfigInput, ServedEntityInput
endpoint_name = "recommendations-endpoint"
# Create endpoint
workspace.serving_endpoints.create(
name=f"{endpoint_name}",
config = EndpointCoreConfigInput(
served_entities=[
ServedEntityInput(
entity_name=f"{feature_spec_name}",
scale_to_zero_enabled=True,
workload_size="Small"
)
]
)
)
After the endpoint is deployed, we can send our first request:
import mlflow
client = mlflow.deployments.get_deploy_client("databricks")
response = client.predict(
endpoint=endpoint_name,
inputs={
"dataframe_records": [
{"CustomerID": 12356},
]
},
)
This is the output we get:
Important to mention that we do not have to update the endpoint every time we update the online table. The endpoint must be updated only if the underlying FeatureSpec changes.
To send the request to the endpoint in a production setting on Azure, use the service principal’s Microsoft Entra ID access token to access the Databricks REST API! https://learn.microsoft.com/en-us/azure/databricks/dev-tools/service-prin-aad-token
Conclusions and follow-up
Feature Serving is a great feature that enables to serve batch predictions. In such a scenario, model serving would be an overkill.
As in the case of model serving, you can not define the payload in any way you like. In the model serving scenario, you have a bit more influence on how the payload looks than in the feature serving scenario. For existing integrations, migrating to the Feature Serving endpoints, a change is required.
In the follow-up articles, we will explore:
Lead MLOps Engineer
5 个月Tyler Reichardt Adeoluwa Adelugba