登录查看更多内容

Marvelous MLOps #44: Getting started with Databricks Feature Serving

Marvelous MLOps

Power up MLOps with Marvelous content

发布日期: 2024年6月23日

Databricks has recently announced the general availability of the Feature Serving. This functionality is useful when you need to expose the data available on Databricks to other applications.

To make it a bit more clear when you need Feature Serving vs Model Serving, see a diagram below.

In the case of machine learning, Feature Serving can be used to expose your pre-computed predictions. For example, user recommendations for an online store that get generated in a batch process daily/weekly depending on your business needs. This is the use case that we are going to explore.

Remember, even though Feature Serving is in GA, it uses Online Tables, and that feature is still in Public Preview. Also, make sure you use databricks-sdk version 0.28.0 or higher to run the code. It is currently in Beta.

High-level overview

Let's start with the high-level overview, and then go through the process step-by-step.

In this article, we are not going to explain how Databricks workflows work and how batch recommendations are generated. We just take a recommendations dataframe as given.
We store the recommendations in a feature table in Unity Catalog.
We create an online table, which is a read-only copy of your feature table designed for fast data lookups. According to Databricks documentation, "online tables provide low latency and high throughput access to data of any scale". Feature Serving does not work without online tables.
We create a FeatureSpec is a user-defined set of features and functions, and this is what is served behind the Feature serving endpoint.
In the end, we deploy the feature serving endpoint and query it to retrieve the predictions.

Recommenations dataframe

Let's mimic the recommendations dataframe using the online retail dataset available on your Databricks workspace.

import pyspark.sql.functions as F
from pyspark.sql import Window

df = spark.read.load("/databricks-datasets/online_retail/data-001/data.csv",format="csv",sep=",",inferSchema="true",header="true" )
df = df[['CustomerID', 'StockCode']].filter(df['CustomerId'].isNotNull()).drop_duplicates()
w = Window.partitionBy('CustomerId').orderBy('StockCode')
df = df.withColumn('rank', F.rank().over(w))
df = df.filter(df.rank <=5).withColumnRenamed('StockCode', 'ProductId')
df = df.groupby("CustomerID").agg(F.collect_list("ProductId")).withColumnRenamed("collect_list(ProductId)", "Recommendations")

For each CustomerID, we have a list of product ids available as recommendations. This is what the dataframe looks like:

Create a feature table and an online table

Both feature and online table are stored in Unity Catalog. To run the commands, catalog "mlops_test" and schema "feature_serving" must exist and you must have USE privileges on both. Users with CREATE privileges on both can create schema and catalog.

from databricks import feature_engineering

catalog_name = "mlops_test"
schema_name = "feature_serving"

fe = feature_engineering.FeatureEngineeringClient()
feature_table_name = f"{catalog_name}.{schema_name}.user_recommendations"
online_table_name = f"{catalog_name}.{schema_name}.user_recommendations_online"

fe.create_table(
  name = feature_table_name,
  primary_keys=["CustomerID"],
  df = df1,
  description = "User recommendations"
)

With "create_table" method, we create a new feature table. To update the existing feature table, use "write_table" method instead.

After the feature table is created, we need to generate an online table. The commands below will only run with databricks-sdk version 0.28.0 or higher.

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.catalog import OnlineTableSpec, OnlineTableSpecTriggeredSchedulingPolicy
import mlflow

workspace = WorkspaceClient()

# Create an online table
spec = OnlineTableSpec(
  primary_key_columns = ["CustomerID"],
  source_table_full_name = feature_table_name,
  run_triggered=OnlineTableSpecTriggeredSchedulingPolicy.from_dict({'triggered': 'true'}),
  perform_full_copy=True)

online_table_pipeline = workspace.online_tables.create(name=online_table_name, spec=spec)

In the example above, the online table can be updated using a trigger. online_table_pipeline object contains pipeline id (in our case, 'f6bn131e-da4e-4431-abc7–4ae189c28001'), which can be used to trigger an update later:

w.pipelines.start_update(pipeline_id='f6bn131e-da4e-4431-abc7–4ae189c28001', full_refresh=True)

If the online table is not updated, it will not reflect changes in the feature table.

领英推荐

Announcing Shakudo – the modern data solution I wish I…

DJ Patil 1 年前

Databricks vs. Snowflake: What their rivalry reveals…

Ashu Garg 8 个月前

Plotly Newsletter, December 2024

Plotly 3 个月前

An online table can also be updated continuously if it is created with different specifications (run_continuously=OnlineTableSpecContinuousSchedulingPolicy.from_dict({'continuous': 'true'}). Also, change data feed must be enabled for the feature table you can do it by running the following SQL commands on your feature table:

ALTER TABLE mlops_test.feature_serving.user_recommendations SET TBLPROPERTIES (delta.enableChangeDataFeed = true)

This is unclear what consequences it will have for the costs, but I would assume a triggered update is cheaper.

Create FeatureSpec

FeatureSpec is a user-defined set of features and functions that is served behind the Feature serving endpoint.

In our scenario, we just need to use feature lookup.

from databricks.feature_engineering import FeatureLookup
features=[
  FeatureLookup(
    table_name=feature_table_name,
    lookup_key="CustomerID",
  )
]
feature_spec_name = f"{catalog_name}.{schema_name}.recommendation"

fe.create_feature_spec(name=feature_spec_name, features=features, exclude_columns=None)

FeatureSpec is also stored in Unity Catalog and can be viewed via UI under "Functions". Note: you can see both the feature table and the online table under "Tables".

Deploy a Feature Serving endpoint and send a request

To create the Feature Serving endpoint, you need to specify a served entity, which is FeatureSpec defined above.

from databricks.sdk.service.serving import EndpointCoreConfigInput, ServedEntityInput

endpoint_name = "recommendations-endpoint"

# Create endpoint
workspace.serving_endpoints.create(
  name=f"{endpoint_name}",
  config = EndpointCoreConfigInput(
    served_entities=[
    ServedEntityInput(
        entity_name=f"{feature_spec_name}",
        scale_to_zero_enabled=True,
        workload_size="Small"
      )
    ]
  )
)

After the endpoint is deployed, we can send our first request:

import mlflow

client = mlflow.deployments.get_deploy_client("databricks")
response = client.predict(
   endpoint=endpoint_name,
   inputs={
       "dataframe_records": [
           {"CustomerID": 12356},
       ]
   },
)

This is the output we get:

Important to mention that we do not have to update the endpoint every time we update the online table. The endpoint must be updated only if the underlying FeatureSpec changes.

To send the request to the endpoint in a production setting on Azure, use the service principal’s Microsoft Entra ID access token to access the Databricks REST API! https://learn.microsoft.com/en-us/azure/databricks/dev-tools/service-prin-aad-token

Conclusions and follow-up

Feature Serving is a great feature that enables to serve batch predictions. In such a scenario, model serving would be an overkill.

As in the case of model serving, you can not define the payload in any way you like. In the model serving scenario, you have a bit more influence on how the payload looks than in the feature serving scenario. For existing integrations, migrating to the Feature Serving endpoints, a change is required.

In the follow-up articles, we will explore:

Use cases and architectures for Feature Serving vs Model Serving deep-dive.
How Feature Serving endpoints can be used in scenarios where we want to trigger long-running tasks via an API call, and then retrieve the results via another API call (something you can achieve with FastAPI + Celery if using open-source solutions).
How we can monitor endpoints using inference tables

Lewis Wong

Lead MLOps Engineer

9 个月

Tyler Reichardt Adeoluwa Adelugba

3 次回应

要查看或添加评论，请登录

Marvelous MLOps的更多文章

See all articles

Marvelous MLOps #44: Getting started with Databricks Feature Serving

Marvelous MLOps

Power up MLOps with Marvelous content

High-level overview

Recommenations dataframe

Create a feature table and an online table

领英推荐

Create FeatureSpec

Deploy a Feature Serving endpoint and send a request

Conclusions and follow-up

Marvelous MLOps的更多文章

社区洞察

其他会员也浏览了

InstaQuery Helps Rewrite Snowflake Queries with AI

THE Strongest Link! 5 Reasons why Semantic Link IS the Fabric big deal

Inside Databricks Data+AI Summit 2023

Data Bricks - The New Way to Manage Data Efficiently

Comprehensive Guide to Azure Databricks Unity Catalog

Databricks: A Look At 2024 + Expectations For 2025

InstaQuery Helps Rewrite Snowflake Queries with AI

Latest Microsoft Fabric updates that can help you in 2025.

The Future of Big Data and AI: How Databricks is Leading the Transformation

DATA Pill #077 - Snowflake + Snowpark + Streamlit + Vanna AI, How to reduced docker build times by 40%

High-level overview

Recommenations dataframe

Create a feature table and an online table

领英推荐

Create FeatureSpec

Deploy a Feature Serving endpoint and send a request

Conclusions and follow-up

Marvelous MLOps的更多文章

Unicorns and Rainbows: The Reality of Implementing AI in a Corporate

Navigating Databricks developer tools

Marvelous MLOps #57: Building an End-to-end MLOps Project with Databricks

Marvelous MLOps #56: Streamlining ML Model Monitoring with Databricks Lakehouse and Inference Tables

Marvelous MLOps #55: Traffic Splits Aren’t True A/B Testing for Machine Learning Models

Marvelous MLOps #54. Developing on Databricks (without compromises)

Marvelous MLOps #53. Top data & AI conferences to attend in 2024

Marvelous MLOps #52: How Much ML Should Engineers in Tech Really Know?

Marvelous MLOps #51: MLOps with Databricks Roadmap & Course Announcement

Marvelous MLOps #50: Dealing with private Python packages in Databricks Asset Bundles, part 1.

社区洞察

其他会员也浏览了

InstaQuery Helps Rewrite Snowflake Queries with AI

THE Strongest Link! 5 Reasons why Semantic Link IS the Fabric big deal

Inside Databricks Data+AI Summit 2023

Data Bricks - The New Way to Manage Data Efficiently

Comprehensive Guide to Azure Databricks Unity Catalog

Databricks: A Look At 2024 + Expectations For 2025

InstaQuery Helps Rewrite Snowflake Queries with AI

Latest Microsoft Fabric updates that can help you in 2025.

The Future of Big Data and AI: How Databricks is Leading the Transformation

DATA Pill #077 - Snowflake + Snowpark + Streamlit + Vanna AI, How to reduced docker build times by 40%