Amazon Sagemaker Feature Store
Recap
Just to recap, the main reasons we need a feature store are
- Consistency in features for Models (variations can impact model results)
- Reuse of features across models and teams saving time and cost
- Enabling feature discovery and versioning
Sagemaker
Amazon Sagemaker is a fully managed, purpose-built repository for features. Being fully managed means the entire infrastructure, setup and provisioning are managed by AWS and do not require any management.
The gateway to the feature store is Amazon Sagemaker Studio. The Studio is a fully managed Jupyter Lab environment.?When you open the studio you would find the following widget in the Launcher tab.
Feature groups
To bundle related features together,?we use feature groups. Imagine the Feature group as a table and each feature as a column.?Each row is a way to group related features. As a simple example:
- The customer would be a feature group
- Recency, Frequency and Monetary categories/values would be separate features
- Each customer would be a separate record
Each record contains a unique RecordIdentifier to uniquely identify the record. In our example, it could be the customerId.
Feature groups could be made available online or offline or both. Online Feature groups are mainly used for real-time predictions and store only the latest version of the feature data.?The read latency for an online store is a few milliseconds.
领英推è
From there, navigating to the feature store you get a?view to see the list of all features and feature groups.
In true AWS tradition, every action that you can perform from the UI is also available via APIs. There is a Sagemaker python library that is available that can be used for API access.
The sample code to create a feature group is below and it is quite self-explanatory:
import sagemaker
sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
products_feature_group = FeatureGroup(
name=customers_feature_group_name, sagemaker_session=sagemaker_session
)
product_data = pd.read_csv("data/product.csv")
product_data["EventTime"] = pd.Series([current_time_sec] * len(product_data), dtype="float64")
customers_feature_group.load_feature_definitions(data_frame=product_data)
customers_feature_group.create(s3_uri="<s3 path>", record_identifier_name=record_identifier_feature_name, event_time_feature_name="EventTime", role_arn=role, enable_online_store=True)
With the use of a short code snippet above, we can create features into Sagemaker. The API for retrieval is equally simple and intuitive.
In a nutshell, Sagemaker Feature Store makes it a breeze to store, discover and retrieve features for Machine Learning.
Footnotes:
- Sagemaker library version must be greater than 2.0