Unlocking Snowflake's Classification Cortex Function: A Hands-on Journey with InSights

Unlocking Snowflake's Classification Cortex Function: A Hands-on Journey with InSights

Introduction to Snowflake Classification Cortex Function

Classification is a pivotal task in data analysis and machine learning, involving sorting data into distinct categories based on identified patterns in the training data. Snowflake's Cortex ML functions, particularly the classification function, provide a powerful tool for performing such tasks directly within the Snowflake ecosystem. This feature supports both binary and multi-class classification. Common applications include customer churn prediction, credit card fraud detection, and spam detection.

How Classification Works in Snowflake

The classification process involves several steps:

  1. Creating a Classification Model Object: Define a model by referencing the training data.
  2. Training the Model: Fit the model to the provided training data using machine learning algorithms. Snowflake uses a gradient boosting machine for this purpose.
  3. Classifying New Data: Once trained, the model can classify new data points.
  4. Evaluating the Model: Assess the model’s accuracy and effectiveness through various evaluation APIs provided by Snowflake.

Important Considerations

  • Accuracy and Bias: Machine learning outputs can be inaccurate or biased. Human oversight is essential.
  • Data Types: Supports numeric, Boolean, and string data types for features and labels. Timestamps must be in the TIMESTAMP_NTZ format.
  • Model Training and Inference: Training data must have a target column representing labeled classes and feature columns. Inference data must have the same feature names and types as the training data.

Current Limitations

  • Data Types: Only numeric, TIMESTAMP_NTZ, Boolean, or string data types are supported.
  • Algorithm and Parameters: Users cannot modify the classification algorithm or its parameters.
  • Data Size: Supports up to 1,000 columns and 10 million rows.
  • Model Management: Models cannot be cloned, and the CREATE SNOWFLAKE.ML.CLASSIFICATION privilege must be granted to relevant roles.

Preparing for Classification

Before starting with classification in Snowflake, ensure that you have:

  • Selected an appropriate virtual warehouse based on your data size.
  • Granted the necessary privileges to create classification models.

Example Workflow

  1. Select a Virtual Warehouse: Choose a warehouse based on the size of your training data.
  2. Grant Privileges: Ensure the role you are using has the necessary permissions to create classification models.
  3. Create and Train the Model: Use the CREATE SNOWFLAKE.ML.CLASSIFICATION command.
  4. Run Inference: Classify new data points using the PREDICT method.
  5. Evaluate the Model: Use the provided evaluation methods to assess the model's performance.

HANDS-ON EXERCISE

Let's dive into a practical exercise using Snowflake's Cortex classification function. We'll work with a dataset available on the Snowflake marketplace, specifically from InSight’s dataset on Electric Vehicles Data Trends, which includes the view ALL_STATES_EV_REGISTRATION. You can access the dataset here:

InSights: Electric Vehicles Data Trends ( snowflake.com )

EXPLORE DATASET

First, let's take a look at our dataset by selecting data from the ALL_STATES_EV_REGISTRATION view, focusing on the state of Minnesota (MN):

select * from ALL_STATES_EV_REGISTRATION where state = 'MN';        

Our goal is to classify the data to predict the vehicle’s manufacturer and model based on the complete vehicle name.

Creating Training and Testing Views

To accomplish this, we'll create four views from our dataset: two for training and two for testing. We'll have one set for the vehicle make and another for the vehicle model.

In the CLASSIFICATION worksheet, we'll create four views:

create VIEW vw_ev_mn_training_make as
select VEHICLE_NAME, VEHICLE_MAKE from ALL_STATES_EV_REGISTRATION where state = 'MN';
select * from vw_ev_mn_training_make;        
create VIEW vw_ev_mn_testing_make as
select DISTINCT(VEHICLE_NAME) from ALL_STATES_EV_REGISTRATION where state = 'MN';
select * from vw_ev_mn_testing_make;        
create VIEW vw_ev_mn_training_model as
select VEHICLE_NAME, VEHICLE_MODEL from ALL_STATES_EV_REGISTRATION where state = 'MN';
select * from vw_ev_mn_training_model;        
create VIEW vw_ev_mn_testing_model as
select DISTINCT(VEHICLE_NAME) from ALL_STATES_EV_REGISTRATION where state = 'MN';
select * from vw_ev_mn_testing_model;        

CLASSIFICATION MODEL TRAINING AND TESTING

Now, let's create a different SQL worksheet, EV_VEHICLE_MAKE_CLASSIFICATION, for the model for classifying the vehicle make (manufacturer).

  1. Set up the environment: First, set up the environment in the SQL worksheet:

use role XXXXXX;
use warehouse XXXXXX;
use database XXXXXX;
use schema CORTEX;
select * from VW_EV_MN_TRAINING_MAKE limit 10;
select * from VW_EV_MN_TESTING_MAKE limit 10;        

2. Create our CLASSIFICATION MODEL and train it over training data view VW_EV_MN_TRAINING_MAKE:

CREATE OR REPLACE SNOWFLAKE.ML.CLASSIFICATION EV_VEHICLE_MAKE_CLASSIFICATION_MODEL(
    INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'VW_EV_MN_TRAINING_MAKE'),
    TARGET_COLNAME => 'VEHICLE_MAKE',
    CONFIG_OBJECT => { 'ON_ERROR': 'SKIP' }
);
CALL EV_VEHICLE_MAKE_CLASSIFICATION_MODEL!SHOW_TRAINING_LOGS();        

3. Save table EV_VEHICLE_MAKE_CLASSIFICATION_MODEL from creating predictions using our classification model on the testing data stored inside VW_EV_MN_TESTING_MAKE view:

CREATE or REPLACE TABLE EV_VEHICLE_MAKE_CLASSIFICATION_MODEL AS SELECT
    *,
    EV_VEHICLE_MAKE_CLASSIFICATION_MODEL!PREDICT(
        OBJECT_CONSTRUCT(*),
        {'ON_ERROR': 'SKIP'}
    ) as predictions
from VW_EV_MN_TESTING_MAKE;        

4. To view the class (prediction) with highest probability of being correct:

SELECT * EXCLUDE predictions,
        predictions:class AS class,
        round(predictions['probability'][class], 3) as probability
FROM EV_VEHICLE_MAKE_CLASSIFICATION_MODEL  order by vehicle_name asc;        

5. Comparison of PREDICTION and ACTUAL:

SELECT
  A.VEHICLE_NAME,
  GET_PATH(A.predictions, 'class') AS PREDICTION,
  B.vehicle_make AS ACTUAL,
  CASE
    WHEN ACTUAL = PREDICTION THEN 1
    ELSE 0
  END AS match_indicator,
  ROUND(A.predictions['probability'][PREDICTION], 3) AS probability,
  COUNT(B.vehicle_name) AS count
FROM
  EV_VEHICLE_MAKE_CLASSIFICATION_MODEL AS A
  JOIN VW_EV_MN_TRAINING_MAKE AS B ON A.vehicle_name = B.vehicle_name
GROUP BY
  1,
  2,
  3,
  4,
  5
ORDER BY
1;        
Heat Grid to show higher the count(y-axis) higher is the probability(x-axis) of being correct(heatmap) predictions.

6. Inspect model's evaluation metrics:

CALL EV_VEHICLE_MAKE_CLASSIFICATION_MODEL!SHOW_EVALUATION_METRICS();        
CALL EV_VEHICLE_MAKE_CLASSIFICATION_MODEL!SHOW_GLOBAL_EVALUATION_METRICS();        
CALL EV_VEHICLE_MAKE_CLASSIFICATION_MODEL!SHOW_CONFUSION_MATRIX();        
CALL EV_VEHICLE_MAKE_CLASSIFICATION_MODEL!SHOW_FEATURE_IMPORTANCE();        

Conclusion

In this comprehensive guide, we've explored the capabilities of Snowflake's Classification Cortex function, a robust tool for performing classification tasks directly within the Snowflake ecosystem. By walking through the process of creating, training, and evaluating classification models, we've highlighted key considerations, limitations, and best practices to ensure a smooth workflow.

The hands-on exercise with the InSights Electric Vehicles Data Trends dataset demonstrated practical steps for data preparation, model training, inference, and evaluation, providing a clear roadmap for applying these techniques to your own data analysis projects. With this knowledge, you're equipped to leverage Snowflake's Classification Cortex function for a wide range of applications, from customer churn prediction to fraud detection and beyond.

As you embark on your classification journey, don't hesitate to experiment and refine your models to achieve optimal results. Engage with the provided exercises, share your insights, and embrace the power of Snowflake's classification capabilities. Now, it's your turn to try creating a Cortex model for the vehicle model classification using the same dataset. Feel free to leave comments if you have any questions or need further assistance. Happy classifying!

Madhurima Madgula

Data & Analytics @ CG Infinity | UT Dallas

5 个月

Nice one, Sidarth!??

要查看或添加评论,请登录

社区洞察

其他会员也浏览了