ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Easily create machine learning models with BigQuery ML! From data processing and model creation to predictions.

Koji Tanaka

Building AI applications

å‘å¸ƒæ—¥æœŸ: 2024å¹´1æœˆ25æ—¥

The following is a tech blog I wrote while working at CyberOwl, Inc. The original version, written in Japanese, can be found here.

In a business context, I think there are times when data is collected and stored in data warehouses like BigQuery. BigQuery has a feature called BigQuery ML that allows you to create, evaluate, and make predictions with machine learning models using SQL queries on the stored data. With BigQuery ML, you don't need to build your own infrastructure, and you don't need to learn libraries like TensorFlow or PyTorch, making it easy to get started with machine learning model creation. You can choose from a variety of models, from simple ones like linear regression and logistic regression to more complex ones like deep neural networks and gradient-boosted trees. Additionally, there are features for semi-automating hyperparameter tuning and AutoML Tables, which automatically selects and tunes models, making it possible to use BigQuery ML even if you're not well-versed in machine learning algorithms but have knowledge of SQL.

In this blog, we have compiled the necessary knowledge for getting started with creating machine learning models using BigQuery ML.

Contents

Basics of Machine Learning Model Creation
Basics of Machine Learning Model Prediction
DNN (Deep Neural Network) Regression Analysis
Normalise Numerical Features! ML.STANDARD_SCALER
Trying Predictions with the Created Model
Convenient Features of BigQuery ML!: TRANSFORM Clause
XGBoost Regression Analysis
Effective Feature Engineering with Gradient Boosting Trees
Generating Polynomials: ML.POLYNOMIAL_EXPAND
Converting Numerical Features into Categorical Features: ML.QUANTILE_BUCKETIZE
Combining Categorical Features: ML.FEATURE_CROSS
Semi-Automating Hyperparameter Tuning!
Conclusion

Basics of Machine Learning Model Creation

In this section, we will introduce the method of creating machine learning models in BigQuery ML. In the following example, we create a model using the data from the table stored in project_id.dataset_name.table_name and save it as project_id.dataset_name.linear_model_name.

The table project_id.dataset_name.table_name contains columns label_column, feature1, feature2, and feature3. When using label_column as the target variable, it predicts label_column data from the data in feature1, feature2, and feature3. This means that during training, all four columns are used, but during prediction, only the features are required.

CREATE OR REPLACE MODEL `project_id.dataset_name.linear_model_name`
OPTIONS(model_type='LINEAR_REG', input_label_cols=['label_column']) AS
SELECT
  label_column,
  feature1,
  feature2,
  feature3
FROM
  `project_id.dataset_name.table_name`

CREATE OR REPLACE MODEL: Determines the model name. If a model with the same name exists in the dataset, it will overwrite the existing model. In the example above, the model name is set as linear_model_name.

OPTIONS Clause: Allows specifying the model type and its details.

model_type: Specifies the model type. Use the model type names as introduced in BigQuery's documentation. For example, if you want to perform linear regression, you would specify LINEAR_REG, and for classification using a neural network, you would specify DNN_CLASSIFIER.
input_label_cols[ ]: Specifies the column name in the training data that serves as the target variable. This column name becomes the variable that the model predicts. In the example above, project_id.dataset_name.table_name has a column named label_column, which is used as the target variable.

AS SELECT FROM: In the AS section, you use SQL's SELECT to specify the data to be used for training. Columns not specified in input_label_cols are used as features. In the example above, label_column becomes the target variable, while feature1, feature2, and feature3 become features.

Basics of Machine Learning Model Prediction

Once you have created a model, the next step is to test it using actual data. In the example below, the name of the test table is set as test_table.

CREATE OR REPLACE MODEL `project_id.dataset_name.DNN_model_name`
OPTIONS(
  model_type='DNN_REGRESSOR',
  ACTIVATION_FN = 'RELU',
  BATCH_SIZE = 25,
  HIDDEN_UNITS = [16, 8],
  LEARN_RATE = 0.02,
  OPTIMIZER = 'ADAGRAD',
  DATA_SPLIT_METHOD = 'SEQ',
  DATA_SPLIT_EVAL_FRACTION = 0.2,
  DATA_SPLIT_COL = 'feature3',
  input_label_cols=['label_column']
) AS
SELECT
  label_column,
  ML.STANDARD_SCALER(feature1) AS scaled_feature1,
  ML.STANDARD_SCALER(feature2) AS scaled_feature2,
  feature3
FROM
  `project_id.dataset_name.table_name`

This time, the model is a Deep Neural Network regression, so we use DNN_REGRESSOR for the model_type. When using a neural network, you need to specify parameters such as activation function (ACTIVATION_FN), batch size (BATCH_SIZE), learning rate (LEARN_RATE), hidden layers, hidden units (HIDDEN_UNITS), and optimisation algorithm (OPTIMIZER). Additionally, you also need to specify the data split method (DATA_SPLIT_METHOD). If not specified, BigQuery ML will automatically use default values. However, it's essential to have knowledge of machine learning for tuning these values because default values may not produce a model with good accuracy. You can check the details of what can be specified for DNN tuning in this document.

Normalise numerical features! ML.STANDARD_SCALER

One of the things to be mindful of when creating a neural network is the normalisation of numerical features. Normalisation helps smooth the learning process with optimisation algorithms based on gradient descent. ML.STANDARD scales the numerical values of the specified columns to standardise them (mean 0, variance 1). In the example above, assuming that feature1 and feature2 are numerical features, ML.STANDARD_SCALER(feature1) AS scaled_feature1, ML.STANDARD_SCALER(feature2) AS scaled_feature2, is used to perform normalisation.

Try making predictions with the created model

Once you have created a model, the next step is to test it using actual data. In the example below, the name of the test table is set as test_table.

SELECT
  *
FROM
  ML.PREDICT(MODEL `project_id.dataset_name.DNN_model_name`,
    (SELECT
      ML.STANDARD_SCALER(feature1) AS scaled_feature1,
      ML.STANDARD_SCALER(feature2) AS scaled_feature2,
      feature3
     FROM
      `project_id.dataset.test_table`))

For neural networks, the numerical features used in training were normalized using ML.STANDARD_SCALER. Therefore, the data used to create predictions must also be normalized in the same way. In the example above, ML.STANDARD_SCALER(feature1) AS scaled_feature1, ML.STANDARD_SCALER(feature2) AS scaled_feature2, is used.

However, when the number of features in the model becomes large or when you create numerous models, it can be cumbersome to remember how you edited the features and use ML.STANDARD for normalization every time you make predictions. Additionally, when you perform feature engineering using ML.FEATURE_CROSS or ML.POLYNOMIAL_EXPAND, it can further increase the processing time during prediction.

Fortunately, BigQuery ML provides a solution to this problem through the TRANSFORM clause.

é¢†è‹±æŽ¨è

Types of statistical Inference

Ajit Jaokar 1 ä¸ªæœˆå‰

Artificial Intelligence No 50: Machine learning v.s. Statistics

Artificial Intelligence No 50: Machine learning v.sâ€¦

Ajit Jaokar 2 å¹´å‰

Statistical inference vs Machine Learning inference: Bayesian vs frequentist perspectives

Statistical inference vs Machine Learning inference:â€¦

Ajit Jaokar 1 å¹´å‰

Convenient Feature of BigQuery ML: TRANSFORM Clause

The TRANSFORM clause allows you to define data preprocessing during model creation, automating that processing during prediction and evaluation stages. In the example below, the same model as in the previous DNN regression analysis is created using the TRANSFORM clause.

CREATE OR REPLACE MODEL `project_id.dataset_name.DNN_model_name`
#-------TRANSFORM Clause-------
TRANSFORM(
  label_column,
  ML.STANDARD_SCALER(feature1) AS scaled_feature1,
  ML.STANDARD_SCALER(feature2) AS scaled_feature2,
  feature3
)
#------------------------------
OPTIONS(
  model_type='DNN_REGRESSOR',
  ACTIVATION_FN = 'RELU',
  BATCH_SIZE = 25,
  HIDDEN_UNITS = [16, 8],
  LEARN_RATE = 0.02,
  OPTIMIZER = 'ADAGRAD',
  DATA_SPLIT_METHOD = 'SEQ',
  DATA_SPLIT_EVAL_FRACTION = 0.2,
  DATA_SPLIT_COL = 'feature3',
  input_label_cols=['label_column']
) AS
SELECT
  label_column,
  feature1, # NOT PROCESSING HERE
  feature2,ã€€ã€€# NOT PROCESSING HERE
  feature3
FROM
  `project_id.dataset_name.table_name`

In the SQL query following the OPTIONS clause and AS clause, you only need to specify the columns in the table that will be used as features and the target variable. The data processing is handled by the TRANSFORM clause. Below is the query for prediction when using the TRANSFORM clause.

SELECT
  *
FROM
  ML.PREDICT(MODEL `project_id.dataset_name.DNN_model_name`,
    (SELECT
      feature1,
      feature2,
      feature3
     FROM
      `project_id.dataset.test_table`))

Since preprocessing is defined in the TRANSFORM clause, there is no need to use ML.STANDARD_SCALER here. As a result, data preprocessing can be omitted during the prediction and evaluation stages.

XGBoost Regression Analysis

Next, let's take a look at regression analysis using gradient boosting trees. Generally, machine learning models based on decision trees like gradient boosting trees are said to benefit more from feature engineering other than normalisation, as they don't learn relationships between features as neural networks do. Therefore, in the following example, we will use an XGBoost model and perform operations such as feature multiplication within the TRANSFORM clause.

CREATE OR REPLACE MODEL `project_id.dataset_name.xgboost_model_name`
TRANSFORM(
  label_column,
  ML.POLYNOMIAL_EXPAND(feature1, 2) AS poly_feature1,
  ML.QUANTILE_BUCKETIZE(feature2, 10) AS bucketized_feature2,
  ML.FEATURE_CROSS(STRUCT(feature3, feature4)) AS feature_cross,
  feature5, 
)
OPTIONS(
  model_type='BOOSTED_TREE_REGRESSOR', 
  DATA_SPLIT_METHOD = 'SEQ',
  DATA_SPLIT_EVAL_FRACTION = 0.2,
  DATA_SPLIT_COL = 'feature5',
  input_label_cols=['label_column']
) AS
SELECT
  label_column,
  feature1,
  feature2,
  feature3,
  feature4,
  feature5
FROM
  `project_id.dataset_name.table_name`

Effective Feature Engineering with Gradient Boosting Trees

Generating Polynomials: ML.POLYNOMIAL_EXPAND ML.POLYNOMIAL_EXPAND is a function that allows you to create polynomial features by, for example, multiplying existing features. This makes it easier for the model to learn nonlinear relationships. In the example above, feature1 is multiplied to generate second-degree polynomial features, which are added as a new feature called poly_feature1. Additionally, you can multiply multiple numerical features together.

Converting Numerical Features into Categorical Features: ML.QUANTILE_BUCKETIZE By using ML.QUANTILE_BUCKETIZE, you can transform numerical features into categorical features. This helps the model capture patterns hidden within numerical data. In the example above, the numerical feature feature2 is divided into 10 equal parts, creating 10 categories and adding a new feature called bucketized_feature2.

Combining Categorical Features: ML.FEATURE_CROSS ML.FEATURE_CROSS allows you to combine two or more categorical features to create new features, making it easier for the model to capture relationships between categorical features. It creates a new feature called feature_cross by crossing feature3 and feature4.

In addition to these, BigQuery ML offers various other feature engineering functions, which can be found in the documentation.

Semi-Automating Hyperparameter Tuning!

Machine learning, as the name suggests, learns variables automatically, but there are parameters that need to be set before training the model. These are called hyperparameters, and identifying the appropriate hyperparameters is essential for creating a highly accurate machine learning model. Hyperparameters include the learning rate (how quickly the model learns from data), batch size (the number of data samples processed at once), and model-specific ones (such as the depth of decision trees or the number of hidden layers in neural networks), some of which have a significant impact on the model's accuracy.

The process of searching for better hyperparameters is called hyperparameter tuning, but it can be time-consuming, which is a challenge. Therefore, it is necessary to efficiently explore hyperparameters using methods like grid search or random search.

BigQuery ML provides features that support this exploration using specific algorithms. The example below is for the OPTIONS clause when performing DNN regression.

OPTIONS(
    model_type='DNN_REGRESSOR',
    ACTIVATION_FN = 'RELU',
    BATCH_SIZE = 25,
    HIDDEN_UNITS = HPARAM_CANDIDATES([struct([8]), struct([14]), struct([8, 4])]),
    LEARN_RATE = HPARAM_RANGE(0.01, 0.03),
    HPARAM_TUNING_ALGORITHM = 'GRID_SEARCH',
    HPARAM_TUNING_OBJECTIVES= ['mean_squared_error'],
    NUM_TRIALS = 20,
    input_label_cols=['label_column']
  )

HPARAM_CANDIDATES([candidates]): You can create a list of hyperparameters to explore. In the example above, it is set to explore the number of hidden layers and hidden units in a neural network. BigQuery ML will try different configurations, such as one hidden layer with 8 units, one hidden layer with 14 units, and two hidden layers with 8 units in the first layer and 4 units in the second layer. It is also possible to explore optimisation algorithms like OPTIMIZER=HPARAM_CANDIDATES(['adagrad', 'sgd', 'ftrl']).

HPARAM_RANGE(min, max): This is used when tuning numerical hyperparameters like learning rate or batch size. You set the lower and upper bounds of the search range, and BigQuery ML will try different values within that range. In the example above, it is searching within the range of learning rates from 0.01 to 0.03.

HPARAM_TUNING_ALGORITHM: Specifies the algorithm used for hyperparameter tuning. In the example above, GRID_SEARCH (grid search) is selected.

HPARAM_TUNING_OBJECTIVES[ ]: Specifies the loss function you want to minimise through hyperparameter tuning. In the example above, mean_squared_error (mean squared error) is selected.

NUM_TRIALS: Sets the number of trials for hyperparameter tuning. When you have a large number of candidates in HPARAM_CANDIDATES or a wide range in HPARAM_RANGE, it's essential to set a sufficiently large value for NUM_TRIALS, as not doing so may result in not finding the optimal hyperparameters. However, increasing NUM_TRIALS also requires more computation and time, so it's a trade-off that needs to be considered. The setting of NUM_TRIALS can vary depending on the characteristics of the training dataset and the model being used, so it's determined through trial and error to find an appropriate value.

In Conclusion

BigQuery ML is an excellent service, considering its low learning costs compared to other machine learning frameworks and its ability to perform tasks like hyperparameter tuning semi-automatically. Additionally, it offers the advantage of directly analysing data stored in BigQuery without the need to move it elsewhere. As mentioned at the beginning, there is an opinion that 'you can build machine learning models in BigQuery ML without specialised knowledge.' However, considering the need for model selection, feature selection, and hyperparameter tuning, this view may apply only when using AutoML and may be somewhat challenging in other cases. I have provided detailed explanations about AutoML and its accuracy in another blog post, so please check that out as well.

This concludes the basic usage of BigQuery ML. For more detailed usage instructions, please refer to the documentation and try implementing machine learning in BigQuery. Thank you for reading until the end.

å¸¦æœ‰æ¤å›¾æ ‡çš„é“¾æŽ¥ ç”±é¢†è‹±åˆ›å»ºï¼Œä¸å¸¦æ¤å›¾æ ‡çš„é“¾æŽ¥ç”±ä½œè€…æ·»åŠ ã€‚

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Koji Tanakaçš„æ›´å¤šæ–‡ç«

Is AI Conscious?

2025å¹´1æœˆ5æ—¥

Is AI Conscious?

Since the day IBMâ€™s Deep Blue defeated chess world champion Garry Kasparov, many people have feared AI. Especially withâ€¦

5 æ¡è¯„è®º
Enabling a Large Number of Image Processing on Rails Server! Understanding Memory Overflow Solutions Using Semaphores and Parallel Processing

2023å¹´12æœˆ29æ—¥

Enabling a Large Number of Image Processing on Rails Server! Understanding Memory Overflow Solutions Using Semaphores and Parallel Processing

The following is a tech blog I wrote while working at CyberOwl, Inc. The original version, written in Japanese, can beâ€¦
<Ruby on Rails> Implementing an Image Hosting Service Using Active Storage

2023å¹´12æœˆ25æ—¥

<Ruby on Rails> Implementing an Image Hosting Service Using Active Storage

The following is a tech blog I wrote while working at CyberOwl, Inc. The original version, written in Japanese, can beâ€¦
AutoML vs. Humans: Are Data Scientists Still Needed?

2023å¹´12æœˆ16æ—¥

AutoML vs. Humans: Are Data Scientists Still Needed?

The following is a tech blog I wrote while working at CyberOwl, Inc. The original version, written in Japanese, can beâ€¦

Easily create machine learning models with BigQuery ML! From data processing and model creation to predictions.

Koji Tanaka

Building AI applications

Basics of Machine Learning Model Creation

Basics of Machine Learning Model Prediction

Normalise numerical features! ML.STANDARD_SCALER

Try making predictions with the created model

é¢†è‹±æŽ¨è

Convenient Feature of BigQuery ML: TRANSFORM Clause

XGBoost Regression Analysis

Effective Feature Engineering with Gradient Boosting Trees

Semi-Automating Hyperparameter Tuning!

In Conclusion

Koji Tanakaçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Unlock the Power of Machine Learning in Data Science & AI

Strategies for Improving Machine Learning Algorithms: Tips & Tricks

Data Science vs. Artificial Intelligence vs. Machine Learning vs. Deep Learning

Top Data Science and Machine Learning Methods Used

DATA Pill #043 - RecSys, Kubernetes, Prometheus and how Discord stores trillions of messages

Tensors, TensorRank, and TensorFlow: Simplified!!

Data Science and its Nearest-Neighbours

Latent Space with Kafka for Advanced Data Processing

Exploring the Steps of a Machine Learning Project: From Problem to Solution

Math for Data Science, AI/ML, and Generative AI: A Simple and Exciting Guide to the Essentials!

Basics of Machine Learning Model Creation

Basics of Machine Learning Model Prediction

Normalise numerical features! ML.STANDARD_SCALER

Try making predictions with the created model

é¢†è‹±æŽ¨è

Convenient Feature of BigQuery ML: TRANSFORM Clause

XGBoost Regression Analysis

Effective Feature Engineering with Gradient Boosting Trees

Semi-Automating Hyperparameter Tuning!

In Conclusion

Koji Tanakaçš„æ›´å¤šæ–‡ç«

Is AI Conscious?

Enabling a Large Number of Image Processing on Rails Server! Understanding Memory Overflow Solutions Using Semaphores and Parallel Processing

<Ruby on Rails> Implementing an Image Hosting Service Using Active Storage

AutoML vs. Humans: Are Data Scientists Still Needed?

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Unlock the Power of Machine Learning in Data Science & AI

Strategies for Improving Machine Learning Algorithms: Tips & Tricks

Data Science vs. Artificial Intelligence vs. Machine Learning vs. Deep Learning

Top Data Science and Machine Learning Methods Used

DATA Pill #043 - RecSys, Kubernetes, Prometheus and how Discord stores trillions of messages

Tensors, TensorRank, and TensorFlow: Simplified!!

Data Science and its Nearest-Neighbours

Latent Space with Kafka for Advanced Data Processing

Exploring the Steps of a Machine Learning Project: From Problem to Solution

Math for Data Science, AI/ML, and Generative AI: A Simple and Exciting Guide to the Essentials!

é¢†è‹±æŽ¨è

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†