Easily create machine learning models with BigQuery ML! From data processing and model creation to predictions.
The following is a tech blog I wrote while working at CyberOwl, Inc. The original version, written in Japanese, can be found here.
In a business context, I think there are times when data is collected and stored in data warehouses like BigQuery. BigQuery has a feature called BigQuery ML that allows you to create, evaluate, and make predictions with machine learning models using SQL queries on the stored data. With BigQuery ML, you don't need to build your own infrastructure, and you don't need to learn libraries like TensorFlow or PyTorch, making it easy to get started with machine learning model creation. You can choose from a variety of models, from simple ones like linear regression and logistic regression to more complex ones like deep neural networks and gradient-boosted trees. Additionally, there are features for semi-automating hyperparameter tuning
In this blog, we have compiled the necessary knowledge for getting started with creating machine learning models
Contents
- Basics of Machine Learning Model Creation
- Basics of Machine Learning Model Prediction
- DNN (Deep Neural Network) Regression Analysis
- Normalise Numerical Features! ML.STANDARD_SCALER
- Trying Predictions with the Created Model
- Convenient Features of BigQuery ML!: TRANSFORM Clause
- XGBoost Regression Analysis
- Effective Feature Engineering
with Gradient Boosting Trees - Generating Polynomials: ML.POLYNOMIAL_EXPAND
- Converting Numerical Features into Categorical Features: ML.QUANTILE_BUCKETIZE
- Combining Categorical Features: ML.FEATURE_CROSS
- Semi-Automating Hyperparameter Tuning!
- Conclusion
Basics of Machine Learning Model Creation
In this section, we will introduce the method of creating machine learning models in BigQuery ML. In the following example, we create a model using the data from the table stored in project_id.dataset_name.table_name and save it as project_id.dataset_name.linear_model_name.
The table project_id.dataset_name.table_name contains columns label_column, feature1, feature2, and feature3. When using label_column as the target variable, it predicts label_column data from the data in feature1, feature2, and feature3. This means that during training, all four columns are used, but during prediction, only the features are required.
CREATE OR REPLACE MODEL `project_id.dataset_name.linear_model_name`
OPTIONS(model_type='LINEAR_REG', input_label_cols=['label_column']) AS
SELECT
label_column,
feature1,
feature2,
feature3
FROM
`project_id.dataset_name.table_name`
CREATE OR REPLACE MODEL: Determines the model name. If a model with the same name exists in the dataset, it will overwrite the existing model. In the example above, the model name is set as linear_model_name.
OPTIONS Clause: Allows specifying the model type and its details.
- model_type: Specifies the model type. Use the model type names as introduced in BigQuery's documentation. For example, if you want to perform linear regression, you would specify LINEAR_REG, and for classification using a neural network, you would specify DNN_CLASSIFIER.
- input_label_cols[ ]: Specifies the column name in the training data that serves as the target variable. This column name becomes the variable that the model predicts. In the example above, project_id.dataset_name.table_name has a column named label_column, which is used as the target variable.
AS SELECT FROM: In the AS section, you use SQL's SELECT to specify the data to be used for training. Columns not specified in input_label_cols are used as features. In the example above, label_column becomes the target variable, while feature1, feature2, and feature3 become features.
Basics of Machine Learning Model Prediction
Once you have created a model, the next step is to test it using actual data. In the example below, the name of the test table is set as test_table.
CREATE OR REPLACE MODEL `project_id.dataset_name.DNN_model_name`
OPTIONS(
model_type='DNN_REGRESSOR',
ACTIVATION_FN = 'RELU',
BATCH_SIZE = 25,
HIDDEN_UNITS = [16, 8],
LEARN_RATE = 0.02,
OPTIMIZER = 'ADAGRAD',
DATA_SPLIT_METHOD = 'SEQ',
DATA_SPLIT_EVAL_FRACTION = 0.2,
DATA_SPLIT_COL = 'feature3',
input_label_cols=['label_column']
) AS
SELECT
label_column,
ML.STANDARD_SCALER(feature1) AS scaled_feature1,
ML.STANDARD_SCALER(feature2) AS scaled_feature2,
feature3
FROM
`project_id.dataset_name.table_name`
This time, the model is a Deep Neural Network regression, so we use DNN_REGRESSOR for the model_type. When using a neural network, you need to specify parameters such as activation function (ACTIVATION_FN), batch size (BATCH_SIZE), learning rate (LEARN_RATE), hidden layers, hidden units (HIDDEN_UNITS), and optimisation algorithm (OPTIMIZER). Additionally, you also need to specify the data split method (DATA_SPLIT_METHOD). If not specified, BigQuery ML will automatically use default values. However, it's essential to have knowledge of machine learning for tuning these values because default values may not produce a model with good accuracy. You can check the details of what can be specified for DNN tuning in this document.
Normalise numerical features! ML.STANDARD_SCALER
One of the things to be mindful of when creating a neural network is the normalisation of numerical features
Try making predictions with the created model
Once you have created a model, the next step is to test it using actual data. In the example below, the name of the test table is set as test_table.
SELECT
*
FROM
ML.PREDICT(MODEL `project_id.dataset_name.DNN_model_name`,
(SELECT
ML.STANDARD_SCALER(feature1) AS scaled_feature1,
ML.STANDARD_SCALER(feature2) AS scaled_feature2,
feature3
FROM
`project_id.dataset.test_table`))
For neural networks, the numerical features used in training were normalized using ML.STANDARD_SCALER. Therefore, the data used to create predictions must also be normalized in the same way. In the example above, ML.STANDARD_SCALER(feature1) AS scaled_feature1, ML.STANDARD_SCALER(feature2) AS scaled_feature2, is used.
However, when the number of features in the model becomes large or when you create numerous models, it can be cumbersome to remember how you edited the features and use ML.STANDARD for normalization every time you make predictions. Additionally, when you perform feature engineering using ML.FEATURE_CROSS or ML.POLYNOMIAL_EXPAND, it can further increase the processing time during prediction.
Fortunately, BigQuery ML provides a solution to this problem through the TRANSFORM clause.
领英推è
Convenient Feature of BigQuery ML: TRANSFORM Clause
The TRANSFORM clause allows you to define data preprocessing during model creation, automating that processing during prediction and evaluation stages. In the example below, the same model as in the previous DNN regression analysis is created using the TRANSFORM clause.
CREATE OR REPLACE MODEL `project_id.dataset_name.DNN_model_name`
#-------TRANSFORM Clause-------
TRANSFORM(
label_column,
ML.STANDARD_SCALER(feature1) AS scaled_feature1,
ML.STANDARD_SCALER(feature2) AS scaled_feature2,
feature3
)
#------------------------------
OPTIONS(
model_type='DNN_REGRESSOR',
ACTIVATION_FN = 'RELU',
BATCH_SIZE = 25,
HIDDEN_UNITS = [16, 8],
LEARN_RATE = 0.02,
OPTIMIZER = 'ADAGRAD',
DATA_SPLIT_METHOD = 'SEQ',
DATA_SPLIT_EVAL_FRACTION = 0.2,
DATA_SPLIT_COL = 'feature3',
input_label_cols=['label_column']
) AS
SELECT
label_column,
feature1, # NOT PROCESSING HERE
feature2,  # NOT PROCESSING HERE
feature3
FROM
`project_id.dataset_name.table_name`
In the SQL query following the OPTIONS clause and AS clause, you only need to specify the columns in the table that will be used as features and the target variable. The data processing is handled by the TRANSFORM clause. Below is the query for prediction when using the TRANSFORM clause.
SELECT
*
FROM
ML.PREDICT(MODEL `project_id.dataset_name.DNN_model_name`,
(SELECT
feature1,
feature2,
feature3
FROM
`project_id.dataset.test_table`))
Since preprocessing is defined in the TRANSFORM clause, there is no need to use ML.STANDARD_SCALER here. As a result, data preprocessing can be omitted during the prediction and evaluation stages.
XGBoost Regression Analysis
Next, let's take a look at regression analysis using gradient boosting trees. Generally, machine learning models based on decision trees
CREATE OR REPLACE MODEL `project_id.dataset_name.xgboost_model_name`
TRANSFORM(
label_column,
ML.POLYNOMIAL_EXPAND(feature1, 2) AS poly_feature1,
ML.QUANTILE_BUCKETIZE(feature2, 10) AS bucketized_feature2,
ML.FEATURE_CROSS(STRUCT(feature3, feature4)) AS feature_cross,
feature5,
)
OPTIONS(
model_type='BOOSTED_TREE_REGRESSOR',
DATA_SPLIT_METHOD = 'SEQ',
DATA_SPLIT_EVAL_FRACTION = 0.2,
DATA_SPLIT_COL = 'feature5',
input_label_cols=['label_column']
) AS
SELECT
label_column,
feature1,
feature2,
feature3,
feature4,
feature5
FROM
`project_id.dataset_name.table_name`
Effective Feature Engineering with Gradient Boosting Trees
Generating Polynomials: ML.POLYNOMIAL_EXPAND ML.POLYNOMIAL_EXPAND is a function that allows you to create polynomial features by, for example, multiplying existing features. This makes it easier for the model to learn nonlinear relationships. In the example above, feature1 is multiplied to generate second-degree polynomial features, which are added as a new feature called poly_feature1. Additionally, you can multiply multiple numerical features together.
Converting Numerical Features into Categorical Features: ML.QUANTILE_BUCKETIZE By using ML.QUANTILE_BUCKETIZE, you can transform numerical features into categorical features. This helps the model capture patterns hidden within numerical data. In the example above, the numerical feature feature2 is divided into 10 equal parts, creating 10 categories and adding a new feature called bucketized_feature2.
Combining Categorical Features: ML.FEATURE_CROSS ML.FEATURE_CROSS allows you to combine two or more categorical features to create new features, making it easier for the model to capture relationships between categorical features. It creates a new feature called feature_cross by crossing feature3 and feature4.
In addition to these, BigQuery ML offers various other feature engineering functions, which can be found in the documentation.
Semi-Automating Hyperparameter Tuning!
Machine learning, as the name suggests, learns variables automatically, but there are parameters that need to be set before training the model. These are called hyperparameters, and identifying the appropriate hyperparameters is essential for creating a highly accurate machine learning model. Hyperparameters include the learning rate (how quickly the model learns from data), batch size (the number of data samples processed at once), and model-specific ones (such as the depth of decision trees or the number of hidden layers in neural networks), some of which have a significant impact on the model's accuracy.
The process of searching for better hyperparameters is called hyperparameter tuning, but it can be time-consuming, which is a challenge. Therefore, it is necessary to efficiently explore hyperparameters using methods like grid search or random search.
BigQuery ML provides features that support this exploration using specific algorithms. The example below is for the OPTIONS clause when performing DNN regression.
OPTIONS(
model_type='DNN_REGRESSOR',
ACTIVATION_FN = 'RELU',
BATCH_SIZE = 25,
HIDDEN_UNITS = HPARAM_CANDIDATES([struct([8]), struct([14]), struct([8, 4])]),
LEARN_RATE = HPARAM_RANGE(0.01, 0.03),
HPARAM_TUNING_ALGORITHM = 'GRID_SEARCH',
HPARAM_TUNING_OBJECTIVES= ['mean_squared_error'],
NUM_TRIALS = 20,
input_label_cols=['label_column']
)
HPARAM_CANDIDATES([candidates]): You can create a list of hyperparameters to explore. In the example above, it is set to explore the number of hidden layers and hidden units in a neural network. BigQuery ML will try different configurations, such as one hidden layer with 8 units, one hidden layer with 14 units, and two hidden layers with 8 units in the first layer and 4 units in the second layer. It is also possible to explore optimisation algorithms like OPTIMIZER=HPARAM_CANDIDATES(['adagrad', 'sgd', 'ftrl']).
HPARAM_RANGE(min, max): This is used when tuning numerical hyperparameters like learning rate or batch size. You set the lower and upper bounds of the search range, and BigQuery ML will try different values within that range. In the example above, it is searching within the range of learning rates from 0.01 to 0.03.
HPARAM_TUNING_ALGORITHM: Specifies the algorithm used for hyperparameter tuning. In the example above, GRID_SEARCH (grid search) is selected.
HPARAM_TUNING_OBJECTIVES[ ]: Specifies the loss function you want to minimise through hyperparameter tuning. In the example above, mean_squared_error (mean squared error) is selected.
NUM_TRIALS: Sets the number of trials for hyperparameter tuning. When you have a large number of candidates in HPARAM_CANDIDATES or a wide range in HPARAM_RANGE, it's essential to set a sufficiently large value for NUM_TRIALS, as not doing so may result in not finding the optimal hyperparameters. However, increasing NUM_TRIALS also requires more computation and time, so it's a trade-off that needs to be considered. The setting of NUM_TRIALS can vary depending on the characteristics of the training dataset and the model being used, so it's determined through trial and error to find an appropriate value.
In Conclusion
BigQuery ML is an excellent service, considering its low learning costs compared to other machine learning frameworks and its ability to perform tasks like hyperparameter tuning semi-automatically. Additionally, it offers the advantage of directly analysing data stored in BigQuery without the need to move it elsewhere. As mentioned at the beginning, there is an opinion that 'you can build machine learning models in BigQuery ML without specialised knowledge.' However, considering the need for model selection, feature selection, and hyperparameter tuning, this view may apply only when using AutoML and may be somewhat challenging in other cases. I have provided detailed explanations about AutoML and its accuracy in another blog post, so please check that out as well.
This concludes the basic usage of BigQuery ML. For more detailed usage instructions, please refer to the documentation and try implementing machine learning in BigQuery. Thank you for reading until the end.