Building a Custom Grid Search for Your Custom Model
Bunyamin Ergen
Artificial Intelligence Engineer @eTa??n | Multi-Agent AI Systems / Agentic AI, LLM, State-of-the-Art Technologies, Multi-Modal Learning, Speech-to-Text, Computer Vision and Adversarial Machine Learning
Introduction
In the ever-evolving world of data science, custom projects are becoming increasingly common. These unique tasks can be both exciting and challenging, requiring custom functions and code to tackle the problems at hand. One essential aspect of these projects is optimizing the model's parameters, which is where a custom grid search comes into play. In this article, we will delve into the logic of grid search and demonstrate how to build a custom grid search for your own use.
Understanding Grid Search Logic
Grid search is a process that repeatedly trains a model with different combinations of parameters to find the optimal hyperparameters for the desired model. For instance, if you have a Random Forest Regressor model and want to determine the best hyperparameters, the grid search will create a combination of the given parameters:
grid_search_params = { 'n_estimators': [10, 20], 'max_depth': [40, 50], 'min_samples_split': [2, 3], 'min_samples_leaf': [4, 5] }
Combinations of Parameters:
({'n_estimators': 10, 'max_depth': 40, 'min_samples_split': 2, 'min_samples_leaf': 4}, ({'n_estimators': 10, 'max_depth': 40, 'min_samples_split': 2, 'min_samples_leaf': 5}, ) ({'n_estimators': 10, 'max_depth': 40, 'min_samples_split': 3, 'min_samples_leaf': 4}, ) ({'n_estimators': 10, 'max_depth': 40, 'min_samples_split': 3, 'min_samples_leaf': 5}, ) ({'n_estimators': 10, 'max_depth': 50, 'min_samples_split': 2, 'min_samples_leaf': 4}, ) ({'n_estimators': 10, 'max_depth': 50, 'min_samples_split': 2, 'min_samples_leaf': 5}, ) ({'n_estimators': 10, 'max_depth': 50, 'min_samples_split': 3, 'min_samples_leaf': 4}, ) ({'n_estimators': 10, 'max_depth': 50, 'min_samples_split': 3, 'min_samples_leaf': 5}, ) ({'n_estimators': 20, 'max_depth': 40, 'min_samples_split': 2, 'min_samples_leaf': 4}, ) ({'n_estimators': 20, 'max_depth': 40, 'min_samples_split': 2, 'min_samples_leaf': 5}, ) ({'n_estimators': 20, 'max_depth': 40, 'min_samples_split': 3, 'min_samples_leaf': 4}, ) ({'n_estimators': 20, 'max_depth': 40, 'min_samples_split': 3, 'min_samples_leaf': 5}, ) ({'n_estimators': 20, 'max_depth': 50, 'min_samples_split': 2, 'min_samples_leaf': 4}, ) ({'n_estimators': 20, 'max_depth': 50, 'min_samples_split': 2, 'min_samples_leaf': 5}, ) ({'n_estimators': 20, 'max_depth': 50, 'min_samples_split': 3, 'min_samples_leaf': 4}, ) ({'n_estimators': 20, 'max_depth': 50, 'min_samples_split': 3, 'min_samples_leaf': 5}, ))
The grid search will then train the model using all possible parameter combinations to identify the best one. However, understanding the logic is not enough; you need to implement it effectively in your projects.
Enhancing the Grid Search Process
To improve the grid search process, follow these steps:
- Utilize itertools to match all parameter combinations.
- Calculate error metrics for evaluation, using custom functions if needed.
- Implement k-fold cross-validation for a more robust evaluation.
- Train the model with all parameter combinations and utilize parallel programming for efficient computation. Include an option to stop the process at any time with KeyboardInterrupt.
- Return all combination results, best parameters, and scores.
Implementation and Usage
The code for building a custom grid search is available in a Github gist linked in the comments below. Although the author's custom model is not used as an example due to confidentiality, you can easily apply this custom grid search function to any dataset with parameters, actual, and predicted values.
This custom grid search is compatible with scikit-learn models and can be adapted for other projects. Although scikit-learn does provide options for custom grid search, creating your own can be a valuable learning experience and can offer greater flexibility.
Conclusion
Creating a custom grid search for your unique model is an essential step in optimizing your model's parameters. By understanding the logic behind grid search and implementing it effectively, you can enhance the performance of your model and tackle complex projects with confidence. So go ahead and apply this custom grid search approach to your own projects, and enjoy the challenge of conquering new data science tasks!
Complete Code:
import sy import os import itertools import numpy as np from concurrent.futures import ThreadPoolExecutor from sklearn.ensemble import RandomForestRegressor def custom_grid_search(Model, X, y, GridSearchParams, cv=5, ParallelUnits=1, error_metrics="mse", random_state=None): """ Perform a grid search on the given model using the specified parameters, returning the results, the best parameters, and the best score. Parameters ---------- Model : sklearn model The model to perform the grid search on X : array-like, shape (n_samples, n_features) The training input samples y : array-like, shape (n_samples,) The target values GridSearchParams : dict The parameters to search over, with keys as the parameter names and values as a list of possible values cv : int, optional (default=5) The number of cross-validation splits to perform ParallelUnits : int, optional (default=1) The number of parallel workers to use. If -1, use all available CPU cores error_metrics : str, optional (default="mse") The error metric to use when determining the best parameters. Can be one of "mae", "mse", "rmse", or "r2" random_state : int, optional (default=None) Seed for random number generator Returns ------- results : list A list of tuples, each containing a dictionary of parameters and the corresponding average error best_params : dict The best parameters found error : float The best score found """ # Function to split the data into k-folds for cross-validation def k_fold_split(X, y, n_splits, random_state=None): """ Split the data into train and test sets for k-fold cross-validation. Parameters ---------- X : array-like, shape (n_samples, n_features) The input data y : array-like, shape (n_samples,) The target data n_splits : int The number of folds to split the data into random_state : int, optional (default=None) Seed for random number generator Returns ------- generator A generator that yields the indices of the train and test sets for each fold """ # Create an array of indices indices = np.arange(len(y)) # Set the random seed if specified if random_state is not None: np.random.seed(random_state) # Shuffle the indices np.random.shuffle(indices) # Calculate the size of each fold fold_sizes = np.full(n_splits, len(y) // n_splits, dtype=int) fold_sizes[:len(y) % n_splits] += 1 # Initialize the start position current = 0 # Split the data into train and test sets for k-fold cross-validation for fold_size in fold_sizes: start, stop = current, current + fold_size test_indices = indices[start:stop] train_indices = np.concatenate((indices[:start], indices[stop:])) yield train_indices, test_indices # Mean absolute error calculation function def mean_absolute_error(y_true, y_pred): """ Calculate the mean absolute error between two arrays of target values. Parameters ---------- y_true : array-like, shape (n_samples,) The true target values y_pred : array-like, shape (n_samples,) The predicted target values Returns ------- float The mean absolute error """ return sum(abs(yt - yp) for yt, yp in zip(y_true, y_pred)) / len(y_true) # Mean squared error calculation function def mean_squared_error(y_true, y_pred): """ Calculate the mean squared error between two arrays of target values. Parameters ---------- y_true : array-like, shape (n_samples,) The true target values y_pred : array-like, shape (n_samples,) The predicted target values Returns ------- float The mean squared error """ return sum((yt - yp) ** 2 for yt, yp in zip(y_true, y_pred)) / len(y_true) # Root mean squared error calculation function def root_mean_squared_error(y_true, y_pred): """ Calculate the root mean squared error between two arrays of target values. Parameters ---------- y_true : array-like, shape (n_samples,) The true target values y_pred : array-like, shape (n_samples,) The predicted target values Returns ------- float The root mean squared error """ return (mean_squared_error(y_true, y_pred)) ** 0.5 # R^2 score calculation function def r2_score(y_true, y_pred): """ Calculate the R^2 score between two arrays of target values. Parameters ---------- y_true : array-like, shape (n_samples,) The true target values y_pred : array-like, shape (n_samples,) The predicted target values Returns ------- float The R^2 score """ mean_true = sum(y_true) / len(y_true) total_var = sum((yt - mean_true) ** 2 for yt in y_true) residual_var = sum((yt - yp) ** 2 for yt, yp in zip(y_true, y_pred)) return 1 - (residual_var / total_var) # Dictionary mapping error metric names to their corresponding functions error_metrics_dict = { "mae": mean_absolute_error, "mse": mean_squared_error, "rmse": root_mean_squared_error, "r2": r2_score } # Train the model using the specified parameters and calculate the error def process_param_set(args, Model, error_metrics_dict): """ Train the model using the specified parameters and calculate the error. Parameters ---------- args : tuple A tuple containing the parameters, input data, target data, error metric, train indices, and test indices Model : custom model The model to train error_metrics_dict : dict A dictionary mapping error metric names to their corresponding functions Returns ------- tuple A tuple of the parameters and the calculated error """ params, X, y, error_metrics, train_indices, test_indices = args # Split the data into training and testing set using the indices X_train, X_test = X[train_indices], X[test_indices] y_train, y_test = y[train_indices], y[test_indices] # Train the model using the specified parameters model = Model(**params) model.fit(X_train, y_train) # Predict using the trained model y_pred = model.predict(X_test) # Calculate the error using the specified error metric error = error_metrics_dict[error_metrics](y_test, y_pred) return params, error # Initialize variables to store the results Results = [] Keys = GridSearchParams.keys() MinError = sys.maxsize MaxScore = -sys.maxsize BestParams = {} ParamSets = [dict(zip(Keys, values)) for values in itertools.product(*GridSearchParams.values())] # Check the number of parallel units if ParallelUnits == -1: ParallelUnits = os.cpu_count() elif ParallelUnits > os.cpu_count(): raise RuntimeError(f"Maksimum CPU say?n?z: {os.cpu_count()}") # Start the thread pool executor try: with ThreadPoolExecutor(max_workers=ParallelUnits) as executor: for params in ParamSets: total_error = 0 futures = [] for train_indices, test_indices in k_fold_split(X, y, cv, random_state): future = executor.submit(process_param_set, (params, X, y, error_metrics, train_indices, test_indices), Model, error_metrics_dict) futures.append(future) for future in futures: _, error = future.result() total_error += error avg_error = total_error / cv Results.append((params, avg_error)) if error_metrics == "r2": if avg_error > MaxScore: MaxScore = avg_error BestParams = params else: if avg_error < MinError: MinError = avg_error BestParams = params # Handle keyboard interrupt to stop the execution except KeyboardInterrupt: print("CTRL+C alg?land?. ??lem sonland?r?l?yor...") executor.shutdown(wait=False) # Return the results, best parameters and error return Results, BestParams, MaxScore if error_metrics == "r2" else MinError # Simple Dataset np.random.seed(42) X = np.random.rand(100, 2) y = 3 * X[:, 0] + 2 * X[:, 1] + np.random.randn(100) # Parameters grid_search_params = { 'n_estimators': [10, 20], 'max_depth': [40, 50], 'min_samples_split': [2, 3], 'min_samples_leaf': [4, 5] } # Searching Best Parameters results, best_params, error = custom_grid_search(RandomForestRegressor, X, y, grid_search_params, cv=5, ParallelUnits=-1, error_metrics="mae", random_state=42) # Print results print("Results:") for res in results: print(res) print("Best Parameters:", best_params) print("Error:", error)