Building a Custom Grid Search for Your Custom Model
GridSearch3D

Building a Custom Grid Search for Your Custom Model

Introduction

In the ever-evolving world of data science, custom projects are becoming increasingly common. These unique tasks can be both exciting and challenging, requiring custom functions and code to tackle the problems at hand. One essential aspect of these projects is optimizing the model's parameters, which is where a custom grid search comes into play. In this article, we will delve into the logic of grid search and demonstrate how to build a custom grid search for your own use.

Understanding Grid Search Logic

Grid search is a process that repeatedly trains a model with different combinations of parameters to find the optimal hyperparameters for the desired model. For instance, if you have a Random Forest Regressor model and want to determine the best hyperparameters, the grid search will create a combination of the given parameters:

grid_search_params = {
    'n_estimators': [10, 20],
    'max_depth': [40, 50],
    'min_samples_split': [2, 3],
    'min_samples_leaf': [4, 5]
}

Combinations of Parameters:

({'n_estimators': 10, 'max_depth': 40, 'min_samples_split': 2, 'min_samples_leaf': 4}, 
({'n_estimators': 10, 'max_depth': 40, 'min_samples_split': 2, 'min_samples_leaf': 5}, )
({'n_estimators': 10, 'max_depth': 40, 'min_samples_split': 3, 'min_samples_leaf': 4}, )
({'n_estimators': 10, 'max_depth': 40, 'min_samples_split': 3, 'min_samples_leaf': 5}, )
({'n_estimators': 10, 'max_depth': 50, 'min_samples_split': 2, 'min_samples_leaf': 4}, )
({'n_estimators': 10, 'max_depth': 50, 'min_samples_split': 2, 'min_samples_leaf': 5}, )
({'n_estimators': 10, 'max_depth': 50, 'min_samples_split': 3, 'min_samples_leaf': 4}, )
({'n_estimators': 10, 'max_depth': 50, 'min_samples_split': 3, 'min_samples_leaf': 5}, )
({'n_estimators': 20, 'max_depth': 40, 'min_samples_split': 2, 'min_samples_leaf': 4}, )
({'n_estimators': 20, 'max_depth': 40, 'min_samples_split': 2, 'min_samples_leaf': 5}, )
({'n_estimators': 20, 'max_depth': 40, 'min_samples_split': 3, 'min_samples_leaf': 4}, )
({'n_estimators': 20, 'max_depth': 40, 'min_samples_split': 3, 'min_samples_leaf': 5}, )
({'n_estimators': 20, 'max_depth': 50, 'min_samples_split': 2, 'min_samples_leaf': 4}, )
({'n_estimators': 20, 'max_depth': 50, 'min_samples_split': 2, 'min_samples_leaf': 5}, )
({'n_estimators': 20, 'max_depth': 50, 'min_samples_split': 3, 'min_samples_leaf': 4}, )
({'n_estimators': 20, 'max_depth': 50, 'min_samples_split': 3, 'min_samples_leaf': 5}, ))

The grid search will then train the model using all possible parameter combinations to identify the best one. However, understanding the logic is not enough; you need to implement it effectively in your projects.

Enhancing the Grid Search Process

To improve the grid search process, follow these steps:

  1. Utilize itertools to match all parameter combinations.
  2. Calculate error metrics for evaluation, using custom functions if needed.
  3. Implement k-fold cross-validation for a more robust evaluation.
  4. Train the model with all parameter combinations and utilize parallel programming for efficient computation. Include an option to stop the process at any time with KeyboardInterrupt.
  5. Return all combination results, best parameters, and scores.

Implementation and Usage

The code for building a custom grid search is available in a Github gist linked in the comments below. Although the author's custom model is not used as an example due to confidentiality, you can easily apply this custom grid search function to any dataset with parameters, actual, and predicted values.

This custom grid search is compatible with scikit-learn models and can be adapted for other projects. Although scikit-learn does provide options for custom grid search, creating your own can be a valuable learning experience and can offer greater flexibility.

Conclusion

Creating a custom grid search for your unique model is an essential step in optimizing your model's parameters. By understanding the logic behind grid search and implementing it effectively, you can enhance the performance of your model and tackle complex projects with confidence. So go ahead and apply this custom grid search approach to your own projects, and enjoy the challenge of conquering new data science tasks!

Complete Code:

import sy
import os
import itertools
import numpy as np
from concurrent.futures import ThreadPoolExecutor
from sklearn.ensemble import RandomForestRegressor

def custom_grid_search(Model, X, y, GridSearchParams, cv=5, ParallelUnits=1, error_metrics="mse", random_state=None):

    """
    Perform a grid search on the given model using the specified parameters,
    returning the results, the best parameters, and the best score.

    Parameters
    ----------
    Model : sklearn model
        The model to perform the grid search on
    X : array-like, shape (n_samples, n_features)
        The training input samples
    y : array-like, shape (n_samples,)
        The target values
    GridSearchParams : dict
        The parameters to search over, with keys as the parameter names and
        values as a list of possible values
    cv : int, optional (default=5)
        The number of cross-validation splits to perform
    ParallelUnits : int, optional (default=1)
        The number of parallel workers to use. If -1, use all available CPU cores
    error_metrics : str, optional (default="mse")
        The error metric to use when determining the best parameters. Can be
        one of "mae", "mse", "rmse", or "r2"
    random_state : int, optional (default=None)
        Seed for random number generator

    Returns
    -------
    results : list
        A list of tuples, each containing a dictionary of parameters and the
        corresponding average error
    best_params : dict
        The best parameters found
    error : float
        The best score found

    """

    # Function to split the data into k-folds for cross-validation
    def k_fold_split(X, y, n_splits, random_state=None):

        """
        Split the data into train and test sets for k-fold cross-validation.

        Parameters
        ----------
        X : array-like, shape (n_samples, n_features)
            The input data
        y : array-like, shape (n_samples,)
            The target data
        n_splits : int
            The number of folds to split the data into
        random_state : int, optional (default=None)
            Seed for random number generator

        Returns
        -------
        generator
            A generator that yields the indices of the train and test sets for
            each fold

        """

        # Create an array of indices
        indices = np.arange(len(y))

        # Set the random seed if specified
        if random_state is not None:
            np.random.seed(random_state)

        # Shuffle the indices
        np.random.shuffle(indices)

        # Calculate the size of each fold
        fold_sizes = np.full(n_splits, len(y) // n_splits, dtype=int)
        fold_sizes[:len(y) % n_splits] += 1

        # Initialize the start position
        current = 0

        # Split the data into train and test sets for k-fold cross-validation
        for fold_size in fold_sizes:
            start, stop = current, current + fold_size
            test_indices = indices[start:stop]
            train_indices = np.concatenate((indices[:start], indices[stop:]))
            yield train_indices, test_indices

    # Mean absolute error calculation function
    def mean_absolute_error(y_true, y_pred):

        """
        Calculate the mean absolute error between two arrays of target values.

            Parameters
        ----------
        y_true : array-like, shape (n_samples,)
            The true target values
        y_pred : array-like, shape (n_samples,)
            The predicted target values

        Returns
        -------
        float
            The mean absolute error

        """

        return sum(abs(yt - yp) for yt, yp in zip(y_true, y_pred)) / len(y_true)

    # Mean squared error calculation function
    def mean_squared_error(y_true, y_pred):

        """
        Calculate the mean squared error between two arrays of target values.

        Parameters
        ----------
        y_true : array-like, shape (n_samples,)
            The true target values
        y_pred : array-like, shape (n_samples,)
            The predicted target values

        Returns
        -------
        float
            The mean squared error

        """

        return sum((yt - yp) ** 2 for yt, yp in zip(y_true, y_pred)) / len(y_true)

    # Root mean squared error calculation function
    def root_mean_squared_error(y_true, y_pred):

        """
        Calculate the root mean squared error between two arrays of target values.

        Parameters
        ----------
        y_true : array-like, shape (n_samples,)
            The true target values
        y_pred : array-like, shape (n_samples,)
            The predicted target values

        Returns
        -------
        float
            The root mean squared error

        """

        return (mean_squared_error(y_true, y_pred)) ** 0.5

    # R^2 score calculation function
    def r2_score(y_true, y_pred):

        """
        Calculate the R^2 score between two arrays of target values.

        Parameters
        ----------
        y_true : array-like, shape (n_samples,)
            The true target values
        y_pred : array-like, shape (n_samples,)
            The predicted target values

        Returns
        -------
        float
            The R^2 score

        """

        mean_true = sum(y_true) / len(y_true)
        total_var = sum((yt - mean_true) ** 2 for yt in y_true)
        residual_var = sum((yt - yp) ** 2 for yt, yp in zip(y_true, y_pred))
        return 1 - (residual_var / total_var)

    # Dictionary mapping error metric names to their corresponding functions
    error_metrics_dict = {
        "mae": mean_absolute_error,
        "mse": mean_squared_error,
        "rmse": root_mean_squared_error,
        "r2": r2_score
    }

    # Train the model using the specified parameters and calculate the error
    def process_param_set(args, Model, error_metrics_dict):

        """
        Train the model using the specified parameters and calculate the error.

        Parameters
        ----------
        args : tuple
            A tuple containing the parameters, input data, target data, error metric,
            train indices, and test indices
        Model : custom model
            The model to train
        error_metrics_dict : dict
            A dictionary mapping error metric names to their corresponding functions

        Returns
        -------
        tuple
            A tuple of the parameters and the calculated error

        """

        params, X, y, error_metrics, train_indices, test_indices = args

        # Split the data into training and testing set using the indices
        X_train, X_test = X[train_indices], X[test_indices]
        y_train, y_test = y[train_indices], y[test_indices]

        # Train the model using the specified parameters
        model = Model(**params)
        model.fit(X_train, y_train)

        # Predict using the trained model
        y_pred = model.predict(X_test)

        # Calculate the error using the specified error metric
        error = error_metrics_dict[error_metrics](y_test, y_pred)

        return params, error

    # Initialize variables to store the results
    Results = []
    Keys = GridSearchParams.keys()
    MinError = sys.maxsize
    MaxScore = -sys.maxsize
    BestParams = {}
    ParamSets = [dict(zip(Keys, values)) for values in itertools.product(*GridSearchParams.values())]

    # Check the number of parallel units
    if ParallelUnits == -1:
        ParallelUnits = os.cpu_count()
    elif ParallelUnits > os.cpu_count():
        raise RuntimeError(f"Maksimum CPU say?n?z: {os.cpu_count()}")

    # Start the thread pool executor
    try:
        with ThreadPoolExecutor(max_workers=ParallelUnits) as executor:
            for params in ParamSets:
                total_error = 0
                futures = []

                for train_indices, test_indices in k_fold_split(X, y, cv, random_state):
                    future = executor.submit(process_param_set,
                                             (params, X, y, error_metrics, train_indices, test_indices),
                                             Model, error_metrics_dict)
                    futures.append(future)

                for future in futures:
                    _, error = future.result()
                    total_error += error

                avg_error = total_error / cv
                Results.append((params, avg_error))

                if error_metrics == "r2":
                    if avg_error > MaxScore:
                        MaxScore = avg_error
                        BestParams = params
                else:
                    if avg_error < MinError:
                        MinError = avg_error
                        BestParams = params

    # Handle keyboard interrupt to stop the execution
    except KeyboardInterrupt:
        print("CTRL+C alg?land?. ??lem sonland?r?l?yor...")
        executor.shutdown(wait=False)

    # Return the results, best parameters and error
    return Results, BestParams, MaxScore if error_metrics == "r2" else MinError

# Simple Dataset
np.random.seed(42)
X = np.random.rand(100, 2)
y = 3 * X[:, 0] + 2 * X[:, 1] + np.random.randn(100)

# Parameters
grid_search_params = {
    'n_estimators': [10, 20],
    'max_depth': [40, 50],
    'min_samples_split': [2, 3],
    'min_samples_leaf': [4, 5]
}

# Searching Best Parameters
results, best_params, error = custom_grid_search(RandomForestRegressor, X, y, grid_search_params, cv=5, ParallelUnits=-1, error_metrics="mae", random_state=42)

# Print results
print("Results:")
for res in results:
    print(res)

print("Best Parameters:", best_params)
print("Error:", error)

要查看或添加评论,请登录

社区洞察