Quick Review: Model Development and Business Metric Evaluation using Sci-Kit Learn

Quick Review: Model Development and Business Metric Evaluation using Sci-Kit Learn

One of the things you will encounter in almost every data set are missing, erroneous and outliers. Data scientists and/or engineers typically spend about 50-80% of their time on data cleansing, transformations and aggregations that turn raw data into curated features that can be used in machine learning models. This depends on how sophisticated your data pipelines and storage is. A well maintained data environment pays off in dividends in improvements in quality of downstream reports and analytics. I will run through a quick exploratory analysis and model on the titanic dataset (mainly bc it is open source and not for it's disaster) and create a script that can handle imputation on all data columns as well as evaluation.

When one finds missing values in your dataset you can either throw that column out or impute using your favorite method depending on what percent is missing. I would recommend setting a threshold to make that decision. KNN or k-nearest-neighbors is another imputation method which maintains the distribution of that feature without injecting the missing observations into a mean or median which creates an artificial spike in that particular value. I evaluate multiple machine learning models on performance metrics to gauge how well they do in predicting a certain event.

Let's start by loading the dataset, and set the target variable to "Survived". I build a model that tries to predict whether or not someone survived the titanic crash. Essentially this is a boolean model aka binary logistic regression. What do you think the most indicative variables would be on survival? Steerage class? Age? Sex? The results are interesting.

A few notes on Logistic Regression:

"Logistic regression can handle all sorts of relationships, because it applies a non-linear log transformation to the predicted odds ratio. Secondly, the independent variables do not need to be multivariate normal – although multivariate normality yields a more stable solution."

# Loading dataset
url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
data = pd.read_csv(url)
display(data)  ## --> This is a prettier version that print to view dataframes
target_var = 'Survived'
type_of_model = 'Boolean'

X = data.drop(columns=[target_var])
y = data[target_var]        

Once the data is loaded into your environment you can check which columns have missing values, and print out the percentage of the column so you can take the appropriate action either removing or imputing.

### Check to Identify which columns have null values
missing_values = data.isnull().sum()

missing_columns = missing_values[missing_values > 0].index.tolist()
print(f"Columns with missing values: {', '.join(missing_columns)}")
missing_values[missing_values > 0]/data.shape[0] ##percent of data that is missing
        
Columns with missing values: Age, Cabin, Embarked        
Age         0.198653
Cabin       0.771044
Embarked    0.002245        

In this example, three of the data columns have missing values which must be addressed because a model cannot run with missing data. Cabin with over 70% missing would be a candidate column to throw away. However, on big data sets you might have much more, so it is wise to set up a script that can systematically check all the columns and impute based on various methods. I write a script that can handle this and checks the type of data column and assigns them to categorical or numerical which need to be handled differently.

*Note: column and feature are used interchangeably. I have in previous writing referred to these as fields, but according to some computer scientists this isn't the right term.

##impute missing values function with simple and knn imputation methods
# Identify numerical and categorical features

def impute_features(data, impute_type, num_method, cat_method):

    numerical_features = data.select_dtypes(include=['int64', 'float64']).columns
    categorical_features = data.select_dtypes(include=['object']).columns

    if(impute_type == 'simple'):
        # Impute numerical features with median
        numerical_imputer = SimpleImputer(strategy= num_method)
        data[numerical_features] = numerical_imputer.fit_transform(data[numerical_features])

        # Impute categorical features with the most frequent value
        categorical_imputer = SimpleImputer(strategy= cat_method)
        data[categorical_features] = categorical_imputer.fit_transform(data[categorical_features])

    if(impute_type == 'knn'): ## --> Use nearest neighbor algorithm
        # Create a LabelEncoder object
        encoders = dict()
        imputer = KNNImputer(n_neighbors=5) 

        for feature in categorical_features:
            encoders[feature] = LabelEncoder()
            data[feature] = encoders[feature].fit_transform(data[feature])
            

        data[categorical_features] = imputer.fit_transform(data[categorical_features])
        data[numerical_features] = imputer.fit_transform(data[numerical_features])

        ## for feature in categorical_features:
        ##    data[feature]=encoders[feature].inverse_transform(data[feature].astype('int'))
        
    return data
        

Let's experiment with imputation. Age is a numerical column, and has about 20% missing values. In the graphs below you see Age imputed with the median method and the KNN method. What do you notice about the distributions? If you want to learn more about how KNN works, check out this site:

https://www.k2analytics.co.in/missing-value-imputation-using-knn/


Median Imputation Method of Age Column


KNN imputation method of Age Column


Data scaling is used to identify and remove anomalies and place your data on a similar scale so the sheer size of one data field doesn't dominate your model. Without scaling bigger scale features could dominate the learning producing skewed outcomes. This bias is removed through scaling and each feature contributes fairly to model predictions. There are different methods to scale and normalize your data including z-score normalization, min-max scaling, etc. Sci-kit learn exposes different scaling methods for you, and you can choose which one you would like to use. I like to use this library because it has a higher level of abstraction instead of writing lower level base code from scratch, but that could be done as well in python.

##standardizing the data for the model

numerical_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object']).columns

# Preprocessing for numerical data
numerical_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Apply the preprocessing
X_preprocessed = preprocessor.fit_transform(X)

# Convert the result to a DataFrame for easy viewing (optional)
X_preprocessed_df = pd.DataFrame(X_preprocessed.toarray() if hasattr(X_preprocessed, "toarray") else X_preprocessed)

# Display the preprocessed features
print("\nPreprocessed features:\n", X_preprocessed_df)        

Finally I evaluate multiple models and compare the accuracy in predicting whether or not someone survived the titanic crash. By placing your models in a dictionary I reference them through a function call and get whatever performance metrics I want to compare all at once. Here I check accuracy which simply put is the number of right predictions, but there are different measures and validation checks you should do to ensure you are not overfitting. Other items to consider are precision, recall and f1 score. To validate, apart from just setting up your train, test, and validate, you can run k-fold validation techniques and performance measures over many sample sets.

The random forest performed the best with over 85% accuracy using this quick run. With hyper parameter tuning, you can get it higher. It's important to remember that while machine learning models can help with predictions they are not 100% accurate, and if implementing in your business, should understand the trade-offs using a confusion matrix, or a model which might have better performance on certain strata of the population. What is the cost to your business for being wrong. This can be determined using the precision-recall or confusion matrix. In the events that you are correct in identifying a positive and negative case there is no cost to your business, but what are the cost for a false positive or false negative. Breaking it down this way you can optimize your models based on the dynamics for your business and gauge how improvements in data quality, model precision and accuracy impact your bottom line.

Try it yourself. My script is also on my github.

## model evaluation
#Initialize the models
models = {
    "Logistic Regression": LogisticRegression(max_iter=200),
    "K-Nearest Neighbors": KNeighborsClassifier(),
    "Random Forest": RandomForestClassifier(),
    "Support Vector Machine": SVC(),
    "Naive Bayes": GaussianNB(),
    "GradientBoostingClassifier": GradientBoostingClassifier()
}

# Function to evaluate a model
def evaluate_model(model, X_train, y_train, X_test, y_test):
    ## TODO:  Should incorporate hyper-tuning here
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    return accuracy_score(y_test, y_pred)

# Compare models
results = {}
for model_name, model in models.items():
    accuracy = evaluate_model(model, X_train, y_train, X_test, y_test)
    results[model_name] = accuracy
    print(f"{model_name}: {accuracy:.2f}")

# Show comparison results
results_df = pd.DataFrame(list(results.items()), columns=['Model', 'Accuracy'])
print("\nModel Comparison:\n", results_df.sort_values('Accuracy', ascending=False))        

要查看或添加评论,请登录

Helena Ristov的更多文章

社区洞察

其他会员也浏览了