登录查看更多内容

Quick Review: Model Development and Business Metric Evaluation using Sci-Kit Learn

Helena Ristov

Technology Consulting and AI @ Protiviti

发布日期: 2025年2月24日

One of the things you will encounter in almost every data set are missing, erroneous and outliers. Data scientists and/or engineers typically spend about 50-80% of their time on data cleansing, transformations and aggregations that turn raw data into curated features that can be used in machine learning models. This depends on how sophisticated your data pipelines and storage is. A well maintained data environment pays off in dividends in improvements in quality of downstream reports and analytics. I will run through a quick exploratory analysis and model on the titanic dataset (mainly bc it is open source and not for it's disaster) and create a script that can handle imputation on all data columns as well as evaluation.

When one finds missing values in your dataset you can either throw that column out or impute using your favorite method depending on what percent is missing. I would recommend setting a threshold to make that decision. KNN or k-nearest-neighbors is another imputation method which maintains the distribution of that feature without injecting the missing observations into a mean or median which creates an artificial spike in that particular value. I evaluate multiple machine learning models on performance metrics to gauge how well they do in predicting a certain event.

Let's start by loading the dataset, and set the target variable to "Survived". I build a model that tries to predict whether or not someone survived the titanic crash. Essentially this is a boolean model aka binary logistic regression. What do you think the most indicative variables would be on survival? Steerage class? Age? Sex? The results are interesting.

A few notes on Logistic Regression:

"Logistic regression can handle all sorts of relationships, because it applies a non-linear log transformation to the predicted odds ratio. Secondly, the independent variables do not need to be multivariate normal – although multivariate normality yields a more stable solution."

# Loading dataset
url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
data = pd.read_csv(url)
display(data)  ## --> This is a prettier version that print to view dataframes
target_var = 'Survived'
type_of_model = 'Boolean'

X = data.drop(columns=[target_var])
y = data[target_var]

Once the data is loaded into your environment you can check which columns have missing values, and print out the percentage of the column so you can take the appropriate action either removing or imputing.

### Check to Identify which columns have null values
missing_values = data.isnull().sum()

missing_columns = missing_values[missing_values > 0].index.tolist()
print(f"Columns with missing values: {', '.join(missing_columns)}")
missing_values[missing_values > 0]/data.shape[0] ##percent of data that is missing

Columns with missing values: Age, Cabin, Embarked

Age         0.198653
Cabin       0.771044
Embarked    0.002245

In this example, three of the data columns have missing values which must be addressed because a model cannot run with missing data. Cabin with over 70% missing would be a candidate column to throw away. However, on big data sets you might have much more, so it is wise to set up a script that can systematically check all the columns and impute based on various methods. I write a script that can handle this and checks the type of data column and assigns them to categorical or numerical which need to be handled differently.

*Note: column and feature are used interchangeably. I have in previous writing referred to these as fields, but according to some computer scientists this isn't the right term.

领英推荐

The Power of Data Science: Transforming Insights into…

Naresh Maddela 5 个月前

Solving the Problem of Missing Data

Quantum Analytics NG 11 个月前

Understanding Data Science: A Deep Dive into the…

Sankhyana Consultancy Services Pvt. Ltd. 6 个月前

##impute missing values function with simple and knn imputation methods
# Identify numerical and categorical features

def impute_features(data, impute_type, num_method, cat_method):

    numerical_features = data.select_dtypes(include=['int64', 'float64']).columns
    categorical_features = data.select_dtypes(include=['object']).columns

    if(impute_type == 'simple'):
        # Impute numerical features with median
        numerical_imputer = SimpleImputer(strategy= num_method)
        data[numerical_features] = numerical_imputer.fit_transform(data[numerical_features])

        # Impute categorical features with the most frequent value
        categorical_imputer = SimpleImputer(strategy= cat_method)
        data[categorical_features] = categorical_imputer.fit_transform(data[categorical_features])

    if(impute_type == 'knn'): ## --> Use nearest neighbor algorithm
        # Create a LabelEncoder object
        encoders = dict()
        imputer = KNNImputer(n_neighbors=5) 

        for feature in categorical_features:
            encoders[feature] = LabelEncoder()
            data[feature] = encoders[feature].fit_transform(data[feature])
            

        data[categorical_features] = imputer.fit_transform(data[categorical_features])
        data[numerical_features] = imputer.fit_transform(data[numerical_features])

        ## for feature in categorical_features:
        ##    data[feature]=encoders[feature].inverse_transform(data[feature].astype('int'))
        
    return data

Let's experiment with imputation. Age is a numerical column, and has about 20% missing values. In the graphs below you see Age imputed with the median method and the KNN method. What do you notice about the distributions? If you want to learn more about how KNN works, check out this site:

https://www.k2analytics.co.in/missing-value-imputation-using-knn/

Data scaling is used to identify and remove anomalies and place your data on a similar scale so the sheer size of one data field doesn't dominate your model. Without scaling bigger scale features could dominate the learning producing skewed outcomes. This bias is removed through scaling and each feature contributes fairly to model predictions. There are different methods to scale and normalize your data including z-score normalization, min-max scaling, etc. Sci-kit learn exposes different scaling methods for you, and you can choose which one you would like to use. I like to use this library because it has a higher level of abstraction instead of writing lower level base code from scratch, but that could be done as well in python.

##standardizing the data for the model

numerical_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object']).columns

# Preprocessing for numerical data
numerical_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Apply the preprocessing
X_preprocessed = preprocessor.fit_transform(X)

# Convert the result to a DataFrame for easy viewing (optional)
X_preprocessed_df = pd.DataFrame(X_preprocessed.toarray() if hasattr(X_preprocessed, "toarray") else X_preprocessed)

# Display the preprocessed features
print("\nPreprocessed features:\n", X_preprocessed_df)

Finally I evaluate multiple models and compare the accuracy in predicting whether or not someone survived the titanic crash. By placing your models in a dictionary I reference them through a function call and get whatever performance metrics I want to compare all at once. Here I check accuracy which simply put is the number of right predictions, but there are different measures and validation checks you should do to ensure you are not overfitting. Other items to consider are precision, recall and f1 score. To validate, apart from just setting up your train, test, and validate, you can run k-fold validation techniques and performance measures over many sample sets.

The random forest performed the best with over 85% accuracy using this quick run. With hyper parameter tuning, you can get it higher. It's important to remember that while machine learning models can help with predictions they are not 100% accurate, and if implementing in your business, should understand the trade-offs using a confusion matrix, or a model which might have better performance on certain strata of the population. What is the cost to your business for being wrong. This can be determined using the precision-recall or confusion matrix. In the events that you are correct in identifying a positive and negative case there is no cost to your business, but what are the cost for a false positive or false negative. Breaking it down this way you can optimize your models based on the dynamics for your business and gauge how improvements in data quality, model precision and accuracy impact your bottom line.

Try it yourself. My script is also on my github.

## model evaluation
#Initialize the models
models = {
    "Logistic Regression": LogisticRegression(max_iter=200),
    "K-Nearest Neighbors": KNeighborsClassifier(),
    "Random Forest": RandomForestClassifier(),
    "Support Vector Machine": SVC(),
    "Naive Bayes": GaussianNB(),
    "GradientBoostingClassifier": GradientBoostingClassifier()
}

# Function to evaluate a model
def evaluate_model(model, X_train, y_train, X_test, y_test):
    ## TODO:  Should incorporate hyper-tuning here
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    return accuracy_score(y_test, y_pred)

# Compare models
results = {}
for model_name, model in models.items():
    accuracy = evaluate_model(model, X_train, y_train, X_test, y_test)
    results[model_name] = accuracy
    print(f"{model_name}: {accuracy:.2f}")

# Show comparison results
results_df = pd.DataFrame(list(results.items()), columns=['Model', 'Accuracy'])
print("\nModel Comparison:\n", results_df.sort_values('Accuracy', ascending=False))

MuxTalk

237 位关注者

要查看或添加评论，请登录

Helena Ristov的更多文章

AI Software: Concepts with AutoGen and MCP

2025年1月31日

AI Software: Concepts with AutoGen and MCP

Introduction I'll discuss my conceptual software architecture that integrates innovative tools AutoGen and MCP for use…
Automate to Innovate: Exploring Principal Components and Software Applications

2025年1月6日

Automate to Innovate: Exploring Principal Components and Software Applications

In the ever-evolving landscape of autonomous systems, the application of principal component analysis (PCA) has become…

4 条评论
Driving Innovation: AI Solutions for a Connected World

2024年12月2日

Driving Innovation: AI Solutions for a Connected World

Holiday Edition: AI Innovations at Muxing and Beyond Simplifying AI Applications: With advancements in agent-based AI…

2 条评论
Integrated Machine Learning Platform

2024年11月25日

Integrated Machine Learning Platform

THE CHALLENGE The challenge with generating quality analytical results lies with the data ETL processes and framework…
Detecting Drift: Architecting Machine Learning Systems to Detect Model Degradation

2024年11月13日

Detecting Drift: Architecting Machine Learning Systems to Detect Model Degradation

Challenges with Big Data Models and Environments Big Data processes and models often experience issues such as data…
Introduction -- Talk With Us!

2024年11月11日

Introduction -- Talk With Us!

As the first inaugural edition of MuxTalk, I want to showcase the work which has been done and what's on the forefront…

1 条评论
Data Transformation: The Origin is the Assessment.

2023年8月27日

Data Transformation: The Origin is the Assessment.

Over the last ten years, we’ve seen an exponential increase in the amount of data captured, stored, and consumed across…
MLOPs: Detecting Drift

2023年8月27日

MLOPs: Detecting Drift

Big Data processes and models often experience issues such as data quality, drift, and accuracy. The underlying data is…
Generative AI for Art

2023年8月27日

Generative AI for Art

The dawn of machine learning is upon us and we already find ourselves walking among humanoid robots. The exponential…

See all articles

Quick Review: Model Development and Business Metric Evaluation using Sci-Kit Learn

Helena Ristov

Technology Consulting and AI @ Protiviti

领英推荐

MuxTalk

237 位关注者

Helena Ristov的更多文章

社区洞察

其他会员也浏览了

What is Data Science in simple words?

Graph use-case archetypes

Statistics in Data Science: From Analysis to Decision Making and Beyond

Data Science Notes _ Part 1

Bridging the Gap, Simplifying Data Interpretation through KPIs Normalization

Demystifying Inference Pipelines in Data Science: From Data to Decisions

Robust Data Models: Building Resilient Systems Against Outliers

Decision Tree Classification

Understanding Data Science and Its Workflow

Demystifying Data Science: Exploring Definitions, Applications, and the Workflow of a Data-Driven World

领英推荐

MuxTalk

237 位关注者

Helena Ristov的更多文章

AI Software: Concepts with AutoGen and MCP

Automate to Innovate: Exploring Principal Components and Software Applications

Driving Innovation: AI Solutions for a Connected World

Integrated Machine Learning Platform

Detecting Drift: Architecting Machine Learning Systems to Detect Model Degradation

Introduction -- Talk With Us!

Data Transformation: The Origin is the Assessment.

MLOPs: Detecting Drift

Generative AI for Art

社区洞察

其他会员也浏览了

What is Data Science in simple words?

Graph use-case archetypes

Statistics in Data Science: From Analysis to Decision Making and Beyond

Data Science Notes _ Part 1

Bridging the Gap, Simplifying Data Interpretation through KPIs Normalization

Demystifying Inference Pipelines in Data Science: From Data to Decisions

Robust Data Models: Building Resilient Systems Against Outliers

Decision Tree Classification

Understanding Data Science and Its Workflow

Demystifying Data Science: Exploring Definitions, Applications, and the Workflow of a Data-Driven World