Model Explainability via LIME
What will speed up the adoption of Machine Learning in Business ? Why is Machine Learning Interpretability Important?
Think of interpretability as the bridge between humans and machines. It's not just about understanding how a model works; it's about building trust and accountability. Interpretability empowers us to ask questions like "Why did the model make this prediction?" or "Which features were most influential?"
By demystifying AI models, interpretability fosters trust among users, regulators, and stakeholders.
In linear and logistic regression, the weights or coefficients play a important role as they signify the significance of each variable in the predictive model.
Say we are predicting an employee's salary, where we rely on two key features: years of experience and a previous performance rating In this scenario, our model might look something like this:
Salary = w1*macro_condition + w2*advertisement
These coefficients serve as indicators, shedding light on whether the rating holds more weight in determining an employee's salary or if it's the experience that primarily influences the outcome. In essence, these weights offer insights into the relative importance of each feature, helps in understanding the dynamics between variables and their impact on the predicted outcome.
But in case of Random Forest, XGBoost or other ML Models, interpretability of Model is not easy.
In this article, I have used LIME framework to interpret a Random Forest Regression based model.
Before getting into LIME framework ON how it can be used to interpret a Regression based model , lets talk about few things:
Surrogate Model: A surrogate model acts as a simplified approximation of a black-box model, offering a glimpse into its decision-making process. While black-box models, such as deep neural networks or ensemble methods, may deliver superior predictive performance, their inner workings often remain opaque, leaving users in the dark about how and why specific predictions are made. Surrogate models bridge this gap by providing a more interpretable alternative.
Global Explainability: Global explainability provides an overarching understanding of how a model works across the entire dataset.
In the realm of AI, global explainability helps us understand the overall behavior of a model. It answers questions like: "What features are most important for making predictions?" or "How does the model generalize across different subsets of data?" Think of it as the big picture view that guides our trust in the model's decision-making process. A surrogate model involves training a more interpretable model, such as a decision tree or linear regression, on the predictions or intermediate representations generated by the black-box model. By mapping the inputs to the outputs of the black-box model, the surrogate model encapsulates its underlying logic in a more digestible form.
Local Explainability: Local Explainability focuses on explaining the model's prediction for a specific instance or observation. So, instead of looking at the entire dataset, we're analyzing how the model arrived at a particular decision for a particular input.
This is crucial for understanding why a model made a specific prediction for an individual case. It helps us answer questions like: "What factors influenced the model to deny a loan for this particular applicant?"
Model Agnostic Interpretability: Model agnostic techniques allows usage of more complex models
without losing all interpretability power. Model agnostic interpretability techniques can be applied to most of the machine learning model, regardless of its type or complexity. They provide insights into how models make decisions without relying on the inner workings of a specific algorithm.
LIME (Local Interpretable Model-agnostic Explanation) stands as a robust framework widely embraced in the industry for providing human-friendly explanations to tabular, text, and image data. Its versatility across different data modalities creates trust and confidence in black-box machine learning models. Operating on the principle of local interpretability, LIME offers granular insights at the instance level, making complex model behaviour accessible and actionable for all stakeholders.
Below is the implementation in Python using LIME for a Random Forest Regression model:
# Import Libraries import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import train_test_split from sklearn.preprocessing import OneHotEncoder, LabelEncoder from sklearn import tree from sklearn.metrics import mean_squared_error
# Read csv file
df = pd.read_csv('data.csv')
# sample data screenshot below
# Checking Missing Values
df.isnull().sum().sort_values()
# Missing Value Treatment - Item_Weight by median and Outlet_Size with mode
df.Item_Weight.fillna(df.Item_Weight.median(), inplace=True)
df.Outlet_Size.fillna(df.Outlet_Size.mode()[0], inplace=True)
# Reducing the Cardinality of 'Item_Type_Combined'
df['Item_Type_Combined'] = df['Item_Identifier'].apply(lambda df: df[0:2])
df['Item_Type_Combined'] = df['Item_Type_Combined'].map({'FD':'Food', 'NC':'Non-Consumable', 'DR':'Drinks'})
df['Item_Type_Combined'].value_counts()
# No of Years of the existence of stores
df['Existence_Years'] = 2013 - df['Outlet_Establishment_Year']
# Updating the values of Item_Fat_Content
df['Item_Fat_Content'] = df['Item_Fat_Content'].replace({'LF':'Low Fat', 'reg':'Regular', 'low fat':'Low Fat'})
df['Item_Fat_Content'].value_counts()
# label encoding of ordinal variables
lbl_enco = LabelEncoder()
df['Outlet'] = lbl_enco.fit_transform(df['Outlet Identifier'])
领英推荐
cat =['Item_Fat_Content','Outlet_Location_Type','Outlet_Size','Item_Type_Combined','Outlet_Type','Outlet']
lbl_enco = LabelEncoder()
for i in cat:
df[i] = lbl_enco.fit_transform(df[i])
# dropping the ID variables and variables that have been used to extract new variables
df.drop(['Item_Identifier', 'Outlet_Identifier','Item_Type','Outlet_Establishment_Year'],axis=1,inplace=True)
# separating the dependent and independent variables
X = df.drop('Item_Outlet_Sales',axis =1)
y = df['Item_Outlet_Sales']
# creating the training and validation set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25, random_state=42)
# installing lime library
!pip install lime
# training the Random Forest model
model = RandomForestRegressor(n_estimators=200,max_depth=5, min_samples_leaf=100,n_jobs=-1, random_state=10)
model.fit(X_train, y_train)
# model - RMSE on validation Set
np.sqrt(mean_squared_error(y_test,model.predict(X_test)))
# creating the explainer function
from lime.lime_tabular import LimeTabularExplainer
explainer = LimeTabularExplainer(X_train.values, mode="regression", feature_names=X_train.columns)
# storing a observation
j = 12
X_obs = X_test.iloc[[j], :]
X_obs
# Explaining the Random Forest model
expl = explainer.explain_instance(X_obs.values[0], model.predict)
expl.show_in_notebook(show_table=True, show_all=False)
print(expl.score)
Leftmost visualization shows the model's predicted outcome with a range of possible values depicting best and worst case.
Middle visualization depicts which variables influences the prediction being on the higher or lower side.
Most important identified variables are shown in the rightmost visualization, in the descending order of importance.
Ujjyaini Mitra SETU School #ExplainableAI #ModelInterpretability #Datascience #Analytics #ai
Driving Advanced Analytics & Digital Transformation in Audit & Assurance | Expertise in Continuous Auditing, Fraud Analytics & Automation | xPTCL & Ufone (e& UAE) | Data Science - Agentic AI - Machine Learning - GenAI
10 个月Excellent points on accelerating Machine Learning adoption and the importance of interpretability
Strategic Sales & Marketing Visionary | Digital Transformation & AI-Driven Innovation | P&L Growth Architect
10 个月Awesome Samir. It’s really a good read ??????
Building Deltacube.ai & SETU. DC offers Data Science, AI innovation & Digital Transformation. SETU - Bridging the Gap of Emerging Industry Skill demand in Data Domain
10 个月Very well written and simply explained. Ability to control the decision journey and reduce model biases is attempted through explainability. LIME framework also used nicely Samir Paul
Digital Transformation Champion | Startup Mentor | Author & Speaker | Social Changemaker | On a Mission to Shape Bharat 2047
10 个月Using the LIME framework to interpret a Random Forest Regression model is a commendable approach, as it helps bridge the gap between complex machine learning models and human understanding, promoting transparency and trust.
Vice President | Data Engineering | Analytics (ISB) | MBA (IIM Nagpur)
10 个月Any models based on Atificial Neural Networks are unexplainable, which includes Gen AI models also which are quite famous these days as most of the research is going on here.