登录查看更多内容

Extracting Rules from Random Forest Classifier written in PySpark and visualize it using GraphViz

Shorya Sharma

Assistant Manager at Bank Of America | Ex - Data Engineer at IBM | Azure data engineer certified | AWS CP certified

发布日期: 2024年5月25日

In the ever-evolving field of fraud detection, understanding the intricate patterns and relationships between variables is crucial. Leveraging machine learning models like Random Forest classifiers can significantly enhance our ability to detect fraudulent activities. In this article, I will demonstrate how to automatically extract rules from a Random Forest classifier using PySpark, providing valuable insights into the variables that influence our fraud detection models.

While I'll be using Google Colab for this demonstration, you can apply the same principles and techniques with any PySpark environment.

Download Dataset from:

Paymentcards-and-fraud/subscription.csv at main · shorya1996/Paymentcards-and-fraud · GitHub

To get started, We'll initialize our Spark session and define the schema for our dataset. Specifying the schema ensures each column is interpreted correctly, with appropriate data types. After defining the schema, we'll load the CSV file into a DataFrame, ensuring our data is structured and ready for analysis. This DataFrame will serve as the basis for training our Random Forest classifier to detect fraudulent activities.

from pyspark.sql import SparkSession
from pyspark.sql import SQLContext

spark = SparkSession.builder.getOrCreate()
from pyspark.sql.types import *
Schema = StructType([
    StructField('Account ID', FloatType(), True),
    StructField('Subscription', StringType(), True),
    StructField('Subscription Year', StringType(), True),
    StructField('user age', FloatType(), True),
    StructField('job', StringType(), True),
    StructField('marital', StringType(), True),
    StructField('education', StringType(), True),
    StructField('housing', StringType(), True),
    StructField('loan', StringType(), True),
    StructField('contact', StringType(), True),
    StructField('Account Age', FloatType(), True),
    StructField('Count of linking accounts', FloatType(), True),
    StructField('default', FloatType(), True)
])
df = spark.read.csv("subscription.csv", header=True, schema=Schema)
df = df.dropna()
df.show(10)

Next, we'll focus on preparing our data for the Random Forest classifier by selecting important features and transforming categorical variables.

Select Important Features: We choose a subset of relevant features from our DataFrame, including both categorical and numerical columns.
Categorical and Numerical Columns: We define lists for our categorical columns (job, marital, education) and numerical columns (user age, Account Age, Count of linking accounts).
String Indexing and One-Hot Encoding: We create a series of StringIndexer and OneHotEncoder transformations for the categorical columns. Indexing converts categorical variables into numerical indices, and encoding transforms these indices into one-hot encoded vectors.
Assemble Features: Using VectorAssembler, we combine the encoded categorical columns and the numerical columns into a single feature vector.
Create and Fit Pipeline: We define a pipeline that sequentially applies the indexers, encoders, and assembler. Fitting this pipeline to our DataFrame transforms our data into a format suitable for training the Random Forest model.

df_important_features = df.select('job', 'marital', 'education', 'user age', 'Account Age', 'Count of linking accounts', 'default')

categorical_columns = ['job', 'marital', 'education']
numerical_columns = ['user age', 'Account Age', 'Count of linking accounts']

indexers = [StringIndexer(inputCol=col, outputCol=col+"_index") for col in categorical_columns]
encoders = [OneHotEncoder(inputCol=col+"_index", outputCol=col+"_vec") for col in categorical_columns]

assembler = VectorAssembler(
    inputCols=[col+"_vec" for col in categorical_columns] + numerical_columns,
    outputCol="features"
)

pipeline = Pipeline(stages=indexers + encoders + [assembler])
pipeline_model = pipeline.fit(df_important_features)

After crafting a tailored pipeline to encode categorical variables and assemble features, we apply the fitted pipeline model to transform the selected features into a new DataFrame. This step ensures that our dataset is properly prepared for training the Random Forest classifier, with categorical variables encoded and numerical variables combined into a unified feature representation.

df_transformed = pipeline_model.transform(df_important_features)

With our dataset transformed and prepared, we're now poised to train our Random Forest classifier for fraud detection.

Selecting Features and Target Variable: First, we extract the features column, containing our transformed feature vectors, along with the target variable – in this case, "default" – from the transformed DataFrame. This step ensures that our model learns from the relevant data.
Data Splitting: To properly evaluate the performance of our model, we divide our data into training and testing sets. Here, we allocate 70% of the data for training (train_df) and reserve 30% for testing (test_df). This division allows us to assess how well our model generalizes to unseen data.
Random Forest Classifier Initialization: Leveraging the PySpark ML library, we initialize a Random Forest classifier. We specify the label column ("default") and the features column ("features") to be utilized during model training. Additionally, we set the number of trees in the forest to 10, a parameter that influences the model's complexity and performance.
Model Training: With the classifier instantiated, we proceed to train it using the training data (train_df). During this phase, the model learns intricate patterns and relationships between the features and the target variable, enabling it to make informed predictions.

领英推荐

SQL Challenge: Movie Duration Match

StrataScratch 7 个月前

Issue #316 - The ML Engineer ??

Alejandro Saucedo 1 个月前

Demystifying Data Careers: Your Guide to Data Analyst…

Jean Lee 10 个月前

df_ml = df_transformed.select("features", "default")
from pyspark.ml.classification import RandomForestClassifier
train_df, test_df = df_ml.randomSplit([0.7, 0.3], seed=42)
rf = RandomForestClassifier(labelCol="default", featuresCol="features", numTrees=10)
rf_model = rf.fit(train_df)

To evaluate the performance of our trained Random Forest classifier, we employ a Multiclass Classification Evaluator from the PySpark ML library.

Model Predictions: We generate predictions on the test dataset (test_df) using our trained Random Forest model (rf_model). This step enables us to assess how well the model generalizes to unseen data and makes predictions on fraudulent activities.
Evaluation Metric: Utilizing the Multiclass Classification Evaluator, we specify the label column ("default") and the prediction column ("prediction") from the generated predictions. Additionally, we specify the metric we wish to evaluate, in this case, "accuracy". The accuracy metric measures the proportion of correctly classified instances out of all instances.
Evaluating Model Performance: By calling the evaluate method on the evaluator with the predictions, we obtain the accuracy score of our model. This score provides insight into how effectively our Random Forest classifier identifies fraudulent activities based on the test data.

from pyspark.ml.evaluation import MulticlassClassificationEvaluator
predictions = rf_model.transform(test_df)
evaluator = MulticlassClassificationEvaluator(labelCol="default", predictionCol="prediction", metricName="accuracy")
evaluator.evaluate(predictions)

To gain deeper insights into the Random Forest classifier's decision-making process, we delve into the conversion of Java-based decision trees into Python-readable structures.

Node Representation: We define named tuples – LeafNode and InternalNode – to represent leaf and internal nodes, respectively. These structures encapsulate crucial information such as prediction values, impurity measures, and splitting criteria.
Split Types: Depending on whether a split is categorical or continuous, we define named tuples CategoricalSplit and ContinuousSplit. These structures hold details about the feature index, threshold (for continuous splits), and categories (for categorical splits).
Conversion Function: The jtree_to_python function is pivotal in this conversion process. It traverses the Java-based decision tree recursively, translating each node and split into their Python equivalents.
Java to Python Conversion: Within the conversion function, we utilize Java methods to extract pertinent information about nodes and splits. We handle internal nodes and leaf nodes differently, capturing details such as predictions, impurity statistics, and splitting criteria.
Tree Extraction: Applying the conversion function to each tree in the Random Forest model (rf_model), we obtain a list of Python-readable representations of the decision trees, stored in the nodes variable.

from  collections import namedtuple
import numpy as np

LeafNode = namedtuple("LeafNode", ("prediction", "impurity"))
InternalNode = namedtuple(
    "InternalNode", ("left", "right", "prediction", "impurity", "split"))
CategoricalSplit = namedtuple("CategoricalSplit", ("feature_index", "categories"))
ContinuousSplit = namedtuple("ContinuousSplit", ("feature_index", "threshold"))

def jtree_to_python(jtree):
    def jsplit_to_python(jsplit):
        if jsplit.getClass().toString().endswith(".ContinuousSplit"):
            return ContinuousSplit(jsplit.featureIndex(), jsplit.threshold())
        else:
            jcat = jsplit.toOld().categories()
            return CategoricalSplit(
                jsplit.featureIndex(),
                [jcat.apply(i) for i in range(jcat.length())])
    def jnode_to_python(jnode):
        prediction = jnode.prediction()        
        stats = np.array(list(jnode.impurityStats().stats()))

        if jnode.numDescendants() != 0:  # InternalNode
            left = jnode_to_python(jnode.leftChild())
            right = jnode_to_python(jnode.rightChild())
            split = jsplit_to_python(jnode.split())

            return InternalNode(left, right, prediction, stats, split)            

        else:
            return LeafNode(prediction, stats) 

    return jnode_to_python(jtree.rootNode())

nodes = [jtree_to_python(t) for t in rf_model._java_obj.trees()]

from graphviz import Digraph

Next We introduce a function visualize_tree that enables us to visualize decision trees in a human-readable format directly within our Python environment. Let's explore how it works:

Function Overview: The visualize_tree function is designed to render decision trees using the Graphviz library, providing a graphical representation of the tree's structure and decision-making process.
Graph Creation: Within the function, we initialize a new Digraph object from the Graphviz library, which serves as the canvas for our tree visualization.
Node Addition: Using a recursive approach, we traverse the decision tree provided as input (tree) and add nodes to the graph accordingly. Depending on whether a node is a leaf node or an internal node with a split, we customize the node's label and edges to reflect relevant information such as prediction values, impurity measures, feature names, and splitting criteria.
Edge Labeling: For internal nodes, we label the edges based on the splitting criteria (e.g., threshold for continuous splits, categories for categorical splits) to visually represent the decision-making process.
Visualization Output: Once the tree visualization is constructed, we return the Digraph object representing the decision tree, which can be further customized or displayed as needed.

for i in range(0, len(nodes)):
    tree_visual = visualize_tree(nodes[i])
    tree_visual.render(f"DecisionTree_{i}", format="png", cleanup=True)

Let's download anyone of the decision tree and look at the graph :

In summary, our journey from data preparation to visualization equips us with powerful tools for insightful analysis. By leveraging PySpark, we've transformed data into actionable insights, trained robust models, and visualized decision trees to understand predictive logic. This holistic approach enhances accuracy and deepens our understanding of data relationships, paving the way for innovation in data-driven decision-making.

Ashokvir Roy

Business Analyst - Analytics at Paytm || Engineer

9 个月

????

Sachin Antil

Assistant Manager @ Bank of America | Ex- Amex

9 个月

Insightful ??

Kanika Pahuja

Senior Manager at Bank of America | ARM Fraud Detection Strategies

9 个月

????

Drishti Srivastava

Associate Manager (FP/A) At HCLTech

9 个月

RUDRAKSHI RAJE SINGH

Performance Marketing | Client success | Account management | Digital marketing

9 个月

????

查看更多评论

要查看或添加评论，请登录

Shorya Sharma的更多文章

OpenAi, LangChain and PySpark: The Future of Data Analysis

2024年6月23日

OpenAi, LangChain and PySpark: The Future of Data Analysis

In today's data-driven world, the ability to process and analyze vast amounts of data efficiently is more critical than…

1 条评论
Credit Card Fraud Analytics using Decision trees in Rapidminer

2023年12月10日

Credit Card Fraud Analytics using Decision trees in Rapidminer

In recent years, there has been a staggering surge in credit-card fraud cases. Fraud poses a significant hurdle for…

7 条评论

Extracting Rules from Random Forest Classifier written in PySpark and visualize it using GraphViz

Shorya Sharma

Assistant Manager at Bank Of America | Ex - Data Engineer at IBM | Azure data engineer certified | AWS CP certified

领英推荐

Shorya Sharma的更多文章

社区洞察

其他会员也浏览了

Anomaly Detection

?? Ridge vs. Lasso: Tuning Models for Stock Markets ??

Data Science End to End: From data wrangling to machine learning to model deployment as micro-service API

Fast Kullback-Leibler Divergence Using Spark

DATA Pill #054 - 10 best open-source repos, LLM, Flink and Apache Iceberg + Python

From Analysts to Data Scientists

The Effects of Data Noise on the Efficiency of Vector Search Algorithms

?? DATA Pill #100 - dbt vs. Dataform, RAG for Quality Engineers, Text-to-SQL at Pinterest

Things I wish I knew before starting in Data Science

DATA Pill #029 - What is the future of Apache Flink? And what do football and LoL have to do with DATA?

领英推荐

Shorya Sharma的更多文章

OpenAi, LangChain and PySpark: The Future of Data Analysis

Credit Card Fraud Analytics using Decision trees in Rapidminer

社区洞察

其他会员也浏览了

Anomaly Detection

?? Ridge vs. Lasso: Tuning Models for Stock Markets ??

Data Science End to End: From data wrangling to machine learning to model deployment as micro-service API

Fast Kullback-Leibler Divergence Using Spark

DATA Pill #054 - 10 best open-source repos, LLM, Flink and Apache Iceberg + Python

From Analysts to Data Scientists

The Effects of Data Noise on the Efficiency of Vector Search Algorithms

?? DATA Pill #100 - dbt vs. Dataform, RAG for Quality Engineers, Text-to-SQL at Pinterest

Things I wish I knew before starting in Data Science

DATA Pill #029 - What is the future of Apache Flink? And what do football and LoL have to do with DATA?