Unveiling Consumer Behavior: Analysis and Prediction with Data Science

Unveiling Consumer Behavior: Analysis and Prediction with Data Science

Understanding consumer behavior is essential for the success of any business in today's market. With the vast amount of data available, data science provides powerful tools to analyze and predict consumer actions. In this article, we will explore a real case study from a large marketplace, demonstrating how data analysis and machine learning techniques can be applied to uncover behavior patterns, optimize marketing strategies, and increase customer loyalty. We will dive into the theoretical and practical concepts underlying this process, providing valuable insights for those looking to leverage their e-commerce operations.

Consumer Behavior

In consumer marketing, the consumer lifecycle is a term used to describe the progression of steps a customer goes through when considering, buying, using, and maintaining loyalty to a product or service.

Why It's Important

These metrics can be tracked over time (e.g., quarter over quarter, year over year) and compared to industry-wide benchmarks. Comparing Customer Lifecycle metrics can help solve competitive gaps in product or service offerings and especially predict which consumer will buy from the store.

Propensity to Purchase

We want to measure a consumer's propensity to buy. Why is this important? For several reasons. First of all, I can't lose this customer, then I can create points of contact to drive the sale, and lastly, I increase my chance of selling significantly by interacting with the most likely consumers to buy.

  1. Consumer Behavior: Definition: The consumer lifecycle describes the steps a customer goes through when considering, purchasing, using, and maintaining loyalty to a product or service. Importance: Measuring these metrics over time helps to identify gaps in product offerings and predict future purchasing behaviors.
  2. Propensity to Buy: Measure: Important not to lose consumers, create points of contact and increase the chances of a sale. Example: Identification of consumers with a higher propensity to buy (e.g., 73%) versus a lower propensity (e.g., 11%).
  3. Counterfactual Modeling: Description: Measures the impact of interventions on churn. Example: Churn lift ranging between 11% and 17%, with actual results at 16%.
  4. Business Case - Target: Problem: Loss of millions of dollars due to potential buyers switching to competitors. Solution: Predict behavior and optimize interactions to retain customers.

From Data Lake

The architecture of a Data Lake is a data storage solution designed to manage large volumes of data coming from multiple sources. The Data Lake framework is made up of several layers, including the ingestion layer, where structured and unstructured data is collected from both streaming and batch sources. The data is then processed at the processing layer, where it is validated, cleaned, and transformed. After processing, the data is stored in the storage layer, which is divided into landing, cleansing, and curation zones. In addition, the cataloging and search layer allows for organization and easy retrieval of data. Finally, the processed data is made available at the consumption layer, where it can be used for analysis and visualization, facilitating data-driven decision-making.

Typical Data Lake Ecosystem

  1. Streaming and Batch: The source of the data can be from continuous streams (e.g. Twitter) or batch (e.g. spreadsheets).
  2. Ingestion Layer: Responsible for collecting structured and unstructured data from different sources.
  3. Storage Layer: Includes three zones: Landing Zone: Where the raw data is initially stored. Clean Zone: Where data is validated, cleaned, and standardized. Curated Zone: Data transformed, enriched, and modeled.
  4. Processing Layer: Contains the data processing pipeline for validating, cleaning, standardizing, transforming, and enriching the data.
  5. Cataloging & Search Layer: Allows cataloging and searching of stored data.
  6. Consumption Layer: Where data is consumed through dashboards, analytics, and reports.

This architecture enables the efficient ingestion, storage, processing, and consumption of large volumes of data.

Possible Solutions with Data Science:

  1. Preliminary Models: Use machine learning to predict churn and propensity to buy. Ferramentas: Python (Scikit-learn, TensorFlow), R (caret).
  2. Real-Time Data Analysis: Implement real-time analytics systems to monitor consumer behavior. Ferramentas: Apache Kafka, Spark Streaming.
  3. Customer Segmentation: Develop targeted marketing strategies based on customer segmentation. Tools: K-means clustering, cluster analysis.
  4. KPI Tracking: Monitor consumer behavior KPIs and adjust strategies as needed. Ferramentas: Power BI, Tableau.
  5. Automation and Customization: Create automated, personalized campaigns to engage customers with a high propensity to buy. Ferramentas: Marketing Automation Platforms (HubSpot, Marketo).

Implementation in Data Lake:

  1. Ingestion Layer: Collecting data from multiple sources (e.g., app interactions, social media).
  2. Processing Layer: Processing pipeline to validate, clean, transform, and enrich data.
  3. Storage Layer: Data stored in different zones (landing, clean, curated) for easy access and analysis.
  4. Consumption Layer: Data consumed by BI tools and dashboards for analysis and decision-making.

These solutions will enable a deeper understanding of consumer behavior and help improve retention and sales through actionable insights derived from data.

Our Case

Collection of Information

Within the shopping site or mobile app, we will collect some important metrics from users who are logged into a session.

Metrics collected:

1.? SESSION_ID: Unique identifier of the session.

2.??Click_Image: Indicator of whether the user clicked on an image (0 or 1).

3.??Read_Review: Indicator of whether the user has read a review (0 or 1).

4.? Category_View: Indicator whether the user viewed the category (0 or 1).

5.??Read_Details: Indicator whether the user has read the product details (0 or 1).

6.??Video_View: Indicator whether the user has watched a product video (0 or 1).

7.??Add_to_List: Indicator whether the user has added the product to the list (0 or 1).

8.? Compare_Prc: Indicator if the user has compared prices (0 or 1).

9.? View_Similar: Indicator of whether the user has viewed similar products (0 or 1).

10.?Save_for_Later: Indicator whether the user has saved the product for later (0 or 1).

11.? Personalized: Indicator of whether the user used personalized recommendations (0 or 1).

12.? BUY: Indicator of whether the user purchased the product (0 or 1).

These metrics help to understand user behavior while browsing and purchasing on an e-commerce platform.

In our case study, 1 million records were collected, i.e. 1 million sessions with user interaction. Below is a part of the csv table with the result of the collection:

Predictive Model

We are going to implement the first stage of this entire Digital Marketing process, which would be the construction of a predictive model.

# Imports
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import os
import matplotlib.pylab as plt
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score        

??These lines import the libraries needed for data analysis and modeling. The imported libraries are:

  • Pandas: for data manipulation and analysis.
  • numpy: for numerical operations.
  • OS: For operations related to the operating system.
  • matplotlib.pylab: for data visualization.
  • scikit-learn: for machine learning modeling and evaluation.

?# Function to load data

def load_data(file_path):
    return pd.read_csv(file_path)        

This function loads the data from a CSV file specified by the file path (file_path) and returns a pandas DataFrame.

# Function to inspect data

def inspect_data(data):
     print(data.dtypes)
     print(data.head())
     print(data.describe())
     print(data.corr()['BUY'])        

?This function prints:

  1. The data types for each column (data.dtypes).
  2. The first five rows of the DataFrame (data.head()).
  3. A statistical summary of the data (data.describe()).
  4. The correlation of each column with the 'BUY' column (data.corr()['BUY']).

?# Function to prepare data

def prepare_data(data):
# Correct the column name from 'Read_Reviews' to 'Read_Review'
   predictors = data[['Read_Review', 'Compare_Products', 'Add_to_List', 'Save_for_Later',     'Personalized', 'View_Similar']]
   targets = data. BUY
   return train_test_split(predictors, targets, test_size=0.3)        

This function:

  1. Seleciona as colunas preditoras de interesse (Read_Review, Compare_Products, Add_to_List, Save_for_Later, Personalized, View_Similar).
  2. Define coluna alvo as BUY.
  3. Splits the data into training and test sets using train_test_split, where 30% of the data is used for testing (test_size=0.3).

?# Function to train model

def train_model(X_train, y_train):
    model = GaussianNB()
    model.fit(X_train, y_train)
    return model        

This function:

  1. Create a Gaussian Na?ve Bayes (GaussianNB) model.
  2. Trains the model with the training data (X_train and y_train).
  3. Returns the trained model.

?# Function to evaluate model

def evaluate_model(model, X_test, y_test):
    predictions = model.predict(X_test)
    print(confusion_matrix(y_test, predictions))
    print(accuracy_score(y_test, predictions))
    return model.predict_proba(X_test)        

?This function:

  1. Uses the model to make predictions on the test set (X_test).
  2. Imprime a matriz de confus?o (confusion_matrix) e a precis?o do modelo (accuracy_score).
  3. Returns the predictive probabilities for the test set (model.predict_proba(X_test)).

?# Function to predict propensity

def predict_propensity(model, data):
    data = np.array(data).reshape(1, -1)
    return model.predict_proba(data)[:,1]        

This function:

  1. Converts the input data to a numpy array and adjusts the shape to be used in the model.
  2. Returns the probability of propensity predicted by the model.

?# Execution Pipeline

file_path = "/…/market_app_correlated.csv"
prospect_data = load_data(file_path)
inspect_data(prosptect_data)
X_train, X_test, y_train, y_test = prepare_data(prospect_data)
model = train_model(X_train, y_train)
probabilities = evaluate_model(model, X_test, y_test)        

?Estas linhas:

  1. Set the path of the CSV file.
  2. Load the data from the CSV.
  3. They inspect the uploaded data.
  4. Prepare the data for training and testing.
  5. They train the model with the training data.
  6. Evaluate the model with the test data.

Now let's do some simulations:

Predict propensity for new browsing data

new_browsing_data = [0, 0, 0, 0, 0, 0]
print("New User: propensity:", predict_propensity(model, new_browsing_data))        

Result:

?New User: propensity: [0.19087601]        

?That is, simply by entering and logging in to the website or application, the chance of buying that user is close to 19%.

Predict propensity after adding to list

?add_to_list_data = [1, 1, 1, 0, 0, 0]
print("After Add_to_List: propensity:", predict_propensity(model, add_to_list_data))        

Result:

After Add_to_List: propensity: [0.61234248]        

When you add an item to your list, the chance of buying it goes up to 61%.

Predict propensity after multiple interactions

?full_interaction_data = [1, 1, 1, 1, 1, 1]
print("Full Interaction: propensity:", predict_propensity(model, full_interaction_data))        

?Result:

Full Interaction: propensity: [0.80887743]        

?For those users who have made all possible interactions, the chance of purchase rises to about 81%.

To wrap up this comprehensive exploration of consumer behavior analytics and its implementation within a large marketplace, it is evident that leveraging data science and machine learning can significantly enhance our understanding of consumer actions and preferences. By systematically collecting and analyzing user interaction data, we can predict purchasing behaviors with remarkable accuracy, enabling the creation of targeted marketing strategies that boost engagement and conversion rates. This case study not only demonstrates the practical application of these techniques but also underscores their potential to drive substantial business growth and competitive advantage in the dynamic landscape of e-commerce.


要查看或添加评论,请登录

Marco Aurelio Minozzo的更多文章

社区洞察

其他会员也浏览了