登录查看更多内容

Where did my customer go aka Churn Prediction?

Supreet Sethi

Let's elevate your business infrastructure together. #solutionengineering

发布日期: 2024年9月6日

Churn is reality. As a CEO of a bank, real estate site, or online game, you likely understand that no matter how compelling the customer experience may be, people can drift away for various reasons. They may lose interest, find a better service, or simply have other priorities. Predicting churn—knowing when and why customers might leave—can be a game-changer. One of our customers, a gaming company, came to us at The Migration Workshop with this exact problem. They needed a churn prediction solution that could scale, adapt to new games, and be simple enough to manage without a data science team.

We were faced with several challenges:

No Data Science Team: The company didn’t have a dedicated data science team to handle complex models or processes.
High Accuracy: They needed a highly accurate model that could deliver insights reliably.
Adaptable: The model had to adapt easily to new games or game versions.
Scalability: The solution needed to scale with their player base, which could grow rapidly.
Cloud-Agnostic: They wanted the flexibility to deploy across different cloud platforms.
No Feature Engineering: They needed a solution that minimized the complexity of data processing, ideally avoiding manual feature engineering altogether.

Choosing the Right Technology: Elixir and Vowpal Wabbit

With these constraints in mind, we first looked at a Python-based fraud detection model built by Nirav D. Doshi using XGBoost. Fraud detection and churn prediction have similarities—both involve identifying rare patterns or anomalies in customer behavior. However, since our client didn’t have experience with Python or XGBoost, we decided to simplify things. We took the main ideas from that model and reworked them into an Elixir-based solution, using Vowpal Wabbit for machine learning. This gave us a streamlined and effective toolchain that fit the client’s needs without adding unnecessary complexity.

Why Vowpal Wabbit?

Vowpal Wabbit (VW) is an open-source, fast, and lightweight machine learning library built for scalability. Unlike many traditional machine learning libraries that can become bottlenecks at scale, VW shines when dealing with large datasets and high throughput.

Performance: VW is built for speed, capable of handling massive datasets without choking on compute or memory resources.
Scalability: Its efficiency makes it ideal for large-scale, real-time prediction tasks, like predicting churn in games where millions of players are involved.
Ease of Use: VW can work with minimal feature engineering, thanks to its ability to consume raw, high-dimensional data. This reduces the time and complexity needed to build and tune models.
Cloud Agnosticism: VW runs anywhere—from a local machine to a cloud-based infrastructure—without being tied to a specific platform.

Why Elixir?

We chose Elixir for its performance and scalability. It's designed for building large-scale, fault-tolerant systems, making it a perfect fit for gaming companies expecting rapid growth in their player base. Here are a few reasons why Elixir was the right choice:

Concurrency: Elixir’s lightweight processes allow the application to handle millions of concurrent connections, which is crucial for a gaming company with an expanding user base.
Fault Tolerance: Elixir was built on the Erlang VM, known for its resilience. If parts of the system fail, Elixir’s architecture ensures that it can recover seamlessly without affecting the entire system.
Maintainability: Elixir’s syntax is clean and modern, which simplifies code readability and maintenance, particularly important for teams without extensive technical resources.

Lightweight Feature Engineering

Although our client requested no feature engineering, we found that applying just a bit of it helped us achieve a remarkable 98.364% accuracy. This minimal transformation involved selecting and reformatting the raw data to better suit VW’s needs with few categorical transformations which can be accomplished realtime, without overcomplicating the process.

The Solution: A Scalable, Resilient Model

We built a solution that ticks all the boxes for our client’s needs. Here’s how it works:

Data Preparation: We split the customer’s data into training and prediction sets (70% for training, 30% for prediction).
Format Conversion: The data was converted from CSV to Vowpal Wabbit format to suit VW’s input requirements.
Model Training: The churn prediction model was trained using VW with logistic regression as the loss function, running 10 training passes to optimize the model.
Predictions and Accuracy: Finally, predictions were generated, and accuracy was calculated by comparing actual outcomes with predicted results.

This churn prediction model, written in Elixir and powered by Vowpal Wabbit, allowed the client to implement the solution within their gaming platform in flat 3 weeks from concept to production, scaling as needed while maintaining high accuracy.

Data & Analytics 3 个月前

Predicting Telco Customer Churn and findings from data…

Jabo Justin 1 年前

Leverage Data Insights This Halloween

Lumenore 3 周前

Fraud Detection Model with Zero Data Modeling

Although we cannot share the original churn prediction codebase due to customer confidentiality, we are presenting an equivalent fraud detection model built using the same Elixir and Vowpal Wabbit toolchain. This model, without any feature engineering, complex data modeling or hyper parameter tuning, achieved a solid 74% accuracy. This shows the flexibility of the toolchain to adapt to various prediction tasks, from churn to fraud detection, without requiring deep expertise in data science.

# churn_prediction.exs
defmodule ChurnPrediction do
  require Logger

  @csv_file "1Million.csv"
  @train_file "train.csv"
  @prediction_file "prediction.csv"
  @vw_train_file "train.vw"
  @vw_prediction_file "prediction.vw"
  @model_file "model.vw"
  @predictions_output_file "predictions.txt"

  # Step 1: Split the dataset into train and prediction datasets (70% train, 30% test)
  def split_data() do
    csv_content = File.stream!(@csv_file)
                     |> Stream.drop(1)  # Skip the header
                     |> Enum.to_list()

    total_count = length(csv_content)
    train_count = div(total_count * 70, 100)  # 70% for training

    # Split the data
    train_data = Enum.take(csv_content, train_count)
    prediction_data = Enum.drop(csv_content, train_count)

    # Write the train and prediction datasets to separate CSV files
    File.write!(@train_file, Enum.join(train_data, ""))
    File.write!(@prediction_file, Enum.join(prediction_data, ""))

    Logger.info("Split data into #{@train_file} (#{train_count} rows) and #{@prediction_file} (#{total_count - train_count} rows).")
  end

  # Step 2: Convert CSV to Vowpal Wabbit format
  def convert_to_vw_format(csv_file, vw_file) do
    File.stream!(csv_file)
    |> Stream.map(fn line ->
      fields = String.split(line, ",")
      label = if String.trim(List.last(fields)) == "True." do "1" else "0" end
      features = Enum.take(fields, length(fields) - 1)  # Take all but the last column
      vw_line = "#{label} | " <> Enum.join(features, " ")
      "#{vw_line}\n"
    end)
    |> Stream.into(File.stream!(vw_file))  # Write to VW format
    |> Stream.run()

    Logger.info("Converted #{csv_file} to Vowpal Wabbit format: #{vw_file}")
  end

  # Step 3: Train the Vowpal Wabbit model
  def train_model() do
    command = "vw #{@vw_train_file} -f #{@model_file} --loss_function logistic --passes 10 --cache_file vw.cache"
    {output, _} = System.cmd("bash", ["-c", command])
    Logger.info("Model training completed: #{@model_file}")
    Logger.info(output)
  end

  # Step 4: Predict using the trained model
  def predict() do
    command = "vw #{@vw_prediction_file} -t -i #{@model_file} -p #{@predictions_output_file}"
    {output, _} = System.cmd("bash", ["-c", command])
    Logger.info("Predictions saved to #{@predictions_output_file}")
    Logger.info(output)

    # Calculate accuracy
    calculate_accuracy()
  end

  # Step 5: Calculate accuracy by comparing actual vs predicted values
  def calculate_accuracy() do
    actual_values = File.stream!(@prediction_file)
                       |> Stream.map(fn line ->
                         fields = String.split(line, ",")
                         label = if String.trim(List.last(fields)) == "True." do 1 else 0 end
                         label
                       end)
                       |> Enum.to_list()

    predicted_values = File.stream!(@predictions_output_file)
                          |> Stream.map(&String.trim/1)
                          |> Enum.map(&String.to_float/1)
                          |> Enum.map(&round/1)  # Round to 0 or 1 (binary classification)

    # Compare actual vs predicted and calculate accuracy
    correct_predictions =
      Enum.zip(actual_values, predicted_values)
      |> Enum.filter(fn {actual, predicted} -> actual == predicted end)
      |> length()

    total_predictions = length(predicted_values)
    accuracy = correct_predictions / total_predictions * 100

    Logger.info("Accuracy: #{accuracy}%")
    IO.puts("Accuracy: #{accuracy}%")
  end

  # Step 6: Full process that runs splitting, conversion, training, and prediction
  def process() do
    # Split data into train and prediction sets
    split_data()

    # Convert the data into Vowpal Wabbit format
    convert_to_vw_format(@train_file, @vw_train_file)
    convert_to_vw_format(@prediction_file, @vw_prediction_file)

    # Train the model
    train_model()

    # Run predictions and calculate accuracy
    predict()
  end
end

# Run the process when the script is executed
ChurnPrediction.process()

To run this code, few dependencies are needed

$ git clone https://github.com/niravdd/2018-GAM310.git
$ brew install elixir vowpal-wabbit
$ elixir churn_prediction.exs

and the output from model training and predictions shows up.

$ elixir churn_prediction.exs

22:27:13.973 [info] Split data into train.csv (700000 rows) and prediction.csv (300000 rows).

22:27:20.150 [info] Converted train.csv to Vowpal Wabbit format using headers: train.vw

22:27:22.728 [info] Converted prediction.csv to Vowpal Wabbit format using headers: prediction.vw
using l1 regularization = 1e-06
using l2 regularization = 1e-06
final_regressor = model.vw
creating cache_file = vw.cache
Reading datafile = train.vw
num sources = 1
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
Enabled learners: gd, scorer-identity, binary, count_label
Input label = SIMPLE
Output pred = SCALAR
average  since         example        example        current        current  current
loss     last          counter         weight          label        predict features
1.000000 1.000000            1            1.0         1.0000        -1.0000        8
0.500000 0.000000            2            2.0         1.0000         1.0000        8
0.800000 1.000000            3            5.0        -1.0000         1.0000        8
0.363636 0.000000            5           11.0        -1.0000        -1.0000        8
0.173913 0.000000            9           23.0        -1.0000        -1.0000        8
0.148936 0.125000           19           47.0        -1.0000        -1.0000        8
0.148936 0.148936           40           94.0        -1.0000        -1.0000        8
0.122340 0.095745           80          188.0         1.0000         1.0000        8
0.085106 0.047872          156          376.0        -1.0000        -1.0000        8
0.054449 0.023873          303          753.0        -1.0000        -1.0000        8
0.036521 0.018592          604         1506.0        -1.0000        -1.0000        8
0.024900 0.013280         1210         3012.0         1.0000         1.0000        7
0.017427 0.009957         2411         6025.0        -1.0000        -1.0000        8
0.013194 0.008961         4813        12051.0        -1.0000        -1.0000        8
0.010456 0.007717         9684        24102.0        -1.0000        -1.0000        8
0.013463 0.016470        19362        48206.0        -1.0000        -1.0000        8
0.018213 0.022963        38853        96413.0        -1.0000        -1.0000        8
0.021180 0.024146        77720       192826.0        -1.0000        -1.0000        8
0.026246 0.031313       155415       385653.0        -1.0000        -1.0000        8
0.046257 0.066267       310752       771306.0         1.0000        -1.0000        8
0.070875 0.095493       621604      1542612.0         1.0000        -1.0000        8
0.089065 0.089065      1243290      3085224.0        -1.0000        -1.0000        8 h
0.094183 0.099300      2486612      6170448.0        -1.0000        -1.0000        8 h

finished run
number of examples per pass = 630000
passes used = 4
weighted example sum = 6253096.000000
weighted label sum = -4946192.000000
average loss = 0.072802 h
best constant = -2.148189
best constant's loss = 0.334861
total feature number = 20154216

22:27:25.687 [info] Model training completed: model.vw

22:27:25.687 [info]
only testing
predictions = predictions.txt
using no cache
Reading datafile = prediction.vw
num sources = 1
Num weight bits = 18
learning rate = 0.5
initial_t = 1.73658e+06
power_t = 0.5
Enabled learners: gd, scorer-identity, binary, count_label
Input label = SIMPLE
Output pred = SCALAR
average  since         example        example        current        current  current
loss     last          counter         weight          label        predict features
1.000000 1.000000            1            1.0         1.0000        -1.0000        8
0.250000 0.000000            2            4.0        -1.0000        -1.0000        8
0.250000 0.250000            4            8.0         1.0000        -1.0000        8
0.117647 0.000000            7           17.0        -1.0000        -1.0000        8
0.117647 0.117647           14           34.0        -1.0000        -1.0000        8
0.073529 0.029412           26           68.0        -1.0000        -1.0000        8
0.073529 0.073529           52          136.0        -1.0000        -1.0000        8
0.113139 0.152174          112          274.0        -1.0000        -1.0000        8
0.124088 0.135036          228          548.0        -1.0000        -1.0000        8
0.118613 0.113139          452         1096.0        -1.0000        -1.0000        8
0.109033 0.099453          890         2192.0         1.0000        -1.0000        8
0.106703 0.104376         1774         4386.0        -1.0000        -1.0000        8
0.104309 0.101915         3534         8772.0        -1.0000        -1.0000        8
0.101847 0.099385         7040        17546.0        -1.0000        -1.0000        8
0.104206 0.106565        14136        35094.0        -1.0000        -1.0000        8
0.105428 0.106650        28330        70190.0        -1.0000        -1.0000        8
0.105007 0.104586        56621       140381.0        -1.0000        -1.0000        8
0.104932 0.104857       113228       280762.0         1.0000        -1.0000        8
0.104243 0.103554       226198       561524.0        -1.0000        -1.0000        8

finished run
number of examples = 300000
weighted example sum = 744476.000000
weighted label sum = -588952.000000
average loss = 0.104452
best constant = -0.791096
best constant's loss = 0.374167
total feature number = 2400000

22:27:26.956 [info] Predictions saved to predictions.txt

22:27:26.956 [info]
Accuracy: 74.07966666666667%

22:27:27.369 [info] Accuracy: 74.07966666666667%

The out of the box accuracy is 74%

Why Banks, Real Estate, and E-commerce should Use This Approach

This type of model isn't just for gaming—it’s highly adaptable for industries like banking, real estate, and e-commercethat deal with churn, fraud, or user behavior anomalies. Here's why these sectors should consider using this approach:

Radical Cost Reduction: Traditional machine learning models often require highly skilled data scientists and engineers to handcraft features and fine-tune algorithms, which can be expensive. Using Vowpal Wabbit, we offer a solution that eliminates the need for complex feature engineering, significantly cutting costs without compromising accuracy whilst ensuring reduction in time to market.
Scalability: Whether you're handling a large volume of transactions in banking, monitoring property trends in real estate, or tracking customer behavior in e-commerce, our approach can scale effortlessly to manage growing data.
Cloud Agnostic: This model is cloud-agnostic, meaning it can run on any infrastructure—whether you're using AWS, Azure, Google Cloud, or a hybrid solution.
No Data Science Expertise Required: Even without a dedicated data science team, businesses can implement this solution. It’s lightweight, fast, and achieves accuracy comparable to traditional, hand-crafted models.
High Accuracy: In our case, we achieved 98.36% accuracy for churn prediction in gaming, with minimal data preparation and lightweight feature engineering. The same principles can apply to fraud detection, customer retention, and more across different industries.

By using this approach, businesses can cut costs, speed up implementation, and maintain high accuracy—making it an ideal solution for banks, real estate platforms, and e-commerce companies.

Conclusion

By choosing Elixir for its scalability and fault tolerance and Vowpal Wabbit for its speed and cloud-agnostic nature, we delivered a churn prediction model that our customer could deploy with ease. The system was resilient to data drift, adapted effortlessly to new games, and required minimal maintenance—an essential feature for a company without a dedicated data science team.

The result was not just a technology win, but a business win, helping the gaming company predict and mitigate churn, ultimately improving retention rates. If you're grappling with a similar issue—whether in gaming, banking, or real estate—technology like this can be your competitive advantage.

Where did my customer go aka Churn Prediction?

Supreet Sethi

Let's elevate your business infrastructure together. #solutionengineering

Choosing the Right Technology: Elixir and Vowpal Wabbit

Why Vowpal Wabbit?

Why Elixir?

Lightweight Feature Engineering

The Solution: A Scalable, Resilient Model

领英推荐

Fraud Detection Model with Zero Data Modeling

Why Banks, Real Estate, and E-commerce should Use This Approach

Conclusion

更多精彩文章

社区洞察

其他会员也浏览了

Horror Films, Data Analytics, and Customer Journey Mapping

Optimizing Contact Center Operations with Big Data Best Practices

Guide to Churn Prediction: Part 2 — Exploring missing values

Data Products: tell me how and what

Data: An Optimal Catalyst to drive Customer Experience (CX) - Part II

Connected Data, You're the GOAT!

Speech Analytics: Turning Customer Conversations into Actionable Insights

FROM DATA TO RESULTS: HARNESSING CALL CENTER ANALYTICS FOR SUCCESS

Mass balances are still a data scientist's best friend for understanding and modeling churn: Hugs and Mugs for Pugs

Reinventing Ourselves: Unified Batch + Real-Time CDP

Choosing the Right Technology: Elixir and Vowpal Wabbit

Why Vowpal Wabbit?

Why Elixir?

Lightweight Feature Engineering

The Solution: A Scalable, Resilient Model

领英推荐

Fraud Detection Model with Zero Data Modeling

Why Banks, Real Estate, and E-commerce should Use This Approach

Conclusion

Comprehensive Framework for Measuring Technical Project Quality in Go (Golang) Teams

2024年10月4日

Navigating Financial Storms: The Chakra Strategy for Wealth Preservation

2024年7月14日

A Practical Guide to GenAI Model Fine-Tuning

2023年8月31日

Exploring the Nuances of GenAI Model Training: Lessons from the Trenches

2023年8月30日

Managing Risk like big banks for Retail Investors

2023年8月23日

We dropped CDN. We don't regret it, yet?

2017年7月13日

What have we conceived for returning users?

2016年6月21日

4 point guide to inventory for ecommerce

2016年4月30日

Different: Machine model and smart products

2016年3月30日

To raise an alarm!

2016年3月21日

社区洞察

其他会员也浏览了

Horror Films, Data Analytics, and Customer Journey Mapping

Optimizing Contact Center Operations with Big Data Best Practices

Guide to Churn Prediction: Part 2 — Exploring missing values

Data Products: tell me how and what

Data: An Optimal Catalyst to drive Customer Experience (CX) - Part II

Connected Data, You're the GOAT!

Speech Analytics: Turning Customer Conversations into Actionable Insights

FROM DATA TO RESULTS: HARNESSING CALL CENTER ANALYTICS FOR SUCCESS

Mass balances are still a data scientist's best friend for understanding and modeling churn: Hugs and Mugs for Pugs

Reinventing Ourselves: Unified Batch + Real-Time CDP