Where did my customer go aka Churn Prediction?
Supreet Sethi
Let's elevate your business infrastructure together. #solutionengineering
Churn is reality. As a CEO of a bank, real estate site, or online game, you likely understand that no matter how compelling the customer experience may be, people can drift away for various reasons. They may lose interest, find a better service, or simply have other priorities. Predicting churn—knowing when and why customers might leave—can be a game-changer. One of our customers, a gaming company, came to us at The Migration Workshop with this exact problem. They needed a churn prediction solution that could scale, adapt to new games, and be simple enough to manage without a data science team.
We were faced with several challenges:
Choosing the Right Technology: Elixir and Vowpal Wabbit
With these constraints in mind, we first looked at a Python-based fraud detection model built by Nirav D. Doshi using XGBoost. Fraud detection and churn prediction have similarities—both involve identifying rare patterns or anomalies in customer behavior. However, since our client didn’t have experience with Python or XGBoost, we decided to simplify things. We took the main ideas from that model and reworked them into an Elixir-based solution, using Vowpal Wabbit for machine learning. This gave us a streamlined and effective toolchain that fit the client’s needs without adding unnecessary complexity.
Why Vowpal Wabbit?
Vowpal Wabbit (VW) is an open-source, fast, and lightweight machine learning library built for scalability. Unlike many traditional machine learning libraries that can become bottlenecks at scale, VW shines when dealing with large datasets and high throughput.
Why Elixir?
We chose Elixir for its performance and scalability. It's designed for building large-scale, fault-tolerant systems, making it a perfect fit for gaming companies expecting rapid growth in their player base. Here are a few reasons why Elixir was the right choice:
Lightweight Feature Engineering
Although our client requested no feature engineering, we found that applying just a bit of it helped us achieve a remarkable 98.364% accuracy. This minimal transformation involved selecting and reformatting the raw data to better suit VW’s needs with few categorical transformations which can be accomplished realtime, without overcomplicating the process.
The Solution: A Scalable, Resilient Model
We built a solution that ticks all the boxes for our client’s needs. Here’s how it works:
This churn prediction model, written in Elixir and powered by Vowpal Wabbit, allowed the client to implement the solution within their gaming platform in flat 3 weeks from concept to production, scaling as needed while maintaining high accuracy.
领英推荐
Fraud Detection Model with Zero Data Modeling
Although we cannot share the original churn prediction codebase due to customer confidentiality, we are presenting an equivalent fraud detection model built using the same Elixir and Vowpal Wabbit toolchain. This model, without any feature engineering, complex data modeling or hyper parameter tuning, achieved a solid 74% accuracy. This shows the flexibility of the toolchain to adapt to various prediction tasks, from churn to fraud detection, without requiring deep expertise in data science.
# churn_prediction.exs
defmodule ChurnPrediction do
require Logger
@csv_file "1Million.csv"
@train_file "train.csv"
@prediction_file "prediction.csv"
@vw_train_file "train.vw"
@vw_prediction_file "prediction.vw"
@model_file "model.vw"
@predictions_output_file "predictions.txt"
# Step 1: Split the dataset into train and prediction datasets (70% train, 30% test)
def split_data() do
csv_content = File.stream!(@csv_file)
|> Stream.drop(1) # Skip the header
|> Enum.to_list()
total_count = length(csv_content)
train_count = div(total_count * 70, 100) # 70% for training
# Split the data
train_data = Enum.take(csv_content, train_count)
prediction_data = Enum.drop(csv_content, train_count)
# Write the train and prediction datasets to separate CSV files
File.write!(@train_file, Enum.join(train_data, ""))
File.write!(@prediction_file, Enum.join(prediction_data, ""))
Logger.info("Split data into #{@train_file} (#{train_count} rows) and #{@prediction_file} (#{total_count - train_count} rows).")
end
# Step 2: Convert CSV to Vowpal Wabbit format
def convert_to_vw_format(csv_file, vw_file) do
File.stream!(csv_file)
|> Stream.map(fn line ->
fields = String.split(line, ",")
label = if String.trim(List.last(fields)) == "True." do "1" else "0" end
features = Enum.take(fields, length(fields) - 1) # Take all but the last column
vw_line = "#{label} | " <> Enum.join(features, " ")
"#{vw_line}\n"
end)
|> Stream.into(File.stream!(vw_file)) # Write to VW format
|> Stream.run()
Logger.info("Converted #{csv_file} to Vowpal Wabbit format: #{vw_file}")
end
# Step 3: Train the Vowpal Wabbit model
def train_model() do
command = "vw #{@vw_train_file} -f #{@model_file} --loss_function logistic --passes 10 --cache_file vw.cache"
{output, _} = System.cmd("bash", ["-c", command])
Logger.info("Model training completed: #{@model_file}")
Logger.info(output)
end
# Step 4: Predict using the trained model
def predict() do
command = "vw #{@vw_prediction_file} -t -i #{@model_file} -p #{@predictions_output_file}"
{output, _} = System.cmd("bash", ["-c", command])
Logger.info("Predictions saved to #{@predictions_output_file}")
Logger.info(output)
# Calculate accuracy
calculate_accuracy()
end
# Step 5: Calculate accuracy by comparing actual vs predicted values
def calculate_accuracy() do
actual_values = File.stream!(@prediction_file)
|> Stream.map(fn line ->
fields = String.split(line, ",")
label = if String.trim(List.last(fields)) == "True." do 1 else 0 end
label
end)
|> Enum.to_list()
predicted_values = File.stream!(@predictions_output_file)
|> Stream.map(&String.trim/1)
|> Enum.map(&String.to_float/1)
|> Enum.map(&round/1) # Round to 0 or 1 (binary classification)
# Compare actual vs predicted and calculate accuracy
correct_predictions =
Enum.zip(actual_values, predicted_values)
|> Enum.filter(fn {actual, predicted} -> actual == predicted end)
|> length()
total_predictions = length(predicted_values)
accuracy = correct_predictions / total_predictions * 100
Logger.info("Accuracy: #{accuracy}%")
IO.puts("Accuracy: #{accuracy}%")
end
# Step 6: Full process that runs splitting, conversion, training, and prediction
def process() do
# Split data into train and prediction sets
split_data()
# Convert the data into Vowpal Wabbit format
convert_to_vw_format(@train_file, @vw_train_file)
convert_to_vw_format(@prediction_file, @vw_prediction_file)
# Train the model
train_model()
# Run predictions and calculate accuracy
predict()
end
end
# Run the process when the script is executed
ChurnPrediction.process()
To run this code, few dependencies are needed
$ git clone https://github.com/niravdd/2018-GAM310.git
$ brew install elixir vowpal-wabbit
$ elixir churn_prediction.exs
and the output from model training and predictions shows up.
$ elixir churn_prediction.exs
22:27:13.973 [info] Split data into train.csv (700000 rows) and prediction.csv (300000 rows).
22:27:20.150 [info] Converted train.csv to Vowpal Wabbit format using headers: train.vw
22:27:22.728 [info] Converted prediction.csv to Vowpal Wabbit format using headers: prediction.vw
using l1 regularization = 1e-06
using l2 regularization = 1e-06
final_regressor = model.vw
creating cache_file = vw.cache
Reading datafile = train.vw
num sources = 1
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
Enabled learners: gd, scorer-identity, binary, count_label
Input label = SIMPLE
Output pred = SCALAR
average since example example current current current
loss last counter weight label predict features
1.000000 1.000000 1 1.0 1.0000 -1.0000 8
0.500000 0.000000 2 2.0 1.0000 1.0000 8
0.800000 1.000000 3 5.0 -1.0000 1.0000 8
0.363636 0.000000 5 11.0 -1.0000 -1.0000 8
0.173913 0.000000 9 23.0 -1.0000 -1.0000 8
0.148936 0.125000 19 47.0 -1.0000 -1.0000 8
0.148936 0.148936 40 94.0 -1.0000 -1.0000 8
0.122340 0.095745 80 188.0 1.0000 1.0000 8
0.085106 0.047872 156 376.0 -1.0000 -1.0000 8
0.054449 0.023873 303 753.0 -1.0000 -1.0000 8
0.036521 0.018592 604 1506.0 -1.0000 -1.0000 8
0.024900 0.013280 1210 3012.0 1.0000 1.0000 7
0.017427 0.009957 2411 6025.0 -1.0000 -1.0000 8
0.013194 0.008961 4813 12051.0 -1.0000 -1.0000 8
0.010456 0.007717 9684 24102.0 -1.0000 -1.0000 8
0.013463 0.016470 19362 48206.0 -1.0000 -1.0000 8
0.018213 0.022963 38853 96413.0 -1.0000 -1.0000 8
0.021180 0.024146 77720 192826.0 -1.0000 -1.0000 8
0.026246 0.031313 155415 385653.0 -1.0000 -1.0000 8
0.046257 0.066267 310752 771306.0 1.0000 -1.0000 8
0.070875 0.095493 621604 1542612.0 1.0000 -1.0000 8
0.089065 0.089065 1243290 3085224.0 -1.0000 -1.0000 8 h
0.094183 0.099300 2486612 6170448.0 -1.0000 -1.0000 8 h
finished run
number of examples per pass = 630000
passes used = 4
weighted example sum = 6253096.000000
weighted label sum = -4946192.000000
average loss = 0.072802 h
best constant = -2.148189
best constant's loss = 0.334861
total feature number = 20154216
22:27:25.687 [info] Model training completed: model.vw
22:27:25.687 [info]
only testing
predictions = predictions.txt
using no cache
Reading datafile = prediction.vw
num sources = 1
Num weight bits = 18
learning rate = 0.5
initial_t = 1.73658e+06
power_t = 0.5
Enabled learners: gd, scorer-identity, binary, count_label
Input label = SIMPLE
Output pred = SCALAR
average since example example current current current
loss last counter weight label predict features
1.000000 1.000000 1 1.0 1.0000 -1.0000 8
0.250000 0.000000 2 4.0 -1.0000 -1.0000 8
0.250000 0.250000 4 8.0 1.0000 -1.0000 8
0.117647 0.000000 7 17.0 -1.0000 -1.0000 8
0.117647 0.117647 14 34.0 -1.0000 -1.0000 8
0.073529 0.029412 26 68.0 -1.0000 -1.0000 8
0.073529 0.073529 52 136.0 -1.0000 -1.0000 8
0.113139 0.152174 112 274.0 -1.0000 -1.0000 8
0.124088 0.135036 228 548.0 -1.0000 -1.0000 8
0.118613 0.113139 452 1096.0 -1.0000 -1.0000 8
0.109033 0.099453 890 2192.0 1.0000 -1.0000 8
0.106703 0.104376 1774 4386.0 -1.0000 -1.0000 8
0.104309 0.101915 3534 8772.0 -1.0000 -1.0000 8
0.101847 0.099385 7040 17546.0 -1.0000 -1.0000 8
0.104206 0.106565 14136 35094.0 -1.0000 -1.0000 8
0.105428 0.106650 28330 70190.0 -1.0000 -1.0000 8
0.105007 0.104586 56621 140381.0 -1.0000 -1.0000 8
0.104932 0.104857 113228 280762.0 1.0000 -1.0000 8
0.104243 0.103554 226198 561524.0 -1.0000 -1.0000 8
finished run
number of examples = 300000
weighted example sum = 744476.000000
weighted label sum = -588952.000000
average loss = 0.104452
best constant = -0.791096
best constant's loss = 0.374167
total feature number = 2400000
22:27:26.956 [info] Predictions saved to predictions.txt
22:27:26.956 [info]
Accuracy: 74.07966666666667%
22:27:27.369 [info] Accuracy: 74.07966666666667%
The out of the box accuracy is 74%
Why Banks, Real Estate, and E-commerce should Use This Approach
This type of model isn't just for gaming—it’s highly adaptable for industries like banking, real estate, and e-commercethat deal with churn, fraud, or user behavior anomalies. Here's why these sectors should consider using this approach:
By using this approach, businesses can cut costs, speed up implementation, and maintain high accuracy—making it an ideal solution for banks, real estate platforms, and e-commerce companies.
Conclusion
By choosing Elixir for its scalability and fault tolerance and Vowpal Wabbit for its speed and cloud-agnostic nature, we delivered a churn prediction model that our customer could deploy with ease. The system was resilient to data drift, adapted effortlessly to new games, and required minimal maintenance—an essential feature for a company without a dedicated data science team.
The result was not just a technology win, but a business win, helping the gaming company predict and mitigate churn, ultimately improving retention rates. If you're grappling with a similar issue—whether in gaming, banking, or real estate—technology like this can be your competitive advantage.