How to Generate Random (Not So Random) CDRs
ElMehdi Erroussafi
Solving telecom, AI, and cybersecurity challenges with a touch of creativity and a lot of will.
I'm often met with a problem : In the field of telecom and data science, Call Detail Records (CDRs) are crucial for many types of analysis, ranging from operational insights to anomaly detection. However, obtaining real-world CDRs for testing models can be difficult due to privacy concerns. I was on the other side, and i know how valuable CDRs can be.
That’s where generating synthetic CDRs comes into play. But in most cases, purely random CDRs won't reflect the patterns or anomalies we need for real-world analysis.
In this article, I'll try to share with you how I use Python to generate synthetic CDRs with varying degrees of randomness and control. We’ll walk through generating basic random CDRs, applying constraints, and even introducing controlled anomalies.
One of the easiest ways to get started is by using the Faker library. This tool can help us generate synthetic data that looks realistic, including CDR fields such as calling number, called number, call duration, and cell ID.
from faker import Faker
import random
from datetime import datetime, timedelta
fake = Faker()
# Generating a random CDR
def generate_random_cdr():
cdr_id = fake.uuid4()
calling_number = fake.msisdn()
called_number = fake.msisdn()
duration = random.randint(1, 3600) # Duration in seconds
start_time = fake.date_time_this_month()
end_time = start_time + timedelta(seconds=duration)
cell_id = fake.random_int(min=1, max=1000) # Mock cell IDs
return {
'cdr_id': cdr_id,
'calling_number': calling_number,
'called_number': called_number,
'duration': duration,
'start_time': start_time,
'end_time': end_time,
'cell_id': cell_id
}
cdrs = [generate_random_cdr() for _ in range(10)]
for cdr in cdrs:
print(cdr)
The above code will generate simple random CDRs with the minimal essential fields. It’s useful for initial testing but doesn’t reflect realistic telecom patterns.
In many cases, you need CDR data to conform to certain patterns or constraints. For example, the calling number might need to be part of a specific pool of numbers, or certain time windows should be respected in the generated call records start time or restrict the cell id's to a certain range or list.
领英推荐
def generate_constrained_cdr():
cdr_id = fake.uuid4()
# Use a specific pool of numbers
pool_of_numbers = ['+212660000001', '+212660000002', '+212660000003']
calling_number = random.choice(pool_of_numbers)
called_number = fake.msisdn()
# Constrain call duration to working hours (e.g., between 9 AM and 5 PM)
start_time = fake.date_time_between_dates(datetime_start=datetime.now() - timedelta(days=30), datetime_end=datetime.now(), tzinfo=None).replace(hour=random.randint(9, 17))
duration = random.randint(60, 1800) # Duration between 1 to 30 minutes
end_time = start_time + timedelta(seconds=duration)
cell_id = fake.random_int(min=100, max=200) # Restrict cell IDs to a certain range
return {
'cdr_id': cdr_id,
'calling_number': calling_number,
'called_number': called_number,
'duration': duration,
'start_time': start_time,
'end_time': end_time,
'cell_id': cell_id
}
# Generate a batch of constrained CDRs
constrained_cdrs = [generate_constrained_cdr() for _ in range(10)]
for cdr in constrained_cdrs:
print(cdr)
Here, we’ve constrained the calling_number to a specific pool and ensured that calls occur during working hours. Such constraints make the generated CDRs more useful for simulation in realistic scenarios.
Now, In telecom fraud detection or anomaly analysis, we often need CDRs with specific abnormal patterns. For instance, we might want a subset of calling numbers (A-numbers) to exhibit anomalous behavior, like making an unusually high number of calls compared to the baseline.
# Generate CDRs with anomalies
def generate_anomalous_cdrs(anomalous_numbers, total_cdrs=100):
normal_cdrs = []
anomalous_cdrs = []
for _ in range(total_cdrs):
cdr = generate_constrained_cdr()
if cdr['calling_number'] in anomalous_numbers:
# Generate an anomalous number of calls for specific A-numbers
for _ in range(random.randint(5, 20)): # Increase call frequency for anomalies
anomalous_cdrs.append(generate_constrained_cdr())
else:
normal_cdrs.append(cdr)
return normal_cdrs + anomalous_cdrs
# List of anomalous A-numbers
anomalous_numbers = ['+12345000001', '+12345000002']
anomalous_cdrs = generate_anomalous_cdrs(anomalous_numbers)
# Simulate dataset with both normal and anomalous CDRs
for cdr in anomalous_cdrs:
print(cdr)
This code intentionally introduces anomalies by making specific A-numbers generate more calls than the others. This is useful when building datasets for machine learning models or anomaly detection algorithms, where you want to highlight abnormal patterns.
Generating synthetic CDRs using Python can be incredibly powerful, especially when you need control over the randomness of the data. By applying constraints and intentionally introducing anomalies, you can simulate real-world scenarios for testing and model training. The Faker library combined with Python’s flexibility allows for highly customizable CDR datasets that cater to your specific needs in telecom and data science.
Notes : i rely a lot on GPT to generate code snippets, i discovered faker thanks to chatgpt so all the credits really go to my good old friend for helping me solve some of my daily problems.