How to Generate Random (Not So Random) CDRs

How to Generate Random (Not So Random) CDRs

I'm often met with a problem : In the field of telecom and data science, Call Detail Records (CDRs) are crucial for many types of analysis, ranging from operational insights to anomaly detection. However, obtaining real-world CDRs for testing models can be difficult due to privacy concerns. I was on the other side, and i know how valuable CDRs can be.

That’s where generating synthetic CDRs comes into play. But in most cases, purely random CDRs won't reflect the patterns or anomalies we need for real-world analysis.

In this article, I'll try to share with you how I use Python to generate synthetic CDRs with varying degrees of randomness and control. We’ll walk through generating basic random CDRs, applying constraints, and even introducing controlled anomalies.


One of the easiest ways to get started is by using the Faker library. This tool can help us generate synthetic data that looks realistic, including CDR fields such as calling number, called number, call duration, and cell ID.

from faker import Faker
import random
from datetime import datetime, timedelta

fake = Faker()

# Generating a random CDR
def generate_random_cdr():
    cdr_id = fake.uuid4()
    calling_number = fake.msisdn()
    called_number = fake.msisdn()
    duration = random.randint(1, 3600)  # Duration in seconds
    start_time = fake.date_time_this_month()
    end_time = start_time + timedelta(seconds=duration)
    cell_id = fake.random_int(min=1, max=1000)  # Mock cell IDs

    return {
        'cdr_id': cdr_id,
        'calling_number': calling_number,
        'called_number': called_number,
        'duration': duration,
        'start_time': start_time,
        'end_time': end_time,
        'cell_id': cell_id
    }

cdrs = [generate_random_cdr() for _ in range(10)]
for cdr in cdrs:
    print(cdr)        

The above code will generate simple random CDRs with the minimal essential fields. It’s useful for initial testing but doesn’t reflect realistic telecom patterns.

In many cases, you need CDR data to conform to certain patterns or constraints. For example, the calling number might need to be part of a specific pool of numbers, or certain time windows should be respected in the generated call records start time or restrict the cell id's to a certain range or list.

def generate_constrained_cdr():
    cdr_id = fake.uuid4()

    # Use a specific pool of numbers
    pool_of_numbers = ['+212660000001', '+212660000002', '+212660000003']
    calling_number = random.choice(pool_of_numbers)
    called_number = fake.msisdn()

    # Constrain call duration to working hours (e.g., between 9 AM and 5 PM)
    start_time = fake.date_time_between_dates(datetime_start=datetime.now() - timedelta(days=30),  datetime_end=datetime.now(),                                              tzinfo=None).replace(hour=random.randint(9, 17))
    duration = random.randint(60, 1800)  # Duration between 1 to 30 minutes
    end_time = start_time + timedelta(seconds=duration)
    cell_id = fake.random_int(min=100, max=200)  # Restrict cell IDs to a certain range

    return {
        'cdr_id': cdr_id,
        'calling_number': calling_number,
        'called_number': called_number,
        'duration': duration,
        'start_time': start_time,
        'end_time': end_time,
        'cell_id': cell_id
    }

# Generate a batch of constrained CDRs
constrained_cdrs = [generate_constrained_cdr() for _ in range(10)]
for cdr in constrained_cdrs:
    print(cdr)        

Here, we’ve constrained the calling_number to a specific pool and ensured that calls occur during working hours. Such constraints make the generated CDRs more useful for simulation in realistic scenarios.

Now, In telecom fraud detection or anomaly analysis, we often need CDRs with specific abnormal patterns. For instance, we might want a subset of calling numbers (A-numbers) to exhibit anomalous behavior, like making an unusually high number of calls compared to the baseline.

# Generate CDRs with anomalies
def generate_anomalous_cdrs(anomalous_numbers, total_cdrs=100):
    normal_cdrs = []
    anomalous_cdrs = []

    for _ in range(total_cdrs):
        cdr = generate_constrained_cdr()
        if cdr['calling_number'] in anomalous_numbers:
            # Generate an anomalous number of calls for specific A-numbers
            for _ in range(random.randint(5, 20)):  # Increase call frequency for anomalies
                anomalous_cdrs.append(generate_constrained_cdr())
        else:
            normal_cdrs.append(cdr)

    return normal_cdrs + anomalous_cdrs

# List of anomalous A-numbers
anomalous_numbers = ['+12345000001', '+12345000002']
anomalous_cdrs = generate_anomalous_cdrs(anomalous_numbers)

# Simulate dataset with both normal and anomalous CDRs
for cdr in anomalous_cdrs:
    print(cdr)
        

This code intentionally introduces anomalies by making specific A-numbers generate more calls than the others. This is useful when building datasets for machine learning models or anomaly detection algorithms, where you want to highlight abnormal patterns.

Generating synthetic CDRs using Python can be incredibly powerful, especially when you need control over the randomness of the data. By applying constraints and intentionally introducing anomalies, you can simulate real-world scenarios for testing and model training. The Faker library combined with Python’s flexibility allows for highly customizable CDR datasets that cater to your specific needs in telecom and data science.

Notes : i rely a lot on GPT to generate code snippets, i discovered faker thanks to chatgpt so all the credits really go to my good old friend for helping me solve some of my daily problems.


要查看或添加评论,请登录

ElMehdi Erroussafi的更多文章

社区洞察

其他会员也浏览了