RAFM Lookup Process using different Data representation

RAFM Lookup Process using different Data representation

In RAFM, lookup process is often required for various tasks. For example, in some cases, you may need to aggregate usage data and compare it against a list of predefined numbers. Whether it's for a fraud alarm, a revenue assurance audit, or control, or even for other complex multi-step audits performed daily, the process typically shares a common feature: both the aggregated data and the lookup table change with each run. Please refer to the graph below for an illustration,

In some cases, lookups may involve a variable at runtime, whether it's a list, an array, or another type of data structure. Alternatively, the lookup source could be a temporary table in a database or even a flat text file. Each method of storing consistent set of similar items—such as MSISDNs, call references, or other keys depends on the volume of data and the frequency of lookups.

In the realm of RAFM, speed and efficiency are critical. Certain audits may generate millions of records that need to be matched against hundreds of millions of events. Given this scale, what would be the optimal approach for handling these lookups efficiently?

In addition, the lookup process is often intensive in terms of frequency and the resources it consumes, including memory and storage. Therefore, we are always looking for the most optimal solution. In this context Bloom Filter might be what we are looking for. So, what is a Bloom Filter?

A Bloom Filter is a data structure used to efficiently check if an element is part of a set. It uses multiple hash functions to map an element to a set of positions in a bit array. When checking if an element is in the set, the Bloom Filter tests if all the corresponding positions in the array are set to 1. This allows for fast membership testing with low memory usage.

For example, imagine we need to store the later A and B in 8-bits array, as shown in the illustration below,

So the final result will be the following array,

An 8-bit representation, such as ASCII, hashes both A and B and may even include C later. This illustrates a common engineering dilemma: balancing trade-offs between competing factors. In the case of a Bloom Filter, the trade-off lies between efficiency and accuracy. This is the essence of how a Bloom Filter functions.

Back to our 8-bit array example, it might indicate the presence of A, B, or even C, but it definitively does not contain D. The core idea of a Bloom Filter is that it may sometimes produce false positives, suggesting an item is present when it isn't, but it will never produce false negatives. If the Bloom Filter states an item is absent, that conclusion is guaranteed to be correct.

This is how a Bloom Filter works. While we can reduce the false positive rate by optimizing the hashing algorithm and increasing the length of the bit array, some false positives will always remain—this is fundamental to how Bloom Filters operate.

One significant advantage of Bloom Filters is their portability and ease of use. Unlike tables, in-memory lists, or arrays, a Bloom Filter is simply a series of bits, making it easy to transfer between audits or processes during execution. This simplicity and efficiency make Bloom Filters a lightweight and effective choice

In the context of RAFM audits and controls, we can sometimes prioritize performance and efficiency over absolute accuracy. Therefore, we will compare several methods used for the lookup process. Specifically, we will evaluate the performance of Python Lists, Python Sets, an SQLite3 database, and a Bloom Filter in terms of efficiency (size) and performance (time). For this comparison, we will ignore the issue of Bloom Filter accuracy and later discuss how to mitigate its limitations in practical implementations.

We will simulate these four different approaches and compare their metrics.

Preparing the Dataset

?First we will prepare a 1 Million MSISDN in plain text file,

The text file size is 15 M-bytes, and now we will convert this MSISDNs to Bloom Filter,

Deploying Bloom Filter

Below is how we to deploy the bloom filter in Python,

There are multiple implementations of Bloom Filters available in Python. For this simulation, we will use the most popular repository.

There are two important variables in this implementation: capacity and error_rate. These variables allow us to mitigate or reduce false positive rates. After hashing the 1 million MSISDNs and pickling this instance to a text file, we obtain a 3.5 MB hashed instance.

Testing Bloom Filter with 10 million Lookup process

Below is a snippet of the deployed code, where we perform a lookup for a randomly generated number to check whether it is in the Bloom Filter or not

Doing the same for List, Set and SQlite3 DB (table and index), and we will obtain the following result,

Please find the full repository at https://github.com/abdofrea/rafm_lookup

In terms of size, the Bloom Filter outperforms all other types of data representations. However, it falls behind Python Sets in terms of performance. This highlights the importance of engineering judgment in choosing between a Bloom Filter, a Set, or any other data structure, depending on the specific circumstances or use case.

While there are hundreds of data structures available, each with its own advantages and limitations, it is impractical to cover them all. Here, we have focused on a common process within the RAFM domain, where a lightweight lookup table with reasonable performance and memory efficiency is required. In such cases, the Bloom Filter proves to be a very handy choice.


Abdulwahed Freaa

要查看或添加评论,请登录

Abdalwahed Frea的更多文章

  • Telecom CDRs Data Serialization (with Simulation)

    Telecom CDRs Data Serialization (with Simulation)

    Part I (CDRs) In the realm of the telecom industry, any event taking place within a node or system is commonly referred…

    8 条评论
  • Revenue Assurance & Python

    Revenue Assurance & Python

    Throughout my career as a Revenue Assurance Engineer, I've learned that there is one thing for sure, which is there no…

    2 条评论

社区洞察

其他会员也浏览了