Analysis of challenges and differences in the implementation of Big Data projects from an analytical perspective.

Good morning, Pawel Matlawski . First of all, thank you for accepting our invitation and agreeing to participate in our podcast. You are the only representative from SofixIt featured in the entire series, so I hope that by showcasing our projects and our approach to them, the listeners will get to know us and our values better.

Pawe? Mat?awski

Hi, thanks for inviting me. I’m happy to represent us in this podcast.

Pawe?, let’s start with the basics. Let’s answer the question of what Big Data actually is, but let’s do it in a practical way. I’d like to ask you to present to our listeners a real-life project from the broad area of data.

I think one of the most common examples we’re currently working on is telecom data, meaning data related to mobile networks, the internet, and online TV. I could mention a subset of this data that pertains to how well the network is performing. This is known as Performance Management Data, which tells us if the network has delays, if it’s dropping packets, or how well the hardware is functioning—how well the equipment and the cells managing connections are working. Where is this useful? Telecom companies sign agreements with their customers to provide telecom services. These agreements include quality commitments, meaning customers can expect, for example, no delays or stable internet service. Typically, these agreements are tied to certain percentages—what’s known as SLA (Service Level Agreement)—for example, ensuring 95% of the time the network meets a specific quality level.

It’s one thing to promise this, but you also need to be able to prove it. That’s where this data comes in—Performance Management Data allows us to assess how individual cells are performing, such as whether there are delays, how many packets are being sent, and which cells are under heavy or light load. It can even identify which cells could be turned off without impacting service. Importantly, we’re not dealing with customer data here; we’re focusing strictly on how the system is functioning.

This data is gathered by special units that aggregate information from specific areas. For example, here in Wroc?aw, there are several devices collecting data from antennas and cells, which is then transported to a server. We process this data to create a tabular format suitable for analysis. Ultimately, we want to answer whether a given customer experienced a good-quality connection on a specific day.

Of course, you could take a reactive approach—wait for a frustrated customer to report an issue—but telecom companies aim to be proactive. So, we process this data, which usually comes in XML format, and transform it into an analyzable structure. These data points are aggregated and fed into visualization tools.

When it comes to volume, a single cell sends a 100-kilobyte file every 15 minutes, 96 times a day. This can be set to an hourly interval, but vendors typically opt for 15 minutes. In Sweden, where we currently manage a network, there are around 40,000 cells nationwide. If you multiply these figures, you get over 3 terabytes of data daily. This is an approximation, of course, but it illustrates the massive scale of these projects in the data domain.

Do you think data projects differ significantly from other IT projects? What are the distinctive features of Big Data projects?

One key feature is the large-scale infrastructure involved. Often, there are numerous dependent systems that either consume or supply data. These systems are built over many years, often carrying technical debt, yet they are critical to the overall ecosystem. Clients are highly invested in integrating with them.

Another distinction is the extensive documentation. Data - such as telecom data - comes with detailed documentation prepared by leading industry players.

This documentation is comprehensive and complex, adding to the workload. Additionally, these projects often involve niche solutions, like ILUM, which we use in our products.

ILUM is a Spark-on-Kubernetes solution designed for ease of use—it’s a ready-made component that eliminates the need for custom setup and management. It’s an example of highly specialized tools that handle a specific slice of the project pie.

Another challenge is the variety and complexity of data transformations. We don’t have a single data type processed in the same way every time. Transformations are often bespoke, serving specific needs multiple times, but in ways that can’t be standardized. This requires analysts to have specialized knowledge and a knack for uncovering information, often from past implementations or by reaching out to colleagues. It’s another hallmark of Big Data projects.

Clearly, specific expertise is needed. In your opinion, what are the main challenges associated with acquiring, storing, and processing large datasets in such projects?

The most classic challenge is the lack of good examples. Often, clients provide us with a sample dataset containing one, two, or three specific cases.

Later, when the system begins processing actual production data, new cases emerge—ones that are difficult to extract beforehand because clients themselves may not be aware of them. We have to be prepared for the system to be used in slightly different ways or to quickly adapt to unexpected scenarios.

Another challenge often beyond our control is infrastructure. We might discuss files, solutions, or products with the client, only to find out that someone else decided on the infrastructure—for instance, the type of servers and their computational power. Naturally, we want more capacity, while clients want to minimize costs. It’s a clash of priorities where a middle ground must be found.

Clients also often prefer to remain recipients of the data rather than active participants in the processing pipeline.

The complexity of the tools and transformations we implement, described in dedicated syntax, can lead clients to delegate the work entirely to us. Essentially, we develop software for ourselves—solutions that meet the client’s needs but may not be directly operated by them. Occasionally, we do encounter clients who want to use and understand the system, but they’re in the minority.

A further challenge in data storage is planning for the system’s long-term requirements. For example, in one project, we calculated that the client’s data retention requirements—storing archived data in an aggregated format—would need 116 terabytes. This excludes raw data, which, as I mentioned earlier, would significantly exceed this volume if stored long-term. Aggregating and condensing data is more cost-effective, but 116 terabytes is still a substantial resource to manage.

There’s also the question of backups and data retrieval, which are not always straightforward. All of this affects system performance, responsiveness, and user experience. Everyone wants systems to be fast and stable, but that’s not always feasible. Sometimes, issues arise—like overly large queries causing failures—and we need to have procedures in place to address these situations.

The success of any project depends on the people involved, their skills, and their experience—from project managers and analysts to the entire team. Could you share what skills, particularly in analytics, are essential for executing Big Data projects?

From my perspective, as I mentioned earlier, documentation is a major factor, so analytical skills like document analysis are crucial. Analysts must be able to sift through vast amounts of information, such as Excel files, and distill meaningful insights.

Additionally, understanding data interfaces and using tools like data flow diagrams to visualize data exchanges and transformations is important. While the volume of data can make comprehensive understanding challenging, these tools help in piecing together the bigger picture.

On the technical side, mockups are highly useful. By mockups, I mean both interface design and sample process designs for data transformation.

Sometimes, when we want to illustrate more complex cases, we need to create or simulate them ourselves to understand system behavior.

From a soft skills perspective, quick learning is vital. Analysts need to quickly adapt to new topics, understand client tools, and align with their expectations. Attention to detail is another critical skill. For example, when designing a data processing pipeline for a client, we identified an issue with colons in file names early on. Addressing this in the analysis phase helped avoid problems in production.

Lastly, communication is essential. Knowledge is often dispersed among various stakeholders, especially in telecom data projects. Effective communication helps gather, manage, and later retrieve critical insights.

What tools and technologies support analysts in Big Data projects?

From our perspective, Kubernetes is a key technology. While a basic understanding suffices, it’s invaluable for comprehending both our systems and the client’s infrastructure. Grafana is another fantastic tool for monitoring, though it requires technical skills, which means relying on developers for setup.

For interface design, I use Figma. For analyzing project data, database browsers like DBeaver are indispensable. Sometimes, tools for handling complex data formats like Parquet or Avro are necessary, though these are command-line-based and quite specialized. Occasionally, I use Enterprise Architect for modeling diagrams, such as communication or sequence diagrams.

It’s clear you’re an expert in this area, so let me ask a slightly provocative question: what surprising challenges have you encountered in data-related IT products?

One surprising aspect is that we often create products the client doesn’t want to interact with. The client is primarily interested in the data output, not the intricacies of how it’s processed. They care about having data in a usable format, but the processing itself is largely delegated to us. This dynamic reminds me of a platform like Tinder. We aim to deliver a service—data processing—so seamless that clients don’t need to return to us once it’s set up.

If the system works perfectly, we won’t feel bad if the client never contacts us again, except to make occasional adjustments.

What direction do you foresee for future changes in data-related products?

Fewer people involved in data processing—automation is increasing in many areas, and this will extend to data projects. For instance, we’ve implemented solutions enabling cellular networks to adapt automatically to new data formats.

Another trend is focusing on the quality of data rather than its quantity. As we gather less data, ensuring its value and consistency becomes critical.

Thank you, Pawe?, for the fascinating discussion and for sharing your knowledge and experience in managing and analyzing data projects. I’m sure our listeners found the conversation engaging.

Thank you for the conversation. It was a pleasure.

Analysis of challenges and differences in the implementation of Big Data projects from an analytical perspective.

Sofixit

Knowledge Blast and Insights

461 位关注者

Sofixit的更多文章

Knowledge Blast and Insights

461 位关注者

Sofixit的更多文章

The challenges of the data anonymization process.

Building a data-driven organization as the key to improving data quality and the quality of artificial intelligence systems.

Does all data need to be big? And is it possible to have qualitative and ethical projects in data management without a change in mindset?

Where should we currently position the Data area within organizations?

Data labeling pipelines for machine learning.

How Data Architecture supports the organization. and how the organization can support Data Architecture.

Big Data and Artificial Intelligence. How data quality affects AI power.

Another dose of fun: Sofixit’s Big Data escapades!

First dose of laughter: Sofixit’s adventures in Big Data!

Top resources recommended by Experts from Sofixit's Podcast Series