Decoding the Data Platform: Latest thoughts from the field....
Bartlomiej Graczyk
Chief Technology Architect | Tech Evangelist & Public Speaker | Cloud Data, Analytics and AI | MVP Data Platform | Demystifying complexity of data solutions in the cloud | Helping others to grow in tech roles |
As I wrapped up my previous article , I had a clear roadmap for the next one.
The plan was to delve into the intricacies of data platforms, data lakes, data warehouses, and Lakehouses. However, I realized there’s a crucial piece of the puzzle we need to address first, before we enter the arcana of technology and the ubiquitous implementation concepts.
This missing puzzle is the model in which the data platform is built and implemented in an organization.
Centralized vs. Decentralized Data Platforms - let's make sure we speak the same language
The idea of a central data platform, serving as the nexus for all operational, reporting, and analytical data within an organization, is a well-established concept. However, the tide is turning. Recently, we've witnessed a paradigm shift towards decentralized, distributed, or federated models. This evolution is not limited to the technological infrastructure; it's also revolutionizing the operational frameworks of the enterprises adopting them.
To clarify, my mention of a decentralized model doesn't imply fragmenting the data platform into various services for distinct functions like storage or computation. Rather, it's about leveraging these components as modular platforms to cater to the specific needs of individual entities within large capital groups or distinct domains within an organization, such as marketing or supply chain. This approach leads to a network of connected platforms that either function independently or as part of a broader ecosystem—a federated organizational data platform. In essence, we move from a single central data hub to a series of smaller, federated implementations.
With the comprehensive introduction complete, I trust we've established a mutual understanding. Now, let's delve deeper into the previously discussed concepts, examining their benefits and drawbacks, and most importantly, their influence on organizations.
The old well know central model
It's hardly surprising that the first model we deconstruct into prime factors is the central one. This method harks back to when a data platform was entirely conceived as a monolithic solution. Constructed and deployed as a component of a cohesive technology stack, such as Microsoft SQL Server, it transitioned from a relational database with an analytical and reporting engine to embracing the realm of less structured data through a Big Data Cluster (indeed, such a thing existed).
Centralization means efficiency
The central model, which has been the standard for many years, offers several benefits. It ensures uniform data standards, governance, and security. Because all data from the organization is ?flowing into centralized repository, at every stage of ingestion, either it is a batch or stream, they can processed in exactly the same way, following strict standards provided and enforced on the platform level. Furthermore as everything is centralized, by default there is a better exposure on the data itself, discoverability and governance can be improved and security e.g. access levels, encryption, classification can become integral part of every process.
Both in the area of resource utilization (process consolidation and capacity sharing), which directly translates into cost optimization by reducing unused computing power and storage, but also by unifying work related to maintenance, expansion or improvement of the platform itself. All in all this approach is efficient, both in terms of resource utilization and work related to maintenance, expansion, or improvement of the platform itself.
领英推荐
The sun doesn't always shine
However, the central model isn’t without its drawbacks. For large organizations, centralizing all their data can be a significant challenge. For example, when analyzing the needs of a company operating in a large geographical dispersion, it may turn out that the desire to centralize their data is not only a considerable challenge related to their transfer to one place, but also to find such a place. E.g. the possibility of transferring data outside a specific area (e.g. GDPR) can be significant complication.. In addition, most often companies with such a large dispersion can operate as a group of independent companies, with considerable independence in the area of the broadly understood IT environment, which opens up another stream of considerations, talks and potential work related to addressing them. An additional dimension to be analyzed when trying to build a platform in such a model may be the need to centralize the resources(people) available for the platform itself, which in fact may soon become a real bottleneck of the entire project. What's more, even if it doesn't become a bottleneck, it can have quite a challenge with a good connection to the business and addressing their daily needs.
Contrary to what it might seem, the central data platform indeed has the potential to maintain its relevance within any organization. For companies with a less extensive structure or a highly centralized operating model, adopting a platform structure can be very beneficial. Alternatively, a top-down approach to consolidating IT solutions can significantly influence the development of a data platform. Caution is advised here: if decisions regarding data are made exclusively from an IT perspective, it could be a serious warning sign of potential obstacles in the organization's effective transition to a data-driven model.
The Decentralized Model: A New Approach ?
So, what's the alternative? On the other end of the spectrum lies the vision of complete decentralization. Each department, country, or domain constructs its own data platform, leading to a federated solution that permits independent operation. Consequently, we lose control over data flow. This eliminates the possibility of resource efficiency, renders data availability and discoverability virtually impossible, and the data quality clearly indicates a significant data mess.
Indeed, opting for decentralization necessitates the establishment of certain rules, guidelines, and even standards and blueprints for executing specific tasks, such as the methods of data processing and sharing, the allowable technological stack, or the adoption of standards for describing the developed solutions or even data products.
The metaphor of an umbrella encapsulating the solution aptly describes the creation of a federated system, which allows for flexibility within a defined framework. This framework must be clearly delineated and verified. To me, it resembles the governance of entities known as "United ...." where some exhibit considerable liberalism and autonomy for their subunits, while others, despite a broad distribution, show a strong central influence. Although this is just one illustration (perhaps not the most fitting), it helps to conceptualize the primary challenges and benefits inherent in such an approach to deploying a data platform.
In my view, the federated model effectively tailors data platform solutions to an organization's needs and its distinct areas. Not all areas within an organization are alike. Some require more capabilities, such as processing data in near-real-time as well as in a traditional batch model. Others may not be ready and are content with the standard daily data load, while some areas need both methods and are also prepared to implement advanced analytics on the streaming data.
Maintaining data quality in such a model can be achieved by introducing the concept of a data product. This product would be well-defined and have a dedicated manufacturer concerned with its quality and brand. However, that is a subject for another article.
Conclusion
Upon reading this article, it should become clear how crucial it is to tailor the development of a data platform to the business requirements and operational framework of the organization. This opportunity allows the organization to grasp the full scope of what constructing a data platform involves, the essential changes needed for its successful deployment, and the potential risks that may emerge during the process.
?