The (Modern) Big Data Platform

The (Modern) Big Data Platform

Disclaimer: The views expressed here are mine alone and do not necessarily reflect the view of my current, former, or future employers.

Although technology constantly changes, we largely agree that data volumes continue to grow, becoming one of an organization's most valuable assets.? I previously stated, "Organizations that effectively utilize their data will be the ones to have a competitive advantage moving to the future of the big data era." This statement continues to hold true, but recent capabilities offer more effective ways to obtain value from data. In the previous article, I argued that the Enterprise Data Warehouse should be augmented with a Data Lake based upon distributed technology, such as Hadoop, to support expanding big data use cases for data science, Machine Learning (ML), and Artificial Intelligence (AI). This post aims to provide updates on the ever-changing landscape of the Modern Big Data Platform.?

Let's quickly review the need for a Data Warehouse complemented with a Data Lake architecture. With this design, the Data Warehouse continues to be the workhorse for structured reporting and analysis, supporting complex structured data, and handling some unstructured data. There were a few gaps with this pattern:?

  • New and varied data types are produced at high volumes, straining the Warehouse's flexibility
  • Data volumes struggle to keep pace with the Warehouse-size
  • Data Scientists need uncleansed, raw data access much more quickly or even possible within a highly structured manner, increasing the demand for shadow IT

Data Lakes offer a way to support the three areas listed above without losing the capabilities of their legacy environment.?This approach yields a low barrier of entry and initial cost to add big data and analytical workloads to the enterprise. Data scientists are to become empowered with varied data for data discovery, and AI/ML use cases at greater speed and scale than with a Data Warehouse alone.??

Example Big Data Reference Architecture from 2018

Example Legacy Big Data Reference Architecture

Although this legacy approach provides quick wins to support expanding use cases organizations must make strategic choices with data, specifically tradeoffs for data management, sizing, and accessibility. Below I will describe the challenges with these tradeoffs and provide an updated view of Big Data for 2022.?

  • Governance and Management – In the bifurcated architecture, technical leaders must decide where data is stored and who has access to which data assets. These decisions lead to data governance and management challenges due to having two data platforms for different purposes. With patching, software upgrades, and package version management, Hadoop platforms also have become overly complex. To effectively manage the environment, questions require answering: Do we leverage data for management and put the data in the data warehouse, provide it for data scientists to analyze, or both? Who should have access to the data within the lake, and how do we manage data security? How do we ensure that newly created data assets are appropriately governed? What will happen to our code if we upgrade to the latest version? These questions lead to unfavorable choices increasing cost and complexity.?
  • Scalability – Once scoped workloads are up and running, they typically perform well based on the initial requirements and assumptions. Challenges begin when new use cases and data are introduced. For example, let's consider an organization that starts capturing IT security logs as a new workload. If not initially scoped, it may add terabytes of data and require purchasing capacity in the form of data nodes to add to an existing cluster. In addition, the new IT security use case requires compute resources which will likely contend with the current workloads, causing jobs to take longer than they did initially (or even fail).?
  • Data Sharing – Moving data within an organization can be very complex, and even more so outside an organization, for regulatory, external reporting, or other reasons. Moving between the Data Warehouse and Lake alone requires a data pipeline, converting structured data to unstructured, or vice versa, which isn't generally difficult but is often time-consuming. This process derails efforts to reduce shadow IT given the time to provide data scientists and other users with data. When moving data outside of the analytical platform, users often don't know where to find what they are seeking (see governance and management). It can be searching for a needle in a haystack to find the correct version of data in the proper format to make use of it quickly. Sharing data inside or outside an organization has its own business and technical complexities. Suffice to say, many organizations rely on technology from the 1970s (FTP) to move and share data amongst one another.?

Fast forward to 2022, I find this architecture adds technical debt, becoming a challenge to manage and scale, and not delivering on the promised long-term benefits.?Enter the Cloud. It unlocks value by not having to choose between Data Warehouse or Lake. I ended the previous article with, "as we continue to obtain more and more data, I believe the Data Warehouse and Data Lake will become more and more blurred." Today the Data Warehouse and Data Lake have completely converged in the form of the Data Cloud. Cloud Data Platforms have been around for many years, but most have been lifted and shifted, not natively architected for the cloud, reducing their effectiveness.

Data Cloud Architecture: https://www.snowflake.com/blog/beyond-modern-data-architecture/

Data Cloud Architecture. Source: https://www.snowflake.com/blog/beyond-modern-data-architecture/

The Data Cloud provides a single copy of data under one platform, removing data silos. Data can take many forms, be it structured, semi-structured, or unstructured, and provide users with greater value to lower total cost of ownership (TCO) than with the overhead of managing and maintaining multiple platforms. In addition, the Data Cloud provides an always-connected, continuously updated platform with no upgrades or patching required.??

The architecture allows you to infinitely and separately scale your data and compute. You don't need to worry about degradation or slowed performance over time. Need more horsepower, no problem, click a button, and away you go within seconds. No longer want to pay for the extra compute or data for a throw-away analysis; stop running the compute job and delete the data. It's that simple.??

The Data Cloud provides native ways for organizations to clone and share data within an environment without moving actual data. Cloning provides developers, data scientists, and other users almost instant access to data, whatever form it takes, in a secured and governed manner. In Deloitte's 2022 Tech Trends, the authors state, "During the next 18 to 24 months, we expect to see more organizations explore opportunities to create seamless, secure data-sharing capabilities that can help them monetize their own information assets and accomplish business goals using other people's data." The Data Cloud also provides the ability to securely share data assets within your organization, outside your organization, or quickly obtain new data to enhance your own. Imagine needing to leverage COIVD-19 data and use it for demand planning - then sharing the results with your management team and suppliers - all with a press of a button. ?

I previously suggested that the way to modernize your data architecture was by adding a Data Lake to your existing architecture, expanding capabilities, and supporting new and varied data and use cases. Users had to choose where to put their data, creating additional data silos and bottlenecks. This is simply no longer the case. The modern big data platform is here. And because the cloud enables automatic updates and newly released features continuously, the investment to shift to the Data Cloud will pay off. Not just for today but throughout the ever-evolving data landscape, providing organizations with a future-proof platform and delivering a competitive advantage for today and years to come.?

要查看或添加评论,请登录

Paul Needleman的更多文章

  • Empowering the Self-Service Analytics Workforce

    Empowering the Self-Service Analytics Workforce

    Disclaimer: The views expressed here are mine alone and do not necessarily reflect the view of my current, former, or…

    3 条评论
  • Data Management in the Era of Big Data

    Data Management in the Era of Big Data

    Disclaimer: The views expressed here are mine alone and do not necessarily reflect the view of my current, former, or…

    4 条评论
  • 3 Tips for Achieving Success at Work

    3 Tips for Achieving Success at Work

    Disclaimer: The views expressed here are mine alone and do not necessarily reflect the view of my current, former, or…

    1 条评论
  • Data Warehousing in the era of Big Data

    Data Warehousing in the era of Big Data

    Disclaimer: The views expressed here are mine alone and do not necessarily reflect the view of my current, former, or…

    26 条评论
  • Dimensional Modeling - Part 2 of 2: Creating a Successful Model

    Dimensional Modeling - Part 2 of 2: Creating a Successful Model

    In my previous post, I introduced dimensional modeling, compared with transactional modeling, and provided some overall…

    5 条评论
  • Dimensional Modeling - Part 1 of 2: An Overview

    Dimensional Modeling - Part 1 of 2: An Overview

    I believe that the data model is the most important part of any data driven solution. Because I work in the Business…

    6 条评论

社区洞察

其他会员也浏览了