登录查看更多内容

Striking the Right Balance with the Modern Cloud Data Platform

Kevin Petrie

Vice President of Research at BARC

发布日期: 2022年3月3日

SUMMARY: The cloud data platform must strike a balance on three dimensions: suite vs. extensibility, standardization vs. customization, and stability vs. agility. (Eckerson Group .)

To understand the role of the cloud data platform, picture a seesaw with kids piling onto each side. It’s not easy to keep everyone in balance—and out of the dirt.

The cloud data platform is a combination of data warehouse and data lake elements that supports workloads ranging from business intelligence (BI) and data science to data-driven applications. Like a data warehouse, it queries and transforms data, delivers query outputs to BI tools, and secures and governs data usage. Like a data lake, it stores myriad types of data objects; and applies data science models to those objects. Cloud data platforms, also known as data lakehouses, enable high performance and scalability on elastic cloud infrastructure.

The cloud data platform combines data warehouse and data lake elements to support business intelligence (BI), data science, and data-driven applications

Needless to say, that’s a lot for one platform to support.

Data engineers, data analysts, data scientists, and developers all pile onto this seesaw. To support them all, the cloud data platform needs to strike a careful balance. First, it needs to combine the benefits of a software suite (all-in-one) and extensibility. Second, it needs to offer both standard and custom approaches. Third, it needs to give data teams both stability and agility.

This blog explores these three dichotomies to help data teams select a cloud data platform that satisfies their complex requirements. We start with suite vs. extensibility.

1. Suite vs. Extensibility

Suite

The cloud data platform should integrate the basic functionality of data management into a suite. Like a data warehouse, it should enable data engineers to build tables, validate records, and track lineage. Like a data lake, it should help apply multiple analytics methods to its wide-ranging datasets. It executes structured query language (SQL) commands to retrieve data for BI tools, and should execute Python programs to run machine learning models or Java scripts to feed data into mobile applications. And the suite continues to expand. Cloud data platforms now support the full lifecycle for machine learning projects, including data and feature engineering, model development, and model production.

Extensibility

However, the cloud data platform cannot be all things to all people. It also must enable data teams to add functionality by integrating with an ecosystem of best of breed tools. This ecosystem includes a variety of commercial and open-source offerings. For example:

●?Workflow tools?such as?Airflow ?orchestrate analytical and operational workflows.

●?Data pipeline tools?such as?Fivetran ?and?dbt ?help ingest and transform data.

●?Libraries?such as?PyTorch ?and?TensorFlow ?provide AI/ML algorithms for data scientists to train.

●?Notebooks?such as?Jupyter ?and?Zeppelin ?help develop and customize AI/ML algorithms.

领英推荐

Azure Data Factory

Rohit Singh 3 周前

Solutions for Common BigQuery Concerns

Aliz 9 个月前

Building a Data Warehouse on Google Cloud Platform…

Policygenius 2 年前

●?Feature stores?such as?Tecton ?and?Rasgo ?that help define AI/ML features.

Data teams should select a cloud data platform whose data management suite still offers this level of extensibility.

Because it cannot be all things to all people, the cloud data platform must integrate with an ecosystem of best of breed tools.

2. Standardization vs. Customization

Standardization

The cloud data platform should help data teams standardize datasets, data pipelines, and consumption patterns to make data management more efficient. It helps standardize data by performing transformation tasks—for example, joining tables, filtering out unneeded data, converting file formats, or tagging semi/unstructured data. It helps standardize pipelines by automating the configuration, executing, and monitoring of ETL commands, transformation (T) in particular. In addition, the cloud data platform helps create standard consumption patterns, using application programming interfaces (APIs) to deliver data to the BI tools, AI/ML models, and applications that consume them. In these ways, data teams get standard artifacts they can use, adapt, and reuse as they scale the business.

Customization

However, the cloud data platform also needs to support customization. Expert data teams need to ingest and transform new datasets to address new use cases that the business demands. The cloud data platform must support custom-scripted pipelines that address these new use cases, and deliver data for consumption by custom models or applications. To ensure they get this, data teams should select a cloud data platform that supports various programming languages, as well as development frameworks that help build, test, revise, deploy, and rollback different versions of pipeline code. They also need to maintain open access to the best of breed ecosystem.

The cloud data platform helps data teams create standard artifacts they can use, adapt, and reuse. It also helps experts customize to address new use cases.

3. Stability vs. Agility

Stability

The cloud data platform needs to help data teams maintain stability. They must meet their service level agreements (SLAs), support chargeback, and ensure compliance. It should make it easy for cloud operations (CloudOps) engineers to configure virtualized cloud resources; monitor performance and capacity utilization; and spot and remediate issues. The cloud data platform also can expose its workings—in the form of logs, traces, metrics, and alerts—to observability tools that monitor and optimize performance across heterogeneous environments. By stabilizing workloads in these ways, the cloud data platform makes workloads more predictable and lower risk.

Agility

However, the cloud data platform still needs to help data teams stay agile. They need to help the business address urgent market demands, innovate, and release competitive enhancements. Data scientists might need to spin up a sandbox to explore new data, train a new ML model on that data, and deploy that model into production. Developers might need to build, test, and release—then monitor and revise—a new data-driven application that operationalizes the ML model’s outputs. Data teams should select a cloud data platform that supports fast actions and iterations like these.

So, what does this look like in practice? Shell offers an instructive?case study . Shell uses the?Databricks ?Lakehouse platform to integrate data management tasks and support a variety of standard BI and custom data science use cases. Their data scientists use best of breed tools to devise agile, custom models that predict inventory levels across the supply chain and recommend purchases to customers. Shell maintains stability while supporting agile projects to compete.?

Cloud data platforms will continue to evolve and absorb functionality. As they do, data teams should evaluate and select those that strike the right balance–and keep them from toppling into the dirt.

Striking the Right Balance with the Modern Cloud Data Platform

Kevin Petrie

Vice President of Research at BARC

1. Suite vs. Extensibility

领英推荐

2. Standardization vs. Customization

3. Stability vs. Agility

更多精彩文章

社区洞察

其他会员也浏览了

Building a Data Warehouse on Google Cloud Platform That Scales With the Business

Redshift vs Bigquery

BigQuery vs Azure Synapse

Role of Azure Data Lake in Power BI

Scaling Real-Time Analytics with ClickHouse: Best Practices for Petabyte-Scale Data Management and Cloud Performance

Exploring Azure Synapse Analytics: Dedicated Pools vs. Serverless Pools

best practices for data warehousing with Azure real world scenario

Azure Data factory

Data warehousing in Azure

Importance of partitioning in Data-intensive Analytics Solution Design

1. Suite vs. Extensibility

领英推荐

2. Standardization vs. Customization

3. Stability vs. Agility

Standardizing Data Delivery with Data as a Product

2022年1月20日

Assessing the Kafka Data Streaming Opportunity

2018年10月8日

Five Consistent Principles for the Changing World of Data Integration

2018年5月23日

Streaming Change Data Capture

2018年5月5日

How to Turn Dark Data from a Problem into an Advantage

2016年8月24日

Shedding Light on the Problem of Dark Data

2016年5月13日

Navigating US-EMEA Data Privacy Rules

2016年4月14日

Five Steps to Successfully Manage Multiple Data Platforms

2016年4月13日

Big Data Analytics in Latin America: Stormy Waters, but the Tide is Rising

2016年3月4日

Data Variety: The Upside and the Challenge for Financial Services

2016年3月3日

社区洞察

其他会员也浏览了

Building a Data Warehouse on Google Cloud Platform That Scales With the Business

Redshift vs Bigquery

BigQuery vs Azure Synapse

Role of Azure Data Lake in Power BI

Scaling Real-Time Analytics with ClickHouse: Best Practices for Petabyte-Scale Data Management and Cloud Performance

Exploring Azure Synapse Analytics: Dedicated Pools vs. Serverless Pools

best practices for data warehousing with Azure real world scenario

Azure Data factory

Data warehousing in Azure

Importance of partitioning in Data-intensive Analytics Solution Design