登录查看更多内容

Google Big Lake -Dataplex- Big Query - Changing the Data Paradigm in Multi-cloud world

Sandeep Patel Author, VC networker, Google AI researcher

Principal Data Science,AI and Analytics Architect leader at Google- Data AI-ML advocate

发布日期: 2022年12月12日

Let me try to decode some advancements where Google made strides in the way how data is addressed in the Mulit-Cloud - DataLake and Siloed data world.Most organizations’ data environment is diverse and increasingly distributed.They have data across?data lakes, data warehouses, data marts and databases.Some of their data is on-premise, and some is in the cloud, often with multiple providers.And each of these systems has its own way to handle metadata, security and governance.For many of our prospects, this creates an operational nightmare.

Data volume, variety & velocity has continued to grow.Data is incresingly spread across various storage storage systems and clouds.Data has many more use cases - (AI, BI, SQL, Self serve, Data apps, streaming, event driven workflows).Organizations today need the ability to distribute ownership of data across teams that have the most business context, while ensuring that the overall data lifecycle management and governance is consistently applied across their distributed data landscape.

No alt text provided for this image — Data Compexity in multi -cloud , multiple Datalake and multi-warehouse world

DataMesh - DataFabric - Multicloud - How to address it all..

Data lake and warehouses have separate data management layers, and serve specific use cases.Use cases require data, agnostic of where it is stored.Google three core paltform provies the backbone for Datamesh?for customer Digital Assets :

BigLake - storage engine that unifies data warehouses and lakes by enabling BigQuery and open source frameworks like Spark to access data with fine-grained access control. BigLake provides accelerated query performance .Store a single copy of data with uniform features across data warehouses & lakes.Fine-grained access control and multi-cloud governance over distributed data.Seamless integration with open source analytics tools and open data formats.It also Integrates with Dataplex to provide management at scale, including logical data organization, centralized policy & metadata management, quality and lifecycle management for consistency across distributed data.

Dataplex an intelligent data fabric that enables you to centrally manage, monitor, and govern data across data lakes, data warehouses, and data marts, and make this data securely accessible to a variety of analytics and data science tools.With Dataplex, enterprises can easily delegate ownership, usage, and sharing of data, to data owners who have the right business context, while still having a single pane of glass to consistently monitor and govern data across various data domains in their organization. With built-in data intelligence, Dataplex automates the data discovery, data lifecycle management, and data quality, enabling data productivity and accelerating analytics agility.?

With Dataplex you can logically organize your data and related artifacts such as code, notebooks, and logs, into a Dataplex Lake which represents a data domain. You can model all the data in a particular domain as a set of Dataplex Assets within a lake without physically moving data or storing it into a single storage system. Assets can refer to Cloud Storage buckets and BigQuery datasets, stored in multiple Google Cloud projects, and manage both analytics and operational data, structured and unstructured data that logically belongs to a single domain. Dataplex Zones enable you to group assets and add structure that capture key aspects of your data - its readiness, the workloads it is associated with, or the data products it is serving.The lakes and data zones in Dataplex enable you to unify distributed data and organize it based on the business context. This forms the foundation for managing metadata, setting up governance policies, monitoring data quality and so on, giving you the ability to manage your distributed data at scale

领英推荐

Data Modernization – What is the best route for your…

ITC Infotech 1 年前

Rethinking Modern Data Architectures: How VAST Data…

VAST Data 2 个月前

The Evolving Landscape of Data Analytics: Comparing…

Quadrant Technologies 6 个月前

Customers can automatically discover metadata across data sources- Dataplex provides metadata management and cataloging that enables all members of the domain to easily search, browse and discover the tables and filesets as well as augment them with business and domain-specific semantics. Once data is added as assets, Dataplex automatically extracts associated metadata and keeps it up-to-date as data evolves. This metadata is made available for search, discovery, and enrichment via integration with Data Catalog.

How Does Big Lake challenge the structured, un-structured and semi structured data model and multi-cloud challenges ??

BigLake is a storage engine that unifies data warehouses and lakes by enabling BigQuery and open source frameworks like Spark to access data with fine-grained access control. BigLake provides accelerated query performance across multi-cloud storage and open formats such as Apache Iceberg.Go to conBigLake eliminates the need to grant file Enable interoperability of tools: The metadata curated by Dataplex is automatically made available as runtime metadata to power federated open source analytics via Apache SparkSQL, HiveQL, Presto, and so on. Compatible metadata is also automatically published as external tables in BigQuery to enable federated analytics via BigQuery.
Govern data at scale: Dataplex enables data administrators and stewards to consistently and scalably manage their IAM data policies to control data access across distributed data. It provides the ability to centrally govern data across domains while enabling autonomous and delegated ownership of data. It provides the ability to manage reader/writer permissions on the domains and the underlying physical storage resources. Dataplex integrates with Stackdriver to provide observability including audit logs, data metrics and logs.
Enable access to high quality data: Dataplex provides built-in data quality rules that can automatically surface issues in your data. You can run these rules as data quality tasks across your data in BigQuery and GCS.
One-click data exploration: Dataplex enables data engineers, data scientists and data analysts with a built-in, self-serve, serverless data exploration experience to interactively explore data and metadata, iteratively develop scripts, and deploy and monitor data management workloads. It provides content management across SQL scripts and Jupyter notebooks that makes it easy to create domain-specific code artifacts and share or schedule them from that same interface.
Data management: You can also leverage the built-in data management tasks that address common tasks such as tiering, archiving or refining data. It integrates with Google Cloud’s native data tools such as Dataproc Serverless, Dataflow, Data Fusion, and BigQuery to provide an integrated data management platform.
With the collective of data, metadata, policies, code, interactive and production analytics infrastructure, and data monitoring, Dataplex delivers on the core value proposition of a data mesh: data as the product.

How the challenges of multi-cloud be addressed with this approach ??

World of Multi-Cloud :Customers can create BigLake tables on Google Cloud Storage (GCS), Amazon S3 and ADLS Gen 2 over supported open file formats, such as Parquet, ORC and Avro. BigLake tables are a new type of external table that can be managed similar to data warehouse tables. Administrators do not need to grant end users access to files in object stores, but instead manage access at a table, row or a column level. These tables can be created from a query engine of your choice, such as BigQuery or open-source engines using the BigLake connector. Once these tables are created, BigLake and BigQuery tables can be centrally discovered in the data catalog and managed at scale using Dataplex.

BigLake extends the BigQuery storage API to object stores to help you build a multi-compute architecture. BigLake connectors are built on the BigQuery storage API and enable Google Cloud DataFlow and open-source query engines (such as Spark, Trino, Presto, Hive) to query BigLake tables by enforcing security. This eliminates the need to move the data to a query engine specific use case and security only needs to be configured at one place and is enforced everywhere.
Multi-compute analytics:Maintain a single copy of data and make it uniformly accessible across Google Cloud and open-source engines, including BigQuery, Vertex AI, Dataflow, Spark, Presto, Trino, and Hive using BigLake connectors. Centrally manage security policies in one place, and have it consistently enforced across the query engines by the API interface built into the connectors.
Multi-cloud governance:Discover all BigLake tables, including those defined over Amazon S3, Azure data lake Gen 2 in Data Catalog. Configure fine grained access control and have it enforced across clouds when querying with BigQuery Omni.
Performance acceleration:Achieve industry leading performance over data lake tables on Google Cloud, AWS and Azure, powered by proven BigQuery infrastructure.Built on open formats:Gain access to the most popular open data formats including Parquet, Avro, ORC, CSV, JSON. The API serves multiple compute engines through Apache Arrow.
Data governance is a combination of processes that ensure that data is secure, private, accurate, available and usable. It includes people with specific roles and responsibilities and well defined processes supported by technology. While you are responsible for defining a data governance strategy for your organization, google cloud provides several tools and technologies to operationalise such a strategy. Google cloud also provides a framework for data governance in the cloud.??

More in-depth industry perspective by domains coming in my next newsletter and article. Stay tuned..

Prabu K.

Principal Data Architect @ BT E-SERV (INDIA) PRIVATE LIMITED | Data Architecture

2 年

Thanks for sharing

1 次回应

Vikas Kumar

2 年

Awesome Sandeep Patel Author, VC networker, AI researcher

1 次回应

Vimal Adhia

Deputy Manager, Sales at Air Liquide India.

2 年

Amazing

1 次回应

Abhi Patel

Vice President

2 年

Great information!! Thanks for sharing

1 次回应

查看更多评论

要查看或添加评论，请登录

Sandeep Patel Author, VC networker, Google AI researcher的更多文章

Data-mesh- The New Data Design Paradigm -Building with Google Cloud

2022年2月28日

Data-mesh- The New Data Design Paradigm -Building with Google Cloud

The four data mesh principals seems to be the cornerstone of Google Datamesh architecture. Working with Major Fortune…
Love the work by Google research

2022年2月2日

Love the work by Google research

To enable a real-time working solution for a variety of video conferencing applications, we needed to design a light…
Data Science -Changing Paradigm with ML & making the right choices.

2020年2月29日

Data Science -Changing Paradigm with ML & making the right choices.

It is rat's race in the world of Data Science . With 20 + business i interacted over last one month - it seems as Data…

1 条评论
Deep Learning ,Data Science & ML- Changing Paradigm - Chief Data Officer's Bible

2019年9月20日

Deep Learning ,Data Science & ML- Changing Paradigm - Chief Data Officer's Bible

A lot have been said, marketed , put across and heavily advertised in areas of Data Science , ML and Deep Learning-…

2 条评论
Data Lake - Why ?? What & How ??

2016年4月2日

Data Lake - Why ?? What & How ??

What pushes customers for Data lake today . Majority of CDO expressed their concerns .
IOT- DONT MISS THE BUS (2016 DEFINING PARADIGM)

2015年12月25日

IOT- DONT MISS THE BUS (2016 DEFINING PARADIGM)

IOT - THE NEED OF THE HOUR AND BUSINESS APPROACH TO TECHNOLOGY ARE YOU LOST IN DIGITAL,SMAC,IOT, BIG DATA HALUCINATION…
M2M to IoT to IoE - It's just gettting bigger everyday

2015年12月17日

M2M to IoT to IoE - It's just gettting bigger everyday

Industry going through an interesting battle between intersections in the industry-We can list two categories of things…

3 条评论
Digital Dawn - Selling "things" through IoT

2015年11月5日

Digital Dawn - Selling "things" through IoT

Changing Paradigm Organizations can now intuit things about their customers: signals sent from their devices can…
Changing paradigm of Big Data Ecosystem- Service to Data Lake to Data Refinery..

2015年9月9日

Changing paradigm of Big Data Ecosystem- Service to Data Lake to Data Refinery..

Been through a couple of webinars on Oracle , Microsoft and top niche vendors stacks including pentaho . Some…

9 条评论
Big Data \IoT making its way working with Diverse Data Ecosystem

2015年8月2日

Big Data \IoT making its way working with Diverse Data Ecosystem

In my current pursuits around Big Data and machine learning came across some remarkable work came across by MapR &…

8 条评论

See all articles

Google Big Lake -Dataplex- Big Query - Changing the Data Paradigm in Multi-cloud world

Sandeep Patel Author, VC networker, Google AI researcher

Principal Data Science,AI and Analytics Architect leader at Google- Data AI-ML advocate

领英推荐

Sandeep Patel Author, VC networker, Google AI researcher的更多文章

社区洞察

其他会员也浏览了

The Data Lakes That Turn into Swamps: Why Companies Struggle with Big Data

Warping through Data pipelines

Topic- The Top of the Best Practices to Implement in Big Data Platforms

An Approach to Architecting a Lower Cost, Fast and Self-Service Data Lakehouse

Harnessing Microsoft Fabric: Unifying Data Management with One Lake

Data Lakehouses: Bridging the Gap Between Data Lakes and Warehouses

Data Lake vs. Data Warehouse: Which to Choose and When?

Microsoft Fabric: Empowering Modern Data Analytics

How Dremio Simplifies Data Lakehouse Architecture for Modern Analytics

Scaling Real-Time Analytics with ClickHouse: Best Practices for Petabyte-Scale Data Management and Cloud Performance

领英推荐

Sandeep Patel Author, VC networker, Google AI researcher的更多文章

Data-mesh- The New Data Design Paradigm -Building with Google Cloud

Love the work by Google research

Data Science -Changing Paradigm with ML & making the right choices.

Deep Learning ,Data Science & ML- Changing Paradigm - Chief Data Officer's Bible

Data Lake - Why ?? What & How ??

IOT- DONT MISS THE BUS (2016 DEFINING PARADIGM)

M2M to IoT to IoE - It's just gettting bigger everyday

Digital Dawn - Selling "things" through IoT

Changing paradigm of Big Data Ecosystem- Service to Data Lake to Data Refinery..

Big Data \IoT making its way working with Diverse Data Ecosystem

社区洞察

其他会员也浏览了

The Data Lakes That Turn into Swamps: Why Companies Struggle with Big Data

Warping through Data pipelines

Topic- The Top of the Best Practices to Implement in Big Data Platforms

An Approach to Architecting a Lower Cost, Fast and Self-Service Data Lakehouse

Harnessing Microsoft Fabric: Unifying Data Management with One Lake

Data Lakehouses: Bridging the Gap Between Data Lakes and Warehouses

Data Lake vs. Data Warehouse: Which to Choose and When?

Microsoft Fabric: Empowering Modern Data Analytics

How Dremio Simplifies Data Lakehouse Architecture for Modern Analytics

Scaling Real-Time Analytics with ClickHouse: Best Practices for Petabyte-Scale Data Management and Cloud Performance