Google Big Lake -Dataplex- Big Query - Changing the Data Paradigm in Multi-cloud world

Google Big Lake -Dataplex- Big Query - Changing the Data Paradigm in Multi-cloud world

Let me try to decode some advancements where Google made strides in the way how data is addressed in the Mulit-Cloud - DataLake and Siloed data world.Most organizations’ data environment is diverse and increasingly distributed.They have data across?data lakes, data warehouses, data marts and databases.Some of their data is on-premise, and some is in the cloud, often with multiple providers.And each of these systems has its own way to handle metadata, security and governance.For many of our prospects, this creates an operational nightmare.

Data volume, variety & velocity has continued to grow.Data is incresingly spread across various storage storage systems and clouds.Data has many more use cases - (AI, BI, SQL, Self serve, Data apps, streaming, event driven workflows).Organizations today need the ability to distribute ownership of data across teams that have the most business context, while ensuring that the overall data lifecycle management and governance is consistently applied across their distributed data landscape.

No alt text provided for this image
Data Compexity in multi -cloud , multiple Datalake and multi-warehouse world


DataMesh - DataFabric - Multicloud - How to address it all..

Data lake and warehouses have separate data management layers, and serve specific use cases.Use cases require data, agnostic of where it is stored.Google three core paltform provies the backbone for Datamesh?for customer Digital Assets :

BigLake - storage engine that unifies data warehouses and lakes by enabling BigQuery and open source frameworks like Spark to access data with fine-grained access control. BigLake provides accelerated query performance .Store a single copy of data with uniform features across data warehouses & lakes.Fine-grained access control and multi-cloud governance over distributed data.Seamless integration with open source analytics tools and open data formats.It also Integrates with Dataplex to provide management at scale, including logical data organization, centralized policy & metadata management, quality and lifecycle management for consistency across distributed data.

Dataplex an intelligent data fabric that enables you to centrally manage, monitor, and govern data across data lakes, data warehouses, and data marts, and make this data securely accessible to a variety of analytics and data science tools.With Dataplex, enterprises can easily delegate ownership, usage, and sharing of data, to data owners who have the right business context, while still having a single pane of glass to consistently monitor and govern data across various data domains in their organization. With built-in data intelligence, Dataplex automates the data discovery, data lifecycle management, and data quality, enabling data productivity and accelerating analytics agility.?

With Dataplex you can logically organize your data and related artifacts such as code, notebooks, and logs, into a Dataplex Lake which represents a data domain. You can model all the data in a particular domain as a set of Dataplex Assets within a lake without physically moving data or storing it into a single storage system. Assets can refer to Cloud Storage buckets and BigQuery datasets, stored in multiple Google Cloud projects, and manage both analytics and operational data, structured and unstructured data that logically belongs to a single domain. Dataplex Zones enable you to group assets and add structure that capture key aspects of your data - its readiness, the workloads it is associated with, or the data products it is serving.The lakes and data zones in Dataplex enable you to unify distributed data and organize it based on the business context. This forms the foundation for managing metadata, setting up governance policies, monitoring data quality and so on, giving you the ability to manage your distributed data at scale

Customers can automatically discover metadata across data sources- Dataplex provides metadata management and cataloging that enables all members of the domain to easily search, browse and discover the tables and filesets as well as augment them with business and domain-specific semantics. Once data is added as assets, Dataplex automatically extracts associated metadata and keeps it up-to-date as data evolves. This metadata is made available for search, discovery, and enrichment via integration with Data Catalog.

How Does Big Lake challenge the structured, un-structured and semi structured data model and multi-cloud challenges ??

No alt text provided for this image

  • BigLake is a storage engine that unifies data warehouses and lakes by enabling BigQuery and open source frameworks like Spark to access data with fine-grained access control. BigLake provides accelerated query performance across multi-cloud storage and open formats such as Apache Iceberg.Go to conBigLake eliminates the need to grant file Enable interoperability of tools: The metadata curated by Dataplex is automatically made available as runtime metadata to power federated open source analytics via Apache SparkSQL, HiveQL, Presto, and so on. Compatible metadata is also automatically published as external tables in BigQuery to enable federated analytics via BigQuery.
  • Govern data at scale: Dataplex enables data administrators and stewards to consistently and scalably manage their IAM data policies to control data access across distributed data. It provides the ability to centrally govern data across domains while enabling autonomous and delegated ownership of data. It provides the ability to manage reader/writer permissions on the domains and the underlying physical storage resources. Dataplex integrates with Stackdriver to provide observability including audit logs, data metrics and logs.
  • Enable access to high quality data: Dataplex provides built-in data quality rules that can automatically surface issues in your data. You can run these rules as data quality tasks across your data in BigQuery and GCS.
  • One-click data exploration: Dataplex enables data engineers, data scientists and data analysts with a built-in, self-serve, serverless data exploration experience to interactively explore data and metadata, iteratively develop scripts, and deploy and monitor data management workloads. It provides content management across SQL scripts and Jupyter notebooks that makes it easy to create domain-specific code artifacts and share or schedule them from that same interface.
  • Data management: You can also leverage the built-in data management tasks that address common tasks such as tiering, archiving or refining data. It integrates with Google Cloud’s native data tools such as Dataproc Serverless, Dataflow, Data Fusion, and BigQuery to provide an integrated data management platform.
  • With the collective of data, metadata, policies, code, interactive and production analytics infrastructure, and data monitoring, Dataplex delivers on the core value proposition of a data mesh: data as the product.

How the challenges of multi-cloud be addressed with this approach ??

World of Multi-Cloud :Customers can create BigLake tables on Google Cloud Storage (GCS), Amazon S3 and ADLS Gen 2 over supported open file formats, such as Parquet, ORC and Avro. BigLake tables are a new type of external table that can be managed similar to data warehouse tables. Administrators do not need to grant end users access to files in object stores, but instead manage access at a table, row or a column level. These tables can be created from a query engine of your choice, such as BigQuery or open-source engines using the BigLake connector. Once these tables are created, BigLake and BigQuery tables can be centrally discovered in the data catalog and managed at scale using Dataplex.

  • BigLake extends the BigQuery storage API to object stores to help you build a multi-compute architecture. BigLake connectors are built on the BigQuery storage API and enable Google Cloud DataFlow and open-source query engines (such as Spark, Trino, Presto, Hive) to query BigLake tables by enforcing security. This eliminates the need to move the data to a query engine specific use case and security only needs to be configured at one place and is enforced everywhere.
  • Multi-compute analytics:Maintain a single copy of data and make it uniformly accessible across Google Cloud and open-source engines, including BigQuery, Vertex AI, Dataflow, Spark, Presto, Trino, and Hive using BigLake connectors. Centrally manage security policies in one place, and have it consistently enforced across the query engines by the API interface built into the connectors.
  • Multi-cloud governance:Discover all BigLake tables, including those defined over Amazon S3, Azure data lake Gen 2 in Data Catalog. Configure fine grained access control and have it enforced across clouds when querying with BigQuery Omni.
  • Performance acceleration:Achieve industry leading performance over data lake tables on Google Cloud, AWS and Azure, powered by proven BigQuery infrastructure.Built on open formats:Gain access to the most popular open data formats including Parquet, Avro, ORC, CSV, JSON. The API serves multiple compute engines through Apache Arrow.
  • Data governance is a combination of processes that ensure that data is secure, private, accurate, available and usable. It includes people with specific roles and responsibilities and well defined processes supported by technology. While you are responsible for defining a data governance strategy for your organization, google cloud provides several tools and technologies to operationalise such a strategy. Google cloud also provides a framework for data governance in the cloud.??

More in-depth industry perspective by domains coming in my next newsletter and article. Stay tuned..

Prabu K.

Principal Data Architect @ BT E-SERV (INDIA) PRIVATE LIMITED | Data Architecture

2 年

Thanks for sharing

Vikas Kumar

Head of Data, AI & ML | Technologist | Speaker | Learner | Change Agent | ex- startups | Team Builder

2 年
Vimal Adhia

Deputy Manager, Sales at Air Liquide India.

2 年

Amazing

Abhi Patel

Vice President

2 年

Great information!! Thanks for sharing

要查看或添加评论,请登录

Sandeep Patel Author, VC networker, Google AI researcher的更多文章

社区洞察

其他会员也浏览了