Google Big Lake -Dataplex- Big Query - Changing the Data Paradigm in Multi-cloud world
Sandeep Patel Author, VC networker, Google AI researcher
Principal Data Science,AI and Analytics Architect leader at Google- Data AI-ML advocate
Let me try to decode some advancements where Google made strides in the way how data is addressed in the Mulit-Cloud - DataLake and Siloed data world.Most organizations’ data environment is diverse and increasingly distributed.They have data across?data lakes, data warehouses, data marts and databases.Some of their data is on-premise, and some is in the cloud, often with multiple providers.And each of these systems has its own way to handle metadata, security and governance.For many of our prospects, this creates an operational nightmare.
Data volume, variety & velocity has continued to grow.Data is incresingly spread across various storage storage systems and clouds.Data has many more use cases - (AI, BI, SQL, Self serve, Data apps, streaming, event driven workflows).Organizations today need the ability to distribute ownership of data across teams that have the most business context, while ensuring that the overall data lifecycle management and governance is consistently applied across their distributed data landscape.
DataMesh - DataFabric - Multicloud - How to address it all..
Data lake and warehouses have separate data management layers, and serve specific use cases.Use cases require data, agnostic of where it is stored.Google three core paltform provies the backbone for Datamesh?for customer Digital Assets :
BigLake - storage engine that unifies data warehouses and lakes by enabling BigQuery and open source frameworks like Spark to access data with fine-grained access control. BigLake provides accelerated query performance .Store a single copy of data with uniform features across data warehouses & lakes.Fine-grained access control and multi-cloud governance over distributed data.Seamless integration with open source analytics tools and open data formats.It also Integrates with Dataplex to provide management at scale, including logical data organization, centralized policy & metadata management, quality and lifecycle management for consistency across distributed data.
Dataplex an intelligent data fabric that enables you to centrally manage, monitor, and govern data across data lakes, data warehouses, and data marts, and make this data securely accessible to a variety of analytics and data science tools.With Dataplex, enterprises can easily delegate ownership, usage, and sharing of data, to data owners who have the right business context, while still having a single pane of glass to consistently monitor and govern data across various data domains in their organization. With built-in data intelligence, Dataplex automates the data discovery, data lifecycle management, and data quality, enabling data productivity and accelerating analytics agility.?
With Dataplex you can logically organize your data and related artifacts such as code, notebooks, and logs, into a Dataplex Lake which represents a data domain. You can model all the data in a particular domain as a set of Dataplex Assets within a lake without physically moving data or storing it into a single storage system. Assets can refer to Cloud Storage buckets and BigQuery datasets, stored in multiple Google Cloud projects, and manage both analytics and operational data, structured and unstructured data that logically belongs to a single domain. Dataplex Zones enable you to group assets and add structure that capture key aspects of your data - its readiness, the workloads it is associated with, or the data products it is serving.The lakes and data zones in Dataplex enable you to unify distributed data and organize it based on the business context. This forms the foundation for managing metadata, setting up governance policies, monitoring data quality and so on, giving you the ability to manage your distributed data at scale
领英推荐
Customers can automatically discover metadata across data sources- Dataplex provides metadata management and cataloging that enables all members of the domain to easily search, browse and discover the tables and filesets as well as augment them with business and domain-specific semantics. Once data is added as assets, Dataplex automatically extracts associated metadata and keeps it up-to-date as data evolves. This metadata is made available for search, discovery, and enrichment via integration with Data Catalog.
How Does Big Lake challenge the structured, un-structured and semi structured data model and multi-cloud challenges ??
How the challenges of multi-cloud be addressed with this approach ??
World of Multi-Cloud :Customers can create BigLake tables on Google Cloud Storage (GCS), Amazon S3 and ADLS Gen 2 over supported open file formats, such as Parquet, ORC and Avro. BigLake tables are a new type of external table that can be managed similar to data warehouse tables. Administrators do not need to grant end users access to files in object stores, but instead manage access at a table, row or a column level. These tables can be created from a query engine of your choice, such as BigQuery or open-source engines using the BigLake connector. Once these tables are created, BigLake and BigQuery tables can be centrally discovered in the data catalog and managed at scale using Dataplex.
More in-depth industry perspective by domains coming in my next newsletter and article. Stay tuned..
Principal Data Architect @ BT E-SERV (INDIA) PRIVATE LIMITED | Data Architecture
2 年Thanks for sharing
Head of Data, AI & ML | Technologist | Speaker | Learner | Change Agent | ex- startups | Team Builder
2 年Awesome Sandeep Patel Author, VC networker, AI researcher
Deputy Manager, Sales at Air Liquide India.
2 年Amazing
Vice President
2 年Great information!! Thanks for sharing