登录查看更多内容

Modernize data lakes to be ready for Generative AI

karim badr

Top Accounts Sales Executive | CXO level engagement, Cloud Computing and Artificial intelligence

发布日期: 2023年9月25日

Data Lakes have been around for well over a decade now, supporting the analytic operations of some of the largest world corporations. Some argue though that the vast majority of these deployments have now become data “swamps”. Regardless of which side of this controversy you sit in, reality is that there is still a lot of data held in these systems. Such data volumes are not easy to move, migrate or modernize.

Figure 1: The emergence of data lakehouse architectures.

In the case of Hadoop, one of the more popular data lakes, the promise of implementing such a repository using open-source software and having it all run on commodity hardware meant you could store a lot of data on these systems at a very low cost. Data could be persisted in open data formats, democratizing its consumption, as well as replicated automatically which helped you sustain high availability. The default processing framework offered the ability to recover from failures mid-flight. This was, without a question, a significant departure from traditional analytic environments, which often meant vendor-lock in and the inability to work with data at scale.

The data lakehouse is an emerging architecture that offers the flexibility of a data lake with the performance and structure of a data warehouse. Most lakehouse solutions offer a high- performance query engine over low-cost storage in conjunction with a metadata governance layer. Intelligent metadata layers make it easier for users to categorize and classify unstructured data, such as video and voice, and semi-structured data, such as XML, JSON and emails.

Currently, we see the lakehouse as an augmentation, not a replacement, of existing data stores, whether on-premises or in the cloud. A lakehouse should make it easy to combine new data from a variety of different sources, with mission criticaldata about customers and transactions that reside in existing repositories. New insights are found in the combination of new data with existing data, and the identification of new relationships. And AI, both supervised and unsupervised machine learning, is the best and sometimes only way to unlock these new insights at scale.

领英推荐

Microsoft Fabric: Unified Integrated Analytics

Amit Chandak 1 年前

Building a Future-Proof Data Architecture

Sagar L. 3 周前

10 big data technologies you must know

Naveen Joshi 7 年前

Data lakehouse was designed to bring together best features of a data warehouse and a data lake, it yields specific key benefits to its users. This includes:

Reduced data redundancy: The single data storage system allows for a streamlined platform to carry out all business data demands. Data lakehouses also simplify data observability by reducing the amount of data moving through the data pipline into multiple systems.??
Cost-effective: Since data lakehouses capitalize off of the lower costs of cloud object storage, the operational costs of a data lakehouse are comparatively lower than data warehouses. Additionally, the hybrid architecture of a data lakehouse eliminates the need to maintain multiple data storage systems, making it less expensive to operate. ?
Supports wide variety of workloads: Data lakehouses can address different use cases across the data management lifecycle. It also can support both business intelligence and data visualization workstreams or more complex data science ones.
Better governance:? The data lakehouse architecture mitigates the standard governance issues that come with data lakes. For example, as data is ingested and uploaded, it can ensure that the data meets the defined schema requirements, reducing downstream data quality issues.
More scale: In traditional data warehouses, compute and storage were coupled together, which drove up operational costs. Data lakehouses separate storage and compute, allowing data teams to access the same data storage while use different computing nodes for different applications. This results in more scalability and flexibility.??
Streaming support: The data lakehouse is built for business and technology of today and many data sources use real-time streaming directly from devices. The lakehouse system supports this real-time ingestion, which will only become more popular in the future. ?

IBM’s answer to the current analytics crossroad is watsonx.data. This is a new open data store for managing data at scale that allows companies to surround, augment and modernize their existing data lakes and data warehouses without the need to migrate. Its hybrid nature means you can run it on customer-managed infrastructure (on-premises and/or IaaS) and Cloud.

Figure 3: Overview of key components of the IBM

A key differentiator is the multi-engine strategy that allows users to leverage the right technology for the right job at the right time all via a unified data platform. Watsonx.data enables customers to implement fully dynamic tiered storage (and associated compute). This can lead, over time, to very significant data management and processing cost savings.

If your organization has existing on premises big data implementations, a lakehouse offers a less-expensive alternative for storing data in open formats on object storage. You’ll lower the cost of analytics, decrease complexity and improve time to value.

Islam Hemdan

Strategic Account Executive | Middle East & Africa

1 年

A very nice breakdown , thanks Karim for such valuable insights

Loay Tabbaa

Storage Technical Sales Leader - MEA

1 年

Great summary karim badr

1 次回应

查看更多评论

要查看或添加评论，请登录

karim badr的更多文章

Why it’s called “generative” AI !

2023年10月5日

Why it’s called “generative” AI !

Generative AI has become more mainstream than ever, thanks to the popularity of ChatGPT, the proliferation of…

3 条评论

Modernize data lakes to be ready for Generative AI

karim badr

Top Accounts Sales Executive | CXO level engagement, Cloud Computing and Artificial intelligence

领英推荐

karim badr的更多文章

社区洞察

其他会员也浏览了

Data Lake And Data Warehouse

Difference Between Data Lakehouse and Delta Lake

Data Lake vs. Delta Lake: The Next Level of Data Management

Low-Latency Data Pipelines with Kafka and Apache Pinot

Data Lake / Mesh / Data Fabric and Everything in Between (The Active Metadata)

Navigating the Depths of Data Lakes: A Comprehensive Overview

Data Unleashed - From Bits to Brilliance

Real-Time Data Processing: Architectures and Tools

Episode 1- A Gentle Intro to Data Mesh World

领英推荐

karim badr的更多文章

Why it’s called “generative” AI !

社区洞察

其他会员也浏览了

Data Lake And Data Warehouse

Difference Between Data Lakehouse and Delta Lake

Data Lake vs. Delta Lake: The Next Level of Data Management

Low-Latency Data Pipelines with Kafka and Apache Pinot

Data Lake / Mesh / Data Fabric and Everything in Between (The Active Metadata)

Navigating the Depths of Data Lakes: A Comprehensive Overview

Data Unleashed - From Bits to Brilliance

Real-Time Data Processing: Architectures and Tools

Episode 1- A Gentle Intro to Data Mesh World