Modernize data lakes to be ready for Generative AI

Data Lakes have been around for well over a decade now, supporting the analytic operations of some of the largest world corporations. Some argue though that the vast majority of these deployments have now become data “swamps”. Regardless of which side of this controversy you sit in, reality is that there is still a lot of data held in these systems. Such data volumes are not easy to move, migrate or modernize.

Figure 1: The emergence of data lakehouse architectures.

In the case of Hadoop, one of the more popular data lakes, the promise of implementing such a repository using open-source software and having it all run on commodity hardware meant you could store a lot of data on these systems at a very low cost. Data could be persisted in open data formats, democratizing its consumption, as well as replicated automatically which helped you sustain high availability. The default processing framework offered the ability to recover from failures mid-flight. This was, without a question, a significant departure from traditional analytic environments, which often meant vendor-lock in and the inability to work with data at scale.

The data lakehouse is an emerging architecture that offers the flexibility of a data lake with the performance and structure of a data warehouse. Most lakehouse solutions offer a high- performance query engine over low-cost storage in conjunction with a metadata governance layer. Intelligent metadata layers make it easier for users to categorize and classify unstructured data, such as video and voice, and semi-structured data, such as XML, JSON and emails.

Figure 2: watsonx Architecture overview

Currently, we see the lakehouse as an augmentation, not a replacement, of existing data stores, whether on-premises or in the cloud. A lakehouse should make it easy to combine new data from a variety of different sources, with mission criticaldata about customers and transactions that reside in existing repositories. New insights are found in the combination of new data with existing data, and the identification of new relationships. And AI, both supervised and unsupervised machine learning, is the best and sometimes only way to unlock these new insights at scale.

Data lakehouse was designed to bring together best features of a data warehouse and a data lake, it yields specific key benefits to its users. This includes:

  • Reduced data redundancy: The single data storage system allows for a streamlined platform to carry out all business data demands. Data lakehouses also simplify data observability by reducing the amount of data moving through the data pipline into multiple systems.??
  • Cost-effective: Since data lakehouses capitalize off of the lower costs of cloud object storage, the operational costs of a data lakehouse are comparatively lower than data warehouses. Additionally, the hybrid architecture of a data lakehouse eliminates the need to maintain multiple data storage systems, making it less expensive to operate. ?
  • Supports wide variety of workloads: Data lakehouses can address different use cases across the data management lifecycle. It also can support both business intelligence and data visualization workstreams or more complex data science ones.
  • Better governance:? The data lakehouse architecture mitigates the standard governance issues that come with data lakes. For example, as data is ingested and uploaded, it can ensure that the data meets the defined schema requirements, reducing downstream data quality issues.
  • More scale: In traditional data warehouses, compute and storage were coupled together, which drove up operational costs. Data lakehouses separate storage and compute, allowing data teams to access the same data storage while use different computing nodes for different applications. This results in more scalability and flexibility.??
  • Streaming support: The data lakehouse is built for business and technology of today and many data sources use real-time streaming directly from devices. The lakehouse system supports this real-time ingestion, which will only become more popular in the future. ?

IBM’s answer to the current analytics crossroad is watsonx.data. This is a new open data store for managing data at scale that allows companies to surround, augment and modernize their existing data lakes and data warehouses without the need to migrate. Its hybrid nature means you can run it on customer-managed infrastructure (on-premises and/or IaaS) and Cloud.

Figure 3: Overview of key components of the IBM

A key differentiator is the multi-engine strategy that allows users to leverage the right technology for the right job at the right time all via a unified data platform. Watsonx.data enables customers to implement fully dynamic tiered storage (and associated compute). This can lead, over time, to very significant data management and processing cost savings.

If your organization has existing on premises big data implementations, a lakehouse offers a less-expensive alternative for storing data in open formats on object storage. You’ll lower the cost of analytics, decrease complexity and improve time to value.

Islam Hemdan

Strategic Account Executive | Middle East & Africa

1 年

A very nice breakdown , thanks Karim for such valuable insights

回复
Loay Tabbaa

Storage Technical Sales Leader - MEA

1 年

Great summary karim badr

要查看或添加评论,请登录

karim badr的更多文章

  • Why it’s called “generative” AI !

    Why it’s called “generative” AI !

    Generative AI has become more mainstream than ever, thanks to the popularity of ChatGPT, the proliferation of…

    3 条评论

社区洞察

其他会员也浏览了