登录查看更多内容

Will Apache Iceberg be the Catalyst to Revolutionize AI and Data Management?

Peter Crocker

Analyst, Thought Leader and Storyteller

发布日期: 2025年1月15日

Over the past four decades, the advancement of digital technology has gone through three distinct eras, each evolving in similar patterns marked by an open-source catalyst that drives fundamental change and disruption. In the PC & Internet era, personal computers connected by the Internet made access to information worldwide instantaneous. The mobile and cloud era made information available anywhere, any time, and the big data and AI era are still evolving to enable computers to think like humans. Apache Iceberg's open table format will be the catalyst that moves the AI age forward and fundamentally changes how we work with data.

The evolution of each of these eras follows familiar patterns.

1) Market drivers such as the adoption of PC and Smartphones created a demand for more scalable and robust infrastructure.

2)???? More scalable infrastructures such as client-server networks, the cloud, and data lakes are adopted, but old processes have been moved to the new platforms.

3)???? Demand for greater quality and efficiency requires new processes that are better aligned to fit the capabilities of the new infrastructure.

4)???? Revolutionary technology catalysts such as TCP/IP, Docker, and Kubernetes enable software and data to leverage the capability of new infrastructure.

These revolutionary technology catalysts have similar characteristics

1) Disintegration - Data or code is broken into smaller, more manageable components. These components are packaged with metadata, making them more accessible and independent of the underlying platform.

2)???? Easy to use open source standard - Standardized open-source software made these components interoperable and accessible by any application layer.

3)????Middleware - An orchestration engine or automated middleware manages these components by routing and tracking changes and data movement.

4)???? The technology can evolve with limited disruption - These characteristics enable the technology to evolve smoothly, and innovations and features can be integrated into the system while continuously operating.

The state of the technology landscape after these innovations is fundamentally different than before each of these standards was adopted. Let us consider how these patterns are represented in the different eras and how they are replicated in today's environment.

PC & Internet Era

Before State

Before the creation of the Internet, computers communicated directly through modems using phone lines. This digital traffic was turned into audio signals and routed via circuits that connected machines directly. These circuit-switching technologies were proprietary, developed independently and controlled by each individual telephone company. The result was a very brittle and inefficient system.

Market Drivers - PC growth & shift to client-server networks

The growing popularity of PC's led to high demand for a more efficient way to connect PCs and corporate computers. To adapt to this change, enterprises and telecom firms built client-server networks. Initially, there was limited legacy digital technology, so innovators could build from scratch and define the new IP network.

Revolutionary Technology Catalyst - TCP/IP & HTTP

The development of TCP/IP was the last piece of the puzzle that led to the creation of the Internet we know today. The technology exhibits the characteristics of a revolutionary catalyst.

Disintegration - The shift from circuit switching to packet switching was a key innovation. Instead of using a dedicated connection, communications were broken down into packets that could take different routes to the user/client PC. This was defined by TCP/IP and the backbone of the Internet. Each packet contained metadata about it and where it should be routed.

Interoperability - IP was the standard addressing protocol that all networking computers understood. It also defined how packets were processed and transported. HTTP, part of the IP protocol, enables any browser to display a web document or page.

Automated Middleware - TCP ensured that packets were created and routed appropriately.

Observability and tracking - Network servers processed these packets dynamically and tracked them so they didn't get lost.

Concurrent users - Because packets are processed dynamically, multiple users can use a network node or access a website simultaneously.

Open-source: TCP/IP was created through government research, so unlike the telephone networks, no company controlled it.

Evolution path - Networks and communications were able to evolve without shutting down the whole system. Traffic can be rerouted if there is a problem with any part of the network.

End State

Everything changed. The Internet has fundamentally changed how we live our lives. Although the digital world may have evolved since the creation of TCP/IP, this protocol is still the foundation of the web and communications networks.

Mobile and Cloud Era

Before State - Application monoliths

Before Mobile and cloud, applications were built as monoliths designed to run on PCs. Code was tightly integrated, and individual code components could not function separately. If one component breaks, the whole app breaks. Also, components written in different languages could not easily interact with each other. Reusing code in different applications was also tricky.

Market Driver - Mobile and the move to the cloud

The proliferation of smartphones drove demand for access to data and applications from anywhere at any time, leading to the rise of the cloud. When organizations adopted the cloud to run apps, in many cases, they moved their applications to the new platform as is. The same monolithic apps designed to run on a PC or Smartphone were ported to a container in the cloud.

Demand for Quality and Agility

As smartphones and the cloud evolved, the customer experience became a competitive differentiator. Apps needed to be more resilient, easier to build, and quicker to maintain. Agile development was a thing, but there was no technology standard for DevOps to enable better quality control and faster, more collaborative app development.?

Revolutionary Technology Catalyst - Docker & Kubernetes

The emergence and adoption of Docker and Kubernetes revolutionized how applications were developed. The technology exhibits the same characteristics as TCP/IP and HTTP.

领英推荐

Secure Data Infrastructure for Generative AI:…

Rubrik 1 个月前

Exploring Key Distributed System Algorithms and…

Vertisystem 1 年前

The Most Powerful Version of MinIO Ever - Introducing…

MinIO 3 个月前

Disintegration - Applications are broken up into blocks of code that are placed in Docker containers. Linux containers existed before Docker, but the new technology was easier to use.

Interoperable - Standard packaging and APIs, defined by Docker, enable software components to interoperate with each other. Any cloud can run Kubernetes, Google, AWS or Azure.

Open-source - Kubernetes and Docker are both open-source, allowing the best software to emerge as the standard.

Automated middleware ?- Kubernetes processed Docker containers dynamically to run software components.

Observability - Containers are orchestrated and observed by Kubernetes.

Concurrent users - Code in containers can be changed and integrated back into an application without disruption or recompiling the app.

Evolution path - Applications can evolve without disruption. Components can be maintained and updated independently and integrated back into the program without having to rebuild it.

End state

Applications became much more resilient and reliable. The development process is also much more efficient and collaborative. Consequently, a large ecosystem of tools emerged to support DevOps.

Big Data and AI Era

Before State - Data warehouses

Early in the digital age, data was stored in data warehouses using relational databases. These are tightly connected and well-organized data tables that make changes difficult. They are also not designed to store unstructured data.

Market Driver - Explosive data growth and data lake adoption

Mobile and IoT adoption generated vast amounts of data, both structured and unstructured. This data needed to be stored somewhere, and data warehouses could not scale to meet the demand. Data lakes were created to handle this extensive data volume, providing a place to dump in its native format with limited organization. This made it difficult to access and track data quality. Old ETL technologies were then adopted to move data in and out of data lakes.

This approach is similar to how the cloud era evolved: Applications were moved to the cloud, and old processes and technologies moved with them. The ETL process developed for data warehouses requires significant amounts of data engineering work and remains a popular way to move data between databases, data lakes, and analytical platforms.

Demand for Quality and Efficiency

The growth of ML and AI is driving demand for higher amounts of better-quality data. Improved AI performance and decision-making are becoming a competitive advantage, and access to high-quality data is a differentiator.? Limited data engineering resources are also driving a need for a more efficient new approach.

Revolutionary Technology Catalyst - Apache Iceberg Data Table Format

What is Iceberg

Iceberg is an open-source data table format with all the required features to support fundamental changes in data management. The widespread adoption fits with the evolutionary trends of the previous internet era and the cloud-native era.

Disintegration - Iceberg offers an innovative approach to data table partitioning. Partitioning data tables is a process that groups table rows in files so processing engines don't have to search a whole database to find data, just the appropriate files. This disintegration is similar to IP packets and Docker containers. Sections of a database can be accessed and acted upon independently of the entire database. This is similar to how TCP/IP enables sections of communications to be stored in packets, and Docker stores code in containers to be controlled independently and automatically.

Previous approaches to data partitioning technologies like Hive require users to define each partition. Iceberg's hidden partitioning negates the need to define partition, making it much more dynamic and programable. This is similar to the Linux containers that existed before Docker, but a more programmable approach drove rapid adoption.

Automated Middleware - Icebergs metadata is separated from the underlying data, enabling centralized metadata management.

Interoperable - Standard data format and open metadata enable any data processing engine to access and manipulate data.

Open-Souce - Iceberg is an open-source data table format that is outside the control of any vendor.

Observability - The metadata layer keeps track of processes. Iceberg tracks changes to the data tables and can enable time travel and rollbacks.

Concurrent users - Iceberg takes snapshots of metadata before tables are changed. Because tables are partitioned, changes can be made to a single file instead of an entire table. If conflicts arise, the file can be rolled back independently of the table.?

Evolutionary path - Schema evolution enables schema to be changed without disrupting ongoing operations.

End state

The ability for a single Iceberg table to support multiple users enables data to be analyzed in place. Since data does not need to be duplicated for analysis, a greater focus can be on a single source of truth and ensure its quality. The need to move data is eliminated, reducing the chances of errors and limiting the demand for data engineers.

Much like how DevOps practices evolved with Docker and Kubernetes, a much more collaborative data management process should emerge with the adoption of Iceberg. This will improve data quality and access. Better trust in data and higher quality will support better-performing AI that will revolutionize our lives, just like the innovations that created the Internet.?

Recommendations

CIOs and CDOs can begin making changes now to prepare their data management strategies for the impact of Apache Iceberg.

1) Build expertise and an ecosystem around Iceberg. Iceberg's interoperability will enable new tools to be built on top of it, creating a new data tools ecosystem.

2)?Build new processes to leverage more dynamic data.

3)?Shift left. The Iceberg standard will significantly impact the growth of tools and processes to improve data quality and catch data problems early in the pipeline.

4)?Get ready for better AI built on higher-quality data.

ManyMangoes ??

4 周

This is a fascinating perspective on the role of Apache Iceberg in data management. It really raises some important questions about how emerging technologies can fuel AI advancements. What specific market drivers do you see having the most impact moving forward?

Viktor Kessler

Actionable metadata

1 个月

It will require more standards to make Icebeg to a catalyst

2 次回应

查看更多评论

要查看或添加评论，请登录

Peter Crocker的更多文章

Embracing Dark Social With Thought Leadership

2022年11月7日

Embracing Dark Social With Thought Leadership

Dark social is perhaps the most crucial channel for your B2B marketing strategy. Just because you can’t measure it or…
Reaching the CTO in the age of Digital Transformation

2022年8月1日

Reaching the CTO in the age of Digital Transformation

Being digital is increasingly a competitive must-have and without a digital transformation strategy, organizations will…
Three Ways to Deliver Executive Thought Leadership

2022年5月6日

Three Ways to Deliver Executive Thought Leadership

Executive thought leadership when done right is an important growth driver for any company but particularly in…
2017 - The Year of Automation and AI - Wrangling Complexity to Optimize Productivity

2017年1月4日

2017 - The Year of Automation and AI - Wrangling Complexity to Optimize Productivity

The technology expansion of the past decade fueled by the mobile computers in our pocket is slowing, signaling the end…
Neural networks are changing mobile and digital experiences

2016年11月15日

Neural networks are changing mobile and digital experiences

Advancements in artificial neural networks are breaking barriers leading to revolutionary mobile and digital…
Context as a Service - Making the mobile moment more accessable

2016年5月19日

Context as a Service - Making the mobile moment more accessable

The industry is awash with the buzz of context these days especially when it comes to mobile engagement and the…
Mobile App Trends: 2016 and Beyond

2016年3月9日

Mobile App Trends: 2016 and Beyond

As the mobile ecosystem has evolved it has become much more dynamic and flexible. The concept of Mobile First is fading…

2 条评论

See all articles

Will Apache Iceberg be the Catalyst to Revolutionize AI and Data Management?

Peter Crocker

Analyst, Thought Leader and Storyteller

PC & Internet Era

Before State

Market Drivers - PC growth & shift to client-server networks

Revolutionary Technology Catalyst - TCP/IP & HTTP

End State

Mobile and Cloud Era

Before State - Application monoliths

Market Driver - Mobile and the move to the cloud

Demand for Quality and Agility

Revolutionary Technology Catalyst - Docker & Kubernetes

领英推荐

End state

Big Data and AI Era

Before State - Data warehouses

Market Driver - Explosive data growth and data lake adoption

Demand for Quality and Efficiency

Revolutionary Technology Catalyst - Apache Iceberg Data Table Format

What is Iceberg

End state

Recommendations

Peter Crocker的更多文章

社区洞察

其他会员也浏览了

Exploring the Role of APIs in Modern Data Collection Services Practices

Supercharge Your Intelligent Computing Center with AI-Ready Data Infrastructure

Harnessing AI-Ready Data Infrastructure for Enterprise Applications

Why Adding NAS/NFS on Object Storage May not Solve Your Data Access Problem of AI

Unlocking the Need for Speed: The Secrets Behind Kafka's Blazing Performance

Most Dev Friendly IP Data API In 2024

SupraPartners #218 – SupraOracles partners with SINSO to simplify decentralized storage and data governance

SupraPartners #141 – SupraOracles partners with The APIS, an indexing protocol for reading and writing to open networks

Data Science Sandboxes, How to Build Scalable Infrastructure

Modernizing Data, Analytics, AI, and Applications for Enterprise Transformation: The Road to Generative and Agentic AI with Governance

PC & Internet Era

Before State

Market Drivers - PC growth & shift to client-server networks

Revolutionary Technology Catalyst - TCP/IP & HTTP

End State

Mobile and Cloud Era

Before State - Application monoliths

Market Driver - Mobile and the move to the cloud

Demand for Quality and Agility

Revolutionary Technology Catalyst - Docker & Kubernetes

领英推荐

End state

Big Data and AI Era

Before State - Data warehouses

Market Driver - Explosive data growth and data lake adoption

Demand for Quality and Efficiency

Revolutionary Technology Catalyst - Apache Iceberg Data Table Format

What is Iceberg

End state

Recommendations

Peter Crocker的更多文章

Embracing Dark Social With Thought Leadership

Reaching the CTO in the age of Digital Transformation

Three Ways to Deliver Executive Thought Leadership

2017 - The Year of Automation and AI - Wrangling Complexity to Optimize Productivity

Neural networks are changing mobile and digital experiences

Context as a Service - Making the mobile moment more accessable

Mobile App Trends: 2016 and Beyond

社区洞察

其他会员也浏览了

Exploring the Role of APIs in Modern Data Collection Services Practices

Supercharge Your Intelligent Computing Center with AI-Ready Data Infrastructure

Harnessing AI-Ready Data Infrastructure for Enterprise Applications

Why Adding NAS/NFS on Object Storage May not Solve Your Data Access Problem of AI

Unlocking the Need for Speed: The Secrets Behind Kafka's Blazing Performance

Most Dev Friendly IP Data API In 2024

SupraPartners #218 – SupraOracles partners with SINSO to simplify decentralized storage and data governance

SupraPartners #141 – SupraOracles partners with The APIS, an indexing protocol for reading and writing to open networks

Data Science Sandboxes, How to Build Scalable Infrastructure

Modernizing Data, Analytics, AI, and Applications for Enterprise Transformation: The Road to Generative and Agentic AI with Governance