Will Apache Iceberg be the Catalyst to Revolutionize AI and Data Management?
Over the past four decades, the advancement of digital technology has gone through three distinct eras, each evolving in similar patterns marked by an open-source catalyst that drives fundamental change and disruption. In the PC & Internet era, personal computers connected by the Internet made access to information worldwide instantaneous. The mobile and cloud era made information available anywhere, any time, and the big data and AI era are still evolving to enable computers to think like humans. Apache Iceberg's open table format will be the catalyst that moves the AI age forward and fundamentally changes how we work with data.
The evolution of each of these eras follows familiar patterns.
1) Market drivers such as the adoption of PC and Smartphones created a demand for more scalable and robust infrastructure.
2)???? More scalable infrastructures such as client-server networks, the cloud, and data lakes are adopted, but old processes have been moved to the new platforms.
3)???? Demand for greater quality and efficiency requires new processes that are better aligned to fit the capabilities of the new infrastructure.
4)???? Revolutionary technology catalysts such as TCP/IP, Docker, and Kubernetes enable software and data to leverage the capability of new infrastructure.
These revolutionary technology catalysts have similar characteristics
1) Disintegration - Data or code is broken into smaller, more manageable components. These components are packaged with metadata, making them more accessible and independent of the underlying platform.
2)???? Easy to use open source standard - Standardized open-source software made these components interoperable and accessible by any application layer.
3)????Middleware - An orchestration engine or automated middleware manages these components by routing and tracking changes and data movement.
4)???? The technology can evolve with limited disruption - These characteristics enable the technology to evolve smoothly, and innovations and features can be integrated into the system while continuously operating.
The state of the technology landscape after these innovations is fundamentally different than before each of these standards was adopted. Let us consider how these patterns are represented in the different eras and how they are replicated in today's environment.
PC & Internet Era
Before State
Before the creation of the Internet, computers communicated directly through modems using phone lines. This digital traffic was turned into audio signals and routed via circuits that connected machines directly. These circuit-switching technologies were proprietary, developed independently and controlled by each individual telephone company. The result was a very brittle and inefficient system.
Market Drivers - PC growth & shift to client-server networks
The growing popularity of PC's led to high demand for a more efficient way to connect PCs and corporate computers. To adapt to this change, enterprises and telecom firms built client-server networks. Initially, there was limited legacy digital technology, so innovators could build from scratch and define the new IP network.
Revolutionary Technology Catalyst - TCP/IP & HTTP
The development of TCP/IP was the last piece of the puzzle that led to the creation of the Internet we know today. The technology exhibits the characteristics of a revolutionary catalyst.
Disintegration - The shift from circuit switching to packet switching was a key innovation. Instead of using a dedicated connection, communications were broken down into packets that could take different routes to the user/client PC. This was defined by TCP/IP and the backbone of the Internet. Each packet contained metadata about it and where it should be routed.
Interoperability - IP was the standard addressing protocol that all networking computers understood. It also defined how packets were processed and transported. HTTP, part of the IP protocol, enables any browser to display a web document or page.
Automated Middleware - TCP ensured that packets were created and routed appropriately.
Observability and tracking - Network servers processed these packets dynamically and tracked them so they didn't get lost.
Concurrent users - Because packets are processed dynamically, multiple users can use a network node or access a website simultaneously.
Open-source: TCP/IP was created through government research, so unlike the telephone networks, no company controlled it.
Evolution path - Networks and communications were able to evolve without shutting down the whole system. Traffic can be rerouted if there is a problem with any part of the network.
End State
Everything changed. The Internet has fundamentally changed how we live our lives. Although the digital world may have evolved since the creation of TCP/IP, this protocol is still the foundation of the web and communications networks.
Mobile and Cloud Era
Before State - Application monoliths
Before Mobile and cloud, applications were built as monoliths designed to run on PCs. Code was tightly integrated, and individual code components could not function separately. If one component breaks, the whole app breaks. Also, components written in different languages could not easily interact with each other. Reusing code in different applications was also tricky.
Market Driver - Mobile and the move to the cloud
The proliferation of smartphones drove demand for access to data and applications from anywhere at any time, leading to the rise of the cloud. When organizations adopted the cloud to run apps, in many cases, they moved their applications to the new platform as is. The same monolithic apps designed to run on a PC or Smartphone were ported to a container in the cloud.
Demand for Quality and Agility
As smartphones and the cloud evolved, the customer experience became a competitive differentiator. Apps needed to be more resilient, easier to build, and quicker to maintain. Agile development was a thing, but there was no technology standard for DevOps to enable better quality control and faster, more collaborative app development.?
Revolutionary Technology Catalyst - Docker & Kubernetes
The emergence and adoption of Docker and Kubernetes revolutionized how applications were developed. The technology exhibits the same characteristics as TCP/IP and HTTP.
领英推荐
Disintegration - Applications are broken up into blocks of code that are placed in Docker containers. Linux containers existed before Docker, but the new technology was easier to use.
Interoperable - Standard packaging and APIs, defined by Docker, enable software components to interoperate with each other. Any cloud can run Kubernetes, Google, AWS or Azure.
Open-source - Kubernetes and Docker are both open-source, allowing the best software to emerge as the standard.
Automated middleware ?- Kubernetes processed Docker containers dynamically to run software components.
Observability - Containers are orchestrated and observed by Kubernetes.
Concurrent users - Code in containers can be changed and integrated back into an application without disruption or recompiling the app.
Evolution path - Applications can evolve without disruption. Components can be maintained and updated independently and integrated back into the program without having to rebuild it.
End state
Applications became much more resilient and reliable. The development process is also much more efficient and collaborative. Consequently, a large ecosystem of tools emerged to support DevOps.
Big Data and AI Era
Before State - Data warehouses
Early in the digital age, data was stored in data warehouses using relational databases. These are tightly connected and well-organized data tables that make changes difficult. They are also not designed to store unstructured data.
Market Driver - Explosive data growth and data lake adoption
Mobile and IoT adoption generated vast amounts of data, both structured and unstructured. This data needed to be stored somewhere, and data warehouses could not scale to meet the demand. Data lakes were created to handle this extensive data volume, providing a place to dump in its native format with limited organization. This made it difficult to access and track data quality. Old ETL technologies were then adopted to move data in and out of data lakes.
This approach is similar to how the cloud era evolved: Applications were moved to the cloud, and old processes and technologies moved with them. The ETL process developed for data warehouses requires significant amounts of data engineering work and remains a popular way to move data between databases, data lakes, and analytical platforms.
Demand for Quality and Efficiency
The growth of ML and AI is driving demand for higher amounts of better-quality data. Improved AI performance and decision-making are becoming a competitive advantage, and access to high-quality data is a differentiator.? Limited data engineering resources are also driving a need for a more efficient new approach.
Revolutionary Technology Catalyst - Apache Iceberg Data Table Format
What is Iceberg
Iceberg is an open-source data table format with all the required features to support fundamental changes in data management. The widespread adoption fits with the evolutionary trends of the previous internet era and the cloud-native era.
Disintegration - Iceberg offers an innovative approach to data table partitioning. Partitioning data tables is a process that groups table rows in files so processing engines don't have to search a whole database to find data, just the appropriate files. This disintegration is similar to IP packets and Docker containers. Sections of a database can be accessed and acted upon independently of the entire database. This is similar to how TCP/IP enables sections of communications to be stored in packets, and Docker stores code in containers to be controlled independently and automatically.
Previous approaches to data partitioning technologies like Hive require users to define each partition. Iceberg's hidden partitioning negates the need to define partition, making it much more dynamic and programable. This is similar to the Linux containers that existed before Docker, but a more programmable approach drove rapid adoption.
Automated Middleware - Icebergs metadata is separated from the underlying data, enabling centralized metadata management.
Interoperable - Standard data format and open metadata enable any data processing engine to access and manipulate data.
Open-Souce - Iceberg is an open-source data table format that is outside the control of any vendor.
Observability - The metadata layer keeps track of processes. Iceberg tracks changes to the data tables and can enable time travel and rollbacks.
Concurrent users - Iceberg takes snapshots of metadata before tables are changed. Because tables are partitioned, changes can be made to a single file instead of an entire table. If conflicts arise, the file can be rolled back independently of the table.?
Evolutionary path - Schema evolution enables schema to be changed without disrupting ongoing operations.
End state
The ability for a single Iceberg table to support multiple users enables data to be analyzed in place. Since data does not need to be duplicated for analysis, a greater focus can be on a single source of truth and ensure its quality. The need to move data is eliminated, reducing the chances of errors and limiting the demand for data engineers.
Much like how DevOps practices evolved with Docker and Kubernetes, a much more collaborative data management process should emerge with the adoption of Iceberg. This will improve data quality and access. Better trust in data and higher quality will support better-performing AI that will revolutionize our lives, just like the innovations that created the Internet.?
Recommendations
CIOs and CDOs can begin making changes now to prepare their data management strategies for the impact of Apache Iceberg.
1) Build expertise and an ecosystem around Iceberg. Iceberg's interoperability will enable new tools to be built on top of it, creating a new data tools ecosystem.
2)?Build new processes to leverage more dynamic data.
3)?Shift left. The Iceberg standard will significantly impact the growth of tools and processes to improve data quality and catch data problems early in the pipeline.
4)?Get ready for better AI built on higher-quality data.
This is a fascinating perspective on the role of Apache Iceberg in data management. It really raises some important questions about how emerging technologies can fuel AI advancements. What specific market drivers do you see having the most impact moving forward?
Actionable metadata
1 个月It will require more standards to make Icebeg to a catalyst