Data Gravity & AI flares up EDP wars
Piyush Malik
LinkedIN TopVoice 2023 | Data, AppliedAI, Technology & Strategy | CXO | BOD Advisor | Entrepreneur | Analytics | Cloud | Do click ?? to be notified of my latest posts
Do you remember the? “I’m a Mac and I’m a PC” ad campaign at the height of Microsoft v/s?Apple tech rivalry?
Well, attending recent data conferences and observations from the recent Snowflake v/s Databricks actions and market moves have made me go down the memory lane and reminded me of the bygone era of the personal computer OS wars from early 2000’s?
What do I mean by that? I’ll get to that story in a bit in this newsletter later but if you are in a hurry, here’s the TL;DR
“Data has gravity and the fight for Enterprise Data Platform (EDP) supremacy has AI applications at its core with Databricks and Snowflake heating up that market”
The fight for enterprise data platform supremacy has AI at its core
With the rapid pace of innovation in AI over the past decade, and the fast paced experimentation with GenAI since ChatGPT rolled out less than 2 years ago, one thing is clear - AI will eat all software and disrupt industries and our society in unimaginable ways.
The adoption of impactful use cases by organizations of all sizes to leverage their foundational data assets to better serve their customers with AI is quite promising and per McKinsey and PwC research studies, AI could increase global GDP by a staggering $13-15.7 Trillion by 2030.
If we zero in on enterprise data and how it is stored and used to make decisions that make organizations thrive, it should set the context of how? Data and AI are interrelated through the concept called Data Gravity.
Data gravity is the tendency of large data sets to attract smaller data sets, applications, and services. This is similar to how a planet's gravity pulls objects toward it, with the accumulation of data increasing its "gravitational pull". Benefits of data gravity include full visibility and more data volumes. More data gathered can help teams paint a full story and make informed decisions as well as have a bigger data set for AI to train on and hence better predictions. However, data gravity can also present challenges, such as difficulty and costs in moving unwieldy data. As datasets grow, they can become "heavy" and unwieldy, making it more complicated and expensive to migrate data from one source to another and? make management even harder. Data warehouses and data lakes are primary examples of data gravity.
The backstory: The elusive search for “Single source of truth” in the enterprise?
I started my career journey in Data three decades ago so let me share what I bserved from the trenches. Data warehouses initially emerged in the 1980s as a solution for organizing structured business data in enterprises and the repository of “single source of truth” for the company. However, by 2010, organizations began accumulating a significant amount of unstructured data to support more varied use cases, such as big data, predictive and prescriptive analytics. To address this, data lakes were introduced as an open, scalable system for any type of data and distributed data processing software Hadoop became synonymous with that hype. 4 V’s of data - Volume, Velocity, Variety & Veracity were discussed in every presentation I saw or made. ? By 2015, it became common for most organizations to operate both data warehouses and data lakes. This dual-platform approach, however, presented significant challenges in governance, security, reliability and management as the fundamentals of building a solid data strategy and data foundation layered architecture were often skipped.
Hyperscaler cloud? technologies like AWS, GCP and Azure gathered steam in the “move to cloud” wave and soon thereafter (~10 years ago), Snowflake emerged as the leading choice of “data warehouse on the cloud". Snowflake built its early success around a SQL-centric architecture and tight integration with BI tools, catering to data analysts and traditional IT departments with a closed, “it just works” solution.
Conceived in UC Berkeley AMP labs and Apache foundation driven open-source software Spark (later commercialized by Databricks) stepped in initially to overcome the shortcomings of the Hadoop ecosystem as an efficient in-memory distributed data processing system that was 100X faster and supported the iterative experiments that machine learning and data science use cases demanded.
Rise of the Enterprise Data Architecture and EDPs?
A few years later, the concept of the lakehouse to combine and unify the best of both data lakes and data warehouses was introduced by Databricks. Lakehouses store and govern all data in open formats, and natively support workloads ranging from BI to AI. Lakehouses offered a unified system to (1) query all data sources in an organization together and (2) govern all the workloads that use data (BI, AI, etc.) in a unified way. Lakehouse became its own category of data platform and is now widely adopted by enterprises and incorporated into most vendors' stacks.
Despite the progress, available data platforms in the market still face several major challenges such as steep learning curve & technical skill barrier, data quality, skyrocketing costs and poor performance of mismanaged data platform due to complexity, concerns on lineage, privacy,? security and governance of globally distributed data amplified by compliance mandates of regulations such as GDPR, HIPPA, CCPA and now the recent European AI Act and finally the iterative tuning and engineering demands of emerging AI applications that needs deep domain data specific knowledge
Many of these issues arise because current data platforms do not fundamentally understand the data in organizations and how it is used. Fortunately, generative AI presents a powerful new way to address exactly these challenges.
In essence, the impact of AI on data platforms will not be incremental, but fundamental: massively democratizing access to data, automating manual administration, and enabling turnkey creation of custom AI applications. All this will be enabled by a new wave of unified platforms that deeply understand an organization's data.?
Besides the leaders Databricks and Snowflake, all cloud as well as legacy tech vendors including IBM, Oracle, Google, Amazon, Microsoft, Salesforce are vying for a stronger foothold in the Enterprise Data Platform market.
?
Market Moves (some public and some stealthy) by the two leaders?
Recent high profile public conferences in the past couple of months by Snowflake and Databricks showcased their strategy as well as outcomes of their aggressive market moves including M&A activity from last year. Here are some market round-up observations?:
领英推荐
Key technology & strategy announcements by the leaders and its ecosystem impact?
Each platform can be used for ingesting and analyzing huge sums of data — such as an airline trying to understand which customers are most likely to cancel their flights based on ticket price, destination and weather patterns. The market for this kind of software is rapidly growing and not entirely zero sum — many companies use both Databricks and Snowflake for different types of work, while countless others are still using older-generation tools that are traditional replacement targets, according to data from market research firms.
The strategic maneuvers highlighted in the previous section underscore how AI is redrawing the battle lines in enterprise data infrastructure. Enterprises are increasingly demanding interoperability and portable compute.
For Databricks, with its open-source roots, this is a natural evolution. For Snowflake, it marks a major shift from its traditionally closed approach. Both are racing to adapt as value migrates up the stack toward dynamic systems of models and tools built on top of their offerings.?
Lets see the implications of some recent technology and strategy moves and announcements by Databricks:
1. Databricks Unity Catalog Metrics enhancements and Unity Open Sourced. This simplifies data governance and opens up the platform for broader developer collaboration and accelerates development and innovation. Also leads to? enhanced customization, flexibility, and faster advancements. Key features include Unified data view, fine-grained access controls, automated data lineage tracking, comprehensive auditing. Unity now offers improved lineage capture and customization, surpassing Snowflake's Polaris Iceberg Catalog. Enhanced Unity metrics centralizes metric definitions for consistent and governed business metrics, accessible from various Databricks interfaces and seamlessly integrates with third-party tools.
2. GA of Lakehouse Federation and Monitoring: This improves data integration and real-time insights, enhancing governance and operational efficiency. Streamlines data management, real-time operational insights and cross-platform integration and governance.
3. Attribute-Based Access Control (ABAC): Fine-grained access control, dynamic policies, simplified security management. Enhanced security, compliance, and scalability.Detailed access permissions based on user attributes, dynamic policy adjustments. Easy integration with existing systems, improved regulatory compliance, and better data governance.
4 Enhanced Serverless Offerings: Targeting Snowflake's ease-of-use customers with streamlined deployment and management. Serverless-only features in 2025, highlighting a shift despite the open-source ethos.
5 Metadata and AI Integration: Lakehouse IQ: AI-driven data catalog enhancing querying, search, and documentation. Mosaic Integration: Embeds AI into data workflows for advanced analytics.
Impact of GenAI on convergence of Data, Analytics and EDPs
As the generative AI revolution has accelerated, the lines between the once distinct domains of data processing and iterative modeling have blurred. Building generative AI applications requires the ability to manage and process data (a traditional analytics skill) along with the ability to experiment with and fine-tune models (a data science skill). The worlds of analytics and AI are rapidly converging.
Databricks anticipated this convergence early and bet big on its "lakehouse" architecture, as discussed previously. This AI-friendly approach can efficiently store and process massive amounts of structured and unstructured data. Snowflake, despite its success in BI, was slower to adapt to the rising importance of AI. As the market shifted towards AI-centric use cases, it found itself falling behind, with support only for structured and semi-structured data.
Conclusion: What's in it for the CDOs & CIOs?
In short, it is getting interesting to watch these two EDP leaders fight it out in the market just like the “I’m a Mac and I'm a PC” era I mentioned in the beginning. The competitive environment in EDPs is intensifying both between Snowflake & Databricks as well as with first-party products from the mega cloud providers. While Databricks has a smaller revenue base, it is growing faster than Snowflake and? is outperforming for net-new IT? spend in the enterprise(SQL workloads, AI, legacy migrations etc).
As AI reshapes the software world, it is a common belief that the leaders in every industry will be those who leverage data and AI deeply to power their organizations.??EDPs will be a cornerstone for these organizations, enabling them to create the next generation of data and AI applications with quality, speed and agility.
Having been deeply embedded in the Data and Applied AI world for the past 3 decades with clients big and small around the world, I personally have a lot of war stories and “in the trenches” wounds and back stories of success from the frontlines of customer implementations. It's an exciting time to see the transformation from the front seat? and correlate with the rear view.
Parting thoughts &? key questions:
Do get in touch if you wish to discuss more especially if the #TheDigitalAgenda could be of help from strategy to blueprinting to organizational enablement for the Data & AI driven future.
1 Dear reader, what are your thoughts on this ?
2 Do you have similar? experiences or wish to collaborate?
(Please do share your comments, reshare with your network ?? and subscribe to this newsletter and click ?? if not already done so)
LinkedIN TopVoice 2023 | Data, AppliedAI, Technology & Strategy | CXO | BOD Advisor | Entrepreneur | Analytics | Cloud | Do click ?? to be notified of my latest posts
1 个月In related news, Apache Iceberg cemented it's position further in the industry this week. More here : https://www.theregister.com/2024/10/14/apache_iceberg_feature_announcements/
CEO @ Lighted Road AI | Insurtech | Data | AI/ML | Drive profitable growth in Medicare
2 个月Excellent article. ?? ??
GenAI Research Scientist
2 个月How to process 1 trillion rows in mere seconds? Can Databricks do it? It is now a reality, see https://mltblog.com/3z71oeP
Program and Operations Manager * Data Governance * Data Quality * Supply Chain Management * Business Intelligence *
2 个月very insightful, thanks Piyush ??
Great article Piyush Malik, this is the perfect read for a Monday morning over a coffee ?? Delighted to hear you found the Chill Data Summit in San Francisco helpful - thank you for sharing your photos and giving us a mention ??