Evolution of data tech stack
Given the announcements by both Snowflake and Databricks in the past two weeks, it may be helpful to recap how far we have come in terms of data warehousing and analytics
2000s
As organizations started to realize the business value of data that they had amassed, there was a need for large scale, distributed computing paradigms to unlock that value.
This was the impetus for various initiatives, with Apache Hadoop being the more successful of the various open source projects (with an initial release on Sep 4 2007).
This was the basis for multiple vendors to offer commercialized offerings around Hadoop (more successful ones being Cloudera, MapR and Hortonworks). In addition, there was continued evolution of large-scale enterprise data warehouses, with the likes of Teradata and Netezza leading the charge. They offered appliances with dedicated processes and proprietary software to provide scale out capabilities to support the ever-growing structured data volumes. The adoption of these platforms was hindered by the complexity and level of custom development in programming languages with most organizations did not possess.
2010s
With adoption of public cloud, the desire to decouple data storage from compute gained moment. Likes of AWS EMR provided a more cost-effective alternative to the incumbents while providing a pay as you go model. The integration complexity as well as development needed to provide an unified view still existed though.
Open-source ecosystem around data lakes also gained traction as high-tech consumers such as Uber, Netflix sought to address shortcoming for scale and complexity. This resulted in multiple, seemingly redundant projects and efforts to address technical challenges as well as governance considerations posed by open-source based data lakes. Nascent interoperability standards such as file formats emerged along with a new set of compute engines (e.g., Spark, Presto now Trino) to make use of those.?
2020s
To aid adoption of a unified stack, a new set of solution providers emerged that provided an pre-integrated, unified stack that addressed additional non-functional requirements such as fine grained entitlements, job observability and polygot language support. The desire to ‘bundle’ services together has driven better interoperability and standardization across the vendors.
Key capabilities that a Lakehouse architecture provides:
领英推荐
1.??????? Support for multiple compute engines (e.g., Spark, Flink) to support various use cases (e.g., streaming, batch)
2.??????? Transactional support for data (i.e., ACID transactions)
3.??????? Primitives / Tools to allow for pattern-based integration with third parties
4.??????? Management of data pipelines at scale
5.??????? SSO integration with leading Identity Providers (e.g., Encarta / Azure AD)
6.??????? Support for fine grained entitlements (e.g., row level access, field obfuscation, attributed based permission model)
7.??????? Schema enforcement (i.e., systematic approach to filtering out non-conformant data)
8.??????? Broader and performance support for SQL semantics, including accessibility via leading business intelligence and visualization tools (e.g., Tableau, Microsoft Power BI)
9.??????? True decoupling of data storage from compute
Currently, there are a few vendors that provide an end-to-end ecosystem. Databricks leads given its multi cloud support as well as providing some of the capabilities via an OSS model (perhaps more of an exit strategy play) and with AWS services aligning on Lake Formation and its AWS Glue catalog offerings. Snowflake has recently made some progress on opening its proprietary offerings though more to come in terms of level of interoperability and fulsome nature of integrations.
2026 and beyond
Lots of innovation in the GenAI though it is early inning on how the overall LLM workflow will be integrated into the current data stacks. ?Though there are some initial forays, most of the approaches are point solutions with more general approaches reliant on feedback from early adopters. The hope is that we will have some of the initial outcomes in the next 18-24 months.