OpenAI acquires Rockset - a tribute to my friends at Rockset, coupled with personal insights on data processing strategies

OpenAI acquires Rockset - a tribute to my friends at Rockset, coupled with personal insights on data processing strategies

I spent some time at AWS as a Database Specialist Solutions Architect focusing on strategic enablement for technology partners and independent software vendors. Along with many other responsibilities, the role involved developing strategic technical relationships with software vendors like Rockset. I vividly remember meeting Julie Mills, and Kevin Leong for the first time at a coffee shop in San Francisco, few months before the pandemic. After coffee, we headed to the AWS Loft in San Francisco for a brief discussion. Their passion and vision for real-time analytics were truly inspiring. Less than 30mins into the conversation, I already knew that they were destined for great things. That first meeting in San Fransisco led to over 18 months of close collaboration with Rockset, during that time I also had the pleasure of meeting and working with Shruti Bhat and Venkat Venkataramani.


Rockset: A 10,000 ft overview

Rockset is a real-time analytics database company founded in 2016 by Venkat Venkataramani and Dhruba Borthakur. It enables application developers to create fast APIs using SQL search, aggregation and join queries directly on semi-structured data for building data-intensive applications at scale. Rockset's serverless search and analytics engine empowers developers to build applications without the complexities of extensive data pipelines. It leverages the principles behind RocksDB, a high-performance key-value store known for its fast read/write operations ideal for real-time data processing. This translates to real-time indexing, scalable data retrieval, and SQL-based querying for live data streams across various industries.

Rockset's core strength lies in real-time data processing and analysis – I suspect this is a critical function for OpenAI's AI models. Generally, large language models require substantial data inputs to deliver accurate and timely outputs. By integrating Rockset's technology, OpenAI could potentially improve their data ingestion, processing, and analysis, ultimately enhancing the performance of its AI systems like ChatGPT and others. OpenAI's acquisition of Rockset signifies a potential shift towards unified data management solutions. As asserted in Michael Stonebraker and Andrew Pavlov's 2024 SIGMOD paper - "What goes around comes around... and around", the pendulum has swung back towards relational databases for complex data models. Additionally, the inclusion of vector search capabilities (offered by both Rockset and Oracle in Oracle Database 23ai) also hints at a future where a single system can handle diverse data types and workloads.


Ongoing evolution of online transaction data processing (OLTP) and my journey with Rockset & DynamoDB

A lot of the work my team and I did with Rockset, was driven through joint workshops, technical content, and office hours. This collaboration provided valuable insights for both teams – Rockset gained a better understanding of AWS Database Strategy and Partner Network dynamics, while the AWS Service Teams got a deeper dive into Rockset's product and technical focus. These discussions helped Rockset improve their integration with Amazon DynamoDB and refine their approach within the AWS Partner Network, ultimately expanding their partnership and integration with additional AWS Services.


Rockset: The perfect querying tool for Amazon DynamoDB

Key-value stores like Amazon DynamoDB were not originally designed for analytics and support a limited number of query operators and indexes. This limitation often leads to the need for pairing DynamoDB with an analytics solution to provide comprehensive querying and real-time analytics capabilities. Rockset, with its real-time indexing and serverless architecture, stood out as an ideal partner for DynamoDB, especially for read-heavy workloads. Rockset's ability to automatically index data for fast search, aggregations, and joins at scale made it an exceptional fit for DynamoDB. During my time working with Rockset, I was continually impressed by how their solution transformed the capabilities of DynamoDB, particularly in environments requiring extensive read operations, such as gaming leaderboards.

Architecture diagram for building a Real-Time Analytics Dashboard with Amazon DynamoDB and Rockset

Rockset uses a built-in connector with the DynamoDB Streams API to keep data constantly in sync. Initially, DynamoDB tables are scanned linearly, and then Rockset switches to the Streams API to maintain a time-ordered queue of updates. This seamless integration allows game developers to avoid the complexities of building and managing their own synchronization mechanisms.


One of the key advantages of Rockset is its suitability for read-heavy microservices. By offloading these read operations, developers gain greater flexibility since the data model used for writes does not need to be carried over for reads. This separation of concerns allows for more efficient data handling and quicker adaptations to new requirements. Rockset’s indexing capabilities mean that queries can be added or modified without being restricted by the initial data modeling in DynamoDB. This flexibility accelerates the development and deployment of new read-heavy microservices, enabling faster time-to-market for new features and improvements.


An important cautionary note!

During my tenure at AWS, I was constantly exploring the latest advancements in database technology, as I worked towards developing strategic relationships with key technology partners. While my collaboration with Rockset showcased the potential of pairing a specialized key-value store like Amazon DynamoDB with a robust analytics solution (CQRS Pattern), it also highlights the inherent limitations of relying solely on specialized databases like Key value stores. During that time, there was also an infatuation with single-table design in some sections of the NoSQL community. This paradigm largely influenced the schema design used in the architecture described in this technical blog I co-wrote with Rockset.

Key-value stores like DynamoDB, while excellent for specific use cases, come with significant limitations, particularly when it comes to supporting complex queries and diverse workloads. A case is point is Uber who migrated a trillion entries from DynamoDB to their custom-built LedgerStore. The challenges Uber faced underscored a critical point: specialized databases often necessitate complex workarounds and additional integrations to meet evolving business needs. Uber's migration experience is a cautionary tale for organizations relying heavily on specialized databases. The key issues included the inability to efficiently handle complex queries, the need for additional systems to manage data transformations, and the challenges in maintaining data consistency across multiple platforms. These limitations are not unique to Uber but are indicative of the broader challenges faced by enterprises relying on purpose-built databases. This is where Oracle's converged database strategy shines. Unlike the purpose-built approach, Oracle's strategy is to provide a single, unified database that supports multiple data models, workloads, and use cases. This convergence eliminates the need for disparate systems and complex integrations, offering a more streamlined and efficient data management solution.

Enterprises pursuing digital transformation face the same challenges as internet scale applications like Uber, with the additional burden of integrating new environments with existing systems, enabling data exchange across both. This is where the innovation of a converged database becomes essential. Traditional approaches using different single-purpose databases for each environment complicate this process due to varying operational, security, and performance profiles.

Oracle's Converged Database Architecture simplifies data driven applications

A converged database is a multi-model, multi-tenant, multi-workload database that supports the specific data model and access method preferred by each development team, without unnecessary functionality. It excels in handling diverse workloads like OLTP, analytics, and IoT, providing the necessary consolidation and isolation for different teams.


Parallels between Rockset and Oracle Globally Distributed Database Architecture

Rockset’s architecture is designed for real-time analytics and scales horizontally through document-based sharding. Each document in a Rockset collection is mapped to an abstract entity called a microshard using a microshard mapping function based on the document ID. These microshards are then grouped into Rockset shards. Similarly, Oracle horizontally partitions data to distribute it across multiple nodes using sharding keys. This partitioning is crucial for managing large datasets and ensuring efficient data access in both systems.

Oracle Globally Distributed database leverages sharding for horizontal partitioning of data across multiple independent physical databases, known as shards. Each shard operates independently, without sharing CPU, memory, or storage devices, which ensures fault containment and near-linear scalability. The sharding key, which is used to partition the workload, ensures that most transactions operate within a single shard, optimizing performance and efficiency.

Oracle Globally Distributed Database Architecture

Both Rockset and Oracle benefit from shard-level parallelism. In Rockset, splitting data into multiple shards allows each shard to be scanned in parallel, significantly speeding up query processing and preventing any single shard from becoming a bottleneck. Oracle leverages shard-level parallelism by distributing data across multiple shards, enabling multiple shards to process queries concurrently. This approach enhances performance by reducing latency and improving throughput, making both systems highly efficient in handling large-scale data operations. Oracle's architecture also implements raft replication which introduces the concept of replication units, When Raft replication is enabled, a sharded database contains multiple replication units. A replication unit (RU) is a set of chunks that have the same replication topology. Each RU has?multiple?replicas placed on different shards. The Raft consensus protocol is used to maintain consistency between the replicas in case of failures, network partitioning, message loss, or delay.

Ensuring high availability and fault tolerance is a cornerstone of both architectures. Rockset allows multiple replicas of the same shard to maintain data accessibility even in the event of hardware failures. Oracle’s architecture also includes replication mechanisms to ensure high availability. Each shard can have multiple replicas, and automated failover processes help maintain data accessibility and integrity during failures. This redundancy is vital for mission-critical applications, providing robustness and reliability in both systems. Oracle’s globally distributed architecture also supports global data distribution and data sovereignty, ensuring that data is stored and managed according to local regulations and requirements. This capability is essential for enterprises operating in multiple regions with varying data privacy laws.


Conclusion

The acquisition of Rockset by OpenAI marks a significant milestone in the database landscape, reflecting the growing importance of scalable data solutions for unlocking the power of artificial intelligence. The lessons learned from working with Rockset and observing industry trends, such as Uber's data management challenges with Amazon DynamoDB, highlight the advantages of leveraging a converged database strategy. Oracle's approach of providing a single, unified database that supports multiple data models, workloads, and use cases eliminates the complexities associated with managing disparate systems and integrations. This strategy not only simplifies data architecture but also enhances performance, scalability, and availability, making it a superior solution for modern enterprises.

To my friends at Rockset (acquired by OpenAI), congratulations on this remarkable achievement. Acquisitions of this scale come with their own set of challenges, particularly in integrating different technological stacks. Careful planning and execution will be crucial. We are cheering you on as you join forces with OpenAI to work towards achieving the ultimate goal of data processing—Artificial General Intelligence!



Disclaimer

The information provided in this article is intended for information purposes only. Views and opinions shared are solely mine and it may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.


Chukwuemeka Aduba

Technical Operations Manager (Databases) at AWS ZA

4 个月

Nice article. Thanks for sharing!

Kirill Panov

Tech Program Manager | Reinforcing & Advancing Engineering Projects with Tech Know-How | HealthTech, Data-Obsessed, Solution Design

4 个月

Great article, Kehinde Otubamowo! I really liked the clean architecture diagrams! I also really liked how you connected the dots, shining the light on the whole industry (Oracle, Amazon, Rockset, Uber, etc).? It’s awesome that Rocketset helps with read-heavy workloads. That was a highlight for me.? And wow, really nice explanation of shard-level parallelism.? You’ve talked about converged databases before (i think it was you or your colleague at Oracle), but I think this article really hit it home with the real-world examples.? Let’s see what the future holds!

Shruti Bhat

CPO & CMO. Data & AI. Startup advisor. Ex-Oracle, ex-VMware

4 个月

Thanks for the write up Kehinde Otubamowo. Fantastic working with you.

Velimir Radanovic

Architect, Development Manager, Product Manager, Developer

4 个月

Smart move!

Pankaj Chandiramani

Drive Product Management at Oracle for multiple products including Distributed Database, Distributed Cache and Database Cloud Service

4 个月

??

要查看或添加评论,请登录

社区洞察

其他会员也浏览了