HarperDB: An underdog SQL / NoSQL database
HarperDB flies in the face of conventional wisdom in a number of ways. But does the world need another database?
Start coding a new database in your garage with a buddy of yours. Use JavaScript. Name it after your dog. Patent your own data model. Do not go open source. Take over the world.
That is HarperDB's recipe for success. It seems so unlikely, it makes you wonder whether it's genius or just crazy.
THE EXPLODED DATA MODEL
Stephen Goldberg and Kyle Bernhardy do not seem like crazy people. They have long standing experience in enterprise consulting, and this is precisely what got them started on HarperDB.
Goldberg and Bernhardy liked the scale and ease of use of NoSQL, but still wanted ANSI SQL for actionable analytics. They wanted the ability to perform multi table joins, and multi conditions statements.
Them, and pretty much everyone else in the database world. The convergence of SQL and NoSQL solutions is something that has been going on for a while. A typical way to deal with this requirement is multi-model databases. But Goldberg and Bernhardy decided to take a different approach.
They felt multi-model was inherently flawed as a design pattern, were frustrated by the performance of data lakes and map reduce solutions, and wanted something that would be ACID compliant.
They thought a single model was needed to accommodate all of the above, so they went ahead and created what they call the exploded data model, which is also the basis of their patent.
The exploded data model is a patent the creators of HarperDB came up with to deal with the need to accommodate both SQL and NoSQL. Image: HarperDB
In the exploded data model, each attribute from a JSON object, or column from a SQL insert/update statement becomes an index upon write. These attributes and their values are stored discreetly on disk.
Goldberg and Bernhardy say this avoids the need to configure foreign keys and indexes, and allows indexing every attribute/column without increasing disk footprint as they do not store the entire record whole, or store separate index tables.
Upon search parallelization is used to coalesce the data back into an object based on which columns are requested. This, Goldberg and Bernhardy note, has the added benefit of allowing joins to be as performant as a single table search:
"Our data model allows for both read and write concurrently at high throughput. Each attribute transaction is discrete, and we don't experience row locking or need in-memory transformation, which often plague database solutions and cause them to fail under HTAP scenarios."
NO SCHEMA, NO MAINTENANCE
It sounds less crazy now, although at first glance their coalescing data approach does not seem completely different from the multimodel approach. To evaluate this would mean to either have access to their implementation and patent, or to benchmark against competing solutions, and these are options we do not have.
What still sounds a little crazy though is to take on established solutions, in a market as crowded as the database market. Goldberg and Bernhardy say they are not trying to compete against entrenched solutions, but rather work alongside them and augment them.
That is part of the reason why they are launching today focusing on IoT, as they note there are a lot of greenfield projects, which need new architectural patterns to see success and to scale.
They also target working alongside traditional SQL data warehouses as a sidecar providing SQL capability in real-time for unstructured data via their JDBC driver, or making column/row data from SQL databases into applications that were designed to interact with JSON.
HarperDB specifically targets IoT use cases, due to its small footprint and the fact that IoT is a relatively new field. Image: HarperDB
HarperDB advertises as schema-less and configuration-free. Goldberg and Bernhardy clarify it is more accurate to say that HarperDB has a dynamic schema. And no configuration refers to the fact that no configuration for columns, foreign keys, data types, or indexes is needed.
HarperDB has the concept of schemas, tables, and attributes. Schemas and tables only provide name spaces for finding attributes, and creating logical collections. Attributes are reflexively created on insert/update and do not have data types, but ODBC and JDBC drivers sample data to suggest data type in BI tools.