MemSQL 6.5: NewSQL with autonomous workload optimization, improved data ingestion and query execution speed
MemSQL wants to be the world's best database. Leading that race is a tall order, but the new version seems to improve on an already strong offering.
Is it SQL, NoSQL, or NewSQL?
Even though it's just naming conventions, MemSQL is usually thought of as part of the NewSQL lot. That is, databases grounded on the relational SQL model yet adopting some of the "knobs" the NoSQL lot has introduced for people to turn in order to achieve horizontal scalability.
It's a balancing act between consistency and performance, which, in the end, applies to any distributed database. NewSQL folks argue you should not have to ditch schema, or SQL, to get the scale-out benefits that NoSQL brings. If you like your schema and SQL, and also like your database to have analytics as well as transactional capabilities, then MemSQL is probably in your radar.
Today, MemSQL is announcing version 6.5, introducing autonomous workload optimization, and bringing improved data ingestion, as well as query execution speed. ZDNet had a Q&A with MemSQL CEO Nikita Shamgunov on what the new version does and how.
Autonomous workload optimization
What autonomous workload optimization means in practice is that at least some of the tasks typically attended to by a DBA in MemSQL deployments are now to some extent handled by automation.
MemSQL utilizes a shared-nothing, distributed, in memory processing architecture. Nodes are organized in two layers -- leaf nodes and aggregators -- with the latter splitting and routing workloads and the former executing them.
In MemSQL 6.5, explains Shamgunov, workload management was added to more gracefully handle the unpredictable spikes in queries while still maintaining high performance during normal load. Workload management can be broken into three components that work together to address cases of heavy load: Detection, prediction, and management.
MemSQL wants to take the lessons learned from NoSQL and apply them to SQL. (Image: MemSQL)
Detection refers to identifying when any node is struggling. MemSQL differentiates memory used for table data or temporarily in queries to determine if it is safe to continue forwarding more queries to the target nodes.
Prediction refers to estimating the resource usage of queries and classifying them into groups accordingly. In MemSQL 5.8, management views were introduced to allow users to see resource usage statistics of previously run queries. Workload management can also use these statistics to determine how expensive a query is from a memory consumption perspective.
The last component is Management, which admits queries in three tiers. The cheapest queries such as single-partition selects, inserts, or updates avoid being queued entirely. Queries that use moderate resource amounts are queued on a FIFO (first-in first-out) basis at a rate dependent on the highest load among leaves.
The most expensive queries are queued with maximum concurrency and running is coordinated between all the aggregator nodes. Since there is an associated cost to the coordination, only the most resource-intensive queries fall into this category.
When asked to compare MemSQL's autonomous workload optimization capabilities with other databases such as Oracle, Datastax, and ScyllaDB, Shamgunov noted that MemSQL has similar capabilities to Oracle's Automated Memory Management utility:
"MemSQL offers slightly more granular options such as managing pre-configured query pools while Oracle offers a more broad total memory management utility. Datastax and ScyllaDB offer simplified memory management configurations for allocating memory to entire applications but do not offer lower level configurations based on query profile or query pool."
Pipelines and stored procedures
In a previous discussion with Gary Orenstein, MemSQL's former CMO, one of the things we talked about was MemSQL's positioning in a real-time data ingestion/transformation pipeline. MemSQL has introduced its own Pipelines for data ingestion in version 5.5, and Orenstein mentioned that MemSQL's capabilities are such that some MemSQL clients choose to not use solutions like Spark or Flink, but rather ingest data directly using MemSQL.
MemSQL Pipelines architecture. Now Pipelines is enhanced with the ability to execute stored procedures. (Image: MemSQL)
Shamgunov notes that MemSQL 6.5 introduces pipelines to stored procedures, augmenting the existing MemSQL Pipelines data flow by providing the option to replace the default Pipelines load phase with a stored procedure:
"The default Pipelines load phase only supports simple insertions into a single table, with the data being loaded either directly after extraction or following an optional transform.
Replacing this default loading phase with a stored procedure opens up the possibility for much more complex processing, providing the ability to insert into multiple tables, enrich incoming streams using existing data, and leverage the full power of MemSQL Extensibility.
Leveraging both a transform and a stored procedure in the same Pipeline allows you to combine the third-party library support of a traditional transform alongside the multi-insert and data-enrichment capabilities of a stored procedure."
While data streaming and ingestion to support real-time insights is an essential feature, adding, configuring, and operating the streaming engines required to support this increases operational complexity. This should initiate a nuanced discussion on comparing the capabilities of streaming engines versus Pipelines.
Assuming, however, that the transformations supported by Pipelines are enough to cover the scenarios they are interested in, flattening system architecture even beyond Kappa sounds like a strong selling point for prospective MemSQL users.
Resource optimization improvements for multi-tenant deployments
With multi-cloud and hybrid cloud strategies being a reality for most organizations, MemSQL already has had a strategy in place for this since last year. But while using MemSQL in multi-cloud and hybrid cloud environments is possible, cross-querying is not -- it entails moving data around. That is a rather high price to pay, in more than one ways.
We asked Shamgunov whether anything has changed in the meanwhile, and what the improvements for multi-tenant deployments are about. While the operation of MemSQL in multi-cloud and hybrid cloud environments does not seem to have changed, Shamgunov referred to the need many organizations have to support a workload of database-as-a-service:
"This is a model where multiple organizations have data living side by side to reduce management costs but each organization requires a strong namespace and security boundary around its data. Organizations also often want to leverage the existing ecosystem of tooling available on top of the data layer. This is a challenge to do using legacy databases or NoSQL systems.
With MemSQL 6.5, we further enhanced the multi-tenant capabilities by optimizing memory utilization to deliver dramatically more tenants with fewer hardware resources. With the improvement, internal tests were able to configure nearly thousands of tenants with significantly fewer machine resources than the incumbent single node database solution."
Read the full story on ZDNet Big on Data