Optimizing projections in Vertica
Coupole de la Samaritaine à Paris, France. Credit: Patrice Borne

Optimizing projections in Vertica

Years ago, I used to work at Vertica, the columnar database company. During my time there, I experimented with the database engine to see how it would behave with different workloads. One of those experiments was about multiplying sparse matrices, to leverage the columnar storage design as well as the parallelism of the engine itself when running on a cluster of machines.

I won't go into the details of how Vertica does what it does as there is already a lot of information on the web in general and on Vertica's website at Vertica.com in particular.

A sparse matrix contains mostly values set to 0 and is an excellent candidate to store in Vertica due to its capability to compress the data on disk.

This article is about optimizing the execution time of a SQL query in Vertica and how to improve performance by starting with the design of projections specifically crafted to help the engine be more efficient and thus, run faster.

Table Structure

Vertica runs on Linux and standard x86 hardware (64-bit only) and is built from the ground up to run on multiple machines in a cluster. The data is stored in columns and is compressed to reduce the storage footprint and the I/O cost when retrieving the data to answer a query.

Vertica is a relational database that uses SQL, like every other relational database.

As a first pass, the table structure to get started looks like this:

No alt text provided for this image
Initial definition of a table to store matrices

The ID field is used to be able to store more than one Matrix into the table. At this point in the design, we don’t try to optimize the physical storage and we are simply going to load into the default super-projection created by the engine when we insert the first tuple. We will only store non-zero values.

Multiplying matrices with SQL

Let’s assume that we have two matrices stored into our Matrix table. In order to know which is which, we identify them with a unique ID. We call C the matrix that is the product of these two matrices: C=AB.

By definition, each element in C is calculated with the following formula:

No alt text provided for this image
Formula to multiply matrices

Based on the formula above, it is clear that when either Aik?or Bkj?equals 0, there is no need to do the multiplication and the addition. This fits perfectly with our approach to only store non-zero values.

To retrieve the non-zero values from A, we use this query:

No alt text provided for this image
Retrieve the non-zero values for matrix 'A'

To retrieve the non-zero values from B, we use this query:

No alt text provided for this image
Retrieve the non-zero values for matrix 'B'

And to generate C=AB, we use this query:

No alt text provided for this image
SQL Query to multiply matrices

Inserting test data

We’d like to check what the database engine does with the query above. Instead of loading massive matrices to get started, we can use small matrices and see what happens. We are going to create the following matrices A and B and use the query above to calculate C=AB.

No alt text provided for this image
Sample sparse matrices

Based on the table definition above, these two matrices are created with the following insert statements. We only insert non-zero values.

No alt text provided for this image
Insert non-zero values into the matrix table

Let’s run the query below:

No alt text provided for this image
Query to compute C=AB with sample matrices

Result:

No alt text provided for this image
Result of the multiplication of sample matrices, C=AB

Let’s do the multiplication “by hand” and see if we get the same results:

No alt text provided for this image
Compute C=AB "by hand"

The query is working fine on small matrices, so we can continue with our design.

Execution Plan on one node

Let’s compute statistics first. These statistics are used by the Vertica optimizer to decide on what execution plan to run, just like any other database engine.

No alt text provided for this image
Compute statistics on all columns using 100% of the data

Then, we can ask the database what the execution plan on one node is:

No alt text provided for this image
Ask for the execution plan on one node

Result:

No alt text provided for this image
Execution plan on one node

Based on the plan above, there is one step that could limit the scalability of this approach. The hash join operator (in red above) needs to build a hash table in memory, based on the inner set, then, stream the outer set to do the lookup in the hash table.

This operator needs memory (potentially a lot when working with large matrices) to store its hash table and cannot start generating output data until the hash table is completely built in memory. This means that on top of taking time to do its job, it will introduce latency, potentially significant for large matrices.

Fortunately, Vertica is able to use another operator when the data is sorted the right way. We can have Vertica use a “MERGE JOIN” operator, which doesn’t need to build a hash table, and is able to produce output data right away. This operator doesn’t need a lot of memory to run and its latency is negligible.

Optimizing Projections

Without explicit projection creation, the system uses default super projections that are not encoded and the sorting is done on all fields by default. To see the existing default super projection on the table, we can ask Vertica as follows:

No alt text provided for this image
Ask for the definition of default super projection on the Matrix table

Result:

No alt text provided for this image
Definition of the super projection on the Matrix table

Clearly, there is no encoding done on the different fields stored in the table. The sort order is using all fields in the order they appeared during the table creation. The segmentation clause only applies when we have more than one node in the cluster.

In order to be able to use a “MERGE JOIN” operator, we need to have the data sorted so that the inner and outer portions of the join operator are sorted on the join key(s). In our case, we are joining like this:

No alt text provided for this image
Join key used in the query

The join condition above is using two different fields (C and R), which means the data cannot be sorted on both fields with just one projection. Our choice at this point is to either create two projections on a single table or to create two separate tables so we can have each one store its data into separate projections that are sorted differently.

This constraint is a direct result of how matrices are multiplied (cross product between a row and a column) and we have to design for it.

Considering that we intend to use this approach to multiply very large matrices, separating matrices out seems to be the best choice to save on physical storage at the cost of having to do a bit more work when writing the query.

If we were to use a single table with two projections, we would then store each Matrix twice, using different sorts. However, we are interested in calculating the product AB. The product BA may not even make sense, if the dimensions are not compatible. Breaking down the design into two separate tables helps us save disk space.

Create two tables

We are going to create two tables that will store data based on the position of the matrix in the multiplication i.e. left side and right side. In our case, when we calculate C=AB, we will say that A is on the left side and B on the right side.

No alt text provided for this image
Create two tables to store matrices

Now, let’s insert the test matrices we used above:

No alt text provided for this image
Insert sample data into the two tables

We need to have statistics on those tables too:

No alt text provided for this image
Ask Vertica to compute statistics on the tables, using 100% of the data

Let’s rewrite the query to use both tables.

No alt text provided for this image
New query to multiply matrices using two tables

Result:

No alt text provided for this image
Result of the multiplication C=AB

And let’s check the execution plan:

No alt text provided for this image
Execution plan using two tables

We are back to where we were with just one table. Let’s look at the default projections on those tables:

No alt text provided for this image
Ask Vertica to generate the definition of the default super projection on MatrixLeft

Result:

No alt text provided for this image
Definition of the default super projection on MatrixLeft

Let's get the DDL for the other table:

No alt text provided for this image
Ask Vertica to generate the definition of the default super projection on MatrixRight

Result:

No alt text provided for this image
Definition of the default super projection on MatrixRight

Let’s not worry about the segmentation clause at this point since we are still working on a single node. Let’s look into the "order by" clause before we look into the encodings per column.

We are going to create new projections on those tables. Note the different sort orders.

No alt text provided for this image
Change the sort order of the default super projection for MatrixLeft
No alt text provided for this image
Change the sort order of the default super projection for MatrixRight

Then, we ask Vertica to refresh the projections:

No alt text provided for this image
Refresh the projections

Ask Vertica to drop the default super projections:

No alt text provided for this image
Drop the default super projections

Let’s look at the execution plan again, this time, we should see a difference because we have new super projections defined with a different sort order.

No alt text provided for this image
Execution plan with new super projections

We now have projections sorted to specifically accommodate the join condition between the two sets and the optimizer decides to use a “MERGE JOIN” instead of a “HASH JOIN”, which is precisely what we wanted.

Loading Data

Typical File Format

Over the years, standard formats to store and exchange the content of matrices have emerged. A definition of the different formats can be found at this link:

Since this article is about using Vertica to multiply sparse matrices, we will use the “Matrix Market Exchange Formats” only because it is simpler to parse and can be directly mapped to our table structure.

Sample Matrix #1

This first example is using a sample Matrix file found at:

Its header is as follows:

No alt text provided for this image
Header in the file storing a sample matrix

This is a symmetrical matrix and it is square. Since it is a square matrix, we are going to multiply it by itself, in effect, we are going to compute C=AA=A^2.

After the header, the content of the file is straightforward and maps exactly our table structure. Here are the first three lines in the file:

No alt text provided for this image
Sample data found in the sample matrix file

To load this data, we are going to use the COPY command, which allows us to define a mapping from a text file to a table structure. For this exercise, we are not going to try to leverage as much parallelism as possible since this data file is very small (about 8.5 Mbytes uncompressed.) We are going to load the data into MATRIXLEFT only.

Since the matrix is symmetrical, only the lower half plus the diagonal is represented in the file. This means that we will have to do an INSERT … SELECT to store the other half of the matrix in the table.

Once this is done, we will simply copy all the data from MATRIXLEFT into MATRIXRIGHT to reuse what we built so far and do some more optimization work.

Truncate the tables

There is no commit needed since truncate is a DDL statement. Truncating makes sure we remove the sample data we tested with before loading the file.

No alt text provided for this image
Truncate the tables before loading the data

Load the data from the matrix file with the COPY command

The copy command is the way Vertica loads data in bulk. It can load a gzipped file directly, by uncompressing it on the fly.

No alt text provided for this image
COPY command used to load the matrix file in bulk

Check the number of tuples in the table:

The file format is simple enough. There shouldn’t be any entry rejected. A simple count of the number of tuples loaded will allow us to do a validation check.

No alt text provided for this image
Numer of rows loaded into the table

Check the number of rows in the source file:

No alt text provided for this image
Number of lines in the file we just loaded

The 14-line difference is the header.

Let’s look at the diagonal of the matrix we just loaded:

No alt text provided for this image
Count the number of non-zero values on the diagonal of the matrix

Populate the upper half of the matrix

The data file contains only half of the matrix values since it is symmetrical. This way, the file size can be halved by not storing the entire matrix. We now need to populate the other half in the table.

No alt text provided for this image
Populate the upper half of the matrix

A quick check shows that 448,799 + 90,597 = 539,396.

Populate MATRIXRIGHT

We need to do the same thing for the other table that stores the matrix used on the right side of the multiplication so we can leverage the sorting order of the super projection.

No alt text provided for this image
Store the data for matrix A into the MatrixRight table

Re-encoding of the projections

Vertica offers an API that will sample the data in the columns to determine the best encoding algorithm to use in order to get the best compression ratio. The reason we do this is because different encodings and compressions lead to different compression ratios, since there is no best universal encoding or compressing algorithm. Depending on the data, some encodings will do a better job than others.

First, let's see what the current situation is before optimizing the encodings.

No alt text provided for this image
Ask Vertica about the disk footprint before optimization

Result:

No alt text provided for this image
Disk footprint before optimization

This gives us a starting point so we can measure later on how good of a job the optimization did. Let’s ask Vertica to re-encode these projections for us. The following two function calls will schedule background jobs that will sample the columns in the projections and generate a DDL file ready to run.

Be sure to wait for the first one to be done before starting the second one, or you may get an error like this:

No alt text provided for this image
Typical error when scheduling more than one background job

Let's use the Database Designer (DBD) to analyze the best encodings to use for our new projections:

No alt text provided for this image
Run the database designer (DBD) to determine the best encodings per column

Here’s the generated DDL for an encoded super-projection for the table MatrixLeft:

No alt text provided for this image
Result of running DBD to re-encode columns in the super projection for MATRIXLEFT

Here’s the generated DDL for an encoded super-projection for the table MatrixRight:

No alt text provided for this image
Result of running DBD to re-encode columns in the super projection for MATRIXRIGHT

Once this is done, we run those two scripts and recompute statistics.

No alt text provided for this image
Apply the changes suggested by DBD

Let’s now check the result of these new encodings on a single node:

No alt text provided for this image
Ask Vertica about the disk footprint after optimization

This is the size needed to store our matrices after encoding.

No alt text provided for this image
Disk footprint after optimization

This looks much better now since we went from 2,840,193 bytes on disk to 430,215 bytes, per projection.

Here’s the result for a 4-node cluster:

No alt text provided for this image
Ask Vertica about the disk footprint after optimization on 4 nodes

Result:

No alt text provided for this image
Disk footprint after optimization on 4 nodes

Measuring performance on a single node

Hardware used

These tests were run in a Linux Virtual Machine running on a laptop with an i7 quad core at 2.2GHz, i.e. a low performance platform.

Measures

Let’s ask Vertica to measure how long queries run:

No alt text provided for this image
Turn on the timing of queries in vsql

Let’s run the query again:

No alt text provided for this image
Compute the result of C=AA with timing and "order by"

Let’s run it again without the order by:

No alt text provided for this image
Compute the result of C=AA with timing and without "order by"

Clearly, the “order by” clause is expensive as it makes the runtime jump from about 10.2 seconds to almost 19 seconds. We will now run the query with and without "order by" to compare the sorted and unsorted execution times.

Testing on a Desktop with Linux running?on bare metal (no VM)

Let's now run more tests on a desktop directly, without using any virtualization layer. Those tests were run multiple times to make sure they would stay consistent and not be subject to wide variations based on what the system might be doing concurrently.

Timing without “order by”

No alt text provided for this image
Timing of the multiplication C=AA on a single node, without "order by"

Even though sorting the result set is usually not needed, let's see the difference when we add "order by" to the query.

Timing with “order by”

No alt text provided for this image
Timing of the multiplication C=AA on a single node, with "order by"

As expected, sorting the result set takes more time than not sorting it, but the difference is not as dramatic as when running in a virtual machine on a laptop.

Scaling up to multiple nodes

Let's now see what happens when we scale our experiment on a 4-node cluster (16 physical cores each)

First, let's see how long it takes to run the multiplication query with “order by” on 4 nodes. The query was run four times to make sure it is consistent. No other query was running concurrently.

No alt text provided for this image
Timing of the multiplication on 4 nodes with "order by"

Let's measure again on 4 nodes without “order by”

No alt text provided for this image
Timing of the multiplication on 4 nodes without "order by"

As expected, sorting the result set takes more time than not sorting it but the difference between sorting and not sorting is getting even slimmer when running on a 4-node cluster.

Conclusion and further experimentation

This experiment has shown how we can optimize projections to accelerate the work done by Vertica to run queries. However, there is more work that can be done to leverage as many CPU cores as possible and optimize the different pools of memory.

In general, to optimize queries in Vertica, just like many other databases, the objective is to organize the data to help the engine be more efficient. In this case less is better. The less data there is to retrieve from storage at runtime and the less computation the engine has to do, the faster it goes.

Since Vertica is a distributed system that runs on multiple machines in a cluster, the other optimization that helps a lot is the distribution of the data across the nodes, to limit the need to move data over the network between machines.

Those are the next steps in optimization that will be the subject of another article.

要查看或添加评论,请登录

Patrice Borne的更多文章

  • Oracle workload migration – Building a Transpiler

    Oracle workload migration – Building a Transpiler

    Executive Summary This document continues the exploration of automating the migration out of Oracle. Our exploration…

  • Oracle workload migration

    Oracle workload migration

    Executive Summary The database market is large ($46B in 2019 according to Gartner) and Oracle is one of the major…

社区洞察

其他会员也浏览了