Optimizing projections in Vertica
Years ago, I used to work at Vertica, the columnar database company. During my time there, I experimented with the database engine to see how it would behave with different workloads. One of those experiments was about multiplying sparse matrices, to leverage the columnar storage design as well as the parallelism of the engine itself when running on a cluster of machines.
I won't go into the details of how Vertica does what it does as there is already a lot of information on the web in general and on Vertica's website at Vertica.com in particular.
A sparse matrix contains mostly values set to 0 and is an excellent candidate to store in Vertica due to its capability to compress the data on disk.
This article is about optimizing the execution time of a SQL query in Vertica and how to improve performance by starting with the design of projections specifically crafted to help the engine be more efficient and thus, run faster.
Table Structure
Vertica runs on Linux and standard x86 hardware (64-bit only) and is built from the ground up to run on multiple machines in a cluster. The data is stored in columns and is compressed to reduce the storage footprint and the I/O cost when retrieving the data to answer a query.
Vertica is a relational database that uses SQL, like every other relational database.
As a first pass, the table structure to get started looks like this:
The ID field is used to be able to store more than one Matrix into the table. At this point in the design, we don’t try to optimize the physical storage and we are simply going to load into the default super-projection created by the engine when we insert the first tuple. We will only store non-zero values.
Multiplying matrices with SQL
Let’s assume that we have two matrices stored into our Matrix table. In order to know which is which, we identify them with a unique ID. We call C the matrix that is the product of these two matrices: C=AB.
By definition, each element in C is calculated with the following formula:
Based on the formula above, it is clear that when either Aik?or Bkj?equals 0, there is no need to do the multiplication and the addition. This fits perfectly with our approach to only store non-zero values.
To retrieve the non-zero values from A, we use this query:
To retrieve the non-zero values from B, we use this query:
And to generate C=AB, we use this query:
Inserting test data
We’d like to check what the database engine does with the query above. Instead of loading massive matrices to get started, we can use small matrices and see what happens. We are going to create the following matrices A and B and use the query above to calculate C=AB.
Based on the table definition above, these two matrices are created with the following insert statements. We only insert non-zero values.
Let’s run the query below:
Result:
Let’s do the multiplication “by hand” and see if we get the same results:
The query is working fine on small matrices, so we can continue with our design.
Execution Plan on one node
Let’s compute statistics first. These statistics are used by the Vertica optimizer to decide on what execution plan to run, just like any other database engine.
Then, we can ask the database what the execution plan on one node is:
Result:
Based on the plan above, there is one step that could limit the scalability of this approach. The hash join operator (in red above) needs to build a hash table in memory, based on the inner set, then, stream the outer set to do the lookup in the hash table.
This operator needs memory (potentially a lot when working with large matrices) to store its hash table and cannot start generating output data until the hash table is completely built in memory. This means that on top of taking time to do its job, it will introduce latency, potentially significant for large matrices.
Fortunately, Vertica is able to use another operator when the data is sorted the right way. We can have Vertica use a “MERGE JOIN” operator, which doesn’t need to build a hash table, and is able to produce output data right away. This operator doesn’t need a lot of memory to run and its latency is negligible.
Optimizing Projections
Without explicit projection creation, the system uses default super projections that are not encoded and the sorting is done on all fields by default. To see the existing default super projection on the table, we can ask Vertica as follows:
Result:
Clearly, there is no encoding done on the different fields stored in the table. The sort order is using all fields in the order they appeared during the table creation. The segmentation clause only applies when we have more than one node in the cluster.
In order to be able to use a “MERGE JOIN” operator, we need to have the data sorted so that the inner and outer portions of the join operator are sorted on the join key(s). In our case, we are joining like this:
The join condition above is using two different fields (C and R), which means the data cannot be sorted on both fields with just one projection. Our choice at this point is to either create two projections on a single table or to create two separate tables so we can have each one store its data into separate projections that are sorted differently.
This constraint is a direct result of how matrices are multiplied (cross product between a row and a column) and we have to design for it.
Considering that we intend to use this approach to multiply very large matrices, separating matrices out seems to be the best choice to save on physical storage at the cost of having to do a bit more work when writing the query.
If we were to use a single table with two projections, we would then store each Matrix twice, using different sorts. However, we are interested in calculating the product AB. The product BA may not even make sense, if the dimensions are not compatible. Breaking down the design into two separate tables helps us save disk space.
Create two tables
We are going to create two tables that will store data based on the position of the matrix in the multiplication i.e. left side and right side. In our case, when we calculate C=AB, we will say that A is on the left side and B on the right side.
Now, let’s insert the test matrices we used above:
We need to have statistics on those tables too:
Let’s rewrite the query to use both tables.
Result:
And let’s check the execution plan:
We are back to where we were with just one table. Let’s look at the default projections on those tables:
Result:
Let's get the DDL for the other table:
Result:
Let’s not worry about the segmentation clause at this point since we are still working on a single node. Let’s look into the "order by" clause before we look into the encodings per column.
We are going to create new projections on those tables. Note the different sort orders.
Then, we ask Vertica to refresh the projections:
Ask Vertica to drop the default super projections:
Let’s look at the execution plan again, this time, we should see a difference because we have new super projections defined with a different sort order.
We now have projections sorted to specifically accommodate the join condition between the two sets and the optimizer decides to use a “MERGE JOIN” instead of a “HASH JOIN”, which is precisely what we wanted.
Loading Data
Typical File Format
Over the years, standard formats to store and exchange the content of matrices have emerged. A definition of the different formats can be found at this link:
领英推荐
Since this article is about using Vertica to multiply sparse matrices, we will use the “Matrix Market Exchange Formats” only because it is simpler to parse and can be directly mapped to our table structure.
Sample Matrix #1
This first example is using a sample Matrix file found at:
Its header is as follows:
This is a symmetrical matrix and it is square. Since it is a square matrix, we are going to multiply it by itself, in effect, we are going to compute C=AA=A^2.
After the header, the content of the file is straightforward and maps exactly our table structure. Here are the first three lines in the file:
To load this data, we are going to use the COPY command, which allows us to define a mapping from a text file to a table structure. For this exercise, we are not going to try to leverage as much parallelism as possible since this data file is very small (about 8.5 Mbytes uncompressed.) We are going to load the data into MATRIXLEFT only.
Since the matrix is symmetrical, only the lower half plus the diagonal is represented in the file. This means that we will have to do an INSERT … SELECT to store the other half of the matrix in the table.
Once this is done, we will simply copy all the data from MATRIXLEFT into MATRIXRIGHT to reuse what we built so far and do some more optimization work.
Truncate the tables
There is no commit needed since truncate is a DDL statement. Truncating makes sure we remove the sample data we tested with before loading the file.
Load the data from the matrix file with the COPY command
The copy command is the way Vertica loads data in bulk. It can load a gzipped file directly, by uncompressing it on the fly.
Check the number of tuples in the table:
The file format is simple enough. There shouldn’t be any entry rejected. A simple count of the number of tuples loaded will allow us to do a validation check.
Check the number of rows in the source file:
The 14-line difference is the header.
Let’s look at the diagonal of the matrix we just loaded:
Populate the upper half of the matrix
The data file contains only half of the matrix values since it is symmetrical. This way, the file size can be halved by not storing the entire matrix. We now need to populate the other half in the table.
A quick check shows that 448,799 + 90,597 = 539,396.
Populate MATRIXRIGHT
We need to do the same thing for the other table that stores the matrix used on the right side of the multiplication so we can leverage the sorting order of the super projection.
Re-encoding of the projections
Vertica offers an API that will sample the data in the columns to determine the best encoding algorithm to use in order to get the best compression ratio. The reason we do this is because different encodings and compressions lead to different compression ratios, since there is no best universal encoding or compressing algorithm. Depending on the data, some encodings will do a better job than others.
First, let's see what the current situation is before optimizing the encodings.
Result:
This gives us a starting point so we can measure later on how good of a job the optimization did. Let’s ask Vertica to re-encode these projections for us. The following two function calls will schedule background jobs that will sample the columns in the projections and generate a DDL file ready to run.
Be sure to wait for the first one to be done before starting the second one, or you may get an error like this:
Let's use the Database Designer (DBD) to analyze the best encodings to use for our new projections:
Here’s the generated DDL for an encoded super-projection for the table MatrixLeft:
Here’s the generated DDL for an encoded super-projection for the table MatrixRight:
Once this is done, we run those two scripts and recompute statistics.
Let’s now check the result of these new encodings on a single node:
This is the size needed to store our matrices after encoding.
This looks much better now since we went from 2,840,193 bytes on disk to 430,215 bytes, per projection.
Here’s the result for a 4-node cluster:
Result:
Measuring performance on a single node
Hardware used
These tests were run in a Linux Virtual Machine running on a laptop with an i7 quad core at 2.2GHz, i.e. a low performance platform.
Measures
Let’s ask Vertica to measure how long queries run:
Let’s run the query again:
Let’s run it again without the order by:
Clearly, the “order by” clause is expensive as it makes the runtime jump from about 10.2 seconds to almost 19 seconds. We will now run the query with and without "order by" to compare the sorted and unsorted execution times.
Testing on a Desktop with Linux running?on bare metal (no VM)
Let's now run more tests on a desktop directly, without using any virtualization layer. Those tests were run multiple times to make sure they would stay consistent and not be subject to wide variations based on what the system might be doing concurrently.
Timing without “order by”
Even though sorting the result set is usually not needed, let's see the difference when we add "order by" to the query.
Timing with “order by”
As expected, sorting the result set takes more time than not sorting it, but the difference is not as dramatic as when running in a virtual machine on a laptop.
Scaling up to multiple nodes
Let's now see what happens when we scale our experiment on a 4-node cluster (16 physical cores each)
First, let's see how long it takes to run the multiplication query with “order by” on 4 nodes. The query was run four times to make sure it is consistent. No other query was running concurrently.
Let's measure again on 4 nodes without “order by”
As expected, sorting the result set takes more time than not sorting it but the difference between sorting and not sorting is getting even slimmer when running on a 4-node cluster.
Conclusion and further experimentation
This experiment has shown how we can optimize projections to accelerate the work done by Vertica to run queries. However, there is more work that can be done to leverage as many CPU cores as possible and optimize the different pools of memory.
In general, to optimize queries in Vertica, just like many other databases, the objective is to organize the data to help the engine be more efficient. In this case less is better. The less data there is to retrieve from storage at runtime and the less computation the engine has to do, the faster it goes.
Since Vertica is a distributed system that runs on multiple machines in a cluster, the other optimization that helps a lot is the distribution of the data across the nodes, to limit the need to move data over the network between machines.
Those are the next steps in optimization that will be the subject of another article.