DB Queries - Index Optimization

DB Queries - Index Optimization

In this article, we will discuss the problems that index solves, short brief of how index is implemented, when and when not to use and also see a small example. The example is made using MySQL, but the index concepts apply to all databases, relational and non-relational.

Let's start with a use case. Say we work at a company that sells products to clients online. We have an example orders table that looks like that:

No alt text provided for this image

Our boss assigned us with a task - get the total sum of orders made in 2013. Basic query to achieve our goal would look like that: select sum(total) from orders where year(date) = 2013;

Let's use the explain query (almost every DB now days has some sort of an explain command to queries) to see how the DB engine executes our query, to have some sort of idea of how much it costs and whether we can optimize it.

explain select sum(total) from orders where year(date) = 2013;

No alt text provided for this image

The columns that we are currently interested in the explain query result are:

1. type - The scan type that will be needed for our query, It can be one of three: full table scan meaning the engine will have to iterate through the entire table in order to fetch the result of our query. Range scan meaning the engine will have to iterate through a subset of our table in order to fetch the result of our query. Specific scan meaning that our engine would know exactly where to find our result. This scan type is possible only if our query result depends on one row only, and this means that our query cannot be further optimized.

2. key - The index being used while getting the query result.

3. rows - The approximate number of rows our engine estimates he will have to iterate to get the query result (yes, DB engines are rather smart and they keep interesting statistics every now and than).

There are other interesting columns in the explain query like possible keys (indexes that our DB might be using to execute the query), which can points to wrong index usage by our side but we won't go deep into this in this article.

In our explain query, we can see that our DB engine does a full table scan (type - ALL) doesn't use any index (key - NULL), and estimates that it will iterate six rows (rows - 6), meaning the entire table. Currently, it's as expected - nothing fancy here. We asked the total sum of orders made in 2013, and our DB engine tells us exactly how he does it - he iterates and fetches row by row, and asks - is the year of this order is 2013? if yes, add its total to the sum, if not, move to the next row.

Currently, our orders table contains only six rows and our DB will handle this query just fine. But as our company goes larger, and millions of orders can be made each year, that query will not go so fast - our table will have to go through all the orders from all years, to get our result though it only needs the orders that were made in the year we asked - 2013. Slow query means loss of time that equals to loss of money that leads to unhappy boss. That doesn't look good for us, the developers who store that data and query it.

So, can this query be optimized? Sure! As said in the last paragraph, our DB engine should not fetch all the rows in the orders table just to give us the total sum of a single year. So what if we could tell our engine to fetch only the subset of our table which is orders that were made at 2013, and sum their total? How could this be done?

Imagine if we stored our orders sorted by the date they were made. This way, we wouldn't have to go through all the orders to sum the total of the orders made in 2013 [O(n) complexity], we could use a very famous algorithm that is faster and works well with sorted values - binary search [O(log(n)) complexity]. This way our engine would find the orders made in 2013 much faster and will fetch only them in order to get our result.

This is exactly how index works! Say we index the date column, our DB will save another data structure in addition to our table, which will be ordered by date and will allow log(n) search complexity on the date column. This data structure will not contain all the columns though, only the indexed columns along with reference to where the full row exists in the table. DB engines will commonly use the B-Tree data structure as it is optimized for inserting and reading large blocks of data using the binary search algorithm. There are other data structure that can be used in certain databases to store and maintain indexes like hash tables (optimized best for specific values), or bitmap (optimized for low cardinality columns, like booleans) but we won't go deep into this in this article.

So... let's create an index on the date column in our orders table and see the optimizing takes place using the explain command! First, to create the index we will use create index orders_date_index on orders(date); commit; Before executing the explain command on our query again, let's think on what we expect the result to be. First, we expect that the type column will show range scan, as 2013 orders are a subset of our table (no need to scan the full table). Second, we expect that the key column will show orders_date_index (the name of the index we created). Lastly, we expect the rows column to show 4, as there are only four orders made in 2013.

explain select sum(total) from orders where year(date) = 2013;

No alt text provided for this image

After using the explain command on our query after we created the index, we still see no change in the type, key and rows columns. We didn't expect that! Any idea of why that may be happening? Let's have a look on our where statement again. where year(date) = 2013; As we can see, we use the MySQL year function on the date column and apply our condition to this result. While we indeed created an index for the date column, we didn't create an index for the year function applied on date. That is called functional index and only exists in certain databases. A way to go around it would be to create another column called year and index that column, but that would cause a duplication in our data, not talking about the complication of maintaining the year field. This is a hint that we are querying our table in the wrong way. What is the right way you ask? The SQL language offers between statement just for cases like this. Let's use the explain query, this time with a between statement.

explain select * from orders where date between '2013-01-01 00:00:00' and '2013-12-31 23:59:59'; (between statement is inclusive at both ends).

No alt text provided for this image

Walla! We optimized our query! We can see that we our doing a range scan, using the orders_date_index and fetching only 4 rows while executing our query. Way to go ! Boss is happy :)

Important notice - on large scale our DB would not have chosen to use our index, because it would not be efficient to read each result row from from the disk in order to get the 'total' column. Batch reading the full table will be more efficient than iterating through the index, and for each row apply a disk read to get the total column. A good way to solve this: add 'total' to our index with date to create a multiple column index in our database, this way we won't have to go that extra disk read. This is often called covered index (meaning all of our selected columns is inside the index) and its better to take into consideration when creating the index. This is why we try and avoid select * when not necessary !

After our ranging success, our boss is hungry and assigns us with another task: get the total sum of orders made in 2013 by the user id '1'. When indexing multiple columns, the order matters. The first column is the primary sort, the next is secondary sort and so on. While running the explain command on this query after creating this index create index orders_date_user_id_index on orders(date, user_id); commit; we will see that nothing has changed, and while the engine uses our index and does a range scan, it estimates to fetch four rows while we expect twi (only two orders made from user id '1' in 2013, no need to fetch the rest). This means that our query is not fully optimized, even though we added user_id column to our index. As explained before, columns order in our index matters, and by looking at our where statement, it looks like we got our column order wrong. We have two conditions in our where statement, one of them being strong condition (equality =) and one of them being weak condition (between statement, <=, >=, !=). The column where we have strong condition, should come first in the sort order. After changing the index accordingly, we see the expected result.

Lastly, I want to talk about the cons of index after seeing how it can help us get our query results faster, because nothing is perfect - especially in the world of tech. There is always a tradeoff; Can you recognize the one being made here? Index improves our read queries, but now we have another data structure to maintain on each row we insert/update - meaning our write queries needs more process power and will slow our database performance (more than before creating the index). The more indexes we have, the worst our write performance will be. When creating an index, we sacrifice read for write. This is often what we want, as now days reading the data is often done more than writing it, but this tells us to index only on what we really need (meaning queries that being ran a lot).

That's it! Just a taste of this very large and interesting world of storing data correctly and methods to improve our queries. There are other methods like partitioning and sharding, which we haven't discussed at all in this article, but those solutions apply only to really big data tables.

Hope u enjoyed :)

Shahar Luftig

Senior Data Engineer at Explorium

3 年

Well said

回复

要查看或添加评论,请登录

Matan Rabi的更多文章

  • System Design - Simplified

    System Design - Simplified

    The whole System Design topic is so amorphous that reading many articles about it, can result in very little practical…

    3 条评论
  • Realtime: Pull and Push Models Implementations

    Realtime: Pull and Push Models Implementations

    In this article we will discuss the different implementations of transferring realtime data from a backend to a client…

  • Sync, Async, multi-threading and multi-processing programming techniques

    Sync, Async, multi-threading and multi-processing programming techniques

    In this article, we will discuss those different methods to execute code, show use cases for each one and discuss the…

  • CI/CD, demonstrated with Docker and Jenkins

    CI/CD, demonstrated with Docker and Jenkins

    CI/CD is a concept in software development, it means Continues Integration, Continues Delivery. Integration: Installing…

  • Microservices Architecture – In a nutshell

    Microservices Architecture – In a nutshell

    The year is 2015. Netflix, breaks out with its newly software development technique – ‘Microservices’ (effort that…

    6 条评论

社区洞察

其他会员也浏览了