Bad Medicine
All MongoDB databases suffer from a genetic disease: they can scale to ginormous.
What is the cure for scale-to-ginormous-itis? Optimization.
The symptoms of scale-to-ginormous-itis have a way of creeping into your application. At the beginning, when you create your application with just some sample data, everything runs just fine. All queries return in 0ms and you have nothing to worry about.
Fast-forward to when your application is in production, handling a significant amount of data, and the symptoms begin to show up. What are the symptoms? Slow queries.
You already know that the medicine for slow queries is indexes, but this medicine has a serious side effect: slower insert queries (and maybe write in general). Wait, what???
Are indexes BAD MEDICINE then? Well, sometimes. Indexes that are created without carefully studying utilization patterns can turn into more of a detriment than a booster to performance.
Design Good Medicine
How to write an index and how to make sure that it optimizes your queries is a great subject for a future post. For now, I will assume that you already know how to do that. In the examples below I will make sure to create an index that includes only the one key used for filtering.
The key to optimizing query performance is to study the utilization pattern of all queries and to use the explain() method to measure the performance of each. That's it! But it takes a little time and you have to find the time to do it.
First, gather a list of all queries. Measure the performance of each query by using .explain("executionStats"). For example, if you have a query that is db.collection.find({"a": 1}) you want to use db.collection.explain("executionStats").find({"a": 1}) and log the results. Create a script that does this and you will be able to quickly measure performance before and after creating your indexes.
Second, run MongoStat and add 1,000,000 documents (create a script to do that). If you have any insert queries make sure that the documents added here are very similar to the documents added in your application. Starting with no indexes and hardly any data mongostat shows a value between 850 and 1100/second for inserts. Let's say that we are adding 1000 documents per second and that each takes 1ms.
Third, measure performance again using the script developed in step one. What was running in 0ms now takes time. Do not only look at executionTimeInMillis, also look at nReturned, keysExamined, and docsExamined. Actually, also look at whether your queries do collection scans and in-memory sorts.
In the images below the 5 values are for: success | nReturned | ms | keysExamined | docsExamined.
The query to return all documents without any filter does a collection scan: it returns 1,000,020 documents after inspecting all 1,000,020 documents. It does so in 315ms and inspects no index keys (obviously since we don't have any usable indexes).
The query to find all documents (db.collection.find({}).sort({"a": 1}).explain("executionStats")) and sort them, fails due to in-memory sort. The limit for in-memory sorts is 32Mb, which was reached after loading 20,701 documents. This is not good. We definitely need to create an index to fix it.
The third query is a find query that aims to find 10,000 documents that match a condition (db.collection.find({"a": 1}).explain("executionStats")). To do so, without an index, it has to examine all 1,000,020 documents and the query takes 933ms.
The fourth query is an update query filtered using the same match condition (db.collection.explain("executionStats").update({"a": 1}, {"b": 2})). It takes the same amount of time and effort.
The fifth query is a remove query using the same match condition (db.collection.explain("executionStats").remove({"a": 1}, {"justOne": true})). It also takes about the same amount of time and effort scanning all documents and clocking at 1033ms
Administer and Evaluate Your Medicine
Fourth, given that my queries seem to all use a certain key to filter on I will add an index on that key (db.collection.createIndex({"a": 1})). Also, to support a proposed search feature I will add a text index on all the string fields to support a generic search feature (db.collection.createIndex({"$**": "text"})). The first index took about 3.5 seconds to create. The second one took six minutes in the foreground. That's a long time to lock up the application, so plan heavy indexes accordingly and consider adding heavy indexes in the background.
Fifth, let's evaluate the medicine. Measure performance again. All queries now run in 0ms with the exception of the third one that took 9ms and the second one that had to inspect a million index keys instead of a million documents and still took 1 second. That swap (between documents examined and index keys examined) is still slow, almost as slow as doing a collection scan, but the in-memory sort was avoided and the query actually completed.
Also, the text search performed well. At first, I searched a word that was in all one-million documents and it took four seconds. Not that great, but it had to inspect a million keys. To make the test more meaningful I put a new word ("kaputt") in just 10,000 documents and those 10,000 documents returned in 28ms (db.collection.find({$text: {$search: "kaputt"}})).
What is also interesting to note is that the update and delete write queries improved their performance thanks to the index. This is because they found the document to write instantly and then only had to write one document. A query that has to update every document in the database (i.e. it uses no match filter) would actually run more slowly because it cannot benefit from the index to find the documents to update.
We still have to look at the performance impact on the insert queries. First of all, the baseline: we had measured about 1000 inserts per second--the equivalent of performing an insert in 1ms.
First we need to delete the million records added. I used .remove and it took a while. It would have been faster to just drop the collection. After that, I called the add a million documents script and monitored the mongostat output. This time it was only adding 650 documents per second. Here is a snippet of the mongostat output:
Not that bad, but significantly slower than before. If before it took 1ms to add a document now it takes 1.5ms. Still, these are among the lowest numbers we've seen. As we stand most reads and writes return in 0ms, inserts return in 1.5 ms, and a massive text search query returns in 28ms. That's with a million documents.
Conclusion
You may want to ask: what about a billion documents? My first reaction is that it would take an estimated 17 days on my underpowered laptop to insert that many 1.5 Kb documents. I will keep that investigation for another blog post. I did not use bulk insert this time: it would be interesting to measure the speed difference.
The process described in this post is a bit tedious, but the scripts are going to come in handy in the future too, so I think that optimizing performance by simulating a future state of the database and by using explain() and mongostat is well worth doing. Some people argue that indexes are something to worry about only when you have begun scaling. This is probably a matter of opinion, but I disagree: once you have defined your queries and application you can start observing your application's usage patterns and you can take action to prevent problems. An ounce of prevention is worth it, and if it turns out to be the wrong prevention you can fix it later: but you don't want to start completely from scratch later. IMHO, the sooner you start thinking about performance the better. To measure performance you need data. In this post, I have showed you how to get that data by using the explain method and mongostat.
This is my methodology. What is yours? I am curious to hear about other ways to optimize queries using indexes.
I hope that this post has helped someone who is trying to figure out how to optimize performance with indexes. I look forward to your questions!
Lead Software Developer| REACT.JS | NODE.JS | TYPESCRIPT| EXPRESS.JS | MONGODB| NOSQL | AJURE | AWS | Kafka | Debugging
7 年good stuff..thanks for sharing. you can also use sharding ...thats also help in scaling by dividing the load on each shard. we use mongo db with layer of elastic search on top of it and we dont index on mongodb. if scaling is the issue you will need to use something similar to elastic search. indexing is good but it also slows your inserts.
Senior Consulting Engineer at MongoDB
7 年Come to the MongoDB user group meeting on 4/26 to hear more details about indexes. I will also talk a bit about query optimization: https://www.meetup.com/Silicon-Valley-MongoDB-User-Group/events/238931212/