Database Choices for Startups
I've found myself having several discussions lately about NoSQL vs SQL recently. To be clear, I think that almost every system can be built on either. I think there's a lot of myths out there, and I'm writing this to express my thoughts on why I believe PaaS NoSQL is the right choice for the average tech startup. By "average tech startup" I mean a company that's raised less than a million dollars and is building a consumer-facing app where the role of the back end is simply to perform CRUD operations. Examples: a school LMS, a diet tracking app, a to-do list, note taking apps, dating apps, etc. Most apps fall in this category.
I'm not a database expert. I do think I have a lot of experience and I have built many greenfield apps. I have released applications with many different storage technologies: Postgres, SQL server, MongoDB, Firestore, and DynamoDB. My opinions here are based on those experiences.
Let me also say that for the average startup, every DBMS can work. You can code whatever little CRUD-tracking app in whatever database technology you want. It will work if it's properly built. This article deals with choices to make things easier, accelerate business iteration, and mitigate long term hidden risks.
Using SQL
One of the biggest, most prevalent myths out there is that "if your data has relations, you have to use SQL." I don't believe there's merit to this claim for a few reasons:
- Many (most?) Non-relational DBMS's can enforce relations too.
- Even if they don't, you can still enforce relations in the application layer by having a record-locking table and just hand-rolling your relations implementation. I've worked on a system that does this with DynamoDB before. It works fine.
- Relations are a concept that mostly comes into existence once you've accepted all the premises of entity-relation models. If you step back and look at those premises, in many cases it makes much more sense to embed objects rather than have relations.
While I don't think that the whole claim about relations has merit, I do agree that there are some things that relational databases tend to do a lot better. Generally, if you really need atomicity (ie. transaction support at the database level) and consistency, and you have predictable scaling constraints, relational is probably gonna be the way to go. While you can achieve atomicity and consistency with most NoSQL DBs by coding for it in your application layer, this is something you'll rarely want to do.
Another reason you might want to pick relational is if you need max performance. Having known attribute sizes allows relational databases to do simple pointer algebra to locate records. You can also cram more records per hard drive page. You can even choose to store records column-wise rather than row-wise. These little things can sometimes shave off a couple of microseconds (or even milliseconds), and if that matters to your application then you might want to use relational technologies.
Yet another reason to go with relational is flexibility in indexing. Most NoSQL databases don't have as many kinds of indexes or might even have some limits around them (eg. DynamoDB with its 20-index limitation).
Now that I have acknowledged the good things about relational database technologies, let's move on to why you probably shouldn't use them.
PaaS vs Self Hosted
Now, I understand that PaaS and NoSQL vs SQL are no longer tightly-bound constraints. We have Aurora Serverless and other cloud providers have other serverless SQL DB offerings. So I'll quickly cover the big reasons why you want to go PaaS in the context of a startup:
- Security. Startups are always starved for resources and systems operations go out the window. With a custom install of mongodb or postgres, good luck keeping the operating system up to date. It will not happen until you raise tens of millions or more, and even then it might still not happen. Solving the problem of patching a stateful machine without losing availability is a pretty tall order. What most startups end up doing is just risking it. They run on outdated OS versions and with outdated software, with tons of unaddressed vulnerabilities. This is not a risk that I think my clients and friends should take.
- Scalability. Almost every single PaaS offering is gonna give you auto scalability out of the box. Regardless of how good you think you are at setting things up, what tends to happen in practice, in my opinion, is that people don't have a robust system in place to scale their DB tier in response to a traffic spike. In theory you can set CPU/memory utilization alarms and monitor them. In theory you have a scaling plan you have rehearsed and tested before (Switching over your application server to a different DB is not the most trivial thing ever). In practice, in my opinion, this is never the case. Every startup I've ever been familiar with put less than $50k into their MVP and then they basically stayed at MVP phase, never increasing operational rigor to the point where they could easily support large traffic spikes. In other words, if they suddenly start succeeding, then their systems will crash.
- Cost. One way to mitigate the scalability risk is reserving some oversized capacity for your DB tier. This will run you like several hundred dollars per month. With certain PaaS offerings like dynamodb and firestore, you'd actually be operating for free (Within the free tier) until you hit that point of inflection with success, and then your system will scale easily if the data model is correctly designed.
So I think it should be without question that every budding startup following the lean model should opt for a PaaS solution. In fact, I'd even go as far as saying that if you are interviewing dev shops and one of them suggests custom installations of a DBMS, you should probably disqualify them outright.
Why Choose NoSQL
Now that there are good and viable PaaS solutions for SQL databases, that can be the end of the conversation. When it comes to SQL vs NoSQL, you should pick what your dev team has strong expertise in. But if your dev team has strong expertise with both (or neither, which is more often the case), I think NoSQL has some very strong things going for it:
- Logical coherence. In NoSQL, the way you persist your data can often be the same as the way you'd handle it in memory. This ends up resulting in shorter, simpler code that is significantly easier to understand. With SQL, you can use Object-relational-mappers to solve this problem but then you introduce a barely-predictable black box into your software. It's difficult to predict the queries that an ORM will generate. If you don't use an ORM then you're back to writing SQL statements with somewhat predictable performance (they can be predictable if you're an expert in B+trees and SQL performance, but in my experience almost nobody is). But in not using an ORM you're back to having a cumbersome application layer that does a lot of data transformations. This comparative virtue of NoSQL systems is often "hidden in plain sight." It requires a little of "Freeing your mind" in order to discover it. I hear engineers say things like "Oh, in SQL I could have done this operation just by changing an ID, and in NoSQL I had to copy and replace a document." Why yes, sure, but do you realize all the additional mental overhead the unnecessary SQL data model inflicted on you everywhere else in the application?
- Predictability of Performance. SQL databases were developed an insanely long time ago, and this still impacts how they work today. SQL databases chase every last bit of possible efficiency that can ever be achieved because they go back to the days when storage was actually expensive and limited. SQL databases do things such as specifying the max size of a particular attribute in a row, and if an attribute is too big, they'll actually store it in a different hard drive page. They do a lot of under-the-hood optimizations for you which result in significant query speed gains in some scenarios. However when your data model is complex and a given query requires several joins, it becomes damn near impossible to get a sense of whether a particular query is actually achievable in few I/O's or not. NoSQL databases are around 20% slower in a single table indexed query (I tested this with a 10gb mongodb collection vs a 10gb mariadb table) but the ability to embed data models results in significantly more predictable performance.
- Shardability. If you ever need truly massive scale, breaking up your data (sharding) is almost always a necessity in order to save costs. In a world constrained by relational data models, sharding is difficult to say the least. Consider a massive table of students, a big table of class sections, and a join table specifying which students are in which class sections. It would be very difficult to shard this data model in a way that would allow for scalability gains. The easiest way to shard this would be to embed objects into one another - in other words you have to use a non-relational data model. This is more natural to do in NoSQL databases (though certainly possible in modern SQL engines, it's uncommon and you lose some of the other benefits of SQL if you do it).
- Agility and Developer experience. This one is tough to explain without rambling. But once you free your mind from Entity relationship diagrams and embrace NoSQL, the business agility gains are staggering. I've stepped into projects where it's literally impossible to get the application running locally because there's a bug in the migrations code and a bunch of devs have a bunch of different branches expressing different migrations. Removing this headache of having a super-constrained schema speeds up development time. A LOT.
To summarize, for most startups, regardless of which database management system you choose, you should go with a PaaS model. When it comes to SQL vs NoSQL, there are many niche reasons to use SQL. None of these reasons usually apply to the average consumer-facing startup that's just trying to develop a "better mouse trap" version of some application that already exists. Such startups can tolerate some level of relaxation in ACID constraints for the sake of business agility, risk mitigation and cost savings, and therefore they should be using (or at least considering) NoSQL.