What I Like About Rick Houlihan's 2018 re:Invent Talk on NoSQL Data Modelling
Last week, I published the following post on LinkedIn: Saw this awesome video on NoSQL data modelling from last year's re:Invent by Rick Houlihan! Had to really bend my mind and do some serious homework to understand what Rick is saying, but now that I have understood it, it feels awesome! This is a *must-watch* for anybody who wants to understand NoSQL data modelling. Also, myth-shattering in more ways than one...
In response to this post, one of my friends asked me, “What did you like about this video?” I replied to him privately to which he replied, “This is awesome Prashant! If you add a bit more beef to it, it will make a great blog.” Hence this blog. Here goes…
The short answer to my friend’s question is, “I liked this video because I now understand how to do NoSQL data modelling to achieve single-digit ms (read and write) latencies in large-scale OLTP systems, and for the very first time I might add!” ??
And now the long answer…
Before watching this video I had read a few articles on NoSQL databases to understand what they are, how they are different from relational databases, and so on. But to be honest, that question – How to do NoSQL data modelling? – hadn’t even framed itself in my mind until recently when I had to help prepare an IoT proposal for a prospect of ours from the manufacturing industry. That is when I really started thinking seriously about it. In the case of this prospect, we found out that one factory of theirs could generate 100 million values per day in the worst case. That was a decent chunk of data to say the least and warranted the use of some “special” database. But the question was which one? Should I go for a time-series database? Or should I go for a “generic” NoSQL database with my own custom table design? If it was the latter, which NoSQL database should I go for? MongoDB? Cassandra? DynamoDB? CosmosDB? Some other one?
I did a search of time-series databases and found out that InfluxDB would satisfy our requirements. But to feel confident that I had done a thorough search and made the right decision, I thought I should also get a good high-level understanding of the above-mentioned NoSQL databases. That is when I found a video that gives just that: 10 NoSQL Databases You Have to Know. In case you are thinking to yourself, “Jeez, that’s a cheesy sounding title,” I agree with you 100%! ?? It *is* a cheesy title, but trust me, the video is anything but cheesy.
The video gave me a great high-level understanding of NoSQL databases, and also gave me the answer to one of my questions. The data that I was trying to store was simple time-series data, which didn’t have an elaborate structure which necessitates the use of a “document” DB like MongoDB. So I could go with Cassandra (if our prospect forced us to go with an open source option) or DynamoDB/CosmosDB (if the prospect was okay with a managed service from AWS/Microsoft). But it did not answer my other question: if I do go with one of these databases, what should my tables look like? After a little bit more digging I found out that AWS recommends a single table design. Now that was perplexing! How can I convert my entire relational ERD (Entity Relationship Diagram) into a single table?
Some more search and I found the following article: From relational DB to single DynamoDB table: a step-by-step exploration. This is how this article started: “Of all the sessions I’ve seen from AWS re:Invent 2018, my favourite is certainly this bewildering drop-kick of NoSQL expertise from AWS Principal Technologist, NoSQL and certified outer space wizard Rick Houlihan.” ?? I decided to ignore the advice and continue reading. But I guess the author Forrest Brazeal knew that some people will do just that. So he next included his own tweet on the video with another piece of advice, “Seriously, watch the video, then come back to this article. You won't be disappointed!”
Now that’s what I call persistence and I couldn’t ignore his advice this time. So I clicked on the link and watched the video, and boy was Forrest right! The video blew my mind. Like I wrote in my brief LinkedIn post, I had to really bend my mind and do some serious homework to understand what Rick said in the video because most of what he said was alien to me. So I agree 100% with Forrest that Rick is a “certified NoSQL wizard from outer space!” ?? But I saw the video from start to finish multiple times and made sure that I understood every word of it. Here is what I liked about the video:
- All the quotes in the presentation (@ 2:29, 18:56, 22:45, 35:05. 37:39, 45:46, 47:25, 53:47, 57:43) are simply awesome!
- The timeline of database technology (@ 2:45) is great.
- The “Why NoSQL” slide (@ 7:43) had most of what I had expected, but the last line in the NoSQL column took me by surprise. I thought relational databases were used in OLTP applications. But the line says, “Built for OLTP at scale.” However, I know now that in that line the emphasis is not on OLTP but on “at scale”.
- The fundamentals of NoSQL table design are explained very well (@ 12:22): partition key (pk), sort key (sk), primary key (pk + sk), and attributes. However, I didn’t grasp the significance of the fact that “items don’t have to have the same attributes” until much later in the video; this feature is exactly what enables us to store items from many different tables in a relational ERD in a single DynamoDB table!
- LSIs (Local Secondary Indexes) and GSIs (Global Secondary Indexes) are explained crisply (@ 16:15 and 17:15 respectively).
- LSIs have the same primary key but a different sort key, which in other words means that LSIs are “a way to re-sort the data but not re-group the data”. This slide also shows the three types of allowed projections into the index: KEYS_ONLY (only index and primary keys are projected), INCLUDE (index, primary keys, and certain included attributes are projected), ALL (all attributes are projected).
- GSIs, on the other hand, allow us to regroup the data differently, and that’s where some of the magic happens.
- The actual discussion on NoSQL data modelling (@ 23:00) begins with a line that struck me like lightning, “There is no such thing as ‘non-relational’ data. I don’t use that word because the bottom-line is that the data is relational. It does not (and can not) stop being relational just because I am using a different database!” And that’s what leads to the question, “How do I model relational data in a NoSQL database?” which is the title of Forrest’s article.
- How the data has been modelled traditionally and why it leads to the need for ACID transactions (@ 23:56) is insightful.
- The two key concepts of modelling data in NoSQL databases are explained clearly (@ 26:07):
- Select a good partition key, one with high cardinality, and
- Select the required sort key to sort the data as per one’s requirements – and leverage the range queries– and also to model 1:N and N:N relationships.
- The four tenets of modelling data in NoSQL databases are laid out clearly (@ 27:27):
- Understand the use case
- Identify the data access patterns (This is something that we don’t do with relational databases because we can perform any arbitrary query using SQL. Of course, we pay for that flexibility in terms of computation time.
- Data modelling (Again, this is very different because the advice is to model the ERD in a single table and not create multiple tables like in relational databases.)
- Review, repeat, review.
- The need for creating composite keys and the way to do it is shown (@ 35:14) with a very good example. This concept can be used to do a faceted search which can greatly reduce the amount of data that is read from the database because it applies the filter before the sort.
- The design pattern for maintaining version history (@ 38:20) is neat.
- The explanation of an internal service of Amazon to resolve configuration items (@ 40:24) is fairly complex, and the first instance where I had to bend my mind. This example shows how to create adjacency lists in the main table and reverse lookups (pk and sk are interchanged and can thus be used to model the N:N relationships) using GSI. It also shows how data is duplicated across multiple partitions (or denormalized) to make queries run with single-digit ms latencies, which creates the need for “transactions” in NoSQL databases just like in relational databases, although for a different reason.
- The explanation of how hierarchical data can be handled elegantly using composite keys (@ 45:41) is given using the example of an internal Amazon service to locate their offices based on country, state, city, and office. But then he delivers the killer punch: “When you hear people say ‘NoSQl databases are missing joins’, you say ‘You are missing the point!’” ??
- The example of a theoretical delivery service (@ 47:37) has a very complex ERD which is not easy to model. This is where he shows how to model half a dozen entities and more than a dozen access patterns using one table and two GSIs! I had to bend my mind again to understand this, but when I did it and understood it, it felt awesome! Rick also reveals who his favourite restaurants are in Austin: Torchy’s Tacos and Salt Lake. I know where I am going if I ever find myself in Austin with some free time on my hands. ??
- The last data modelling example is that of the Audible eBook Sync Service (@ 53:55) and it is the most complex of them all: one table and three GSIs to satisfy 20 different access patterns. To be honest, I didn’t understand this example partly because I didn’t properly understand the use case. But I did understand that I am supposed to extend the access patterns table by another column and write the query to satisfy that pattern. That’s how it’s done. That’s what Forrest does in his article as well.
- The cheapest datacentre infrastructure in the world (@ 57:50) can be built using serverless components. It runs on a few cents per day and therefore allows us to fail fast and cheap (the new mantra) if one has to, but can scale to 1000X higher loads with very little effort if successful.
- The conclusions (@ 59:28), just like rest of the video, are awesome! 1) NoSQL does not mean no-relational. 2) The ERD still matters. 3) RDBMS is not deprecated by NoSQL. 4) Use NoSQL for OLTP or DSS at scale. 5) Use RDBMS for OLAP or OLTP where scale is not important.
The main myth about NoSQL databases that this video shatters is that “NoSQL databases are schema-less and therefore very flexible.” Rick says, “I have designed thousands of NoSQL database and they are anything but flexible!” You even have to know the data access patterns while designing the database. That's not the case with RDBMS systems, which can execute any arbitrary query, at the cost of computation time of course. NoSQL databases can’t do that; if you have to perform a join, you have to do it in the application layer, which many developers actually do even though it is a pain in the neck. But that is something that NoSQL developers are not supposed to do, as mentioned above. Also, if you mess up the NoSQL database design or if you discover a new, frequently-used data access pattern, you will have to re-design the database and fix it. But you won’t be able to do an “in-place fix”. You will have to go for a database migration from old database to new database.
Like I wrote above, I had to see Rick's video from start to finish multiple times to understand it properly. But still there were some cobwebs in my mind. To clear them, I had to read the DynamoDB documentation on the AWS website (https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-hybrid.html) thoroughly. While doing that, I had to go back to the video a few times to focus on some specific details. But in hindsight, it was all worth it!
While reading the documentation, I saw that a hybrid RDBMS-DynamoDB system is a viable option in some cases. The moment I read that I knew that it is exactly what one of our current customers needs. They have an RDBMS system which is having severe performance problems because it performs crazy joins that span a hundred tables apparently. So I talked to the architect of that project, and gave him a dump of what I had learnt from Rick’s video and AWS documentation on DynamoDB. He understood what I told him and agreed that it is the way to go forward. So starting next week, we are going to start migrating the complex SPs (Stored Procedures), one at a time though, which is something that I stressed upon. I said, “When Amazon and Netflix decided to break their monolith into microservices, they didn't create a team of 100 engineers, lock them up in a room, and then reveal the brand, new system after 2-3 years. They did it one small part at a time, while keeping the whole thing functional. We have to do the exact same thing!”
The quote in Rick’s video @ 2:29 says, “History repeats itself because nobody was listening the first time!” I, for one, have decided not to do that.
Product, Strategy, GTM, Venture Operator | MIT, IIT Bombay - Aerospace Engg | Mentor NASSCOM DeepTechClub
5 年Great article Prashant. You can change the title to “Demystifying NoSQL DBs”.??