登录查看更多内容

Explain by Example: CosmosDB

Michelle Xie

Learning at Microsoft

发布日期: 2020年9月3日

Disclaimer: The following content is not officially affiliated with Microsoft.

Since I'm going to be giving a spiel (or two) about Data and AI at Microsoft Ignite later on this year, I decided it's about time I stopped ignoring the Data guys and what better way to get to know them than to take them on a virtual date (I'm still in full lockdown and can't see anyone anyway) so I decided to go on a virtual date with Cosmos (his full name is Azure CosmosDB but he likes to go by the nickname, Cosmos).

Anyway, here is a summary how the virtual date went...

Hey Cosmos, thanks for doing this virtual date with me, so what do you do for a living?

No worries, I'm glad you picked me. Some people are a little afraid of me because they think I'm trying to replace that SQL guy who used to be super popular but I'm actually a pretty cool guy once you get to know me. And, not to boost my own ego here but people kind of fall in love with me once they realize I'm not that mysterious. Anyway, I work in this field called "Databases" and I store things for a living.

Oh cool, what kind of things do you store? I might be moving houses soon. Do you do furniture storage?

Nah, not furniture storage. I store data which is why I'm in the Databases field but I don't have strict requirements like that SQL guy. I'm a bit more chill and laid back than he is. SQL likes to plan ahead of time, he tells people that they need to call him up before hand, come up with a plan (which he calls a schema) that specifies exactly what data needs to be stored, how to arrange the data, all that kind of stuff before he lets anyone use his storage services. And if people don't stick to his rigid schema plans, he throws a tantrum. Then everyone ends up having a big fat cry especially those Database Administrator guys who has to deal with him.

Honestly, he's a bit of a neat freak in my opinion. Too structured and relational for my liking. I'm more of a semi-structured, non-relational type of guy. I don't like making plans but I do have a few basic rules so I just tell people, "Hey, if you can stick to these rules then you're welcome use my services at anytime. I don't really care what it is you're storing as long as you follow my rules."

Right, and what are these rules they have to follow?

Two simple rules really, they just need to tell me the Partition Key and the Throughput.

The Partition Key gives me an idea of how they want to separate their data (logically) and then I do some magic math in my head to work out how to physically store their logical partitions into my physical storage space.

And Throughput is really about how much work I have to do for them like will I have to move their data around a lot, find things for them, get and replace things for them, or remove things? All that labour work will cost them so they need to give me an idea of how much work they are expecting me to do for them. I call this Request Units (RUs) so I tend to ask people how many RUs they need from me. You can think of an RU as a unit of work or effort.

Ok, but how would I decide what Partition Key or Throughput to give you?

So when it comes to data, you're generally doing either one of these two things: Read or Write.

Read is essentially getting data or filtering data or aggregating data, doing some sort of manipulation with the data and then returning the results. Write is generally inserting new data or modifying existing data or removing data. So when people ask me what to choose for their Partition Key, I tell them to look at what kind of queries they will be getting. Will it be read-heavy or write-heavy? You know, what are the most frequent queries they will be getting. Like if it is going to be read-heavy, then we probably want to have lots of replicas (basically copies of the same data) to avoid "hot" partitions and make sure that reads are highly available. If it is going to be write-heavy, then we probably want to make sure we can keep the replicas as consistent as possible in the shortest amount of time and try to avoid or mitigate any write conflicts that might occur.

It also affects the Throughput or number of RUs they need from me as well. For example, if it is read-heavy, I could probably go find your data from any of the replicas but if it is write-heavy, I will need to go and make changes to all the replicas and you know, I'm definitely going to charge more RUs for doing more work.

You mentioned some stuff I didn't understand here like Hot Partitions, Replicas, Conflicts, High Availability, and Consistency. Can you explain what they are?

Yeah, sure.

Let's see, I'll go through how I store data and why I do it that way. So when someone comes to me and says they need me to store their data, I tell them to create an account first because you know, I need to keep track of my customers so I don't accidentally mistaken one person's data for someone else's data. Once they have set up their account, they can create a Database to start storing their data. Their data actually gets store inside Containers and these data could be Collections, Tables, or Graphs. I'll just talk about Collections for now otherwise we could be here for a while.

Collections is just documents of data stored in JSON format and JSON stands for JavaScript Object Notation and, do you have a pen and paper?

No, sorry.

That's ok. I have a pen and this napkin here will work just fine. So JSON is just a way of storing and exchanging data and it looks kind of like this:

{
  "name": "Cosmos", 
  "bio": {
          "dateOfBirth": "May 2017",
          "friends": ["MongoDb", "Gremlin", "Cassandra", "Azure 
                       Table Storage", "Core SQL"] 
          }
}

You see, if I wanted to find out my own bio, I just need to ask for "bio" and then inside the value for "bio", I can find my "dateOfBirth" or my "friends".

Anyway, when someone comes up to me with a bunch of JSON documents to put into the container, I automatically create an index which just helps me keep track of the documents I have stored so if they want me to look up some stuff inside their documents, I can do it easily by going through my index:

Sometimes people don't want me to keep track of everything in their documents so they can specify an Index Policy which basically tells me not to index certain stuff for them.

So, what other Azure services have you also got lined up for a virtual date?

Well, I was thinking of Azure HDInsight, Azure Databricks, and um, maybe Azure Synapse Analytics?

Ok, let's say you are using my Document or Collection store to keep track of your virtual date's profiles. You will probably create a document for each one of them and inside that document will be contain all their details:

Document 1:
{
  "name" : "HDInsight", 
  "type" : "data analytics", 
  "cloud" : "Azure",
  "Apache" : ["Hadoop", "Spark", "HBase", "Storm", "Kafka", "Hive"]
} 

Document 2:

{
  "name" : "Databricks",
  "cloud" : "Azure", 
  "Apache" : "Spark"
}

Document 3:
{
  "name" : "Synapse Analytics", 
  "cloud" : "Azure", 
  "type" : ["data warehouse", "data analytics"],
  "SQLSupport" : true
}

Just having a quick look at this and I can already tell you that a good Partition Key to use would be "name" because then I know that each document is its own logical partition as the value is distinct, unique and will probably have a good range (as far as I know, all Azure services have different names). If we choose "cloud" as the Partition Key then I would treat all 3 documents as a single logical partition because they all have the same value. Do you know why choosing a good Partition Key is important?

Is it so you can uniquely identify and find each document in the Collection?

Nah, not quite. Remember what I told you earlier that the Partition Key helps me decide where to put stuff into my physical storage space? Yeah, so basically I take the Partition Key that you give me, put it through a hash function that does consistent hashing to get a random physical partition to put that logical partition into. So you can imagine, if these partition keys are not unique then I'm probably going to end up mapping all the logical partitions into the same physical partition.

Um, physical partition?

Oh, uh, you can think of a physical partition like a physical storage room. If I put all your stuff in the same room then when you want me to get a lot of things out, I'm going to have to keep visiting the same room but, if I have them distributed across multiple rooms, I can hire other people to go look for all your stuff at the same time (in parallel) and so you can imagine just how much more effective that is.

In the Database world, when we talk about avoiding "hot" partitions, we essentially mean we want to avoid putting all your stuff in the same place because then for any requests that comes in, we have to send them all to one place and that's just bad.

Ah ok, got it.

Yeah, so anyway, my storage services have this reputation of being "highly available", "resilient", "durable", "consistent", and "globally connected" that I kind of need to uphold otherwise I might lose my job and go out of business so I tell people that I can make guarantees that 99.999% of the time that I can always find and return the things they're looking for or perform some action on their data.

How are you so confident that you can make this guarantee?

Well, ok I'll let you in on a little secret. I have this Replica strategy which means when I store people's data, I actually make 3 more copies of it so in total I have 4 replicas stored in the same physical partition and I call this the Replica Set.

How does having 4 replicas help you?

So these replicas are actually spread across multiple fault domains. A fault domain is essentially anything that has a single point of failure. Think of your PC, if you pulled out the power plug, your PC will shut down due to a loss of power. That's single point of failure for your PC. So if I had these replicas all live on the same fault domain, they might all get wiped if I ever experience a failure no matter how many replicas I create so that's why they're spread across multiple fault domains. So now, even if a replica experiences some failure, I've still got other replicas or copies and I can act like nothing went wrong and continue to serve my customers but really what I'm doing in the background is trying to restore that replica loss using the other replicas. But my customers won't see this impact so in front of my customers, I am always highly available, resilient to failures and durable.

Anyway, in terms of consistency and being a globally distributed data storage service, I have 5 consistency offerings that my customers can choose from.

What do you mean by consistency offerings?

Basically I have consistency offerings that starts from Strong consistency to Bounded Staleness to Session to Consistent Prefix to Eventual consistency. I have to make these consistency offerings because of the replicas that I make for high availability, resiliency and globally distributing the data.

Why?

Because there are trade-offs. Think about it. If I made all these replicas for high availability, it's pretty hard to guarantee consistency without some form of latency. Let's say someone came in and said, can you go and delete this thing inside this document for me. If I was only dealing with one copy, that'll be easy and I won't have to worry about consistency but the thing is I've made all these replicas so I need to go and delete the same thing from all the replicas otherwise someone might go look at one of the replicas and get inconsistent results. But then, if I have to go and make all my replicas consistent, that's going to take time, right?

Uh huh.

And so during this time, I can't make any of the replicas available for use and so I've lost my high availability status. But then, if I want to keep my high availability status, I can't guarantee 100% consistency at all times so typically with Databases, there is a trade-off between availability and consistency.

Ah, right, ok.

Anyway, the consistency offerings is basically a continuum from strong consistency (at the cost of lower availability or high latency) to eventual consistency (for high availability but the replicas may return inconsistent results). To give you a bit more detail:

Strong consistency means any reads and writes will always be consistent, so whatever replica it is reading from or writing to, these changes will always be propagated such that there will never be any inconsistent results.
With Bounded Staleness, they get to set a "window" of how many writes can lag from one replica to another or how long this lag from one replica to another can take. So the changes will still be written in the same order but there might just be some window of lag (or latency) from one to be completely updated and identical to another.
With Session, basically those connected to the same session will be able to read the latest changes, those not connected to the same session will eventually see these changes but not immediately.
With Consistent Prefix, the ordering of writes is still kept (like in a log book) and then these changes are eventually made to the rest of the replicas so there is latency in getting the latest updates replicated across all the copies but you'll never see any inconsistencies (out of order writes), you might just not be able to read the latest updates for a while.
Finally Eventual Consistency is well, how do I say this? It will become consistent, eventually. So all the replicas will eventually be the same but until then, they might differ from one another. What that means is that you might see inconsistencies when reading from the replicas because there is no guarantee on the ordering. But it's highly available because you can always read from the replicas, they might just return inconsistent results (until it eventually becomes consistent).

So remember, as we move from Strong to Eventual, we are making that trade-off of consistency (and latency) for high availability.

Oh wow, that was...quite a lot to take in. I feel like I should've brought a notebook or something to this virtual date.

Yeah, well, I haven't even talked about why I'm also known as Mr Worldwide (you know, as a globally distributed, multi-model database service and all) but if you're up for a second virtual date, you can always find me here.

P.S: I know this is a slightly different approach to what I normally do but I felt like doing something different (mostly to entertain myself). Again, I would love to hear your feedback (good, bad or otherwise). Also, I should probably mention that any characters created in this are purely fictional. The technology however, is real.

P.S (x2): If you want to support Explain by Example, you can buy me a coffee here ?

Vikas Rana

SQL developer

2 年

Intrested

David Keel

Director of IT at Intermark Group

3 年

I was looking for something that would break CosmosDB down into simple terms I could understand and you knocked it out of the park. Thank you so much.

Andrew Keough

System Administrator at Schréder

3 年

I completed the 2 day Azure data fundamentals today and this is a bonus to it, entertaining and informative, thanks for posting this Michelle.

Kevin Abdoelkariem

Solution Area Specialist / Technical Trainer

4 年

Hilarious and super informative at the same time! Thanks for sharing your 'virtual date' experience and looking forward to more! ??

Vignesh C.

Project manager APAC @ Ecovadis | Sustainable Supply chain | Cloud & AI enthusiast

4 年

It is entertaining to read Michelle. Thanks for sharing your knowledge

查看更多评论

要查看或添加评论，请登录

Michelle Xie的更多文章

5 years at Microsoft

2024年1月12日

5 years at Microsoft

Disclaimer: Microsoft does not officially endorse the content of this article. All comments, views, and opinions are my…

21 条评论
Explain By Example: Blockchain

2021年11月16日

Explain By Example: Blockchain

Disclaimer: The following content is not affiliated with Microsoft. Sometime back around July, I got challenged to…

4 条评论
Explain By Example: Object-Oriented Programming (OOP)

2021年8月30日

Explain By Example: Object-Oriented Programming (OOP)

Disclaimer: The following content is not officially affiliate with Microsoft. I was taught the concept of…

6 条评论
Explain by Example: Synapse Analytics

2021年6月22日

Explain by Example: Synapse Analytics

Disclaimer: The following content is not officially affiliated with Microsoft. When I posted about my date with…

12 条评论
Explain by Example: Identity and Access (IAM)

2021年2月25日

Explain by Example: Identity and Access (IAM)

Disclaimer: The following content is not officially affiliated or endorsed by Microsoft. A few weeks ago, I was in the…

7 条评论
Two years at Microsoft

2021年1月6日

Two years at Microsoft

A disclaimer (just in case): The following content are my own personal views and experiences, they do not reflect the…

4 条评论
Explain by Example: Deep Learning (NN)

2020年11月27日

Explain by Example: Deep Learning (NN)

Disclaimer: The following content is not officially associated or endorsed by Microsoft. I was at an airport recently -…

1 条评论
Explain by Example: DDoS Attack

2020年10月5日

Explain by Example: DDoS Attack

Disclaimer: The following content is not officially endorsed by Microsoft. [Based on a true story] So I was just…

3 条评论
Explain By Example: VPN Gateway or ExpressRoute?

2020年8月10日

Explain By Example: VPN Gateway or ExpressRoute?

Disclaimer: The following content is not officially endorsed by Microsoft. Recently I watched Frank Abagnale's talk on…

23 条评论
Explain by Example: Machine Learning

2020年7月6日

Explain by Example: Machine Learning

Disclaimer: The following content is not officially endorsed by Microsoft. I've been wanting to learn about…

See all articles

Explain by Example: CosmosDB

Michelle Xie

Learning at Microsoft

Michelle Xie的更多文章

社区洞察

其他会员也浏览了

Part 2 - Azure Databricks, Delta Engine and it's Optimizations

Unlocking Synergy: Connecting Databricks Notebooks with Microsoft Fabric OneLake

5 ways to utilize ADX Free Cluster

Microsoft Fabric is the New Office

Unbiased view of bringing Synapse Analytics and Azure Databricks together

How to Trigger a Databricks Job Using a Logic App in Azure

Databricks : Intelligence 2.0, Delivered!

Microsoft Fabric Community Conference, Day 2

Unbiased view of bringing Synapse Analytics and Azure Databricks together

It is an argument, not a trojan horse!

Michelle Xie的更多文章

5 years at Microsoft

Explain By Example: Blockchain

Explain By Example: Object-Oriented Programming (OOP)

Explain by Example: Synapse Analytics

Explain by Example: Identity and Access (IAM)

Two years at Microsoft

Explain by Example: Deep Learning (NN)

Explain by Example: DDoS Attack

Explain By Example: VPN Gateway or ExpressRoute?

Explain by Example: Machine Learning

社区洞察

其他会员也浏览了

Part 2 - Azure Databricks, Delta Engine and it's Optimizations

Unlocking Synergy: Connecting Databricks Notebooks with Microsoft Fabric OneLake

5 ways to utilize ADX Free Cluster

Microsoft Fabric is the New Office

Unbiased view of bringing Synapse Analytics and Azure Databricks together

How to Trigger a Databricks Job Using a Logic App in Azure

Databricks : Intelligence 2.0, Delivered!

Microsoft Fabric Community Conference, Day 2

Unbiased view of bringing Synapse Analytics and Azure Databricks together

It is an argument, not a trojan horse!