Explain by Example: Synapse Analytics
Disclaimer: The following content is not officially affiliated with Microsoft.
When I posted about my date with CosmosDB last year, everyone kept asking me, "When is the next date?" But you know, sometimes it really takes two to tango. Cosmos and I have decided to meet someone new to expand our social networks so today I decided to go on a date with a mutual friend, Synapse Analytics.
Who is Synapse Analytics?
When Cosmos first told me about their mutual friend Synapse Analytics, I got a little bit intimidated. Turns out Synapse is kind of a big deal in the data world. Synapse started off doing SQL data warehouse stuff for a while but then Synapse's business really took off. They got a lot of investment from this big tech company called Microsoft Azure and managed to form partnerships with Apache Spark, CosmosDB, Data Factory, Machine Learning, and Azure Purview. Anyway, Synapse is super busy running all these big data processing jobs for large data projects so it was a bit difficult to schedule in some time for the date. We both agreed I could just meet Synapse at Synapse's Studio and that way I could learn more about Synapse because everything they do is located at the Studio.
Visiting Synapse's Studio
Because of the large-scale nature of Synapse's business, I had to register my details with Microsoft Azure first so they can issue me a Workspace badge before I could enter Synapse's Studio. That took some time. I had to fill out some basic info, then some security info and also networking info before they would issue me the workspace badge:
Then I was allowed to enter Synapse's Studio at https://ms.web.azuresynapse.net/ and Synapse was my tour guide for the Studio so I put on my curiosity hat and started asking a lot of questions.
The first room I went into was the SQL pools room. Here, I could see that a serverless SQL pool has already been set up and ready to go. Synapse turns to me and asks, "Do you want to set up a dedicated SQL pool?"
I had no idea what a dedicated SQL pool is so I said, "Sure, why not" and went through the process to create a dedicated SQL pool:
Once the dedicated SQL pool was created, I decided to ask about it...
"What does dedicated SQL pool and serverless SQL pool mean?"
Synapse looked at me in shock, "Are you telling me you don't know what dedicated SQL pool means?"
"Fair enough, we did rename it because of the rebranding we went through a couple of...anyway, a dedicated SQL pool is just SQL Datawarehouse. The difference between dedicated and serverless is really the resource that are used to run your SQL Datawarehouse jobs.
Dedicated means you create resources dedicated to run your jobs. Serverless means you share the resources you use to run your jobs with other people."
I nodded, "So dedicated is like having a private pool I can use and Serverless is like having a public pool I share with others?"
"Not quite" Synapse starts to say, "If we are going off that pool analogy you are using, you can think of dedicated SQL pools like having reserved swimming lanes at the swimming pool. That means for a set period of time, you have reserved that lane for your swimming activities only and no one else can enter into your swimming lane. Now, that doesn't mean that pool belongs to you, some of those lanes is just reserved for you.
Serverless is like going up to the pool and entering into a swimming lane that is available. Now, if you start swimming really fast, they might give you more lane space or move you to the fast lane, if you start swimming slower, they might take away some of your lane space or move you to the slow lane but none of the lanes are reserved for you. Does that kind of make sense?"
Yeah, I think that makes sense.
But how do these SQL pools process things?
"Good question, the first thing we need to do is look at the architecture of Synapse SQL pools. Both dedicated SQL pools and serverless SQL pools have the concept of a control node and compute nodes.
Now, the distinction between the control node and the compute node is that the control node gets to control how to use the compute nodes.
You can think of the control node as the swimming instructor and the compute nodes as the swimmers. The swimming instructor doesn't do any of the swimming but they usually figure out a strategy to direct swimmers.
The same goes for the control node, they don't do any of the work but they figure out how to get all the compute nodes to do all the work in the most effective and efficient way."
Ok, what's the distinction between how dedicated SQL pools process things and serverless SQL pools process things?
"You like to get straight to the point, don't you? Well, seeing as it is almost lunch time, let's use the analogy of making a sandwich to describe how dedicated SQL pool differs to serverless SQL pool.
Imagine if I came up to you and said, we need to make 100 sandwiches to feed some very hungry children. Here are all the sandwich ingredients and here is a bunch of sandwich workers to help you out. Now, if you were the control node of this sandwich making request, what would you do?"
Well, I would...uh...
"If you operated similar to the dedicated SQL pool architecture model, you might look at the number of sandwich workers you have. Let's say that number is 10.
So you've got 10 sandwich workers. Then, what you do is look at the request which says you need to make 100 sandwiches. You would probably then tell each of your sandwich workers to make 10 sandwiches each.
They all start making sandwiches and in one tenth of whatever the standard time it takes to make 100 sandwiches, you would have 100 sandwiches all completed, in parallel. That's one way you could go about the division of labor.
And if we looked at the dedicated SQL pool architecture, it operates pretty much the exact same way. Minus the sandwiches. We have a control node that accepts the main request. The control node runs a distributed query engine to figure out how to distribute the request across all its available compute nodes. The compute nodes then processes all of the work in parallel."
Oh, that makes sense. What about the serverless version?
"Hang on, I was just about to get to that.
The other approach is to follow the serverless SQL pool architecture model. Let's say there are about 5 tasks that needs to be completed to make a sandwich such as:
- Butter bread
- Peel lettuce
- Slice cheese
- Slice ham
- Put it all together
Now, rather than getting one sandwich worker to complete tasks 1 to 5. We might start giving "specialized" tasks to each worker. Worker 1 and 2 might only take care of the task of buttering the bread. Worker 3 and 4 might only peel the lettuce. Worker 5 and 6 might only slice cheese. Worker 7 and 8 might only take care of slicing the ham. And Worker 9 and 10 puts all the ingredients together. That is also another way we can do the division of labor.
Now, serverless SQL pool processes requests in a similar way. The control node still accepts the main request however the control node runs a distributed query processing engine to figure out how to split this large request into smaller tasks and hands each of these tasks to the compute nodes to complete. Some compute nodes might need to wait for other compute nodes to finish their work first kind of like Workers 9 and 10 had to wait for all the others to finish before putting it all together but at the very end, we still get 100 fully constructed sandwiches to feed our very hungry children.
Speaking of, should we head out to lunch and discuss something else?"
Sure, I was getting quite hungry with all the sandwich talk.
"I'm sure you have many other questions like why did we form partnerships with Apache Spark in our studio."
Well, yes I did. But firstly, what is Apache Spark?
"Apache Spark is actually created from a group of about 300 companies made of about 1200 people that all came together to build this massive distributed big data processing engine and they put together this big data processing framework that can process lots and lots of data in-memory at a large scale for things like data preparation, data processing and machine learning which I think is pretty phenomenal. Anyway, we decided to partner up with them to make Apache Spark run in our Synapse studios because Apache Spark needs great hardware tools which our studio can provide. They also have a pretty big fanbase because they have been quite popular these days in the data space so we worked together to build Apache Spark for Synapse."
I'm not quite sure I understood what you just said.
"Oh. Well, I suppose I haven't told you why we exists" Synapse says. "Simply put, we exists to help our users find answers to their queries. We're not a search engine like a Google or a Bing but rather, we allow our users to build their big data systems by using our services with their data. So what we're doing is providing all of the necessary tools for our users to build their own data systems to help find answers to their own queries."
Why would someone want to build a data system to answer their own queries?
"Ok, think about this. You get home after work, you attempt to cook some dinner but because you're terrible at cooking, you accidentally burn your dinner so you order some takeout instead. While you wait for the food to arrive, you decide to put a movie on. How do you decide what to watch?"
Well, I would probably just go on Netflix and maybe go through the recommended list.
"Right, but what if Netflix didn't have that recommended list?
You would have to do some investigation, find some sources to help answer your own question about which movie to watch. For instance, you might go to YouTube and look up a movies trailer channel or you might look up a movies review site and read through some of the reviews or you might phone a friend like that How to be a Millionaire game show.
Now, those are all different forms of data sources to help you solve the problem of which movie to pick to watch. You might want to save those sources somewhere. Be it the URL to that YouTube video or a screenshot snippet of that review or writing down the name of the movie that your friend suggested over the phone. These are all the data sources you can collect to help answer your question. And you'll probably want to save those data sources somewhere.
But do you really want to watch every single movie trailer or re-read every single movie review every time you want to find a new movie to watch?"
No.
"Precisely. So you might want to generalize all of those data sources into a movie list and perhaps prioritize the ones that seems the most interesting to you to the top of the list. That process of you generalizing all of your data sources into this movie list is called data processing.
You have essentially processed all your data into something much more useable so that the next time you want to find out which movie to watch, you simply query your aggregate movie list instead of doing all of that pre-work. And if you want, you can throw away all of those URLs and screenshots because you can find out everything you need from your movie list anyway.
Does that make sense?"
Yeah, that makes sense.
"So, organizations process data and build these data warehouse systems for the same reason. Not quite to the extent of finding out which movie to watch next but they build these data warehouse and data analytics systems to help them find answers to their questions and to make better business decisions.
Now typically, they're dealing with a lot more data and a lot more complex queries in their data systems so you'll often hear scary words like big data processing, massively parallel processing engines and frameworks but they're all just tools to help you build a system to answer queries and make better decisions."
Ah, I see.
"Alright, back to Apache Spark. Apache Spark is quite a complex organization, they do a lot of things. If you go to their official website, this is the description they provide:
Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for incremental computation and stream processing.
I know that contains a lot of technical jargon but you can basically think of Apache Spark as this organization that has come up with a really, really good framework for doing data processing and they're popular because they support quite a few of the popular programming languages like Java, Scala, Python and R that people can use to process their data. They've also come up with frameworks for doing machine learning and graph processing which I'm not even going to bother going into because like I said, they do a lot.
At some point we just decided, hey, rather than trying to figure out our own framework and play competition with them, why don't we just partner up with Apache Spark and leverage their framework they've put together and run it in our studio. It wasn't hard to convince the executive board to let us do that so now we fully support Apache Spark in our Synapse studio."
I see, smart move.
"Yeah, and now that also allows us to expand out our capabilities too because not only can people use our studio for building SQL data warehouses, they can now do their data preparations, data processing and even build machine learning models using Apache Spark's frameworks. Kind of best of all worlds, in my opinion."
So how does Apache Spark work in Synapse?
"Ok, so we start by telling our users to create an Apache Spark pool. This pool is like a configuration file. It allows our users to specify things like how many nodes will be part of the pool, what sizes these nodes should be, when to pause the pool, which Spark version to use and so on.
The nodes are essentially the worker nodes in the Spark pool that will the get allocated tasks to complete as part of a Spark job.
Obviously, the node sizes are dependent on what our user's data processing needs are so we give our users the ability to choose this.
We also give them other choices like whether or not they want us to take care of autoscaling their node numbers for them, so basically they can set a range. Let's say the range is 3 to 10, that means at a minimum, 3 nodes will be in the pool but if things start getting a little crazy, more nodes will be automatically added to the pool until it reaches 10 nodes."
Can I only have one pool at a time?
"Absolutely not, you can create as many pools as you like with different configurations."
But wouldn't that cost more to have multiple pools?
"Well, technically no. A Spark pool does not cost anything until you start running Spark jobs but generally we recommend to our users to start off small, create a small pool to play around with it and test things out until they've figured out what it is that they need."
So then how do you charge for the use of Spark pools?
"That's a good question. When you want to run a Spark job, you have to connect to the Spark pool first. That's what we call a Spark instance. Now, a pool can have multiple connections by multiple users so that's why we call each connection a Spark instance.
Once you have a Spark instance, your Spark job is then given to a thing called a SparkContext who sits inside the main driver program of a Spark application. Think of the SparkContext as a coordinator sort of like that swimming instructor from earlier on.
Now your SparkContext is responsible for connecting to a cluster manager which you can think of as a swimming agency who contracts out their swimmers for work. We'll call this swimming agency 'YARN' because its the cluster manager that we use.
YARN has their own way of allocating or contracting out their worker nodes to us so for simplicity we'll just call it YARN magic.
So basically, SparkContext goes to YARN and asks for some nodes. Then YARN goes and does some YARN magic behind the scenes and then comes back to SparkContext and says, 'Hey mate, you can use node 1 and node 2'.
SparkContext then sends the application code and the job which is broken down into tasks to the available nodes 1 and 2.
These nodes run an executor which is in charge of processing the tasks. The nodes also contains a cache so that it can bring in data and store it as RDDs (Resilient Distributed Datasets) for in-memory processing which is much faster than your typical disk processing because you don't have to do as much read and writes to disk.
Once the tasks is processed, the results are sent back to SparkContext.
SparkContext then puts everything together like those sandwich makers who are task with putting the entire sandwich together at the end.
SparkContext knows how to put everything together because it uses a directed acyclic graph (DAG) to keep track of the tasks.
Anyway, that was a very long-winded way to say we charge for Spark instances and not Spark pools so unless our users are utilizing our resources to process stuff, we won't charge them unnecessarily to reserve resources for them."
Ok, so what else can I explore in the Studio?
"Let's wrap up for lunch and head back in to check out the pipelines room."
Sure.
"Have you heard of Azure Data Factory before?"
No I haven't.
"Oh. Well, if you had, our pipelines room in our studio looks very similar to data factory and that's because we decided to not re-invent the wheels. We decided to partner up with data factory to bring their amazing data movement, orchestration, transformation and integration services in, some of which we support under Synapse Pipelines in our studio."
And how exactly does pipelines work?
"Let's imagine we want to bake a batch of cookies to pair with those sandwiches we made earlier to feed our hungry kids. Now, to bake a batch of cookies, we have a set of activities we need to complete, right?"
Right.
"This set of activities is called a pipeline in data factory terminology. There are 3 types of activities that pipelines support and these are:
The movement activity that is typically used is called the Copy Activity which takes care of sourcing all the cookie ingredients needed to make cookies as well as distributing the baked goods to our little cookie monsters.
Sourcing the cookie ingredients is a example of reading data from a data source into data factory so that we can process it into cookies.
Now, to bring in cookie ingredients, this requires us to link up with some cookie ingredients suppliers first. This is known as a linked service in data factory. A linked service specifies which data source we want to establish a connection with. There are a lot of cookie ingredients suppliers in the world so we need to specify which one we want to connect with. A linked service is also used to establish a link to the compute resources you want to use. For example, you can bake the cookies at home or in commercial kitchens that you rent out like Azure.
Now, the cookie ingredients could come in all different types so we will need to specify things like all the "white" ingredients are the flour, all the "yellow" ingredients are butter, all the "liquid" ingredients are milk and so on. This is known as a dataset in data factory because it defines the structure of the ingredients data that we are bringing in.
Once we have the ingredients or our cookie datasets, we can start turning them into actual cookies. This is where our data transformation activity comes in.
If you are a simple cookie baker, that is, most of the steps that you'll take are fairly standard cookie making techniques, you might be interested in data flows. Think of data flows as a visual way of baking cookies. So rather than having the cookies instructions written out as text, they are drawn out as visual diagrams instead. And that's what data flows in data factory allows you to do with data. You can design your data transformation logic without having to write code.
Activities like Copy Activity which is a data movement activity or data flows which is a data transformation activity need to run on an integration runtime (IR)."
What's an integration runtime?
Thought you would ask me that. An integration runtime is the compute environment where the activity is executed. You can think of like the kitchen infrastructure that makes your cookies happen. Again, this integration runtime could be your own self-hosted kitchen or an Azure managed kitchen.
Ok, here's a question for you: "If we had two pipelines, let's say, a sandwich pipeline and a cookie pipeline and we want to make the sandwiches first before we make our cookies. What do you think we can do?"
Could we build the sandwich pipeline then destroy it and build the cookie pipeline?
"Not quite. We can build two pipelines then use our control flow activities to trigger the cookie pipeline to run after the sandwich pipeline."
Trigger?
"Yeah, so typically the pipelines in data factory run when the trigger it is assigned is invoked. You can also manually run the pipelines but triggers allows you to automate the pipeline runs. For example, you may want to schedule the pipeline to run at a certain time using a schedule trigger or you want to invoke the pipeline to run after some event using an event-based trigger.
Hey, speaking of schedules, I have another meeting to run off to so I might have to call our date to an end here but if you want to learn more, just reach out and we can book in some more time for next time."
P.S: I've been visiting a lot of cafes lately to write these blog posts on the weekends. It hasn't helped with my coffee addiction but it has helped me finish this blog post so if you want to support Explain by Example, you can buy me a coffee here ??
Business Intelligence | Data Engineer
8 个月This is great!!!. I just started to work with Synapse, and coming from an ADF environment I loved the analogies. ??
Manager Data Analytics | Microsoft Fabric | Life Sciences expert driving Data Migration & AI Solutions at Capgemini
2 年Awesome ??
Really nice article!
Azure Expert | Data Architect | Career Consultant @ Wonsulting
3 年I love the uniqueness of this article and the way the examples are apt for the scenarios. I am sure the next time I come across the terminologies mentioned here, these examples would be the first to strike my mind. Thank you Michelle Xie for such an artistic writeup!
Microsoft Azure Enthusiast ? | Putting the fun in FUNdamentals ?? | Lifetime Learner ??
3 年I love that I get to work with such brilliant people like you. Phenomenal work.