How many databases can you handle?
Armando Lacerda
Azure Architect & Engineer | Data Platform Expert | Microsoft MVP | Microsoft Certified Trainer
Introduction
Azure Synapse Analytics is one of the most comprehensive and integrated tool for data engineering out there. One can implement data integration (a.k.a. data transformation) by code using Spark clusters and notebooks. Or by graphic user interface with minimal or no code using data flows with transformation pipelines, among other options.
We can use at least 6 distinct languages to implement ETL/ELT logic. You can run your ETL/ELT processes on either one of the 3 compute engines inside the Synapse workspace currently in GA. You only get more options at Starbucks when ordering your caffeine.
When it comes to store the data, there are two options. The Azure Storage Account gen2 with hierarchical namespace support (a.k.a. data lake), or the Dedicate SQL engine's proprietary storage known as distributions.
In this article we will investigate these 2 options to store data and how we can implement data transformation on each one.
Dedicated SQL Pool database
Before the cloud came along, SQL Server professionals (a.k.a. DBAs) are used to implement data movement and transformations using SSIS (SQL Server Integration Service). Even though a SSIS package could read from multiple sources and write to multiple destinations, it is a very common approach to write data to a SQL database, and then transform the data using T-SQL language organized in stored procedures.
The Synapse Analytics Dedicated SQL Pool is your similar option when it comes to implement ETL/ELT on the same model. Data must be ingested into a dedicated SQL Pool database. Once imported into the database, you have the T-SQL language available to implement your data crunch.
It is important to note here that the techniques and logic used in the Dedicated SQL Pool should be different from what we usually see on on-prem SQL servers. C.T.A.S. (Create Table As Select) technique leverages the fact that SQL Server is more efficient writing to new pages in the database than updating already populated pages and should be extensively used to transform data inside the dedicated pool.
One drawback of the Dedicated SQL Pool is that the data engine and the storage service are coupled and work as one. You can’t read or write data into the distribution storage without having the Dedicated SQL pool online. That drives cost higher. And it can get really expensive really fast.
Lake Database
The concept, implementation and adoption of standard open-source file formats brought up flexibility to data processing. Almost like the same way you can open a CSV file (file format) in Excel, Visual Studio Code or Notepad++ (compute resources), you can process Apache Parquet file format using a Spark pool engine or Serverless SQL pool engine, among other options outside Synapse.
Having just the Parquet files is not enough to organize the data in the lake. Parquet files don’t keep enough metadata about what is stored in the lake to support very large data volume queries efficiently.
To supplement the missing metadata, The Spark pool engine allows the creation of databases. Spark databases are just a collection of metadata about tables and other assets. Spark databases don’t store the data itself. The actual data is kept in the data lake in an open-source file format, thus allowing other data engines to touch that data when needed. This means that Synapse and Databricks can share the same data in the lake, and we can use either engine to implement ETL/ELT on the same data storage. That is flexibility!
Inside the Synapse Workspace, this database you can create using the Spark pool engine is known as the Lake Database. For the SQL DBAs out there venturing in the Spark universe, it is important to notice that database and schema are the same thing. The SQL’s three name notation (database.schema.table) is not valid in the Spark environment. In Spark we only have database.table (or schema.table).
The Lake database metadata created and maintained by the Spark engine is shared between itself and the Serverless SQL pool. Which means that when we create a table in the Spark pool it shows up in Serverless SQL pool.
Under the hood, Serverless SQL pool is actually the same SQL engine we know and love but adapted to work with the Lake database. When you access the lake database through Serverless SQL Pool, the three names notation applies. All the objects created in the Spark pool are in the DBO schema. And the DBO schema is read-only from the Serverless SQL pool engine. If we want to create a view based on tables in the shared Lake database, the view needs to belong to another schema other than DBO.
For instance, if we create a Spark database like this:
领英推荐
%%sql
create?database?MySparkDB;
use?MySparkDB;
create?table?MyTable?(col1?int);
We can query this table from Serverless SQL pool like this:
use?MySparkDB;
select?col1?from?dbo.MyTable;
This gives great reach for querying data lake hosted data in the format of Spark tables, exposed as regular SQL tables through the Serverless SQL pool. One can use any SQL client tool like Power BI or Data Studio (or even Excel!) to query data from the shared data lake storage through the compute power if the Serverless SQL pool.
SQL Databases
Don’t be fooled by the name Microsoft has given to it. SQL Database is not your regular, on-prem SQL Database implemented in the cloud. Not even close. It can be either the Dedicate SQL Pool database or the Serverless SQL Pool database.
The Serverless SQL pool database is not as generous with its metadata as the Spark pool is. Or, from another perspective, the Spark pool don't care about the Serverless SQL database metadata. Database objects created in a Serverless SQL pool database are not visible or accessible from the Spark pool. But the underlying data files in supported open-source file formats are available from the Spark pool engine.
The metadata sharing is in one-way direction: from Spark pool to Serverless SQL pool.
In the Serverless SQL pool we can use a subset of the T-SQL language to query and transform data in the data lake, either using Spark’s metadata, SQL’s metadata, or reaching to the file in the lake using the OPENROWSET() T-SQL function. Either way has its benefits and pitfalls. In particular when it comes to character encoding, which we will investigate in another blog post.
I call it pitfalls because we are tempted to create our regular T-SQL code we do on an on-prem SQL instance. And when we do it, it usually fails. There is always a specific element to Serverless SQL that makes the query almost like the one you'd do on a regular SQL offering. But not quite the same.
In the same way that Spark pool databases don’t keep the data, but it is saved in open-source file formats in the data lake, Serverless SQL pool follows the same act. When we create a Serverless SQL pool database, only the metadata will be kept in the database. The actual tables and their data will be stored in the data lake, in a supported open-source file format of your choice. In the SQL terminology, a table that points to data that is stored outside a SQL database is known as external table.
Therefore, in Serverless SQL pool, we only have external tables. Views and stored procedures keep code, not data. Serverless SQL keep their code in the database's metadata store. Only tables are external objects.
The same CTAS technique applied in Dedicates SQL pool databases can be applied in Serverless SQL pool databases. With the caveat that all tables are external. Thus the CTAS technique, when done from the Serverless SQL pool, is known as C.E.T.A.S. (Create External Table as Select).
CETAS transformations done in Serverless SQL Pool have more restrictions when compared with CTAS in Dedicated SQL Pools, though. This is due to T-SQL limitations in the Serverless SQL pool.
Conclusion
Azure Synapse Analytics offers a variety of storage models and technologies for data architects to choose from and address specific use case needs. They can be split in two: data lake storage and proprietary Dedicated SQL distribution storage.
Dedicated SQL distribution storage can only be queried through the potentially expensive Dedicated SQL pool engine. This engine offers great compatibility with T-SQL language and it is usually easier to adopt by SQL developer DBAs.
Data in the data lake is stored in open-source file formats and can be queried and processed by any compatible data engine like Synapse Spark Pool, Databricks and Serverless SQL pool.
With all of that in mind, tell me in the comments bellow, how many databases can you handle? ??
Data Engineer
6 个月Hi Armando, thank you for the information, its very useful, so I guess I am able to create multiples dedicated SQL pool for multiple business departments in my organisation right? but could I join the tables from different SQL pools together as a query or store procedure under Synapse?
Fantastic, Thank you very much, I have a question though, Can I write it down here?