Open Query Vs Linked Server
Leon Gordon
CEO of Onyx Data | Forbes Tech Council | Microsoft MVP | International Keynote Speaker | Gartner Ambassador
Introduction
Few users understand (or care) how far away they are from their data. Quite often, they expect instant results from their queries, despite there currently being possible issues like network bandwidth or the data sitting on a server on the other side of the world.
SQL Server makes it easy to connect to and query data from remote data sources. The common way of getting this done is with a linked server, which is little more than an OLEDB data source.
The beauty of a linked server connection is that it allows an end user to write a query that looks like any other T-SQL query, other than you need to include the name of the linked server.
Query that has been written to select data locally
· SELECT * FROM [Database_Name].[Schema_Name].[Table_Name]
Query that has been written to utilise a linked server
· SELECT * FROM [Linked_Server_Name].[Database_Name].[Schema_Name].[Table_Name]
However, underneath the surface of the linked server query, SQL Server is going to make decisions for you on how it mashes all the data together and returns the result set.
By default, when you run a distributed query using a linked server, the query is processed locally. This may or may not be efficient, depending on how much data must be sent from the remote server to the local server for processing. Sometimes it is more efficient to pass through the query so that it is run on the remote server. This way, if the query must process many rows, it can process them on the remote server, and only return to the local server the results of the query. The OPENQUERY function is used to specify that a distributed query be processed on the remote server instead of the local server.
Introduction excerpt taken from Thomas LaRock's article which provided motivation behind this article - https://thomaslarock.com/2013/05/top-3-performance-killers-for-linked-server-queries/
Linked Servers vs. Open Query
Linked servers (four part queries); are also called distributed queries. Distributed queries create an execution plan by looking at the query and breaking it into remote and local queries. Local queries are executed locally and data for remote queries are collected from the remote servers, scrubbed locally, combined together and presented to end user as single record set.
I.e.: even though you have a WHERE clause in your Distributed Query to filter the results, SQL Server may just send a SELECT * FROM the remote table, and then locally it will filter out the necessary data. Using OPENQUERY on the other hand, sends the complete query to the remote server and the resources of the remote server are spent in processing the query, generating a plan, and filtering the rows. Then, the filtered result set is sent back to the originating server and the results are displayed.
Limitation of Open Query
The alternative to using Linked Servers is to use the OPENQUERY statement, also known as a pass through query. When using an OPENQUERY statement, the WHERE clause gets executed at the remote server and the resultant (mapped) records traverse over the wire instead of an entire sourced data set. The only issue with the OPENQUERY statement is that the TSQL syntax is limited in that it does not accept variables for arguments. To get around this, you need to ensure that the OPENQUERY TSQL syntax is generically coded.
Possible Linked Server Performance Impact
If you change the security configuration of a linked server from using login UserA to login UserB, would you expect that to significantly influence the performance of a query that uses the linked server, assuming that both users have permission to select from the remote table referenced in the Distributed Query? Many people probably would answer no. After all, as long as the connection has the permission to access the remote table, how the connection is authenticated should not matter, right? Well, it is more complicated than that.
Let us say you have the following the query:
SELECT *
FROM [Linked_Server_Name].[Database_Name].[Schema_Name].[Table_Name]
WHERE OrderDate = '20090717'
To process the above query, the SQL Server optimizer on the [Linked_Server_Name] may probably choose between two strategies (among others). First, while utilising an API server cursor, it may decide to retrieve one row at a time from the table [Table_Name] that matches the Order Date value, or retrieve all the rows of [Table_Name] and then perform the WHERE clause locally. To make the correct decision as to which strategy to use, the local server first needs the distribution statistics for table [Table_Name]. The question is what permission does the local server need in order to retrieve the distribution statistics from the [Linked_Server_Name]?
It turns out that it needs the same permission as is required to execute DBCC SHOW_STATISTICS. Per Books Online, the permissions for DBCC SHOW_STATISTICS are:
User must own the table, or be a member of the sysadmin fixed server role, the db_owner fixed database role, or the db_ddladmin fixed database role.
This is where the problem lies.
You may think that as long as you can access the table [Table_Name] from the [Linked_Server_Name], you are all set with the above Distributed Query.
After all, if you just want to retrieve some data from the [Table_Name] table, it would not be a security best practice to be given a sysadmin, db_owner, or even db_ddladmin role.
If the connection to the [Linked_Server_Name] is not the table owner, or a sysadmin, db_owner, or db_ddladmin, you will still get the result back. However, the query optimizer on the local server will not have the stats for [Table_Name], and may probably just decide to do a table scan.
Whether it is reasonable to require sysadmin, db_owner, or db_ddladmin in order to process a distributed query efficiently, is an issue for a different discussion.
Possible Linked Server Work-Around
When running distributed queries on a linked server, if the linked server has the same character set and sort order (collation) as the local SQL Server, then you can possibly reduce overhead and boost performance if you set the SP_SERVEROPTION “collation compatible” option to true. What this setting does is tell SQL Server to assume that all columns and character sets on the remote server are compatible with the local server.
If this option is not selected, then the Distributed Query being executed on the remote server must return the entire table to the local server in order for the WHERE clause to be applied. As you can imagine, this could potentially return a lot of unnecessary data over the network and slow it down. If the option is selected, (which is always recommended if the collations are the same on both servers), then the WHERE clause is applied on the remote server. This, of course, means that much less data is transmitted over the network, often greatly speeding up the distributed query.
Linked Servers vs. SSIS
If you are migrating data from a legacy system and you have the option of restoring the legacy system database to your new server, then you do not necessarily need SSIS. T-SQL commands on the new server are probably going to be faster to script out and run than a SSIS data flow if you are taking the raw data in whole.
If the legacy data is only accessible via linked server or through SSIS, then you are probably better off using SSIS with Fast Bulk Load to load the data. TSQL calls that pull the data across via linked servers are generally slower than SSIS data flow connections.
I would not recommend using the default wizards in SSIS to import data unless it is a one-time process. These are generally not the best from either a performance stand-point or from a maintenance stand-point. Even when loading the data once, you should script out each object individually to make certain you do not repeat the errors of the legacy database in the new database.
If you intend to load the data from the legacy system in a reoccurring manner, then you should use an SSIS package, just for maintainability and extensibility. Even if you decide it needs to execute nothing but TSQL statements, at least you can parallelize it and organize the TSQL calls in a meaningful way. If you go with just executing the TSQL via Linked servers sequentially in a SQL Server Agent job step, then you are not going to be able to easily parallelize the process.
Internet Marketing Manager at Devart
1 年For those interested in this topic, I suggest looking at this article (https://blog.devart.com/openquery-in-sql-server.html) that thoroughly explores using OPENQUERY in SQL Server.
Staff Process Computer and DCS Engineer
1 年Nice to see actual useful info on LI !
Data Engineer at Tesco Mobile
3 年Very well dissected; both have their strong points and weaknesses. I consider them as different tools in a toolbox; which one to use will depend on the situation. I should add that in order to run the OPENQUERY or OPENROWSET commands, the 'Ad Hoc Distributed Queries' option will need to be enabled. Careful consideration should be given to this as it does come with a warning.
SQL Server Database Administrator
5 年https://www.dotnetfunda.com/forums/show/3108/sql-server-xml-data-type-is-not-supported-in-distributed-queries-remot
This is excellent article.thank you.?