Building high performance back end - SQL Server
Gaurav Jain
Sr IT Architect at IQVIA | Salesforce Certified Application Architect | Salesforce Certified Integration Architect | Salesforce Certified Data Architect | 9X Salesforce Certified | 4X Trailhead Ranger | Copado Certified
Search Stored procedure tuning -
I am sure below screen look familiar to you because most of us as a developer has build this screen in our project. Where in you need to create a search feature to search something as per your project and bring the data from back end. Generally from front end we collect the parameter and pass this data to a store prodecure.
Remember , in this type of screen, if user has provided ID (in below case, Service provider ID) , it will bring only one single record from back end, while in all other boxes user can provide First name or last name or both.
For ex. If user has entered only Tom as First Name, Stored procedure will bring all Service provider whose First name is Tom.
We often use below type of query for this –
WHERE ([C].[CustomerID] = @SProviderID OR @SProviderID IS NULL)
AND ([C].[lastname] LIKE @LastName OR @LastName IS NULL)
AND ([C].[firstname] LIKE @FirstName OR @FirstName IS NULL)
This is a BAD design of stored procedure. This type of stored proc will neither optimize well nor it will make use of execution plan caching.
Whenever this type of stored procedure will get called first time, it will build execution plan and cache it. It will try to use the same plan for another set of parameter and it will not fit for all provided data. Resulting, we will get horrible performance out of that stored procedure.
For ex- If this SP gets called for service provider ID first. SQL server will return one row based on the provided data. It will save the execution plan in cache. Next time when you call the same SP with only First name , SQL server plan will estimate the result row size as 1 , while it may return thousands of row and it will kill the performance.
OPTION(RECOMPILE) is the quickest solution to fix this.
To fix this type of stored procedure, OPTION(RECOMPILE) works wonderfully but how to use it is equally important. Because you can’t use it blindly with all statements. If you use it blindly , every time SQL server will invest time to create a ideal plan first and you might end up using too much CPU. Definitely, we can reuse at least few stable plan , if not all.
A Hybrid solution:
In the below example, I have used OPTION(RECOMPILE) dynamically inside the stored procedure based on the incoming data.
If the SProviderID has been provided, do not recompile it , if anything else is provided recompile it based on the length of data. By doing that way, we are asking SQL server to create a new plan only when it’s needed.
CREATE PROC [dbo].[GetServiceProviderData]
(
@SProviderID BIGINT = NULL
, @LastName VARCHAR (30) = NULL
, @FirstName VARCHAR (30) = NULL
)
AS
IF (@SProviderID IS NULL
AND @LastName IS NULL
AND @FirstName IS NULL
BEGIN
RAISERROR
RETURN;
END;
DECLARE @ExecStr NVARCHAR (4000),
@Recompile BIT = 1;
SELECT @ExecStr =
N'SELECT * FROM [dbo].[Customers] AS [C] WHERE 1=1';
IF @SProviderID IS NOT NULL
SELECT @ExecStr = @ExecStr + N' AND [C].[CustomerID] = @SProviderID';
IF @LastName IS NOT NULL
SELECT @ExecStr = @ExecStr + N' AND [C].[LastName] LIKE @LName';
IF @FirstName IS NOT NULL
SELECT @ExecStr = @ExecStr + N' AND [C].[Firstname] LIKE @FName';
IF (@SProviderID IS NOT NULL)
SET @Recompile = 0
IF (PATINDEX('%[%_?]%', @LastName) = 4) OR (PATINDEX('%[%_?]%', FirstName) = 4)
SET @Recompile = 0
IF @Recompile = 1
BEGIN
SELECT @ExecStr = @ExecStr + N' OPTION(RECOMPILE)';
END;
EXEC [sp_executesql] @ExecStr
@CustID = @SProviderID
, @LName = @LastName
, @FName = @FirstName
Index(Table) Tuning:
Choose correct Index:
We always think that indexes improve performance, which is correct most of the time but before choosing the index, we need to understand that what type of index is needed in our case and on which column index is needed. Choosing the incorrect type of index or choosing the invalid columns for index can negatively impact the performance.
Internal architecture of data storage:
Before diving into indexes, let’s first look that how SQL server stores data. SQL server always stores the data in the form of pages (8KB = 8*1024 bytes).
There are different types of pages; some store data records, some store index records and others store metadata of various sorts. All of them have one thing in common, which is their structure. A page is always exactly 8 KB (8192 bytes) in size and contains two major sections, the header and the body. The header has a fixed size of 96 bytes and has the same contents and format, regardless of the page type. It contains information such as how much space is free in the body, how many records are stored in the body, the object to which the page belongs and, in an index, the pages that precede and succeed it.
Image:1 Typical structure of a Data page.
Heap and Indexes:
A table that does not have a clustered index is referred to as a HEAP and a table that has a clustered index is referred to as a clustered table. A table can either have one heap or one clustered index.
In Heaps, Data is not stored in any order, Data pages are not linked, so sequential access needs to refer to the index allocation map (IAM) pages.
In clustered indexed table Data is stored in order based on the clustered index key, Data can be retrieved quickly based on the clustered index key, if the query uses the indexed columns.
As from the above clarification it is clear that having a clustered index on a large table will definitely help in query performance, if that indexed column is being used in where clause.
What all this means is that you should consider using a heap only when you’re working with very light tables or your table operations are limited to inserts and your queries are fairly basic (and you’re still using non-clustered indexes). Otherwise, stick with a well-designed clustered index.
Generally, in big size applications, we can’t only have heap tables and we need to have indexed tables, next questions come is what is the ideal candidate (column) for a indexed key.
What is Clustered key-? Clustered index is like the white pages of a phone book. The phone book stores every person’s name in alphabetical order, making it easy to look up certain individuals. Additionally, if we look someone up, we immediately have their address and phone number right their next to their name. This is exactly the very good feature of clustered indexes is, if you ever need to retrieve many or all columns from your table, a clustered index will usually be efficient because once it finds the indexed value you are searching on, it doesn’t need to go anywhere else to get the remaining data from that row. Whole data is stored in data page.
Who is the right candidate for a Clustered/Indexed key - Some people think they can ONLY put their clustered index on their PK. That’s not true! Most of it can be much more beneficial to put your clustered index on something that isn’t your PK, like a different column that is getting more use than our PK. For example, you might have an identity column set as your PK, but every query against your table is filtering and sorting on a different column. Why store your table in PK order if you are always going to be filtering and returning data on a separate column? You should put that clustered index on that column not on PK, which you are most commonly going to use on filtering, ordering. Though remember insert, update can become a bit slow as SQL server need to find the appropriate place to keep that record in data page.
In SQL server to get the maximum advantage of clustered index, choose a column with a high degree of uniqueness, that can be utilized in range queries, that is often accessed sequentially, that is monotonic, incremental, and unique.
Remember GUID also fulfills all the characters mentioned in above paragraph but is it a good idea to have GUID as a clustered key. NO, why not?
Choosing GUID as clustering key is a horrible design - GUIDs may seem to be a natural choice for your primary key but is using GUID as clustering key, is a good idea? By default, the primary key on a SQL Server table is also used as the clustering key - but that doesn't need to be that way, you can change it. You can make GUID as PK but not as a clustering key. You can create clustering (ordering) key on a separate INT IDENTITY (1,1) column.
GUID is unnecessarily wide (it’s 4 times wider than an int-based identity), estimate how much this cost on a bigger table and one with a few indexes and top of it fragmentation costs. Disk space is cheap but it will impact performance horribly.
Having an auto-increment Int or BIGINT column(It should be narrow, ever increasing, unique) will be a good choice for PK and column high degree of uniqueness and being used in a range of queries should be choice for clustering key (CK).
What is non-clustered Index and how does it work:
Unlike a clustered index, the leaf nodes of a non-clustered index contain only the values from the indexed columns and row locators that point to the actual data rows, rather than contain the data rows themselves. This means that the query engine must take an additional hop in order to locate the actual data. Remember, if table is a clustered table then row locator points to clustered index and from there it will go to actual data to fetch the data and if the table is heap, it will point to the actual data row.
Unlike clustered index, you can contain up to 16 index keys in NC index and you are not restricted with this number in the included columns. In SQL Server, you can include up-to 1023 columns per each non-clustered index. But please remember having multiple columns as a key column is not recommended.
Don’t over index, the transactional table:
Before designing any index on the table, please see that what are the activity involved on that table. Is it only being used for reporting (select statement) or also being used as a transactional table (insert/update/delete) and how much transactional activity is being performed on that table. When you add a record to a transactional table, an INSERT operation occurs not only on the data page, but also in all associated indexes and same is true about update and delete. if your table is either a heap or a clustered table and there are a lot of INSERTS, UPDATES and DELETES the data pages can become fragmented. This results in wasted space as well as additional data pages to read to satisfy the queries. If you have too many indexes, it can slow down the transactional activity on your table because your query will do very heavy I/O activity.
Use Fill Factor wisely on transnational table:
The fill-factor option is provided for fine-tuning index data storage and performance. When an index is created or rebuilt, the fill-factor determines the percentage of space on each leaf-level page to be filled/free with data. It reserves the free space for any transactional activity.
For ex. In below, data pages will be 80% full and 20% empty.
CREATE INDEX IX_Employee_Name ON HumanResources.Employee
(Employee_Name) WITH (FILLFACTOR = 80);
A correctly chosen fill-factor value can reduce page splits by providing enough space for index expansion as data is added to the table or large size data has got updated on the existing page. Therefore, it reduces page split.
Few other performance/Space killers:
Avoid using MAX, if you know the maximum data width already:
Sometime without realizing, we tend to create column as VARCHAR(MAX) even if we know that user is not going to give more than 100 characters in that field. Max is a safety net, but if you don't need it, don't put it up.
VARCHAR(MAX) or NVARCHAR(MAX) is considered as a 'large value type'. Large value types are usually stored 'out of row'. (only if the size exceeds 8000). It means that the data row will have a pointer to another location where the 'large value' is stored. (Copied from MSDN).
Avoid using NVARCHAR/NCHAR for all columns, if not needed:
If your database going to is store multilingual data you should use the nvarchar datatype instead varchar. Also, nvarchar takes twice as much space as varchar data.
Constraint (e.g. Foreign Key) boost performance: Most of us as a developer try to avoid putting constraint in the backend table and believe that constraints are only to maintain data integrity. If the constraints are used wisely they contribute to performance.
Let’s try to see with a simple example.
I have created two simple table and I am fetching data ONLY from OrderHeader1 table, but if you see the execution plan OrderHeader1 table is being scanned and Customer1 table is getting seek. So both tables are being read?
select OH.OrderDate,Oh.OrderNo
from OrderHeader1 OH
inner join Customer1 C
on oh.CustomerId=C.CustomerId
Now, I simply put a Foreign Key constraint on OrderHeader1 table as below and run the same query again.
ALTER TABLE [dbo].[OrderHeader1] WITH NOCHECK ADD CONSTRAINT [fkCust] FOREIGN KEY([CustomerId])
REFERENCES [dbo].[Customer1] ([CustomerId])
Below is the execution plan. After applying FK contraint. Only one table is being scanned. We can very well imagine the impact of it on performace when we have two heavily loaded table.
Thanks for reading!!
Associate Director @ IQVIA
6 年Very well put up, fantastic ??