Database Corruption - Real world Issue
Ayman Elnory
Technical Lead, DBA Manager @ Najm | Certified Professional, Database Expert | Technical Author, Trainer
Corruption does happen, many times per day, all around the world
- Many people don’t realize they have corruption until too late
- Either they do not know how to check for corruption or they miss the warning signs that corruption has occurred.
- The faster you realize you have corruption, the more likely you will be able to recover with minimal downtime and data loss
- Many people don’t know what to do when they do have corruption, leading to:
- More data loss and downtime than necessary
- Monetary and even job losses
- Overall lowered perception of SQL Server
o Makes it harder to convince management that SQL Server is Enterprise-capable
I will discuss in this article one simple example of database corruption and a real-world story that I faced with database corruption.
What is SQL corruption? What causes it?
Before that let me give you a simple answer to these question:
So, if a data file page is “good” when written out of SQL Server’s memory, but “bad” when reading back into memory, that is corruption simply.
In 99.99% cases of corruption, the reason is almost always the I/O subsystem.
- The I/O subsystem means anything “underneath” SQL Server:
- Windows operating system
- File system filter drivers E.g. antivirus, defraggers, encryption
- Network cards, switches, cables
- SAN controllers
- RAID controllers
- Disks
Page Corruption
SQL Server has a bunch of mechanisms built into it that allows it to automatically detect when a problem occurs when it's reading a page from disk, whether that page is corrupt for instance or the I/O just fails. It also has mechanisms that allow it to tell you when problems have occurred rather than just detecting them:
- Torn-page detection
- Page checksums
I will not get into details about these two mechanisms now, but I encourage every DBA working with SQL server 2005 onwards to enable page checksums so that you can realize the corruption once it happens.
I/O Error Example
- There are actually three different kinds of I/O errors.
- 823: a hard I/O error
- 824: a soft I/O error
- 825: a read-retry error
So 823 and 824 are the ones that you are most likely to see. Now, these are both severity 24 errors, it is the highest severity of any error that you can get and it means that your connection is going to be broken. Anytime an 823 or an 824 occurs, it is going to be logged in the msdb.dbo.suspect_pagestable which can be used as input into single-page restore operations.
Also, it is going to be written to the SQL Server error log and the Windows Application event log.
Let us get into our simple example here. I have a database called “company” with one table called “RandomData” filled with 10 records as you can see:
Now, list all the pages in the table:
DBCC IND (N'Company', N'RandomData', -1);
GO
And pick a random one of them like page number 284 and run the following query which is going to overwrite the first couple of bytes in that page on disk.
ALTER DATABASE [Company] SET SINGLE_USER;
GO
DBCC WRITEPAGE (N'Company', 1, 284, 0, 2, 0x0000, 1);
GO
ALTER DATABASE [Company] SET MULTI_USER;
GO
And then I'm just going to get rid of any places that might have errors in them so I can show you a clean error log.
DELETE FROM [msdb].[dbo].[suspect_pages]; EXEC sp_cycle_errorlog; GO
Now if you select from the table you will get the corruption error message:
SELECT * FROM [Company].[dbo].[RandomData]; GO
So this is what we expected in terms of our page checksum and this is what we got. So those are different and the buffer pool recognized that as it read the page into memory and it threw the 824 error. It closed our connection, see down at the bottom it says disconnected there and it tells us which page ID it was, it was the page that we just corrupted. Now we can also see that error shown in the error log.
That is the same message we saw there in our output.
And we can also go and look in the Application event log.
There is the error right there and it is exactly the same message.
So if we go back into Management Studio, we can also look at the suspect_pages table in MSDB.
SELECT * FROM [msdb].[dbo].[suspect_pages]; GO
This has one entry for each page that problems have been detected on. And the event_type is all documented in books online, so we can see that just a few minutes ago where it came across the 824 and the number of times that has been hit.
So a very simple example just showing you what an I/O error actually looks like and you will be able to see them in the error log, you'll be able to see them in the Windows Application event log, you actually get the connection broken, and you can also look after the fact in the suspect_pages table in MSDB.
Real-world example
- Story
- Customer is a small bank in the US
- The database is corrupt and it takes literally days to bring online
- Cause
- Customer only had a full backup from January plus log backups every ? hour until April when corruption hit
- All backups are stored on tape
- Resolution
- Restored ALL the backups (approximately 5000!) and luckily none were corrupt
- Cost
- Aggravation to bank customers while bank was down for several days
- The bank went out of business
- Lessons learned
- Plan a backup strategy that allows an efficient and quick restore
- Test the disaster recovery strategy
- Get prepared with a third party tool like Stellar Phoenix SQL Database Repair software which can do all of the following:
o Repairs SQL server database
o Restores objects of MDF and NDF Files
o Performs deleted record recovery
o repairs stored procedures
o Multiple saving options
o Repairs all database components
o Preview of database objects
o Allows selective recovery
o Recovers ROW and PAGE compressed data
o Allows searching the items
All of that can be done in much less time than restoring backups.
Summary:
As I mentioned, almost every DBA is going to see corruption at some point in their career, so it is really useful to understand the different causes of corruption and why it is so inevitable that you are going to see corruption at some point. So I encourage you to start downloading the trial version of Stellar Phoenix SQL Database Repair software to be prepared for the most critical situation you might face. I hope this article has been informative for you.