Recovering from Critical Data Loss: My Experience with Oracle 12c, RMAN, and Spark

Recovering from Critical Data Loss: My Experience with Oracle 12c, RMAN, and Spark

Introduction

Data loss incidents are among the most challenging scenarios for IT professionals. This article details my experience with a critical data loss event involving Oracle 12c, the recover manager RMAN, and Spark in my lab. (fault of mine entirely). Having been away from day-to-day DBA work since 2016, I found myself needing to brush up on my Oracle skills and handle a situation that included accidental data loss and the subsequent recovery process.

For many years I have maintained my Hive metadata on Oracle which I accept it is a very reliable DBMS. In addition, I keep various schemas on Oracle as well. I run my personal accounts on Hive DW which in turn requires a robust metadata host being Oracle in this case. Ironically the fact that I had root access to my lab Linux hosts, added to this burden. Additionally Spark works closely with Hive and Hadoop. So you basically see the impact the morning after so to speak!

The Critical Incident

Accidental Reformatting

The trouble began when I tried to fix spurious filesystem errors on /dev/sdc7 (disk partition on Linux with 4TB in size where Oracle files and Spark releases were kept), which held crucial files for both Oracle database and Spark binaries.

Filesystem Check Attempt:

cd /d4T
df -k .
Filesystem     1K-blocks      Used Available Use% Mounted on
/dev/sdc7      890601504 486438068 404147052  55% /d4T

ls
hduser  oracle

# Try to see problems with e2fsck!
e2fsck -f /dev/sdc7        

I used e2fsck to check and repair filesystem issues.

Accidental Reformatting:

When I was googling and not concentrating!, I came across the command below and like a dummy I used it without thinking twice!

# deadly command reformatting

mke2fs -T ext4 -m 0 /dev/sdc7        

Immediate Consequences

The accidental reformatting led to:

  • Loss of Oracle Database Files: Datafiles, control files, and archived logs were erased.
  • Loss of Spark Binaries: All binaries and related files for Spark were lost (kept under hduser folder)

The Recovery Process

  1. Brushing Up on Skills and Tools:

Having been away from DBA work for several years, I needed to quickly reacquaint myself with Oracle 12c and RMAN. I consulted available online documentation and re-familiarized myself with RMAN commands and recovery procedures. Fortunately I had performed an RMAN backup few days before.

Rebuild of Oracle and RMAN instances

At this moment in time, it looked like a mammoth task. I had created a 4TB file system (which was already formatted). Quickly created the two oracle and hduser folders with correct permissioning and used Oracle utility on Linux dbca to create new Oracle 12c and RMAN database instances. I know these are a bit old hat compared to Oracle Cloud releases but they do a good job without introducing additional risks!


  1. Verify Recovery Catalog and RMAN Configuration:

Now that I had a skeleton Oracle DB, I ensured the RMAN recovery catalog was operational and that the database was registered correctly.

Restore Control File and SPFILE:

  • Restore Control File:

RMAN> RESTORE CONTROLFILE FROM '/backup/oracle/rman_backups/ORASOURC_c-2409346222-20240820-01.ctl';
        

  • Restore SPFILE

RMAN> RESTORE SPFILE FROM '/backup/oracle/rman_backups/ORASOURC_c-2409346222-20240820-01.ctl';
        

OK time for ACID test

Fortunately I had some recent backups under separate directory

 /backup/oracle/rman_backups> ls -l
total 20946324
-rw-r-----. 1 oracle dba 3450314752 Jul 17 12:04 ORASOURC_2409346222_20240717_15304rda_1_1
-rw-r-----. 1 oracle dba 2984853504 Jul 17 12:07 ORASOURC_2409346222_20240717_16304rda_1_1
-rw-r-----. 1 oracle dba   10993664 Jul 17 12:07 ORASOURC_c-2409346222-20240717-00.ctl
-rw-r-----. 1 oracle dba 2499205632 Aug 14 11:53 ORASOURC_2409346222_20240814_1j32el00_1_1
-rw-r-----. 1 oracle dba 2845873152 Aug 14 11:54 ORASOURC_2409346222_20240814_1i32el00_1_1
-rw-r-----. 1 oracle dba   43698176 Aug 14 11:54 ORASOURC_2409346222_20240814_1k32em3q_1_1
-rw-r-----. 1 oracle dba   11321344 Aug 14 12:55 ORASOURC_c-2409346222-20240814-00.ctl
-rw-r-----. 1 oracle dba   11223040 Aug 14 13:06 snapcf_san.f
-rw-r-----. 1         

Using these backup, I used RMAN to load them in

  • Start RMAN and connect to your target database and, if applicable, the recovery catalog

rman target / catalog rman_user/password@catalog_db        

Set the Backup Location

Instruct RMAN to look for backup pieces in the specific directory where your backup files are stored.

CATALOG START WITH '/backup/oracle/rman_backups/';        

Once done, verify that backups are loaded OK

LIST BACKUP;        

Restore the Control File (If Needed)

RESTORE CONTROLFILE FROM '/backup/oracle/rman_backups/ORASOURC_c-2409346222-20240814-00.ctl';        

After restoring the control file, mount the database:

ALTER DATABASE MOUNT;        

Restore the Database

RESTORE DATABASE;        

Recover the Database

RECOVER DATABASE;        

Open the Database

ALTER DATABASE OPEN RESETLOGS;        

Of course the devil is in the details. Likely, you are going to get RMAN-07518: Reason: Foreign database file DBID: 2409346222 Database Name: ORASOURCE) indicates that RMAN recognizes the backup files, but it considers them to be associated with a different database instance, specifically identified by its DBID (Database Identifier). Error that you will need to resolve. It is a bit frustrating going through these iterative processes. As usual, there will be many steps to resolve ( and retreat is not an option). There will be many times that you will have to recycle Oracle and start it different modes until you sort out the issues. You may also need to Increase the Size of the Fast Recovery Area (FRA) and sets its location as well. All sorts of weird and wonderful errors.

Resolving hduser directory loss

As I have two Linux hosts, I was able to tar folder hduser under /d4T/ in the unaffected host and scp the tarred file to the affected host and untar and create the files there.

Lessons learned

1. The Importance of Regular Backups and Data Management: The experience of accidentally formatting a critical filesystem reinforced the importance of maintaining regular and comprehensive backups. Having Oracle backups stored on a separate filesystem was a crucial safeguard that allowed for a full recovery, highlighting the need for having backup strategies in any data-intensive environment.

2. Critical Role of Documentation: Being away from day-to-day DBA work since 2016, I realized how essential proper documentation is. The absence of up-to-date notes meant I had to quickly brush up on my skills and rely heavily on available documentation and my past experience. This underscored the value of maintaining detailed and current documentation for all systems and processes.

3. Staying Calm Under Pressure: The situation required a methodical approach to recovery, emphasizing the importance of remaining calm and focused when dealing with critical system failures. Careful planning and execution are key when handling emergencies to avoid further complications.

4. Continuous Learning and Skill Refreshment: The necessity to recover the Oracle database and Spark binaries reminded me of the rapidly evolving nature of technology. Keeping skills sharp and staying updated on the latest tools and techniques are essential for effectively managing and troubleshooting complex systems.

5. Automation and Disaster Recovery Planning: This experience highlighted the need for robust disaster recovery planning and the potential of automation. Implementing automated backup verification and recovery processes can save valuable time and reduce the risk of human error during critical situations.

6. Cross-Platform Expertise: The ability to work across different cloud platforms (AWS, Azure, GCP) and integrate various tools (like Talend MDM and Kubernetes) proved to be invaluable. It reinforced the lesson that diversifying skillsets and understanding the interoperability of different technologies can significantly enhance problem-solving capabilities.


Developing a structured methodology for problem-solving is crucial, especially in complex scenarios like database recovery or dealing with accidental data loss. Here is a methodology one can follow to tackle such problems systematically:

1. Problem Identification

- Understand the Issue: Clearly define what happened (e.g., accidental formatting, data corruption, or deletion).

- Gather Information: Collect all relevant details such as error messages, logs, the specific command or action that caused the problem, and the current state of the system.

2. Impact Assessment

- Determine Affected Areas: Identify which parts of the system were impacted. In my case, this was be the Oracle database, Spark binaries, and any other critical data on the affected filesystem.

- Evaluate the Severity: Assess how critical the data or system affected is to your operations. This helps prioritize actions.

3. Stop Further Damage

- Isolate the Issue: If possible, take the affected system or filesystem offline to prevent further damage or data overwriting.

- Prevent Access: Ensure no new data is written to the affected area until recovery is planned and initiated.

4. Develop a Recovery Strategy

- Review Available Resources: Check what backups, logs, and recovery tools you have at your disposal.

- Define Recovery Goals: Determine what you need to restore, to what point in time (e.g., using SCN or timestamps), and the acceptable downtime.

- Plan the Steps: Outline the recovery steps, such as restoring specific files, running recovery commands, or reinstalling software. This plan should be methodical and ensure that each step is clearly understood before execution.

5. Execute the Recovery Plan

- Step-by-Step Execution: Follow the recovery plan meticulously, verifying each step before moving on to the next.

- Monitor the Process: As you recover data, continuously monitor the system for any signs of additional issues.

6. Verification and Testing

- Check Data Integrity: After restoration, validate that the data is intact and consistent. For a database, this involves running consistency checks.

- Functional Testing: Ensure that all services (like Oracle DB and Spark) are running as expected. Perform test runs of applications and processes that depend on the restored data.

- System Performance: Verify that the system's performance is as expected post-recovery.

7. Documentation

- Document the Process: Write down or use confluence for each step taken, what worked, what didn’t, and any lessons learned. This documentation is invaluable for future incidents.

- Update Procedures: If the recovery process revealed gaps in your backup strategy or other procedures, update them accordingly.

8. Post-Mortem Analysis

- Root Cause Analysis: Determine why the issue occurred (e.g., human error, lack of safeguards) and what can be done to prevent it in the future.

- Review and Reflect: Analyze what worked well and what didn’t during the recovery. This should inform improvements to your processes.

9. Continuous Improvement

- Implement Safeguards: Based on the post-mortem, implement measures to avoid a repeat of the incident. This could involve additional training, changes to access permissions, or more robust backup strategies.

10. Learn and Share

- Share Lessons: If applicable, share your experience and lessons learned with your team or broader community. This not only helps others but also reinforces the knowledge within your organization.

- Stay Updated: Keep your skills and knowledge up-to-date with regular training and by staying informed about new tools and techniques.

Example: Applying the Methodology

In my specific case:

1. Identification: Recognized that a critical filesystem was accidentally formatted.

2. Assessment: Evaluated that Oracle DB and Spark binaries were impacted.

3. Stop Damage: Ensured no further writes were made to /dev/sdc7.

4. Recovery Strategy: Planned to use RMAN backups to restore Oracle and reinstall Spark.

5. Execution: Restored Oracle, verified it, then reinstalled and tested Spark.

6. Verification: Ran checks on the database and executed Spark jobs to ensure everything was back to normal.

7. Documentation: Recorded all steps and updated disaster recovery protocols.

8. Post-Mortem: Analysed the cause and implemented new safety measures to prevent future incidents.

Disclaimer: Great care has been taken to make sure that the technical information presented in this article is accurate, but any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on its content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.




要查看或添加评论,请登录

Mich Talebzadeh (Ph.D.)的更多文章

社区洞察

其他会员也浏览了