Skip Headers

Oracle® Database High Availability Architecture and Best Practices
10g Release 1 (10.1)

Part Number B10726-02
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Master Index
Master Index
Go to Feedback page
Feedback

Go to previous page
Previous
Go to next page
Next
View PDF

10 Detailed Recovery Steps

This chapter describes the detailed recovery operations that are referred to in the outages and solutions tables in Chapter 9, "Recovering from Outages". It includes the following topics:

10.1 Summary of Recovery Operations

This chapter describes the detailed recovery operations that are referred to in the outages and solutions tables in Chapter 9, "Recovering from Outages". Table 10-1 summarizes the recovery operations that are described in this chapter.

Table 10-1 Recovery Operations

Recovery Operation Description
Complete or Partial Site Failover
In the complete site failover scenario, existing connections fail and new connections are routed to a secondary or failover site. This occurs when there is a true disaster and where the application stack is replicated.

In the partial site failover scenario, the primary site is intact, and the middle-tier applications need to be redirected after the database has failed over or switched over to a standby database on the secondary site. This configuration is not recommended if performance decreases dramatically because of the greater latency between the application servers and the database.

Database Failover
A Data Guard failover is invoked in the database tier because of an unscheduled outage. If complete recovery is attempted, then there is minimal or no data loss. A subsequent complete or partial site failover must occur. The previous production database can be converted to a standby database by flashing back the database.
Database Switchover
A Data Guard switchover occurs in the database tier. The previous production database becomes the new standby database while the previous standby database becomes the new production database. A subsequent complete or partial site failover must occur. This is a scheduled or planned outage.
RAC Recovery
RAC automatically handles instance and node failures at a given site to provide continued access to the backend database. The reconnection to available instances can be transparent from an application standpoint and occurs only on the primary site. The application service migrates to the available or designated instances.
Apply Instance Failover
When a standby node requires maintenance, you can switch to another standby instance to avoid any impact on the production database and to ensure that the standby recovery does not fall behind. When the standby cluster requires maintenance or the standby cluster fails, if the maximum protection mode is enabled, then the production database needs to be downgraded to maximum availability mode or maximum performance mode.
Application failover The client or application tier automatically fails over to one or more surviving RAC instances when an instance or node failure occurs at the primary site.

See Also: "Recommendations for Fast Application Failover"

Recovery Solutions for Data Failures
When data failure occurs due to media corruption or media damage, you can use different recovery options that use the flash recovery area or switch over to the standby database. Alternatively, you can rebuild tables by fast imports or rebuild indexes in parallel.
Recovering from User Error with Flashback Technology
When a user error causes transactional or logical data inconsistencies, you can resolve these problems by using flashback error correction technology at the row, transaction, table, or database levels.
RAC Rolling Upgrade
With RAC, you can apply patches for specific customer issues incrementally to one node or instance at a time, which enables continual application and database availability.
Upgrade with Logical Standby Database
Data Guard enables you to upgrade the database on the standby database and perform a switchover, which minimizes the overall scheduled outage time for applying patch sets or software upgrades for the database.
Online Object Reorganization
Many scheduled outages related to the data server involve some reorganization of the database objects. They need to be accomplished with continued availability of the database. Oracle online object reorganization is used to manage the scheduled outages.

10.2 Complete or Partial Site Failover

This section describes the following client failover scenarios:

In the complete site failover scenario, existing connections fail and new connections are routed to a secondary or failover site. This occurs when there is a true disaster and where the application stack is replicated.

In the partial site failover scenario, the primary site is intact, and the middle-tier applications need to be redirected after the database has failed over or switched over to a standby database on the secondary site. This configuration is not recommended if performance decreases significantly because of the greater latency between the application servers and the database.

10.2.1 Complete Site Failover

A wide-area traffic manager is implemented on the primary and secondary sites to provide the site failover function. The wide-area traffic manager can redirect traffic automatically if the primary site or a specific application on the primary site is not accessible. It can also be triggered manually to switch to the secondary site for switchovers. Traffic is directed to the secondary site only when the primary site cannot provide service due to an outage or after a switchover. If the primary site fails, then user traffic is directed to the secondary site.

Figure 10-1 illustrates the network routes before site failover. Client requests enter the client tier of the primary site and travel by the WAN traffic manager. Client requests are sent through the firewall into the demilitarized zone (DMZ) to the application server tier. Requests are then forwarded through the active load balancer to the application servers. They are then sent through another firewall and into the database server tier. The application requests, if required, are routed to a RAC instance. Responses are sent back to the application and clients by a similar path.

Figure 10-1 Network Routes Before Site Failover

Description of maxav013.gif follows
Description of the illustration maxav013.gif

Figure 10-2 illustrates the network routes after site failover. Client or application requests enter the secondary site at the client tier and follow exactly the same path on the secondary site that they followed on the primary site.

Figure 10-2 Network Routes After Site Failover

Description of maxav028.gif follows
Description of the illustration maxav028.gif

The following steps describe what happens to network traffic during a failover or switchover.

  1. The administrator has failed over or switched over the production database to the secondary site.

  2. The administrator starts the middle-tier application servers on the secondary site.

  3. Typically, a Domain Name Service (DNS) administrator changes the wide-area traffic manager selection of the secondary site. Alternatively, the selection can be made automatically for an entire site failure.The wide-area traffic manager at the secondary site returns the virtual IP address of a load balancer at the secondary site. In this scenario, the site failover is accomplished by a DNS failover. The following is an example of a manual DNS failover:

    1. Change the DNS to point to the secondary site load balancer.

    2. Set TTL (Time to Live) to a short interval for the DNS propagation.

    3. Disable DNS on the primary site.

    4. Execute a DNS "push" to point to the secondary site.

    5. Wait until all failover operations are complete.

    6. Change TTL back to its normal setting on the DNS server.

    7. The secondary site load balancer directs traffic to the secondary site middle-tier application server.

    8. The secondary site is ready to take client requests.

Failover also depends on the client's web browser. Most browser applications cache the DNS entry for a period of time. Consequently, sessions in progress during an outage may not fail over until the cache timeout expires. The only way to resume service to such clients is to close the browser and restart it.

10.2.2 Partial Site Failover: Middle-Tier Applications Connect to a Remote Database Server

This usually occurs after the database has been failed over or switched over to the secondary site and the middle-tier applications remain on the primary site. The following steps describe what happens to network traffic during a partial site failover:

  1. The production database is failed over or switched over to the secondary site.

  2. The middle-tier application servers reconnect to the database on the secondary site using configuration best practices described in "Recommendations for Fast Application Failover".

Figure 10-3 shows the network routes after partial site failover. Client and application requests enter the primary site at the client tier and follow the same path to the database server tier as in Figure 10-1. When the requests enter the database server tier, they are routed to the database tier of the secondary site through any additional switches, routers, and possible firewalls.

Figure 10-3 Network Routes After Partial Site Failover

Description of maxav031.gif follows
Description of the illustration maxav031.gif

10.3 Database Failover

Failover is the operation of taking the production database offline on one site and bringing one of the standby databases online as the new production database. A failover operation can be invoked when an unplanned catastrophic failure occurs on the production database, and there is no possibility of recovering the production database in a timely manner.

Data Guard enables you to fail over by issuing the SQL statements described in subsequent sections, by using Oracle Enterprise Manager, or by using the Oracle Data Guard broker command-line interface.


See Also:

Oracle Data Guard Broker for information about using Enterprise Manager or the Data Guard broker command-line for database failover

Data Guard failover is a series of steps to convert a standby database into a production database. The standby database essentially assumes the role of production. A Data Guard failover is accompanied by a site failover to fail over the users to the new site and database. After the failover, the secondary site contains the production database. The former production database needs to be re-created as a new standby database to restore resiliency. The standby database can be quickly re-created by using Flashback Database. See "Restoring the Standby Database After a Failover".

During a failover operation, little or no data loss may be experienced. The complete description of a failover can be found in Oracle Data Guard Concepts and Administration.

The rest of this section includes the following topics:

10.3.1 When to Use Data Guard Failover

Data Guard failover should be used only in the case of an emergency and should be initiated due to an unplanned outage such as:

  • Site disaster

  • User errors

  • Data failures

A failover requires that the initial production database be re-created as a standby database to restore fault tolerance to your environment. The standby database can be quickly re-created by using Flashback Database. See "Restoring the Standby Database After a Failover".

10.3.2 When Not to Use Data Guard Failover

Do not use Data Guard failover when the problem can be fixed locally in a timely manner or when Data Guard switchover can be used. For failover with complete recovery scenarios, either the production database is not accessible or cannot be restarted. Data Guard failover should not be used where object recovery or flashback technology solutions provide a faster and more efficient alternative.

10.3.3 Data Guard Failover Using SQL*Plus

This section includes the following topics:

10.3.3.1 Physical Standby Failover Using SQL*Plus

  1. Check for archive gaps.

    SELECT THREAD#, LOW_SEQUENCE#, HIGH_SEQUENCE# 
    FROM V$ARCHIVE_GAP;
    

See Also:

Oracle Data Guard Concepts and Administration for more information about what to do if a gap exists
  1. Shut down other standby instances.

  2. Finish recovery.

    ALTER DATABASE RECOVER MANAGED STANDBY DATABASE FINISH;
    
    
  3. Check the database state. Query switchover status readiness by executing the following statement:

    SELECT SWITCHOVER_STATUS FROM V$DATABASE;
    
    
  4. Convert the physical standby database to the production role.

    ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY;
    
    
  1. Restart instance.

10.3.3.2 Logical Standby Failover Using SQL*Plus

  1. Stop the current SQL Apply session.

    ALTER DATABASE STOP LOGICAL STANDBY APPLY;
    
    
  1. If there are additional logs to be registered, (for example, you can get to the primary database or you are using LGWR to the standby destination), then register the log files.

    ALTER DATABASE REGISTER LOGICAL LOGFILE 'file_name';
    
    
  1. Start the SQL Apply session using the NODELAY and FINISH clauses.

    ALTER DATABASE START LOGICAL STANDBY APPLY NODELAY FINISH;
    
    
  1. Activate the logical standby database.

    ALTER DATABASE ACTIVATE LOGICAL STANDBY DATABASE;
    
    

Failover has completed, and the new production database is available to process transactions.

10.4 Database Switchover

A database switchover performed by Oracle Data Guard is a planned transition that includes a series of steps to switch roles between a standby database and a production database. Thus, following a successful switchover operation, the standby database assumes the production role and the production database becomes a standby database. In a RAC environment, a switchover requires that only one instance is active for each database, production and standby. At times the term "switchback" is also used within the scope of database role management. A switchback operation is a subsequent switchover operation to return the roles to their original state.

Data Guard enables you to change these roles dynamically by issuing the SQL statements described in subsequent sections, or by using Oracle Enterprise Manager, or by using the Oracle Data Guard broker command-line interface. Using Oracle Enterprise Manager or the Oracle Data Guard broker command-line interface is described in Oracle Data Guard Broker.

This section includes the following topics:

10.4.1 When to Use Data Guard Switchover

Switchover is a planned operation. Switchover is the capability to switch database roles between the production and standby databases without needing to instantiate any of the databases. Switchover can occur whenever a production database is started, the target standby database is available, and all the archived redo logs are available. It is useful in the following situations:

  • Scheduled maintenance such as hardware maintenance (such as hardware or firmware patches) on the production host

  • Resolution of data failures when the production database is still opened

  • Testing and validating the secondary resources, as a means to test disaster recovery readiness

  • Upgrades for database software and patch sets. See "Upgrade with Logical Standby Database".

10.4.2 When Not to Use Data Guard Switchover

Switchover is not possible or practical under the following circumstances:

  • Missing archived redo logs

  • Point-in-time recovery is required

  • The production database is not open and cannot be opened

Do not use Data Guard switchover when local recovery solutions provide a faster and more efficient alternative. The complete description of a switchover can be found in Oracle Data Guard Concepts and Administration.

10.4.3 Data Guard Switchover Using SQL*Plus

If you are not using Oracle Enterprise Manager, then the high-level steps in this section can be executed with SQL*Plus. These steps are described in detail in Oracle Data Guard Concepts and Administration.

This section includes the following topics:

10.4.3.1 Physical Standby Switchover Using SQL*Plus

  1. Shut down all production and standby instances except one for each site.

  2. Stop active sessions on remaining active production instance

    To identify active sessions, execute the following query:

    SELECT SID, PROCESS, PROGRAM 
    FROM V$SESSION 
    WHERE TYPE = 'USER' 
    AND SID <> (SELECT DISTINCT SID FROM V$MYSTAT);
    
    
  1. Check that the switchover status on the production database is 'TO STANDBY'.

    SELECT SWITCHOVER_STATUS FROM V$DATABASE;
    
    
  1. Switch over the current production database to the standby database.

    ALTER DATABASE COMMIT TO SWITCHOVER TO STANDBY [WITH SESSION SHUTDOWN];
    
    
  1. Start the new standby database.

    STARTUP MOUNT;
    RECOVER MANAGED STANDBY DATABASE USING CURRENT LOGFILE DISCONNECT;
    
    
  1. Convert the former standby database to a production database.

    ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY [WITH SESSION SHUTDOWN];
    
    
  1. Restart all instances.

10.4.3.2 Logical Standby Switchover Using SQL*Plus

  1. Prepare the production database to become the logical standby database.

    ALTER DATABASE PREPARE TO SWITCHOVER TO LOGICAL STANDBY;
    
    
  1. Prepare the logical standby database to become the production database.

    Following this step, logs start to ship in both directions, although the current production database does not process the logs coming from the current logical standby database.

    ALTER DATABASE PREPARE TO SWITCHOVER TO PRIMARY;
    
    
  1. Commit the production database to become the logical standby database.

    This is the phase where current transactions on the production database are cancelled. All DML-related cursors are invalidated, preventing new records from being applied. The end of redo (EOR) marker is recorded in the online redo log and then shipped (immediately if using real-time apply) to the logical standby database and registered.

    ALTER DATABASE COMMIT TO SWITCHOVER TO LOGICAL STANDBY;
    
    
  1. Commit the logical standby database to become the production database.

    ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY;
    
    
  1. Start the logical standby apply engine on the new logical standby database.

    If real-time apply is required, execute the following statement:

    ALTER DATABASE START LOGICAL STANDBY APPLY IMMEDIATE;
    
    

    Otherwise execute the following statement:

    ALTER DATABASE START LOGICAL STANDBY APPLY;
    

10.5 RAC Recovery

This section includes the following topics:

10.5.1 RAC Recovery for Unscheduled Outages

This section includes the following topics:

10.5.1.1 Automatic Instance Recovery for Failed Instances

Instance failure occurs when software or hardware problems disable an instance. After instance failure, Oracle automatically uses the online redo log file to perform database recovery as described in this section.

10.5.1.1.1 Single Node Failure in Real Application Clusters

Instance recovery in RAC does not include restarting the failed instance or the recovery of applications that were running on the failed instance. Applications that were running continue by using failure recognition and recovery as described in Oracle Real Application Clusters Installation and Configuration Guide. This provides consistent and uninterrupted service in the event of hardware or software failures. When one instance performs recovery for another instance, the surviving instance reads redo log entries generated by the failed instance and uses that information to ensure that committed transactions are recorded in the database. Thus, data from committed transactions is not lost. The instance that is performing recovery rolls back uncommitted transactions that were active at the time of the failure and releases resources used by those transactions.

10.5.1.1.2 Multiple Node Failures in Real Application Clusters

When multiple node failures occur, as long as one instance survives, RAC performs instance recovery for any other instances that fail. If all instances of a RAC database fail, then Oracle automatically recovers the instances the next time one instance opens the database. The instance that is performing recovery can mount the database in either shared or exclusive mode from any node of a RAC database. This recovery procedure is the same for Oracle running in shared mode as it is for Oracle running in exclusive mode, except that one instance performs instance recovery for all the failed instances in exclusive node.

10.5.1.2 Automatic Service Relocation

Service reliability is achieved by configuring and failing over among redundant instances. More instances are enabled to provide a service than would otherwise be needed. If a hardware failure occurs and adversely affects a RAC database instance, then RAC automatically moves any services on that instance to another available instance. Then Cluster Ready Services (CRS) attempts to restart the failed nodes and instances.

An installation can specify the "preferred" and "available" configuration for each service. This configuration describes the preferred way to run the system, and is used when the service first starts up. For example, the ERP service runs on instance1 and instance2, and the HR service runs on instance3 when the system first starts. instance2 is available to run HR in the event of a failure or planned outage, and instance3 and instance4 are available to run ERP. The service configuration can be designed several ways.

RAC recognizes when a failure affects a service and automatically fails over the service and redistributes the clients across the surviving instance supporting the service. In parallel, CRS attempts to restart and integrate the failed instances and dependent resources back into the system. Notification of failures occur at various levels including notifying external parties through Enterprise Manager and callouts, recording the fault for tracking, event logging, and interrupting applications. Notification occurs from a surviving fault domain when the failed domain is out of service. The location and number of fault domains serving a service is transparent to the applications. Auto restart and recovery are automatic, including all the subsystems, not just database.

10.5.2 RAC Recovery for Scheduled Outages

This section includes the following topics:

10.5.2.1 Disabling CRS-Managed Resources

When an outage occurs, RAC automatically restarts essential components. Components that are eligible for automatic restart include instances, listeners, and the database as well as several subcomponents. Some scheduled administrative tasks require that you prevent components from automatically restarting. To perform scheduled maintenance that requires a CRS-managed component to be down during the operation, the resource must be disabled to prevent CRS from trying to automatically restart the component. For example, to take a node and all of its instances and services offline for maintenance purposes, disable the instance and its services using either Enterprise Manager or SRVCTL, and then perform the required maintenance. Otherwise, if the node fails and then restarts, then CRS attempts to restart the instance during the administrative operation.

10.5.2.2 Planned Service Relocation

For a scheduled outage that requires an instance, node, or other component to be isolated, RAC provides the ability to relocate, disable, and enable services. Relocation migrates the service to another instance. The sessions can also be relocated. These interfaces also allow services, instances and databases to be selectively disabled while a repair, change, or upgrade is made and re-enabled after the change is complete. This ensures that the service is not started at the instance being repaired because of a dependency or a start operation on the service. The service is disabled on the instance at the beginning of the planned outage. It is then enabled at the end of the maintenance outage.

For example, to relocate the SALES service from instance1 to instance3 in order to perform scheduled maintenance on node1,the tasks can be performed using Enterprise Manager or SRVCTL commands. The following shows how to use SRVCTL commands:

  1. Relocate the SALES service to instance3.

    srvctl relocate service -d PROD -s SALES -i instance1 -t instance3
    
    
  1. Disable the SALES service on instance1 to prevent it from being relocated to instance1 while maintenance is performed.

    srvctl disable service -d PROD -s SALES -i instance1
    
    
  1. Stop instance1.

    srvctl stop instance -d PROD -i instance1
    
    
  1. Perform the scheduled maintenance.

  2. Start instance1.

    srvctl start instance -D PROD -i instance1
    
    
  1. Re-enable the SALES service on instance1.

    srvctl enable service -d PROD -s SALES -i instance1
    
    
  1. If desired, relocate the SALES service running on instance3 back to instance1.

    srvctl relocate service -d PROD -s SALES -i instance3 -t instance1
    

10.6 Apply Instance Failover

This section applies to MAA, with RAC and Data Guard on each site.

A standby database can have multiple standby instances. Only one instance can have the managed recovery process (MRP) or the logical standby apply process (LSP). The instance with the MRP or LSP is called the apply instance.

When you have a RAC-enabled standby database, you can fail over the apply instance of the standby RAC environment. Failing over to another apply instance may be necessary when incurring a planned or unplanned outage that affects the apply instance or node. Note the difference between apply instance failover, which utilizes multiple instances of the standby database at the secondary site, and Data Guard failover or Data Guard switchover, which converts the standby database into a production database. The following occurs as a result of apply instance failover:

For apply failover to work correctly, "Configuration Best Practices for MAA" must be followed:

When you follow these configuration recommendations, apply instance failover is automatic for a scheduled or unscheduled outage on the primary instance, and all standby instances have access to archived redo logs. By definition, all RAC standby instances already have access to standby redo logs because they must reside on shared storage.

The method of restarting the physical standby managed recovery process (MRP) or the logical standby apply process (LSP) depends on whether Data Guard Broker is being used. If the Data Guard Broker is in use, then the MRP or LSP is automatically restarted on the first available standby instance if the primary standby instance fails. If the Data Guard Broker is not being used, then the MRP or LSP must be manually restarted on the new standby instance. Consider using a shared file system, such as a clustered file system or a global file system, for the archived redo logs. A shared file system enables you to avoid reshipment of any unapplied archived redo logs that were already shipped to the standby.


See Also:

Oracle Data Guard Concepts and Administration for details about setting up cross-instance archiving

10.6.1 Performing an Apply Instance Failover Using SQL*Plus

If apply instance failover does not happen automatically, then follow these steps to restart your production database, if necessary, and restart MRP or LSP following an unscheduled apply instance or node outage:

10.6.1.1 Step 1: Ensure That the Chosen Standby Instance is Mounted

From the targeted standby instance, run the following query.

SELECT OPEN_MODE, DATABASE_ROLE FROM V$DATABASE;
Type of Standby Database Output for Mounted Standby Database If Not Mounted, Choose a Different Target or Open Manually
Physical MOUNTED, PHYSICAL STANDBY STARTUP NOMOUNT
Logical READ WRITE, LOGICAL STANDBY STARTUP

10.6.1.2 Step 2: Verify Oracle Net Connection to the Chosen Standby Host

  1. Ensure that the standby listener is started.

    % lsnrctl status listener_name
    
    
  1. Validate the Oracle Net alias from all the production hosts in the cluster.

    % tnsping standby_database_connection_service_name
    
    

If the connection cannot be made, then consult Oracle Net Services Administrator's Guide for further troubleshooting.

10.6.1.3 Step 3: Start Recovery on the Chosen Standby Instance

Use the following statements for a physical standby database:

RECOVER MANAGED STANDBY DATABASE USING CURRENT LOGFILE DISCONNECT;

Use the following statements for a logical standby database:

ALTER DATABASE START LOGICAL STANDBY APPLY IMMEDIATE;

10.6.1.4 Step 4: Copy Archived Redo Logs to the New Apply Host

Optionally, copy the archived redo logs to the new apply host.

The copy is not necessary for a physical standby database. For a physical standby database, when the managed recovery process detects an archive gap, it requests the production archived redo logs to be resent automatically.

For a logical standby database, unapplied archive file names that have already been registered are not resent automatically to the logical standby database. These archived redo logs must be sent manaully to the same directory structure in the new apply host. You can identify the registered unapplied archived redo logs by executing a statement similar to the following:

SELECT LL.FILE_NAME, LL.THREAD#, LL.SEQUENCE#, LL.FIRST_CHANGE#, LL.NEXT_CHANGE#, LP.APPLIED_SCN, LP.READ_SCN
  FROM DBA_LOGSTDBY_LOG LL, DBA_LOGSTDBY_PROGRESS LP
  WHERE LEAST(LP.APPLIED_SCN, LP.READ_SCN) <= LL.NEXT_CHANGE#;

Compare the results of the statement to the contents of the STANDBY_ARCHIVE_DEST directory.


See Also:

"Oracle9i Data Guard: SQL Apply Best Practices" at http://otn.oracle.com/deploy/availability/htdocs/maa.htm

10.6.1.5 Step 5: Verify the New Configuration

  1. Verify that archived redo logs are being sent to the new apply host.

    Query V$ARCHIVE_STATUS and V$ARCHIVED_DEST_STATUS.

    SELECT NAME_SPACE, STATUS, TARGET, LOG_SEQUENCE ,
    TYPE,PROCESS, REGISTER , ERROR FROM V$ARCHIVE_DEST
    WHERE STATUS!='INACTIVE';
    
    SELECT * FROM V$ARCHIVE_DEST_STATUS WHERE STATUS!='INACTIVE';
    
    
  1. Verify that the managed recovery or logical apply on the new apply host is progressing.

    Issue the following queries to ensure that the sequence number is advancing over time.

    Use the following statements for a physical standby database:

    SELECT MAX(SEQUENCE#), THREAD# FROM V$LOG_HISTORY GROUP BY THREAD#;
    SELECT PROCESS, STATUS, THREAD#, SEQUENCE#, CLIENT_PROCESS FROM V$MANAGED_ STANDBY;
    
    

    Use the following statements for a logical standby database:

    SELECT MAX(SEQUENCE#), THREAD# FROM DBA_LOGSTDBY_LOG GROUP BY THREAD#;
    SELECT APPLIED_SCN FROM DBA_LOGSTDBY_PROGRESS;
    

10.7 Recovery Solutions for Data Failures

Recovering from a data failure is an unscheduled outage scenario. A data failure is usually, but not always, caused by some activity or failure that occurs outside the database, even though the problem may be evident within the database.

Data failure can affect the following types of database objects:

Data failure can be categorized as either datafile block corruption or media failure:

In all environments, you can resolve a data failure outage by one of the following methods:

In a Data Guard environment, you can also use a Data Guard switchover or failover to a standby database to recover from data failures.

Another category of related outages that result in database objects becoming unavailable or inconsistent are caused by user error, such as dropping a table or erroneously updating table data. Information about recovering from user error can be found in "Recovering from User Error with Flashback Technology".

The rest of this section includes the following topics:

10.7.1 Detecting and Recovering From Datafile Block Corruption

A corrupt datafile block can be accessed, but the contents within the block are invalid or inconsistent. The typical cause of datafile corruption is a faulty hardware or software component in the I/O stack, which includes, but is not limited to, the file system, volume manager, device driver, host bus adapter, storage controller, and disk drive.

The database usually remains available when corrupt blocks have been detected, but some corrupt blocks may cause widespread problems, such as corruption in a file header or with a data dictionary object, or corruption in a critical table that renders an application unusable.

The rest of this section includes the following topics:

10.7.1.1 Detecting Datafile Block Corruption

A data fault is detected when it is recognized by the user, administrator, RMAN backup, or application because it has affected the availability of the application. For example:

  • A single corrupt data block in a user table that cannot be read by the application because of a bad spot of the physical disk

  • A database that automatically shuts down because of the invalid blocks of a datafile in the SYSTEM tablespace caused by a failing disk controller

Regularly monitor application logs (which may be distributed across the data server, middle-tier and the client machines), the alert log, and Oracle trace files for errors such as ORA-1578 and ORA-1110

ORA-01578: ORACLE data block corrupted (file # 4, block # 26)
ORA-01110: data file 4: '/u01/oradata/objrs/obj_corr.dbf'

10.7.1.2 Recovering From Datafile Block Corruption

After you have identified datafile block corruption, follow these steps:

  1. Determine the Extent of the Corruption Problem

  2. Replace or Move Away From Faulty Hardware

  3. Determine Which Objects Are Affected

  4. Decide Which Recovery Method to Use

10.7.1.2.1 Determine the Extent of the Corruption Problem

Use the following methods to determine the extent of the corruption:

  • Gather details from error messages

    Gather the file number, file name, and block number from the error messages. For example:

ORA-01578: ORACLE data block corrupted (file # 22, block # 12698)
ORA-01110: data file 22: '/oradata/SALES/users01.dbf'

The file number is 22, the block number is 12698, and the file name is /oradata/SALES/users01.dbf.

  • Check log files for additional error messages

    Record additional error messages that appear in the alert log, Oracle trace files, or application logs. Note that log files may be distributed across the data server, middle tier, and client machines.

  • Use Oracle utilities to check for additional corruption

    Use Oracle detection tools to find other data failure problems that may exist on the same disk or set of disks that have not yet been reported. For example, if the file number 22 has corrupt blocks, then it is prudent to run the RMAN BACKUP VALIDATE DATAFILE 22 command to detect additional corruption. Table 10-2 summarizes the Oracle tools that are available to detect datafile block corruption.

Table 10-2 Oracle Tools for Detecting Datafile Block Corruption

Oracle Tool Description Location of Additional Information
RMAN BACKUP or RESTORE command with VALIDATE option RMAN scans the specified files and verifies content-checking for physical and logical errors but does not actually perform the backup or recovery operation. Oracle records the address of the corrupt block and the type of corruption in the control file. Access these records through the V$DATABASE_BLOCK_CORRUPTION view, which can be used by RMAN block media recovery.

If BLOCK CHANGE TRACKING is enabled, then do not use the INCREMENTAL LEVEL option with BACKUP VALIDATE to ensure that all data blocks are read and verified.

To detect all types of corruption that are possible to detect:

  • Do not specify the MAXCORRUPT option

  • Do not specify the NOCHECKSUM option

  • Do specify the CHECK LOGICAL option

Oracle Database Backup and Recovery Advanced User's Guide
DBVERIFY utility External command-line utility that performs physical structure data integrity check on a datafile or a specific segment within a datafile. Output of DBVERIFY goes to the screen or the specified LOGFILE. Oracle Database Utilities
ANALYZE SQL statement with VALIDATE STRUCTURE option Verifies the integrity of the structure of an index, table, or cluster; identifies corruption; and checks or verifies that your tables and indexes are consistent. Problems are reported into the user session trace file in USER_DUMP_DEST. Oracle Database Administrator's Guide
DBMS_REPAIR PL/SQL package Performs block checking for a specified table, partition, or index. DBMS_REPAIR should be used with care and under the guidance of Oracle Support. Oracle Database Administrator's Guide
10.7.1.2.2 Replace or Move Away From Faulty Hardware

Some corruption problems are caused by faulty hardware. If there is a hardware fault or a suspect component, then it is sensible to either repair the problem, or make disk space available on a separate disk subsystem before proceeding with a recovery option.

If there are multiple errors, if there are operating system-level errors against the affected file, or if the errors are transient and keep moving about, then there is little point in proceeding until the underlying problem has been addressed or space is available on alternative disks. Ask your hardware vendor to verify system integrity.

In a Data Guard environment, a switchover can be performed to bring up the application and restore service quickly while the corruption problem is handled offline.

10.7.1.2.3 Determine Which Objects Are Affected

Using the file ID (fid) and block ID (bid) gathered from error messages and the output from Oracle block checking utilities, determine which database objects are affected by the corruption by using a query similar to the following:

SELECT tablespace_name, partition_name, segment_type,
   owner, segment_name
FROM dba_extents 
WHERE file_id = fid
  AND bid BETWEEN block_id AND block_id + blocks - 1;

The following is an example of the query and its resulting output:

SQL> select tablespace_name, partition_name, segment_type,
  2  owner, segment_name from dba_extents 
  3  where file_id=4 and 11 between block_id and block_id + blocks -1;

TABLESPACE_NAME   PARTITION_NAME     SEGMENT_TYPE    OWNER   SEGMENT_NAME
---------------   --------------     ------------    -----   ------------
USERS                                TABLE           SCOTT   EMP

The integrity of a table or index can be determined by using the ANALYZE statement.

10.7.1.2.4 Decide Which Recovery Method to Use

The recommended recovery methods are summarized in Table 10-3 and Table 10-4. The recovery methods depend on whether Data Guard is being used.

Table 10-3 summarizes recovery methods for data failure when Data Guard is not used.

Table 10-3 Recovering From Data Failure Without Data Guard

Object Affected Extent of Problem Action
Data dictionary or UNDO segment N/A Use RMAN Datafile Media Recovery
Application segment (user table, index, cluster) Widespread or unknown Use RMAN Datafile Media Recovery

or

Re-Create Objects Manually

Application segment (user table, index, cluster) Localized Use RMAN Block Media Recovery

or

Re-Create Objects Manually

TEMPORARY segment or temporary table N/A No impact to permanent objects. Re-create temporary tablespace if required.

Table 10-4 summarizes recovery methods for data failure when Data Guard is present.

Table 10-4 Recovering from Data Failure With Data Guard

Object Affected Extent of Problem Impact to Application Cost of Local Recovery Action
Data dictionary or UNDO segment N/A N/A N/A Use Data Guard to Recover From Data Failure
Application segment (user table, index, cluster) Widespread or unknown Low Low Use RMAN Datafile Media Recovery

or

Re-Create Objects Manually

Application segment (user table, index, cluster) N/A High N/A Use Data Guard to Recover From Data Failure
Application segment (user table, index, cluster) N/A N/A High Use Data Guard to Recover From Data Failure
Application segment (user table, index, cluster) Localized Low Low Use RMAN Block Media Recovery

or

Re-Create Objects Manually

TEMPORARY segment or temporary table N/A N/A N/A No impact to permanent objects. Re-create temporary tablespace if required.

The proper recovery method to use depends on the following criteria, as indicated in Table 10-3 and Table 10-4:

  • Object affected: The recovery actions available depend on which objects are affected by the corruption. Possible values are:

    • Data dictionary or UNDO segment: The object affected is owned by SYS or is part of UNDO tablespace

    • Temporary segment: Corruptions within a temporary segment or object within a temporary tablespace do not affect permanent objects

    • Application segment: Table, index, or cluster used by the application

  • Impact to application

    An object may be critical for the application to function. This includes objects that are critical for the performance and usability of the application. It could also be a history, reporting, or logging table, which may not be as critical. It could also be an object that is no longer in use or a temporary segment.

    This criterion only applies to a Data Guard environment and should be used to decide between recovering the affected object locally and using Data Guard failover. Possible values are:

    • High: Object is critical to the application such that available or performance suffers significantly

    • Low: Object is not critical and has little or no impact to the application

  • Cost of local recovery

    This criterion only applies to a Data Guard environment and should be used to decide between recovering the affected object locally and using Data Guard failover. This is not a business cost which is assumed to be implicit in deciding how critical an object is to the application but cost in terms of feasibility of recovery, resources required and their impact on performance and total time taken.

    Cost of local recovery should include the time to restore and recover the object from a valid source; the time to recover other dependent objects like indexes, constraints, and related tables and its indexes and constraints; availability of resources like disk space, data or index tablespace, temporary tablespace; and impact on performance and functionality of current normal application functions due to absence of the corrupt object.

  • Extent of corruption

    Corruption may be localized so that it affects a known number of blocks within one or a few objects, or it may be widespread so that it affects a large portion of an object.

10.7.2 Recovering From Media Failure

When media failure occurs, follow these steps:

  1. Determine the Extent of the Media Failure

  2. Replace or Move Away From Faulty Hardware

  3. Decide Which Recovery Action to Take

10.7.2.1 Determine the Extent of the Media Failure

Use the following methods to determine the extent of the media failure:

  • Gather details from error messages

    Gather the file number and file name from the error messages reported. Typical error messages are ORA-1157, ORA-1110 and ORA-1115, ORA-1110.

    For example, from the following error message:

ORA-01115: IO error reading block from file 22 (block # 12698)
    ORA-01110: data file 22: '/oradata/SALES/users01.dbf'

  • Check log files for additional error messages

    Record additional error messages that appear in the system logs, volume manager logs, alert log, Oracle trace files, or application logs. Note that log files may be distributed across the data server, middle tier, and the client machines.

10.7.2.2 Replace or Move Away From Faulty Hardware

If there is a hardware fault or a suspect component, then it is sensible to either repair the problem or make disk space available on a separate disk subsystem before proceeding with a recovery option.

If there are multiple errors, if there are operating system-level errors against the affected file, or if the errors are transient and keep moving about, then there is little point in proceeding until the underlying problem has been addressed or space is available on alternative disks. Ask your hardware vendor to verify system integrity.

10.7.2.3 Decide Which Recovery Action to Take

The appropriate recovery action depends on what type of file is affected by the media failure. Table 10-5 shows the type of file and the appropriate recovery.

Table 10-5 Recovery Actions for Failure of Different Types of Files

Type of File Recovery Action
Datafile Media failure of a datafile is resolved in the same manner in which widespread datafile block corruption is handled.

See Also: "Recovering From Datafile Block Corruption"

Control file Loss of a control file causes the primary database to shut down. The steps to recover from control file failure include making a copy of a good control file, restoring a backup copy of the control file, or manually creating the control file with the CREATE CONTROLFILE statement. The proper recovery method depends on the following:
  • Whether all current control files were lost or just a member of a multiplexed control file

  • Whether or not a backup control file is available

See Also: "Performing User-Managed Flashback and Recovery" in Oracle Database Backup and Recovery Advanced User's Guide

Standby control file Loss of a standby control file causes the standby database to shut down. It may also, depending on the primary database protection mode, cause the primary database to shut down. To recover from a standby control file failure, a new standby control file must be created from the primary database and transferred to the standby system.

See Also: "Creating a Physical Standby Database" in Oracle Data Guard Concepts and Administration

Online redo log file If a media failure has affected the online redo logs of a database, then the appropriate recovery procedure depends on the following:
  • The configuration of the online redo log: mirrored or non-mirrored

  • The type of media failure: temporary or permanent

  • The status of the online redo log files affected by the media failure: current, active, unarchived, or inactive

See Also: "Advanced User-Managed Recovery Scenarios" in Oracle Database Backup and Recovery Advanced User's Guide

If the online redo log failure causes the primary database to shut down and incomplete recovery must be used to make the database operational again, then Flashback Database can be used instead of restoring all datafiles. Use Flashback Database to take the database back to an SCN before the SCN of the lost online redo log group. The resetlogs operation that is done as part of the Flashback Database procedure reinitializes all online redo log files. Using Flashback Database is faster than restoring all datafiles.

If the online redo log failure causes the primary database to shut down in a Data Guard environment, it may be desirable to perform a Data Guard failover to reduce the time it takes to restore service to users and to reduce the amount of data loss incurred (when using the proper database protection mode). The decision to perform a failover (instead of recovering locally at the primary site with Flashback Database, for example) depends on the estimated time to recover from the outage at the primary site, the expected amount of data loss, and the impact the recovery procedures taken at the primary site may have on the standby database.

For example, if the decision is to recover at the primary site, then the recovery steps may require a Flashback Database and open resetlogs, which may incur a full redo log file of lost data. A standby database will have less data loss in most cases than recovering at the primary site because all redo data is available to the standby database. If recovery is done at the primary site and the standby database is ahead of the point to which the primary database is recovered, then the standby database must be re-created or flashed back to a point before the resetlogs SCN on the primary database.

See Also: "Creating a Physical Standby Database" in Oracle Data Guard Concepts and Administration

Standby redo log file Standby redo log failure affects only the standby database in a Data Guard environment. Most standby redo log failures are handled automatically by the standby database without affecting the primary database. However, if a standby redo log file fails while being archived to, then the primary database treats it as a log archive destination failure.

See Also: "Determine the Data Protection Mode"

Archived redo log file Loss of an archived redo log does not affect availability of the primary database directly, but it may significantly affect availability if another media failure occurs before the next scheduled backup, or if Data Guard is being used and the archived redo log had not been fully received by the standby system and applied to the standby database before losing the file.

See Also: "Advanced User-Managed Recovery Scenarios" in Oracle Database Backup and Recovery Advanced User's Guide

If an archived redo log is lost in a Data Guard environment and the log has already been applied to the standby database, then there is no impact. If there is no valid backup copy of the lost file, then a backup should be taken immediately of either the primary or standby database because the lost log will be unavailable for media recovery that may be required for some other outage.

If the lost archived redo log has not yet been applied to the standby database, then a backup copy of the file must be restored and made available to the standby database. If there is no valid backup copy of the lost archived redo log, then the standby database must be reinstantiated from a backup of the primary database taken after the NEXT_CHANGE# of the lost log (see V$ARCHIVED_LOG).

Server parameter file (SPFILE) Loss of the server parameter file does not affect availability of the database. SPFILE is necessary for database startup. With the flash recovery and RMAN CONTROLFILE AUTOBACKUP features enabled, restoring a server parameter file from backup is a fast operation.

See Also: "Performing Recovery" of Oracle Database Backup and Recovery Basics

Oracle Cluster Registry (OCR) Loss of the Oracle Cluster Registry file affects the availability of RAC and Cluster Ready Services. The OCR file can be restored from a physical backup that is automatically created or from an export file that is manually created by using the ocrconfig tool.

See Also: "Administering Storage in Real Application Clusters" in Oracle Real Application Clusters Administrator's Guide

10.7.3 Recovery Methods for Data Failures

The following recovery methods can be used in all environments:

Always use local recovery methods when Data Guard is not being used. Local recovery methods may also be appropriate in a Data Guard environment. This section also includes the following topic:

10.7.3.1 Use RMAN Datafile Media Recovery

Datafile media recovery recovers an entire datafile or set of datafiles for a database by using the RMAN RECOVER command. When a large or unknown number of data blocks are marked media-corrupt and require media recovery, or when an entire file is lost, the affected datafiles must be restored and recovered.

Use RMAN file media recovery when the following conditions are true:

  • The number of blocks requiring recovery is large or unknown

  • Block media recovery is not available (for example, if incomplete recovery is required, or if only incremental backups are available for the datafile requiring recovery)


    See Also:

    "Advanced User-Managed Recovery Scenarios" in Oracle Database Backup and Recovery Advanced User's Guide

10.7.3.2 Use RMAN Block Media Recovery

Block media recovery (BMR) recovers one or a set of data blocks marked "media corrupt" within a datafile by using the RMAN BLOCKRECOVER command. When a small number of data blocks are marked media corrupt and require media recovery, you can selectively restore and recover damaged blocks rather than whole datafiles. This results in lower mean time to recovery (MTTR) because only blocks that need recovery are restored and only necessary corrupt blocks undergo recovery. Block media recovery minimizes redo application time and avoids I/O overhead during recovery. It also enables affected datafiles to remain online during recovery of the corrupt blocks. The corrupt blocks, however, remain unavailable until they are completely recovered.

Use block media recovery when:

  • A small number of blocks require media recovery and the blocks that need recovery are known. If a significant portion of the datafile is corrupt, or if the amount of corruption is unknown, then a different recovery method should be used.

  • Blocks are marked corrupt (verified with the RMAN BACKUP VALIDATE command) and only when complete recovery is required.

Block media recovery cannot be used to recover from the following:

  • User error or software bugs that cause logical corruption where the data blocks are intact. See "Recovering from User Error with Flashback Technology" for additional details for this type of recovery.

  • Changes caused by corrupt redo data. Block media recovery requires that all available redo data be applied to the blocks being recovered.

The following are useful practices when using block media recovery:

  • The flash recovery area should have a retention policy such that backups of all datafiles are retained on disk longer than the frequency of use of an Oracle tool that detects block corruption. For example, if RMAN BACKUP VALIDATE is run once a week, then the flash recovery area should have a retention policy greater than one week. This ensures that corrupt blocks can be recovered quickly using block media recovery with disk-based backups.

  • Even if archived redo logs have missing redo data, block media recovery can still work if the block has been renewed since the data file backup used for the restoration of the block, or if there are no changes for the specific block in the missing redo data.

  • If RMAN block validation has been run proactively, then the V$DATABASE_BLOCK_CORRUPTION view has a list of blocks validated as corrupt by RMAN. RMAN can be instructed to recover all corrupt blocks listed in V$DATABASE_BLOCK_CORRUPTION using block media recovery.

10.7.3.3 Re-Create Objects Manually

Some database objects, such as small look-up tables or indexes, can be recovered quickly by manually re-creating the object instead of doing media recovery.

Use manual object re-creation when:

  • You need to re-create a small index because of media corruption. Creating an index online enables the base object to be used concurrently.

  • You need to re-create a look-up table or when the scripts to re-create the table are readily available. Dropping and re-creating the table may be the fastest option.

10.7.3.4 Use Data Guard to Recover From Data Failure

Failover is the operation of taking the production database offline on one site and bringing one of the standby databases online as the new production database. A database switchover is a planned transition in which a standby database and a production database switch roles.

Use Data Guard switchover or failover for data failure when:

  • The database is down or when the database is up but the application is unavailable because of data failure, and the time to restore and recover locally is long or unknown.

  • The business cost of a switchover or failover and re-creating the standby database is less than the cost of the expected downtime or reduced capacity while local recovery steps are taken.

  • The proper decision to recover locally is unclear or too complex, and the data failure is affecting the availability of the application.

10.8 Recovering from User Error with Flashback Technology

Oracle flashback technology revolutionizes data recovery. In the past it took seconds to damage a database but hours to days to recover it. With flashback technology, the time to correct errors can be as short as the time it took to make the error. Fixing user errors that require rewinding the database, table, transaction, or row level changes to a previous point in time is easy and does not require any database or object restoration. Flashback technology provides fine-grained analysis and repair for localized damage such as erroneous row deletion. Flashback technology also enables correction of more widespread damage such as accidentally running the wrong application batch job. Furthermore, flashback technology is exponentially faster than a database restoration.

Flashback technologies are applicable only to repairing the following user errors:

Flashback technologies cannot be used for media or data corruption such as block corruption, bad disks, or file deletions. See "Recovery Solutions for Data Failures" and "Database Failover" to repair these outages.

Table 10-8 summarizes the flashback solutions for each type of outage.

Table 10-6 Flashback Solutions for Different Outages

Impact of Outage Examples of User Errors Flashback Solutions
Row or transaction

See Also: "Resolving Row and Transaction Inconsistencies"

  • Accidental deletion of row
  • Erroneous transaction

Use a combination of:

See Also: "Flashback Query"

Table

See Also:"Resolving Table Inconsistencies"

  • Dropped table
  • Erroneous transactions affecting one table or a set of tables


Tablespace or database

See Also: "Resolving Database-Wide Inconsistencies"

  • Erroneous batch job affecting many tables or an unknown set of tables
  • Series of database-wide malicious transactions

  • Drop tablespace without removing the physical datafiles

Table 10-7 summarizes each flashback feature.

Table 10-7 Summary of Flashback Features

Flashback Feature Description
Flashback Query
Flashback Query enables you to view data at a point in time in the past. It can be used to view and reconstruct lost data that was deleted or changed by accident. Developers can use this feature to build self-service error correction into their applications, empowering end users to undo and correct their errors.

Note: Changes are propagated to physical and logical standby databases.

Flashback Version Query
Flashback Version Query uses undo data stored in the database to view the changes to one or more rows along with all the metadata of the changes.

Note: Changes are propagated to physical and logical standby databases.

Flashback Transaction Query
Flashback Transaction Query enables you to examine changes to the database at the transaction level. As a result, you can diagnose problems, perform analysis, and audit transactions.

Note: Changes are propagated to physical and logical standby databases.

Flashback Drop
Flashback Drop provides a way to restore accidentally dropped tables.

Note: Changes are propagated to physical standby databases.

Flashback Table
Flashback Table enables you to quickly recover a table to a point in time in the past without restoring a backup.

Note: Changes are propagated to physical and logical standby databases.

Flashback Database
Flashback Database enables you to quickly return the database to an earlier point in time by undoing all of the changes that have taken place since that time. This operation is fast because you do not need to restore the backups.

Flashback Database uses the Oracle Database flashback logs, while all other features of flashback technology use the Oracle Database unique undo and multiversion read consistency capabilities. See "Configuration Best Practices for the Database" for configuring flashback technologies to ensure that the resources from these solutions are available at a time of failure.

The rest of this section includes the following topics:

10.8.1 Resolving Row and Transaction Inconsistencies

Resolving row and transaction inconsistencies may require a combination of Flashback Query, Flashback Version Query, Flashback Transaction Query, and the suggested undo statements to rectify the problem. The following sections describe a general approach using a human resources example to resolve row and transaction inconsistencies caused by erroneous or malicious user errors.

This section includes the following topics:

10.8.1.1 Flashback Query

Flashback Query, a feature introduced in the Oracle9i Database, enables an administrator or user to query any data at some point in time in the past. This powerful feature can be used to view and reconstruct data that may have been deleted or changed by accident. For example:

SELECT * FROM EMPLOYEES 
       AS OF TIMESTAMP 
       TO_DATE('28-Aug-03 14:00','DD-Mon-YY HH24:MI')
 WHERE ...

This partial statement displays rows from the EMPLOYEES table starting from 2 p.m. on August 28, 2003. Developers can use this feature to build self-service error correction into their applications, empowering end users to undo and correct their errors without delay, rather than burdening administrators to perform this task. Flashback Query is very simple to manage, because the database automatically keeps the necessary information to reconstruct data for a configurable time into the past.

10.8.1.2 Flashback Version Query

Flashback Version Query provides a way to view changes made to the database at the row level. It is an extension to SQL and enables the retrieval of all the different versions of a row across a specified time interval. For example:

SELECT * FROM EMPLOYEES
       VERSIONS BETWEEN TIMESTAMP
       TO_DATE('28-Aug-03 14:00','dd-Mon-YY hh24:mi') AND
       TO_DATE('28-Aug-03 15:00','dd-Mon-YY hh24:mi')
WHERE ...

This statement displays each version of the row, each entry changed by a different transaction, between 2 and 3 p.m. today. A DBA can use this to pinpoint when and how data is changed and trace it back to the user, application, or transaction. This enables the DBA to track down the source of a logical corruption in the database and correct it. It also enables application developers to debug their code.

10.8.1.3 Flashback Transaction Query

Flashback Transaction Query provides a way to view changes made to the database at the transaction level. It is an extension to SQL that enables you to see all changes made by a transaction. For example:

SELECT UNDO_SQL
FOMR DBA_TRANSACTION_QUERY 
WHERE XID = '000200030000002D';

This statement shows all of the changes that resulted from this transaction. In addition, compensating SQL statements are returned and can be used to undo changes made to all rows by this transaction. Using a precision tool like this, the DBA and application developer can precisely diagnose and correct logical problems in the database or application.

10.8.1.4 Example: Using Flashback Technology to Investigate Salary Discrepancy

Consider a human resources (HR) example involving the SCOTT schema. the HR manager reports to the DBA that there is a potential discrepancy in Ward's salary. Sometime before 9:00 a.m., Ward's salary was increased to $1875. The HR manager is uncertain how this occurred and wishes to know when the employee's salary was increased. In addition, he has instructed his staff to reset the salary to the previous level of $1250, and this was completed around 9:15 a.m.

The following steps show how to approach the problem.

  1. Assess the problem.

    Fortunately, the HR manager has provided information about the time when the change occurred. We can query the information as it was at 9:00 a.m. with Flashback Query.

    SELECT EMPNO, ENAME, SAL
    FROM EMP
    AS OF TIMESTAMP TO_DATE('03-SEP-03 09:00','dd-Mon-yy hh24:mi')
    WHERE ENAME = 'WARD';
    
     EMPNO     ENAME             SAL
    ---------- ---------- ----------
          7521 WARD             1875
    
    

    We can confirm we have the correct employee by the fact that Ward's salary was $1875 at 09:00 a.m. Rather than using Ward's name, we can now use the employee number for subsequent investigation.

  1. Query past rows or versions of the data to acquire transaction information.

    Although it is possible to restrict the row version information to a specific date or SCN range, we decide to query all the row information that we have available for the employee WARD using Flashback Version Query.

    SELECT EMPNO, ENAME, SAL, VERSIONS_STARTTIME, VERSIONS_ENDTIME
    FROM EMP
    VERSIONS BETWEEN TIMESTAMP MINVALUE AND MAXVALUE
    WHERE EMPNO = 7521
    ORDER BY NVL(VERSIONS_STARTSCN,1);
    
        EMPNO ENAME             SAL VERSIONS_STARTTIME     VERSIONS_ENDTIME
    -------- ---------- ---------- ---------------------- ----------------------
         7521 WARD             1250 03-SEP-03 08.48.43 AM  03-SEP-03 08.54.49 AM
         7521 WARD             1875 03-SEP-03 08.54.49 AM  03-SEP-03 09.10.09 AM
         7521 WARD             1250 03-SEP-03 09.10.09 AM
    
    

    We can see that WARD's salary was increased from $1250 to $1875 at 08:54:49 the same morning and was subsequently reset to $1250 at approximately 09:10:09.

    Also, we can modify the query to determine the transaction information for each of the changes effecting WARD using a similar Flashback Version Query. This time we use the VERSIONS_XID pseudocolumn.

    SELECT EMPNO, ENAME, SAL, VERSIONS_XID
    FROM EMP
    VERSIONS BETWEEN TIMESTAMP MINVALUE AND MAXVALUE
    WHERE EMPNO = 7521
    ORDER BY NVL(VERSIONS_STARTSCN,1);
    
         EMPNO ENAME             SAL VERSIONS_XID
    ---------- ---------- ---------- ----------------
          7521 WARD             1250 0006000800000086
          7521 WARD             1875 0009000500000089
          7521 WARD             1250 000800050000008B
    
    
  1. Query the erroneous transaction and the scope of its impact.

    With the transaction information (VERSIONS_XID pseudocolumn), we can now query the database to determine the scope of the transaction, using Flashback Transaction Query.

    SELECT UNDO_SQL
    FROM FLASHBACK_TRANSACTION_QUERY
    WHERE XID = HEXTORAW('0009000500000089');
    
    UNDO_SQL                                                                    
        ----------------------------------------------------------------------------
    update "SCOTT"."EMP" set "SAL" = '950' where ROWID = 'AAACV4AAFAAAAKtAAL';      
    update "SCOTT"."EMP" set "SAL" = '1500' where ROWID = 'AAACV4AAFAAAAKtAAJ';     
    update "SCOTT"."EMP" set "SAL" = '2850' where ROWID = 'AAACV4AAFAAAAKtAAF';      
    update "SCOTT"."EMP" set "SAL" = '1250' where ROWID = 'AAACV4AAFAAAAKtAAE';    
    update "SCOTT"."EMP" set "SAL" = '1600' where ROWID = 'AAACV4AAFAAAAKtAAB';     
                                                                                    
        6 rows selected.
    
    

    We can see that WARD's salary was not the only change that occurred in the transaction. The information that was changed for the other four employees at the same time as WARD can now be passed back to the HR manager for review.

  1. Determine if the corrective statements should be executed.

    If the HR manager decides that the corrective changes suggested by the UNDO_SQL column are correct, then the DBA can execute these statements individually.

10.8.2 Resolving Table Inconsistencies

Oracle provides a FLASHBACK DROP statement to recover from an accidental DROP TABLE statement, and a FLASHBACK TABLE statement to restore a table to a previous point in the database.

This section includes the following topics:

10.8.2.1 Flashback Table

Flashback Table provides the DBA the ability to recover a table, or a set of tables, to a specified point in time quickly and easily. In many cases, Flashback Table alleviates the need to perform more complicated point in time recovery operations. For example:

FLASHBACK TABLE orders, order_items 
      TO TIMESTAMP 
      TO_DATE('29-AUG-03 14.00.00','dd-Mon-yy hh24:mi:ss');

This statement rewinds any updates to the ORDERS and ORDER_ITEMS tables that have been done between the current time and specified timestamp in the past. Flashback Table performs this operation online and in place, and it maintains referential integrity constraints between the tables.

10.8.2.2 Flashback Drop

Dropping or deleting database objects by accident is a common mistake. Users soon realize their mistake, but by then it is too late and there has been no way to easily recover the dropped tables and its indexes, constraints, and triggers. Objects once dropped were dropped forever. Loss of very important tables or other objects (like indexes, partitions or clusters) required DBAs to perform a point-in-time recovery, which can be time-consuming and lead to loss of recent transactions.

Flashback Drop provides a safety net when dropping objects in Oracle Database 10g. When a user drops a table, Oracle places it in a recycle bin. Objects in the recycle bin remain there until the user decides to permanently remove them or until space limitations begin to occur on the tablespace containing the table. The recycle bin is a virtual container where all dropped objects reside. Users can look in the recycle bin and undrop the dropped table and its dependent objects. For example, the employees table and all its dependent objects would be undropped by the following statement:

FLASHBACK TABLE employees TO BEFORE DROP;

10.8.3 Resolving Database-Wide Inconsistencies

Oracle provides Flashback Database to rewind the entire database to a previous point in time. This section includes the following topics:

10.8.3.1 Flashback Database

To bring an Oracle database to a previous point in time, the traditional method is point-in-time recovery. However, point-in-time recovery can take hours or even days, since it requires the whole database to be restored from backup and recovered to the point in time just before the error was introduced into the database. With the size of databases constantly growing, it will take hours or even days just to restore the whole database.

Flashback Database is a new strategy for doing point-in-time recovery. It quickly rewinds an Oracle database to a previous time to correct any problems caused by logical data corruption or user error. Flashback logs are used to capture old versions of changed blocks. One way to think of it is as a continuous backup or storage snapshot. When recovery needs to be performed the flashback logs are quickly replayed to restore the database to a point in time before the error and just the changed blocks are restored. It is extremely fast and reduces recovery time from hours to minutes. In addition, it is easy to use. A database can be recovered to 2:05 p.m. by issuing a single statement. Before the database can be recovered, all instances of the database must be shut down and one of the instances subsequently mounted. The following is an example of a FLASHBACK DATABASE statement.

FLASHBACK DATABASE TO TIMESTAMP TIMESTAMP'2002-11-05 14:00:00';

No restoration from tape, no lengthy downtime, and no complicated recovery procedures are required to use it. You can also use Flashback Database and then open the database in read-only mode and examine its contents. If you determine that you flashed back too far or not far enough, then you can reissue the FLASHBACK DATABASE statement or continue recovery to a later time to find the proper point in time before the database was damaged. Flashback Database works with a production database, a physical standby database, and a logical standby database.

These steps are recommended for using Flashback Database:

  1. Determine the time or the SCN to which to flash back the database.

  2. Verify that there is sufficient flashback log information.

    SELECT OLDEST_FLASHBACK_SCN, 
       TO_CHAR(OLDEST_FLASHBACK_TIME, 'mon-dd-yyyy HH:MI:SS') 
       FROM V$FLASHBACK_DATABASE_LOG;
    
    
  1. Flash back the database to a specific time or SCN. (The database must be mounted to perform a Flashback Database.)

    FLASHBACK DATABASE TO SCN scn;
    
    

    or

    FLASHBACK DATABASE TO TIMESTAMP TO_DATE date;
    
    
  1. Open the database in read-only mode to verify that it is in the correct state.

    ALTER DATABASE OPEN READ ONLY;
    
    

    If more flashback data is required, then issue another FLASHBACK DATABASE statement. (The database must be mounted to perform a Flashback Database.)

    If you want to move forward in time, issue a statement similar to the following:

    RECOVER DATABASE UNTIL [TIME | CHANGE] date | scn;
    
    
  1. Open the database.


    Caution:

    After you open the database, you will not be able to flash back to an SCN before the resetlogs SCN, so ensure that the database is in the correct state before issuing the ALTER DATABASE OPEN RESETLOGS statement.

    ALTER DATABASE OPEN RESETLOGS;
    
    

Other considerations when using Flashback Database are as follows:

  • If there are not sufficient flashback logs, then use Data Guard if available or restore from backups.

  • After flashing back a database, any dependent database such as a standby database needs to be flashed back. See Chapter 11, "Restoring Fault Tolerance".

10.8.3.2 Using Flashback Database to Repair a Dropped Tablespace

Flashback Database does not automatically fix this problem, but it can be used to dramatically reduce the downtime. You can flash back the production database to a point before the tablespace was dropped and then restore a backup of the corresponding datafiles from the affected tablespace and recover to a time before the tablespace was dropped.

Follow these recommended steps to use Flashback Database to repair a dropped tablespace:

  1. Determine the SCN or time you dropped the tablespace.

  2. Flash back the database to a time before the tablespace was dropped. You can use a statement similar to the following:

    FLASHBACK DATABASE TO BEFORE SCN drop_scn;
    
    
  1. Restore, rename, and bring datafiles online.

    1. Restore only the datafiles from the affected tablespace from a backup.

    2. Rename the unnamed files to the backup files.

    ALTER DATABASE RENAME FILE '.../UNNAMED00005' to 'restored_file';
    
    
  1. Bring the datafiles online.

    ALTER DATABASE DATAFILE 'name' ONLINE; 
             
    
  1. Query and recover the database.

    SELECT CHECKPOINT_CHANGE# FROM V$DATAFILE_HEADER WHERE FILE#=1;
    RECOVER DATABASE UNTIL CHANGE checkpoint_change#;
    
    
  1. Open the database.

    ALTER DATABASE OPEN RESETLOGS;
    

10.9 RAC Rolling Upgrade

"One-off" patches or interim patches to database software are usually applied to implement known fixes for software problems an installation has encountered or to apply diagnostic patches to gather information regarding a problem. Such patch application is often carried out during a scheduled maintenance outage.

Oracle now provides the capability to do rolling patch upgrades with Real Application Clusters with little or no database downtime. The tool used to achieve this is the opatch command-line utility.

The advantage of a RAC rolling upgrade is that it enables at least some instances of the RAC installation to be available during the scheduled outage required for patch upgrades. Only the RAC instance that is currently being patched needs to be brought down. The other instances can continue to remain available. This means that the impact on the application downtime required for such scheduled outages is further minimized. Oracle's opatch utility enables the user to apply the patch successively to the different instances of the RAC installation.

Rolling upgrade is available only for patches that have been certified by Oracle to be eligible for rolling upgrades. Typically, patches that can be installed in a rolling upgrade include:

Rolling upgrade of patches is currently available for one-off patches only. It is not available for patch sets.

Rolling patch upgrades are not available for deployments where the Oracle Database software is shared across the different nodes. This is the case where the Oracle home is on Cluster File System (CFS) or on shared volumes provided by file servers or NFS-mounted drives. The feature is only available where each node has its own copy of the Oracle Database software.

This section includes the following topics:

10.9.1 Applying a Patch with opatch

The opatch utility applies a patch successively to nodes of the RAC cluster. The nature of the patch enables a RAC installation to run in a mixed environment. Different instances of the database may be operating at the same time, and the patch may have been applied to some instances and not others. The opatch utility automatically detects the nodes of the cluster on which a specific RAC deployment has been implemented. The patch is applied to each node, one at a time. For each node, the DBA is prompted to shut down the instance. The patch is applied to the database software install on that node. After the current node has been patched, the instance can be restarted. After the patch is applied on the current node, the DBA is allowed to choose the next RAC node to apply the patch to. The cycle of instance shutdown, patch application, and instance startup is repeated. Thus, at any time during the patch application, only one node needs to be down.

To check if a patch is a rolling patch, execute the following on UNIX platforms:

opatch query -is_rolling

(On Windows, execute opatch.bat.)

Enter the patch location after the prompt.

To apply a patch to all nodes of the RAC cluster, execute the following command:

opatch apply patch_location

opatch automatically recognizes the patch to be a rolling patch and provides the required behavior.

To apply a patch to only the local node, enter the following command:

opatch apply -local patch_location

To check the results of a patch application, check the logs in the following location:

$ORACLE_HOME/.patch_storage/patch_id/patch_id_Apply_timestamp.log

10.9.2 Rolling Back a Patch with opatch

Patches can be rolled back with the opatch utility. This enables the DBA to remove a troublesome patch or a patch that is no longer required. This can be done as a rolling procedure.

To roll back a patch across all nodes of a RAC cluster, execute the following command:

opatch rollback -id patch_id -ph patch_location

To roll back a patch on the local node only, enter the following command:

opatch rollback -local -id patch_id -ph patch_location

To check the results of a patch rollback, check the logs in the following location:

$ORACLE_HOME/.patch_storage/patch_id/patch_id_RollBack_timestamp.log

10.9.3 Using opatch to List Installed Software Components and Patches

The opatch utility also provides an option to list the installed software components as well as the installed patches. Enter the following command:

opatch lsinventory 

For details on usage and the other options to these commands, see MetaLink Notes 242993.1 and 189489.1 at http://metalink.oracle.com.

10.9.4 Recommended Practices for RAC Rolling Upgrades

The following are recommended practices for all database patch upgrades:

  • Always confirm with Oracle Support Services that the patch is valid for your problem and for your deployment environment.

  • Have a plan for applying the patch as well as a plan for backing out the patch.

  • Apply the patch to your test environment first and verify that it fixes the problem.

  • When you plan the elapsed time for applying the patch, include time for starting up and shutting down the other tiers of your technology stack if necessary.

The following are additional recommended practices for RAC rolling upgrades.

  • If multiple instances share an Oracle home, then all of them will be affected by application of a patch. The DBA should verify that this will not cause unintentional side effects. Also, all such instances on a node must be shut down during the patch application. Scheduled outage planning should take this into account. As a best practice, only similar applications should share an Oracle home on a node. This provides greater flexibility for patching.

  • The Oracle inventory on each node is a repository of the Oracle Database software installed on the node. The inventory is node-specific. It is shared by all Oracle software installed on the node. It is similar across nodes only if all nodes are exactly the same in terms of the Oracle Database software deployed, the deployment configuration, and patch levels. Because the Oracle inventory greatly aids the patch application and patch management process, it is recommended that its integrity be maintained. Oracle inventory should be backed up after each patch installation to any Oracle software on a specific node. This applies to the Oracle inventory on each node of the cluster.

  • Use the Oracle Universal Installer to install all Oracle database software. This creates the relevant repository entries in the Oracle inventory on each node of the cluster. Also, use the Oracle Universal Installer to add nodes to an existing RAC cluster.

    However, if this was not done or is not feasible for some reason, adding information about an existing Oracle database software installation to the Oracle inventory can be done with the attach option of the opatch utility. Node information can be also added with this option.

  • The nature of the rolling patch upgrade enables it to be applied to only some nodes of the RAC cluster. So an instance can be operating with the patch applied, while another instance is operating without the patch. This is not possible for nonrolling patch upgrades. Apply nonrolling patch upgrades to all instances before the RAC deployment is activated. A mixed environment is useful if a patch needs to be tested before deploying it to all the instances. Applying the patch with the -local option is the recommended way to do this.

    In the interest of keeping all instances of the RAC cluster at the same patch level, it is strongly recommended that after a patch has been validated, it should be applied to all nodes of the RAC installation. When instances of a RAC cluster have similar patch software, services can be migrated among instances without running into the problem a patch may have fixed.

  • All patches (including those applied by rolling upgrades) should be maintained online and not removed once they have been applied. This is useful if a patch needs to be rolled back or needs to be applied again.

    The patches should be stored in a location that is accessible by all nodes of the cluster. Thus all nodes of the cluster are equivalent in their capability to apply or roll back a patch.

  • Rolling patch upgrades, just like any other patch upgrade, should be done when no other patch upgrade or Oracle installation is being done on the node. Application of multiple patches is a sequential process. The scheduled outage should be planned accordingly.

  • If multiple patches need to be applied and they must be applied at the same time, and if only some of these patches are eligible for rolling upgrade, apply all of them in a nonrolling manner. This reduce the overall time required to get through the patching process.

  • For patches that are not eligible for rolling upgrade, the next best option for RAC deployments is the minimize_downtime option of the apply command.

  • Perform the rolling upgrade when system usage is low. This ensures minimal disruption of service for the end user.

10.10 Upgrade with Logical Standby Database

Using a logical standby database enables you to accomplish upgrades for database software and patch sets with almost no downtime.


Note:

This capability is not available for upgrading from Oracle9i to Oracle Database 10g.

If a logical standby database does not currently exist, then verify that a logical standby database supports all of the essential datatypes of your application.


See Also:

Oracle Data Guard Concepts and Administration for a list of datatypes supported by the logical standby database

If you cannot use a logical standby database because of the datatypes in your application, then perform the upgrade as documented in Oracle Database Upgrade Guide

First, create or establish a logical standby database. Figure 10-4 shows a production database and a logical standby database, which are both version X databases.

Figure 10-4 Establish a Logical Standby Database

Description of maxav023.gif follows
Description of the illustration maxav023.gif

Second, stop the SQL Apply process and upgrade the database software on the logical standby database to version X+1. Figure 10-5 shows the production database, version X, and the logical standby database, version X+1.

Figure 10-5 Upgrade the Logical Standby Database Version

Description of maxav026.gif follows
Description of the illustration maxav026.gif

Third, restart SQL Apply and operate with version X on the production database and version X+1 on the standby database. The configuration can run in the mixed mode shown in Figure 10-6 for an arbitrary period to validate the upgrade in the production environment.

Figure 10-6 Running in Mixed Mode

Description of maxav024.gif follows
Description of the illustration maxav024.gif

When you are satisfied that the upgraded software is operating properly, you can reverse the database roles by performing a switchover. This may take only a few seconds. Switch the database clients to the new production database, so that the application becomes active. If application service levels degrade for some reason, then you can open the previous production database again, switch users back, and terminate the previous steps.

Figure 10-7 shows that the former standby database (version X+1) is now the production database, and the former production database (version X) is now the standby database. The clients are connected to the new production database.

Figure 10-7 After a Switchover Has Been Performed

Description of maxav027.gif follows
Description of the illustration maxav027.gif

Upgrade the new standby database. Figure 10-8 shows the system after both databases have been upgraded to version X+1.

Figure 10-8 Both Databases Have Been Upgraded

Description of maxav025.gif follows
Description of the illustration maxav025.gif

The role reversal that was just described includes the following steps:

  1. Initiate a Data Guard switchover, making the original standby database the production database. This can occur in a few seconds.

  2. Switch the database clients to the new production database, so that the application and service become active. If application service levels degrade for some reason, then you can open the previous production database again, switch users back, and terminate the previous steps.

  3. Upgrade the database software on the standby database after you are satisfied with current operations.

  4. Raise the compatibility level of both databases.

10.11 Online Object Reorganization

Oracle's online object reorganization capabilities have been available since Oracle8i. These capabilities enable object reorganization to be performed even while the underlying data is being modified.

Table 10-8 describes a few of the object reorganization capabilities available with Oracle Database 10g.

Table 10-8 Some Object Reorganization Capabilities

Object Type Example of Object Reorganization Solution Description of Solution
Table DBMS_REDEFINITION PL/SQL package A PL/SQL package that provides a mechanism to redefine tables online
Index Rebuild index Rebuild an index that has previously been marked as unusable
Tablespace Rename tablespace Enables an existing tablespace to be renamed without rebuilding the tablespace and its contents

10.11.1 Online Table Reorganization

In highly available systems, it is occasionally necessary to redefine large tables that are constantly accessed to improve the performance of queries or DML performed. Oracle provides the DBMS_REDEFINITION PL/SQL package to redefine tables online. This package provides a significant increase in availability compared to traditional methods of redefining tables that require tables to be taken offline.

10.11.2 Online Index Reorganization

An index can be rebuilt online using the previous index definition, optionally moving the index to a new tablespace.

10.11.3 Online Tablespace Reorganization

Oracle Database 10g introduces the ability to rename a tablespace, similar to the ability to rename a column, table and datafile. Previously, the only way to change a tablespace name was to drop and re-create the tablespace, but this meant that the contents of the tablespace had to be dropped and rebuilt later. With the ability to rename a tablespace online, there is no interruption to the users.

ALTER TABLESPACE USERS RENAME TO new_tablespace_name;

Tablespace altered.