Sunday, November 4, 2012


How to Restore ASM based OCR After Loss

In this post I will describe how to restore the OCR and votedisk if they are lost due to hardware issue or manual errors.
I am using single node cluster for this demo however, the process remains almost same even on multinode cluster.

[oracle@appsractest ~]$ cluvfy stage -post crsinst -n appsractest -verbose

Performing post-checks for cluster services setup
Checking node reachability...

Check: Node reachability from node "appsractest"

  Destination Node                      Reachable?
  ------------------------------------  ------------------------
  appsractest                           yes
Result: Node reachability check passed from node "appsractest"
Checking user equivalence...

Check: User equivalence for user "oracle"

  Node Name                             Comment
  ------------------------------------  ------------------------
  appsractest                           passed
Result: User equivalence check passed for user "oracle"

ERROR:

PRVF-4037 : CRS is not installed on any of the nodes
Verification cannot proceed
Post-check for cluster services setup was unsuccessful on all the nodes.

[oracle@appsractest ~]$ crsctl check cluster -all

**************************************************************
appsractest:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
**************************************************************

It is not possible to directly restore a manual or automatic OCR backup if the OCR is located in an ASM disk group. This is caused by the fact that the command 'ocrconfig -restore' requires ASM to be up & running in order to restore an OCR backup to an ASM disk group. However, for ASM to be available, the CSS and CRS stack must have been successfully started. 

On the other side, For the restore to succeed, the OCR also must not be in use (r/w), i.e. no CRS daemon must be running while the OCR is being restored. 

A description of the general procedure to restore the OCR can be found in the  documentation, this document explains how to recover from a complete loss of the ASM disk group that held the OCR and Voting files in a 11gR2 Grid environment.


When using an ASM disk group for CRS there are typically 3 different types of files located in the disk group that potentially need to be restored/recreated:

the Oracle Cluster Registry file (OCR)

the Voting file(s)
the shared SPFILE for the ASM instances

The following example assumes that the OCR was located in a single disk group used exclusively for CRS. The disk group has just one disk using external redundancy.

Note - This document assumes that the name of the OCR diskgroup remains unchanged, however there may be a need to use a different diskgroup name, in which case the name of the OCR diskgroup would have to be modified in /etc/oracle/ocr.loc across all nodes prior to executing the following steps.

--Locate the latest automatic OCR backup

When using a non-shared CRS home, automatic OCR backups can be located on any node of the cluster, consequently all nodes need to be checked for the most recent backup:
[root@appsractest appsractest]# ocrconfig -showbackup manual
appsractest     2012/10/30 06:49:31     /u01/app/grid/11.0/cdata/ractest/backup_20121030_064931.ocr

-- If you try to remove another and the only remaining copy ...
[root@appsractest appsractest]# ocrconfig -delete +DATA1
PROT-28: Cannot delete or replace the only configured Oracle Cluster Registry location

--Make sure the Grid Infrastructure is shutdown on all nodes
-- If the OCR diskgroup is missing, the GI stack will not be functional on any node, however there may still be various -- -- daemon processes running.  On each node shutdown the GI stack using the force (-f) option:


[root@appsractest grid]# crsctl stop crs
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'appsractest'
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'appsractest' has completed
CRS-4133: Oracle High Availability Services has been stopped.


[root@appsractest bin]# ps -ef | grep d.bin
root     13781  3768  0 07:12 pts/1    00:00:00 grep d.bin

Note - if you try to restore before shutting crs , you will get following err 
[root@appsractest grid]# ocrconfig -restore /u01/app/11.2.0.2/grid/cdata/appsractest-cluster/backup_20121022_102735.ocr
PROT-19: Cannot proceed while the Cluster Ready Service is running

-- Start the CRS stack in exclusive mode
-- On the node that has the most recent OCR backup, log on as root and start CRS in exclusive mode, this mode will -- -- --- allow ASM to start & stay up without the presence of a Voting disk and without the CRS daemon process (crsd.bin) ------ running.

Please note:

This document assumes that the CRS diskgroup was completely lost, in which  case the CRS daemon (resource ora.crsd) will terminate again due to the inaccessibility of the OCR - even if above message indicates that the start succeeded. 
If this is not the case - i.e. if the CRS diskgroup is still present (but corrupt or incorrect) the CRS daemon needs to be shutdown manually using:
11.2.0.1:
# $CRS_HOME/bin/crsctl stop res ora.crsd -init
otherwise the subsequent OCR restore will fail.

11.2.0.2:

# $CRS_HOME/bin/crsctl start crs -excl -nocrs
CRS-4123: Oracle High Availability Services has been started.
...
CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'auw2k3'
CRS-2672: Attempting to start 'ora.ctssd' on 'racnode1'
CRS-2676: Start of 'ora.drivers.acfs' on 'racnode1' succeeded
CRS-2676: Start of 'ora.ctssd' on 'racnode1' succeeded
CRS-2676: Start of 'ora.cluster_interconnect.haip' on 'racnode1' succeeded
CRS-2672: Attempting to start 'ora.asm' on 'racnode1'
CRS-2676: Start of 'ora.asm' on 'racnode1' succeeded

IMPORTANT:
A new option '-nocrs' has been introduced with  11.2.0.2, which prevents the start of the ora.crsd resource. It is vital that this option is specified, otherwise the failure to start the ora.crsd resource will tear down ora.cluster_interconnect.haip, which in turn will cause ASM to crash.

-- Since I'm using 11201, I have to use following command
[root@appsractest bin]# crsctl start crs -excl
CRS-4123: Oracle High Availability Services has been started.
CRS-2672: Attempting to start 'ora.gipcd' on 'appsractest'
CRS-2672: Attempting to start 'ora.mdnsd' on 'appsractest'
CRS-2676: Start of 'ora.gipcd' on 'appsractest' succeeded
CRS-2676: Start of 'ora.mdnsd' on 'appsractest' succeeded
CRS-2672: Attempting to start 'ora.gpnpd' on 'appsractest'
CRS-2676: Start of 'ora.gpnpd' on 'appsractest' succeeded
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'appsractest'
CRS-2676: Start of 'ora.cssdmonitor' on 'appsractest' succeeded
CRS-2672: Attempting to start 'ora.cssd' on 'appsractest'
CRS-2679: Attempting to clean 'ora.diskmon' on 'appsractest'
CRS-2681: Clean of 'ora.diskmon' on 'appsractest' succeeded
CRS-2672: Attempting to start 'ora.diskmon' on 'appsractest'
CRS-2676: Start of 'ora.diskmon' on 'appsractest' succeeded
CRS-2676: Start of 'ora.cssd' on 'appsractest' succeeded
CRS-2672: Attempting to start 'ora.ctssd' on 'appsractest'
CRS-2672: Attempting to start 'ora.drivers.acfs' on 'appsractest'
CRS-2676: Start of 'ora.ctssd' on 'appsractest' succeeded
CRS-2676: Start of 'ora.drivers.acfs' on 'appsractest' succeeded
CRS-2672: Attempting to start 'ora.asm' on 'appsractest'
CRS-2676: Start of 'ora.asm' on 'appsractest' succeeded
CRS-2672: Attempting to start 'ora.crsd' on 'appsractest'
CRS-2676: Start of 'ora.crsd' on 'appsractest' succeeded
[root@appsractest bin]# crs_stat -t
CRS-0184: Cannot communicate with the CRS daemon.

[root@appsractest bin]# crsctl check cluster
CRS-4692: Cluster Ready Services is online in exclusive mode
CRS-4529: Cluster Synchronization Services is online
-- Stop CRS 
[root@appsractest dbs]# crsctl stop resource ora.crsd -init
CRS-2673: Attempting to stop 'ora.crsd' on 'appsractest'
CRS-2677: Stop of 'ora.crsd' on 'appsractest' succeeded
-- Restore the latest OCR backup, must be done as the root user:
[root@appsractest dbs]# ocrconfig -restore /u01/app/grid/11.0/cdata/ractest/backup_20121030_064931.ocr
[root@appsractest dbs]#
[root@appsractest dbs]# ocrcheck
Status of Oracle Cluster Registry is as follows :
         Version                  :          3
         Total space (kbytes)     :     262120
         Used space (kbytes)      :       2568
         Available space (kbytes) :     259552
         ID                       :  219028771
         Device/File Name         :  +OCR_DISK
                                    Device/File integrity check succeeded
                                    Device/File not configured
                                    Device/File not configured
                                    Device/File not configured
                                    Device/File not configured
         Cluster registry integrity check succeeded
         Logical corruption check succeeded
-- Once restored stop the cluster 
[root@appsractest dbs]# crsctl stop crs -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'appsractest'
CRS-2673: Attempting to stop 'ora.cssdmonitor' on 'appsractest'
CRS-2673: Attempting to stop 'ora.ctssd' on 'appsractest'
CRS-2673: Attempting to stop 'ora.asm' on 'appsractest'
CRS-2673: Attempting to stop 'ora.mdnsd' on 'appsractest'
CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'appsractest'
CRS-2677: Stop of 'ora.cssdmonitor' on 'appsractest' succeeded
CRS-2677: Stop of 'ora.mdnsd' on 'appsractest' succeeded
CRS-2677: Stop of 'ora.drivers.acfs' on 'appsractest' succeeded
CRS-2677: Stop of 'ora.asm' on 'appsractest' succeeded
CRS-2677: Stop of 'ora.ctssd' on 'appsractest' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'appsractest'
CRS-2677: Stop of 'ora.cssd' on 'appsractest' succeeded
CRS-2673: Attempting to stop 'ora.gpnpd' on 'appsractest'
CRS-2673: Attempting to stop 'ora.diskmon' on 'appsractest'
CRS-2677: Stop of 'ora.gpnpd' on 'appsractest' succeeded
CRS-2673: Attempting to stop 'ora.gipcd' on 'appsractest'
CRS-2677: Stop of 'ora.gipcd' on 'appsractest' succeeded
CRS-2677: Stop of 'ora.diskmon' on 'appsractest' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'appsractest' has completed
CRS-4133: Oracle High Availability Services has been stopped.

-- If you need to Recreate the Voting file, The Voting file needs to be initialized in the OCR_DISK disk group:
[root@appsractest dbs]# $CRS_HOME/bin/crsctl replace votedisk +OCR_DISK
Successful addition of voting disk 00caa5b9c0f54f3abf5bd2a2609f09a9.
Successfully replaced voting disk group with +CRS.
CRS-4266: Voting file(s) successfully replaced

-- Once done you can now safely start the HAS stack and verify if the cluster comes back nicely or not.
[root@appsractest dbs]# crsctl start crs
CRS-4123: Oracle High Availability Services has been started.

If you now check after some time, the whole Grid stack will be up and running

Note  - If you votedisk is also corrupted due to some of the maintenance command, you need to stop all running clusterware threads. Delete and recreate the OCR_DISK 
[root@appsractest dbs]# ps -ef | grep d.bin
root     24899  3768  0 08:18 pts/1    00:00:00 grep d.bin
[root@appsractest dbs]# ps -ef | grep crs
root     24902  3768  0 08:19 pts/1    00:00:00 grep crs
[root@appsractest dbs]# clear
[root@appsractest dbs]# oracleasm createdisk OCR_DISK /dev/hdd1
Writing disk header: done
Instantiating disk: done
[root@appsractest dbs]# oracleasm listdisks
DATA
OCR_DISK

-- In case of ASM one needs to recreate them as follows...
[root@appsractest dbs]# sqlplus  sys/oracle  as sysasm
SQL*Plus: Release 11.2.0.1.0 Production on Tue Oct 30 09:05:41 2012
Copyright (c) 1982, 2009, Oracle.  All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - Production
With the Real Application Clusters and Automatic Storage Management options
SQL> sho parameter spfile
NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
spfile                               string      +DATA1/ractest/asmparameterfil
                                                 e/registry.253.798018457
SQL> create spfile='+DATA1/iedge/asmparameterfile/registry.253.798018457' from pfile;