Sunday, September 23, 2012


Node Reboot or Shutdown Skipped to Stop 11gR2 Grid Infrastructure

During maintenance in one of our RAC env , the node was rebooted without bringing down the grid manually (as GI automatically stops all its processes automatically when it detects the node shutting down).
Upon start up we had issues of starting up ASM instance and due to this our CSS and then on CRS was not coming up hence grid was un-operational.
During research it was revealed that this was due to the unpublished bug 8740030. Due to this bug, while rebooting a node, command K19ohasd in /etc, which suppose to stop Grid Infrastructure, will be skipped as /var/lock/subsys/ohasd* doesn't exist:

# ls -l /var/lock/subsys/ohasd* | wc -l
output: 0

Looking at the logs reveals following...
CSS Logs - 
[    CSSD][4105858816]clssscProcessKillShutdown: Initiating shutdown due to process kill
[    CSSD][4105858816]###################################
[    CSSD][1145833792]clssgmSendShutdown: Aborting client (0x2aaaac01c850) proc (0x90f66c0), iocapables 1.

ASM Logs - 
ORA-29746: Cluster Synchronization Service is being shut down.
ORA-29702: error occurred in Cluster Group Service operation
GMON (ospid: 6595): terminating the instance due to error 29746
Instance terminated by GMON, pid = 6595


Fix -
To fix the issue, one has to modify /etc/init.d/ohasd

1. From:
  Linux)
  ..
    LOGMSG="$LOGGER -puser.err"
    LOGERR="$LOGGER -puser.alert"
    ;;
To:
  Linux)
  ..
    LOGMSG="$LOGGER -puser.err"
    LOGERR="$LOGGER -puser.alert"
    SUBSYSFILE="/var/lock/subsys/ohasd"
    ;;

2. From:
start()
{
  $ECHO -n $"Starting $PROG: "
To:
start()
{
  case `/bin/uname` in
    Linux)
      /bin/touch $SUBSYSFILE
      ;;
    *)
      ;;
  esac
  $ECHO -n $"Starting $PROG: "

 
3. From:
stop()
{
  $ECHO -n "Stopping Oracle Clusterware stack"
  ..
}
To:
  stop()
{
  case `/bin/uname` in
    Linux)
      $RMF $SUBSYSFILE
      ;;
    *)
      ;;
  esac
  $ECHO -n "Stopping Oracle Clusterware stack"
  ..
}

 Once /etc/init.d/ohasd is modified, please execute following command before reboot the node:
 #/bin/touch  /var/lock/subsys/ohasd

Once this is done, try to reboot the node again without shutting down the GI and see if it stops gracefully or not. I tested it and it worked fine this time.

No comments:

Post a Comment