Thursday, August 17, 2017

Golden Gate Extract/Pump/Replicat Shows "RUNNING" when it is Down


After a UNIX system crashes and is restarted, a status check of the GoldenGate group (process) may show "RUNNING", but the process really is not running.
This can happen for Extract, an Extract data pump, or Replicat.

GoldenGate considers a group as "RUNNING" when its checkpoint file is locked. In a mounted file system, (such as NFS or NAS) an operating system bug prevents the locks on the files from being released after a system crash. Because the checkpoint file is locked, the GoldenGate process is considered "RUNNING" when checked through GGSCI.

In a normal situation, the process should be accessing the .pce file (if Extract) or .pcr file (if Replicat) in the dirpcs directory AND the .cpe (Extract) or .cpr (Replicat) file in the dirchk directory (in the GoldenGate installation directory).

When I checked, I see following status. 

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

MANAGER     STOPPED
JAGENT      STOPPED
REPLICAT    RUNNING     RQOG025     00:00:00      175:17:57
REPLICAT    RUNNING     RQOG035     00:00:00      175:17:56
REPLICAT    STOPPED     RQOG125     00:00:00      03:08:05
REPLICAT    RUNNING     RQOG135     00:00:00      175:17:48
REPLICAT    RUNNING     RQOG225     00:00:00      175:17:51
REPLICAT    RUNNING     RQOG235     00:00:00      175:17:46

Try to find out the pid for the process which is being shown as running. In my case the OS didnt return any PID so effectively no process was running. 

ps -ef | grep RQOG025

check the checkpoint directory to find out if the processes has uncleared checkpoint files. 

[ggsadmin 12.2.0]$cd dirpcs/
MGR.pcm      RQOG025.pcr  RQOG035.pcr  RQOG135.pcr  RQOG225.pcr  RQOG235.pcr
[ggsadmin 12.2.0]$cd dirpcs/
[ggsadmin dirpcs]$ll
total 24
-rw-r----- 1 ggsadmin oinstall 60 Aug 17 09:07 MGR.pcm
-rw-r----- 1 ggsadmin oinstall 69 Aug  7 10:46 RQOG025.pcr
-rw-r----- 1 ggsadmin oinstall 69 Aug  7 10:46 RQOG035.pcr
-rw-r----- 1 ggsadmin oinstall 69 Aug  7 10:46 RQOG135.pcr
-rw-r----- 1 ggsadmin oinstall 69 Aug  7 10:46 RQOG225.pcr
-rw-r----- 1 ggsadmin oinstall 69 Aug  7 10:46 RQOG235.pcr

check the process id for replicat and check if the process exists, if it does, then kill it. 

[ggsadmin dirpcs]$cat RQOG025.pcr
PROGRAM REPLICAT        PROCESSID       RQOG025 PORT    xxxxxxdb11.7846       PID     93930
[ggsadmin dirpcs]$ps -p 93930
  PID TTY          TIME CMD
93930 ?        00:10:08 replicat

[ggsadmin dirpcs]$kill 93930
[ggsadmin dirpcs]$

[ggsadmin dirpcs]$cat RQOG035.pcr
PROGRAM REPLICAT        PROCESSID       RQOG035 PORT    xxxxxxdb11.7845       PID     93931
[ggsadmin dirpcs]$kill 93931
[ggsadmin dirpcs]$

[ggsadmin dirpcs]$cat RQOG135.pcr
PROGRAM REPLICAT        PROCESSID       RQOG135 PORT    xxxxxxdb11.7843       PID     93933
[ggsadmin dirpcs]$kill 93933
[ggsadmin dirpcs]$

[ggsadmin dirpcs]$ccat RQOG225.pcr
-bash: ccat: command not found
[ggsadmin dirpcs]$cat RQOG225.pcr
PROGRAM REPLICAT        PROCESSID       RQOG225 PORT    xxxxxxdb11.7841       PID     93934
[ggsadmin dirpcs]$kill 93934
[ggsadmin dirpcs]$

[ggsadmin dirpcs]$cat RQOG235.pcr
PROGRAM REPLICAT        PROCESSID       RQOG235 PORT    xxxxxxdb11.7842       PID     93935
[ggsadmin dirpcs]$kill 93935
[ggsadmin dirpcs]$

Start the GG instance using AGCTL 

[ggsadmin dirpcs]$agctl status   goldengate gg_inst
Goldengate  instance 'gg_inst' is not running

[ggsadmin dirpcs]$agctl start   goldengate gg_inst
CRS-2672: Attempting to start 'xag.gg_inst.goldengate' on 'xxxxxxdb11'
ERROR: Manager not currently running.

CRS-2674: Start of 'xag.gg_inst.goldengate' on 'xxxxxxdb11' failed
CRS-2679: Attempting to clean 'xag.gg_inst.goldengate' on 'xxxxxxdb11'
CRS-2681: Clean of 'xag.gg_inst.goldengate' on 'xxxxxxdb11' succeeded
CRS-2563: Attempt to start resource 'xag.gg_inst.goldengate' on 'xxxxxxdb11' has failed. Will re-retry on 'xxxxxxdb12' now.
CRS-2673: Attempting to stop 'gg-xxx-veri-agent' on 'xxxxxxdb11'
CRS-2677: Stop of 'gg-xxx-veri-agent' on 'xxxxxxdb11' succeeded
CRS-2673: Attempting to stop 'hqiqlxxxgg03-vip.xxxx.com' on 'xxxxxxdb11'
CRS-2677: Stop of 'hqiqlxxxgg03-vip.xxxx.com' on 'xxxxxxdb11' succeeded
CRS-2672: Attempting to start 'hqiqlxxxgg03-vip.xxxx.com' on 'xxxxxxdb12'
CRS-2676: Start of 'hqiqlxxxgg03-vip.xxxx.com' on 'xxxxxxdb12' succeeded
CRS-2672: Attempting to start 'gg-xxx-veri-agent' on 'xxxxxxdb12'
CRS-2676: Start of 'gg-xxx-veri-agent' on 'xxxxxxdb12' succeeded
CRS-2672: Attempting to start 'xag.gg_inst.goldengate' on 'xxxxxxdb12'
CRS-2674: Start of 'xag.gg_inst.goldengate' on 'xxxxxxdb12' failed
CRS-2679: Attempting to clean 'xag.gg_inst.goldengate' on 'xxxxxxdb12'
CRS-2681: Clean of 'xag.gg_inst.goldengate' on 'xxxxxxdb12' succeeded
CRS-2673: Attempting to stop 'gg-xxx-veri-agent' on 'xxxxxxdb12'
CRS-2677: Stop of 'gg-xxx-veri-agent' on 'xxxxxxdb12' succeeded
CRS-2673: Attempting to stop 'hqiqlxxxgg03-vip.xxxx.com' on 'xxxxxxdb12'
CRS-2677: Stop of 'hqiqlxxxgg03-vip.xxxx.com' on 'xxxxxxdb12' succeeded
CRS-2672: Attempting to start 'hqiqlxxxgg03-vip.xxxx.com' on 'xxxxxxdb11'
CRS-2676: Start of 'hqiqlxxxgg03-vip.xxxx.com' on 'xxxxxxdb11' succeeded
CRS-2672: Attempting to start 'gg-xxx-veri-agent' on 'xxxxxxdb11'
CRS-2676: Start of 'gg-xxx-veri-agent' on 'xxxxxxdb11' succeeded
CRS-2632: There are no more servers to try to place resource 'xag.gg_inst.goldengate' on that would satisfy its placement policy
CRS-4000: Command Start failed, or completed with errors.

So as you ccan see the start up failed even after clearing up checkpoint files. 

[ggsadmin dirpcs]$agctl status   goldengate gg_inst
Goldengate  instance 'gg_inst' is not running

[ggsadmin 12.2.0]$./ggsci

GGSCI (xxxxxxdb11) 1> info all

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

MANAGER     STOPPED
JAGENT      STOPPED
REPLICAT    STOPPED     RQOG025     00:00:00      175:27:07
REPLICAT    STOPPED     RQOG035     00:00:00      175:27:06
REPLICAT    STOPPED     RQOG125     00:00:00      03:17:15
REPLICAT    STOPPED     RQOG135     00:00:00      175:26:57
REPLICAT    STOPPED     RQOG225     00:00:00      175:27:01
REPLICAT    STOPPED     RQOG235     00:00:00      175:26:55


GGSCI (xxxxxxdb11) 2> start mgr
Manager started.

GGSCI (xxxxxxdb11) 3> info all
Program     Status      Group       Lag at Chkpt  Time Since Chkpt

MANAGER     STOPPED    -- Manager still not started. 
JAGENT      STOPPED
REPLICAT    STOPPED     RQOG025     00:00:00      175:27:10
REPLICAT    STOPPED     RQOG035     00:00:00      175:27:09
REPLICAT    STOPPED     RQOG125     00:00:00      03:17:18
REPLICAT    STOPPED     RQOG135     00:00:00      175:27:01
REPLICAT    STOPPED     RQOG225     00:00:00      175:27:04
REPLICAT    STOPPED     RQOG235     00:00:00      175:26:59

THE PROBLEM HERE IS THE DATASTORE. ONE NEEDS TO RECREATE IT AS FOLLOWS. 

GGSCI (xxxxxxdb11) 2> delete datastore ;
Are you sure you want to delete the datastore? yes
2017-08-17T16:20:17Z  INFO    OGG-06492  Datastore deleted
GGSCI (xxxxxxdb11) 3> exit

[ggsadmin@xxxxxxdb11 12.2.0]$./ggsci
GGSCI (xxxxxxdb11) 1> create datastore shm id 1000
Profile 'Trace' added.
2017-08-17T16:20:30Z  INFO    OGG-06489  Datastore created
GGSCI (xxxxxxdb11) 2> exit

GGSCI (xxxxxxdb11) 1> info datastore
2017-08-17T16:20:38Z  INFO    OGG-06309  Datastore uses shared memory (SHM) environment files with a starting ID of 0x000003e8.
GGSCI (xxxxxxdb11) 2> exit

[ggsadmin@xxxxxxdb11 12.2.0]$agctl start  goldengate gg_inst
[ggsadmin@xxxxxxdb11 12.2.0]$agctl status  goldengate gg_inst
Goldengate  instance 'gg_inst' is running on xxxxxxdb51

GGSCI (xxxxxxdb11) 2> info all

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

MANAGER     RUNNING
JAGENT      RUNNING
REPLICAT    STOPPED     RQOG025     00:00:00      175:31:41
REPLICAT    STOPPED     RQOG035     00:00:00      175:31:40
REPLICAT    STOPPED     RQOG125     00:00:00      03:21:49
REPLICAT    STOPPED     RQOG135     00:00:00      175:31:31
REPLICAT    STOPPED     RQOG225     00:00:00      175:31:35
REPLICAT    STOPPED     RQOG235     00:00:00      175:31:30

GGSCI (xxxxxxdb11) 3> start er *

Sending START request to MANAGER ...
REPLICAT RQOG025 starting

Sending START request to MANAGER ...
REPLICAT RQOG035 starting

Sending START request to MANAGER ...
REPLICAT RQOG125 starting

Sending START request to MANAGER ...
REPLICAT RQOG135 starting

Sending START request to MANAGER ...
REPLICAT RQOG225 starting

Sending START request to MANAGER ...
REPLICAT RQOG235 starting

GGSCI (xxxxxxdb11) 4> info all

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

MANAGER     RUNNING     <-- MANAGER IS NOW RUNNING AFTER RECREATING DATASTORE
JAGENT      RUNNING
REPLICAT    RUNNING     RQOG025     00:00:00      175:32:03
REPLICAT    RUNNING     RQOG035     00:00:00      175:32:02
REPLICAT    RUNNING     RQOG125     00:00:00      03:22:11
REPLICAT    RUNNING     RQOG135     00:00:00      175:31:53
REPLICAT    RUNNING     RQOG225     00:00:00      175:31:57
REPLICAT    RUNNING     RQOG235     00:00:00      175:31:51

IF ABOVE FIX DOESNT WORK YOU CAN TRY FOLLOWING WORK AROUND TOO.... 

Workaround 1:
Rename the checkpoint file, then copy (not rename again) the file back to the original name. For example:

mv RQOG025.cpe RQOG025.cpe.save
cp -p RQOG025.cpe.save RQOG025.cpe

Workaround 2:
Move the checkpoint file to another file system, and then move it back again. For example:

mv RQOG025.cpe /tmp/RQOG025.cpe
mv /tmp/RQOG025.cpe ./RQOG025.cpe

1 comment:

  1. Thanks for the tips for the workarounds. they work like a charm

    ReplyDelete