elcarOnOsdnaH

Golden Gate Extract/Pump/Replicat Shows "RUNNING" when it is Down

After a UNIX system crashes and is restarted, a status check of the GoldenGate group (process) may show "RUNNING", but the process really is not running.
This can happen for Extract, an Extract data pump, or Replicat.

GoldenGate considers a group as "RUNNING" when its checkpoint file is locked. In a mounted file system, (such as NFS or NAS) an operating system bug prevents the locks on the files from being released after a system crash. Because the checkpoint file is locked, the GoldenGate process is considered "RUNNING" when checked through GGSCI.

In a normal situation, the process should be accessing the .pce file (if Extract) or .pcr file (if Replicat) in the dirpcs directory AND the .cpe (Extract) or .cpr (Replicat) file in the dirchk directory (in the GoldenGate installation directory).

When I checked, I see following status.

Program Status Group Lag at Chkpt Time Since Chkpt

MANAGER STOPPED
JAGENT STOPPED
REPLICAT RUNNING RQOG025 00:00:00 175:17:57
REPLICAT RUNNING RQOG035 00:00:00 175:17:56
REPLICAT STOPPED RQOG125 00:00:00 03:08:05
REPLICAT RUNNING RQOG135 00:00:00 175:17:48
REPLICAT RUNNING RQOG225 00:00:00 175:17:51
REPLICAT RUNNING RQOG235 00:00:00 175:17:46

Try to find out the pid for the process which is being shown as running. In my case the OS didnt return any PID so effectively no process was running.

ps -ef | grep RQOG025

check the checkpoint directory to find out if the processes has uncleared checkpoint files.

[ggsadmin 12.2.0]$cd dirpcs/
MGR.pcm RQOG025.pcr RQOG035.pcr RQOG135.pcr RQOG225.pcr RQOG235.pcr
[ggsadmin 12.2.0]$cd dirpcs/
[ggsadmin dirpcs]$ll
total 24
-rw-r----- 1 ggsadmin oinstall 60 Aug 17 09:07 MGR.pcm
-rw-r----- 1 ggsadmin oinstall 69 Aug 7 10:46 RQOG025.pcr
-rw-r----- 1 ggsadmin oinstall 69 Aug 7 10:46 RQOG035.pcr
-rw-r----- 1 ggsadmin oinstall 69 Aug 7 10:46 RQOG135.pcr
-rw-r----- 1 ggsadmin oinstall 69 Aug 7 10:46 RQOG225.pcr
-rw-r----- 1 ggsadmin oinstall 69 Aug 7 10:46 RQOG235.pcr

check the process id for replicat and check if the process exists, if it does, then kill it.

[ggsadmin dirpcs]$cat RQOG025.pcr
PROGRAM REPLICAT PROCESSID RQOG025 PORT xxxxxxdb11.7846 PID 93930
[ggsadmin dirpcs]$ps -p 93930
PID TTY TIME CMD
93930 ? 00:10:08 replicat

[ggsadmin dirpcs]$kill 93930
[ggsadmin dirpcs]$

[ggsadmin dirpcs]$cat RQOG035.pcr
PROGRAM REPLICAT PROCESSID RQOG035 PORT xxxxxxdb11.7845 PID 93931
[ggsadmin dirpcs]$kill 93931
[ggsadmin dirpcs]$

[ggsadmin dirpcs]$cat RQOG135.pcr
PROGRAM REPLICAT PROCESSID RQOG135 PORT xxxxxxdb11.7843 PID 93933
[ggsadmin dirpcs]$kill 93933
[ggsadmin dirpcs]$

[ggsadmin dirpcs]$ccat RQOG225.pcr
-bash: ccat: command not found
[ggsadmin dirpcs]$cat RQOG225.pcr
PROGRAM REPLICAT PROCESSID RQOG225 PORT xxxxxxdb11.7841 PID 93934
[ggsadmin dirpcs]$kill 93934
[ggsadmin dirpcs]$

[ggsadmin dirpcs]$cat RQOG235.pcr
PROGRAM REPLICAT PROCESSID RQOG235 PORT xxxxxxdb11.7842 PID 93935
[ggsadmin dirpcs]$kill 93935
[ggsadmin dirpcs]$

Start the GG instance using AGCTL

[ggsadmin dirpcs]$agctl status goldengate gg_inst
Goldengate instance 'gg_inst' is not running

[ggsadmin dirpcs]$agctl start goldengate gg_inst
CRS-2672: Attempting to start 'xag.gg_inst.goldengate' on 'xxxxxxdb11'
ERROR: Manager not currently running.

CRS-2674: Start of 'xag.gg_inst.goldengate' on 'xxxxxxdb11' failed
CRS-2679: Attempting to clean 'xag.gg_inst.goldengate' on 'xxxxxxdb11'
CRS-2681: Clean of 'xag.gg_inst.goldengate' on 'xxxxxxdb11' succeeded
CRS-2563: Attempt to start resource 'xag.gg_inst.goldengate' on 'xxxxxxdb11' has failed. Will re-retry on 'xxxxxxdb12' now.
CRS-2673: Attempting to stop 'gg-xxx-veri-agent' on 'xxxxxxdb11'
CRS-2677: Stop of 'gg-xxx-veri-agent' on 'xxxxxxdb11' succeeded
CRS-2673: Attempting to stop 'hqiqlxxxgg03-vip.xxxx.com' on 'xxxxxxdb11'
CRS-2677: Stop of 'hqiqlxxxgg03-vip.xxxx.com' on 'xxxxxxdb11' succeeded
CRS-2672: Attempting to start 'hqiqlxxxgg03-vip.xxxx.com' on 'xxxxxxdb12'
CRS-2676: Start of 'hqiqlxxxgg03-vip.xxxx.com' on 'xxxxxxdb12' succeeded
CRS-2672: Attempting to start 'gg-xxx-veri-agent' on 'xxxxxxdb12'
CRS-2676: Start of 'gg-xxx-veri-agent' on 'xxxxxxdb12' succeeded
CRS-2672: Attempting to start 'xag.gg_inst.goldengate' on 'xxxxxxdb12'
CRS-2674: Start of 'xag.gg_inst.goldengate' on 'xxxxxxdb12' failed
CRS-2679: Attempting to clean 'xag.gg_inst.goldengate' on 'xxxxxxdb12'
CRS-2681: Clean of 'xag.gg_inst.goldengate' on 'xxxxxxdb12' succeeded
CRS-2673: Attempting to stop 'gg-xxx-veri-agent' on 'xxxxxxdb12'
CRS-2677: Stop of 'gg-xxx-veri-agent' on 'xxxxxxdb12' succeeded
CRS-2673: Attempting to stop 'hqiqlxxxgg03-vip.xxxx.com' on 'xxxxxxdb12'
CRS-2677: Stop of 'hqiqlxxxgg03-vip.xxxx.com' on 'xxxxxxdb12' succeeded
CRS-2672: Attempting to start 'hqiqlxxxgg03-vip.xxxx.com' on 'xxxxxxdb11'
CRS-2676: Start of 'hqiqlxxxgg03-vip.xxxx.com' on 'xxxxxxdb11' succeeded
CRS-2672: Attempting to start 'gg-xxx-veri-agent' on 'xxxxxxdb11'
CRS-2676: Start of 'gg-xxx-veri-agent' on 'xxxxxxdb11' succeeded
CRS-2632: There are no more servers to try to place resource 'xag.gg_inst.goldengate' on that would satisfy its placement policy
CRS-4000: Command Start failed, or completed with errors.

So as you ccan see the start up failed even after clearing up checkpoint files.

[ggsadmin dirpcs]$agctl status goldengate gg_inst
Goldengate instance 'gg_inst' is not running

[ggsadmin 12.2.0]$./ggsci

GGSCI (xxxxxxdb11) 1> info all

Program Status Group Lag at Chkpt Time Since Chkpt

MANAGER STOPPED
JAGENT STOPPED
REPLICAT STOPPED RQOG025 00:00:00 175:27:07
REPLICAT STOPPED RQOG035 00:00:00 175:27:06
REPLICAT STOPPED RQOG125 00:00:00 03:17:15
REPLICAT STOPPED RQOG135 00:00:00 175:26:57
REPLICAT STOPPED RQOG225 00:00:00 175:27:01
REPLICAT STOPPED RQOG235 00:00:00 175:26:55

GGSCI (xxxxxxdb11) 2> start mgr
Manager started.

GGSCI (xxxxxxdb11) 3> info all
Program Status Group Lag at Chkpt Time Since Chkpt

MANAGER STOPPED -- Manager still not started.
JAGENT STOPPED
REPLICAT STOPPED RQOG025 00:00:00 175:27:10
REPLICAT STOPPED RQOG035 00:00:00 175:27:09
REPLICAT STOPPED RQOG125 00:00:00 03:17:18
REPLICAT STOPPED RQOG135 00:00:00 175:27:01
REPLICAT STOPPED RQOG225 00:00:00 175:27:04
REPLICAT STOPPED RQOG235 00:00:00 175:26:59

THE PROBLEM HERE IS THE DATASTORE. ONE NEEDS TO RECREATE IT AS FOLLOWS.

GGSCI (xxxxxxdb11) 2> delete datastore ;
Are you sure you want to delete the datastore? yes
2017-08-17T16:20:17Z INFO OGG-06492 Datastore deleted
GGSCI (xxxxxxdb11) 3> exit

[ggsadmin@xxxxxxdb11 12.2.0]$./ggsci
GGSCI (xxxxxxdb11) 1> create datastore shm id 1000
Profile 'Trace' added.
2017-08-17T16:20:30Z INFO OGG-06489 Datastore created
GGSCI (xxxxxxdb11) 2> exit

GGSCI (xxxxxxdb11) 1> info datastore
2017-08-17T16:20:38Z INFO OGG-06309 Datastore uses shared memory (SHM) environment files with a starting ID of 0x000003e8.
GGSCI (xxxxxxdb11) 2> exit

[ggsadmin@xxxxxxdb11 12.2.0]$agctl start goldengate gg_inst
[ggsadmin@xxxxxxdb11 12.2.0]$agctl status goldengate gg_inst
Goldengate instance 'gg_inst' is running on xxxxxxdb51

GGSCI (xxxxxxdb11) 2> info all

Program Status Group Lag at Chkpt Time Since Chkpt

MANAGER RUNNING
JAGENT RUNNING
REPLICAT STOPPED RQOG025 00:00:00 175:31:41
REPLICAT STOPPED RQOG035 00:00:00 175:31:40
REPLICAT STOPPED RQOG125 00:00:00 03:21:49
REPLICAT STOPPED RQOG135 00:00:00 175:31:31
REPLICAT STOPPED RQOG225 00:00:00 175:31:35
REPLICAT STOPPED RQOG235 00:00:00 175:31:30

GGSCI (xxxxxxdb11) 3> start er *

Sending START request to MANAGER ...
REPLICAT RQOG025 starting

Sending START request to MANAGER ...
REPLICAT RQOG035 starting

Sending START request to MANAGER ...
REPLICAT RQOG125 starting

Sending START request to MANAGER ...
REPLICAT RQOG135 starting

Sending START request to MANAGER ...
REPLICAT RQOG225 starting

Sending START request to MANAGER ...
REPLICAT RQOG235 starting

GGSCI (xxxxxxdb11) 4> info all

Program Status Group Lag at Chkpt Time Since Chkpt

MANAGER RUNNING <-- MANAGER IS NOW RUNNING AFTER RECREATING DATASTORE
JAGENT RUNNING
REPLICAT RUNNING RQOG025 00:00:00 175:32:03
REPLICAT RUNNING RQOG035 00:00:00 175:32:02
REPLICAT RUNNING RQOG125 00:00:00 03:22:11
REPLICAT RUNNING RQOG135 00:00:00 175:31:53
REPLICAT RUNNING RQOG225 00:00:00 175:31:57
REPLICAT RUNNING RQOG235 00:00:00 175:31:51

IF ABOVE FIX DOESNT WORK YOU CAN TRY FOLLOWING WORK AROUND TOO....

Workaround 1:
Rename the checkpoint file, then copy (not rename again) the file back to the original name. For example:

mv RQOG025.cpe RQOG025.cpe.save
cp -p RQOG025.cpe.save RQOG025.cpe

Workaround 2:
Move the checkpoint file to another file system, and then move it back again. For example:

mv RQOG025.cpe /tmp/RQOG025.cpe
mv /tmp/RQOG025.cpe ./RQOG025.cpe

elcarOnOsdnaH

Thursday, August 17, 2017

Golden Gate Extract/Pump/Replicat Shows "RUNNING" when it is Down

2 comments: