ORA-12547: TNS:lost contact / Process W000 died /
Process J000 died / kkjcre1p: unable to spawn jobq slave process
There are days when you can’t seem to do a thing wrong and some are days when you can’t do a thing right. 26 March 13 was also one of those later kind of days.
DBA got the page that production DB was not accessible
for users to run batch jobs and we jumped on issue. The error reported was as
follows
ORA-12547:
TNS:lost contact
This is very generic error. After spending a bit time DBA figured out that even they
cannot connect to DB. Things were not looking right. General suspect in this kind situation is the
process limits. So the alert file was scanned for similar error (though users
never reported similar error). Surprisingly there was no such error that
indicates that process parameter in DB has not maxed out nor the session’s
parameter was!!!
Though Error which we got was following…
Process J000 died, see its trace file
kkjcre1p: unable to spawn jobq slave process
Errors in file
/db/archive/app/oracle/ product/diag/rdbms/mmprod/MMPROD/trace/MMPROD_cjq0_778436.trc:
Process W000 died, see its trace file
Process J000 died, see its trace file
kkjcre1p: unable to spawn jobq slave process
Hmm.. Unable to spawn process… May be JOB_QUEUE_PROCESS
parameter reached limit. But then it will not allow the background job but will
not block user connections, even as sysdba.
Something amiss.
We need to look at different angle now. Probably something wrong with OS hence
SA were called in and they declared (as always) that nothing seem wrong with
OS. We also asked them to check profile limits and they seemed to be fine. So the ball is back in out court.
MOS was consulted and we came across few notes suggesting
that this might be due to permissions issues. But looking at ORACLE_HOME binaries,
things were looking fine!!!
So the decision was taken to bounce the DB and bring the
DB back in business as much valuable time was lost in trouble shooting and all
SLA in shambles. DB was bounce and now we find ourselves in more serious shit
than we can imagine as DB was not ready to start.
Things were getting interesting by minute. Suddenly we
noticed that when we were trying to set ORACLE_HOME, we got PERMISSIONS DENIED
error. One of the DBA came across following stack trace…
skgpgcmdout: read() for cmd /usr/bin/procstack 1425470
2>&1 timed out after 24.378 seconds
Now something is really messed up and one of our applications
Primary DBA (very experienced guy, whose opinion cannot be over ruled!) again
asked to SA to check OS more thoroughly.
And Guess what surprise, surprise. SA came back few
minutes later and declared that permission on special device /dev/null was changed by an unknown
entity and that brought the whole house down.
Once, the permissions were fixed the DB came up nicely.
So the important lesson learned was for the connections
issue, not always look at DB but also beyond it.
No comments:
Post a Comment