Wednesday, March 27, 2013

ORA-12547: TNS:lost contact / Process W000 died /

Process J000 died / kkjcre1p: unable to spawn jobq slave process



There are days when you can’t seem to do a thing wrong and some are days when you can’t do a thing right. 26 March 13 was also one of those later kind of days.
DBA got the page that production DB was not accessible for users to run batch jobs and we jumped on issue. The error reported was as follows

ORA-12547: TNS:lost contact

This is very generic error. After spending a bit time DBA figured out that even they cannot connect to DB. Things were not looking right.  General suspect in this kind situation is the process limits. So the alert file was scanned for similar error (though users never reported similar error). Surprisingly there was no such error that indicates that process parameter in DB has not maxed out nor the session’s parameter was!!!

Though Error which we got was following…

Process J000 died, see its trace file
kkjcre1p: unable to spawn jobq slave process
Errors in file /db/archive/app/oracle/ product/diag/rdbms/mmprod/MMPROD/trace/MMPROD_cjq0_778436.trc:
Process W000 died, see its trace file
Process J000 died, see its trace file
kkjcre1p: unable to spawn jobq slave process

Hmm.. Unable to spawn process… May be JOB_QUEUE_PROCESS parameter reached limit. But then it will not allow the background job but will not block user connections, even as sysdba.
Something  amiss. We need to look at different angle now. Probably something wrong with OS hence SA were called in and they declared (as always) that nothing seem wrong with OS. We also asked them to check profile limits and they seemed to be fine.  So the ball is back in out court.

MOS was consulted and we came across few notes suggesting that this might be due to permissions issues. But looking at ORACLE_HOME binaries, things were looking fine!!!

So the decision was taken to bounce the DB and bring the DB back in business as much valuable time was lost in trouble shooting and all SLA in shambles. DB was bounce and now we find ourselves in more serious shit than we can imagine as DB was not ready to start.

Things were getting interesting by minute. Suddenly we noticed that when we were trying to set ORACLE_HOME, we got PERMISSIONS DENIED error. One of the DBA came across following stack trace…

skgpgcmdout: read() for cmd /usr/bin/procstack 1425470 2>&1 timed out after 24.378 seconds

Now something is really messed up and one of our applications Primary DBA (very experienced guy, whose opinion cannot be over ruled!) again asked to SA to check OS more thoroughly.
And Guess what surprise, surprise. SA came back few minutes later and declared that permission on special device /dev/null was changed by an unknown entity and that brought the whole house down.

Once, the permissions were fixed the DB came up nicely.
So the important lesson learned was for the connections issue, not always look at DB but also beyond it.

No comments:

Post a Comment