Wednesday, January 16, 2013

OMS hung with warnings <BEA-000449> <Closing socket as no data read from it during the configured idle timeout of 5 secs>

Recently our prod OMS server was hung and was not accessible. We tried to bounce, as it was showing OMS down and tried to being it up.But OMS didnt come up and only WebTier came up. We tried twice and both the times it failed miserably.

Error During Startup - Error during start oms. Please check error and log files

We checked the emctl.trc and log files to see whats goin on we found following.
-- Here it was showing the connection reset. But why to reset the conection which was already established ?? something was not right so need to see other logs.

2013-01-15 01:22:06,139 [EMUI_01_22_02_/console/database/instance/sitemap] ERROR svlt.PageHandler handleRequest.640 - javax.servlet.ServletException: javax.servlet.jsp.JspException: Connection reset
javax.servlet.ServletException: javax.servlet.jsp.JspException: Connection reset
        at weblogic.servlet.jsp.PageContextImpl.handlePageException(
        at jsp_servlet._database._instance._sitemap.__sitemap._jspService(

-- emctl.trc shows following, still no clue whats happening !!
java.sql.SQLException: ORA-01403: no data found
ORA-06512: at "SYSMAN.EM_TARGET", line 3503
ORA-06512: at line 1

-- Weblogic logs following error with code. At last some error code to dig around and see what the real issue is.
WARN  jdbc.ConnectionCache _getConnection.354 - Got a fatal exeption when getting a connection; Error code = 17002; Cleaning up cache and retrying
<Jan 16, 2013 1:18:02 AM CST> <Warning> <Socket> <BEA-000449> <Closing socket as no data read from it during the configured idle timeout of 5 secs>
which is more than the configured time (StuckThreadMaxTime) of "600" seconds. Stack trace:

####<Jan 16, 2013 12:57:52 AM CST> <Error> <WebLogicServer> <EMGC_OMS1> <[STANDBY] ExecuteThread: '72' for queue: 'weblogic.kernel.Default (self-tuning)'> <<WLS Kernel>> <> <> <1358319472827> <BEA-000337> <[STUCK] ExecuteThread: '32' for queue: 'weblogic.kernel.Default (self-tuning)' has been busy for "632" seconds working on the request "weblogic.servlet.internal.ServletRequestImpl@385fca37[

-- However the error that really threw me off was this from EMGC_OMS log file
Exception in thread "OMSHeartbeatThread" java.lang.OutOfMemoryError: Java heap space

So, Now I know the reason the story why the connections were reset and cache was being cleanned up. Hmm. It was running low on memory. Meanwhile we made the matters worse by trying to restart the OMS twice and stoping it. How ? well, when we tried to stop the OMS after failed restart, the stop action also errored out. Interestingly that left the JAVA process intact, kicking and alive!!

fix - To fix the issue first we have to clean up the mess and that to in crude way. So we did following.
-- Identify the java processes running and then kill them# ps -ef | grep java
# kill -9 <pid>

-- Identify and EMGC processes running and kill them # ps -ef | grep -V | grep EMGC
# kill -9 <pid>

Now when we tried to start OMS, if came up nice and clean. However we have just fixed the issue temporarily. The root cause was lack of memory. Grid Control 11g OMS for Linux 64 bit software was shipped with a default maximum OMS JVM heap size of 512MB. This value can be a little low for larger Grid Control deployments and should be raised to 1024MB.
So, the following is the way to do it...

#cd /u01/app/Oracle/GC11g/gc_inst/user_projects/domains/GCDomain/bin
-- take backup of file before making changes#cp
-- Add the following lines to the file before the invocation of command:

if [ "${SERVER_NAME}" != "EMGC_ADMINSERVER" ] ; then

USER_MEM_ARGS="-Xms256m -Xmx1024m -XX:CompileThreshold=8000 -XX:PermSize=128m -XX:MaxPermSize=512m"


Save and exit. Start the OMS once again and check if its running with new modifed parameters!! Now onwards your OMS may less likely to encounter this issue.