Thursday, June 12, 2014

Client TNS Connection Time Out 


Error Descripton -

TNS-12535: TNS:operation timed out
    ns secondary err code: 12606
    nt main err code: 0
    nt secondary err code: 0
    nt OS err code: 0

##Following is the brief list of clients that got affected. This is not conclusive list, only indicative.

Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=151.149.159.163)(PORT=3687))
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=157.241.198.49)(PORT=49492))
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=151.149.159.90)(PORT=2338))

SQLNET.ORA Parameters -

SQLNET.EXPIRE_TIME = 10
# following parameter is not defined but has default value of 60 seconds
SQLNET.INBOUND_CONNECT_TIMEOUT =60

Diagnosis -
Diagnosis is done based on hosts/clients facing issue of connection time out. I have broken the analysis client wise for easier understanding of the problem.

HOST=157.241.198.49

# following line identifies the batch server as user and time at which connection establishment request is accepted by listener.
Listener directive -
02-MAY-2014 06:01:44 * (CONNECT_DATA=(SID=xxx)(CID=(PROGRAM=)(HOST=__jdbc__)(USER=aimsbatc))) * (ADDRESS=(PROTOCOL=tcp)(HOST=157.241.198.49)(PORT=31563)) * establish * xxx* 0

# Following line describes the error recorded for the same client, same port on DB. Pls see that establish and time out time has 60 seconds gap.
DB Alert -
2014-05-02 06:02:44.552000 -04:00
TNS-12535: TNS:operation timed out
    ns secondary err code: 12606
    nt main err code: 0
    nt secondary err code: 0
    nt OS err code: 0
  Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=157.241.198.49)(PORT=31563))

---------------------------------------------------------------------------
HOST=157.241.198.49
Listener directive -
03-MAY-2014 01:31:42 * (CONNECT_DATA=(SID=P2PAIMS)(CID=(PROGRAM=)(HOST=__jdbc__)(USER=aimsbatc))) * (ADDRESS=(PROTOCOL=tcp)(HOST=157.241.198.49)(PORT=44309)) * establish * P2PAIMS * 0

DB Alert -
Fatal NI connect error 12170.
  Time: 03-MAY-2014 01:32:42
TNS-12535: TNS:operation timed out
    ns secondary err code: 12606
    nt main err code: 0
    nt secondary err code: 0
    nt OS err code: 0
  Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=157.241.198.49)(PORT=44309))
WARNING: inbound connection timed out (ORA-3136)

## SIMILAR ERROR FOR SAME HOST AND PORT, ARE ALSO RECORDED ON  03 AND 05 MAY FOR DIFFERENT USERS.
Listener directive -

03-MAY-2014 05:30:54 * (CONNECT_DATA=(SID=P2PAIMS)(CID=(PROGRAM=)(HOST=__jdbc__)(USER=aimsbatc))) * (ADDRESS=(PROTOCOL=tcp)(HOST=157.241.198.49)(PORT=44309)) * establish * P2PAIMS * 0

05-MAY-2014 15:05:06 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=P2PAIMS)(CID=(PROGRAM=sqlplus)(HOST=tryprmaim01.intra.searshc.com)(USER=aimsuser))) * (ADDRESS=(PROTOCOL=tcp)(HOST=157.241.198.49)(PORT=44309)) * establish * P2PAIMS * 0

---------------------------------------------------------------------------------------------------------
HOST=157.241.198.49
Listener directive -
06-MAY-2014 00:32:14 * (CONNECT_DATA=(SID=xxx)(CID=(PROGRAM=)(HOST=__jdbc__)(USER=aimsbatc))) * (ADDRESS=(PROTOCOL=tcp)(HOST=157.241.198.49)(PORT=25360)) * establish * xxx* 0

DB Alert -
Time: 06-MAY-2014 00:33:14
 
TNS-12535: TNS:operation timed out
    ns secondary err code: 12606
    nt main err code: 0
    nt secondary err code: 0
    nt OS err code: 0
  Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=157.241.198.49)(PORT=25360))

----------------------------------------------------------------------------
HOST=151.149.159.90

# For following user the connection terminated after 38 minutes, whcih could be due to dead connection detection.
Listener directive -
06-MAY-2014 14:51:00 * (CONNECT_DATA=(SID=xxx)(CID=(PROGRAM=)(HOST=__jdbc__)(USER=rgupta4))) * (ADDRESS=(PROTOCOL=tcp)(HOST=151.149.159.90)(PORT=2122)) * establish * xxx* 0

DB Alert -
2014-05-06 15:28:02.831000 -04:00

TNS-12535: TNS:operation timed out
    ns secondary err code: 12560
    nt main err code: 505

TNS-00505: Operation timed out
    nt secondary err code: 110
    nt OS err code: 0
  Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=151.149.159.90)(PORT=2122))

Interpretation -

As defined earlier the listener receives the request to create connection. Here for the authentication purpose the client has to complete the handshake in 60 seconds. If the client does not able to respond back in 60 seconds due to load on app server or congested network or DB could not process the connection due heavy load on the server AND if elapsed time > 60 seconds, the TIME OUT will occur.

Solution/Workaround -

It is often necessary to increase the values for INBOUND CONNECT TIMEOUT at  both the listener and the database in order to resolve this issue. It is usually advisable to set the database (sqlnet.ora) value slightly higher than the listener (listener.ora).  The authentication process is more demanding for the database than the listener.

1. To set these parameters to use values higher than the default of 60 seconds in RDBMS_HOME/network/admin/sqlnet.ora, follow these instructions and restart the listener.  There is no need to restart Oracle:

                SQLNET.INBOUND_CONNECT_TIMEOUT = 120

2. To set the INBOUND CONNECT TIMEOUT on listener, first check the existing value.

LSNRCTL> show inbound_connect_timeout
Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=EXTPROC1531)))
LISTENER parameter "inbound_connect_timeout" set to 60
The command completed successfully

# Now set the value to 110 seconds
LSNRCTL>set  INBOUND_CONNECT_TIMEOUT_LISTENER=110

Other Areas -

If the application server is using Connection Pool, then following parameters worth to look at. These are standard features offered by various application servers and not the platform/Technology specific.
If you run into a timeout it can mean that your definition of the timeouts in the data-source are not sufficient. The following timeout parameters can be set:

abandoned-connection-timeout
Sets the maximum time that a connection can remain unused before the connection is closed and returned to the cache. A connection is considered unused if it has not had SQL database activity. When AbandonedConnectionTimeout is set, JDBC monitors SQL database activity on each logical connection. For example, when stmt.execute() is invoked on the connection, a heartbeat is registered to convey that this connection is active. The heartbeats are set at each database execution. If a connection has been inactive for the specified amount of time, the underlying connection is reclaimed and returned to the cache for reuse.

time-to-live-timeout
The maximum time, in seconds, a used connection may be active. When this timeout expires, the used connection is unconditionally closed, the relevant statement handles are canceled, and the connection is returned to the connection pool.

Inactivity timeout
Maximum time for which an unused connection can be kept in the pool.

However, there was one caveat here. Even after changing time out value, the connection was still timing out in 60 seconds only. Upon close examination we found out that this home was upgraded from 11.2.0.3 to 11.2.0.4 . 

Normally the SQLNET.ORA file contained in TNS_ADMIN_HOME affects the time out values on DB. In this case the value was modified in 11.2.0.4 home but TNS_ADMIN_HOME  was pointing to 11.2.0.3 home hence the new time out value was not reflecting. So there are 2 ways to fix it.
1. Either update TNS_ADMIN value to 11.2.0.4 home
2. Modify the 11.2.0.3 home SQLNET.ORA file. 

Once we fixed it the issue got resolved. 

Thursday, May 8, 2014

Standby's For Primary Site Backups ??

In this post I will show how one can leverage the standby site for off-loading backup jobs from primary. Thus freeing up primary for other critical jobs and also improving the performance of the primary site.

On Primary Site - 

[oracle@rac1e ~]$ rman target /
Recovery Manager: Release 11.2.0.3.0 - Production on Mon Feb 24 07:43:40 2014
Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.
connected to target database: DGTEST (DBID=3796352795)

RMAN> report schema;
using target database control file instead of recovery catalog
Report of database schema for database with db_unique_name TROY
List of Permanent Datafiles
===========================
File Size(MB) Tablespace           RB segs Datafile Name
---- -------- -------------------- ------- ------------------------
1    710      SYSTEM               ***     +REPORTING_DATA/troy/datafile/system.262.839828803
2    720      SYSAUX               ***     +REPORTING_DATA/troy/datafile/sysaux.276.839828805
3    35       UNDOTBS1             ***     +REPORTING_DATA/troy/datafile/undotbs1.277.839828809
4    5        USERS                ***     +REPORTING_DATA/troy/datafile/users.278.839828815
5    25       UNDOTBS2             ***     +REPORTING_DATA/troy/datafile/undotbs2.283.839829369

List of Temporary Files
=======================
File Size(MB) Tablespace           Maxsize(MB) Tempfile Name
---- -------- -------------------- ----------- --------------------
1    25       TEMP                 32767       +REPORTING_DATA/troy/tempfile/temp.282.839829279


On Standby Site - 

[oracle@stby2 HS]$ rman target /
connected to target database: DGTEST (DBID=3796352795, not open)

RMAN> report schema;
using target database control file instead of recovery catalog
RMAN-06139: WARNING: control file is not current for REPORT SCHEMA
Report of database schema for database with db_unique_name HOFFMAN

List of Permanent Datafiles
===========================
File Size(MB) Tablespace           RB segs Datafile Name
---- -------- -------------------- ------- ------------------------
1    710      SYSTEM               ***     +DATA/hoffman/datafile/system.293.839922187
2    720      SYSAUX               ***     +DATA/hoffman/datafile/sysaux.290.839922187
3    35       UNDOTBS1             ***     +DATA/hoffman/datafile/undotbs1.288.839922187
4    5        USERS                ***     +DATA/hoffman/datafile/users.286.839922187
5    25       UNDOTBS2             ***     +DATA/hoffman/datafile/undotbs2.287.839922187

List of Temporary Files
=======================
File Size(MB) Tablespace           Maxsize(MB) Tempfile Name
---- -------- -------------------- ----------- --------------------
1    25       TEMP                 32767       +DATA/hoffman/tempfile/temp.279.840090943


On standby - 

RMAN> backup as copy datafile 1 format '/home/oracle/HS/%U.rman';

RMAN> list backup ;

List of Backup Sets
===================

BS Key  Type LV Size       Device Type Elapsed Time Completion Time
------- ---- -- ---------- ----------- ------------ ---------------
6       Full    80.00K     DISK        00:00:00     24-FEB-14
        BP Key: 6   Status: AVAILABLE  Compressed: NO  Tag: TAG20140224T065328
        Piece Name: /home/oracle/HS/0ap1dfgc_1_1.rman
  SPFILE Included: Modification time: 24-FEB-14
  SPFILE db_unique_name: HOFFMAN


On primary - 
delete the system datafile 

ASMCMD [+reporting_data/troy/datafile] > cp SYSTEM.262.839828803 SYSTEM.262.839828803.bak
copying +reporting_data/troy/datafile/SYSTEM.262.839828803 -> +reporting_data/troy/datafile/SYSTEM.262.839828803.bak

ASMCMD [+reporting_data/troy/datafile] > rm SYSTEM.262.839828803

#while starting up hit the following error - 
SQL> alter database open ;
alter database open
*
ERROR at line 1:
ORA-01157: cannot identify/lock data file 1 - see DBWR trace file
ORA-01110: data file 1: '+REPORTING_DATA/troy/datafile/system.262.839828803'


#NOW COPY THE BACKUP FILE FROM STANDBY TO PRIMARY .....
#SINCE THE PRIMARY IS NOT AWARE OF THIS BACKUP ONE NEED TO CATALOG IT  AS FOLLOWS

On Primary - 

RMAN> catalog start with '/home/oracle/HS/';
searching for all files that match the pattern /home/oracle/HS/

List of Files Unknown to the Database
=====================================
File Name: /home/oracle/HS/0ap1dfgc_1_1.rman

Do you really want to catalog the above files (enter YES or NO)? yes
cataloging files...
cataloging done

List of Cataloged Files
=======================
File Name: /home/oracle/HS/0ap1dfgc_1_1.rman

RMAN> restore datafile 1;

Starting restore at 24-FEB-14
using channel ORA_DISK_1

channel ORA_DISK_1: restoring datafile 00001
input datafile copy RECID=1 STAMP=840356556 file name=+REPORTING_DATA/troy/datafile/system.262.840356549
destination for restore of datafile 00001: +REPORTING_DATA/troy/datafile/system.262.839828803
channel ORA_DISK_1: copied datafile copy of datafile 00001
output file name=+REPORTING_DATA/troy/datafile/system.288.840356741
RECID=0 STAMP=0
Finished restore at 24-FEB-14

# So recovery finishes successfully. Now one needs to recover the datafile using the archive logs already available on primary site. 

RMAN> recover datafile 1 ;
Starting recover at 24-FEB-14
using channel ORA_DISK_1
starting media recovery
media recovery complete, elapsed time: 00:00:01
Finished recover at 24-FEB-14

RMAN> alter database open ;

SQL> select database_role, open_mode, SWITCHOVER_STATUS from gv$database;
DATABASE_ROLE    OPEN_MODE            SWITCHOVER_STATUS
---------------- -------------------- --------------------
PRIMARY          READ WRITE           TO STANDBY

SQL> select current_scn from gv$database;
CURRENT_SCN
-----------
    1682907

CHECK SCN ON STANDBY - 
  
SQL> select current_scn from gv$database /
CURRENT_SCN
-----------
    1682437

SCENARIO II - 

In this scenario, one can restore the whole database backup taken on standby site on primary site. 

ON STANDBY - 
# Take full database backup on standby site as follows

RMAN>  backup database tag 'full_standby_copy' format '/home/oracle/HS/%U.rman';

Starting backup at 24-FEB-14
using channel ORA_DISK_1
channel ORA_DISK_1: starting full datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
input datafile file number=00002 name=+DATA/hoffman/datafile/sysaux.290.839922187
input datafile file number=00001 name=+DATA/hoffman/datafile/system.293.839922187
input datafile file number=00003 name=+DATA/hoffman/datafile/undotbs1.288.839922187
input datafile file number=00005 name=+DATA/hoffman/datafile/undotbs2.287.839922187
input datafile file number=00004 name=+DATA/hoffman/datafile/users.286.839922187
channel ORA_DISK_1: starting piece 1 at 24-FEB-14
channel ORA_DISK_1: finished piece 1 at 24-FEB-14
piece handle=/home/oracle/HS/0fp1diba_1_1.rman tag=FULL_STANDBY_COPY comment=NONE
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:15
channel ORA_DISK_1: starting full datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
including current control file in backup set
including current SPFILE in backup set
channel ORA_DISK_1: starting piece 1 at 24-FEB-14
channel ORA_DISK_1: finished piece 1 at 24-FEB-14
piece handle=/home/oracle/HS/0gp1dibp_1_1.rman tag=FULL_STANDBY_COPY comment=NONE
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:01
Finished backup at 24-FEB-14

#  and then catalog the backup files on primary.

On Primary - 
# Copy the backup on primary server. Once the backup is copied to the primary site, one needs to catalog it with primary control file as primary control file may not have this backup information.

RMAN> catalog start with '/home/oracle/HS';
using target database control file instead of recovery catalog
searching for all files that match the pattern /home/oracle/HS

List of Files Unknown to the Database
=====================================
File Name: /home/oracle/HS/0fp1diba_1_1.rman
File Name: /home/oracle/HS/0gp1dibp_1_1.rman
File Name: /home/oracle/HS/0ep1dial_1_1.rman

Do you really want to catalog the above files (enter YES or NO)? yes
cataloging files...
cataloging done

List of Cataloged Files
=======================
File Name: /home/oracle/HS/0fp1diba_1_1.rman
File Name: /home/oracle/HS/0gp1dibp_1_1.rman
File Name: /home/oracle/HS/0ep1dial_1_1.rman

# NOW SIMULATE THE CRASH BY DELETING THE DATAFILES. REMEMBER WE ARE NOT TAKING CONTROLFILE BACKUP FROM STANDBY AS ITS STANDBY CONTROLFILE. 
# IN THE EVENT WHERE CONTROLFILE IS ALSO LOST, ONE HAS TO RESTORE CONTROLFILE FROM THE BACKUP OF PRIMARY.
# HERE I'M NOT RESTORING CONTROLFILE AS WE ARE SIMULATING THE CRASH OF ONLY DATAFILES. 

ASMCMD [+reporting_data/troy/datafile] > ls
SYSAUX.276.839828805
SYSTEM.262.839828803.bak
SYSTEM.262.840356549
SYSTEM.288.840356741
UNDOTBS1.277.839828809
UNDOTBS2.283.839829369
USERS.278.839828815
ASMCMD [+reporting_data/troy/datafile] > rm *
You may delete multiple files and/or directories.
Are you sure? (y/n) y

 # Once done, connect to RMAN on primary database and restore the database with newly catalogged backup sets.

RMAN> restore database ;

Starting restore at 24-FEB-14
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=234 instance=DGTEST1 device type=DISK

channel ORA_DISK_1: restoring datafile 00001
input datafile copy RECID=1 STAMP=840356556 file name=+REPORTING_DATA/troy/datafile/system.262.840356549
destination for restore of datafile 00001: +REPORTING_DATA/troy/datafile/system.288.840356741
ORA-19505: failed to identify file "+REPORTING_DATA/troy/datafile/system.262.840356549"
ORA-17503: ksfdopn:2 Failed to open file +REPORTING_DATA/troy/datafile/system.262.840356549
ORA-15012: ASM file '+REPORTING_DATA/troy/datafile/system.262.840356549' does not exist
ORA-19600: input file is datafile-copy 1 (+REPORTING_DATA/troy/datafile/system.262.840356549)
ORA-19601: output file is datafile 1 (+REPORTING_DATA
channel ORA_DISK_1: starting datafile backup set restore
channel ORA_DISK_1: specifying datafile(s) to restore from backup set
channel ORA_DISK_1: restoring datafile 00002 to +REPORTING_DATA/troy/datafile/sysaux.276.839828805
channel ORA_DISK_1: restoring datafile 00003 to +REPORTING_DATA/troy/datafile/undotbs1.277.839828809
channel ORA_DISK_1: restoring datafile 00004 to +REPORTING_DATA/troy/datafile/users.278.839828815
channel ORA_DISK_1: restoring datafile 00005 to +REPORTING_DATA/troy/datafile/undotbs2.283.839829369
channel ORA_DISK_1: reading from backup piece /home/oracle/HS/0fp1diba_1_1.rman
channel ORA_DISK_1: piece handle=/home/oracle/HS/0fp1diba_1_1.rman tag=FULL_STANDBY_COPY
channel ORA_DISK_1: restored backup piece 1
channel ORA_DISK_1: restore complete, elapsed time: 00:00:07
failover to previous backup

channel ORA_DISK_1: starting datafile backup set restore
channel ORA_DISK_1: specifying datafile(s) to restore from backup set
channel ORA_DISK_1: restoring datafile 00001 to +REPORTING_DATA/troy/datafile/system.288.840356741
channel ORA_DISK_1: reading from backup piece /home/oracle/HS/0fp1diba_1_1.rman
channel ORA_DISK_1: piece handle=/home/oracle/HS/0fp1diba_1_1.rman tag=FULL_STANDBY_COPY
channel ORA_DISK_1: restored backup piece 1
channel ORA_DISK_1: restore complete, elapsed time: 00:00:03
Finished restore at 24-FEB-14

RMAN> recover database ;
Starting recover at 24-FEB-14
using channel ORA_DISK_1

starting media recovery
archived log for thread 1 with sequence 41 is already on disk as file +APP_DATA/troy/archivelog/2014_02_24/thread_1_seq_41.12295.840359309
archived log for thread 1 with sequence 42 is already on disk as file +APP_DATA/troy/archivelog/2014_02_24/thread_1_seq_42.10005.840359313
archived log for thread 1 with sequence 43 is already on disk as file +APP_DATA/troy/archivelog/2014_02_24/thread_1_seq_43.12416.840359315
archived log for thread 1 with sequence 44 is already on disk as file +APP_DATA/troy/archivelog/2014_02_24/thread_1_seq_44.10048.840359317
archived log for thread 1 with sequence 45 is already on disk as file +APP_DATA/troy/archivelog/2014_02_24/thread_1_seq_45.14052.840359317
archived log for thread 1 with sequence 46 is already on disk as file +APP_DATA/troy/archivelog/2014_02_24/thread_1_seq_46.10027.840359317
archived log for thread 1 with sequence 47 is already on disk as file +APP_DATA/troy/archivelog/2014_02_24/thread_1_seq_47.9986.840359319
archived log file name=+APP_DATA/troy/archivelog/2014_02_24/thread_1_seq_41.12295.840359309 thread=1 sequence=41
archived log file name=+APP_DATA/troy/archivelog/2014_02_24/thread_1_seq_42.10005.840359313 thread=1 sequence=42
archived log file name=+APP_DATA/troy/archivelog/2014_02_24/thread_1_seq_43.12416.840359315 thread=1 sequence=43
archived log file name=+APP_DATA/troy/archivelog/2014_02_24/thread_1_seq_44.10048.840359317 thread=1 sequence=44
archived log file name=+APP_DATA/troy/archivelog/2014_02_24/thread_1_seq_45.14052.840359317 thread=1 sequence=45
archived log file name=+APP_DATA/troy/archivelog/2014_02_24/thread_1_seq_46.10027.840359317 thread=1 sequence=46
media recovery complete, elapsed time: 00:00:02
Finished recover at 24-FEB-14

## NOTE THAT THE DATAFILES PATH ON STANDBY SITE IS TO +DATA DISK GROUP under +DATA/hoffman DIRECTORY. 

Also, when restoring on primary the diskgroup name is +APP_DATA for archivelogs and +REPORTING_DATA for datafiles. RMAN automatically discovers the same (from primary controlfile) and restore them to original location regardless of different path on standby.

RMAN> alter database open;
database opened

SQL> select database_role, open_mode, SWITCHOVER_STATUS from gv$database;
DATABASE_ROLE    OPEN_MODE            SWITCHOVER_STATUS
---------------- -------------------- --------------------

PRIMARY          READ WRITE           TO STANDBY

Hope this will help you to leverage the data guard set up more efficiently by freeing up your primary for more meaningful work. 

Monday, May 5, 2014

Restore Database To New Host & New ASM DiskGroup 

Few weeks back, we need to perform a mock test of disaster i.e. Server Failure and check the response times of our recovery process. Since we are using 11g with Grid infrastructure we need to consider the dependency of ASM disk group while restoring the database. 

So to start with following is the snipped we used to take backup on source database.

run {
sql 'alter system archive log current';
backup tag level0_database format '/gold_backups/omsupg_bkP/%d_DAT_%U_%t' database plus archivelog;
backup current controlfile tag CURR_CONTROLFILE format '/gold_backups/omsupg_bkP/%d_CONTROL_%U_%t';
sql 'alter system archive log current';
}

On the target server the 11g binaries are already installed and patched as per source binaries. 


#export ORACLE_HOME=/u01/app/oracle/product/11.2.3/dbhome_1

We created a temporary parameter file to cut the task.

Transient Parameter file -



*.audit_file_dest='/u01/app/oracle/admin/OMSUPG/adump'

*.audit_trail='db'
*.compatible='11.2.0.0.0'

*.db_block_size=8192
*.control_files='+DATA_PSFT_HR/OMSUPG/CONTROLFILE/current.3884.820500121'

-- THE ABOVE LOCATION MAY NOT BE KNOWN AS START OF THE DATABASE AND MAY POINT TO NAME ON OLD SERVER. ONE NEEDS TO CHANGE THIS POST CONTROL FILE RESTORE.*.db_create_file_dest='+DATA_PSFT_HR'
*.db_create_online_log_dest_1='+DATA_PSFT_HR'
*.db_file_name_convert='+DATA_PSOFTHR','+DATA_PSFT_HR'
*.log_file_name_convert='+DATA_PSOFTHR','+DATA_PSFT_HR'
*.db_name='OMSUPG'
*.diagnostic_dest='/u01/app/oracle'
*.instance_number=1
OMSUPG.job_queue_processes=1000
*.log_archive_format='%t_%s_%r.dbf'
*.db_domain=''
*.memory_target=4G
*.open_cursors=300
*.processes=500
*.remote_login_passwordfile='EXCLUSIVE'
*.session_cached_cursors=300
*.undo_tablespace='UNDOTBS1'

SQL> create spfile from pfile;
File created.

SQL> startup nomount
ORACLE instance started.


Total System Global Area 4.0486E+10 bytes
Fixed Size                  2237088 bytes
Variable Size            1.9327E+10 bytes
Database Buffers         2.1072E+10 bytes
Redo Buffers               84631552 bytes
-- Now connect to recently started database thru RAN and perform as follows.

RMAN> set DBID=3245417037
executing command: SET DBID
RMAN> restore controlfile  to '+DATA_PSFT_HR' from '/oraclebackup/omsupg_bkP/snapcf_OMSUPG.f' ;

Starting restore at 11-JUL-13
using channel ORA_DISK_1

channel ORA_DISK_1: restoring control file
channel ORA_DISK_1: restore complete, elapsed time: 00:00:07
Finished restore at 11-JUL-13


Note - 
ONCE CONTROL FILE IS RESTORED YOU NEED TO MODIFY PFILE TO POINT TO ABSOLUTE PATH OR OTHERWISE DATABASE WILL NOT BE ABLE TO FIND CORRECT CONTROL FILE

*.control_files='+DATA_PSFT_HR/OMSUPG/CONTROLFILE/current.3884.820500121'

THEN START THE DB IN MOUNT MODE

[oracle@hofsvdorarac2e ~]$ rman target / nocatalog
Recovery Manager: Release 11.2.0.3.0 - Production on Thu Jul 11 13:20:13 2013

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.

connected to target database: OMSUPG (DBID=3245417037, not open)
using target database control file instead of recovery catalog

-- If your control file is backed up after backup is taken then control file will have backup info OR otherwise one has catalog the backup pieces with control file. 


RMAN> CATALOG START WITH '/oraclebackup/omsupg_bkP/';

searching for all files that match the pattern /oraclebackup/omsupg_bkP/

List of Files Unknown to the Database
=====================================
File Name: /oraclebackup/omsupg_bkP/OMSUPG_DAT_1joefcg0_1_1_820490752
File Name: /oraclebackup/omsupg_bkP/OMSUPG_DAT_1moefcj8_1_1_820490856
File Name: /oraclebackup/omsupg_bkP/OMSUPG_DAT_1loefcj5_1_1_820490853
File Name: /oraclebackup/omsupg_bkP/OMSUPG_CONTROL_1noefcj9_1_1_820490857
File Name: /oraclebackup/omsupg_bkP/OMSUPG_DAT_1koefcgq_1_1_820490778
File Name: /oraclebackup/omsupg_bkP/snapcf_OMSUPG.f

Do you really want to catalog the above files (enter YES or NO)? YES
cataloging files...
cataloging done

List of Cataloged Files
=======================
File Name: /oraclebackup/omsupg_bkP/OMSUPG_DAT_1joefcg0_1_1_820490752
File Name: /oraclebackup/omsupg_bkP/OMSUPG_DAT_1moefcj8_1_1_820490856
File Name: /oraclebackup/omsupg_bkP/OMSUPG_DAT_1loefcj5_1_1_820490853
File Name: /oraclebackup/omsupg_bkP/OMSUPG_CONTROL_1noefcj9_1_1_820490857
File Name: /oraclebackup/omsupg_bkP/OMSUPG_DAT_1koefcgq_1_1_820490778
File Name: /oraclebackup/omsupg_bkP/snapcf_OMSUPG.f

-- Since I am restoring the backup on new server which as has different disk group name, I need to rename all my data files to new disk group name

+DATA_PSOFTHR  --> +DATA_PSFT_HR

RMAN> RUN{
SET NEWNAME FOR DATAFILE 1 TO '+DATA_PSFT_HR';
SET NEWNAME FOR DATAFILE 2 TO '+DATA_PSFT_HR';
SET NEWNAME FOR DATAFILE 3 TO '+DATA_PSFT_HR';
SET NEWNAME FOR DATAFILE 4 TO '+DATA_PSFT_HR';
SET NEWNAME FOR DATAFILE 5 TO '+DATA_PSFT_HR';
SET NEWNAME FOR DATAFILE 6 TO '+DATA_PSFT_HR';
SET NEWNAME FOR DATAFILE 7 TO '+DATA_PSFT_HR';
SQL "ALTER DATABASE RENAME FILE ''+DATA_PSOFTHR/omsupg/redo01.log'' TO ''+DATA_PSFT_HR'' ";
SQL "ALTER DATABASE RENAME FILE ''+DATA_PSOFTHR/omsupg/redo02.log'' TO ''+DATA_PSFT_HR'' ";
SQL "ALTER DATABASE RENAME FILE ''+DATA_PSOFTHR/omsupg/redo03.log'' TO ''+DATA_PSFT_HR'' ";

RESTORE DATABASE;
SWITCH DATAFILE ALL;
RECOVER DATABASE;
}


executing command: SET NEWNAME
executing command: SET NEWNAME
executing command: SET NEWNAME
executing command: SET NEWNAME
executing command: SET NEWNAME
executing command: SET NEWNAME
executing command: SET NEWNAME
sql statement: ALTER DATABASE RENAME FILE ''+DATA_PSOFTHR/omsupg/redo01.log''    TO ''+DATA_PSFT_HR''
sql statement: ALTER DATABASE RENAME FILE ''+DATA_PSOFTHR/omsupg/redo02.log''    TO ''+DATA_PSFT_HR''
sql statement: ALTER DATABASE RENAME FILE ''+DATA_PSOFTHR/omsupg/redo03.log''    TO ''+DATA_PSFT_HR''

Starting restore at 11-JUL-13
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=456 device type=DISK

channel ORA_DISK_1: starting datafile backup set restore
channel ORA_DISK_1: specifying datafile(s) to restore from backup set
channel ORA_DISK_1: restoring datafile 00001 to +DATA_PSFT_HR
channel ORA_DISK_1: restoring datafile 00002 to +DATA_PSFT_HR
channel ORA_DISK_1: restoring datafile 00003 to +DATA_PSFT_HR
channel ORA_DISK_1: restoring datafile 00004 to +DATA_PSFT_HR
channel ORA_DISK_1: restoring datafile 00005 to +DATA_PSFT_HR
channel ORA_DISK_1: restoring datafile 00006 to +DATA_PSFT_HR
channel ORA_DISK_1: restoring datafile 00007 to +DATA_PSFT_HR

channel ORA_DISK_1: reading from backup piece /gold_backups/omsupg_bkP/OMSUPG_DAT_1koefcgq_1_1_820490778
channel ORA_DISK_1: errors found reading piece handle=/gold_backups/omsupg_bkP/OMSUPG_DAT_1koefcgq_1_1_820490778
channel ORA_DISK_1: failover to piece handle=/oraclebackup/omsupg_bkP/OMSUPG_DAT_1koefcgq_1_1_820490778 tag=LEVEL0_DATABASE
channel ORA_DISK_1: restored backup piece 1
channel ORA_DISK_1: restore complete, elapsed time: 00:00:55
Finished restore at 11-JUL-13

datafile 1 switched to datafile copy
input datafile copy RECID=9 STAMP=820503843 file name=+DATA_PSFT_HR/omsupg/datafile/system.3899.820503797
datafile 2 switched to datafile copy
input datafile copy RECID=10 STAMP=820503843 file name=+DATA_PSFT_HR/omsupg/datafile/sysaux.3902.820503797
datafile 3 switched to datafile copy
input datafile copy RECID=11 STAMP=820503843 file name=+DATA_PSFT_HR/omsupg/datafile/undotbs1.3870.820503801
datafile 4 switched to datafile copy
input datafile copy RECID=12 STAMP=820503843 file name=+DATA_PSFT_HR/omsupg/datafile/users.3900.820503801
datafile 5 switched to datafile copy
input datafile copy RECID=13 STAMP=820503843 file name=+DATA_PSFT_HR/omsupg/datafile/mgmt_ecm_depot_ts.3897.820503801
datafile 6 switched to datafile copy
input datafile copy RECID=14 STAMP=820503844 file name=+DATA_PSFT_HR/omsupg/datafile/mgmt_tablespace.3901.820503789
datafile 7 switched to datafile copy
input datafile copy RECID=15 STAMP=820503844 file name=+DATA_PSFT_HR/omsupg/datafile/mgmt_ad4j_ts.3898.820503801

Starting recover at 11-JUL-13
using channel ORA_DISK_1

starting media recovery

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of recover command at 07/11/2013 13:44:06
RMAN-06053: unable to perform media recovery because of missing log
RMAN-06025: no backup of archived log for thread 1 with sequence 203 and starting SCN of 2855435 found to restore


RMAN> alter database open resetlogs;
database opened

-- if you see the above recovery failed during archive log recovery. The reason being is that the backup set does not contain any archive log files beyond that point. So when you log in to database, you will see that database is open in read write mode. 


Since I am restoring this database with Oracle Grid, I need to add the database to OCR and start it using SRVCTL utility. However, during start of database I hit following error. 

ERROR: failed to establish dependency between database OMSUPG and diskgroup resource ora.DATA_PSFT_HR.dg
Thu Jul 11 12:42:10 2013
SUCCESS: diskgroup DATA_PSFT_HR was dismounted

alter database mount
ORA-00210: cannot open the specified control file
ORA-00202: control file: '+DATA_PSFT_HR'
ORA-17503: ksfdopn:2 Failed to open file +DATA_PSFT_HR
ORA-15045: ASM file name '+DATA_PSFT_HR' is not in reference form
ORA-205 signalled during: alter database mount...

Thu Jul 11 12:42:16 2013

Solution to above error is as follows as one needs to change the hard dependency using srvctl command.


# srvctl add database -d OMSUPG -o /u01/app/oracle/product/11.2.3/dbhome_1 -c SINGLE  -i OMSUPG -x host1e

# srvctl modify database -d OMSUPG -a "DATA_PSFT_HR"


Once done, you will now be able to start the start the database with SRVCTL successfully. 

Monday, April 21, 2014


ACS, Histogram & Oracle Bug

with 11g, Oracle has enhanced the ACS feature. Ideally when you have multiple bind values for same SQL statement, Optimizer may generate different plan for different bind values. These plan are generated based on feature called as Bind Aware statements. So if Optimizer detects that different bind values generates the different and very varied foot prints of IO, then one execution plan will not suffice. So  it will create different execution plan and will mark the SQL as bind aware. 

So effectively you are sharing the Cursor but not the plan and that is expected behavior of 11g Optimizer. However, during one performance issue troubleshooting, we observed that this is not the case. This all started with one application having severe performance issue and one of the DBA executed the Gather Stats jobs and things went back to normal. 
After few hours things again started to went south as normal process of request (from frontend), that usually takes couple of minutes didn't even completed in 15 minutes. 
Again, someone executed Gather Stats job and things came back in line, but only for another few hours. Pattern has now started to emerge... and things were not looking good.

So I decided to take a look at DB health. Also I checked if there was any change went in recently but there were none. My strategy was simple, start with comparing good and bad times. So I ran AWRDIFF report with good and bad run-times (as received by application team).

From the report what I saw that there was severe CPU Scheduling problem on Database. The processes were doing nothing but waiting on CPU till someone either kill them or run gather stats. 


Top 5 Timed Events







select child_number, sql_id,hash_value, plan_hash_value, executions, buffer_gets, is_bind_sensitive, is_bind_aware, is_shareable from v$sql
where sql_text like 'SELECT xxxxxxx%';

-- excel output






So if you see here there are lots of SQLs which are shared and have different children based on the bind values provided. However if you see the attached table there are multiple child numbers with same SQL hash value and same plan hash value i.e. diff values of bind variables creates the new child cursor but doesn’t generate the new plan always i.e. its bind sensitive but not bind aware. Hence its using the same PLAN for different cursor which is not the ideal way to do it. Ideally it should create the new plan for each new child cursor. This usually requires the HISTOGRAMS. when I go and try to find the histograms I see the relevant histograms on the table. 
Also if you notice there are lots of hard parses going on. This is pretty evident by the number of child cursor created for each statement and there are few hundreds statement like that. 

So that brings to the most important question that, why the histograms are not used when they are in place ??
Looks like we hit some Oracle Bug ? its time to verify and so raised TAR with Oracle and they came up with following justification (after waiting of 5 days and Escalation of SR, which was then attended by Sr Manager from European Region)

SR Update - 

---------------------------------------------------------------------------------------------
https://support.oracle.com/epmos/adf/images/t.gif
the reason for not sharing seems to be caused by bug 10351178. See Note 10351178.8 for details: Bug 10351178 - High version counts caused by auto tuned PGA memory.

The bug is fixed in 11.2.0.4. The workaround is to disable auto tuned PGA. You can do this by setting

PGA_AGGREGATE_TARGET and SGA_TARGET explicitly and setting MEMORY_TARGET = 0. See Note 443746.1 (Automatic Memory Management (AMM) on 11g) in section "2. Automatic Shared Memory Management – For the SGA" on how to do that.
------------------------------------------------------------------------------------------- 


On a side note, I created an additional index on the table, which was then used by query which stabilized the performance a lot. So at the moment we are not modifying any parameter and will wait for some time as our upgrade cycle is approaching. 

Hope this will help you to get some insight in internal working of ACS and histograms.