Thursday, August 29, 2013

 Instance Caging

Recently on one of our production cluster running 8 nodes and having shared environment with multiple applications, ran into high CPU usage. Since this is shared environment, running few databases, its always difficult to pinpoint which is culprit. Hence after more than couple of incidents we were able to point out which one is actually causing mayhem. Also the pattern of problem is not predictable as application is used across globe. But since hang was not momentarily it really affects all other candidates as well. This is serious problem in shared environment as one wants to save cost on hardware and software, which is quite under stable. However, this also ensures the optimum usage of the capacity as not all the databases are always used up to the capacity (again, case to case).

So the question is how do we avoid such an instance and still provide the optimum usage of resources ? 

11g, introduces one new feature called "instance caging", where you can have more than one instance on the same server, and you can share the CPUs reported by the operating system

-- To find out total cpu and cores available to your system
SQL> select cpu_count_current,cpu_core_count_current from v$license;

System with 12 physical cpu cores


----------------- ----------------------

               12                     12

System with 6 physical cpu  and 12 cores, Hyper Threaded


----------------- ----------------------

               12                     6

One can use following query to find out the CPU SOCKET and CPU CORES available to database 

    select * from v$osstat where stat_name like '%CPU%'

In Hyper threaded Servers, the output will be like following.

Now one needs to determine how the database instances on your server will share the CPU.  With Instance Caging, each instance's cpu_count specifies the maximum number of CPUs you want it to use at any time for that particular instance. The sum of the cpu_counts across all database instances determines the amount of isolation between the database instances and the efficiency of the server.

For maximum isolation one can use  "partition" approach. With the partition approach, the sum of the cpu_counts is less than or equal to the number of CPUs. 

For example, suppose the total number of CPUs (i.e. CPU threads/cores) is 16.  Using the partition approach, we could set cpu_count=8 for database A, cpu_count=4 for database B, and cpu_count=4 for database C.  The sum of the cpu_counts is 16, which equals the number of CPUs.

The disadvantage of the partition approach is that any CPU unused by one database instance cannot be used by another. Therefore, for non-critical databases where you also want to achieve better CPU utilization efficiency, use the "over-subscribe" approach. With the over-subscribe approach, the sum of the cpu_counts is less than or equal to 3x the number of CPUs.

For example, for a server with 16 CPUs, you could use the over-subscribe approach and set cpu_count=8 for database A, cpu_count=8 for database B, and cpu_count=8 for database C.  The sum of the cpu_counts is 24, which is greater than the number of CPUs.  Therefore, if all databases are using their full CPU allocation, there will be some CPU contention.

Enable Instance Caging – 

-- Set the cpu_count initialization parameter.
-- This is a dynamic parameter, and can be set with the following statement:


To verify that Instance Caging is enabled, check that "instance_caging" equals "ON" and that "cpu_count" is set appropriately.

SQL> select instance_caging from v$rsrc_plan where is_top_plan = 'TRUE';




Once this is enabled,  The CPU spikes will be limited to only specified number of cpu's and will be not be hampering other databases.