Solution to the disk related errors / ASM Diskgroup for RAC

The main purpose of sharing this tech note is that the Oracle’s note in metalink on this particular error did not provide a solution to  environment. (Disk Is not Discovered in ASM, Diskgroup Creation Fails with Ora-15018 Ora-15031 Ora-15014 [ID 431013.1])

While it seems simple to add and drop the disk online in an ASM environment, often, we encounter very many errors based on the cluster software, OS and the way the SAN is visible to cluster. Before going into the detail, the error and work around let me briefly touch base on the general ASM operation specifically when altering the disk group for those who are new to ASM (automatic storage management).

We can use the ALTER DISKGROUP statement to alter a ASMdisk group configuration. As we know, we can add, resize, or drop disks while the database remains online. Whenever possible, multiple operations in a single ALTER DISKGROUP statement are recommended.

ASM automatically rebalances when the configuration of a disk group changes. By default, the ALTER DISKGROUP statement does not wait until the operation is complete before returning. Query the V$ASM_OPERATION view to monitor the status of this operation.

We can use the REBALANCE WAIT clause if you want the ALTER DISKGROUP statement processing to wait until the rebalance operation is complete before returning. This is especially useful in scripts. The statement also accepts a REBALANCE NOWAIT clause that invokes the default behavior of conducting the rebalance operation asynchronously in the background.

Having familiarized with the syntaxes and concepts, lets see the below command and error. As a part of migrating the ASM to a new SAN storage and would like to add the new disks to the existing the ASM disk group and plan to drop the old disks. The operation can be done online in oracle 11G R1 onwards; we plan to do this offline as the environment is 10204.

In the ASM instance:
==============

alter diskgroup RAWDG01 add disk ‘/dev/did/rdsk/d21s6′;

ERROR at line 1:

ORA-15032: not all alterations performed

ORA-15031: disk specification ‘/dev/did/rdsk/d21s6′ matches no disks

ORA-15014: location ‘/dev/did/rdsk/d21s6′ is not in the discovery set

As per the metalink note above, the recommendation is to use ‘dd’ command at OS level to cleanse the prior contents or use FORCE.

Since we are using the new disks, there is no need to use dd and the following force command did not work for us.

With FORCE option:
====================

alter diskgroup RAWDG01 add disk ‘/dev/did/rdsk/d21s6′ force

ORA-15032: not all alterations performed

ORA-15031: disk specification ‘/dev/did/rdsk/d22s6′ matches no disks

ORA-15014: location ‘/dev/did/rdsk/d22s6′ is not in the discovery set

SQL> CREATE DISKGROUP RAWDG03 disk ‘/dev/did/rdsk/d22s6′ force;

CREATE DISKGROUP RAWDG03 disk ‘/dev/did/rdsk/d22s6′ force

*

ERROR at line 1:

ORA-15018: diskgroup cannot be created

ORA-15031: disk specification ‘/dev/did/rdsk/d22s6′ matches no disks

ORA-15014: location ‘/dev/did/rdsk/d22s6′ is not in the discovery set
After going through the hundreds of pages of ASM manuals, we come across a relevant parameter asm_dsikstring that is empty in our environment.

SQL> show parameter asm_diskstring;

NAME                                 TYPE        VALUE

———————————— ———– ——————————

asm_diskstring                       string

SQL> alter system set asm_diskstring = ‘/dev/rdsk*’;

alter system set asm_diskstring = ‘/dev/rdsk*’

*

ERROR at line 1:

ORA-02097: parameter cannot be modified because specified value is invalid

ORA-15014: location ‘/dev/rdsk/c4t60060160829012004CBE135E05ADDE11d0s0′ is not

in the discovery set

ORA-15025: could not open disk ‘/dev/rdsk’

ORA-15056: additional error message

SVR4 Error: 13: Permission denied

Additional information: 42

Additional information: 109884796

Additional information: 107607288

SQL>  alter system set asm_diskstring = ‘/dev/rdsk/*s0′ SID=’*';

System altered.

SQL> show parameter asm

NAME                                 TYPE        VALUE

———————————— ———– ——————————

asm_diskgroups                       string      RAWDG01, RAWDG02

asm_diskstring                       string      /dev/rdsk/*s0

asm_power_limit                      integer     1

SQL> alter diskgroup RAWDG01 add disk ‘/dev/did/rdsk/d21s6′;

Diskgroup altered.

SQL> select substr(name,1,10) name,substr(path,1,20) path, REDUNDANCY, TOTAL_MB,  free_mb from V$ASM_DISK

2  /

NAME       PATH                 REDUNDA   TOTAL_MB    FREE_MB

———- ——————– ——- ———- ———-

/dev/rdsk/c4t6006016 UNKNOWN        511          0

/dev/rdsk/c4t6006016 UNKNOWN        511          0

/dev/rdsk/c4t600144F UNKNOWN        511          0

/dev/rdsk/c4t600144F UNKNOWN        511          0

/dev/rdsk/c4t600144F UNKNOWN     204699          0

/dev/rdsk/c4t600144F UNKNOWN     102286          0

/dev/rdsk/c4t600144F UNKNOWN        519          0

/dev/did/rdsk/d23s6  UNKNOWN        519          0

/dev/did/rdsk/d22s6  UNKNOWN     102286          0

/dev/did/rdsk/d19s6  UNKNOWN        511          0

/dev/did/rdsk/d20s6  UNKNOWN        511          0

NAME       PATH                 REDUNDA   TOTAL_MB    FREE_MB

———- ——————– ——- ———- ———-

RAWDG02_00 /dev/rdsk/c4t6006016 UNKNOWN     204788     141340

RAWDG01_00 /dev/rdsk/c4t6006016 UNKNOWN     102394      11609

RAWDG01_00 /dev/did/rdsk/d21s6  UNKNOWN     204699     194393


In conclusion, we are able to add the new disks to the existing disk group after setting up the asm_distring on both the nodes. The fact that prior disk groups are being added during installation and configuration of ASM indicates that, perhaps we do not need to set this parameter when doing via cluster install or dbca. However, it is imperative from the current experience that asm_diskstring must be specified for managing the ASM disk groups.

Hope this helps to folks working on the ASM/RAC environment.

Oracle 10G RAC / CRS install issues (Sun cluster 3.2/ Solaris 10): Diagnostics

Hello All -Just want to share an unique RAC – CRS install issue in Oracle 10G R2 RAC/Sun cluster 3.2/Solaris 10 environment. 

Problem:

Recently we have encountered a problem during installation of  the ClusterResourceServices (CRS).
As we know, Oracle provides the runcluvfy utility that is being used to check whether the system is ready for CRS
or shared storage is configured properly.

As a pre-check list, we did run the following command.

$ /runcluvfy.sh stage -pre crsinst -n node1,node2  -verbose

The final resultant output is

“Pre-check for cluster services setup was successful on all the nodes”

As we know the next step is to install the CRS. In our unique situation, the specify nodes screens could not identify the nodes, even though the cluvfy tool did pass the pre-requisites.

Diagnostics:

The /etc/hosts file, ifconfig -a, the gui installactions related *.err, *.out files as a result of running

(./runInstaller -J-DTRACING.ENABLED=true -J-DTRACING.LEVEL=2) did not point to anything unusual.

Initially we thought, we got the old version of ORCLudlm package that is needed for sun cluster to integrate well with oracle.

We got the latest ORCLudlm ( “Cluster Membership Monitor”) is installed, we did re-start it up by rebooting the cluster node in cluster mode:-

Verified the cluster status with the following command

Before starting OUI for CRS installation verified all is well by running below clusterware command to display the current cluster nodes

/usr/cluster/bin/scstat -q

/usr/cluster/bin/scstat -g

pkginfo -l ORCLudlm |grep VERSION.

Then it comes down to another important sun cluster daemon for Oracle RAC called “UCMMD

The attempt to start UCMMD daemon met with failure in our case

The output from attempting to start ucmmd:

bash-3.00# clresourcegroup online -emM -n node1 rac-fmwk-rg

rac-fmwk-rg: invalid resource group

clresourcegroup: (C918779) Invalid resource group “rac-fmwk-rg” specified.

bash-3.00# clresourcegroup online -emM -n node2 rac-fmwk-rg

rac-fmwk-rg: invalid resource group

clresourcegroup: (C918779) Invalid resource group “rac-fmwk-rg” specified.
The content in the below brackets is taken from the dun docs for those who are interested to go in detail.
http://docs.sun.com/app/docs/doc/819-0583/6n30h631j?l=en&a=view#ch8_ops-118
[[[The UCMM daemon, ucmmd, manages the reconfiguration of Sun Cluster Support for Oracle Real Application Clusters.

When a cluster is booted or rebooted, this daemon is started only after all components of Sun Cluster Support for Oracle Real Application Clusters are validated.

If the validation of a component on a node fails, the ucmmd fails to start on the node.

To determine the cause of the problem, examine the following files:

The UCMM reconfiguration log file can be found at /var/cluster/ucmm/ucmm_reconf.log

The system messages file

The most common causes of this problem are as follows:

The ORCLudlm package that contains the Oracle UDLM is not installed.

An error occurred during a previous reconfiguration of a component Sun Cluster Support for Oracle Real Application Clusters.

A step in a previous reconfiguration of Sun Cluster Support for Oracle Real Application Clusters timed out, causing the node on which the timeout occurred to panic.

To correct the problem, perform the appropriate recovery action for the cause of the problem and reboot the node on which ucmmd failed to start.]]]

We believed in our particular case, some important messages related to ucmmd are overlooked during the sun cluster 3.2 installations and reconfiguration. Performed the recovery action to start the ucmmd daemon and rebooted the node on which ucmmd failed to start.

Once the ucmmd daemon is up and running the CRS GUI is able to identify the nodes.
In summary, the oracle provided cluvfy pre-check is not totally reliable to give us any indication to proceed further with CRS installation. In addition to lsnodes, the ucmmd daemon must be working properly for oracle CRS to run.  Hope this note is useful in terms where to look for troubleshooting for CRS installation related problems in sun cluster/Solaris combination.

RESTORING THE ORACLE DB FROM THE LOSS/CORRUPTION OF AN ACTIVE REDO LOG FILE – AN EXTREME RECOVERY SCENARIO

                                                                                             

 
INTRODUCTION:
 
We have encountered a very uncommon recovery scenario recently, where an “active” redo log file got corrupted and as a result, the crashed instance could not be brought up. Recovering from the loss of an “inactive” redo log file would be straight forward as per Oracle docs, however nothing has been covered on the topic of recovering due to loss of an “active” redo member, except a brief mention to call up the oracle tech support.
 
PROBLEM:
 
In the event of loss or corruption of redo log file, the instance detects a mismatch between Redo records and Rollback (Undo)   records, and causes the crash with a PMON error of 472 as shown below.
 
OPIRIP: Uncaught error 1089. Error stack:
ORA-01089: immediate shutdown in progress – no operations are permitted
ORA-00600: internal error code, arguments: [4194], [52], [46], [], [], [], [], []
PMON: terminating instance due to error 472
Instance terminated by PMON, pid = 7623
 
IMBROGLIO a.k.a CATCH 22:
 
1. The database will not open with the loss / corruption of redo log.
2. Unless the database is opened, the redo log related commands such as switch, clear, drop and create log file   member or log file group will not work.
3. Reset logs will not work as well as the header information between redo and undo differs and the instance will still be looking for a complete recovery.
 
ANALYSIS:
Based on the mismatch of records between redo and undo, we need to recreate both redo and undo segments in order for the database to be  functional.
 
RESOLUTION:
1.  In the mount state
 
SQL> SELECT GROUP#, MEMBERS, STATUS, ARCHIVED FROM V$LOG;
 
    GROUP#    MEMBERS STATUS           ARC
———- ———- —————- —
         1          1 INACTIVE         NO
         2          1 CURRENT          NO
 
SQL> !ls -ltr /opt/oracle/oradata/test/redo02.log
ls: /opt/oracle/oradata/test/redo02.log: No such file or directory
 
 
2. SQL>ALTER DATABASE ADD LOGFILE GROUP 3 (‘/opt/oracle/oradata/test/redo04.log’, ‘/opt/oracle/oradata/test/redo05.log’) SIZE 500K;
 
3. Shutdown the database.
 
4. Incorporate the below three hidden parameters and start up in the mount state.
 
_ALLOW_RESETLOGS_CORRUPTION = true
_CORRUPTED_ROLLBACK_SEGMENTS = true
_ALLOW_READ_ONLY_CORRUPTION = tue
 
5. Note down that the status column “active” state has been transferred to a different member from 2 to 1.    Now, we need to bring the header, scn info into sync for all the redo members and for undo segment as well.
 
SQL> SELECT GROUP#, MEMBERS, STATUS, ARCHIVED FROM V$LOG;
 
    GROUP#    MEMBERS STATUS           ARC
———- ———- —————- —
         1          1 CURRENT          NO
         2          1 INACTIVE         NO
         3          2 UNUSED           YES
 
 
6. SQL> ALTER DATABASE DROP LOGFILE GROUP 2;
 
 
7. SQL> ALTER DATABASE ADD LOGFILE GROUP 2;
     SQL>  ALTER DATABASE ADD LOGFILE MEMBER ‘/opt/oracle/oradata/test/redo02.log’ REUSE TO GROUP 2;
 
 
8.  SQL> CREATE UNDO TABLESPACE UNDOTBS2 DATAFILE  ’/opt/oracle/oradata/test/undotbs02.dbf’ SIZE 1024M reuse  AUTOEXTEND ON;
 
9. SQL> ALTER SYSTEM SET undo_tablespace = UNDOTBS2;
 
10. SQL> DROP TABLESPACE undotbs1 INCLUDING CONTENTS AND DATAFILES;
 
11. Finally, clean -up the parameter file from the three hidden underscore parameters, then set the correct undo tablespace followed by bringing up the database.
 
In summary, this particular procedure outlines a technique on, how to recover the database from the loss/corruption of an active redo log. 
Hope fellow oracle-ites find the information useful.
 
Disclaimer:  If you ever get into similar situation in production environment, please contact  oracle tech support, as each oracle set-up will be different.
 

RAC/CRS Stack will not start after host reboot.Problem, Analysis, Resolution

In two node RAC environment, the UNIX hosts reboots are known to cause variety problems
for CRS stack. Usually the first node comes up clean and the second one will start
 writing messages to all the evm, client, crs logs, a very conflicting and confusing messages.
There are myriad ways of adressing the issue as mentioned in OTN, and other tech forums based
on same type of error messages. Nevertheless, none of the solutions have worked for us.
 While one can spend a day in creating an SR and wait
 for another week to resolve, thought I would share this troubleshooting experience that
 saves fellow RAC-ites some time and energy with a similar kind of issue.

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
If you see the below type of error message>>
–[ COMMCRS][1]clsc_connect: (1002f4fe0)
–[    EVMD][1] EVMD waiting for CSS to be ready err = 3
–[ CRSRTI][1] CSS is not ready. Received status 3 from CSS. Waiting for good status ..
– Voting disk offline
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
We have addressed the problem by understanding that the CRS is unable to start,
in absence of no symptoms of OCR, voting disk corruptions. Additionally,
 the evmd daemon is waiting for css to come up and seems to have hung.
What we have noticed is, while bring up the CRS stack deamons, Oracle writes
 the socket files to /var/tmp/.oracle directory. This directory should be
clean in order for CRS to come up. Cleaned up existing socket files, rebooted the node2.
All RAC components started working without any issues.
We have scrapped the SR draft for Oracle and the resolve to resolve the CRS issue paid off.
 
Hope the troubleshooting tip would be useful…

RAC evolution since 2003

Gartner’s report, Feb 2009

 

http://mediaproducts.gartner.com/reprints/oracle/article61/article61.html

RAC -Tuning- Root cause for global cache blocks lost issue.

The below post is the result of work in identifying the root cause for the mysterious ‘global cache blocks lost’ . An uncommon issue known to significantly cause poor performance. Also, an indication of sub-optimal interconnect configuration.

( Issue,Diagnosis and Solution )———————————————————————————————————–

1) ISSUE: We have large amount of global Cache blocks lost values as shown below( 29 occurrences on node 1 and 287 on node 2),

SELECT
A.VALUE “GC BLOCKS LOST 1″,
B.VALUE “GC BLOCKS CORRUPT 1″,
C.VALUE “GC BLOCKS LOST 2″,
D.VALUE “GC BLOCKS CORRUPT 2″
FROM GV$SYSSTAT A, GV$SYSSTAT B, GV$SYSSTAT C, GV$SYSSTAT D
WHERE A.INST_ID=1 AND A.NAME=’gc blocks lost’
AND B.INST_ID=1 AND B.NAME=’gc blocks corrupt’
AND C.INST_ID=2 AND C.NAME=’gc blocks lost’
AND D.INST_ID=2 AND D.NAME=’gc blocks corrupt’;

2) DIAGNOSIS : The below statements prove that the udp_max_buf. and sq_max_size at OS level are not set to optimized values on both nodes

a)OnNODE 1:netstat did not show up any collisions or errors

$ netstat -I ce0 <–network stats sometime misleads showing the zeroes. Please see below are the kernel statistics (kstat)
Name Mtu Net/Dest Address Ipkts Ierrs Opkts Oerrs Collis Queue
ce0 1500 clusternode1-priv-physical1 clusternode1-priv-physical1 1086594021 6 3478599066 0 0 0
$ netstat -I ce1
Name Mtu Net/Dest Address Ipkts Ierrs Opkts Oerrs Collis Queue
ce1 1500 xxxxxxxxxxxxxxxxxxxxxxxxx 3010334968 1 162602450 0 0 0

$ kstat -p -s ‘*nocanput*’
ce:0:ce0:rx_nocanput 0
ce:0:ce0:tx_nocanput 0
ce:1:ce1:rx_nocanput 0
ce:1:ce1:tx_nocanput 0
ce:2:ce2:rx_nocanput 0
ce:2:ce2:tx_nocanput 0
ce:3:ce3:rx_nocanput 0
ce:3:ce3:tx_nocanput 0
ce:4:ce4:rx_nocanput 0
ce:4:ce4:tx_nocanput 0
ce:5:ce5:rx_nocanput 21 <—-Indicating the packet losses, small number compared to the below node 2 with 186 on ce0, concurring the highest number of gc blocks lost from the query from db.
ce:5:ce5:tx_nocanput 0

b) On Node 2:netstat did not show up any collisions or errors
$ netstat -I ce1
Name Mtu Net/Dest Address Ipkts Ierrs Opkts Oerrs Collis Queue
ce1 1500 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 2018406385 0 3189380678 0 0 0

$ netstat -I ce0
Name Mtu Net/Dest Address Ipkts Ierrs Opkts Oerrs Collis Queue
ce0 1500 clusternode2-priv-physical1 clusternode2-priv-physical1 3478616312 8 1086431534 0 0 0

$ kstat -p -s ‘*nocanput*’
ce:0:ce0:rx_nocanput 186 <– High number of packets loss concurring with the highest number of occurrences of blocks lost on node 2
ce:0:ce0:tx_nocanput 0
ce:1:ce1:rx_nocanput 0
ce:1:ce1:tx_nocanput 0
ce:2:ce2:rx_nocanput 0
ce:2:ce2:tx_nocanput 0
ce:3:ce3:rx_nocanput 0
ce:3:ce3:tx_nocanput 0
ce:4:ce4:rx_nocanput 0
ce:4:ce4:tx_nocanput 0
ce:5:ce5:rx_nocanput 10
ce:5:ce5:tx_nocanput 0
3) SOLUTION/ NEXT STEPS : Increase the udp_max_buf and sq_max_size at Solaris OS level.

Oracle Troubleshooting Snippet -When sys can not log in -DBA’s Courage Under Fire

When sys cannot log in to the server, the database floor becomes as tense as it can be with panicked production support managers to the mutliple dba’s trying to help the fellow dba in distress. While each situation demands it’s own solution, the below steps can help out.
1) Stay calm
2) login using a -prelim to open a sessionless connection
# sqlplus -prelim
3)SQL> oradebug setmypid
SQL> oradebug hanganalyze 12
4) Examine the trace files in user_dump_dest directory

A good start ..

Follow

Get every new post delivered to your Inbox.