Solution to the disk related errors / ASM Diskgroup for RAC
17 Nov 2011 Leave a Comment
in ASM/RAC, oracle Tags: ASM, crs rac sun cluster orcludlm ucmmd solaris 10, DISKGROUP, ORA-15014, ORA-15018, Ora-15018 Ora-15031 Ora-15014, ORA-15031, ORACLE, RAC
The main purpose of sharing this tech note is that the Oracle’s note in metalink on this particular error did not provide a solution to environment. (Disk Is not Discovered in ASM, Diskgroup Creation Fails with Ora-15018 Ora-15031 Ora-15014 [ID 431013.1])
While it seems simple to add and drop the disk online in an ASM environment, often, we encounter very many errors based on the cluster software, OS and the way the SAN is visible to cluster. Before going into the detail, the error and work around let me briefly touch base on the general ASM operation specifically when altering the disk group for those who are new to ASM (automatic storage management).
We can use the ALTER DISKGROUP statement to alter a ASMdisk group configuration. As we know, we can add, resize, or drop disks while the database remains online. Whenever possible, multiple operations in a single ALTER DISKGROUP statement are recommended.
ASM automatically rebalances when the configuration of a disk group changes. By default, the ALTER DISKGROUP statement does not wait until the operation is complete before returning. Query the V$ASM_OPERATION view to monitor the status of this operation.
We can use the REBALANCE WAIT clause if you want the ALTER DISKGROUP statement processing to wait until the rebalance operation is complete before returning. This is especially useful in scripts. The statement also accepts a REBALANCE NOWAIT clause that invokes the default behavior of conducting the rebalance operation asynchronously in the background.
Having familiarized with the syntaxes and concepts, lets see the below command and error. As a part of migrating the ASM to a new SAN storage and would like to add the new disks to the existing the ASM disk group and plan to drop the old disks. The operation can be done online in oracle 11G R1 onwards; we plan to do this offline as the environment is 10204.
In the ASM instance:
==============
alter diskgroup RAWDG01 add disk ‘/dev/did/rdsk/d21s6′;
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15031: disk specification ‘/dev/did/rdsk/d21s6′ matches no disks
ORA-15014: location ‘/dev/did/rdsk/d21s6′ is not in the discovery set
As per the metalink note above, the recommendation is to use ‘dd’ command at OS level to cleanse the prior contents or use FORCE.
Since we are using the new disks, there is no need to use dd and the following force command did not work for us.
With FORCE option:
====================
alter diskgroup RAWDG01 add disk ‘/dev/did/rdsk/d21s6′ force
ORA-15032: not all alterations performed
ORA-15031: disk specification ‘/dev/did/rdsk/d22s6′ matches no disks
ORA-15014: location ‘/dev/did/rdsk/d22s6′ is not in the discovery set
SQL> CREATE DISKGROUP RAWDG03 disk ‘/dev/did/rdsk/d22s6′ force;
CREATE DISKGROUP RAWDG03 disk ‘/dev/did/rdsk/d22s6′ force
*
ERROR at line 1:
ORA-15018: diskgroup cannot be created
ORA-15031: disk specification ‘/dev/did/rdsk/d22s6′ matches no disks
ORA-15014: location ‘/dev/did/rdsk/d22s6′ is not in the discovery set
After going through the hundreds of pages of ASM manuals, we come across a relevant parameter asm_dsikstring that is empty in our environment.
SQL> show parameter asm_diskstring;
NAME TYPE VALUE
———————————— ———– ——————————
asm_diskstring string
SQL> alter system set asm_diskstring = ‘/dev/rdsk*’;
alter system set asm_diskstring = ‘/dev/rdsk*’
*
ERROR at line 1:
ORA-02097: parameter cannot be modified because specified value is invalid
ORA-15014: location ‘/dev/rdsk/c4t60060160829012004CBE135E05ADDE11d0s0′ is not
in the discovery set
ORA-15025: could not open disk ‘/dev/rdsk’
ORA-15056: additional error message
SVR4 Error: 13: Permission denied
Additional information: 42
Additional information: 109884796
Additional information: 107607288
SQL> alter system set asm_diskstring = ‘/dev/rdsk/*s0′ SID=’*';
System altered.
SQL> show parameter asm
NAME TYPE VALUE
———————————— ———– ——————————
asm_diskgroups string RAWDG01, RAWDG02
asm_diskstring string /dev/rdsk/*s0
asm_power_limit integer 1
SQL> alter diskgroup RAWDG01 add disk ‘/dev/did/rdsk/d21s6′;
Diskgroup altered.
SQL> select substr(name,1,10) name,substr(path,1,20) path, REDUNDANCY, TOTAL_MB, free_mb from V$ASM_DISK
2 /
NAME PATH REDUNDA TOTAL_MB FREE_MB
———- ——————– ——- ———- ———-
/dev/rdsk/c4t6006016 UNKNOWN 511 0
/dev/rdsk/c4t6006016 UNKNOWN 511 0
/dev/rdsk/c4t600144F UNKNOWN 511 0
/dev/rdsk/c4t600144F UNKNOWN 511 0
/dev/rdsk/c4t600144F UNKNOWN 204699 0
/dev/rdsk/c4t600144F UNKNOWN 102286 0
/dev/rdsk/c4t600144F UNKNOWN 519 0
/dev/did/rdsk/d23s6 UNKNOWN 519 0
/dev/did/rdsk/d22s6 UNKNOWN 102286 0
/dev/did/rdsk/d19s6 UNKNOWN 511 0
/dev/did/rdsk/d20s6 UNKNOWN 511 0
NAME PATH REDUNDA TOTAL_MB FREE_MB
———- ——————– ——- ———- ———-
RAWDG02_00 /dev/rdsk/c4t6006016 UNKNOWN 204788 141340
RAWDG01_00 /dev/rdsk/c4t6006016 UNKNOWN 102394 11609
RAWDG01_00 /dev/did/rdsk/d21s6 UNKNOWN 204699 194393
In conclusion, we are able to add the new disks to the existing disk group after setting up the asm_distring on both the nodes. The fact that prior disk groups are being added during installation and configuration of ASM indicates that, perhaps we do not need to set this parameter when doing via cluster install or dbca. However, it is imperative from the current experience that asm_diskstring must be specified for managing the ASM disk groups.
Hope this helps to folks working on the ASM/RAC environment.
Oracle 10G RAC / CRS install issues (Sun cluster 3.2/ Solaris 10): Diagnostics
01 Dec 2010 Leave a Comment
in oracle, Uncategorized Tags: crs rac sun cluster orcludlm ucmmd solaris 10
Problem:
Recently we have encountered a problem during installation of the ClusterResourceServices (CRS).
As we know, Oracle provides the runcluvfy utility that is being used to check whether the system is ready for CRS
or shared storage is configured properly.
As a pre-check list, we did run the following command.
$ /runcluvfy.sh stage -pre crsinst -n node1,node2 -verbose
The final resultant output is
“Pre-check for cluster services setup was successful on all the nodes”
As we know the next step is to install the CRS. In our unique situation, the specify nodes screens could not identify the nodes, even though the cluvfy tool did pass the pre-requisites.
Diagnostics:
The /etc/hosts file, ifconfig -a, the gui installactions related *.err, *.out files as a result of running
(./runInstaller -J-DTRACING.ENABLED=true -J-DTRACING.LEVEL=2) did not point to anything unusual.
Initially we thought, we got the old version of ORCLudlm package that is needed for sun cluster to integrate well with oracle.
We got the latest ORCLudlm ( “Cluster Membership Monitor”) is installed, we did re-start it up by rebooting the cluster node in cluster mode:-
Verified the cluster status with the following command
Before starting OUI for CRS installation verified all is well by running below clusterware command to display the current cluster nodes
/usr/cluster/bin/scstat -q
/usr/cluster/bin/scstat -g
pkginfo -l ORCLudlm |grep VERSION.
Then it comes down to another important sun cluster daemon for Oracle RAC called “UCMMD“
The attempt to start UCMMD daemon met with failure in our case
The output from attempting to start ucmmd:
bash-3.00# clresourcegroup online -emM -n node1 rac-fmwk-rg
rac-fmwk-rg: invalid resource group
clresourcegroup: (C918779) Invalid resource group “rac-fmwk-rg” specified.
bash-3.00# clresourcegroup online -emM -n node2 rac-fmwk-rg
rac-fmwk-rg: invalid resource group
clresourcegroup: (C918779) Invalid resource group “rac-fmwk-rg” specified.
The content in the below brackets is taken from the dun docs for those who are interested to go in detail.
http://docs.sun.com/app/docs/doc/819-0583/6n30h631j?l=en&a=view#ch8_ops-118
[[[The UCMM daemon, ucmmd, manages the reconfiguration of Sun Cluster Support for Oracle Real Application Clusters.
When a cluster is booted or rebooted, this daemon is started only after all components of Sun Cluster Support for Oracle Real Application Clusters are validated.
If the validation of a component on a node fails, the ucmmd fails to start on the node.
To determine the cause of the problem, examine the following files:
The UCMM reconfiguration log file can be found at /var/cluster/ucmm/ucmm_reconf.log
The system messages file
The most common causes of this problem are as follows:
The ORCLudlm package that contains the Oracle UDLM is not installed.
An error occurred during a previous reconfiguration of a component Sun Cluster Support for Oracle Real Application Clusters.
A step in a previous reconfiguration of Sun Cluster Support for Oracle Real Application Clusters timed out, causing the node on which the timeout occurred to panic.
To correct the problem, perform the appropriate recovery action for the cause of the problem and reboot the node on which ucmmd failed to start.]]]
We believed in our particular case, some important messages related to ucmmd are overlooked during the sun cluster 3.2 installations and reconfiguration. Performed the recovery action to start the ucmmd daemon and rebooted the node on which ucmmd failed to start.
Once the ucmmd daemon is up and running the CRS GUI is able to identify the nodes.
In summary, the oracle provided cluvfy pre-check is not totally reliable to give us any indication to proceed further with CRS installation. In addition to lsnodes, the ucmmd daemon must be working properly for oracle CRS to run. Hope this note is useful in terms where to look for troubleshooting for CRS installation related problems in sun cluster/Solaris combination.
RESTORING THE ORACLE DB FROM THE LOSS/CORRUPTION OF AN ACTIVE REDO LOG FILE – AN EXTREME RECOVERY SCENARIO
07 Jul 2009 1 Comment
RAC/CRS Stack will not start after host reboot.Problem, Analysis, Resolution
07 Jul 2009 Leave a Comment
in Uncategorized Tags: CRS, CSS, EVMD, RAC, RAC CRS REBOOT CSS EVMD, REBOOT
In two node RAC environment, the UNIX hosts reboots are known to cause variety problems
for CRS stack. Usually the first node comes up clean and the second one will start
writing messages to all the evm, client, crs logs, a very conflicting and confusing messages.
There are myriad ways of adressing the issue as mentioned in OTN, and other tech forums based
on same type of error messages. Nevertheless, none of the solutions have worked for us.
While one can spend a day in creating an SR and wait
for another week to resolve, thought I would share this troubleshooting experience that
saves fellow RAC-ites some time and energy with a similar kind of issue.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
If you see the below type of error message>>
–[ COMMCRS][1]clsc_connect: (1002f4fe0)
–[ EVMD][1] EVMD waiting for CSS to be ready err = 3
–[ CRSRTI][1] CSS is not ready. Received status 3 from CSS. Waiting for good status ..
– Voting disk offline
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
We have addressed the problem by understanding that the CRS is unable to start,
in absence of no symptoms of OCR, voting disk corruptions. Additionally,
the evmd daemon is waiting for css to come up and seems to have hung.
What we have noticed is, while bring up the CRS stack deamons, Oracle writes
the socket files to /var/tmp/.oracle directory. This directory should be
clean in order for CRS to come up. Cleaned up existing socket files, rebooted the node2.
All RAC components started working without any issues.
We have scrapped the SR draft for Oracle and the resolve to resolve the CRS issue paid off.
Hope the troubleshooting tip would be useful…
RAC evolution since 2003
12 Mar 2009 Leave a Comment
Gartner’s report, Feb 2009
http://mediaproducts.gartner.com/reprints/oracle/article61/article61.html
RAC -Tuning- Root cause for global cache blocks lost issue.
12 Mar 2009 1 Comment
in Uncategorized Tags: gc cr block lost, global cache, interconnect, RAC tuning, sq_max_size
The below post is the result of work in identifying the root cause for the mysterious ‘global cache blocks lost’ . An uncommon issue known to significantly cause poor performance. Also, an indication of sub-optimal interconnect configuration.
( Issue,Diagnosis and Solution )———————————————————————————————————–
1) ISSUE: We have large amount of global Cache blocks lost values as shown below( 29 occurrences on node 1 and 287 on node 2),
SELECT
A.VALUE “GC BLOCKS LOST 1″,
B.VALUE “GC BLOCKS CORRUPT 1″,
C.VALUE “GC BLOCKS LOST 2″,
D.VALUE “GC BLOCKS CORRUPT 2″
FROM GV$SYSSTAT A, GV$SYSSTAT B, GV$SYSSTAT C, GV$SYSSTAT D
WHERE A.INST_ID=1 AND A.NAME=’gc blocks lost’
AND B.INST_ID=1 AND B.NAME=’gc blocks corrupt’
AND C.INST_ID=2 AND C.NAME=’gc blocks lost’
AND D.INST_ID=2 AND D.NAME=’gc blocks corrupt’;
2) DIAGNOSIS : The below statements prove that the udp_max_buf. and sq_max_size at OS level are not set to optimized values on both nodes
a)OnNODE 1:netstat did not show up any collisions or errors
$ netstat -I ce0 <–network stats sometime misleads showing the zeroes. Please see below are the kernel statistics (kstat)
Name Mtu Net/Dest Address Ipkts Ierrs Opkts Oerrs Collis Queue
ce0 1500 clusternode1-priv-physical1 clusternode1-priv-physical1 1086594021 6 3478599066 0 0 0
$ netstat -I ce1
Name Mtu Net/Dest Address Ipkts Ierrs Opkts Oerrs Collis Queue
ce1 1500 xxxxxxxxxxxxxxxxxxxxxxxxx 3010334968 1 162602450 0 0 0
$ kstat -p -s ‘*nocanput*’
ce:0:ce0:rx_nocanput 0
ce:0:ce0:tx_nocanput 0
ce:1:ce1:rx_nocanput 0
ce:1:ce1:tx_nocanput 0
ce:2:ce2:rx_nocanput 0
ce:2:ce2:tx_nocanput 0
ce:3:ce3:rx_nocanput 0
ce:3:ce3:tx_nocanput 0
ce:4:ce4:rx_nocanput 0
ce:4:ce4:tx_nocanput 0
ce:5:ce5:rx_nocanput 21 <—-Indicating the packet losses, small number compared to the below node 2 with 186 on ce0, concurring the highest number of gc blocks lost from the query from db.
ce:5:ce5:tx_nocanput 0
b) On Node 2:netstat did not show up any collisions or errors
$ netstat -I ce1
Name Mtu Net/Dest Address Ipkts Ierrs Opkts Oerrs Collis Queue
ce1 1500 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 2018406385 0 3189380678 0 0 0
$ netstat -I ce0
Name Mtu Net/Dest Address Ipkts Ierrs Opkts Oerrs Collis Queue
ce0 1500 clusternode2-priv-physical1 clusternode2-priv-physical1 3478616312 8 1086431534 0 0 0
$ kstat -p -s ‘*nocanput*’
ce:0:ce0:rx_nocanput 186 <– High number of packets loss concurring with the highest number of occurrences of blocks lost on node 2
ce:0:ce0:tx_nocanput 0
ce:1:ce1:rx_nocanput 0
ce:1:ce1:tx_nocanput 0
ce:2:ce2:rx_nocanput 0
ce:2:ce2:tx_nocanput 0
ce:3:ce3:rx_nocanput 0
ce:3:ce3:tx_nocanput 0
ce:4:ce4:rx_nocanput 0
ce:4:ce4:tx_nocanput 0
ce:5:ce5:rx_nocanput 10
ce:5:ce5:tx_nocanput 0
3) SOLUTION/ NEXT STEPS : Increase the udp_max_buf and sq_max_size at Solaris OS level.
Oracle Troubleshooting Snippet -When sys can not log in -DBA’s Courage Under Fire
23 Feb 2009 Leave a Comment
in Uncategorized Tags: DBA under stress, open session less sqlplus, oracle diagnostics
When sys cannot log in to the server, the database floor becomes as tense as it can be with panicked production support managers to the mutliple dba’s trying to help the fellow dba in distress. While each situation demands it’s own solution, the below steps can help out.
1) Stay calm
2) login using a -prelim to open a sessionless connection
# sqlplus -prelim
3)SQL> oradebug setmypid
SQL> oradebug hanganalyze 12
4) Examine the trace files in user_dump_dest directory
A good start ..