RESTORING THE ORACLE DB FROM THE LOSS/CORRUPTION OF AN ACTIVE REDO LOG FILE – AN EXTREME RECOVERY SCENARIO
July 7, 2009 · Leave a Comment
→ Leave a CommentCategories: Uncategorized
RAC/CRS Stack will not start after host reboot.Problem, Analysis, Resolution
July 7, 2009 · Leave a Comment
In two node RAC environment, the UNIX hosts reboots are known to cause variety problems
for CRS stack. Usually the first node comes up clean and the second one will start
writing messages to all the evm, client, crs logs, a very conflicting and confusing messages.
There are myriad ways of adressing the issue as mentioned in OTN, and other tech forums based
on same type of error messages. Nevertheless, none of the solutions have worked for us.
While one can spend a day in creating an SR and wait
for another week to resolve, thought I would share this troubleshooting experience that
saves fellow RAC-ites some time and energy with a similar kind of issue.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
If you see the below type of error message>>
–[ COMMCRS][1]clsc_connect: (1002f4fe0)
–[ EVMD][1] EVMD waiting for CSS to be ready err = 3
–[ CRSRTI][1] CSS is not ready. Received status 3 from CSS. Waiting for good status ..
– Voting disk offline
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
We have addressed the problem by understanding that the CRS is unable to start,
in absence of no symptoms of OCR, voting disk corruptions. Additionally,
the evmd daemon is waiting for css to come up and seems to have hung.
What we have noticed is, while bring up the CRS stack deamons, Oracle writes
the socket files to /var/tmp/.oracle directory. This directory should be
clean in order for CRS to come up. Cleaned up existing socket files, rebooted the node2.
All RAC components started working without any issues.
We have scrapped the SR draft for Oracle and the resolve to resolve the CRS issue paid off.
Hope the troubleshooting tip would be useful…
→ Leave a CommentCategories: Uncategorized
Tagged: CRS, CSS, EVMD, RAC, RAC CRS REBOOT CSS EVMD, REBOOT
RAC evolution since 2003
March 12, 2009 · Leave a Comment
Gartner’s report, Feb 2009
http://mediaproducts.gartner.com/reprints/oracle/article61/article61.html
→ Leave a CommentCategories: Uncategorized
RAC -Tuning- Root cause for global cache blocks lost issue.
March 12, 2009 · Leave a Comment
The below post is the result of work in identifying the root cause for the mysterious ‘global cache blocks lost’ . An uncommon issue known to significantly cause poor performance. Also, an indication of sub-optimal interconnect configuration.
( Issue,Diagnosis and Solution )———————————————————————————————————–
1) ISSUE: We have large amount of global Cache blocks lost values as shown below( 29 occurrences on node 1 and 287 on node 2),
SELECT
A.VALUE “GC BLOCKS LOST 1″,
B.VALUE “GC BLOCKS CORRUPT 1″,
C.VALUE “GC BLOCKS LOST 2″,
D.VALUE “GC BLOCKS CORRUPT 2″
FROM GV$SYSSTAT A, GV$SYSSTAT B, GV$SYSSTAT C, GV$SYSSTAT D
WHERE A.INST_ID=1 AND A.NAME=’gc blocks lost’
AND B.INST_ID=1 AND B.NAME=’gc blocks corrupt’
AND C.INST_ID=2 AND C.NAME=’gc blocks lost’
AND D.INST_ID=2 AND D.NAME=’gc blocks corrupt’;
2) DIAGNOSIS : The below statements prove that the udp_max_buf. and sq_max_size at OS level are not set to optimized values on both nodes
a)OnNODE 1:netstat did not show up any collisions or errors
$ netstat -I ce0 <–network stats sometime misleads showing the zeroes. Please see below are the kernel statistics (kstat)
Name Mtu Net/Dest Address Ipkts Ierrs Opkts Oerrs Collis Queue
ce0 1500 clusternode1-priv-physical1 clusternode1-priv-physical1 1086594021 6 3478599066 0 0 0
$ netstat -I ce1
Name Mtu Net/Dest Address Ipkts Ierrs Opkts Oerrs Collis Queue
ce1 1500 xxxxxxxxxxxxxxxxxxxxxxxxx 3010334968 1 162602450 0 0 0
$ kstat -p -s ‘*nocanput*’
ce:0:ce0:rx_nocanput 0
ce:0:ce0:tx_nocanput 0
ce:1:ce1:rx_nocanput 0
ce:1:ce1:tx_nocanput 0
ce:2:ce2:rx_nocanput 0
ce:2:ce2:tx_nocanput 0
ce:3:ce3:rx_nocanput 0
ce:3:ce3:tx_nocanput 0
ce:4:ce4:rx_nocanput 0
ce:4:ce4:tx_nocanput 0
ce:5:ce5:rx_nocanput 21 <—-Indicating the packet losses, small number compared to the below node 2 with 186 on ce0, concurring the highest number of gc blocks lost from the query from db.
ce:5:ce5:tx_nocanput 0
b) On Node 2:netstat did not show up any collisions or errors
$ netstat -I ce1
Name Mtu Net/Dest Address Ipkts Ierrs Opkts Oerrs Collis Queue
ce1 1500 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 2018406385 0 3189380678 0 0 0
$ netstat -I ce0
Name Mtu Net/Dest Address Ipkts Ierrs Opkts Oerrs Collis Queue
ce0 1500 clusternode2-priv-physical1 clusternode2-priv-physical1 3478616312 8 1086431534 0 0 0
$ kstat -p -s ‘*nocanput*’
ce:0:ce0:rx_nocanput 186 <– High number of packets loss concurring with the highest number of occurrences of blocks lost on node 2
ce:0:ce0:tx_nocanput 0
ce:1:ce1:rx_nocanput 0
ce:1:ce1:tx_nocanput 0
ce:2:ce2:rx_nocanput 0
ce:2:ce2:tx_nocanput 0
ce:3:ce3:rx_nocanput 0
ce:3:ce3:tx_nocanput 0
ce:4:ce4:rx_nocanput 0
ce:4:ce4:tx_nocanput 0
ce:5:ce5:rx_nocanput 10
ce:5:ce5:tx_nocanput 0
3) SOLUTION/ NEXT STEPS : Increase the udp_max_buf and sq_max_size at Solaris OS level.
→ Leave a CommentCategories: Uncategorized
Tagged: gc cr block lost, global cache, interconnect, RAC tuning, sq_max_size
Oracle Troubleshooting Snippet -When sys can not log in -DBA’s Courage Under Fire
February 23, 2009 · Leave a Comment
When sys cannot log in to the server, the database floor becomes as tense as it can be with panicked production support managers to the mutliple dba’s trying to help the fellow dba in distress. While each situation demands it’s own solution, the below steps can help out.
1) Stay calm
2) login using a -prelim to open a sessionless connection
# sqlplus -prelim
3)SQL> oradebug setmypid
SQL> oradebug hanganalyze 12
4) Examine the trace files in user_dump_dest directory
A good start ..
→ Leave a CommentCategories: Uncategorized
Tagged: DBA under stress, open session less sqlplus, oracle diagnostics