Oracle RAC Blog

RESTORING THE ORACLE DB FROM THE LOSS/CORRUPTION OF AN ACTIVE REDO LOG FILE – AN EXTREME RECOVERY SCENARIO

July 7, 2009 · Leave a Comment

                                                                                             

 
INTRODUCTION:
 
We have encountered a very uncommon recovery scenario recently, where an “active” redo log file got corrupted and as a result, the crashed instance could not be brought up. Recovering from the loss of an “inactive” redo log file would be straight forward as per Oracle docs, however nothing has been covered on the topic of recovering due to loss of an “active” redo member, except a brief mention to call up the oracle tech support.
 
PROBLEM:
 
In the event of loss or corruption of redo log file, the instance detects a mismatch between Redo records and Rollback (Undo)   records, and causes the crash with a PMON error of 472 as shown below.
 
OPIRIP: Uncaught error 1089. Error stack:
ORA-01089: immediate shutdown in progress – no operations are permitted
ORA-00600: internal error code, arguments: [4194], [52], [46], [], [], [], [], []
PMON: terminating instance due to error 472
Instance terminated by PMON, pid = 7623
 
IMBROGLIO a.k.a CATCH 22:
 
1. The database will not open with the loss / corruption of redo log.
2. Unless the database is opened, the redo log related commands such as switch, clear, drop and create log file   member or log file group will not work.
3. Reset logs will not work as well as the header information between redo and undo differs and the instance will still be looking for a complete recovery.
 
ANALYSIS:
Based on the mismatch of records between redo and undo, we need to recreate both redo and undo segments in order for the database to be  functional.
 
RESOLUTION:
1.  In the mount state
 
SQL> SELECT GROUP#, MEMBERS, STATUS, ARCHIVED FROM V$LOG;
 
    GROUP#    MEMBERS STATUS           ARC
———- ———- —————- —
         1          1 INACTIVE         NO
         2          1 CURRENT          NO
 
SQL> !ls -ltr /opt/oracle/oradata/test/redo02.log
ls: /opt/oracle/oradata/test/redo02.log: No such file or directory
 
 
2. SQL>ALTER DATABASE ADD LOGFILE GROUP 3 (‘/opt/oracle/oradata/test/redo04.log’, ‘/opt/oracle/oradata/test/redo05.log’) SIZE 500K;
 
3. Shutdown the database.
 
4. Incorporate the below three hidden parameters and start up in the mount state.
 
_ALLOW_RESETLOGS_CORRUPTION = true
_CORRUPTED_ROLLBACK_SEGMENTS = true
_ALLOW_READ_ONLY_CORRUPTION = tue
 
5. Note down that the status column “active” state has been transferred to a different member from 2 to 1.    Now, we need to bring the header, scn info into sync for all the redo members and for undo segment as well.
 
SQL> SELECT GROUP#, MEMBERS, STATUS, ARCHIVED FROM V$LOG;
 
    GROUP#    MEMBERS STATUS           ARC
———- ———- —————- —
         1          1 CURRENT          NO
         2          1 INACTIVE         NO
         3          2 UNUSED           YES
 
 
6. SQL> ALTER DATABASE DROP LOGFILE GROUP 2;
 
 
7. SQL> ALTER DATABASE ADD LOGFILE GROUP 2;
     SQL>  ALTER DATABASE ADD LOGFILE MEMBER ‘/opt/oracle/oradata/test/redo02.log’ REUSE TO GROUP 2;
 
 
8.  SQL> CREATE UNDO TABLESPACE UNDOTBS2 DATAFILE  ’/opt/oracle/oradata/test/undotbs02.dbf’ SIZE 1024M reuse  AUTOEXTEND ON;
 
9. SQL> ALTER SYSTEM SET undo_tablespace = UNDOTBS2;
 
10. SQL> DROP TABLESPACE undotbs1 INCLUDING CONTENTS AND DATAFILES;
 
11. Finally, clean -up the parameter file from the three hidden underscore parameters, then set the correct undo tablespace followed by bringing up the database.
 
In summary, this particular procedure outlines a technique on, how to recover the database from the loss/corruption of an active redo log. 
Hope fellow oracle-ites find the information useful.
 
Disclaimer:  If you ever get into similar situation in production environment, please contact  oracle tech support, as each oracle set-up will be different.
 

→ Leave a CommentCategories: Uncategorized

RAC/CRS Stack will not start after host reboot.Problem, Analysis, Resolution

July 7, 2009 · Leave a Comment

In two node RAC environment, the UNIX hosts reboots are known to cause variety problems
for CRS stack. Usually the first node comes up clean and the second one will start
 writing messages to all the evm, client, crs logs, a very conflicting and confusing messages.
There are myriad ways of adressing the issue as mentioned in OTN, and other tech forums based
on same type of error messages. Nevertheless, none of the solutions have worked for us.
 While one can spend a day in creating an SR and wait
 for another week to resolve, thought I would share this troubleshooting experience that
 saves fellow RAC-ites some time and energy with a similar kind of issue.

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
If you see the below type of error message>>
–[ COMMCRS][1]clsc_connect: (1002f4fe0)
–[    EVMD][1] EVMD waiting for CSS to be ready err = 3
–[ CRSRTI][1] CSS is not ready. Received status 3 from CSS. Waiting for good status ..
– Voting disk offline
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
We have addressed the problem by understanding that the CRS is unable to start,
in absence of no symptoms of OCR, voting disk corruptions. Additionally,
 the evmd daemon is waiting for css to come up and seems to have hung.
What we have noticed is, while bring up the CRS stack deamons, Oracle writes
 the socket files to /var/tmp/.oracle directory. This directory should be
clean in order for CRS to come up. Cleaned up existing socket files, rebooted the node2.
All RAC components started working without any issues.
We have scrapped the SR draft for Oracle and the resolve to resolve the CRS issue paid off.
 
Hope the troubleshooting tip would be useful…

→ Leave a CommentCategories: Uncategorized
Tagged: , , , , ,

RAC evolution since 2003

March 12, 2009 · Leave a Comment

→ Leave a CommentCategories: Uncategorized

RAC -Tuning- Root cause for global cache blocks lost issue.

March 12, 2009 · Leave a Comment

The below post is the result of work in identifying the root cause for the mysterious ‘global cache blocks lost’ . An uncommon issue known to significantly cause poor performance. Also, an indication of sub-optimal interconnect configuration.

( Issue,Diagnosis and Solution )———————————————————————————————————–

1) ISSUE: We have large amount of global Cache blocks lost values as shown below( 29 occurrences on node 1 and 287 on node 2),

SELECT
A.VALUE “GC BLOCKS LOST 1″,
B.VALUE “GC BLOCKS CORRUPT 1″,
C.VALUE “GC BLOCKS LOST 2″,
D.VALUE “GC BLOCKS CORRUPT 2″
FROM GV$SYSSTAT A, GV$SYSSTAT B, GV$SYSSTAT C, GV$SYSSTAT D
WHERE A.INST_ID=1 AND A.NAME=’gc blocks lost’
AND B.INST_ID=1 AND B.NAME=’gc blocks corrupt’
AND C.INST_ID=2 AND C.NAME=’gc blocks lost’
AND D.INST_ID=2 AND D.NAME=’gc blocks corrupt’;

2) DIAGNOSIS : The below statements prove that the udp_max_buf. and sq_max_size at OS level are not set to optimized values on both nodes

a)OnNODE 1:netstat did not show up any collisions or errors

$ netstat -I ce0 <–network stats sometime misleads showing the zeroes. Please see below are the kernel statistics (kstat)
Name Mtu Net/Dest Address Ipkts Ierrs Opkts Oerrs Collis Queue
ce0 1500 clusternode1-priv-physical1 clusternode1-priv-physical1 1086594021 6 3478599066 0 0 0
$ netstat -I ce1
Name Mtu Net/Dest Address Ipkts Ierrs Opkts Oerrs Collis Queue
ce1 1500 xxxxxxxxxxxxxxxxxxxxxxxxx 3010334968 1 162602450 0 0 0

$ kstat -p -s ‘*nocanput*’
ce:0:ce0:rx_nocanput 0
ce:0:ce0:tx_nocanput 0
ce:1:ce1:rx_nocanput 0
ce:1:ce1:tx_nocanput 0
ce:2:ce2:rx_nocanput 0
ce:2:ce2:tx_nocanput 0
ce:3:ce3:rx_nocanput 0
ce:3:ce3:tx_nocanput 0
ce:4:ce4:rx_nocanput 0
ce:4:ce4:tx_nocanput 0
ce:5:ce5:rx_nocanput 21 <—-Indicating the packet losses, small number compared to the below node 2 with 186 on ce0, concurring the highest number of gc blocks lost from the query from db.
ce:5:ce5:tx_nocanput 0

b) On Node 2:netstat did not show up any collisions or errors
$ netstat -I ce1
Name Mtu Net/Dest Address Ipkts Ierrs Opkts Oerrs Collis Queue
ce1 1500 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 2018406385 0 3189380678 0 0 0

$ netstat -I ce0
Name Mtu Net/Dest Address Ipkts Ierrs Opkts Oerrs Collis Queue
ce0 1500 clusternode2-priv-physical1 clusternode2-priv-physical1 3478616312 8 1086431534 0 0 0

$ kstat -p -s ‘*nocanput*’
ce:0:ce0:rx_nocanput 186 <– High number of packets loss concurring with the highest number of occurrences of blocks lost on node 2
ce:0:ce0:tx_nocanput 0
ce:1:ce1:rx_nocanput 0
ce:1:ce1:tx_nocanput 0
ce:2:ce2:rx_nocanput 0
ce:2:ce2:tx_nocanput 0
ce:3:ce3:rx_nocanput 0
ce:3:ce3:tx_nocanput 0
ce:4:ce4:rx_nocanput 0
ce:4:ce4:tx_nocanput 0
ce:5:ce5:rx_nocanput 10
ce:5:ce5:tx_nocanput 0
3) SOLUTION/ NEXT STEPS : Increase the udp_max_buf and sq_max_size at Solaris OS level.

→ Leave a CommentCategories: Uncategorized
Tagged: , , , ,

Oracle Troubleshooting Snippet -When sys can not log in -DBA’s Courage Under Fire

February 23, 2009 · Leave a Comment

When sys cannot log in to the server, the database floor becomes as tense as it can be with panicked production support managers to the mutliple dba’s trying to help the fellow dba in distress. While each situation demands it’s own solution, the below steps can help out.
1) Stay calm
2) login using a -prelim to open a sessionless connection
# sqlplus -prelim
3)SQL> oradebug setmypid
SQL> oradebug hanganalyze 12
4) Examine the trace files in user_dump_dest directory

A good start ..

→ Leave a CommentCategories: Uncategorized
Tagged: , ,