昨天通宵恢复一个1T的数据库,早上6点成功恢复,立即开始备份。7点回家睡觉下午三点起床,也许这就是一个DBA的生活。本想现在写写恢复过程,不过又想睡觉了,明天好好写吧。
Updated @ 2010-12-07
一、系统背景:
硬件: P690_1和P670_1
操作系统:A IX 5.2 ML04
数据库: Oracle 9.2.0.1.0 2 Nodes RAC HA
备库P570(非DG): 与DG原理类似,利用生产库的归档日志在P570上恢复
故障时间:2010-11-23 13:20
二、故障现象:
数据库系统在13:19报告SYSTEM表空间空间不够无法自动扩展,因为系统使用RAW设备所以无法自动扩展,需要管理员手工增加RAW设备文件。
| Tue Nov 23 13:18:41 2010
Completed: alter database backup controlfile to ‘/p690_1_arch Tue Nov 23 13:19:20 2010 ORA-1653: unable to extend table WWW.MLOG$_WWW_MONEY_T by 8192 in tablespace SYSTEM ORA-1653: unable to extend table WWW.MLOG$_WWW_MONEY_T by 8192 in tablespace SYSTEM ORA-1653: unable to extend table WWW.MLOG$_WWW_MONEY_T by 8192 in tablespace SYSTEM …… |
客户工程师在P670_1节点上新增RAW设备并增加到SYSTEM表空间:
RAW设备:/dev/rwwwdb_system04
SQL>alter tablespace system add ‘/dev/rwwwdb_system04’ reuse;
随后系统报错:
| ORA-01157: cannot identify/lock data file 424 – see DBWR trace file
ORA-01110: data file 424: ‘/dev/rwwwdb_system04′ ORA-27037: unable to obtain file status IBM AIX RISC System/6000 Error: 2: No such file or directory Additional information: 3 Tue Nov 23 14:17:03 2010 Errors in file /oracle/ora92/admin/wwwdb/bdump/wwwdb1_dbw0_516116.trc: ORA-01186: file 424 failed verification tests ORA-01157: cannot identify/lock data file 424 – see DBWR trace file ORA-01110: data file 424: ‘/dev/rwwwdb_system04′ Tue Nov 23 14:17:03 2010 File 424 not verified due to error ORA-01157 |
dbv检查数据文件,报文件坏块:
| DBVERIFY – Verification starting : FILE = /dev/rwwwdb_system04
Page 1 is influx – most likely media corrupt Corrupt block relative dba: 0×00000001 (file 0, block 1) Fractured block found during dbv: Data in bad block: type: 0 format: 0 rdba: 0×00000000 last change scn: 0×0000.00000000 seq: 0×0 flg: 0×00 spare1: 0×0 spare2: 0×0 spare3: 0×0 consistency value in tail: 0×00000000 check value in block header: 0×0 |
此时,由于系统表空间不足数据库系统被挂起,欲将增加失败的数据文件离线删除后重启数据库实例报错:
SQL>ALTER DATABASE DATAFILE ‘/dev/rwwwdb_system04′ offline drop;
| Tue Nov 23 14:56:18 2010
ORA-1541 signalled during: ALTER DATABASE DATAFILE ‘/dev/rwwwdb_system04′ off… This instance was first to open ORA-1147 signalled during: alter database open… ORA-01122: database file 424 failed verification check ORA-01110: data file 424: ‘/dev/rwwwdb_system04′ ORA-01251: Unknown File Header Version read for file number 26 |
不要受[ID 422031.1]ORA-01122, ORA-01251 or data blocks reported as corrupted影响,否则方向错了会花比较多的时间
时间:16:00
由于系统数据量大,在生产库不能短时间恢复情况下的
措施:启动P570备库
结果:启动备库P570失败
原因:在10月19增加数据库文件/dev/rphoto1026时出错,需要恢复。
时间:16:20
措施:利用生产系统RMAN备份来恢复
结果:无法恢复
原因:每晚8点的定时备份任务在14日由于备份空间不足备份失败,后续没有任务备份;系统归档日志保留策略为7天,若要恢复到14日备份前,缺少1天的归档日志文件。
时间:16:30
措施:向公司上级汇报情况,商议决策
结果:与Oracle原厂工程取得系统,商议决策如下:
1. 保护好生产系统现状
2. 使用操作系统级dd命令备份生产系统所有数据文件
3. 将备份的文件用dd命令恢复到备库P570
4. 在备库P570上做恢复测试
时间:24日2:30
措施:在P570备机上重建control file,剔除/dev/rwwwdb_system04数据文件,尝试打开数据库。
结果:open失败
原因:Oracle 9i 系统表空间数据文件不能真正删除(10g开始可以)。
再次如下恢复操作:
重新创建/dev/rwwwdb_system04文件
SQL> alter database create datafile ‘/dev/rwwwdb_system04′ as ‘/dev/rwwwdb_system04′;
结果:open成功
三、生产系统恢复过程:
1. 重建control file
获取重建控制文件的脚本,并修改相应部分
SQL> alter database backup controlfile to trace;
| CREATE CONTROLFILE REUSE DATABASE “WWWDB” NORESETLOGS FORCE LOGGING ARCHIVELOG
– SET STANDBY TO MAXIMIZE PERFORMANCE MAXLOGFILES 192 MAXLOGMEMBERS 3 MAXDATAFILES 1024 MAXINSTANCES 32 MAXLOGHISTORY 3403 LOGFILE GROUP 1 ‘/dev/rwwwdb_log24′ SIZE 268430848, GROUP 2 ‘/dev/rwwwdb_log25′ SIZE 268430848, GROUP 3 ‘/dev/rwwwdb_log26′ SIZE 268430848, GROUP 4 ‘/dev/rwwwdb_log27′ SIZE 268430848, GROUP 5 ‘/dev/rwwwdb_log28′ SIZE 268430848, GROUP 6 ‘/dev/rwwwdb_log29′ SIZE 268430848, GROUP 7 ‘/dev/rwwwdb_log14′ SIZE 268430848, GROUP 8 ‘/dev/rwwwdb_log15′ SIZE 268430848, GROUP 9 ‘/dev/rwwwdb_log16′ SIZE 268430848, GROUP 10 ‘/dev/rwwwdb_log17′ SIZE 268430848, GROUP 11 ‘/dev/rwwwdb_log18′ SIZE 268430848, GROUP 12 ‘/dev/rwwwdb_log19′ SIZE 268430848 …… |
2. 修改启动参数文件,将控制文件存放路径临时指定到一个新位置,并启动数据库到 nomount状态
3. 执行重建控制文件的脚本,启动数据库mount状态
SQL>@bldctl.sql
SQL>alter database mount;
4. 创建数据文件/dev/rwwwdb_system04
SQL>alter database create datafile ‘/dev/rwwwdb_system04′ as ‘/dev/rwwwdb_system04′;
5. 重启数据库到mount状态,恢复/dev/rwwwdb_system04文件
SQL>startup mount;
SQL>ALTER DATABASE RECOVER datafile ‘/dev/rwwwdb_system04′ ;
| Wed Nov 24 03:37:47 2010
Media Recovery Datafile: ‘/dev/rwwwdb_system04′ Media Recovery Start Starting datafile 424 recovery in thread 2 sequence 26019 Datafile 424: ‘/dev/rwwwdb_system04′ Media Recovery Log Recovery of Online Redo Log: Thread 2 Group 2 Seq 26019 Reading mem 0 Mem# 0 errs 0: /dev/rwwwdb_log25 Recovery of Online Redo Log: Thread 1 Group 12 Seq 31759 Reading mem 0 Mem# 0 errs 0: /dev/rwwwdb_log19 Media Recovery Complete Completed: ALTER DATABASE RECOVER datafile ‘/dev/rwwwdb_syst em04′ ; |
6. 尝试打开数据库
SQL>alter database open;
报错:ORA-1113 signalled during: alter database open…,数据库需要恢复
7. 恢复数据库
SQL>alter database recover database;
| Media Recovery Start
Starting datafile 1 recovery in thread 2 sequence 26019 Datafile 1: ‘/dev/rwwwdb_system’ Starting datafile 2 recovery in thread 2 sequence 26019 Datafile 2: ‘/dev/rwwwdb_index01′ …… Successfully onlined Undo Tablespace 27. Dictionary check beginning Tablespace ‘WWW_PHOTO04′ #16 found in data dictionary, but not in the controlfile. Adding to controlfile. Tablespace ‘WWW_PHOTO06′ #17 found in data dictionary, but not in the controlfile. Adding to controlfile. Tablespace ‘WWW_PHOTO07′ #18 found in data dictionary, but not in the controlfile. Adding to controlfile. …… Wed Nov 24 03:38:45 2010 File #370 found in data dictionary but not in controlfile. Creating OFFLINE file ‘MISSING00370′ in the controlfile. File #371 found in data dictionary but not in controlfile. Creating OFFLINE file ‘MISSING00371′ in the controlfile. File #372 found in data dictionary but not in controlfile. … |
8. 重启数据库
SQL>shutdown immediate;
SQL>startup;
| Wed Nov 24 04:00:11 2010
alter database open This instance was first to open Picked Lamport scheme to generate SCNs Wed Nov 24 04:00:11 2010 |
9. 将控制文件恢复到RAW设备上
SQL>shutdown immediate;
$dd if= /dbbak/cntrl7 of= /dev/rwwwdb_cntrl3 bs=2m
$dd if= /dbbak/cntrl8 of= /dev/rwwwdb_cntrl4 bs=2m
$dd if= /dbbak/cntrl9 of= /dev/rwwwdb_cntrl5 bs=2m
修改启动参数文件initwwwdb1.ora,还原控制文件存放位置
SQL>startup pfile=?/dbs/initwwwdb1.ora
SQL>create spfile from pfile;
SQL>shutdown immediate;
SQL>startup;
10. 执行在重建控制文件时没有执行的命令,并将数据库中只读表空间online
| # Files in read-only tablespaces are now named.
ALTER DATABASE RENAME FILE ‘MISSING00015′ TO ‘/dev/rwww_photo0611′; ALTER DATABASE RENAME FILE ‘MISSING00016′ TO ‘/dev/rwww_photo0612′; ALTER DATABASE RENAME FILE ‘MISSING00020′ TO ‘/dev/rwww_photo0626′; ALTER DATABASE RENAME FILE ‘MISSING00022′ TO ‘/dev/rwww_photo0627′; ALTER DATABASE RENAME FILE ‘MISSING00023′ …… # Online the files in read-only tablespaces. ALTER TABLESPACE “WWW_PHOTO04″ ONLINE; ALTER TABLESPACE “WWW_PHOTO06″ ONLINE; ALTER TABLESPACE “WWW_PHOTO07″ ONLINE; ALTER TABLESPACE “WWW_PHOTO05″ ONLINE; ALTER TABLESPACE “WWW_PHOTO08″ ONLINE; ALTER TABLESPACE “WWW_PHOTO09″ ONLINE; # No tempfile entries found to add. # |
至此,生产系统数据库成功打开。
P670_1节点处理:
检查P670_1节点,发现主机访问DS4500存储系统上共享VG出现问题,为数据安全起见,先在P690_1节点备份数据库再处理P670_1节点。
四、故障分析:
在备份成功后,处理P670_1节点发现,DS4500存储系统B控制器坏掉,而引发了增加SYSTEM表空间失败。在更换存储系统控制器后,P670_1节点正常工作。
五、改进与预防措施:
1. 完善备份策略
2. 改善P570备库方式(采用rman备份归档日志,再恢复归档日志)
3. 每天检查生产系统的备份情况,及时发现问题及时处理。
-The End-
期待~~
Taom
25 Nov 10 at 13:06
连续奋战一天一夜,辛苦了。仔细阅读,体会不少啊
这里我有一些疑问,
1、这就是说,最初这一步alter tablespace system add ‘/dev/rwwwdb_system04’ reuse时,
是由于B存储控制器故障引起的坏块
2、我觉得,即然datafile都已经添加成功了,DB肯定有数据同步到新添加的文件上了。
这个时候来OFFLINE DROP,肯定会出问题,restart db时肯定会去找这个文件啊,
3、如果最初不offline drop,直接alter database create datafile … as …是不是就可以恢复成功了?
Ning.Tam Jiang
9 Dec 10 at 0:13
1.是B控制器故障导致P670_1节点访问到的VG不正常
2.增加datafile时就报错了
3.9i中offline drop并没有真正删除这个文件,只是系统忽略该文件的SCN,随时可以online恢复。若当时在p690_1节点上再做alter database create datafile … as …操作就行了,关键是整个系统没有备份了,直接在生产上做是不可取的,不怕一万就怕万一啊,呵呵…
ochef
9 Dec 10 at 16:36
老杨。似乎是可以不用重制控制文件哦?
直接在mount状态,重建新增的数据文件也可以哦
Ning.Tam Jiang
9 Dec 10 at 21:43