天道酬勤

Oracle and My Life

无有效备份下恢复数据库一例

4 comments

昨天通宵恢复一个1T的数据库,早上6点成功恢复,立即开始备份。7点回家睡觉下午三点起床,也许这就是一个DBA的生活。本想现在写写恢复过程,不过又想睡觉了,明天好好写吧。

Updated  @ 2010-12-07

一、系统背景:

硬件:    P690_1和P670_1

操作系统:A IX 5.2 ML04

数据库:  Oracle 9.2.0.1.0  2 Nodes RAC  HA

备库P570(非DG): 与DG原理类似,利用生产库的归档日志在P570上恢复

故障时间:2010-11-23  13:20

二、故障现象:

数据库系统在13:19报告SYSTEM表空间空间不够无法自动扩展,因为系统使用RAW设备所以无法自动扩展,需要管理员手工增加RAW设备文件。

Tue Nov 23 13:18:41 2010

Completed: alter database backup controlfile to ‘/p690_1_arch

Tue Nov 23 13:19:20 2010

ORA-1653: unable to extend table WWW.MLOG$_WWW_MONEY_T by 8192 in           tablespace SYSTEM

ORA-1653: unable to extend table WWW.MLOG$_WWW_MONEY_T by 8192 in           tablespace SYSTEM

ORA-1653: unable to extend table WWW.MLOG$_WWW_MONEY_T by 8192 in           tablespace SYSTEM

……

客户工程师在P670_1节点上新增RAW设备并增加到SYSTEM表空间:

RAW设备:/dev/rwwwdb_system04

SQL>alter tablespace system add ‘/dev/rwwwdb_system04’ reuse;

随后系统报错:

ORA-01157: cannot identify/lock data file 424 – see DBWR trace file

ORA-01110: data file 424: ‘/dev/rwwwdb_system04′

ORA-27037: unable to obtain file status

IBM AIX RISC System/6000 Error: 2: No such file or directory

Additional information: 3

Tue Nov 23 14:17:03 2010

Errors in file /oracle/ora92/admin/wwwdb/bdump/wwwdb1_dbw0_516116.trc:

ORA-01186: file 424 failed verification tests

ORA-01157: cannot identify/lock data file 424 – see DBWR trace file

ORA-01110: data file 424: ‘/dev/rwwwdb_system04′

Tue Nov 23 14:17:03 2010

File 424 not verified due to error ORA-01157

dbv检查数据文件,报文件坏块:

DBVERIFY – Verification starting : FILE = /dev/rwwwdb_system04

Page 1 is influx – most likely media corrupt

Corrupt block relative dba: 0×00000001 (file 0, block 1)

Fractured block found during dbv:

Data in bad block:

type: 0 format: 0 rdba: 0×00000000

last change scn: 0×0000.00000000 seq: 0×0 flg: 0×00

spare1: 0×0 spare2: 0×0 spare3: 0×0

consistency value in tail: 0×00000000

check value in block header: 0×0

此时,由于系统表空间不足数据库系统被挂起,欲将增加失败的数据文件离线删除后重启数据库实例报错:

SQL>ALTER DATABASE DATAFILE ‘/dev/rwwwdb_system04′ offline drop;

Tue Nov 23 14:56:18 2010

ORA-1541 signalled during: ALTER DATABASE DATAFILE ‘/dev/rwwwdb_system04′ off…

This instance was first to open

ORA-1147 signalled during: alter database open…

ORA-01122: database file 424 failed verification check

ORA-01110: data file 424: ‘/dev/rwwwdb_system04′

ORA-01251: Unknown File Header Version read for file number 26

不要受[ID 422031.1]ORA-01122, ORA-01251 or data blocks reported as corrupted影响,否则方向错了会花比较多的时间

时间:16:00

由于系统数据量大,在生产库不能短时间恢复情况下的

措施:启动P570备库

结果:启动备库P570失败

原因:在10月19增加数据库文件/dev/rphoto1026时出错,需要恢复。

时间:16:20

措施:利用生产系统RMAN备份来恢复

结果:无法恢复

原因:每晚8点的定时备份任务在14日由于备份空间不足备份失败,后续没有任务备份;系统归档日志保留策略为7天,若要恢复到14日备份前,缺少1天的归档日志文件。

时间:16:30

措施:向公司上级汇报情况,商议决策

结果:与Oracle原厂工程取得系统,商议决策如下:

1.      保护好生产系统现状

2.      使用操作系统级dd命令备份生产系统所有数据文件

3.      将备份的文件用dd命令恢复到备库P570

4.      在备库P570上做恢复测试

时间:24日2:30

措施:在P570备机上重建control file,剔除/dev/rwwwdb_system04数据文件,尝试打开数据库。

结果:open失败

原因:Oracle 9i 系统表空间数据文件不能真正删除(10g开始可以)。

再次如下恢复操作:

重新创建/dev/rwwwdb_system04文件

SQL> alter database create datafile ‘/dev/rwwwdb_system04′ as ‘/dev/rwwwdb_system04′;

结果:open成功

三、生产系统恢复过程:

1.      重建control file

获取重建控制文件的脚本,并修改相应部分

SQL> alter database backup controlfile to trace;

CREATE CONTROLFILE REUSE DATABASE “WWWDB” NORESETLOGS FORCE LOGGING ARCHIVELOG

–  SET STANDBY TO MAXIMIZE PERFORMANCE

MAXLOGFILES 192

MAXLOGMEMBERS 3

MAXDATAFILES 1024

MAXINSTANCES 32

MAXLOGHISTORY 3403

LOGFILE

GROUP 1 ‘/dev/rwwwdb_log24′  SIZE 268430848,

GROUP 2 ‘/dev/rwwwdb_log25′  SIZE 268430848,

GROUP 3 ‘/dev/rwwwdb_log26′  SIZE 268430848,

GROUP 4 ‘/dev/rwwwdb_log27′  SIZE 268430848,

GROUP 5 ‘/dev/rwwwdb_log28′  SIZE 268430848,

GROUP 6 ‘/dev/rwwwdb_log29′  SIZE 268430848,

GROUP 7 ‘/dev/rwwwdb_log14′  SIZE 268430848,

GROUP 8 ‘/dev/rwwwdb_log15′  SIZE 268430848,

GROUP 9 ‘/dev/rwwwdb_log16′  SIZE 268430848,

GROUP 10 ‘/dev/rwwwdb_log17′  SIZE 268430848,

GROUP 11 ‘/dev/rwwwdb_log18′  SIZE 268430848,

GROUP 12 ‘/dev/rwwwdb_log19′  SIZE 268430848

……

2.      修改启动参数文件,将控制文件存放路径临时指定到一个新位置,并启动数据库到 nomount状态

3.      执行重建控制文件的脚本,启动数据库mount状态

SQL>@bldctl.sql

SQL>alter database mount;

4.      创建数据文件/dev/rwwwdb_system04

SQL>alter database create datafile ‘/dev/rwwwdb_system04′ as ‘/dev/rwwwdb_system04′;

5.      重启数据库到mount状态,恢复/dev/rwwwdb_system04文件

SQL>startup mount;

SQL>ALTER DATABASE RECOVER  datafile ‘/dev/rwwwdb_system04′ ;

Wed Nov 24 03:37:47 2010

Media Recovery Datafile: ‘/dev/rwwwdb_system04′

Media Recovery Start

Starting datafile 424 recovery in thread 2 sequence 26019

Datafile 424: ‘/dev/rwwwdb_system04′

Media Recovery Log

Recovery of Online Redo Log: Thread 2 Group 2 Seq 26019 Reading mem 0

Mem# 0 errs 0: /dev/rwwwdb_log25

Recovery of Online Redo Log: Thread 1 Group 12 Seq 31759 Reading mem 0

Mem# 0 errs 0: /dev/rwwwdb_log19

Media Recovery Complete

Completed: ALTER DATABASE RECOVER  datafile ‘/dev/rwwwdb_syst em04′ ;

6.      尝试打开数据库

SQL>alter database open;

报错:ORA-1113 signalled during: alter database open…,数据库需要恢复

7.      恢复数据库

SQL>alter database recover database;

Media Recovery Start

Starting datafile 1 recovery in thread 2 sequence 26019

Datafile 1: ‘/dev/rwwwdb_system’

Starting datafile 2 recovery in thread 2 sequence 26019

Datafile 2: ‘/dev/rwwwdb_index01′

……

Successfully onlined Undo Tablespace 27.

Dictionary check beginning

Tablespace ‘WWW_PHOTO04′ #16 found in data dictionary,

but not in the controlfile. Adding to controlfile.

Tablespace ‘WWW_PHOTO06′ #17 found in data dictionary,

but not in the controlfile. Adding to controlfile.

Tablespace ‘WWW_PHOTO07′ #18 found in data dictionary,

but not in the controlfile. Adding to controlfile.

……

Wed Nov 24 03:38:45 2010

File #370 found in data dictionary but not in controlfile.

Creating OFFLINE file ‘MISSING00370′ in the controlfile.

File #371 found in data dictionary but not in controlfile.

Creating OFFLINE file ‘MISSING00371′ in the controlfile.

File #372 found in data dictionary but not in controlfile.

8.      重启数据库

SQL>shutdown  immediate;

SQL>startup;

Wed Nov 24 04:00:11 2010

alter database open

This instance was first to open

Picked Lamport scheme to generate SCNs

Wed Nov 24 04:00:11 2010

9.      将控制文件恢复到RAW设备上

SQL>shutdown immediate;

$dd if= /dbbak/cntrl7 of= /dev/rwwwdb_cntrl3 bs=2m

$dd if= /dbbak/cntrl8 of= /dev/rwwwdb_cntrl4 bs=2m

$dd if= /dbbak/cntrl9 of= /dev/rwwwdb_cntrl5 bs=2m

修改启动参数文件initwwwdb1.ora,还原控制文件存放位置

SQL>startup  pfile=?/dbs/initwwwdb1.ora

SQL>create spfile from pfile;

SQL>shutdown immediate;

SQL>startup;

10.  执行在重建控制文件时没有执行的命令,并将数据库中只读表空间online

# Files in read-only tablespaces are now named.

ALTER DATABASE RENAME FILE ‘MISSING00015′

TO ‘/dev/rwww_photo0611′;

ALTER DATABASE RENAME FILE ‘MISSING00016′

TO ‘/dev/rwww_photo0612′;

ALTER DATABASE RENAME FILE ‘MISSING00020′

TO ‘/dev/rwww_photo0626′;

ALTER DATABASE RENAME FILE ‘MISSING00022′

TO ‘/dev/rwww_photo0627′;

ALTER DATABASE RENAME FILE ‘MISSING00023′

……

# Online the files in read-only tablespaces.

ALTER TABLESPACE “WWW_PHOTO04″ ONLINE;

ALTER TABLESPACE “WWW_PHOTO06″ ONLINE;

ALTER TABLESPACE “WWW_PHOTO07″ ONLINE;

ALTER TABLESPACE “WWW_PHOTO05″ ONLINE;

ALTER TABLESPACE “WWW_PHOTO08″ ONLINE;

ALTER TABLESPACE “WWW_PHOTO09″ ONLINE;

# No tempfile entries found to add.

#

至此,生产系统数据库成功打开。

P670_1节点处理:

检查P670_1节点,发现主机访问DS4500存储系统上共享VG出现问题,为数据安全起见,先在P690_1节点备份数据库再处理P670_1节点。

四、故障分析:

在备份成功后,处理P670_1节点发现,DS4500存储系统B控制器坏掉,而引发了增加SYSTEM表空间失败。在更换存储系统控制器后,P670_1节点正常工作。

五、改进与预防措施:

1. 完善备份策略

2. 改善P570备库方式(采用rman备份归档日志,再恢复归档日志)

3. 每天检查生产系统的备份情况,及时发现问题及时处理。

-The End-

Written by ochef

November 24th, 2010 at 9:30 pm

Posted in Backup & Recover,Database

Tagged with ,

4 Responses to '无有效备份下恢复数据库一例'

Subscribe to comments with RSS or TrackBack to '无有效备份下恢复数据库一例'.

  1. 期待~~

    Taom

    25 Nov 10 at 13:06

  2. 连续奋战一天一夜,辛苦了。仔细阅读,体会不少啊
    这里我有一些疑问,

    1、这就是说,最初这一步alter tablespace system add ‘/dev/rwwwdb_system04’ reuse时,
    是由于B存储控制器故障引起的坏块
    2、我觉得,即然datafile都已经添加成功了,DB肯定有数据同步到新添加的文件上了。
    这个时候来OFFLINE DROP,肯定会出问题,restart db时肯定会去找这个文件啊,
    3、如果最初不offline drop,直接alter database create datafile … as …是不是就可以恢复成功了?

    Ning.Tam Jiang

    9 Dec 10 at 0:13

  3. 1.是B控制器故障导致P670_1节点访问到的VG不正常
    2.增加datafile时就报错了
    3.9i中offline drop并没有真正删除这个文件,只是系统忽略该文件的SCN,随时可以online恢复。若当时在p690_1节点上再做alter database create datafile … as …操作就行了,关键是整个系统没有备份了,直接在生产上做是不可取的,不怕一万就怕万一啊,呵呵…

    ochef

    9 Dec 10 at 16:36

  4. 老杨。似乎是可以不用重制控制文件哦?
    直接在mount状态,重建新增的数据文件也可以哦

    Ning.Tam Jiang

    9 Dec 10 at 21:43

Leave a Reply

无觅相关文章插件,快速提升流量