天道酬勤

Oracle and My Life

Archive for the ‘Troubleshooting’ Category

ORA-00600:[2103],[1],[0],[1],[900]的处理

3 comments

上周四中午刚放下订餐电话,手机响起鸟,客户的一套AIX 5.2 ML04  +  Oracle 10.2.0.3.0 (+Dataguard)的系统挂了,业务中断。收拾东西下楼打车赶往客户机房,大致半小时后到达客户机房登录系统检查发现系统报:“ORA-00600: internal error code, arguments: [2103], [1], [0], [1], [900]”错误。[2103]的错误与controlfile有关,[900]是等待的超时时间S。进一步检查发现,问题是出现在每天晚上0点都有crontab调度的系统备份任务:0点将系统置为热备模式—>连接FTP服务器—>CP需要备份的所有件—>最后将系统退出热备模式。metalink [ID 567891.1]说:系统处在热备份模式下,在日志发生切换时获取控制文件队列信息时发生了超时,这个等待时间默认是15分钟(900s)。虽然这里有一个Bug 6018274,但我的习惯是只有在做完了尽可能的尝试之后再认为是bug。

继续检查发现,系统一直在进行kill动作:“Killing enqueue blocker (pid=827572) on resource CF-00000000-00000000”,但遗憾的是这个动作失败了,尝试手工kill,失败。而且此时在系统中只要一发起与log相关的动作系统就挂起,无赖之下想重启oracle(这个想法很邪恶啊,如果待会儿起不来,how?),FT……正常关闭失败(意料中的),abort关机失败,回到OS中kill -9失败。万般无赖之下,只能重启OS,没想到的是AIX重启到一般的时候挂起,此时已经13:30了,还没有吃饭(12点客户叫俺一起去吃饭看到她那表情俺也有点不好意思。)不管了,先搞定系统再说吧。由于这套系统是P690上的一个分区,只要在HMC上使用“Hard Reset”选项了,鼠标移到此分区上检查了又检查了,俺视力5.2的说,但就怕这时看错了的话,后果,你知道的。在这前后几分钟,真有心跳加速的感觉,就怕待会儿库打不开啊。10分钟之后,my god,库顺利的打开了,哎呀,肚子不饿了,起应用一切顺利,先让别人开工吧。

造成ORA-00600 [2103]错误的原因有下:

1) 控制文件存放在I/O非常慢的存储系统上。

2) 频繁的日志切换,或日志文件过小或日志文件组数目过少。

3) 同时使用了异步I/O或多个数据写进程。

4) Oracle软件内部Bug。

5) OS/硬件问题。

讯问过客户工程师,确实在前一天18:30的时候进行数据整理,发生过非常频繁的日志切换,高峰时1分钟达5次日志切换,但在其后一天是正常的,由于环境复杂,暂时先做如下测试:(以影响业务程度最小化且实施难易程度依次列出)

1. 增大日志组成员大小,同时增加日志组数目。

2. 由于业务的特殊性,商讨能否修改备份策略来避开热备份。

3. 根据官方建议打补丁:Patch 5923866,在失效的情况下,将系统升级到10.2.0.4.0之上(最新分布为10.2.0.5.0)

4. 将操作系统从AIX 5.2 ML04升级到AIX 5.2 ML05(Oracle在[ID 406191.1]中建议)

BTW:在我重启完系统后问客户,此套系统应该有Dataguard的吧,答案是肯定的。到这里,可能大家会说为什么不在出问题的时候将系统切换到备机上。其实切换系统更多决定因素不仅仅只在技术层面的。

-The End-

Written by ochef

July 19th, 2010 at 10:00 am

Posted in Troubleshooting

Tagged with

AIX network buffer参数设置引起RAC故障

one comment

故障日期:2010年3月23日 11:30 AM

生产环境:数据库:  Oracle 10.2.0.4  2Nodes RAC

操作系统:AIX 5309

故障现象:

现象1:在23日上午11:30,发现应用系统不能连接到RAC的实例1(Instance name:int1),此时实例2是正常的。

现象2:使用辅助工具TOAD也不能连接到实例1。

现象3:在实例1所在机器的本地使用SQLPLUS可以连接到实例1,此时也能在实例2上通过TNS连接到实例1。

现象4:15:19重启实例1后系统恢复正常。

故障分析:

1.根据实例1的alert log日志记载,在11:30记录的错误如下:

Tue Mar 23 11:30:08 2010

WARNING: inbound connection timed out (ORA-3136)

Tue Mar 23 11:32:05 2010

WARNING: inbound connection timed out (ORA-3136)

首先来了解ORA-3136这个错误,该错误表示客户端在sqlnet.ora文件中SQLNET.INBOUND_CONNECT_TIMEOUT参数定义的时间内没有完成登录认证,该参数默认值为60S,据Oracle官方文档记载,此默认值能够满足绝大多数条件;此外该错误还涉及到listener.ora文件中定义的参数INBOUND_CONNECT_TIMEOUT_LISTENER,Oracle 10.2.0.1之前默认值为0,从10.2.0.1开始默认值为60S,根据alert log日志记录的其它信息,目前暂时排除实例1的错误是由以上参数造成。

2.Alert log还记载

……

Tue Mar 23 12:15:36 2010

Errors in file /soft/oracle/admin/int/udump/int1_ora_2617378.trc:

ORA-00600: internal error code, arguments: [12333], [7], [2], [49], [], [], [], []

……

根据Oracle metalink文档[ID 35928.1]描述:“Fatal Two-Task Protocol Violation”

ORA-600 [12333]描述收到一个没有经过验证的无效的网络数据包,这里有二个可能:一是客户端多线程的应用发送了一个无顺序的OCI调用请求,二是网络缓冲区中的数据可能被覆盖,进一步查看trace文件,可以看到每个trace文件的开关处都有:PROTOCOL VIOLATION DETECTED。

另外,由贵行的带内网管软件Tivoli监控到故障当时RAC心跳网络(ent8)的通信流量信息证明,当时心跳网络流量确实比正常情况下高,RAC 采用UDP 协议进行节点间的互联通信,查询系统统计如下:

RACDB1# netstat -p udp -s

udp:

574337869 datagrams received

0 incomplete headers

0 bad data length fields

0 bad checksums

169617 dropped due to no socket

32335 broadcast/multicast datagrams dropped due to no socket

243 socket buffer overflows

574135674 delivered

500048775 datagrams output

RACDB2# netstat -p udp -s

udp:

500187207 datagrams received

0 incomplete headers

0 bad data length fields

0 bad checksums

171357 dropped due to no socket

32333 broadcast/multicast datagrams dropped due to no socket

2108 socket buffer overflows

499981409 delivered

574427147 datagrams output

以上信息可以看到,由于系统网络参数network buffer设置不当出现通信问题,查看涉及network buffer大小的参数:

#no -a |pg

sb_max = 1310720

udp_recvspace = 655360

udp_sendspace = 65536

sb_max被用来指定允许的TCP和UDP socket的最大缓冲区大小,默认值为1048576 bytes,1048576 bytes,很显然,udp_recvspace与udp_sendspace设置不对称且sb_max参数设置过小。

3.ORA-600 [12333]的错误也可以由JDBC驱动版本与Oracle数据库版本不一致造成,但贵行此套系统已上线很久,由此可以暂时先排除该原因。另外,根据trace文件的记录,在故障期间有大量的UNION联合查询操作,而这种大量的UNION操作会增加节点间的通信,ashrpt的报告也证实了gc buffer busy随故障时间增加,到最后被剔出RAC降下来。

初步结论:

基于以上情况分析,现初步判断此次故障为:由系统网络buffer参数设置不当引起RAC 节点间的互联网络故障,而节点间的互联网络用于协调各个节点的运行,包括全局锁(global locking) ,队列(enqueue) 和缓存管理(buffer cache management),建议udp_sendspace 的起始值为db_block_size * db_file_multiblock_read_count ,udp_recvspace 设为udp_sendspace 的4 倍,上限为1048576 。如果发生socket 缓存溢出( 可通过 netstat -s | grep “socket buffer overflows” 命令察看) udp_recvspace 参数值需要增加,netstat -p udp -s的结果也证实了这一点。

BTW:这里还有EYGLE大师的文章供参考:

IBM AIX Oracle 9i RAC 性能因素 – udp及其他

-The End-

Written by ochef

March 26th, 2010 at 12:49 pm

Posted in Troubleshooting

Tagged with ,

RMAN ORA-19504、ORA-27038错误解决方法

leave a comment

一客户目前正在上IBM TSM的项目,现在要求临时用RMAN将生产库的数据备份到本地, rman常规备份很简单,给了客户如下的脚本:

run{

allocate channel c1 type disk;

allocate channel c2 type disk;

backup full tag ‘dbfull’ format ‘/oracle/app/backup/full%Y_%M_%t_A’ database;

sql ‘alter system archive log current’;

backup filesperset 3 format ‘/oracle/app/backup/arch%Y_%M_%t_A’ archivelog all;

release channel c1;

release channel c2;

}

但昨天收到客户电话说备份有点问题,出错信息如下:

released channel: c1

released channel: c2

RMAN-00571: ===========================================================

RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============

RMAN-00571:

===========================================================

RMAN-03002: failure of release command at 08/01/2009 17:12:44

RMAN-06012: channel: c2 not allocated

系统提示没有分配channel c2,报要创建的文件已经存在,我在虚拟机分别在9i、10g的环境中测试也有类似的错误:

9i:

RMAN-00571: ===========================================================

RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============

RMAN-00571: ===========================================================

RMAN-03009: failure of backup command on c1 channel at 08/04/2009 09:27:03

ORA-19504: failed to create file “/home/oracle/backup/arch2009_08_693998821″

ORA-27038: skgfrcre: file exists

10g:

RMAN-00571: ===========================================================

RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============

RMAN-00571: ===========================================================

RMAN-03009: failure of backup command on c1 channel at 08/04/2009 08:34:43

ORA-19504: failed to create file “/home/oracle/backup/full2009_08_A”

ORA-27038: created file already exists

这个问题第一次在客户的测试机上执行是成功的,第二次报错,在我自己测试的时候,全新安装的ORACLE也有这样的问题,为什么会出现这样的问题,请看下面的分析:

1.错误中指定要创建的文件已经,而实际上是并不存在,也没有用先前的备份的数据恢复过数据库。考虑是不是在OS层面删除过备份文件,而没有在RMAN中更新,试着更新RMAN:

RMAN>crosscheck archivelog all;

RMAN>delete expired archivelog all;

RMAN>crosscheck backup;

RMAN>delete expired backup;

做了更新之后,问题依旧。

2.会不会是目录(/home/oracle/backup)权限的问题,修改目录权限:

#chmod –R 775 /home/oracle/backup

再执行备份,问题仍旧。

3.试着让oracle自动分配通道,但配置RMAN的并行度为2

RMAN>configure device type disk parallelism2;

尝试着再次备份,问题仍然是channel c2没有分配通道。

此时,伟大的google对此问题已经搜不到有用的信息,没办法,只能找metalink了,在Doc ID: 1082911.6

上找到答案,原文如下:

Solution Description:

=====================

Insert a %U into the format portion of the backup script in order to ensure a unique backup file name.

Problem Explanation:

====================

These errors occur because a backup file name already exists by the name specified in you backup script. For instance, if you are using the line “allocate channel c1 type disk format

‘/oracle/database/rman/backup/df_%d_%p_%c’;”, df_%d_%p_%c formats the backupstring like so;

df_ is simply a name. This could be any set of characters. In this case it means

database full.

%d_ is the database sid.

%p_is the backup piece number within the backup set.

%c_ specifies the copy number of the backup piece within a set of duplexed backup pieces.

There needs to be a %U added to the format string.

%U_ specifies a convenient shorthand that guarantees uniqueness in generated backup filenames. So, if the string were “db_%d_%U_%p_%c a unique name would be generated and it would not be necessary to either rename or move the backup file name prior to the next backup.

If the format is changed to include the %U, for instance;

allocate channel c1 type disk format ‘/oracle/database/rman/backup/df_%d_%U_%p_%c’;

The backup file will automatically have a unique name generated like;

df_JTG_0eblfm65_1_1_1_1

The next one would look like;

df_JTG_0fblfm76_1_1_1_1

增加%U参数后再次尝试备份:

RMAN> run{

2>  allocate channel c1 type disk;

3>  allocate channel c2 type disk;

4>  backup full tag ‘dbfull’ format ‘/home/oracle/backup/full%Y_%M_%U_A’ database;

5>  sql ‘alter system archive log current’;

6>  backup filesperset 3  format ‘/home/oracle/backup/arch%Y_%M_%U_A’ archivelog all;

7>  release channel c1;

8>  release channel c2;

9>  }

released channel: ORA_DISK_1

released channel: ORA_DISK_2

allocated channel: c1

channel c1: sid=144 devtype=DISK

allocated channel: c2

channel c2: sid=146 devtype=DISK

……

……

channel c1: starting piece 1 at 04-AUG-09

channel c1: finished piece 1 at 04-AUG-09

piece handle=/home/oracle/backup/arch2009_08_3pklr29q_1_1_A tag=TAG20090804T083659 comment=NONE

channel c1: backup set complete, elapsed time: 00:00:02

Finished backup at 04-AUG-09

released channel: c1

released channel: c2

RMAN>

至此,备份成功问题得到解决。

-The End-

Written by ochef

August 4th, 2009 at 11:34 am

EXP-00056 bug解决方法

leave a comment

今天在测试环境:Linux + Oracle 9.2.0.5.0帮开发人员把某个用户的所有数据全部EXP时,遇到下面的错误:

EXP-00056: ORACLE error 1422 encountered

ORA-01422: exact fetch returns more than requested number of rows

ORA-06512: at “XDB.DBMS_XDBUTIL_INT”, line 52

ORA-06512: at line 1

EXP-00056: ORACLE error 1422 encountered

ORA-01422: exact fetch returns more than requested number of rows

ORA-06512: at “XDB.DBMS_XDBUTIL_INT”, line 52

ORA-06512: at line 1

EXP-00000: Export terminated unsuccessfully

查了metalink解释如下:

This has been logged as bug 2685696. (viewable in Metalink). The issue is that the package body XDB.DBMS_XDBUTIL_INT is doing a SELECT object_type FROM all_objects and is expecting to return only 1 row. However, if you have a table and trigger of other object, named the same, it returns 2 rows, which causes the error.

Currently a patch for the Windows Platform is not available. The workaround is to rename one of the objects. You can also log an iTAR requesting a one-off patch be created for your platform.

原因是由同名的trigger引起的,解决问题的办法有二个:

1.使用下面的SQL语句找出系统中有问题的trigger,然后将其名称修正即可

SELECT object_type,owner||'.'||object_name "OWNER.OBJECT_NAME"
          FROM dba_objects
         WHERE owner||object_name IN
               (SELECT o.owner||object_name FROM dba_objects o, dba_triggers t
                    WHERE object_name = trigger_name
                    GROUP BY o.owner||object_name HAVING COUNT(*) > 1);

2.将系统升级至9.2.0.8.0即可

-The End-

Written by ochef

May 8th, 2009 at 1:49 pm

Posted in Troubleshooting

Tagged with

无觅相关文章插件,快速提升流量