Oracle 11gR2 RAC重启后只能起单节点(3)

###db01 message
Sep 29 15:29:27 db01 kernel: bonding: bond1: link status definitely down for interface eth3, disabling it
Sep 29 15:29:27 db01 kernel: bonding: bond1: making interface eth2 the new active one.
Sep 29 15:29:31 db01 kernel: igb: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Sep 29 15:29:31 db01 kernel: bonding: bond1: link status definitely up for interface eth3.
Sep 29 15:31:28 db01 kernel: igb: eth2 NIC Link is Down
Sep 29 15:31:28 db01 kernel: bonding: bond1: link status definitely down for interface eth2, disabling it
Sep 29 15:31:28 db01 kernel: bonding: bond1: making interface eth3 the new active one.
Sep 29 15:31:28 db01 kernel: igb: eth3 NIC Link is Down
Sep 29 15:31:29 db01 kernel: bonding: bond1: link status definitely down for interface eth3, disabling it
Sep 29 15:31:29 db01 kernel: bonding: bond1: now running without any active interface !
Sep 29 15:31:54 db01 kernel: igb: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Sep 29 15:31:54 db01 kernel: bonding: bond1: link status definitely up for interface eth2.
Sep 29 15:31:54 db01 kernel: bonding: bond1: making interface eth2 the new active one.
Sep 29 15:31:54 db01 kernel: bonding: bond1: first active interface up!
Sep 29 15:31:54 db01 kernel: igb: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Sep 29 15:31:54 db01 kernel: bonding: bond1: link status definitely up for interface eth3.
Sep 29 15:36:10 db01 shutdown[17047]: shutting down for system reboot
Sep 29 15:36:11 db01 gconfd (root-6536): Received signal 15, shutting down cleanly
Sep 29 15:36:11 db01 gconfd (root-6536): Exiting

###db02 message
Sep 29 15:36:54 db02 kernel: igb: eth2 NIC Link is Down
Sep 29 15:36:54 db02 kernel: bonding: bond1: link status definitely down for interface eth2, disabling it
Sep 29 15:36:54 db02 kernel: bonding: bond1: making interface eth3 the new active one.
Sep 29 15:36:55 db02 kernel: igb: eth3 NIC Link is Down
Sep 29 15:36:55 db02 kernel: bonding: bond1: link status definitely down for interface eth3, disabling it
Sep 29 15:36:55 db02 kernel: bonding: bond1: now running without any active interface !
Sep 29 15:37:10 db02 kernel: igb: eth2 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX
Sep 29 15:37:10 db02 kernel: igb: eth3 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX
Sep 29 15:37:10 db02 kernel: bonding: bond1: link status definitely up for interface eth2.
Sep 29 15:37:10 db02 kernel: bonding: bond1: making interface eth2 the new active one.
Sep 29 15:37:10 db02 kernel: bonding: bond1: first active interface up!
Sep 29 15:37:10 db02 kernel: bonding: bond1: link status definitely up for interface eth3.

问题分析:

​    从如上详细的日志信息我们不难看出有如下动作,在db01上执行关机操作之后ocss和crsd进程都会向远端返送消息告诉对端本机即将执行关闭操作。后再停止各个进程。

在节点二中我们可从message日志中看到之前的私网bond1口状态时Down,在15:37分将第一个节点shutdown -immediate之后私网bond1居然自动执行了up操作。随即我们从ocss和crsd的日志中可以看到集群进程都正在起来。

那么这个时候我们可以分析问题应该是出在私网网络这一部分,可能是网卡绑定的问题。

问题处理过程:

既然从日志中看出是网络问题,那么我们就从网络排除,待节点一重启启动后,首先采用ping私网来确定,节点一启动了,同样集群服务是没有起来的:

ping节点2的私网,不通:
[root@db01 ~]# ping pri02
PING pri02.xmtvdb.com (10.10.11.2) 56(84) bytes of data.
From pri01.xmtvdb.com (10.10.11.1) icmp_seq=193 Destination Host Unreachable
From pri01.xmtvdb.com (10.10.11.1) icmp_seq=194 Destination Host Unreachable
From pri01.xmtvdb.com (10.10.11.1) icmp_seq=195 Destination Host Unreachable
From pri01.xmtvdb.com (10.10.11.1) icmp_seq=197 Destination Host Unreachable
From pri01.xmtvdb.com (10.10.11.1) icmp_seq=198 Destination Host Unreachable
From pri01.xmtvdb.com (10.10.11.1) icmp_seq=199 Destination Host Unreachable
From pri01.xmtvdb.com (10.10.11.1) icmp_seq=201 Destination Host Unreachable
From pri01.xmtvdb.com (10.10.11.1) icmp_seq=202 Destination Host Unreachable
From pri01.xmtvdb.com (10.10.11.1) icmp_seq=203 Destination Host Unreachable
From pri01.xmtvdb.com (10.10.11.1) icmp_seq=204 Destination Host Unreachable
From pri01.xmtvdb.com (10.10.11.1) icmp_seq=205 Destination Host Unreachable
From pri01.xmtvdb.com (10.10.11.1) icmp_seq=206 Destination Host Unreachable

检查bonding,是好的,没有问题:
###db01
[root@db01 ~]# cat /proc/net/bonding/bond1
Ethernet Channel Bonding Driver: v3.4.0-1 (October 7, 2008)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth2
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth2
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 1
Permanent HW addr: 40:f2:e9:db:c9:c4

Slave Interface: eth3
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 1
Permanent HW addr: 40:f2:e9:db:c9:c5

###db02
[root@db02 ~]# cat /proc/net/bonding/bond1
Ethernet Channel Bonding Driver: v3.4.0-1 (October 7, 2008)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth2
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth2
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 40:f2:e9:db:c9:fc

Slave Interface: eth3
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 40:f2:e9:db:c9:fd

随即尝试down掉节点二的bond1中的eth3网口,发现可以ping通,且集群能够起来。
###db02
[root@db02 ~]# ifdow eth3
Sep 29 15:40:55 db02 kernel: bonding: bond1: Removing slave eth3

###db01
[root@db01 ~]# ping pri02
PING pri02.xmtvdb.com (10.10.11.2) 56(84) bytes of data.
64 bytes from pri02.xmtvdb.com (10.10.11.2): icmp_seq=1 ttl=64 time=0.071 ms
64 bytes from pri02.xmtvdb.com (10.10.11.2): icmp_seq=2 ttl=64 time=0.122 ms
64 bytes from pri02.xmtvdb.com (10.10.11.2): icmp_seq=3 ttl=64 time=0.134 ms
64 bytes from pri02.xmtvdb.com (10.10.11.2): icmp_seq=4 ttl=64 time=0.098 ms

同时这个时候集群服务也起来了:
[root@db01 ~]# su - grid -c "crsctl status res -t"
--------------------------------------------------------------------------------
NAME          TARGET  STATE        SERVER                  STATE_DETAILS     
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.BAK001.dg
              ONLINE  ONLINE      db01                                       
              ONLINE  ONLINE      db02                                       
ora.DATA001.dg
              ONLINE  ONLINE      db01                                       
              ONLINE  ONLINE      db02                                       
ora.FRA001.dg
              ONLINE  ONLINE      db01                                       
              ONLINE  ONLINE      db02                                       
ora.LISTENER.lsnr
              ONLINE  ONLINE      db01                                       
              ONLINE  ONLINE      db02                                       
ora.OCR_VOTE.dg
              ONLINE  ONLINE      db01                                       
              ONLINE  ONLINE      db02                                       
ora.asm
              ONLINE  ONLINE      db01                    Started           
              ONLINE  ONLINE      db02                    Started           
ora.gsd
              OFFLINE OFFLINE      db01                                       
              OFFLINE OFFLINE      db02                                       
ora.net1.network
              ONLINE  ONLINE      db01                                       
              ONLINE  ONLINE      db02                                       
ora.ons
              ONLINE  ONLINE      db01                                       
              ONLINE  ONLINE      db02                                       
ora.registry.acfs
              ONLINE  ONLINE      db01                                       
              ONLINE  ONLINE      db02                                       
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
      1        ONLINE  ONLINE      db01                                       
ora.cvu
      1        ONLINE  ONLINE      db01                                       
ora.db01.vip
      1        ONLINE  ONLINE      db01                                       
ora.db02.vip
      1        ONLINE  ONLINE      db02                                       
ora.oc4j
      1        ONLINE  ONLINE      db01                                       
ora.scan1.vip
      1        ONLINE  ONLINE      db01                                       
ora.xmman.db
      1        ONLINE  ONLINE      db01                    Open               
      2        ONLINE  ONLINE      db02                    Open               
ora.xmman.taf.svc
      1        ONLINE  ONLINE      db01                                       
      2        ONLINE  ONLINE      db02                                       

再次把eth3 up起来,不受影响
###db02
[root@db02 ~]# ifup eth3

###db01
[root@db01 ~]# ping pri02
PING pri02.xmtvdb.com (10.10.11.2) 56(84) bytes of data.
64 bytes from pri02.xmtvdb.com (10.10.11.2): icmp_seq=1 ttl=64 time=0.161 ms
64 bytes from pri02.xmtvdb.com (10.10.11.2): icmp_seq=2 ttl=64 time=0.022 ms
64 bytes from pri02.xmtvdb.com (10.10.11.2): icmp_seq=3 ttl=64 time=0.034 ms
64 bytes from pri02.xmtvdb.com (10.10.11.2): icmp_seq=4 ttl=64 time=0.196 ms

随即根据Oracle最佳实践将直连的两根心跳线连接上交换后,问题没有再现;原因未知,有知道的请告知。

内容版权声明:除非注明,否则皆为本站原创文章。

转载注明出处:https://www.heiqu.com/61a7f66e756278f3cb6b3d417602b5d5.html