关于LVS director节点connection sync的问题,该怎么解决?

问题:我创建了一个Local环境下的测试系统用来测试HTTP和Mysql集群。但是遇到了LVS Director节点connection sync的问题。该问题已经解决:)

环境:
LVS Director:
VIP: 192.168.100.1:80 for HTTP集群
VIP: 192.168.100.1:3306 for Mysql集群
节点1:lvsdirector, IP地址是192.168.100.20 (该节点作为master用)
节点2:sqlnode1,IP地址是192.168.100.30 (该节点作为backup节点,也作为SQL Node用,不要被节点名字疑惑)
Real Server:
Real Server 1: testnode, IP地址是192.168.100.10 (该节点同时运行HTTP和Mysql服务)
Real Server 2: sqlnode, IP地址是192.168.100.40 (该节点同时运行HTTP和Mysql服务)

操作系统:
所有节点均采用CentOS4.1 with kernel 2.6.9-34.EL。

LVS配置:
采用Ultra Monkey + LVS/DR的方式运行.
下面是在lvsdirector节点上运行 ipvsadm -l的结果:

[root@lvsdirector ha.d]# ipvsadm -l
IP Virtual Server version 1.2.0 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 192.168.100.1:http wlc
-> sqlnode:http Route 1 0 0
-> testnode:http Route 1 0 0
TCP 192.168.100.1:mysql wrr
-> testnode:mysql Route 10 0 0
-> sqlnode:mysql Route 10 0 0
[root@lvsdirector ha.d]#

下面是在lvsdirector节点运行 ipvsadm -l --daemon 的结果:

[root@lvsdirector ha.d]# ipvsadm -l --daemon
master sync daemon (mcast=eth0, syncid=0)
[root@lvsdirector ha.d]#

下面是在sqlnode1节点(backup节点)运行 ipvsadm -l --daemon的结果:

[root@sqlnode1 ha.d]# ipvsadm -l --daemon
backup sync daemon (mcast=eth0, syncid=0)
[root@sqlnode1 ha.d]#

测试结果:
HTTP负载均衡测试 OK :)
MYSQL负载均衡测试 OK :)
Ldirectord for HTTP检测测试 OK :)
Ldirectord for Mysql检测测试 OK :)
LVS Director节点之间的Failover测试 OK :)
LVS Director节点之间的Connection Sync测试 不工作 :(

下面是在lvsdirector节点运行 ipvsadm -l --daemon 的结果:

[root@lvsdirector ha.d]# ipvsadm -l --daemon
master sync daemon (mcast=eth0, syncid=0)
[root@lvsdirector ha.d]#

下面是在sqlnode1节点(backup节点)运行 ipvsadm -l --daemon的结果:

[root@sqlnode1 ha.d]# ipvsadm -l --daemon
backup sync daemon (mcast=eth0, syncid=0)
[root@sqlnode1 ha.d]#

看起来,connect sync daemon运行正常。

下面是在lvsdirector节点上运行 ipvsadm -lcn 的结果:

[root@lvsdirector ha.d]# ipvsadm -lcn
IPVS connection entries
pro expire state source virtual destination
TCP 14:32 ESTABLISHED 192.168.169.18:61504 192.168.100.1:3306 192.168.100.10:3306
TCP 13:47 ESTABLISHED 192.168.169.18:2436 192.168.100.1:80 192.168.100.40:80
TCP 13:49 ESTABLISHED 192.168.169.18:2438 192.168.100.1:80 192.168.100.40:80
TCP 14:39 ESTABLISHED 192.168.169.18:61505 192.168.100.1:3306 192.168.100.40:3306
TCP 13:45 ESTABLISHED 192.168.169.18:2432 192.168.100.1:80 192.168.100.40:80
TCP 14:21 ESTABLISHED 192.168.169.18:61503 192.168.100.1:3306 192.168.100.40:3306
TCP 13:49 ESTABLISHED 192.168.169.18:2439 192.168.100.1:80 192.168.100.10:80
TCP 13:48 ESTABLISHED 192.168.169.18:2437 192.168.100.1:80 192.168.100.10:80
TCP 13:46 ESTABLISHED 192.168.169.18:2434 192.168.100.1:80 192.168.100.40:80
TCP 13:45 ESTABLISHED 192.168.169.18:2433 192.168.100.1:80 192.168.100.10:80
TCP 14:14 ESTABLISHED 192.168.169.18:61502 192.168.100.1:3306 192.168.100.10:3306
TCP 13:47 ESTABLISHED 192.168.169.18:2435 192.168.100.1:80 192.168.100.10:80
[root@lvsdirector ha.d]#

以下是在sqlnode1(backup节点)上运行 ipvsadm -lcn 的结果:

[root@sqlnode1 ha.d]# ipvsadm -lcn
IPVS connection entries
pro expire state source virtual destination
[root@sqlnode1 ha.d]#

看起来,Connection Sync并没有真正把主节点上的连接复制到备份节点上。
现在停掉lvsdirector(master节点)上的heartbeat。sqlnode1 (Backup节点)能够开始工作。

以下是在sqlnodel (Backup节点)上运行 ipvsadm -l 的结果:

[root@sqlnode1 ha.d]# ipvsadm -l
IP Virtual Server version 1.2.0 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 192.168.100.1:http wlc
-> sqlnode:http Route 1 0 0
-> testnode:http Route 1 0 0
TCP 192.168.100.1:mysql wrr
-> sqlnode:mysql Route 10 0 0
-> testnode:mysql Route 10 0 0
[root@sqlnode1 ha.d]#

以下是在sqlnode1 (Backup 节点)上再次运行ipvsadm -lcn的结果:

[root@sqlnode1 ha.d]# ipvsadm -lcn
IPVS connection entries
pro expire state source virtual destination
[root@sqlnode1 ha.d]#

看起来,主节点(lvsdirector)上的连接信息并没有复制过来。我该怎么做呢?

Forums:

我一时没有看出问题来,不过试试一些排错的方法:

1)用ps命令确认在主lvsdirector中有ipvs_master内核线程在运行,在从lvsdirector中有ipvs_backup内核线程在运行。

2)用ethereal/tcpdump来验证主lvsdirector已发出224.0.0.81:8848的multicast报文,从lvsdirector并接受multicast报文。

谢谢章博士的提醒,经过tcpdump,发现没有multicast报文从lvsdirector发出,查看防火墙设置,确实是防火墙档住了multicast报文,配置防火墙允许224.0.0.81:8848的traffic后,connecion sync工作正常。 Great! :-)

randomness