想问一下章老师,对heartbeat,ldirectord有没比较深入的研究,附上我的搭建和测试过程及所遇到的问题

我目前正在搭建一个web负载均衡系统,环境如下.
硬件:4台机,两个做热备,两个做应用,一个做数据库(除了执备,其余二台均为虚拟机,备份机兼做应用服务器)
软件,heartbeat 2.02,ldirectord,lvs
系统环境,rhel4+php442+mysql41

断断续续做了有近一个月.
现在可以说整个系统是已基本搭建起来了,正在作一些稳定和性能方面的测试.

其中也遇到挺多的问题,也有些还不太明白和没解决到的.我会继续整理出来.在此也希望章老师能指点一下.先谢谢了.

在lvs的测试中,调度算法为rr/wrr
资料上说的,要支持session,要加上-p选项即可,但我测试中发现,在加了-p选项后.有两个时间.一个是lvs可以设置的时间(默认为15分钟),还有一个我不太清楚在哪设置的(默认为2分钟).我在测试中,用ipvsadm -lc查看状态看到的.
我要说的问题是,在这个时间内,所以的数据转发,都是发往之前已建立连接的服务器上.如果这这个时间,这一台应用服务器当掉了.即使用ldirectord可以监测到并且在列表中删除,但因为之前有建立连接的原因,而继续转发到这台机上.而这时,这台机已当掉了,也就无法响应了.
当然,我们可以用ipvsadm --set来设置一个比较恰当的时间.正在测试中

当没有设置这个-p选项时
在调度器上,基本上要保持到客户与每一台real server都有一个的连接状态.如果是这样的话,在大流量的情况下,我有点担心转发器的能承受的连接数.

这里我这几天的测试中所遇到的一些问题,当然,可能我的配置还存在一些问题,或者是说不是最优的.

我会继续把我的一些想法和测试过程发上来,希望能和大家一起研究和讨论.

用ab作压力测试的问题(192.168.1.30为VIP)
所用参数,ab -n 10000 -c 1000 http://192.168.1.30/info.php
先后分别对调度器,单独对应用ab做测试
在应用服务器上测试,均可以完成测试,结果也比较满意.如下
[root@www ~]# ab -n 10000 -c 1000 http://192.168.1.32/info.php
This is ApacheBench, Version 2.0.41-dev <$Revision: 1.121.2.12 $> apache-2.0
Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright (c) 1998-2002 The Apache Software Foundation, http://www.apache.org/

Benchmarking 192.168.1.32 (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Finished 10000 requests

Server Software: Apache
Server Hostname: 192.168.1.32
Server Port: 80

Document Path: /info.php
Document Length: 27472 bytes

Concurrency Level: 1000
Time taken for tests: 35.187610 seconds
Complete requests: 10000
Failed requests: 23
(Connect: 0, Length: 23, Exceptions: 0)
Write errors: 0
Total transferred: 276521902 bytes
HTML transferred: 275108941 bytes
Requests per second: 284.19 [#/sec] (mean)
Time per request: 3518.761 [ms] (mean)
Time per request: 3.519 [ms] (mean, across all concurrent requests)
Transfer rate: 7674.29 [Kbytes/sec] received

Connection Times (ms)
min mean[+/-sd] median max
Connect: 3 731 2190.3 21 21017
Processing: 118 1042 2313.3 451 19089
Waiting: 60 798 1878.7 385 19072
Total: 137 1773 3492.1 481 22115

Percentage of the requests served within a certain time (ms)
50% 481
66% 610
75% 929
80% 1486
90% 3609
95% 9490
98% 16410
99% 21232
100% 22115 (longest request)

但测试调度器时,经常不能完成测试,显示如下
[root@www ~]# ab -n 10000 -c 1000 http://192.168.1.30/info.php
This is ApacheBench, Version 2.0.41-dev <$Revision: 1.121.2.12 $> apache-2.0
Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright (c) 1998-2002 The Apache Software Foundation, http://www.apache.org/

Benchmarking 192.168.1.30 (be patient)
Completed 1000 requests
apr_recv: No route to host (113)
Total of 1882 requests completed

说没有路由到主机,但实际是这些机器间是通的.
查不出是哪里问题,网上的资料也很少
即使把并发连接降低,也还是一样出错.
[root@www ~]# ab -n 10000 -c 200 http://192.168.1.30/info.php
This is ApacheBench, Version 2.0.41-dev <$Revision: 1.121.2.12 $> apache-2.0
Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright (c) 1998-2002 The Apache Software Foundation, http://www.apache.org/

Benchmarking 192.168.1.30 (be patient)
apr_recv: No route to host (113)
Total of 39 requests completed

有时能完成测试,但失败的请求数也占了50%
Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright (c) 1998-2002 The Apache Software Foundation, http://www.apache.org/

Benchmarking 192.168.1.30 (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Finished 10000 requests

Server Software: Apache/2.0.55
Server Hostname: 192.168.1.30
Server Port: 80

Document Path: /info.php
Document Length: 26393 bytes

Concurrency Level: 1000
Time taken for tests: 36.689066 seconds
Complete requests: 10000
Failed requests: 5449
(Connect: 0, Length: 5449, Exceptions: 0)
Write errors: 0
Total transferred: 279894206 bytes
HTML transferred: 278257811 bytes
Requests per second: 272.56 [#/sec] (mean)
Time per request: 3668.907 [ms] (mean)
Time per request: 3.669 [ms] (mean, across all concurrent requests)
Transfer rate: 7450.01 [Kbytes/sec] received

Connection Times (ms)
min mean[+/-sd] median max
Connect: 4 424 1065.0 12 11008
Processing: 29 1899 3812.6 224 28118
Waiting: 3 493 1599.4 12 22987
Total: 36 2324 4134.6 375 32214

Percentage of the requests served within a certain time (ms)
50% 375
66% 1542
75% 2979
80% 3606
90% 7474
95% 10620
98% 17547
99% 21065
100% 32214 (longest request)
章老师推荐个好的压力测试工具

2006/3/3
上面说的
Benchmarking 192.168.1.30 (be patient)
apr_recv: No route to host (113)
Total of 39 requests completed
这个问题,应该是与我的测试所用到虚拟机有关,很具体的现在还说不清楚,有待进一步的测试

Forums:

当选用持久服务(-p选项)支持HTTP session时,来自同一IP地址的请求将被送到同一台服务器。所以在这种状况下,一个ab生成的请求都会被调度到一台服务器,达不到性能测试的目的。在真实系统使用中,持久服务时间一般设置好几个小时。

当ldirectord监测到并且在列表中删除一台应用服务器时,之前有建立连接的,继续转发到这台机上,确实是这样。因为IPVS并不立即淘汰刚删除的服务器,考虑到服务器太忙被删除,可能很快会被加回来。如果你需要马上淘汰已删除服务器的连接,可以用

echo 1 > /proc/sys/net/ipv4/vs/expire_nodest_conn

不用担心记录连接所消耗的内存,因为一个连接只占用128个字节,所以512M可用内存可以支持四百万条连接数。

可以考虑用分布式的测试工具,或者多台机器一起跑ab。

谢谢章老师的指点.我会继续努力的.

对于 "在真实系统使用中,持久服务时间一般设置好几个小时",几个小时,会不会太长了,如果在这个时间内,应用服务器当掉了,又没及时发现,那不是会中断一部分的客户浏览也不知情.

不过,有了

echo 1 > /proc/sys/net/ipv4/vs/expire_nodest_conn

这个,也就没问题了.但在均衡上,不知会不会有影响.
在监控方面,不知有没好的软件和方法.
我目前都是用ipvsadm工具查看的.

能有你的QQ吗?我的QQ:271545151

Wow, this is in every reescpt what I needed to know.

Great post with lots of impantort stuff.

Hats off to wheeovr wrote this up and posted it.

randomness