请教关于keepalived Real_server 过多导致进程崩溃

环境: CentOS release 6.2 2.6.32-220.el6.x86_64
keepalived-1.2.7 ipvsadm v1.26 IPVS v1.2.1

使用的是keepalived做健康检查
因为目前使用VIP的数量有200左右. 每个VIP下面realserver约在5-10个左右
每个keepalived所管理的realserver数量大概在1100个左右,keepalived进程就会挂掉,然后进入无限循环
现在只有keepalvied start 或是 keepalived reload 都会在出现大量这样的日志

Jul 14 19:47:07 b02 Keepalived_healthcheckers[14203]: Cannot send get request to [10.15.200.200]:80.
Jul 14 19:47:07 b02 Keepalived_healthcheckers[14203]: Removing service [10.100.200.200]:80 from VS [10.15.177.177]:80
Jul 14 19:47:07 b02 Keepalived_healthcheckers[14203]: SMTP connection ERROR to [127.0.0.1]:25.
Jul 14 19:47:07 b02 Keepalived[13055]: Healthcheck child process(14203) died: Respawning
Jul 14 19:47:07 b02 Keepalived[13055]: Starting Healthcheck child process, pid=14525
Jul 14 19:47:07 b02 Keepalived_healthcheckers[14525]: Interface queue is empty
Jul 14 19:47:07 b02 Keepalived_healthcheckers[14525]: No such interface, eth1
Jul 14 19:47:07 b02 Keepalived_healthcheckers[14525]: No such interface, usb0
Jul 14 19:47:07 b02 Keepalived_healthcheckers[14525]: No such interface, bond0

想请教一下各位该数值可能会受什么影响 ,而且出问题时只有keepalived的监控进程受影响. vrrp进程还正常
在源码中貌似也没有找到在哪里有硬性规定rs的数量
最初怀疑可能是因为VIP过多导致,后来经过测试发现还是RS列表过多会影响~
求个解决思路~。。。。 : )


check_respawn_thread(thread_t * thread)
{
pid_t pid;

/* Fetch thread args */
pid = THREAD_CHILD_PID(thread);

/* Restart respawning thread */
if (thread->type == THREAD_CHILD_TIMEOUT) {
thread_add_child(master, check_respawn_thread, NULL,
pid, RESPAWN_TIMER);
return 0;
}

/* We catch a SIGCHLD, handle it */
log_message(LOG_ALERT, "Healthcheck child process(%d) died: Respawning", pid);
start_check_child();
return 0;
}

测试1: 当realserver超过1100个左右,keepalived的的Healthcheck进程会挂掉,然后不停的重启,主进程及vrrp子进程都无影响

Jul 14 19:47:07 b02 Keepalived_healthcheckers[14203]: Cannot send get request to [10.15.200.200]:80.
Jul 14 19:47:07 b02 Keepalived_healthcheckers[14203]: Removing service [10.100.200.200]:80 from VS [10.15.177.177]:80
Jul 14 19:47:07 b02 Keepalived_healthcheckers[14203]: SMTP connection ERROR to [127.0.0.1]:25.
Jul 14 19:47:07 b02 Keepalived[13055]: Healthcheck child process(14203) died: Respawning
Jul 14 19:47:07 b02 Keepalived[13055]: Starting Healthcheck child process, pid=14525
Jul 14 19:47:07 b02 Keepalived_healthcheckers[14525]: Interface queue is empty
Jul 14 19:47:07 b02 Keepalived_healthcheckers[14525]: No such interface, eth1
Jul 14 19:47:07 b02 Keepalived_healthcheckers[14525]: No such interface, usb0
Jul 14 19:47:07 b02 Keepalived_healthcheckers[14525]: No such interface, bond0

测试2: 当realserver在1020左右 keepalived正常. 父进程及2个子进程都正常.

Forums:

找到问题了~

因为系统参数__FD_SETSIZE限制 ,因为keepalived使用select模式,默认select限制 1024个socket连接~

randomness