AS5.3做LVS服务器碰到oom问题,请教大家!

系统版本:
Red Hat Enterprise Linux Server release 5.3 (Tikanga)

内核版本:
2.6.18-128.el5PAE

用yum连centos库装的heartbeat,相关包信息如下:
heartbeat-stonith-2.1.3-3.el5.centos
heartbeat-devel-2.1.3-3.el5.centos
heartbeat-2.1.3-3.el5.centos
heartbeat-ldirectord-2.1.3-3.el5.centos
heartbeat-gui-2.1.3-3.el5.centos
heartbeat-pils-2.1.3-3.el5.centos

碰到一个妖怪问题,服务跑起来以后过一段时间就死机了,log里发现
Aug 25 19:23:37 xxx kernel: ldirectord invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0
Aug 25 19:23:37 xxx kernel: [] out_of_memory+0x72/0x1a5
Aug 25 19:23:37 xxx kernel: [] __alloc_pages+0x216/0x297
Aug 25 19:23:37 xxx kernel: [] tcp_sendmsg+0x504/0x9e9
Aug 25 19:23:37 xxx kernel: [] current_fs_time+0x4a/0x55
Aug 25 19:23:37 xxx kernel: [] core_sys_select+0x1c5/0x2ca
Aug 25 19:23:37 xxx kernel: [] inet_sendmsg+0x35/0x3f
Aug 25 19:23:37 xxx kernel: [] do_sock_write+0xa3/0xaa
Aug 25 19:23:37 xxx kernel: [] sock_aio_write+0x53/0x61
Aug 25 19:23:37 xxx kernel: [] release_sock+0xc/0x91
Aug 25 19:23:37 xxx kernel: [] do_sync_write+0xb6/0xf1
Aug 25 19:23:37 xxx kernel: [] autoremove_wake_function+0x0/0x2d
Aug 25 19:23:37 xxx kernel: [] audit_syscall_entry+0x14b/0x17d

Aug 25 19:24:13 xxx heartbeat: [3599]: WARN: G_SIG_dispatch: Dispatch function for SIGCHLD took too long to execute: 180 ms
(> 30 ms) (GSource: 0x8261aa8)
Aug 25 19:24:13 xxx heartbeat: [8317]: info: Starting "/usr/lib/heartbeat/ipfail" as uid 498 gid 496 (pid 8317)
Aug 25 19:24:13 xxx kernel: cpu 3 hot: high 0, batch 1 used:0
Aug 25 19:24:13 xxx heartbeat: [3599]: WARN: Gmain_timeout_dispatch: Dispatch function for check for signals was delayed 21
210 ms (> 510 ms) before being called (GSource: 0x82678a8)
Aug 25 19:24:13 xxx kernel: cpu 3 cold: high 0, batch 1 used:0
Aug 25 19:24:13 xxx heartbeat: [3599]: info: Gmain_timeout_dispatch: started at 430172311 should have started at 430170190
Aug 25 19:24:13 xxx kernel: DMA32 per-cpu: empty
Aug 25 19:24:13 xxx heartbeat: [3599]: WARN: Gmain_timeout_dispatch: Dispatch function for check for signals took too long
to execute: 100 ms (> 50 ms) (GSource: 0x82678a8)
Aug 25 19:24:13 xxx kernel: Normal per-cpu:
Aug 25 19:24:13 xxx heartbeat: [3599]: WARN: Gmain_timeout_dispatch: Dispatch function for client audit was delayed 18380 m
s (> 5000 ms) before being called (GSource: 0x82677d8)
Aug 25 19:24:13 xxx kernel: cpu 0 hot: high 186, batch 31 used:10
Aug 25 19:24:13 xxx heartbeat: [3599]: info: Gmain_timeout_dispatch: started at 430172325 should have started at 430170487
Aug 25 19:24:13 xxx kernel: cpu 0 cold: high 62, batch 15 used:57
Aug 25 19:24:13 xxx kernel: cpu 1 hot: high 186, batch 31 used:21
Aug 25 19:24:13 xxx kernel: cpu 1 cold: high 62, batch 15 used:56
Aug 25 19:24:13 xxx kernel: cpu 2 hot: high 186, batch 31 used:15
Aug 25 19:24:13 xxx kernel: cpu 2 cold: high 62, batch 15 used:6
Aug 25 19:24:13 xxx kernel: cpu 3 hot: high 186, batch 31 used:7
Aug 25 19:24:13 xxx kernel: cpu 3 cold: high 62, batch 15 used:2
Aug 25 19:24:13 xxx kernel: HighMem per-cpu:
Aug 25 19:24:13 xxx kernel: cpu 0 hot: high 186, batch 31 used:166
Aug 25 19:24:13 xxx kernel: cpu 0 cold: high 62, batch 15 used:0
Aug 25 19:24:13 xxx kernel: cpu 1 hot: high 186, batch 31 used:82
Aug 25 19:24:13 xxx kernel: cpu 1 cold: high 62, batch 15 used:2
Aug 25 19:24:13 xxx kernel: cpu 2 hot: high 186, batch 31 used:162
Aug 25 19:24:14 xxx kernel: cpu 2 cold: high 62, batch 15 used:9
Aug 25 19:24:14 xxx kernel: cpu 3 hot: high 186, batch 31 used:181
Aug 25 19:24:14 xxx kernel: cpu 3 cold: high 62, batch 15 used:12
Aug 25 19:24:14 xxx kernel: Free pages: 3146460kB (3139712kB HighMem)
Aug 25 19:24:14 xxx kernel: Active:14156 inactive:17457 dirty:10 writeback:0 unstable:0 free:786615 slab:215678 mapped-file
:3632 mapped-anon:10032 pagetables:589
Aug 25 19:24:14 xxx kernel: DMA free:3588kB min:68kB low:84kB high:100kB active:8kB inactive:0kB present:16384kB pages_scan
ned:28515 all_unreclaimable? yes
Aug 25 19:24:14 xxx kernel: lowmem_reserve[]: 0 0 880 4848
Aug 25 19:24:14 xxx kernel: DMA32 free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all
_unreclaimable? no
Aug 25 19:24:14 xxx kernel: lowmem_reserve[]: 0 0 880 4848
Aug 25 19:24:14 xxx kernel: Normal free:3160kB min:3756kB low:4692kB high:5632kB active:128kB inactive:88kB present:901120k
B pages_scanned:453617 all_unreclaimable? yes
Aug 25 19:24:14 xxx kernel: lowmem_reserve[]: 0 0 0 31744
Aug 25 19:24:14 xxx kernel: HighMem free:3139712kB min:512kB low:4748kB high:8988kB active:56616kB inactive:69740kB present
:4063232kB pages_scanned:0 all_unreclaimable? no
Aug 25 19:24:14 xxx kernel: lowmem_reserve[]: 0 0 0 0
Aug 25 19:24:14 xxx kernel: DMA: 1*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 3588
kB
Aug 25 19:24:14 xxx kernel: DMA32: empty
Aug 25 19:24:14 xxx kernel: Normal: 0*4kB 1*8kB 1*16kB 0*32kB 1*64kB 0*128kB 0*256kB 0*512kB 1*1024kB 1*2048kB 0*4096kB = 3
160kB
Aug 25 19:24:14 xxx kernel: HighMem: 964*4kB 976*8kB 631*16kB 302*32kB 147*64kB 68*128kB 23*256kB 18*512kB 11*1024kB 6*2048

接着再杀一系列的系统进程,然后就死机了。

重启后观察lowmem情况
cat /proc/meminfo |grep LowFree
LowFree: 84868 kB
发现一直在减少,从800多m到最后用完,而其他AS4.x系列的lvs服务器上这个数值则稳定在200左右,ldirectord程序在占用了这部分内存后会自动释放,但5.3的系统则不会,直至用尽为止。

请教一下各位,有没有碰到过类似的情况,应该和内核版本没关系,因为我们有AS4.7的服务器升内核到2.6.18然后做lvs的,都一切正常。
先谢过了。

Forums:

randomness