redhat rhel 6 kernel: nf_conntrack: table full, dropping packet.

在做HAWQ 压力测试的时候突然发现连不上服务器了,检查一下,莫名自动切换到了备机运行。

检查heartbeat的日志:

Jul 24 15:55:17 big3hd02.corp.haier.com heartbeat: [23081]: info: Link big3hd01.corp.haier.com:bond0 dead.
Jul 24 15:55:17 big3hd02.corp.haier.com ipfail: [23133]: info: Link Status update: Link big3hd01.corp.haier.com/bond0 now has status dead
Jul 24 15:55:18 big3hd02.corp.haier.com ipfail: [23133]: info: Asking other side for ping node count.
Jul 24 15:55:18 big3hd02.corp.haier.com ipfail: [23133]: info: Checking remote count of ping nodes.
Jul 24 15:55:21 big3hd02.corp.haier.com ipfail: [23133]: info: Telling other node that we have more visible ping nodes.
Jul 24 15:55:26 big3hd02.corp.haier.com heartbeat: [23081]: info: big3hd01.corp.haier.com wants to go standby [all]
Jul 24 15:55:26 big3hd02.corp.haier.com heartbeat: [23081]: info: standby: other_holds_resources: 3
Jul 24 15:55:26 big3hd02.corp.haier.com heartbeat: [23081]: info: New standby state: 2
Jul 24 15:55:26 big3hd02.corp.haier.com heartbeat: [23081]: info: New standby state: 2
Jul 24 15:55:27 big3hd02.corp.haier.com heartbeat: [23081]: info: other_holds_resources: 0
Jul 24 15:55:41 big3hd02.corp.haier.com heartbeat: [23081]: info: Link big3hd01.corp.haier.com:bond0 up.

主节点的bond0网卡无法联通了,所以自动切换到了备机运行。这有点纳闷。检查主节点的系统日志

Jul 24 15:54:59 big3hd01 kernel: nf_conntrack: table full, dropping packet.
Jul 24 15:54:59 big3hd01 kernel: nf_conntrack: table full, dropping packet.
Jul 24 15:54:59 big3hd01 kernel: nf_conntrack: table full, dropping packet.
Jul 24 15:54:59 big3hd01 kernel: nf_conntrack: table full, dropping packet.
Jul 24 15:54:59 big3hd01 kernel: nf_conntrack: table full, dropping packet.
Jul 24 15:54:59 big3hd01 kernel: nf_conntrack: table full, dropping packet.
Jul 24 15:54:59 big3hd01 kernel: nf_conntrack: table full, dropping packet.
Jul 24 15:54:59 big3hd01 kernel: nf_conntrack: table full, dropping packet.
Jul 24 15:54:59 big3hd01 heartbeat: [3687]: ERROR: glib: Error sending packet: Operation not permitted
Jul 24 15:54:59 big3hd01 heartbeat: [3685]: ERROR: glib: ucast_write: Unable to send HBcomm packet bond0 10.135.24.2:694 len=210 [-1]: Operation not permitted
Jul 24 15:54:59 big3hd01 heartbeat: [3687]: info: glib: euid=0 egid=0
Jul 24 15:54:59 big3hd01 heartbeat: [3687]: ERROR: write_child: write failure on ping 10.135.25.254.: Operation not permitted
Jul 24 15:54:59 big3hd01 heartbeat: [3685]: ERROR: write_child: write failure on ucast bond0.: Operation not permitted
Jul 24 15:54:59 big3hd01 heartbeat: [3685]: ERROR: glib: ucast_write: Unable to send HBcomm packet bond0 10.135.24.2:694 len=198 [-1]: Operation not permitted
Jul 24 15:54:59 big3hd01 heartbeat: [3685]: ERROR: write_child: write failure on ucast bond0.: Operation not permitted
Jul 24 15:55:01 big3hd01 heartbeat: [3687]: ERROR: glib: Error sending packet: Operation not permitted
Jul 24 15:55:01 big3hd01 heartbeat: [3685]: ERROR: glib: ucast_write: Unable to send HBcomm packet bond0 10.135.24.2:694 len=198 [-1]: Operation not permitted
Jul 24 15:55:01 big3hd01 heartbeat: [3687]: info: glib: euid=0 egid=0
Jul 24 15:55:01 big3hd01 heartbeat: [3687]: ERROR: write_child: write failure on ping 10.135.25.254.: Operation not permitted
Jul 24 15:55:01 big3hd01 heartbeat: [3685]: ERROR: write_child: write failure on ucast bond0.: Operation not permitted
Jul 24 15:55:03 big3hd01 heartbeat: [3687]: ERROR: glib: Error sending packet: Operation not permitted
Jul 24 15:55:03 big3hd01 heartbeat: [3685]: ERROR: glib: ucast_write: Unable to send HBcomm packet bond0 10.135.24.2:694 len=197 [-1]: Operation not permitted
Jul 24 15:55:03 big3hd01 heartbeat: [3687]: info: glib: euid=0 egid=0
Jul 24 15:55:03 big3hd01 heartbeat: [3687]: ERROR: write_child: write failure on ping 10.135.25.254.: Operation not permitted
Jul 24 15:55:03 big3hd01 heartbeat: [3685]: ERROR: write_child: write failure on ucast bond0.: Operation not permitted
Jul 24 15:55:04 big3hd01 kernel: __ratelimit: 169 callbacks suppressed
Jul 24 15:55:04 big3hd01 kernel: nf_conntrack: table full, dropping packet.

出现了大量 kernel: nf_conntrack: table full, dropping packet.的信息。

检查netfilter的设置

sysctl net.nf_conntrack_max

net.nf_conntrack_max = 65536

检查当前的连接

wc -l /proc/net/nf_conntrack 达到5万多。

调大net.nf_conntrack_max到200000.

sysctl -w net.nf_conntrack_max=65536

监控并发测试时候,wc -l /proc/net/nf_conntrack  可以达到7万多。难怪日志中出现 kernel: nf_conntrack: table full, dropping packet.信息。

sysctl -w net.nf_conntrack_max=65536

再次测试就很顺利解决了并非测试的问题。